Marty Anstey Logo
 About Me   Hobbies   Business   Programming   Photos   Projects   More... 

Spiders & Web-Indexing Robots

I began working on spiders in 1998 while I was working on my first search engine in mid-1998. From what began as a simple project building a basic search engine for my site came a powerful and large scale search engine by the end of 2000.

During 1998, I found myself with a lot of free time on my hands and an uncanny urge of creativity which proved itself as the beginning to a long adventure in business in the idyllic technology bubble of 1999.

NetBreach Family Friendly Search was the front for our adventure when my cousin Shone and I began the long process of building a complex and extensively intricate search engine from the ground up.

Even though NetBreach only achieved about 250,000 documents in it's lifespan, it proved to be the springboard into the worlds of business and technology, and eventually spawned into it's own business entity, known as iaNett.com

This, however, proved to be the foundation of all of the spiders I have written over the years and as such I've documented here the major spiders that I've written.

  :: Sven
 DATE  STATUS  PLATFORM  LANGUAGE  USER AGENT
 1998  Defunct  WinNT  VB 5  None
Sven was my first experience writing a web indexing robot. There were a lot of lessons that I learned through the course of it's development that really played a large role in the future indexers that I wrote. Developed as the primary indexer for the website Netbreach Family-Safe search, Sven was eventually replaced during 1999 with Origin, a complete re-write from the ground up.
 
  :: Origin
 DATE  STATUS  PLATFORM  LANGUAGE  USER AGENT
 1999  Defunct  WinNT  VB 5  None
Origin was based on many of the lessons I had learned while writing Sven. It included a very robust parser as well as being able to be stopped and restarted at will. Origin, like Sven, was a single-threaded indexer and as such took a long time to index a large number of documents. I had solved this problem somewhat by allowing multiple versions of Origin to be run simultaneously, however, this required extensive post-index parsing once the index was complete. During it's lifetime, Origin indexed approximately 400,000 documents.
 
  :: ParaSite
 DATE  STATUS  PLATFORM  LANGUAGE  USER AGENT
 2000-2001  Defunct  WinNT  C/C++  ParaSite/0.21
This is a spider which I worked on at iaNett, the company which I started with my cousin in 1999. ParaSite is an incredibly powerful spider which went through several different versions over the course of two years. It is designed to index a substantial portion of the web quickly. ParaSite runs using a server and multiple downloaders. Each downloader runs a number of threads, capable of indexing five to ten documents per second. Since this is a parallel implementation, multiple downloaders can be run simultaneously. The server sorts the incoming urls into queues and hands of batches of urls to the downloaders for indexing.
 
  :: SiteSpider
 DATE  STATUS  PLATFORM  LANGUAGE  USER AGENT
 2002  Defunct  Windows/Unix  PERL  SiteSpider/1.0
SiteSpider is a multiplatform indexer which I built to spider client documents for iaNett's site search service. The indexer is capable of indexing up to 1,000 documents per site, and the information is stored to a database searchable by clients. The user can then utilize a simple search server protocol to query the database and generate a search service for their site.
 
  :: XP5 (development)
 DATE  STATUS  PLATFORM  LANGUAGE  USER AGENT
 2002-2004  Defunct  Unix  Various  Project XP5
In my quest to build the world's best web-indexing robot comes this latest crawler. Under development for almost six months as of this writing; XP5 is now nearly fully functional. Initial tests place it's real-world performance at roughly 75 documents/second per indexer. XP5 can run fully in parallel, and there is no foreseeable limitations - apart from the number of available machines and bandwidth - to limit it's capabilities. Since XP5 is currently in development mode, I will have to wait until further advancements in the project before releasing more information.
 
  :: Helix
 DATE  STATUS  PLATFORM  LANGUAGE  USER AGENT
 2004-2005  Defunct  Unix  Various  Helix/1.0
Helix was a crawler that I built in 2004 which grew out of the work on the XP5 project. It's goal was to build indexes for an experimental search engine I had built.
 
  :: Reaper
 DATE  STATUS  PLATFORM  LANGUAGE  USER AGENT
 2004-2005  Defunct  FreeBSD  Various  Reaper/2.0x
A wide-scale parallel crawler that ran on a small cluster of servers. It never left the prototype stage, as it was scrapped and completely rewritten from the ground up. This rewrite was named Vortex. During it's relatively short life, Reaper performed a test crawl which succeded in downloading nearly 20 million documents, which were later used in the SiteSearch search engine. This data was also later used in the Vortex project as well.
 
  :: Vortex
 DATE  STATUS  PLATFORM  LANGUAGE  USER AGENT
 2005-2006  Defunct  Unix  Various  Vortex/x.y
Vortex was a discovery robot designed to perform research on internet trends, particularly link distribution and growth. It only ran for a few months in 2005-2006 and was shut down in 2006.

Recently, it seems that someone else decided to use the "vortex" name in their robot. If you have come to this page to find out more information about that other crawler, unfortunately I have no idea who is running it. However, they should change the name of their crawler, as it's very confusing for people trying to contact the operators.



  :: News
 May 29 2011
This is how you earn free bitcoins!
 February 6 2011
I am the 83rd person to achieve a score of 1500 on the HE.net IPv6 certification
 January 4 2011
How time flies. Maybe I should post some real news here sometime...
 January 1 2010
Wow, is it 2010 already?
  :: Features
  :: Links
  :: Search the Site

Home - Writing/Poetry - Programming - Projects - Music - Travel - Guest Book - Calendar
Business Ventures - PHP Scripts - Web Spiders - Search Engine - Links - Contact Me

Constructed entirely by hand using only TextPad and PhotoShop
Modified Thursday January 6, 2011 - 5:36 UTC

(C) Copyright 2000-2012 Marty Anstey ~~ I didn't rip you off, so don't rip me off.

Want to donate some Bitcoins?