|
I began working on spiders in 1998 while I was working on my first search engine in mid-1998. From what began as
a simple project building a basic search engine for my site came a powerful and large scale search engine by the
end of 2000.
During 1998, I found myself with a lot of free time on my hands and an uncanny urge of creativity which proved
itself as the beginning to a long adventure in business in the idyllic technology bubble of 1999.
NetBreach Family Friendly Search was the front for our adventure when my cousin Shone and I began the long
process of building a complex and extensively intricate search engine from the ground up.
Even though NetBreach only achieved about 250,000 documents in it's lifespan, it proved to be the springboard
into the worlds of business and technology, and eventually spawned into it's own business entity, known as
iaNett.com
This, however, proved to be the foundation of all of the spiders I have written over the years and as such I've
documented here the major spiders that I've written.
| :: Sven |
| DATE |
STATUS |
PLATFORM |
LANGUAGE |
USER AGENT |
| 1998 |
Defunct |
WinNT |
VB 5 |
None |
 |
|
Sven was my first experience writing a web indexing robot. There were a lot of lessons that I learned through the course of it's development that really played a large role in
the future indexers that I wrote. Developed as the primary indexer for the website Netbreach Family-Safe search, Sven was eventually replaced during 1999 with Origin, a
complete re-write from the ground up.
|
| |
| :: Origin |
| DATE |
STATUS |
PLATFORM |
LANGUAGE |
USER AGENT |
| 1999 |
Defunct |
WinNT |
VB 5 |
None |
 |
|
Origin was based on many of the lessons I had learned while writing Sven. It included a very robust parser as well as being able to be stopped and restarted at will. Origin, like Sven, was a single-threaded indexer and as such took a long time to index a large number of documents. I had solved this problem somewhat by allowing multiple versions of Origin to be run simultaneously, however, this required extensive post-index parsing once the index was complete. During it's lifetime, Origin indexed approximately 400,000 documents.
|
| |
| :: ParaSite |
| DATE |
STATUS |
PLATFORM |
LANGUAGE |
USER AGENT |
| 2000-2001 |
Defunct |
WinNT |
C/C++ |
ParaSite/0.21 |
 |
|
This is a spider which I worked on at iaNett, the company which I started with my cousin in 1999. ParaSite is an incredibly powerful spider which went through several different versions over the course of two years. It is designed to index a substantial portion of the web quickly. ParaSite runs using a server and multiple downloaders. Each downloader runs a number of threads, capable of indexing five to ten documents per second. Since this is a parallel implementation, multiple downloaders can be run simultaneously. The server sorts the incoming urls into queues and hands of batches of urls to the downloaders for indexing.
|
| |
| :: SiteSpider |
| DATE |
STATUS |
PLATFORM |
LANGUAGE |
USER AGENT |
| 2002 |
Defunct |
Windows/Unix |
PERL |
SiteSpider/1.0 |
 |
|
SiteSpider is a multiplatform indexer which I built to spider client documents for iaNett's site search service. The indexer is capable of indexing up to 1,000 documents per site, and the information is stored to a database searchable by clients. The user can then utilize a simple search server protocol to query the database and generate a search service for their site.
|
| |
| :: XP5 (development) |
| DATE |
STATUS |
PLATFORM |
LANGUAGE |
USER AGENT |
| 2002-2004 |
Defunct |
Unix |
Various |
Project XP5 |
 |
|
In my quest to build the world's best web-indexing robot comes this latest crawler. Under development for almost six months as of this writing; XP5 is now
nearly fully functional. Initial tests place it's real-world performance at roughly 75 documents/second per indexer. XP5 can run fully in parallel, and there is
no foreseeable limitations - apart from the number of available machines and bandwidth - to limit it's capabilities. Since XP5 is currently in development mode,
I will have to wait until further advancements in the project before releasing more information.
|
| |
| :: Helix |
| DATE |
STATUS |
PLATFORM |
LANGUAGE |
USER AGENT |
| 2004-2005 |
Defunct |
Unix |
Various |
Helix/1.0 |
 |
|
Helix was a crawler that I built in 2004 which grew out of the work on the XP5 project. It's goal was to build indexes for an experimental search engine I had built.
|
| |
| :: Reaper |
| DATE |
STATUS |
PLATFORM |
LANGUAGE |
USER AGENT |
| 2004-2005 |
Defunct |
FreeBSD |
Various |
Reaper/2.0x |
 |
|
A wide-scale parallel crawler that ran on a small cluster of servers. It never left the prototype stage, as it was scrapped and completely rewritten from the ground up. This rewrite was named Vortex.
During it's relatively short life, Reaper performed a test crawl which succeded in downloading nearly 20 million documents, which were later used in the SiteSearch search engine. This data was also later used in the Vortex project as well.
|
| |
| :: Vortex |
| DATE |
STATUS |
PLATFORM |
LANGUAGE |
USER AGENT |
| 2005-2006 |
Defunct |
Unix |
Various |
Vortex/x.y |
 |
Vortex was a discovery robot designed to perform research on internet trends, particularly link distribution and growth. It only ran for a few months in 2005-2006 and was shut down in 2006.
Recently, it seems that someone else decided to use the "vortex" name in their robot. If you have come to this page to find out more information about
that other crawler, unfortunately I have no idea who is running it. However, they should change the name of
their crawler, as it's very confusing for people trying to contact the operators.
|
|