Larbin
Multi-purpose web crawler
|
|
Introduction
Larbin is a web crawler (also called (web) robot, spider,
scooter...). It is intended to fetch a large number of web pages to
fill the database of a search engine. With a network fast enough,
Larbin should be able to fetch more than 100 millions pages on a
standard PC.
Larbin is (just) a web crawler, NOT an indexer. You have to write
some code yourself in order to save pages or index them in a database.
Larbin was initially developped for the XYLEME project in the VERSO
team at INRIA. The goal of Larbin was to go and fetch xml pages on the
web to fill the database of an xml-oriented search engine. Thanks to
its origins, Larbin is very generalistic (and easy to customize).
How to use Larbin
How to customize Larbin
Larbin is freely available on the web. It is under the GPL. Comments
are welcomed ! Please mail me if you use Larbin; I'll be very happy to
know it.
However, this program is not suited for personnal use, and might
be ill-used (wget or ht://dig are often more appropriate).
Whatever you might do with Larbin, don't forget I'm not ta all
responsible for the damages you might cause.
Current state
The current version of Larbin can fetch 5,000,000 pages a day on a
standard PC, but this speed mainly depends on your network.
Larbin works under Linux and uses standard libraries, plus adns (included
in the distribution). The program is multithreaded but prefers using
select instead of a lot of threads (for efficiency purposes).
The advantage of Larbin over wget or ht://dig is that it is much
faster when getting files over many sites (because it opens a lot of
connexions at a time) and very generalistic (in particular very easy
to customize).
To do
I have a lot of improvements in mind, but if you need something
specific, mail me (sebastien@ailleret.com).
Here are the things I want to do :
- Allow the program to run on multiple hosts.
- Solaris compatibility.
Here is what you can do with it :
- A crawler for a standard search engine.
- A crawler for a specialized search engine (xml, images, mp3...).
- Statistics on the web (about servers or page contents)