Download Larbin
Latest version
Here is the latest version of larbin :
larbin.tar.gz
If you need another version of larbin, see
here .
Changelog
V2.6.3 (2003-07-09) :
- Add the possibility to follow only internal links (ie on the same
site). See option noExternalLinks in larbin.conf.
- Correct a compilation problem with gcc 3.XX (and avoid warnings).
- Add the possibility to use "" in larbin.conf to define a token
containing blank char.
V2.6.2 (2002-04-14) :
- A very basic implementation of cookies has been added (see
COOKIES in options.h).
- Can now get images (see IMAGES and ANYTYPE in options.h).
- Rewrite the robots.txt and html parser (should use less ressources and
understand tags better).
- Try to be more portable (index becomes strchr, Makefile
update...). larbin should now compile on Solaris.
- Try to rewrite things in order to make some #ifdef more readable.
- Many cleanups and efficiency update thanks to profiling.
V2.6.1 (2002-03-09) :
- Some configurations did not compile.
- Possibility to get images with pages (follow img src).
- Correct fatal bug in proxy management.
- Improve robots.txt parser (normalize path : /./, %xx, ...).
- Cleanups.
V2.6.0 (2002-01-12) : This version does not work with proxy
- larbin website moves to sourceforge.
- Add a new output file, which does some stats on the size of the
pages (see STATS_OUTPUT in options.h).
- Depth can be calculated for each site or for the whole seach (see
DEPTHBYSITE in options.h).
- big work on the sequencer.
- dns requests follow CNAME chains (much less dns errors).
- More static buffers for avoiding allocations/fragmantations.
V2.5.9 (2001-12-12) :
- specificSearch has changed a lot and must now be configured from
"options.h" (instead of "larbin.conf").
- Try to make output interface simpler. The file to change is now
useroutput.cc. There are predefined examples
(interf/XXXuseroutput.cc).
- Try to make buffers for specfic pages more flexibles (see
"fetch/specbuf.cc" and "fetch/specbuf.h"). Add a dynamic buffers option.
- Can choose if you're interested by cgi (see CGILEVEL in
options.h).
- New management of timeout.
- Crawl through a proxy works again.
- Try to avoid too many dns calls and to increase the number of
sites simultaneously in ram.
- Improve the webserver.
- Possibility to totally disable the webserver (using port 0 in
larbin.conf). This way, it becomes possible to launch no thread at
all.
- correct small bug and enhance robots.txt parsing
V2.5.0 (2001-11-22) :
- The old config.h is now named options.h.
- The stats page now includes fancy histograms (thanks to Laurent
Viennot). see GRAPH in options.h (set by default).
- Possibility to limit bandwidth usage (see MAXBANDWIDTH in options.h).
- Larbin now works on freeBSD (a configure script has been added).
- Change in the RELOAD semantics. By default, it restarts from
where it last stops. To restart from scratch, use -scratch. Also save
duplicate information if they exists.
- Possibility to manage specific files with a bigger size than
maxFileSize (they are directly stored on disk). This is SPECIFICSAVE
in options.h.
- Possibility for larbin to stop when everything has been fetched
(EXIT_AT_END in options.h).
- Many code cleanups.
V2.2.2 (2001-10-02) :
- You can now save files and respect directory structure (option
MIRROR_SITES in config.h).
- Change the way depthInSite works (more intuitive).
- Correct bug with sites closing connection before the end of headers.
V2.2.1 (2001-09-13) : This version is buggy if you try to read the
content of the pages (for instance if you use the SAVE option).
- Add the possibility to suppress duplicate pages (option NO_DUP in
config.h).
- Parse the whole page (except headers), only when totally
received. This is much better for the duplicate option.
V2.2.0 (2001-07-23) :
- Add the possibility to save fetched pages in files.
- Replace select by poll for using unlimited number of file connexions.
- Some efficiency updates (url normalizer, html parser...).
- Possibility to query a same server many times simultaneously (use
with care if the server is not yours).
V2.1.1 (2001-06-19) :
- Url parser improvement.
- Possibility to use one less thread (if the output never does
blocking operation).
- Possibility to disable the -reload option : this avoids unnecessary
saves of the hashtable on disk (#define RELOAD in config.h).
V2.1.0 (2001-06-06) : Should be quite more stable than the previous
ones : a quick test without tuning (but 24 hours long) gives back 4.5
million pages without any problem (memory consumption : 50 Mo).
- Improve the html parser (do not parse comments and improve cgi
filter).
- Possibility to associate tags to urls.
- Update in the input system.
- Rewrite Makefiles and the configuration system (now much more
customizable).
- Again less allocations in many places thanks to static buffers
(especially url.cc and PersistentFifo.cc).
- Delete stupid (and buggy) hacks.
V2.0.1 (2001-05-23) : Contains some bugs that may causes long term
slow down.
- Rewrite of the input section (first step toward multi-host larbin).
- Rewrite of the the robots.txt parser (less allocation and fix a
very old bug).
- Small improvements of the html parser.
V2.0.0 (2001-05-11) : Contains some bugs that may causes long term
slow down.
- Big internal rewrite (it allows less dns calls and no more rapid
fire with virtual hosts).
- Much less copy of data.
- Much less allocations : this should lead to less fragmentation, so
less memory consumption.
- Small API change (headers and content of the page are now char* :
before it was a String).
- More tolerant to buggy html.
V1.2.2 (2001-04-04) :
- More tolerant to buggy html
- Correct a bug with specificSearch (parsing of the configuration
file)
- Suppress some system calls when possible (especially time)
- Use less cpu when reaching the end of the search
- Stats improvements
- new Makefiles options (make stats and make bigstats), if you want
stats on stdout. "make bigstats" might decrease performances a lot.
V1.2.1 (2001-03-12) :
- More Makefile enhancement
- Suppress shutdown calls since they seem to hang some kernels
- Manage redirections as errors and follow them correctly
- Correct a bug with specificSearch (assertion failed)
- Output functions simplification
V1.2.0 (2001-02-18) :
- Use less threads (only for user interaction)
- Makefile enhancement (make all debug nodebug crash and prof)
- Change the directory structure
- Correct a bug in robots.txt management
- Correct a bug in frame management
V1.1.4 :
- RedHat 7.0 (gcc 2.96) compatibility
- Minor bug fixes and feature enhancements
V1.1.3 :
- adns 1.0
- Little performance improvements (especially in the parser)
V1.1.2 :
- Larbin works quicker and better through a proxy
V1.1.1 :
- Bug fix : no more crash after 2 days
V1.1.0 :
- Possibility to restart larbin where it last stopped
- Stats and output improvements
- Makefiles cleanup
V1.0.2 :
- Increased compatibility : gcc 2.95 (Mandrake 7.0) and alpha
processor
- input is more powerfull (see here)
- Possibility to crawl through a proxy
- http headers are saved (Ira Joseph Woodhead)
- No more dynamic library to install
- less cpu time used on startup
V1.0.1 :
- "make crash" for efficient feedback
- stats improvements
- Configuration improvement (SpecificSearch, limitToDomain) : see
larbin.conf for details
V1.0.0 : Initial release