The Best open source Web Crawling Frameworks
On my hunt for the right back-end crawler for my startup I took a look at several open source systems. A fter some initial research I narrowed the choice down to the three systems that seemed to be the most mature and widely used: Scrapy (Python), Heritrix (Java) and Apache Nutch(Java).
What is the best open source Web Crawler that is very scalable and fast?
As a starting point I wanted the crawler services of choice to satisfy the properties described in Web crawling and Indexes:
- Robustness: The Web contains servers that create spider traps, which are generators of web pages that mislead crawlers into getting stuck fetching an infinite number of pages in a particular domain. Crawlers must be designed to be resilient to such traps. Not all such traps are malicious; some are the inadvertent side-effect of faulty website development.
- Politeness: Web servers have both implicit and explicit policies regulating the rate at which a crawler can visit them. These politeness policies must be respected.
- Distributed: The crawler should have the ability to execute in a distributed fashion across multiple machines.
- Scalable: The crawler architecture should permit scaling up the crawl rate by adding extra machines and bandwidth.
- Performance and efficiency: The crawl system should make efficient use of various system resources including processor, storage and network band- width.
- Freshness: In many applications, the crawler should operate in continuous mode: it should obtain fresh copies of previously fetched pages. A search engine crawler, for instance, can thus ensure that the search engine’s index contains a fairly current representation of each indexed web page. For such continuous crawling, a crawler should be able to crawl a page with a frequency that approximates the rate of change of that page.
- Quality: Given that a significant fraction of all web pages are of poor utility for serving user query needs, the crawler should be biased towards fetching “useful” pages first.
- Extensible: Crawlers should be designed to be extensible in many ways — to cope with new data formats, new fetch protocols, and so on. This demands that the crawler architecture be modular.
I also had a wish list of additional features that would be nice to have. Instead of just being scalable I wanted to the crawler to be dynamically scalable, so that I could add and remove machines during continuous web crawls. I also wanted to the crawler to be able to export data into a variety of storage backends or data pipelines like Amazon S3, HDFS, or Kafka.
Focused vs. broad crawling
Before getting into the meat of the comparison let’s take a step back and look at two different use cases for web crawlers: Focused crawls and broad crawls.
In a focused crawl you are interested in a specific set of pages (usually a specific domain). For example, you may want to crawl all product pages on amazon.com. In a broad crawl the set of pages you are interested in is either very large or unlimited and spread across many domains. That’s usually what search engines are doing. This isn’t a black and white distinction. It’s a continuum. A focused crawl with a large number of domains (or multiple focused crawls performed simultaneously) will essentially approach the properties of a broad crawl.
Now, why is this important? Because focused crawls have a different bottleneck than broad crawls.
When crawling one domain (such as amazon.com) you are essentially limited by your politeness policy. You don’t want to overwhelm the server with thousands of requests per second or you’ll get blocked. Thus, you need to impose an artificial limit of requests per second. This limit is usually based on server response time. Due to this artificial limit most of the CPU or network resources of your server will be idle. Having a distributed crawler using thousands of machines will not make a focused crawl go any faster than running it on your laptop.
In case of broad crawl the bottleneck is the performance and scalability of the crawler. Because you need to request pages from different domains you can potentially perform millions of requests per second without overwhelming a specific server. You are limited by the number of machines you have, their CPU, network bandwidth, and how well your crawler can make use of these resources.
If all you want it scrape data from a couple of domains then looking for a web-scale crawler may be overkill. In this case take a look at services like import.io (from $299 monthly), which is great at scraping specific data items from web pages.
Scrapy (described below) is also an excellent choice for focused crawls.
Scrapy is a Python framework for web scraping. It does not have built-in functionality for running in a distributed environment so that it’s primary use case are focused crawls. That is not to say that Scrapy cannot be used for broad crawling, but other tools may be better suited for this purpose, particularly at very large scale. According to the documentation the best practice to distribute crawls is to manually partition the URLs based on domain.
What stands out about Scrapy is its ease of use and excellent documentation. If you are familiar with Python you’ll be up and running in just a couple of minutes.
Scrapy has a couple of handy built-in export formats such as JSON, JSON lines, XML and CSV. Scrapy was built for extracting specific information from websites, not necessarily getting for a full dump of the HTML and indexing it. The latter requires some manual work to avoid writing the full HTML content of all pages to one gigantic output file. You would have to chunk the files manually.
Without the ability to run in a distributed environment, scale dynamically, or running continuos crawls, Scrapy is missing some of the key features I was looking for. However, if you need easy to use tool for extracting specific information from a couple of domains then scrapy is nearly perfect. I’ve successfully used in in several projects and have been very happy with it.
- Easy to setup and use if you know Python
- Excellent developer documentation
- Built-in JSON, JSON lines, XML and CSV export formats
- No support for running in a distributed environment
- No support for continuous crawls
- Exporting large amounts of data is difficult
How develop custom web crawler on Python for OLX? You can do it by guide from Adnan.
Heritrix is developed, maintained, and used by the The Internet Archive. Its architecture is described in this paper and largely based on that of the Mercator research project. Heritrix has been well-maintained ever since its release in 2004 and is being used in production by various other sites.
Heritrix runs in a distributed environment by hashing the URL hosts to appropriate machines. As such it is scalable, but not dynamically scalable. This means you must decide on the number of machines before you start crawling. If one of the machines goes down during your crawl you are out of luck.
The output format of Heritrix are WARC files that are written to the local file system. WARC is an efficient format for writing multiple resources (such as HTML) and their metadata into one archive file. Writing data to other data stores (or formats) is currently not supported, and it seems like doing so would require quite a few changes to the source code.
Continuous crawling is not supported, but apparently it is being worked on. However, as with many open source projects the turnaround time for new features can be quite long and I would not expect support for continuous crawls to be available anytime soon.
Heritrix is probably the most mature out of the source projects I looked it. I have found it easier to setup, configure and use than Nutch. At the same time it is more scalable and faster than scrapy. It ships together with a web frontend that can be used for monitoring and configuring crawls.
- Excellent user documentation and easy setup
- Mature and stable platform. It has been in production use at archive.org for over a decade
- Good performance and decent support for distributed crawls
- Does not support continuous crawling
- Not dynamically scalable. This means, you must decide on the number of servers and partitioning scheme upfront
- Exports ARC/WARC files. Adding support for custom backends would require changing the source
- Incremental crawling with Heritrix (2005)
- A vertical search engine for school information based on Heritrix and Lucene (2011)
Instead of building its own distributed system Nutch makes use of the Hadoop ecosystem and uses MapReduce for its processing ( details ). If you already have an existing Hadoop cluster you can simply point Nutch at it. If you don’t have an existing Hadoop cluster you will need to setup and configure one. Nutch inherits the advantages (such as fault-tolerance and scalability), but also the drawbacks (slow disk access between jobs due to the batch nature) of the Hadoop MapReduce architecture.
It is interesting to note that Nutch did not start out as a pure web crawler. It started as an open source search engine that handles both crawling and indexing of web content. Even though Nutch has since become more of a web crawler, it still comes bundled with deep integration for indexing systems such as Solr (default) and ElasticSearch(via plugins). The newer 2.x branch of Nutch tries to separate the storage backend from the crawling component using Apache Gora, but is still in a rather early stage. In my own experiments I have found it to be rather immature and buggy. This means that if you are considering using Nutch you will probably be limited to combining it with Solr and ElasticSearch Web Crawler, or write your own plugin to support a different backend or export format.
Despite a lot of prior experience with Hadoop and Hadoop-based projects I have found Nutch quite difficult to setup and configure, mostly due to a lack of good documentation or real-world examples.
Nutch (1.x) seems to be a stable platform that is used in production by various organization, CommonCrawl among them. It has a flexible plugin system, allowing you to extend it with custom functionality. Indeed, this seems to be necessary for most use cases. When using Nutch you can expect spending quite a bit of time writing your own plugins or browsing through source code to make it fit your use case. If you have the time and expertise to do so then Nutch seems like a great platform to built upon.
Nutch does not currently support continuous crawls, but you could write a couple of scripts to emulate such functionality.
- Dynamically scalable (and fault-tolerant) through Hadoop
- Flexible plugin system
- Stable 1.x branch
- Bad documentation and confusing versioning. No examples.
- Inherits disadvantages of Hadoop (disk reads, difficult setup)
- No built-in support for continuous crawls
- Export limited to Solr/ElasticSearch (on 1.x branch)
Not surprisingly, there isn’t a “perfect” web crawler out there. It’s all about picking the right one for your use case. Do you want to do a focused or a broad crawl? Do you have an existing Hadoop cluster (and knowledge in your team)? Where do you want your data to end up in?
Out of all the crawlers I have looked at Heritrix is probably my favorite, but it’s far from perfect. That’s probably why some people and organizations have opted to build their own crawler instead. That may be a viable alternative if none of the above fits your exact use case. If your thinking about building your own I would highly recommend reading through the academic literature on web crawling, starting with these:
- Web crawling and indexes (from the “Introduction to Information Retrieval” book)
What are your experiences with web crawlers? What are you using and are you happy with it? Did I forget any? I only had limited time to evaluate each of the above crawlers, so it is very possible that I have overlooked some important features. If so, please let me know in the comments.
Do you Know What Google use Java?