The Best open-source Web Crawling Frameworks in 2021
In my search for a suitable back-end crawler for my startup, I looked at many open source solutions. After some initial research, I narrowed the choice down to the 10 systems that seemed to be the most mature and widely used:
- Scrapy (Python),
- Heritrix (Java),
- Apache Nutch (Java),
- Web-Harvest (Java),
- MechanicalSoup (Python),
- Jaunt (Java),
- PySpider (Python),
- StormCrawler (Java).
What is the best open source Web Crawler that is very scalable and fast?
As a starting point I wanted the crawler services of choice to satisfy the properties described in Web crawling and Indexes:
- Robustness: The Web contains servers that create spider traps, which are generators of web pages that mislead crawlers into getting stuck fetching an infinite number of pages in a particular domain. Crawlers must be designed to be resilient to such traps. Not all such traps are malicious; some are the inadvertent side-effect of faulty website development.
- Politeness: Web servers have both implicit and explicit policies regulating the rate at which a crawler can visit them. These politeness policies must be respected.
- Distributed: The crawler should have the ability to execute in a distributed fashion across multiple machines.
- Scalable: The crawler architecture should permit scaling up the crawl rate by adding extra machines and bandwidth.
- Performance and efficiency: The crawl system should make efficient use of various system resources including processor, storage and network band- width.
- Freshness: In many applications, the crawler should operate in continuous mode: it should obtain fresh copies of previously fetched pages. A search engine crawler, for instance, can thus ensure that the search engine’s index contains a fairly current representation of each indexed web page. For such continuous crawling, a crawler should be able to crawl a page with a frequency that approximates the rate of change of that page.
- Quality: Given that a significant fraction of all web pages are of poor utility for serving user query needs, the crawler should be biased towards fetching “useful” pages first.
- Extensible: Crawlers should be designed to be extensible in many ways — to cope with new data formats, new fetch protocols, and so on. This demands that the crawler architecture be modular.
I also had a wish list of additional features that would be nice to have. Instead of just being scalable I wanted to the crawler to be dynamically scalable, so that I could add and remove machines during continuous web crawls. I also wanted to the crawler to be able to export data into a variety of storage backends or data pipelines like Amazon S3, HDFS, or Kafka.
Focused vs. Broad Crawling
Before getting into the meat of the comparison let’s take a step back and look at two different use cases for web crawlers: Focused crawls and broad crawls.
In a focused crawl you are interested in a specific set of pages (usually a specific domain). For example, you may want to crawl all product pages on amazon.com. In a broad crawl the set of pages you are interested in is either very large or unlimited and spread across many domains. That’s usually what search engines are doing. This isn’t a black and white distinction. It’s a continuum. A focused crawl with many domains (or multiple focused crawls performed simultaneously) will essentially approach the properties of a broad crawl.
Now, why is this important? Because focused crawls have a different bottleneck than broad crawls.
When crawling one domain (such as amazon.com) you are essentially limited by your politeness policy. You don’t want to overwhelm the server with thousands of requests per second or you’ll get blocked. Thus, you need to impose an artificial limit of requests per second. This limit is usually based on server response time. Due to this artificial limit, most of the CPU or network resources of your server will be idle. Having a distributed crawler using thousands of machines will not make a focused crawl go any faster than running it on your laptop.
In the case of broad crawl, the bottleneck is the performance and scalability of the crawler. Because you need to request pages from different domains you can potentially perform millions of requests per second without overwhelming a specific server. You are limited by the number of machines you have, their CPU, network bandwidth, and how well your crawler can make use of these resources.
If all you want it scrapes data from a couple of domains then looking for a web-scale crawler may be overkill. In this case take a look at services like import.io (from $299 monthly), which is great at scraping specific data items from web pages.
Scrapy (described below) is also an excellent choice for focused crawls.
Scrapy is a Python framework for web scraping. It does not have built-in functionality for running in a distributed environment so that it’s primary use case are focused crawls. That is not to say that Scrapy cannot be used for broad crawling, but other tools may be better suited for this purpose, particularly at a very large scale. According to the documentation the best practice to distribute crawls is to manually partition the URLs based on domain.
What stands out about Scrapy is its ease of use and excellent documentation. If you are familiar with Python you’ll be up and running in just a couple of minutes.
Scrapy has a couple of handy built-in export formats such as JSON, JSON lines, XML and CSV. Scrapy was built for extracting specific information from websites, not necessarily getting for a full dump of the HTML and indexing it. The latter requires some manual work to avoid writing the full HTML content of all pages to one gigantic output file. You would have to chunk the files manually.
Without the ability to run in a distributed environment, scale dynamically, or running continuos crawls, Scrapy is missing some of the key features I was looking for. However, if you need easy to use tool for extracting specific information from a couple of domains then Scrapy is nearly perfect. I’ve successfully used in in several projects and have been very happy with it.
- Easy to setup and use if you know Python
- Excellent developer documentation
- Built-in JSON, JSON lines, XML and CSV export formats
- No support for running in a distributed environment
- No support for continuous crawls
- Exporting large amounts of data is difficult
How to develop custom web crawler on Python for OLX? You can do it by guide from Adnan.
Heritrix is developed, maintained, and used by The Internet Archive. Its architecture is described in this paper and largely based on that of the Mercator research project. Heritrix has been well-maintained ever since its release in 2004 and is being used in production by various other sites.
Heritrix runs in a distributed environment by hashing the URL hosts to appropriate machines. As such it is scalable, but not dynamically scalable. This means you must decide on the number of machines before you start crawling. If one of the machines goes down during your crawl you are out of luck.
The output format of Heritrix are WARC files that are written to the local file system. WARC is an efficient format for writing multiple resources (such as HTML) and their metadata into one archive file. Writing data to other data stores (or formats) is currently not supported, and it seems like doing so would require quite a few changes to the source code.
Continuous crawling is not supported, but apparently it is being worked on. However, as with many open source projects the turnaround time for new features can be quite long and I would not expect support for continuous crawls to be available anytime soon.
Heritrix is probably the most mature out of the source projects I looked it. I have found it easier to setup, configure and use than Nutch. At the same time it is more scalable and faster than scrapy. It ships together with a web frontend that can be used for monitoring and configuring crawls.
- Excellent user documentation and easy setup
- Mature and stable platform. It has been in production use at archive.org for over a decade
- Good performance and decent support for distributed crawls
- Does not support continuous crawling
- Not dynamically scalable. This means, you must decide on the number of servers and partitioning scheme upfront
- Exports ARC/WARC files. Adding support for custom backends would require changing the source
- An introduction to Heritrix (2004)
- Incremental crawling with Heritrix (2005)
- A vertical search engine for school information based on Heritrix and Lucene (2011)
Instead of building its own distributed system Nutch makes use of the Hadoop ecosystem and uses MapReduce for its processing ( details ). If you already have an existing Hadoop cluster you can simply point Nutch at it. If you don’t have an existing Hadoop cluster you will need to setup and configure one. Nutch inherits the advantages (such as fault-tolerance and scalability), but also the drawbacks (slow disk access between jobs due to the batch nature) of the Hadoop MapReduce architecture.
It is interesting to note that Nutch did not start out as a pure web crawler. It started as an open-source search engine that handles both crawling and indexing of web content. Even though Nutch has since become more of a web crawler, it still comes bundled with deep integration for indexing systems such as Solr (default) and ElasticSearch(via plugins). The newer 2.x branch of Nutch tries to separate the storage backend from the crawling component using Apache Gora, but is still in a rather early stage. In my own experiments I have found it to be rather immature and buggy. This means that if you are considering using Nutch you will probably be limited to combining it with Solr and ElasticSearch Web Crawler, or write your own plugin to support a different backend or export format.
Despite a lot of prior experience with Hadoop and Hadoop-based projects I have found Nutch quite difficult to setup and configure, mostly due to a lack of good documentation or real-world examples.
Nutch (1.x) seems to be a stable platform that is used in production by various organization, CommonCrawl among them. It has a flexible plugin system, allowing you to extend it with custom functionality. Indeed, this seems to be necessary for most use cases. When using Nutch you can expect spending quite a bit of time writing your own plugins or browsing through source code to make it fit your use case. If you have the time and expertise to do so then Nutch seems like a great platform to built upon.
Nutch does not currently support continuous crawls, but you could write a couple of scripts to emulate such functionality.
- Dynamically scalable (and fault-tolerant) through Hadoop
- Flexible plugin system
- Stable 1.x branch
- Bad documentation and confusing versioning. No examples.
- Inherits disadvantages of Hadoop (disk reads, difficult setup)
- No built-in support for continuous crawls
- Export limited to Solr/ElasticSearch (on 1.x branch)
Open source web scrapers are quite powerful and extensible but are limited to developers.
Not surprisingly, there isn’t a “perfect” web crawler out there. It’s all about picking the right one for your use case. Do you want to do a focused or a broad crawl? Do you have an existing Hadoop cluster (and knowledge in your team)? Where do you want your data to end up in?
Out of all the crawlers I have looked at Heritrix is probably my favorite, but it’s far from perfect. That’s probably why some people and organizations have opted to build their own crawler instead. That may be a viable alternative if none of the above fits your exact use case. If you’re thinking about building your own, I would highly recommend reading through the academic literature on web crawling, starting with these:
- Web crawling and indexes (from the “Introduction to Information Retrieval” book)
- Web crawling survey (2010)
- High-Performance web crawling (Mercator)
- UbiCrawler: A Scalable Fully Distributed Web Crawler
- How to make a Java Web Crawler?
What are your experiences with web crawlers? What are you using and are you happy with it? Did I forget any? I only had limited time to evaluate each of the above crawlers, so it is very possible that I have overlooked some important features. If so, please let me know in the comments.
Do you Know What Google use Java?
Need Top 50 open source web crawlers List for data mining?
This Feature Focus came at the request of you, the people! We had a tidal wave of (three) emails asking for a piece on our data mining functionality, so that’s what you’re getting.
Analysts use our ‘Data Mining’ tool to quickly extract valuable insights from massive datasets – without having to get tangled up in databases and spreadsheets.
This article breaks down:
- What is data mining?
- How does it work?
- Why is it useful?
Get to grips with our data mining tools here.
Once your query has returned a table of results, you can visualize that data in one of eight ways:
- Standard matrix
- Bar chart
- Line chart
- Pie chart
- Scatter graph
- World map
- Sankey chart
Understanding website crawling and how search engines crawl and index websites can be a confusing topic. Everyone does it a little bit differently, but the overall concepts are the same. Here is a quick breakdown of things you should know about how search engines crawl your website. (I’m not getting into the algorithms, keywords or any of that stuff, simply how search engines crawl sites.)
So what is website crawling?
Website Crawling is the automated fetching of web pages by a software process, the purpose of which is to index the content of websites so they can be searched. The crawler analyzes the content of a page looking for links to the next pages to fetch and index.
What types of crawls are there?
Two of the most common types of crawls that get content from a website are:
- Site crawls are an attempt to crawl an entire site at one time, starting with the home page. It will grab links from that page, to continue crawling the site to other content of the site. This is often called “Spidering”.
- Page crawls, which are the attempt by a crawler to crawl a single page or blog post.
Are there different types of crawlers?
There definitely are different types of crawlers. But one of the most important questions is, “What is a crawler?” A crawler is a software process that goes out to websites and requests the content as a browser would. After that, an indexing process actually picks out the content it wants to save. Typically the content that is indexed is any text visible on the page.
Different search engines and technologies have different methods of getting a website’s content with crawlers:
- Crawls can get a snapshot of a site at a specific point in time, and then periodically recrawl the entire site. This is typically considered a “brute force” approach as the crawler is trying to recrawl the entire site each time. This is very inefficient for obvious reasons. It does, though, allow the search engine to have an up-to-date copy of pages, so if the content of a particular page changes, this will eventually allow those changes to be searchable.
- Single page crawls allow you to only crawl or recrawl new or updated content. There are many ways to find new or updated content. These can include sitemaps, RSS feeds, syndication and ping services, or crawling algorithms that can detect new content without crawling the entire site.
Can crawlers always crawl my site?
That’s what we strive for at OutsourceIT, but isn’t always possible. Typically, any difficulty crawling a website has more to do with the site itself and less with the crawler attempting to crawl it. The following issues could cause a crawler to fail:
- The site owner denies indexing and or crawling using a robots.txt file.
- The page itself may indicate it’s not to be indexed and links not followed (directives embedded in the page code). These directives are “meta” tags that tell the crawler how it is allowed to interact with the site.
- The site owner blocked a specific crawler IP address or “user-agent”.
All of these methods are usually employed to save bandwidth for the owner of the website, or to prevent malicious crawler processes from accessing the content. Some site owners simply don’t want their content to be searchable. One would do this kind of thing, for example, if the site was primarily a personal site, and not really intended for a general audience.
I think it is also important to note here that robots.txt and meta directives are really just a “gentlemen’s agreement”, and there’s nothing to prevent a truly impolite crawler from crawling. Many Crawlers are polite, and will not request pages that have been blocked by robots.txt or meta directives.
How do I optimize my website so it is easy to crawl?
There are steps you can take to build your website in such a way that it is easier for search engines to crawl it and provide better search results. The end result will be more traffic to your site, and enabling your readers to find your content more effectively.
Search Engine Accessibility Tips:
- Having an RSS feed or feeds so that when you create new content the search software can recognize new content and crawl it faster.
- Be selective when blocking crawlers using robots.txt files or meta tag directives in your content. Most blog platforms allow you to customize this feature in some way. A good strategy to employ is to let the search engines in that you trust, and block those you don’t.
- Building a consistent document structure. This means when you construct your HTML page that the content you want crawled is consistently in the same place under the same content section.
- Having content and not just images on a page. Search engines can’t find an image unless you provide text or alt tag descriptions for that image.
- Try (within the limits of your site design) to have links between pages so the crawler can quickly learn that those pages exist. If you’re running a blog, you might, for example, have an archive page with links to every post. Most blogging platforms provide such a page. A sitemap page is another way to let a crawler know about lots of pages at once.
To learn more about configuring robots.txt and how to manage it for your site, visit http://www.robotstxt.org/. We want you to be a successful blogger, and understanding website crawling is one of the most important steps.