Comparison of Open Source Web Crawlers for Data Mining and Web Scraping: TOP3 Pros&Cons

The Best open-source Web Crawling Frameworks in 2021

In my search for a suitable back-end crawler for my startup, I looked at many open source solutions. After some initial research, I narrowed the choice down to the 10 systems that seemed to be the most mature and widely used: 

  • Scrapy (Python), 
  • Heritrix (Java),
  • Apache Nutch (Java),
  • Web-Harvest (Java),
  • MechanicalSoup (Python), 
  • Apify SDK (JavaScript),
  • Jaunt (Java),
  • Node-crawler (JavaScript),
  • PySpider (Python), 
  • StormCrawler (Java).

What is the best open source Web Crawler that is very scalable and fast?

As a starting point I wanted the crawler services of choice to satisfy the properties described in Web crawling and Indexes:

  • Robustness: The Web contains servers that create spider traps, which are generators of web pages that mislead crawlers into getting stuck fetching an infinite number of pages in a particular domain. Crawlers must be designed to be resilient to such traps. Not all such traps are malicious; some are the inadvertent side-effect of faulty website development.
  • Politeness: Web servers have both implicit and explicit policies regulating the rate at which a crawler can visit them. These politeness policies must be respected.
  • Distributed: The crawler should have the ability to execute in a distributed fashion across multiple machines.
  • Scalable: The crawler architecture should permit scaling up the crawl rate by adding extra machines and bandwidth.
  • Performance and efficiency: The crawl system should make efficient use of various system resources including processor, storage and network band- width.
  • Freshness: In many applications, the crawler should operate in continuous mode: it should obtain fresh copies of previously fetched pages. A search engine crawler, for instance, can thus ensure that the search engine’s index contains a fairly current representation of each indexed web page. For such continuous crawling, a crawler should be able to crawl a page with a frequency that approximates the rate of change of that page.
  • Quality: Given that a significant fraction of all web pages are of poor utility for serving user query needs, the crawler should be biased towards fetching “useful” pages first.
  • Extensible: Crawlers should be designed to be extensible in many ways — to cope with new data formats, new fetch protocols, and so on. This demands that the crawler architecture be modular.
Table: comparison of TOP 3 open source crawlers in
terms of various parameters (IJSER)

I also had a wish list of additional features that would be nice to have. Instead of just being scalable I wanted to the crawler to be dynamically scalable, so that I could add and remove machines during continuous web crawls. I also wanted to the crawler to be able to export data into a variety of storage backends or data pipelines like Amazon S3, HDFS, or Kafka.

Focused vs. Broad Crawling

Before getting into the meat of the comparison let’s take a step back and look at two different use cases for web crawlers: Focused crawls and broad crawls.

In a focused crawl you are interested in a specific set of pages (usually a specific domain). For example, you may want to crawl all product pages on amazon.com. In a broad crawl the set of pages you are interested in is either very large or unlimited and spread across many domains. That’s usually what search engines are doing. This isn’t a black and white distinction. It’s a continuum. A focused crawl with many domains (or multiple focused crawls performed simultaneously) will essentially approach the properties of a broad crawl.

Now, why is this important? Because focused crawls have a different bottleneck than broad crawls.

When crawling one domain (such as amazon.com) you are essentially limited by your politeness policy. You don’t want to overwhelm the server with thousands of requests per second or you’ll get blocked. Thus, you need to impose an artificial limit of requests per second. This limit is usually based on server response time. Due to this artificial limit, most of the CPU or network resources of your server will be idle. Having a distributed crawler using thousands of machines will not make a focused crawl go any faster than running it on your laptop.

Architecture of a Web crawler.

In the case of broad crawl, the bottleneck is the performance and scalability of the crawler. Because you need to request pages from different domains you can potentially perform millions of requests per second without overwhelming a specific server. You are limited by the number of machines you have, their CPU, network bandwidth, and how well your crawler can make use of these resources.

If all you want it scrapes data from a couple of domains then looking for a web-scale crawler may be overkill. In this case take a look at services like import.io (from $299 monthly), which is great at scraping specific data items from web pages.

Scrapy (described below) is also an excellent choice for focused crawls.

Meet with Scrapy, Heritrix and Apache Nutch

Scrapy

Website: http://scrapy.org

Language: Python

Scrapy is a Python framework for web scraping. It does not have built-in functionality for running in a distributed environment so that it’s primary use case are focused crawls. That is not to say that Scrapy cannot be used for broad crawling, but other tools may be better suited for this purpose, particularly at a very large scale. According to the documentation the best practice to distribute crawls is to manually partition the URLs based on domain.

What stands out about Scrapy is its ease of use and excellent documentation. If you are familiar with Python you’ll be up and running in just a couple of minutes.

Scrapy has a couple of handy built-in export formats such as JSON, JSON lines, XML and CSV. Scrapy was built for extracting specific information from websites, not necessarily getting for a full dump of the HTML and indexing it. The latter requires some manual work to avoid writing the full HTML content of all pages to one gigantic output file. You would have to chunk the files manually.

Without the ability to run in a distributed environment, scale dynamically, or running continuos crawls, Scrapy is missing some of the key features I was looking for. However, if you need easy to use tool for extracting specific information from a couple of domains then Scrapy is nearly perfect. I’ve successfully used in in several projects and have been very happy with it.

Pros:

  • Easy to setup and use if you know Python
  • Excellent developer documentation
  • Built-in JSON, JSON lines, XML and CSV export formats

Cons:

  • No support for running in a distributed environment
  • No support for continuous crawls
  • Exporting large amounts of data is difficult

How to develop custom web crawler on Python for OLX? You can do it by guide from Adnan.


Heritrix

Website: webarchive.jira.com

Language: Java

Heritrix is developed, maintained, and used by The Internet Archive. Its architecture is described in this paper and largely based on that of the Mercator research project. Heritrix has been well-maintained ever since its release in 2004 and is being used in production by various other sites.

Heritrix runs in a distributed environment by hashing the URL hosts to appropriate machines. As such it is scalable, but not dynamically scalable. This means you must decide on the number of machines before you start crawling. If one of the machines goes down during your crawl you are out of luck.

The output format of Heritrix are WARC files that are written to the local file system. WARC is an efficient format for writing multiple resources (such as HTML) and their metadata into one archive file. Writing data to other data stores (or formats) is currently not supported, and it seems like doing so would require quite a few changes to the source code.

Continuous crawling is not supported, but apparently it is being worked on. However, as with many open source projects the turnaround time for new features can be quite long and I would not expect support for continuous crawls to be available anytime soon.

Heritrix is probably the most mature out of the source projects I looked it. I have found it easier to setup, configure and use than Nutch. At the same time it is more scalable and faster than scrapy. It ships together with a web frontend that can be used for monitoring and configuring crawls.

Pros:

  • Mature and stable platform. It has been in production use at archive.org for over a decade
  • Good performance and decent support for distributed crawls

Cons:

  • Does not support continuous crawling
  • Not dynamically scalable. This means, you must decide on the number of servers and partitioning scheme upfront
  • Exports ARC/WARC files. Adding support for custom backends would require changing the source

Further Reading:


Apache Nutch

Website: nutch.apache.org

Language: Java

Instead of building its own distributed system Nutch makes use of the Hadoop ecosystem and uses MapReduce for its processing ( details ). If you already have an existing Hadoop cluster you can simply point Nutch at it. If you don’t have an existing Hadoop cluster you will need to setup and configure one. Nutch inherits the advantages (such as fault-tolerance and scalability), but also the drawbacks (slow disk access between jobs due to the batch nature) of the Hadoop MapReduce architecture.

It is interesting to note that Nutch did not start out as a pure web crawler. It started as an open-source search engine that handles both crawling and indexing of web content. Even though Nutch has since become more of a web crawler, it still comes bundled with deep integration for indexing systems such as Solr (default) and ElasticSearch(via plugins). The newer 2.x branch of Nutch tries to separate the storage backend from the crawling component using Apache Gora, but is still in a rather early stage. In my own experiments I have found it to be rather immature and buggy. This means that if you are considering using Nutch you will probably be limited to combining it with Solr and ElasticSearch Web Crawler, or write your own plugin to support a different backend or export format.

Despite a lot of prior experience with Hadoop and Hadoop-based projects I have found Nutch quite difficult to setup and configure, mostly due to a lack of good documentation or real-world examples.

Nutch (1.x) seems to be a stable platform that is used in production by various organization, CommonCrawl among them. It has a flexible plugin system, allowing you to extend it with custom functionality. Indeed, this seems to be necessary for most use cases. When using Nutch you can expect spending quite a bit of time writing your own plugins or browsing through source code to make it fit your use case. If you have the time and expertise to do so then Nutch seems like a great platform to built upon.

Nutch does not currently support continuous crawls, but you could write a couple of scripts to emulate such functionality.

Pros:

  • Dynamically scalable (and fault-tolerant) through Hadoop
  • Flexible plugin system
  • Stable 1.x branch

Cons:

  • Bad documentation and confusing versioning. No examples.
  • Inherits disadvantages of Hadoop (disk reads, difficult setup)
  • No built-in support for continuous crawls
  • Export limited to Solr/ElasticSearch (on 1.x branch)

Further reading:


Conclusion

Open source web scrapers are quite powerful and extensible but are limited to developers. 

Not surprisingly, there isn’t a “perfect” web crawler out there. It’s all about picking the right one for your use case. Do you want to do a focused or a broad crawl? Do you have an existing Hadoop cluster (and knowledge in your team)? Where do you want your data to end up in?

Out of all the crawlers I have looked at Heritrix is probably my favorite, but it’s far from perfect. That’s probably why some people and organizations have opted to build their own crawler instead. That may be a viable alternative if none of the above fits your exact use case. If you’re thinking about building your own, I would highly recommend reading through the academic literature on web crawling, starting with these:

What are your experiences with web crawlers? What are you using and are you happy with it? Did I forget any? I only had limited time to evaluate each of the above crawlers, so it is very possible that I have overlooked some important features. If so, please let me know in the comments.

Do you Know What Google use Java?


Need Top 50 open source web crawlers List for data mining?

Free Social Network Analysis Software – Who is Watching You?

You spend a lot of time on social media, don’t you? Don’t worry, most of us do. And if you are like us, you have probably been at least a little bit curious about who is actually looking at your profile the most. Maybe a former friend or someone you like is secretly lurking your profile? Well now there is a mobile app that can show you your top profile views.

Check out the mobile app Social Network Analyzer.

SNA screenshot

Social Network Analyzer was created by Bilal Raad. This innovative app retrieves a full list of your social network profiles – like Facebook, Twitter, and Instagram. Then it goes on to work analyzing your top interactions like comments, likes, and chats – it even goes as far as showing you your profile views. In order to find out who views your profile you have to interpret the list a little bit. But don’t worry, it is easy. All you have to do is exclude the people that you normally interact with, the rest will be your viewers. There is a free version and a paid version. Check out the free version first and try it out, if you like what you see you will have the option to switch to the paid version. The free version shows you a part of the list but for a complete list you will have to purchase the full version. If you do not want to pay, there are still bonuses that allow you to unlock the top spots. All you have to do is share Social Network Analyzer on your Twitter or Facebook profile.

Ready to find out who is secretly viewing your social media profiles? Go download Social Network Analyzer. The app is currently available on iOS and Android devices. Head to your app store and search for “Social Network Analyzer” to download it today.

4 Apps Like Social Network Analyzer

  • InstaGhost is a great application which lets everyone to get noticed with the interactive users who have just stopped posting anymore and who is fed up with using Instagram.
  • Crowdfire is an all in one solution for managing your social media concerns under one platform.
  • Wish to get a huge number of Likes on Instagram? Need to be well known like VIPs like Kim Kardashian, Dan Bilzerian, and others? Utilize a final application – Get Likes for Instagram.
  • SocialViewer for Instagram is another intuitive app for Instagram users which enables its users to calculate all their account activity and access data for each user who is interested with your profile freshly.

Meet Free Social Network Analysis Tools

Socilab

Socilab is an online tool that lets you visualize and analyze LinkedIn network using methods derived from social-scientific research. It displays a number of network measures drawn from sociological research on professional networks, and percentile bars comparing your aggregate network measures to past users. Also, there is a messaging feature that allows you to type and send a message to the selected LinkedIn contacts.

JUNG

JUNG stands for Java Universal Network/Graph Framework. This Java application provides an extendible language for the analysis, modeling and visualization of data that could be represented as a graph or network.JUNG supports numerous graph types (including hypergraphs) with any properties.

It enables customizable visualizations, and includes algorithms from graph theory, social network analysis and data mining. However, it is limited by the amount of memory allocated to Java.

Netlytic

Netlytic is a cloud-based text analyzer and social network visualizer that can automatically summarize large dataset of text and visualize social networks from conversations on social media sites like Twitter, YouTube, online forums, and blog comments. The tool is mainly developed for researchers to identify key and influential constituents, and discover how information flow in a network.

NodeXL

NodeXL is an open-source template for Microsoft Excel for network analysis and visualization. It allows you to enter a network edge list in a worksheet, click a button and visualize your graph, all in the familiar environment of the Excel window.

The tool supports extracting email, YouTube, Facebook. Twitter, WWW, and Flickr social network. You can easily manipulate and filter underlying data in spreadsheet format.

Open-source Image Recognition Library

What is the best image recognition app?

  • Google Image Recognition. Google is renowned for creating the best search tools available. …
  • Brandwatch Image Insights. …
  • Amazon Rekognition. …
  • Clarifai. …
  • Google Vision AI. …
  • GumGum. …
  • LogoGrab. …
  • IBM Image Detection.

Is there an app that can find an item from a picture?

Google Goggles: Image-Recognition Mobile AppThe Google Goggles app is an image-recognition mobile app that uses visual search technology to identify objects through a mobile device’s camera. Users can take a photo of a physical object, and Google searches and retrieves information about the image.

Virtual Churches: Is VR the Future of Religious Technology?

In the 18th century, the American colonies experienced an enormous swell of religious fervor known as the First Great Awakening.

Leaving the cold, dark churches and their solemnly delivered sermons for exuberant traveling preachers in the town square, the colonials must have felt a profound spiritual transformation.

Since that First Great Awakening, there have been many other breakthroughs in the way people worship and interact spiritually: televangelists, megachurches, and online churches, just to name a few.

They’re all part of the constantly evolving way that we experience religion, and for the church to stay relevant, it will need to continue to embrace innovation by listening for new voices coming from the town square.

VR is one of the most recent examples of technology altering the way we worship, and in this article, we’ll look at how your Virtual Church could become a virtual church.

What is virtual reality?

Virtual reality uses computer-generated 360-degree images to immerse the viewer in a comprehensive, realistic experience.

When I recently came home from a summer in Washington, D.C., I brought my little sister a virtual reality viewer that let her visit the national monuments without ever leaving her room. She loved it, and spent hours that night looking around the Jefferson Memorial, the Washington Monument, the Capitol Building, and (my favorite) the Lincoln Memorial.

Virtual reality offers an escape from ordinary content consumption, but it also offers an exciting educational experience.

Students can work with virtual reality in the classroom to visit historical sites, watch a demonstration, or learn about the stars. Employees can train on the job without entering into dangerous or uncomfortable situations in real life.

Virtual reality can also be a portal to faith and spirituality for those who might not otherwise get the chance.

Virtual reality and the Church

L. Michelle Salvant, founder of Mission:VR, considers virtual reality to be vital for faith formation.

“We have a lot of people talking about faith,” she says. “But I know for a surety that when you can experience someone’s faith and their hope, when you can go inside of their life and feel it, conversion will increase.”

Mission:VR partnered with Covalent Reality to create an environment for the virtual reality platform Google Cardboard called BelieveVR.

BelieveVR uses 360-degree cameras to follow the stories of people of faith as they struggle to overcome spiritual challenges. The first story, called “Healed,”follows Florida pastor Nicky E. Collins as she finds spiritual support during her battle with breast cancer. After the premiere of the short film, Salvant said viewers were “uplifted” by the experience, which they called “awesome” and “touching.”

That’s just one example of how virtual reality is affecting the worship experience.

#1 Case: Church Online Platform

Church Online is developing a virtual reality platform to supplement its online presence.

#2 Case: Virtual Reality Church

The Virtual Reality Church uses the software platform AltspaceVR (free for most Android VR devices) to bring their congregants a 360-degree church experience.

What is VR Church?

In recent times the Church has been slow to venture into new technology; with many pastors siting allegiance to the good old pen and paper over an iPad. Indeed technology is often met with mistrust; but this was not always the case for God’s people. In 1454 one of these new technologies in it’s infancy was used to print the Gutenberg bible. At one point the pen and paper was a new technology, and God has embraced that in order to communicate his word. As Christian’s it is our mandate to communicate God’s loving kindness through whatever means we have available to us. Advanced technology and recent breakthroughs such as virtual reality present new media channels through which the Church can communicate the timeless message of Jesus.

VR Church is not so much about being a Church as it is about making use of technology to engage in the mission of the Church. Virtual Reality is an upcoming media channel that can be used like any other to communicate God’s words. Virtual Reality is different from many channels though in that it presents more ways for us to communicate God’s word that other channels: it is actually most similar to theatre in that respect. An important aspect of this channel is that we can communicate God’s word using different learning modalities: kinaesthetic, audio and visual learning styles are all represented and this has been shown to help with memorisation, as well as engaging a wide range of people.

So what specifically can we do with virtual reality that we couldn’t do already? Well, in a quick moment in our busy lives we can put on our VR headset and:

  • Be transported to a new and beautiful place which reflects the words of the Psalms
  • Be completely immersed in scenes from the bible
  • Find a quiet and still place to pray and worship
  • Join in a unique communal prayer room

There are a lot of different possibilities for how we might engage with God in virtual reality, what is left to do now is to get out there and present these ideas so that might be experienced and enjoyed.

#3 Case: SecondLife

SecondLife, one of the earliest virtual reality software systems, offers three different churches in its platform, and you can start exploring for free.

The National Shrine of the Divine Mercy in Second Life

No one expects any of these platforms to replace traditional churches, and even a megachurch would be ill advised to buy hundreds of high-end Oculus Rift headsets for their congregation.

But as VR devices become more and more affordable and ubiquitous (Google Cardboard can turn your smartphone into a VR headset for about $15, for example), virtual reality will become another channel for people to communicate on.

For people who can’t physically come to church, people who want to enhance their experience with new technology, or people who want to come back to the church, virtual reality offers another opportunity to do so.

Using virtual reality in the church brings a holistic spiritual experience to those who can’t or wouldn’t normally go to church. Making faith accessible to more people was one of the major goals Salvant hoped to achieve with her church software. People with disabilities are less likely to attend a service, so virtual reality quite literally opens the door for hundreds of thousands of disabled religious people across the country.

Clearly, VR is making its way into the spiritual community. But what are some ways you can use it in your own church?

1. Live stream in VR

I love watching live streams. They’re like a little window into someone else’s life. My church at home sets up a live stream at every Mass for those who can’t make it. But VR takes it to a whole new level. Your congregants can meet in a common space that requires no travel time, no cleanup, and no reservation.

VahanaVR by Orah

Software such as VahanaVR by Orah, or Facebook Spaces for Oculus Rift provides real-time hangout spaces for any kind of event, meeting, or service. Your congregants don’t necessarily need a headset. All you need to broadcast in 3D is:

  • the software (VahanaVR sells for $2,195; Facebook Spaces is free on Oculus devices)
  • 360-degree camera (which start at about $100)
  • a connection to a livestreaming platform such as Facebook, YouTube, or Twitter

Anyone who wants to watch can simply click the link, put on a headset or watch on their computer screen, and become immersed in the experience.

2. VR small groups

You can also apply virtual reality to smaller groups of people. Instead of meeting at the church for department decisions, your staff can simply log onto an online VR platform.

A demo in Hyperfair

  • Hyperfair, which is usually used for businesses, could be used to meet in a virtual space. Pricing for Hyperfair is not available online.
  • Mozilla’s MozVR is a free, open-source virtual reality framework that allows you to create your own online VR experience and share it with your coworkers or friends (assuming you have advanced programming ability, of course).
  • vTime is a VR social network that allows people to meet in virtual space, and is free on Windows Mixed Reality devices.

Meeting in virtual space provides a chance to meet new people without the pressure of meeting in person. It gives your congregants a chance to talk to like-minded people about their faith, without having to leave home. For people who are new to the church, a virtual reality group just for them can give them a chance to get to know new people before they arrive.

3. VR group retreats

Retreats are expensive. There’s no way around it. From finding a place to accommodate everyone in your group, to catering food for each night, to planning the content and activities, retreats are labor intensive and costly.

What if you could join them from anywhere in the world using your computer?

Meeting in a virtual space (using the same devices and apps outlined in the sections above) can reduce the cost of reserving a building or campground, as well as travel to that location.

Guest speakers can give talks without ever having to leave their homes. People with disabilities can participate without having to worry about whether the space will accommodate their needs.

And the possibilities are endless. In the near future, you might be able to virtually travel all over the world on your retreat without having to pay for tickets or hotels. Because the virtual environment is computer generated, the options are only limited to what the designers and programmers can dream up. Just be prepared for a little push back if you try to convince your congregation that a virtual Hawaiian beachfront is as good as the real thing. We’re not quitethere yet.

How will you include virtual reality in your church?

Media professionals say virtual reality is the future. Classrooms, real estate, and construction are just three industries where virtual reality is already becoming a part of everyday life.

Places of faith and worship are not far behind. Will you incorporate VR in your church experience in the future? How will VR help your church? Tell us what you think in the comments!

10+2 Superior Automated Testing Tools for Mobile Apps

A collection of the best Mobile test automation tools than you can use to test your Mobile Apps.

Testing Tools for Native or cross-platform applications

1. Appium

-free

An open-source mobile test automation tool to test Android and iOS applications. It supports C#, Java, Ruby, and many other programming languages that belong to WebDriver library.

2. Selendroid

-free

Being one of the leading test automation software, Selendroid tests the UI of Androids based hybrid and native applications and mobile web.

3. MonkeyTalk by Oracle

-close

MonkeyTalk automates the functional testing of Android and iOS apps.

4. Bitbar Testing

-from $ 99 / month

It is one of the best platforms to test your iOS and Android devices that are having different screen resolutions, OS versions, and HW platforms.

5. Calabash

-free

Calabash works efficiently with .NET, Ruby, Flex, Java and other programming languages.

6. SeeTest

-have trial

SeeTest Automation is a cross-platform solution. It allows to run the same scripts on different devices.

Android Automation Testing Tools:

1. Robotium

-free
-android

Again an open-source tool to test Android applications of all versions and sub-versions.

2. monkeyrunner

-free
-android

MonkeyRunner is specifically designed for the testing of devices and applications at the framework/functional level.

3. UI Automator

-free
-android

In order to test the user interface of an app, UI Automator creates functional Android UI test cases. It has been recently expanded by Google.

Automation iOS Testing Tools:

1. Frank

-free
-iOS

Frank allows to test only iOS applications and software. The open-source framework combines JSON and Cucumber.

2. KIF for iOS

-free, open-source

KIF stands for Keep It Functional. It is an open source framework developed for iOS mobile app UI testing. It utilizes the Accessibility APIs built into iOS in order to simulate real user interactions. Tests are written in Objective-C, which is already familiar to iOS developers, but not test teams. Apple’s switch to Swift makes its use of Objective-C a disadvantage going forward.

3. iOS Driver for iOS

-free

iOS Driver utilizes Selenium and the WebDriver API for testing iOS mobile apps. Its default is to run on emulators, where execution is faster and scalable. The current version works with devices, but actually executes more slowly in that case. No app source code requires modification and no additional apps are loaded on the device under test. iOS Driver is designed to run as a Selenium grid node, which improves test speed as it enables parallel GUI testing.


At the moment, there appear to be many test framework solutions looking for problems, but that is to be expected as mobile app development and testing tools continue to be developed at a rapid pace. Every framework has its pros and cons, each of which should be weighted relative to the needs of the testing organization and the software products being delivered.


Inspiration