Top 8 FREE tools for automated web scraping

1. Scrapy

Scrapy is not just a Python library but it is an entire data scraping framework that provides spider-bots who can crawl several websites at once and extract their data. Using Scrapy, you can create your own spiders and host them on the Scrapy Cloud or as a Scrapy API.

The spider is created using a set of commands and a target webpage, along with a parse method such as self.parse for HTTP requests. Once the program for the spider is executed, it will make a request to the given webpage, get the HTML for the first URL from the supplied list of URLs, and parse it according to its parameters. Scrapy’s method for indexing and tagging is using CSS along with XPath to move in the HTML tree.

We can extract and store links that end with a certain extension using the spiders. For example but not limited to .jpg, .png for images, .pdf, .docx for documents. In addition to standard extracting web content, we create a separate HTML document containing all of the extracted data. The Requests library can also download the media attachments or the data and perform further processing on it.

Scrapy is an extensive package and you can replace its modules with other packages like Selenium to enhance the functionality. For example, Scrapy+Splash is a popular combination where Splash is a lightweight browser which helps in scrape data from websites with JavaScript content. There’s already a scrapy-splash library available for this combination.

Using the APIs, you can extract data and use it as a general purpose web crawler. The crawling and extracting includes managing web page requests, follow various web links, preserving user sessions, and handle output pipelines.

2. Apify SDK

Apify SDK provides a universal framework which runs on JavaScript to crawl and scrape any website with high scalability and performance. The SDK is loaded into a Node.js project, which enables it to browse through the web using the seamless and advanced functions of Node.js. Apify SDK can run a headless Chrome or Selenium, manage lists and queues of URLs, and run crawlers in parallel at maximum capacity.

While this task can be performed in Python using Scrapy, Apify SDK works with JavaScript which is the language of the web itself and helps in running extensive scripts for web scraping in Node.js using libraries like Puppeteer and Cheerio. These two tools provide exhaustive functions to scrape the web seamlessly, but the SDK is what can tie them together for a more scalable use.

Once you have installed Node.js 8.14 or a higher version, you can download and install Apify SDK via the CLI:

npm i apify

once you create a project, you have three tools to work with for your crawling: BasicCrawler, CheerioCrawler and PuppeteerCrawler. The SDK ties these classes and tools together for a more scalable use, while they can be used independently in Node.js for web scraping.

3. Web is a Chrome extension which runs directly into the Chrome browser and exports data in CSV format. Since you only need Chrome for this, the OS or any other settings do not matter in running a successful scraping operation. With its point-and-click interface, it does not rely heavily on coding and programming skills of the user, which makes it ideal for marketers and researchers who lack a coding experience.

Web pages which are dynamic, relying heavily on user interface in JavaScript, are easy to scroll manually but difficult for scraping bots to scrape. In such websites, which are getting more popular with time, is applicable given its background in JavaScript and operated in the up and coming Node.js.

You can add the Chrome extension from the Chrome extension library, and use it like any other extension on the browser.

4. Cheerio

Trying to hard code a script for a website can be risky if the script relies on parsed HTML data which is prone to updating. Rather than parsing data completely, Cheerio enables developers to work directly with the HTML data downloaded from the webpage. It is open-source since it is developed and updated by backers from the community itself, thus the interface and API are user-oriented and easy to use.

Although the cheerio wraps around the htmlparser2 parser, it does not interpret webpages as a browser in a way Scrapy or Selenium do. Thus, if your application is built around the visual interface (rendered visuals or CSS) of the websites Cheerio cannot generate those for you. Since Cheerio isn’t a web browser, it works efficiently with a Document Object Model (DOM Model) which makes parsing and manipulating faster and simpler.

Run the following command to install cheerio using NPM. Install NPM first if you don’t have it already. Usually, by installing Node.js, you get the NPM package along with it.

npm install scheerio

Instead of using raw HTML data each time, you can combine the requests library with cheerio and load the data directly from the requested webpage.

npm install request

5. Scraper Chrome Extension

Chrome’s primitive extension called Scraper is pretty easy to use for simple scraping tasks since it relies on basic XPath and jQuery selectors. Even if you don’t understand the syntaxes and concepts, you can use the scraper with some tutorials. Although there is no comprehensive documentation, or many tutorials, available for this extension but the ones available are enough for one to get by in using the tool fruitfully.

The way it works is: once you decide on the kind of data you want from a webpage, just select one example of that data from that page; example: the price of one item out of many. Next, right-click on the selected data and choose the “scrape similar” option from the menu. Based on the parameters of the API, it will select all the prices (or such data) from the webpage.

You can export this data in a CSV format in Google Spreadsheets or Docs. And now you have your desired data.

To automate this semi-automated process, you can use Node.js or Python to automate the entire process of selection and scraping from the API. Since it’s a collab with Google, you can plug the data in into OpenRefine for further cleaning and processing.

You can download and install the Scrapper extension from the Chrome Webstore:

6. PySpider

PySpider is a web crawler system in Python which provides an accessible WebUI to edit scripts, monitor ongoing tasks, with a project manager and result viewer. Although it runs in Python, it can run JavaScript scripts after the page loads and save data in a wide range of databases including but not limited to MySQL, MongoDB and Redis with SQLAlchemy in the backend. Just like Apify, it runs as a headless browser without having a time and space overhead like in browsers.

Of course, there are other libraries available like selenium and beautifulsoup4 which are more diverse in terms of general-purpose functions than PySpider. But for other dedicated functionalities like scheduling, and checking for errors in the scraping one needs to add cronjobs or external task schedulers.

PySpider takes care of this and provides components including scheduler, fetcher, Phantomjs fetcher (fetches JS enabled pages and returns a general HTML), processor, result worker and a webUI. This data flow, run with multiple instances of the system processor and parallel processing, makes PySpider really fast in scraping the web.

Prerequisites: Python 3.5 and PhantomJS

Installation command for users with pip or pip3 respectively:

pip3 install pyspider

7. UiPath

UiPath imitates a real user on the browser, resembling almost all of the human-like actions on the web with its Robotic Automation Process (RPA) software program. With UiPath Studio and its web scraping tool, you can extract everything useful from a webpage including statistics, listings, search engine results, catalogs, reviews and particulars about specific fields like employees, products or the company.

UiPath automates the entire process of web scraping and takes care of intricacies like automatically logging in, filling up forms, applying filters, navigating multiple web pages, scanning images, and storing it all in a format of your choice. In addition to storing, it can form reports, handle dashboards, databases in a large variety for you to export or integrate into another application.

The UiPath web scraping software guides you through the parameters for selecting the data you wish to scrape from the given website. The wizard shows you a preview of the columns that will be scraped and then goes ahead with all the other websites.

Apart from the web, UiPath also offers the option for Screen Scraping, which involves scraping and extracting data from scanned images, PDF or such documents. You can download and install the UiPath RPA community version from the official website. And further you can learn from the many comprehensive UiPath tutorials available online.

8. Puppeteer

In contrast to a user imitating software, the Puppeteer framework is a headless browser that comes with Chromium; it does not load the UI like other browsers and enables a scraper or crawler to read web data from it.

It is similar to Cheerio, but unlike the latter it can be used for websites which have heavy JavaScript content and require a browser to execute the JS. In combination with the Apify SDK, the URLs are represented as a Request object and for each such object, the PuppeteerCrawler opens a new Chrome page, thus allowing for parallel processing and scraping.

To begin, you will need NodeJS 8 or above installed, and some knowledge of JavaScript and DOM is a lot helpful in proceeding with web scraping. Puppeteer provides an API to go on top of the headless browser to operate it, thus making it hassle-free for use.