What is Web Scraping?

Web Scraping means to extract data from other websites to serve all purposes of the application you are working on.

Web Scraping has various practical applications, including notable ones like:

  • Data extraction and generation of the stock market;
  • Price comparison of similar products on different online markets;
  • Information gathering from different news sources.
Common architectures in Web Scraping applications

In the above model:

  • Websites: where we need to extract data from.
  • Jobs: Web Scraping applications, either automatically run or activated using another app, depending on specific circumstances.
  • Storage: after processing input data, Web Scraping will produce data in the format pre-defined inside the app, and output data will be stored to files, databases…
  • Data Consumers: output data from Web Scraping will be exploited by terminal services to serve all purposes we want.
What do we need to know to build 1 Web Scraping?

To build 1 Web Scraping app, you will need to know at least one Back End programming language like: Java, Javascript (NodeJS),… and another quite important Front End definition called Selector.

Definitions of Selector can be read at https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Selectors

In simple words, Selector is a term that shows how to identify cards inside a website and extract desired information from anywhere on that site.

In the above image, the author is using Chrome’s developer tool to get the title’s selector from the first news on https://vnexpress.net/. The acquired Selector is also the one needed for Web Scraping, further details in the sample code later.

Notice: A selector’s stability depends on its source website, as some webs cannot change information displayed in a pre-defined position, while some are able to in 1 selector. This means that one may get different data from the same selector in different tries.

How to effectively design a Web Scraping?

Nowadays, with the strong technical support, designing a Web Scraping is not quite challenging, but there are still issues to consider, including where to run our Web Scraping app, how much CPU resources or storage does it take, what is the monthly cost… To answer these questions, we need to know how to locate data sources, what requirements does it take to extract data, where to store data, which technologies to use…

As mentioned above, there are various ways to go around this, but this article is going to utilize Javascript (NodeJS) as an example and analyze this approach’s effectiveness.

Per the author’s various experiences in Web Scraping, there are 3 major web sources (data extraction sources) as follow:

  1. Static Web: the type of web that upon download includes all needed information on the site. Most new sites follow this type of web.
  2. Dynamic Web: this type of web differs from static in that the first download will not be adequate, and data will continue to be downloaded and filled in later on. Either that or data or constantly changed like in online chat, auction, stock, SPA app sites, and so on.
  3. The final type not only includes dynamic data but also requires real-time background running on Browsers, Browsers’ APIs, cookie storage, temporary local data storage…

Likewise, there are respective data extraction methods and libraries, as well as deployment tactics to best minimize costs.

  1. Type 1: this type requires little storage and CPU power, as all we need is to read the source web’s contents, extract necessary data, and store those for future uses.

A usable library for this type of website is: Cheerio

  1. Type 2: to extract dynamic data from a website or changing data at one point in time, you will need a library that can understand and execute HTML and Javascript commands. Due to this, the app will need to upload web content to the computer storage for analysis and command execution, thus leading to higher requirements for storage and CPU power.

A usable library for this type of website is: JSDom

  1. Type 3: when a website calls for use of real Browser’s APIs, we can no longer use virtual browsers, but rather run that website on real Browsers to get the most accurate results.

A usable library for this type of website is: Puppeteer

So why does the author feel the need for different libraries in each type of web? The answer is to save costs. The aforementioned 3 libraries all have different attributes that are suitable for different purposes:

  1. Cheerio: this library allows you to review and process HTML cards as well as static CSS definitions. It is small and requires no further special library, and thus Web Scraping apps using Cheerio are easy to pack and deploy with Docker to save deployment and operational costs.
  2. JSDom: this library acts as a virtual browser that can read web contents, perform Javascript, CSS… commands, and generate respective HTML codes. Compared to Cheerio, JSDom is more flexible, can perform more tasks, but takes up more storage and CPU resources.

JSDom also requires no further special library, and thus if needed, you can design it using Docker to save costs.

  1. Puppeteer: as a library developed by Google, Puppeteer provides APIs for control of Chrome or Chromium via DevTools Protocol, in two modes: headless and non-headless. As a real browser is needed to use this library, there will be no challenge in running Puppeteer on Ubuntu or equivalent servers. However, if you do the same in Docker or Serverless environments, you may run into quite many configuration problems. Hence, the optimal yet most costly option is to run this library on a virtual machine.
Test results for the 3 designs

To see clearer the performance of the aforementioned 3 designs, the author had completed a test trial in GCP with 3 n1-standard-1 machines at the same time, with results as illustrated in the image below:

  • webscraping: the app written with Cheerio.
  • webscraping-jsdom: the app written with Jsdom.
  • webscraping-puppeteer: the app written with Puppeteer.
DEMO

This DEMO part can be run from any computer already with NodeJS environment.

First, you will need to take the source code from github.

Notice that the source code includes 3 branches respective to the 3 libraries mentioned above.

After finished checkout with your desired branch, run the following command for continuation

> npm install

> node index.js

A successful installation will show results similar to this image below (for Puppeteer):

Conclusion

While data extraction from websites is no new issue in serving a project’s purposes, changes in technologies had created different methods for different phases. Different technologies can be chosen for product development depending on various elements like requirements of the project, dev team, budget, and so on. The author hopes that via this article, readers will get an overview on data extraction, as well as learn some Cloud-based examples to widen the range of their options.

Tran Huu Lap FPT Software

Related posts: