What is Web Scraping?
Web Scraping means to extract data from other websites to serve all purposes of the application you are working on.
Web Scraping has various practical applications, including notable ones like:
- Data extraction and generation of the stock market;
- Price comparison of similar products on different online markets;
- Information gathering from different news sources.
Common architectures in Web Scraping applications
In the above model:
- Websites: where we need to extract data from.
- Jobs: Web Scraping applications, either automatically run or activated using another app, depending on specific circumstances.
- Storage: after processing input data, Web Scraping will produce data in the format pre-defined inside the app, and output data will be stored to files, databases…
- Data Consumers: output data from Web Scraping will be exploited by terminal services to serve all purposes we want.
What do we need to know to build 1 Web Scraping?
Definitions of Selector can be read at https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Selectors
In simple words, Selector is a term that shows how to identify cards inside a website and extract desired information from anywhere on that site.
In the above image, the author is using Chrome’s developer tool to get the title’s selector from the first news on https://vnexpress.net/. The acquired Selector is also the one needed for Web Scraping, further details in the sample code later.
Notice: A selector’s stability depends on its source website, as some webs cannot change information displayed in a pre-defined position, while some are able to in 1 selector. This means that one may get different data from the same selector in different tries.
How to effectively design a Web Scraping?
Nowadays, with the strong technical support, designing a Web Scraping is not quite challenging, but there are still issues to consider, including where to run our Web Scraping app, how much CPU resources or storage does it take, what is the monthly cost… To answer these questions, we need to know how to locate data sources, what requirements does it take to extract data, where to store data, which technologies to use…
Per the author’s various experiences in Web Scraping, there are 3 major web sources (data extraction sources) as follow:
- Static Web: the type of web that upon download includes all needed information on the site. Most new sites follow this type of web.
- Dynamic Web: this type of web differs from static in that the first download will not be adequate, and data will continue to be downloaded and filled in later on. Either that or data or constantly changed like in online chat, auction, stock, SPA app sites, and so on.
- The final type not only includes dynamic data but also requires real-time background running on Browsers, Browsers’ APIs, cookie storage, temporary local data storage…
Likewise, there are respective data extraction methods and libraries, as well as deployment tactics to best minimize costs.
- Type 1: this type requires little storage and CPU power, as all we need is to read the source web’s contents, extract necessary data, and store those for future uses.
A usable library for this type of website is: Cheerio
A usable library for this type of website is: JSDom
- Type 3: when a website calls for use of real Browser’s APIs, we can no longer use virtual browsers, but rather run that website on real Browsers to get the most accurate results.
A usable library for this type of website is: Puppeteer
So why does the author feel the need for different libraries in each type of web? The answer is to save costs. The aforementioned 3 libraries all have different attributes that are suitable for different purposes:
- Cheerio: this library allows you to review and process HTML cards as well as static CSS definitions. It is small and requires no further special library, and thus Web Scraping apps using Cheerio are easy to pack and deploy with Docker to save deployment and operational costs.
JSDom also requires no further special library, and thus if needed, you can design it using Docker to save costs.
- Puppeteer: as a library developed by Google, Puppeteer provides APIs for control of Chrome or Chromium via DevTools Protocol, in two modes: headless and non-headless. As a real browser is needed to use this library, there will be no challenge in running Puppeteer on Ubuntu or equivalent servers. However, if you do the same in Docker or Serverless environments, you may run into quite many configuration problems. Hence, the optimal yet most costly option is to run this library on a virtual machine.
Test results for the 3 designs
To see clearer the performance of the aforementioned 3 designs, the author had completed a test trial in GCP with 3 n1-standard-1 machines at the same time, with results as illustrated in the image below:
- webscraping: the app written with Cheerio.
- webscraping-jsdom: the app written with Jsdom.
- webscraping-puppeteer: the app written with Puppeteer.
This DEMO part can be run from any computer already with NodeJS environment.
First, you will need to take the source code from github.
Notice that the source code includes 3 branches respective to the 3 libraries mentioned above.
After finished checkout with your desired branch, run the following command for continuation
> npm install
> node index.js
A successful installation will show results similar to this image below (for Puppeteer):
While data extraction from websites is no new issue in serving a project’s purposes, changes in technologies had created different methods for different phases. Different technologies can be chosen for product development depending on various elements like requirements of the project, dev team, budget, and so on. The author hopes that via this article, readers will get an overview on data extraction, as well as learn some Cloud-based examples to widen the range of their options.
Tran Huu Lap – FPT SoftwareRelated posts: