VnExpress is the leading news and digital media publisher in Vietnam, serving over 30 million daily users. With such a massive inventory, the publisher employs a comprehensive big data solution to handle content selection across its channels and media. Built on top of open-source parallel computing and data science platforms, the solution enjoys significant cost savings during development yet remains easily upgradable.

The business

The whole inventory of VnExpress can be largely categorized into articles, videos, advertisements, and user-submitted classifieds. The task of content selection thus consists of several distinct but related scenarios:

  • Recommending articles or videos to a viewer reading a particular article or watching a video. Within this scenario, there are multiple use cases and strategies to consider. For example the recommended items can follow a storyline or explore a certain topic.
  • Selecting advertisements to display on each content page (contextual advertising).
  • Displaying related classifieds to maximize revenue.

These scenarios are traditionally handled by news editors and advertising operators, which quickly ran into scaling problems as the publisher’s inventory grew in size. The sheer number of available items made it difficult for the staff to consider anything but the most popular items or those deemed important otherwise. Furthermore, a whole dimension of the content selection process had to be ignored: with million audiences, it was humanly impossible to take individual preferences into account.

The automated solution killed two birds with one stone. Now every item in the inventory has a chance to reach the viewer, and the right items are given better opportunities – the audience analysis lets us pick the most suitable content for each viewer in every scenario.

This does not supplant the editor or advertising operator in the content selection process, much like the computer did not make human intellect obsolete. It simply frees up the human labor for more critical or productive tasks.

The science

The centerpiece of the whole solution is a content analytics framework, which categorizes and computes relevant statistics for every item in the publisher’s inventory. This is performed by a combination of general data mining and natural language processing techniques such as topic modeling, reinforced by human feedback. Every item in the inventory, e.g. an article, is represented by a set of categories and features. Some of the features are automatically generated using machine learning algorithm while others are engineered by human experts. The collection of features, called the feature space, is a natural basis onto which audience behavior can be projected.

Having analyzed and clarified the entire inventory, the audiences are then profiled by the features of items they interacted with and the strength of their interactions. Each individual profile can be thought of as a portion of the feature space, which allows us to select suitable items for each viewer and each scenario. This representation is also useful for comparing viewers in terms of the publisher’s content. The profiles are further enriched with demographic information which helps to improve the recommendations. The additional information can either come from the publisher’s CRM or be inferred from the profile itself by statistical learning.

The content analysis and audience profiles can finally be used to select content automatically or act as a decision support tool for manual content selection.

The technology

Every relevant interaction on any of the publisher’s channels is collected for analysis. The process is parallelized using Apache Spark and Spark MLlib. As a big data platform, Spark is mature and extremely well-supported with continuous additions and upgrades. The MLlib library alone has over 200 contributors from 75 organizations in 2015. The platform is streamlined, scalable and compatible with. It allows the same code to transition seamlessly between research, development, and deployment.

The analytics is scheduled to run every few hours and output is cached to a Redis cluster for a fast, efficient access. The NGINX proxy takes care of load balancing, leaving the application code to the Apache servers. The entire process is illustrated below:

Dr. Dang Hoang Vu – FPT HO
(Published on the FPT Technology Magazine, FPT TechInsight No.1)

Related posts: