Having overcome myriads strong competitors from around the world, the new language model from FPT.AI had come first in SHINRA2020-ML, achieving the final results of Rank #1 for 25 out of 30 languages, including languages that are deemed hard to process like Korean, Chinese, and Hindi.

SHINRA2020-ML is a competition co-hosted by the RIKEN Institute and Tohoku University, Japan. It is also major part of the 15th NTCIR Conference – an esteemed annual scientific event in Japan, and had attracted attendance from many universities, labs, and institutes from around the world. This competition’s aim is to build an open knowledge base with optimized artificial intelligence models that are free to share among the tech community.

Wikipedia – An astounding resource yet to be optimized

Wikipedia consists of a large volume of entities (a.k.a. articles), which is a great resource of knowledge to be utilized in many NLP tasks. To maximize the use of such knowledge, resources created from Wikipedia need to be structured for inference, reasoning, or any other purposes in many NLP applications. The currently structured knowledge bases, such as DBpedia, Wikidata, Freebase, YAGO, and Wikidata, are created mostly by bottom-up crowdsourcing, which leads to a significant amount of undesirable noises in the knowledge base.

To create cleaner and more valuable knowledge bases, the structure of the knowledge should be defined top-down rather than bottom-up. Instead of the existing, cumbersome Wikipedia categories, we should rely on well-defined and fine-grained categories. Among a few definitions of fine-grained named entities, Extended Named Entity (ENE) is a well-defined name ontology, which has about 200 hierarchical categories and a set of attributes are defined for each category.

SHINRA is a resource creation project that started in the year 2017, aiming to structure the knowledge in Wikipedia. SHINRA2020-ML is the first shared-task of text categorization in SHINRA project, tackling the problem of classifying 30 language Wikipedia entities in fine-grained categories.

The goal of this project is not only to compare the participated systems and see which system performs the best but also to create the knowledge base using the outputs of the participated systems. We can utilize the state of the art “ensemble learning technologies” to gather the fruit of the systems and create the knowledge base as accurately as possible.

The SHINRA problem  

SHINRA’s ultimate goal is to restructure knowledge on Wikipedia per their features. And to do that, we first need to classify Wikipedia entities (articles) in 30 languages into ENEs categories (ver.8.0). Japanese Wikipedia pages (920,000 pages) had been categorized, and we can use the Wikipedia pages linked from the categorized Japanese pages as the training data for 30 other languages, and run the system to classify the remaining pages of these languages according to this trading data.

The task given at SHINRA2020-ML, in particular, is to classify 30 language Wikipedia entities into 219 categories defined in Extended Named Entity (ENE). The organizers provided the training data for 30 languages, created by the categorized Japanese Wikipedia of 920,000 pages and Wikipedia language links for 30 languages.

Wikipedia provides for 30 languages. For example, out of 2,263,000 German Wikipedia pages, 275,000 pages have a language link from Japanese Wikipedia, which will potentially serve as a-bit-noisy training data for German. So, the task is “to classify the remaining 1,988,000 pages into 219 categories, based on the 275,000 categorized pages.” The same holds true for the other 29 languages as shown in Data statistics.

The target data for each language is provided as a Wikipedia dump. The participants are requested to submit the outputs for the entire target data. The submitted data will be open to the public so that people can try some ensemble learnings to create the resource of the categories of Wikipedia pages in 30 languages.

The 30 target languages are: English, Spanish, French, German, Chinese, Russian, Portuguese, Italian, Arabic, Indonesian, Turkish, Dutch, Polish, Persian, Swedish, Vietnamese, Korean, Hebrew, Romanian, Norwegian, Czech, Ukrainian, Hindi, Finnish, Hungarian, Danish, Thai, Catalan, Greek, Bulgarian.

FPT.AI and the highly promising multi-language model

The problem given by SHINRA requires processing a huge volume of unstructured data; and this unbalanced nature will lead to limitations in the estimation, multi-label processing problems, and long training time.

In answer to the aforementioned requirements and challenges, FPT.AI had utilized the BERT model in combination with hierarchical multi-label classification to create a modern multi-language model. In particular, this model comprises of 2 parts: encoder and decoder. The encoder layer utilizes a Multilingual bert base cased – a form of pretrained-model, while the decoder uses hierarchical multi-label classification to solve the multi-label classification problem, for multicore output.

In the competition, FPT.AI’s model is trained in the 3 following stages:

  • Stage 1: Create models for all 30 languages. Here, the accuracy while run on the database is quite low.
  • Stage 2: Fintuning for each target language. The model generated in this stage is optimized for one specific language and thus have higher results in comparison to the first models.
  • Stage 3: Voting strategy using language results. This model is created by generating prediction results for all languages using the same input, then determining the final results by label frequencies and language popularities.

Successfully created in stage 3, FPT.AI’s model had overcome others and achieved Rank #1 for 25 out of 30 languages. Among strong competitors is Studio Ousia – Japan’s renowned tech firm, who had recently introduced “LUKE” – an NLP model that is topping the SQuAD v1.1 leaderboard, surpassing BERT (developed by Google in 2018) and SpanBERT (developed by Facebook in 2019).

FPT.AI’s ranking for the English language.

Such results had demonstrated the tech capabilities of FPT.AI, not only in Vietnam but also in the region and the world. At the moment, this model is integrated in FPT.AI chatbot to improve the traditional model, one that often struggled with processing speed and accuracy. The application of this new solution will considerably elevate the flexibility and quality of FPT.AI chatbot while eliminating old challenges that the traditional model faced.

And, with this strong foundation, in the future, FPT.AI will continue to research and develop to expand its chatbot to other languages, to slowly conquer the tough markets of Japan, Korea, Indonesia, and so on.

Thao Nguyen – VietBT

Related posts: