If the rocketing development of artificial intelligence is the missile system, the data is the input fuel. Experts in the field of artificial intelligence have identified “Data as the new gold”, which shows the importance of data in the success of AI projects. This article will briefly outline the data annotation process for natural language processing tasks – an important research area in the field of artificial intelligence.

With the development of internet communication such as social networks, websites, forums, blogs, etc, language data is becoming increasingly diverse. However, computers may have difficulty understanding the raw language data available. Therefore, for computers to easily identify the characteristics of the data and “deduce” more, the process of labeling data is of utmost significance. The labeler will annotate a piece of text with one or many specific metadata depending on the specific problem. Labeling data is the first step in helping computers to “understand” natural language. There are many ways to divide the data labeling process, one of which is based on the labeled project process: preparation – deployment – packaging and handover.

Figure 1. Data annotation process.

Make comprehensive preparations for a data annotation project

To implement a labeling project, first, we must define the detailed specification of the problem of linguistic data annotation, which can be word splitting, type-based labeling, entity labeling, syntax labeling, classifying the email as spam or not, etc. It may be considered that each problem will be addressed separately or simultaneously.

After clearly defining the labeling problem, the next step is to plan and collect the corpus. The two most important factors when collecting materials are “representativity” and the “balance”. The “representativity” means collecting full characteristics of the data distribution in practice. For example, when collecting data about movie theater reviews, we need to collect many assessing methods of users. The “balance” means collecting data from multiple sources of data distribution in practice. For example, when collecting materials to build a word split for Vietnamese, we need to collect data from electronic newspaper documents, forums, blogs, social networks, etc. Another very important aspect when collecting data is to check the copyright to avoid legal problems.

Once the data source is available, the preprocessing steps such as removing “garbage” data, standardizing spelling errors, etc should also be done carefully with the preparation of infrastructure – the foundation for implementing the data annotation project. It could be software that stores labeling results, high-speed internet connections, etc.

Write a data annotation guide, depending on the complexity of the labeling problem, we can hire a labeling team (Annotator) whose skills and knowledge are commensurate with the problem; The labeling team involved in document development needs to understand the requirements of the labeling problem and labeling guidelines. In addition, similar to building labeling documents, depending on the complexity and scale of the labeling project, the project manager should consider implementing data labeling in- house or outsourcing (Crowd Workers). When implementing outsourced labeling, a data security agreement with the label deployment unit should be paid heed to.

Actual data annotation deployment

The first step is to take a small part from the corpus set to try out. Why is it important to try it? Because when testing, we may detect problems for further improvement or modification before implementing a series of labeling. In addition, testing will also help the project manager to make a fairly accurate estimate of progress and labeling quality. After completing the test phase on the original sample data set, common statistical techniques such as Cohen’s Kappa or Fleiss’s Kappa need to be applied to measure the consistency index of the labeling results. By doing this, the labeling project manager will have preliminary comments on the labeling result and can decide which parts of the labeling result that needs to be rechecked, make changes such as updating instructions, modifying processes, etc if necessary to improve quality and consistency in labeling results.

After reviewing the trial phase, the next step is to deploy a large-scale data annotation project. Depending on the nature of the labeling problem, it is possible to carry out projects in-house or outsource. Labeling data needs to be re-checked by the most well-known annotators with the deepest insight into the ongoing labeling problem. It is best to nominate members who have participated right from the first step of the annotation project, or those who have participated in the process of defining the problem of labeling or writing the labeling instructions. Annotation quality control is an indispensable step at this stage. Weekly quality checks and quality assurance meetings with a labeling team are required. In addition, it is possible to hire an independent unit to verify the quality of annotation; root for causes of labeling errors and take corrective measures.

Data packaging and hand-over

This stage involves packaging and transferring labeled data to the parties to use. If needed, we can update some of the annotation results after receiving the feedback from the data users; writing reports on data, annotation teams, the processes, and infrastructure, etc. to summarize the lessons learned for more effective data annotation projects in the future.


  • Natural Language Annotation for Machine Learning – James Pustejovsky and Amber Stubbs, O’Reilly Publishers 2012
  • WebAnno: A Flexible, Web-based and Visually Supported System for Distributed Annotations – Seid Muhie Yimam, Iryna Gurevych, Richard Eckart de Castilho, and Chris Biemann. 2013. In Proceedings of the Annual Meeting of Computational Linguistics (System Demonstrations) (ACL 2013), pages 1-6, Sofia, Bulgaria.

Nguyen The Tuyen, Big Data Research Institute
VinTech Technology Development Joint Stock Company

Related posts: