In this paper, we report our work on building linguistic resources for Vietnamese social network text analysis in multiple domains. We first describe our annotation methodology including guidelines development, annotation softwares and quality assurance. We then present results of the first pilot phase of the project. Finally, we outline some perspectives of our ongoing corpus development and their expected results.
Index Terms—linguistic resources, natural language, social text, Vietnamese.
In the era of statistical natural language processing (NLP), manually-annotated linguistic data play a vital role for the evaluation and training of the state-of-the arts tools for most NLP tasks. However, linguistic annotation is one of the most time-consuming and financially costly of many NLP research efforts. It is important that linguistic data annotation should be considered and funded by not only public research institutions but also by private companies working in natural language technologies, especially with underresourced languages such as Vietnamese.
In this paper, we report an effort on building large linguistically annotated corpora for the analysis of Vietnamese social network texts in multiple domains. This data annotation project is funded by FPT corporation which aims to build four linguistically annotated datasets in four corresponding social text domains of (1) education, (2) finance and banking, (3) fashion retail and (4) electronic devices. In each domain, there are about 100,000 sentences which are manually annotated with four layers including word segmentation, lemmatization, part-of-speech tagging, and shallow syntactic parsing. The text considered by this project is of social nature which comes from different sources such as Facebook posts and comments, discussion forums and blogs.
There are a number of properties which set this project apart from existing projects of building linguistic resources for the Vietnamese language. First, this project focuses on social text rather than conventional newswire text like that of the VLSP project (Vietnamese Speech and Language Processing) . Second, the expected result of this project is much more ambitious than that of VLSP in that four social text domains will be considered and in each domain, about 100,000 sentences will be annotated, compared to only about 20,000 sentences of the VLSP project which were manually annotated at the syntactic level. Third, this data development project is funded by a private company (FPT) rather than by a public organization like that of the national VLSP project.
It is worth mentioning that this is an ongoing project which is actively carried out at the Research and Development division of FPT corporation. The project gathers a wide range of personnels in multiple groups including linguists, natural language processing experts, domain experts, software developers, project managers and supporting people. This paper reports some current results of the project after its first pilot stage, focusing on the methodology and early results.
The remainder of this paper is structured as follows. Section II outlines the methodology adopted in our project. Then Section III presents the results of the pilot phase of the project. Finally, Section IV concludes the paper and gives some directions for the future work of the project.
Click HERE to see more.
Nguyen The Tuyen (FPT Technology Innovation)
Luong Xuan Vu (Vietnam Lexicography Center)
Le Hong Phuong (Data Science Laboratory, University of Science, VNU Hanoi)