Over the past two years, a widespread application of chatbots and voicebots (or bots in general), especially in fields relating to finance and banking in Vietnam, has motivated a more diverse development in text analysis and processing [1-4]. Text analysis generated during bot operations (can be in form of text between users – chatbots, or text generated from users’ voice – voicebots), has allowed better contextual understanding in the bot system over the course of interaction with users. Since most services are provided directly to users in real-time, the quality of conversations between bots and users would heavily influence customer experience and impression with the service provider.

Therefore, it is essential to promptly respond to customer’s inquiries with high quality of information. Figure 1 is an example of a chatbot provided by a bank with imperfect customer experience.

Figure 1: Imperfect Chatbot from a Bank.

In order to process customer inquiries in a quicker and more accurate manner, the bot system first has to identify what the customer says/types is a question or not. Should it be a question, the customer would be expecting an answer, meanwhile, when it is a definitive sentence or statement, most customers would only need the system to record their feedbacks.


Our recent research [5] had proposed an approach to solving this problem via the identification of whether the users’ voices are questions. This algorithm uses a library supporting Google speech-to-text services (STT) to transform user voices into Vietnamese texts, then analyzes and identifies whether the converted texts are questions or statements.

In particular, the algorithm will convert a recording into text sentences. Then, it will use pre-defined question words (in Vietnamese) to detect if the text was meant to be question(s). This could be done by two methods:

  • Method 1: Use words and phrases commonly used in questions, for example: “do you have”, “what time”, “for what”, “what”, “what can be taken”, etc. Normally, these phrases are made up of two words or more, with some special long exceptions such as: “what local dishes are special?”
  • Method 2: Use a two-layered word structure to identify questions. For example, the first layer contains words such as: “have”, “can”, “probably”, “may”, “still”; while the second layer includes asking phrases like “do you?”. When combined, if the order resembles first_layer – second_layer, then the sentence with those words will be marked as a question. It is also important to note that, this research does not cover rhetorical questions.

To ensure the best quality of input data for the developed algorithm, we had utilized a free data source from FPT Corporation, namely FPT Open Speech Dataset [4]. This database is abundant with over 25.000 recordings with a total duration of over 30 hours. To the best of our knowledge, this is among the first researches that attempt to utilize this freely available dataset since date it was released in late 2018.

We then manually selected 176 recordings with question contents to test their algorithm. The result: 156 recordings were correctly identified as questions, making the algorithm’s accuracy about 88,64%. After reviewing the texts generated, the two most common identified errors were found as follow:

  • First, the voice to text system had generated texts incorrectly (missing words) at a rate of around 10,23%.

For example:

  • Voice: “Can I exchange foreign currency?” – Question
  • Text: “I can exchange foreign currency. ” – Affirmative statement
  • Second, the algorithm still identifies the question incorrectly due to contextual elements, at a rate of about 1,13%.


  • Voice: “Do you care about antiques?” – Question
  • Text: “You do care about antiques.” – Affirmative statement

On average, it takes around 0,9 seconds to process each recording, with standard deviation around 0,458 second, mostly due to differences in lengths of recordings. The longest processing time is approximately 2,6 seconds, while the shortest is about 0,34 seconds.

In the future, the algorithm’s accuracy can be improved by updating the library with question words, as well as supporting for languages with stable structures (for example, English). Furthermore, higher accuracy can also be achieved via deeper analysis of voice elements such as pitch, speed of speaking, accent, etc. Quality of sound input also could be improved for more efficient speech-to-text systems, in order to avoid inaccuracies like missing words, which may hinder question identification results.






[5] T. D. Chung, H. H. Son, K. Alexandra “A Question Detection Algorithm for Text Analysis”, in Proc. 2020 5th International Conference on Intelligent Information Technology (ICIIT 2020), Hanoi, pp. 1-6.


The author would like to express deep gratitude to Ha Hong Son (Student ID: HE140611) for his contributions in the completion of this project, data preparation, and reporting of the research results [5].

Dr. Tran Duc Chung
Computing Fundamental Department, FPT University

Related posts: