A case of FPT Voice Search Technology for Movie on Demand
About FPT Software customer
FPT Software customer is an American direct broadcast satellite service provider and broadcaster. Its satellite service transmits digital satellite television and audio to households in United States, Latin America and the Caribbean. Located in El Segundo, California, US, they have been providing huge amounts of television services with more than one million movies on demand and more than two thousand linear program channels. Developing efficient content-based search and management technologies was becoming urgent needs for this leading satellite TV company. Traditionally, when a user wants to find a television program or a movie, he has to input some appropriate keywords by using a remote control. Such searching method reveals several disadvantages to users:
Operates are slow and inconvenient.
Systems can only support simple searches/queries with several keywords, such as movie name or channel name.
Systems cannot be able to interact with users to narrow down or extend the search results.
Recently, voice recognition technologies have been developed and deployed successfully on mobile devices. Siri on iOS is such an example. Using voice to control mobile phones or using natural language to communicate with such smart electronic devices is a dream in which can be realized. With the rapid growth of technologies and the needs of users, this satellite company decided to develop a system that allowed users to search and control smart devices via voice. By using natural language to search and control, users can easily explore television programs provided by them. The new system aimed to achieve several advantages in comparison with the traditional one as follows:
Allowing users to control mobile devices and search movies by using voice. Communicating by using natural language is the most convenient method for people.
Allowing users to search movies by different types of information, including movie names, genres, actors, directors, times, dates, channels, characters, or even descriptions.
Allowing users to interactive with the search program to narrow down or extend the search results by changing the search context.
The new system would enable users to describe their needs by using different and flexible sentences/queries instead of using only keywords. Here are some examples of natural language sentences which users can talk to the system.
“Show me actions movie with Tom Cruise tonight”
“I am looking for movies where Keanu Reeves plays samurai”
“Movie in which Tom Hanks is stupid”
“Movie which has a scene with feather falling”
“Show movies directed by Martin Scorsese about gangsters”
“Sherlock Holmes type movies.”
FPT Software Solution
Figure 1 shows the system architecture of our voice search system for our customer. There are two types of input methods which a user can use, i.e. input from a keyboard or input by using voice. If the second method is used, a speech recognition engine will convert the request into the text format. So we will now have an input query in natural language (text).
Figure 1: System architecture of our voice search system. The natural language query is then processed in three modules: Information extraction, Query formulation, and Search.
Information extraction: The important information in the input query will be extracted, including movie names, genres, actors, directors, times, dates, channels, characters, descriptions, etc.
For example: If we have an input query “Show me action movies with Brad Pitt tonight”, the information extraction module will identify “Brad Pitt” is an actor name, “action” is genre, and “tonight” is time.
Query formulation: This module builds an SQL query from the important information extracted in the previous step. The SQL query is built by using several rules to combine extracted information.
Search: This module uses the SQL query formed in the previous step to search into movie databases and display movie information to users.
By dealing with natural language, our system has several advantages in comparison with the previous one. It is familiar with people and easy to use. Users can use it every time and everywhere. It is also powerful in the sense that it sets no limit on expression. Users can state exactly what they want.
Techniques in Detail
To develop the system, we have to address several problems. Automatic Speech Recognition (ASR): At the beginning, our customer used the ASR system developed by Nuance, which is also the owner of Siri. Recently, they changed to use the engine of AT&T. Moreover, they also used an ASR system developed by FPT Software for researching purpose and demo. That FPT Software’s system was based on Kaldi, an open source system. Natural Language Processing (NLP): All the parts related to natural language processing are developed by FPT Software. Natural language processing plays a very important role in the system, which parse the input query to “understand” the needs of users. NLP recognizes named entities (person names, movie names, channel names, director names, actor names, and so one), recognizes control sentences, determines dates and times, and recognizes special search keywords to build the SQL query. Dialogue Processing (DP): Dialogue processing allows users to communicate with the system to narrow down or extend the search results. Let’s see the following example: User: Show me action movies System: Here are 500 action movies for you. You can narrow down the search by giving additional conditions. User: OK, which are on tonight System: We found 50 actions movies airing tonight User: With Jason Statham System: There are 3 action movies with Jason Statham airing tonight In this example, at the beginning, the user wants to find action movies. Because there are many action movies in the database, the system suggests that the user should narrow down the search by giving additional conditions. After that the user provides more conditions (tonight with Jason Statham) and receives the desire search result. Knowledge Graph Search (KGS): Knowledge Graph (KG) is a project developed by FPT Software based on metadata content provided by our customer. KG search allows users to conduct approximate and extended search based on the relations between entities in the KG, such as directors, actors, movies, awards, and so on. Many technologies and techniques have been applied to deal with the problems described above.
Kaldi is used for ASR.
Support vector machine (SVM) and Conditional random field (CRF) methods are used for natural language processing tasks.
MongoDB (nosql), Neo4j are used to represent and store graphs in Knowledge Graph Search.
Revo R is used for data analytics.
iOS, Android development
HTML5 for responsive design.
Voice Search Technology for other Languages
We have also developed voice search systems for other languages, including Vietnamese and Japanese. For these languages (resource poor languages) the most difficult is the lack of resources, i.e. knowledge graph, movie database, and data for training NLP models. We developed experimental systems for Vietnamese and Japanese by building small semantic knowledge and movie database. To improve the performance of the system, we employ an active learning framework, which is shown in Figure 2. The performance the system depends heavily on the quality of NLP models, semantic knowledge or knowledge graph, and movie database. The system exploits the feedback from users to enrich the semantic knowledge (or knowledge graph), the movie database, and the NLP models. By doing so, NLP models can be updated frequently and automatically. The performance of the system will be improved by time.
Figure 2: Method for improving the performance of the system.