Last week, FPT Software held a virtual seminar named Solution Forum #66 to share VnExpress’s lessons on building their own copyright detection system.
Keynote speaker of the seminar is Mr. Luu Xuan Viet, Manager of DX projects at FPT Online. He introduced a solution called License Checking System, the current Copyright detection system of VnExpress. It is one of the typical digital transformation (DX) projects at FPT Online in the first quarter of 2020, and is expected to develop a new business model in the future.
At the seminar, Viet shared specific challenges facing the anti-piracy team in VnExpress and other electronic newspapers.
Previously, the anti-piracy team manually copied the copyright violated articles on Google Search, searched for similar content, took screenshots and saved to excel file. This method reveals several limitations, which are:
- The number of sites on Google Search are extremely varied. On average, VnExpress can only check from 200 to 300 sites per article a day.
- VnExpress cannot conduct a comprehensive check of all articles, but solely check the latest one.
- Manual, time-consuming process and low accuracy. Data storage in Excel spreadsheets makes statistics challenging.
From these “pain points”, the R&D team of FPT Online has been researching and developing an AI-powered anti-pirated system that detects illegal use of VnExpress’s content, text and images on newspapers and media means.
Experts use image similarity algorithms, such as CNN, HASH or Faiss, to detect pirated images, and text similarity algorithms, including Cosine and Jaccard, to detected copiright violated article content. According to Mr. Luu Xuan Viet, the selected algorithms can accelerate detection process, store a large amount of data and handle varied image features.
Specifically, text input will be parsed into paragraphs and removed all special characters. Then the system automatically encode and “embed” these data to compare the number of plagiarized sentences with the total number of sentences in the article. Based on the score on the set of filter criteria, the system assess the level of plagiarism and take action on the violations.
As soon as the violations are identified, the system will automatically send an alert via Telegram to relevant parties for further monitoring and handling steps.
As of February 2020, after 6 months of implementation, the License Checking System was officially released. The initial outcomes are positive, as the system can automatically scan 104 sites that often copy VnExpress’s content. In addition, the solution also possesses a fully functional real-time monitoring system that completes the filtering criteria for violation levels and produces reports on request. In particular, 100% of the violation alerts are automatically done through Telegram and displayed clearly.
Besides the main piracy detection feature, License Checking System is considered as a potential project with limitless benefits. It can meet all requirements for traceability. In addition, R&D team hopes to develop a new business model of copyright detection and provide services for other news agencies. In the near future, it expected to generate 2 billion per year in revenue.
Here are 6 main features on the License Checking System’s interface.
- Realtime Staristics: Allow statistics to be gathered and extracted in real time.
- Historical: Categorize infringing pages by date or month depending on user’s needs.
- Process: Analyze detection time and handle the violation of each article
- Infringed Articles: Synthesize all violating information on sites
- Setting: Select/remove followed sites or warn temporarily idle sites due to technical problems.
- Debug: Check the violation level of each article, or each text in the article.
Answering the audience’s question about whether the solution can detect removed watermarks or not, Viet claimed that AI can completely do it. You can even detect cropped images, images with color correction, and other watermark based on comparing the number of common points and pixel ratio. When it comes to copied content, the system can still recognize some sentences or paragraphs to be intentionally reordered, and issue warnings to the team.
The seminar took place with a lot of in-depth questions about technology and techniques, specifically test cases, data flow, algorithms, infrastructure and system operations. The Q&A section also opens up discussion about video/audio infringement and piracy on social media. Viet shared that these are FPT Online R&D team’s concerns and the team will continue to research and upgrade the system.
The 66th Solution Forum shows a comprehensive investment, raging from content to logistics, of the Organizing Committee. The speakers are experienced experts in the fields, allowing participants to grasp problems in real life and explore several state-of-the-art technologies.
Thao MyRelated posts: