A technical view of FVI: End-to-end Vietnamese ID card OCR


The article would like to share with readers experience in research, architecture, and deploy process in hope to clear the mist of building and deploying a complete Deep Learning project in general, and OCR task in particular.

The author is currently working on an OCR project with some of  Vision researcher/engineer colleagues: FVI. The job is to extract pieces of information from a Vietnamese ID card.

In research progress, the author wandered the internet and found some useful articles (e.g. Dropbox, Zocdoc, Facebook) about how to build an OCR system. But none of this explained clearly a complete intuition how to bring these research models into a production environment. So the team had a hard time (roughly 6 months) struggling to build an accurate, production-ready and scalable OCR system.

And in this post will share with you the experience (research, architecture, and deploy process) in hope to clear the mist of building and deploying a complete Deep Learning project in general, and OCR task in particular.


The structure consists of 3 basic components: Cropper, Detector, Reader. Each component has its own model and train/validation/test process. So it can be easy to plug and play plugins to improve the system, better than a black box of a single end-to-end model.

1. Cropper

This component locates 4 corners of the ID card and then crop its quadrangular shape. The meaning of this is to make easier for word detection tasks (e.g. reduce noises and variances) which comes after.

Since the common object detection model only returns 2 corners (a rectangular box), the team use a little trick by treating each corner as an object with its own unique class and then detect 4 corners. The geometric transformation after locating 4 corners is trivial.

The detection model used is single shot detector: SSD ( SSD: Single Shot MultiBox Detector), with feature extractor, is MobileNet v2 (MobileNetV2: Inverted Residuals and Linear Bottlenecks).

SSD provides us fast inference speed, while MobileNet v2 decreases the number of operations and memory but still preserves good accuracy.

Image courtesy of Matthijs Hollemans, source.

2. Detector

This component extracts rectangular shapes contain word tokens belongs to each class (e.g. ID number, name, address, date of birth). They are sorted depending on coordinates and then parse them to the Reader.

The team also utilize SSD for this task just like the Cropper, but with different feature extractor: Resnet FPN (Deep Residual Learning for Image Recognition, Feature Pyramid Networks for Object Detection)

Image in the FPN paper, and courtesy of Jonathan Hui, source.

Resnet FPN assures us state-of-the-art accuracy and supports multi-scale of an image so the model can deal with various input situations.

3. Reader

Given some words and their orders for each region class, the team do batch inferencing to the Reader model to get the string results.

The model architecture used is a word-level reader, utilizes the Google’s Attention OCR architecture ( Attention-based Extraction of Structured Information from Street View Imagery) with some little tweaks.

Image in Attention OCR paper.

First, using 1 view instead of 4 views for each word image because the text detected is mostly vertical after the Cropper phase. And then use the same Inception Resnet v2 layer (Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning) cut at Mixed_6a for feature extraction, followed by LSTM layer and then attention decoder.

For gradient descent optimization, the team use Adadelta with initial learning rate 1.0 rather than stochastic gradient descent (SGD) with momentum described in the paper. Mainly reason is Adadelta adapts the learning rate to the parameters, thus no troubling tuning the learning rate in train process.

This Reader model achieves more than 99% for character accuracy and often with 1-2% decreases in word accuracy. The combined end-to-end system (Cropper -> Detector -> Reader) achieves approximately 90% accuracy for each region class.

Theoretically, the team can do better with some strategy like synthetic data, curriculum training, and so much more. But with the first public version is considered enough and decide to give it a go in the production, and then update follow client feedbacks.


Here a diagram of our structure used in a real-life scenario.

The problem is how to bring these components into the production environment.

The naive way is to pre-load the trained checkpoints and write some additional query function to infer the model. But it’s not an optimized way. It eats resources with bloated memory, has high latency queries and high risk of CPU spikes.

A better way is to freeze to model first (called a Frozen Graph, the tutorial like this) to clear the mentioned problem.

But the team want a more mature way of serving the trained models, say low-latency high-throughput requests, zero-downtime model update, batch inferencing request, consistent API for inferencing. Luckily, Google has a solution for us, enter Tensorflow Serving.

Tensorflow Serving

Tensorflow Serving uses SavedModel with version tag. For example, here is a 1-version Cropper, 1-version Detector, and 2-version Reader.

And you can load all of these models into Tensorflow Serving at once (YES, multiple models serving) by specifying a model base path config (for example, serving.config):

and boot up Tensorflow Serving with the appropriate flag:

The trouble way is you have to export model with trained weights to SavedModel format first.

A sample script to export here. The trick is you make use of Tensorflow Saver.

Specifically, SavedModel wraps a TensorFlow Saver. The Saver is primarily used to generate the variable checkpoints. source

And the second important thing is you understand what input and output node your model is having.

Backend service

Now is the full look of our backend service:


There are 2 small services (components) we mainly take care of: apiand.serving

serving is Tensorflow Serving as we mentioned, and api (which provides RESTful API for clients) connects with throughserving grpc. Luckily (also), we don’t have to take care of much of Protocol Buffers logics and just make use of tensorflow-serving-api library. A sample use:


Deployment is an important part, but usually a myth among Deep Learning articles. Tensorflow Serving guide demos deploy in Kubernetes with built container images. But we found it over-complicated and decided to use Docker Compose for simply running two pre-built images api and serving.

We have two types of deploy machines, only CPU platform, and with GPU platform.

Only CPU machine goes with common,docker-compose.yml with Dockerfile for api, and Dockerfile.serving for serving which is based from tensorflow/serving:latest-devel image.


On the other hand, GPU machine goes with custom nvidia-docker-compose.yml (which requires nvidia-docker), same Dockerfile for api, and Dockerfile.serving.gpu for serving (which is based from tensorflow:latest-devel-gpu).


One main note for running Docker Compose with GPU is inside nvidia-docker-compose.yml you have to specify runtime: nvidia. A troublesome problem is Docker Compose default uses all GPUs so you have to set NVIDIA_VISIBLE_DEVICES variable to your dedicated running GPUs. A trick the author use for consistent use between Tensorflow (CUDA_VISIBLE_DEVICES) and Docker Compose is:

By doing that, if you set CUDA_VISIBLE_DEVICES you use what GPUs you want to use. Otherwise, it uses all GPUs by default.


The article shared you our research, architecture, and deployment of a complete OCR system. It uses the state-of-the-art deep learning OCR model (Attention OCR), scalable with Tensorflow Serving, and ready for production deployment with the help of Docker Compose.

By using Tensorflow we have an entire ecosystem backed by Google, a typical benefit is Tensorflow Serving (which belongs to TFX). A huge support is from the community for model implementation, too.

Hiep Pham – FPT Head Office

Related posts: