In a previous post, I introduced our Vision team product with a simple deployment flow and some models we use. Most of the time we manually train the models: process data, train, evaluate, test and export with the support of the command line.

The tasks gradually become repetitive and boring with large varieties of data and multiple experiments. They also cost time and human resources to monitor, watch for bugs and silly mistakes. Writing Bash scripts, Makefile, Rakefile partially solve the problem. But  it is a nightmare to manage the processes and package dependencies. Moving an existing pipeline to a new server costs tons of time of initial setup and debug. 

The decision to automate the process is now a high priority for us. We need a tool that supports ETL pipeline, which automates the process of loading data, training and exporting modes. Kubeflow Pipelines is intriguing since it treats Docker containers as unit tasks need to be automated. But it is heavy, over-complicated and built on top of Kubernetes, not suitable for our team size. After some research and experiments we discovered a more lightweight solution: Airflow.

In this post, we’ll explain how we adopted Airflow and shared our practice with it.

Structure

Here is our simple flow we are currently using:

After having an acceptable manual process, we wrap everything into Docker containers and then write an instruction code for Airflow. Airflow will generate a pipeline (preprocess data, train a model, test and export the model) ready for us to trigger. /data and /export are sample mount directories we use to store data and models.

Setting up Airflow is considered easy but still time consuming given we want Postgres database for storing tasks, Docker integration, etc. So we use docker-compose to quickly boot up Airflow in one simple command docker-compose up airflow

Here we use a custom Airflow docker image: hiepph/docker-airflow:latest. 

Airflow does not support running Docker with GPU by default so we have to write a pluginfor it: DockerGPUOperator. It’s a modified version of DockerOperator with GPU capability and device select.

There are many practices out there so we have to define our own standard for our engineers and scientists to follow. A sample codebase:

plugins is where we store our plugins (i.e DockerGPUOperator as mentioned above). And it’s easy to interact with data and models by mounting data and export directories.

By concept Airflow executes task based on DAG instruction. A DAG is a Python file which contains our instructions to execute tasks as a pipeline. They are stored in dags directory and will be automatically loaded by Airflow with latest changes. DAG supports many operators such as BashOperator, PythonOperator, etc. but we decide to keep our standard to be DockerOperator and DockerGPUOperator. This prevents decision complexity and dependencies hell between many existing tasks. Writing tasks in DAG format is now simple for us since we have all the power of Python syntax and libraries. We shifted the focus of trying to automate and manage pipelines to write a dedicated instruction script. 

For example of DAG usage, suppose we have the following pipeline:

After preprocessing data, we train the model, and then we test and export it. We define:

As you see we easily define which tasks will use GPU and which tasks will not. 

Demo

A demo is worth a thousand words. Let’s see our demo of using Airflow for automating the pipeline of training a simple mask detector! 

The goal is to detect all faces in the image and identify which person is wearing a mask. The model we will use is EfficientDet. The dataset is public data by AIZOOTech.

For the sake of simplicity, we already process the data and use the following simple pipeline:The train process will load the data in data directory and then we test and export the model. Test process will print out a test image with detected results into results directory and the exported model will be stored in export. Our codebase will look like:

Here is our DAG:

Now we can go to admin console (ip:8080) and trigger the task:

In GraphView we can easily see our pipeline as a Graph:

There is much more information you can check for your tasks. Here we only focus on TreeView which shows our experiment and pipeline status.

Looks like our pipeline ran successfully without any errors. On the left you can see our previous run which didn’t succeed. Let’s check the result directory:


And we also have an export model, ready for use with Serving!

Above is our simple demo with our practice of using Ariflow. Of course Airflow supports much more of what we demo. The creativity is yours with Docker images and DAG instruction.

As a result, Airflow greatly reduced resource time for us by focusing on automation. We plan to use Airflow on our platform to automate training processes for clients by interacting with its experimental API. More detail will be in a future post.

Pham Hoang Hiep – FPT Head Office

Related posts: