Recently I have set up Apache Airflow in Docker containers. Apart from the container with Apache Airflow backend database server (running PostgreSQL), there are two containers running Apache Airflow Webserver and Apache Airflow Scheduler.
The first problem that was encountered is the inability to view task logs through Web UI (while Apache Airflow is configured with LocalExecutor). There is a relevant issue on github repo puckel/docker-airflow. The solution was to attach a volume to each container so that in each container Airflow logs directory is pointed at the same place on the host machine.
The second problem is clean up of old logs. Apache Airflow does not have a built-in way of managing its logs, so we have to rely on other tools. Since this setup is intended as a local development environment, deleting files that are several days old is sufficient. The simplest solution that comes in mind is to run a bash command periodically using cron. Running background cron job in the docker container with Apache Airflow Scheduler was proven to be very difficult. There are articles written about problems with SystemD in Docker and people are developing replacement scripts for SystemD in Docker.
In our case solution can be as simple as creating another Docker container that runs cron in the foreground. Define cron jobs in a file (for example, cronjobs
) and set this file to be copied into Docker Cron image into /etc/crontabs/root
. In CMD we should specify to run cron in the foreground. For Alpine Linux it would be ["crond", "-f", "-d", "8"]
. On Manjaro (Arch) Linux: crond -n -s
.
This StackOverflow thread provides more options and details.
Today we have learned that when moving to containers traditional background processes might become dedicated services in their own containers.