Apache Airflow is, according to its website, a platform created by the community to programmatically author, schedule and monitor workflows. Airflow is deployed in several several components, two of which are the Scheduler and Webserver.
DAG Serialization allows the Webserver to read DAGs entirely from the database instead of having to parse DAG files. This makes it possible to deploy onto a serverless platform like Cloud Run, where the Webserver does not need to be running when no one is accessing it. However, there are some hoops which need to be jumped through to actually do that.
Build a custom Airflow container image
Airflow generates a
webserver_config.py
config file by default, which is used to configure features like authentication
or mail sending. However, on a serverless platform like Cloud Run which uses
ephemeral containers, the base Airflow container image will constantly
regenerate this file.
In order to customise webserver_config.py
, it should be built directly into
the container. Below is a sample Dockerfile which installs
Authlib, which is needed for the AUTH_OAUTH
authentication method in Airflow, then copies in a preconfigured
airflow_webserver.py
. This container can then be passed to Cloud Run and will
use the predefined config.
FROM apache/airflow:2.1.1-python3.8
RUN pip install --no-cache-dir Authlib
COPY webserver_config.py .
Disable entrypoint database checks
Cloud Run connects to Cloud SQL via Cloud SQL Proxy, which mounts a Unix socket
into the container for the application to connect to. SQLAlchemy, which is used
by Airflow, supports this. However, the official Airflow container image also
tries to verify the database connection in its entrypoint
script
but assumes that the database connection is always over TCP, so this check will
always fail. We can bypass this check by setting the
CONNECTION_CHECK_MAX_COUNT
environment variable to 0.
Configure Airflow
Airflow is configured with an airflow.cfg
config file by default, however
every config option can also be set using environment variables. The full
configuration reference and corresponding environment variables can be found on
the Configuration
Reference
page.
The webserver_config.py
file can also be configured to read environment
variables, for example to set OAuth credentials.