Deploying DAGs Using Git Diff
Deployment patterns for Airflow DAGs in multi-user Composer environments
Thanks to Casey Zakroff for the idea and Koushik Ghosh for the development support.
Overview
Continuous Integration and Deployment (CICD) is a key part of a mature system that tests and integrates new features such that releases of new functionality minimise the impact of changes to a user-facing, continuously running application.
Airflow is an orchestration system that coordinates actions between multiple systems to execute sets of tasks, specified in a DAG. GCP’s managed service for Airflow is Cloud Composer, where the DAGs are located in GCS, in a single bucket.
Whilst most software is compiled and released as packages, Airflow DAGs are static files that are parsed at runtime (can also be generated at runtime) and the testing, release and integration of changes to each file is hard to understand until it is tested in a runtime environment. On GCP, there are two key patterns for deploying DAGs, described in, Using Cloud Build for CICD and Testing and Deploying DAGs.
One limitation of both these patterns is that they sync the entire DAGs folder to the bucket, overwriting all DAGs within the bucket. This can cause problems in a couple of ways
- Multiple new DAGs are being developed at the same time but deployed by different developers with different versions of the source code
- Multiple teams or team are deploying DAGs through different pipelines into a test/env environment, but in the same monorepo
These are both variants on the “How do I develop a DAG without breaking the wider system” question, of which a new alternative is to do Local Development.
Solution
For a DAG repo which is a single bucket and a single git monorepo, we can use git to determine which files have changed and hence which ones to copy to a bucket. This can be run through Cloud Build.
Setup
Clone the repo and navigate to the appropriate directory
git clone https://github.com/alexshires/google-cloud-demos
cd giff-diff-sync
If you don’t have a composer environment to use, you can create one using the create-composer.sh
, script or follow the guides here. Key things to do are
- Set up a custom service account and give it the right roles
- Check that the private configuration is correct
You will need to also give the Cloud Build default service account permission to
- Copy DAGs to GCS (storage object creator)
- Describe the composer environment (Composer User)
There are two examples of the Airflow tutorial DAG in the “dags” folder, so that we can demonstrate the differential change.
The key to the delta is in copy-diff.sh:
git diff main...${BRANCH_NAME} --name-only | grep "dags" > dagfilelist.txt
cat dagfilelist.txt
while read p; do
echo "file name: $p"
gsutil cp p $BUCKET_NAME/
done < dagfilelist.txt
cat dagfilelist.txt
And the cloud-build.yaml gives a single simple step of
steps:
# install dependencies
- name: 'gcr.io/cloud-builders/gcloud'
entrypoint: bash
args: ["./git-diff-deploy/copy-diff.sh"]
env:
- "COMPOSER_NAME=${_COMPOSER_NAME}"
- "GCP_REGION=${_GCP_REGION}"
substitutions:
_COMPOSER_NAME: example-environment
_GCP_REGION: us-central1
options:
substitution_option: 'ALLOW_LOOSE'
In order to upload the .git directory to allow us to to the git diff, we need to update the .gcloudignore:
#!include:.gitignore
.gitignore
.gcloudignore
The first line adds in all the directories excluded from the .gitignore, whereas the second and third lines ignore the ignore files. However, most importantly, this will upload the .git directory to allow us to do the branch comparison
Step 1: change a DAG
Create a new branch, change a dag and commit the change
Step 2: Deploy the change
Executing the CloudBuild through the script should be in the root of the directory, and the paths are mapped as such:
cd ../
./submit-build.sh
Here, you should see a cloud build job upload the files, get the appropriate bucket, check for the change and then use gsutil to copy the DAG to the bucket.
Step 3: Check for the new DAG
In the composer environment, check for the new DAG with your updated change
Conclusion
To summarise, we have demonstrated a cloud build script for deploying only changed files to the Composer environment - allowing for multiple teams to work on the same Composer environment and DAGs directory whilst only updating the DAGs changed in git.
This does not include suitable unit testing and other ways of parameterising and partitioning DAGs for simultaneous development, and an updated process will need to be implemented for progression through test, staging and production environments.
Code is available at Github