Recently, I worked on Apache Airflow for one of my project. The first thing I needed to do was to set up an Airflow cluster on Google Cloud Platform.
Note: GCP itself provide a fully managed Apache Airflow service by the name “Cloud Composer”.
While for single node installation just to demo, the documentation page instructions should be enough, but when it comes to setup a cluster even though I found this excellent post which gave me solid understanding of configuration, it was a bit outdated and some of the steps didn’t work for me or were simply missing like RBAC.
So, I have decided to write all the necessary steps that I used to setup my ready to be used Airflow cluster.
Objective: At the end of this tutorial, we should get a secure web authentication enabled Apache Airflow cluster running with below components:
- Operating System: Ubuntu 20.04 LTS
- Python Version: Python 3.8
- Apache Airflow Version 1.10.11 (latest stable as on August 2020)
- Backend DB: MySQL
- Executor: Celery
- Celery Backend: RabbitMQ
Note: For this tutorial, our current directory is user’s home directory. e.g. /home/user or ~/
1. Setup Master Node/Single Node
1.1) Install System Level Requirements
# Update your system package index
sudo apt-get update
# Install required package development tools
sudo apt-get install libmysqlclient-dev python3 python3-dev build-essential libssl-dev libffi-dev libxml2-dev libxslt1-dev zlib1g-dev
# Install system level requirement for Airflow
sudo apt-get install -y --no-install-recommends freetds-bin krb5-user ldap-utils libsasl2-2 libsasl2-modules libssl1.1 locales lsb-release sasl2-bin sqlite3 unixodbc
1.2) Install & Configure MySQL for Airflow Backend
# Install MySQL to be used as Airflow backend, it will store all the metadata e.g. Dags related info
sudo apt-get install mysql-server
# Open MySQL config file and scroll down to bind-address and change it's value from localhost to 0.0.0.0 so that your file now reads bind-address = 0.0.0.0
sudo nano /etc/mysql/mysql.conf.d/mysqld.cnf
# Restart the service for changes to take effect
sudo service mysql restart
# Create a dedicated database "airflowdb" in MySQL for airflow
sudo mysql -e "CREATE DATABASE airflowdb;"
# Create a new user "airflow" and allow this local user to connect to MySQL db
sudo mysql -e "CREATE USER 'airflow'@'localhost' IDENTIFIED BY 'password'; GRANT ALL PRIVILEGES ON airflowdb.* TO 'airflow'@'localhost';"
1.3) Install & Configure RabbitMQ for Celery Backend
# Install RabbitMQ for Celery backend and confirm it status
sudo apt-get install rabbitmq-server
sudo rabbitmq-plugins enable rabbitmq_management #Enable RabbitMQ Web Interface if needed
sudo service rabbitmq-server start
sudo rabbitmqctl status
# Configure RabbitMQ for airflow
sudo rabbitmqctl add_user airflow password
sudo rabbitmqctl set_user_tags airflow administrator
sudo rabbitmqctl add_vhost myvhost
sudo rabbitmqctl set_permissions -p myvhost airflow ".*" ".*" ".*"
1.4) Install & Configure Apache Airflow in a Virtual Environment
# Install virtual environment
sudo apt-get install virtualenv
# Create and activate virtual environment
virtualenv -p python3 venv
source venv/bin/activate
# Download requirements.txt having all the Python packages required by Airflow
wget https://gist.githubusercontent.com/skumarlabs/f9364c16668f17ec6ba7de08f63eac03/raw/4179a629921ff9c04a11bffefdb823fa92d90e01/airflow-python3-requirements.txt
# Pip install core Airflow and other required Airflow packages
pip install 'apache-airflow' --constraint airflow-python3-requirements.txt
pip install 'apache-airflow[mysql]'
pip install 'apache-airflow[celery]'
pip install 'apache-airflow[rabbitmq]'
pip install 'apache-airflow[crypto]'
pip install 'apache-airflow[password]'
# Create a new user "airflow" for web UI, assign it "Admin" role, specify your email, first name, last name and password
airflow create_user -r Admin -u airflow -e your_email@domain.com -f Airflow -l Admin -p password
# Check version and installation. It should also create a default airflow/airflow.cfg file
airflow version
Note: Use a strong password for this web UI admin user and wherever required rather than password shown, here.
1.5) Update Configuration File airflow.cfg
By default, Airflow will save the passwords for the connection in plain text within the metadata database. To enable encryption for passwords, generate a fernet_key running below Python code.
from cryptography.fernet import Fernet
fernet_key= Fernet.generate_key()
print(fernet_key.decode()) # your fernet_key, keep it in secured place!
By this time, you should have a airflow.cfg in your ~/airflow directory. Open it in your favorite text editor. We need to change a couple of variables here to connect to our MySQL, RabbitMQ etc and to last one to enable RBAC feature. One advantage of using RBAC features is to be able to change time zone in web UI.
executor = CeleryExecutor
sql_alchemy_conn = mysql://airflow:password@localhost/airflowdb
broker_url = pyamqp://airflow:password@localhost:5672/myvhost
result_backend = db+mysql://airflow:password@localhost:3306/airflowdb
flower_basic_auth = user1:password1
rbac = True
fernet_key = "Your_Key"
Note: If you want to change the time zone of scheduler or UI, there are 2 more variables in configuration file, search for “default_timezone” and “default_ui_timezone” which are by default set to UTC and are recommended to be so.
1.6) Start Airflow & UI Web Server
# Initialize airflow database
airflow initdb
# Start web server on port number 8080
airflow webserver -p 8080
# Start scheduler
airflow scheduler
# To start a worker
airflow worker
# To monitor workers
airflow flower
If you followed everything till here, you should be able to see Airflow web interface at YOUR_IP_ADDRESS:8080. Use username and password you configured in section 1.4 using airflow create_user command to login.
2. Setup Worker Node
2.1) Install System Level Requirements
# Update your system package index
sudo apt-get update
# Install required package development tools
sudo apt-get install libmysqlclient-dev python3 python3-dev build-essential libssl-dev libffi-dev libxml2-dev libxslt1-dev zlib1g-dev
# Install system level requirement for Airflow
sudo apt-get install -y --no-install-recommends freetds-bin krb5-user ldap-utils libsasl2-2 libsasl2-modules libssl1.1 locales lsb-release sasl2-bin sqlite3 unixodbc
2.2) Install & Configure Apache Airflow in a Virtual Environment
# Create and activate virtual environment
virtualenv -p python3 venv
source venv/bin/activate
# Download requirements.txt having all the Python packages required by Airflow
wget https://gist.githubusercontent.com/skumarlabs/f9364c16668f17ec6ba7de08f63eac03/raw/4179a629921ff9c04a11bffefdb823fa92d90e01/airflow-python3-requirements.txt
# Pip install core Airflow and other required Airflow packages
pip install 'apache-airflow' --constraint airflow-python3-requirements.txt
pip install 'apache-airflow[mysql]'
pip install 'apache-airflow[celery]'
pip install 'apache-airflow[crypto]'
pip install 'apache-airflow[password]'
airflow version
2.3) Configure airflow.cfg on worker node
By this time, you should have a airflow.cfg in your ~/airflow directory. Open it in your favorite text editor. We need to change a couple of variables here again to connect to our MySQL, RabbitMQ etc running on master node and to secure flower web interface and to enable RBAC feature.
executor = CeleryExecutor
sql_alchemy_conn = mysql://airflow:password@MASTER_NODE_IP/airflowdb
broker_url = pyamqp://airflow:password@MASTER_NODE_IP:5672/myvhost
result_backend = db+mysql://airflow:password@MASTER_NODE_IP:3306/airflowdb
rbac = True
flower_basic_auth = user1:password1
fernet_key = "Your_Key"
Note: Make the appropriate changes for time zone variables on worker node as well if you have changed it on master node.
After this you need to go to master node and run below MySQL command so that user from worker node can connect to the database on master node.
# replace 10.1.0.3 here with your remote worker ip address
sudo mysql -e "CREATE USER 'airflow'@'10.1.0.3' IDENTIFIED BY 'password'; GRANT ALL PRIVILEGES ON airflowdb.* TO 'airflow'@'10.1.0.3';"
2.4) Start Airflow Worker
# Initialize the airflow database on worker node
airflow initdb
# Start the worker
airflow worker
You should be able to see the worker node coming up on Flower interface at YOUR_MASTER_IP_ADDRESS:5555. Use username and password configured on airflow.cfg in flower_basic_auth variable.
We also used a bash script sync_dags.sh having a gsutil rsync command to keep our dags folder in sync with a central Google Cloud Storage bucket.
#!/bin/bash
# sync_dags.sh
/snap/bin/gsutil -m rsync -r gs://YOUR_BUCKET/airflow/dags /home/user/airflow/dags
#cron schedule to sync new dags every minute
* * * * * /home/user/sync_dags.sh > /home/user/error.log 2>&1
Few Useful Points
- The default folder for keeping dags is ~/airflow/dags and can be configured in airflow.cfg file.
- The scheduler scans dags directory by default every 5 minutes but can be configured using dag_dir_list_interval variable in airflow.cfg.
- TCP ports 3306 for MySQL, 5672 for broker, 8080 for web, 5555 for Flower should be open on master node.
- TCP port 8793 for logs access should be open on worker nodes.
- TCP port 22 should be open if you want to connect to your nodes using SSH.
- There is slight difference in when airflow scheduler triggers than cron. Have a look at scheduler documentation if you are wondering why your jobs are running with one interval delay.
Useful Resources
https://airflow.apache.org/docs/stable/installation.html
https://airflow.apache.org/docs/stable/howto/initialize-database.html
https://airflow.apache.org/docs/stable/executor/index.html
https://airflow.apache.org/docs/stable/executor/celery.html
https://docs.celeryproject.org/en/stable/getting-started/brokers/rabbitmq.html
https://airflow.readthedocs.io/en/stable/howto/secure-connections.html
https://airflow.apache.org/docs/stable/cli-ref.html?highlight=create_user#create_user
http://site.clairvoyantsoft.com/installing-and-configuring-apache-airflow/