Airflow DAG Scheduling
Airflow DAG Scheduling Deep Dive
How DAG Scheduling Works
Airflow Scheduler's Role: The Airflow scheduler diligently monitors all DAGs and their associated tasks.
Task Triggering: It initiates tasks when their scheduled time arrives and all predefined dependencies are fully satisfied.
Continuous Monitoring: The scheduler constantly observes the
DAGsfolder, evaluating schedules every minute.DAG Instance Creation: If, within the current minute or the immediate future, a new DAG matches its defined schedule, the scheduler creates a corresponding DAG instance.
Defining DAG Frequency: schedule_interval
The
schedule_intervalparameter is used to define when a DAG should run.This parameter accepts either a cron expression for custom timings or preset values for common frequencies.
Airflow Presets for schedule_interval
Airflow provides several convenience presets:
None:Indicates the DAG is not scheduled to run automatically.
It must be triggered manually.
@once:Schedules the DAG to run only a single time immediately after being unpaused.
@hourly:Runs once every hour, specifically at the beginning of the hour (e.g., , , UTC, etc.), starting from midnight (UTC).
@daily:Executes once every day, at midnight (e.g., UTC).
@weekly:Runs once every week, typically at midnight on Sunday (e.g., Sunday at UTC).
@monthly:Executes once every month, at midnight on the first day of the month (e.g., UTC on the ).
@yearly:Runs once every year, at midnight on January (e.g., UTC on January ).
Custom Scheduling with Cron Expressions
For specific custom timings not covered by presets (e.g., daily at , monthly on the day), a cron expression is used.
Cron Expression Format: A cron expression typically consists of five (or sometimes six) fields:
minute hour day_of_month month day_of_weekExamples:
To run daily at , the cron expression would be
0 17 * * *.To run daily at , the cron expression would be
0 9 * * *.To run on the of every month, the cron expression would be
0 0 5 * *.To run every Sunday, the cron expression would be
0 0 * * 0(where or represents Sunday).
Examples of Scheduled DAGs (VSCode & Airflow UI Demos)
The demonstration highlights various schedule_interval configurations:
Daily DAG:
DAG Name:hello_world_daily_DAGschedule_interval:@daily(runs at midnight daily).
Hourly DAG:
DAG Name:hourly_test_DAGschedule_interval:@hourly(runs every hour at the top of the hour).
Manual DAG:
DAG Name:manual_DAGschedule_interval:NoneBehavior: This DAG must be triggered manually each time. The UI demonstrates initiating a manual run, showing its tasks execute.
Custom 6 PM Daily DAG:
DAG Name:EOD_data_pipeline_DAGschedule_interval:0 18 * * *(a cron expression to run at every day).
Monthly DAG on the :
DAG Name:monthly_DAGschedule_interval:0 0 5 * *(a cron expression to run at midnight on the day of every month).
Weekly DAG (Sundays):
schedule_interval: Uses a preset or cron to execute only on Sundays.
Catchup and Backfill
Concept:
Catchuprefers to the behavior where Airflow attempts to run past DAG executions that were missed between thestart_dateand the current date.Backfill Defined: When a DAG's
start_dateis in the past, andcatchupis enabled, Airflow performs a "backfill" by creating DAG runs for each of the missedschedule_intervals up to the present.Parameter: Controlled by the
catchupparameter in the DAG definition. Settingcatchup=Trueenables this behavior.Example Scenario: If today is July and a DAG has a
start_dateof July withcatchup=True, Airflow will execute the DAG for each day from July to July (a total of missed executions), in addition to scheduling for July onwards.
Demonstration of Catchup/Backfill
DAG File: A new DAG file
dag_backfill.pyis created.Start Date:
start_dateis set to July .Schedule Interval: A custom cron expression
52 8 * * *is used, indicating daily execution at UTC.Catchup Parameter:
catchup=Trueis explicitly set in the DAG definition.Observation in Airflow UI:
After the scheduler picks up the
dag_backfill.pyfile, the DAG appears in the UI.Upon enabling (unpausing) the DAG, Airflow immediately starts creating and running DAG instances for all missed schedule intervals.
The UI shows consecutive DAG runs executing, corresponding to July through July (assuming the current date is July ).
Checking the logs for individual task instances confirms that each run corresponds to a specific date in the backfill period (e.g., logs for one run show execution for July , another for July , another for July ).
Key Scheduling Parameters
To effectively schedule Airflow DAGs, three critical parameters work in conjunction:
start_date: Defines when the DAG should logically begin its runs.schedule_interval: Specifies the frequency and timing of DAG executions (using presets or cron).catchup: A boolean parameter (TrueorFalse) that determines whether Airflow should execute missed runs from thestart_dateup to the current date.