Airflow DAG Scheduling

Airflow DAG Scheduling Deep Dive

How DAG Scheduling Works

  • Airflow Scheduler's Role: The Airflow scheduler diligently monitors all DAGs and their associated tasks.

  • Task Triggering: It initiates tasks when their scheduled time arrives and all predefined dependencies are fully satisfied.

  • Continuous Monitoring: The scheduler constantly observes the DAGs folder, evaluating schedules every minute.

  • DAG Instance Creation: If, within the current minute or the immediate future, a new DAG matches its defined schedule, the scheduler creates a corresponding DAG instance.

Defining DAG Frequency: schedule_interval

  • The schedule_interval parameter is used to define when a DAG should run.

  • This parameter accepts either a cron expression for custom timings or preset values for common frequencies.

Airflow Presets for schedule_interval

Airflow provides several convenience presets:

  • None:

    • Indicates the DAG is not scheduled to run automatically.

    • It must be triggered manually.

  • @once:

    • Schedules the DAG to run only a single time immediately after being unpaused.

  • @hourly:

    • Runs once every hour, specifically at the beginning of the hour (e.g., 00:0000:00, 01:0001:00, 02:0002:00 UTC, etc.), starting from midnight (UTC).

  • @daily:

    • Executes once every day, at midnight (e.g., 00:0000:00 UTC).

  • @weekly:

    • Runs once every week, typically at midnight on Sunday (e.g., Sunday at 00:0000:00 UTC).

  • @monthly:

    • Executes once every month, at midnight on the first day of the month (e.g., 00:0000:00 UTC on the 1st1^{\text{st}}).

  • @yearly:

    • Runs once every year, at midnight on January 1st1^{\text{st}} (e.g., 00:0000:00 UTC on January 1st1^{\text{st}}).

Custom Scheduling with Cron Expressions

  • For specific custom timings not covered by presets (e.g., daily at 5 PM5 \text{ PM}, monthly on the 5th5^{\text{th}} day), a cron expression is used.

  • Cron Expression Format: A cron expression typically consists of five (or sometimes six) fields:
    minute hour day_of_month month day_of_week

  • Examples:

    • To run daily at 5 PM5 \text{ PM}, the cron expression would be 0 17 * * *.

    • To run daily at 9 AM9 \text{ AM}, the cron expression would be 0 9 * * *.

    • To run on the 5th5^{\text{th}} of every month, the cron expression would be 0 0 5 * *.

    • To run every Sunday, the cron expression would be 0 0 * * 0 (where 00 or 77 represents Sunday).

Examples of Scheduled DAGs (VSCode & Airflow UI Demos)

The demonstration highlights various schedule_interval configurations:

  • Daily DAG:

    • DAG Name: hello_world_daily_DAG

    • schedule_interval: @daily (runs at 12:00 AM12:00 \text{ AM} midnight daily).

  • Hourly DAG:

    • DAG Name: hourly_test_DAG

    • schedule_interval: @hourly (runs every hour at the top of the hour).

  • Manual DAG:

    • DAG Name: manual_DAG

    • schedule_interval: None

    • Behavior: This DAG must be triggered manually each time. The UI demonstrates initiating a manual run, showing its tasks execute.

  • Custom 6 PM Daily DAG:

    • DAG Name: EOD_data_pipeline_DAG

    • schedule_interval: 0 18 * * * (a cron expression to run at 6:00 PM6:00 \text{ PM} every day).

  • Monthly DAG on the 5th5^{\text{th}}:

    • DAG Name: monthly_DAG

    • schedule_interval: 0 0 5 * * (a cron expression to run at midnight on the 5th5^{\text{th}} day of every month).

  • Weekly DAG (Sundays):

    • schedule_interval: Uses a preset or cron to execute only on Sundays.

Catchup and Backfill

  • Concept: Catchup refers to the behavior where Airflow attempts to run past DAG executions that were missed between the start_date and the current date.

  • Backfill Defined: When a DAG's start_date is in the past, and catchup is enabled, Airflow performs a "backfill" by creating DAG runs for each of the missed schedule_intervals up to the present.

  • Parameter: Controlled by the catchup parameter in the DAG definition. Setting catchup=True enables this behavior.

  • Example Scenario: If today is July 15th15^{\text{th}} and a DAG has a start_date of July 1st1^{\text{st}} with catchup=True, Airflow will execute the DAG for each day from July 1st1^{\text{st}} to July 14th14^{\text{th}} (a total of 1414 missed executions), in addition to scheduling for July 15th15^{\text{th}} onwards.

Demonstration of Catchup/Backfill

  • DAG File: A new DAG file dag_backfill.py is created.

  • Start Date: start_date is set to July 1st,20241^{\text{st}}, 2024.

  • Schedule Interval: A custom cron expression 52 8 * * * is used, indicating daily execution at 08:5208:52 UTC.

  • Catchup Parameter: catchup=True is explicitly set in the DAG definition.

  • Observation in Airflow UI:

    • After the scheduler picks up the dag_backfill.py file, the DAG appears in the UI.

    • Upon enabling (unpausing) the DAG, Airflow immediately starts creating and running DAG instances for all missed schedule intervals.

    • The UI shows 1414 consecutive DAG runs executing, corresponding to July 1st1^{\text{st}} through July 14th14^{\text{th}} (assuming the current date is July 15th15^{\text{th}}).

    • Checking the logs for individual task instances confirms that each run corresponds to a specific date in the backfill period (e.g., logs for one run show execution for July 1st1^{\text{st}}, another for July 5th5^{\text{th}}, another for July 12th12^{\text{th}}).

Key Scheduling Parameters

To effectively schedule Airflow DAGs, three critical parameters work in conjunction:

  • start_date: Defines when the DAG should logically begin its runs.

  • schedule_interval: Specifies the frequency and timing of DAG executions (using presets or cron).

  • catchup: A boolean parameter (True or False) that determines whether Airflow should execute missed runs from the start_date up to the current date.