8221  Reviews star_rate star_rate star_rate star_rate star_half

Introduction to Apache Airflow

This Introduction to Airflow training course teaches software engineers and data engineers how to use Apache Airflow to orchestrate production-ready data pipelines. Students learn how to create...

Read More
Course Code PYTH-220
Duration 3 days
Available Formats Classroom

This Introduction to Airflow training course teaches software engineers and data engineers how to use Apache Airflow to orchestrate production-ready data pipelines. Students learn how to create pipelines via DAGs (Directed Acyclic Graphs), make pipelines predictable, and master their scheduling. In addition, participants learn how to abstract functionality, reuse components, and modularize and share data across tasks and pipelines with the TaskFlow API, Task Groups, and Branching. Finally, attendees learn how to scale Airflow within Celery, Kubernetes, and KEDA executors.

Throughout the course, students incrementally construct a single, real-world application.

Skills Gained

  • Create production-ready data pipelines in Airflow
  • Build pipelines in Airflow that are able to scale to hundreds of tasks
  • Enforce modularization and reusability of Airflow tasks across projects
  • Scale Airflow in Kubernetes

Prerequisites

All attendees should have basic Python knowledge or object-oriented programming experience.                                            

Course Details

Training Materials

All Airflow training students receive comprehensive courseware.

Software Requirements

  • Python 3.5 or later
  • Airflow 2.1 or later

Outline

  • Introducing Apache Airflow
    • What Airflow is and what does it solve?
    • Airflow architecture
    • How do we represent a Pipeline?
    • Our first DAG
    • Tasks, TaskFlow, and Operators
    • First Pipeline
  • Mastering scheduling
    • execution_date, start_date and schedule_interval
    • Handling non-default schedule_intervals
    • Playing with time
  • Abstracting functionality
    • Using custom operators
    • Creating TaskGroups vs subDAGs
    • Sharing data with xCOMs
    • Branching and Triggers
    • Sensors and SmartSensors
  • Executors and Scaling Airflow
    • Abandoning SQLite for PostgreSQL
    • Executors: Debug, Local, Celery
    • Concurrency and parallelism
    • Concurrency with Celery
    • Airflow in Kubernetes, the old and new ways
    • KEDA and HA scheduler
    • Deploying a highly availability fault-tolerant Airflow
  • Conclusion