
Duration: 4 Weeks | Total Time: 40 Hours
Format: Live online sessions using Google meet or MS Teams with hands-on coding, mini-projects, and a capstone project by an industry expert.
Target Audience: College Students, Professionals in Finance, HR, Marketing, Operations, Analysts, and Entrepreneurs
Tools Required: Laptop with internet
Trainer: Industry professional with hands on expertise
Week 1: Introduction to Apache Airflow & Core Concepts
Duration: 8 hours (4 sessions × 2 hrs)
Topics:
- Introduction to Workflow Orchestration (2 hrs)
- What is orchestration?
- Role of Airflow in Data Engineering
- Airflow vs Luigi vs Prefect comparison
- Airflow architecture: Scheduler, Executor, Worker, Web Server
2. Airflow Installation & Environment Setup (2 hrs)
- Installing Airflow using
pipand Docker - Understanding Airflow components
- Navigating Airflow UI (DAGs, Logs, Tasks, Graphs)
3. Understanding DAGs & Tasks (2 hrs)
- Creating a simple DAG in Python
- Operators: PythonOperator, BashOperator, DummyOperator
- Dependencies:
set_upstream(),set_downstream(),>>and<<
4. Mini Project + Q&A (2 hrs)
- Build a simple ETL DAG to extract and transform CSV data
- Schedule and run through the Airflow UI
Week 2: Building & Managing Complex DAGs
Duration: 10 hours (5 sessions × 2 hrs)
Topics:
- Advanced DAG Design (2 hrs)
- DAG parameters, default_args, retries, SLAs
- Dynamic task generation
- Branching and SubDAGs
2. Using Airflow Operators (2 hrs)
- FileSensor, EmailOperator, SimpleHttpOperator, PostgresOperator
- Working with external APIs and SQL databases
3. XComs and Data Sharing (2 hrs)
- Passing data between tasks
- Using XComs effectively in data pipelines
4. Error Handling & Task Monitoring (2 hrs)
- Handling task failures and retries
- Alerting & notifications (Slack/Email integration)
5. Mini Project + Q&A (2 hrs)
- Build a multi-stage DAG integrating API extraction + data transformation + DB loading
Week 3: Airflow with Big Data & Cloud Integration
Duration: 10 hours (5 sessions × 2 hrs)
Topics:
- Airflow with Apache Spark (2 hrs)
- Submitting Spark jobs using Airflow
- Using
SparkSubmitOperatorfor batch data pipelines
2. Airflow with Hadoop & HDFS (2 hrs)
- Managing data in HDFS
- Using Airflow for daily ingestion & transformation jobs
3. Airflow with AWS / GCP / Azure (2 hrs)
4. Airflow with Kafka & Streaming Data (2 hrs)
- Triggering workflows from Kafka topics
- Real-time batch pipeline simulation
5. Mini Project + Q&A (2 hrs)
- Build a batch pipeline integrating Airflow + Spark + S3
Week 4: Airflow in Production, Scaling & Capstone Project
Duration: 12 hours (6 sessions × 2 hrs)
Topics:
- Scheduling, Triggers, and Backfills (2 hrs)
- Airflow scheduling and cron expressions
- Manual triggers and backfilling DAG runs
2. Airflow in Production Environments (2 hrs)
- Airflow Executors: Sequential, Local, Celery, Kubernetes
- Configuring Airflow for scalability and high availability
3. CI/CD and Version Control (2 hrs)
- DAG versioning using Git
- Deploying Airflow pipelines through CI/CD tools (GitHub Actions, Jenkins)
4. Monitoring, Logging & Security (2 hrs)
- Airflow Metrics, Logging, Prometheus, Grafana integration
- Authentication & Role-Based Access Control (RBAC)
5. Capstone Project Development (2 hrs)
- Design and build an end-to-end data pipeline using Airflow and Cloud Storage
6. Capstone Presentation & Feedback (2 hrs)
- Present final DAG and pipeline workflow
- Instructor feedback and best practices discussion
Capstone Project Example
Project Title: Automated Data Pipeline for E-Commerce Analytics
Goal:
Extract transactional data from APIs → Load into AWS S3 → Transform using Spark → Load into Redshift → Orchestrate with Airflow
Tech Stack: Airflow, Python, Spark, AWS S3, Redshift
No comments:
Post a Comment