TechFedd LogoTechFedd

How Netflix Orchestrates Millions of Workflow Jobs with Maestro

ByteByteGo

ByteByteGo

Alex Xu • Published 18 days ago • 1 min read

Read Original
How Netflix Orchestrates Millions of Workflow Jobs with Maestro

Netflix developed Maestro, a scalable workflow orchestrator, to replace Meson, which struggled with increasing workloads due to its single-leader architecture. Maestro uses a microservices-based design, distributed queues, and CockroachDB for horizontal scalability, supporting time-based scheduling, event-driven triggers, and dynamic workflows with features like foreach loops and parameterization. It caters to diverse users via multiple DSLs (YAML, Python, Java), UI-based workflow creation, and integrations like Metaflow.


Core Technical Concepts & Technologies

  • Workflow Orchestration (DAG-based execution)
  • Microservices Architecture (stateless services)
  • Distributed Queues (decoupled communication)
  • CockroachDB (distributed SQL for state storage)
  • Time-Based & Event-Driven Scheduling (cron, signals)
  • Dynamic Workflows (parameterization, foreach loops)
  • Execution Abstractions (predefined step types, notebooks, Docker)
  • Multi-DSL Support (YAML, Python, Java)

Key Points

  1. Meson’s Limitations

    • Single-leader architecture led to scaling bottlenecks.
    • Required vertical scaling (AWS instance limits reached).
    • Struggled with peak loads (e.g., midnight UTC workflows).
  2. Maestro’s Architecture

    • Workflow Engine: Manages DAGs, step execution, and dynamic workflows (e.g., foreach loops).
    • Time-Based Scheduler: Cron-like triggers with deduplication for exact-once execution.
    • Signal Service: Event-driven triggers (e.g., S3 updates, internal events) with lineage tracking.
  3. Scalability Techniques

    • Stateless microservices + horizontal scaling.
    • Distributed queues for reliable inter-service communication.
    • CockroachDB for consistent, scalable state storage.
  4. Execution Abstractions

    • Step Types: Predefined templates (Spark, SQL, etc.).
    • Notebook Execution: Direct Jupyter notebook support.
    • Docker Jobs: Custom logic via containers.
  5. User Flexibility

    • DSLs (YAML, Python, Java) and UI for workflow creation.
    • Metaflow Integration: Pythonic DAGs for data scientists.
  6. Advanced Features

    • Parameterized Workflows: Dynamic backfills (e.g., date ranges).
    • Rollup & Aggregated Views: Unified status tracking for complex workflows.
    • Event Publishing: Internal/external (Kafka/SNS) for real-time monitoring.

Technical Specifications & Examples

  • Foreach Loop:
    steps:  
      - foreach:  
          input: ${date_range}  
          steps:  
            - notebook:  
                params:  
                  date: ${item}  
    
  • Signal Service: Subscribes to events (e.g., s3://data-ready) to trigger workflows.
  • CockroachDB: Ensures strong consistency for workflow state across regions.

Key Takeaways

  1. Horizontal Scaling: Maestro’s stateless microservices and distributed queues overcome single-node bottlenecks.
  2. Flexible Triggers: Combines time-based and event-driven scheduling for efficiency.
  3. User-Centric Design: Supports engineers (Docker/APIs), data scientists (notebooks), and analysts (UI).
  4. Observability: Rollup views and event publishing enable real-time workflow tracking.
  5. Dynamic Workflows: Parameterization and foreach loops reduce manual definition overhead.

Limitations & Future Work

  • Complexity: Deeply nested workflows may require careful monitoring.
  • Learning Curve: Multiple DSLs/APIs could overwhelm new users.
  • Open-Source Adoption: External use cases may reveal edge cases not yet addressed.

References: Netflix Tech Blog, Maestro GitHub.

WorkOS + MCP: Authentication for AI Agents (Sponsored)

This article was originally published on ByteByteGo

Visit Original Source