TechFedd

Your gateway to technical excellence. Curated content from industry experts.

Quick Links

  • Browse Sources
  • Categories
  • Latest Articles

Company

  • About Us
  • Contact
  • Privacy Policy
  • Terms of Service

Newsletter

Subscribe to get weekly updates on the latest technical content.

© 2025 TechFedd. All rights reserved.

PrivacyTermsSitemap
TechFedd LogoTechFedd
ArticlesSources
Sign InSign Up
  1. Home
  2. /
  3. Articles
  4. /
  5. System Design

How Canva Collects 25 Billion Events a Day

ByteByteGo

ByteByteGo

Alex Xu • Published about 1 month ago • 1 min read

Read Original
How Canva Collects 25 Billion Events a Day

Canva's event collection system processes 25 billion events daily, leveraging a scalable architecture with Kafka, Flink, and S3. The system prioritizes reliability, low latency, and cost-efficiency while handling diverse event types from global users. Key optimizations include batching, compression, and intelligent routing to balance performance and resource usage.


Core Technical Concepts/Technologies

  • Event Streaming: Kafka for high-throughput data ingestion
  • Stream Processing: Flink for real-time event aggregation/enrichment
  • Storage: S3 for cost-effective long-term retention
  • Batching/Compression: Protocol Buffers (Protobuf) and Snappy for efficiency
  • Load Balancing: Regional routing to minimize latency

Main Points

  • Scale Challenges:

    • 25B events/day (~300k events/sec peak) with sub-second latency requirements
    • Events vary in size (1KB–10KB) and type (e.g., clicks, edits, collaborations)
  • Architecture:

    1. Client SDKs: Lightweight collectors batch events (5s/100KB thresholds) with Protobuf+Snappy compression.
    2. Ingestion Layer: Regional Kafka clusters handle traffic spikes; auto-scaling via Kubernetes.
    3. Processing: Flink jobs enrich/aggregate events (e.g., sessionization) in real time.
    4. Storage: Processed data lands in S3 (Parquet format) via hourly partitions for analytics.
  • Optimizations:

    • Batching: Reduces network overhead (e.g., 100KB batches cut TCP handshake costs).
    • Regional Proximity: Clients route to nearest AWS region (us-east-1, ap-southeast-2, etc.).
    • Dead-Letter Queues: Handle malformed events without blocking pipelines.

Technical Specifications

  • Kafka Configuration:
    • 6-node clusters per region, 32 vCPUs/node, 64GB RAM
    • Retention: 7 days (hot storage), 30 days (cold via S3)
  • Flink Jobs:
    • Checkpointing every 10s for fault tolerance
    • Parallelism tuned per event type (e.g., 32–128 tasks)

Key Takeaways

  1. Batching is critical for high-volume event systems to reduce network/processing overhead.
  2. Regional routing improves latency and reliability for global user bases.
  3. Protocol Buffers + Snappy offer an optimal balance of size and speed for serialization.
  4. Separation of hot/cold storage (Kafka → S3) balances cost and accessibility.

Limitations & Future Work

  • Cold Start Latency: Flink recovery from checkpoints can delay processing after failures.
  • Schema Evolution: Protobuf requires careful versioning for backward compatibility.
  • Exploration Areas: Testing Arrow format for analytics queries on S3 data.

This article walks through how Canva structures, collects, and distributes billions of events daily, without drowning in tech debt and increasing cloud bills.

This article was originally published on ByteByteGo

Visit Original Source