How Canva Collects 25 Billion Events a Day

ByteByteGo

Alex Xu • Published 4 months ago • 1 min read

Canva's event collection system processes 25 billion events daily, leveraging a scalable architecture with Kafka, Flink, and S3. The system prioritizes reliability, low latency, and cost-efficiency while handling diverse event types from global users. Key optimizations include batching, compression, and intelligent routing to balance performance and resource usage.

Core Technical Concepts/Technologies

Event Streaming: Kafka for high-throughput data ingestion
Stream Processing: Flink for real-time event aggregation/enrichment
Storage: S3 for cost-effective long-term retention
Batching/Compression: Protocol Buffers (Protobuf) and Snappy for efficiency
Load Balancing: Regional routing to minimize latency

Main Points

Scale Challenges:
- 25B events/day (~300k events/sec peak) with sub-second latency requirements
- Events vary in size (1KB–10KB) and type (e.g., clicks, edits, collaborations)
Architecture:
1. Client SDKs: Lightweight collectors batch events (5s/100KB thresholds) with Protobuf+Snappy compression.
2. Ingestion Layer: Regional Kafka clusters handle traffic spikes; auto-scaling via Kubernetes.
3. Processing: Flink jobs enrich/aggregate events (e.g., sessionization) in real time.
4. Storage: Processed data lands in S3 (Parquet format) via hourly partitions for analytics.
Optimizations:
- Batching: Reduces network overhead (e.g., 100KB batches cut TCP handshake costs).
- Regional Proximity: Clients route to nearest AWS region (us-east-1, ap-southeast-2, etc.).
- Dead-Letter Queues: Handle malformed events without blocking pipelines.

Technical Specifications

Kafka Configuration:
- 6-node clusters per region, 32 vCPUs/node, 64GB RAM
- Retention: 7 days (hot storage), 30 days (cold via S3)
Flink Jobs:
- Checkpointing every 10s for fault tolerance
- Parallelism tuned per event type (e.g., 32–128 tasks)

Key Takeaways

Batching is critical for high-volume event systems to reduce network/processing overhead.
Regional routing improves latency and reliability for global user bases.
Protocol Buffers + Snappy offer an optimal balance of size and speed for serialization.
Separation of hot/cold storage (Kafka → S3) balances cost and accessibility.

Limitations & Future Work

Cold Start Latency: Flink recovery from checkpoints can delay processing after failures.
Schema Evolution: Protobuf requires careful versioning for backward compatibility.
Exploration Areas: Testing Arrow format for analytics queries on S3 data.

This article walks through how Canva structures, collects, and distributes billions of events daily, without drowning in tech debt and increasing cloud bills.

This article was originally published on ByteByteGo

Visit Original Source