Building Real-time Data Pipelines at Scale: Lessons from Processing 10 Billion Records Daily
January 30, 20266 min readBy Mayank

Building Real-time Data Pipelines at Scale: Lessons from Processing 10 Billion Records Daily

After spending over 8 years building data infrastructure at scale, including leading the migration of legacy data warehouses to modern cloud architectures at Ci...

Building Real-time Data Pipelines at Scale: Lessons from Processing 10 Billion Records Daily

After spending over 8 years building data infrastructure at scale, including leading the migration of legacy data warehouses to modern cloud architectures at Citi, I've learned that building real-time data pipelines isn't just about choosing the right tools—it's about understanding the trade-offs and designing for failure.

In this post, I'll share practical lessons from building systems that process over 10 billion records daily, including architecture patterns, common pitfalls, and optimization techniques that actually work in production.

The Challenge: Why Real-time Data Processing is Hard#

When we talk about "real-time" data processing, we're really talking about three different things:

  1. True real-time (sub-second latency): Event streaming with immediate processing
  2. Near real-time (seconds to minutes): Micro-batch processing
  3. Operational real-time (minutes to hours): Frequent batch updates

Most organizations don't need true real-time—and building for it when you don't need it creates unnecessary complexity and cost.

Info: Key Insight: Before designing your pipeline, ask yourself: "What's the actual business requirement for data freshness?" The answer will dramatically simplify your architecture.

Architecture Pattern: The Lambda Architecture Evolution#

The classic Lambda architecture (batch + speed layers) has evolved. Here's what a modern real-time pipeline looks like:

Plain Text
1┌─────────────┐ ┌──────────────┐ ┌─────────────┐
2│ Sources │────▶│ Apache Kafka │────▶│ Spark │
3│ (Apps, DBs) │ │ (Ingestion) │ │ Streaming │
4└─────────────┘ └──────────────┘ └─────┬───────┘
5
6 ┌──────────────────────────┴──────────┐
7 │ │
8 ▼ ▼
9 ┌──────────────┐ ┌──────────────┐
10 │ Hot Storage │ │ Cold Storage │
11 │ (Redis/ │ │ (S3/Delta │
12 │ Druid) │ │ Lake) │
13 └──────────────┘ └──────────────┘

Key Components Explained#

1. Apache Kafka as the Central Nervous System

Kafka isn't just a message queue—it's an immutable log that becomes your source of truth. This pattern enables:

  • Replay of events when bugs are discovered
  • Multiple consumers processing the same data differently
  • Decoupling of producers and consumers
Plain Text
1# Example: Kafka producer with idempotent writes
2from kafka import KafkaProducer
3import json
4
5producer = KafkaProducer(
6 bootstrap_servers=['kafka-1:9092', 'kafka-2:9092'],
7 enable_idempotence=True, # Exactly-once semantics
8 acks='all', # Wait for all replicas
9 value_serializer=lambda v: json.dumps(v).encode('utf-8')
10)
11
12def publish_event(topic: str, event: dict, key: str):
13 future = producer.send(
14 topic,
15 key=key.encode('utf-8'),
16 value=event
17 )
18 # Block until sent (for critical events)
19 future.get(timeout=10)

2. Spark Structured Streaming for Processing

Spark Structured Streaming provides exactly-once guarantees with checkpointing:

Plain Text
1from pyspark.sql import SparkSession
2from pyspark.sql.functions import *
3from pyspark.sql.types import *
4
5spark = SparkSession.builder \
6 .appName("RealTimePipeline") \
7 .config("spark.sql.streaming.checkpointLocation", "s3://bucket/checkpoints") \
8 .getOrCreate()
9
10# Define schema for incoming events
11event_schema = StructType([
12 StructField("event_id", StringType(), False),
13 StructField("user_id", StringType(), False),
14 StructField("event_type", StringType(), False),
15 StructField("timestamp", TimestampType(), False),
16 StructField("properties", MapType(StringType(), StringType()), True)
17])
18
19# Read from Kafka
20df = spark \
21 .readStream \
22 .format("kafka") \
23 .option("kafka.bootstrap.servers", "kafka:9092") \
24 .option("subscribe", "user_events") \
25 .option("startingOffsets", "latest") \
26 .load()
27
28# Parse and transform
29events = df \
30 .select(from_json(col("value").cast("string"), event_schema).alias("data")) \
31 .select("data.*") \
32 .withWatermark("timestamp", "10 minutes") # Handle late data
33
34# Aggregate by window
35aggregated = events \
36 .groupBy(
37 window("timestamp", "5 minutes", "1 minute"),
38 "event_type"
39 ) \
40 .agg(
41 count("*").alias("event_count"),
42 countDistinct("user_id").alias("unique_users")
43 )
44
45# Write to sink
46query = aggregated \
47 .writeStream \
48 .outputMode("update") \
49 .format("delta") \
50 .option("checkpointLocation", "s3://bucket/checkpoints/aggregates") \
51 .start("s3://bucket/aggregates")

Production Lessons: What They Don't Tell You#

1. Backpressure Will Happen—Plan for It#

When downstream systems can't keep up, you have three options:

  • Drop messages (not usually acceptable)
  • Buffer indefinitely (you'll run out of memory)
  • Apply backpressure (pause producers)

Warning: Production Tip: Always set explicit rate limits and monitor queue depths. A 10x traffic spike shouldn't bring down your entire pipeline.

Plain Text
1# Spark Streaming rate limiting
2spark.conf.set("spark.streaming.kafka.maxRatePerPartition", "10000")
3spark.conf.set("spark.streaming.backpressure.enabled", "true")

2. Schema Evolution is Inevitable#

Your data schema will change. Plan for it from day one:

  • Use schema registry (Confluent or AWS Glue)
  • Design schemas that are backward AND forward compatible
  • Never remove required fields—deprecate them first
Plain Text
1# Example: Avro schema with backward compatibility
2{
3 "type": "record",
4 "name": "UserEvent",
5 "fields": [
6 {"name": "event_id", "type": "string"},
7 {"name": "user_id", "type": "string"},
8 {"name": "event_type", "type": "string"},
9 {"name": "timestamp", "type": "long"},
10 # New optional field - backward compatible
11 {"name": "session_id", "type": ["null", "string"], "default": null}
12 ]
13}

3. Monitoring is Not Optional#

You need visibility into:

  • Lag: How far behind is your consumer?
  • Throughput: Events processed per second
  • Error rate: Failed transformations/writes
  • Processing time: P50, P95, P99 latencies
Plain Text
1# Custom metrics with Prometheus
2from prometheus_client import Counter, Histogram, Gauge
3
4events_processed = Counter(
5 'pipeline_events_processed_total',
6 'Total events processed',
7 ['event_type', 'status']
8)
9
10processing_time = Histogram(
11 'pipeline_processing_seconds',
12 'Time spent processing events',
13 buckets=[.001, .005, .01, .05, .1, .5, 1, 5]
14)
15
16consumer_lag = Gauge(
17 'pipeline_consumer_lag',
18 'Kafka consumer lag',
19 ['partition']
20)

Optimization Techniques That Actually Work#

1. Partition Strategy Matters#

Poor partitioning is the #1 cause of pipeline performance issues:

Plain Text
1# Bad: All data goes to one partition
2producer.send(topic, value=event)
3
4# Good: Partition by business key for ordering guarantees
5producer.send(topic, key=user_id.encode(), value=event)
6
7# Better: Custom partitioner for hot key handling
8class SkewAwarePartitioner:
9 def __call__(self, key, all_partitions, available_partitions):
10 # Route hot keys to dedicated partitions
11 if key in hot_keys:
12 return hash(key) % len(hot_partitions)
13 return hash(key) % len(all_partitions)

2. Batch Wisely#

Micro-batching reduces overhead but increases latency. Find your sweet spot:

Plain Text
1# Spark trigger options
2.trigger(processingTime='10 seconds') # Fixed interval
3.trigger(once=True) # Process all available, then stop
4.trigger(continuous='1 second') # Experimental low-latency mode

3. Use Delta Lake for ACID Guarantees#

Delta Lake gives you:

  • ACID transactions on data lakes
  • Schema enforcement and evolution
  • Time travel for debugging
  • Efficient upserts (MERGE)
Plain Text
1# Upsert pattern with Delta Lake
2from delta.tables import DeltaTable
3
4deltaTable = DeltaTable.forPath(spark, "s3://bucket/users")
5
6deltaTable.alias("target").merge(
7 updates.alias("source"),
8 "target.user_id = source.user_id"
9).whenMatchedUpdate(set={
10 "last_active": "source.timestamp",
11 "event_count": "target.event_count + 1"
12}).whenNotMatchedInsert(values={
13 "user_id": "source.user_id",
14 "first_seen": "source.timestamp",
15 "last_active": "source.timestamp",
16 "event_count": "1"
17}).execute()

Cost Optimization: Because Budget Matters#

At scale, small inefficiencies become expensive. Here's what worked for us:

  1. Right-size your Spark executors: More small executors often outperform fewer large ones
  2. Use spot instances: 70-90% cost savings for stateless processing
  3. Implement data lifecycle: Move cold data to cheaper storage tiers
  4. Compress aggressively: Snappy for speed, Zstd for size
Plain Text
1# Cost-optimized Spark configuration
2spark.conf.set("spark.sql.files.maxPartitionBytes", "128m")
3spark.conf.set("spark.sql.shuffle.partitions", "200")
4spark.conf.set("spark.sql.parquet.compression.codec", "zstd")

Conclusion: Start Simple, Iterate Fast#

The best data pipeline is one that:

  1. Solves the actual business problem (not a hypothetical future problem)
  2. Is observable (you know when it breaks)
  3. Is evolvable (you can change it without fear)

Don't start with the most complex architecture. Start with the simplest thing that works, measure everything, and optimize based on real data.


Have questions about building data pipelines? Feel free to reach out or connect with me on LinkedIn.

Mayank Gulaty

Written by Mayank Gulaty

Senior Data Engineer with 8+ years of experience at Citi and Nagarro, specializing in building petabyte-scale data pipelines and cloud-native architectures. I combine deep data engineering expertise with full-stack development skills to create end-to-end solutions.