Shift Left or Stay Right? Streaming vs Batch Processing Deep Dive

Here's a truth that marketing materials won't tell you: most production data platforms use both batch and streaming. The question isn't "which is better?" but "which is right for this specific use case?"

Consider two scenarios from the same company:

1Fraudulent pattern emerges at 2:47 AM

↓

2Flink detects within 50ms

↓

3Transaction blocked instantly

$2.3M fraud prevented

1Collect 6 months of transaction data

↓

2Spark processes 500TB overnight

↓

3Train improved fraud model

12% accuracy improvement

Same company, same data, different paradigms. Both essential. This guide examines both approaches honestly: their genuine strengths, hidden complexities, and what you should understand before choosing.

The best architecture is the one your team can operate reliably. Technical elegance means nothing if you can't debug it at 3 AM.

Two Mental Models

Batch and streaming aren't just different technologies. They represent fundamentally different philosophies about data. Understanding these mental models helps you choose wisely.

Batch Philosophy

"Collect everything, analyze thoroughly." Optimized for completeness and correctness. Process data in large chunks on a schedule. Better late than wrong.

Streaming Philosophy

"Process as it happens." Optimized for latency and reactivity. Data is a continuous flow, not discrete batches. Approximate now beats perfect later.

Batch: The Proven Workhorse

Batch processing has powered data warehousing for decades. It's the foundation of SQL, ETL, and business intelligence. The model is simple: accumulate data, process it on a schedule, serve the results.

Complete Picture

All data is present before processing begins. No worrying about late arrivals or out-of-order events. Joins across the full dataset are straightforward.

Simple Retry Logic

Job failed? Re-run it. Idempotent by design. No state to manage between runs. Full recompute is always an option.

Mature Ecosystem

Decades of tooling: Airflow, dbt, Spark. Every engineer knows SQL. Debugging is well-understood. Hiring is easier.

Cost Predictable

Efficient resource usage with on-demand compute. No always-on infrastructure for sporadic workloads.

Streaming: The Real-Time Frontier

Streaming treats data as an infinite sequence of events. Processing happens continuously, not on a schedule. The paradigm shift enables new use cases but demands new thinking.

Millisecond Latency

React to events as they happen. Fraud detection, real-time pricing, live dashboards. No waiting for the next batch window.

Event Sourcing

The event log is the source of truth. Tables and caches are derived views. Replay from the log to rebuild any state.

Unified Processing

One codebase for real-time and batch (via replay). Kappa architecture eliminates the "two systems" problem.

Decoupled Systems

Producers and consumers evolve independently. Add new consumers without touching producers. Event-driven microservices.

Neither is wrong. Batch optimizes for throughput and simplicity. Streaming optimizes for latency and reactivity. They serve different needs, often within the same organization.

Architecture Patterns

Let's look at how modern data platforms implement each paradigm. These aren't theoretical. They're patterns running in production at scale.

Modern Batch Architecture

The modern batch stack has evolved significantly from traditional ETL. dbt brought software engineering practices to data transformation. Open table formats like Apache Iceberg enable flexible, scalable data lakehouse architectures.

1Sources → Airbyte (Extract & Load)

↓

2Apache Iceberg + Trino (Storage)

↓

3dbt (Transform)

↓

4BI Tools / Reverse ETL

Hourly/daily freshness, simple operations

1Sources → Kafka/Redpanda (Capture)

↓

2Apache Flink (Transform)

↓

3ClickHouse/Redis (Serve)

Sub-second latency, continuous processing

Key Components Compared

Component	Batch Stack	Streaming Stack
Ingestion	Airbyte, Apache NiFi, custom scripts	Debezium CDC, direct producers
Storage	Apache Iceberg, ClickHouse, Trino	Kafka (event log), ClickHouse (OLAP)
Transformation	dbt, Spark, SQL	Flink, ksqlDB, Spark Structured Streaming
Orchestration	Airflow, Dagster, Prefect	Always-on (Kubernetes, managed Flink)
Serving	Direct warehouse queries, caching	Pre-computed views in Redis/ClickHouse

Hybrid is often right. Stream for operational use cases (fraud, alerts, live dashboards). Batch for analytical use cases (reporting, ML training, ad-hoc queries). Don't force one paradigm where the other excels.

Best-in-Class Technology

Both paradigms have mature, battle-tested tools. Here's what's powering production systems today.

dbt + Trino/Iceberg

Modern batch transformation stack

SQL-first with software engineering practices
Built-in testing and documentation
Version controlled transformations
Open table format with query engine flexibility
Massive ecosystem of packages

Apache Flink

Stateful stream processing engine

True event-time processing with watermarks
Exactly-once semantics for stateful operations
Millisecond latency at scale
SQL and DataStream APIs
Savepoints for zero-downtime upgrades

Apache Spark

Distributed batch processing

Process petabytes of data
Python, Scala, SQL, R APIs
ML pipelines with MLlib
Structured Streaming for hybrid
Massive community and support

Kafka / Redpanda

Distributed streaming platform

Durable, ordered event log
Horizontal scaling to millions of events/sec
Replay capability for recovery
Tiered storage for cost-effective retention
Schema registry for data contracts

Code Examples

Same problem, different solutions. Let's see how each paradigm handles real-world use cases.

Example 1: Daily Revenue by Category

A straightforward aggregation. This is where batch typically shines.

-- Batch: Simple, readable, testable
SELECT
    DATE(order_timestamp) AS order_date,
    category,
    SUM(amount) AS daily_revenue,
    COUNT(*) AS order_count
FROM {{ ref('orders') }}
WHERE order_timestamp >= DATEADD('day', -30, CURRENT_DATE())
GROUP BY 1, 2

-- Streaming: Continuous, requires windowing
SELECT
    TUMBLE_START(order_time, INTERVAL '1' DAY) AS order_date,
    category,
    SUM(amount) AS daily_revenue,
    COUNT(*) AS order_count
FROM orders
GROUP BY
    TUMBLE(order_time, INTERVAL '1' DAY),
    category

Verdict: Batch wins here. The dbt version is simpler, testable with dbt tests, and historical recompute is trivial. Unless you need real-time revenue updates (rare), use batch.

Example 2: Fraud Detection (Velocity Check)

Detect when a user makes transactions from geographically distant locations within an impossible timeframe.

# Batch: Run hourly - detects fraud after the fact
from pyspark.sql import functions as F
from pyspark.sql.window import Window

user_window = Window.partitionBy("user_id").orderBy("timestamp")

flagged = (
    transactions
    .withColumn("prev_lat", F.lag("lat").over(user_window))
    .withColumn("prev_lon", F.lag("lon").over(user_window))
    .withColumn("prev_time", F.lag("timestamp").over(user_window))
    .withColumn("distance_km", haversine_udf("lat", "lon", "prev_lat", "prev_lon"))
    .withColumn("time_diff_min", (F.col("timestamp") - F.col("prev_time")) / 60)
    .filter((F.col("distance_km") > 500) & (F.col("time_diff_min") < 30))
)

-- Streaming: Detect in real-time, block before damage
CREATE TABLE payments (
    txn_id STRING,
    user_id STRING,
    amount DECIMAL(10, 2),
    lat DOUBLE,
    lon DOUBLE,
    event_time TIMESTAMP(3),
    WATERMARK FOR event_time AS event_time - INTERVAL '5' SECOND
) WITH ('connector' = 'kafka', ...);

-- Detect impossible velocity: >500km in <30 min
INSERT INTO fraud_alerts
SELECT p1.user_id, p1.txn_id, p2.txn_id,
       ST_Distance(ST_Point(p1.lon, p1.lat), ST_Point(p2.lon, p2.lat)) / 1000 AS km
FROM payments p1, payments p2
WHERE p1.user_id = p2.user_id
  AND p1.event_time < p2.event_time
  AND p2.event_time < p1.event_time + INTERVAL '30' MINUTE
  AND ST_Distance(...) > 500000;

Verdict: Streaming wins here. The batch version detects fraud hours after it happened. The streaming version blocks the transaction in milliseconds. For fraud detection, those milliseconds mean millions in prevented losses.

Monitoring & Visibility

How do you know your pipeline is healthy? The monitoring story is very different between paradigms.

Batch Monitoring

Binary outcomes: Jobs succeed or fail. dbt tests catch data quality issues. Airflow shows DAG execution history. Clear audit trail per run.

Streaming Monitoring

Continuous metrics: Consumer lag, end-to-end latency, checkpoint success rate. Requires always-on monitoring. Distributed debugging is harder.

Key Metrics Compared

Metric	Batch	Streaming
Data freshness	"Last run: 2 hours ago"	"Consumer lag: 50ms"
Pipeline health	Job success/failure rate	Checkpoint success, backpressure
Data quality	dbt tests, Great Expectations	Schema registry, runtime validation
Lineage	dbt docs, DataHub (mature)	Event correlation IDs (manual)
Debugging	Re-run failed job, inspect logs	Distributed tracing, state inspection

Traceability gap: Batch has mature lineage tools (dbt docs generates automatic lineage). Streaming requires manual correlation ID propagation. "Which event caused this output?" is harder to answer in streaming systems.

Operational Complexity

The 3 AM pager test: which system is easier to debug and fix when something goes wrong?

Batch Operations

Simple retry logic

Job failed? Re-run it. Full recompute is always an option. No state to corrupt.

Mature scheduling

Airflow, Dagster, Prefect are well-understood. Dependency management is explicit.

Delayed detection

Issues discovered when the next job runs. Hours or days after the problem occurred.

DAG dependency hell

Complex pipelines create cascading failures. One slow job blocks everything downstream.

Streaming Operations

Immediate detection

Consumer lag spikes instantly visible. No waiting for the next scheduled run.

Always-on processing

No scheduling to manage. No batch windows. Continuous flow of data.

State management complexity

Checkpoints, savepoints, state backends. Corrupted state may require full rebuild from Kafka.

Upgrade complexity

Stateful job upgrades require savepoints. Schema changes need careful coordination.

Team readiness matters. If your team knows Airflow and SQL, they can operate dbt pipelines tomorrow. Flink requires learning watermarks, event time, checkpointing, and distributed systems debugging. Budget training time.

Cost Economics

Money talks. Here's an honest look at when each paradigm is more cost-effective.

When Batch is Cheaper

Infrequent processing (daily/weekly). Serverless warehouses (pay-per-query). Small data volumes (<1TB/day). Ad-hoc analytics workloads. No always-on infrastructure needed.

When Streaming is Cheaper

High-volume continuous processing. Replacing multiple redundant batch jobs. Eliminating intermediate storage. Real-time requirements that would require over-provisioning in batch.

Migration costs are real. During transition, you run both stacks. Budget 12-18 months of parallel operation. Factor in team training, debugging unfamiliar failure modes, and the inevitable "oh, we forgot that pipeline depended on this" discoveries.

Decision Framework

Use this framework to choose the right paradigm for each use case. Don't force one approach everywhere.

Clear Wins for Batch

Historical Analytics

Ad-hoc queries over years of data. Complex joins across large datasets. Columnar storage shines here.

ML Training

Feature engineering over historical data. Model training doesn't benefit from real-time. Batch is simpler.

Compliance Reporting

Audit reports, regulatory filings. Point-in-time snapshots. Full reproducibility required.

Clear Wins for Streaming

Fraud Detection

Every millisecond counts. Block fraudulent transactions before they complete. Real-time is essential.

Operational Monitoring

System health, alerting, live dashboards. Stale data is useless. Continuous processing required.

Event-Driven Systems

Microservices choreography. Real-time notifications. User-facing features that respond instantly.

Start with batch unless you have a specific latency requirement that batch cannot meet. It's easier to add streaming later than to simplify an over-engineered streaming system.

Risks & Failure Modes

What can go wrong? Every architecture has failure modes. Understanding them helps you design for resilience.

Batch Risks

Stale data: Decisions made on outdated information. Delayed detection: Problems discovered hours later. Cascade failures: One slow job blocks everything. Resource contention: Batch windows competing for compute.

Streaming Risks

Silent data loss: Misconfigured consumers drop events. State corruption: Requires full rebuild from Kafka. Backpressure cascade: Slow consumer affects entire pipeline. Schema breaks: Producer changes break consumers.

Mitigation Strategies

Risk	Batch Mitigation	Streaming Mitigation
Data quality issues	dbt tests, Great Expectations, post-run validation	Schema registry, runtime validation, dead letter queues
Pipeline failures	Retry policies, alerting on failure, SLAs with buffer	Checkpointing, automatic restart, consumer lag alerting
Data loss	Idempotent jobs, backup before overwrite	Kafka retention, exactly-once semantics, idempotent sinks
Schema evolution	dbt schema tests, migration scripts	Schema registry compatibility rules, versioned topics

The Bottom Line

Neither paradigm is universally superior. They optimize for different outcomes:

Batch optimizes for simplicity, completeness, and cost-effectiveness for periodic workloads
Streaming optimizes for latency, reactivity, and continuous processing

Most modern data platforms use both. Stream for operational use cases where latency matters. Batch for analytical use cases where completeness matters. Don't let ideology drive architecture decisions.

Choose wisely, not ideologically. Start with the simpler approach. Add complexity only when requirements demand it. The best architecture is the one your team can operate reliably.

My recommendation: Start with batch. It's simpler to build, test, and operate. Add streaming pipelines for specific use cases where the latency requirement justifies the complexity. Measure everything. Let data guide your architecture evolution, not hype.

Shift Left orStay Right?

Two Mental Models

Batch Philosophy

Streaming Philosophy

Batch: The Proven Workhorse

Complete Picture

Simple Retry Logic

Mature Ecosystem

Cost Predictable

Streaming: The Real-Time Frontier

Millisecond Latency

Event Sourcing

Unified Processing

Decoupled Systems

Architecture Patterns

Modern Batch Architecture

Key Components Compared

Best-in-Class Technology

dbt + Trino/Iceberg

Apache Flink

Apache Spark

Kafka / Redpanda

Code Examples

Example 1: Daily Revenue by Category

Example 2: Fraud Detection (Velocity Check)

Monitoring & Visibility

Batch Monitoring

Streaming Monitoring

Key Metrics Compared

Operational Complexity

Batch Operations

Simple retry logic

Mature scheduling

Delayed detection

DAG dependency hell

Streaming Operations

Immediate detection

Always-on processing

State management complexity

Upgrade complexity

Cost Economics

When Batch is Cheaper

When Streaming is Cheaper

Decision Framework

Clear Wins for Batch

Historical Analytics

ML Training

Compliance Reporting

Clear Wins for Streaming

Fraud Detection

Operational Monitoring

Event-Driven Systems

Risks & Failure Modes

Batch Risks

Streaming Risks

Mitigation Strategies

The Bottom Line

Learn More

Shift Left or
Stay Right?