Architecture Trenches • 2025

Shift Left or
Stay Right?

Streaming vs Batch Processing. Two paradigms, different tradeoffs. An honest guide to choosing the right approach for your data architecture.

Boyan Balev
Boyan Balev Software Engineer
15 min read Data Architecture
Scroll

Here's a truth that marketing materials won't tell you: most production data platforms use both batch and streaming. The question isn't "which is better?" but "which is right for this specific use case?"

Consider two scenarios from the same company:

Streaming Wins Fraud Detection
1Fraudulent pattern emerges at 2:47 AM
2Flink detects within 50ms
3Transaction blocked instantly
$2.3M fraud prevented
Batch Wins ML Model Training
1Collect 6 months of transaction data
2Spark processes 500TB overnight
3Train improved fraud model
12% accuracy improvement

Same company, same data, different paradigms. Both essential. This guide examines both approaches honestly: their genuine strengths, hidden complexities, and what you should understand before choosing.

The best architecture is the one your team can operate reliably. Technical elegance means nothing if you can't debug it at 3 AM.

Two Mental Models

Batch and streaming aren't just different technologies. They represent fundamentally different philosophies about data. Understanding these mental models helps you choose wisely.

Batch Philosophy

"Collect everything, analyze thoroughly." Optimized for completeness and correctness. Process data in large chunks on a schedule. Better late than wrong.

Streaming Philosophy

"Process as it happens." Optimized for latency and reactivity. Data is a continuous flow, not discrete batches. Approximate now beats perfect later.

Batch: The Proven Workhorse

Batch processing has powered data warehousing for decades. It's the foundation of SQL, ETL, and business intelligence. The model is simple: accumulate data, process it on a schedule, serve the results.

01

Complete Picture

All data is present before processing begins. No worrying about late arrivals or out-of-order events. Joins across the full dataset are straightforward.

02

Simple Retry Logic

Job failed? Re-run it. Idempotent by design. No state to manage between runs. Full recompute is always an option.

03

Mature Ecosystem

Decades of tooling: Airflow, dbt, Spark. Every engineer knows SQL. Debugging is well-understood. Hiring is easier.

04

Cost Predictable

Efficient resource usage with on-demand compute. No always-on infrastructure for sporadic workloads.

Streaming: The Real-Time Frontier

Streaming treats data as an infinite sequence of events. Processing happens continuously, not on a schedule. The paradigm shift enables new use cases but demands new thinking.

01

Millisecond Latency

React to events as they happen. Fraud detection, real-time pricing, live dashboards. No waiting for the next batch window.

02

Event Sourcing

The event log is the source of truth. Tables and caches are derived views. Replay from the log to rebuild any state.

03

Unified Processing

One codebase for real-time and batch (via replay). Kappa architecture eliminates the "two systems" problem.

04

Decoupled Systems

Producers and consumers evolve independently. Add new consumers without touching producers. Event-driven microservices.

Neither is wrong. Batch optimizes for throughput and simplicity. Streaming optimizes for latency and reactivity. They serve different needs, often within the same organization.

Architecture Patterns

Let's look at how modern data platforms implement each paradigm. These aren't theoretical. They're patterns running in production at scale.

Modern Batch Architecture

The modern batch stack has evolved significantly from traditional ETL. dbt brought software engineering practices to data transformation. Open table formats like Apache Iceberg enable flexible, scalable data lakehouse architectures.

Batch Modern ELT Stack
1Sources → Airbyte (Extract & Load)
2Apache Iceberg + Trino (Storage)
3dbt (Transform)
4BI Tools / Reverse ETL
Hourly/daily freshness, simple operations
Streaming Event-Driven Stack
1Sources → Kafka/Redpanda (Capture)
2Apache Flink (Transform)
3ClickHouse/Redis (Serve)
Sub-second latency, continuous processing

Key Components Compared

Component Batch Stack Streaming Stack
Ingestion Airbyte, Apache NiFi, custom scripts Debezium CDC, direct producers
Storage Apache Iceberg, ClickHouse, Trino Kafka (event log), ClickHouse (OLAP)
Transformation dbt, Spark, SQL Flink, ksqlDB, Spark Structured Streaming
Orchestration Airflow, Dagster, Prefect Always-on (Kubernetes, managed Flink)
Serving Direct warehouse queries, caching Pre-computed views in Redis/ClickHouse

Hybrid is often right. Stream for operational use cases (fraud, alerts, live dashboards). Batch for analytical use cases (reporting, ML training, ad-hoc queries). Don't force one paradigm where the other excels.

Best-in-Class Technology

Both paradigms have mature, battle-tested tools. Here's what's powering production systems today.

dbt + Trino/Iceberg

Modern batch transformation stack

  • SQL-first with software engineering practices
  • Built-in testing and documentation
  • Version controlled transformations
  • Open table format with query engine flexibility
  • Massive ecosystem of packages

Apache Flink

Stateful stream processing engine

  • True event-time processing with watermarks
  • Exactly-once semantics for stateful operations
  • Millisecond latency at scale
  • SQL and DataStream APIs
  • Savepoints for zero-downtime upgrades

Apache Spark

Distributed batch processing

  • Process petabytes of data
  • Python, Scala, SQL, R APIs
  • ML pipelines with MLlib
  • Structured Streaming for hybrid
  • Massive community and support

Kafka / Redpanda

Distributed streaming platform

  • Durable, ordered event log
  • Horizontal scaling to millions of events/sec
  • Replay capability for recovery
  • Tiered storage for cost-effective retention
  • Schema registry for data contracts

Code Examples

Same problem, different solutions. Let's see how each paradigm handles real-world use cases.

Example 1: Daily Revenue by Category

A straightforward aggregation. This is where batch typically shines.

models/daily_revenue.sql dbt
-- Batch: Simple, readable, testable
SELECT
    DATE(order_timestamp) AS order_date,
    category,
    SUM(amount) AS daily_revenue,
    COUNT(*) AS order_count
FROM {{ ref('orders') }}
WHERE order_timestamp >= DATEADD('day', -30, CURRENT_DATE())
GROUP BY 1, 2
streaming_revenue.sql Flink SQL
-- Streaming: Continuous, requires windowing
SELECT
    TUMBLE_START(order_time, INTERVAL '1' DAY) AS order_date,
    category,
    SUM(amount) AS daily_revenue,
    COUNT(*) AS order_count
FROM orders
GROUP BY
    TUMBLE(order_time, INTERVAL '1' DAY),
    category

Verdict: Batch wins here. The dbt version is simpler, testable with dbt tests, and historical recompute is trivial. Unless you need real-time revenue updates (rare), use batch.

Example 2: Fraud Detection (Velocity Check)

Detect when a user makes transactions from geographically distant locations within an impossible timeframe.

fraud_detection.py PySpark
# Batch: Run hourly - detects fraud after the fact
from pyspark.sql import functions as F
from pyspark.sql.window import Window

user_window = Window.partitionBy("user_id").orderBy("timestamp")

flagged = (
    transactions
    .withColumn("prev_lat", F.lag("lat").over(user_window))
    .withColumn("prev_lon", F.lag("lon").over(user_window))
    .withColumn("prev_time", F.lag("timestamp").over(user_window))
    .withColumn("distance_km", haversine_udf("lat", "lon", "prev_lat", "prev_lon"))
    .withColumn("time_diff_min", (F.col("timestamp") - F.col("prev_time")) / 60)
    .filter((F.col("distance_km") > 500) & (F.col("time_diff_min") < 30))
)
velocity_fraud.sql Flink SQL
-- Streaming: Detect in real-time, block before damage
CREATE TABLE payments (
    txn_id STRING,
    user_id STRING,
    amount DECIMAL(10, 2),
    lat DOUBLE,
    lon DOUBLE,
    event_time TIMESTAMP(3),
    WATERMARK FOR event_time AS event_time - INTERVAL '5' SECOND
) WITH ('connector' = 'kafka', ...);

-- Detect impossible velocity: >500km in <30 min
INSERT INTO fraud_alerts
SELECT p1.user_id, p1.txn_id, p2.txn_id,
       ST_Distance(ST_Point(p1.lon, p1.lat), ST_Point(p2.lon, p2.lat)) / 1000 AS km
FROM payments p1, payments p2
WHERE p1.user_id = p2.user_id
  AND p1.event_time < p2.event_time
  AND p2.event_time < p1.event_time + INTERVAL '30' MINUTE
  AND ST_Distance(...) > 500000;

Verdict: Streaming wins here. The batch version detects fraud hours after it happened. The streaming version blocks the transaction in milliseconds. For fraud detection, those milliseconds mean millions in prevented losses.

Monitoring & Visibility

How do you know your pipeline is healthy? The monitoring story is very different between paradigms.

Batch Monitoring

Binary outcomes: Jobs succeed or fail. dbt tests catch data quality issues. Airflow shows DAG execution history. Clear audit trail per run.

Streaming Monitoring

Continuous metrics: Consumer lag, end-to-end latency, checkpoint success rate. Requires always-on monitoring. Distributed debugging is harder.

Key Metrics Compared

Metric Batch Streaming
Data freshness "Last run: 2 hours ago" "Consumer lag: 50ms"
Pipeline health Job success/failure rate Checkpoint success, backpressure
Data quality dbt tests, Great Expectations Schema registry, runtime validation
Lineage dbt docs, DataHub (mature) Event correlation IDs (manual)
Debugging Re-run failed job, inspect logs Distributed tracing, state inspection

Traceability gap: Batch has mature lineage tools (dbt docs generates automatic lineage). Streaming requires manual correlation ID propagation. "Which event caused this output?" is harder to answer in streaming systems.

Operational Complexity

The 3 AM pager test: which system is easier to debug and fix when something goes wrong?

Batch Operations

Simple retry logic

Job failed? Re-run it. Full recompute is always an option. No state to corrupt.

Mature scheduling

Airflow, Dagster, Prefect are well-understood. Dependency management is explicit.

Delayed detection

Issues discovered when the next job runs. Hours or days after the problem occurred.

DAG dependency hell

Complex pipelines create cascading failures. One slow job blocks everything downstream.

Streaming Operations

Immediate detection

Consumer lag spikes instantly visible. No waiting for the next scheduled run.

Always-on processing

No scheduling to manage. No batch windows. Continuous flow of data.

State management complexity

Checkpoints, savepoints, state backends. Corrupted state may require full rebuild from Kafka.

Upgrade complexity

Stateful job upgrades require savepoints. Schema changes need careful coordination.

Team readiness matters. If your team knows Airflow and SQL, they can operate dbt pipelines tomorrow. Flink requires learning watermarks, event time, checkpointing, and distributed systems debugging. Budget training time.

Cost Economics

Money talks. Here's an honest look at when each paradigm is more cost-effective.

When Batch is Cheaper

Infrequent processing (daily/weekly). Serverless warehouses (pay-per-query). Small data volumes (<1TB/day). Ad-hoc analytics workloads. No always-on infrastructure needed.

When Streaming is Cheaper

High-volume continuous processing. Replacing multiple redundant batch jobs. Eliminating intermediate storage. Real-time requirements that would require over-provisioning in batch.

Migration costs are real. During transition, you run both stacks. Budget 12-18 months of parallel operation. Factor in team training, debugging unfamiliar failure modes, and the inevitable "oh, we forgot that pipeline depended on this" discoveries.

Decision Framework

Use this framework to choose the right paradigm for each use case. Don't force one approach everywhere.

Clear Wins for Batch

Historical Analytics

Ad-hoc queries over years of data. Complex joins across large datasets. Columnar storage shines here.

ML Training

Feature engineering over historical data. Model training doesn't benefit from real-time. Batch is simpler.

Compliance Reporting

Audit reports, regulatory filings. Point-in-time snapshots. Full reproducibility required.

Clear Wins for Streaming

Fraud Detection

Every millisecond counts. Block fraudulent transactions before they complete. Real-time is essential.

Operational Monitoring

System health, alerting, live dashboards. Stale data is useless. Continuous processing required.

Event-Driven Systems

Microservices choreography. Real-time notifications. User-facing features that respond instantly.

Start with batch unless you have a specific latency requirement that batch cannot meet. It's easier to add streaming later than to simplify an over-engineered streaming system.

Risks & Failure Modes

What can go wrong? Every architecture has failure modes. Understanding them helps you design for resilience.

Batch Risks

Stale data: Decisions made on outdated information. Delayed detection: Problems discovered hours later. Cascade failures: One slow job blocks everything. Resource contention: Batch windows competing for compute.

Streaming Risks

Silent data loss: Misconfigured consumers drop events. State corruption: Requires full rebuild from Kafka. Backpressure cascade: Slow consumer affects entire pipeline. Schema breaks: Producer changes break consumers.

Mitigation Strategies

Risk Batch Mitigation Streaming Mitigation
Data quality issues dbt tests, Great Expectations, post-run validation Schema registry, runtime validation, dead letter queues
Pipeline failures Retry policies, alerting on failure, SLAs with buffer Checkpointing, automatic restart, consumer lag alerting
Data loss Idempotent jobs, backup before overwrite Kafka retention, exactly-once semantics, idempotent sinks
Schema evolution dbt schema tests, migration scripts Schema registry compatibility rules, versioned topics

The Bottom Line

Neither paradigm is universally superior. They optimize for different outcomes:

  • Batch optimizes for simplicity, completeness, and cost-effectiveness for periodic workloads
  • Streaming optimizes for latency, reactivity, and continuous processing

Most modern data platforms use both. Stream for operational use cases where latency matters. Batch for analytical use cases where completeness matters. Don't let ideology drive architecture decisions.

Choose wisely, not ideologically. Start with the simpler approach. Add complexity only when requirements demand it. The best architecture is the one your team can operate reliably.

My recommendation: Start with batch. It's simpler to build, test, and operate. Add streaming pipelines for specific use cases where the latency requirement justifies the complexity. Measure everything. Let data guide your architecture evolution, not hype.

Learn More

Explore the documentation for both paradigms and start building.