Here's a truth that marketing materials won't tell you: most production data platforms use both batch and streaming. The question isn't "which is better?" but "which is right for this specific use case?"
Consider two scenarios from the same company:
Same company, same data, different paradigms. Both essential. This guide examines both approaches honestly: their genuine strengths, hidden complexities, and what you should understand before choosing.
The best architecture is the one your team can operate reliably. Technical elegance means nothing if you can't debug it at 3 AM.
Two Mental Models
Batch and streaming aren't just different technologies. They represent fundamentally different philosophies about data. Understanding these mental models helps you choose wisely.
Batch Philosophy
"Collect everything, analyze thoroughly." Optimized for completeness and correctness. Process data in large chunks on a schedule. Better late than wrong.
Streaming Philosophy
"Process as it happens." Optimized for latency and reactivity. Data is a continuous flow, not discrete batches. Approximate now beats perfect later.
Batch: The Proven Workhorse
Batch processing has powered data warehousing for decades. It's the foundation of SQL, ETL, and business intelligence. The model is simple: accumulate data, process it on a schedule, serve the results.
Complete Picture
All data is present before processing begins. No worrying about late arrivals or out-of-order events. Joins across the full dataset are straightforward.
Simple Retry Logic
Job failed? Re-run it. Idempotent by design. No state to manage between runs. Full recompute is always an option.
Mature Ecosystem
Decades of tooling: Airflow, dbt, Spark. Every engineer knows SQL. Debugging is well-understood. Hiring is easier.
Cost Predictable
Efficient resource usage with on-demand compute. No always-on infrastructure for sporadic workloads.
Streaming: The Real-Time Frontier
Streaming treats data as an infinite sequence of events. Processing happens continuously, not on a schedule. The paradigm shift enables new use cases but demands new thinking.
Millisecond Latency
React to events as they happen. Fraud detection, real-time pricing, live dashboards. No waiting for the next batch window.
Event Sourcing
The event log is the source of truth. Tables and caches are derived views. Replay from the log to rebuild any state.
Unified Processing
One codebase for real-time and batch (via replay). Kappa architecture eliminates the "two systems" problem.
Decoupled Systems
Producers and consumers evolve independently. Add new consumers without touching producers. Event-driven microservices.
Neither is wrong. Batch optimizes for throughput and simplicity. Streaming optimizes for latency and reactivity. They serve different needs, often within the same organization.
Architecture Patterns
Let's look at how modern data platforms implement each paradigm. These aren't theoretical. They're patterns running in production at scale.
Modern Batch Architecture
The modern batch stack has evolved significantly from traditional ETL. dbt brought software engineering practices to data transformation. Open table formats like Apache Iceberg enable flexible, scalable data lakehouse architectures.
Key Components Compared
| Component | Batch Stack | Streaming Stack |
|---|---|---|
| Ingestion | Airbyte, Apache NiFi, custom scripts | Debezium CDC, direct producers |
| Storage | Apache Iceberg, ClickHouse, Trino | Kafka (event log), ClickHouse (OLAP) |
| Transformation | dbt, Spark, SQL | Flink, ksqlDB, Spark Structured Streaming |
| Orchestration | Airflow, Dagster, Prefect | Always-on (Kubernetes, managed Flink) |
| Serving | Direct warehouse queries, caching | Pre-computed views in Redis/ClickHouse |
Hybrid is often right. Stream for operational use cases (fraud, alerts, live dashboards). Batch for analytical use cases (reporting, ML training, ad-hoc queries). Don't force one paradigm where the other excels.
Best-in-Class Technology
Both paradigms have mature, battle-tested tools. Here's what's powering production systems today.
dbt + Trino/Iceberg
Modern batch transformation stack
- SQL-first with software engineering practices
- Built-in testing and documentation
- Version controlled transformations
- Open table format with query engine flexibility
- Massive ecosystem of packages
Apache Flink
Stateful stream processing engine
- True event-time processing with watermarks
- Exactly-once semantics for stateful operations
- Millisecond latency at scale
- SQL and DataStream APIs
- Savepoints for zero-downtime upgrades
Apache Spark
Distributed batch processing
- Process petabytes of data
- Python, Scala, SQL, R APIs
- ML pipelines with MLlib
- Structured Streaming for hybrid
- Massive community and support
Kafka / Redpanda
Distributed streaming platform
- Durable, ordered event log
- Horizontal scaling to millions of events/sec
- Replay capability for recovery
- Tiered storage for cost-effective retention
- Schema registry for data contracts
Code Examples
Same problem, different solutions. Let's see how each paradigm handles real-world use cases.
Example 1: Daily Revenue by Category
A straightforward aggregation. This is where batch typically shines.
-- Batch: Simple, readable, testable
SELECT
DATE(order_timestamp) AS order_date,
category,
SUM(amount) AS daily_revenue,
COUNT(*) AS order_count
FROM {{ ref('orders') }}
WHERE order_timestamp >= DATEADD('day', -30, CURRENT_DATE())
GROUP BY 1, 2
-- Streaming: Continuous, requires windowing
SELECT
TUMBLE_START(order_time, INTERVAL '1' DAY) AS order_date,
category,
SUM(amount) AS daily_revenue,
COUNT(*) AS order_count
FROM orders
GROUP BY
TUMBLE(order_time, INTERVAL '1' DAY),
category
Verdict: Batch wins here. The dbt version is simpler, testable with dbt tests, and historical recompute is trivial. Unless you need real-time revenue updates (rare), use batch.
Example 2: Fraud Detection (Velocity Check)
Detect when a user makes transactions from geographically distant locations within an impossible timeframe.
# Batch: Run hourly - detects fraud after the fact
from pyspark.sql import functions as F
from pyspark.sql.window import Window
user_window = Window.partitionBy("user_id").orderBy("timestamp")
flagged = (
transactions
.withColumn("prev_lat", F.lag("lat").over(user_window))
.withColumn("prev_lon", F.lag("lon").over(user_window))
.withColumn("prev_time", F.lag("timestamp").over(user_window))
.withColumn("distance_km", haversine_udf("lat", "lon", "prev_lat", "prev_lon"))
.withColumn("time_diff_min", (F.col("timestamp") - F.col("prev_time")) / 60)
.filter((F.col("distance_km") > 500) & (F.col("time_diff_min") < 30))
)
-- Streaming: Detect in real-time, block before damage
CREATE TABLE payments (
txn_id STRING,
user_id STRING,
amount DECIMAL(10, 2),
lat DOUBLE,
lon DOUBLE,
event_time TIMESTAMP(3),
WATERMARK FOR event_time AS event_time - INTERVAL '5' SECOND
) WITH ('connector' = 'kafka', ...);
-- Detect impossible velocity: >500km in <30 min
INSERT INTO fraud_alerts
SELECT p1.user_id, p1.txn_id, p2.txn_id,
ST_Distance(ST_Point(p1.lon, p1.lat), ST_Point(p2.lon, p2.lat)) / 1000 AS km
FROM payments p1, payments p2
WHERE p1.user_id = p2.user_id
AND p1.event_time < p2.event_time
AND p2.event_time < p1.event_time + INTERVAL '30' MINUTE
AND ST_Distance(...) > 500000;
Verdict: Streaming wins here. The batch version detects fraud hours after it happened. The streaming version blocks the transaction in milliseconds. For fraud detection, those milliseconds mean millions in prevented losses.
Monitoring & Visibility
How do you know your pipeline is healthy? The monitoring story is very different between paradigms.
Batch Monitoring
Binary outcomes: Jobs succeed or fail. dbt tests catch data quality issues. Airflow shows DAG execution history. Clear audit trail per run.
Streaming Monitoring
Continuous metrics: Consumer lag, end-to-end latency, checkpoint success rate. Requires always-on monitoring. Distributed debugging is harder.
Key Metrics Compared
| Metric | Batch | Streaming |
|---|---|---|
| Data freshness | "Last run: 2 hours ago" | "Consumer lag: 50ms" |
| Pipeline health | Job success/failure rate | Checkpoint success, backpressure |
| Data quality | dbt tests, Great Expectations | Schema registry, runtime validation |
| Lineage | dbt docs, DataHub (mature) | Event correlation IDs (manual) |
| Debugging | Re-run failed job, inspect logs | Distributed tracing, state inspection |
Traceability gap: Batch has mature lineage tools (dbt docs generates automatic lineage). Streaming requires manual correlation ID propagation. "Which event caused this output?" is harder to answer in streaming systems.
Operational Complexity
The 3 AM pager test: which system is easier to debug and fix when something goes wrong?
Batch Operations
Simple retry logic
Job failed? Re-run it. Full recompute is always an option. No state to corrupt.
Mature scheduling
Airflow, Dagster, Prefect are well-understood. Dependency management is explicit.
Delayed detection
Issues discovered when the next job runs. Hours or days after the problem occurred.
DAG dependency hell
Complex pipelines create cascading failures. One slow job blocks everything downstream.
Streaming Operations
Immediate detection
Consumer lag spikes instantly visible. No waiting for the next scheduled run.
Always-on processing
No scheduling to manage. No batch windows. Continuous flow of data.
State management complexity
Checkpoints, savepoints, state backends. Corrupted state may require full rebuild from Kafka.
Upgrade complexity
Stateful job upgrades require savepoints. Schema changes need careful coordination.
Team readiness matters. If your team knows Airflow and SQL, they can operate dbt pipelines tomorrow. Flink requires learning watermarks, event time, checkpointing, and distributed systems debugging. Budget training time.
Cost Economics
Money talks. Here's an honest look at when each paradigm is more cost-effective.
When Batch is Cheaper
Infrequent processing (daily/weekly). Serverless warehouses (pay-per-query). Small data volumes (<1TB/day). Ad-hoc analytics workloads. No always-on infrastructure needed.
When Streaming is Cheaper
High-volume continuous processing. Replacing multiple redundant batch jobs. Eliminating intermediate storage. Real-time requirements that would require over-provisioning in batch.
Migration costs are real. During transition, you run both stacks. Budget 12-18 months of parallel operation. Factor in team training, debugging unfamiliar failure modes, and the inevitable "oh, we forgot that pipeline depended on this" discoveries.
Decision Framework
Use this framework to choose the right paradigm for each use case. Don't force one approach everywhere.
Clear Wins for Batch
Historical Analytics
Ad-hoc queries over years of data. Complex joins across large datasets. Columnar storage shines here.
ML Training
Feature engineering over historical data. Model training doesn't benefit from real-time. Batch is simpler.
Compliance Reporting
Audit reports, regulatory filings. Point-in-time snapshots. Full reproducibility required.
Clear Wins for Streaming
Fraud Detection
Every millisecond counts. Block fraudulent transactions before they complete. Real-time is essential.
Operational Monitoring
System health, alerting, live dashboards. Stale data is useless. Continuous processing required.
Event-Driven Systems
Microservices choreography. Real-time notifications. User-facing features that respond instantly.
Start with batch unless you have a specific latency requirement that batch cannot meet. It's easier to add streaming later than to simplify an over-engineered streaming system.
Risks & Failure Modes
What can go wrong? Every architecture has failure modes. Understanding them helps you design for resilience.
Batch Risks
Stale data: Decisions made on outdated information. Delayed detection: Problems discovered hours later. Cascade failures: One slow job blocks everything. Resource contention: Batch windows competing for compute.
Streaming Risks
Silent data loss: Misconfigured consumers drop events. State corruption: Requires full rebuild from Kafka. Backpressure cascade: Slow consumer affects entire pipeline. Schema breaks: Producer changes break consumers.
Mitigation Strategies
| Risk | Batch Mitigation | Streaming Mitigation |
|---|---|---|
| Data quality issues | dbt tests, Great Expectations, post-run validation | Schema registry, runtime validation, dead letter queues |
| Pipeline failures | Retry policies, alerting on failure, SLAs with buffer | Checkpointing, automatic restart, consumer lag alerting |
| Data loss | Idempotent jobs, backup before overwrite | Kafka retention, exactly-once semantics, idempotent sinks |
| Schema evolution | dbt schema tests, migration scripts | Schema registry compatibility rules, versioned topics |
The Bottom Line
Neither paradigm is universally superior. They optimize for different outcomes:
- Batch optimizes for simplicity, completeness, and cost-effectiveness for periodic workloads
- Streaming optimizes for latency, reactivity, and continuous processing
Most modern data platforms use both. Stream for operational use cases where latency matters. Batch for analytical use cases where completeness matters. Don't let ideology drive architecture decisions.
Choose wisely, not ideologically. Start with the simpler approach. Add complexity only when requirements demand it. The best architecture is the one your team can operate reliably.
My recommendation: Start with batch. It's simpler to build, test, and operate. Add streaming pipelines for specific use cases where the latency requirement justifies the complexity. Measure everything. Let data guide your architecture evolution, not hype.