Your nightly report used to run in a database query. Then the data grew — billions of rows a day, arriving faster than any single server can write, let alone analyze. Around the same time, a colleague in research has the opposite shape of the same problem: one weather simulation that would take a single CPU three months to finish.
Neither of these fits on one machine, no matter how big. Big data and big compute are the two architecture styles for exactly this situation — both answer it by throwing many machines at the problem instead of one giant one.
The problem
A traditional database assumes the data is small enough to live on one server (or a small handful) and that queries finish in seconds. That assumption breaks in two different directions. Sometimes the dataset itself is too large or arrives too fast — terabytes of logs, clickstreams, sensor readings — to fit, index, and query the old way. Other times the dataset is modest but the computation is monstrous — a physics simulation or a frame of CGI that needs trillions of floating-point operations.
In both cases a single machine is the bottleneck. You can buy a bigger box for a while, but eventually there is no bigger box. The only way forward is to split the work and distribute it.
How it works
A big data system is best understood as a pipeline with four stages: ingest pulls raw data in from many sources, store lands it somewhere durable and cheap (often a data lake), process transforms and aggregates it, and serve exposes the results to dashboards, reports, and applications. The animation below traces this flow — data streaming in from several sources, funneling through an ingestion step, into a processing layer, and finally out to a serving and analytics layer where people actually use it.
The key to the whole thing is that data is partitioned across many nodes, so every stage can run in parallel: more data simply means more machines, not a slower system.
- ProcessingBatch and/or streaming compute spread across many nodes working in parallel.
Partitioning the data across nodes is the same sharding idea behind replication and horizontal scaling: instead of one machine holding everything, each node owns a slice. Storage and compute then grow by adding machines, which is what lets these systems handle volumes a single server never could.
Batch versus streaming
There are two ways to push data through the processing stage, and big data systems often use both. The batch path collects data over a window — an hour, a day — and crunches it all at once. Batch is throughput-friendly and great for thorough, accurate aggregates, but its results are inherently delayed: you learn about today fully only after today is over.
The streaming path processes each record (or tiny micro-batch) as it arrives, giving near-real-time answers at the cost of some precision and more fiddly machinery. Streaming is what powers a live fraud check or an up-to-the-second dashboard.
Running both at once is a recognized pattern. One approach keeps two parallel tracks — a fast streaming layer for fresh-but-approximate results and a slow batch layer that periodically corrects them — and merges the two when serving. A simpler variant treats everything as a stream and just replays history through the same code when it needs to reprocess. The first trades duplicated logic for a safety net; the second trades that net for one code path to maintain.
Don't reach for streaming just because it sounds modern. If your consumers only look at yesterday's numbers each morning, a nightly batch job is simpler, cheaper, and easier to get right. Add a streaming path only when someone genuinely needs answers measured in seconds, not hours.
Big compute (HPC)
Big compute, often called high-performance computing, flips the emphasis. Here the data may be small, but a single calculation is so heavy that you split it across thousands of cores or machines that work in parallel, then stitch their partial answers back together. Think simulating airflow over a wing, rendering every frame of an animated film, or training a large scientific model.
The classic recipe is divide-and-conquer: carve the problem into independent chunks (a region of the simulation grid, a batch of frames), hand each chunk to a different worker, and combine the results at the end. The more cleanly the chunks can run without talking to each other, the better it scales — tightly coupled chunks that constantly exchange data are where HPC gets hard.
What it costs you
Distributing work is never free. You inherit the complexity of a cluster — scheduling, partitioning, retries, and partial failures that a single machine never forced you to think about. Moving data between nodes becomes a real cost: the network is slow compared to local memory, so a careless design can spend more time shuffling bytes than computing on them.
You also accept latency — batch results are stale by design, and even streaming has lag — and you commit to specialized tooling (distributed processing frameworks, cluster schedulers, data lake formats) that your team has to learn and operate. These styles pay off only when the scale genuinely demands them.
When to use it
Reach for big data when your dataset outgrows what a single database can store or query, or when data arrives faster than one machine can keep up — large-scale analytics, log and event processing, recommendation pipelines, machine-learning feature stores. Choose the batch path for thorough periodic reporting and the streaming path when freshness matters, and combine them only when you truly need both.
Reach for big compute when you have one enormous, parallelizable calculation rather than a flood of data — scientific simulation, financial risk modeling, rendering, genomics. And if neither describes you — if a tuned database and one beefy server still cope — stay there. These styles solve scale problems, and they bring real overhead in exchange.