The Small Data Movement: Why Your Data Stack is Probably Over-Engineered

A client calls. They need analytics for their SaaS product. The conversation starts with listing requirements: data ingestion, processing, storage, transformations, orchestration, and visualization.

The proposed solution: Kafka, Spark cluster on Databricks, data lake on S3, Snowflake warehouse, dbt, Airflow, and Looker.

Then someone asks: "How much data do we have?"

"About 50 gigabytes."

There's a simpler approach that's been gaining traction: the small data movement. It turns out that for most companies, modern hardware can handle their data needs without distributed systems.

The old maze of distributed systems

If you've worked in data engineering in the past decade, this tech stack probably sounds familiar:

Kafka for data ingestion (because everything needs to be real-time, right?)
Spark cluster on EMR or Databricks for processing ($$$$)
Data lake on S3 with complicated partitioning schemes
dbt for transformations with 200+ YAML files
Airflow with 15 DAGs to orchestrate everything
Looker or Tableau for visualization (even more $$$$)
3-5 data engineers to keep it all running

This complexity creates real problems:

High costs: Infrastructure bills of $5-10k/month plus salaries for specialized engineers
Slow development: "Just add a new metric" becomes a 2-week project
Debugging nightmares: "It works on my cluster but fails in prod"
Team dependencies: Need a Spark expert, a dbt specialist, a platform engineer
Complexity debt: Each tool adds configuration, monitoring, and maintenance burden

For reference, here's how most data teams feel about their stack:

What changed: Hardware caught up

The secret is that your laptop is more powerful than yesterday's data center.

Some perspective on how hardware evolved:

2004: Top Intel Xeon had 1 core, 4GB RAM
2025: AMD EPYC chips pack 96 cores, AWS instances offer 1TB RAM
That's 96x more cores and 250x more memory

Meanwhile, most companies' data didn't grow nearly as fast. Jordan Tigani, who helped build Google BigQuery, analyzed actual query patterns and found that 99.5% of queries could run on a laptop [source]. Even companies with petabytes stored typically query only small working sets—around 64MB median query size according to Fivetran's analysis.

The math is simple: what needed a 10-node cluster in 2010 now fits comfortably on a $1000 machine.

This realization sparked the Small Data SF conference, which launched in September 2024 with 250+ attendees and speakers from Google, AWS, Fivetran, and Mode. The event's tagline says it all: "Think Small, Build Big."

Key insights from the conference:

George Fraser (Fivetran CEO) revealed that 99.9% of queries could run on a laptop, based on analysis of Redshift and Snowflake datasets [source]
Wes McKinney (creator of Pandas) discussed "Retooling for the Smaller Data Era," emphasizing how hardware evolution enables simpler architectures
Benn Stancil (Mode) delivered "BI's Big Lie," showing how business intelligence often over-promises and under-delivers

The toolkit

Let's look at the some of the tools making small data approaches practical.

DuckDB: Analytics without the database server

Think of DuckDB as "SQLite for analytics."

What it is: An embedded analytical database that runs in-process. No server, no configuration, no dependencies.

How does it work?

Install with pip install duckdb (or
pip install but uv is nicer ;) )- you're done
Query Parquet files directly from S3 (no copying needed)
Process datasets larger than RAM (automatic disk spilling)
Native integration with Pandas, Polars, and Arrow

Real-world impact: Watershed replaced PostgreSQL with DuckDB for 17 million rows and saw 10x performance gains. Financial services firm Billigence reduced query times from 3-4 minutes to under 5 seconds.

When to use it: Analytics queries, data transformations, exploratory analysis, datasets under 1TB.

When NOT to use it: High-concurrency transactional workloads, or when you actually need distributed processing for 10+ TB datasets.

The Postgres renaissance

One of the most exciting developments is turning Postgres into a data lakehouse platform.

Three new tools—DuckLake, pg_lake, and pg_mooncake—let you build "data lakes" with just Postgres + S3. No Hive metastore, no Glue catalog, no complex infrastructure.

How does it work?

Store your table metadata in Postgres (fast, transactional, familiar)
Keep actual data as Parquet files in S3 (cheap, scalable)
Query with standard SQL or connect DuckDB for analytics

Why this matters: You get ACID transactions, time travel, and multi-table operations without the complexity of traditional lakehouse platforms. Plus, Postgres is something every engineer already knows.

When you actually need the big stuff

Let's be honest: the small data approach isn't always the answer.

You probably need Spark/Databricks when:

Regularly processing 1TB+ datasets (not just storing, actually processing)
Running 24/7 continuous pipelines that need fault tolerance
High concurrency requirements (100+ simultaneous users)
Complex streaming use cases with exactly-once semantics
Team already has Spark expertise and is productive with it

Databricks serverless has real value for:

Cross-functional teams where data engineers, scientists, and analysts need unified tools
Organizations requiring Unity Catalog's governance features (row/column security, automated lineage)
Bursty workloads where serverless saves money vs always-on clusters

The 3-5x cost premium for managed services can be justified when operational simplicity matters more than infrastructure cost. Just be honest about whether you're in that category.

The gray zone (100GB-1TB):

This is where judgment matters. Ask yourself:

Will data grow past 1TB within 12 months?
Do we have complex multi-table joins across billions of rows?
Is sub-second query latency critical for the business?
Do we need automatic recovery from hardware failures?

If multiple answers are yes, distributed systems make sense. If most answers are no, stay simple.

At Wolk, we often recommend a blended approach: use DuckDB/Polars for ad-hoc analysis and exploration (where single-node performance excels), and Spark only for the specific ETL jobs that benefit from distribution. Don't feel like you have to pick one tool for everything.

Decision time

Here's a quick self-assessment:

Is your working dataset under 100GB?
Can you develop locally with real data?
Do you have fewer than 3 data engineers?
Is fast iteration more valuable than theoretical scale?
Are infrastructure costs a concern?
Do queries mostly hit recent data (last 90 days)?

If you checked most of these boxes: start with small data tools.

The pragmatic path:

Start with DuckDB and Polars
Develop everything locally on your laptop
Deploy to a single VM (scale up the machine specs as needed)
Only introduce distributed complexity when you hit concrete limits

Remember: you can always migrate from DuckDB to Databricks later. The reverse journey is much harder.

What's next

The small data movement isn't about never using Spark or Databricks. It's about not starting with them when DuckDB will do. It's about matching tools to actual needs instead of theoretical fears.

At Wolk, we've built both massive distributed systems and elegant single-node solutions. The elegant ones usually ship faster, cost less, and require smaller teams to maintain.

In our next post, we'll build a complete small data pipeline from scratch: extracting data, transforming it with DuckDB and Polars, scheduling it to run automatically, and connecting it to a BI tool. We'll include actual code, cost breakdowns, and deployment options.

Want to learn more right now?

Small Data SF conference (November 2025)
DuckDB documentation
Jordan Tigani's "Big Data is Dead" article
Polars documentation

Need help deciding?

At Wolk, we help companies right-size their data infrastructure. Whether you're over-engineered and want to simplify, or have genuinely outgrown single-node tools, we can help. Get in touch.