)
In our previous post, we explored why most companies don’t need complex distributed systems for their data needs. The small data movement shows that modern hardware and smart tools can handle far more than we give them credit for.
Now it’s time to build something.
This guide walks you through creating a complete analytics platform using DuckDB and Superset—no Spark clusters, no expensive warehouses, just open-source tools that get the job done. We’ll start with local development on your laptop, then show you how to make it accessible to your entire organization.
What We’re Building
Let’s say you’re working at a company that currently does analytics in Excel spreadsheets and ad-hoc Python scripts. This does not scale well, Excel struggles with slightly more data and python can quickly become messy if you are trying to explore data visually. You have:
Sales data exported monthly as CSV files
Customer data in a Postgres database
Product usage logs in Excel
A team of 5-10 people who need dashboards
Our goal: Build a platform where you can:
Combine data from multiple sources
Transform it with SQL
Create interactive dashboards
Share with colleagues without everyone needing Python knowledge
The architecture (starting simple):
Data Sources (CSV, Excel, Databases)
↓
DuckDB (transformation & query)
↓
Superset (visualization)
Later, we’ll evolve this to support multiple users safely.
Phase 1: Local Development
Let’s start on your laptop. This is where you’ll develop your transformations and build your first dashboards.
Setting Up Your Environment
First, create a project directory and install DuckDB:
# Create project directory
mkdir small-data-platform
cd small-data-platform
# Create a Python virtual environment with uv
uv init
# Install DuckDB
uv add duckdb
If you don’t have uv installed yet, read here why and here how to install.
Note: DuckDB also works as a standalone CLI tool (just download the binary) or even in your browser. But we’re using Python here since most data folks are already comfortable with it.
Loading Your First Data
Let’s load some CSV data. Create a file called load_data.py:
import duckdb
# Create a persistent database file
# This file will store your tables and can be reused
con = duckdb.connect('analytics.duckdb')
# Load CSV data directly - DuckDB infers the schema
con.execute("""
CREATE TABLE sales AS
SELECT * FROM read_csv('data/sales_2024.csv')
""")
# Load Excel files (yes, DuckDB can read Excel!)
con.execute("""
CREATE TABLE product_usage AS
SELECT * FROM st_read('data/usage_logs.xlsx')
""")
# Connect to your existing Postgres database
con.execute("""
INSTALL postgres;
LOAD postgres;
CREATE TABLE customers AS
SELECT * FROM postgres_scan(
'host=localhost port=5432 dbname=production user=readonly',
'public',
'customers'
)
""")
con.close()
print("✅ Data loaded successfully!")
Run it:
uv run load_data.py
You now have an analytics.duckdb file containing all your data. This file is a complete database that you can version control, back up, or share with colleagues.
Writing Transformations
Now let’s transform this data into something useful. Create transformations.py:
import duckdb
con = duckdb.connect('analytics.duckdb')
# Create a view for monthly revenue by customer segment
con.execute("""
CREATE OR REPLACE VIEW monthly_revenue AS
SELECT
DATE_TRUNC('month', s.sale_date) as month,
c.segment,
c.region,
COUNT(DISTINCT s.customer_id) as active_customers,
SUM(s.amount) as total_revenue,
AVG(s.amount) as avg_order_value
FROM sales s
JOIN customers c ON s.customer_id = c.id
GROUP BY 1, 2, 3
ORDER BY 1 DESC, 2
""")
# Create a view for product usage metrics
con.execute("""
CREATE OR REPLACE VIEW product_metrics AS
SELECT
product_id,
DATE_TRUNC('week', event_date) as week,
COUNT(DISTINCT user_id) as weekly_active_users,
COUNT(*) as total_events,
COUNT(*) / COUNT(DISTINCT user_id) as events_per_user
FROM product_usage
GROUP BY 1, 2
""")
# Create a customer summary table for dashboarding
con.execute("""
CREATE OR REPLACE TABLE customer_summary AS
SELECT
c.id,
c.name,
c.segment,
c.region,
COUNT(s.id) as total_orders,
SUM(s.amount) as lifetime_value,
MAX(s.sale_date) as last_purchase_date,
DATEDIFF('day', MAX(s.sale_date), CURRENT_DATE) as days_since_purchase
FROM customers c
LEFT JOIN sales s ON c.id = s.customer_id
GROUP BY 1, 2, 3, 4
""")
con.close()
print("✅ Transformations complete!")
Why views vs tables?
Views are computed on-the-fly when queried (good for logic that changes frequently)
Tables are materialized and stored (good for expensive computations you’ll query often)
Run your transformations:
uv run transformations.py
Querying Your Data
Now you can query your transformed data. Create query.py:
import duckdb
con = duckdb.connect("analytics.duckdb", read_only=True)
# Get revenue by segment from July-December 2024
df = con.execute("""
SELECT
month,
segment,
region,
total_revenue,
active_customers
FROM monthly_revenue
WHERE month >= '2024-07-01' AND month < '2025-01-01'
ORDER BY total_revenue DESC, month DESC
""").df()
print(df.head())
con.close()
The .df() at the end converts the result to a Pandas DataFrame, which is handy for plotting or further analysis.
Installing Superset Locally
Now let’s visualize this data. We’ll install Superset using Docker (easiest way), create a file named docker-compose.yml with the following content:
name: small-data-platform
services:
superset:
container_name: superset
build:
context: .
dockerfile_inline: |
FROM apache/superset:6.0.0-py312
USER root
RUN uv pip install psycopg2-binary duckdb 'duckdb-driver>=0.17.0'
USER superset
ports:
- "8088:8088"
volumes:
- ./analytics.duckdb:/app/analytics.duckdb
configs:
- source: superset_config.py
target: /app/superset_config.py
environment:
- SUPERSET_CONFIG_PATH=/app/superset_config.py
- DATABASE_DB=superset
- DATABASE_HOST=db
- DATABASE_PASSWORD=superset
- DATABASE_USER=superset
- SUPERSET_LOAD_EXAMPLES=false
db:
image: postgres:16
environment:
- POSTGRES_USER=superset
- POSTGRES_PASSWORD=superset
- POSTGRES_DB=superset
volumes:
- superset_postgres_data:/var/lib/postgresql/data
volumes:
superset_postgres_data:
configs:
superset_config.py:
content: |
SECRET_KEY = "your_super_duper_secret_key_here"
SQLALCHEMY_DATABASE_URI = "postgresql://superset:superset@db:5432/superset"
This file contains a small example for a simple superset setup with the required dependencies for visualizing data from Duckdb.
Now run the following commands:
docker compose up --build -d
# you might need to wait a bit untill it starts and then run
docker exec -it superset /app/docker/docker-init.sh
Open your browser to http://localhost:8088 and log in with admin/admin !
Connecting Superset to DuckDB
In Superset, you’ll need to add a database connection:
Click Settings → Database Connections → + Database
Select “DuckDB” as the database type and click on “Connect this database with a SQLAlchemy URI string instead”
Enter this connection string:
duckdb:////app/analytics.duckdbClick Connect
Creating Your First Chart
Now you can create visualizations:
Go to SQL Lab → SQL Editor
Write a query against your data:
SELECT month, segment, total_revenue FROM monthly_revenue ORDER BY month DESC LIMIT 100Click Save → Save as Chart
Choose a visualization type (e.g., “Line Chart”)
Configure your chart axes and save it
You now have a working analytics setup on your laptop! You can develop transformations, query data, and build dashboards—all locally.
Phase 2: Making It Shareable
Your local setup works great for you, but what about your colleagues (or family maybe) ? They need access to the dashboards without installing DuckDB and Python.
You have two main paths forward:
The Managed Path: MotherDuck + Preset
MotherDuck is a managed DuckDB service (think “DuckDB in the cloud”). Preset is managed Superset.
This path is the fastest to get started:
Sign up for MotherDuck (free tier available)
Upload your
analytics.duckdbfile or connect your data sourcesSign up for Preset (free tier available)
Connect Preset to your MotherDuck database
Share dashboard links with your team
When this makes sense:
You want to get started in an afternoon
You don’t want to manage infrastructure
Your data governance requirements are met by US-based services
Budget allows for ~$100+/month depending on usage (but will probably be a lot smaller when sharing this with under 5 people)
When to consider alternatives:
Data sovereignty requirements (must stay in EU)
Cost sensitivity (self-hosting is cheaper at scale)
You want full control over your stack
Let’s focus on the self-hosted path, which gives you more control and can run on European infrastructure.
Phase 3: Self-Hosted on a VM
This is where things get interesting. We’ll set up a shared analytics platform that multiple users can access simultaneously, while keeping costs reasonable and data in your preferred region.
Architecture Overview
Here’s what we’re building:
Your Laptop (Development)
↓ (write SQL transformations)
↓ (run locally, push results)
DuckLake on VM
├── Postgres (metadata)
├── OVH Object Storage (actual data)
└── DuckDB (query engine)
↓
Superset (visualization)
Why this architecture?
DuckLake lets multiple users query the same data safely using a Postgres metadata layer
Object Storage (like S3) is cheap, reliable, and scales automatically
Separation of storage and compute means you don’t pay for idle resources
Why European Cloud Providers?
You might be less comfortable with hyperscalers these days, whether due to data sovereignty concerns, GDPR compliance, or simply wanting to support European infrastructure. Good news: this entire stack runs perfectly on providers like OVH, Hetzner, or Stackit. A brief comparison written by Stijn can be found here: https://www.wolk.work/blog/posts/european-cloud-providers-compared-a-hands-on-terraform-test-january-2026-
We’ll use OVH in our examples, but the setup is nearly identical across providers. ( Stijn also wrote a short guide on how to use OVH with terraform, find it here)
Understanding DuckLake
DuckLake is an open-source data lake format that uses:
A Postgres database for metadata (table schemas, file locations)
Object storage for the actual data (Parquet files)
DuckDB as the query engine
Unlike Iceberg or Delta Lake, DuckLake is dead simple to set up and designed specifically for DuckDB. It allows multiple users to query and write to the same data lake without conflicts.
Setting Up Your VM
First, provision a VM. On OVH Cloud:
Instance type: B2-15 (4 vCPU, 15GB RAM) ~€30/month
Operating System: Ubuntu 22.04
Network: Ensure port 8088 is open for Superset
SSH into your VM:
ssh ubuntu@your-vm-ip
Setting Up Object Storage
Create an OVH Object Storage bucket:
Log into OVH Cloud
Navigate to Public Cloud → Object Storage
Click Create an Object Storage
Choose your region (e.g.,
GRAfor Gravelines, France)Name it
small-data-platformCreate S3 credentials (access key + secret key)
Save these credentials—you’ll need them.
Cost: ~€0.01/GB/month for storage, plus minimal egress (~€0.01/GB for downloads)
For a 100GB data lake: ~€1/month storage + ~€5/month bandwidth = €6/month total.
Installing Dependencies on the VM
# Update system
sudo apt update && sudo apt upgrade -y
# Install Docker and Docker Compose
sudo apt install -y docker.io docker-compose
# Start Docker
sudo systemctl enable docker
sudo systemctl start docker
# Add your user to docker group
sudo usermod -aG docker $USER
# Log out and back in for this to take effect
Creating the Docker Compose Setup
Create a directory for your setup:
mkdir ~/small-data-platform
cd ~/small-data-platform
Create a docker-compose.yml file (we’ll reuse the local one and add a database for the ducklake metadata):
name: small-data-platform
services:
superset:
container_name: superset
build:
context: .
dockerfile_inline: |
FROM apache/superset:6.0.0-py312
USER root
RUN uv pip install psycopg2-binary duckdb 'duckdb-driver>=0.17.0'
USER superset
ports:
- "8088:8088"
volumes:
- ./analytics.duckdb:/app/analytics.duckdb
configs:
- source: superset_config.py
target: /app/superset_config.py
environment:
- SUPERSET_CONFIG_PATH=/app/superset_config.py
- SUPERSET_SECRET_KEY="your_super_duper_secret_key_here"
- DATABASE_DB=superset
- DATABASE_HOST=db
- DATABASE_PASSWORD=superset
- DATABASE_USER=superset
- SUPERSET_LOAD_EXAMPLES=false
restart: unless-stopped
db:
image: postgres:16
environment:
- POSTGRES_USER=superset
- POSTGRES_PASSWORD=${SUPERSET_DB_PASSWORD}
- POSTGRES_DB=superset
volumes:
- superset_postgres_data:/var/lib/postgresql/data
restart: unless-stopped
ducklake-metadata:
image: postgres:16
ports:
- "5432:5432"
environment:
- POSTGRES_USER=ducklake
- POSTGRES_PASSWORD=${DUCKLAKE_DB_PASSWORD}
- POSTGRES_DB=ducklake
volumes:
- ducklake-metadata-data:/var/lib/postgresql/data
restart: unless-stopped
volumes:
superset_postgres_data:
ducklake-metadata-data:
configs:
superset_config.py:
content: |
SQLALCHEMY_DATABASE_URI = "postgresql://superset:${SUPERSET_DB_PASSWORD}@db:5432/superset"
Create a .env file with your passwords:
# Generate secure passwords
DUCKLAKE_DB_PASSWORD=$(openssl rand -base64 32)
SUPERSET_DB_PASSWORD=$(openssl rand -base64 32)
SUPERSET_SECRET_KEY=$(openssl rand -base64 42)
# Save them to .env
cat > .env << EOF
DUCKLAKE_DB_PASSWORD=$DUCKLAKE_DB_PASSWORD
SUPERSET_DB_PASSWORD=$SUPERSET_DB_PASSWORD
SUPERSET_SECRET_KEY=$SUPERSET_SECRET_KEY
EOF
echo "✅ Credentials generated and saved to .env"
Start the services:
docker compose up -d
Initialize Superset:
docker exec -it superset /app/docker/docker-init.sh
Setting Up DuckLake
On your local machine, install the DuckLake extension and initialize the ducklake (fill in the appropriate secrets):
import duckdb
con = duckdb.connect('analytics.duckdb')
# Install DuckLake extension
con.execute("INSTALL ducklake")
con.execute("LOAD ducklake")
# Configure connection to your VM's Postgres and OVH Object Storage
con.execute("""
CREATE OR REPLACE PERSISTENT SECRET s3_ducklake (
TYPE s3,
PROVIDER config,
KEY_ID 'your-ovh-access-key',
SECRET 'your-ovh-secret-key',
REGION 'gra'
ENDPOINT 's3.gra.io.cloud.ovh.net'
);
-- Using the DuckLake credentials for Postgres
CREATE OR REPLACE PERSISTENT SECRET postgres_secret (
TYPE postgres,
HOST 'your-vm-ip',
DATABASE 'ducklake',
USER 'ducklake',
PASSWORD 'your-DUCKLAKE_DB_PASSWORD'
);
-- DuckLake config secret
CREATE OR REPLACE PERSISTENT SECRET ducklake_secret (
TYPE ducklake,
METADATA_PATH '',
DATA_PATH 's3://small-data-platform/lake',
METADATA_PARAMETERS MAP {'TYPE': 'postgres','SECRET': 'postgres_secret'}
);
-- This initializes DuckLake
ATTACH 'ducklake:ducklake_secret' AS my_ducklake;
USE my_ducklake;
""")
con.close()
Pushing Data to DuckLake
Now you can push your transformed data to the shared lake:
import duckdb
con = duckdb.connect('analytics.duckdb')
# Load DuckLake and attach
con.execute("LOAD ducklake")
con.execute("""ATTACH 'ducklake:ducklake_secret' AS my_ducklake;
USE my_ducklake;""")
# Copy your tables to DuckLake
con.execute("""
CREATE OR REPLACE TABLE my_ducklake.monthly_revenue AS
SELECT * FROM analytics.monthly_revenue
""")
con.execute("""
CREATE OR REPLACE TABLE my_ducklake.customer_summary AS
SELECT * FROM analytics.customer_summary
""")
con.execute("""
CREATE OR REPLACE TABLE my_ducklake.product_metrics AS
SELECT * FROM analytics.product_metrics
""")
print("✅ Data pushed to DuckLake!")
con.close()
Your data is now stored in OVH Object Storage, with metadata in Postgres. Anyone with access can query it.
Connecting Superset to DuckLake
Now you need to set up Superset on the VM to query your DuckLake. (kudo’s to the guide from Danial Lewis https://medium.com/data-analytics-at-nesta/ducklake-on-apache-superset-d1b1b00c222a , I did have to fork the duckdb-engine repo to fix some ducklake compatability issues and publish it under the name duckdb-driver )
Now in Superset’s UI:
Go to Settings → Database Connections → + Database
Select “Other”
Enter this SQLAlchemy URI:
duckdb:///ducklake:postgres:dbname=ducklake host=ducklake-metadata user=ducklake password={your-ducklake-db-password}In the Advanced tab → SQL Lab → SQL Alchemy Parameters, add:
{ "connect_args": { "config": { "s3_access_key_id": "your-ovh-access-key", "s3_secret_access_key": "your-ovh-secret-key", "s3_endpoint": "s3.gra.cloud.ovh.net", "s3_url_style": "path", "s3_use_ssl": false "s3_region": "gra" } } }It’s connected! you can check the available tables in the SQLLab tab or immediately create dataset for one!
Awesome! you did it. You made it to the end of this very long piece of text. Now all your small data troubles are now over! I hope this helps a bit in showing what’s possible with modern technology.
Now you can create user accounts for you colleagues and create your first charts and dashboards!
Accessing Superset Safely
Your Superset instance is now running at http://your-vm-ip:8088. However, exposing it directly to the internet isn’t secure. (especially insecure if we keep the admin:admin login 😅) So please don’t do this with data you are uncomfortable sharing with the world!
There are some options to make it a bit more secure:
Host it on a VM on your internal network (if you have one)
Use a VPN service like Tailscale to expose the connection so it doesn’t need to be opened to the internet (free for personal use!)
NextSteps
You now have a complete small data platform that:
✅ Runs on your laptop for development
✅ Scales to multiple users on a shared VM
✅ Keeps data in Europe (if you chose EU cloud)
✅ Costs ~€30/month (not $3000/month)
✅ Uses open-source tools you control
Migration Path: When to Upgrade
This stack will take you surprisingly far. But you’ll eventually outgrow it. Here are the warning signs and what to do:
Warning Sign 1: Queries are getting slow (>30 seconds)
Solution: Upgrade your VM to more CPUs and RAM (vertical scaling)
When that’s not enough: Consider DuckDB’s parallel query execution across multiple machines or migrating hot tables to a proper data warehouse like ClickHouse
Warning Sign 2: Multiple people running transformations causes conflicts
Solution: Move transformation execution to the VM (one designated machine)
When that’s not enough: Add a workflow orchestrator (Dagster/Prefect)
Warning Sign 3: Data is growing past 1TB
Solution: Partition your Parquet files by date, use columnar compression
When that’s not enough: Consider Apache Iceberg tables with proper partitioning, or migrate to Databricks/Snowflake/Clickhouse
The key: Each upgrade should be motivated by an actual problem you’re experiencing, not theoretical future needs.
Where to go from here:
Add data quality checks: Use Great Expectations or simple SQL assertions
Automate and structure transformations: Use something like dbt or SQLMesh
Implement incremental loading: Only process new/changed data instead of full reloads
Set up monitoring: Track query performance, data freshness, pipeline success/failure
Add more data sources: APIs, log files, third-party services
Resources to learn more:
Need help implementing this?
At Wolk, we help companies build data infrastructure that matches their actual needs. No over-engineering, no unnecessary complexity—just solid platforms that work. Whether you’re just getting started or need to upgrade an existing setup, get in touch.
Want to stay updated?
Subscribe to our newsletter, de Wolkskrant, to get the latest tools, trends and tips from the data engineering world.
This post is Part 2 of our Small Data Platform series. Read Part 1: Why Your Data Stack is Probably Over-Engineered if you haven’t already.
Stay up to date!
Subscribe to our newsletter, de Wolkskrant, to get the latest tools, trends and tips from the industry.
SubscribeMore articles

)
)
)
)
)
)
)
)