All articles

Small Data, Big Impact: Your DuckDB + Superset Starter Guide

In our previous post, we explored why most companies don’t need complex distributed systems for their data needs. The small data movement shows that modern hardware and smart tools can handle far more than we give them credit for.

Now it’s time to build something.

This guide walks you through creating a complete analytics platform using DuckDB and Superset—no Spark clusters, no expensive warehouses, just open-source tools that get the job done. We’ll start with local development on your laptop, then show you how to make it accessible to your entire organization.

What We’re Building

Let’s say you’re working at a company that currently does analytics in Excel spreadsheets and ad-hoc Python scripts. This does not scale well, Excel struggles with slightly more data and python can quickly become messy if you are trying to explore data visually. You have:

  • Sales data exported monthly as CSV files

  • Customer data in a Postgres database

  • Product usage logs in Excel

  • A team of 5-10 people who need dashboards

Our goal: Build a platform where you can:

  1. Combine data from multiple sources

  2. Transform it with SQL

  3. Create interactive dashboards

  4. Share with colleagues without everyone needing Python knowledge

The architecture (starting simple):

Data Sources (CSV, Excel, Databases)
    ↓
DuckDB (transformation & query)
    ↓
Superset (visualization)

Later, we’ll evolve this to support multiple users safely.

Phase 1: Local Development

Let’s start on your laptop. This is where you’ll develop your transformations and build your first dashboards.

Setting Up Your Environment

First, create a project directory and install DuckDB:

# Create project directory
mkdir small-data-platform
cd small-data-platform

# Create a Python virtual environment with uv
uv init 

# Install DuckDB
uv add duckdb

If you don’t have uv installed yet, read here why and here how to install.

Note: DuckDB also works as a standalone CLI tool (just download the binary) or even in your browser. But we’re using Python here since most data folks are already comfortable with it.

Loading Your First Data

Let’s load some CSV data. Create a file called load_data.py:

import duckdb

# Create a persistent database file
# This file will store your tables and can be reused
con = duckdb.connect('analytics.duckdb')

# Load CSV data directly - DuckDB infers the schema
con.execute("""
    CREATE TABLE sales AS
    SELECT * FROM read_csv('data/sales_2024.csv')
""")

# Load Excel files (yes, DuckDB can read Excel!)
con.execute("""
    CREATE TABLE product_usage AS
    SELECT * FROM st_read('data/usage_logs.xlsx')
""")

# Connect to your existing Postgres database
con.execute("""
    INSTALL postgres;
    LOAD postgres;

    CREATE TABLE customers AS
    SELECT * FROM postgres_scan(
        'host=localhost port=5432 dbname=production user=readonly',
        'public',
        'customers'
    )
""")

con.close()

print("✅ Data loaded successfully!")

Run it:

uv run load_data.py

You now have an analytics.duckdb file containing all your data. This file is a complete database that you can version control, back up, or share with colleagues.

Writing Transformations

Now let’s transform this data into something useful. Create transformations.py:

import duckdb

con = duckdb.connect('analytics.duckdb')

# Create a view for monthly revenue by customer segment
con.execute("""
    CREATE OR REPLACE VIEW monthly_revenue AS
    SELECT
        DATE_TRUNC('month', s.sale_date) as month,
        c.segment,
        c.region,
        COUNT(DISTINCT s.customer_id) as active_customers,
        SUM(s.amount) as total_revenue,
        AVG(s.amount) as avg_order_value
    FROM sales s
    JOIN customers c ON s.customer_id = c.id
    GROUP BY 1, 2, 3
    ORDER BY 1 DESC, 2
""")

# Create a view for product usage metrics
con.execute("""
    CREATE OR REPLACE VIEW product_metrics AS
    SELECT
        product_id,
        DATE_TRUNC('week', event_date) as week,
        COUNT(DISTINCT user_id) as weekly_active_users,
        COUNT(*) as total_events,
        COUNT(*) / COUNT(DISTINCT user_id) as events_per_user
    FROM product_usage
    GROUP BY 1, 2
""")

# Create a customer summary table for dashboarding
con.execute("""
    CREATE OR REPLACE TABLE customer_summary AS
    SELECT
        c.id,
        c.name,
        c.segment,
        c.region,
        COUNT(s.id) as total_orders,
        SUM(s.amount) as lifetime_value,
        MAX(s.sale_date) as last_purchase_date,
        DATEDIFF('day', MAX(s.sale_date), CURRENT_DATE) as days_since_purchase
    FROM customers c
    LEFT JOIN sales s ON c.id = s.customer_id
    GROUP BY 1, 2, 3, 4
""")

con.close()

print("✅ Transformations complete!")

Why views vs tables?

  • Views are computed on-the-fly when queried (good for logic that changes frequently)

  • Tables are materialized and stored (good for expensive computations you’ll query often)

Run your transformations:

uv run transformations.py

Querying Your Data

Now you can query your transformed data. Create query.py:

import duckdb

con = duckdb.connect("analytics.duckdb", read_only=True)

# Get revenue by segment from July-December 2024
df = con.execute("""
    SELECT
        month,
        segment,
        region,
        total_revenue,
        active_customers
    FROM monthly_revenue
    WHERE month >= '2024-07-01' AND month < '2025-01-01'
    ORDER BY total_revenue DESC, month DESC
""").df()

print(df.head())

con.close()

The .df() at the end converts the result to a Pandas DataFrame, which is handy for plotting or further analysis.

Installing Superset Locally

Now let’s visualize this data. We’ll install Superset using Docker (easiest way), create a file named docker-compose.yml with the following content:

name: small-data-platform
services:
  superset:
    container_name: superset
    build:
      context: .
      dockerfile_inline: |
        FROM apache/superset:6.0.0-py312
        USER root
        RUN uv pip install psycopg2-binary duckdb 'duckdb-driver>=0.17.0'
        USER superset
    ports:
      - "8088:8088"
    volumes:
      - ./analytics.duckdb:/app/analytics.duckdb
    configs:
      - source: superset_config.py
        target: /app/superset_config.py
    environment:
      - SUPERSET_CONFIG_PATH=/app/superset_config.py
      - DATABASE_DB=superset
      - DATABASE_HOST=db
      - DATABASE_PASSWORD=superset
      - DATABASE_USER=superset
      - SUPERSET_LOAD_EXAMPLES=false
  db:
    image: postgres:16
    environment:
      - POSTGRES_USER=superset
      - POSTGRES_PASSWORD=superset
      - POSTGRES_DB=superset
    volumes:
      - superset_postgres_data:/var/lib/postgresql/data
volumes:
  superset_postgres_data:
configs:
  superset_config.py:
    content: |
      SECRET_KEY = "your_super_duper_secret_key_here"
      SQLALCHEMY_DATABASE_URI = "postgresql://superset:superset@db:5432/superset"

This file contains a small example for a simple superset setup with the required dependencies for visualizing data from Duckdb.

Now run the following commands:

docker compose up --build -d
# you might need to wait a bit untill it starts and then run
docker exec -it superset /app/docker/docker-init.sh

Open your browser to http://localhost:8088 and log in with admin/admin !

Connecting Superset to DuckDB

In Superset, you’ll need to add a database connection:

  1. Click SettingsDatabase Connections+ Database

  2. Select “DuckDB” as the database type and click on “Connect this database with a SQLAlchemy URI string instead”

  3. Enter this connection string:

    duckdb:////app/analytics.duckdb
    
  4. Click Connect

Creating Your First Chart

Now you can create visualizations:

  1. Go to SQL LabSQL Editor

  2. Write a query against your data:

    SELECT
        month,
        segment,
        total_revenue
    FROM monthly_revenue
    ORDER BY month DESC
    LIMIT 100
    
  3. Click SaveSave as Chart

  4. Choose a visualization type (e.g., “Line Chart”)

  5. Configure your chart axes and save it

You now have a working analytics setup on your laptop! You can develop transformations, query data, and build dashboards—all locally.

Phase 2: Making It Shareable

Your local setup works great for you, but what about your colleagues (or family maybe) ? They need access to the dashboards without installing DuckDB and Python.

You have two main paths forward:

The Managed Path: MotherDuck + Preset

MotherDuck is a managed DuckDB service (think “DuckDB in the cloud”). Preset is managed Superset.

This path is the fastest to get started:

  1. Sign up for MotherDuck (free tier available)

  2. Upload your analytics.duckdb file or connect your data sources

  3. Sign up for Preset (free tier available)

  4. Connect Preset to your MotherDuck database

  5. Share dashboard links with your team

When this makes sense:

  • You want to get started in an afternoon

  • You don’t want to manage infrastructure

  • Your data governance requirements are met by US-based services

  • Budget allows for ~$100+/month depending on usage (but will probably be a lot smaller when sharing this with under 5 people)

When to consider alternatives:

  • Data sovereignty requirements (must stay in EU)

  • Cost sensitivity (self-hosting is cheaper at scale)

  • You want full control over your stack

Let’s focus on the self-hosted path, which gives you more control and can run on European infrastructure.

Phase 3: Self-Hosted on a VM

This is where things get interesting. We’ll set up a shared analytics platform that multiple users can access simultaneously, while keeping costs reasonable and data in your preferred region.

Architecture Overview

Here’s what we’re building:

Your Laptop (Development)
    ↓ (write SQL transformations)
    ↓ (run locally, push results)
DuckLake on VM
    ├── Postgres (metadata)
    ├── OVH Object Storage (actual data)
    └── DuckDB (query engine)
        ↓
    Superset (visualization)

Why this architecture?

  • DuckLake lets multiple users query the same data safely using a Postgres metadata layer

  • Object Storage (like S3) is cheap, reliable, and scales automatically

  • Separation of storage and compute means you don’t pay for idle resources

Why European Cloud Providers?

You might be less comfortable with hyperscalers these days, whether due to data sovereignty concerns, GDPR compliance, or simply wanting to support European infrastructure. Good news: this entire stack runs perfectly on providers like OVH, Hetzner, or Stackit. A brief comparison written by Stijn can be found here: https://www.wolk.work/blog/posts/european-cloud-providers-compared-a-hands-on-terraform-test-january-2026-

We’ll use OVH in our examples, but the setup is nearly identical across providers. ( Stijn also wrote a short guide on how to use OVH with terraform, find it here)

Understanding DuckLake

DuckLake is an open-source data lake format that uses:

  • A Postgres database for metadata (table schemas, file locations)

  • Object storage for the actual data (Parquet files)

  • DuckDB as the query engine

Unlike Iceberg or Delta Lake, DuckLake is dead simple to set up and designed specifically for DuckDB. It allows multiple users to query and write to the same data lake without conflicts.

Setting Up Your VM

First, provision a VM. On OVH Cloud:

  • Instance type: B2-15 (4 vCPU, 15GB RAM) ~€30/month

  • Operating System: Ubuntu 22.04

  • Network: Ensure port 8088 is open for Superset

SSH into your VM:

ssh ubuntu@your-vm-ip

Setting Up Object Storage

Create an OVH Object Storage bucket:

  1. Log into OVH Cloud

  2. Navigate to Public CloudObject Storage

  3. Click Create an Object Storage

  4. Choose your region (e.g., GRA for Gravelines, France)

  5. Name it small-data-platform

  6. Create S3 credentials (access key + secret key)

Save these credentials—you’ll need them.

Cost: ~€0.01/GB/month for storage, plus minimal egress (~€0.01/GB for downloads)

For a 100GB data lake: ~€1/month storage + ~€5/month bandwidth = €6/month total.

Installing Dependencies on the VM

# Update system
sudo apt update && sudo apt upgrade -y

# Install Docker and Docker Compose
sudo apt install -y docker.io docker-compose

# Start Docker
sudo systemctl enable docker
sudo systemctl start docker

# Add your user to docker group
sudo usermod -aG docker $USER
# Log out and back in for this to take effect

Creating the Docker Compose Setup

Create a directory for your setup:

mkdir ~/small-data-platform
cd ~/small-data-platform

Create a docker-compose.yml file (we’ll reuse the local one and add a database for the ducklake metadata):

name: small-data-platform
services:
  superset:
    container_name: superset
    build:
      context: .
      dockerfile_inline: |
        FROM apache/superset:6.0.0-py312
        USER root
        RUN uv pip install psycopg2-binary duckdb 'duckdb-driver>=0.17.0'
        USER superset
    ports:
      - "8088:8088"
    volumes:
      - ./analytics.duckdb:/app/analytics.duckdb
    configs:
      - source: superset_config.py
        target: /app/superset_config.py
    environment:
      - SUPERSET_CONFIG_PATH=/app/superset_config.py
      - SUPERSET_SECRET_KEY="your_super_duper_secret_key_here"
      - DATABASE_DB=superset
      - DATABASE_HOST=db
      - DATABASE_PASSWORD=superset
      - DATABASE_USER=superset
      - SUPERSET_LOAD_EXAMPLES=false
    restart: unless-stopped
  db:
    image: postgres:16
    environment:
      - POSTGRES_USER=superset
      - POSTGRES_PASSWORD=${SUPERSET_DB_PASSWORD}
      - POSTGRES_DB=superset
    volumes:
      - superset_postgres_data:/var/lib/postgresql/data
    restart: unless-stopped
  ducklake-metadata:
    image: postgres:16
    ports:
      - "5432:5432"
    environment:
      - POSTGRES_USER=ducklake
      - POSTGRES_PASSWORD=${DUCKLAKE_DB_PASSWORD}
      - POSTGRES_DB=ducklake
    volumes:
      - ducklake-metadata-data:/var/lib/postgresql/data
    restart: unless-stopped

volumes:
  superset_postgres_data:
  ducklake-metadata-data:
configs:
  superset_config.py:
    content: |
      SQLALCHEMY_DATABASE_URI = "postgresql://superset:${SUPERSET_DB_PASSWORD}@db:5432/superset"

Create a .env file with your passwords:

# Generate secure passwords
DUCKLAKE_DB_PASSWORD=$(openssl rand -base64 32)
SUPERSET_DB_PASSWORD=$(openssl rand -base64 32)
SUPERSET_SECRET_KEY=$(openssl rand -base64 42)

# Save them to .env
cat > .env << EOF
DUCKLAKE_DB_PASSWORD=$DUCKLAKE_DB_PASSWORD
SUPERSET_DB_PASSWORD=$SUPERSET_DB_PASSWORD
SUPERSET_SECRET_KEY=$SUPERSET_SECRET_KEY
EOF

echo "✅ Credentials generated and saved to .env"

Start the services:

docker compose up -d

Initialize Superset:

docker exec -it superset /app/docker/docker-init.sh

Setting Up DuckLake

On your local machine, install the DuckLake extension and initialize the ducklake (fill in the appropriate secrets):

import duckdb

con = duckdb.connect('analytics.duckdb')

# Install DuckLake extension
con.execute("INSTALL ducklake")
con.execute("LOAD ducklake")

# Configure connection to your VM's Postgres and OVH Object Storage
con.execute("""
   
CREATE OR REPLACE PERSISTENT SECRET s3_ducklake (
  TYPE s3,
  PROVIDER config,
  KEY_ID 'your-ovh-access-key',
  SECRET 'your-ovh-secret-key',
  REGION 'gra'
  ENDPOINT 's3.gra.io.cloud.ovh.net'
);

-- Using the DuckLake credentials for Postgres
CREATE OR REPLACE PERSISTENT SECRET postgres_secret (
  TYPE postgres,
  HOST 'your-vm-ip',
  DATABASE 'ducklake',
  USER 'ducklake',
  PASSWORD 'your-DUCKLAKE_DB_PASSWORD'
);

-- DuckLake config secret
CREATE OR REPLACE PERSISTENT SECRET ducklake_secret (
  TYPE ducklake,
  METADATA_PATH '',
  DATA_PATH 's3://small-data-platform/lake',
  METADATA_PARAMETERS MAP {'TYPE': 'postgres','SECRET': 'postgres_secret'}
);

-- This initializes DuckLake
ATTACH 'ducklake:ducklake_secret' AS my_ducklake;
USE my_ducklake;
""")

con.close()

Pushing Data to DuckLake

Now you can push your transformed data to the shared lake:

import duckdb

con = duckdb.connect('analytics.duckdb')

# Load DuckLake and attach
con.execute("LOAD ducklake")
con.execute("""ATTACH 'ducklake:ducklake_secret' AS my_ducklake;
USE my_ducklake;""")

# Copy your tables to DuckLake
con.execute("""
    CREATE OR REPLACE TABLE my_ducklake.monthly_revenue AS
    SELECT * FROM analytics.monthly_revenue
""")

con.execute("""
    CREATE OR REPLACE TABLE my_ducklake.customer_summary AS
    SELECT * FROM analytics.customer_summary
""")

con.execute("""
    CREATE OR REPLACE TABLE my_ducklake.product_metrics AS
    SELECT * FROM analytics.product_metrics
""")

print("✅ Data pushed to DuckLake!")

con.close()

Your data is now stored in OVH Object Storage, with metadata in Postgres. Anyone with access can query it.

Connecting Superset to DuckLake

Now you need to set up Superset on the VM to query your DuckLake. (kudo’s to the guide from Danial Lewis https://medium.com/data-analytics-at-nesta/ducklake-on-apache-superset-d1b1b00c222a , I did have to fork the duckdb-engine repo to fix some ducklake compatability issues and publish it under the name duckdb-driver )

Now in Superset’s UI:

  1. Go to SettingsDatabase Connections+ Database

  2. Select “Other”

  3. Enter this SQLAlchemy URI:

    duckdb:///ducklake:postgres:dbname=ducklake host=ducklake-metadata user=ducklake password={your-ducklake-db-password}
    
    
  4. In the Advanced tab → SQL LabSQL Alchemy Parameters, add:

    {
      "connect_args": {
        "config": {
          "s3_access_key_id": "your-ovh-access-key",
          "s3_secret_access_key": "your-ovh-secret-key",
          "s3_endpoint": "s3.gra.cloud.ovh.net",
          "s3_url_style": "path",
          "s3_use_ssl": false
          "s3_region": "gra"
        }
      }
    }
    
    
  5. It’s connected! you can check the available tables in the SQLLab tab or immediately create dataset for one!

Awesome! you did it. You made it to the end of this very long piece of text. Now all your small data troubles are now over! I hope this helps a bit in showing what’s possible with modern technology.

Now you can create user accounts for you colleagues and create your first charts and dashboards!

Accessing Superset Safely

Your Superset instance is now running at http://your-vm-ip:8088. However, exposing it directly to the internet isn’t secure. (especially insecure if we keep the admin:admin login 😅) So please don’t do this with data you are uncomfortable sharing with the world!

There are some options to make it a bit more secure:

  • Host it on a VM on your internal network (if you have one)

  • Use a VPN service like Tailscale to expose the connection so it doesn’t need to be opened to the internet (free for personal use!)

NextSteps

You now have a complete small data platform that:

  • Runs on your laptop for development

  • Scales to multiple users on a shared VM

  • Keeps data in Europe (if you chose EU cloud)

  • Costs ~€30/month (not $3000/month)

  • Uses open-source tools you control

Migration Path: When to Upgrade

This stack will take you surprisingly far. But you’ll eventually outgrow it. Here are the warning signs and what to do:

Warning Sign 1: Queries are getting slow (>30 seconds)

  • Solution: Upgrade your VM to more CPUs and RAM (vertical scaling)

  • When that’s not enough: Consider DuckDB’s parallel query execution across multiple machines or migrating hot tables to a proper data warehouse like ClickHouse

Warning Sign 2: Multiple people running transformations causes conflicts

  • Solution: Move transformation execution to the VM (one designated machine)

  • When that’s not enough: Add a workflow orchestrator (Dagster/Prefect)

Warning Sign 3: Data is growing past 1TB

  • Solution: Partition your Parquet files by date, use columnar compression

  • When that’s not enough: Consider Apache Iceberg tables with proper partitioning, or migrate to Databricks/Snowflake/Clickhouse

The key: Each upgrade should be motivated by an actual problem you’re experiencing, not theoretical future needs.

Where to go from here:

  1. Add data quality checks: Use Great Expectations or simple SQL assertions

  2. Automate and structure transformations: Use something like dbt or SQLMesh

  3. Implement incremental loading: Only process new/changed data instead of full reloads

  4. Set up monitoring: Track query performance, data freshness, pipeline success/failure

  5. Add more data sources: APIs, log files, third-party services

Resources to learn more:


Need help implementing this?

At Wolk, we help companies build data infrastructure that matches their actual needs. No over-engineering, no unnecessary complexity—just solid platforms that work. Whether you’re just getting started or need to upgrade an existing setup, get in touch.

Want to stay updated?

Subscribe to our newsletter, de Wolkskrant, to get the latest tools, trends and tips from the data engineering world.


This post is Part 2 of our Small Data Platform series. Read Part 1: Why Your Data Stack is Probably Over-Engineered if you haven’t already.


Stay up to date!

Subscribe to our newsletter, de Wolkskrant, to get the latest tools, trends and tips from the industry.

Subscribe