doppa-data-contribution is the data-publishing stage of the doppa benchmarking system. It is a set of Jupyter notebooks that download raw Norwegian building footprints and administrative boundaries from public sources, normalize them to a shared GeoParquet schema, and upload the result to Azure Blob Storage. The doppa benchmarking framework consumes these published datasets as its inputs.
Keeping data preparation in a separate repository means the benchmark framework never has to re-derive its inputs: every engine under test (DuckDB, PostGIS, GeoPandas + Shapefile, Apache Sedona) reads the exact same bytes, written once here and conflated/synthesized downstream into the small/medium/large tiers.
The thesis (TBA4925, NTNU Geomatics) — benchmarking cloud-native versus traditional geospatial technologies on Azure — spans three repositories that depend on each other in a pipeline:
doppa-data-contribution ──► doppa ──► doppa-analytics
(this repo) (benchmarks) (stats + figures)
publish datasets run engines read results, build
to blob storage write metrics thesis figures/tables
| Repository | Role | Link |
|---|---|---|
| doppa-data-contribution | Downloads and normalizes source datasets; publishes GeoParquet to blob storage (this repo) | — |
| doppa | Reproducible benchmarking framework. Consumes the published datasets, runs the spatial-query benchmarks on Azure, writes per-iteration samples and cost rows back to blob storage | kartAI/doppa |
| doppa-analytics | Reads the benchmark result Parquet, runs the statistical tests, and generates the thesis figures and LaTeX tables | kartAI/doppa-analytics |
Each notebook is an independent contribution: download → normalize to the schema → upload one Parquet blob. They can be run in any order, and each is idempotent (the upload overwrites the existing blob).
| Notebook | Dataset | Source | Output blob |
|---|---|---|---|
01-osm-contribution.ipynb |
OpenStreetMap building footprints (Norway) | Geofabrik norway-latest.osm.pbf |
osm.parquet |
02-fkb-contribution.ipynb |
FKB cadastral building polygons (selected Norwegian cities) | kartai/DX_datasett on Hugging Face (FlatGeobuf) | fkb.parquet |
03-microsoft-buildings-contribution.ipynb |
Microsoft Global ML Building Footprints (Norway) | Microsoft GlobalMLBuildingFootprints | microsoft.parquet |
04-kommuner-contribution.ipynb |
Norwegian municipality polygons (357 features) | Administrative enheter — kommuner (Geonorge) | municipalities.parquet |
OSM (01) streams the national PBF extract, keeps the building features and a fixed set of attribute
columns (OSM_COLUMNS_TO_KEEP in src/config.py), and processes them in batches
(OSM_FEATURE_BATCH_SIZE) to keep memory bounded.
FKB (02) downloads per-city FlatGeobuf archives, unzips the relevant building layers (Bygning,
AnnenBygning, Takkant, Bygningsdelelinje, FiktivBygningsavgrensning), and reconstructs closed
building polygons from the edge layers with DuckDB's ST_Polygonize.
Microsoft Buildings (03) reads the dataset-links.csv index, filters to the Norway tiles, and converts
the newline-delimited GeoJSON to Parquet via DuckDB, deriving a stable external_id as the MD5 of the geometry.
Kommuner (04) downloads the municipality GeoJSON from Geonorge, strips the UTF-8 BOM before handing the
bytes to GeoPandas, and writes region (kommunenummer), name (kommunenavn), and wkb (geometry as WKB).
The region column mirrors counties.parquet in doppa so the spatial-join entrypoints need no schema changes.
All building contributions conform to the doppa schema published at
https://doppablobstorage.blob.core.windows.net/schema/latest/schema.yml (DOPPA_SCHEMA_PATH in
src/config.py). The data owner is free to carry extra attributes as long as the required schema columns are
present, so OSM/FKB/Microsoft can each preserve source-specific fields while remaining conflatable downstream.
Every notebook uploads its Parquet to the contributions container of the configured storage account, named
after the dataset (osm.parquet, fkb.parquet, microsoft.parquet, municipalities.parquet). Uploads use
overwrite=True, so re-running a notebook republishes that dataset in place.
- Building footprints — doppa's
TestDatasetServiceconflates the OSM and FKB contributions into thesmall(~5M row) benchmark dataset, thenDatasetSynthesisServiceclones it into themedium(~40M) andlarge(~100M) tiers. - Municipality boundaries — doppa's
setup_benchmarking_frameworkcopiesmunicipalities.parquetfromcontributionsto themetadatacontainer, where the RQ2 national-scale spatial join reads it. Running04-kommuner-contributiononce is a prerequisite for that benchmark; if the blob is missing the doppa setup step fails fast with an actionable error pointing back here.
Clone the repository and create a virtual environment:
git clone https://github.com/kartAI/doppa-data-contribution.git
cd doppa-data-contribution
python -m venv .venv # Create virtual environment
source .venv/bin/activate # Activate venv (Linux/macOS)
# .\.venv\Scripts\Activate.ps1 # Activate venv (Windows PowerShell)
pip install -r requirements.txt # Install dependenciesAdd a .env file to the project root. The contribution targets the same storage account that doppa
benchmarks against.
AZURE_BLOB_STORAGE_CONNECTION_STRING=<azure-blob-storage-connection-string>
HUGGING_FACE_API_TOKEN=<hugging-face-api-token>AZURE_BLOB_STORAGE_CONNECTION_STRING— connection string for the storage account that owns thecontributionscontainer.HUGGING_FACE_API_TOKEN— token for thekartai/DX_datasettHugging Face dataset used by the FKB notebook.
Launch Jupyter (or run a single notebook headless) and execute the cells top to bottom:
jupyter lab # interactive
# or run one notebook end to end:
jupyter execute 04-kommuner-contribution.ipynbEach notebook downloads its source into data/input/, writes the normalized Parquet to data/output/, and
then uploads it to the contributions blob container.
- OpenStreetMap — Norway extract via Geofabrik. © OpenStreetMap contributors, ODbL.
- FKB building data — kartai/DX_datasett (Kartverket FKB, redistributed for the DX project).
- Microsoft Global ML Building Footprints — microsoft/GlobalMLBuildingFootprints (ODbL).
- Administrative units (municipalities) — Geonorge, Kartverket (NLOD / CC BY 4.0).