cloudify is a modular application designed to serve Earth System Model (ESM) simulation output as cloud-optimized datasets. Built on a customized version of the xpublish framework and extended with plugins, cloudify exposes data as zarr endpoints through a RESTful API powered by FastAPI. By supporting multiple storage backends and virtual zarr datasets, cloudify enables efficient, scalable access to data from heterogeneous sources via a unified interface. This approach enhances the FAIRness (Findability, Accessibility, Interoperability, and Reusability) of ESM output and closely mimics the behavior of truly cloud-native datasets (Abernathey, 2021).
We differentiate between data providers and data consumers.
With dask support, Data as a service (DaaS) is enabled. It has to be adjusted with fastapi's async Threadpool.
- Open xarray datasets within the main function. This avoids infinite subprocess creation.
- Create a dask cluster with sufficient resources. You can use the
get_dask_clusterfunction fromcloudify.utils.daskhelper. - Set the environment variable ZARR_ADDRESS to the scheduler address. This will be used by the
/zarr-API to calculated chunks.
Minimal Example
from cloudify.utils.daskhelper import get_dask_cluster
import asyncio
import os
import xpublish as xp
import xarray as xr
import nest_asyncio
nest_asyncio.apply()
if __name__ == "__main__": #
import dask
zarrcluster = asyncio.get_event_loop().run_until_complete(get_dask_cluster())
os.environ["ZARR_ADDRESS"] = zarrcluster.scheduler._address
dsdict={}
dsdict["test"] = xr.open_dataset(DATASET,chunks="auto")
collection = xp.Rest(dsdict)
collection.serve(host="0.0.0.0", port=9000)All endpoints can be listed programmatically with python:
import requests
fast=requests.get(SERVER_URL+"/openapi.json").json()
print(list(fast["paths"].keys()))Default xpublish dataset plugin endpoints:
'/datasets',
'/datasets/{dataset_id}/',
'/datasets/{dataset_id}/keys',
'/datasets/{dataset_id}/dict',
'/datasets/{dataset_id}/info'Zarr Raw. Self developed. Endpoints:
import xarray as xr
xr.open_dataset(
f"{SERVER_URL}/datasets/{dataset_id}/kerchunk",
engine="zarr",
chunks="auto"
) Zarr Dask backed. Default plugin. Endpoints:
import xarray as xr
xr.open_dataset(
f"{SERVER_URL}/datasets/{dataset_id}/zarr",
engine="zarr",
chunks="auto"
) STAC
The cloudify.stac package generates STAC items and collections from xarray
datasets. It works in two modes:
Standalone — generate STAC items offline without running a server:
import xarray as xr
from cloudify.stac import build_stac_item, load_config
ds = xr.open_zarr("s3://my-bucket/model.zarr", consolidated=True)
config = load_config("cloudify/stac/configs/default.json")
item = build_stac_item(ds, "model-v1", config,
assets={"data": "s3://my-bucket/model.zarr"})Icechunk (regular) — chunk data lives directly in the repo:
import icechunk, xarray as xr, pystac
from cloudify.stac import build_stac_item_from_icechunk
repo = icechunk.Repository.open(icechunk.s3_storage(...))
session = repo.readonly_session("main")
ds = xr.open_zarr(session.store, consolidated=False, zarr_format=3)
item = build_stac_item_from_icechunk(
ds,
item_id="my-dataset",
icechunk_href="s3://my-bucket/my-repo/",
snapshot_id=session.snapshot_id,
storage_schemes={"aws-s3-my-bucket": {"type": "aws-s3", "bucket": "my-bucket",
"region": "us-east-1", "anonymous": False}},
virtual=False,
)Icechunk (virtual) — chunk data is referenced from external source files.
Pass virtual_hrefs so that xr.open_dataset(asset) via
xpystac can reconstruct the
VirtualChunkContainer config automatically:
item = build_stac_item_from_icechunk(
ds,
item_id="my-virtual-dataset",
icechunk_href="s3://my-bucket/virtual-repo/",
snapshot_id=session.snapshot_id,
storage_schemes={"aws-s3-my-bucket": {"type": "aws-s3", "bucket": "my-bucket",
"region": "us-west-2", "anonymous": True}},
virtual=True,
virtual_hrefs=["s3://my-bucket/source-data/"], # where chunk bytes live
)xpublish plugin — serve STAC items alongside zarr endpoints. Datasets opened
from Icechunk stores are detected automatically via _icechunk_href /
_icechunk_snapshot_id attrs:
import xpublish
from cloudify.stac import Stac
# tag icechunk datasets so the plugin routes them correctly
ds_ice.attrs["_icechunk_href"] = "s3://my-bucket/my-repo/"
ds_ice.attrs["_icechunk_snapshot_id"] = session.snapshot_id
rest = xpublish.Rest(
{"zarr-dataset": ds_zarr, "ice-dataset": ds_ice},
plugins={"stac": Stac(config_path="cloudify/stac/configs/default.json")}
)
rest.serve()
# GET /datasets/zarr-dataset/stac → STAC item, asset points at /zarr endpoint
# GET /datasets/ice-dataset/stac → STAC item, asset points at original S3 repo
# GET /stac-collection-all.json → STAC collection listing all datasetsFor institution-specific deployments (e.g. EERIE/DKRZ) use
config_path="cloudify/stac/configs/eerie.json".
Example notebooks (verified against live public stores):
| Notebook | Dataset | Store type |
|---|---|---|
examples/noaa_gfs_stac.ipynb |
NOAA GFS Forecast (dynamical.org) | Regular Icechunk, lat/lon |
examples/noaa_hrrr_stac.ipynb |
NOAA HRRR 48-Hour Forecast (dynamical.org) | Regular Icechunk, Lambert Conformal CRS |
examples/nldas3_virtual_stac.ipynb |
NLDAS-3 Forcing (NASA) | Virtual Icechunk, lat/lon |
Building a static STAC catalog (scripts/build_dynamical_stac.py)
The script auto-discovers all dynamical.org Icechunk stores from the AWS Open Data Registry, builds STAC items for each, and publishes the catalog to S3-compatible storage or GitHub Pages.
# Dry run — build locally only (no upload):
python scripts/build_dynamical_stac.py --no-upload --output-dir /tmp/stac-out
# Publish to S3-compatible storage (e.g. Cloudflare R2):
python scripts/build_dynamical_stac.py \
--catalog-bucket osc-pub \
--catalog-prefix stac/dynamical \
--profile osc-pub-r2 \
--public-domain r2-pub.openscicomp.io
# Publish to GitHub Pages (auto-commit; push manually afterward):
python scripts/build_dynamical_stac.py \
--no-upload \
--public-domain myorg.github.io/myrepo \
--catalog-prefix stac/dynamical \
--github-pages /path/to/local/gh-pages-clone
# Publish to GitHub Pages and auto-push:
python scripts/build_dynamical_stac.py \
--no-upload \
--public-domain myorg.github.io/myrepo \
--catalog-prefix stac/dynamical \
--github-pages /path/to/local/gh-pages-clone \
--github-pages-push
# Publish to both S3 and GitHub Pages in one run:
python scripts/build_dynamical_stac.py \
--public-domain r2-pub.openscicomp.io \
--github-pages /path/to/local/gh-pages-clone \
--github-pages-pushKey behaviours:
- Datasets are discovered automatically — new dynamical.org stores appear without any script changes.
- For GitHub Pages: files land at
DIR/catalog-prefix/; the commit is skipped if nothing changed (idempotent re-runs); without--github-pages-pushyou can review before publishing. - GitHub Pages has CORS enabled by default, making it a zero-config alternative to S3 for hosting read-only catalogs.
- The catalog URL printed at the end can be pasted directly into STAC Browser.
Consuming a STAC collection from an xpublish server:
import pystac
stac_collection = pystac.Collection.from_file(
f"{SERVER_URL}/stac-collection-all.json"
)
items = list(stac_collection.get_all_items())Intake
Intake-xpublish plugin.
import intake
dataset_id="test"
api="kerchunk"
xpublish_intake_uri=SERVER_URL+"/intake.yaml"
cat=intake.open_catalog(xpublish_intake_uri)
#xarray dataset:
ds=cat[dataset_id](method=api).to_dask()Server-side processing
Plotting
Self-developed plugin. Endpoints.
import requests
plot=requests.get(f"{SERVER_URL}/datasets/{dataset_id}/plot/{var_name}/{kind}/{cmap}/{selection}").contentDiagnostics
Self-developed plugin. Endpoints:
import requests
data_var_view=requests.get(f"{SERVER_URL}/datasets/{dataset_id}/groupby/{variable_name}/{groupby_coord}/{groupby_action}")