Cloudify

cloudify is a modular application designed to serve Earth System Model (ESM) simulation output as cloud-optimized datasets. Built on a customized version of the xpublish framework and extended with plugins, cloudify exposes data as zarr endpoints through a RESTful API powered by FastAPI. By supporting multiple storage backends and virtual zarr datasets, cloudify enables efficient, scalable access to data from heterogeneous sources via a unified interface. This approach enhances the FAIRness (Findability, Accessibility, Interoperability, and Reusability) of ESM output and closely mimics the behavior of truly cloud-native datasets (Abernathey, 2021).

Technical description

Installation

Usage

We differentiate between data providers and data consumers.

Provisioning

With Dask to apply functions before hosting:

With dask support, Data as a service (DaaS) is enabled. It has to be adjusted with fastapi's async Threadpool.

Open xarray datasets within the main function. This avoids infinite subprocess creation.
Create a dask cluster with sufficient resources. You can use the get_dask_cluster function from cloudify.utils.daskhelper.
Set the environment variable ZARR_ADDRESS to the scheduler address. This will be used by the /zarr-API to calculated chunks.

Minimal Example

from cloudify.utils.daskhelper import get_dask_cluster
import asyncio
import os
import xpublish as xp
import xarray as xr
import nest_asyncio
nest_asyncio.apply()

if __name__ == "__main__":  # 
    import dask
    zarrcluster = asyncio.get_event_loop().run_until_complete(get_dask_cluster())
    os.environ["ZARR_ADDRESS"] = zarrcluster.scheduler._address
    dsdict={}
    dsdict["test"] = xr.open_dataset(DATASET,chunks="auto")
    collection = xp.Rest(dsdict)
    collection.serve(host="0.0.0.0", port=9000)

Consumption

All endpoints can be listed programmatically with python:

import requests
fast=requests.get(SERVER_URL+"/openapi.json").json()
print(list(fast["paths"].keys()))

Metadata info per xarray dataset

Default xpublish dataset plugin endpoints:

 '/datasets',
 '/datasets/{dataset_id}/',
 '/datasets/{dataset_id}/keys',
 '/datasets/{dataset_id}/dict',
 '/datasets/{dataset_id}/info'

Access

Zarr Raw. Self developed. Endpoints:

import xarray as xr
xr.open_dataset(
    f"{SERVER_URL}/datasets/{dataset_id}/kerchunk",
    engine="zarr",
    chunks="auto"
)

Zarr Dask backed. Default plugin. Endpoints:

import xarray as xr
xr.open_dataset(
    f"{SERVER_URL}/datasets/{dataset_id}/zarr",
    engine="zarr",
    chunks="auto"
)

Catalogs

STAC

The cloudify.stac package generates STAC items and collections from xarray datasets. It works in two modes:

Standalone — generate STAC items offline without running a server:

import xarray as xr
from cloudify.stac import build_stac_item, load_config

ds = xr.open_zarr("s3://my-bucket/model.zarr", consolidated=True)
config = load_config("cloudify/stac/configs/default.json")
item = build_stac_item(ds, "model-v1", config,
                       assets={"data": "s3://my-bucket/model.zarr"})

Icechunk (regular) — chunk data lives directly in the repo:

import icechunk, xarray as xr, pystac
from cloudify.stac import build_stac_item_from_icechunk

repo = icechunk.Repository.open(icechunk.s3_storage(...))
session = repo.readonly_session("main")
ds = xr.open_zarr(session.store, consolidated=False, zarr_format=3)

item = build_stac_item_from_icechunk(
    ds,
    item_id="my-dataset",
    icechunk_href="s3://my-bucket/my-repo/",
    snapshot_id=session.snapshot_id,
    storage_schemes={"aws-s3-my-bucket": {"type": "aws-s3", "bucket": "my-bucket",
                                           "region": "us-east-1", "anonymous": False}},
    virtual=False,
)

Icechunk (virtual) — chunk data is referenced from external source files. Pass virtual_hrefs so that xr.open_dataset(asset) via xpystac can reconstruct the VirtualChunkContainer config automatically:

item = build_stac_item_from_icechunk(
    ds,
    item_id="my-virtual-dataset",
    icechunk_href="s3://my-bucket/virtual-repo/",
    snapshot_id=session.snapshot_id,
    storage_schemes={"aws-s3-my-bucket": {"type": "aws-s3", "bucket": "my-bucket",
                                           "region": "us-west-2", "anonymous": True}},
    virtual=True,
    virtual_hrefs=["s3://my-bucket/source-data/"],  # where chunk bytes live
)

xpublish plugin — serve STAC items alongside zarr endpoints. Datasets opened from Icechunk stores are detected automatically via _icechunk_href / _icechunk_snapshot_id attrs:

import xpublish
from cloudify.stac import Stac

# tag icechunk datasets so the plugin routes them correctly
ds_ice.attrs["_icechunk_href"] = "s3://my-bucket/my-repo/"
ds_ice.attrs["_icechunk_snapshot_id"] = session.snapshot_id

rest = xpublish.Rest(
    {"zarr-dataset": ds_zarr, "ice-dataset": ds_ice},
    plugins={"stac": Stac(config_path="cloudify/stac/configs/default.json")}
)
rest.serve()
# GET /datasets/zarr-dataset/stac  →  STAC item, asset points at /zarr endpoint
# GET /datasets/ice-dataset/stac   →  STAC item, asset points at original S3 repo
# GET /stac-collection-all.json    →  STAC collection listing all datasets

For institution-specific deployments (e.g. EERIE/DKRZ) use config_path="cloudify/stac/configs/eerie.json".

Example notebooks (verified against live public stores):

Notebook	Dataset	Store type
`examples/noaa_gfs_stac.ipynb`	NOAA GFS Forecast (dynamical.org)	Regular Icechunk, lat/lon
`examples/noaa_hrrr_stac.ipynb`	NOAA HRRR 48-Hour Forecast (dynamical.org)	Regular Icechunk, Lambert Conformal CRS
`examples/nldas3_virtual_stac.ipynb`	NLDAS-3 Forcing (NASA)	Virtual Icechunk, lat/lon

Building a static STAC catalog (scripts/build_dynamical_stac.py)

The script auto-discovers all dynamical.org Icechunk stores from the AWS Open Data Registry, builds STAC items for each, and publishes the catalog to S3-compatible storage or GitHub Pages.

# Dry run — build locally only (no upload):
python scripts/build_dynamical_stac.py --no-upload --output-dir /tmp/stac-out

# Publish to S3-compatible storage (e.g. Cloudflare R2):
python scripts/build_dynamical_stac.py \
    --catalog-bucket osc-pub \
    --catalog-prefix stac/dynamical \
    --profile osc-pub-r2 \
    --public-domain r2-pub.openscicomp.io

# Publish to GitHub Pages (auto-commit; push manually afterward):
python scripts/build_dynamical_stac.py \
    --no-upload \
    --public-domain myorg.github.io/myrepo \
    --catalog-prefix stac/dynamical \
    --github-pages /path/to/local/gh-pages-clone

# Publish to GitHub Pages and auto-push:
python scripts/build_dynamical_stac.py \
    --no-upload \
    --public-domain myorg.github.io/myrepo \
    --catalog-prefix stac/dynamical \
    --github-pages /path/to/local/gh-pages-clone \
    --github-pages-push

# Publish to both S3 and GitHub Pages in one run:
python scripts/build_dynamical_stac.py \
    --public-domain r2-pub.openscicomp.io \
    --github-pages /path/to/local/gh-pages-clone \
    --github-pages-push

Key behaviours:

Datasets are discovered automatically — new dynamical.org stores appear without any script changes.
For GitHub Pages: files land at DIR/catalog-prefix/; the commit is skipped if nothing changed (idempotent re-runs); without --github-pages-push you can review before publishing.
GitHub Pages has CORS enabled by default, making it a zero-config alternative to S3 for hosting read-only catalogs.
The catalog URL printed at the end can be pasted directly into STAC Browser.

Consuming a STAC collection from an xpublish server:

import pystac
stac_collection = pystac.Collection.from_file(
    f"{SERVER_URL}/stac-collection-all.json"
)
items = list(stac_collection.get_all_items())

Intake

Intake-xpublish plugin.

import intake
dataset_id="test"
api="kerchunk"
xpublish_intake_uri=SERVER_URL+"/intake.yaml"
cat=intake.open_catalog(xpublish_intake_uri)
#xarray dataset:
ds=cat[dataset_id](method=api).to_dask()

Server-side processing

Plotting

Self-developed plugin. Endpoints.

import requests
plot=requests.get(f"{SERVER_URL}/datasets/{dataset_id}/plot/{var_name}/{kind}/{cmap}/{selection}").content

Diagnostics

Self-developed plugin. Endpoints:

import requests
data_var_view=requests.get(f"{SERVER_URL}/datasets/{dataset_id}/groupby/{variable_name}/{groupby_coord}/{groupby_action}")

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cloudify

Installation

Usage

Provisioning

With Dask to apply functions before hosting:

Consumption

Metadata info per xarray dataset

Access

Catalogs

Uh oh!

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Cloudify

Installation

Usage

Provisioning

With Dask to apply functions before hosting:

Consumption

Metadata info per xarray dataset

Access

Catalogs