Skip to content

Model-centric refactoring to reduce dataset creation#2646

Open
VeckoTheGecko wants to merge 48 commits into
Parcels-code:mainfrom
VeckoTheGecko:restructure
Open

Model-centric refactoring to reduce dataset creation#2646
VeckoTheGecko wants to merge 48 commits into
Parcels-code:mainfrom
VeckoTheGecko:restructure

Conversation

@VeckoTheGecko

@VeckoTheGecko VeckoTheGecko commented May 27, 2026

Copy link
Copy Markdown
Contributor

Description

Overview

The central change is the introduction of a new Model abstraction layer between raw xarray/uxarray data and the Field/FieldSet objects. Previously, Field owned its data and grid directly. Now, Field is a thin view over a Model, and FieldSet is a container of Model objects rather than Field objects.


New file: src/parcels/_core/model.py

Model (abstract base class)

Abstract class with three required attributes:

  • data: Any — the underlying dataset
  • grid: BaseGrid — the grid object
  • field_to_interpolator: dict[str, ScalarInterpolator | VectorInterpolator] — maps field names to interpolator instances

Abstract methods:

  • construct_fields() -> list[Field | VectorField] — build field objects from this model
  • scalar_field_names -> list[str] — names of scalar fields in the data
  • assert_valid_field_data(field_data) — validate a single field's data

Concrete methods on Model:

  • assert_valid_model_data() — iterates scalar_field_names and calls assert_valid_field_data on each
  • time_interval -> TimeInterval | None — computed from self.data

StructuredModel(Model)

For structured (SGRID) grid data backed by xr.Dataset.

Constructor: StructuredModel(data: xr.Dataset, mesh: Mesh)

  • Calls preprocess_sgrid_model_data(data) to transpose fields to (t, z, y, x) order
  • Creates an XGrid(data, mesh) grid
  • Initializes field_to_interpolator = {}
  • Calls assert_valid_model_data() on construction

from_sgrid_conventions(cls, ds, mesh=None) classmethod:

  • Copied/moved from FieldSet.from_sgrid_conventions — handles time axis renaming, mesh type inference
  • Sets default interpolator XLinear() on all scalar fields after construction
  • Returns a StructuredModel instance

construct_fields():

  • Creates Field("U", self), Field("V", self) etc., then wraps them in VectorField("UV", ...) if U+V present
  • Uses XLinear_Velocity() for A-grids, CGrid_Velocity() for C-grids

UnstructuredModel(Model)

For unstructured (UGRID) grid data backed by ux.UxDataset.

Constructor: UnstructuredModel(data: ux.UxDataset, grid: UxGrid)

from_ugrid_conventions(cls, ds, mesh="spherical") classmethod:

  • Validates required dimensions (time, zf, zc)
  • Creates UxGrid, calls _discover_ux_U_and_V, returns instance

construct_fields():

  • Uses _select_uxinterpolator(da) to pick the appropriate interpolator per field
  • Note: interpolator is passed as 3rd arg to Field(name, model, interp) — see Field changes below

Helper functions moved from fieldset.py to model.py

  • _discover_ux_U_and_V(ds) — unchanged logic
  • _select_uxinterpolator(da) — unchanged logic
  • _get_mesh_type_from_sgrid_dataset(ds) — unchanged logic
  • _is_coordinate_in_degrees(da) — unchanged logic
  • _get_time_interval(data) — logic adjusted: checks "time" not in data or data["time"].size == 1 (previously checked data.shape[0] == 1)
  • _assert_valid_uxdataarray(data) — unchanged logic
  • _assert_has_time_coordinate(da) — new helper extracted from old Field.__init__

New helper in model.py

  • preprocess_sgrid_model_data(ds) — transposes all non-grid-topology data vars to (t, z, y, x) using _transpose_xfield_data_to_tzyx

Changes to src/parcels/_core/field.py

Field.__init__ signature change

Before:

Field(name: str, data: xr.DataArray | ux.UxDataArray, grid: UxGrid | XGrid, interp_method: Callable)

After:

Field(name: str, model: Model)
  • data, grid, and interp_method are no longer constructor arguments
  • The constructor only sets self.name, self.model, and self.igrid = -1
  • Validation (data type checks, axis checks, time interval extraction) removed from __init__

Field properties (delegating to model)

Three new properties proxy into the model:

@property
def data(self):
    return self.model.data[self.name]

@property
def grid(self):
    return self.model.grid

@property
def time_interval(self):
    return self.model.time_interval

These preserve backward compatibility for code that reads field.data, field.grid, field.time_interval.

Field.interp_method property/setter

Before: stored as self._interp_method; validated via assert_same_function_signature against ZeroInterpolator

After: stored in self.model.field_to_interpolator[self.name]

  • Getter raises AttributeError (not KeyError) if no interpolator is set for this field
  • Setter validates isinstance(value, ScalarInterpolator) instead of checking function signature

Interpolator call convention change

Before: self._interp_method(particle_positions, grid_positions, self)

After: self.interp_method.interp(particle_positions, grid_positions, self)

Interpolators are now objects with an .interp(...) method, not plain callables.

VectorField changes

  • interp_method parameter type annotation changed from Callable | None to VectorInterpolator | None
  • Validation changed from assert_same_function_signature(...) to isinstance(interp_method, VectorInterpolator)
  • Setter similarly validates isinstance(method, VectorInterpolator)
  • Call site: self._interp_method.interp(...) instead of self._interp_method(...)

Removed from field.py

  • _assert_valid_uxdataarray — moved to model.py
  • _assert_compatible_combination — removed (validation now handled per-model)
  • _get_time_interval — moved to model.py
  • Imports: uxarray, xarray, Callable, TimeInterval, ZeroInterpolator, ZeroInterpolator_Vector, assert_same_function_signature, _transpose_xfield_data_to_tzyx, assert_all_field_dims_have_axis

Changes to src/parcels/_core/fieldset.py

FieldSet.__init__ signature change

Before: FieldSet(fields: list[Field | VectorField])

After: FieldSet(models: list[Model])

  • Now stores self.models: list[Model]
  • Calls self.reconstruct_fields() on init to build self._fields
  • assert_compatible_calendars(fields) call commented out (TODO)

New FieldSet.fields property

_fields is now the backing store; fields is a lazy property that calls reconstruct_fields() if _fields is None.

New FieldSet.reconstruct_fields() method

Iterates self.models, calls model.construct_fields() on each, flattens into self._fields dict.

New FieldSet.__add__ operator

def __add__(self, other: FieldSet) -> FieldSet:
    assert_compatible_fieldsets(self, other)
    combined = FieldSet(self.models + other.models)
    combined.constants = {**self.constants, **other.constants}
    return combined

from_ugrid_conventions simplified

Before: ~15 lines building grid, discovering U/V, creating Field objects, returning cls(list(fields.values()))

After:

model = UnstructuredModel.from_ugrid_conventions(ds, mesh)
return cls([model])

from_sgrid_conventions simplified

Before: ~50 lines handling time axis, xgcm grid creation, field creation

After:

model = StructuredModel.from_sgrid_conventions(ds, mesh)
return cls([model])

add_field constant field creation updated

The inline xgcm.Grid(...) call when adding a constant scalar field is replaced with constructing XGrid(ds, mesh=mesh) directly (after attaching SGRID metadata via sgrid._attach_sgrid_metadata).

New module-level function: assert_compatible_fieldsets

def assert_compatible_fieldsets(left: FieldSet, right: FieldSet) -> None

Raises ValueError if the two fieldsets share any field names or constant names.

Removed from fieldset.py

  • xgcm import
  • UxGrid import
  • _DEFAULT_XGCM_KWARGS import
  • logger import
  • _ds_rename_using_standard_names import
  • Most interpolator imports (only XConstantField remains)
  • _discover_ux_U_and_V — moved to model.py
  • _select_uxinterpolator — moved to model.py
  • _get_mesh_type_from_sgrid_dataset — moved to model.py
  • _is_coordinate_in_degrees — moved to model.py

Summary of architectural intent

Concern Before After
Data ownership Field (held self.data, self.grid) Model (holds self.data, self.grid)
Interpolator storage Field._interp_method (per-field callable) Model.field_to_interpolator (dict of objects)
Interpolator type Any callable matching ZeroInterpolator signature Instance of ScalarInterpolator / VectorInterpolator
Interpolator invocation interp_method(positions, grid_positions, field) interp_method.interp(positions, grid_positions, field)
FieldSet contents list[Field | VectorField] list[Model]
Field construction Done in FieldSet.from_* classmethods Delegated to Model.construct_fields()
FieldSet combination Not supported fieldset_a + fieldset_b via __add__

Open questions

  • Is Model the best name for this level of abstraction? (this question doesn't have to be answered now - talking with @erikvansebille it seemed confusing, but there wasn't a clear better alternative)

Checklist

  • Closes None
  • Tests added
  • This PR targets the correct branch (main for normal development, v3-support for v3 support)

AI Disclosure

  • This PR contains AI-generated content - None(yet)
    • I have tested any AI-generated content in my PR.
    • I take responsibility for any AI-generated content in my PR.
    • Describe how you used it (e.g., by pasting your prompt): AI was used in the refactoring of tests (mainly for simple updates, and for categorisation of failures - before coming back and updating by hand). This PR description was also written by AI based on the diff.

VeckoTheGecko and others added 20 commits May 29, 2026 16:02
Split out data manipulation and grid creation from the construction of fields
Move all data validation code to the model itself
Ultimately we want to make the grid just dependent on the SGRID compliant model data since it contains all the information needed regarding staggering (we dont need xgcm anymore). I want to update the constructor to remove the xgcm grid object - so adding an adapter at the moment to help with refactoring (will be removed at a later date)
Also update calling code in model and field.py
@VeckoTheGecko

Copy link
Copy Markdown
Contributor Author

OK - now I've mapped out most of the architectural changes here. I'm going to now profile performance and memory consumption to see where we stand, and how to optimise that under the new structure.

Once i have results for that, I will populate this PR with them as well as a full description over the proposed refactoring.

Once we decide on structure and performance differences and agree, we can go through the test suite and update.

@VeckoTheGecko VeckoTheGecko mentioned this pull request Jun 16, 2026
6 tasks
@VeckoTheGecko

Copy link
Copy Markdown
Contributor Author

Just to make sure I understand it: a Model is the set of Fields that share a Grid; right?

Yes, that is correct

@VeckoTheGecko

Copy link
Copy Markdown
Contributor Author

Will do the Model renamings after lunch and re-request a review

@VeckoTheGecko

Copy link
Copy Markdown
Contributor Author

Hmmm.

I'm munching on the term Model though. Most oceanographers would probably assume that a model is e.g. NEMO, MITgcm, etc. What we here refer to as a model, they would call a dataset, simulation or modelrun; the output of a model on a specific grid.

But Dataset would be too confusing with the xarray terminology, and Simulation would be confusing with the output of Parcels itself (since Parcels simulates). Not sure if Modelrun is better than Model

How about renaming model to SharedGrid, or something like that? Or ParcelsDataset, to highlight that these are "parcels-flavoured" xarray Datasets?

Thinking more about this, I think the renaming to ParcelsDataset would be too confusing to us as devs since it would warrant replacing all mentions with models to dataset which I think makes it very easy to conflate with xarray's meaning of datasets.

I agree that Model by itself might be a bit confusing to users - but at the same time Model and mentions to it is not public API (in fact, I don't think that it will show up anywhere in public API - perhaps only error messaging in the file paths). Model here is mainly only used for internal bookkeeping and reasoning - and can be said as such in our docs.

I think having the term model in parcels is quite clear for developers and how the abstraction fits within parcels (i.e,. a model encapsulates how the spatial domain relates to the underlying data representation by defining indexing, and by performing transformations on the input data).

I think there is a degree to which we need to use language that works best for development (particularly for internal API) even if there is alternative usage in other domains (i.e., oceanography or atmosphere) - there's just not enough words.

Perhaps a middleground would be to use ModelData for the base class name, StructuredModelData and UnstructuredModelData and keep everything else the same? (i.e., keep FieldSet.models)

@VeckoTheGecko

VeckoTheGecko commented Jun 23, 2026

Copy link
Copy Markdown
Contributor Author

I was looking at the docs build, and saw an example that adds a second velocity field

import matplotlib.pyplot as plt
import numpy as np
import xarray as xr

import parcels
import parcels.tutorial

# Load the CopernicusMarine data in the Agulhas region from the example_datasets
ds_fields = parcels.tutorial.open_dataset("CopernicusMarine_data_for_Argo_tutorial/data")
ds_fields.load()  # load the dataset into memory

# Create an idealised wind field and add it to the dataset
tdim, ydim, xdim = (len(ds_fields.time),len(ds_fields.latitude), len(ds_fields.longitude))
ds_fields["UWind"] = xr.DataArray(
    data=np.ones((tdim, ydim, xdim)) * np.sin(ds_fields.latitude.values)[None, :, None],
    coords=[ds_fields.time, ds_fields.latitude, ds_fields.longitude])

ds_fields["VWind"] = xr.DataArray(
    data=np.zeros((tdim, ydim, xdim)),
    coords=[ds_fields.time, ds_fields.latitude, ds_fields.longitude])

fields = {
    "U": ds_fields["uo"],
    "V": ds_fields["vo"],
    "UWind": ds_fields["UWind"],
    "VWind": ds_fields["VWind"],
}
ds_fset = parcels.convert.copernicusmarine_to_sgrid(fields=fields)
fieldset = parcels.FieldSet.from_sgrid_conventions(ds_fset)

# Create a vecorfield for the wind
windvector = parcels.VectorField(
    "Wind",
    fieldset.UWind,
    fieldset.VWind,
    interp_method=parcels.interpolators.XLinear_Velocity
)
fieldset.add_field(windvector)

We no longer have add_field - I think we need to map out exactly how we should be defining velocity fields under this new refactoring... I think that can be done in a future PR

@VeckoTheGecko

Copy link
Copy Markdown
Contributor Author

Just chatted f2f with Erik, and decided on vvvvv

Perhaps a middleground would be to use ModelData for the base class name, StructuredModelData and UnstructuredModelData and keep everything else the same? (i.e., keep FieldSet.models)

Also renamed the subclasses
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

3 participants