-
Notifications
You must be signed in to change notification settings - Fork 88
Support sharding through config and raster_write_kwargs #1106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
7039c00
63fbe9a
b7e98b9
558fe97
8eebdb0
790be0c
1a1c673
471f72c
cd48574
0187fe9
b629de0
c2b1375
bf5b910
49441fe
6334fa8
73ca72a
c6041bb
278606a
f6c0a37
36e2271
50cb6bb
6257096
5525fbf
44414c1
e974647
bd6249e
2600237
bbb4bb6
b95c710
03aafc9
f44221a
f77cff4
d614519
10ba46b
c036d49
e65c0fb
26cb40b
2642b74
302fd1a
1ce8d08
7851d94
1cd0eb3
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,8 +1,10 @@ | ||
| from __future__ import annotations | ||
|
|
||
| from collections.abc import Iterable | ||
| from typing import Any | ||
|
|
||
| from anndata import AnnData | ||
| from ome_zarr.types import JSONDict | ||
|
|
||
| from spatialdata._core.spatialdata import SpatialData | ||
|
|
||
|
|
@@ -164,3 +166,36 @@ def get_unique_name(name: str, attr: str, is_dataframe_column: bool = False) -> | |
| setattr(sanitized, attr, new_dict) | ||
|
|
||
| return None if inplace else sanitized | ||
|
|
||
|
|
||
| def create_raster_element_kwargs( | ||
| raster_write_kwargs: dict[str, JSONDict | list[JSONDict]] | list[JSONDict], | ||
| element_name: str, | ||
| element_names: set[str], | ||
| ) -> dict[str, Any] | list[dict[str, Any]]: | ||
| element_raster_write_kwargs = None | ||
| if isinstance(raster_write_kwargs, dict) and (kwargs := raster_write_kwargs.get(element_name)): | ||
| element_raster_write_kwargs = kwargs | ||
|
|
||
| if not element_raster_write_kwargs: | ||
|
Comment on lines
+171
to
+180
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why not just |
||
| if isinstance(raster_write_kwargs, dict): | ||
| for name in element_names: | ||
| raster_write_kwargs.pop(name, None) | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This seems to be mutating the user's input. We should make a copy instead. |
||
| if not raster_write_kwargs: | ||
| element_raster_write_kwargs = {} | ||
| elif isinstance(raster_write_kwargs, dict) and not all( | ||
| isinstance(x, (dict, list)) for x in raster_write_kwargs.values() | ||
| ): | ||
| element_raster_write_kwargs = raster_write_kwargs | ||
| elif isinstance(raster_write_kwargs, list): | ||
| if not all(isinstance(x, dict) for x in raster_write_kwargs): | ||
| raise ValueError( | ||
| "If passing raster_write_kwargs as list, it is assumed to be the storage " | ||
| "options for each scale of a multiscale raster as a dictionary." | ||
| ) | ||
| element_raster_write_kwargs = raster_write_kwargs | ||
| else: | ||
| raise ValueError( | ||
| f"Type of raster_write_kwargs should be either dict or list, got {type(raster_write_kwargs)}." | ||
| ) | ||
| return element_raster_write_kwargs | ||
| Original file line number | Diff line number | Diff line change | ||
|---|---|---|---|---|
|
|
@@ -16,6 +16,7 @@ | |||
| from dask.dataframe import DataFrame as DaskDataFrame | ||||
| from dask.dataframe import Scalar | ||||
| from geopandas import GeoDataFrame | ||||
| from ome_zarr.types import JSONDict | ||||
| from shapely import MultiPolygon, Polygon | ||||
| from upath import UPath | ||||
| from xarray import DataArray, DataTree | ||||
|
|
@@ -1113,6 +1114,7 @@ def write( | |||
| update_sdata_path: bool = True, | ||||
| sdata_formats: SpatialDataFormatType | list[SpatialDataFormatType] | None = None, | ||||
| shapes_geometry_encoding: Literal["WKB", "geoarrow"] | None = None, | ||||
| raster_write_kwargs: dict[str, JSONDict | list[JSONDict]] | list[JSONDict] | None = None, | ||||
| raster_compressor: dict[Literal["lz4", "zstd"], int] | None = None, | ||||
| ) -> None: | ||||
| """ | ||||
|
|
@@ -1161,12 +1163,32 @@ def write( | |||
| shapes_geometry_encoding | ||||
| Whether to use the WKB or geoarrow encoding for GeoParquet. See :meth:`geopandas.GeoDataFrame.to_parquet` | ||||
| for details. If None, uses the value from :attr:`spatialdata.settings.shapes_geometry_encoding`. | ||||
| raster_write_kwargs | ||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. please replace with docstring parameter since this docstring is repeated later
|
||||
| Storage options for raster elements. These options are passed to the zarr storage backend for writing and | ||||
| can be provided in several formats: | ||||
|
|
||||
| 1. Single dictionary | ||||
| A dictionary containing all storage options applied globally. | ||||
| 2. Dictionary per raster element | ||||
| A dictionary where: | ||||
| - Keys = names of raster elements | ||||
| - Values = storage options for each element | ||||
| - For single-scale data: a dictionary | ||||
| - For multiscale data: a list of dictionaries (one per scale) | ||||
| 3. List of dictionaries (multiscale only) | ||||
| A list where each dictionary defines the storage options for one scale of a multiscale raster element. | ||||
|
|
||||
| Important Notes | ||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Here I would strongly recommend adding an example for users so that they have a recommendation on what to write, at least for Zarr v3. The tests already contain this information.
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you mean as part of the docstring or actual doc? I would provide a follow up PR with specific docs.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd put a quick example of how it's called right it in the docstring. |
||||
| - The available key–value pairs in these dictionaries depend on the Zarr format used for writing. | ||||
| - For a full list of supported storage options, refer to: | ||||
| https://zarr.readthedocs.io/en/stable/api/zarr/create/#zarr.create_array | ||||
| raster_compressor | ||||
| A lenght-1 dictionary with as key the type of compression to use for images and labels and as value the | ||||
| compression level which should be inclusive between 0 and 9. For compression, `lz4` and `zstd` are | ||||
| supported. If not specified, the compression will be `lz4` with compression level 5. Bytes are automatically | ||||
| ordered for more efficient compression. | ||||
| """ | ||||
| from spatialdata._core._utils import create_raster_element_kwargs | ||||
| from spatialdata._io._utils import _resolve_zarr_store, _validate_compressor_args | ||||
| from spatialdata._io.format import _parse_formats | ||||
|
|
||||
|
|
@@ -1185,6 +1207,13 @@ def write( | |||
| store.close() | ||||
|
|
||||
| for element_type, element_name, element in self.gen_elements(): | ||||
| element_raster_write_kwargs = None | ||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Instead of calling this in |
||||
| if element_type in ("images", "labels") and raster_write_kwargs: | ||||
| element_names = set(self.images.keys()).union(self.labels.keys()) | ||||
| element_raster_write_kwargs = create_raster_element_kwargs( | ||||
| raster_write_kwargs, element_name, element_names | ||||
| ) | ||||
|
|
||||
| self._write_element( | ||||
| element=element, | ||||
| zarr_container_path=file_path, | ||||
|
|
@@ -1193,6 +1222,7 @@ def write( | |||
| overwrite=False, | ||||
| parsed_formats=parsed, | ||||
| shapes_geometry_encoding=shapes_geometry_encoding, | ||||
| element_raster_write_kwargs=element_raster_write_kwargs, | ||||
| raster_compressor=raster_compressor, | ||||
| ) | ||||
|
|
||||
|
|
@@ -1211,6 +1241,7 @@ def _write_element( | |||
| overwrite: bool, | ||||
| parsed_formats: dict[str, SpatialDataFormatType] | None = None, | ||||
| shapes_geometry_encoding: Literal["WKB", "geoarrow"] | None = None, | ||||
| element_raster_write_kwargs: JSONDict | list[JSONDict] | None = None, | ||||
| raster_compressor: dict[Literal["lz4", "zstd"], int] | None = None, | ||||
| ) -> None: | ||||
| from spatialdata._io.io_zarr import _get_groups_for_element | ||||
|
|
@@ -1250,6 +1281,7 @@ def _write_element( | |||
| group=element_group, | ||||
| name=element_name, | ||||
| element_format=parsed_formats["raster"], | ||||
| storage_options=element_raster_write_kwargs, | ||||
| raster_compressor=raster_compressor, | ||||
| ) | ||||
| elif element_type == "labels": | ||||
|
|
@@ -1258,6 +1290,7 @@ def _write_element( | |||
| group=root_group, | ||||
| name=element_name, | ||||
| element_format=parsed_formats["raster"], | ||||
| storage_options=element_raster_write_kwargs, | ||||
| raster_compressor=raster_compressor, | ||||
| ) | ||||
| elif element_type == "points": | ||||
|
|
@@ -1289,6 +1322,7 @@ def write_element( | |||
| overwrite: bool = False, | ||||
| sdata_formats: SpatialDataFormatType | list[SpatialDataFormatType] | None = None, | ||||
| shapes_geometry_encoding: Literal["WKB", "geoarrow"] | None = None, | ||||
| raster_write_kwargs: dict[str, JSONDict | list[JSONDict] | Any] | list[JSONDict] | None = None, | ||||
| raster_compressor: dict[Literal["lz4", "zstd"], int] | None = None, | ||||
| ) -> None: | ||||
| """ | ||||
|
|
@@ -1308,6 +1342,25 @@ def write_element( | |||
| shapes_geometry_encoding | ||||
| Whether to use the WKB or geoarrow encoding for GeoParquet. See :meth:`geopandas.GeoDataFrame.to_parquet` | ||||
| for details. If None, uses the value from :attr:`spatialdata.settings.shapes_geometry_encoding`. | ||||
| raster_write_kwargs | ||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Duplicated docstring. I would rather use the
|
||||
| Storage options for raster elements. These options are passed to the zarr storage backend for writing and | ||||
| can be provided in several formats: | ||||
|
|
||||
| 1. Single dictionary | ||||
| A dictionary containing all storage options applied globally. | ||||
| 2. Dictionary per raster element | ||||
| A dictionary where: | ||||
| - Keys = names of raster elements | ||||
| - Values = storage options for each element | ||||
| - For single-scale data: a dictionary | ||||
| - For multiscale data: a list of dictionaries (one per scale) | ||||
| 3. List of dictionaries (multiscale only) | ||||
| A list where each dictionary defines the storage options for one scale of a multiscale raster element. | ||||
|
|
||||
| Important Notes | ||||
| - The available key–value pairs in these dictionaries depend on the Zarr format used for writing. | ||||
| - For a full list of supported storage options, refer to: | ||||
| https://zarr.readthedocs.io/en/stable/api/zarr/create/#zarr.create_array | ||||
| raster_compressor | ||||
| A lenght-1 dictionary with as key the type of compression to use for images and labels and as value the | ||||
| compression level which should be inclusive between 0 and 9. For compression, `lz4` and `zstd` are | ||||
|
|
@@ -1319,6 +1372,7 @@ def write_element( | |||
| If you pass a list of names, the elements will be written one by one. If an error occurs during the writing of | ||||
| an element, the writing of the remaining elements will not be attempted. | ||||
| """ | ||||
| from spatialdata._core._utils import create_raster_element_kwargs | ||||
| from spatialdata._io.format import _parse_formats | ||||
|
|
||||
| parsed_formats = _parse_formats(formats=sdata_formats) | ||||
|
|
@@ -1331,6 +1385,7 @@ def write_element( | |||
| overwrite=overwrite, | ||||
| sdata_formats=sdata_formats, | ||||
| shapes_geometry_encoding=shapes_geometry_encoding, | ||||
| raster_write_kwargs=raster_write_kwargs, | ||||
| raster_compressor=raster_compressor, | ||||
| ) | ||||
| return | ||||
|
|
@@ -1359,6 +1414,11 @@ def write_element( | |||
|
|
||||
| self._check_element_not_on_disk_with_different_type(element_type=element_type, element_name=element_name) | ||||
|
|
||||
| element_raster_write_kwargs = None | ||||
| if element_type in ("images", "labels") and raster_write_kwargs: | ||||
| element_names = set(self.images.keys()).union(self.labels.keys()) | ||||
| element_raster_write_kwargs = create_raster_element_kwargs(raster_write_kwargs, element_name, element_names) | ||||
|
|
||||
| self._write_element( | ||||
| element=element, | ||||
| zarr_container_path=self.path, | ||||
|
|
@@ -1367,6 +1427,7 @@ def write_element( | |||
| overwrite=overwrite, | ||||
| parsed_formats=parsed_formats, | ||||
| shapes_geometry_encoding=shapes_geometry_encoding, | ||||
| element_raster_write_kwargs=element_raster_write_kwargs, | ||||
| raster_compressor=raster_compressor, | ||||
| ) | ||||
| # After every write, metadata should be consolidated, otherwise this can lead to IO problems like when deleting. | ||||
|
|
||||
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function would strongly benefit from a docstring. In particular explaining:
In particular, if the input/output types are implicitly defined in a library, e.g. by Dask, I suggest to write that this is the format defined by X library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that the input type is defined in the docstring of
write().There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I did it there because this is private API and write is public API, but can add it to private API as well if prefered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please reopen if you would prefer that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would put some context at least. Right now if a contributor sees that function, it can be quite confusing.