Support sharding through config and raster_write_kwargs#1106
Conversation
- Added additional settings - Allow environment variables that overwrite config
|
Failing atm due to ome-zarr not yet being released. You can test locally with ome-zarr-py from main. Also, need to add support for zarrs to improve speed of shard io |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1106 +/- ##
========================================
Coverage 92.44% 92.44%
========================================
Files 51 51
Lines 7811 7918 +107
========================================
+ Hits 7221 7320 +99
- Misses 590 598 +8
🚀 New features to boost your workflow:
|
The reason for only supporting these versions is that they provide the proper use of the zarr api inside dask and also the possibility for setting the tune optimization. The latter is required to prevent errors due to collapsing dask partitions when reading data back in from parquet.
|
Should we also allow the control of sharding for anndata? |
|
Yes, but not as part of this PR. I will adjust the config though to accommodate. |
| "xarray>=2024.10.0", | ||
| "xarray-spatial>=0.3.5", | ||
| "zarr>=3.0.0", | ||
| "zarrs", |
|
|
||
|
|
||
| @dataclass | ||
| class Settings: |
There was a problem hiding this comment.
General comment. Sounds like this file is a duplicate of the settings mechanism from scverse-misc: https://github.com/scverse/scverse-misc/blob/main/src/scverse_misc/_settings.py.
I think scverse-misc doesn't support reading settings from .json (they support .env files), but since it's based on Pydantic, it would be a quick extension.
I'll go trough the PR and review. In a follow-up PR we should probably align to use scverse-misc.
@ilia-kats @flying-sheep, devs behind scverse-misc, may have extra comments.
There was a problem hiding this comment.
would this be something for this PR or follow up PR before release? I believe before settings were not stored at all in any case and I think in general we need a PR to move towards using scverse-misc, also with adding the context manager there.
There was a problem hiding this comment.
Since config.py works, we can do that on a follow up release to avoid delaying this PR even more. On the other hand, if there is time/energy/will, I'd go directly with scverse-misc so that we don't merge code that will be removed soon after. Up to you!
| raster_chunks | ||
| The chunksize to use for chunking an array. Length of the tuple must match | ||
| the number of dimensions. | ||
| raster_shards | ||
| The default shard size (zarr v3) to use when storing arrays. Length of the tuple | ||
| must match the number of dimensions. |
There was a problem hiding this comment.
Feels a bit weird to have these 2 variables in a class that is for global settings, since, if I am getting this right, we for cyx data we would need len(raster_chunks) == 3, while for yx data we would need len(raster_chunks) == 2.
There was a problem hiding this comment.
want to split in image_shards/chunks and label or just remove all together? I would maybe vote for the former.
There was a problem hiding this comment.
there is still the problem of 2D vs 3D. I'd remove altogether. We could use instead an helper function that computes reasonable defaults based on the input shape.
| must match the number of dimensions. | ||
| """ | ||
|
|
||
| custom_config_path: Path | None = None |
There was a problem hiding this comment.
I don't think the settings should contain this path. Feels out of place.
There was a problem hiding this comment.
depends on whether certain settings are project specific I believe. This allows for that. Otherwise global settings on hpc could be shared by more than 1 user even though there are project specific needs.
| def _config_path() -> Path: | ||
| """Return the platform-appropriate path to the user config file.""" | ||
| return Path(user_config_dir(appname="spatialdata")) / "settings.json" |
There was a problem hiding this comment.
why not moving this inside config_path()?
| assert s.large_chunk_threshold_bytes == 1_000_000_000 | ||
| assert s.raster_chunks == [512, 512] | ||
| assert s.raster_shards == [1024, 1024] | ||
| assert s.custom_config_path is None |
There was a problem hiding this comment.
Fine. But since we called save(), shouldn't custom_config_path be equal to the path we are reading from?
This is a consequence of https://github.com/melonora/spatialdata/blob/f44221ab23d6b136084157203e1d0b466979318c/src/spatialdata/config.py#L65
path is None, so it gets assigned the value from _config_path(), without saving it in the in-memory value of custom_config_path.
|
|
||
| s.reset() | ||
| s.save() | ||
| assert s.custom_config_path is None # This returns False |
There was a problem hiding this comment.
what's the comment about?
|
| assert arr.shards == write_shards | ||
|
|
||
| other_arr = zarr.open_group(path / zarr_subpath / other_name, mode="r")["s0"] | ||
| assert other_arr.chunks == base_chunks |
There was a problem hiding this comment.
we could add an explicit check that shards are None (?) here.
| assert other_arr.chunks == base_chunks | ||
|
|
||
|
|
||
| def test_write_raster_elements_sharding_chunking(tmp_path: Path) -> None: |
There was a problem hiding this comment.
What are we testing that it is not tested before? That write_element() supports sharding? In that case I'd rename the test.
There was a problem hiding this comment.
yeah did that independently because of issues downstream with non matching shards / chunks. These errors are specific and not connected to our implementation per se so I kept it separate. For me it indicates that by _sharding_chunking which is seen not to be tested in the other functions.
|
Looks good in general. There are a few decisions to be made, but once done most of the requested changes can be one-shot by an agent. |
chore: fix typo in docstrings
…to support_sharding
bump ome_zarr remove distributed add platformdirs
LucaMarconato
left a comment
There was a problem hiding this comment.
Changes look good so far.
This PR adds the following:
raster_write_kwargsfor io functions like.writeand.write_element. This also adds the ability to write sharded arrays. Support for anndata sharding is to be added in a follow up PR.raster_write_kwargsargument.raster_chunksandraster_shards. The config can now be stored in a default location or a custom location. Additionally, environment variables can be set to temporarily override the values.Additional changes
@LucaMarconato