feat: Add DataFrame support in dy.Collection#335
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #335 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 56 56
Lines 3404 3432 +28
=========================================
+ Hits 3404 3432 +28 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
Adds first-class support for eager dy.DataFrame[...] members in dy.Collection, enabling collections to expose Polars DataFrames directly (instead of always returning LazyFrames).
Changes:
- Extend member metadata with an
is_lazyflag and addlazy_members()/eager_members()helpers. - Introduce internal
_to_lazy_dict()and switch internal operations (join/collect_all/storage/filter-result collection) to use it. - Update collection initialization to eagerly
.collect()members annotated asdy.DataFrame[...], plus add dedicated tests for eager/lazy/mixed collections.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
dataframely/collection/_base.py |
Adds MemberInfo.is_lazy, derives eager vs lazy from annotations, introduces _to_lazy_dict(), and updates _init() to materialize eager members. |
dataframely/collection/collection.py |
Routes internal operations (join/collect_all/storage) through _to_lazy_dict() to support mixed eager/lazy collections. |
dataframely/collection/filter_result.py |
Ensures CollectionFilterResult.collect_all() operates on lazy views of members for mixed collections. |
tests/collection/test_implementation.py |
Updates annotation implementation tests to reflect dy.DataFrame[...] support. |
tests/collection/test_dataframe_members.py |
Adds new end-to-end tests for eager-only, lazy-only, mixed, and optional eager members. |
| def test_eager_member_detection() -> None: | ||
| members = EagerCollection.members() | ||
| assert not members["users"].is_lazy | ||
| assert not members["orders"].is_lazy | ||
|
|
There was a problem hiding this comment.
These member-detection tests are structurally identical and differ only in the collection class/expected member set. Per the repo testing guidelines, consider merging them into a single @pytest.mark.parametrize test to reduce duplication and make it easier to add more member-type combinations later.
| def test_eager_member_returns_dataframe(valid_data: dict[str, pl.DataFrame]) -> None: | ||
| collection = EagerCollection.validate(valid_data) | ||
| assert isinstance(collection.users, pl.DataFrame) | ||
| assert isinstance(collection.orders, pl.DataFrame) | ||
|
|
||
|
|
||
| def test_lazy_member_returns_lazyframe(valid_data: dict[str, pl.DataFrame]) -> None: | ||
| collection = LazyCollection.validate(valid_data) | ||
| assert isinstance(collection.users, pl.LazyFrame) | ||
| assert isinstance(collection.orders, pl.LazyFrame) |
There was a problem hiding this comment.
These access-pattern tests repeat the same arrange/act/assert structure for different collection types. Consider parametrizing over (CollectionType, expected_python_type_per_member) to keep coverage while reducing duplication, per the repo testing guidelines.
Oliver Borchert (borchero)
left a comment
There was a problem hiding this comment.
Sorry it took so long to review this! Generally, I think that we can support this, I left a few comments to simplify the implementation a little
| # Check that exactly one arg is None (type(None) is NoneType) | ||
| if not any(arg is type(None) for arg in union_args): | ||
| raise AnnotationImplementationError(attr, type_annotation) | ||
|
|
||
| not_none_args = [arg for arg in union_args if get_origin(arg) is not None] | ||
| if len(not_none_args) == 0 or not issubclass( | ||
| get_origin(not_none_args[0]), TypedLazyFrame | ||
| ): | ||
| # Get the non-None type (exactly one exists given prior checks) | ||
| not_none_arg = next(arg for arg in union_args if arg is not type(None)) | ||
|
|
||
| frame_origin = get_origin(not_none_arg) | ||
| if frame_origin is None: | ||
| raise AnnotationImplementationError(attr, type_annotation) |
There was a problem hiding this comment.
Why did these need to change? It feels like it's the same thing rewritten?
|
|
||
| def to_dict(self) -> dict[str, pl.LazyFrame]: | ||
| """Return a dictionary representation of this collection.""" | ||
| def to_dict(self) -> dict[str, FrameType]: |
There was a problem hiding this comment.
This signature forces the user to always specify the return type. I would keep pl.LazyFrame here and non-conditionally call .lazy() on the attribute. Wdyt?
| frame = data[member_name] | ||
| if isinstance(frame, pl.LazyFrame): | ||
| setattr(out, member_name, frame.collect()) | ||
| else: | ||
| setattr(out, member_name, frame) |
There was a problem hiding this comment.
You can just do .lazy().collect() which works on both lazy and eager frames
| maintain_order=maintain_order, | ||
| ) | ||
| for key, lf in self.to_dict().items() | ||
| for key, lf in self._to_lazy_dict().items() |
There was a problem hiding this comment.
Might not be necessary with the comment above
| dfs = pl.collect_all(lazy_dict.values()) | ||
| return self._init( | ||
| {key: dfs[i].lazy() for i, key in enumerate(self.to_dict().keys())} | ||
| {key: dfs[i].lazy() for i, key in enumerate(lazy_dict.keys())} |
There was a problem hiding this comment.
Is the .lazy() required?
Motivation
Closes #319
Changes
is_lazyattribute inMemberInfo.lazy_membersandeager_memberstody.Collection-_to_lazy_dictmethods.Collection._initto collect the dataframe is the eager case.