[FEAT][Python] Tie Python wrapper lifetime to underlying C++ FFI object#593
[FEAT][Python] Tie Python wrapper lifetime to underlying C++ FFI object#593cyx-6 wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces 'PyObjectTying,' a mechanism to link the lifetime of Python wrappers to their underlying C++ FFI objects. By prepending a PyCustomizeAllocHeader to object allocations, the implementation enables stable object identity and address preservation across wrapper finalization cycles. The changes include modifications to the C++ SimpleObjAllocator, Cython bindings to utilize tp_finalize, and updates to the reflection system. Feedback from the review highlights a critical thread-safety issue in the callback used to free cached wrapper memory, which lacks GIL protection, and a potential memory leak when installing handles if a previous phantom wrapper is not correctly released.
435caeb to
1b751b2
Compare
| if self.chandle != NULL: | ||
| CHECK_CALL(TVMFFIObjectDecRef(self.chandle)) | ||
| self.chandle = NULL | ||
| def __del__(self): |
There was a problem hiding this comment.
Let's double check and test it out, making sure this __del__ method will be properly invoked. I had some really negative experience with overloading __del__ because it could possibly prevent Python GC from recycling memory in certain cases. Not 100% sure if it applies to Cython's cdef class's __del__ method though.
There was a problem hiding this comment.
a tp_finalize-free version is updated. we are now using tp_alloc + tp_free as alternative.
| cls = make_fallback_cls_for_type_index(type_index) | ||
| obj = cls.__new__(cls) | ||
| (<CObject>obj).chandle = result.v_obj | ||
| TVMFFIPyAttachPyObject(result.v_obj, <PyObject*>obj) |
There was a problem hiding this comment.
comment, this is a new class, so it must be in detached state, attach it
| dlpack.deleter(dlpack) | ||
| pass | ||
| # default return the tensor | ||
| # default return the tensor. |
There was a problem hiding this comment.
NOTE: we don't try to attach py object for tensor types as it may appear in callback and there maybe translation depending on the context
| // ---------- | ||
| // I1. When a PyObject goes out of scope (no Python var refers to it), | ||
| // its +1 on chandle is always released. | ||
| // I2. When a chandle is destroyed, its cached PyObject (if any) is |
There was a problem hiding this comment.
When a chandle is destroyed, we can only be in Detached or Inactive state, if it is in Inactive state, the PyObject is reclaimed
| // ``self.chandle`` and DecRef. | ||
| // Detached -> Detached : ``self`` is not the canonical wrapper for | ||
| // this chandle (eager-detach via move already | ||
| // cleared the binding). Just DecRef. |
There was a problem hiding this comment.
need clarification here, since Detached menas chandle == NULL
| */ | ||
| TVM_FFI_INLINE void TVMFFIPyTPFinalize(void** ptr_to_chandle, PyObject* wrapper) { | ||
| void* chandle = *ptr_to_chandle; | ||
| if (chandle == nullptr) return; |
| * | ||
| * Detached -> Detached: ``cached_pyobj`` was already cleared by an | ||
| * eager move, or chandle was not allocated through the Python custom | ||
| * allocator. Just null ``wrapper.chandle`` and DecRef. |
There was a problem hiding this comment.
no need to DecRef here? just return and pass to dealloc
|
Btw - does this approach also work for String and Bytes? I’m asking because I wasn’t sure if small string optimization is compatible with this design |
|
@junrushao Not working for |
| TVMFFIObjectAllocHeader base; | ||
| }; | ||
|
|
||
| constexpr size_t kPyHeaderOffset = sizeof(PyCustomAllocHeader); // 16 |
| * need ``PyObject_InitVar(.., nitems)`` here and a matching basicsize check. | ||
| */ | ||
| inline PyObject* TVMFFIPyTpAlloc(PyTypeObject* type, Py_ssize_t nitems) { | ||
| void* blk = TVMFFIPyTlsReviveSlot(); |
| * count). Per-thread -> free-threading safe; strict read-and-clear. | ||
| */ | ||
| inline void*& TVMFFIPyTlsReviveSlot() { | ||
| static thread_local void* slot = nullptr; |
| * ``TVMFFIPyTpAlloc`` (which is handed only ``type`` and an item | ||
| * count). Per-thread -> free-threading safe; strict read-and-clear. | ||
| */ | ||
| inline void*& TVMFFIPyTlsReviveSlot() { |
| // stable wrapper class for the life of the process, and ``make_ret_object`` | ||
| // derives both the cached allocation (from the chandle) and ``type`` (= cls for that | ||
| // same type_index) from the very same chandle on the revival path. | ||
| assert(type->tp_itemsize == 0 && |
There was a problem hiding this comment.
static PyObject* custom_tp_alloc(PyTypeObject *type, Py_ssize_t nitems) {
PyCustomizeAllocHeader *header = get_current_header();
// Hot Path: Reuse
if (header && header->cached_mem) {
PyObject *obj = header->cached_mem;
// Inline-friendly initialization
PyObject_VarInit((PyVarObject *)obj, type, nitems);
// Only memset the delta between header and body
// This is where your real time is spent
size_t head_sz = (nitems > 0 || type->tp_itemsize > 0) ? sizeof(PyVarObject) : sizeof(PyObject);
memset((char*)obj + head_sz, 0, type->tp_basicsize - head_sz);
if (PyType_IS_GC(type)) {
PyObject_GC_Track(obj);
}
return obj;
}
// Cold Path: Slow allocation
return PyType_GenericAlloc(type, nitems);
}| * on, GC or not. | ||
| */ | ||
| inline void TVMFFIPyTpFree(void* self) { | ||
| if (reinterpret_cast<TVMFFIPyCObjectHead*>(self)->chandle == TVM_FFI_PY_INACTIVE_SENTINEL) { |
There was a problem hiding this comment.
add a runtime offset check test case
There was a problem hiding this comment.
void** TVMFFICyObjectGetCHandlePtr(PyObject* ptr)
cdef void** TVMFFICyObjectGetCHandlePtr(PyObject* ptr):
return &((<CObject>ptr).chandle)
| TVM_FFI_INLINE void TVMFFIPyTpDealloc(void** ptr_to_chandle, PyObject* wrapper) { | ||
| void* chandle = *ptr_to_chandle; | ||
| // NULL: already detached/moved. INACTIVE_SENTINEL: an inactive cached | ||
| // allocation must never re-enter __dealloc__ while inactive (defensive). |
There was a problem hiding this comment.
move TVMFFIPyTpDealloc after TVMFFIPyTpAlloc, TVM_FFI_PY_INACTIVE_SENTINEL/TVMFFIPyTpDealloc/ TVMFFIPyTpFree/ sit together, document the overall logic in a section, TVM_FFI_PY_INACTIVE_SENTINEL=> TVM_FFI_PY_CHANDLE_INACTIVE_SENTINEL
| * header. Just null ``wrapper.chandle`` and DecRef. | ||
| */ | ||
| TVM_FFI_INLINE void TVMFFIPyTpDealloc(void** ptr_to_chandle, PyObject* wrapper) { | ||
| void* chandle = *ptr_to_chandle; |
There was a problem hiding this comment.
Document, after dealloc, there are two possibilities:
- chandle is set to CHANDLE_INACTIVE_SENTINEL, tpfree will not be triggered
- other cases, normal free
| * time); the only cost is no stable-id-across-drop. | ||
| */ | ||
| TVM_FFI_INLINE bool TVMFFIPyIsInactiveEligible(PyObject* wrapper) { | ||
| PyTypeObject* tp = Py_TYPE(wrapper); |
| * | ||
| * The (return, ``*out``) pair has four combinations; one cannot occur: | ||
| * (true, non-NULL) : Active -- the live canonical wrapper. | ||
| * (false, NULL) : Detached / non-Python -- no wrapper bound. |
There was a problem hiding this comment.
document
\return whether returned PyObject state is active
Apply Tianqi's PR apache#593 review feedback to the pyobject-tying layer: - Drop the kPyHeaderOffset constant; use sizeof(PyCustomAllocHeader) directly at every site (and in the static_asserts). - Rename TVMFFIPyTlsReviveSlot -> TVMFFIPyTLSReviveSlot and make its sole access primitive a swap-exchange (store next, return prior). The slot is typed PyObject*, and taking the block is TVMFFIPyTLSReviveSlot(nullptr) -- read-and-clear in one step, no separate clear to forget. - Document \param/\return on TVMFFIPyTryGetAttachedPyObject. - Reorder the slot functions into execution order (TpAlloc -> TpDealloc -> TpFree) with the DEALLOC_TRANSIT marker declared between them. - Replace the hand-rolled TVMFFIPyCObjectHead layout mirror with a Cython cdef public accessor TVMFFICyObjectGetCHandlePtr, so the chandle field offset comes from the real cdef class layout instead of a hardcoded one. Build clean; full Python suite passes (2336 passed, 18 skipped, 2 xfailed).
Resolve the lint job failures on PR apache#593: - ruff-check: add one-line docstrings to test methods (D102), hoist threading/weakref imports to top-level (PLC0415). - ty: annotate the rvalue-move callback params as `Any` instead of `object` so `x._move()` resolves (matches test_function.py). - ruff-format / clang-format / cmake-format: apply formatter reflow. - taplo-format: expand the cibuildwheel `build` array to multi-line in pyproject.toml (applied from CI log; taplo has no wheel for this platform). Reproduced locally via pre-commit (taplo-format skipped).
|
Any updates on this issue? |
88b235f to
38ca72d
Compare
Make `a.x is a.x`, `id(a.x)` stable across drop+refetch, and `f(x) is x`
for FFI returns, on both the GIL and free-threaded (3.14t) builds, by
binding one Python wrapper to one C++ FFI object ("chandle") for the
object's lifetime. Implements the PyObjectTying design.
## Allocation layout
A two-layer custom-allocator hook lives in core libtvm_ffi.so:
`TVMFFIObjectAllocHeader { delete_space }`, `TVMFFICustomAllocator
{ allocate, context }`, plus `TVMFFIGetCustomAllocator` /
`TVMFFISetCustomAllocator`. libtvm_ffi installs a builtin default at
registry init, so every Object always carries the 8-byte base header and
`TVMFFIGetCustomAllocator` never returns NULL. The Python Cython module
overrides the global default at module load with `TVMFFIPyAllocate`,
which prepends a 16-byte `PyCustomAllocHeader` encoding the wrapper
binding. `make_object` / `make_inplace_array_object` / `PyClassDeleter`
and the Rust `ObjectArc::new[_with_extra_items]` paths all funnel through
the registry, so Python-defined types and Rust-allocated objects share
the layout and lifetime semantics.
## Binding state machine
State is concentrated in `tvm_ffi_python_helpers.h`. Each header word
(`tagged_pyobj`) holds the wrapper back-pointer plus low tag bits, giving
a four-state machine:
Detached: no wrapper bound
Active: wrapper bound and owns a +1 on the chandle
Inactive: wrapper dead but its allocation is cached for in-place revival
InTransit: a dealloc is mid-flight (handshake bit)
Every FFI return funnels through `make_ret_object` (C++ entry
`TVMFFIPyMakeRetObject`), which returns the canonical wrapper for a
chandle when one exists, reviving an Inactive cached allocation in place
so a re-fetched wrapper keeps a stable `id()` at the same address. The
cache-vs-free handshake spans three slots -- a pre-bump `tp_dealloc`
opens it, `tp_free` settles it, and the C++ weak deleter
(`TVMFFIPyDeleteSpace`) reclaims the block -- coordinated so a chandle
that outlives its wrapper keeps the cached bytes, and a genuinely dead
chandle frees them exactly once. Frontend-allocation is detected by
`delete_space` pointer comparison (`TVMFFIPyIsCanonical`), avoiding a
flag bit on `TVMFFIObject`; chandles created before the Python allocator
is registered (statically-initialized global functions in libtvm_ffi.so)
carry only the base header and are skipped.
## Free-threaded build
The same tying runs on Py_GIL_DISABLED. The `tagged_pyobj` word doubles
as a per-word spin-lock (a Locked tag bit, `__atomic_*` CAS) that
serializes every binding transition; the lock is held only across short,
park-free word/header edits, while alloc/revival run lock-released. The
wait/back-off (`TVMFFIPyLockYield`) detaches the thread state so a
concurrent stop-the-world GC is never starved.
The free-threaded revival hazard is Cython's generated `tp_dealloc`,
which bumps the wrapper refcount before running `__dealloc__`: that bump
makes `PyUnstable_TryIncRef` spuriously succeed on a wrapper being torn
down, so a concurrent `make_ret` Active-hit could revive a corpse
(borrowed-ref UAF). The fix replaces Cython's `tp_dealloc` on each cdef
CObject-family carrier with one hand-built `TVMFFIPyTpDeallocSlot` that
runs the binding transition (bracketed by
`PyErr_Get/SetRaisedException`) before any bump, then -- stripped of the
now-dead bump -- GC-untrack (guarded by GC-ness), a generic `__dict__`
clear (guarded by a real `tp_dictoffset`), and `tp_free`. A plain
carrier runs exactly `transition; tp_free`; Function fires both guards;
both are faithful to Cython's originals minus the bump. The six carriers
(CObject, CContainerBase, OpaquePyObject, Error, Tensor, Function) are a
closed compile-time set, each wrapped once at init via
`TVMFFIPyWrapDealloc`; every heap subtype is covered for free through
`subtype_dealloc`'s base-walk to the nearest carrier. An import-time
layout guard `Py_FatalError`s if a non-GC carrier ever gains a
`__dict__`, turning silent owned-member drift into a loud failure.
Active-hit revival uses `PyUnstable_TryIncRef` / `EnableTryIncRef` to
close the borrowed-read UAF. The GIL build is byte-identical: all
free-threaded machinery is behind `#ifdef Py_GIL_DISABLED`, and on the
GIL build the post-bump `__dealloc__` hook runs the transition directly
(safe under the GIL) while the slot installer is a no-op stub.
## Robustness
A double-free in the builtin allocator is fixed along the way: allocate
offset the body by `round_up(header, alignment)` but free subtracted a
fixed `sizeof(header)`, symmetric only for alignment <= 8 -- a 16-aligned
reflection dataclass (alignof(max_align_t)=16) was freed 8 bytes early.
Both sides now use a fixed `alignof(max_align_t)` body offset. A shutdown
guard (`TVMFFIPyMarkPythonFinalizing`, wired to atexit) flips an
atomic flag read before any `PyGILState_Ensure` to avoid acquiring the
GIL after Python finalization has begun; wrapper bytes on chandles still
alive at exit are intentionally leaked.
## `_move()` semantics
Under universal cache-on, callback args alias the caller's wrapper and
FFI returns of the same chandle alias the caller's wrapper (one wrapper,
one chandle ref). `_move()` is kept as an API: the rvalue setter on
either side eager-detaches the canonical binding before the C++
`AnyViewToOwnedAny` transfer nulls the source chandle.
## Tests
New `tests/python/test_pyobject_tying.py` covers Active/Inactive/InTransit
transitions, cache-on aliasing, `_move()` under cache-on, pickle stress,
threading stress, GC integration, multi-chandle isolation, the weakref
limitation, free-threaded concurrent carrier-type stress
(Function/Error/multi-level-heap), and OpaquePyObject roundtrip/leak.
`test_function.py::test_rvalue_ref` is refactored for the new
aliasing-aware use_count expectations. Verified: FT crash oracle clean,
FT suite 2321 passed, GIL suite 2346 passed, tying tests 43/43 on both
builds; Rust suite passes.
Summary
Bind one Python wrapper to one C++ FFI object ("chandle") for the object's lifetime, so identity is stable:
a.x is a.x,id(a.x)stable across drop+refetch, andf(x) is xfor FFI returns. Works on both the GIL and free-threaded (3.14t) builds. Implements the PyObjectTying design.Before:
After: both hold, and identity is preserved across a wrapper death-and-revive cycle whenever the C++ object outlives the wrapper.
Allocation layout
A two-layer custom-allocator hook lives in core libtvm_ffi:
TVMFFIObjectAllocHeader { delete_space }— 8-byte base header preceding every Object body.TVMFFICustomAllocator { allocate, context }— process-wide registry; libtvm_ffi installs a builtin default at registry init soTVMFFIGetCustomAllocatornever returns NULL.TVMFFIGetCustomAllocator/TVMFFISetCustomAllocator— frontends override the default at module load.The Python Cython module overrides the global default with
TVMFFIPyAllocate, which prepends a 16-bytePyCustomAllocHeaderencoding the wrapper binding. The Rust crate (ObjectArc::new[_with_extra_items]) and the Python-defined types inextra/dataclass.ccroute through the same registry, so layout and lifetime semantics are uniform across frontends.Binding state machine
State is concentrated in
python/tvm_ffi/cython/tvm_ffi_python_helpers.h. Each header word (tagged_pyobj) holds the wrapper back-pointer plus low tag bits, giving a four-state machine:Every FFI return funnels through
make_ret_object(C++ entryTVMFFIPyMakeRetObject), which returns the canonical wrapper for a chandle when one exists, reviving an Inactive cached allocation in place so a re-fetched wrapper keeps a stableid()at the same address. The cache-vs-free handshake spans three slots — a pre-bumptp_deallocopens it,tp_freesettles it, and the C++ weak deleter (TVMFFIPyDeleteSpace) reclaims the block — coordinated so a chandle that outlives its wrapper keeps the cached bytes, and a genuinely dead chandle frees them exactly once.Frontend-allocation is detected by
delete_spacepointer comparison (TVMFFIPyIsCanonical), avoiding a flag bit onTVMFFIObject. Chandles created before the Python allocator is registered (statically-initialized global functions in libtvm_ffi.so) carry only the base header and are skipped.Free-threaded build
The same tying runs on
Py_GIL_DISABLED. All free-threaded machinery is behind#ifdef Py_GIL_DISABLED; the GIL build is byte-identical (verified by a function-body-map diff).tagged_pyobjword doubles as a spin-lock (a Locked tag bit,__atomic_*CAS) serializing every binding transition. The lock is held only across short, park-free word/header edits; alloc/revival run lock-released. The back-off (TVMFFIPyLockYield) detaches the thread state so a concurrent stop-the-world GC is never starved.tp_deallocbumps the wrapper refcount before running__dealloc__. On free-threaded builds that bump makesPyUnstable_TryIncRefspuriously succeed on a wrapper being torn down, so a concurrentmake_retActive-hit could revive a corpse (borrowed-ref UAF). The fix replaces Cython'stp_deallocon each cdef CObject-family carrier with one hand-builtTVMFFIPyTpDeallocSlotthat runs the binding transition (bracketed byPyErr_Get/SetRaisedException) before any bump, then — stripped of the now-dead bump — GC-untrack (guarded by GC-ness), a generic__dict__clear (guarded by a realtp_dictoffset), andtp_free. A plain carrier runs exactlytransition; tp_free; Function fires both guards; both are faithful to Cython's originals minus the bump.TVMFFIPyWrapDealloc; every heap subtype is covered for free throughsubtype_dealloc's base-walk to the nearest carrier. An import-time layout guardPy_FatalErrors if a non-GC carrier ever gains a__dict__, turning silent owned-member drift into a loud failure.PyUnstable_TryIncRef/EnableTryIncRefto close the borrowed-read UAF (the CPython "weakmap" pattern, whosetp_deallocsupport requirement this implements).Verified safe through CPython 3.15/3.16: the
PyUnstable_TryIncRef/PyMutexcontract is unchanged, and the 3.15 rule that managed dict/weakref impliesHAVE_GCdoes not affect us — no carrier uses a managed dict.Robustness
round_up(header, alignment)but free subtracted a fixedsizeof(header), symmetric only for alignment ≤ 8. A 16-aligned reflection dataclass (alignof(max_align_t)=16) was freed 8 bytes early. Both sides now use a fixedalignof(max_align_t)body offset. It was masked on the GIL build because the symmetric custom allocator shadows the builtin one for all Python objects.TVMFFIPyMarkPythonFinalizing(wired to atexit) flips an atomic flag read before anyPyGILState_Ensure, to avoid acquiring the GIL after Python finalization has begun. Wrapper bytes on chandles still alive at exit are intentionally leaked — the process is exiting._move()semanticsUnder universal cache-on, callback args alias the caller's wrapper and FFI returns of the same chandle alias the caller's wrapper (one wrapper, one chandle ref).
_move()is kept as an API: the rvalue setter on either side eager-detaches the canonical binding before the C++AnyViewToOwnedAnytransfer nulls the source chandle, so a downstream cache lookup never sees a stale back-pointer.Test plan
tests/python/test_pyobject_tying.pycovers Active/Inactive/InTransit transitions, cache-on aliasing,_move()under cache-on, pickle stress, threading stress, GC integration, multi-chandle isolation, the weakref limitation, free-threaded concurrent carrier-type stress (Function/Error/multi-level-heap), and OpaquePyObject roundtrip/leak.test_function.py::test_rvalue_refrefactored for cache-on aliasing semantics.Out-of-scope follow-ups
_get_global_functo deliver id-stability for static-init Functions (whose chandles predate the Python allocator and so carry only the base header). Tracked as a TODO infunction.pxi::_get_global_func.