Skip to content

Per-worker proxy threads for UDF callbacks (#1136)#7

Open
otegami wants to merge 2 commits into
mainfrom
feature/per-worker-proxy
Open

Per-worker proxy threads for UDF callbacks (#1136)#7
otegami wants to merge 2 commits into
mainfrom
feature/per-worker-proxy

Conversation

@otegami
Copy link
Copy Markdown
Owner

@otegami otegami commented May 31, 2026

Summary

Implements per-worker proxy threads for scalar and table function UDF
callbacks (refs suketaGH-1136), so callbacks from different DuckDB worker threads
run concurrently instead of serializing through a single global executor.

This is a personal-fork PR for local verification and integration — the
living picture of the whole feature. It is sent upstream (suketa/ruby-duckdb)
one PR at a time, in order:

  1. lifecycle primitive — merged as Add per-worker proxy thread primitive suketa/ruby-duckdb#1364
  2. dispatch wiring — merged as feat: route the non-Ruby-thread dispatch path through an optional proxy suketa/ruby-duckdb#1365 (so it no longer appears here)
  3. scalar integration — open as feat: per-worker proxy for scalar function execute callback suketa/ruby-duckdb#1366 (this branch's 1st commit)
  4. table integration — next (this branch's 2nd commit)

Why

With one global executor, callbacks from different workers can never overlap,
even when they release the GVL (e.g. on I/O). One proxy thread per DuckDB
worker lifts exactly that ceiling. Measured with sample/issue1136.rb
(GVL-releasing scalar callback over a 500k-row scan):

SET threads=1 SET threads=4
before 1.379s, 1 callback thread 0.976s, 2 callback threads
after 1.374s, 1 callback thread 0.365s, 4 callback threads

The before run caps at 2 threads (calling thread + global executor) no matter
how many workers DuckDB spawns. Pure-CPU callbacks stay bounded by the GVL,
so the win is specific to GVL-releasing UDFs.

Design notes

  • Three-path dispatch (Fix GVL-unsafe callbacks in table_function.c suketa/ruby-duckdb#1280) is preserved; proxies only change Case 3
    (non-Ruby thread): route through the worker's own proxy when present, else
    fall back to the global executor.
  • DuckDB 1.4.x LTS keeps the old path byte-for-byte — all proxy code is gated
    behind HAVE_DUCKDB_H_GE_V1_5_0 (set_init / set_local_init are 1.5.0
    APIs).
  • Proxy structs use calloc/free (not xcalloc/xfree): DuckDB frees
    them from non-Ruby threads. Proxy threads are GC-protected via a global
    array.
  • No public surface changes: the Ruby API (DuckDB::*) and the
    duckdb_native artifact name are untouched.
  • An earlier revision renamed function_executor.{c,h} -> executor.{c,h};
    the rename was dropped (the name is accurate and matches the
    *_function_* file family).

Verification

  • Full suite green at each commit: 1148 (scalar) -> 1149 (table) runs,
    0 failures. rake compile clean (only the pre-existing
    -Wshorten-64-to-32 warning); RuboCop clean.
  • The lifecycle tests record which Ruby threads run callbacks and assert
    more than two distinct threads — the global executor structurally caps at
    two, so each test fails without its commit (verified red/green against a
    proxy-less build).

Wire the scalar execute path to per-worker proxy threads on DuckDB
>= 1.5.0. An init callback registered via duckdb_scalar_function_set_init
runs once per worker thread, creates a proxy (allocating its Ruby thread
under the GVL through the global executor, since init runs on a non-Ruby
thread), and stores it as per-worker state via
duckdb_scalar_function_init_set_state. The execute callback retrieves
that proxy with duckdb_scalar_function_get_state and dispatches through
it via rbduckdb_function_executor_dispatch_via_proxy, so callbacks from
different workers run concurrently instead of serializing on the single
global executor. DuckDB frees each proxy through rbduckdb_worker_proxy_destroy.

The proxy-creating wrapper runs rbduckdb_worker_proxy_create under
rb_protect, implementing the raise contract documented on that function:
the executor runs callbacks unprotected, so an uncaught raise would
longjmp past its done-signaling and block the waiting DuckDB worker
forever. On failure the proxy stays NULL and the execute callback falls
back to the global executor.

On DuckDB < 1.5.0 the init hook is absent and the execute callback keeps
using the global executor unchanged.

The added test records which Ruby threads run the callback and asserts
more than two distinct threads, which the old implementation can never
produce (calling thread plus the single global executor), in addition to
result correctness. Simultaneity assertions are avoided as
scheduler-dependent; sample/issue1136.rb demonstrates the throughput win
with a GVL-releasing callback (about 3.8x at SET threads=4 locally).
Wire the table execute path to per-worker proxy threads on DuckDB
>= 1.5.0. A local_init callback registered via
duckdb_table_function_set_local_init runs once per worker thread, creates
a proxy (allocating its Ruby thread under the GVL through the global
executor, since local_init runs on a non-Ruby thread), and stores it as
thread-local init data via duckdb_init_set_init_data. The execute
callback retrieves that proxy with duckdb_function_get_local_init_data
and dispatches through it via
rbduckdb_function_executor_dispatch_via_proxy, so callbacks from
different workers run concurrently instead of serializing on the single
global executor. bind and init stay on the global executor. DuckDB frees
each proxy through rbduckdb_worker_proxy_destroy.

The proxy-creating wrapper runs rbduckdb_worker_proxy_create under
rb_protect, implementing the raise contract documented on that function:
the executor runs callbacks unprotected, so an uncaught raise would
longjmp past its done-signaling and block the waiting DuckDB worker
forever. On failure the proxy stays NULL and the execute callback falls
back to the global executor.

On DuckDB < 1.5.0 the local_init hook is absent and the execute callback
keeps using the global executor unchanged.

The added test records which Ruby threads run the execute callback and
asserts more than two distinct threads, which the old implementation can
never produce (calling thread plus the single global executor), in
addition to result correctness. Verified to fail against a build without
this change. Simultaneity assertions are avoided as scheduler-dependent.
@otegami otegami force-pushed the feature/per-worker-proxy branch from 83bd94e to 89f19d8 Compare June 6, 2026 12:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant