fix(ai-proxy-multi): guard construct_upstream in healthcheck timers#13592
fix(ai-proxy-multi): guard construct_upstream in healthcheck timers#13592shreemaan-abhishek wants to merge 3 commits into
Conversation
construct_upstream returns nil, err on its normal failure path (unresolvable endpoint, missing host/port, nil instance) and can raise a Lua error. Both healthcheck timers called it without a pcall or nil check, so one malformed instance made upstream nil and the subsequent upstream.resource_key / upstream._nodes_ver access aborted checker creation/refresh for every resource in that worker, not just the bad one. Wrap both call sites in pcall; on failure log and skip the resource (creation) or keep the existing checker (refresh) instead of crashing the timer.
There was a problem hiding this comment.
Pull request overview
Note
Copilot couldn't run its full agentic review because no GitHub Actions runner was available. Make sure your repository has a runner available to run Copilot's review, or add a copilot-setup-steps.yml file specifying one with the runs-on attribute. See the docs for more details.
Improves robustness of ai-proxy-multi healthcheck manager timers by preventing timer callbacks from crashing when construct_upstream returns nil, err or throws, and introduces regression tests that simulate panics.
Changes:
- Wrap
construct_upstreamcalls in both healthcheck-manager timers withpcalland add failure handling/logging. - Prevent nil-indexing in timer callbacks by skipping resource creation or handling refresh failures more safely.
- Add tests covering panic scenarios during checker creation and working-pool checks.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
apisix/healthcheck_manager.lua |
Adds pcall guards and conditional logic around dynamic upstream construction in healthcheck timers. |
t/plugin/ai-proxy-multi-construct-upstream-panic.t |
Adds regression tests that force construct_upstream to throw to ensure timers don’t die and proxying continues. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ok, upstream, err = pcall(plugin.construct_upstream, upstream_constructor_config) | ||
| if not ok or not upstream then | ||
| err = err or upstream | ||
| upstream = nil | ||
| local err_msg = "[checking checker] unable to construct upstream for plugin: " | ||
| .. plugin_name .. ", resource path: " .. resource_path | ||
| .. ", json path: " .. json_path .. ", error: " .. err | ||
| if not ok then | ||
| core.log.error(err_msg) | ||
| else | ||
| core.log.warn(err_msg) | ||
| end | ||
| else | ||
| upstream.resource_key = resource_path | ||
| end |
| if upstream then | ||
| local current_ver = upstream_utils.version(res_conf.modifiedIndex, | ||
| upstream._nodes_ver) | ||
| core.log.info("checking working pool for resource: ", resource_path, | ||
| " current version: ", current_ver, " item version: ", item.version) | ||
| if item.version == current_ver then | ||
| need_destroy = false | ||
| end | ||
| end |
| plugin.construct_upstream = function(instance) | ||
| local panic_check | ||
| panic_check.cnt = 1 | ||
| end |
|
|
||
| __DATA__ | ||
|
|
||
| === TEST 1: panic in construct_upstream during checker creation must not crash the timer |
…afe and non-destructive - guard nil error value before concatenation in timer_working_pool_check - retain existing checker on transient construct_upstream failure - cover nil-return failure paths in both timers
membphis
left a comment
There was a problem hiding this comment.
Approved with one follow-up comment.
| local plugin = require("apisix.plugins." .. plugin_name) | ||
| upstream = plugin.construct_upstream(upstream_constructor_config) | ||
| upstream.resource_key = resource_path | ||
| ok, upstream, err = pcall(plugin.construct_upstream, upstream_constructor_config) |
There was a problem hiding this comment.
[P2] This keeps the old checker for every construct_upstream failure. If the JSON path no longer exists because the ai-proxy-multi plugin or instance was removed, the old healthchecker can stay running. Consider treating upstream_constructor_config == nil as a real removal and allowing the checker to be destroyed.
There was a problem hiding this comment.
Good catch, thanks. You're right that retaining on every failure would leak a checker when the instance/plugin is genuinely removed. Fixed: the retain is now gated on upstream_constructor_config ~= nil, so a nil constructor config (json path no longer resolves) is treated as a real removal and the checker is destroyed, while a transient construct_upstream failure still keeps it. Added a regression test that removes the plugin from the route and asserts the stale checker is released.
Treat a nil construct_upstream config as a real removal so a stale checker is released, while still retaining it through transient failures.
7d0e2e3
Description
ai-proxy-multi'sconstruct_upstreamreturnsnil, erron its normal failure path (unresolvable endpoint, missing host/port, nil instance) and can also raise a Lua error. The two healthcheck-manager timers called it without apcallor a nil check:When
construct_upstreamreturnednil(or threw), the followingupstream.resource_key = .../upstream._nodes_veraccess raisedattempt to index a nil valueinside the timer callback that iterates all healthcheck resources. A single malformedai-proxy-multiinstance could therefore abort checker creation/refresh for every resource in that worker, not just the offending one.This wraps both call sites in
pcall. On failure the timer logs and skips the resource (creation path) or keeps the existing checker (refresh path) instead of crashing.Tests
Adds
t/plugin/ai-proxy-multi-construct-upstream-panic.tcovering both timer paths: aconstruct_upstreamoverride that raises a Lua error during checker creation and during the working-pool check. Proxying stays unaffected and the guarded message is logged instead of the worker timer dying.Checklist