Skip to content

feat(ai-aliyun-content-moderation): role-aware request_check_mode and O(n) content chunking#13598

Open
nic-6443 wants to merge 5 commits into
apache:masterfrom
nic-6443:fix-aliyun-cm-moderation-scope
Open

feat(ai-aliyun-content-moderation): role-aware request_check_mode and O(n) content chunking#13598
nic-6443 wants to merge 5 commits into
apache:masterfrom
nic-6443:fix-aliyun-cm-moderation-scope

Conversation

@nic-6443

@nic-6443 nic-6443 commented Jun 23, 2026

Copy link
Copy Markdown
Member

Description

The ai-aliyun-content-moderation plugin had hot-path inefficiencies in the moderation path, re-moderated the entire conversation on every request, and opened a new HTTP connection per moderation call. This PR fixes the performance issues and adds a moderation-scope option.

1. O(n²) → O(n) content chunking. content_moderation() splits text into *_check_length_limit-sized chunks. It used utf8.sub(content, index, ...), which locates the index-th character by scanning from the start of the string on every chunk, making the loop O(n²) in content length. Replaced with a byte cursor + utf8.offset(content, length_limit + 1, cur) (scans only the current chunk window) + byte-based string.sub, giving O(n). Output is byte-identical to the previous chunking (verified for ASCII and multibyte text).

2. Faster RFC-3986 signing. url_encoding() ran five separate Lua string.gsub passes over the percent-encoded chunk to escape the sub-delimiters ! ' ( ) *. Profiling a ~2 KB chunk showed this single function at ~148 µs — 30–40× more than every other per-chunk operation (json.encode/hmac_sha1/encode_args are all ~4–5 µs). Replaced with one JIT-compiled ngx.re.gsub(escaped, "[!'()*]", repl, "jo") pass (~7 µs, same output).

3. Schema guard. Added minimum = 1 to request_check_length_limit and response_check_length_limit. A non-positive value made the chunking loop never advance (utf8.offset(content, length_limit + 1, cur) returns cur), an infinite loop. It is now rejected at config time, consistent with stream_check_cache_size.

4. request_check_mode (last | all). Request moderation now targets user input only. By default (last) it moderates just the latest consecutive block of user messages (the newest user turn) instead of the whole conversation, so multi-turn requests no longer re-send the entire history to the moderation service every turn. all moderates every user message. Both modes ignore system/assistant/tool messages (the query moderation service is meant for user input). Note this changes the previous behavior, which moderated all messages regardless of role.

5. Reuse one HTTP connection per request. check_single_content() opened a new resty.http object and ran connect()/set_keepalive() on every moderation call. In realtime response mode a single response fires one call per stream_check_cache_size characters (tens of calls), so per-call connection churn dominated. Now one httpc is cached on ctx, reused across all of a request's moderation calls, and returned to the keepalive pool once at request end; a failed connection is dropped and re-established on the next call. This is perf-only with no behavior change. (Minor: on an aborted/errored streaming response the held connection is closed at request teardown instead of pooled — no leak; cosocket ops aren't allowed in the log phase so there is no clean release hook for those paths.)

Performance — request moderation

Single user message, ai-proxy + ai-aliyun-content-moderation, local mock moderation endpoint that always passes, wrk -t1 -c1. Baseline = ai-proxy only:

message size baseline QPS before QPS after QPS before→after
64k 1928 71.8 162.4 2.3x
128k 1377 30.3 84.6 2.8x
256k 899 11.3 43.1 3.8x
512k 508 3.75 21.9 5.9x
1M 266 1.10 11.2 10.1x

The before→after gain grows with size because the O(n²) term is removed; the single-pass encoding adds a roughly uniform ~2× on top.

Performance — streaming response moderation (connection reuse)

Single worker pinned and saturated, streaming response checked in realtime mode (one moderation call per stream_check_cache_size chars), local mock moderation endpoint:

stream_check_cache_size calls/response before tokens/s after tokens/s gain
128 ~48 59,938 62,073 +3.6%
64 ~96 45,074 47,790 +6.0%

The gain scales with the number of moderation calls per response (smaller cache_size / longer responses). final_packet and non-streaming checks are unchanged.

Tests

Added cases to t/plugin/ai-aliyun-content-moderation.t: last/all mode behavior, role-awareness (non-user messages are skipped; a non-user last message means no user turn to check), multi-chunk detection across a multibyte boundary, and schema rejection of request_check_length_limit: 0. Change 5 is perf-only with no behavior change; the existing realtime cases (run with keepalive=true) already exercise cross-call connection reuse.

Checklist

  • I have explained the need for this PR and the problem it solves
  • I have added the relevant tests
  • I have updated the documentation (en + zh)

Copilot AI review requested due to automatic review settings June 23, 2026 04:31
@dosubot dosubot Bot added size:L This PR changes 100-499 lines, ignoring generated files. enhancement New feature or request performance generate flamegraph for the current PR labels Jun 23, 2026

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes the ai-aliyun-content-moderation plugin’s hot path and adds role-aware request moderation scoping so multi-turn chat requests don’t re-moderate the full conversation on every call.

Changes:

  • Improve request/response chunking performance by switching from repeated utf8.sub slicing to a byte cursor + utf8.offset approach (O(n²) → O(n)).
  • Speed up RFC-3986-compatible signing by replacing multiple string.gsub passes with a single ngx.re.gsub pass.
  • Add request_check_mode (last/all) to moderate only user-role messages, plus schema guards (minimum = 1) and corresponding tests/docs updates.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
apisix/plugins/ai-aliyun-content-moderation.lua Adds request_check_mode, improves signing and chunking performance, and tightens schema constraints.
apisix/plugins/ai-protocols/openai-chat.lua Refactors request text extraction and adds user-only extraction for request moderation.
apisix/plugins/ai-protocols/anthropic-messages.lua Refactors request text extraction and adds user-only extraction for request moderation.
t/plugin/ai-aliyun-content-moderation.t Adds coverage for last/all mode behavior, role-awareness, multi-chunk paths, and schema rejection.
docs/en/latest/plugins/ai-aliyun-content-moderation.md Documents request_check_mode and updates length-limit constraints.
docs/zh/latest/plugins/ai-aliyun-content-moderation.md Documents request_check_mode and updates length-limit constraints.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread apisix/plugins/ai-aliyun-content-moderation.lua Outdated
Comment thread apisix/plugins/ai-aliyun-content-moderation.lua Outdated
nic-6443 added 3 commits June 23, 2026 12:39
…ents; drop all-content fallback (moderate user content only) and add extract_user_content to responses/bedrock/embeddings
@dosubot dosubot Bot added size:XL This PR changes 500-999 lines, ignoring generated files. and removed size:L This PR changes 100-499 lines, ignoring generated files. labels Jun 23, 2026
…uest

check_single_content created a new resty.http object and ran
connect()/set_keepalive() on every moderation call. In realtime streaming
mode a response fires one call per stream_check_cache_size characters
(tens of calls), so the per-call connection churn dominated. Cache one
httpc on ctx, reuse it across all moderation calls of the request, and
return it to the keepalive pool once at request end; a failed connection
is dropped and re-established on the next call.

Single-worker saturated benchmark against a local moderation mock:
realtime throughput +3.6% at cache_size=128 and +6.0% at cache_size=64;
final_packet and non-streaming checks unchanged. No behaviour change.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request performance generate flamegraph for the current PR size:XL This PR changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants