Skip to content

Support /pattern/i regex in broker-protection extract afterText/beforeText/separator#2777

Merged
noisysocks merged 3 commits into
mainfrom
randerson/regex-aftertext-beforetext
Jun 23, 2026
Merged

Support /pattern/i regex in broker-protection extract afterText/beforeText/separator#2777
noisysocks merged 3 commits into
mainfrom
randerson/regex-aftertext-beforetext

Conversation

@noisysocks

@noisysocks noisysocks commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Asana Task/Github Issue: https://app.asana.com/1/137249556945/project/1209121419454298/task/1213630318520583

Description

separator already lets a broker rule pass a regex via the /pattern/ convention (#2242), but afterText and beforeText were still plain literal matches, and none of the three could match case-insensitively. We want these fields to be as flexible as possible, so that broker rules can adapt to whatever shape a page throws at them without needing a code change.

So this extends the same /pattern/ convention to afterText and beforeText, and adds an optional i flag across all three:

"age":                  { "selector": ".age", "afterText": "/age:?\\s+/i" },
"alternativeNamesList": { "selector": ".aka", "separator": "/\\s*aka\\s*/i" }

Plain strings stay literal, so there's no behaviour change for existing rules – the regex path only kicks in when a value is wrapped in /.../. Anything with an unsupported flag (i.e. not i) falls back to being treated as a literal string rather than throwing, so a mistyped config fails gently rather than breaking the whole action.

Testing Steps

There's no broker that needs this yet, so there's no good way to test it by hand. Coverage is via tests instead:

  • npm run test-int (broker-protection specs) → extract with regex afterText / beforeText / separator. Runs against a synthetic people-search fixture (results-regex.html) with several same-name results where only one is the right match, and the badge / age-label / alias-separator casing varies between cards. Asserts the matched profile resolves to John Smith, age 38, Chicago, which only holds when the i flag is honoured.
  • npm run test-unit covers the /pattern/ parser (/x/i, unsupported flags staying literal, the / and // edge cases) and the afterText/beforeText/separator behaviour, including the match-and-slice semantics that keep capture groups out of the result.

Checklist

Please tick all that apply:

  • I have tested this change locally
  • I have tested this change locally in all supported browsers
  • This change will be visible to users
  • I have added automated tests that cover this change
  • I have ensured the change is gated by config
  • This change was covered by a ship review
  • This change was covered by a tech design
  • Any dependent config has been merged

🤖 Generated with Claude Code


Note

Medium Risk
Touches shared profile scraping logic used in production broker actions; behavior is backward compatible but misconfigured regex could change extracted PII fields until rules are fixed.

Overview
Broker extract rules can now use the same /pattern/ and /pattern/i JSON convention for afterText and beforeText, not only separator, so labels like age and AKA can be matched case-insensitively without code changes.

parseRegexFromString replaces the old separator-only parser and accepts an optional i flag; unsupported flags (e.g. /x/g) stay literal strings. shapeString routes afterText / beforeText through new splitOnce, which keeps literal String#split behavior for plain strings and uses match + slice for regex so capture groups do not pollute results and “after” keeps everything past the first match.

Plain-string configs are unchanged. Coverage adds unit tests for the parser and regex shaping, plus an integration test on a synthetic results page with mixed casing on badges and labels.

Reviewed by Cursor Bugbot for commit 9966bdb. Bugbot is set up for automated code reviews on this repo. Configure here.

…eText/separator

Extends the /pattern/ separator convention (#2242) to afterText and beforeText, and
adds an optional `i` flag to all three so broker rules can match case-insensitively.
Plain strings stay literal, so existing rules are unaffected.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@noisysocks noisysocks requested review from a team as code owners June 18, 2026 04:41
@github-actions

github-actions Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Build Branch

Branch pr-releases/randerson/regex-aftertext-beforetext
Commit 1fc06c3b0f
Updated June 23, 2026 at 5:18:33 AM UTC

Static preview entry points

QR codes (mobile preview)
Entry point QR code
Docs QR for docs preview
Static pages QR for static pages preview
Integration pages QR for integration pages preview

Integration commands

npm (Android / Extension):

npm i github:duckduckgo/content-scope-scripts#pr-releases/randerson/regex-aftertext-beforetext

Swift Package Manager (Apple):

.package(url: "https://github.com/duckduckgo/content-scope-scripts.git", branch: "pr-releases/randerson/regex-aftertext-beforetext")

git submodule (Windows):

git -C submodules/content-scope-scripts fetch origin pr-releases/randerson/regex-aftertext-beforetext
git -C submodules/content-scope-scripts checkout origin/pr-releases/randerson/regex-aftertext-beforetext
Pin to exact commit

npm (Android / Extension):

npm i github:duckduckgo/content-scope-scripts#1fc06c3b0fe846aaba9fa9384a2a03afb574e052

Swift Package Manager (Apple):

.package(url: "https://github.com/duckduckgo/content-scope-scripts.git", revision: "1fc06c3b0fe846aaba9fa9384a2a03afb574e052")

git submodule (Windows):

git -C submodules/content-scope-scripts fetch origin pr-releases/randerson/regex-aftertext-beforetext
git -C submodules/content-scope-scripts checkout 1fc06c3b0fe846aaba9fa9384a2a03afb574e052

@github-actions

github-actions Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

[Beta] Generated file diff

Time updated: Tue, 23 Jun 2026 05:19:11 GMT

Android
    - android/autofillImport.js
  • android/brokerProtection.js

File has changed

Apple
    - apple/contentScopeIsolated.js

File has changed

Integration
    - integration/contentScope.js

File has changed

Windows
    - windows/contentScope.js

File has changed

daxtheduck
daxtheduck previously approved these changes Jun 18, 2026

@cursor cursor Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stale comment

Injected PR Evaluation: Web Compatibility & Security

Review of broker-protection extract regex support (/pattern/i for afterText, beforeText, separator).

Web Compatibility Assessment

File Lines Severity Finding
injected/src/features/broker-protection/actions/extract.js 358–383 info Changes are confined to broker-protection profile extraction. No browser API overrides, prototype patches, or DOM lifecycle changes that could affect third-party scripts on arbitrary sites.
injected/src/features/broker-protection/actions/extract.js 376–383 info Literal-string afterText/beforeText preserve the prior String#split semantics via splitOnce. Existing configs using plain strings are unaffected.
injected/src/features/broker-protection/actions/extract.js 376–380 info RegExp afterText/beforeText intentionally diverge from String#split for multi-occurrence text (match + slice keeps everything after the first match). This is documented and tested; it only applies to /pattern/ or /pattern/i matchers, not literals.
injected/src/features/broker-protection/actions/extract.js 358–360 warning parseRegexFromString (renamed from parseRegexSeparator) changes how slash-wrapped separators ending in /i compile. Previously /age/i became new RegExp('age/i') (no flag); now it becomes new RegExp('age', 'i'). Any existing separator config that relied on the old parsing would behave differently. The new behavior is correct for the stated goal, but it is a config-level compatibility change worth noting in release notes.
injected/src/features/broker-protection/actions/extract.js 184–189 info afterText/beforeText gain regex parsing for the first time. A config that used a literal value like /verified/ as beforeText would previously split on that exact string; it now compiles as a regex. Unlikely in practice, but possible for edge-case configs.

Security Assessment

File Lines Severity Finding
(scope) info No changes to captured-globals.js, messaging transports, message bridge, wrapper-utils.js, shouldExemptMethod(), or platform entry points. Injected-runtime attack surface is unchanged.
injected/src/features/broker-protection/actions/extract.js 358–360 info Regex patterns originate from native broker action JSON (trusted config), not from page scripts. new RegExp(...) is appropriate for this trust boundary.
injected/src/features/broker-protection/actions/extract.js 377–378 info Regexes execute against page-derived innerText/textContent. A hostile page could supply long adversarial input against a config regex (ReDoS). This risk already existed for separator regexes; this PR extends the same model to afterText/beforeText. Mitigation is config-side pattern hygiene, not page-side.
injected/src/features/broker-protection/actions/extract.js 377 info matcher instanceof RegExp uses the ambient RegExp constructor. In the broker-protection execution context this is low risk (patterns are constructed locally via new RegExp), but captured RegExp would be marginally more consistent with hardened injected code elsewhere.
injected/src/features/broker-protection/actions/extract.js 360 info Invalid regex syntax in a /pattern/ string will throw from new RegExp at extraction time. Failure is localized to the broker-protection action (not a page-wide throw), but a try/catch fallback to literal-string mode would make misconfigured patterns fail more gracefully.

Risk Level

Low Risk — A narrowly scoped broker-protection extraction utility change with solid unit and integration test coverage; no injected API shims, messaging boundary changes, or global-capture regressions.

Recommendations

  1. Document the /pattern/i separator parsing fix in broker-protection release notes so config authors know /foo/i now enables case-insensitivity rather than matching a literal /i suffix in the pattern.
  2. Consider wrapping new RegExp in try/catch inside parseRegexFromString to fall back to the literal string when compilation fails, matching the existing tolerant behavior for unsupported flags like /x/g.
  3. ReDoS hygiene for new configs — prefer bounded, simple patterns (e.g. /age:?\\s+/i, /\\s*aka\\s*/i) over nested quantifiers when authoring broker extract rules that run against untrusted page text.

Unit tests verified locally: broker-protection-extract.js — 28 specs, 0 failures.

Open in Web View Automation 

Sent by Cursor Automation: Web compat and sec

@noisysocks noisysocks requested a review from brianhall June 18, 2026 04:44
@daxtheduck daxtheduck dismissed their stale review June 23, 2026 00:43

Dismissing stale approval — new commits pushed, awaiting Cursor re-review.

@github-actions github-actions Bot added the semver-patch Bug fix / internal — no release needed label Jun 23, 2026
@github-actions

Copy link
Copy Markdown
Contributor

⚠️ Cursor assessed this PR as Medium Risk (only Low Risk is auto-approved).

This PR requires a manual review and approval from a member of one of the following teams:

  • @duckduckgo/content-scope-scripts-owners
  • @duckduckgo/apple-devs
  • @duckduckgo/android-devs
  • @duckduckgo/team-windows-development
  • @duckduckgo/extension-owners
  • @duckduckgo/config-aor
  • @duckduckgo/breakage-aor
  • @duckduckgo/breakage

Comment thread injected/src/features/broker-protection/actions/extract.js
Comment thread injected/src/features/broker-protection/actions/extract.js

@brianhall brianhall left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, thanks!

Resolve conflicts with the field-extractor registry refactor (#2778):
- extract.js: re-apply regex afterText/beforeText through splitOnce in
  shapeString; merge TextFieldSpec JSDoc (regex descriptions + attribute)
- broker-protection-extract.js: migrate regex tests to stringsFromElements
@noisysocks noisysocks added this pull request to the merge queue Jun 23, 2026

@cursor cursor Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Injected PR Evaluation: Web Compatibility & Security

PR: Support /pattern/i regex in broker-protection extract afterText/beforeText/separator


Web Compatibility Assessment

File Lines Severity Finding
injected/src/features/broker-protection/actions/extract.js 444–446 info parseRegexFromString only accepts /pattern/ or /pattern/i. Unsupported flag forms (/x/g, /x/gi) remain literal strings — backward compatible and avoids global/sticky split surprises.
injected/src/features/broker-protection/actions/extract.js 462–468 info splitOnce uses match + slice for RegExp matchers instead of String#split, so afterText/beforeText keep everything after/before the first match (not between occurrences). This is a deliberate improvement over literal split semantics for regex; literal string matchers preserve existing split behavior exactly.
injected/src/features/broker-protection/actions/extract.js 308–315 info When a regex does not match, splitOnce returns undefined and shapeString falls back to the original value (?.trim() || value). No silent empty-string regressions for non-matching pages.
injected/src/features/broker-protection/actions/extract.js 444–446 warning Minor behavior change for separator: previously // (length 2, starts/ends with /) was compiled as an empty regex via the old parseRegexSeparator; now it stays a literal // because .+ requires at least one pattern character. Unlikely in real broker configs but technically a delta.
injected/integration-test/... info Integration test + synthetic HTML exercise case-insensitive extraction across mixed badge/label casing. Good coverage of the motivating scenario.

No findings in API surface fidelity, prototype chain integrity, DOM safety, timing/races, platform-specific APIs, or third-party script detection — this change does not override browser APIs and only runs inside the config-gated broker-protection action executor.


Security Assessment

File Lines Severity Finding
injected/src/features/broker-protection/actions/extract.js 444–446, 462–468 info Regex patterns originate from DDG-authored broker action configs (native → content script), not from page scripts. Only the input text (innerText/textContent) is page-controlled.
injected/src/features/broker-protection/actions/extract.js 464 warning ReDoS surface (bounded): value.match(matcher) runs DDG-defined patterns against page-controlled strings. A hostile broker page could supply very long or adversarial text to amplify backtracking in a poorly chosen config pattern. Mitigated by: feature is opt-in/PIR-scoped, patterns are internally reviewed, and only the i flag is exposed (no g/m/s). Consider documenting safe-pattern guidance for config authors.
injected/src/features/broker-protection/actions/extract.js 446 info Invalid regex syntax in config (e.g. /[/) throws from new RegExp; execute() catches and returns ErrorResponse — no unhandled rejection or page-visible throw.
info No changes to captured globals, messaging/nativeData, message bridge, postMessage, iframe access, or remote-config gating. RegExp usage is in the content-script isolated realm, not a page-tamperable shim boundary.

No critical or error-level security findings.


Risk Level

Low Risk — Scoped broker-protection extraction helper change with no API overrides, messaging changes, or global capture modifications; behavior is backward compatible for literal strings and well-covered by unit + integration tests.


Recommendations

  1. (info) Document in broker-config authoring guidance that only /pattern/ and /pattern/i are supported, and that afterText/beforeText regex matchers anchor on the first match (differs from multi-split literal behavior).
  2. (info) When adding production broker configs with regex, prefer possessive/simple patterns (e.g. /age:?\\s+/i) over nested quantifiers to limit ReDoS exposure against adversarial page text.
  3. (info) The // → literal behavior change is harmless for real configs; no action required unless internal configs relied on empty-regex splitting.

Cursor automation — injected PR evaluation (web compat & security)

Open in Web View Automation 

Sent by Cursor Automation: Web compat and sec

Merged via the queue into main with commit ec0d6da Jun 23, 2026
39 of 40 checks passed
@noisysocks noisysocks deleted the randerson/regex-aftertext-beforetext branch June 23, 2026 05:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

semver-patch Bug fix / internal — no release needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants