[codex] Harden place extractor fixtures and preview enrichment by michaelmwu · Pull Request #48 · 508-dev/gmaps-scraper

michaelmwu · 2026-07-05T09:21:59Z

Summary

Add browser-backed fixture tests for the Google Maps place JS extractor across overview, reviews, about, limited-view, and search-result panel shapes.
Move the large inline _PLACE_JS_EXTRACTOR JavaScript into src/gmaps_scraper/data/place_extractor.js and load it via importlib.resources while preserving the Python symbol.
Add conservative preview-payload extraction for rating and review_count, with fixtures for positive and ambiguous payloads.

Why

The JS extractor was mostly protected by source-string assertions, which can pass even when selector behavior drifts. The new fixture harness evaluates the extractor in a real local browser page and asserts the structured snapshot. Preview payloads can also contain rating and review-count signals on limited/thin pages, so the scraper now backfills those fields when DOM data is missing.

Notes

DOM values still take precedence over preview values.
Preview rating/count extraction rejects ambiguous arrays, price-shaped evidence, year-like counts, non-adjacent values, and conflicting candidates rather than guessing.
The Task 4 handoff brief was amended locally in .context/, but .context/ is gitignored workspace memory and is not part of this PR.

Validation

./scripts/lint.sh
./scripts/typecheck.sh
./scripts/test.sh (266 tests)
Built a wheel and verified gmaps_scraper/data/place_extractor.js is included.

Note

Medium Risk
Touches core place extraction and merge logic for rating/review_count, but changes are covered by new browser fixtures and conservative preview heuristics with DOM precedence.

Overview
Moves the large inline _PLACE_JS_EXTRACTOR script out of place_scraper.py into packaged place_extractor.js, loaded at runtime via importlib.resources so behavior stays the same while the Python module shrinks.

Replaces many source-string tests with browser-backed checks: local HTML fixtures (overview, reviews, about, limited view, search results) are opened in Playwright and the extractor (plus related tab/about/review scripts) is evaluated against real DOM snapshots.

Adds conservative preview-payload enrichment for rating and review_count when the DOM is thin or missing those fields. Parsing only accepts unambiguous compact numeric pairs and rejects price-shaped, year-like, non-adjacent, or conflicting evidence; DOM values still win when present.

^{Reviewed by Cursor Bugbot for commit 29b58ef. Bugbot is set up for automated code reviews on this repo. Configure here.}

coderabbitai · 2026-07-05T09:22:08Z

Warning

Review limit reached

@michaelmwu, you've reached your PR review limit, so we couldn't start this review.

Next review available in: 25 minutes

Enable usage-based reviews in Billing to review now. Otherwise, wait until the next included review is available.
You're only billed for reviews past your plan's rate limits ($0.25/file).

How can I continue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews.

How do review limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please refer docs for additional details.

Review details

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a8a65784-35a0-4999-8abd-4849ba77d95e

📥 Commits

Reviewing files that changed from the base of the PR and between 71beb48 and 29b58ef.

📒 Files selected for processing (10)

src/gmaps_scraper/data/place_extractor.js
src/gmaps_scraper/place_scraper.py
tests/fixtures/place_pages/about.html
tests/fixtures/place_pages/limited_view.html
tests/fixtures/place_pages/overview.html
tests/fixtures/place_pages/reviews.html
tests/fixtures/place_pages/search_result.html
tests/fixtures/preview_payloads/ambiguous_rating_review_count.txt
tests/fixtures/preview_payloads/rating_review_count.txt
tests/test_place_scraper.py

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch michaelmwu/improve-maps-scraping

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 29b58efe7d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-07-05T09:25:48Z

+    if 1900 <= value <= 2100:
+        return None


Do not drop valid 1900–2100 review counts

When the DOM lacks a review count and the preview payload is used as the fallback, a valid compact rating summary such as [4.7, 2000] is discarded solely because the count falls in the year-looking range. Places with 1,900–2,100 reviews are valid and fairly common, and the later _rating_count_pair_is_ambiguous check already handles actual year/price evidence from the same node, so this leaves rating/review_count missing for those places instead of enriching them from preview data.

Useful? React with 👍 / 👎.

Harden place extractor fixtures and preview enrichment

29b58ef

chatgpt-codex-connector Bot reviewed Jul 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[codex] Harden place extractor fixtures and preview enrichment#48

[codex] Harden place extractor fixtures and preview enrichment#48
michaelmwu wants to merge 1 commit into
mainfrom
michaelmwu/improve-maps-scraping

michaelmwu commented Jul 5, 2026 •

edited by cursor Bot

Loading

Uh oh!

coderabbitai Bot commented Jul 5, 2026

Review limit reached

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jul 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Uh oh!

Conversation

michaelmwu commented Jul 5, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Notes

Validation

Uh oh!

coderabbitai Bot commented Jul 5, 2026

Review limit reached

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jul 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

michaelmwu commented Jul 5, 2026 •

edited by cursor Bot

Loading