Add Claude and Gemini shared chat extraction by adventuremommy · Pull Request #169 · XortexAI/XMem

adventuremommy · 2026-05-11T05:49:21Z

Summary

add structured shared-chat extraction for Claude public links, including Next.js __NEXT_DATA__ payloads and existing preloaded state payloads
expand Gemini public-share DOM selectors and add structured-script fallback
report known private/missing provider pages as unavailable instead of falling through to generic extraction behavior
declare python-multipart, which FastAPI needs to import routes that use File(...)

Fixes #155.

Verification

uv run --extra dev pytest tests/test_api_memory_scrape.py -q passes
uv run --extra dev ruff check src/api/routes/memory.py tests/test_api_memory_scrape.py pyproject.toml passes

Repository-wide checks currently have unrelated baseline failures:

uv run --extra dev pytest fails in tests/test_enterprise_chat.py::test_annotation_service_extracts_and_stores_project_annotations because EnterpriseAnnotationService.extract_and_store() returns (['ann_1'], '') while the test expects ['ann_1']; this also fails when run by itself.
uv run --extra dev ruff check . reports existing lint errors in files outside this PR, including server.py, src/api/dependencies.py, and scanner modules.
uv run --extra dev mypy src || true stops on an existing duplicate module name between src/prompts/judge.py and src/prompts/examples/judge.py.

gemini-code-assist

Code Review

This pull request refactors the chat extraction logic for ChatGPT, Claude, and Gemini by introducing modular helper functions for DOM and script-based parsing, and adds a check for unavailable conversations. It also includes a new test suite and adds the python-multipart dependency. Feedback was provided to address a fragile regex that could truncate JSON data and to optimize HTML normalization for better performance.

gemini-code-assist · 2026-05-11T05:50:57Z

+        match = re.search(
+            r"__PRELOADED_STATE__\s*=\s*(\{.*?\});",
+            script_text,
+            re.DOTALL,
+        )


The non-greedy regex (\{.*?\}); is fragile because it will terminate at the first occurrence of };. If the chat content contains code snippets or strings that include };, the extraction will result in truncated, invalid JSON. Using a greedy match to find the last closing brace is more robust for these state-carrying scripts.

Suggested change

match = re.search(

r"__PRELOADED_STATE__\s*=\s*(\{.*?\});",

script_text,

re.DOTALL,

)

match = re.search(

r"__PRELOADED_STATE__\s*=\s*(\{.*\})",

script_text,

re.DOTALL,

)

gemini-code-assist · 2026-05-11T05:50:57Z

+
+
+def _looks_unavailable(html: str) -> bool:
+    lowered = " ".join(html.lower().split())


Normalizing the entire HTML string using " ".join(html.lower().split()) is very inefficient for large documents, as it creates multiple large intermediate objects (a lowercased string, a list of all words, and a new joined string). Since the markers are simple phrases, searching directly in the lowercased HTML string is significantly more performant and sufficient for this check.

Suggested change

lowered = " ".join(html.lower().split())

lowered = html.lower()

ishaanxgupta · 2026-05-11T08:36:42Z

Hi @adventuremommy thank you for the contribution, please have a look on the gemini suggestions

Add Claude and Gemini shared chat extraction

c2b613b

adventuremommy requested review from ishaanxgupta and ved015 as code owners May 11, 2026 05:49

github-actions Bot added tests config api labels May 11, 2026

gemini-code-assist Bot reviewed May 11, 2026

View reviewed changes

Address shared chat extraction review

a70de37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Claude and Gemini shared chat extraction#169

Add Claude and Gemini shared chat extraction#169
adventuremommy wants to merge 2 commits into
XortexAI:mainfrom
adventuremommy:hunter-context-provider-links

adventuremommy commented May 11, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 11, 2026

Uh oh!

gemini-code-assist Bot May 11, 2026

Uh oh!

ishaanxgupta commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants



		def _looks_unavailable(html: str) -> bool:
		lowered = " ".join(html.lower().split())

	lowered = " ".join(html.lower().split())
	lowered = html.lower()

Conversation

adventuremommy commented May 11, 2026

Summary

Verification

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

ishaanxgupta commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants