Skip to content

Add Claude and Gemini shared chat extraction#169

Open
adventuremommy wants to merge 2 commits into
XortexAI:mainfrom
adventuremommy:hunter-context-provider-links
Open

Add Claude and Gemini shared chat extraction#169
adventuremommy wants to merge 2 commits into
XortexAI:mainfrom
adventuremommy:hunter-context-provider-links

Conversation

@adventuremommy
Copy link
Copy Markdown

Summary

  • add structured shared-chat extraction for Claude public links, including Next.js __NEXT_DATA__ payloads and existing preloaded state payloads
  • expand Gemini public-share DOM selectors and add structured-script fallback
  • report known private/missing provider pages as unavailable instead of falling through to generic extraction behavior
  • declare python-multipart, which FastAPI needs to import routes that use File(...)

Fixes #155.

Verification

  • uv run --extra dev pytest tests/test_api_memory_scrape.py -q passes
  • uv run --extra dev ruff check src/api/routes/memory.py tests/test_api_memory_scrape.py pyproject.toml passes

Repository-wide checks currently have unrelated baseline failures:

  • uv run --extra dev pytest fails in tests/test_enterprise_chat.py::test_annotation_service_extracts_and_stores_project_annotations because EnterpriseAnnotationService.extract_and_store() returns (['ann_1'], '') while the test expects ['ann_1']; this also fails when run by itself.
  • uv run --extra dev ruff check . reports existing lint errors in files outside this PR, including server.py, src/api/dependencies.py, and scanner modules.
  • uv run --extra dev mypy src || true stops on an existing duplicate module name between src/prompts/judge.py and src/prompts/examples/judge.py.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the chat extraction logic for ChatGPT, Claude, and Gemini by introducing modular helper functions for DOM and script-based parsing, and adds a check for unavailable conversations. It also includes a new test suite and adds the python-multipart dependency. Feedback was provided to address a fragile regex that could truncate JSON data and to optimize HTML normalization for better performance.

Comment thread src/api/routes/memory.py
Comment on lines +218 to +222
match = re.search(
r"__PRELOADED_STATE__\s*=\s*(\{.*?\});",
script_text,
re.DOTALL,
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The non-greedy regex (\{.*?\}); is fragile because it will terminate at the first occurrence of };. If the chat content contains code snippets or strings that include };, the extraction will result in truncated, invalid JSON. Using a greedy match to find the last closing brace is more robust for these state-carrying scripts.

Suggested change
match = re.search(
r"__PRELOADED_STATE__\s*=\s*(\{.*?\});",
script_text,
re.DOTALL,
)
match = re.search(
r"__PRELOADED_STATE__\s*=\s*(\{.*\})",
script_text,
re.DOTALL,
)

Comment thread src/api/routes/memory.py Outdated


def _looks_unavailable(html: str) -> bool:
lowered = " ".join(html.lower().split())
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Normalizing the entire HTML string using " ".join(html.lower().split()) is very inefficient for large documents, as it creates multiple large intermediate objects (a lowercased string, a list of all words, and a new joined string). Since the markers are simple phrases, searching directly in the lowercased HTML string is significantly more performant and sufficient for this check.

Suggested change
lowered = " ".join(html.lower().split())
lowered = html.lower()

@ishaanxgupta
Copy link
Copy Markdown
Member

Hi @adventuremommy thank you for the contribution, please have a look on the gemini suggestions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

add support of gemini, claude in /context route

2 participants