Add Claude and Gemini shared chat extraction#169
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors the chat extraction logic for ChatGPT, Claude, and Gemini by introducing modular helper functions for DOM and script-based parsing, and adds a check for unavailable conversations. It also includes a new test suite and adds the python-multipart dependency. Feedback was provided to address a fragile regex that could truncate JSON data and to optimize HTML normalization for better performance.
| match = re.search( | ||
| r"__PRELOADED_STATE__\s*=\s*(\{.*?\});", | ||
| script_text, | ||
| re.DOTALL, | ||
| ) |
There was a problem hiding this comment.
The non-greedy regex (\{.*?\}); is fragile because it will terminate at the first occurrence of };. If the chat content contains code snippets or strings that include };, the extraction will result in truncated, invalid JSON. Using a greedy match to find the last closing brace is more robust for these state-carrying scripts.
| match = re.search( | |
| r"__PRELOADED_STATE__\s*=\s*(\{.*?\});", | |
| script_text, | |
| re.DOTALL, | |
| ) | |
| match = re.search( | |
| r"__PRELOADED_STATE__\s*=\s*(\{.*\})", | |
| script_text, | |
| re.DOTALL, | |
| ) |
|
|
||
|
|
||
| def _looks_unavailable(html: str) -> bool: | ||
| lowered = " ".join(html.lower().split()) |
There was a problem hiding this comment.
Normalizing the entire HTML string using " ".join(html.lower().split()) is very inefficient for large documents, as it creates multiple large intermediate objects (a lowercased string, a list of all words, and a new joined string). Since the markers are simple phrases, searching directly in the lowercased HTML string is significantly more performant and sufficient for this check.
| lowered = " ".join(html.lower().split()) | |
| lowered = html.lower() |
|
Hi @adventuremommy thank you for the contribution, please have a look on the gemini suggestions |
Summary
__NEXT_DATA__payloads and existing preloaded state payloadsunavailableinstead of falling through to generic extraction behaviorpython-multipart, which FastAPI needs to import routes that useFile(...)Fixes #155.
Verification
uv run --extra dev pytest tests/test_api_memory_scrape.py -qpassesuv run --extra dev ruff check src/api/routes/memory.py tests/test_api_memory_scrape.py pyproject.tomlpassesRepository-wide checks currently have unrelated baseline failures:
uv run --extra dev pytestfails intests/test_enterprise_chat.py::test_annotation_service_extracts_and_stores_project_annotationsbecauseEnterpriseAnnotationService.extract_and_store()returns(['ann_1'], '')while the test expects['ann_1']; this also fails when run by itself.uv run --extra dev ruff check .reports existing lint errors in files outside this PR, includingserver.py,src/api/dependencies.py, and scanner modules.uv run --extra dev mypy src || truestops on an existing duplicate module name betweensrc/prompts/judge.pyandsrc/prompts/examples/judge.py.