feat: support Claude Code transcripts#168
Conversation
There was a problem hiding this comment.
Code Review
This pull request implements a parser for Claude Code JSONL transcripts, adding _content_to_text and _parse_claude_code_transcript functions to server.py and src/api/routes/memory.py, and includes a new test file. Feedback indicates that the parsing logic is duplicated and should be moved to a shared module to reduce maintenance overhead. Additionally, a performance optimization was suggested to include a heuristic check for JSON content before attempting to parse the transcript lines.
| def _content_to_text(content: Any) -> str: | ||
| """Extract readable text from Claude Code message content blocks.""" | ||
| if isinstance(content, str): | ||
| return content.strip() | ||
| if isinstance(content, list): | ||
| chunks: list[str] = [] | ||
| for item in content: | ||
| if isinstance(item, str): | ||
| chunks.append(item) | ||
| elif isinstance(item, dict) and item.get("type") == "text": | ||
| chunks.append(str(item.get("text", ""))) | ||
| return "\n".join(chunk.strip() for chunk in chunks if chunk.strip()).strip() | ||
| return "" |
There was a problem hiding this comment.
The logic for _content_to_text and _parse_claude_code_transcript is duplicated between server.py and src/api/routes/memory.py. This increases maintenance overhead and the risk of inconsistencies as the parsing logic evolves. Consider moving these utilities to a shared module (e.g., src/utils/transcripts.py) that both files can import from.
| current_user_query: str | None = None | ||
| assistant_chunks: list[str] = [] | ||
|
|
||
| for raw_line in text.splitlines(): |
There was a problem hiding this comment.
The current implementation of _parse_claude_code_transcript iterates through every line of the input text and attempts to parse it as JSON. This can be inefficient for large non-JSON transcripts (e.g., standard markdown files that don't match Cursor or Antigravity formats). Since Claude Code transcripts are JSONL files, adding a quick heuristic check at the beginning of the function can avoid unnecessary processing.
| for raw_line in text.splitlines(): | |
| if not text.strip().startswith("{"): | |
| return [] | |
| for raw_line in text.splitlines(): |
| current_user_query: str | None = None | ||
| assistant_chunks: List[str] = [] | ||
|
|
||
| for raw_line in text.splitlines(): |
There was a problem hiding this comment.
The current implementation of _parse_claude_code_transcript iterates through every line of the input text and attempts to parse it as JSON. This can be inefficient for large non-JSON transcripts. Adding a quick heuristic check at the beginning of the function can avoid unnecessary processing for files that are clearly not in JSONL format.
| for raw_line in text.splitlines(): | |
| if not text.strip().startswith("{"): | |
| return [] | |
| for raw_line in text.splitlines(): |
| current_user_query: str | None = None | ||
| assistant_chunks: List[str] = [] | ||
|
|
||
| for raw_line in text.splitlines(): |
There was a problem hiding this comment.
Good call. Since Claude Code transcripts are JSONL, the shared parser should first reject obvious non-JSONL input before iterating through every line. This should be fixed in the shared parser rather than separately in both files.
|
|
||
|
|
||
| def _content_to_text(content: Any) -> str: | ||
| """Extract readable text from Claude Code message content blocks.""" |
There was a problem hiding this comment.
Agree. Since this parser is used by both the standalone server and the production memory route, please move the Claude transcript parsing into src/utils/transcripts.py and have both server.py and src/api/routes/memory.py import the shared parser from there.
Ankit-Kotnala
left a comment
There was a problem hiding this comment.
The feature is good, but @LoikStyle should centralize the parser and clean up the test before merge.
Summary
Test Plan
python3 -m pytest tests/test_claude_code_transcript.py -q -o addopts=''python3 -m py_compile src/api/routes/memory.py server.pyFixes #156