Skip to content

adding --split-si functionality to rxiv-maker#306

Merged
paxcalpt merged 2 commits into
mainfrom
export_docx_wo_si
Jun 9, 2026
Merged

adding --split-si functionality to rxiv-maker#306
paxcalpt merged 2 commits into
mainfrom
export_docx_wo_si

Conversation

@brunomsaraiva

Copy link
Copy Markdown
Contributor

This PR enables the --split-si and --split-si-pdf arguments to rxiv docx command:

  • split-si will generate _main.docx and _si.docx
  • split-si-pdf will generate _main.docx and _si.pdf

This should be helpful tor journal submissions

@brunomsaraiva brunomsaraiva requested a review from paxcalpt as a code owner June 8, 2026 10:00
@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown

test comment - please ignore

@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown

Code Review: PR 306 - Good addition for journal submissions. The overall structure is clean and the core feature works logically. A few issues to address before merging.

OVERVIEW: Adds two new CLI flags to rxiv docx: --split-si (emits __main.docx + __si.docx via two separate DocxExporter runs) and --split-si-pdf (emits __main.docx + __si.pdf by building the full PDF then splitting it). DocxExporter gains content_mode ("full" / "main" / "si") and output_suffix parameters.

BUG 1 - Documentation placed under the wrong command (docs/cli-reference.md): The --split-si bullet was added to the rxiv pdf options block, not rxiv docx. The rxiv docx section added later in the same file already documents both flags correctly — the stray entry under rxiv pdf should be removed.

BUG 2 - str.replace(".docx", ...) replaces all occurrences (docx_exporter.py:93): docx_filename = docx_filename.replace(".docx", f"{self.output_suffix}.docx") — str.replace replaces every match. If the author name or title contains .docx the filename would be corrupted. Safer: stem = Path(docx_filename).stem; docx_filename = f"{stem}{self.output_suffix}.docx"

CROSS-REFERENCE CORRECTNESS: In SI-only mode (content_mode="si"), the main document is not loaded, so figure_map in export() is built only from SI content. Any @fig:label in the SI that points to a main-document figure will silently produce wrong numbers or survive as a raw marker. The existing hide_si flow handles the inverse case (main-only export loads SI for label mapping). The SI-only case needs symmetric treatment. This is a correctness gap worth at least a documented limitation in the help text.

CODE QUALITY 3 - Inline imports inside _build_and_export_si_pdf: Five imports are deferred inside the function body. Move them to module level to match the style used throughout the rest of the file.

CODE QUALITY 4 - Intermediate PDF files not cleaned up after --split-si-pdf: split_pdf() writes __main.pdf and __si.pdf to the build output directory. The function copies __si.pdf next to the manuscript but never removes the intermediates. The leftover __main.pdf in the build dir is a surprising side-effect.

CODE QUALITY 5 - Implicit hide_si mutation in init: self.hide_si = docx_config.get("hide_si", False); if self.content_mode == "main": self.hide_si = True # silently overrides config — This repurposes hide_si as a proxy for content_mode, making the logic harder to follow. Consider driving _load_markdown from content_mode directly without mutating hide_si.

TEST 6 - Fragile argument-order assertion (test_docx.py:75): mock_pdf.assert_called_once_with(str(manuscript_dir), False, False) — This pins the call to the default values of quiet and verbose. A future parameter addition breaks this test for the wrong reason. Prefer: mock_pdf.assert_called_once(); assert mock_pdf.call_args.args[0] == str(manuscript_dir)

TEST 7 - No integration test for cross-reference resolution in split modes: Existing tests verify which sections appear/disappear but not that @fig:/@sfig: markers resolve correctly when content is split. A test asserting no raw cross-reference markers survive in the output would add meaningful coverage.

SUMMARY: Feature logic is correct for the common case. --split-si was added to wrong section in cli-reference.md. str.replace on .docx is fragile. SI cross-refs to main figures are not handled — silent numbering errors possible. Build-dir leftovers after --split-si-pdf. Inline imports are a style issue. One fragile test assertion. Cross-ref test coverage is missing. The documentation placement bug and str.replace filename issue should be fixed before merging. The cross-reference gap is the most important correctness concern — needs either a fix or a documented limitation.

@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown

Code Review: PR #306

Good addition for journal submissions. The overall structure is clean and the core feature works logically. A few issues to address before merging.


Overview

Adds two new CLI flags to rxiv docx:

  • --split-si: emits __main.docx + __si.docx via two separate DocxExporter runs
  • --split-si-pdf: emits __main.docx + __si.pdf by building the full PDF then splitting it

DocxExporter gains content_mode ("full" / "main" / "si") and output_suffix parameters.


Bugs

1. Documentation placed under the wrong command (docs/cli-reference.md)

The --split-si bullet was added to the rxiv pdf options block, not rxiv docx. The rxiv docx section added later in the same file already documents both flags correctly — the stray entry under rxiv pdf should be removed.

2. str.replace(".docx", ...) replaces all occurrences (docx_exporter.py:93)

docx_filename = docx_filename.replace(".docx", f"{self.output_suffix}.docx")

str.replace replaces every match. If the author name or title contains .docx the filename would be corrupted. Safer pattern:

stem = Path(docx_filename).stem
docx_filename = f"{stem}{self.output_suffix}.docx"

Cross-Reference Correctness

In SI-only mode (content_mode="si"), the main document is not loaded, so figure_map in export() is built only from SI content. Any @fig:label in the SI that points to a main-document figure will silently produce wrong numbers or survive as a raw marker.

The existing hide_si flow handles the inverse case (main-only export loads SI for label mapping). The SI-only case needs symmetric treatment. This is a correctness gap worth at least a documented limitation in the help text.


Code Quality

3. Inline imports inside _build_and_export_si_pdf

Five imports are deferred inside the function body. Move them to module level to match the style used throughout the rest of the file.

4. Intermediate PDF files not cleaned up after --split-si-pdf

split_pdf() writes __main.pdf and __si.pdf to the build output directory. The function copies __si.pdf next to the manuscript but never removes the intermediates. The leftover __main.pdf in the build dir is a surprising side-effect.

5. Implicit hide_si mutation in __init__

self.hide_si = docx_config.get("hide_si", False)
if self.content_mode == "main":
    self.hide_si = True   # silently overrides config

This repurposes hide_si as a proxy for content_mode, making the logic harder to follow. Consider driving _load_markdown from content_mode directly without mutating hide_si.


Tests

6. Fragile argument-order assertion (test_docx.py:75)

mock_pdf.assert_called_once_with(str(manuscript_dir), False, False)

This pins the call to the default values of quiet and verbose. A future parameter addition breaks this test for the wrong reason. Prefer:

mock_pdf.assert_called_once()
assert mock_pdf.call_args.args[0] == str(manuscript_dir)

7. No integration test for cross-reference resolution in split modes

Existing tests verify which sections appear/disappear but not that @fig:/@sfig: markers resolve correctly when content is split. A test asserting no raw cross-reference markers survive in the output would add meaningful coverage.


Summary

Area Status
Feature logic Correct for the common case
Doc placement --split-si added to wrong section in cli-reference.md
Filename generation str.replace on .docx is fragile
SI cross-refs to main figures Not handled — silent numbering errors possible
Intermediate file cleanup Build-dir leftovers after --split-si-pdf
Inline imports Style issue
Test robustness One fragile assertion
Cross-ref test coverage Missing

The documentation placement bug and str.replace filename issue should be fixed before merging. The cross-reference gap is the most important correctness concern — needs either a fix or a documented limitation.

The two split-si unit tests patched the helper functions via the dotted
string "rxiv_maker.cli.commands.docx._export_docx_file". Because
rxiv_maker.cli.commands.__init__ re-exports the `docx` command, that name
shadows the `docx` submodule on the package, so unittest.mock resolves the
target to the RichCommand object instead of the module. This passed on
Python 3.11 but failed on 3.10 (a supported version per requires-python),
and `import ... as` is fooled the same way (IMPORT_FROM uses getattr).

Fetch the real module via importlib.import_module and use patch.object so
the helpers bind correctly on all supported Python versions. No production
code changes; integration tests already exercised the real export.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown

Code Review: PR #306--split-si / --split-si-pdf for rxiv docx

Good feature addition for journal submissions. The design is clean and the core logic is sound. A few issues need attention before merging.


Overview

Adds two new flags to rxiv docx:

  • --split-si: two DocxExporter runs — __main.docx + __si.docx
  • --split-si-pdf: one DocxExporter run + a full LaTeX build + PDF split — __main.docx + __si.pdf

DocxExporter gains content_mode ("full" / "main" / "si") and output_suffix parameters.


Bugs

1. --split-si documented under the wrong command (docs/cli-reference.md line 92)

-  - `--split-si` - Split the final PDF into `__main.pdf` and `__si.pdf`

This entry was added inside the rxiv pdf options block, not rxiv docx. The flag does not exist on the PDF command. The rxiv docx section added later in the same file documents both flags correctly — remove this stray entry.

2. str.replace(".docx", ...) is not suffix-safe (docx_exporter.py:93)

docx_filename = docx_filename.replace(".docx", f"{self.output_suffix}.docx")

str.replace replaces every match. If the auto-generated filename ever contains .docx elsewhere (author name, title slug), the result is corrupted. Use Path.stem instead:

stem = Path(docx_filename).stem
docx_filename = f"{stem}{self.output_suffix}.docx"

Correctness Concern

3. SI cross-references to main figures are unresolved in content_mode="si"

When content_mode="si", _load_markdown skips the main document entirely, so figure_map in export() is built only from SI content. Any @fig:label in the SI that references a main-document figure will either silently produce wrong numbers or survive as a raw marker in the output.

The existing hide_si flow already handles the inverse case (main-only export still loads SI for label resolution). The SI-only path needs symmetric treatment, or at minimum a clearly documented limitation in the --split-si help text.


Code Quality

4. Deferred imports inside _build_and_export_si_pdf (docx.py:105–110)

Five module-level imports are deferred inside the function body. This is inconsistent with the rest of the file. Unless there's a circular-import reason (if so, add a comment), move them to the top-level imports.

5. Intermediate build files not cleaned up after --split-si-pdf

split_pdf() writes __main.pdf and __si.pdf into the build output directory. _build_and_export_si_pdf copies __si.pdf to the manuscript directory but leaves __main.pdf behind. This is a surprising side-effect — consider removing the intermediates or at least noting it in the docstring.

6. hide_si repurposed as a content_mode proxy (docx_exporter.py:62–63)

self.hide_si = docx_config.get("hide_si", False)
if self.content_mode == "main":
    self.hide_si = True   # silently overrides user config

This conflates two distinct concepts. _load_markdown already branches on content_mode — drive it from content_mode directly rather than mutating hide_si, which makes the intent clear and avoids surprising config overrides.

7. No error fallback when YAML metadata extraction fails in _build_and_export_si_pdf

_get_output_path has a try/except fallback for metadata extraction. _build_and_export_si_pdf calls extract_yaml_metadata + get_custom_pdf_filename without any fallback — an unexpected metadata error will surface as an unformatted traceback instead of a clean CLI error.


Tests

8. Fragile positional argument assertion (test_docx.py:75)

mock_pdf.assert_called_once_with(str(manuscript_dir), False, False)

This pins the call against the specific default values and order of quiet/verbose. A signature change breaks this test for the wrong reason. Prefer:

mock_pdf.assert_called_once()
assert mock_pdf.call_args.args[0] == str(manuscript_dir)

9. No test for the default (full) mode through the refactored CLI path

The existing CLI tests only cover --split-si and --split-si-pdf. A test for the unmodified rxiv docx path (no flags) would confirm the refactoring didn't regress the default behaviour.


Summary

Area Status
Feature logic ✅ Correct for the common case
--split-si in wrong doc section ❌ Must fix
str.replace(".docx") fragile filename ❌ Must fix
SI cross-refs to main figures ⚠️ Silent numbering errors — fix or document
Intermediate PDF cleanup ⚠️ Surprising side-effect
Deferred imports ⚠️ Style inconsistency
hide_si mutation ⚠️ Logic clarity
Fragile test assertion ⚠️ Minor
Missing default-mode CLI test ⚠️ Minor

The two documentation/filename bugs should be fixed before merging. The cross-reference gap is the most significant correctness concern and needs either a fix or an explicit note in the help text.

@paxcalpt paxcalpt merged commit 15f835a into main Jun 9, 2026
9 of 10 checks passed
paxcalpt added a commit that referenced this pull request Jun 9, 2026
PR #306 added `rxiv docx --split-si` / `--split-si-pdf` but shipped without a
changelog entry. Document them under the 1.20.3 release so the feature is not
released undocumented.

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants