Skip to content

fix: decode HTML entities before slugifying header IDs#711

Open
gaoflow wants to merge 1 commit into
trentm:masterfrom
gaoflow:fix/header-ids-html-entity-decode
Open

fix: decode HTML entities before slugifying header IDs#711
gaoflow wants to merge 1 commit into
trentm:masterfrom
gaoflow:fix/header-ids-html-entity-decode

Conversation

@gaoflow

@gaoflow gaoflow commented Jun 25, 2026

Copy link
Copy Markdown

Summary

Fixes #649.

When a markdown header contains HTML entities such as <, &, or >, the header_id_from_text method passes the raw entity string to _slugify before HTML-decoding it.

The & and ; characters are stripped by _slugify as non-word characters, but the entity name letters (e.g. lt) survive and pollute the generated ID:

import markdown2
result = markdown2.markdown("# <othertext", extras={"header-ids": None})
# Actual:   <h1 id="ltothertext">&lt;othertext</h1>
# Expected: <h1 id="othertext">&lt;othertext</h1>

Fix

Call html.unescape() on the header text before slugifying, so entity characters are decoded to their actual Unicode code points first. The resulting characters are then stripped or kept by _slugify as any literal character would be:

# lib/markdown2.py — header_id_from_text()
- header_id = _slugify(text)
+ header_id = _slugify(html.unescape(text))

This is a one-line fix plus a new test case (test/tm-cases/header_ids_entity.*).

Tested

  • # &lt;othertextid="othertext" ✓ (was id="ltothertext")
  • # R&amp;Did="rd" ✓ (was id="rampd")
  • # <sometext>id="sometext" ✓ (unchanged, still correct)
  • All existing header_ids_* test cases pass.
  • No new test failures introduced (12 pre-existing failures are unchanged).

This pull request was prepared with the assistance of AI, under my direction and review.

When a markdown header contains HTML entities (e.g. `# &lt;othertext`),
`header_id_from_text` passed the raw entity string to `_slugify` before
HTML-decoding it.  The `&` and `;` were stripped as non-word characters
but the entity name letters (e.g. `lt`) were kept, silently corrupting
the generated ID (`ltothertext` instead of `othertext`).

Fix: call `html.unescape()` on the header text before slugifying so that
entity characters are resolved to their actual Unicode code points first,
then stripped (or kept) by the slug logic as any other character would be.

Closes trentm#649
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

header-ids contain HTML entities

1 participant