fix: decode HTML entities before slugifying header IDs by gaoflow · Pull Request #711 · trentm/python-markdown2

gaoflow · 2026-06-25T08:14:58Z

Summary

Fixes #649.

When a markdown header contains HTML entities such as <, &, or >, the header_id_from_text method passes the raw entity string to _slugify before HTML-decoding it.

The & and ; characters are stripped by _slugify as non-word characters, but the entity name letters (e.g. lt) survive and pollute the generated ID:

import markdown2
result = markdown2.markdown("# &lt;othertext", extras={"header-ids": None})
# Actual:   <h1 id="ltothertext">&lt;othertext</h1>
# Expected: <h1 id="othertext">&lt;othertext</h1>

Fix

Call html.unescape() on the header text before slugifying, so entity characters are decoded to their actual Unicode code points first. The resulting characters are then stripped or kept by _slugify as any literal character would be:

# lib/markdown2.py — header_id_from_text()
- header_id = _slugify(text)
+ header_id = _slugify(html.unescape(text))

This is a one-line fix plus a new test case (test/tm-cases/header_ids_entity.*).

Tested

# <othertext → id="othertext" ✓ (was id="ltothertext")
# R&D → id="rd" ✓ (was id="rampd")
# <sometext> → id="sometext" ✓ (unchanged, still correct)
All existing header_ids_* test cases pass.
No new test failures introduced (12 pre-existing failures are unchanged).

This pull request was prepared with the assistance of AI, under my direction and review.

When a markdown header contains HTML entities (e.g. `# <othertext`), `header_id_from_text` passed the raw entity string to `_slugify` before HTML-decoding it. The `&` and `;` were stripped as non-word characters but the entity name letters (e.g. `lt`) were kept, silently corrupting the generated ID (`ltothertext` instead of `othertext`). Fix: call `html.unescape()` on the header text before slugifying so that entity characters are resolved to their actual Unicode code points first, then stripped (or kept) by the slug logic as any other character would be. Closes trentm#649

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: decode HTML entities before slugifying header IDs#711

fix: decode HTML entities before slugifying header IDs#711
gaoflow wants to merge 1 commit into
trentm:masterfrom
gaoflow:fix/header-ids-html-entity-decode

gaoflow commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gaoflow commented Jun 25, 2026

Summary

Fix

Tested

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant