Skip to content

[Core] Possible issue: ZHA reload leaks entity states, causing "Platform zha does not generate unique IDs" rejections of new entities #77

@zigpy-review-bot

Description

@zigpy-review-bot

⚠️ This is an investigation, not a confirmed reproduction by me. Multiple independent users report the symptom on 2026.5.x. The analysis below is from reading the affected user's debug log + the HA Core / zha source. Filing here per @TheJulianJES request — he authored the most-likely-culprit PR (#169341) and is best placed to confirm/reject.

cc @TheJulianJES

Source report

home-assistant/core#170920 (exhannibal, opened 2026-05-16, 3 distinct users confirming):

"ZHA reload leaves orphaned entity registry entries: devices appear available but commands fail (unique_id already exists)"

Setup: Nabu Casa SkyConnect v1.0 / bellows / Python 3.14.2 / HAOS aarch64 / HA 2026.5.2 / zha==1.3.1.

Symptom: after a ZHA config-entry reload, a non-deterministic subset of devices ends up in a half-up state. The HA log contains, for the affected entities, messages like:

Platform zha does not generate unique IDs.
ID 84:2e:14:ff:fe:ba:5f:52-1 is already used by light.coffee_light - ignoring light.coffee_light

UI still shows the registry-persisted entity. No live ZHA platform entity bound to it. The user's ZHA "Reconfigure" device action restores control until the next reload.

Repro:

  1. Run ZHA with multiple devices.
  2. Toggle disabled_by on any ZHA entity in Settings → Entities (this triggers a config-entry reload).
  3. After ZHA finishes startup, control fails for a subset of devices (different subset each reload).

Not the same as home-assistant/core#130548 (radio-side) — the coordinator is fine and other devices on the same network work normally.

Where the rejection fires

homeassistant/helpers/entity_platform.py:898-925:

if entity.unique_id is not None:
    registered_entity_id = entity_registry.async_get_entity_id(
        self.domain, self.platform_name, entity.unique_id
    )
    if registered_entity_id:
        already_exists, _ = self._entity_id_already_exists(registered_entity_id)
        if already_exists:
            entity.registry_entry = None
            msg = (
                f"Platform {self.platform_name} does not generate unique IDs. "
            )
            ...
            self.logger.error(msg)
            entity.add_to_platform_abort()
            return

_entity_id_already_exists (entity_platform.py:811-825):

already_exists = entity_id in self.entities
restored = False
if not already_exists and not self.hass.states.async_available(entity_id):
    existing = self.hass.states.get(entity_id)
    if existing is not None and ATTR_RESTORED in existing.attributes:
        restored = True
    else:
        already_exists = True
return (already_exists, restored)

After a fresh reload, entity_id in self.entities is False (new platform instance, empty entities dict). So we're hitting Path B: there is a state in hass.states for the entity that isn't marked ATTR_RESTORED. The state survived the unload phase even though the entity didn't.

Evidence in the user's debug log

The user attached a full debug log of one failing reload. The setup-side errors look like:

button.toaster_identify            ← original (no suffix)
button.microwave_light_identify    ← original
button.toaster_light_identify_2    ← _2 suffix
button.kettle_light_identify_3     ← _3 suffix
button.coffee_light_identify_4     ← _4 suffix

That _N is HA's automatic entity-id collision avoidance during generation: when button.coffee_light_identify is already taken in the state machine, the new entity is suggested as _2; if that's taken too, _3; etc. _4 proves three previous orphans for that single entity accumulated across past reloads, all still in hass.states. So this isn't a single one-off leak — every reload adds another orphan for the affected devices.

(For unaffected devices the entity_ids come out clean. The bug is non-deterministic in which devices get hit, not which entity_ids leak.)

My current best theory (NOT confirmed)

The new line added in #169341 to homeassistant/components/zha/entity.py::ZHAEntity.async_will_remove_from_hass:

async def async_will_remove_from_hass(self) -> None:
    for unsub in self._unsubs[:]:
        unsub()
        self._unsubs.remove(unsub)
    self.entity_data.device_proxy.gateway_proxy.remove_entity_reference(self)  # ← added by #169341
    await super().async_will_remove_from_hass()                                 # ← clears hass.states
    self.remove_future.set_result(True)

self.entity_data.device_proxy.gateway_proxy.remove_entity_reference(self) is a four-deep attribute chain. If any of those attributes is None or unreachable at the moment this entity unloads — for example if the gateway proxy has already been disposed by the time this entity hits its async_will_remove_from_hass, depending on the order in which the platform unload and gateway.shutdown() interleave — this raises AttributeError and the rest of the method does not run, including super().async_will_remove_from_hass() which is responsible for hass.states.async_remove(entity_id).

remove_entity_reference itself was also rewritten in the same PR:

def remove_entity_reference(self, entity: ZHAEntity) -> None:
    ieee = entity.entity_data.device_proxy.device.ieee   # another four-deep chain
    if (entity_refs := self._ha_entity_refs.get(ieee)) is None:
        return
    ...

So either dereference chain could raise.

The non-deterministic "different devices each reload" pattern fits a race in which device proxies vs platform unload interleave differently each time — clean unloads succeed, late unloads fail at the new dereference.

Why I don't see the AttributeError in the log

The captured log doesn't contain any AttributeError traceback during unload. That's a gap in my theory. Possibilities:

  1. HA Core's entity_platform.async_unload_entry (or Entity.async_remove) wraps the call in try/except and logs at a level filtered out, or
  2. The exception is swallowed by add_to_platform_finish-like cleanup, or
  3. The actual leak mechanism is different and I'm chasing the wrong line.

(3) is possible. The fact that state survives unload is the load-bearing observation; the why is my best guess.

What does work: "Reconfigure"

The Reconfigure device action triggers re-interview, which goes through DeviceEntityRemovedEvent(remove=False)SIGNAL_REMOVE_ENTITY_{platform}_{unique_id}self.async_remove. This is a different removal path (the new soft-remove flow from #169341) — and it works. So the new soft-remove is fine; the suspect is the change in async_will_remove_from_hass.

What does not trigger this

gateway.shutdown() correctly calls device.on_remove() which calls _async_teardown(emit_entity_events=False) (zha/zigbee/device.py:1226-1276) — so the new SIGNAL_REMOVE_ENTITY dispatcher signal does not fire spuriously during reload. The bug isn't in the new signal path; it's in the modified ZHAEntity.async_will_remove_from_hass body.

Suggested fix (sketch — flag if you'd prefer a different shape)

Two options that should both restore correctness:

Option A — swap order so state-clear always runs:

async def async_will_remove_from_hass(self) -> None:
    for unsub in self._unsubs[:]:
        unsub()
        self._unsubs.remove(unsub)
    await super().async_will_remove_from_hass()
    try:
        self.entity_data.device_proxy.gateway_proxy.remove_entity_reference(self)
    except AttributeError:
        # Gateway proxy may already be torn down; bookkeeping is moot at this point
        pass
    self.remove_future.set_result(True)

Option B — keep order, just guard:

async def async_will_remove_from_hass(self) -> None:
    for unsub in self._unsubs[:]:
        unsub()
        self._unsubs.remove(unsub)
    if (proxy := getattr(self.entity_data, "device_proxy", None)) is not None:
        if (gw := getattr(proxy, "gateway_proxy", None)) is not None:
            gw.remove_entity_reference(self)
    await super().async_will_remove_from_hass()
    self.remove_future.set_result(True)

Either way, the existing test added in #169341 (test_dynamic_entities.py) shouldn't break, and a new test should cover "ZHAEntity removes its state from hass.states even when gateway_proxy is None at unload time".

Things I have not done

  • Reproduced this locally. I read the user's debug log + the code.
  • Confirmed the AttributeError fires by adding instrumentation. The theory is consistent with the symptom but not directly observed.
  • Looked at homeassistant_hardware interactions or the older ha_entity_refs/_ha_entity_refs rename for unrelated issues.

Related work I have already filed

These three plus this one together explain a chunk of the "ZHA broken after 2026.5.x" reports across home-assistant/core#130548, #168432, #170920, and #172247. Different code paths, different tracebacks — worth keeping them separated in triage.

Happy to provide additional log slices, draft the fix PR, or close as misdiagnosed if you confirm a different cause.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions