[Core] Possible issue: ZHA reload leaks entity states, causing "Platform zha does not generate unique IDs" rejections of new entities

> ⚠️ This is an **investigation, not a confirmed reproduction by me**. Multiple independent users report the symptom on 2026.5.x. The analysis below is from reading the affected user's debug log + the HA Core / `zha` source. Filing here per @TheJulianJES request — he authored the most-likely-culprit PR (#169341) and is best placed to confirm/reject.

cc @TheJulianJES

## Source report

home-assistant/core#170920 (`exhannibal`, opened 2026-05-16, 3 distinct users confirming):

> "ZHA reload leaves orphaned entity registry entries: devices appear available but commands fail (unique_id already exists)"

Setup: Nabu Casa SkyConnect v1.0 / bellows / Python 3.14.2 / HAOS aarch64 / HA 2026.5.2 / `zha==1.3.1`.

Symptom: after a ZHA config-entry reload, a non-deterministic subset of devices ends up in a half-up state. The HA log contains, for the affected entities, messages like:

```
Platform zha does not generate unique IDs.
ID 84:2e:14:ff:fe:ba:5f:52-1 is already used by light.coffee_light - ignoring light.coffee_light
```

UI still shows the registry-persisted entity. No live ZHA platform entity bound to it. The user's ZHA "Reconfigure" device action restores control until the next reload.

Repro:
1. Run ZHA with multiple devices.
2. Toggle `disabled_by` on any ZHA entity in Settings → Entities (this triggers a config-entry reload).
3. After ZHA finishes startup, control fails for a subset of devices (different subset each reload).

Not the same as home-assistant/core#130548 (radio-side) — the coordinator is fine and other devices on the same network work normally.

## Where the rejection fires

`homeassistant/helpers/entity_platform.py:898-925`:

```python
if entity.unique_id is not None:
    registered_entity_id = entity_registry.async_get_entity_id(
        self.domain, self.platform_name, entity.unique_id
    )
    if registered_entity_id:
        already_exists, _ = self._entity_id_already_exists(registered_entity_id)
        if already_exists:
            entity.registry_entry = None
            msg = (
                f"Platform {self.platform_name} does not generate unique IDs. "
            )
            ...
            self.logger.error(msg)
            entity.add_to_platform_abort()
            return
```

`_entity_id_already_exists` (`entity_platform.py:811-825`):

```python
already_exists = entity_id in self.entities
restored = False
if not already_exists and not self.hass.states.async_available(entity_id):
    existing = self.hass.states.get(entity_id)
    if existing is not None and ATTR_RESTORED in existing.attributes:
        restored = True
    else:
        already_exists = True
return (already_exists, restored)
```

After a fresh reload, `entity_id in self.entities` is False (new platform instance, empty `entities` dict). So we're hitting **Path B**: there is a *state* in `hass.states` for the entity that isn't marked `ATTR_RESTORED`. The state survived the unload phase even though the entity didn't.

## Evidence in the user's debug log

The user attached a full debug log of one failing reload. The setup-side errors look like:

```
button.toaster_identify            ← original (no suffix)
button.microwave_light_identify    ← original
button.toaster_light_identify_2    ← _2 suffix
button.kettle_light_identify_3     ← _3 suffix
button.coffee_light_identify_4     ← _4 suffix
```

That `_N` is HA's automatic entity-id collision avoidance during generation: when `button.coffee_light_identify` is already taken in the state machine, the new entity is suggested as `_2`; if that's taken too, `_3`; etc. **`_4` proves three previous orphans for that single entity** accumulated across past reloads, all still in `hass.states`. So this isn't a single one-off leak — every reload adds another orphan for the affected devices.

(For unaffected devices the entity_ids come out clean. The bug is non-deterministic in which devices get hit, not which entity_ids leak.)

## My current best theory (NOT confirmed)

The new line added in #169341 to `homeassistant/components/zha/entity.py::ZHAEntity.async_will_remove_from_hass`:

```python
async def async_will_remove_from_hass(self) -> None:
    for unsub in self._unsubs[:]:
        unsub()
        self._unsubs.remove(unsub)
    self.entity_data.device_proxy.gateway_proxy.remove_entity_reference(self)  # ← added by #169341
    await super().async_will_remove_from_hass()                                 # ← clears hass.states
    self.remove_future.set_result(True)
```

`self.entity_data.device_proxy.gateway_proxy.remove_entity_reference(self)` is a four-deep attribute chain. If any of those attributes is `None` or unreachable at the moment this entity unloads — for example if the gateway proxy has already been disposed by the time this entity hits its `async_will_remove_from_hass`, depending on the order in which the platform unload and `gateway.shutdown()` interleave — this raises `AttributeError` and the rest of the method does not run, including `super().async_will_remove_from_hass()` which is responsible for `hass.states.async_remove(entity_id)`.

`remove_entity_reference` itself was also rewritten in the same PR:

```python
def remove_entity_reference(self, entity: ZHAEntity) -> None:
    ieee = entity.entity_data.device_proxy.device.ieee   # another four-deep chain
    if (entity_refs := self._ha_entity_refs.get(ieee)) is None:
        return
    ...
```

So either dereference chain could raise.

The non-deterministic "different devices each reload" pattern fits a race in which device proxies vs platform unload interleave differently each time — clean unloads succeed, late unloads fail at the new dereference.

### Why I don't see the AttributeError in the log

The captured log doesn't contain any AttributeError traceback during unload. That's a gap in my theory. Possibilities:

1. HA Core's `entity_platform.async_unload_entry` (or `Entity.async_remove`) wraps the call in `try/except` and logs at a level filtered out, or
2. The exception is swallowed by `add_to_platform_finish`-like cleanup, or
3. The actual leak mechanism is different and I'm chasing the wrong line.

(3) is possible. The fact that state survives unload is the load-bearing observation; the *why* is my best guess.

### What does work: "Reconfigure"

The Reconfigure device action triggers re-interview, which goes through `DeviceEntityRemovedEvent(remove=False)` → `SIGNAL_REMOVE_ENTITY_{platform}_{unique_id}` → `self.async_remove`. This is a different removal path (the new soft-remove flow from #169341) — and it works. So the new soft-remove is fine; the suspect is the change in `async_will_remove_from_hass`.

### What does *not* trigger this

`gateway.shutdown()` correctly calls `device.on_remove()` which calls `_async_teardown(emit_entity_events=False)` (zha/zigbee/device.py:1226-1276) — so the new `SIGNAL_REMOVE_ENTITY` dispatcher signal does **not** fire spuriously during reload. The bug isn't in the new signal path; it's in the modified `ZHAEntity.async_will_remove_from_hass` body.

## Suggested fix (sketch — flag if you'd prefer a different shape)

Two options that should both restore correctness:

**Option A** — swap order so state-clear always runs:

```python
async def async_will_remove_from_hass(self) -> None:
    for unsub in self._unsubs[:]:
        unsub()
        self._unsubs.remove(unsub)
    await super().async_will_remove_from_hass()
    try:
        self.entity_data.device_proxy.gateway_proxy.remove_entity_reference(self)
    except AttributeError:
        # Gateway proxy may already be torn down; bookkeeping is moot at this point
        pass
    self.remove_future.set_result(True)
```

**Option B** — keep order, just guard:

```python
async def async_will_remove_from_hass(self) -> None:
    for unsub in self._unsubs[:]:
        unsub()
        self._unsubs.remove(unsub)
    if (proxy := getattr(self.entity_data, "device_proxy", None)) is not None:
        if (gw := getattr(proxy, "gateway_proxy", None)) is not None:
            gw.remove_entity_reference(self)
    await super().async_will_remove_from_hass()
    self.remove_future.set_result(True)
```

Either way, the existing test added in #169341 (`test_dynamic_entities.py`) shouldn't break, and a new test should cover "ZHAEntity removes its state from `hass.states` even when `gateway_proxy` is None at unload time".

## Things I have not done

- Reproduced this locally. I read the user's debug log + the code.
- Confirmed the AttributeError fires by adding instrumentation. The theory is consistent with the symptom but not directly observed.
- Looked at `homeassistant_hardware` interactions or the older `ha_entity_refs`/`_ha_entity_refs` rename for unrelated issues.

## Related work I have already filed

- zigpy/bellows#721 — `send_packet` TOCTOU race (different bug, also "possible issue" framing)
- zigpy/bellows#722 — ThreadsafeProxy silent-None on closed loop (different bug)
- zigpy/zha#769 — `async_initialize`'s `except Exception:` doesn't catch `CancelledError` (different bug, related to the same 2026.5.x bootstrap-timeout family)

These three plus this one together explain a chunk of the "ZHA broken after 2026.5.x" reports across home-assistant/core#130548, #168432, #170920, and #172247. Different code paths, different tracebacks — worth keeping them separated in triage.

Happy to provide additional log slices, draft the fix PR, or close as misdiagnosed if you confirm a different cause.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] Possible issue: ZHA reload leaks entity states, causing "Platform zha does not generate unique IDs" rejections of new entities #77

Source report

Where the rejection fires

Evidence in the user's debug log

My current best theory (NOT confirmed)

Why I don't see the AttributeError in the log

What does work: "Reconfigure"

What does not trigger this

Suggested fix (sketch — flag if you'd prefer a different shape)

Things I have not done

Related work I have already filed

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Core] Possible issue: ZHA reload leaks entity states, causing "Platform zha does not generate unique IDs" rejections of new entities #77

Description

Source report

Where the rejection fires

Evidence in the user's debug log

My current best theory (NOT confirmed)

Why I don't see the AttributeError in the log

What does work: "Reconfigure"

What does not trigger this

Suggested fix (sketch — flag if you'd prefer a different shape)

Things I have not done

Related work I have already filed

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions