⚠️ This is an investigation, not a confirmed reproduction by me. Multiple independent users report the symptom on 2026.5.x. The analysis below is from reading the affected user's debug log + the HA Core / zha source. Filing here per @TheJulianJES request — he authored the most-likely-culprit PR (#169341) and is best placed to confirm/reject.
cc @TheJulianJES
Source report
home-assistant/core#170920 (exhannibal, opened 2026-05-16, 3 distinct users confirming):
"ZHA reload leaves orphaned entity registry entries: devices appear available but commands fail (unique_id already exists)"
Setup: Nabu Casa SkyConnect v1.0 / bellows / Python 3.14.2 / HAOS aarch64 / HA 2026.5.2 / zha==1.3.1.
Symptom: after a ZHA config-entry reload, a non-deterministic subset of devices ends up in a half-up state. The HA log contains, for the affected entities, messages like:
Platform zha does not generate unique IDs.
ID 84:2e:14:ff:fe:ba:5f:52-1 is already used by light.coffee_light - ignoring light.coffee_light
UI still shows the registry-persisted entity. No live ZHA platform entity bound to it. The user's ZHA "Reconfigure" device action restores control until the next reload.
Repro:
- Run ZHA with multiple devices.
- Toggle
disabled_by on any ZHA entity in Settings → Entities (this triggers a config-entry reload).
- After ZHA finishes startup, control fails for a subset of devices (different subset each reload).
Not the same as home-assistant/core#130548 (radio-side) — the coordinator is fine and other devices on the same network work normally.
Where the rejection fires
homeassistant/helpers/entity_platform.py:898-925:
if entity.unique_id is not None:
registered_entity_id = entity_registry.async_get_entity_id(
self.domain, self.platform_name, entity.unique_id
)
if registered_entity_id:
already_exists, _ = self._entity_id_already_exists(registered_entity_id)
if already_exists:
entity.registry_entry = None
msg = (
f"Platform {self.platform_name} does not generate unique IDs. "
)
...
self.logger.error(msg)
entity.add_to_platform_abort()
return
_entity_id_already_exists (entity_platform.py:811-825):
already_exists = entity_id in self.entities
restored = False
if not already_exists and not self.hass.states.async_available(entity_id):
existing = self.hass.states.get(entity_id)
if existing is not None and ATTR_RESTORED in existing.attributes:
restored = True
else:
already_exists = True
return (already_exists, restored)
After a fresh reload, entity_id in self.entities is False (new platform instance, empty entities dict). So we're hitting Path B: there is a state in hass.states for the entity that isn't marked ATTR_RESTORED. The state survived the unload phase even though the entity didn't.
Evidence in the user's debug log
The user attached a full debug log of one failing reload. The setup-side errors look like:
button.toaster_identify ← original (no suffix)
button.microwave_light_identify ← original
button.toaster_light_identify_2 ← _2 suffix
button.kettle_light_identify_3 ← _3 suffix
button.coffee_light_identify_4 ← _4 suffix
That _N is HA's automatic entity-id collision avoidance during generation: when button.coffee_light_identify is already taken in the state machine, the new entity is suggested as _2; if that's taken too, _3; etc. _4 proves three previous orphans for that single entity accumulated across past reloads, all still in hass.states. So this isn't a single one-off leak — every reload adds another orphan for the affected devices.
(For unaffected devices the entity_ids come out clean. The bug is non-deterministic in which devices get hit, not which entity_ids leak.)
My current best theory (NOT confirmed)
The new line added in #169341 to homeassistant/components/zha/entity.py::ZHAEntity.async_will_remove_from_hass:
async def async_will_remove_from_hass(self) -> None:
for unsub in self._unsubs[:]:
unsub()
self._unsubs.remove(unsub)
self.entity_data.device_proxy.gateway_proxy.remove_entity_reference(self) # ← added by #169341
await super().async_will_remove_from_hass() # ← clears hass.states
self.remove_future.set_result(True)
self.entity_data.device_proxy.gateway_proxy.remove_entity_reference(self) is a four-deep attribute chain. If any of those attributes is None or unreachable at the moment this entity unloads — for example if the gateway proxy has already been disposed by the time this entity hits its async_will_remove_from_hass, depending on the order in which the platform unload and gateway.shutdown() interleave — this raises AttributeError and the rest of the method does not run, including super().async_will_remove_from_hass() which is responsible for hass.states.async_remove(entity_id).
remove_entity_reference itself was also rewritten in the same PR:
def remove_entity_reference(self, entity: ZHAEntity) -> None:
ieee = entity.entity_data.device_proxy.device.ieee # another four-deep chain
if (entity_refs := self._ha_entity_refs.get(ieee)) is None:
return
...
So either dereference chain could raise.
The non-deterministic "different devices each reload" pattern fits a race in which device proxies vs platform unload interleave differently each time — clean unloads succeed, late unloads fail at the new dereference.
Why I don't see the AttributeError in the log
The captured log doesn't contain any AttributeError traceback during unload. That's a gap in my theory. Possibilities:
- HA Core's
entity_platform.async_unload_entry (or Entity.async_remove) wraps the call in try/except and logs at a level filtered out, or
- The exception is swallowed by
add_to_platform_finish-like cleanup, or
- The actual leak mechanism is different and I'm chasing the wrong line.
(3) is possible. The fact that state survives unload is the load-bearing observation; the why is my best guess.
What does work: "Reconfigure"
The Reconfigure device action triggers re-interview, which goes through DeviceEntityRemovedEvent(remove=False) → SIGNAL_REMOVE_ENTITY_{platform}_{unique_id} → self.async_remove. This is a different removal path (the new soft-remove flow from #169341) — and it works. So the new soft-remove is fine; the suspect is the change in async_will_remove_from_hass.
What does not trigger this
gateway.shutdown() correctly calls device.on_remove() which calls _async_teardown(emit_entity_events=False) (zha/zigbee/device.py:1226-1276) — so the new SIGNAL_REMOVE_ENTITY dispatcher signal does not fire spuriously during reload. The bug isn't in the new signal path; it's in the modified ZHAEntity.async_will_remove_from_hass body.
Suggested fix (sketch — flag if you'd prefer a different shape)
Two options that should both restore correctness:
Option A — swap order so state-clear always runs:
async def async_will_remove_from_hass(self) -> None:
for unsub in self._unsubs[:]:
unsub()
self._unsubs.remove(unsub)
await super().async_will_remove_from_hass()
try:
self.entity_data.device_proxy.gateway_proxy.remove_entity_reference(self)
except AttributeError:
# Gateway proxy may already be torn down; bookkeeping is moot at this point
pass
self.remove_future.set_result(True)
Option B — keep order, just guard:
async def async_will_remove_from_hass(self) -> None:
for unsub in self._unsubs[:]:
unsub()
self._unsubs.remove(unsub)
if (proxy := getattr(self.entity_data, "device_proxy", None)) is not None:
if (gw := getattr(proxy, "gateway_proxy", None)) is not None:
gw.remove_entity_reference(self)
await super().async_will_remove_from_hass()
self.remove_future.set_result(True)
Either way, the existing test added in #169341 (test_dynamic_entities.py) shouldn't break, and a new test should cover "ZHAEntity removes its state from hass.states even when gateway_proxy is None at unload time".
Things I have not done
- Reproduced this locally. I read the user's debug log + the code.
- Confirmed the AttributeError fires by adding instrumentation. The theory is consistent with the symptom but not directly observed.
- Looked at
homeassistant_hardware interactions or the older ha_entity_refs/_ha_entity_refs rename for unrelated issues.
Related work I have already filed
These three plus this one together explain a chunk of the "ZHA broken after 2026.5.x" reports across home-assistant/core#130548, #168432, #170920, and #172247. Different code paths, different tracebacks — worth keeping them separated in triage.
Happy to provide additional log slices, draft the fix PR, or close as misdiagnosed if you confirm a different cause.
cc @TheJulianJES
Source report
home-assistant/core#170920 (
exhannibal, opened 2026-05-16, 3 distinct users confirming):Setup: Nabu Casa SkyConnect v1.0 / bellows / Python 3.14.2 / HAOS aarch64 / HA 2026.5.2 /
zha==1.3.1.Symptom: after a ZHA config-entry reload, a non-deterministic subset of devices ends up in a half-up state. The HA log contains, for the affected entities, messages like:
UI still shows the registry-persisted entity. No live ZHA platform entity bound to it. The user's ZHA "Reconfigure" device action restores control until the next reload.
Repro:
disabled_byon any ZHA entity in Settings → Entities (this triggers a config-entry reload).Not the same as home-assistant/core#130548 (radio-side) — the coordinator is fine and other devices on the same network work normally.
Where the rejection fires
homeassistant/helpers/entity_platform.py:898-925:_entity_id_already_exists(entity_platform.py:811-825):After a fresh reload,
entity_id in self.entitiesis False (new platform instance, emptyentitiesdict). So we're hitting Path B: there is a state inhass.statesfor the entity that isn't markedATTR_RESTORED. The state survived the unload phase even though the entity didn't.Evidence in the user's debug log
The user attached a full debug log of one failing reload. The setup-side errors look like:
That
_Nis HA's automatic entity-id collision avoidance during generation: whenbutton.coffee_light_identifyis already taken in the state machine, the new entity is suggested as_2; if that's taken too,_3; etc._4proves three previous orphans for that single entity accumulated across past reloads, all still inhass.states. So this isn't a single one-off leak — every reload adds another orphan for the affected devices.(For unaffected devices the entity_ids come out clean. The bug is non-deterministic in which devices get hit, not which entity_ids leak.)
My current best theory (NOT confirmed)
The new line added in #169341 to
homeassistant/components/zha/entity.py::ZHAEntity.async_will_remove_from_hass:self.entity_data.device_proxy.gateway_proxy.remove_entity_reference(self)is a four-deep attribute chain. If any of those attributes isNoneor unreachable at the moment this entity unloads — for example if the gateway proxy has already been disposed by the time this entity hits itsasync_will_remove_from_hass, depending on the order in which the platform unload andgateway.shutdown()interleave — this raisesAttributeErrorand the rest of the method does not run, includingsuper().async_will_remove_from_hass()which is responsible forhass.states.async_remove(entity_id).remove_entity_referenceitself was also rewritten in the same PR:So either dereference chain could raise.
The non-deterministic "different devices each reload" pattern fits a race in which device proxies vs platform unload interleave differently each time — clean unloads succeed, late unloads fail at the new dereference.
Why I don't see the AttributeError in the log
The captured log doesn't contain any AttributeError traceback during unload. That's a gap in my theory. Possibilities:
entity_platform.async_unload_entry(orEntity.async_remove) wraps the call intry/exceptand logs at a level filtered out, oradd_to_platform_finish-like cleanup, or(3) is possible. The fact that state survives unload is the load-bearing observation; the why is my best guess.
What does work: "Reconfigure"
The Reconfigure device action triggers re-interview, which goes through
DeviceEntityRemovedEvent(remove=False)→SIGNAL_REMOVE_ENTITY_{platform}_{unique_id}→self.async_remove. This is a different removal path (the new soft-remove flow from #169341) — and it works. So the new soft-remove is fine; the suspect is the change inasync_will_remove_from_hass.What does not trigger this
gateway.shutdown()correctly callsdevice.on_remove()which calls_async_teardown(emit_entity_events=False)(zha/zigbee/device.py:1226-1276) — so the newSIGNAL_REMOVE_ENTITYdispatcher signal does not fire spuriously during reload. The bug isn't in the new signal path; it's in the modifiedZHAEntity.async_will_remove_from_hassbody.Suggested fix (sketch — flag if you'd prefer a different shape)
Two options that should both restore correctness:
Option A — swap order so state-clear always runs:
Option B — keep order, just guard:
Either way, the existing test added in #169341 (
test_dynamic_entities.py) shouldn't break, and a new test should cover "ZHAEntity removes its state fromhass.stateseven whengateway_proxyis None at unload time".Things I have not done
homeassistant_hardwareinteractions or the olderha_entity_refs/_ha_entity_refsrename for unrelated issues.Related work I have already filed
send_packetTOCTOU race (different bug, also "possible issue" framing)async_initialize'sexcept Exception:doesn't catchCancelledError(different bug, related to the same 2026.5.x bootstrap-timeout family)These three plus this one together explain a chunk of the "ZHA broken after 2026.5.x" reports across home-assistant/core#130548, #168432, #170920, and #172247. Different code paths, different tracebacks — worth keeping them separated in triage.
Happy to provide additional log slices, draft the fix PR, or close as misdiagnosed if you confirm a different cause.