Fix/voice stt speech events user state tracking#5797
Conversation
| self.update_vad(self._vad) | ||
|
|
||
| self._speaking = False | ||
| self._user_turn_committed = True | ||
| if not self._vad or self._last_speaking_time is None: | ||
| self._last_speaking_time = time.time() | ||
|
|
||
| chat_ctx = self._hooks.retrieve_chat_ctx().copy() | ||
| self._run_eou_detection(chat_ctx) | ||
| if self._turn_detection_mode == "stt": | ||
| self._user_turn_committed = True | ||
| chat_ctx = self._hooks.retrieve_chat_ctx().copy() | ||
| self._run_eou_detection(chat_ctx) |
There was a problem hiding this comment.
π΄ Missing _stt_end_of_speech_received flag implementation causes tests to fail and feature to not work
The tests in TestSttSpeechEndTiming assert on _stt_end_of_speech_received (lines 237, 255, 272) and expect that _last_speaking_time set by END_OF_SPEECH is preserved when FINAL_TRANSCRIPT arrives. However, the production code never declares, initializes, or sets this flag. The git history shows commit 05075f06c added the full implementation (flag in __init__, setting it in END_OF_SPEECH/START_OF_SPEECH handlers, and guarding _last_speaking_time in FINAL_TRANSCRIPT), but the final commit b0164a98c reverted all of that while keeping the tests.
Consequences:
- Test
test_stt_eos_timestamp_is_preserved_for_final_transcript_without_external_vadwill fail: asserts_stt_end_of_speech_received is Truebut the flag staysFalse(as set by the test itself, never modified by production code); asserts_last_speaking_time == stt_eos_timebut the FINAL_TRANSCRIPT handler ataudio_recognition.py:878unconditionally overwrites it whennot self._vad. - The intended behavior (preserving END_OF_SPEECH timestamp for latency metrics when no external VAD is configured) is not implemented β
_last_speaking_timeis always overwritten by transcript arrival time.
(Refers to lines 954-979)
Prompt for agents
The PR is missing the _stt_end_of_speech_received flag that the tests depend on. The intermediate commit 05075f06c had the complete implementation but the final commit b0164a98c reverted it. To fix:
1. In AudioRecognition.__init__ (around line 162), add: self._stt_end_of_speech_received = False
2. In clear_user_turn() (around line 679), add: self._stt_end_of_speech_received = False
3. In the END_OF_SPEECH handler (after line 955), add: self._stt_end_of_speech_received = True
4. In the START_OF_SPEECH handler (after line 982), add: self._stt_end_of_speech_received = False
5. In the FINAL_TRANSCRIPT handler (line 878), change the condition from:
if not self._vad or self._last_speaking_time is None:
to:
if self._last_speaking_time is None or (not self._vad and not self._stt_end_of_speech_received):
6. Similarly in the PREFLIGHT_TRANSCRIPT handler (line 931), apply the same condition change.
7. In _bounce_eou_task cleanup (around line 1194), add: self._stt_end_of_speech_received = False
Refer to the intermediate commit 05075f06c for the complete implementation that matches the tests.
Was this helpful? React with π or π to provide feedback.
Problem
When no external VAD is configured, but the STT provider emits speech-boundary events (
SpeechEventType.START_OF_SPEECH/END_OF_SPEECH), those STT speech events are not always propagated into the session user-state machine.This causes
AgentSession.user_stateto remain"listening"while the user is actually speaking. As a result,user_away_timeoutcan fire mid-utterance and mark the user as"away"even though speech is ongoing.This is especially visible with STT providers such as Sarvam that expose internal VAD signals over the STT stream.
Root Cause
AudioRecognition._on_stt_eventpreviously only forwarded STT speech-boundary events into the user-state path when:That meant STT
START_OF_SPEECH/END_OF_SPEECHdrove_hooks.on_start_of_speech(...)/_hooks.on_end_of_speech(...)only forturn_detection="stt".When no external VAD is configured and turn detection is:
MultilingualModel()then
_turn_detection_modeis not"stt", so STT speech-boundary events are ignored for user-state purposes.The user-state machine is updated only through
_hooks.on_start_of_speech(...)/_hooks.on_end_of_speech(...), which eventually call:Without those calls,
_user_away_timeris not cancelled when the user starts speaking.Fix
Allow STT speech-boundary events to drive user-state transitions whenever no external VAD is configured.
The STT event handlers now run when either:
or:
This lets STT-internal VAD act as the available speech activity source for user-state tracking when there is no external VAD.
At the same time, STT events are not given unconditional turn-commit authority. The commit path remains scoped to
turn_detection_mode == "stt":So for manual/model/omitted turn detection:
Behavior Matrix
speaking/listening"stt"MultilingualModel()"manual"What This PR Does Not Include
This PR intentionally does not include the metrics-only STT EOS timestamp preservation fix. That belongs to the separate PR:
So this branch should not include:
_stt_end_of_speech_receivedFINAL_TRANSCRIPT/PREFLIGHT_TRANSCRIPTfallback logic for EOU metricsManual Verification
Validated the important combinations manually:
vad=None + turn_detection="stt"START_SPEECHproducedUser State Changed: speakingEND_SPEECHproducedUser State Changed: listeningawayduring speechvad=None + MultilingualModel()speaking/listeninguser_away_timeoutawaydid not fire during speechawayfired only after speech ended and the user was silentvad=None + turn_detection="manual"speaking/listeningawaymid-utterancevad=None + turn_detection omittedspeaking/listeningawaymid-utterancevad=Silero + MultilingualModel()awaymid-utterance