Skip to content

Fix/voice stt speech events user state tracking#5797

Open
dhruvladia-sarvam wants to merge 3 commits into
livekit:mainfrom
dhruvladia-sarvam:fix/voice-stt-speech-events-user-state-tracking
Open

Fix/voice stt speech events user state tracking#5797
dhruvladia-sarvam wants to merge 3 commits into
livekit:mainfrom
dhruvladia-sarvam:fix/voice-stt-speech-events-user-state-tracking

Conversation

@dhruvladia-sarvam
Copy link
Copy Markdown
Contributor

Problem

When no external VAD is configured, but the STT provider emits speech-boundary events (SpeechEventType.START_OF_SPEECH / END_OF_SPEECH), those STT speech events are not always propagated into the session user-state machine.

This causes AgentSession.user_state to remain "listening" while the user is actually speaking. As a result, user_away_timeout can fire mid-utterance and mark the user as "away" even though speech is ongoing.

This is especially visible with STT providers such as Sarvam that expose internal VAD signals over the STT stream.

Root Cause

AudioRecognition._on_stt_event previously only forwarded STT speech-boundary events into the user-state path when:

self._turn_detection_mode == "stt"

That meant STT START_OF_SPEECH / END_OF_SPEECH drove _hooks.on_start_of_speech(...) / _hooks.on_end_of_speech(...) only for turn_detection="stt".

When no external VAD is configured and turn detection is:

  • model-based, e.g. MultilingualModel()
  • omitted / auto
  • manual

then _turn_detection_mode is not "stt", so STT speech-boundary events are ignored for user-state purposes.

The user-state machine is updated only through _hooks.on_start_of_speech(...) / _hooks.on_end_of_speech(...), which eventually call:

self._session._update_user_state("speaking", ...)
self._session._update_user_state("listening", ...)

Without those calls, _user_away_timer is not cancelled when the user starts speaking.

Fix

Allow STT speech-boundary events to drive user-state transitions whenever no external VAD is configured.

The STT event handlers now run when either:

self._turn_detection_mode == "stt"

or:

self._vad is None

This lets STT-internal VAD act as the available speech activity source for user-state tracking when there is no external VAD.

At the same time, STT events are not given unconditional turn-commit authority. The commit path remains scoped to turn_detection_mode == "stt":

if self._turn_detection_mode == "stt":
    self._user_turn_committed = True
    chat_ctx = self._hooks.retrieve_chat_ctx().copy()
    self._run_eou_detection(chat_ctx)

So for manual/model/omitted turn detection:

  • STT events update user state
  • STT events do not auto-commit turns

Behavior Matrix

External VAD Turn detection Before After
Yes any External VAD drives speaking / listening Same
No "stt" STT events drive user-state and turn commit Same
No model-based, e.g. MultilingualModel() STT speech events ignored for user state; away timer can fire mid-speech STT events drive user state; turn detector still controls turn commit
No omitted / auto STT speech events ignored for user state; away timer can fire mid-speech STT events drive user state; existing turn handling remains unchanged
No "manual" STT speech events ignored for user state; away timer can fire mid-speech STT events drive user state; manual commit remains required

What This PR Does Not Include

This PR intentionally does not include the metrics-only STT EOS timestamp preservation fix. That belongs to the separate PR:

fix/preserve-stt-eos-timestamp-for-metrics

So this branch should not include:

  • _stt_end_of_speech_received
  • STT EOS timing tests
  • changes to FINAL_TRANSCRIPT / PREFLIGHT_TRANSCRIPT fallback logic for EOU metrics

Manual Verification

Validated the important combinations manually:

vad=None + turn_detection="stt"

  • STT START_SPEECH produced User State Changed: speaking
  • STT END_SPEECH produced User State Changed: listening
  • Long speech did not trigger away during speech

vad=None + MultilingualModel()

  • STT internal VAD drove speaking / listening
  • User spoke for ~20s, longer than user_away_timeout
  • away did not fire during speech
  • away fired only after speech ended and the user was silent

vad=None + turn_detection="manual"

  • STT internal VAD drove speaking / listening
  • Long speech did not trigger away mid-utterance
  • No automatic turn commit / agent response occurred; manual behavior preserved

vad=None + turn_detection omitted

  • STT internal VAD drove speaking / listening
  • Long speech did not trigger away mid-utterance
  • Away sequence cancellation worked when the user spoke again

vad=Silero + MultilingualModel()

  • Existing external VAD behavior remained healthy
  • External VAD continued driving user-state transitions
  • Long speech did not trigger away mid-utterance
  • No obvious regression from STT event gate changes

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 potential issue.

View 3 additional findings in Devin Review.

Open in Devin Review

Comment on lines 970 to +979
self.update_vad(self._vad)

self._speaking = False
self._user_turn_committed = True
if not self._vad or self._last_speaking_time is None:
self._last_speaking_time = time.time()

chat_ctx = self._hooks.retrieve_chat_ctx().copy()
self._run_eou_detection(chat_ctx)
if self._turn_detection_mode == "stt":
self._user_turn_committed = True
chat_ctx = self._hooks.retrieve_chat_ctx().copy()
self._run_eou_detection(chat_ctx)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

πŸ”΄ Missing _stt_end_of_speech_received flag implementation causes tests to fail and feature to not work

The tests in TestSttSpeechEndTiming assert on _stt_end_of_speech_received (lines 237, 255, 272) and expect that _last_speaking_time set by END_OF_SPEECH is preserved when FINAL_TRANSCRIPT arrives. However, the production code never declares, initializes, or sets this flag. The git history shows commit 05075f06c added the full implementation (flag in __init__, setting it in END_OF_SPEECH/START_OF_SPEECH handlers, and guarding _last_speaking_time in FINAL_TRANSCRIPT), but the final commit b0164a98c reverted all of that while keeping the tests.

Consequences:

  1. Test test_stt_eos_timestamp_is_preserved_for_final_transcript_without_external_vad will fail: asserts _stt_end_of_speech_received is True but the flag stays False (as set by the test itself, never modified by production code); asserts _last_speaking_time == stt_eos_time but the FINAL_TRANSCRIPT handler at audio_recognition.py:878 unconditionally overwrites it when not self._vad.
  2. The intended behavior (preserving END_OF_SPEECH timestamp for latency metrics when no external VAD is configured) is not implemented β€” _last_speaking_time is always overwritten by transcript arrival time.

(Refers to lines 954-979)

Prompt for agents
The PR is missing the _stt_end_of_speech_received flag that the tests depend on. The intermediate commit 05075f06c had the complete implementation but the final commit b0164a98c reverted it. To fix:

1. In AudioRecognition.__init__ (around line 162), add: self._stt_end_of_speech_received = False
2. In clear_user_turn() (around line 679), add: self._stt_end_of_speech_received = False
3. In the END_OF_SPEECH handler (after line 955), add: self._stt_end_of_speech_received = True
4. In the START_OF_SPEECH handler (after line 982), add: self._stt_end_of_speech_received = False
5. In the FINAL_TRANSCRIPT handler (line 878), change the condition from:
   if not self._vad or self._last_speaking_time is None:
   to:
   if self._last_speaking_time is None or (not self._vad and not self._stt_end_of_speech_received):
6. Similarly in the PREFLIGHT_TRANSCRIPT handler (line 931), apply the same condition change.
7. In _bounce_eou_task cleanup (around line 1194), add: self._stt_end_of_speech_received = False

Refer to the intermediate commit 05075f06c for the complete implementation that matches the tests.
Open in Devin Review

Was this helpful? React with πŸ‘ or πŸ‘Ž to provide feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant