Fix/voice stt speech events user state tracking by dhruvladia-sarvam · Pull Request #5797 · livekit/agents

dhruvladia-sarvam · 2026-05-21T09:22:50Z

Problem

When no external VAD is configured, but the STT provider emits speech-boundary events (SpeechEventType.START_OF_SPEECH / END_OF_SPEECH), those STT speech events are not always propagated into the session user-state machine.

This causes AgentSession.user_state to remain "listening" while the user is actually speaking. As a result, user_away_timeout can fire mid-utterance and mark the user as "away" even though speech is ongoing.

This is especially visible with STT providers such as Sarvam that expose internal VAD signals over the STT stream.

Root Cause

AudioRecognition._on_stt_event previously only forwarded STT speech-boundary events into the user-state path when:

self._turn_detection_mode == "stt"

That meant STT START_OF_SPEECH / END_OF_SPEECH drove _hooks.on_start_of_speech(...) / _hooks.on_end_of_speech(...) only for turn_detection="stt".

When no external VAD is configured and turn detection is:

model-based, e.g. MultilingualModel()
omitted / auto
manual

then _turn_detection_mode is not "stt", so STT speech-boundary events are ignored for user-state purposes.

The user-state machine is updated only through _hooks.on_start_of_speech(...) / _hooks.on_end_of_speech(...), which eventually call:

self._session._update_user_state("speaking", ...)
self._session._update_user_state("listening", ...)

Without those calls, _user_away_timer is not cancelled when the user starts speaking.

Fix

Allow STT speech-boundary events to drive user-state transitions whenever no external VAD is configured.

The STT event handlers now run when either:

self._turn_detection_mode == "stt"

or:

self._vad is None

This lets STT-internal VAD act as the available speech activity source for user-state tracking when there is no external VAD.

At the same time, STT events are not given unconditional turn-commit authority. The commit path remains scoped to turn_detection_mode == "stt":

if self._turn_detection_mode == "stt":
    self._user_turn_committed = True
    chat_ctx = self._hooks.retrieve_chat_ctx().copy()
    self._run_eou_detection(chat_ctx)

So for manual/model/omitted turn detection:

STT events update user state
STT events do not auto-commit turns

Behavior Matrix

External VAD	Turn detection	Before	After
Yes	any	External VAD drives `speaking` / `listening`	Same
No	`"stt"`	STT events drive user-state and turn commit	Same
No	model-based, e.g. `MultilingualModel()`	STT speech events ignored for user state; away timer can fire mid-speech	STT events drive user state; turn detector still controls turn commit
No	omitted / auto	STT speech events ignored for user state; away timer can fire mid-speech	STT events drive user state; existing turn handling remains unchanged
No	`"manual"`	STT speech events ignored for user state; away timer can fire mid-speech	STT events drive user state; manual commit remains required

What This PR Does Not Include

This PR intentionally does not include the metrics-only STT EOS timestamp preservation fix. That belongs to the separate PR:

fix/preserve-stt-eos-timestamp-for-metrics

So this branch should not include:

_stt_end_of_speech_received
STT EOS timing tests
changes to FINAL_TRANSCRIPT / PREFLIGHT_TRANSCRIPT fallback logic for EOU metrics

Manual Verification

Validated the important combinations manually:

`vad=None + turn_detection="stt"`

STT START_SPEECH produced User State Changed: speaking
STT END_SPEECH produced User State Changed: listening
Long speech did not trigger away during speech

`vad=None + MultilingualModel()`

STT internal VAD drove speaking / listening
User spoke for ~20s, longer than user_away_timeout
away did not fire during speech
away fired only after speech ended and the user was silent

`vad=None + turn_detection="manual"`

STT internal VAD drove speaking / listening
Long speech did not trigger away mid-utterance
No automatic turn commit / agent response occurred; manual behavior preserved

`vad=None + turn_detection omitted`

STT internal VAD drove speaking / listening
Long speech did not trigger away mid-utterance
Away sequence cancellation worked when the user spoke again

`vad=Silero + MultilingualModel()`

Existing external VAD behavior remained healthy
External VAD continued driving user-state transitions
Long speech did not trigger away mid-utterance
No obvious regression from STT event gate changes

devin-ai-integration

Devin Review found 1 potential issue.

View 3 additional findings in Devin Review.

devin-ai-integration · 2026-05-21T09:27:13Z

                self.update_vad(self._vad)

            self._speaking = False
-            self._user_turn_committed = True
            if not self._vad or self._last_speaking_time is None:
                self._last_speaking_time = time.time()

-            chat_ctx = self._hooks.retrieve_chat_ctx().copy()
-            self._run_eou_detection(chat_ctx)
+            if self._turn_detection_mode == "stt":
+                self._user_turn_committed = True
+                chat_ctx = self._hooks.retrieve_chat_ctx().copy()
+                self._run_eou_detection(chat_ctx)


🔴 Missing _stt_end_of_speech_received flag implementation causes tests to fail and feature to not work

The tests in TestSttSpeechEndTiming assert on _stt_end_of_speech_received (lines 237, 255, 272) and expect that _last_speaking_time set by END_OF_SPEECH is preserved when FINAL_TRANSCRIPT arrives. However, the production code never declares, initializes, or sets this flag. The git history shows commit 05075f06c added the full implementation (flag in __init__, setting it in END_OF_SPEECH/START_OF_SPEECH handlers, and guarding _last_speaking_time in FINAL_TRANSCRIPT), but the final commit b0164a98c reverted all of that while keeping the tests.

Consequences:

Test test_stt_eos_timestamp_is_preserved_for_final_transcript_without_external_vad will fail: asserts _stt_end_of_speech_received is True but the flag stays False (as set by the test itself, never modified by production code); asserts _last_speaking_time == stt_eos_time but the FINAL_TRANSCRIPT handler at audio_recognition.py:878 unconditionally overwrites it when not self._vad.

The intended behavior (preserving END_OF_SPEECH timestamp for latency metrics when no external VAD is configured) is not implemented — _last_speaking_time is always overwritten by transcript arrival time.

(Refers to lines 954-979)

Prompt for agents

The PR is missing the _stt_end_of_speech_received flag that the tests depend on. The intermediate commit 05075f06c had the complete implementation but the final commit b0164a98c reverted it. To fix: 1. In AudioRecognition.__init__ (around line 162), add: self._stt_end_of_speech_received = False 2. In clear_user_turn() (around line 679), add: self._stt_end_of_speech_received = False 3. In the END_OF_SPEECH handler (after line 955), add: self._stt_end_of_speech_received = True 4. In the START_OF_SPEECH handler (after line 982), add: self._stt_end_of_speech_received = False 5. In the FINAL_TRANSCRIPT handler (line 878), change the condition from: if not self._vad or self._last_speaking_time is None: to: if self._last_speaking_time is None or (not self._vad and not self._stt_end_of_speech_received): 6. Similarly in the PREFLIGHT_TRANSCRIPT handler (line 931), apply the same condition change. 7. In _bounce_eou_task cleanup (around line 1194), add: self._stt_end_of_speech_received = False Refer to the intermediate commit 05075f06c for the complete implementation that matches the tests.

Was this helpful? React with 👍 or 👎 to provide feedback.

dhruvladia-sarvam added 3 commits May 18, 2026 07:35

initial

77c064e

ruff fix

05075f0

initial

b0164a9

devin-ai-integration Bot reviewed May 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix/voice stt speech events user state tracking#5797

Fix/voice stt speech events user state tracking#5797
dhruvladia-sarvam wants to merge 3 commits into
livekit:mainfrom
dhruvladia-sarvam:fix/voice-stt-speech-events-user-state-tracking

dhruvladia-sarvam commented May 21, 2026

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

devin-ai-integration Bot May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dhruvladia-sarvam commented May 21, 2026

Problem

Root Cause

Fix

Behavior Matrix

What This PR Does Not Include

Manual Verification

vad=None + turn_detection="stt"

vad=None + MultilingualModel()

vad=None + turn_detection="manual"

vad=None + turn_detection omitted

vad=Silero + MultilingualModel()

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`vad=None + turn_detection="stt"`

`vad=None + MultilingualModel()`

`vad=None + turn_detection="manual"`

`vad=None + turn_detection omitted`

`vad=Silero + MultilingualModel()`