fix/preserve-stt-eos-timestamp-for-metrics#5755
Open
dhruvladia-sarvam wants to merge 3 commits into
Open
Conversation
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
When no external VAD is configured, some streaming STT plugins still emit speech-boundary events such as
SpeechEventType.START_OF_SPEECHandSpeechEventType.END_OF_SPEECH. Those STT-provided boundaries should be used for EOU latency metrics.Previously,
AudioRecognition._on_stt_eventcould overwrite or fail to preserve the STT-provided speech-end timestamp. This causedEOUMetrics.transcription_delayand sometimesend_of_utterance_delayto be reported as0.0or based on transcript-arrival time instead of the actual end-of-speech time.This was especially visible when:
turn_handling.turn_detection = "stt"is used with no external VADFINAL_TRANSCRIPTbeforeEND_OF_SPEECHRoot Cause
AudioRecognitionused_last_speaking_timefor multiple purposes but did not track whether that timestamp came from an actual speech-end signal.Before this change, when
FINAL_TRANSCRIPTorPREFLIGHT_TRANSCRIPTarrived andself._vad is None, the handler could fall back to:That fallback is only safe when no speech-end timestamp is available. If the STT plugin later emits
END_OF_SPEECH, that event is the authoritative speech-end signal and should replace any transcript-arrival fallback.A review also identified an important ordering issue: many STT providers emit
FINAL_TRANSCRIPTbeforeEND_OF_SPEECH. In that ordering, the old fallback could set_last_speaking_timeto transcript-arrival time first, then the laterEND_OF_SPEECHtimestamp could be ignored, leaving metrics based on the wrong timestamp.Fix
Add an explicit
_stt_end_of_speech_receivedflag and use it to preserve STT-provided EOS timing.The updated behavior is:
START_OF_SPEECH, clear_stt_end_of_speech_received.END_OF_SPEECH:_stt_end_of_speech_received = True_last_speaking_time = time.time(), becauseEND_OF_SPEECHis the authoritative speech-end timestampFINAL_TRANSCRIPTorPREFLIGHT_TRANSCRIPTarrives:Behavior Matrix
END_OF_SPEECHbeforeFINAL_TRANSCRIPTFINAL_TRANSCRIPTbeforeEND_OF_SPEECHturn_detection="stt"0.00.0or fallback-basedTests Added
Added focused tests in
tests/test_speech_start_time_persistence.py:test_stt_eos_timestamp_is_preserved_for_final_transcripttest_stt_eos_replaces_fallback_final_transcript_timetest_final_transcript_falls_back_without_stt_eostest_external_vad_timestamp_is_not_overwritten