Voice activity events indicate when speech start or end has been detected throughout a stream. The events are sent in real-time as they are detected by Speech-to-Text. Voice activity events can be useful for developing applications that rely on automatic detection of when a user has started or finished speaking. Speech-to-Text can also be configured to automatically close the stream based on voice activity.
Voice activity events are only available for StreamingRecognize gRPC requests.
Enable voice activity events
You can enable receiving voice activity responses by setting the
enable_voice_activity_events
flag to true under the streaming_features
message.
Voice activity event types
Voice activity events are usually returned in real time as Speech-to-Text detects speech start or stop during the stream. They will usually be returned before the transcription results for the corresponding segment of speech. Speech activity events can be sent for audio that produces empty transcription results.
Speech Activity Begin
Sent when Speech-to-Text detects that speech has started.
{ "speechEventType": "SPEECH_ACTIVITY_BEGIN", "speechEventOffset": "1.070s" }
Speech Activity End
Sent when Speech-to-Text detects that speech has ended.
{ "speechEventType": "SPEECH_ACTIVITY_END", "speechEventOffset": "1.070s" }If the stream is closed before speech ends, a
SPEECH_ACTIVITY_END
event will not be sent.
Enable voice activity timeouts
You can enable voice activity timeouts by setting values on the
voice_activity_timeout
message
in streaming_features
. Voice activity timeouts must be greater than 500ms and
less than 60s. Speech begin and end timeouts can be set independently.
Speech begin timeout
When a speech begin timeout is set, Speech-to-Text will automatically
close the stream if speech has not started before the timeout period. Once a
SPEECH_ACTIVITY_START
event has been detected and returned, the timeout is
canceled for the duration of the stream. This feature is useful for applications
that expect a user to begin speaking within a given period of time.
Speech end timeout
When a speech end timeout is set, Speech-to-Text will automatically
close the stream if no further speech is detected within the timeout duration
after a SPEECH_ACTIVITY_END
event. Once a SPEECH_ACTIVITY_START
event has
been detected and returned, the timeout is canceled and will start again once a
SPEECH_ACTIVITY_END
event is sent.
Time measurement for timeouts
Time elapsed is measured by the bytes of audio sent in requests to Speech-to-Text, as opposed to server time. This allows for preserving accuracy during variations in stream transmission. Sending very large audio chunks in requests, or sending requests in very rapid succession will reduce accuracy in timeout measurement. Note: the size limit for audio chunks is 15360 bytes per request.