Service that implements the Cobalt Cubic Speech Recognition API
Method Name | Request Type | Response Type | Description |
---|---|---|---|
Version | .google.protobuf.Empty | VersionResponse | Queries the Version of the Server |
ListModels | ListModelsRequest | ListModelsResponse | Retrieves a list of available speech recognition models |
Recognize | RecognizeRequest | RecognitionResponse | Performs synchronous speech recognition: receive results after all audio has been sent and processed. It is expected that this request be typically used for short audio content: less than a minute long. For longer content, the StreamingRecognize method should be preferred. |
StreamingRecognize | StreamingRecognizeRequest | RecognitionResponse | Performs bidirectional streaming speech recognition. Receive results while sending audio. This method is only available via GRPC and not via HTTP+JSON. However, a web browser may use websockets to use this service. |
CompileContext | CompileContextRequest | CompileContextResponse | Compiles recognition context information, such as a specialized list of words or phrases, into a compact, efficient form to send with subsequent Recognize or StreamingRecognize requests to customize speech recognition. For example, a list of contact names may be compiled in a mobile app and sent with each recognition request so that the app user’s contact names are more likely to be recognized than arbitrary names. This pre-compilation ensures that there is no added latency for the recognition request. It is important to note that in order to compile context for a model, that model has to support context in the first place, which can be verified by checking its ModelAttributes.ContextInfo obtained via the ListModels method. Also, the compiled data will be model specific; that is, the data compiled for one model will generally not be usable with a different model. |
The top-level message sent by the client for the CompileContext
request. It
contains a list of phrases or words, paired with a context token included in
the model being used. The token specifies a category such as “menu_item”,
“airport”, “contact”, “product_name” etc. The context token is used to
determine the places in the recognition output where the provided list of
phrases or words may appear. The allowed context tokens for a given model can
be found in its ModelAttributes.ContextInfo
obtained via the ListModels
method.
Field | Type | Label | Description |
---|---|---|---|
model_id | string | Unique identifier of the model to compile the context information for. The model chosen needs to support context which can be verified by checking its |
|
token | string | The token that is associated with the provided list of phrases or words (e.g “menu_item”, “airport” etc.). Must be one of the tokens included in the model being used, which can be retrieved by calling the |
|
phrases | ContextPhrase | repeated | List of phrases and/or words to be compiled. |
The message returned to the client by the CompileContext
method.
Field | Type | Label | Description |
---|---|---|---|
context | CompiledContext | Context information in a compact form that is efficient for use in subsequent recognition requests. The size of the compiled form will depend on the amount of text that was sent for compilation. For 1000 words it’s generally less than 100 kilobytes. |
Context information in a compact form that is efficient for use in subsequent recognition requests. The size of the compiled form will depend on the amount of text that was sent for compilation. For 1000 words it’s generally less than 100 kilobytes.
Field | Type | Label | Description |
---|---|---|---|
data | bytes | The context information compiled by the |
An Arc inside a Confusion Network Link
Field | Type | Label | Description |
---|---|---|---|
word | string | Word in the recognized transcript |
|
confidence | double | Confidence estimate between 0 and 1. A higher number represents a higher likelihood that the word was correctly recognized. |
A Link inside a confusion network
Field | Type | Label | Description |
---|---|---|---|
start_time | google.protobuf.Duration | Time offset relative to the beginning of audio received by the recognizer and corresponding to the start of this link |
|
duration | google.protobuf.Duration | Duration of the current link in the confusion network |
|
arcs | ConfusionNetworkArc | repeated | Arcs between this link |
Model information specifc to supporting recognition context.
Field | Type | Label | Description |
---|---|---|---|
supports_context | bool | If this is set to true, the model supports taking context information into account to aid speech recognition. The information may be sent with with recognition requests via RecognitionContext inside RecognitionConfig. |
|
allowed_context_tokens | string | repeated | A list of tokens (e.g “name”, “airport” etc.) that serve has placeholders in the model where a client provided list of phrases or words may be used to aid speech recognition and produce the exact desired recognition output. |
A phrase or word that is to be compiled into context information that can be
later used to improve speech recognition during a Recognize
or
StreamingRecognize
call. Along with the phrase or word itself, there is an
optional boost parameter that can be used to boost the likelihood of the
phrase or word in the recognition output.
Field | Type | Label | Description |
---|---|---|---|
text | string | The actual phrase or word. |
|
boost | float | This is an optional field. The boost value is a positive number which is used to increase the probability of the phrase or word appearing in the output. This setting can be used to differentiate between similar sounding words, with the desired word given a bigger boost value. By default, all phrases or words are given an equal probability of 1/N (where N = total number of phrases or words). If a boost value is provided, the new probability is (boost + 1) * 1/N. We normalize the boosted probabilities for all the phrases or words so that they sum to one. This means that the boost value only has an effect if there are relative differences in the values for different phrases or words. That is, if all phrases or words have the same boost value, after normalization they will all still have the same probability. This also means that the boost value can be any positive value, but it is best to stick between 0 to 20. Negative values are not supported and will be treated as 0 values. |
The top-level message sent by the client for the ListModels
method.
This message is empty and has no fields.
The message returned to the client by the ListModels
method.
Field | Type | Label | Description |
---|---|---|---|
models | Model | repeated | List of models available for use that match the request. |
Description of a Cubic Model
Field | Type | Label | Description |
---|---|---|---|
id | string | Unique identifier of the model. This identifier is used to choose the model that should be used for recognition, and is specified in the |
|
name | string | Model name. This is a concise name describing the model, and maybe presented to the end-user, for example, to help choose which model to use for their recognition task. |
|
attributes | ModelAttributes | Model attributes |
Attributes of a Cubic Model
Field | Type | Label | Description |
---|---|---|---|
sample_rate | uint32 | Audio sample rate supported by the model |
|
context_info | ContextInfo | Attributes specifc to supporting recognition context. |
A recognition hypothesis
Field | Type | Label | Description |
---|---|---|---|
transcript | string | Text representing the transcription of the words that the user spoke. The transcript will be formatted according to the servers formatting configuration. If you want the raw transcript, please see the field As an example, if the spoken utterance was “four people”, and the server was configured to format numbers, this field would be set to “4 people”. |
|
raw_transcript | string | Text representing the transcription of the words that the user spoke, without any formatting. This field will be populated only the config As an example, if the spoken utterance was |
|
confidence | double | Confidence estimate between 0 and 1. A higher number represents a higher likelihood of the output being correct. |
|
words | WordInfo | repeated | A list of word-specific information for each recognized word in the |
raw_words | WordInfo | repeated | A list of word-specific information for each recognized word in the |
start_time | google.protobuf.Duration | Time offset relative to the beginning of audio received by the recognizer and corresponding to the start of this utterance. |
|
duration | google.protobuf.Duration | Duration of the current utterance in the spoken audio. |
Audio to be sent to the recognizer
Field | Type | Label | Description |
---|---|---|---|
data | bytes |
Configuration for setting up a Recognizer
Field | Type | Label | Description |
---|---|---|---|
model_id | string | Unique identifier of the model to use, as obtained from a |
|
audio_encoding | RecognitionConfig.Encoding | Encoding of audio data sent/streamed through the If not specified, the default encoding is RAW_LINEAR16. Depending on how they are configured, server instances of this service may not support all the encodings enumerated above. They are always required to accept RAW_LINEAR16. If any other |
|
idle_timeout | google.protobuf.Duration | Idle Timeout of the created Recognizer. If no audio data is received by the recognizer for this duration, ongoing rpc calls will result in an error, the recognizer will be destroyed and thus more audio may not be sent to the same recognizer. The server may impose a limit on the maximum idle timeout that can be specified, and if the value in this message exceeds that serverside value, creating of the recognizer will fail with an error. |
|
enable_word_time_offsets | bool | This is an optional field. If this is set to true, each result will include a list of words and the start time offset (timestamp) and the duration for each of those words. If set to |
|
enable_word_confidence | bool | This is an optional field. If this is set to true, each result will include a list of words and the confidence for those words. If |
|
enable_raw_transcript | bool | This is an optional field. If this is set to true, the field |
|
enable_confusion_network | bool | This is an optional field. If this is set to true, the results will include a confusion network. If set to |
|
audio_channels | uint32 | repeated | This is an optional field. If the audio has multiple channels, this field should be configured with the list of channel indices that should be transcribed. Channels are 0-indexed. Example: If this field is not set, a mono file will be assumed by default and only channel-0 will be transcribed even if the file actually has additional channels. Channels that are present in the audio may be omitted, but it is an error to include a channel index in this field that is not present in the audio. Channels may be listed in any order but the same index may not be repeated in this list. BAD: |
metadata | RecognitionMetadata | This is an optional field. If there is any metadata associated with the audio being sent, use this field to provide it to cubic. The server may record this metadata when processing the request. The server does not use this field for any other purpose. |
|
context | RecognitionContext | This is an optional field for providing any additional context information that may aid speech recognition. This can also be used to add out-of-vocabulary words to the model or boost recognition of specific proper names or commands. Context information must be pre-compiled via the |
Confusion network in recognition output
Field | Type | Label | Description |
---|---|---|---|
links | ConfusionNetworkLink | repeated |
A collection of additional context information that may aid speech
recognition. This can be used to add out-of-vocabulary words to
the model or to boost recognition of specific proper names or commands.
Field | Type | Label | Description |
---|---|---|---|
compiled | CompiledContext | repeated | List of compiled context information, with each entry being compiled from a list of words or phrases using the |
Metadata associated with the audio to be recognized.
Field | Type | Label | Description |
---|---|---|---|
custom_metadata | string | Any custom metadata that the client wants to associate with the recording. This could be a simple string (e.g. a tracing ID) or structured data (e.g. JSON) |
Collection of sequence of recognition results in a portion of audio. When transcribing a single audio channel (e.g. RAW_LINEAR16 input, or a mono file), results will be ordered chronologically. When transcribing multiple channels, the results of all channels will be interleaved. Results of each individual channel will be chronological. No such promise is made for the ordering of results of different channels, as results are returned for each channel individually as soon as they are ready.
Field | Type | Label | Description |
---|---|---|---|
results | RecognitionResult | repeated |
A recognition result corresponding to a portion of audio.
Field | Type | Label | Description |
---|---|---|---|
alternatives | RecognitionAlternative | repeated | An n-best list of recognition hypotheses alternatives |
is_partial | bool | If this is set to true, it denotes that the result is an interim partial result, and could change after more audio is processed. If unset, or set to false, it denotes that this is a final result and will not change. Servers are not required to implement support for returning partial results, and clients should generally not depend on their availability. |
|
cnet | RecognitionConfusionNetwork | If |
|
audio_channel | uint32 | Channel of the audio file that this result was transcribed from. For a mono file, or RAW_LINEAR16 input, this will be set to 0. |
The top-level message sent by the client for the Recognize
method. Both
the RecognitionConfig
and RecognitionAudio
fields are required. The
entire audio data must be sent in one request. If your audio data is larger,
please use the StreamingRecognize
call..
Field | Type | Label | Description |
---|---|---|---|
config | RecognitionConfig | Provides configuration to create the recognizer. |
|
audio | RecognitionAudio | The audio data to be recognized |
The top-level message sent by the client for the StreamingRecognize
request. Multiple StreamingRecognizeRequest
messages are sent. The first
message must contain a RecognitionConfig
message only, and all subsequent
messages must contain RecognitionAudio
only. All RecognitionAudio
messages must contain non-empty audio. If audio content is empty, the server
may interpret it as end of stream and stop accepting any further messages.
Field | Type | Label | Description |
---|---|---|---|
config | RecognitionConfig | ||
audio | RecognitionAudio |
The message sent by the server for the Version
method.
Field | Type | Label | Description |
---|---|---|---|
cubic | string | version of the cubic library handling the recognition |
|
server | string | version of the server handling these requests |
Word-specific information for recognized words
Field | Type | Label | Description |
---|---|---|---|
word | string | The actual word in the text |
|
confidence | double | Confidence estimate between 0 and 1. A higher number represents a higher likelihood that the word was correctly recognized. |
|
start_time | google.protobuf.Duration | Time offset relative to the beginning of audio received by the recognizer and corresponding to the start of this spoken word. |
|
duration | google.protobuf.Duration | Duration of the current word in the spoken audio. |
The encoding of the audio data to be sent for recognition.
For best results, the audio source should be captured and transmitted using the RAW_LINEAR16 encoding.
Name | Number | Description |
---|---|---|
RAW_LINEAR16 | 0 | Raw (headerless) Uncompressed 16-bit signed little endian samples (linear PCM), single channel, sampled at the rate expected by the chosen Model . |
WAV | 1 | WAV (data with RIFF headers), with data sampled at a rate equal to or higher than the sample rate expected by the chosen Model. |
MP3 | 2 | MP3 data, sampled at a rate equal to or higher than the sampling rate expected by the chosen Model. |
FLAC | 3 | FLAC data, sampled at a rate equal to or higher than the sample rate expected by the chosen Model. |
VOX8000 | 4 | VOX data (Dialogic ADPCM), sampled at 8 KHz. |
ULAW8000 | 5 | μ-law (8-bit) encoded RAW data, single channel, sampled at 8 KHz. |
See the protocol buffer documentation for these
.proto Type | Notes |
---|---|
Duration | Represents a signed, fixed-length span of time represented as a count of seconds and fractions of seconds at nanosecond resolution |
Empty | Used to indicate a method takes or returns nothing |