> Speech Recognition > Cubic (ASR) Documentation > Cubic API Reference > Cubic Protobuf API Docs

Cubic Protobuf API Docs

cubic.proto

Service: Cubic

Service that implements the Cobalt Cubic Speech Recognition API

Method Name	Request Type	Response Type	Description
Version	.google.protobuf.Empty	VersionResponse	Queries the Version of the Server
ListModels	ListModelsRequest	ListModelsResponse	Retrieves a list of available speech recognition models
Recognize	RecognizeRequest	RecognitionResponse	Performs synchronous speech recognition: receive results after all audio has been sent and processed. It is expected that this request be typically used for short audio content: less than a minute long. For longer content, the `StreamingRecognize` method should be preferred.
StreamingRecognize	StreamingRecognizeRequest	RecognitionResponse	Performs bidirectional streaming speech recognition. Receive results while sending audio. This method is only available via GRPC and not via HTTP+JSON. However, a web browser may use websockets to use this service.
CompileContext	CompileContextRequest	CompileContextResponse	Compiles recognition context information, such as a specialized list of words or phrases, into a compact, efficient form to send with subsequent `Recognize` or `StreamingRecognize` requests to customize speech recognition. For example, a list of contact names may be compiled in a mobile app and sent with each recognition request so that the app user’s contact names are more likely to be recognized than arbitrary names. This pre-compilation ensures that there is no added latency for the recognition request. It is important to note that in order to compile context for a model, that model has to support context in the first place, which can be verified by checking its `ModelAttributes.ContextInfo` obtained via the `ListModels` method. Also, the compiled data will be model specific; that is, the data compiled for one model will generally not be usable with a different model.

Message: CompileContextRequest

The top-level message sent by the client for the CompileContext request. It contains a list of phrases or words, paired with a context token included in the model being used. The token specifies a category such as “menu_item”, “airport”, “contact”, “product_name” etc. The context token is used to determine the places in the recognition output where the provided list of phrases or words may appear. The allowed context tokens for a given model can be found in its ModelAttributes.ContextInfo obtained via the ListModels method.

Field	Type	Label	Description
model_id	string		Unique identifier of the model to compile the context information for. The model chosen needs to support context which can be verified by checking its `ModelAttributes.ContextInfo` obtained via `ListModels`.
token	string		The token that is associated with the provided list of phrases or words (e.g “menu_item”, “airport” etc.). Must be one of the tokens included in the model being used, which can be retrieved by calling the `ListModels` method.
phrases	ContextPhrase	repeated	List of phrases and/or words to be compiled.

Message: CompileContextResponse

The message returned to the client by the CompileContext method.

Field	Type	Label	Description
context	CompiledContext		Context information in a compact form that is efficient for use in subsequent recognition requests. The size of the compiled form will depend on the amount of text that was sent for compilation. For 1000 words it’s generally less than 100 kilobytes.

Message: CompiledContext

Context information in a compact form that is efficient for use in subsequent recognition requests. The size of the compiled form will depend on the amount of text that was sent for compilation. For 1000 words it’s generally less than 100 kilobytes.

Field	Type	Label	Description
data	bytes		The context information compiled by the `CompileContext` method.

Message: ConfusionNetworkArc

An Arc inside a Confusion Network Link

Field	Type	Label	Description
word	string		Word in the recognized transcript
confidence	double		Confidence estimate between 0 and 1. A higher number represents a higher likelihood that the word was correctly recognized.

Message: ConfusionNetworkLink

A Link inside a confusion network

Field	Type	Label	Description
start_time	google.protobuf.Duration		Time offset relative to the beginning of audio received by the recognizer and corresponding to the start of this link
duration	google.protobuf.Duration		Duration of the current link in the confusion network
arcs	ConfusionNetworkArc	repeated	Arcs between this link

Message: ContextInfo

Model information specifc to supporting recognition context.

Field	Type	Label	Description
supports_context	bool		If this is set to true, the model supports taking context information into account to aid speech recognition. The information may be sent with with recognition requests via RecognitionContext inside RecognitionConfig.
allowed_context_tokens	string	repeated	A list of tokens (e.g “name”, “airport” etc.) that serve has placeholders in the model where a client provided list of phrases or words may be used to aid speech recognition and produce the exact desired recognition output.

Message: ContextPhrase

A phrase or word that is to be compiled into context information that can be later used to improve speech recognition during a Recognize or StreamingRecognize call. Along with the phrase or word itself, there is an optional boost parameter that can be used to boost the likelihood of the phrase or word in the recognition output.

Field	Type	Label	Description
text	string		The actual phrase or word.
boost	float		This is an optional field. The boost value is a positive number which is used to increase the probability of the phrase or word appearing in the output. This setting can be used to differentiate between similar sounding words, with the desired word given a bigger boost value. By default, all phrases or words are given an equal probability of 1/N (where N = total number of phrases or words). If a boost value is provided, the new probability is (boost + 1) * 1/N. We normalize the boosted probabilities for all the phrases or words so that they sum to one. This means that the boost value only has an effect if there are relative differences in the values for different phrases or words. That is, if all phrases or words have the same boost value, after normalization they will all still have the same probability. This also means that the boost value can be any positive value, but it is best to stick between 0 to 20. Negative values are not supported and will be treated as 0 values.

Field

Type

Label

Description

text

string

The actual phrase or word.

boost

float

This is an optional field. The boost value is a positive number which is used to increase the probability of the phrase or word appearing in the output. This setting can be used to differentiate between similar sounding words, with the desired word given a bigger boost value.

By default, all phrases or words are given an equal probability of 1/N (where N = total number of phrases or words). If a boost value is provided, the new probability is (boost + 1) * 1/N. We normalize the boosted probabilities for all the phrases or words so that they sum to one. This means that the boost value only has an effect if there are relative differences in the values for different phrases or words. That is, if all phrases or words have the same boost value, after normalization they will all still have the same probability. This also means that the boost value can be any positive value, but it is best to stick between 0 to 20.

Negative values are not supported and will be treated as 0 values.

Message: ListModelsRequest

The top-level message sent by the client for the ListModels method.

This message is empty and has no fields.

Message: ListModelsResponse

The message returned to the client by the ListModels method.

Field	Type	Label	Description
models	Model	repeated	List of models available for use that match the request.

Message: Model

Description of a Cubic Model

Field	Type	Description
id	string	Unique identifier of the model. This identifier is used to choose the model that should be used for recognition, and is specified in the `RecognitionConfig` message.
name	string	Model name. This is a concise name describing the model, and maybe presented to the end-user, for example, to help choose which model to use for their recognition task.
attributes	ModelAttributes	Model attributes

Message: ModelAttributes

Attributes of a Cubic Model

Field	Type	Label	Description
sample_rate	uint32		Audio sample rate supported by the model
context_info	ContextInfo		Attributes specifc to supporting recognition context.

Message: RecognitionAlternative

A recognition hypothesis

Field	Type	Label	Description
transcript	string		Text representing the transcription of the words that the user spoke. The transcript will be formatted according to the servers formatting configuration. If you want the raw transcript, please see the field `raw_transcript`. If the server is configured to not use any formatting, then this field will contain the raw transcript. As an example, if the spoken utterance was “four people”, and the server was configured to format numbers, this field would be set to “4 people”.
raw_transcript	string		Text representing the transcription of the words that the user spoke, without any formatting. This field will be populated only the config `RecognitionConfig.enable_raw_transcript` is set to true. Otherwise this field will be an empty string. If you want the formatted transcript, please see the field `transcript`. As an example, if the spoken utterance was `here are four words`, this field would be set to “HERE ARE FOUR WORDS”.
confidence	double		Confidence estimate between 0 and 1. A higher number represents a higher likelihood of the output being correct.
words	WordInfo	repeated	A list of word-specific information for each recognized word in the `transcript` field. This is available only if `enable_word_confidence` or `enable_word_time_offsets` was set to `true` in the `RecognitionConfig`.
raw_words	WordInfo	repeated	A list of word-specific information for each recognized word in the `raw_transcript` field. This is available only if `enable_word_confidence` or `enable_word_time_offsets` was set to `true` and `enable_raw_transcript` is also set to `true` in the `RecognitionConfig`.
start_time	google.protobuf.Duration		Time offset relative to the beginning of audio received by the recognizer and corresponding to the start of this utterance.
duration	google.protobuf.Duration		Duration of the current utterance in the spoken audio.

Message: RecognitionAudio

Audio to be sent to the recognizer

Field	Type	Label	Description
data	bytes

Message: RecognitionConfig

Configuration for setting up a Recognizer

Field	Type	Label	Description
model_id	string		Unique identifier of the model to use, as obtained from a `Model` message.
audio_encoding	RecognitionConfig.Encoding		Encoding of audio data sent/streamed through the `RecognitionAudio` messages. For encodings like WAV/MP3 that have headers, the headers are expected to be sent at the beginning of the stream, not in every `RecognitionAudio` message. If not specified, the default encoding is RAW_LINEAR16. Depending on how they are configured, server instances of this service may not support all the encodings enumerated above. They are always required to accept RAW_LINEAR16. If any other `Encoding` is specified, and it is not available on the server being used, the recognition request will result in an appropriate error message.
idle_timeout	google.protobuf.Duration		Idle Timeout of the created Recognizer. If no audio data is received by the recognizer for this duration, ongoing rpc calls will result in an error, the recognizer will be destroyed and thus more audio may not be sent to the same recognizer. The server may impose a limit on the maximum idle timeout that can be specified, and if the value in this message exceeds that serverside value, creating of the recognizer will fail with an error.
enable_word_time_offsets	bool		This is an optional field. If this is set to true, each result will include a list of words and the start time offset (timestamp) and the duration for each of those words. If set to `false`, no word-level timestamps will be returned. The default is `false`.
enable_word_confidence	bool		This is an optional field. If this is set to true, each result will include a list of words and the confidence for those words. If `false`, no word-level confidence information is returned. The default is `false`.
enable_raw_transcript	bool		This is an optional field. If this is set to true, the field `RecognitionAlternative.raw_transcript` will be populated with the raw transcripts output from the recognizer will be exposed without any formatting rules applied. If this is set to false, that field will not be set in the results. The RecognitionAlternative.transcript will always be populated with text formatted according to the server’s settings.
enable_confusion_network	bool		This is an optional field. If this is set to true, the results will include a confusion network. If set to `false`, no confusion network will be returned. The default is `false`. If the model being used does not support a confusion network, results may be returned without a confusion network available. If this field is set to `true`, then `enable_raw_transcript` is also forced to be true.
audio_channels	uint32	repeated	This is an optional field. If the audio has multiple channels, this field should be configured with the list of channel indices that should be transcribed. Channels are 0-indexed. Example: `[0]` for a mono file, `[0, 1]` for a stereo file. If this field is not set, a mono file will be assumed by default and only channel-0 will be transcribed even if the file actually has additional channels. Channels that are present in the audio may be omitted, but it is an error to include a channel index in this field that is not present in the audio. Channels may be listed in any order but the same index may not be repeated in this list. BAD: `[0, 2]` for a stereo file; BAD: `[0, 0]` for a mono file.
metadata	RecognitionMetadata		This is an optional field. If there is any metadata associated with the audio being sent, use this field to provide it to cubic. The server may record this metadata when processing the request. The server does not use this field for any other purpose.
context	RecognitionContext		This is an optional field for providing any additional context information that may aid speech recognition. This can also be used to add out-of-vocabulary words to the model or boost recognition of specific proper names or commands. Context information must be pre-compiled via the `CompileContext()` method.

Message: RecognitionConfusionNetwork

Confusion network in recognition output

Field	Type	Label	Description
links	ConfusionNetworkLink	repeated

Message: RecognitionContext

A collection of additional context information that may aid speech recognition. This can be used to add out-of-vocabulary words to
the model or to boost recognition of specific proper names or commands.

Field	Type	Label	Description
compiled	CompiledContext	repeated	List of compiled context information, with each entry being compiled from a list of words or phrases using the `CompileContext` method.

Message: RecognitionMetadata

Metadata associated with the audio to be recognized.

Field	Type	Label	Description
custom_metadata	string		Any custom metadata that the client wants to associate with the recording. This could be a simple string (e.g. a tracing ID) or structured data (e.g. JSON)

Message: RecognitionResponse

Collection of sequence of recognition results in a portion of audio. When transcribing a single audio channel (e.g. RAW_LINEAR16 input, or a mono file), results will be ordered chronologically. When transcribing multiple channels, the results of all channels will be interleaved. Results of each individual channel will be chronological. No such promise is made for the ordering of results of different channels, as results are returned for each channel individually as soon as they are ready.

Field	Type	Label	Description
results	RecognitionResult	repeated

Message: RecognitionResult

A recognition result corresponding to a portion of audio.

Field	Type	Label	Description
alternatives	RecognitionAlternative	repeated	An n-best list of recognition hypotheses alternatives
is_partial	bool		If this is set to true, it denotes that the result is an interim partial result, and could change after more audio is processed. If unset, or set to false, it denotes that this is a final result and will not change. Servers are not required to implement support for returning partial results, and clients should generally not depend on their availability.
cnet	RecognitionConfusionNetwork		If `enable_confusion_network` was set to true in the `RecognitionConfig`, and if the model supports it, a confusion network will be available in the results.
audio_channel	uint32		Channel of the audio file that this result was transcribed from. For a mono file, or RAW_LINEAR16 input, this will be set to 0.

Message: RecognizeRequest

The top-level message sent by the client for the Recognize method. Both the RecognitionConfig and RecognitionAudio fields are required. The entire audio data must be sent in one request. If your audio data is larger, please use the StreamingRecognize call..

Field	Type	Label	Description
config	RecognitionConfig		Provides configuration to create the recognizer.
audio	RecognitionAudio		The audio data to be recognized

Message: StreamingRecognizeRequest

The top-level message sent by the client for the StreamingRecognize request. Multiple StreamingRecognizeRequest messages are sent. The first message must contain a RecognitionConfig message only, and all subsequent messages must contain RecognitionAudio only. All RecognitionAudio messages must contain non-empty audio. If audio content is empty, the server may interpret it as end of stream and stop accepting any further messages.

Field	Type	Label	Description
config	RecognitionConfig
audio	RecognitionAudio

Message: VersionResponse

The message sent by the server for the Version method.

Field	Type	Label	Description
cubic	string		version of the cubic library handling the recognition
server	string		version of the server handling these requests

Message: WordInfo

Word-specific information for recognized words

Field	Type	Description
word	string	The actual word in the text
confidence	double	Confidence estimate between 0 and 1. A higher number represents a higher likelihood that the word was correctly recognized.
start_time	google.protobuf.Duration	Time offset relative to the beginning of audio received by the recognizer and corresponding to the start of this spoken word.
duration	google.protobuf.Duration	Duration of the current word in the spoken audio.

Enum: RecognitionConfig.Encoding

The encoding of the audio data to be sent for recognition.

For best results, the audio source should be captured and transmitted using the RAW_LINEAR16 encoding.

Name	Number	Description
RAW_LINEAR16	0	Raw (headerless) Uncompressed 16-bit signed little endian samples (linear PCM), single channel, sampled at the rate expected by the chosen `Model`.
WAV	1	WAV (data with RIFF headers), with data sampled at a rate equal to or higher than the sample rate expected by the chosen Model.
MP3	2	MP3 data, sampled at a rate equal to or higher than the sampling rate expected by the chosen Model.
FLAC	3	FLAC data, sampled at a rate equal to or higher than the sample rate expected by the chosen Model.
VOX8000	4	VOX data (Dialogic ADPCM), sampled at 8 KHz.
ULAW8000	5	μ-law (8-bit) encoded RAW data, single channel, sampled at 8 KHz.

Well-Known Types

See the protocol buffer documentation for these

.proto Type	Notes
Duration	Represents a signed, fixed-length span of time represented as a count of seconds and fractions of seconds at nanosecond resolution
Empty	Used to indicate a method takes or returns nothing

Scalar Value Types

.proto Type	Notes	Go Type	Python Type
double		float64	float
float		float32	float
int32	Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint32 instead.	int32	int
int64	Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint64 instead.	int64	int/long
uint32	Uses variable-length encoding.	uint32	int/long
uint64	Uses variable-length encoding.	uint64	int/long
sint32	Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int32s.	int32	int
sint64	Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int64s.	int64	int/long
fixed32	Always four bytes. More efficient than uint32 if values are often greater than 2^28.	uint32	int
fixed64	Always eight bytes. More efficient than uint64 if values are often greater than 2^56.	uint64	int/long
sfixed32	Always four bytes.	int32	int
sfixed64	Always eight bytes.	int64	int/long
bool		bool	boolean
string	A string must always contain UTF-8 encoded or 7-bit ASCII text.	string	str/unicode
bytes	May contain any arbitrary sequence of bytes.	[]byte	str