An in-depth explanation of the various fields of the complete SDK can be found here. The sub-section RecognitionConfig is particularly important here.
This page here discusses the more common combinations sent to the server.
Here is a quick overview of the fields.
Field | Required | Default | Description |
---|---|---|---|
model_id | Yes | Unique ID of the model to use. | |
audio_encoding | Yes | Encoding format of the audio, such as RAW_LINEAR_16, WAV, MP3, etc. | |
idle_timeout | No | 0s (Unlimited) |
Maximum time allowed between each gRPC message. The server may place further restrictions depending on its configuration. |
enable_word_time_offsets | No | false |
Toggles the calculation of word-level timestamps. The specified model must also support word-level timestamps for this field to be populated. |
enable_word_confidence | No | false |
Toggles the calculation of word-level confidence scores. The specified model must also support word-level confidence for this field to be populated. |
enable_raw_transcript | No | false |
If true , the raw transcript will be included in the results. |
enable_confusion_network | No | false |
Toggles the inclusion of a confusion network, consisting of multiple alternative transcriptions. The specified model must also support confusion networks for this field to be populated. |
audio_channels | No | [0] (mono) |
Specifies which channels of a multi-channel audio file to be transcribed, each as their own individual audio stream. |
metadata | No | "" | Can be used to send any custom metadata associated with the audio being sent.The server may record this metadata when processing the request. The server does not use this field for any other purpose. |
context | No | nil |
Can be used to provide any context information that can aid speech recognition, such as probable phrases or words that may appear in the recognition output or even out of vocabulary words for the model being used. Currently all context information must first be pre-compiled via the CompileContext() . |
The most basic use case is getting a formatted transcript. This would simply need a config such as:
{
"model_id": "1",
"audio_encoding": "raw"
}
and your resulting transcript would be found at results.alternatives[*].transcript
.
Sometimes, only the raw (unformatted) transcript is desired. In this case, there are two options.
results.alternatives[*].transcript
field.enable_raw_transcript = true
flag, and access the field results.alternatives[*].raw_transcript
.Note: Prior to
cubicsvr
v2.9.0 and SDK-Cubic v1.3.0, the fieldresults.alternatives[*].transcript
was populated with either the raw or the formatted transcription depending on theenable_raw_transcript
config. After these changes, raw transcripts have been pushed to a new fieldresults.alternatives[*].raw_transcript
, and only populated whenenable_raw_transcript
is set to true.
Another use case would be getting both the formatted and raw transcript. This can be done using this config.
{
"model_id": "1",
"audio_encoding": "raw",
"enable_raw_transcript": "true"
}
The formatted transcript would be found at results.alternatives[*].transcript
, and the raw transcript would be found at results.alternatives[*].raw_transcript
.
If you need to know the timestamp for each word, to align subtitles with a video, for example, then you can use this config to enable those word-level timestamps. Please note the inclusion of enable_raw_transcript
; the word-level information corresponds to the raw transcript, since the formatter may combine multiple words into one symbol in the formatted transcript (e.g. “twenty one” to “21”)
{
"model_id": "1",
"audio_encoding": "raw",
"enable_raw_transcript": true,
"enable_word_time_offsets": true
}
Timestamps will be found at results.alternatives[*].words[*].start_time
and results.alternatives[*].words[*].duration
.
Word-level confidences, i.e. for displaying a lighter color for less confident words, can be included much like the word-level timestamps. Please note the inclusion of enable_raw_transcript
.
{
"model_id": "1",
"audio_encoding": "raw",
"enable_raw_transcript": true,
"enable_word_time_offsets": true
}
Word-level confidence scores will be found at results.alternatives[*].words[*].confidence
.
For applications that need more than the one-best transcription, the most comprehensive and detailed Cubic results are found in the confusion network. Please refer to the in depth confusion network documentation to see what is included.
To enable the confusion network, the config will look similar to this:
{
"model_id": "1",
"audio_encoding": "raw",
"enable_raw_transcript": true,
"enable_confusion_network": true
}
The confusion network can be accessed at result.cnet
.