Juzu API Reference

The Juzu API is specified as a proto file. This section of the documentation is auto-generated from the spec. It describes the data types and functions defined in the spec. The “messages” below correspond to the data structures to be used, and the “service” contains the methods that can be called.


Service: Juzu

Service that implements the Cobalt Juzu Diarization API.

Method Name Request Type Response Type Description
Version .google.protobuf.Empty VersionResponse Queries the Version of the Server.
ListModels .google.protobuf.Empty ListModelsResponse Retrieves a list of available diarization models.
StreamingDiarize StreamingDiarizeRequest DiarizationResponse Performs bidirectional streaming to enable on-the-go processing of audio files, as well as the option to receive partial transcripts of audio along with speaker IDs. This method is not truly streaming for diarization yet, as results are received after specific chunks of audio have been sent. This method is only available via GRPC and not via HTTP+JSON. However, a web browser may use websockets to use this service.

Message: DiarizationAudio

Audio to be sent to the diarizer.

Field Type Label Description
data bytes

Message: DiarizationConfig

Configuration for setting up a Diarizer.

Field Type Label Description
model_id string

ID of the diarization model to use on the server. Can be obtained by first getting list of models on the server via ListModels().

num_speakers uint32

The number of speakers expected in the audio; If the number of speakers is unknown, set to 0.

sample_rate uint32

Sampling rate of the audio to process.

audio_encoding DiarizationConfig.Encoding

Encoding of audio data sent/streamed through the DiarizationAudio messages. For encodings like WAV/MP3 that have headers, the headers are expected to be sent at the beginning of the stream, not in every DiarizationAudio message.

If not specified, the default encoding is RAW_LINEAR16.

Depending on how they are configured, server instances of this service may not support all the encodings enumerated above. They are always required to accept RAW_LINEAR16. If any other Encoding is specified, and it is not available on the server being used, the recognition request will result in an appropriate error message.

cubic_model_id string

Unique identifier of the cubic model to be used for speech recognition. If this value is specified, transcription results from the cubic model with the given ID will also be returned alongside speaker labels. If it omitted or blank, the results will not include transcripts, even if Cubic server was included in the deployed image.

enable_raw_transcript bool

Returns unformatted transcript.

Message: DiarizationResponse

Collection of sequence of diarization results in a portion of audio. Juzu currently requires the full audio to determine which audio segments belong to which speaker.

Field Type Label Description
results DiarizationResult repeated

Message: DiarizationResult

A diarization result corresponding to a portion of audio.

Field Type Label Description
segments Segment repeated

Diarized segments containing speaker labels, timestamps and transcripts.

speaker_labels string repeated

Set of labels used to identify speakers in each segment.

is_partial bool

If this is set to true, it denotes that the result is an interim partial result, and could change after more audio is processed. If unset, or set to false, it denotes that this is a final result and will not change.

Servers are not required to implement support for returning partial results, and clients should generally not depend on their availability.

Message: ListModelsResponse

The message sent by the server for the ListModels method.

Field Type Label Description
models Model repeated

List of models available for use that match the request.

Message: Model

Description of a Juzu Diarization Model.

Field Type Label Description
id string

Unique identifier of the model. This identifier is used to choose the model that should be used for diarization, and is specified in the DiarizationConfig message.

name string

Model name. This is a concise name describing the model, and maybe presented to the end-user, for example, to help choose which model to use.

attributes ModelAttributes

Model attributes.

Message: ModelAttributes

Attributes of a Juzu Diarization Model.

Field Type Label Description
sample_rate uint32

Audio sample rate supported by the model.

segmentation_type string

The type of segmentation (fixed / variable) supported by the model.

Message: Segment

A diarized segment of audio.

Field Type Label Description
speaker_label string

The identity of the speaker for this segment.

start_time google.protobuf.Duration

Time offset relative to the beginning of audio received by the diarizer and corresponding to the start of this segment.

end_time google.protobuf.Duration

Time offset relative to the beginning of audio received by the diarizer and corresponding to the end of this segment.

transcript string

Text representing the transcription of the words that the speaker spoke. Formatting options are set in cubicsvr.

words WordInfo repeated

Words in the transcript, their timestamps and confidence scores.

Message: StreamingDiarizeRequest

The top-level message sent by the client for the StreamingDiarize request. Multiple StreamingDiarizeRequest messages are sent. The first message must contain a DiarizationConfig message only, and all subsequent messages must contain DiarizationAudio only. All DiarizationAudio messages must contain non-empty audio. If audio content is empty, the server may interpret it as end of stream and stop accepting any further messages.

Field Type Label Description
config DiarizationConfig

audio DiarizationAudio

Message: VersionResponse

The message sent by the server for the Version method.

Field Type Label Description
juzu string

version of the juzu library handling the recognition.

server string

version of the server handling these requests.

Message: WordInfo

Word-specific information for recognized words.

Field Type Label Description
word string

The actual word in the text.

confidence double

Confidence estimate between 0 and 1. A higher number represents a higher likelihood that the word was correctly recognized.

start_time google.protobuf.Duration

Time offset relative to the beginning of audio received by the recognizer and corresponding to the start of this spoken word.

duration google.protobuf.Duration

Duration of the current word in the spoken audio.

Enum: DiarizationConfig.Encoding

The encoding of the audio data to be sent for recognition.

For best results, the audio source should be captured and transmitted using the RAW_LINEAR16 encoding.

Name Number Description
RAW_LINEAR16 0 Raw (headerless) Uncompressed 16-bit signed little endian samples (linear PCM), single channel, sampled at the rate expected by the chosen Model.
WAV 1 WAV (data with RIFF headers), with data sampled at a rate equal to or higher than the sample rate expected by the chosen Model.
FLAC 2 FLAC data, sampled at a rate equal to or higher than the sample rate expected by the chosen Model.

Scalar Value Types

.proto Type Notes Go Type Python Type
double float64 float
float float32 float
int32 Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint32 instead. int32 int
int64 Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint64 instead. int64 int/long
uint32 Uses variable-length encoding. uint32 int/long
uint64 Uses variable-length encoding. uint64 int/long
sint32 Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int32s. int32 int
sint64 Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int64s. int64 int/long
fixed32 Always four bytes. More efficient than uint32 if values are often greater than 2^28. uint32 int
fixed64 Always eight bytes. More efficient than uint64 if values are often greater than 2^56. uint64 int/long
sfixed32 Always four bytes. int32 int
sfixed64 Always eight bytes. int64 int/long
bool bool boolean
string A string must always contain UTF-8 encoded or 7-bit ASCII text. string str/unicode
bytes May contain any arbitrary sequence of bytes. []byte str