Streaming Diarization

The following example shows how to diarize and transcribe an audio file using Juzu’s Streaming Diarize Request. The stream can come from a file on disk or be directly from a microphone in real time. The diarization (and transcription) results are returned after the stream is ended and all the audio has been sent to the server.

For real-time streaming transcription without diarization, call cubicsvr directly. (See StreamingRecognize in the Cubic SDK documentation.)

Streaming from an audio file

We support several file formats including RAW, WAV and FLAC. For more details, please see the protocol buffer specification file in the SDK repository (grpc/juzu.proto). The examples below use a WAV file as input to the streaming diarization (and transcription).

<div class="highlight"><pre style="color:#d0d0d0;background-color:#202020;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-py" data-lang="py"><span style="color:#6ab825;font-weight:bold">import</span> <span style="color:#447fcf;text-decoration:underline">juzu</span>

serverAddress = <span style="color:#ed9d13">&#39;127.0.0.1:2727&#39;</span>

<span style="color:#999;font-style:italic"># set insecure=True for connecting to server not using TLS</span> client = juzu.Client(serverAddress, insecure=False)

<span style="color:#999;font-style:italic"># get list of available models</span> modelResp = client.ListModels() <span style="color:#6ab825;font-weight:bold">for</span> model <span style="color:#6ab825;font-weight:bold">in</span> modelResp.models: <span style="color:#6ab825;font-weight:bold">print</span>(<span style="color:#ed9d13">&#34;ID = {}</span><span style="color:#ed9d13">\t</span><span style="color:#ed9d13"> Name = {}</span><span style="color:#ed9d13">\t</span><span style="color:#ed9d13"> [SampleRate = {} Hz]&#34;</span>.format(model.id, model.name, model.attributes.sample_rate))

<span style="color:#999;font-style:italic"># use the first available model</span> juzuModelID = modelResp.models[<span style="color:#3677a9">0</span>]

<span style="color:#999;font-style:italic"># Using cubic model to transcribe; Cubicsvr must also be</span> <span style="color:#999;font-style:italic"># running and the address:port provided in the Juzu server</span> <span style="color:#999;font-style:italic"># config file. The cubic models and their ID on Cubicsvr can</span> <span style="color:#999;font-style:italic"># found in cubicsvr.cfg.toml or be obtained via sdk-cubic.</span> cubicModelID = <span style="color:#ed9d13">&#34;1&#34;</span>

cfg = juzu.DiarizationConfig( model_id = juzuModel.id, cubic_model_id = cubicModelID, num_speakers = <span style="color:#3677a9">2</span>, <span style="color:#999;font-style:italic"># number of speakers expected in the audio file</span> audio_encoding = <span style="color:#ed9d13">&#34;WAV&#34;</span>, <span style="color:#999;font-style:italic"># supported : &#34;RAW_LINEAR16&#34;, &#34;FLAC&#34;, &#34;WAV&#34;</span> sample_rate = <span style="color:#3677a9">16000</span>, <span style="color:#999;font-style:italic"># must match juzu model&#39;s expected sample rate</span> )

<span style="color:#999;font-style:italic"># client.StreamingDiarize takes any binary</span> <span style="color:#999;font-style:italic"># stream object that has a read(nBytes) method.</span> <span style="color:#999;font-style:italic"># The method should return nBytes from the stream.</span>

<span style="color:#999;font-style:italic"># open audio file stream</span> audio = <span style="color:#24909d">open</span>(<span style="color:#ed9d13">&#39;test.wav&#39;</span>, <span style="color:#ed9d13">&#39;rb&#39;</span>)

<span style="color:#999;font-style:italic"># helper function convert protobuf duration objects</span> <span style="color:#999;font-style:italic"># (which stores the time split into in integer seconds</span> <span style="color:#999;font-style:italic"># and integer nano seconds) into single floating value</span> <span style="color:#999;font-style:italic"># in seconds</span> <span style="color:#6ab825;font-weight:bold">def</span> <span style="color:#447fcf">protoDurToSec</span>(dur): <span style="color:#6ab825;font-weight:bold">return</span> <span style="color:#24909d">float</span>(dur.seconds) + <span style="color:#24909d">float</span>(dur.nanos) * <span style="color:#3677a9">1e-9</span>

<span style="color:#999;font-style:italic"># defining function to print speaker segments and transcripts to screen</span> <span style="color:#6ab825;font-weight:bold">def</span> <span style="color:#447fcf">handleResults</span>(diarizationResp): <span style="color:#6ab825;font-weight:bold">for</span> result <span style="color:#6ab825;font-weight:bold">in</span> diarizationResp.results: <span style="color:#6ab825;font-weight:bold">for</span> segment <span style="color:#6ab825;font-weight:bold">in</span> result.segments: <span style="color:#6ab825;font-weight:bold">print</span>(<span style="color:#ed9d13">&#34;{start:.3f} - {end:.3f}</span><span style="color:#ed9d13">\t</span><span style="color:#ed9d13">{speaker}:</span><span style="color:#ed9d13">\t</span><span style="color:#ed9d13">{transcript}</span><span style="color:#ed9d13">\n</span><span style="color:#ed9d13">&#34;</span>.format( start = protoDurToSec(segment.start_time), end = protoDurToSec(segment.end_time), speaker = segment.speaker_label, transcript = segment.transcript, ))

<span style="color:#999;font-style:italic"># sending streaming request to Juzu and</span> <span style="color:#999;font-style:italic"># waiting for results to return</span> <span style="color:#6ab825;font-weight:bold">for</span> resp <span style="color:#6ab825;font-weight:bold">in</span> client.StreamingDiarize(cfg, audio): handleResults(resp)

</code></pre></div>

<h4 id="programcs">Program.cs</h4>
<div class="highlight"><pre style="color:#d0d0d0;background-color:#202020;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-csharp" data-lang="csharp"><span style="color:#6ab825;font-weight:bold">using</span> <span style="color:#447fcf;text-decoration:underline">System</span>;
<span style="color:#6ab825;font-weight:bold">using</span> <span style="color:#447fcf;text-decoration:underline">System.IO</span>;
<span style="color:#6ab825;font-weight:bold">using</span> <span style="color:#447fcf;text-decoration:underline">System.Net</span>;
<span style="color:#6ab825;font-weight:bold">using</span> <span style="color:#447fcf;text-decoration:underline">System.Text</span>;
<span style="color:#6ab825;font-weight:bold">using</span> <span style="color:#447fcf;text-decoration:underline">System.Threading.Tasks</span>;

<span style="color:#6ab825;font-weight:bold">namespace</span> <span style="color:#447fcf;text-decoration:underline">JuzusvrClient</span> { <span style="color:#6ab825;font-weight:bold">class</span> <span style="color:#447fcf;text-decoration:underline">Program</span> {

    &lt;span style=&#34;color:#6ab825;font-weight:bold&#34;&gt;static&lt;/span&gt; &lt;span style=&#34;color:#6ab825;font-weight:bold&#34;&gt;async&lt;/span&gt; Task Main (&lt;span style=&#34;color:#6ab825;font-weight:bold&#34;&gt;string&lt;/span&gt;[] args) {

        &lt;span style=&#34;color:#6ab825;font-weight:bold&#34;&gt;var&lt;/span&gt; url = &lt;span style=&#34;color:#ed9d13&#34;&gt;&amp;#34;127.0.0.1:2727&amp;#34;&lt;/span&gt;;
        &lt;span style=&#34;color:#6ab825;font-weight:bold&#34;&gt;string&lt;/span&gt; audioFile = &lt;span style=&#34;color:#ed9d13&#34;&gt;&amp;#34;test.wav&amp;#34;&lt;/span&gt;;

        &lt;span style=&#34;color:#6ab825;font-weight:bold&#34;&gt;var&lt;/span&gt; insecure = &lt;span style=&#34;color:#6ab825;font-weight:bold&#34;&gt;true&lt;/span&gt;;
        &lt;span style=&#34;color:#6ab825;font-weight:bold&#34;&gt;var&lt;/span&gt; client = &lt;span style=&#34;color:#6ab825;font-weight:bold&#34;&gt;new&lt;/span&gt; Client (url, insecure);

        &lt;span style=&#34;color:#999;font-style:italic&#34;&gt;// Getting list of diarization models on the server

</span><span style="color:#999;font-style:italic"></span> <span style="color:#6ab825;font-weight:bold">var</span> modelResp = client.ListModels (); Console.WriteLine (<span style="color:#ed9d13">&#34;\nAvailable models:\n&#34;</span>); <span style="color:#6ab825;font-weight:bold">foreach</span> (<span style="color:#6ab825;font-weight:bold">var</span> model <span style="color:#6ab825;font-weight:bold">in</span> modelResp.Models) { Console.WriteLine (<span style="color:#ed9d13">&#34;{0}\t{1}\t{2}\n&#34;</span>, model.Id, model.Name, model.Attributes.SampleRate); }

        &lt;span style=&#34;color:#999;font-style:italic&#34;&gt;// Creating config for Diarizing + Transcribing file with the first

</span><span style="color:#999;font-style:italic"></span> <span style="color:#999;font-style:italic">// Juzu Model available and the Cubic model with ID &#34;1&#34; (assigned by </span><span style="color:#999;font-style:italic"></span> <span style="color:#999;font-style:italic">// cubicsvr config). </span><span style="color:#999;font-style:italic"></span> <span style="color:#6ab825;font-weight:bold">var</span> diarCfg = <span style="color:#6ab825;font-weight:bold">new</span> DiarizationConfig { JuzuModelID = modelList.Models[<span style="color:#3677a9">0</span>].Id, CubicModelID = <span style="color:#ed9d13">&#34;1&#34;</span>, NumSpeakers = <span style="color:#3677a9">2</span>, <span style="color:#999;font-style:italic">// use 0 if unknown </span><span style="color:#999;font-style:italic"></span> SampleRate = <span style="color:#3677a9">16000</span>, Encoding = AudioEncoding.WAV, };

        &lt;span style=&#34;color:#999;font-style:italic&#34;&gt;// Define callback function to print results on screen; could be

</span><span style="color:#999;font-style:italic"></span> <span style="color:#999;font-style:italic">// modified to do other things with the results as well. </span><span style="color:#999;font-style:italic"></span> ResponseHandler handleFunc = <span style="color:#6ab825;font-weight:bold">delegate</span> (CobaltSpeech.Juzu.DiarizationResponse resp) { <span style="color:#6ab825;font-weight:bold">foreach</span> (<span style="color:#6ab825;font-weight:bold">var</span> result <span style="color:#6ab825;font-weight:bold">in</span> resp.Results) { <span style="color:#6ab825;font-weight:bold">foreach</span> (<span style="color:#6ab825;font-weight:bold">var</span> seg <span style="color:#6ab825;font-weight:bold">in</span> result.Segments) { Console.WriteLine (<span style="color:#ed9d13">&#34;{0} : {1}\t{2}\t{3}&#34;</span>, seg.StartTime, seg.EndTime, seg.SpeakerLabel, seg.Transcript); } } };

        &lt;span style=&#34;color:#999;font-style:italic&#34;&gt;// StreamingDiarizeAsync takes any readable Stream.IO object, that is

</span><span style="color:#999;font-style:italic"></span> <span style="color:#999;font-style:italic">// only the Stream.IO.Read method needs to be implemented. </span><span style="color:#999;font-style:italic"></span> <span style="color:#6ab825;font-weight:bold">using</span> (FileStream file = File.OpenRead (audioFile)) { <span style="color:#6ab825;font-weight:bold">await</span> client.StreamingDiarizeAsync (file, diarCfg, handleFunc); } } } } </code></pre></div><h4 id="juzusvrclientcsproj">JuzusvrClient.csproj</h4> <div class="highlight"><pre style="color:#d0d0d0;background-color:#202020;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-csharp" data-lang="csharp">&lt;Project Sdk=<span style="color:#ed9d13">&#34;Microsoft.NET.Sdk&#34;</span>&gt;

&lt;PropertyGroup&gt; &lt;OutputType&gt;Exe&lt;/OutputType&gt; &lt;TargetFramework&gt;netcoreapp3.<span style="color:#3677a9">0</span>&lt;/TargetFramework&gt; &lt;/PropertyGroup&gt;

&lt;ItemGroup&gt; &lt;PackageReference Include=<span style="color:#ed9d13">&#34;Juzu-SDK&#34;</span> Version=<span style="color:#ed9d13">&#34;0.9.3&#34;</span> /&gt; &lt;/ItemGroup&gt;

&lt;/Project&gt; </code></pre></div>

Streaming from microphone

Streaming audio from microphone input typically needs us to interact with system libraries. There are several options available, and although the examples here use one, you may choose to use an alternative as long as the recording audio format is chosen correctly.

<p>This example requires the <a href="http://people.csail.mit.edu/hubert/pyaudio/">pyaudio</a>
module to stream audio from a microphone. Instructions for installing pyaudio
for different systems are available at the link. On most platforms, this is
simply <code>pip install pyaudio</code></p>
<div class="highlight"><pre style="color:#d0d0d0;background-color:#202020;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-py" data-lang="py"><span style="color:#6ab825;font-weight:bold">import</span> <span style="color:#447fcf;text-decoration:underline">juzu</span>
<span style="color:#6ab825;font-weight:bold">import</span> <span style="color:#447fcf;text-decoration:underline">pyaudio</span>
<span style="color:#6ab825;font-weight:bold">import</span> <span style="color:#447fcf;text-decoration:underline">threading</span>

serverAddress = <span style="color:#ed9d13">&#39;127.0.0.1:2727&#39;</span>

<span style="color:#999;font-style:italic"># set insecure=True for connecting to server not using TLS</span> client = juzu.Client(serverAddress, insecure=True)

<span style="color:#999;font-style:italic"># get list of available models</span> modelResp = client.ListModels()

<span style="color:#999;font-style:italic"># use the first available model</span> juzuModel = modelResp.models[<span style="color:#3677a9">0</span>]

<span style="color:#999;font-style:italic"># creating diarization config to transcribe + diarize</span> <span style="color:#999;font-style:italic"># audio stream from microphone</span> cfg = juzu.DiarizationConfig( model_id = juzuModel.id, cubic_model_id = <span style="color:#ed9d13">&#34;1&#34;</span>, num_speakers = <span style="color:#3677a9">2</span>, audio_encoding = <span style="color:#ed9d13">&#34;RAW_LINEAR16&#34;</span>, sample_rate = juzuModel.attributes.sample_rate, )

<span style="color:#999;font-style:italic"># client.StreamingDiarize takes any binary stream object that has a read(nBytes)</span> <span style="color:#999;font-style:italic"># method. The method should return nBytes from the stream. So pyaudio is a suitable</span> <span style="color:#999;font-style:italic"># library to use here for streaming audio from the microphone. Other libraries or</span> <span style="color:#999;font-style:italic"># modules may also be used as long as they have the read method or have been wrapped</span> <span style="color:#999;font-style:italic"># to do so.</span>

<span style="color:#999;font-style:italic"># defining class to wrap around microphone stream from py audio</span> <span style="color:#6ab825;font-weight:bold">class</span> <span style="color:#447fcf;text-decoration:underline">MicStream</span>(<span style="color:#24909d">object</span>):

&lt;span style=&#34;color:#6ab825;font-weight:bold&#34;&gt;def&lt;/span&gt; __init__(self, sampleRate):

    self._p = pyaudio.PyAudio()
    &lt;span style=&#34;color:#999;font-style:italic&#34;&gt;# opening mic stream, recording 16 bit little endian integer samples, mono channel&lt;/span&gt;
    self._stream = self._p.open(format=pyaudio.paInt16, channels=&lt;span style=&#34;color:#3677a9&#34;&gt;1&lt;/span&gt;, rate=sampleRate, &lt;span style=&#34;color:#24909d&#34;&gt;input&lt;/span&gt;=True)
    self._stopped = False

&lt;span style=&#34;color:#6ab825;font-weight:bold&#34;&gt;def&lt;/span&gt; __del__(self):
    self._stream.close()
    self._p.terminate()

&lt;span style=&#34;color:#999;font-style:italic&#34;&gt;# streamingDiarize requires a read(nBytes) method&lt;/span&gt;
&lt;span style=&#34;color:#999;font-style:italic&#34;&gt;# that return a list of nBytes from the stream. An&lt;/span&gt;
&lt;span style=&#34;color:#999;font-style:italic&#34;&gt;# empty list signals the end of stream.&lt;/span&gt;
&lt;span style=&#34;color:#6ab825;font-weight:bold&#34;&gt;def&lt;/span&gt; &lt;span style=&#34;color:#447fcf&#34;&gt;read&lt;/span&gt;(self, nBytes):
    &lt;span style=&#34;color:#999;font-style:italic&#34;&gt;# if stream is stopped, return empty list to&lt;/span&gt;
    &lt;span style=&#34;color:#999;font-style:italic&#34;&gt;# signal end of stream to Juzu&lt;/span&gt;
    &lt;span style=&#34;color:#6ab825;font-weight:bold&#34;&gt;if&lt;/span&gt; self._stopped:
        &lt;span style=&#34;color:#6ab825;font-weight:bold&#34;&gt;return&lt;/span&gt; []
    &lt;span style=&#34;color:#6ab825;font-weight:bold&#34;&gt;return&lt;/span&gt; self._stream.read(nBytes)

&lt;span style=&#34;color:#6ab825;font-weight:bold&#34;&gt;def&lt;/span&gt; &lt;span style=&#34;color:#447fcf&#34;&gt;pause&lt;/span&gt;(self):
    self._stream.stop_stream()

&lt;span style=&#34;color:#6ab825;font-weight:bold&#34;&gt;def&lt;/span&gt; &lt;span style=&#34;color:#447fcf&#34;&gt;resume&lt;/span&gt;(self):
    self._stream.start_stream()

&lt;span style=&#34;color:#6ab825;font-weight:bold&#34;&gt;def&lt;/span&gt; &lt;span style=&#34;color:#447fcf&#34;&gt;stop&lt;/span&gt;(self):
    self._stopped = True

audio = MicStream(juzuModel.attributes.sample_rate)

<span style="color:#999;font-style:italic"># helper function convert protobuf duration objects</span> <span style="color:#999;font-style:italic"># (which stores the time split into in integer seconds</span> <span style="color:#999;font-style:italic"># and integer nano seconds) into single floating value</span> <span style="color:#999;font-style:italic"># in seconds</span> <span style="color:#6ab825;font-weight:bold">def</span> <span style="color:#447fcf">protoDurToSec</span>(dur): <span style="color:#6ab825;font-weight:bold">return</span> <span style="color:#24909d">float</span>(dur.seconds) + <span style="color:#24909d">float</span>(dur.nanos) * <span style="color:#3677a9">1e-9</span>

<span style="color:#999;font-style:italic"># starting thread to send streaming request to juzu</span> <span style="color:#999;font-style:italic"># and process results once they come back after the</span> <span style="color:#999;font-style:italic"># stream ends.</span> <span style="color:#6ab825;font-weight:bold">def</span> <span style="color:#447fcf">streamToJuzu</span>(cfg, audio): <span style="color:#6ab825;font-weight:bold">try</span>: <span style="color:#6ab825;font-weight:bold">for</span> resp <span style="color:#6ab825;font-weight:bold">in</span> client.StreamingDiarize(cfg, audio): <span style="color:#6ab825;font-weight:bold">for</span> result <span style="color:#6ab825;font-weight:bold">in</span> resp.results: <span style="color:#6ab825;font-weight:bold">for</span> segment <span style="color:#6ab825;font-weight:bold">in</span> result.segments: <span style="color:#6ab825;font-weight:bold">print</span>(<span style="color:#ed9d13">&#34;{start:.3f} - {end:.3f}</span><span style="color:#ed9d13">\t</span><span style="color:#ed9d13">{speaker}:</span><span style="color:#ed9d13">\t</span><span style="color:#ed9d13">{transcript}</span><span style="color:#ed9d13">\n</span><span style="color:#ed9d13">&#34;</span>.format( start = protoDurToSec(segment.start_time), end = protoDurToSec(segment.end_time), speaker = segment.speaker_label, transcript = segment.transcript, )) <span style="color:#6ab825;font-weight:bold">except</span> <span style="color:#bbb">Exception</span> <span style="color:#6ab825;font-weight:bold">as</span> ex: <span style="color:#6ab825;font-weight:bold">print</span>(<span style="color:#ed9d13">&#34;[error]: streaming diarization failed: {}&#34;</span>.format(ex))

streamThread = threading.Thread(target=streamToJuzu, args=(cfg,audio)) streamThread.setDaemon(True) streamThread.start()

<span style="color:#999;font-style:italic"># waiting for user to end mic stream</span> <span style="color:#6ab825;font-weight:bold">print</span>(<span style="color:#ed9d13">&#34;</span><span style="color:#ed9d13">\n</span><span style="color:#ed9d13">Streaming audio to Juzu server …</span><span style="color:#ed9d13">\n</span><span style="color:#ed9d13">&#34;</span>) k = <span style="color:#24909d">input</span>(<span style="color:#ed9d13">&#34;– Press Enter key to stop stream –&#34;</span>)

<span style="color:#6ab825;font-weight:bold">print</span>(<span style="color:#ed9d13">&#34;</span><span style="color:#ed9d13">\n</span><span style="color:#ed9d13">Stopping Stream …&#34;</span>) audio.stop()

<span style="color:#6ab825;font-weight:bold">print</span>(<span style="color:#ed9d13">&#34;Waiting for results …&#34;</span>) streamThread.join()

</code></pre></div>

<p>We do not currently have example C# code for streaming from a microphone. Simply
pass the bytes from the microphone the same as is done from the file in the
<a href="#streaming-from-an-audio-file"><code>Streaming from an audio file</code></a> example above via
a class derived from <a href="https://docs.microsoft.com/en-us/dotnet/api/system.io.stream"><code>Stream.IO</code></a>.
with the <a href="https://docs.microsoft.com/en-us/dotnet/api/system.io.stream.read#System_IO_Stream_Read_System_Byte___System_Int32_System_Int32_"><code>int Read(buffer byte[], offset int, count int)</code></a> method implemented.</p>