Zound, a PlayFramework 2 audio streaming experiment using Iteratees

ZOUND

Last Friday was HackDay #7 at Zenexity, and we decided to work on a real-time audio experiment made with Play Framework. The plan was to use an audio generator (JSyn, an audio synthesizer), encode the output and stream it all using Play Iteratees to pipe everything in real-time.

First of all, let’s highlight some interesting part of the project, then get into some of the details.

Thanks to @Sadache for his Iteratee expertise, we ended up with a simple line of code that does all of the hard work:

val chunkedAudioStream = rawStream &> chunker &> audioEncoder

You can think of the &> operator as the UNIX pipe |. So we simply take the rawStream, chunk it with a chunker and encode it with an audioEncoder.

Now, rawStream is the raw stream of audio samples (numbers between -1 and 1) generated by the audio synthesizer. Next, the chunker buffers a data stream into chunk of bytes. For instance, if you send data stream at 1Kb/s to a 10Kb chunker, it will output one chunk of size 10Kb every 10 seconds. And finally, the audioEncoder takes audio samples and outputs encoded bytes implementing an audio format (like WAVE).

We can then make a broadcast of the stream:

val (sharedChunkedAudioStream, _) =
  Concurrent.broadcast(chunkedAudioStream)

And then the sharedChunkedAudioStream is now a shared stream for every consumer (clients). All that’s left to do is to stream it over HTTP:

def stream = Action {
  Ok.stream(audioHeader >>> sharedChunkedAudioStream).
     withHeaders( ("Content-Type", audio.contentType) )
}

The >>> operator means “concatenation”, so here we’re concatenating the audio header (given by the format like WAVE) with the current chunked audio stream. We also send the right HTTP Content-Type header (like “audio/wav” for WAVE).

Another interesting part of the project is the multi-user web user interface: allowing users to interact with the sound synthesis.

Using @mrspeaker‘s audio synthesis expertise, we started creating a synthesizer generator – 3 oscillators, various wave shapes, frequency and volumes, and finally flowing through a high pass filter before entering our “rawStream” above.

Thanks to the Play framework goodness, this audio stream can be both consumed by the web page with an HTML audio tag, and with a stream player such as VLC! Ok, that’s the project – let’s have a closer look at some of the concepts…

What is sound?

Sound is a mechanical wave that is an oscillation of pressure transmitted through a solid, liquid, or gas, composed of frequencies within the range of hearing and of a level sufficiently strong to be heard, or the sensation stimulated in organs of hearing by such vibrations. Wikipedia

We can represents the sound like any wave as a graphic of the amplitude (the oscillation pressure) as a function of time. Here you see it in Audacity:

sound-audacity

Electricity is used to pump these amplitudes to your speakers, over time.

About primitive wave sounds

There are some patterns – some primitives waves – we can easily generate with computers (or before with analog oscillators). Those are well known by mathematicians and physicians: Sine wave, Triangle wave, Square wave,…

A sine wave produce a smooth tone, whereas Triangle and Square wave are more aggressive sounds. The shorter a wave period is, the lower the note you hear: it’s called the frequency.

How is sound represented by computer?

Whereas analog oscillators generate sounds in an *almost** continuous stream of electricity, computers are not able to generate continuous stream of data. This is why with computers the sound is divided in to discrete samples, usually **44100 samples per second** for standard CD quality audio. Each sample is a value (amplitude) for a given time position.

See Sampling (wikipedia).

If you zoom in Audacity, you can actually see each sample:
audacity-zoom This is an “AH” timbre of my voice. A timbre is unique to everyone, it’s the pattern the sound wave take when you speak. The amplitude of an audio sample is usually represented as a Real number between -1.0 and 1.0.

*_ electricity is not strictly continuous, we have electrons out there!_

Ok, Let’s go back to our experiment now!

The experiment

Our experiment is using Play Framework and is written in Scala language. Specifically, our project takes advantage of Play framework’s powerful Iteratees.

Take the expressivity of UNIX pipes, bring the power of Scala, mix it with Play Framework and you got a powerful framework for handling real-time and web streaming.

The iteratee (and related constructs) can take a bit of getting used to. I recommend checking out this article on Iteratees in Play and/or this presentation if you are interested in learning more about Play2 and reactive programming with Iteratees. And if you just want to see how it work – you can read the source code at Play20 Github source code.

Generating the audio stream

val (rawStream, channel) = Concurrent.broadcast[Array[Double]]
val zound = new ZoundGenerator(channel).start()

We create an Array[Double] broadcast which return two values: the rawStream will be used to read the generated data, and the channel used by the generator to push generated audio samples. We give this channel to the ZoundGenerator. The .start() then starts the audio generation. All of the generation is done using the JSyn library.

Here’s a snippet from the ZoundGenerator class showing the connection between JSyn and Channel:

class ZoundGenerator(output: Channel[Array[Double]]) {
  val out = new MonoStreamWriter()

  val synth = {
    val synth = JSyn.createSynthesizer()
    synth.add(out)
    out.setOutputStream(new AudioOutputStream(){
      def close() {}
      def write(value: Double) {
        output.push(Array(value))
      }
      def write(buffer: Array[Double]) {
        write(buffer, , buffer.length)
      }
      def write(buffer: Array[Double], start: Int, count: Int) {
        output.push(buffer.slice(start, start count))
      }
    })
    synth
  }
  // ...

We have to implement the methods of AudioOutputStream – but it’s just a matter of pushing each audio sample to the channel. It’s that simple!

Encoding the raw audio stream

For now, we have only implemented the WAVE format. Basically, WAVE has 2 parts; the WAVE header which describes important information (like the framerate and the bits per sample), and the data.
The data is encoded in a simple manner I won’t describe here but you can look to the encoder I made here:

Now more interesting, let’s wrap it with Play Iteratees:

val audio = MonoWaveEncoder() // instanciate the WAV encoder
val audioHeader = Enumerator(audio.header)
val audioEncoder = Enumeratee.map[Array[Double]](audio.encodeData)

N. B.: Remember that Scala is a typed language but where the type declaration is optional because the compiler can infer the type.

audioHeader is an Enumerator which means it can produce data, and here the data is the audio header. More precisely it’s an Enumerator[Array[Byte]] because audio.header is an Array[Byte]. Note that contained data is not “consumed” like it would be for an InputStream. Each time you use this enumerator, it gives you its entire content.

audioEncoder is an Enumeratee[Array[Double], Array[Byte]]. It takes an Array[Double] from input and returns an Array[Byte] as output. The input is a raw array of audio samples (double numbers between -1.0 and 1.0). The output is the encoded array of bytes.

More formally, an Enumeratee[A, B] is an adapter which maps some data of type A to new data of type B. You can implement the way the data is transformed with the map function. Here we just give it the audio.encodeData function.

Streaming it

We can basically stream the audio stream with Play2 like so:

def stream = Action {
  val audioStream = rawStream &> audioEncoder
  Ok.stream(audioHeader >>> audioStream).
     withHeaders( (CONTENT_TYPE, audio.contentType) )
}

The rawStream &> audioEncoder takes the raw stream and pipes it into the encoder which results in the encoded audio stream. audioHeader >>> audioStream will concatenate audioHeader with audioStream. Hence, the first thing the server will do is start sending the audio header to the client and then stream the audio in real-time.

A client can connect at any time and will hear current stream, so it should simultaneously hear the same thing as any other client (with some delay depending on the client buffer). If the generator stops emitting audio samples, the http client will stop receiving audio data – but it will still be waiting for the server, so the audio play will pause until the server re-sends new audio samples. That is pretty cool! – because of the way iteratees work, the stream doesn’t just die when all of the input is consumed.

A chunker to reduce HTTP packet numbers

Up to now we’ve been streaming the audio in very small chunks because by default JSyn writes out arrays of just 8 audio samples and the .stream() function consumes all data as it comes. This means a lot of HTTP chunks per second are sent – which is less efficient and take more bandwidth.

In order to fix this, we need to use a buffer on the server side. In other words, instead of sending audio samples as they come we need to group audio samples. We have currently grouped audio samples in arrays of 5000 which is quite reasonable (it’s about 10 chunks per second using 44100 samples/s). We can easily change this later. This logic is implemented in an Enumeratee we called “chunker”. In that sense, it is reusable and modular:

val chunker = Enumeratee.grouped(
  Traversable.take[Array[Double]](5000) &>> Iteratee.consume()
)

And now, we can easily plug it in like this:

def stream = Action {
  val chunkedAudioStream = rawStream &> chunker &> audioEncoder
  Ok.stream(audioHeader >>> chunkedAudioStream).
     withHeaders( (CONTENT_TYPE, audio.contentType) )
}

Broadcast

Now, another improvement we made was to factorize this chunking and encoding part: avoiding having this computing tasks done for every stream consumer.

Basically, we move it out of the stream function:

val chunkedAudioStream = rawStream &> chunker &> audioEncoder
def stream = Action {
  Ok.stream(audioHeader >>> chunkedAudioStream).
     withHeaders( (CONTENT_TYPE, audio.contentType) )
}

But to allow broadcasting, we have to use a broadcast:

val chunkedAudioStream = rawStream &> chunker &> audioEncoder
val (sharedChunkedAudioStream, _) =  =
  Concurrent.broadcast(chunkedAudioStream)
def stream = Action {
  Ok.stream(audioHeader >>> sharedChunkedAudioStream).
     withHeaders( (CONTENT_TYPE, audio.contentType) )
}

Here we only care about the enumerator (the left argument in the Tuple2), we put the wildcard "_" to ignore the return value.

Using a broadcast, generated audio samples pushed by the audio generator can be simultaneously spread to multiple consumers. This is perfect for our needs, multiple players can connect to this web radio!

Avoiding the server load

The last important fix we made was to avoid the server load:

  def stream = Action {
    Ok.stream(audioHeader >>> sharedChunkedAudioStream
      &> Concurrent.dropInputIfNotReady(50)).
       withHeaders( (CONTENT_TYPE, audio.contentType) )
  }

If a client is opening the stream connection but doesn’t consume enough or doesn’t consume it at all (download is paused), the server will fill in memory the chunks to send to the client and the server can reach an out of memory exception. To avoid that we have to drops chunks if the consumer is not ready. Then the client will just lose messages if it is not ready (in our case, we give them 50 milliseconds).

And this is what Concurrent.dropInputIfNotReady(50) is actually doing – with yet another Enumeratee! Dropping old chunks is really what we want in an audio streaming application: We want the consumer to subscribe to the current audio stream and not to continue from where they stopped.

Client consumers

HTML5 Audio tag

In HTML5, we have the Audio tag – and we can just consume our stream like this:

<audio src="/stream.wav"></audio>

Or if we want to make it auto loading:

<audio src="/stream.wav" preload autoplay controls></audio>

It may be a bit “wrong” to use `` for streaming, but it works because we are using it as if the server was hosting a static audio file. The only disputable hack is to have to set the max ChunkSize in the WAVE header which is 2147483647 (it’s about 6 hours 45mn!), so the browser believes the audio is not finished.

The issue we are currently facing is this crazy latency (a few seconds) between user actions and the produced sound. This problem is due to the browser audio cache buffer: if we were able to minimize it we would have an almost real-time audio player.

Playing it with VLC

This stream is spread through HTTP so we need a HTTP client to consume it. But a HTTP client doesn’t mean only browsers! We can also use VLC for this, as if it was a web radio! One advantage of using VLC is it suffers far less latency (presumably because the cache buffer is smaller than the audio tag).

vlc

Making the real-time control UI

Our experiment mixes different oscillators to generator one sound. The web user interface allows a user to control the parameters of those. Two knobs control the volume and the pitch (tuned to a dorian mode scale) and you can select the oscillator wave primitive (sine, sawtooth, square, noise). It’s not fancy at the moment – but JSYN offers a lot of features for expanding our simple demo.

This interface is multi-users, so if you use it with other people, the interface will stay synchronized over multiple browsers (turn the knobs, change the wave primitive, …). All this is done with WebSockets, and on the server-side it’s using, again, Iteratees!

The workflow is simple: When someone does some action on the user interface, events are sent to the server. These events are interpreted by the ZoundGenerator resulting in updates to the audio synthesis configuration. These events are then broadcast to each client, and some Javascript handlers are called in order to keep the interface synchronized.

Source code

Fork me on Github

What’s next?

This was just a simple demo to show the power and flexibility of Play2′s Iteratee concept. Because of the modular nature, extending the demo is easy. For example, we could plug a new audio encoder such an OGG encoder. The code would be simple and we could even choose on a request-by-request basis which encoder to use:

import Concurrent.broadcast
val (chunkedWaveStream, _) =
  broadcast(rawStream &> chunker &> waveEncoder)
val (chunkedOggStream, _) =
  broadcast(rawStream &> chunker &> oggEncoder)

def stream(format: String) = Action {
  val stream = format match {
    case "wav" => waveHeader >>> chunkedWaveStream
    case "ogg" => oggHeader >>> chunkedOffStream
  }
  Ok.stream(stream).
     withHeaders( (CONTENT_TYPE, audio.contentType) )
}

Now it’s up to you!

Hopefully you get a feel for the possibilities of stream processing and piping with Play. You can now reuse these concepts and make your own stuff: Maybe you don’t need to generate sounds on the fly, but instead you simply want to play a collection of audio files and stream them like radio? Well you can make a web radio engine now!.

But that’s just the beginning – I would love to see someone taking the concept, and running even further… Do you know that in Youtube, during the time you are uploading a video, Youtube is already re-encoding it and can start streaming it before the file has finished uploading? Hmm, that’s starting to sound almost simple…