27Jun 2017 by johanr

A while back, we identified two problems while testing and measuring. This is brief overview mby Johan, our Product Owner, on how the sync problems were observed, examined and solved.

The first was a major latency (50-250ms) primarily on iOS that felt performance related.

The second one was a drift in the recordings made on iPhone 7 (and once on Windows), that felt like it had to do with those recording in 48000 instead of 44100 sample rate.

The way this was tested was by recording using our app, and letting the song click track be picked up by the microphone. We played all the tests with a 120bpm drum track (cowbell on all beats). Because we compensate for the device recording latency by removing the latency tester value from the merged audio file on the server, the drum beats should align to 0, and all half second steps thereafter.

Step 1 – “Operation Encoder Delay”

We realized that there was a problem with the encoder delay headers on both mp3 and m4a, in both encoding and decoding.

The encoder delay is a padding of silent samples (zeros) in the beginning and end of the audio stream, added by the encoder. The reasons for this are different and strange, but some blame fixed frame sizes, and decoder startup times…

For m4a (aac) there is the (non-ISO) iTunSMPB tag, created by the Apple encoder and some others. For mp3 there is the (non-ISO) LAMEtag, created by the Lame encoder and some others. The documentation and support of both of these are sparse at best.

Important note also, is that the decorders inside Audacity and Reaper interpret the encoder delays differently, and we realized that we can not trust Audacity to correctly decode mp3 or m4a. This caused some confusion when testing and measuring results in these editors.

Looking at the LAME tag with LameTagGUI

Looking at the iTunSMPB tag with ffprobe

Issues

1. The instrument samples in the app were created locally and were missing the encoder deltay tags.

This resulted in the scheduled sounds in the app reference playback were played too late, resulting in a bigger latency when measuring the timing of the recorded samples.

2. The m4a chunks recoded on iOS did have the encoder delay tag, but the decoder on the server (ffmpeg) only used the tag to trim away the padding at the start of the audio, not the padding in the end.

This resulted in several frames of zeros at the end of each decoded chunk, causing the merged track to drift more and more out of synk with each chunk.

3. The playback audio, encoded (by ffmpeg) from the merged track on the server, were missing the encoder delay tag.

This resulted in the playback audio buffer starting with silence, making the track sound too late in playback.

Solutions

Encoding Correctly

We needed to find a encoder that could encode mp3 and m4a and correctly write the encoder delay tags to the output.

For m4a, we thought that if we built ffmpeg using the Fraunhofer FDKAAC encoder, we would get the correct output, but we realized that ffmpeg never outputs these tags properly.

We found the nu774/fdkaac repo that explicitly states that it can do this. The developer even writes in an issue:

I have developed this project only because ffmpeg doesn’t support proper gapless encoding (ffmpeg doesn’t write necessary metadata for gapless playback).

After switching from ffmpeg to fdkaac for m4a on the server, and re-creating the instrument samples locally, we could confirm that WebAudio on iOS correctly decodes the m4a’s created by this encoder.

For mp3, ffmpeg with the Lame encoder should be able to write the Lame tag, and after rebuilding a new version of ffmpeg, using some build flags, it did, and all was good. We don’t use mp3 for playback interally, though, so this is not extensivly tested.

Decoding Correctly

Our client on iOS records m4a audio chunk files natively (with tags) that are sent to the server and merged. When the server recieved them, they are decoded by ffmpeg and then merged. We noticed that when ffmpeg decoded the chunks, it did take the encoder delay value into account and trimmed the start of the audio stream. It did not, however, use the padding to trim the silence at the end of the file.

We knew that we could get this value from ffprobe, so the sollution now was to run ffprobe on the uploaded chunk, and crop only the end of the decoded audio after it is decoded by ffmpeg. This is quite hacky, but we could see no other options. If this is ever patched in ffmpeg, we need to remember to remove this manual cropping.

Step 2 – Scheduling Playback based on Recording Start Time

After fixing the encoder problems, there was still an unexpected 30-150ms latency in the recorded audio. There was sometimes also a drift, where the latency increased between beats in the same recording, but we decided try to solve the latency first.

We tested this a lot, double-checking the server functionality, the recording and the playback. What solved the problem was when we logged the timestamps for when the played events were scheduled.

The system was designed to take into account the timestamp of when the native recorder on iOS starts recording. Because we know that we cannot trust the time it takes to communicate between Native and JavaScript in the app, the timestamp is sent from Native to JavaScript, and is recalculated to WebAudio clock time. This time was however not used by the scheduler. When the scheduler scheduled the song, the time it took to, first send the message to JS, and then schedule the first events, were not compensated for in the recorded track.

This is what happens (the parts important for this):

1. User presses the rec button

2. JavaScript tells Native to start recording

3. Native starts recording

And saves the current "Native Start Rec Time (Epoch)"

4. Native returns that it has started recording

And includes the "Native Start Rec Time (Epoch)"

5. JavaScript recieves the message that recording has started

And calculates a "Time Diff" between the Current JavaScript Time (Epoch) and the "Native Start Rec Time (Epoch)"

And calculates the Current WebAudio Context Time - "Time Diff", to get the "Start Rec Time (WebAudio Clock)"

6. JavaScript calls the Scheduler to start playback

And includes the "Start Rec Time (WebAudio Clock)"

7. The Scheduler schedules the song

Using the absolute "Start Rec Time (WebAudio Clock)" as the start time for scheduling

Step 7 is where the problem was. We thought that we were scheduling based on the "Start Rec Time (WebAudio Clock)" witch is an absolute clock time from Native, but the value was actually not used. Instead the scheduler scheduled based on the Step 6 current time, and all upcoming events aligned with that time, and not the absolute "Start Rec Time (WebAudio Clock)".

Note that when scheduling happens, the "Start Rec Time (WebAudio Clock)" is always a time that is in the past, lower than (or equal to) now. This results in the events placed on – or close to – start of playback being late, or not be scheduled at all. This is by design, and uses the scheduler’s logic for scheduling, using a window for dermining if and when a past event is scheduled. If this is percieved as a problem, the client can solve this by starting playback slightly earlier (for example a count-in of 2.1 measures, instead of 2).

Step 3 – Drifting back to encoders

After fixing the scheduling problem, we still had the drift problem.

We recorded continuous test sounds, and realized that there were gaps between the merged chunks when recording in 48000 on iPhone7. Because we didn’t have the gaps when recording in on iPhone6 in 44100, we realized that the problem was in the implementation of the manual cropping of the end padding when decoding m4a in Step 1. Because we resample the audio first, we need to recalculate the number of frames to crop.

After doing that it now works!!

We still see a few frames (<10) gap between some chunks, witch might be due to rounding in the resampling. Although it would be interesting to know why, we are well within the problem margin.

Success!

Over and out.