Databending Part 5 β Listening to Telephone Codecs
Today we'll be talking about the VOX or Dialogic ADPCM format β a lossy algorithm from Oki Electric for digital voice telephony βΒ and using it to translate raw data (e.g., program files) into audio. As I mentioned in my first post on the topic, at a certain point,
once you listen to the βsonifiedβ data from enough files, commonalities start to become apparent. Many programs use some of the same [library] files, andβ¦[even] differently-named library files sometimes contain similar elements β likely re-used code patterns, or further library code compiled in.
One way I've found of getting more variety from the data is to change the sample format in which I import it. If I divide or group the raw bytes in different ways, or treat them as coming from audio files with different encodings, I can get different sound results from the exact same data. For example, here is the same data (libicudata.73.1.dylib
from the Calibre e-book manager macOS app) imported into Audacity first as 16-bit integer, then as VOX ADPCM:
Both very cool, and both very different, despite coming from the same data! Today, let's talk about how ADPCM and VOX formats work; how to do this yourself in Audacity; and how I incorporated these formats into the Rust tool I made in my last post to automate the process of converting data to audio. Also check out the end of today's post for some fun stuff to look forward to!
Following along in Audacity
While I will be doing this in Rust, you can make the exact same sounds (minus the convenience of automation) in Audacity. See my first post on databending for a discussion of how to find the best files to use. Once you have the file you want to convert to audio,
- in Audacity, go to File > Import > Raw Dataβ¦, choose your file, and click βOpenβ
- in the settings menu that pops up, set encoding to βVOX ADPCM,β byte order to βdefault endianness,β channels to β1 channel (mono),β and sample rate to 44100 (or change sample rate to taste)
What is ADPCM?
Tan and Jiang [1] have a helpful discussion of the basics of ADPCM, or βadaptive differential pulse-code modulation.β First, with differential pulse-code modulation (the βnon-adaptiveβ flavor),
[the] general idea is to use past recovered values as the basis to predict the current input data and then encode the difference between the current input and the predicted input. (486)
In other words, if we can predict the output to within a decent approximation, all we need to store is the difference between our rough prediction and the actual output. This difference uses less data to store than the original signal, saving bandwidth. The following diagram illustrates the signal flow for the encoder (A) and decoder (B):
Note that while we describe a βpredictor,β there isn't anything fancy here β we simply βpredictβ that the current sample will equal the previous one and take the (quantized) difference between that and the actual current sample.
The next diagram shows the adaptive version of the decoder as shown in the original VOX ADPCM paper from the Dialogic Corporation. [2] The primary difference here is the addition of an adaptive scaling factor for the difference between prediction and actual value. This scaling factor is based on the amplitude of the incoming difference, and we will discuss the specifics of the scaling in the next section.
VOX
There are a number of ADPCM algorithms β many different ways to adapt our step size based on the amplitude of the difference and/or prediction β and after testing some out while importing data as audio in Audacity, I decided VOX was by far my favorite. Unfortunately I wasn't able to find anything pre-existing in Rust for VOX βΒ the symphonia crate that was recommended to me only has Microsoft and IMA flavors of ADPCM. Looks like I need to code it myself! You can find the resulting code here.
Here's a snippet of audio databent through my resulting VOX ADPCM implementation:
The file is libQt5Core.5.dylib
which I believe I pulled from DaVinci Resolve a week or two ago. Also, just as a check, here's a voice file (8 kHz sample rate) I encoded as VOX ADPCM with Audacity [3] and decoded with this Rust tool:
Sounds just as expected β a bit crunchy and lo-fi like a telephone, but clear and comprehensible.
Reading the VOX Spec
First, we need to calculate the step size ss(n)
and use that and the 4-bit input sample L(n)
to calculate the difference d(n)
. That difference plus the previous output X(n-1)
will give our 12-bit output value. Below is the pseudocode from the Dialogic paper for calculating d(n)
given a value of ss(n)
and an incoming sample. Note the values B3βB0 β these refer to the 4 bits in the incoming sample, with B3 as the sign and the rest as the magnitude.
d(n) = (ss(n)*B2)+(ss(n)/2*B1)+(ss(n)/4*BO)+(ss(n)/8)
if (B3 = 1)
then d(n) = d(n) * (-1)
X(n) = X(n-1) + d(n)
To make this calculation, we need to get the step size ss(n)
. The pseudocode for that is shown below:
ss(n+1) = ss(n) * 1.1M(L(n))
The paper includes a pair of lookup tables to efficiently calculate this value. Here they are as I use them in my Rust code. We use the 4-bit incoming value to look up an βadjustment factorβ in the ADPCM_INDEX_TABLE
, and we move an index into the VOX_STEP_TABLE
table by that adjustment factor. This index is initialized to zero, giving the first value in that table β 16.
// duplicate values from spec; can index w/ whole nibble, incl sign bit (4th)
// increment up/down thru this table...
const ADPCM_INDEX_TABLE: [i16; 16] = [
-1, -1, -1, -1, 2, 4, 6, 8,
-1, -1, -1, -1, 2, 4, 6, 8,
];
// ...use (clamped) index table to index this array for step size
const VOX_STEP_TABLE: [i16; 49] = [
16, 17, 19, 21, 23, 25, 28, 31, 34, 37, 41, 45,
50, 55, 60, 66, 73, 80, 88, 97, 107, 118, 130, 143,
157, 173, 190, 209, 230, 253, 279, 307, 337, 371, 408, 449,
494, 544, 598, 658, 724, 796, 876, 963, 1060, 1166, 1282, 1411, 1552,
];
Note that incoming magnitudes (first 3 bits) below 4 cause the step size to decrease, and values 4 or greater cause it to increase. The values in ADPCM_INDEX_TABLE
are duplicated so I can use the whole 4-bit value (including bit 4, the sign bit) to index the table.
Implementing VOX in Rust
To start, I have a struct called VoxState
that stores the predictor and step index. Note in the diagram above that these two values are fed into single-sample delays (the blocks labeled βZ-1β), [4] so having them stored in a struct allows us to maintain state between calls to the decoder function.
pub struct VoxState {
predictor: i16,
step_index: i16,
}
I implement a vox_decode()
function for the VoxState
struct, as shown below. We get the step size from last time around, then update the step size for next time. The sign is the 4th bit of the incoming nibble, and the magnitude is the lower 3 bits. We get the difference between the current value and the prediction from last time with the line let mut delta = ((2 * (magnitude as i16) + 1) * step_size) >> 3;
β we will come back to how this relates to the pseudocode in a bit.
We either add or subtract the predictor and delta
, depending on the sign bit, and clamp the predictor to the range of a 12-bit signed integer. When we return this value from the function, we multiply it by 16, scaling it into the range of the 16 bit integer format of the .WAV file we'll write later. Before returning, we'll also update the struct's step index for next time around.
impl VoxState {
// ...
pub fn vox_decode(&mut self, in_nibble: &u8) -> i16 {
// get step size from last time's index before updating
let step_size = VOX_STEP_TABLE[self.step_index as usize];
// use in_nibble to index into adpcm step table; add to step
let mut step_index = self.step_index + ADPCM_INDEX_TABLE[*in_nibble as usize];
// clamp index to size of step table β for next time
step_index = i16::clamp(step_index, 0, 48);
// sign is 4th bit; magnitude is 3 LSBs
let sign = in_nibble & 0b1000;
let magnitude = in_nibble & 0b0111;
// magnitude; after * 2 and >> 3, equivalent to scale of 3 bits in (ss(n)*B2)+(ss(n)/2*B1)+(ss(n)/4*BO) from pseudocode
// + 1; after >> 3, corresponds to ss(n)/8 from pseudocode β bit always multiplies step, regardless of 3 magnitude bits on/off
let mut delta = ((2 * (magnitude as i16) + 1) * step_size) >> 3;
// last time's value
let mut predictor = self.predictor;
// if sign bit (4th one) is set, value is negative
if sign != 0 { delta *= -1; }
predictor += delta;
// clamp output between 12-bit signed min/max value
self.predictor = i16::clamp(predictor, -i16::pow(2, 11), i16::pow(2, 11) - 1);
// update for next time through; ss(n+1) into z-1 from block diagram
self.step_index = step_index;
// return updated predictor, which is also saved for next time; X(n) into z-1
// scale from 12-bit to 16-bit; 16 = 2^4, or 4 extra bits
self.predictor * 16
}
}
Returning to the main code file and picking up from last time, here is how we use our new code. We've opened a file as a Vec<u8>
, and we're storing the results of a match
expression in a Vec<f64>
(since the filtering will work better with floats). In the βarmβ of the match
expression for the VOX format, we iterate over the imported Vec<u8>
, and for each byte, we split the byte into two 4-bit βnibbles,β iterating over [chunk >> 4, chunk & 0b1111].iter()
and running vox_state.vox_decode()
for each nibble.
In the diagram below from the spec, note that the highest 4 bits in a byte come first, so our first nibble is chunk >> 4
, which brings those bits down into the lowest 4 positions. chunk & 0b1111
keeps only the 4 lowest bits, giving us the second nibble in the byte.
We push the decoded values into our output Vec<f64>
, ready for the next stage, which is filtering and writing to .WAV (see previous post for a discussion of that code).
// import file as Vec<u8>
let data: Vec<u8> = fs::read(entry.path()).expect("Error reading file");
// need to filter as f64 anyway, so best to do in match arms here for consistency
let converted_data: Vec<f64> = match args.format {
// ...
SampleFormat::Vox => {
let mut output: Vec<f64> = Vec::new();
let mut vox_state = vox::VoxState::new();
data
.iter()
// using for_each and...
.for_each(|chunk| {
// start with highest 4 bits (by right-shifting)
// then & 0x1111 then selects lowest 4
for nibble in [chunk >> 4, chunk & 0b1111].iter() {
output.push(vox_state.vox_decode(nibble) as f64);
}
});
// ...returning outside of pipeline since we need to handle *two* nibbles per element in iter()
output
}
};
Before we discuss the challenges, just for funsies I put the compiled binary for my databending tool back through the tool (using our new VOX codec). Here's a segment of the result:
Challenges
At this point, our code works! There were a few things in the VOX spec that tripped me up though, so let's talk about how I got my code working. First, when my attempt at implementing the spec gave me trouble, I looked at the source for FFmpeg, which Audacity uses β specifically the function adpcm_ima_oki_expand_nibble()
in libavcodec/adpcm.c
, line 553. [5] This is where I got the line let mut delta = ((2 * (magnitude as i16) + 1) * step_size) >> 3;
from vox.rs
above.
Let's consider the line of pseudocode d(n) = (ss(n)*B2)+(ss(n)/2*B1)+(ss(n)/4*BO)+(ss(n)/8)
β this is how we combine the incoming magnitude and the step size to get the difference between the current and previous samples. B2, B1, and B0 are the three magnitude bits from the incoming nibble. If, for example, B1 is zero, ss(n)/2*B1
will divide by zero. Not only will we need to check whether each bit is zero or not, but division is more costly than the other arithmetic operations. However, we can think about this another way.
With (ss(n)*B2)+(ss(n)/2*B1)+(ss(n)/4*BO)+(ss(n)/8)
, if we leave out the multiplication by ss(n)
for the time being, we have 1 or 0 times 1; 1 or 0 times 1/2; 1 or 0 times 1/4; and 1 times 1/8. That's just the ones place and first 3 binary floating point places. If we shift those values 3 places left, we have no more fractions/division, and if we shift the incoming 3 magnitude bits 1 place left (i.e., multiply by 2) and add one, our magnitude and the previous values we shifted line up the same way as before. We can multiply what we have now by the step size, and >> 3
βundoesβ the left shift we did to get rid of the fraction. Thus ((2 * (magnitude as i16) + 1) * step_size) >> 3
is equivalent to (ss(n)*B2)+(ss(n)/2*B1)+(ss(n)/4*BO)+(ss(n)/8)
, but we don't need to work around dividing by zero, and things are a bit faster to boot.
Looking Forward
Lately I've been enjoying windytan (Oona RΓ€isΓ€nen)'s blog β a βblog about sound & signalsβ where she discusses a variety of telecommunications encoding formats, both in terms of their sound and decoding them. I got an RTL-SDR software-defined radio dongle back in 2020, and greatly enjoyed tracking down and decoding interesting signals. Now that I have more programming skills, I think I'll do more discussion of and coding with different telecommunications formats β both for radio, and for telephony, as I did today.
One thing that @EveHasWords mentioned recently and that I also saw on windytan's blog is using cassette tapes to store digital data such as software or games. The general idea is that you modulate a tone to encode digital data, and then record that as audio on a regular cassette tape β this person did it with an Arduino and Python, so that could be a good starting point for a fun project.
You can follow the RSS feeds for this blog to see any future updates on such projects βΒ hope to see you then!
L. Tan and J. Jiang, Digital Signal Processing: Fundamentals and Applications. Academic Press, 2018, pp. 486β496. β©οΈ
Dialogic Corporation, Dialogic ADPCM Algorithm, 1988. [Online]. Available: https://people.cs.ksu.edu/~tim/vox/dialogic_adpcm.pdf. β©οΈ
Here's a link to the Audacity forum explaining where to find the settings to do this. β©οΈ
This notation comes from the idea of the Z-transform. β©οΈ
FFmpeg, libavcodec/adpcm.c. [Online]. Available: https://ffmpeg.org/doxygen/7.0/adpcm_8c_source.html#l00553. β©οΈ