There are lots of things to get excited about HTML5 and the one which caught my curiosity was HTML5 Audio / Video API. I was overwhelmed with ideas of practical applications like face detect login or inline dictation but I chose to start with something small – a whistle detector. Although, not wholly accurate it works quite well with very a good accuracy. I used M. Nilsson’s research paper, “Human Whistle Detection and Frequency Estimation” to implement this. It took me a while to get understand exactly what the paper narrates with its mathematical notations but luckily my wandered at the right place to get the right idea.

For the first part, I would try to explain Successive Mean Quantization Transform (SMQT) which prepares the audio data for further processing.

## Successive Mean Quantization Transform

Transformation in mathematics is an operation to map one set to another set. SMQT is a similar method to do the same to remove bias or gain resulting from disparity between various kinds of sensors (microphones) and other factors. In SMQT, we recursively take mean of data set and split it into two halves and do the same on each half. Data values above the mean are assigned, “1” and below are assigned “0”. The recursion is carried out to a pre-defined depth, at the end of which we have a binary tree with 1s and 0s. Sounds confusing? Lets take and example of set:

`X = [89, 78, 63, 202, 90, 45, 112, 79, 95, 87, 90, 78, 54, 34, 66, 32]`

.

`Mean(X) = 80.875`

The values above mean are assigned “1” while below are assigned “0”. So it becomes – `[1 0 0 1 1 0 1 0 1 1 1 0 0 0 0 0]`

. Let this procedure be called as `U(X)`

. Data values corresponding to “0” propagate left of the binary tree while “1” propagate right. So we have a tree which looks like,

Continue this process recursively till you reach a depth of L. ( **Note**: L = 8 in our application. )

After this, you weight each level by multiplying the bits by 2^{cur_level – 1} and add it up to the top of tree. So, if you have a tree which looks like,

Multiply D, E, F, G by 2^{2} which gives `[4 0 4 0]`

, `[0 0 4 0]`

, `[0 0 4 4]`

, `[4 0 0 0]`

and so on. Lets call this procedure of weighing individual arrays as `W(X)`

. After we are done weighing, we add to the node its subtrees. For eg, `B = W(B) + (W(D) . W(E))`

. So we have now have audio data that is bias and gain free. (Gist).