Collective for Research in Interaction, Sound, and Signal Processing

The Sonification Handbook

2012-01-26T13:28:40Z

For those that have not yet heard: The Sonification Handbook edited by Thomas Hermann, Andy Hunt, John G. Neuhoff is published. And, even better, freely available for download here!

SMC 2012 in Copenhagen!!

2011-09-30T05:50:30Z

9th Sound and Music Computing Conference, 12-14 July 2012
Medialogy section, Department of Architecture, Design and Media Technology, Aalborg University Copenhagen
http://smc2012.smcnetwork.org/

The SMC Conference is the forum for international exchanges around the
core interdisciplinary topics of Sound and Music Computing,
and features workshops, lectures, posters, demos, concerts, sound installations, and
satellite events. The SMC Summer School, which takes place just before the
conference, aims at giving young researchers the opportunity to
interactively learn about core topics in this interdisciplinary field from experts,
and to build a network of international contacts.
The specific theme of SMC 2012 is "Illusions", and
that of the SMC Summer School is "Multimodality".

================Important dates=================
Deadline for submissions of music and sound installations: Friday, February 3, 2012
Deadline for paper submissions: Monday 2 April, 2012
Notification of music acceptances: Friday, March 16, 2012
Deadline for applications to the Summer School: Friday March 30, 2012
Notification of acceptance to Summer School: Monday April 16, 2012
Deadline for submission of final music and sound installation materials: Friday, April 27, 2012
Notification of paper acceptances: Wednesday 2 May, 2012
Deadline for submission of camera-ready papers: Monday 4 June, 2012
SMC Summer School: Sunday 8 - Wednesday morning 11 July, 2012
SMC Workshops: Wednesday afternoon 11 July, 2012
SMC 2011: Thursday 12 - Saturday 14 July, 2012
===========================================

SMC2012 will cover topics that lie at the core of the Sound and Music Computing research and creative exploration.
We broadly group these into:
- processing sound and music data
- modeling and understanding sound and music data
- interfaces for sound and music creation
-music creation and performance with established and novel hardware and software technologies

================Call for papers==================
SMC 2012 will include paper presentations as both lectures and poster/
demos. We invite submissions examining all the core areas of the Sound
and Music Computing field. Submission related to the theme "Illusions" are especially encouraged.
All submissions will be peer-reviewed according to their novelty, technical content, presentation, and
contribution to the overall balance of topics represented at the
conference. Paper submissions should have a maximum of 8 pages
including figures and references, and a length of 6 pages is strongly
encouraged. Accepted papers will be designated to be presented either
as posters/demos or as lectures. More details are available at
http://smc2012.smcnetwork.org/
===========================================

================Call for music works and sound installations==================
SMC 2012 will include four curated concerts addressing the conference topic "Illusions". We invite submissions of original compositions created for acoustic instruments and electronics, novel instruments and interfaces, music robots, and speakers as sound objects. Submissions of sound installation are also encouraged. See curatorial statements and call specifics at: http://smc2012.smcnetwork.org.
==============================================================

A blog devoted entirely to sparse representation

2011-07-04T12:21:45Z

As part of my research activities funded by the Danish government, I am happy to announce my new blog: Null Space Pursuits. I have copied all of my content from here to there (though the links still point to CRISSP), and will continue to document over the next 30 months my researches in varying detail.

SPARS 2011, day 4

2011-07-04T09:06:06Z

The fourth and final day of SPARS 2011 served up two plenaries by two prodigious reserarchers: Joel Tropp and Stephen Wright. At the beginning of his talk, Tropp asked who in the room knows how MATLAB computes the SVD. Only a few out of about 200 raised their hand, and a few more gestured that they kind of knew. The problem is that the methods we use today are treated as black boxes, but are based on extremely optimized classical methods that are incapable of working with massive matrices (billions by billions and up). So, we need better tools. He presented his work in SVD by a randomized algorithm ... which at first sounds scarily inaccurate, but proves to be extremely effective at a much reduced computational cost.

In the last plenary, Wright presented a lot of work in state of the art methods for regularized optimization. At the beginning, he showed some fantastic pictures that he called an "Atlas of the Null Space," which showed where solutions to min l1 are the same as min l0. His talked centered around the message that though we talk a lot of exact solutions, or sparsest representations, most applications in the real world only need good algorithms that give the correct support before the whole solution. The trick is to determine when to stop an algorithm, and post-process the results to find the better solution.

In between these talks, there were plenty others, discussing various items of interest with dictionary learning, audio inpainting (Po'D coming soon), and several posters, one of which is by CRISSP reader Graham Coleman. He presented his novel work applying l1 minimization of sound feature mixtures to drive concatenative sound synthesis, or musaicing. (I have discussed an earlier version of this work here.) Coleman's approach appears to be the next generation of concatenative synthesis.

All in all, this workshop was an excellent use of my time and money. Its duration was just perfect that after the last session I really felt as if my fuel tank was completely full. The organizers did an extremely nice job of selecting plenary speakers, assembling a wide range of quality work, and finding an accommodating venue with helpful staff. I even heard that the committee was able to raise enough funds so that many of the student participants had their accommodations paid for. I am really looking forward to the 2013 edition of SPARS (or CoSPARS).

SPARS 2011, day 3

2011-06-29T17:49:30Z

Big things today, with plenaries given by David Donoho and Martin Vetterli. Donoho answered all the questions I have regarding the variability of recovery algorithms on distributions underlying sparse vectors. I just need a few years to understand them. I also need to look more closely at approximate message passing. And Vetterli gave a great talk, discussing the tendency in algorithm development to jump to a solution before solving the outstanding problem, e.g., sampling the real continuous world on a discretized grid.

Now I need to eat dinner, and run some experiments.

SPARS 2011, day 2

2011-06-28T19:43:38Z

Though the SPARS2011 twitter feed appears miserable, this workshop is jam packed by excellent presentations and discussions. I think too many people are having too much good discussion to have too much time to twitter.

Today at SPARS 2011: Heavy hitters Francis Bach and Rémi Gribonval delivered the two plenary talks. This morning Bach talked on a new subject for me: submodular functions. In particular, he is exploring these ideas for creating sparsity-inducing norms. A motivation for this work is that while the l1 norm promotes sparsity within groups, it does not promote sparsity among groups... or vice versa (it is new to me). But I liked how he described his formalization as "the norm-design business." Someone asked him a question about analyzing greedy methods vs. convex optimization. Bach's answer made me realize that we can more completely understand the behavior of convex optimization methods than greedy methods because convex methods are decoupled from the dictionary. For greedy methods, the dictionary is involved from the get go.

This afternoon, Gribonval talked on "cosparsity", or when a signal is sparsely represented by the dual of a frame instead of the frame itself. His entire talk revolved around looking more closely at the assumption that atomic decomposition and a transform are somehow similar. Or that when we say a signal is sparse, we mean it is sparse in some dictionary; but we can also mean its projection on a frame is sparse. This is then "cosparsity", which brings with it l1-analysis. To be a little more formal, we can considering solving the "synthesis" problem $$ \min_\vz || \vz ||_1 \; \textrm{subject to} \; \vy = \MA \MD \vz $$ where we assume $\vz$ is sparse; or the "analysis" problem $$ \min_\vx || \MG \vx ||_1 \; \textrm{subject to} \; \vy = \MA \vx $$ where we assume the analysis (or transformation) of $\vx$ by $\MG$, i.e., $\MG\vx$, is sparse. Gribonval et al. have done an excellent job interpreting what is really going on with l1-analysis. Instead of wanting to minimize the number of non-zeros in the signal domain, l1-analysis wants to maximize the number of zero in the transform domain. Later on, his student Sangnam Nam presented extraordinary results of this work with their Greedy Analysis Pursuit, which attempts to null non-zeros in the solution. This reminded me a bit about the complementary matching pursuit, but this is quantitatively different. Gribonval joked that "sparsity" may now be "over." The new hot topic is "cosparsity."

There were many other exciting talks too, showing extraordinary results; but now I must go and work on some interesting ideas that may or may not require my computer to run through the night.

SPARS 2011

2011-06-27T08:24:56Z

And so it begins! A whole week of nothing but sparsity in various forms and guises. My summer has officially started!

The proceedings collect all the accepted one-page submissions, which I find provide very tantalizing details. And for a cool down, I am reading Michael Elad's excellent book Sparse and Redundant Representations: From Theory to Applications in Signal and Image Processing. It does for sparse signal processing what Hamming's book does for digital filters: completely accessible, drawing together numerous disciplines, and giving a good big picture perspective.

Today began with a bang, featuring Yi Ma. Just as Andrew Ng's Google Talk, Ma amazed me (and I am sure many others) with his examples of the incredible power of Robust PCA for everything from face and text alignment, to extracting the geometry of buildings from 2D pictures without any use of edge or corner detection. All one needs are the pixels, and the rest is done by the assumption that the image can be decomposed into a low rank texture matrix, and a sparse matrix with non-textural items, like a person moving in front of a background. One of my favorite examples was where he took 30 images of Bill Gate's face. Robust PCA aligned them all, corrected for transformations like shearing, and produced a mean image of Bill Gates. Now, I wonder, can we do the same for a piece of classical music, where we create a mean version of a particular Bach Partita from a dozen Glenn Gould recordings?

There were many other fantastic talks and conversations to be had. Because my internet access at this expensive hotel is free only for 30 minutes every 24 hours at a severely limited bandwidth, I must limit my description to that. Tomorrow will be another exciting day in "Sparseland", as Elad calls it.

CMP in MPTK: Third Results

2011-06-17T09:21:35Z

In a previous entry, I compared our results with those produced by my own implementation of CMP in MATLAB --- which did not suffer from the bug because it computes the optimal amplitude and phases in a slow way with matrix inverses. Now, with the new corrected code, I have produced the following results. Just for comparison, here are the residual energy decays of my previous experiments, detailed in my paper on CMP with time-frequency dictionaries.

Now, with the corrections, I observe the decays. The "MPold" decay is that produced by the uncorrected MPTK. "MP" shows that of the new code. Only in Attack and Sine do we see much difference; and at times in Sine the previous version of MPTK beats the corrected version. (Such is the behavior of greedy algorithms. I will write a Po'D about this soon.) Anyhow, the decays of CMP-$\ell$ (where the number denotes the largest number of possible cycles of refinement, but I suspend refinement cycles when energyAfter/energyBefore > 0.999), comports with the decays I see in my MATLAB implementation (see above). So, now I am comfortable moving on.

Below we see the decays and cycle refinements for three different CMPs for these four signals. (Note the change in the y axes.) Bimodal appears to benefit the most in the short term from the refinement cycles, after which improvement is sporadic. The modeling of Sine has a flurry of improvements. It is interesting to note that as $\ell$ increases, we do not necessarily see better models with respect to the residual energy. For instance, for Attack, the residual energy for CMP-1 beats the others.

And briefly back to the glockenspiel signal, below we see the decays and improvements using a multiscale Gabor dictionary (up to atoms with scale 512 samples).

Grab your things, I've come to take you home!

2011-06-16T22:25:14Z

I have solved the mystery that has pushed me for the past week into excruciatingly fun debugging sessions. Yes, I know I mentioned on June 9 that CMP was extremely easy to implement in MPTK. Then came second thoughts as to the behavior of the implementation. And there followed more observations, and rambling observations, and then the videos appeared. And then the music video appeared. Well, now here's another: ex1_MP_atoms_solved.mov

]]>
The columns of the matrix $\MG(\MG^H\MG)^{-1}$ actually form the "dual basis", or "biorthogonal basis", to the columns in $\MG$. Since the Gramian is just a 2x2 matrix, we can invert is easily and obtain these dual vectors: $$ \vh = \frac{\vg - \langle \vg, \vg^* \rangle \vg^* }{1 - | \langle \vg, \vg^* \rangle|^2} $$ the other being the conjugate. Note that though $||\vg||_2 = 1$, the dual may not have the same norm. What is cool about dual bases is that we can project on either, but we must reconstruct on the other. So, considering that we have a complex unit norm atom from the dictionary, and that we want to find the best real atom from it given a signal, we project $\vx$ onto the unit norm atom and its conjugate and build back with the dual: $$ \frac{1}{\gamma} \left ( \langle \vx, \vg \rangle \left [ \vg - \langle \vg, \vg^* \rangle \vg^*\right ] + \langle \vx, \vg^* \rangle \left [ \vg^* - \langle \vg^*, \vg \rangle \vg\right ] \right ) = \frac{a}{2} \left [ e^{i\phi} \vg + e^{-i\phi} \vg^* \right ] $$ where $\gamma := 1 - | \langle \vg, \vg^* \rangle|^2$. The problem now is to find $a$ and $\phi$.

If we make the following definitions: $$ \begin{align} x & := \langle \vx, \vg \rangle \\ r & := \langle \vg, \vg^* \rangle \end{align} $$ then the above becomes $$ x\vg + x^*r^*\vg - c r\vg^* + x^*\vg^* = \frac{a\gamma}{2} \left (\cos \phi + i \sin \phi\right )\vg + \frac{a\gamma}{2} \left (\cos \phi - i \sin \phi\right )\vg^*. $$ Now, one may think, as I first did, that we can group and separate the terms on each side multiplying $\vg$, and then solve for $a$ and $\phi$. But unless $\vg^*$ is orthogonal to $\vg$, that is not a good thing to do. Instead, multiply each side by $\vg^H$: $$ x + x^*r^* - c r^2 + x^*r = \frac{a\gamma}{2} \left (\cos \phi + i \sin \phi\right ) + \frac{a\gamma}{2} \left (\cos \phi - i \sin \phi\right )r. $$ since $\vg^H\vg = 1$, and $\vg^H\vg^* = r^*$. Now we can group things based on real and imaginary because those components are definitely orthogonal. We thus obtain the two equations $$ \begin{align} \text{Real}\{x + x^*r^* - c r^2 + x^*r\} & = \frac{a\gamma}{2} \left ( 1 + r \right ) \cos \phi \\ \text{Imag}\{x + x^*r^* - c r^2 + x^*r\} & = \frac{a\gamma}{2} \left (1 - r \right ) \sin \phi. \end{align} $$ Now we can solve for $a$ and $\phi$ quite simply.

What I noticed in the code of MPTK was that $\gamma$ was missing from the computation of the amplitudes. When I inserted back in, everything worked as it should! But, as I predicted, its absence was absolutely and incredibly subtle. For most atoms with no negligible imaginary part, $1 - | \langle \vg, \vg^* \rangle|^2 \approx 1$. This is especially true when decomposing audio signals because only atoms with modulation frequencies close to zero and Nyquist will have a $\gamma$ that is not approximated by 1. For my length-64 Gaussian-windowed atoms with a modulation frequency 1/64 or 31/64 $\gamma = 0.8038$. When we go to a modulation frequency 2/64, or 30/64, this becomes $\gamma = 0.9985$. For length-512 atoms, atoms with a modulation frequency 1/512 or 255/512 have $\gamma = 0.7951$. When we go to a modulation frequency 2/512, or 254/512, this becomes $\gamma = 0.9982$. And for length-16384 atoms, atoms with a modulation frequency 1/16384 or 8191/16384 have $\gamma = 0.7939$. When we go to a modulation frequency 2/16384, or 8190/16384, this becomes $\gamma = 0.9982$.

This explains why, no matter how large I scaled my atoms, I was only seeing this effect on the second and the penultimate frequency indexes. However, the effect appears to be large. Here are some new plots. Below is the decomposition of the attack signal using a single scale dictionary. Compare these with those.

Tomorrow I will run the glock examples again. What a great way to end my time here!

Don't Give Up

2011-06-15T18:44:22Z

This is my life the past few days. And yet again, I think I have it cornered. The same thing happens for atoms at the Nyquist frequency. Now, how to fix it?

People, normalize your signals before you write wavs

2011-06-15T11:36:11Z

MPTK works!

ex1_MP_atoms.mov

In my experiments before, the MP reconstruction algorithm was hard clipping all values with magnitude greater than 1. So that is from where the spikes come. Oh, for F's sakes.

Debugging repository

2011-06-14T10:11:23Z

We have decided to get to the bottom of the unusual behavior of MPTK, since the next steps of our work on CMP depend on it. This entails comparing the results from my MATLAB implementation with those of MPTK (and CMPTK) on the same dictionary. I have decomposed one signal of dimension 1024 samples, with a dictionary of modulated Gaussian windows of scale 64 samples, with a hop size of 8 samples. The MPTK dictionary is defined as follows (couldn't use pre html tags for some reason, so I attach it as a png): This "windowOpt = 0" business means that the variance of the Gaussian window is set to the default 0.02. In my MATLAB code I have attempted to do the same. Here is how MPTK computes the window (actually only half of it is needed) of even length length:

optional = 0.02;
optional = 1/(2*optional*(length+1)*(length+1));
    for ( i = 0; i < length/2; i++, p1++, p2-- ) {
      k = (double) i - ( (double) (length - 1) )/2.0;
      newPoint = exp(-k*k*optional);
      *p2 = *p1 = (Dsp_Win_t) newPoint;
    }

Put together, p1 and p2 define the entire window. Here is how I compute the same Gaussian window in MATLAB:

winVals = zeros(1,size);
optional = 0.02;
optional = 1/(2*optional*(size+1)*(size+1));
counter = 1;
for ii=0:floor(size/2)-1
    kk = ii - (size-1)/2;
    winVals(counter) = exp(-kk.^2*optional);
    winVals(size - counter + 1) = exp(-kk.^2*optional);
    counter = counter + 1;
end

Both dictionaries have 3993 complex atoms. So far, everything checks out.

In the graph below, I show the ratios of residual energies due to my implementation to those produced by MPTK/CMPTK. Above the periwinkle dashed line, the residual is larger in MATLAB that in MPTK; and vice versa below the line. We see for the most part the differences of MP/MPTK are negligible. This could be due to some indexes being one off, some of which I did find in my code and fixed. For CMP/CMPTK however, the story is different. (I am only performing one cycle of refinement in each case.) There are no reported increases of the residual energy in CMP, but in CMPTK there are plenty, beginning with atom 11.

Let's have a look at the residual decreases of CMP in MATLAB:

2: Energy reduction of 0.27118%
3: Energy reduction of 0.13418%
4: Energy reduction of 0.37532%
5: Energy reduction of 0.073902%
6: Energy reduction of 15.5745%
7: Energy reduction of 5.8543%
8: Energy reduction of 1.0909%
9: Energy reduction of 1.764%
10: Energy reduction of 1.497%

And now of CMPTK

2: Energy reduction of 0.266262% from 1 of 1 possible cycles
3: Energy reduction of 0.131654% from 1 of 1 possible cycles
4: Energy reduction of 0.40187% from 1 of 1 possible cycles
5: Energy reduction of 0.0768828% from 1 of 1 possible cycles
6: Energy reduction of 4.58075% from 1 of 1 possible cycles <-----
7: Energy reduction of 11.134% from 1 of 1 possible cycles
8: Energy reduction of 2.4887% from 1 of 1 possible cycles
9: Energy reduction of 1.68054% from 1 of 1 possible cycles
10: Energy reduction of 1.72243% from 1 of 1 possible cycles

All looks compatible until iteration 5 (arrow). We see that in the graph above too, when CMPTK performance takes a nose dive like my recent investments in CMP, Inc.

So what is the atom that causes trouble at iteration 11? There is nothing unusual about it. It begins at sample 80 (shouldn't that be 79 if we begin indexing at 0?), has a frequency of 0.015625*8192 = 128 Hz, an amp of 0.277851, and a phase of -1.82707. This atom causes the residual energy to increase 1.146%. Here is where we take a look at the residual just before and just after this atom comes into play.

(Editors note: several hours passed between these two paragraphs, during which I had a long think to the bus stop, and some more think over dinner.)

Upon returning to home, I decided to perform the ultimate test of sanity: put two atoms from the dictionary in a signal, and see what happens. With this I believe I have cornered the problem. Take a look at the following movie. My MATLAB implementation vs. MPTK. One is slow, and one is speedy. Which one will win? ex1_MP_atoms.mov. Clearly, MPTK has made a very poor choice. And I believe it all comes down to the position index selecting the atom in the block. What happens when I make off by one in my MATLAB MP those positions at which I am searching for the largest magnitude projection (i.e., the atom hop size)?:ex1_MP_atoms2.mov. Not only do I pick up the same wrong atoms as MPTK, my computer's battery nears empty. The 2-norm of the differences of the residuals is 1e-4 --- which is not too large, but not too small. Finally, when I shift the dictionary such that the atoms line up with how MPTK expects them, and keep my erroneous MP, I get the following: ex1_MP_atoms3.mov. Yay! It works!!!

But wait. That's not all. Often I see the following behavior: ex1_MP_atoms4.mov. That first projection was a doozy for MPTK. It left a massive dirac. And those 3 atoms aren't even close to each other to interfere. And here is one where it happens twice: ex1_MP_atoms5.mov. My MP converges after 3 steps with an SRR of 94 dB; while MPTK has a paltry 31 dB SRR. The only thing I can think of is that MPTK is so fast that it becomes extremely excited when something fits so perfectly that it twists its atoms in knots before subtraction.

So, not only do we need to change the positions at which MPTK searches for maximums, but also something in the atom creation causing those diracs. And then CMPTK will hopefully be ready for the next step.]]>

Paper of the Day (Po'D): Encoding vs. Training with Sparse Coding Edition

2011-06-13T21:45:51Z

Hello, and welcome to Paper of the Day (Po'D): Encoding vs. Training with Sparse Coding Edition. Today's paper is A. Coates and A. Y. Ng, "The Importance of Encoding Versus Training with Sparse Coding and Vector Quantization", In Proc. Int. Conf. on Machine Learning, Bellevue, Washington, USA, Jun. 2011. In this work, which is a follow-up to their AISTATS'11 paper, the authors explain why the dictionary used in encoding the data is not as important in providing a good representation, as the encoding algorithm itself. The authors report competitive results even when the dictionary is populated with random samples from the data.
]]> CIFAR-10, NORB, and Caltech 101 (81.5%, 95%, and 72.6%, the first two being state-of-the-art results).

The authors experiment with six methods for populating the dictionary, and a few methods for encoding using the dictionary.

Here are the methods for learning/populating the dictionary (we are seeking dictionary $D \in \mathcal{R}^{n \times d}$ such that each atom (column) has unit $\ell_2$-norm):

Sparse coding (SC) with coordinate descent: $$\min_{D, s^{(i)}} \sum_i \|Ds^{(i)} - x^{(i)}\|_2^2 + \lambda\|s^{(i)}\|_1.$$ One way to solve this optimization problem is to alternate minimization between the dictionary $D$ and sparse codes $\{s^{(i)}\}$ (one is kept fixed while the objective function is minimized w.r.t. the other and so on). The authors obtain the parameter $\lambda$ by minimizing its average cross-validation error over a grid of candid values (as is suggested in the paper they cite).
Orthogonal matching pursuit (OMP-k): $$\begin{align}&\min_{D, s^{(i)}} \sum_i \|Ds^{(i)} - x^{(i)}\|_2^2 \\ & \text{subject to } \|s^{(i)}\|_0 \leq k\text{, } \forall i\text{,}\end{align}$$where $k$ is an upper bound on the number of nonzero elements in $s^{(i)}$. To solve this optimization problem, one would alternate between minimizing $D$ and $\{s^{(i)}\}$, just like above.

Coordinate descent and OMP are algorithms for obtaining sparse codes given a dictionary, the dictionary on the other hand, can be obtained using gradient descent.
Optimizing both 1. and 2., you get the sparse codes as a byproduct of learning the dictionary (the training and encoding phases are intertwined). However, this doesn't stop us from holding on only to the dictionary obtained in this step and computing the codes by other means.

Sparse restricted Boltzmann machine (RBM) and sparse auto-encoder: (I will go over these in later posts.)
Random downsampling of data matrix $X$ containing normalized $x^{(i)}$
Random weights: One fills the dictionary with normalized columns sampled from the standard normal distribution.

And here are the methods for encoding:

SC: Same optimization problem as above, $D$ fixed, possibly different $\lambda$, and setting the elements of feature $f$ as $$\begin{align}f_j &= \max\{0, s_j\} \\ f_{j+d} &= \max\{0, -s_j\}.\end{align}$$Notice that instead of $d$ dimensions, the feature $f$ has $2d$ dimensions. The authors call this "polarity splitting" and I don't currently understand the significance of it.
OMP-k: Settings like 1.
Soft thresholding: For fixed threshold $\alpha$, they assign $f$ as follows, $$\begin{align}f_j &= \max\{0, D^{(j)T}x - \alpha\} \\ f_{j+d} &= \max\{0, -D^{(j)T}x - \alpha\}.\end{align}$$
The natural encoding: If the dictionary is learned with SC, then the already-learned codes are used. Same goes for OMP. For RBM and the autoencoder, one computes the activation at the hidden nodes using the logistic sigmoid function $g$: $$\begin{align}f_j &= g(W^{(j)}x +b)\\f_{j+d} &= g(-W^{(j)}x +b), \end{align}$$ where $W = D^T$ and $W^{(j)}$ is the $j$th row of $W$. For 5. and 6., the authors use the dictionary as a linear map, i.e., $f = D^Tx$ (this is like random projection, except instead of decreasing dimensionality it increases it, assuming $d > n$).

The authors obtain the best result on CIFAR-10 by using OMP-1 for training, and soft thresholding for encoding (and showing that the fatter the dictionary -- e.g., d = 6000 -- the better). They achieve the best result for NORB using random patches as the dictionary, and SC for encoding. Same goes for Caltech 101, although this result trails behind the state-of-the-art by 3.1%. The authors report the accuracies for each training/encoding pair for classification on CIFAR-10 in the first table (pasted below). In the second table, they report the best accuracies obtained by anyone on CIFAR-10.

Let's go over how the authors use the dataset in the unsupervised feature learning phase. In the case of CIFAR and NORB, they set $x^{(i)} \in \mathcal{R}^n$ to randomly-chosen, normalized, vectorized $(6 \times 6) \times 3$ patches. As for Caltech 101, the $x^{(i)}$ are the $128$-dimensional SIFT descriptors extracted from each random $16 \times 16$ patch. Before sending it over to the dictionary training algorithm, they perform ZCA-whitening on the whole dataset $X = [x^{(1)} , \ldots, x^{(1600)}]$.

Given the feature mapping parameterized by $D$, here's an overview of the pipeline that the authors set up for performing classification. First, they extract patches $\{x^{(i)}\}$ (with the size specified above) with a shift of one pixel for CIFAR-10 and NORB, and eight pixels for Caltech 101, covering the whole image. For CIFAR-10 and NORB, the $x^{(i)}$s are the raw pixel values for the patch, whereas for Caltech 101, they are the values of the single SIFT descriptor extracted from the patch. For each pair of training/encoding method, the authors use the dictionary $D$ to get feature $f^{(i)}$ for each $x^{(i)}$. So for example, for each $32 \times 32$ image in the CIFAR-10 dataset we get (given the settings specified) $27\times 27 \times 1600 \times 2 = 2, 332, 800$ dimensions! To reduce the dimensionality of the feature space a pooling step is carried out (differently for each dataset):

CIFAR-10: The authors average the feature values over the four quadrants of the image, yielding the final feature vector representing that image. (With pooling, we are down to $4 \times 1600 \times 2$ dimensions.)
NORB: From what I understand here, they perform two stages of downsampling on the original $108 \times 108$ images before extracting the patches. But then, they don't mention their pooling strategy after the feature mapping is done.
Caltech 101: Here, the authors perform "spatial pyramid" pooling. That is, they perform max-pooling on the features over $4 \times 4$, then $2 \times 2$, and $1 \times 1$ grids in a hierarchical manner. They concatenate the results to form the final feature vector representing the image.

Having thus obtained a single feature vector for each image in the test and training sets, the authors train a (actually many) linear SVM(s) to classify.

In my opinion, the results of this paper are interesting but not completely surprising. Consider PCA and random projection (although they don't result in overcomplete dictionaries as the methods employed in this paper). We know that (in some tasks) random orthonormal weights are comparable to their "data-aware" and learned counterparts, i.e., the principal components. The results also explain why the k-means algorithm (employed in the AISTATS'11 paper) fared so well as the scheme for learning the atoms of the dictionary.

Some Experiments with Glockenspeil

2011-06-13T16:11:52Z

Today I have been experimenting with CMPTK and a real audio signal. With this larger signal, the energy errors by which I have been plagued this last week seem to be much more rare.

Below we see the residual energy decay of this example with MP and CMPTK using a dictionary of Gabor atoms (Gaussian window) of only two scales: 128/32 and 4906/64/8192 (scale/hop/FFTsize if different from scale). I run 200 iterations. CMP-$l$ is implemented such that all representations at each order undergo at least one cycle. When $l = 5$, more refinement cycles can be performed until the ratio of residual energies before and after a cycle is less than 1.002, or less than about 0.009 dB. I also plot in this graph, the "cycle energy decrease," which is the ratio of the residual energy before and after the entire refinement at the iteration. We find a few large spikes of improvement. At the end of 200 iterations, the models produced by CMP have an error 2.2 dB better than that produced by MP.]]> Below, I show the time-domain residual signals resulting from both MP and CMP-1. For the most part, the CMP-1 error signal is below that of MP; but strangely the first attack causes more problems for CMP-1 than the other attacks.

Let's have a listen to the sounds. Here is the residual due to MP; and here is the residual due to CMP-1. I can't really tell much difference, except the CMP-1 residual is a bit quieter. I don't hear any significant differences in the pre-echos of the attacks, for which I was candidly hoping. But if we take a closer look at how the attacks are being modeled by the shorter atoms, we see some promising results. Below I show each resynthesis aligned to the original signal using only the atoms of scale 128 samples (which is 6 ms at this sampling rate 22.05 kHz). For the MP decomposition, 109 atoms out of 200 fit this description. For both CMP decompositions, 105 atoms fit this description. Except for the third, the attacks modeled by CMP look more cleanly synthesized than that of MP --- especially the second attack, which appears delayed by MP.

Now, what if we do not use the condition that the refinement process can end if there is no significant reduction in residual energy? Using the same dictionary as above, the figure below shows residual energy decay of this example with MP and CMPTK with one or two refinement cycles (5 will take too long, but I might run it overnight). This means that MP will have 200 atom selections and 200 subtractions, and CMP-1 will have 20,100 atom selections and 40,200 additions, and CMP-2 will have 40,200 atom selections and 80,400 additions. Compared with the figure above, I don't see any benefit to forcing a certain number of refinement cycles --- which is a good thing for reducing the computational complexity. A look at the residual signals confirms this. (I will not run $l=5$ overnight.)

Now, we move on to a richer dictionary. Below we see the residual energy decay of this example with MP and CMPTK for a dictionary of Gabor atoms (Gaussian window) of eight scales: 32/8, 128/32, 256/64, 512/128, 1024/128, 2048/256, 4096/512/8192, 8192/1024/16384 (scale/hop/FFTsize if different from scale). For these examples, I have kept the rule that at least one refinement cycle is required, but additional ones will be performed as long as the ratio of residual energies before and after a cycle is greater than 1.002, or greater than about 0.009 dB. Compared to that of the two-scale dictionary above, we see a better decay of the residual energy, but there appears to be less maximum improvement with the cycles. Here the values extend up to nearly 0.15 dB; but for the two-scale dictionary the max improvement is twice that. Just by eyeballing the two though, I think the mean improvements are about the same.

Below, I show the time-domain residual signals resulting from both MP and CMP-1 with this multiscale dictionary. For the most part, the CMP-1 error signal is below that of MP. Both methods appear to have the same problem with the first attack.

Let's have a listen to the sounds. Here is the residual due to MP; and the residual due to CMP-1. Again, I can't really hear the 1 dB difference.

Anyhow, the take home messages from all my weekend experiments appear to be these:

Globally, CMP does not appear to improve a signal model enough over that of MP to warrant its significantly higher computational complexity.
We need to employ a much better strategy at localized levels to avoid as much as possible this additional overhead to have the greatest gains.
The architecture of CMP remains an attractive alternative to that of MP and OMP, where once an atom is selected, it remains a part of the model forever.
Its cyclic application of a simple procedure also permits the application of more complex criteria for atom replacement, such as perceptual weightings, and "dark energy" (my favorite!)
Thus, we must augment CMP with localized considerations; and I believe I can see at least a dozen variations.
With which one should I begin?

A bug or a feature?

2011-06-12T21:49:40Z

I have been thinking about why CMPTK using a MDCT dictionary with Kaiser-Bessel windows produced no increases in energy, unlike all the other dictionaries I have been trying. Maybe it is a confluence of window shape, and real signals being approximated by complex atoms, as well as various approximations being made inside MPTK. So I tried something. (Warning: the following is a rather rambling record of observations, probably what happens inside Dr. House's head.)]]>
Now let's try a dictionary of atoms created by modulating a Hann(ing) window of scale 128 samples, and hopped by 1 sample. I find no increases. Here is a picture of the residual energy decays for example 1 (attack):

The Gaussian dictionary does a little better it seems, even though CMPTK was at times increasing the residual energy. I try the same thing, but using a Cosine window -- no errors. Or a rectangular window --- no errors. But these do more poorly than Hanning (except for the sinusoidal signal example 3). Now, back to the modulated Gaussian windows. Changing the scale to 256 samples, making the hop 8, but doubling the variance of the window, I get no errors! Making its FFT size to 512 samples (zero padding), I get NO errors. Changing its scale to 64 samples, keeping the hop and zero padding, I get lots of errors. Removing the zero padding, NO errors. Changing its scale to 130 (not a power of 2), putting zero padding back, lots of errors. Removing the FFT size option, NO errors. Changing the variance back to the original --- lots of errors.

So it seems that the window shape and/or the zero padding have a significant influence on these errors either separately or together. The errors do not seem to be caused by the phase optimization of the real atoms using complex ones. To explore this further, let's look at a multiscale dictionary. This one is composed of modulated Gaussian windows of scale/hop: 128/8, 256/8, and 512/16. There are NO errors. If I decrease the size of the smallest scale to 64, I get errors. If I double the variance of the windows, I get NO errors. If I make the windows all Hann(ing), I get NO errors. If I give the 64-scale window a zeropadding to 256 samples, I get lots of big errors. Putting the zeropadding on the largest window, to 1024 samples, gives errors. However, putting the zero padding on the second largest window, out to 512 samples, gives NO errors. (The signal has a length 1024, so I wonder if that is a problem.)

Let's add more to this multiscale dictionary. Putting in a Dirac basis gives no errors. Putting in a window of size/hop 64/8 produces errors. Changing that scale to 12 produces NO errors. Changing that to 45, 34, or 24 creates errors. It seems like when all the atom scales are large enough, errors are less likely. But sometimes not. Perhaps the problem is also that my test signals are 1024 samples. I am trying with the Glockenspiel example, and no errors are being produced... So, it looks as if CMPTK is implemented correctly, it just has "features." Something else is going on that is causing the refinement iteration to skip over the best atom. And that something appears more and more like it has nothing to do with what I did or didn't do.