Emaad Ahmed ManzoorBlog of Emaad Ahmed Manzoor
http://www.eyeshalfclosed.com
eyeshalfclosedhttps://feedburner.google.comColors of Spring<p><a data-flickr-embed="true" href="https://www.flickr.com/photos/emaadmanzoor/33664871130/in/photostream/" title="A Springy Ride"><img src="https://c1.staticflickr.com/3/2941/33664871130_83df4f4ceb_z.jpg" width="640" height="360" alt="A Springy Ride" /></a><script async="" src="//embedr.flickr.com/assets/client-code.js" charset="utf-8"></script>
<a data-flickr-embed="true" href="https://www.flickr.com/photos/emaadmanzoor/33664863460/in/photostream/" title="Dino Pride"><img src="https://c1.staticflickr.com/3/2939/33664863460_2ab1915a41_z.jpg" width="640" height="360" alt="Dino Pride" /></a><script async="" src="//embedr.flickr.com/assets/client-code.js" charset="utf-8"></script>
Pittsburgh, 2017.</p><img src="http://feeds.feedburner.com/~r/eyeshalfclosed/~4/AGDeqEd7Q1Q" height="1" width="1" alt=""/>Sat, 15 Apr 2017 00:00:00 -0400
http://feedproxy.google.com/~r/eyeshalfclosed/~3/AGDeqEd7Q1Q/
http://www.eyeshalfclosed.com/blog/2017/04/15/a-springy-ride/http://www.eyeshalfclosed.com/blog/2017/04/15/a-springy-ride/Modeling Disease Trajectories<p>My next project will (with high probability) involve modeling “disease trajectories” in some form. Though a concrete problem definition is still a few months away, I ran into some interesting recent work in this space that I will be reading soon. This post simply lists and quotes content from these papers that I found interesting. I will append to this post as I come across more.</p>
<h2 id="disease-trajectory-maps">Disease Trajectory Maps</h2>
<p><em>Peter Schulam, Raman Arora. <a href="http://pschulam.com/papers/schulam+arora_nips_2016.pdf">NIPS ‘16</a>.</em></p>
<p>Peter approaches this space from the application point of view, and has a bunch of other papers in this domain since 2015:</p>
<ul>
<li>Integrative Analysis using Coupled Latent Variable Models
for Individualizing Prognoses. <a href="http://pschulam.com/papers/schulam+saria_jmlr_2016.pdf">JMLR ‘16.</a></li>
<li>A Framework for Individualizing Predictions of Disease
Trajectories by Exploiting Multi-Resolution Structure. <a href="http://papers.nips.cc/paper/5873-a-framework-for-individualizing-predictions-of-disease-trajectories-by-exploiting-multi-resolution-structure.pdf">NIPS ‘15.</a></li>
<li>Clustering Longitudinal Clinical Marker Trajectories from Electronic Health
Data: Applications to Phenotyping and Endotype Discovery. <a href="http://pschulam.com/papers/schulam+wigley+saria_aaai_2015.pdf">AAAI ‘15.</a></li>
</ul>
<p>His coauthor <a href="http://www.suchisaria.com/">Suchi Saria</a> is also active in this space.
He links to an exciting commentary on <a href="https://rockhealth.com/reports/predictive-analytics/">predictive analytics in healthcare</a>.</p>
<p>The crux of <a href="http://pschulam.com/papers/schulam+arora_nips_2016.pdf">NIPS ‘16</a> is the following:</p>
<blockquote>
<p>[…] we propose the Disease Trajectory Map (DTM), a novel probabilistic model that learns low-dimensional representations of sparse and irregularly sampled longitudinal data.</p>
</blockquote>
<p>“Sparse and irregularly sampled longitudinal data” is essentially point process data.</p>
<p>I usually find representation-learning papers short on the evaluation front, simply because the evaluation ill-defined. A useful proxy is to use the learned representation in a machine learning task with ground truth, such as classification. The evaluation presented in this paper is does not do this and is unfamiliar to me, but I think the gist is that DTM’s corraborate what is already known medically, as shown via some hypothesis and association tests.</p>
<p>It still appears to be a primarily exploratory technique, potentially useful as input to classifiers further down the learning pipeline. The usual skepticism of the lack statistical guarantees and robustness apply.</p>
<h2 id="patient-flow-prediction-via-discriminative-learning-of-mutually-correcting-processes">Patient Flow Prediction via Discriminative Learning of Mutually-Correcting Processes</h2>
<p><em>Hongteng Xu, Weichang Wu, Shamim Nemati, and Hongyuan Zha. <a href="https://arxiv.org/pdf/1602.05112.pdf">TKDE ‘16</a></em></p>
<p>Hongteng approaches this space from the theoretical point of view. He has a bunch of earlier work on point processes with his advisor (and prolific point process expert) <a href="http://www.cc.gatech.edu/~zha/">Hongyuan Zha</a>.</p>
<p>This paper specifically is not about how diseases evolve, but on how patients move between different care units (CUs) in a medical facility.</p>
<blockquote>
<p>In this paper, we focus on an important problem of predicting the so-called “patient flow” from longitudinal electronic health records (EHRs). […] jointly predicting
patients’ destination CUs and duration days.</p>
</blockquote>
<h2 id="longitudinal-mixed-membership-trajectory-models-for-disability-survey-data">Longitudinal Mixed Membership Trajectory Models for Disability Survey Data</h2>
<p><em>Daniel Manrique-Vallier. <a href="https://arxiv.org/pdf/1309.2324.pdf">Annals of Applied Statistics, ‘14</a>.</em></p>
<p>This is pretty far removed from disease trajectory modeling, but has a nice survey that I want to peruse later. The paper essentially combines “Latent Trajectory Models”, which jointly model trajectories and clusters, and mixed memborship models, which permit soft-assignments of data points to clusters. It is easy to see its applicability to disease trajectories.</p><img src="http://feeds.feedburner.com/~r/eyeshalfclosed/~4/LERAE6Ccr5E" height="1" width="1" alt=""/>Sun, 08 Jan 2017 00:00:00 -0500
http://feedproxy.google.com/~r/eyeshalfclosed/~3/LERAE6Ccr5E/
http://www.eyeshalfclosed.com/blog/2017/01/08/modeling-disease-trajectories/http://www.eyeshalfclosed.com/blog/2017/01/08/modeling-disease-trajectories/Indexing Memories Better<p>I realised I’ve been having memory blanks in the thoughts I have on things I read,
secondary research topics, and just about anything else that does not have to do with my
primary research. The symptom manifests as this feeling that I know I had some thought(s)
about a certain matter, consequential ones that I would like to remember, but being
unable to extract that specific memory from within.</p>
<p>So I’ve decided to write things down here before they slip away, which probably means more
frequent posts that closely resemble me on Twitter, but without the 140-character chains.</p><img src="http://feeds.feedburner.com/~r/eyeshalfclosed/~4/PQdr2XJyatM" height="1" width="1" alt=""/>Sat, 12 Nov 2016 00:00:00 -0500
http://feedproxy.google.com/~r/eyeshalfclosed/~3/PQdr2XJyatM/
http://www.eyeshalfclosed.com/blog/2016/11/12/writing-more/http://www.eyeshalfclosed.com/blog/2016/11/12/writing-more/The Elasticity of Consumption<p><em>Update:</em> I found a <a href="http://web.stanford.edu/~pista/ann_rev.pdf">survey</a>
that addresses exactly the same questions as outlined in this post (even
categorizing shocks and responses similarly!).</p>
<p>I am interested in how individuals respond to changes in their income. The changes may affect
periodic income, temporarily or permanently, in the form of income shocks, such as layoffs, or
be less egregious periodic income modulations such as delayed pay-days, bonuses and raises.
They may also be completely independent from periodic income, but expected, such as receiving
your tax refund. Finally, they may be completely unexpected financial windfalls, such as having
your security deposit returned by the slumlord you used to rent from in darker times.</p>
<p>Clearly, there is heterogeneity in what income is modulated, and how this income is modulated.
I categorize effects on income as follows:</p>
<ol>
<li>Duration of effect: Permanent vs transient.</li>
<li>Direction of effect: Positive vs. negative.</li>
<li>Magnitude of effect: Large vs. small.</li>
<li>Target of effect: Primary income vs. none.</li>
</ol>
<p>There is also heterogeneity in how different individuals respond to such changes
in income; some do not respond at all, while some increase spending to unsustainable levels.
I categorize response to income (assumed to be observed as changes in spending) as follows:</p>
<ol>
<li>Duration of response: Permanent vs. transient.</li>
<li>Direction of response: Positive vs. negative.</li>
<li>Magnitude of response: Large vs. small.</li>
<li>Target of response: Overall spending vs. specific product categories.</li>
</ol>
<p>These categorizations are clearly coarse-grained: responses may be highly non-linear,
for example, starting out highly positive, then correcting themselves and going negative,
and finally stabilizing at zero. It is nevertheless useful to have some simple way to
categorize the wealth of research findings on this topic.</p>
<p>Note that I restrict this study to that of income; hence, I assume away any effects of
non-income losses such as being the unfortunate victim of a street robbery. Such events would
definitely show up as expense fluctuations, so I may be better served by incorporating
hidden causes of expense modulation into my model.</p>
<p>Why should we care about how people respond to income? The first motivation is to identify
individuals who may harm themselves financially in response to income shocks (I read that
over 70% of lottery winners go bankrupt). Once identified, planned income shocks to such
individuals (such as the disbursement of tax refunds or bonuses) could be spread out over
time to alleviate adverse responses. The second motivation is to guide setting the frequency
of pay in situations where there is a choice; my hypothesis is that different pay frequencies
lead to different levels of welfare for each individual, given that they respond to pay-days
in different ways. The third motivation is to better predict any given individual’s
future expense transactions (time and amount) from past transactions.</p>
<p>What I have in my data are the times and amounts of income, the times and amounts of expenses
and a coarse product category for each purchase. I am also reasonably certain that this data
forms a “financial sensor log”: an involuntary and automatic stream of all the income and
expense transactions of every individual under study. This sets it apart from the previously
studied streams of voluntary actions, such as Foursquare check-ins and tweets, where multiple
layers of choice (the choice to eat, the choice to use Foursquare, the choice to check-in)
add barriers between observing the activity and making claims about natural human behaviour; as
a simple example, are Facebook “friends” friends? I think this difficulty in making claims on
human behaviour from such samples of observed activity is called <a href="https://en.wikipedia.org/wiki/Selection_bias">selection bias</a>.</p>
<p>In the next post, I will survey a number of papers studying the income elasticity of consumption.
The seeds of this literature review were planted by <a href="http://www.andrew.cmu.edu/user/alm3/">Alan Montgomery</a> over at <a href="http://tepper.cmu.edu/">Tepper</a> and
<a href="https://www.andrew.cmu.edu/user/bkovak/">Brian Kovak</a> who taught me something awesome from 9.00am - 10.30am <a href="http://www.heinz.cmu.edu/academic-resources/course-results/course-details/index.aspx?cid=587">every</a> <a href="http://www.heinz.cmu.edu/academic-resources/course-results/course-details/index.aspx?cid=167">morning</a> this
semester.</p><img src="http://feeds.feedburner.com/~r/eyeshalfclosed/~4/JT61z0UJ7uw" height="1" width="1" alt=""/>Sat, 12 Nov 2016 00:00:00 -0500
http://feedproxy.google.com/~r/eyeshalfclosed/~3/JT61z0UJ7uw/
http://www.eyeshalfclosed.com/blog/2016/11/12/responding-to-income/http://www.eyeshalfclosed.com/blog/2016/11/12/responding-to-income/Spark Streaming Microbatch Metrics, Programmatically via the REST API<p><em>TLDR; <a href="https://gist.github.com/emaadmanzoor/cc12763a4133ca30fad8be065846ecc4">metric collection script</a>.</em></p>
<p>The Spark Streaming web UI shows a number of interesting metrics over time.
<a href="http://www3.cs.stonybrook.edu/~tnle/">Tan</a> and I were specifically interested
in the (micro)batch start times, processing times and scheduling delays, which we
could find no reported way of obtaining programmatically. We were running Spark
2.0.0 on YARN 2.7.2 in cluster-mode.</p>
<p>All I could find with was this <a href="http://stackoverflow.com/questions/34507578/how-to-fetch-spark-streaming-job-statistics-using-rest-calls-when-running-in-yar">StackOverflow question</a>
that suggested scraping the Spark UI webpage (as, horribly, <a href="https://github.com/amitsing89/pythonscripts/blob/master/sparkMonitoring.py">some have done</a>)
or hitting the JSON API endpoint at <code class="highlighter-rouge">/api/v1/</code>. This endpoint, unfortunately again,
does not provide the same metrics we see on the Spark Streaming web UI.</p>
<p>Not directly.</p>
<p>It turns out that you can use the <code class="highlighter-rouge">/jobs/</code> endpoint to reconstruct the metrics you
see on the Spark Streaming web UI: the batch start time, processing delay and
scheduling delay. The key to how this reconstruction is done lies in the
<a href="https://github.com/apache/spark/blob/d6dc12ef0146ae409834c78737c116050961f350/streaming/src/main/scala/org/apache/spark/streaming/scheduler/BatchInfo.scala">BatchInfo class definition</a>
in the Spark codebase.</p>
<p>I wrote a <a href="https://gist.github.com/emaadmanzoor/cc12763a4133ca30fad8be065846ecc4">script</a>
that parses the JSON from this endpoint and reconstructs these metrics, given the
application ID (the one YARN generates for you on submission) and the YARN master
URL. All times are in seconds. A sample execution is:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>python get_spark_streaming_batch_statistics.py <span class="se">\</span>
<span class="nt">--master</span> ec2-52-40-144-150.us-west-2.compute.amazonaws.com <span class="se">\</span>
<span class="nt">--applicationId</span> application_1469205272660_0006
</code></pre></div></div>
<p>Sample output (batch start-time, processing time, scheduling delay):</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code> 18:36:55 3.991 3783.837
18:36:56 4.001 3786.832
18:36:57 3.949 3789.862
...
</code></pre></div></div>
<p>The script creates a map from each batch to its start time and all the jobs it
contains, along with the jobs’ start and completion times. Simple arithmetic then
generates the required metrics. It is easy to modify the script to print the
actual timestamps instead of the delay, if one wishes to.</p>
<p>Do file an <a href="https://github.com/lenhattan86/ccra/issues">issue</a> with any
questions or contributions you have.</p><img src="http://feeds.feedburner.com/~r/eyeshalfclosed/~4/emQanSlTUkY" height="1" width="1" alt=""/>Fri, 22 Jul 2016 00:00:00 -0400
http://feedproxy.google.com/~r/eyeshalfclosed/~3/emQanSlTUkY/
http://www.eyeshalfclosed.com/blog/2016/07/22/spark-streaming-statistics/http://www.eyeshalfclosed.com/blog/2016/07/22/spark-streaming-statistics/Coffee on the North Fork<p><img src="/images/noforoco-front.jpg" alt="North Fork Roasting Co." /></p>
<p>I usually head west from Stony Brook every other weekend to work out of a
nice cafe in Manhattan or Brooklyn, but have become increasingly frustrated
with the tiny, crowded spaces that I’ve had to put up with. So this
weekend, I decided to head east to the <a href="https://goo.gl/maps/n6qGY23Ggoz">North Fork</a>,
and was pleasantly surprised.</p>
<p>I had been disappointed with the Long Island coffee scene in the recent past when
I visited <a href="https://www.facebook.com/localscafe/">Local’s cafe</a> in Port Jefferson;
while quiet and spacy, there was something really wrong with either
the beans or the barista’s technique, making for a really flat, unexciting brew.
I wasn’t expecting much going all the way out to the east end of the island, but
I really wanted to find a local roaster to replenish my weekly beans supply.</p>
<p><a href="http://www.noforoastingco.com/">NoFoRoCo</a> blew me away from my first step in.
The cafe is set in a large old house, complete with rustic furniture and
a big dark house-dog. The couches are comfortable, the WiFi is good and the
staff is pleasant and interesting. There is a small door on the right side that
leads to a quaint study, with its own desk, armchair and vintage computing machinery.
The desk looks out onto the yard. You don’t get this in the cramped confines of
cafes in the City.</p>
<p>Importantly, the coffee is perfect; I would rate this almost as good as
Blue Bottle, in terms of both the beans and technique
(a shoutout to <a href="http://www.noforoastingco.com/say-hi.html">Bri</a>!). I had a latte
and a cappucino, and they only fell slightly short in the complexity of
flavours in the cup (Blue Bottle coffee bursts into a bunch of
interesting flavours on my tongue; that’s as eloquent as my amateur palate can get).
But there must be something to be said about NoFoRoCo being small-batch roasters.</p>
<p>I’ll leave you with some pictures. I’ll definitely be visiting again.</p>
<p><img src="/images/noforoco-main.jpg" alt="NoFoRoCo Main Room" /></p>
<p><img src="/images/noforoco-study.jpg" alt="NoFoRoCo Study" /></p>
<p><img src="/images/noforoco-desk.jpg" alt="NoFoRoCo Desk" /></p><img src="http://feeds.feedburner.com/~r/eyeshalfclosed/~4/9aLwbe6IWic" height="1" width="1" alt=""/>Sun, 24 Apr 2016 00:00:00 -0400
http://feedproxy.google.com/~r/eyeshalfclosed/~3/9aLwbe6IWic/
http://www.eyeshalfclosed.com/blog/2016/04/24/coffee-on-the-north-fork/http://www.eyeshalfclosed.com/blog/2016/04/24/coffee-on-the-north-fork/Online Random Projections<p><img src="/images/streamhash-feature.jpg" alt="feature" />
I recently encountered the problem of computing the similarity of pairs of
“documents” (which, in my case, were actually graphs), where the documents
arrived as a stream of individual words. The incoming stream of words could both
initiate new documents and grow existing documents. Interestingly, the words of each
document could also arrive out-of-order, but I will ignore this complication for now.</p>
<p>The problem then was to continuously compute and maintain the similarity of
every pair of documents originating from the words in the stream. Strangely, I
could not find any existing methods to do this even for the case of a single
document pair, under the constraints of the data stream model (single-pass,
bounded memory). Specifically, all the methods I surveyed needed to know
my “vocabulary”: the unique words that could possibly arrive in the stream.
In my case, this is never known.</p>
<p>In this post, I will describe the method I developed to tackle this problem,
while also discussing the background needed to understand why it works. The
method itself is fairly simple and, with some help from the code snippets in this
post, should be easy to implement and try out yourself!</p>
<h3 class="no_toc" id="contents">Contents</h3>
<ul id="markdown-toc">
<li><a href="#computing-document-similarity" id="markdown-toc-computing-document-similarity">Computing Document Similarity</a></li>
<li><a href="#sketching-document-similarity" id="markdown-toc-sketching-document-similarity">Sketching Document Similarity</a></li>
<li><a href="#streaming-document-similarity" id="markdown-toc-streaming-document-similarity">Streaming Document Similarity</a></li>
<li><a href="#streaming-heterogenous-graph-similarity" id="markdown-toc-streaming-heterogenous-graph-similarity">Streaming Heterogenous Graph Similarity</a></li>
</ul>
<h2 id="computing-document-similarity">Computing Document Similarity</h2>
<p><img src="https://docs.google.com/drawings/d/1pypq-JElxpCNn12brQ1eW9XzD2NA0_td7aZGv7S6eyM/pub?w=1200" alt="Cosine Distance" /></p>
<p class="caption"><em><strong>Computing the cosine similarity of documents D1 and D2:</strong> The documents are
first split into their constituent words and a vocabulary index is assigned to each
word. The documents are then represented by vectors of frequencies of the words they
contain. Given these vectors, their magnitudes, dot-product and cosine-similarity
can be computed.</em></p>
<p>A common way to measure how similar two documents are is by their
<a href="https://en.wikipedia.org/wiki/Cosine_similarity">cosine similarity</a>. An example of this is illustrated
in the figure above. All documents in the collection are first split into
their constituent words, and a <em>vocabulary index</em> is constructed which assigns
an integer to each unique word. If the vocabulary size is <script type="math/tex">|V|</script>, each document is
represented by a <script type="math/tex">|V|</script>-sized vector of frequencies of the words it contains.</p>
<p>Given these vectors for a pair of documents, say <strong>D1</strong> and <strong>D2</strong>, their
magnitudes are given by,</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
magn(\mathbf{D1}) &= \|\mathbf{D1}\|_2\\
magn(\mathbf{D2}) &= \|\mathbf{D2}\|_2,
\end{align*} %]]></script>
<p>and their cosine similarity is given by,</p>
<script type="math/tex; mode=display">sim(\mathbf{D1}, \mathbf{D2}) = cos(\mathbf{D1}, \mathbf{D2})
= \frac{\mathbf{D1} \cdot \mathbf{D2}}
{\|\mathbf{D1}\|_2\|\mathbf{D2}\|_2}.</script>
<h3 class="no_toc" id="a-computational-problem-with-large-v">A Computational Problem With Large |V|</h3>
<p>In practice, it is possible for the
vocabulary to be extremely large. This increases the time needed to compute the
similarity of each document pair and the memory needed to store each document.
To address this issue, a popular approach is to <em>sketch</em> the word-frequency vectors
that represent each document. Sketching replaces <script type="math/tex">|V|</script>-sized document vectors by
<script type="math/tex">k</script>-sized <em>sketch</em> vectors, where <script type="math/tex">k</script> is much smaller than <script type="math/tex">|V|</script>, while accurately
approximating the similarity of pairs of documents.</p>
<h2 id="sketching-document-similarity">Sketching Document Similarity</h2>
<p>An approach to sketching vectors that preserves their cosine similarity is using
<a href="https://en.wikipedia.org/wiki/Random_projection">random projections</a>, which applies some neat tricks from linear algebra and
probability.
Let’s say we want to compute the similarity of documents with vectors <strong>D1</strong> and
<strong>D2</strong> having angle θ between them. We can visualize these vectors in the XY-plane.</p>
<p><img src="https://docs.google.com/drawings/d/15MsVhJreASVW-VnJYCIYTUGNS93pEysOvwQ1jqsZdaU/pub?w=1200" alt="Document Vectors" /></p>
<p>Let’s pick a vector <strong>R</strong> in the plane having the same size as <strong>D1</strong> and
<strong>D2</strong> uniformly at random in the plane. Then define a function <em>h</em> that
operates on any such random vector <strong>R</strong> and a document vector <strong>D</strong> as follows:</p>
<script type="math/tex; mode=display">% <![CDATA[
h_{\mathbf{R}}(\mathbf{D}) =
\begin{cases}
+1 ~\textrm{if}~ \mathbf{D}\cdot\mathbf{R} \geq 0,\\
-1 ~\textrm{if}~ \mathbf{D}\cdot\mathbf{R} < 0.
\end{cases} %]]></script>
<p>So <script type="math/tex">h_{\mathbf{R}}(\mathbf{D1}) = h_{\mathbf{R}}(\mathbf{D2})</script> only if
both <strong>D1</strong> and <strong>D2</strong> lie on the same side of <strong>R</strong>, and
<script type="math/tex">h_{\mathbf{R}}(\mathbf{D1}) \neq h_{\mathbf{R}}(\mathbf{D2})</script> otherwise.
Now what is the probability that
<script type="math/tex">h_{\mathbf{R}}(\mathbf{D1}) = h_{\mathbf{R}}(\mathbf{D2})</script>, over all random
choices of <strong>R</strong>? It’s easier to see what this is with another illustration.</p>
<p><img src="https://docs.google.com/drawings/d/1b7SQPGmNJ3Ig_gRfFD0jsoDIaiLrqCMV5QUwIHBuh0w/pub?w=1200" alt="Random Vectors" /></p>
<p>In the red region, <script type="math/tex">h_{\mathbf{R}}(\mathbf{D1}) \neq h_{\mathbf{R}}(\mathbf{D2})</script>
since <strong>D1</strong> and <strong>D2</strong> will fall on opposite sides of any random vector <strong>R</strong> chosen
in this region. Similarly,
<script type="math/tex">h_{\mathbf{R}}(\mathbf{D1}) = h_{\mathbf{R}}(\mathbf{D2})</script> in the yellow region.
The ratio of the areas of the red and yellow regions is <em>2θ/2π</em>, leading to the
probability over random vectors <strong>R</strong>:</p>
<script type="math/tex; mode=display">P_{\mathbf{R}}[h_{\mathbf{R}}(\mathbf{D1}) = h_{\mathbf{R}}(\mathbf{D2})]
= 1 - θ/π.</script>
<p>If we could somehow estimate this probability, we could plug it into the above
equation to find the angle θ, and hence, find the cosine similarity of the two
document vectors:</p>
<script type="math/tex; mode=display">\begin{gather*}
θ = π (1 - P_{\mathbf{R}}[h_{\mathbf{R}}(\mathbf{D1}) =
h_{\mathbf{R}}(\mathbf{D2})])\\
sim(\mathbf{D1}, \mathbf{D2}) = cos(θ)
\end{gather*}</script>
<p>It turns out that it’s easy to estimate this probability: simply pick <script type="math/tex">k</script> random
vectors <script type="math/tex">\mathbf{R}_1, \dots, \mathbf{R}_k</script>, evaluate
<script type="math/tex">h_{\mathbf{R_i}}(\mathbf{D1})</script> and <script type="math/tex">h_{\mathbf{R_i}}(\mathbf{D2})</script> for
each of these vectors and then count the fraction of function values for
which the two documents agree (i.e. function values are both +1 or both -1),</p>
<script type="math/tex; mode=display">P_{\mathbf{R}}[h_{\mathbf{R}}(\mathbf{D1}) = h_{\mathbf{R}}(\mathbf{D2})]
\approx \frac{\sum_{i=1}^k \mathbb{1}(h_{\mathbf{R_i}}(\mathbf{D1}) =
h_{\mathbf{R_i}}(\mathbf{D2}))}
{k}.</script>
<p>In practice, the random vectors turn out to be sufficiently random if each
of their elements are chosen uniformly from <script type="math/tex">\{+1, -1\}</script>.</p>
<p>By fixing a set of random vectors <script type="math/tex">\mathbf{R}_1, \dots, \mathbf{R}_k</script>,
every <script type="math/tex">|V|</script>-sized document vector <strong>D</strong> can be replaced by a <script type="math/tex">k</script>-sized <em>sketch</em>
vector, where element <script type="math/tex">i</script> of the sketch vector is either +1 or -1 and is given by
<script type="math/tex">h_{\mathbf{R_i}}(\mathbf{D})</script>. Thus, each sketch vector is essentially a bit
vector and consumes very little space. Document similarity can then be estimated
using just these sketch vectors. The accuracy of estimation improves with increasing
<script type="math/tex">k</script>.</p>
<h3 class="no_toc" id="a-computational-problem-with-unknown-v">A Computational Problem With Unknown |V|</h3>
<p>Notice that in the random projections technique just described, we need to know
<script type="math/tex">|V|</script> to construct <script type="math/tex">|V|</script>-sized random vectors. In a streaming scenario, <script type="math/tex">|V|</script>
is never known, and may grow with time!</p>
<h2 id="streaming-document-similarity">Streaming Document Similarity</h2>
<p>Let’s say we could pick <script type="math/tex">k</script> functions <script type="math/tex">h_1, \dots, h_k</script> at random from a
family of functions <script type="math/tex">\mathcal{H}</script>, where each <script type="math/tex">h \in \mathcal{H}</script> maps a word
to an element of <script type="math/tex">\{+1, -1\}</script>. Then we could replace each random vector
<script type="math/tex">\mathbf{R}_i</script> with a function <script type="math/tex">h_i \in \mathcal{H}</script>, as long as <script type="math/tex">\mathcal{H}</script>
obeys the following properties:</p>
<ol>
<li>
<p><em>It should be equally probable for a given word to map to +1 or -1, over
randomly chosen <script type="math/tex">h \in \mathcal{H}</script>.</em></p>
<p>This captures the equivalence of the function <script type="math/tex">h_i</script> with <script type="math/tex">\mathbf{R}_i</script>
whose elements are each uniformly chosen from <script type="math/tex">\{+1, -1\}</script>.</p>
<p>However, requiring only this property permits trival families such as
<script type="math/tex">\mathcal{H} = \{h_1(w) = +1, ~h_2(w) = -1, \forall ~\textrm{words}~ w\}</script>.
This family satisfies property (1) since the probability that any function randomly
picked from <script type="math/tex">\mathcal{H}</script> maps a word to +1 or -1 is 1/2; note that the
randomness originates from picking the function and not from computing the
mapped value of a word. But each of the functions in this trivial family map
all words to the same value, which is not very useful.</p>
<p>So we need the following additional property:</p>
</li>
<li>
<p><em>For a given <script type="math/tex">h \in \mathcal{H}</script>, values in {+1, -1} are equally probable
over all words.</em></p>
<p>This ensures that a function in the family cannot map all words to the same
value, and should evenly distribute the mapped values across all words between
+1 and -1. This still does not prevent families such as
<script type="math/tex">\mathcal{H} = \{h_1(w), ~h_2(w) = -h_1(w), \forall ~\textrm{words}~ w\}</script>
where <script type="math/tex">h_1</script> and <script type="math/tex">h_2</script> both satisfy property (2). This family is not
very useful either, since the second function is correlated with the first.</p>
<p>This leads to our final desired property:</p>
</li>
<li>
<p><em>All the functions in <script type="math/tex">\mathcal{H}</script> are pairwise-independent.</em></p>
</li>
</ol>
<p>It turns out that a family satisfying these requirements is a
a <em>2-universal (strongly universal)</em> hash function family by definition.
Many such families exist that work on strings. One example is the multilinear family:</p>
<script type="math/tex; mode=display">h(w) = m_0 + \sum_{i = 1}^n m_i w_i,</script>
<p>where <script type="math/tex">w_i</script> is the integer representation of the <script type="math/tex">i^\textrm{th}</script> character of
the word <script type="math/tex">w</script>, and <script type="math/tex">m_0, \dots, m_n</script> are uniformly random integers. The
function above returns an integer, but this can be easily mapped to an element of
<script type="math/tex">\{+1, -1\}</script> by extracting the MSB and some algebra.</p>
<h3 class="no_toc" id="implementation">Implementation</h3>
<p>To sample <script type="math/tex">k</script> functions from the strongly-universal multilinear family, it is
only necessary to estimate the maximum word length <script type="math/tex">m = |w|_{\textrm{max}}</script>.
Then, for each function <script type="math/tex">h_1, \dots, h_k</script>, <script type="math/tex">m + 1</script> random integers are generated
and stored.</p>
<p>The following C++ 11 snippet initializes <script type="math/tex">h_1, \dots, h_k</script> as <script type="math/tex">k</script> hash functions
sampled uniformly at random from the multilinear family:</p>
<figure class="highlight">
<pre><code class="language-cpp" data-lang="cpp"><span class="cp">#include <random>
#define SEED 42
</span>
<span class="n">mt19937_64</span> <span class="n">prng</span><span class="p">(</span><span class="n">SEED</span><span class="p">);</span> <span class="cm">/* Mersenne Twister 64-bit PRNG */</span>
<span class="cm">/* m = maximum word length, k = number of hash functions */</span>
<span class="n">vector</span><span class="o"><</span><span class="n">vector</span><span class="o"><</span><span class="kt">uint64_t</span><span class="o">>></span> <span class="n">H</span><span class="p">(</span><span class="n">k</span><span class="p">,</span> <span class="n">vector</span><span class="o"><</span><span class="kt">uint64_t</span><span class="o">></span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">m</span><span class="o">+</span><span class="mi">1</span><span class="p">));</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">uint32_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o"><</span> <span class="n">k</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">uint32_t</span> <span class="n">j</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">j</span> <span class="o"><</span> <span class="n">m</span><span class="o">+</span><span class="mi">1</span><span class="p">;</span> <span class="n">j</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="n">H</span><span class="p">[</span><span class="n">i</span><span class="p">][</span><span class="n">j</span><span class="p">]</span> <span class="o">=</span> <span class="n">prng</span><span class="p">();</span>
<span class="p">}</span>
<span class="p">}</span></code></pre>
</figure>
<p>Then the mapping for a word <script type="math/tex">w</script> using <script type="math/tex">h_i</script> can be computed as:</p>
<figure class="highlight">
<pre><code class="language-cpp" data-lang="cpp"><span class="cm">/* w is a string of length <= m */</span>
<span class="kt">uint64_t</span> <span class="n">sum</span> <span class="o">=</span> <span class="n">H</span><span class="p">[</span><span class="n">i</span><span class="p">][</span><span class="mi">0</span><span class="p">];</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">uint32_t</span> <span class="n">j</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span> <span class="n">j</span> <span class="o"><</span> <span class="n">m</span><span class="o">+</span><span class="mi">1</span><span class="p">;</span> <span class="n">j</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="n">sum</span> <span class="o">+=</span> <span class="n">H</span><span class="p">[</span><span class="n">i</span><span class="p">][</span><span class="n">j</span><span class="p">]</span> <span class="o">*</span> <span class="p">(</span><span class="k">static_cast</span><span class="o"><</span><span class="kt">uint64_t</span><span class="o">></span><span class="p">(</span><span class="n">word</span><span class="p">[</span><span class="n">j</span><span class="o">-</span><span class="mi">1</span><span class="p">])</span> <span class="o">&</span> <span class="mh">0xff</span><span class="p">);</span>
<span class="p">}</span>
<span class="n">sum</span> <span class="o">=</span> <span class="mi">2</span> <span class="o">*</span> <span class="k">static_cast</span><span class="o"><</span><span class="kt">int</span><span class="o">></span><span class="p">((</span><span class="n">sum</span> <span class="o">>></span> <span class="mi">63</span><span class="p">)</span> <span class="o">&</span> <span class="mi">1</span><span class="p">)</span> <span class="o">-</span> <span class="mi">1</span><span class="p">;</span></code></pre>
</figure>
<p>Notice the bitwise-and when computing the summation: this is to take care of the
sign-extension performed when a 1-byte character is cast to a 4-byte unsigned
integer.</p>
<h2 id="streaming-heterogenous-graph-similarity">Streaming Heterogenous Graph Similarity</h2>
<p>The technique described above was used to perform streaming clustering and anomaly
detection on heterogenous graphs, where the graphs were arriving as a stream of
individual edges. Further details are available in the paper and code at the
<a href="http://www3.cs.stonybrook.edu/~emanzoor/streamspot">project website</a>.</p><img src="http://feeds.feedburner.com/~r/eyeshalfclosed/~4/xZoapO-VSl4" height="1" width="1" alt=""/>Thu, 25 Feb 2016 00:00:00 -0500
http://feedproxy.google.com/~r/eyeshalfclosed/~3/xZoapO-VSl4/
http://www.eyeshalfclosed.com/blog/2016/02/25/streaming-random-projections/http://www.eyeshalfclosed.com/blog/2016/02/25/streaming-random-projections/On Profanity in Science<p>I came across a slightly old paper on <a href="http://www.sigkdd.org/sites/default/files/issues/14-2-2012-12/V14-02-07-Cormode.pdf">Studying the Source Code of Scientific Research</a>. The section on <em>The Sacred and the Profane</em> had me in fits. It reminded me of the less academic but equally hilarous analogue from software development: <a href="http://www.commitlogsfromlastnight.com/">Commit Logs from Last Night</a>.</p>
<p>Quoted below for memory:</p>
<blockquote>
<p>In other cases however, it seems that profanity is used internally
to reflect the author’s true feelings: there are examples
where a particularly difficult example is given the (internal)
label “bastard”; a macro for a complexity class is given the
handle “crap”; and an initial theorem is labeled “bullshit”
with an improvement provided.</p>
<p>One particular notable example
is of the occurrence of “bollocks” (a British-English
idiom with broadly negative connotation) which occurs over
fifty times within a single paper. Closer inspection reveals
that this is because the central theorem in the paper is
given the label “dogs-bollocks” and referred to extensively
throughout; this phrase is a (coarse) British-English idiom
with a strongly <em>positive</em> connotation.</p>
<p>There are examples of profanity used in comments: the observation that “the \thanks layout looks crappy!”; the single word “bullshit”
prefacing some technical text which has been commented
out; and the comment “Who the fuck is —?” immediately
after an acknowledgment to the named individual.</p>
</blockquote>
<p>The rest of the paper is as enjoyable and illuminating; a rare combination, but coming from Muthukrishnan (depicted below on fire), possibly expected.</p>
<p><a href="http://www.cs.rutgers.edu/~muthu/"><img src="http://www.cs.rutgers.edu/~muthu/onfire.jpg" alt="Muthukrishnan on fire" /></a></p><img src="http://feeds.feedburner.com/~r/eyeshalfclosed/~4/4sbznhWoG1w" height="1" width="1" alt=""/>Sat, 02 May 2015 00:00:00 -0400
http://feedproxy.google.com/~r/eyeshalfclosed/~3/4sbznhWoG1w/
http://www.eyeshalfclosed.com/blog/2015/05/02/on-profanity-in-science/http://www.eyeshalfclosed.com/blog/2015/05/02/on-profanity-in-science/Setting up the Oculus Rift.<p>More cables than I expected. Also uses up all my USB ports! <img src="/images/dk2_box.jpg" alt="DK2 Setup" /></p><img src="http://feeds.feedburner.com/~r/eyeshalfclosed/~4/8ACWogsuF2k" height="1" width="1" alt=""/>Sun, 05 Apr 2015 06:37:00 -0400
http://feedproxy.google.com/~r/eyeshalfclosed/~3/8ACWogsuF2k/
http://www.eyeshalfclosed.com/blog/2015/04/05/setting-up-the-oculus-rift/http://www.eyeshalfclosed.com/blog/2015/04/05/setting-up-the-oculus-rift/Interesting seminar by Jia Yuan Yu.<p>From Dublin City U. See also: dublinked.ie. Bringing in congestion control methods from TCP to resource allocation in smart cities while maintaining agent privacy. An exciting connection, useful theorems and many open problems!</p><img src="http://feeds.feedburner.com/~r/eyeshalfclosed/~4/cu8PIuKWb64" height="1" width="1" alt=""/>Sun, 05 Apr 2015 06:00:00 -0400
http://feedproxy.google.com/~r/eyeshalfclosed/~3/cu8PIuKWb64/
http://www.eyeshalfclosed.com/blog/2015/04/05/interesting-seminar-by-jia-yuan-yu/http://www.eyeshalfclosed.com/blog/2015/04/05/interesting-seminar-by-jia-yuan-yu/