<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" version="2.0"><channel><description></description><title>connecting the dots . . .</title><generator>Tumblr (3.0; @mikeivanov)</generator><link>https://www.mikeivanov.com/</link><item><title>Machine Learning with Clojure and Spark using Flambo</title><description>&lt;p&gt;In this short tutorial I&amp;rsquo;m going to show you how to train a logistic
regression classifier in a scalable manner with
&lt;a href="http://spark.apache.org/"&gt;Apache Spark&lt;/a&gt; and
&lt;a href="http://clojure.org"&gt;Clojure&lt;/a&gt; using &lt;a href="https://github.com/yieldbot/flambo"&gt;Flambo&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Assumptions:&lt;/p&gt;

&lt;ul&gt;&lt;li&gt;you are familiar with Clojure and &lt;a href="https://github.com/technomancy/leiningen"&gt;Leiningen&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;you have heard of, or ideally - poked around Apache Spark&lt;/li&gt;
&lt;li&gt;you possess some basic Machine Learning skills&lt;/li&gt;
&lt;/ul&gt;&lt;p&gt;The goal of the tutorial is to help you familiarize yourself with
Flambo &amp;ndash; a Clojure DSL for Apache Spark. Even though Flambo is far from being complete, it already does a decent job of wrapping basic Spark APIs into idiomatic Clojure.&lt;/p&gt;

&lt;p&gt;During the course of the tutorial, we are going to train a classifier capable of predicting whether a wine would taste good given certain objective chemical characteristics.&lt;/p&gt;

&lt;h2&gt;Step 1. Create new project&lt;/h2&gt;

&lt;p&gt;Run these commands:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ lein new app t01spark
$ cd t01spark
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Here, &lt;code&gt;t01spark&lt;/code&gt; is the name of the project. You can give it any name
you like. Don&amp;rsquo;t forger to change the current directory to the project
you&amp;rsquo;ve just created.&lt;/p&gt;

&lt;h2&gt;Step 2. Update &lt;code&gt;project.clj&lt;/code&gt;&lt;/h2&gt;

&lt;p&gt;Open &lt;code&gt;project.clj&lt;/code&gt; in a text editor and update the dependency section
so it looks like this:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;:dependencies
    [[org.clojure/clojure "1.6.0"]
     [yieldbot/flambo "0.6.0"]
     [org.apache.spark/spark-mllib_2.10 "1.3.0"]]
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Please note that although listing Spark jars in this manner is
perfectly fine for exploratory projects, it is not suitable for
production use. For that you will need to list them as &amp;ldquo;provided&amp;rdquo;
dependencies in the profiles section, but let&amp;rsquo;s keep things simple for now.&lt;/p&gt;

&lt;p&gt;Make sure that AOT is enabled, otherwise you will see strange
&lt;code&gt;ClassNotFound&lt;/code&gt; errors. Add this to the project file:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;:aot :all
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;It also could make sense to add some extra memory for Spark:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;:jvm-opts ^:replace ["-server" "-Xmx2g"]
&lt;/code&gt;&lt;/pre&gt;

&lt;h2&gt;Step 3. Download dataset&lt;/h2&gt;

&lt;p&gt;In this tutorial we are going to use the
&lt;a href="http://archive.ics.uci.edu/ml/datasets/Wine+Quality"&gt;Wine Quality Dataset&lt;/a&gt;. Download and save it along with the &lt;code&gt;project.clj&lt;/code&gt; file:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ wget &lt;a href="http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"&gt;http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv&lt;/a&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;h2&gt;Step 4. Start REPL&lt;/h2&gt;

&lt;p&gt;The simplest way is running Leiningen with the &lt;code&gt;repl&lt;/code&gt; command:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ lein repl
Clojure 1.6.0
Java HotSpot(TM) 64-Bit Server VM 1.8.0_xxx
...
user=&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Of course, nothing prevents you from running REPL in
&lt;a href="https://www.gnu.org/software/emacs/"&gt;Emacs&lt;/a&gt; with
&lt;a href="https://github.com/clojure-emacs/cider"&gt;Cider&lt;/a&gt;,
&lt;a href="https://www.jetbrains.com/idea/"&gt;IntelliJ IDEA&lt;/a&gt; or any other
Clojure-aware IDE.&lt;/p&gt;

&lt;h2&gt;Step 5. Require modules and import classes&lt;/h2&gt;

&lt;pre&gt;&lt;code&gt;user=&amp;gt; (require '[flambo.api :as f]
                '[flambo.conf :as cf]
                '[flambo.tuple :as ft]
                '[clojure.string :as s])

user=&amp;gt; (import '[org.apache.spark.mllib.linalg Vectors]
               '[org.apache.spark.mllib.regression LabeledPoint]
               '[org.apache.spark.mllib.classification LogisticRegressionWithLBFGS]
               '[org.apache.spark.mllib.evaluation BinaryClassificationMetrics])
&lt;/code&gt;&lt;/pre&gt;

&lt;h2&gt;Step 6. Create Spark context&lt;/h2&gt;

&lt;pre&gt;&lt;code&gt;user=&amp;gt; (def spark
         (let [cfg (-&amp;gt; (cf/spark-conf)
                       (cf/master "local[2]")
                       (cf/app-name "t01spark")
                       (cf/set "spark.akka.timeout" "300"))]
           (f/spark-context cfg)))
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;We&amp;rsquo;ve just created a Spark context bound to a local, in-process Spark
server. You should see lots of INFO log messages in the
terminal. That&amp;rsquo;s normal. Again, creating a Spark context like this
will work for tutorial purposes, although in real life you&amp;rsquo;d probably
want to wrap this expression into a
&lt;a href="http://clojuredocs.org/clojure.core/memoize"&gt;memoizing function&lt;/a&gt; and
call it whenever you need a context.&lt;/p&gt;

&lt;h2&gt;Step 7. Load and parse data&lt;/h2&gt;

&lt;p&gt;The data is stored in a CSV file with a header. We don&amp;rsquo;t need that
header. To get rid of it, let&amp;rsquo;s enumerate rows and retain
only those with indexes greater than zero. Then we split each row
by the semicolon character and convert each element to float:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;user=&amp;gt; (def data
         ;; Read lines from file
         (-&amp;gt; (f/text-file spark "winequality-red.csv")
             ;; Enumerate lines.
             ;; This function is missing from Flambo,
             ;; so we call the method directly
             (.zipWithIndex)
             ;; This is here purely for convenience:
             ;; it transforms Spark tuples into Clojure vectors
             (f/map f/untuple)
             ;; Get rid of the header
             (f/filter (f/fn [[line idx]] (&amp;lt; 0 idx)))
             ;; Split lines and transform values
             (f/map (f/fn [[line _]]
                      (-&amp;gt;&amp;gt; (s/split line #";")
                           (map #(Float/parseFloat %)))))))
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Let&amp;rsquo;s verify what&amp;rsquo;s in the RDD:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;user=&amp;gt; (f/take data 3)
[(7.4 0.7 0.0 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5.0)
 (7.8 0.88 0.0 2.6 0.098 25.0 67.0 0.9968 3.2 0.68 9.8 5.0)
 (7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.997 3.26 0.65 9.8 5.0)]
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Looks legit.&lt;/p&gt;

&lt;h2&gt;Step 8. Transform data&lt;/h2&gt;

&lt;p&gt;The subjective wine quality information is contained in the Quality
variable. It takes values in the [0..10] range. Let&amp;rsquo;s transform
that into a binary variable by splitting it over the median. Wines with Quality below 6 will be considered &amp;ldquo;not good&amp;rdquo;, 6 and above - &amp;ldquo;good&amp;rdquo;.&lt;/p&gt;

&lt;p&gt;I explored this dataset in &lt;a href="http://www.r-project.org/"&gt;R&lt;/a&gt; and found that the most interesting variables are Citric Acid, Total Sulfur Dioxide and Alcohol. I encourage you to experiment with adding other variables to the model. Also, using logarithms of those variables instead of raw values might be a good idea. Please refer to
the &lt;a href="http://archive.ics.uci.edu/ml/datasets/Wine+Quality"&gt;Wine Quality Dataset&lt;/a&gt; for the full variable list.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt; user=&amp;gt; (def dataset
          (f/map data
                 (f/fn [[_ _ citric-acid _ _ _
                         total-sulfur-dioxide _ _ _
                         alcohol quality]]
                   ;; A wine is "good" if the quality is above the median
                   (let [good (if (&amp;lt;= 6 quality) 0.0 1.0)
                         ;; these will be our predictors
                         pred (double-array [citric-acid
                                             total-sulfur-dioxide
                                             alcohol])]
                     ;; Spark requires samples to be packed into LabeledPoints
                     (LabeledPoint. good (Vectors/dense pred))))))

 user=&amp;gt; (f/take dataset 3)
 [#&amp;lt;LabeledPoint (1.0,[0.0,34.0,9.399999618530273])&amp;gt;
  #&amp;lt;LabeledPoint (1.0,[0.0,67.0,9.800000190734863])&amp;gt;
  #&amp;lt;LabeledPoint (1.0,[0.03999999910593033,54.0,9.800000190734863])&amp;gt;]
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;There is no order guarantee in derived RDDs, so you might get a different result.&lt;/p&gt;

&lt;h2&gt;Step 9. Prepare training and validation datasets&lt;/h2&gt;

&lt;pre&gt;&lt;code&gt; user=&amp;gt; (f/cache dataset) ; Temporary cache the source dataset
                          ; BTW, caching is a side effect

 user=&amp;gt; (def training
          (-&amp;gt; (f/sample dataset false 0.8 1234)
              (f/cache)))

 user=&amp;gt; (def validation
          (-&amp;gt; (.subtract dataset training)
              (f/cache)))

 user=&amp;gt; (map f/count [training validation]) ; Check the counts
 (1291 235)

 user=&amp;gt; (.unpersist dataset) ; no need to cache it anymore
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Caching is crucial for MLlib performance. Actually, Spark MLlib algorithms will complain if you feed them with uncached datasets.&lt;/p&gt;

&lt;h2&gt;Step 10. Train classifier&lt;/h2&gt;

&lt;p&gt;MLlib-related parts are completely missing from Flambo, but that&amp;rsquo;s
hopefully coming soon. For now, let&amp;rsquo;s use the Java API directly.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt; user=&amp;gt; (def classifier
          (doto (LogisticRegressionWithLBFGS.)
            ;; Otherwise we'll need to provide it
            (.setIntercept true)))

 user=&amp;gt; (def model
          (doto (.run classifier (.rdd training))
            ;; We need the "raw" probability predictions
            (.clearThreshold)))

 user=&amp;gt; [(.intercept model) (.weights model)]
 [9.805476268219566
  #&amp;lt;DenseVector [-1.6766504448212323,0.011619041367225583,-0.9683045663615859]&amp;gt;]
&lt;/code&gt;&lt;/pre&gt;

&lt;h2&gt;Step 11. Assess predictive power&lt;/h2&gt;

&lt;p&gt;First, let&amp;rsquo;s create a function to compute the area under the
&lt;a href="https://en.wikipedia.org/wiki/Precision_and_recall"&gt;precision-recall&lt;/a&gt; curve and the area under the
&lt;a href="https://en.wikipedia.org/wiki/Receiver_operating_characteristic"&gt;receiver operating characteristic&lt;/a&gt;
curve. These are two the most important indicators of the predictive
power of a trained classification model.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;user=&amp;gt; (defn metrics [ds model]
         ;; Here we construct an RDD containing [prediction, label]
         ;; tuples and compute classification metrics.
         (let [pl (f/map ds (f/fn [point]
                              (let [y (.label point)
                                    x (.features point)]
                                (ft/tuple (.predict model x) y))))
               metrics (BinaryClassificationMetrics. (.rdd pl))]
           [(.areaUnderROC metrics)
            (.areaUnderPR metrics)]))
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Obtain metrics for the training dataset:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt; user=&amp;gt; (metrics training model)
 [0.7800174890996763 0.7471259498290513]
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;And then for the validation dataset:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt; user=&amp;gt; (metrics validation model)
 [0.7785138248847928 0.7160113864756078]
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;OK, let&amp;rsquo;s turn L2 regularization on and rebuild the model:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt; user=&amp;gt; (doto (.optimizer classifier)
          (.setRegParam 0.0001))

 user=&amp;gt; (def model
          (doto (.run classifier (.rdd training))
            (.clearThreshold)))

 user=&amp;gt; (metrics training model)
 [0.7794660966515655 0.748073583460006]

 user=&amp;gt; (metrics validation model)
 [0.7807459677419355 0.7200550175610565]
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Looks good? I&amp;rsquo;m sure you can do better.&lt;/p&gt;

&lt;h2&gt;Step 12. Build predictor&lt;/h2&gt;

&lt;p&gt;As a final step, let&amp;rsquo;s define a function that we could use for
predicting wine quality:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt; user=&amp;gt; (defn is-good? [model citric-acid
                        total-sulfur-dioxide alcohol]
          (let [point (-&amp;gt; (double-array [citric-acid
                                         total-sulfur-dioxide
                                         alcohol])
                          (Vectors/dense))
                prob (.predict model point)]
            (&amp;lt; 0.5 prob)))

 user=&amp;gt; (is-good? model 0.0 34.0 9.399999618530273)
 true
&lt;/code&gt;&lt;/pre&gt;

&lt;h2&gt;Conclusion&lt;/h2&gt;

&lt;p&gt;We have built a simple logistic regression classifier in Clojure on Apache Spark using Flambo. Some parts of the Flambo API are still missing, but it&amp;rsquo;s definitely usable. It was not terribly difficult to get it working and I hope you had fun.&lt;/p&gt;</description><link>https://www.mikeivanov.com/post/116884531461</link><guid>https://www.mikeivanov.com/post/116884531461</guid><pubDate>Mon, 20 Apr 2015 00:12:30 -0400</pubDate></item><item><title>Virtualization and Hadoop: Whys, Whats and Hows</title><description>&lt;p&gt;The idea of this article came after a series of conversations with
Hortonwork&amp;rsquo;s Adam Muise and Aaron Weibe, reflecting on my own
experience and things I&amp;rsquo;ve heard from some other Hadoop practitioners.&lt;/p&gt;

&lt;p&gt;Let me start with an outrageous statement: I believe that in the Big
Data context Virtual Machines offer very little or no benefits beyond
improved security.&lt;/p&gt;

&lt;p&gt;Let&amp;rsquo;s start with looking at the reasons &lt;em&gt;why&lt;/em&gt; people use
virtualization at all, then talk about security aspects, 
virtualization techniques and then I&amp;rsquo;ll try to summarize this experience
along with some practical recommendations.&lt;/p&gt;

&lt;h2&gt;Virtualization: The Why&lt;/h2&gt;

&lt;h3&gt;1. Resource Utilization&lt;/h3&gt;

&lt;p&gt;Underused CPU and memory resources are a direct loss. The idea is that
you can cramp disparate services into the same physical box to have
better chances that it&amp;rsquo;s being used all the time. While utilization
used to be a 100%-valid concern with Hadoop-1, nowadays it&amp;rsquo;s not. YARN
addresses this problem, and if you stay within the YARN framework, no
additional technology is required. Another aspect of this problem is cross-domain
resource utilization. The question is: if I use my Hadoop-2 cluster
only one week a month, can I run something else than Hadoop on part of the nodes
the rest of the time? YARN is obviously of no help in this situation,
so a form of virtualization would be required.&lt;/p&gt;

&lt;h3&gt;2. Deployment Convenience&lt;/h3&gt;

&lt;p&gt;Instead of reconfiguring a system each time a new application is
installed, it&amp;rsquo;s possible to just boot different pre-built OS images as
virtual machines. When its mission is accomplished, a virtual machine gets
discarded and the physical host is ready for running a new app. This
is invaluable in a development environment. Consider a
situation when you need to try &lt;a href="https://github.com/amplab-extras/SparkR-pkg"&gt;SparkR&lt;/a&gt;,
then say &lt;a href="http://0xdata.com/"&gt;H2O&lt;/a&gt; on Hadoop, then some another
weird machine learning library, etc. Many of these things
would require installing dependencies. How to manage all this stuff?
Even though both Cloudera and Hortonworks offer cluster management
tools, those are &amp;ndash; putting it mildly &amp;ndash; not extremely helpful with OS 
dependency management. At this moment virtualization is still the best
way to achieve the button-push deployment experience.&lt;/p&gt;

&lt;h3&gt;3. Security&lt;/h3&gt;

&lt;p&gt;One of the most lucrative promises of virtualization is
zero-maintenance process isolation. If your processes run in different
physical boxes, they cannot poke each other&amp;rsquo;s eyes, period. It seems
that we can reproduce this effect (to certain extent) by running
processes in virtual machines. Some people might say that Hadoop is
integrated with Kerberos, so what&amp;rsquo;s the problem? Kerberos is simply
not enough as the apps run on the same hardware, sharing the same
memory, file handles and buffers of the same instance of an OS kernel. Besides,
if you need multitenancy, only true virtualization is the answer.&lt;/p&gt;

&lt;h2&gt;Security is Isolation&lt;/h2&gt;

&lt;p&gt;Let&amp;rsquo;s gloss over different levels of isolation and their
corresponding threat models.&lt;/p&gt;

&lt;h3&gt;0. No isolation&lt;/h3&gt;

&lt;p&gt;This assumes a fully cooperative security model. If
somebody, even not maliciously, does something bad or just stupid,
everybody gets hurt. There are no security threats in this world.&lt;/p&gt;

&lt;h3&gt;1. Imposed isolation&lt;/h3&gt;

&lt;p&gt;That&amp;rsquo;s what you see in most operating
systems. You have got users, resources and permissions. If you&amp;rsquo;re
lucky, you get Access Control Lists (ACLs). Access to shared resources is limited 
according to the permission settings.  This type of isolation is
enough when you assume no serious insider threats. That is, you kinda
can hide things from people, and most of them will not violently try
to break the locks. Think of lockers in a gym, this is it.&lt;/p&gt;

&lt;h3&gt;2. Implicit isolation&lt;/h3&gt;

&lt;p&gt;Imagine reinforced concrete walls between
different security compartments. Resources are not only shared, they
are separated &lt;em&gt;by design&lt;/em&gt; &amp;ndash; ideally, at the physical level. All users
are adversaries and as such are considered security threats. This
level of isolation is a requirement in public multitenancy systems.&lt;/p&gt;

&lt;h2&gt;Virtualization&lt;/h2&gt;

&lt;p&gt;Now let&amp;rsquo;s briefly look at different virtualization techniques and
their suitability for Hadoop.&lt;/p&gt;

&lt;h3&gt;0. Bare metal&lt;/h3&gt;

&lt;p&gt;Boot to Hadoop. This level does not exist in reality,
I made it up. When I tell people this idea, they typically go like &amp;ldquo;this 
might be interesting, but it&amp;rsquo;s not feasible&amp;rdquo;. Anyways, with bare-metal Hadoop the goal is
not to eliminate the operating system completely, but rather to make
the barrier between the hardware and the app as thin as possible. That
would mean replacing some system-level services with Hadoop-specific ones
and possibly bringing things like HDFS and YARN down to the OS Core. The
benefit is obvious – uncompromised performance. The drawback is that
it can be an unsurmountable task. There&amp;rsquo;s no isolation at this level,
which means no multitenancy at all.&lt;/p&gt;

&lt;h3&gt;1. Operating System&lt;/h3&gt;

&lt;p&gt;An OS &lt;em&gt;is&lt;/em&gt; an abstraction of hardware, so to some
extent it&amp;rsquo;s a valid form of virtualization. At any point of time you
can wipe out an OS with all the apps and install it over again on the
same hardware. At this level you get some security, very simplistic
user-based isolation model and reasonably good performance. This is
what people typically mean when they say &amp;ldquo;bare metal&amp;rdquo;. Although
possible, isolation at the OS level is absolutely not adequate to
public multitenancy. Shared hosting lessons were learned and nobody is
doing it anymore.&lt;/p&gt;

&lt;h3&gt;2. Containers&lt;/h3&gt;

&lt;p&gt;Solaris Zones, LXC-derivatives, etc. The idea is to
isolate process groups at the OS kernel level, effectively creating
security compartments. Such a compartment, along with its associated
resources forms a container. Containers are assumed a reasonably safe
technology, although their fundamental isolation level &amp;ndash; and this is
important &amp;ndash; is the same as permission-based security. This is so,
because containers, in essence, are a mechanism of restricting
access to shared resources. Containers are susceptible to all the same
attacks as regular OS users. Although containers are currently not
popular in the Hadoop world, things are changing and some Hadoop
vendors directly recommend using containers instead of Virtual Machines.&lt;/p&gt;

&lt;h3&gt;3. Virtual Machines&lt;/h3&gt;

&lt;p&gt;Instead of fiddling with OS-level security
mechanisms, you basically say &amp;ldquo;OK, lets create a virtual machine with
a virtual CPU and totally separate memory within my host machine, and
run the app inside that VM&amp;rdquo;. Because resources are [mostly] not
shared, VMs achieve much better level of isolation. Of course, it&amp;rsquo;s more
complicated than that and there are at least three different types of
VM-based virtualization, but roughly this is the idea. Virtual Machines is
what you use when you run Hadoop on AWS or Rackspace.&lt;/p&gt;

&lt;p&gt;So, why not use Virtual Machines all the time? There are two major
problems with this approach:&lt;/p&gt;

&lt;ul&gt;&lt;li&gt;&lt;p&gt;VM&amp;rsquo;s are expensive. Expect 10-15%% CPU overhead comparing to &amp;ldquo;bare
metal&amp;rdquo; Hadoop.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;VM&amp;rsquo;s are unpredictable, especially when running in parallel. The same task
would take 30 minutes today and 45 minutes tomorrow simply
because of all the funny ways the VM&amp;rsquo;s OS interacts with the host
OS.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;If more than one VM is running on the same box, the IO performance
will suffer because of the host OS I/O system overload. And
it will hit hard because Hadoop is I/O-bound almost all the time.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;h2&gt;Use cases and recommendations&lt;/h2&gt;

&lt;p&gt;Here is what I&amp;rsquo;ve seen working or heard from other people that it
works.&lt;/p&gt;

&lt;h3&gt;Safe environment, stable load&lt;/h3&gt;

&lt;p&gt;It&amp;rsquo;s simple and straight &amp;ndash; go bare
metal. You will not be sorry about that, seriously. If you need
another security compartment, build another cluster.&lt;/p&gt;

&lt;h3&gt;Safe environment, varying load&lt;/h3&gt;

&lt;p&gt;Either rely on YARN for resource
management or consider using containers, especially if you ever need
to switch node roles. Use one container per physical box to not harm
the performance.&lt;/p&gt;

&lt;h3&gt;Unsafe environment&lt;/h3&gt;

&lt;p&gt;Only Virtual Machines are up to the task. If you cannot trust
your users, simply don&amp;rsquo;t use anything except VMs. Prepare to pay the
performance tax.&lt;/p&gt;

&lt;h2&gt;Containers: What to Use&lt;/h2&gt;

&lt;p&gt;Unfortunately, neither Hortonworks nor Cloudera currently offer a
turn-on container-based solution. Fortunately, it&amp;rsquo;s not that difficult
to build from scratch. A quick search would bring a few
Hadoop-on-Docker projects such as
&lt;a href="https://github.com/sequenceiq/docker-hoya"&gt;docker-hoya&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;I don&amp;rsquo;t see why it should not be possible to achieve the same with private cloud
solutions such as RedHat&amp;rsquo;s Openshift or ActiveState&amp;rsquo;s Stackato.&lt;/p&gt;

&lt;h2&gt;Conclusion&lt;/h2&gt;

&lt;p&gt;We looked at some basic virtualization concepts in the Big Data
context. Full virtualization technologies should be avoided if
possible. Containers play nicer with Hadoop than VMs, although there
are not so many out of the box Hadoop containerization solutions.&lt;/p&gt;</description><link>https://www.mikeivanov.com/post/96940923071</link><guid>https://www.mikeivanov.com/post/96940923071</guid><pubDate>Sun, 07 Sep 2014 23:13:45 -0400</pubDate><category>hadoop</category><category>virtualization</category><category>yarn</category></item><item><title>Installing SBCL 1.1+ on RHEL/CentOS systems</title><description>&lt;p&gt;The version of SBCL available on RedHat Enterprise Linux 6.4 (and CentOS) is 1.0.38, which is quite old. If your project requires a newer SBCL, it has to be installed manually.&lt;/p&gt;

&lt;p&gt;Although &lt;a href="http://www.sbcl.org/platform-table.html"&gt;sbcl.org&lt;/a&gt; offers some Linux binaries, those are incompatible with RHEL/CentOS 6.4. Compiling from the sources, unfortunately, is the only option.&lt;/p&gt;

&lt;p&gt;This tutorial assumes a 64-bit system (x86_64). Compiling SBCL on a 32-bit platform might or might not work &amp;ndash; I never tried it.&lt;/p&gt;

&lt;ol&gt;&lt;li&gt;&lt;p&gt;The first step is to make sure you can compile programs:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;sudo yum groupinstall "Development Tools"
&lt;/code&gt;&lt;/pre&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Then &lt;a href="http://www.rackspace.com/knowledge_center/article/installing-rhel-epel-repo-on-centos-5x-or-6x"&gt;enable EPEL&lt;/a&gt;, this is necessary for the next step:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;wget &lt;a href="http://dl.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm"&gt;http://dl.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm&lt;/a&gt;
sudo rpm -Uvh epel-release-6*.rpm 
&lt;/code&gt;&lt;/pre&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Now, let&amp;rsquo;s install the &lt;em&gt;old&lt;/em&gt; SBCL. We need it because SBCL&amp;rsquo;s Lisp compiler is written in Lisp, so it requires a working Lisp compiler to compile itself. This older SBCL binary can be safely removed later.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;sudo yum install -y sbcl.x86_64
&lt;/code&gt;&lt;/pre&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Download SBCL source code. At the time of writing this post the latest version was 1.1.13:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;wget &lt;a href="http://downloads.sourceforge.net/project/sbcl/sbcl/1.1.13/sbcl-1.1.13-source.tar.bz2"&gt;http://downloads.sourceforge.net/project/sbcl/sbcl/1.1.13/sbcl-1.1.13-source.tar.bz2&lt;/a&gt;
tar xfj sbcl-1.1.13-source.tar.bz2
cd sbcl-1.1.13
&lt;/code&gt;&lt;/pre&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Compile the sources. Expect to see a lot of diagnostic messages.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;./make.sh
&lt;/code&gt;&lt;/pre&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Install the compiled binary. The warnings about missing &lt;code&gt;doc&lt;/code&gt; directory can be safely ignored. By default, the binary is installed in /usr/local/bin:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;sudo sh install.sh
&lt;/code&gt;&lt;/pre&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Make sure it works. You should see &amp;ldquo;SBCL 1.1.13&amp;rdquo; in the response:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;sbcl --version
&lt;/code&gt;&lt;/pre&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Remove the old SBCL:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;sudo yum remove -y sbcl
&lt;/code&gt;&lt;/pre&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Optional: install Quicklisp. This is not strictly necessary, but having a CPAN-like &lt;a href="http://www.quicklisp.org/beta/"&gt;Lisp package manager&lt;/a&gt; around will definitely make your life easier:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;wget &lt;a href="http://beta.quicklisp.org/quicklisp.lisp"&gt;http://beta.quicklisp.org/quicklisp.lisp&lt;/a&gt;
sbcl --load quicklisp.lisp \
     --eval '(quicklisp-quickstart:install)' \
     --eval '(ql:add-to-init-file)' \
     --eval '(quit)' 
&lt;/code&gt;&lt;/pre&gt;&lt;/li&gt;
&lt;/ol&gt;&lt;p&gt;Enjoy your new SBCL.&lt;/p&gt;</description><link>https://www.mikeivanov.com/post/66510551125</link><guid>https://www.mikeivanov.com/post/66510551125</guid><pubDate>Sat, 09 Nov 2013 18:19:32 -0500</pubDate><category>lisp</category><category>sbcl</category><category>rhel</category><category>centos</category><category>howto</category><category>tutorial</category></item><item><title>MySQL Connector: inherited transactions</title><description>&lt;p&gt;This is what I have noticed today: if a process opens an MySQL connection and then forks, the child process not just inherits the open connection, but also the transaction state. The current transaction becomes shared between the child and the parent. That is, if the child process rolls back, the parent also gets a roll back. Also, as it is the same transaction, a lock set by one process has no effect on another.&lt;/p&gt;

&lt;p&gt;Here is a proof of concept:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;"""
Create and populate a database before running this script:

create database mytest;
grant all on mytest.* to ''@'localhost';
flush privileges;
create table foo(a int);
insert into foo (a) values (0);
"""

import time
from multiprocessing import Process
import _mysql

reconnect = False  # change to true to make the child process block (it should)

conn = _mysql.connect("localhost", user="mike", db="mytest", passwd="")

def sub():
    if reconnect:
        sub_conn = _mysql.connect("localhost", user="mike", db="mytest", passwd="")
    else:
        sub_conn = conn
    print "SUB: start", sub_conn.thread_id()
    print "SUB: do this to get the number of connections -&amp;gt; sudo lsof | grep mysql.sock"
    sub_conn.query('begin')
    sub_conn.query('select * from foo for update')
    if not reconnect:
        print "SUB: NOT BLOCKED, sleeping for 30 sec to hold the conneciton open"
        time.sleep(30)
    print "SUB: result", sub_conn.use_result().fetch_row()
    sub_conn.query('rollback')
    print "SUB: end"

print "HOST: start", conn.thread_id()
conn.query('begin')
conn.query('select * from foo for update')
print "HOST: result", conn.use_result().fetch_row()

process = Process(target=sub)
print "HOST: start sub"
process.start()
process.join()
print "HOST: sub joined"

conn.query('rollback')
print "HOST: end"
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;When &lt;code&gt;reconnect&lt;/code&gt; is set to &lt;code&gt;False&lt;/code&gt;, the parent&amp;rsquo;s thread id will be the same as in the child. The reason why is that MySQL uses server-side thread ids as connection identifiers. Here&amp;rsquo;s the &lt;code&gt;mysql_thread_id&lt;/code&gt; function (mysql-connector-c-6.1.0-src/libmysql/libmysql.c:1070):&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;ulong STDCALL mysql_thread_id(MYSQL *mysql)
{
  ......
  return (mysql)-&amp;gt;thread_id;
}
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;And this is how it is set in &lt;code&gt;CLI_MYSQL_REAL_CONNECT&lt;/code&gt; (mysql-connector-c-6.1.0-src/sql-common/client.c:3613):&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;......
server_version_end= end= strend((char*) net-&amp;gt;read_pos+1);
mysql-&amp;gt;thread_id=uint4korr(end+1);
end+=5;
......
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The direct consequence is that children processes, created for example using the &lt;code&gt;multiprocessing&lt;/code&gt; module, must close the inherited MySQL connections and then reopen them to avoid surprises.&lt;/p&gt;

&lt;p&gt;When I discovered it, I immediately thought about Django management commands splitting workload between children.&lt;/p&gt;

&lt;p&gt;Open questions:&lt;/p&gt;

&lt;ol&gt;&lt;li&gt;Are Celery tasks affected by this? &amp;ndash; probably yes.&lt;/li&gt;
&lt;li&gt;What happens when two processes sharing a transaction update data at the same time?&lt;/li&gt;
&lt;/ol&gt;</description><link>https://www.mikeivanov.com/post/54042838415</link><guid>https://www.mikeivanov.com/post/54042838415</guid><pubDate>Thu, 27 Jun 2013 17:10:00 -0400</pubDate><category>python</category><category>mysql</category></item><item><title>Multimethods in Python</title><description>&lt;p&gt;So, &lt;a href="http://www.python.org/dev/peps/pep-0443/"&gt;PEP-443 aka Single-dispatch generic functions&lt;/a&gt; has made it into Python. There is a &lt;a href="http://lukasz.langa.pl/8/single-dispatch-generic-functions/"&gt;nice writeup&lt;/a&gt; of the &lt;code&gt;singledispatch&lt;/code&gt; package features by &lt;a href="http://lukasz.langa.pl"&gt;Łukasz Langa&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Although I&amp;rsquo;m glad that Python is evolving in the right direction, I can&amp;rsquo;t see how single dispatch alone could be enough. In essence, PEP-443 defines a way of dynamically extending existing types with externally defined generic functions. Which is nice, of course, but too limited.&lt;/p&gt;

&lt;p&gt;What is &lt;em&gt;really&lt;/em&gt; interesting is &lt;a href="http://en.wikipedia.org/wiki/Multiple_dispatch"&gt;multiple dispatch&lt;/a&gt;. There are a few packages bringing multimethods to Python; all of them are overcomplicated to my taste.&lt;/p&gt;

&lt;p&gt;Here&amp;rsquo;s my take on it. I will not talk much, better show you the code.&lt;/p&gt;

&lt;p&gt;This is the complete implementation:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;# multidispatch.py

import operator
from collections import OrderedDict

class DuplicateCondition(Exception): pass

class NoMatchingMethod(Exception): pass

class defmulti(object):
    def __init__(self, predicate):
        self.registry = OrderedDict()
        self.predicate = predicate

    def __call__(self, *args, **kw):
        method = self.dispatch(*args, **kw)
        return method(*args, **kw)

    def dispatch(self, *args, **kw):
        for condition, method in self.registry.items():
            if self.predicate(condition, *args, **kw):
                return method
        return self.notfound

    def notfound(self, *args, **kw):
        raise NoMatchingMethod()

    def when(self, condition):
        if condition in self.registry:
            raise DuplicateCondition()
        def deco(fn):
            self.registry[condition] = fn
            return fn
        return deco

    def default(self, fn):
        self.notfound = fn
        return fn

    @classmethod
    def typedispatch(cls):
        return cls(lambda type, first, *rest, **kw: isinstance(first, type))
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;And here&amp;rsquo;s how to use it:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;import types
from multidispatch import defmulti, NoMatchingMethod

# Exhibit A: Dispatch on the type of the first parameter.
#            Equivalent to `singledispatch`.

cupcakes = defmulti.typedispatch()

@cupcakes.when(types.StringType)
def str_cupcakes(ingredient):
    return "Delicious {0} cupcakes".format(ingredient)

@cupcakes.when(types.IntType)
def int_cupcakes(number):
    return "Integer cupcakes, anyone? I've got {0} of them.".format(number)

@cupcakes.default
def any_cupcakes(thing):
    return ("You can make cupcakes out of ANYTHING! "
            "Even out of {0}!").format(thing)

print cupcakes("bacon")
print cupcakes(4)
print cupcakes(cupcakes)


# Exhibit B: dispatch on the number of args, no default

@defmulti
def jolly(num, *args):
    return len(args) == num

@jolly.when(1)
def single(a):
    return "For {0}'s a jolly old fellow!".format(a)

@jolly.when(2)
def couple(a, b):
    return "{0} and {1} are such a jolly couple!".format(a, b)

print jolly("Lukasz")
print jolly("Fish", "Chips")
try:
    jolly("Good", "Bad", "Ugly")
except NoMatchingMethod:
    print "Noo! Angel Eyes!"
&lt;/code&gt;&lt;/pre&gt;</description><link>https://www.mikeivanov.com/post/52352972836</link><guid>https://www.mikeivanov.com/post/52352972836</guid><pubDate>Thu, 06 Jun 2013 23:23:00 -0400</pubDate><category>python</category><category>multimethods</category><category>pep</category></item><item><title>I will never CNAME my root domain again.

I will never CNAME my root domain again.

I will never...</title><description>&lt;p&gt;I will never CNAME my root domain again.&lt;/p&gt;

&lt;p&gt;I will never CNAME my root domain again.&lt;/p&gt;

&lt;p&gt;I will never CNAME my root domain again.&lt;/p&gt;

&lt;p&gt;NEVER. EVER.&lt;/p&gt;</description><link>https://www.mikeivanov.com/post/48254796819</link><guid>https://www.mikeivanov.com/post/48254796819</guid><pubDate>Wed, 17 Apr 2013 23:43:29 -0400</pubDate><category>i-am-so-sorry</category></item><item><title>Hello Tumblr</title><description>&lt;p&gt;Good bye, Posterous. Rest in peace.&lt;/p&gt;</description><link>https://www.mikeivanov.com/post/47920167164</link><guid>https://www.mikeivanov.com/post/47920167164</guid><pubDate>Sat, 13 Apr 2013 23:04:32 -0400</pubDate><category>blog</category></item><item><title>A simple callback chain macro for elisp</title><description>&lt;h3&gt;The Problem&lt;/h3&gt;

&lt;p&gt;As usual, it started with a tiny piece of ugly code:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;(bd-create-stage datafile-id
                 (lambda (stage-id)
                   (bd-insert-rows stage-id 
                                   [[10 20 30] [40 50 60]]
                                   (lambda (stage-id)
                                     (bd-commit-stage stage-id 
                                                      #'ignore)))))
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The snippet above is basically a callback chain. When &lt;code&gt;bd-create-stage&lt;/code&gt;
finishes its work, it calls the first lambda, which calls
&lt;code&gt;bd-insert-rows&lt;/code&gt; with the second lambda as its callback argument and so
on, until it all stops at the ignore function.&lt;/p&gt;

&lt;p&gt;I wanted to rewrite it as something like this:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;(=&amp;gt; datafile-id
    (bd-create-stage it next)
    (bd-insert-rows  it [[1 2 3 4 5] [6 7 8 9 0]] next)
    (bd-commit-stage it next))
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Where the &lt;code&gt;it&lt;/code&gt; variable would represent the current callback&amp;rsquo;s
parameter and next would refer to the next callback in the chain. As
with the
&lt;a href="http://www.mikeivanov.com/thread-operator-in-elisp"&gt;&lt;code&gt;-&amp;gt;&lt;/code&gt; macro&lt;/a&gt;, I
wanted explicit anaphoric variables.&lt;/p&gt;

&lt;h3&gt;The Idea&lt;/h3&gt;

&lt;p&gt;Each line in the snippet above could be wrapped in a lambda, lust like this:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;(=&amp;gt; datafile-id
    (lambda (next it)
      (bd-create-stage it next))
    (lambda (next it)
      (bd-insert-rows it [[1 2 3 4 5] [6 7 8 9 0]] next)
    (lambda (next it)
      (bd-commit-stage it next))))
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Then it should somehow call each function in the list with the
consequent function as the first parameter and the result of execution
of the previous function as the second parameter.&lt;/p&gt;

&lt;h3&gt;The Solution&lt;/h3&gt;

&lt;p&gt;This function chaining thing looks a lot like a binary function fold:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;(defun chain2 (f1 f2)
  (apply-partially f1 f2))

(defun chain (&amp;amp;amp;rest fns)
  (if fns
      (reduce #'chain2 fns :from-end t)
    #'identity))
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Applying &lt;code&gt;chain&lt;/code&gt; to a function list creates a new function taking one
parameter and passing it through the whole function list, much like
the &lt;code&gt;-&amp;gt;&lt;/code&gt; macro does.&lt;/p&gt;

&lt;p&gt;In fact, this is enough to start working on the macro.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;(defmacro =&amp;gt; (initial &amp;amp;amp;rest forms)
  `(funcall ,(build-form-chain forms) ,initial))
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The &lt;code&gt;build-form-chain&lt;/code&gt; function wraps each form into a lambda and then
chains them together:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;(defun build-form-chain (forms)
  `(apply #'chain 
          (list ,@(mapcar #'build-form-link forms) #'ignore)))
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;At the end it adds &lt;code&gt;ignore&lt;/code&gt; as a terminator. The terminator is necessary
because the last callback&amp;rsquo;s result is almost always ignored.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;build-form-link&lt;/code&gt;&amp;rsquo;s implementation is trivial:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;(defun build-form-link (form)
  `(lambda (next it) ,form))
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Done! Here&amp;rsquo;s the full source for your convenience:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;(defun chain2 (f1 f2)
  (apply-partially f1 f2))

(defun chain (&amp;amp;rest fns)
  (if fns
      (reduce #'chain2 fns :from-end t)
    #'identity))

(defun build-form-link (form)
  `(lambda (next it) ,form))

(defun build-form-chain (forms)
  `(apply #'chain 
          (list ,@(mapcar #'build-form-link forms) #'ignore)))

(defmacro =&amp;gt; (initial &amp;amp;rest forms)
  `(funcall ,(build-form-chain forms) ,initial))
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Now let&amp;rsquo;s see how the macro expands:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;ELISP&amp;gt; (macroexpand
     '(=&amp;gt; datafile-id
          (bd-create-stage it next)
          (bd-insert-rows  it [[1 2 3 4 5] [6 7 8 9 0]] next)
          (bd-commit-stage it next)))

(funcall (apply (function chain) 
                (list (lambda (next it) 
                        (bd-create-stage it next))
                      (lambda (next it)
                        (bd-insert-rows it [[1 2 3 4 5] [6 7 8 9 0]] next))
                      (lambda (next it)
                        (bd-commit-stage it next))
                      (function ignore)))
          datafile-id)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Exactly as intended.&lt;/p&gt;

&lt;p&gt;This macro covers 95% of my callback chaining needs. For the rest 5%
there is the all-powerful &lt;a href="https://github.com/kiwanami/emacs-deferred"&gt;deferred.el&lt;/a&gt;
library.&lt;/p&gt;</description><link>https://www.mikeivanov.com/post/47902599658</link><guid>https://www.mikeivanov.com/post/47902599658</guid><pubDate>Wed, 06 Jun 2012 00:00:00 -0400</pubDate><category>lisp</category><category>emacs</category></item><item><title>Thread operator in Elisp</title><description>&lt;h3&gt;&lt;strong&gt;TL;DR&lt;/strong&gt;&lt;/h3&gt;

&lt;pre&gt;&lt;code&gt;ELISP&amp;gt; (-&amp;gt; 1
           (+ 2 it)
           (* 3 it))
9
ELISP&amp;gt; (macroexpand
           '(-&amp;gt; 1
                (+ 2 it)
                (* 3 it)))
(let* ((it 1) (it (+ 2 it)) (it (* 3 it))) it)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Implementation:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;(defmacro -&amp;gt; (arg &amp;amp;rest forms)
  `(let* ((it ,arg) .
      ,(mapcar (lambda (form) `(it ,form))
           forms))
     it))
&lt;/code&gt;&lt;/pre&gt;

&lt;h3&gt;&lt;strong&gt;The Long Story&lt;/strong&gt;&lt;/h3&gt;

&lt;p&gt;When I see code like this, I frown:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;(defun bd-search (api-key query callback)
  (send-request "GET"
            (format "search?%s"
                (make-query-string `(("api_key" . ,api-key)
                             ("query" . ,query))))
        callback))
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;It&amp;rsquo;s a very simple case, yet the parameter list is already at the fourth level of indentation. When it gets really ugly I usually wrap the whole thing into a let statement and start moving inner parts into variables.&lt;/p&gt;

&lt;p&gt;What I have noticed, however, is that almost always constructs like this are sequential by their nature, in other words the output of the innermost statement serves as input for the statement one level up, and so on and so forth. This is the very reason why Clojure had its &lt;a href="http://clojure.github.com/clojure/clojure.core-api.html#clojure.core/-&amp;gt;"&gt;thread operator macro&lt;/a&gt; since beginning.&lt;/p&gt;

&lt;p&gt;Remembering that, I started literally morphing my &lt;code&gt;bd-search&lt;/code&gt; function into something more prettier. I came up with this variant:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;  (-&amp;gt; `(("api_key" . ,api-key)
    ("query" . ,query))
      (make-query-string it)
      (format "search?%s" it)
      (send-request "GET" it callback)))
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Then I put together the -&amp;gt; macro and that was it.&lt;/p&gt;

&lt;p&gt;I decided to make the macro &lt;a href="http://www.bookshelf.jp/texi/onlisp/onlisp_15.html"&gt;anaphoric&lt;/a&gt; instead of implicitly injecting an extra parameter as in Clojure. This allowed me to put the threaded parameter at any place, not just at the beginning or at the end of the parameter list.&lt;/p&gt;</description><link>https://www.mikeivanov.com/post/47901990033</link><guid>https://www.mikeivanov.com/post/47901990033</guid><pubDate>Wed, 09 May 2012 00:00:00 -0400</pubDate><category>lisp</category><category>emacs</category><category>clojure</category></item><item><title>How much can be done in four hours</title><description>&lt;p&gt;Today I had an awesome day at the first OpenDataBC hackathon which took place at Mozilla Labs Vancouver.&lt;/p&gt;

&lt;p&gt;Tara Gibbs pitched this wonderful idea of consolidating shelter availability data and displaying it on a few window displays, so the homeless people living DTES would not waste their time going from one shelter to another just to find a free spot.&lt;/p&gt;

&lt;p&gt;This doesn&amp;rsquo;t solve all the problems of course, but it does solve a little yet very annoying one.&lt;/p&gt;

&lt;p&gt;So&amp;hellip; At 11:30 we had nothing but an idea. We discussed possible approaches for a while, then came David Eaves and suggested using Twitter as a message queue service.&lt;/p&gt;

&lt;p&gt;At approximately 12:00 we still had nothing but a piece of paper covered with boxes and arrows, then we started coding. Tara did the frontend, I was busy hacking the backend and the Twitter stuff.&lt;/p&gt;

&lt;p&gt;Four hours later we had a fully functional, production ready system - &lt;a href="https://github.com/mikeivanov/vanshelter"&gt;https://github.com/mikeivanov/vanshelter&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;How it is supposed to work:&lt;/p&gt;

&lt;ol&gt;&lt;li&gt;Shelters tweet their availability data (they all have internet access)&lt;/li&gt;
&lt;li&gt;VanShelter monitors &amp;ndash; each of them independently &amp;ndash; receive Twitter updates and&lt;/li&gt;
&lt;li&gt;Refresh their displays when something changes.&lt;/li&gt;
&lt;/ol&gt;&lt;p&gt;For displays we can use cheap LCD monitors, probably even donated.  The software will run on those amazing Raspberry thingies - &lt;a href="http://www.raspberrypi.org/,"&gt;http://www.raspberrypi.org/,&lt;/a&gt; $25 each. This brings the full cost of installing 10 displays down to $250+.&lt;/p&gt;

&lt;p&gt;Thank you Tara and David. Also, thank you Jeff and all the people who made this hackathon possible.&lt;/p&gt;</description><link>https://www.mikeivanov.com/post/47901446886</link><guid>https://www.mikeivanov.com/post/47901446886</guid><pubDate>Sat, 27 Aug 2011 00:00:00 -0400</pubDate><category>python</category><category>shelter</category><category>dtes</category><category>data</category></item><item><title>Pure Python Paillier Homomorphic Cryptosystem Implementation</title><description>&lt;h3&gt;&lt;strong&gt;What&lt;/strong&gt;&lt;/h3&gt;

&lt;p&gt;This is a very basic &lt;a href="http://en.wikipedia.org/wiki/Paillier_cryptosystem"&gt;Paillier Homomorphic Cryptosystem&lt;/a&gt; implemented in pure Python.&lt;/p&gt;

&lt;p&gt;The idea is, in short, to encrypt two numbers, perform an &amp;ldquo;add&amp;rdquo; operation on cyphertexts, decrypt the result and find it to be the sum of the original plaintext numbers.&lt;/p&gt;

&lt;h3&gt;&lt;strong&gt;How&lt;/strong&gt;&lt;/h3&gt;

&lt;p&gt;The code is loosely based on the &lt;a href="http://en.wikipedia.org/wiki/Paillier_cryptosystem"&gt;thep&lt;/a&gt; project and a few &lt;a href="http://code.activestate.com/recipes/"&gt;ActiveState recipes&lt;/a&gt;. The code is pure Python and all objects are serializable.&lt;/p&gt;

&lt;h3&gt;&lt;strong&gt;Where&lt;/strong&gt;&lt;/h3&gt;

&lt;p&gt;Here: &lt;a href="https://github.com/mikeivanov/paillier"&gt;https://github.com/mikeivanov/paillier&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;&lt;strong&gt;Why&lt;/strong&gt;&lt;/h3&gt;

&lt;p&gt;I was bored.&lt;/p&gt;</description><link>https://www.mikeivanov.com/post/47901014668</link><guid>https://www.mikeivanov.com/post/47901014668</guid><pubDate>Tue, 28 Jun 2011 00:00:00 -0400</pubDate><category>python</category><category>crypto</category></item><item><title>How to mount an NTFS-formatted USB drive in read-write mode on Mac OS X</title><description>&lt;p&gt;Actually, it&amp;rsquo;s very easy. No additional software is required. Just seven easy steps:&lt;/p&gt;

&lt;ol&gt;&lt;li&gt;Attach your USB drive&lt;/li&gt;
&lt;li&gt;Open the Terminal app (Command-Space, then type &amp;ldquo;Terminal&amp;rdquo;, hit Enter)&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Type or copy/paste these commands:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;sudo sh -c "mkdir -p /mnt \
$(mount | grep ntfs | head -n 1 \
   | awk '{ print "&amp;amp;&amp;amp; umount " $3 \
               " &amp;amp;&amp;amp; mount_ntfs -o nosuid,rw " $1 " /mnt" }')"
&lt;/code&gt;&lt;/pre&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Locate your drive in Finder&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;Drag/drop files there&lt;/li&gt;
&lt;li&gt;Unmount the drive as usual&lt;/li&gt;
&lt;li&gt;DONE!&lt;/li&gt;
&lt;/ol&gt;&lt;p&gt;The command breakdown, if you&amp;rsquo;re interested:&lt;/p&gt;

&lt;ul&gt;&lt;li&gt;&lt;code&gt;mkdir -p /mnt&lt;/code&gt; creates a mount point &amp;ndash; a place in the file system where the drive is going to be attached&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;mount&lt;/code&gt; command without parameters gives you a list of the currently attached drives&lt;/li&gt;
&lt;li&gt;&lt;code&gt;grep ntfs&lt;/code&gt; filters non-ntfs drives out the list&lt;/li&gt;
&lt;li&gt;&lt;code&gt;head -n 1&lt;/code&gt; grabs the first line (we&amp;rsquo;re assuming only one ntfs drive can be attached at a time)&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;awk&lt;/code&gt; part produces two commands:

&lt;ul&gt;&lt;li&gt;&lt;code&gt;umount /Volumes/&lt;/code&gt; &amp;ndash; unmounts the drive from its original place&lt;/li&gt;
&lt;li&gt;&lt;code&gt;mount_ntfs -o nosuid,rw /dev/ /mnt&lt;/code&gt; &amp;ndash; mounts the drive again, but this time in the read-write mode&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;Now, the &lt;code&gt;sudo sh -c "..."&lt;/code&gt; thing allows code execution with superuser privileges.&lt;/li&gt;
&lt;/ul&gt;&lt;p&gt;That&amp;rsquo;s it.&lt;/p&gt;</description><link>https://www.mikeivanov.com/post/45894951347</link><guid>https://www.mikeivanov.com/post/45894951347</guid><pubDate>Thu, 09 Jun 2011 00:00:00 -0400</pubDate><category>ntfs</category><category>mac</category><category>terminal</category><category>usb</category></item><item><title>Tail recursion without TCO</title><description>&lt;p&gt;Emacs lisp has no Tail Call Optimization (TCO), neither do many other
lisp dialects. The lack of TCO is not a big deal&amp;ndash;it&amp;rsquo;s always possible
to transform a tail recursive algorithm into a loop. However, it makes
functions look uglier. Here is a very simple method of enabling
Clojure-style tail call recursion in Emacs lisp:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;;; A very simple linearized Y combinator.
;; All the state management stuff is incapsulated here.
;; Don't call it directly.
(defun rloop- (body &amp;amp;rest args)
  (let ((res nil))
    (while (progn
             ;; here's the idea: we keep calling body 
             ;; while it returns the recursion marker
             (setq res (apply body args))
             (when (and (consp res)
                        (eq :loop-recur-marker (car res)))
               (progn (setq args (cdr res))
                      t))))
    res))

;; Recursion marker factory
(defun recur (&amp;amp;rest args)
  ;; instead of a real recursive call,
  ;; just signal an intention to make one
  (cons :loop-recur-marker args))

;; The form macro
(defmacro rloop (init body)
  (let ((args (mapcar 'car init)))
    ;; a little courtesy to the macro users
    `(let* ,init
       ;; make a lambda from the body and pass it 
       ;; to the combinator function
       (rloop- (function (lambda (,@args) ,body))
               ,@args))))
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Here&amp;rsquo;s how to use it:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;(defun factorial (x)
  ;; this is the recursion entry point
  (rloop ((x   x) 
          (acc 1))
         (if (&amp;lt; x 1)
             acc ;; done, just return the result
           ;; not done, start the whole rloop block again
           (recur (1- x) 
                  (* x acc)))))

ELISP&amp;gt; (factorial 10)
3628800
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The funny part is defun is not necessary. You can have as many
sequential inlined rloops as you want. I like this approach: all the
state management stuff is off the sight. The function code is almost
identical to the underlying algorithm. Another classic example:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;(defun fibo (x)
  (rloop ((x    x)
          (curr 0)
          (next 1))
         (if (= x 0)
             curr
           (recur (1- x) 
                   next 
                  (+ curr next)))))

ELISP&amp;gt; (fibo 10)
55
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Nice, eh? Of course, this kind of beauty comes with a price. Here is
how the rloop macro expands:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;ELISP&amp;gt; (macroexpand '(rloop ((n 0)) (if (&amp;gt; n 5) n (recur (1+ n)))))

(let*
    ((n 0))
  (rloop-
   #'(lambda
       (n)
       (if
           (&amp;gt; n 5)
           n
         (recur
          (1+ n))))
   n))
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&amp;hellip;which means two extra function calls on each iteration. But
realistically, it&amp;rsquo;s not such a big deal. Clarity of the code is way
more important.&lt;/p&gt;</description><link>https://www.mikeivanov.com/post/44952159596</link><guid>https://www.mikeivanov.com/post/44952159596</guid><pubDate>Fri, 20 Aug 2010 00:00:00 -0400</pubDate><category>emacs</category><category>lisp</category><category>clojure</category><category>tco</category></item><item><title>Clouds and entropy</title><description>&lt;p&gt;In a post titled &lt;a href="http://www.elasticvapor.com/2009/08/trusted-cloud-entropy-authority.html"&gt;A Trusted Cloud Entropy Authority&lt;/a&gt; Reuven Cohen writes:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;&amp;hellip;maybe there an opportunity to create a trusted cloud authority to 
  provide signed verified and certified entropy. Think of it like a certificate 
  authority (CA) but for chaos. Actually, Amazon Web Service itself could 
  act as this entropy authority via a simple encrypted web service call. I 
  even have a name for it, Simple Entropy Service (SES).&amp;ldquo;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is really a good idea. Amazon should have provided such a service long time ago.&lt;/p&gt;

&lt;p&gt;When an SSL connection is being established, a browser and a server perform the Handshake protocol. This protocol involves exchanging random bits between the parties. The important thing is that security depends on how random those bits are. If they are not, the connection is effectively insecure.&lt;/p&gt;

&lt;p&gt;In the case of AWS, there is no source of true randomness, therefore SSL on AWS is inherently insecure. Moreover, instances running on the same physical machine can affect each other&amp;rsquo;s security by draining the shared random pool in the host system.&lt;/p&gt;

&lt;p&gt;Further he writes:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;a website called &lt;a href="http://random.org"&gt;http://random.org&lt;/a&gt; [is] a true random number service 
  that generates randomness via atmospheric noise. Looks cool, maybe 
  this may help solve the problem.&amp;rdquo;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I don&amp;rsquo;t think that random.org is a good choice.&lt;/p&gt;

&lt;p&gt;One problem is a connection to such a service. It should be as secure as the most secure secret handled on your system. If the random bit connection is encrypted with 256 bit AES (and it actually is), this is the highest level of security your system can provide. Plus, there should be guarantee that no unencryped random bits are stored anywhere. The same is true for the proposed SAS service, too.&lt;/p&gt;

&lt;p&gt;Another problem with random.org is&amp;hellip; well, randomness is perceptive. What you see as &amp;ldquo;random&amp;rdquo; can be quite deterministic to the people who run the random.org service. Even though they might not store anything, their present is your future&amp;ndash;just think about relativistic effects. A temptation to tamper with someone&amp;rsquo;s future can be, you know, very strong.&lt;/p&gt;

&lt;p&gt;The overall quality of the service is not known. There is no guarantee it is random at all. A quote from their FAQ: &amp;ldquo;Q1.2: Is the source code for the generator available? &amp;ndash; Not currently, no. Maybe I&amp;rsquo;ll make it available as open source some day.&amp;rdquo;&lt;/p&gt;

&lt;p&gt;Even though the Whois database indicates the domain name&amp;rsquo;s registrant is located in France, the SSL certificate owner is not specified. I have no reasons for not believing the guy running the service, but I would not entrust my customers&amp;rsquo; data into a total stranger&amp;rsquo;s hands, even though he or she seems to be a nice person.&lt;/p&gt;

&lt;p&gt;So the conclusion is: while there is no trusted entropy generator on the AWS side, we, the AWS customers, are on our own.&lt;/p&gt;

&lt;p&gt;Here is a hint: entropy seeds can be generated in-house and smuggled into instances over a secure channel. Then those seeds could be fed to a cryptographically secure RNG like &lt;a href="http://www.burtleburtle.net/bob/rand/isaacafa.html"&gt;Isaac&lt;/a&gt; to produce actual &amp;ldquo;random&amp;rdquo; bits. I think there should be a way of injecting those into the instance&amp;rsquo;s random pool.&lt;/p&gt;</description><link>https://www.mikeivanov.com/post/45894023169</link><guid>https://www.mikeivanov.com/post/45894023169</guid><pubDate>Tue, 06 Apr 2010 00:00:00 -0400</pubDate><category>amazon</category><category>cloud</category><category>crypto</category><category>ssl</category><category>random</category></item><item><title>Dynamic queries: the Postgres way</title><description>&lt;p&gt;Have you ever been in a situation when you needed a query to be
generated and executed dynamically as a result of another query?&lt;/p&gt;

&lt;p&gt;If yes, read further: there is a simple and elegant way to achieve
that.&lt;/p&gt;

&lt;p&gt;First, create this function:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;CREATE OR REPLACE FUNCTION dselect(varchar)
  RETURNS SETOF record AS $$
    DECLARE rec record;
    BEGIN
      FOR rec IN EXECUTE $1 LOOP
        RETURN NEXT rec;
      END LOOP;
    END
  $$ LANGUAGE 'plpgsql';
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Second, umm.. well, that&amp;rsquo;s it.&lt;/p&gt;

&lt;p&gt;Anything you pass as a parameter will be interpreted and executed as
an SQL statement.&lt;/p&gt;

&lt;p&gt;The function is SELECT-able. That is, you can use it in SELECTs:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;SELECT * FROM dselect('SELECT id, name FROM users') AS t(id int, name varchar);
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Note the &lt;code&gt;AS t(id int, name varchar)&lt;/code&gt; part. Postgres has no idea about
what this function returns, so a column definition list should be
provided. If not, Postgres will complain:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;SELECT * FROM dselect('SELECT id, name FROM users');
ERROR:  a column definition list is required for functions returning "record"
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Of course, the column definition list depends on the query and should
match the actual query results.&lt;/p&gt;

&lt;p&gt;So why this function is needed at all?&lt;/p&gt;

&lt;p&gt;Because of situations like this:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;CREATE SCHEMA sc_foo;
CREATE TABLE sc_foo.activity (date date, descr text);
INSERT INTO sc_foo.activity (date, descr) VALUES ('2009-08-07', 'went there');
INSERT INTO sc_foo.activity (date, descr) VALUES ('2009-06-04', 'hanging around');

CREATE SCHEMA sc_bar;
CREATE TABLE sc_bar.activity (date date, descr text);
INSERT INTO sc_bar.activity (date, descr) VALUES ('2009-10-11', 'came here');
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;It allows you to do this:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;SELECT 
    nspname 
FROM
    pg_namespace 
WHERE
    nspname LIKE 'sc_%'  AND
    (SELECT date FROM 
        dselect('SELECT max(date) FROM ' || nspname || '.activity') AS t(date date)) &amp;amp;lt; '2009-09-09';

 nspname 
---------
 sc_bar
(1 row)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;In one pass this query goes over all the schemas whose names start
with &lt;code&gt;sc_&lt;/code&gt;, grabs the latest date from the schema&amp;rsquo;s activity tables
and matches the result against the provided date.&lt;/p&gt;

&lt;p&gt;Of course, this approach is quite ineffective. Each time the function
is called, a query is parsed and executed using a separate plan. A
simple UNION of two queries would do the same, but&amp;hellip; what if there
are ten schemas? How about a hundred? I&amp;rsquo;m actually working with a
database containing thousands of them.&lt;/p&gt;

&lt;p&gt;I use this function when I need to collect statistics or do some db
administration tasks, it saves me a lot of time.&lt;/p&gt;</description><link>https://www.mikeivanov.com/post/44950520359</link><guid>https://www.mikeivanov.com/post/44950520359</guid><pubDate>Fri, 17 Jul 2009 13:00:00 -0400</pubDate><category>postgres</category><category>sql</category></item><item><title>Why Tk matters</title><description>&lt;p&gt;Tk probably is one of the most underlooked GUI toolkits. It is a nice small toolkit which is really, I mean REALLY simple to use.&lt;/p&gt;

&lt;p&gt;Tk by the way, has something common with Vi. Oh no, it&amp;rsquo;s not that it beeps all the time; I meant that it works everywhere. I bet you can find Tk ported to any single platform having a GUI and it will work consistently on all of them.&lt;/p&gt;

&lt;p&gt;Tk is ubiquitous. Guess what? Almost certainly you have Tk already installed on your computer. Got Python? Go look into /usr/lib/python2.5/lib-tk, it&amp;rsquo;s right there!&lt;/p&gt;

&lt;p&gt;Using it is very easy. Here is a &amp;lsquo;Hello, world&amp;rsquo; program in Python (using Tk bindings called Tkinter):&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;from Tkinter import *
Label(text='Hi there').pack()
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Expectedly, this program pops up a window with some text inside. Yes, it&amp;rsquo;s that simple. And this is the area where Tk shines: quick, small GUI tools.&lt;/p&gt;

&lt;p&gt;Yet Tk is very powerful. People create big, sophisticated systems using just this toolkit. The most prominent example is definitely &lt;a href="http://www.inivis.com/"&gt;AC3D&lt;/a&gt;, a 3D modeling program.&lt;/p&gt;

&lt;p&gt;Tk, however, has some issues. I think nobody will argue that Tk looks ugly on the Linux platform. While the Windows and Mac versions got the native look, the Linux port looks unattractive. Some work is being done in this department but it&amp;rsquo;s not quite completed.&lt;/p&gt;

&lt;p&gt;Somebody might ask why use Tk if there is wxPython. Well, for the same reason why there are bicycles and airplanes. They are good for different purposes. Tk is way more lightweight and much easier to learn and use than wxPython.&lt;/p&gt;

&lt;p&gt;Tk is very well documented. Here is the official Tkinter documentation with links to tutorials and such: &lt;a href="http://docs.python.org/library/tkinter.html"&gt;http://docs.python.org/library/tkinter.html&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you&amp;rsquo;re going to learn Tk, it&amp;rsquo;s worth to mention that Tk is actually a part of something called &lt;a href="http://tcl.tk"&gt;Tcl/Tk&lt;/a&gt;. Tcl stands for Tool Command Language, a Shell-like scripting language easy to learn and fun to use. Even though all you want might be just Tkinter, some knowledge of Tcl will be rewarding.&lt;/p&gt;</description><link>https://www.mikeivanov.com/post/1052071152</link><guid>https://www.mikeivanov.com/post/1052071152</guid><pubDate>Wed, 17 Dec 2008 13:00:00 -0500</pubDate><category>tk</category><category>tcl</category><category>python</category><category>tkinter</category></item></channel></rss>
