connecting the dots . . .

Machine Learning with Clojure and Spark using Flambo

Mon, 20 Apr 2015 00:12:30 -0400

In this short tutorial I’m going to show you how to train a logistic regression classifier in a scalable manner with Apache Spark and Clojure using Flambo.

Assumptions:

you are familiar with Clojure and Leiningen
you have heard of, or ideally - poked around Apache Spark
you possess some basic Machine Learning skills

The goal of the tutorial is to help you familiarize yourself with Flambo – a Clojure DSL for Apache Spark. Even though Flambo is far from being complete, it already does a decent job of wrapping basic Spark APIs into idiomatic Clojure.

During the course of the tutorial, we are going to train a classifier capable of predicting whether a wine would taste good given certain objective chemical characteristics.

Step 1. Create new project

Run these commands:

$ lein new app t01spark
$ cd t01spark

Here, t01spark is the name of the project. You can give it any name you like. Don’t forger to change the current directory to the project you’ve just created.

Step 2. Update `project.clj`

Open project.clj in a text editor and update the dependency section so it looks like this:

:dependencies
    [[org.clojure/clojure "1.6.0"]
     [yieldbot/flambo "0.6.0"]
     [org.apache.spark/spark-mllib_2.10 "1.3.0"]]

Please note that although listing Spark jars in this manner is perfectly fine for exploratory projects, it is not suitable for production use. For that you will need to list them as “provided” dependencies in the profiles section, but let’s keep things simple for now.

Make sure that AOT is enabled, otherwise you will see strange ClassNotFound errors. Add this to the project file:

:aot :all

It also could make sense to add some extra memory for Spark:

:jvm-opts ^:replace ["-server" "-Xmx2g"]

Step 3. Download dataset

In this tutorial we are going to use the Wine Quality Dataset. Download and save it along with the project.clj file:

$ wget http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv

Step 4. Start REPL

The simplest way is running Leiningen with the repl command:

$ lein repl
Clojure 1.6.0
Java HotSpot(TM) 64-Bit Server VM 1.8.0_xxx
...
user=>

Of course, nothing prevents you from running REPL in Emacs with Cider, IntelliJ IDEA or any other Clojure-aware IDE.

Step 5. Require modules and import classes

user=> (require '[flambo.api :as f]
                '[flambo.conf :as cf]
                '[flambo.tuple :as ft]
                '[clojure.string :as s])

user=> (import '[org.apache.spark.mllib.linalg Vectors]
               '[org.apache.spark.mllib.regression LabeledPoint]
               '[org.apache.spark.mllib.classification LogisticRegressionWithLBFGS]
               '[org.apache.spark.mllib.evaluation BinaryClassificationMetrics])

Step 6. Create Spark context

user=> (def spark
         (let [cfg (-> (cf/spark-conf)
                       (cf/master "local[2]")
                       (cf/app-name "t01spark")
                       (cf/set "spark.akka.timeout" "300"))]
           (f/spark-context cfg)))

We’ve just created a Spark context bound to a local, in-process Spark server. You should see lots of INFO log messages in the terminal. That’s normal. Again, creating a Spark context like this will work for tutorial purposes, although in real life you’d probably want to wrap this expression into a memoizing function and call it whenever you need a context.

Step 7. Load and parse data

The data is stored in a CSV file with a header. We don’t need that header. To get rid of it, let’s enumerate rows and retain only those with indexes greater than zero. Then we split each row by the semicolon character and convert each element to float:

user=> (def data
         ;; Read lines from file
         (-> (f/text-file spark "winequality-red.csv")
             ;; Enumerate lines.
             ;; This function is missing from Flambo,
             ;; so we call the method directly
             (.zipWithIndex)
             ;; This is here purely for convenience:
             ;; it transforms Spark tuples into Clojure vectors
             (f/map f/untuple)
             ;; Get rid of the header
             (f/filter (f/fn [[line idx]] (< 0 idx)))
             ;; Split lines and transform values
             (f/map (f/fn [[line _]]
                      (->> (s/split line #";")
                           (map #(Float/parseFloat %)))))))

Let’s verify what’s in the RDD:

user=> (f/take data 3)
[(7.4 0.7 0.0 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5.0)
 (7.8 0.88 0.0 2.6 0.098 25.0 67.0 0.9968 3.2 0.68 9.8 5.0)
 (7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.997 3.26 0.65 9.8 5.0)]

Looks legit.

Step 8. Transform data

The subjective wine quality information is contained in the Quality variable. It takes values in the [0..10] range. Let’s transform that into a binary variable by splitting it over the median. Wines with Quality below 6 will be considered “not good”, 6 and above - “good”.

I explored this dataset in R and found that the most interesting variables are Citric Acid, Total Sulfur Dioxide and Alcohol. I encourage you to experiment with adding other variables to the model. Also, using logarithms of those variables instead of raw values might be a good idea. Please refer to the Wine Quality Dataset for the full variable list.

 user=> (def dataset
          (f/map data
                 (f/fn [[_ _ citric-acid _ _ _
                         total-sulfur-dioxide _ _ _
                         alcohol quality]]
                   ;; A wine is "good" if the quality is above the median
                   (let [good (if (<= 6 quality) 0.0 1.0)
                         ;; these will be our predictors
                         pred (double-array [citric-acid
                                             total-sulfur-dioxide
                                             alcohol])]
                     ;; Spark requires samples to be packed into LabeledPoints
                     (LabeledPoint. good (Vectors/dense pred))))))

 user=> (f/take dataset 3)
 [#<LabeledPoint (1.0,[0.0,34.0,9.399999618530273])>
  #<LabeledPoint (1.0,[0.0,67.0,9.800000190734863])>
  #<LabeledPoint (1.0,[0.03999999910593033,54.0,9.800000190734863])>]

There is no order guarantee in derived RDDs, so you might get a different result.

Step 9. Prepare training and validation datasets

 user=> (f/cache dataset) ; Temporary cache the source dataset
                          ; BTW, caching is a side effect

 user=> (def training
          (-> (f/sample dataset false 0.8 1234)
              (f/cache)))

 user=> (def validation
          (-> (.subtract dataset training)
              (f/cache)))

 user=> (map f/count [training validation]) ; Check the counts
 (1291 235)

 user=> (.unpersist dataset) ; no need to cache it anymore

Caching is crucial for MLlib performance. Actually, Spark MLlib algorithms will complain if you feed them with uncached datasets.

Step 10. Train classifier

MLlib-related parts are completely missing from Flambo, but that’s hopefully coming soon. For now, let’s use the Java API directly.

 user=> (def classifier
          (doto (LogisticRegressionWithLBFGS.)
            ;; Otherwise we'll need to provide it
            (.setIntercept true)))

 user=> (def model
          (doto (.run classifier (.rdd training))
            ;; We need the "raw" probability predictions
            (.clearThreshold)))

 user=> [(.intercept model) (.weights model)]
 [9.805476268219566
  #<DenseVector [-1.6766504448212323,0.011619041367225583,-0.9683045663615859]>]

Step 11. Assess predictive power

First, let’s create a function to compute the area under the precision-recall curve and the area under the receiver operating characteristic curve. These are two the most important indicators of the predictive power of a trained classification model.

user=> (defn metrics [ds model]
         ;; Here we construct an RDD containing [prediction, label]
         ;; tuples and compute classification metrics.
         (let [pl (f/map ds (f/fn [point]
                              (let [y (.label point)
                                    x (.features point)]
                                (ft/tuple (.predict model x) y))))
               metrics (BinaryClassificationMetrics. (.rdd pl))]
           [(.areaUnderROC metrics)
            (.areaUnderPR metrics)]))

Obtain metrics for the training dataset:

 user=> (metrics training model)
 [0.7800174890996763 0.7471259498290513]

And then for the validation dataset:

 user=> (metrics validation model)
 [0.7785138248847928 0.7160113864756078]

OK, let’s turn L2 regularization on and rebuild the model:

 user=> (doto (.optimizer classifier)
          (.setRegParam 0.0001))

 user=> (def model
          (doto (.run classifier (.rdd training))
            (.clearThreshold)))

 user=> (metrics training model)
 [0.7794660966515655 0.748073583460006]

 user=> (metrics validation model)
 [0.7807459677419355 0.7200550175610565]

Looks good? I’m sure you can do better.

Step 12. Build predictor

As a final step, let’s define a function that we could use for predicting wine quality:

 user=> (defn is-good? [model citric-acid
                        total-sulfur-dioxide alcohol]
          (let [point (-> (double-array [citric-acid
                                         total-sulfur-dioxide
                                         alcohol])
                          (Vectors/dense))
                prob (.predict model point)]
            (< 0.5 prob)))

 user=> (is-good? model 0.0 34.0 9.399999618530273)
 true

Conclusion

We have built a simple logistic regression classifier in Clojure on Apache Spark using Flambo. Some parts of the Flambo API are still missing, but it’s definitely usable. It was not terribly difficult to get it working and I hope you had fun.

Virtualization and Hadoop: Whys, Whats and Hows

Sun, 07 Sep 2014 23:13:45 -0400

The idea of this article came after a series of conversations with Hortonwork’s Adam Muise and Aaron Weibe, reflecting on my own experience and things I’ve heard from some other Hadoop practitioners.

Let me start with an outrageous statement: I believe that in the Big Data context Virtual Machines offer very little or no benefits beyond improved security.

Let’s start with looking at the reasons why people use virtualization at all, then talk about security aspects, virtualization techniques and then I’ll try to summarize this experience along with some practical recommendations.

Virtualization: The Why

1. Resource Utilization

Underused CPU and memory resources are a direct loss. The idea is that you can cramp disparate services into the same physical box to have better chances that it’s being used all the time. While utilization used to be a 100%-valid concern with Hadoop-1, nowadays it’s not. YARN addresses this problem, and if you stay within the YARN framework, no additional technology is required. Another aspect of this problem is cross-domain resource utilization. The question is: if I use my Hadoop-2 cluster only one week a month, can I run something else than Hadoop on part of the nodes the rest of the time? YARN is obviously of no help in this situation, so a form of virtualization would be required.

2. Deployment Convenience

Instead of reconfiguring a system each time a new application is installed, it’s possible to just boot different pre-built OS images as virtual machines. When its mission is accomplished, a virtual machine gets discarded and the physical host is ready for running a new app. This is invaluable in a development environment. Consider a situation when you need to try SparkR, then say H2O on Hadoop, then some another weird machine learning library, etc. Many of these things would require installing dependencies. How to manage all this stuff? Even though both Cloudera and Hortonworks offer cluster management tools, those are – putting it mildly – not extremely helpful with OS dependency management. At this moment virtualization is still the best way to achieve the button-push deployment experience.

3. Security

One of the most lucrative promises of virtualization is zero-maintenance process isolation. If your processes run in different physical boxes, they cannot poke each other’s eyes, period. It seems that we can reproduce this effect (to certain extent) by running processes in virtual machines. Some people might say that Hadoop is integrated with Kerberos, so what’s the problem? Kerberos is simply not enough as the apps run on the same hardware, sharing the same memory, file handles and buffers of the same instance of an OS kernel. Besides, if you need multitenancy, only true virtualization is the answer.

Security is Isolation

Let’s gloss over different levels of isolation and their corresponding threat models.

0. No isolation

This assumes a fully cooperative security model. If somebody, even not maliciously, does something bad or just stupid, everybody gets hurt. There are no security threats in this world.

1. Imposed isolation

That’s what you see in most operating systems. You have got users, resources and permissions. If you’re lucky, you get Access Control Lists (ACLs). Access to shared resources is limited according to the permission settings. This type of isolation is enough when you assume no serious insider threats. That is, you kinda can hide things from people, and most of them will not violently try to break the locks. Think of lockers in a gym, this is it.

2. Implicit isolation

Imagine reinforced concrete walls between different security compartments. Resources are not only shared, they are separated by design – ideally, at the physical level. All users are adversaries and as such are considered security threats. This level of isolation is a requirement in public multitenancy systems.

Virtualization

Now let’s briefly look at different virtualization techniques and their suitability for Hadoop.

0. Bare metal

Boot to Hadoop. This level does not exist in reality, I made it up. When I tell people this idea, they typically go like “this might be interesting, but it’s not feasible”. Anyways, with bare-metal Hadoop the goal is not to eliminate the operating system completely, but rather to make the barrier between the hardware and the app as thin as possible. That would mean replacing some system-level services with Hadoop-specific ones and possibly bringing things like HDFS and YARN down to the OS Core. The benefit is obvious – uncompromised performance. The drawback is that it can be an unsurmountable task. There’s no isolation at this level, which means no multitenancy at all.

1. Operating System

An OS is an abstraction of hardware, so to some extent it’s a valid form of virtualization. At any point of time you can wipe out an OS with all the apps and install it over again on the same hardware. At this level you get some security, very simplistic user-based isolation model and reasonably good performance. This is what people typically mean when they say “bare metal”. Although possible, isolation at the OS level is absolutely not adequate to public multitenancy. Shared hosting lessons were learned and nobody is doing it anymore.

2. Containers

Solaris Zones, LXC-derivatives, etc. The idea is to isolate process groups at the OS kernel level, effectively creating security compartments. Such a compartment, along with its associated resources forms a container. Containers are assumed a reasonably safe technology, although their fundamental isolation level – and this is important – is the same as permission-based security. This is so, because containers, in essence, are a mechanism of restricting access to shared resources. Containers are susceptible to all the same attacks as regular OS users. Although containers are currently not popular in the Hadoop world, things are changing and some Hadoop vendors directly recommend using containers instead of Virtual Machines.

3. Virtual Machines

Instead of fiddling with OS-level security mechanisms, you basically say “OK, lets create a virtual machine with a virtual CPU and totally separate memory within my host machine, and run the app inside that VM”. Because resources are [mostly] not shared, VMs achieve much better level of isolation. Of course, it’s more complicated than that and there are at least three different types of VM-based virtualization, but roughly this is the idea. Virtual Machines is what you use when you run Hadoop on AWS or Rackspace.

So, why not use Virtual Machines all the time? There are two major problems with this approach:

VM’s are expensive. Expect 10-15%% CPU overhead comparing to “bare metal” Hadoop.
VM’s are unpredictable, especially when running in parallel. The same task would take 30 minutes today and 45 minutes tomorrow simply because of all the funny ways the VM’s OS interacts with the host OS.
If more than one VM is running on the same box, the IO performance will suffer because of the host OS I/O system overload. And it will hit hard because Hadoop is I/O-bound almost all the time.

Use cases and recommendations

Here is what I’ve seen working or heard from other people that it works.

Safe environment, stable load

It’s simple and straight – go bare metal. You will not be sorry about that, seriously. If you need another security compartment, build another cluster.

Safe environment, varying load

Either rely on YARN for resource management or consider using containers, especially if you ever need to switch node roles. Use one container per physical box to not harm the performance.

Unsafe environment

Only Virtual Machines are up to the task. If you cannot trust your users, simply don’t use anything except VMs. Prepare to pay the performance tax.

Containers: What to Use

Unfortunately, neither Hortonworks nor Cloudera currently offer a turn-on container-based solution. Fortunately, it’s not that difficult to build from scratch. A quick search would bring a few Hadoop-on-Docker projects such as docker-hoya.

I don’t see why it should not be possible to achieve the same with private cloud solutions such as RedHat’s Openshift or ActiveState’s Stackato.

Conclusion

We looked at some basic virtualization concepts in the Big Data context. Full virtualization technologies should be avoided if possible. Containers play nicer with Hadoop than VMs, although there are not so many out of the box Hadoop containerization solutions.

Installing SBCL 1.1+ on RHEL/CentOS systems

Sat, 09 Nov 2013 18:19:32 -0500

The version of SBCL available on RedHat Enterprise Linux 6.4 (and CentOS) is 1.0.38, which is quite old. If your project requires a newer SBCL, it has to be installed manually.

Although sbcl.org offers some Linux binaries, those are incompatible with RHEL/CentOS 6.4. Compiling from the sources, unfortunately, is the only option.

This tutorial assumes a 64-bit system (x86_64). Compiling SBCL on a 32-bit platform might or might not work – I never tried it.

The first step is to make sure you can compile programs:
```
sudo yum groupinstall "Development Tools"
```

Then enable EPEL, this is necessary for the next step:

wget http://dl.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm
sudo rpm -Uvh epel-release-6*.rpm

Now, let’s install the old SBCL. We need it because SBCL’s Lisp compiler is written in Lisp, so it requires a working Lisp compiler to compile itself. This older SBCL binary can be safely removed later.
```
sudo yum install -y sbcl.x86_64
```

Download SBCL source code. At the time of writing this post the latest version was 1.1.13:

wget http://downloads.sourceforge.net/project/sbcl/sbcl/1.1.13/sbcl-1.1.13-source.tar.bz2
tar xfj sbcl-1.1.13-source.tar.bz2
cd sbcl-1.1.13

Compile the sources. Expect to see a lot of diagnostic messages.
```
./make.sh
```
Install the compiled binary. The warnings about missing doc directory can be safely ignored. By default, the binary is installed in /usr/local/bin:
```
sudo sh install.sh
```
Make sure it works. You should see “SBCL 1.1.13” in the response:
```
sbcl --version
```
Remove the old SBCL:
```
sudo yum remove -y sbcl
```

Optional: install Quicklisp. This is not strictly necessary, but having a CPAN-like Lisp package manager around will definitely make your life easier:

wget http://beta.quicklisp.org/quicklisp.lisp
sbcl --load quicklisp.lisp \
     --eval '(quicklisp-quickstart:install)' \
     --eval '(ql:add-to-init-file)' \
     --eval '(quit)'

Enjoy your new SBCL.

MySQL Connector: inherited transactions

Thu, 27 Jun 2013 17:10:00 -0400

This is what I have noticed today: if a process opens an MySQL connection and then forks, the child process not just inherits the open connection, but also the transaction state. The current transaction becomes shared between the child and the parent. That is, if the child process rolls back, the parent also gets a roll back. Also, as it is the same transaction, a lock set by one process has no effect on another.

Here is a proof of concept:

"""
Create and populate a database before running this script:

create database mytest;
grant all on mytest.* to ''@'localhost';
flush privileges;
create table foo(a int);
insert into foo (a) values (0);
"""

import time
from multiprocessing import Process
import _mysql

reconnect = False  # change to true to make the child process block (it should)

conn = _mysql.connect("localhost", user="mike", db="mytest", passwd="")

def sub():
    if reconnect:
        sub_conn = _mysql.connect("localhost", user="mike", db="mytest", passwd="")
    else:
        sub_conn = conn
    print "SUB: start", sub_conn.thread_id()
    print "SUB: do this to get the number of connections -> sudo lsof | grep mysql.sock"
    sub_conn.query('begin')
    sub_conn.query('select * from foo for update')
    if not reconnect:
        print "SUB: NOT BLOCKED, sleeping for 30 sec to hold the conneciton open"
        time.sleep(30)
    print "SUB: result", sub_conn.use_result().fetch_row()
    sub_conn.query('rollback')
    print "SUB: end"

print "HOST: start", conn.thread_id()
conn.query('begin')
conn.query('select * from foo for update')
print "HOST: result", conn.use_result().fetch_row()

process = Process(target=sub)
print "HOST: start sub"
process.start()
process.join()
print "HOST: sub joined"

conn.query('rollback')
print "HOST: end"

When reconnect is set to False, the parent’s thread id will be the same as in the child. The reason why is that MySQL uses server-side thread ids as connection identifiers. Here’s the mysql_thread_id function (mysql-connector-c-6.1.0-src/libmysql/libmysql.c:1070):

ulong STDCALL mysql_thread_id(MYSQL *mysql)
{
  ......
  return (mysql)->thread_id;
}

And this is how it is set in CLI_MYSQL_REAL_CONNECT (mysql-connector-c-6.1.0-src/sql-common/client.c:3613):

......
server_version_end= end= strend((char*) net->read_pos+1);
mysql->thread_id=uint4korr(end+1);
end+=5;
......

The direct consequence is that children processes, created for example using the multiprocessing module, must close the inherited MySQL connections and then reopen them to avoid surprises.

When I discovered it, I immediately thought about Django management commands splitting workload between children.

Open questions:

Are Celery tasks affected by this? – probably yes.
What happens when two processes sharing a transaction update data at the same time?

Multimethods in Python

Thu, 06 Jun 2013 23:23:00 -0400

So, PEP-443 aka Single-dispatch generic functions has made it into Python. There is a nice writeup of the singledispatch package features by Łukasz Langa.

Although I’m glad that Python is evolving in the right direction, I can’t see how single dispatch alone could be enough. In essence, PEP-443 defines a way of dynamically extending existing types with externally defined generic functions. Which is nice, of course, but too limited.

What is really interesting is multiple dispatch. There are a few packages bringing multimethods to Python; all of them are overcomplicated to my taste.

Here’s my take on it. I will not talk much, better show you the code.

This is the complete implementation:

# multidispatch.py

import operator
from collections import OrderedDict

class DuplicateCondition(Exception): pass

class NoMatchingMethod(Exception): pass

class defmulti(object):
    def __init__(self, predicate):
        self.registry = OrderedDict()
        self.predicate = predicate

    def __call__(self, *args, **kw):
        method = self.dispatch(*args, **kw)
        return method(*args, **kw)

    def dispatch(self, *args, **kw):
        for condition, method in self.registry.items():
            if self.predicate(condition, *args, **kw):
                return method
        return self.notfound

    def notfound(self, *args, **kw):
        raise NoMatchingMethod()

    def when(self, condition):
        if condition in self.registry:
            raise DuplicateCondition()
        def deco(fn):
            self.registry[condition] = fn
            return fn
        return deco

    def default(self, fn):
        self.notfound = fn
        return fn

    @classmethod
    def typedispatch(cls):
        return cls(lambda type, first, *rest, **kw: isinstance(first, type))

And here’s how to use it:

import types
from multidispatch import defmulti, NoMatchingMethod

# Exhibit A: Dispatch on the type of the first parameter.
#            Equivalent to `singledispatch`.

cupcakes = defmulti.typedispatch()

@cupcakes.when(types.StringType)
def str_cupcakes(ingredient):
    return "Delicious {0} cupcakes".format(ingredient)

@cupcakes.when(types.IntType)
def int_cupcakes(number):
    return "Integer cupcakes, anyone? I've got {0} of them.".format(number)

@cupcakes.default
def any_cupcakes(thing):
    return ("You can make cupcakes out of ANYTHING! "
            "Even out of {0}!").format(thing)

print cupcakes("bacon")
print cupcakes(4)
print cupcakes(cupcakes)


# Exhibit B: dispatch on the number of args, no default

@defmulti
def jolly(num, *args):
    return len(args) == num

@jolly.when(1)
def single(a):
    return "For {0}'s a jolly old fellow!".format(a)

@jolly.when(2)
def couple(a, b):
    return "{0} and {1} are such a jolly couple!".format(a, b)

print jolly("Lukasz")
print jolly("Fish", "Chips")
try:
    jolly("Good", "Bad", "Ugly")
except NoMatchingMethod:
    print "Noo! Angel Eyes!"

I will never CNAME my root domain again. I will never CNAME my root domain again. I will never...

Wed, 17 Apr 2013 23:43:29 -0400

I will never CNAME my root domain again.

NEVER. EVER.

Hello Tumblr

Sat, 13 Apr 2013 23:04:32 -0400

Good bye, Posterous. Rest in peace.

A simple callback chain macro for elisp

Wed, 06 Jun 2012 00:00:00 -0400

The Problem

As usual, it started with a tiny piece of ugly code:

(bd-create-stage datafile-id
                 (lambda (stage-id)
                   (bd-insert-rows stage-id 
                                   [[10 20 30] [40 50 60]]
                                   (lambda (stage-id)
                                     (bd-commit-stage stage-id 
                                                      #'ignore)))))

The snippet above is basically a callback chain. When bd-create-stage finishes its work, it calls the first lambda, which calls bd-insert-rows with the second lambda as its callback argument and so on, until it all stops at the ignore function.

I wanted to rewrite it as something like this:

(=> datafile-id
    (bd-create-stage it next)
    (bd-insert-rows  it [[1 2 3 4 5] [6 7 8 9 0]] next)
    (bd-commit-stage it next))

Where the it variable would represent the current callback’s parameter and next would refer to the next callback in the chain. As with the -> macro, I wanted explicit anaphoric variables.

The Idea

Each line in the snippet above could be wrapped in a lambda, lust like this:

(=> datafile-id
    (lambda (next it)
      (bd-create-stage it next))
    (lambda (next it)
      (bd-insert-rows it [[1 2 3 4 5] [6 7 8 9 0]] next)
    (lambda (next it)
      (bd-commit-stage it next))))

Then it should somehow call each function in the list with the consequent function as the first parameter and the result of execution of the previous function as the second parameter.

The Solution

This function chaining thing looks a lot like a binary function fold:

(defun chain2 (f1 f2)
  (apply-partially f1 f2))

(defun chain (&amp;rest fns)
  (if fns
      (reduce #'chain2 fns :from-end t)
    #'identity))

Applying chain to a function list creates a new function taking one parameter and passing it through the whole function list, much like the -> macro does.

In fact, this is enough to start working on the macro.

(defmacro => (initial &amp;rest forms)
  `(funcall ,(build-form-chain forms) ,initial))

The build-form-chain function wraps each form into a lambda and then chains them together:

(defun build-form-chain (forms)
  `(apply #'chain 
          (list ,@(mapcar #'build-form-link forms) #'ignore)))

At the end it adds ignore as a terminator. The terminator is necessary because the last callback’s result is almost always ignored.

The build-form-link’s implementation is trivial:

(defun build-form-link (form)
  `(lambda (next it) ,form))

Done! Here’s the full source for your convenience:

(defun chain2 (f1 f2)
  (apply-partially f1 f2))

(defun chain (&rest fns)
  (if fns
      (reduce #'chain2 fns :from-end t)
    #'identity))

(defun build-form-link (form)
  `(lambda (next it) ,form))

(defun build-form-chain (forms)
  `(apply #'chain 
          (list ,@(mapcar #'build-form-link forms) #'ignore)))

(defmacro => (initial &rest forms)
  `(funcall ,(build-form-chain forms) ,initial))

Now let’s see how the macro expands:

ELISP> (macroexpand
     '(=> datafile-id
          (bd-create-stage it next)
          (bd-insert-rows  it [[1 2 3 4 5] [6 7 8 9 0]] next)
          (bd-commit-stage it next)))

(funcall (apply (function chain) 
                (list (lambda (next it) 
                        (bd-create-stage it next))
                      (lambda (next it)
                        (bd-insert-rows it [[1 2 3 4 5] [6 7 8 9 0]] next))
                      (lambda (next it)
                        (bd-commit-stage it next))
                      (function ignore)))
          datafile-id)

Exactly as intended.

This macro covers 95% of my callback chaining needs. For the rest 5% there is the all-powerful deferred.el library.

Thread operator in Elisp

Wed, 09 May 2012 00:00:00 -0400

TL;DR

ELISP> (-> 1
           (+ 2 it)
           (* 3 it))
9
ELISP> (macroexpand
           '(-> 1
                (+ 2 it)
                (* 3 it)))
(let* ((it 1) (it (+ 2 it)) (it (* 3 it))) it)

Implementation:

(defmacro -> (arg &rest forms)
  `(let* ((it ,arg) .
      ,(mapcar (lambda (form) `(it ,form))
           forms))
     it))

The Long Story

When I see code like this, I frown:

(defun bd-search (api-key query callback)
  (send-request "GET"
            (format "search?%s"
                (make-query-string `(("api_key" . ,api-key)
                             ("query" . ,query))))
        callback))

It’s a very simple case, yet the parameter list is already at the fourth level of indentation. When it gets really ugly I usually wrap the whole thing into a let statement and start moving inner parts into variables.

What I have noticed, however, is that almost always constructs like this are sequential by their nature, in other words the output of the innermost statement serves as input for the statement one level up, and so on and so forth. This is the very reason why Clojure had its thread operator macro since beginning.

Remembering that, I started literally morphing my bd-search function into something more prettier. I came up with this variant:

  (-> `(("api_key" . ,api-key)
    ("query" . ,query))
      (make-query-string it)
      (format "search?%s" it)
      (send-request "GET" it callback)))

Then I put together the -> macro and that was it.

I decided to make the macro anaphoric instead of implicitly injecting an extra parameter as in Clojure. This allowed me to put the threaded parameter at any place, not just at the beginning or at the end of the parameter list.

How much can be done in four hours

Sat, 27 Aug 2011 00:00:00 -0400

Today I had an awesome day at the first OpenDataBC hackathon which took place at Mozilla Labs Vancouver.

Tara Gibbs pitched this wonderful idea of consolidating shelter availability data and displaying it on a few window displays, so the homeless people living DTES would not waste their time going from one shelter to another just to find a free spot.

This doesn’t solve all the problems of course, but it does solve a little yet very annoying one.

So… At 11:30 we had nothing but an idea. We discussed possible approaches for a while, then came David Eaves and suggested using Twitter as a message queue service.

At approximately 12:00 we still had nothing but a piece of paper covered with boxes and arrows, then we started coding. Tara did the frontend, I was busy hacking the backend and the Twitter stuff.

Four hours later we had a fully functional, production ready system - https://github.com/mikeivanov/vanshelter

How it is supposed to work:

Shelters tweet their availability data (they all have internet access)
VanShelter monitors – each of them independently – receive Twitter updates and
Refresh their displays when something changes.

For displays we can use cheap LCD monitors, probably even donated. The software will run on those amazing Raspberry thingies - http://www.raspberrypi.org/, $25 each. This brings the full cost of installing 10 displays down to $250+.

Thank you Tara and David. Also, thank you Jeff and all the people who made this hackathon possible.

Pure Python Paillier Homomorphic Cryptosystem Implementation

Tue, 28 Jun 2011 00:00:00 -0400

What

This is a very basic Paillier Homomorphic Cryptosystem implemented in pure Python.

The idea is, in short, to encrypt two numbers, perform an “add” operation on cyphertexts, decrypt the result and find it to be the sum of the original plaintext numbers.

How

The code is loosely based on the thep project and a few ActiveState recipes. The code is pure Python and all objects are serializable.

Where

Here: https://github.com/mikeivanov/paillier

Why

I was bored.

How to mount an NTFS-formatted USB drive in read-write mode on Mac OS X

Thu, 09 Jun 2011 00:00:00 -0400

Actually, it’s very easy. No additional software is required. Just seven easy steps:

Attach your USB drive
Open the Terminal app (Command-Space, then type “Terminal”, hit Enter)

Type or copy/paste these commands:

sudo sh -c "mkdir -p /mnt \
$(mount | grep ntfs | head -n 1 \
   | awk '{ print "&& umount " $3 \
               " && mount_ntfs -o nosuid,rw " $1 " /mnt" }')"

Locate your drive in Finder
Drag/drop files there
Unmount the drive as usual
DONE!

The command breakdown, if you’re interested:

mkdir -p /mnt creates a mount point – a place in the file system where the drive is going to be attached
The mount command without parameters gives you a list of the currently attached drives
grep ntfs filters non-ntfs drives out the list
head -n 1 grabs the first line (we’re assuming only one ntfs drive can be attached at a time)
The awk part produces two commands:
- umount /Volumes/ – unmounts the drive from its original place
- mount_ntfs -o nosuid,rw /dev/ /mnt – mounts the drive again, but this time in the read-write mode
Now, the sudo sh -c "..." thing allows code execution with superuser privileges.

That’s it.

Tail recursion without TCO

Fri, 20 Aug 2010 00:00:00 -0400

Emacs lisp has no Tail Call Optimization (TCO), neither do many other lisp dialects. The lack of TCO is not a big deal–it’s always possible to transform a tail recursive algorithm into a loop. However, it makes functions look uglier. Here is a very simple method of enabling Clojure-style tail call recursion in Emacs lisp:

;; A very simple linearized Y combinator.
;; All the state management stuff is incapsulated here.
;; Don't call it directly.
(defun rloop- (body &rest args)
  (let ((res nil))
    (while (progn
             ;; here's the idea: we keep calling body 
             ;; while it returns the recursion marker
             (setq res (apply body args))
             (when (and (consp res)
                        (eq :loop-recur-marker (car res)))
               (progn (setq args (cdr res))
                      t))))
    res))

;; Recursion marker factory
(defun recur (&rest args)
  ;; instead of a real recursive call,
  ;; just signal an intention to make one
  (cons :loop-recur-marker args))

;; The form macro
(defmacro rloop (init body)
  (let ((args (mapcar 'car init)))
    ;; a little courtesy to the macro users
    `(let* ,init
       ;; make a lambda from the body and pass it 
       ;; to the combinator function
       (rloop- (function (lambda (,@args) ,body))
               ,@args))))

Here’s how to use it:

(defun factorial (x)
  ;; this is the recursion entry point
  (rloop ((x   x) 
          (acc 1))
         (if (< x 1)
             acc ;; done, just return the result
           ;; not done, start the whole rloop block again
           (recur (1- x) 
                  (* x acc)))))

ELISP> (factorial 10)
3628800

The funny part is defun is not necessary. You can have as many sequential inlined rloops as you want. I like this approach: all the state management stuff is off the sight. The function code is almost identical to the underlying algorithm. Another classic example:

(defun fibo (x)
  (rloop ((x    x)
          (curr 0)
          (next 1))
         (if (= x 0)
             curr
           (recur (1- x) 
                   next 
                  (+ curr next)))))

ELISP> (fibo 10)
55

Nice, eh? Of course, this kind of beauty comes with a price. Here is how the rloop macro expands:

ELISP> (macroexpand '(rloop ((n 0)) (if (> n 5) n (recur (1+ n)))))

(let*
    ((n 0))
  (rloop-
   #'(lambda
       (n)
       (if
           (> n 5)
           n
         (recur
          (1+ n))))
   n))

…which means two extra function calls on each iteration. But realistically, it’s not such a big deal. Clarity of the code is way more important.

Clouds and entropy

Tue, 06 Apr 2010 00:00:00 -0400

In a post titled A Trusted Cloud Entropy Authority Reuven Cohen writes:

…maybe there an opportunity to create a trusted cloud authority to provide signed verified and certified entropy. Think of it like a certificate authority (CA) but for chaos. Actually, Amazon Web Service itself could act as this entropy authority via a simple encrypted web service call. I even have a name for it, Simple Entropy Service (SES).“

This is really a good idea. Amazon should have provided such a service long time ago.

When an SSL connection is being established, a browser and a server perform the Handshake protocol. This protocol involves exchanging random bits between the parties. The important thing is that security depends on how random those bits are. If they are not, the connection is effectively insecure.

In the case of AWS, there is no source of true randomness, therefore SSL on AWS is inherently insecure. Moreover, instances running on the same physical machine can affect each other’s security by draining the shared random pool in the host system.

Further he writes:

a website called http://random.org [is] a true random number service that generates randomness via atmospheric noise. Looks cool, maybe this may help solve the problem.”

I don’t think that random.org is a good choice.

One problem is a connection to such a service. It should be as secure as the most secure secret handled on your system. If the random bit connection is encrypted with 256 bit AES (and it actually is), this is the highest level of security your system can provide. Plus, there should be guarantee that no unencryped random bits are stored anywhere. The same is true for the proposed SAS service, too.

Another problem with random.org is… well, randomness is perceptive. What you see as “random” can be quite deterministic to the people who run the random.org service. Even though they might not store anything, their present is your future–just think about relativistic effects. A temptation to tamper with someone’s future can be, you know, very strong.

The overall quality of the service is not known. There is no guarantee it is random at all. A quote from their FAQ: “Q1.2: Is the source code for the generator available? – Not currently, no. Maybe I’ll make it available as open source some day.”

Even though the Whois database indicates the domain name’s registrant is located in France, the SSL certificate owner is not specified. I have no reasons for not believing the guy running the service, but I would not entrust my customers’ data into a total stranger’s hands, even though he or she seems to be a nice person.

So the conclusion is: while there is no trusted entropy generator on the AWS side, we, the AWS customers, are on our own.

Here is a hint: entropy seeds can be generated in-house and smuggled into instances over a secure channel. Then those seeds could be fed to a cryptographically secure RNG like Isaac to produce actual “random” bits. I think there should be a way of injecting those into the instance’s random pool.

Dynamic queries: the Postgres way

Fri, 17 Jul 2009 13:00:00 -0400

Have you ever been in a situation when you needed a query to be generated and executed dynamically as a result of another query?

If yes, read further: there is a simple and elegant way to achieve that.

First, create this function:

CREATE OR REPLACE FUNCTION dselect(varchar)
  RETURNS SETOF record AS $$
    DECLARE rec record;
    BEGIN
      FOR rec IN EXECUTE $1 LOOP
        RETURN NEXT rec;
      END LOOP;
    END
  $$ LANGUAGE 'plpgsql';

Second, umm.. well, that’s it.

Anything you pass as a parameter will be interpreted and executed as an SQL statement.

The function is SELECT-able. That is, you can use it in SELECTs:

SELECT * FROM dselect('SELECT id, name FROM users') AS t(id int, name varchar);

Note the AS t(id int, name varchar) part. Postgres has no idea about what this function returns, so a column definition list should be provided. If not, Postgres will complain:

SELECT * FROM dselect('SELECT id, name FROM users');
ERROR:  a column definition list is required for functions returning "record"

Of course, the column definition list depends on the query and should match the actual query results.

So why this function is needed at all?

Because of situations like this:

CREATE SCHEMA sc_foo;
CREATE TABLE sc_foo.activity (date date, descr text);
INSERT INTO sc_foo.activity (date, descr) VALUES ('2009-08-07', 'went there');
INSERT INTO sc_foo.activity (date, descr) VALUES ('2009-06-04', 'hanging around');

CREATE SCHEMA sc_bar;
CREATE TABLE sc_bar.activity (date date, descr text);
INSERT INTO sc_bar.activity (date, descr) VALUES ('2009-10-11', 'came here');

It allows you to do this:

SELECT 
    nspname 
FROM
    pg_namespace 
WHERE
    nspname LIKE 'sc_%'  AND
    (SELECT date FROM 
        dselect('SELECT max(date) FROM ' || nspname || '.activity') AS t(date date)) &lt; '2009-09-09';

 nspname 
---------
 sc_bar
(1 row)

In one pass this query goes over all the schemas whose names start with sc_, grabs the latest date from the schema’s activity tables and matches the result against the provided date.

Of course, this approach is quite ineffective. Each time the function is called, a query is parsed and executed using a separate plan. A simple UNION of two queries would do the same, but… what if there are ten schemas? How about a hundred? I’m actually working with a database containing thousands of them.

I use this function when I need to collect statistics or do some db administration tasks, it saves me a lot of time.

Why Tk matters

Wed, 17 Dec 2008 13:00:00 -0500

Tk probably is one of the most underlooked GUI toolkits. It is a nice small toolkit which is really, I mean REALLY simple to use.

Tk by the way, has something common with Vi. Oh no, it’s not that it beeps all the time; I meant that it works everywhere. I bet you can find Tk ported to any single platform having a GUI and it will work consistently on all of them.

Tk is ubiquitous. Guess what? Almost certainly you have Tk already installed on your computer. Got Python? Go look into /usr/lib/python2.5/lib-tk, it’s right there!

Using it is very easy. Here is a ‘Hello, world’ program in Python (using Tk bindings called Tkinter):

from Tkinter import *
Label(text='Hi there').pack()

Expectedly, this program pops up a window with some text inside. Yes, it’s that simple. And this is the area where Tk shines: quick, small GUI tools.

Yet Tk is very powerful. People create big, sophisticated systems using just this toolkit. The most prominent example is definitely AC3D, a 3D modeling program.

Tk, however, has some issues. I think nobody will argue that Tk looks ugly on the Linux platform. While the Windows and Mac versions got the native look, the Linux port looks unattractive. Some work is being done in this department but it’s not quite completed.

Somebody might ask why use Tk if there is wxPython. Well, for the same reason why there are bicycles and airplanes. They are good for different purposes. Tk is way more lightweight and much easier to learn and use than wxPython.

Tk is very well documented. Here is the official Tkinter documentation with links to tutorials and such: http://docs.python.org/library/tkinter.html

If you’re going to learn Tk, it’s worth to mention that Tk is actually a part of something called Tcl/Tk. Tcl stands for Tool Command Language, a Shell-like scripting language easy to learn and fun to use. Even though all you want might be just Tkinter, some knowledge of Tcl will be rewarding.

connecting the dots . . .

Machine Learning with Clojure and Spark using Flambo

Step 1. Create new project

Step 2. Update project.clj

Step 3. Download dataset

Step 4. Start REPL

Step 5. Require modules and import classes

Step 6. Create Spark context

Step 7. Load and parse data

Step 8. Transform data

Step 9. Prepare training and validation datasets

Step 10. Train classifier

Step 11. Assess predictive power

Step 12. Build predictor

Conclusion

Virtualization and Hadoop: Whys, Whats and Hows

Virtualization: The Why

1. Resource Utilization

2. Deployment Convenience

3. Security

Security is Isolation

0. No isolation

1. Imposed isolation

2. Implicit isolation

Virtualization

0. Bare metal

1. Operating System

2. Containers

3. Virtual Machines

Use cases and recommendations

Safe environment, stable load

Safe environment, varying load

Unsafe environment

Containers: What to Use

Conclusion

Installing SBCL 1.1+ on RHEL/CentOS systems

MySQL Connector: inherited transactions

Multimethods in Python

I will never CNAME my root domain again. I will never CNAME my root domain again. I will never...

Hello Tumblr

A simple callback chain macro for elisp

The Problem

The Idea

The Solution

Thread operator in Elisp

TL;DR

The Long Story

How much can be done in four hours

Pure Python Paillier Homomorphic Cryptosystem Implementation

What

How

Where

Why

How to mount an NTFS-formatted USB drive in read-write mode on Mac OS X

Tail recursion without TCO

Clouds and entropy

Dynamic queries: the Postgres way

Why Tk matters

Step 2. Update `project.clj`