(into blog (filter tech? thoughts))

Clojure, numbers, despair

2015-10-29T17:02:00+01:00

Warning: this is a very angry post, but most points in here are valid despite the tone.

Once upon a time, a high level language was developed. It's beginnings were humble and the developers focused on things that mattered. Numbers were not things that mattered. Numbers were used, but how they were used mattered very little.

So...

The language is Clojure. And numbers in Clojure are this:

;; Auto-promotion is cool
user> (type (inc Integer/MAX_VALUE))
java.lang.Long
;; Except it doesn't always work!
user> (type (inc Long/MAX_VALUE))
ArithmeticException integer overflow  clojure.lang.Numbers.throwIntOverflow (Numbers.java:1501)

or this:

;; Because three ways of parsing a string as number is a Good Thing™
user> (= (Double/parseDouble "1.2") (Double/valueOf "1.2") (read-string "1.2"))
true
;; Because having function return different types based on parameters is an Even Better Thing™
user> (= (type (/ 3 2)) (type (/ 2 2)))
false

Do you think that girl was pretty?

There's no way to put it lightly: I hate Clojure number types. Java keeps leaking into it and no-one cares. To add the insult to the injury, on top of what you have in JVM, Clojure adds two more ways of representing numbers and then builds a huge pile of logic on top of that. Let's quickly cover what types one may find in a typical Clojure application:

clojure.lang.BigInt
clojure.lang.Ratio
java.lang.Number
java.lang.Integer
java.lang.Long
java.math.BigInteger
java.math.BigDecimal
java.lang.Float
java.lang.Double

Not surprisingly, most of these are just Java types. However, two more types are added: BigInt and Ratio. Both are weird. I'd like to focus a bit on Ratio. Ratio can be created by integer division, but only in case the division can not produce an integer:

;; Aight
user> (type (/ 1 2))
clojure.lang.Ratio
;; Not really expecting this
user> (type (/ 1 1))
java.lang.Long
;; Yeah, well, WAIT WHAT
user> (type (/ 1N 1M))
java.math.BigDecimal

We can also just call the Ratio constructor (and fail miserably in some cases):

;; Cool
user> (type 1/2)
clojure.lang.Ratio
;; Eh?
user> (clojure.lang.Ratio. 1 1)
ClassCastException java.lang.Long cannot be cast to java.math.BigInteger  user/eval21314 (form-init5235971328632709373.clj:1)
;; Ah!
user> (clojure.lang.Ratio. (biginteger 1) (biginteger 1))
1/1

The proper way is to coerce the parameters to java.math.BigInteger. Why? Historical reasons: clojure.lang.Ratio only accepts java.math.BigInteger because back when it was written Clojure didn't have clojure.lang.BigInt type and no-one touched the code since quite literally¹ forever.

The fun train doesn't stop here. For example, we may want to create a ratio with a denominator of 0. Let's try the usual way:

;; Good
user> 1/0
ArithmeticException Divide by zero  clojure.lang.Numbers.divide (Numbers.java:158)
;; Consistent!
user> (/ 1 0)
ArithmeticException Divide by zero  clojure.lang.Numbers.divide (Numbers.java:158)

Bummer. But then again it might make sense, after all a Ratio with a denominator value 0 may result in some weird math occurring. But we haven't tried all the available constructors yet, so let's do that:

;; I hate this :/
user> (clojure.lang.Ratio. (biginteger 1) (biginteger 0))
1/0

WAIT WHAT.

Combining java.math.BigInteger with clojure.lang.Ratio is even more fun, especially when it comes to corner cases:

;; Alright makes sense
user> (.denominator (* 7919/7920 (/ 1 Long/MAX_VALUE)))
73049106531889824391440
user> (class (.denominator (* 7919/7920 (/ 1 Long/MAX_VALUE))))
java.math.BigInteger
;; WAIT BUT WHY
user> (/ 7919 (* 7919/7920 (/ 1 Long/MAX_VALUE)))
73049106531889824391440N
user> (class (/ 7919 (* 7919/7920 (/ 1 Long/MAX_VALUE))))
clojure.lang.BigInt

The result type differs while logically you performed the exact same computation. And don't forget that those types are not always cooperating nicely, so you introduce more corner cases. Oh boy!

Who wears Cheetah?

Leaking abstractions is not cool. Clojure tries to present leaking abstractions as a feature. This is doubly not cool.

Number type promotion is not cool if there's no clear way to demote type. It's doubly not cool in Clojure, because there's no clear documentation on how and when promotion works. Existing documentation is lacking at best.

Consistency is great. Clojure is not great at consistency though and sometimes it feels like the "the principle of least astonishment" is being pro-actively broken by Clojure's design in the numbers domain.

Here's an incomplete and perhaps redundant list of things that I find annoying, surprising or outright stupid in Clojure:

Arithmetic overflows everywhere! Multiplying java.lang.Integer will never cause overflow, however java.lang.Long will fail to be autopromoted. To be fair, this behavior is right there in the docstring for * but then again, who reads docstsring for multiplication? There's also *', +' and -', all of which auto-promote the result, but what are the chances you ever even knew about them?
clojure.lang.Ratio uses java.math.BigInteger and not clojure.lang.BigInt for numerator and denominator. Why? Because when Ratio was created (back in 2010) clojure.lang.BigInt simply didn't exist and when it was finally created, Ratio was not updated to represent the change. Bonus points for figuring out why clojure.lang.BigInt was created in the first place.
Floats and doubles are... Well, the same floats and doubles as in Java. There's no attempt to hide them away. So, things like infinity and NaN are there, but they're not really supported by Clojure. How does one check if the number is NaN or Infinity in Clojure? You use java.lang.Float or java.lang.Double classes for that, specifically static methods such as isNaN, isFinite, etc. Hardly a portable solution.
Documentation is bad. Like, terribad. We're talking about a language with 8 years of development history, with strong backing from commercial companies, with successful commercial and open-source products written in the languge and yet we see very little focus on documenting things, even essential things, numbers being one of them.
Unsigned math is not supported. There's nothing in Java, thus there's nothing in Clojure. Make what you want out of it.
Bit operations do not belong in core namespace. It's clutter, most programs don't need them. More than that they're simply broken. More on that in a few bits.
*unchecked-math* is one big can of worms and can quite literally screw up your library performance or even behavior when someone using your library sets said dynamic var.

So, bit operations. Clojure really lets you down here and you as a programmer would have to extremely careful to avoid the common pitfalls. Most recovering C and C++ addicts would say that bit shift to the left by one bit is equal to multiplying by 2. Clojure says NO. Unless you multiply by a different kind of two:

user> (bit-shift-left Long/MAX_VALUE 1)
-2
user> (* 2 Long/MAX_VALUE)
ArithmeticException integer overflow  clojure.lang.Numbers.throwIntOverflow (Numbers.java:1501)
user> (* 2N Long/MAX_VALUE)
18446744073709551614N

The first behavior is a result of lacking proper unsigned, modular number type. The exception in the second is the result of "protecting" the users from overflowing, instead of promoting the type (as expected). And then the third one does the right thing. Or maybe a wrong thing, but in any case you would expect all 3 functions to do the same thing. What's worse is that there are plenty similar examples. Predictability is important, people!

I wanna look tan

Even though no-one asked me, I'll try to imagine a better world of Clojure math. First off, the number types. There should be only two ways to represent numbers in Clojure: integers and reals. Integers should be signed and unbounded. Integer division always produces reals WITHOUT EXCEPTIONS. Integers can be promoted to reals, but reals can never be demoted to integers. Reals can follow the same approach as java.lang.BigDecimal, Python's decimal module or MPFR. Now, obviously you'll immediately find a problem with this approach, namely that you need a proper context for all decimal operations. I say, default to large, and I mean LARGE precision. As in, precision that doesn't even make sense anymore, like 2^20. Let people control the precision through a context. Leave only basic math operations in core namespace: addition, subtraction, multiplication and division. Define those operations clearly, make sure that division always produces reals and truncate where need be.

Then, introduces math namespace. Put modular math operations in math.modular. math.binary for binary math, bit shifting. math.real containing functions and macroses helping with handling real context, rounding, etc. math.ratio for, well Ratio. math.float for IEEE 754-2008 floating point numbers. math.platform.jvm and math.platform.js for exposing platform-specific numbers.

But most importantly, write documentation. Everything has to be documented extensively and clearly, without exceptions. Great code and great design is only half the battle, clear documentation is the other.

As far as negative impact of said change, I can only think of performance. But only a small minority of Clojure users type-hints everything or uses Zachary Tellman's primitive-math. Everyone else? They get to enjoy the math setup that has very questionable decision baked into it without worrying much about the performance.

Let me take a selfie

I tend to complain. A lot. The math in Clojure is just one of my complaint targets. However, it's a valid target. The math is neglected in Clojure, I see no attention being paid to it by core developers, there's no organized effort to make it better, there has been zero calls to community to ask for improvement ideas. And what's worse is that this math is completely ingrained into Clojure core namespace and you can't replace it easily. There's no way to fix the numbers in Clojure from the outside.

I could take on this, spend plenty of time writing the proposal pushing it to Clojure core, writing the code afterwards, push for solution, but what are the chances that it's ever going to get accepted? The upfront cost of this work is tremendous and there's very little chance that such work would ever end up in Clojure core.

Where "literally forever" is used in terms of Internet age. ↩

I've befriended `use-package` and `init.el`

2015-09-28T12:00:00+02:00

I could never relate to the concept of Emacs bankruptcy. It's an unnecessary waste of time and resources; not only you throw out years of carefully written code, but you also have to go through everything and write it from scratch. I've tried it a few times, but those few attempts ended with me arriving at the state I was previously at.

The problem with Emacs bankruptcy is that it's a concept that comes from deep past, when every Emacs configuration was a collection of random bits of ELisp code, combined together without any organizational effort. Since there was no notion of packages or libraries besides the built-in ones Emacs user would often resort to collecting snippets of ELisp code in their .emacs file without trying to organize them in any way. I'd call it James Wood's method of Emacs configuration. The end result of this process was always the same -- unmanagebale multi-thousand lines long configuration with varying level of code quality and no consistency whatsoever.

Nowadays you have the tools to organize your configuration without paying the ultimate price of clutter. package.el has introduced proper package management to Emacs and el-get has shown that package management doesn't have to be complicated. I've used a combination of the latter and some hand crafted stuff for the past few years, then switched to "modular setup" with package.el for package management, however I've recently switched to use-package and that significantly simplified my configuration.

Previous versions of my configuration have gone through several waves of major improvements, but I was never quite satisfied with it. This is where use-package comes into play. I've managed to untangle some mess into pretty manageable code. I still don't understand how the underlying mechanisms work, what lazy loading does and how in hell does :init and :config differ between each other, but I can navigate my config in a much simpler way. Obviously, the current setup can be improved, since it mostly consists of modules which don't have to be modules. In fact, all of my configuration can now neetly fit into single init.el and it will hardly take more than a few hundred lines of code. This is a dramatic improvement over what I had before and enables very quick refactoring and gives a much cleaner overview over setup of each bit of ELisp I have installed.

As a bonus, use-package can be used to decrease the load time of Emacs. This doesn't really concern me, since most of the time I run only one copy of Emacs for prolonged periods of time. Case in point, M-x emacs-uptime shows 4 days, 18 hours, 30 minutes, 31 seconds and that's simply because I've rebooted my laptop recently.

Overall I'm really happy with use-package, however things could still be improved. For example, as a Helm user I'm often finding myself setting up various bits of Helm functionality all over the place. Hardly an ideal situation but it seems that most use-package users deal with it by simply ignoring the issue. I'm not a big fan of ignoring issues like these, but I guess there's no way to avoid this. :)

Implementing BK-tree in Clojure

2015-07-31T17:00:00+02:00

As part of returning back to blogging, or at least attempting to, I've decided to implement a data structure that can be quite useful but yet can be considered somewhat obscure: BK-tree. Named so after its inventors, WA Burkhard and RM Keller it can provide significant performance boost to applications in need of fuzzy string search.

What is BK-tree?

An underlying of BK-tree is relatively simple, however just by skimming the depths of the Internet I've seen a lot of people struggling with understanding it.

Most common use case for BK-tree is spell-checking applications, however most spell-checkers out there use simpler and, to be frank, often more efficient approaches. Nevertheless, we'd

Before we go further, we need to understand how spell-checking is commonly used. I'd split spell-checking in modern applications into 3 parts:

Check whether the word exists in the dictionary
Find possible fixes for the misspelled word
Order suggestions based on some sort of heuristic

While BK-tree can be used to address the first point, it's often faster and easier to use set/bag implementation to achieve close to constant time for checks. The second point can be tackled variety of ways, but for the purpose of the article we'll consider only the simplest ones:

naively in linear time by scanning all words in the dictionary and calculating edit distance
by generating all possible edits of the word being checked or by http://norvig.com/spell-correct.html
BK-tree (duh)

With this in mind, let's write some code.

Implementing BK-tree

Since BK-tree is an index, we have to separate steps -- building said index and querying it. I've decided to illustrate it instead of trying to use algorithm description, to make it easier. Well, at least I found it easier to understand it by using drawings.

Building the tree

Building BK-tree is pretty easy, our algorithm is basically this:

Calculate Levenstein distance between root of the tree and the word being indexed
If there is a free slot for resulting distance, insert new node with the word being indexed
Otherwise, take the node under said slot and continue recursively

Let's imagine that our dictionary consists of words squirrel, square, shard and circus. We start our process by setting squirrel at the root of the tree:

As this is the root, no calculation had to be done. Next in queue is the word shard:

Since the Levenstein distance between square and shard is 6 we check whether there's any child node with the same distance and since there's not we simply insert shard as a child node of shard. Next up square:

Again, we calculate Levenstein distance (which is 3 this time) and happily insert new child node. Now to the next word:

Oops, this time we can't just insert the child node, since there's already a child node with a Levenstein distance of 6. Following the algorithm we simply try enter the word under the child node with the same Levenstein distance, which happens to be shard node. In the end we end up with something that looks like this:

That should show the basic principle of building the tree.

Querying the tree

It's a bad, bad, bad, bad math!

Querying the tree is straightforward, however the efficiency is guaranteed due to a simple property of Levenstein distance, namely the triangle inequality. Warning, what follows is math which may be a bit off. So, given the root of the tree $R$, children $C_i$, where $i = Levenstein(R, C_i)$ and the query word $W$ we can say that

$$Levenstein(W, R) + Levenstein(W, C_i) \ge Levenstein(R, C_i)$$

and consecutively

$$Levenstein(W, R) - Levenstein(W, C_i) \le Levenstein(R, C_i)$$

Since $i = Levenstein(R, C_i)$ and $Levenstein(W, C_i) \le d_{max}$ where $d_{max}$ is our maximum tolerable distance between query word and dictionary words, we end with a simple filter, where we only need to recursively query children nodes $C_i, \forall i \in [Levenstein(W, R) - d_{max}, Levenstein(W, R) + d_{max}]$.

FIXME Make an illustration based on circles

Illustrating the query

Given this overly complicated and most definitely flawed explanation of the math behind BK-tree, let's illustrate how this works in real world. For the purpose of illustration I've created a pretty simple tree.

We'd like to find all words in the tree that are within Levenstein distance of at most 1 of the word "brine". First step is to match the root, which in our tree is "trine".

Since the Levenstein distance between those words is 1, we got our first match. Now, the trick part which makes BK-tree efficient comes into play: from all children of this tree we only need to query children within distance range of $[0..2]$. While in this simple example it seems like a small improvement, in real-world scenarios this would effectively lead to cutting off significant amount of branches.

We then continue using the same approach recursively, until we either terminate at the leaf node or when all edges fail the triangle inequality.

Voila! Basically, searching in BK-tree is a BFS with small filter on top of it.

Performance

Theory and pictures is all nice and dandy, however I couldn't just write about BK-tree without actually implementing it. It's a pretty bare implementation, with very little though put into optimizations and such, thus the performance can not be directly compared to other existing approaches without embarrassing the author of this article. This of course didn't stop me from doing exactly that. For the performance test I've used a standard dictionary found in OS X, under /usr/share/dict/words. For no reason other than to entertain myself I've decided to look into word length distribution of said dictionary, so I proudly present to you the graphical representation of my research:

Benchmark

Armed with that information I've went through and benchmarked my implementation of BK-tree and compared it to using linear dictionary check and Peter Norvig's approach. The methodology was pretty simple: each benchmark was executed with the same list of misspelled words with only the maximum edit distance changing. Below you can see the results of said test:

One glaring omission is Norvig's spellchecker for edit distance of 3. This is due to the fact that for longer words the number of all misspellings at an edit distance of 3 is humongous (we are talking order of tens of millions).

The actual results show that while BK-tree is in fact faster when compared to brute force approach, it loses significantly to Norvig's approach for edit distances of 1 and 2. The latter is also helped by extremely tuned implementations of clojure.lang.PersistentHashSet. Obviously, proper real world spell checkers use more sophisticated solutions, such as suffix tries, phonetic algorithms, bitap, etc.

Conclusion

BK-tree is an interesting application of seemingly unrelated concept (triangle inequality) to a well-established problem space. Implementing BK-tree is easy, however I'm still on the fence regarding practicality of using BK-trees. Some smart people before me considered using BK-tree in Lucene but ultimately decided that it is simply not worth it. I still liked writing it. :)