The Julia Blog

Technical preview: Native GPU programming with CUDAnative.jl

2017-03-14T00:00:00+00:00

After 2 years of slow but steady development, we would like to announce the first preview release of native GPU programming capabilities for Julia. You can now write your CUDA kernels in Julia, albeit with some restrictions, making it possible to use Julia’s high-level language features to write high-performance GPU code.

The programming support we’re demonstrating here today consists of the low-level building blocks, sitting at the same abstraction level of CUDA C. You should be interested if you know (or want to learn) how to program a parallel accelerator like a GPU, while dealing with tricky performance characteristics and communication semantics.

You can easily add GPU support to your Julia installation (see below for detailed instructions) by installing CUDAnative.jl. This package is built on top of experimental interfaces to the Julia compiler, and the purpose-built LLVM.jl and CUDAdrv.jl packages to compile and execute code. All this functionality is brand-new and thoroughly untested, so we need your help and feedback in order to improve and finalize the interfaces before Julia 1.0.

How to get started

CUDAnative.jl is tightly integrated with the Julia compiler and the underlying LLVM framework, which complicates version and platform compatibility. For this preview we only support Julia 0.6 built from source, on Linux or macOS. Luckily, installing Julia from source is well documented in the main repository’s README. Most of the time it boils down to the following commands:

$ git clone https://github.com/JuliaLang/julia.git
$ cd julia
$ git checkout v0.6.0-pre.alpha  # or any later tag
$ make                           # add -jN for N parallel jobs
$ ./julia

From the Julia REPL, installing CUDAnative.jl and its dependencies is just a matter of using the package manager. Do note that you need to be using the NVIDIA binary driver, and have the CUDA toolkit installed.

> Pkg.add("CUDAnative")

# Optional: test the package
> Pkg.test("CUDAnative")

At this point, you can start writing kernels and execute them on the GPU using CUDAnative’s @cuda! Be sure to check out the examples, or continue reading for a more textual introduction.

Hello World Vector addition

A typical small demo of GPU programming capabilities (think of it as the GPU Hello World) is to perform a vector addition. The snippet below does exactly that using Julia and CUDAnative.jl:

using CUDAdrv, CUDAnative

function kernel_vadd(a, b, c)
    # from CUDAnative: (implicit) CuDeviceArray type,
    #                  and thread/block intrinsics
    i = (blockIdx().x-1) * blockDim().x + threadIdx().x
    c[i] = a[i] + b[i]

    return nothing
end

dev = CuDevice(0)
ctx = CuContext(dev)

# generate some data
len = 512
a = rand(Int, len)
b = rand(Int, len)

# allocate & upload on the GPU
d_a = CuArray(a)
d_b = CuArray(b)
d_c = similar(d_a)

# execute and fetch results
@cuda (1,len) kernel_vadd(d_a, d_b, d_c)    # from CUDAnative.jl
c = Array(d_c)

using Base.Test
@test c == a + b

destroy(ctx)

How does it work?

Most of this example does not rely on CUDAnative.jl, but uses functionality from CUDAdrv.jl. This package makes it possible to interact with CUDA hardware through user-friendly wrappers of CUDA’s driver API. For example, it provides an array type CuArray, takes care of memory management, integrates with Julia’s garbage collector, implements @elapsed using GPU events, etc. It is meant to form a strong foundation for all interactions with the CUDA driver, and does not require a bleeding-edge version of Julia. A slightly higher-level alternative is available under CUDArt.jl, building on the CUDA runtime API instead, but hasn’t been integrated with CUDAnative.jl yet.

Meanwhile, CUDAnative.jl takes care of all things related to native GPU programming. The most significant part of that is generating GPU code, and essentially consists of three phases:

interfacing with Julia: repurpose the compiler to emit GPU-compatible LLVM IR (no calls to CPU libraries, simplified exceptions, …)
interfacing with LLVM (using LLVM.jl): optimize the IR, and compile to PTX
interfacing with CUDA (using CUDAdrv.jl): compile PTX to SASS, and upload it to the GPU

All this is hidden behind the call to @cuda, which generates code to compile our kernel upon first use. Every subsequent invocation will re-use that code, convert and upload arguments¹, and finally launch the kernel. And much like we’re used to on the CPU, you can introspect this code using runtime reflection:

# CUDAnative.jl provides alternatives to the @code_ macros,
# looking past @cuda and converting argument types
julia> CUDAnative.@code_llvm @cuda (1,len) kernel_vadd(d_a, d_b, d_c)
define void @julia_kernel_vadd_68711 {
    [LLVM IR]
}

# ... but you can also invoke without @cuda
julia> @code_ptx kernel_vadd(d_a, d_b, d_c)
.visible .func julia_kernel_vadd_68729(...) {
    [PTX CODE]
}

# or manually specify types (this is error prone!)
julia> code_sass(kernel_vadd, (CuDeviceArray{Float32,2},CuDeviceArray{Float32,2},CuDeviceArray{Float32,2}))
code for sm_20
        Function : julia_kernel_vadd_68481
[SASS CODE]

Another important part of CUDAnative.jl are the intrinsics: special functions and macros that provide functionality hard or impossible to express using normal functions. For example, the {thread,block,grid}{Idx,Dim} functions provide access to the size and index of each level of work. Local shared memory can be created using the @cuStaticSharedMem and @cuDynamicSharedMem macros, while @cuprintf can be used to display a formatted string from within a kernel function. Many math functions are also available; these should be used instead of similar functions in the standard library.

What is missing?

As I’ve already hinted, we don’t support all features of the Julia language yet. For example, it is currently impossible to call any function from the Julia C runtime library (aka. libjulia.so). This makes dynamic allocations impossible, cripples exceptions, etc. As a result, large parts of the standard library are unavailable for use on the GPU. We will obviously try to improve this in the future, but for now the compiler will error when it encounters unsupported language features:

julia> nope() = println(42)
nope (generic function with 1 method)

julia> @cuda (1,1) nope()
ERROR: error compiling nope: emit_builtin_call for REPL[1]:1 requires the runtime language feature, which is disabled

Another big gap is documentation. Most of CUDAnative.jl mimics or copies CUDA C, while CUDAdrv.jl wraps the CUDA driver API. But we haven’t documented what parts of those APIs are covered, or how the abstractions behave, so you’ll need to refer to the examples and tests in the CUDAnative and CUDAdrv repositories.

Another example: parallel reduction

For a more complex example, let’s have a look at a parallel reduction for Kepler-generation GPUs. This is a typical well-optimized GPU implementation, using fast communication primitives at each level of execution. For example, threads within a warp execute together on a SIMD-like core, and can share data through each other’s registers. At the block level, threads are allocated on the same core but don’t necessarily execute together, which means they need to communicate through core local memory. Another level up, only the GPU’s DRAM memory is a viable communication medium.

The Julia version of this algorithm looks pretty similar to the CUDA original: this is as intended, because CUDAnative.jl is a counterpart to CUDA C. The new version is much more generic though, specializing both on the reduction operator and value type. And just like we’re used to with regular Julia code, the @cuda macro will just-in-time compile and dispatch to the correct specialization based on the argument types.

So how does it perform? Turns out, pretty good! The chart below compares the performance of both the CUDAnative.jl and CUDA C implementations², using BenchmarkTools.jl to measure the execution time. The small constant overhead (note the logarithmic scale) is due to a deficiency in argument passing, and will be fixed.

We also aim to be compatible with tools from the CUDA toolkit. For example, you can profile Julia kernels using the NVIDIA Visual Profiler, or use cuda-memcheck to detect out-of-bound accesses³:

$ cuda-memcheck julia examples/oob.jl
========= CUDA-MEMCHECK
========= Invalid __global__ write of size 4
=========     at 0x00000148 in examples/oob.jl:14:julia_memset_66041
=========     by thread (10,0,0) in block (0,0,0)
=========     Address 0x1020b000028 is out of bounds

Full debug information is not available yet, so cuda-gdb and friends will not work very well.

Try it out!

If you have experience with GPUs or CUDA development, or maintain a package which could benefit from GPU acceleration, please have a look or try out CUDAnative.jl! We need all the feedback we can get, in order to prioritize development and finalize the infrastructure before Julia hits 1.0.

I want to help

Even better! There’s many ways to contribute, for example by looking at the issues trackers of the individual packages making up this support:

Each of those packages are also in perpetual need of better API coverage, and documentation to cover and explain what has already been implemented.

Thanks

This work would not have been possible without Viral Shah and Alan Edelman arranging my stay at MIT. I’d like to thank everybody at Julia Central and around, it has been a blast! I’m also grateful to Bjorn De Sutter, and IWT Vlaanderen, for supporting my time at Ghent University.

See the README for a note on how expensive this currently is. ↩
The measurements include memory transfer time, which is why a CPU implementation was not included (realistically, data would be kept on the GPU as long as possible, making it an unfair comparison). ↩
Bounds-checked arrays are not supported yet, due to a bug in the NVIDIA PTX compiler. ↩

More Dots: Syntactic Loop Fusion in Julia

2017-01-21T00:00:00+00:00

After a lengthy design process and preliminary foundations in Julia 0.5, Julia 0.6 includes new facilities for writing code in the “vectorized” style (familiar from Matlab, Numpy, R, etcetera) while avoiding the overhead that this style of programming usually imposes: multiple vectorized operations can now be “fused” into a single loop, without allocating any extraneous temporary arrays.

This is best illustrated with an example (in which we get order-of-magnitude savings in memory and time, as demonstrated below). Suppose we have a function f(x) = 3x^2 + 5x + 2 that evaluates a polynomial, and we want to evaluate f(2x^2 + 6x^3 - sqrt(x)) for a whole array X, storing the result in-place in X. You can now do:

X .= f.(2 .* X.^2 .+ 6 .* X.^3 .- sqrt.(X))

or, equivalently:

@. X = f(2X^2 + 6X^3 - sqrt(X))

and the whole computation will be fused into a single loop, operating in-place, with performance comparable to the hand-written “devectorized” loop:

for i in eachindex(X)
    x = X[i]
    X[i] = f(2x^2 + 6x^3 - sqrt(x))
end

(Of course, like all Julia code, to get good performance both of these snippets should be executed inside some function, not in global scope.) To see the details of a variety of performance experiments with this example code, follow along in the attached IJulia/Jupyter notebook: we find that the X .= ... code has performance within 10% of the hand-devectorized loop (which itself is within 5% of the speed of C code), except for very small arrays where there is a modest overhead (e.g. 50% overhead for a length-1 array X).

In this blog post, we delve into some of the details of this new development, in order to answer questions that often arise when this feature is presented:

What is the overhead of traditional “vectorized” code? Isn’t vectorized code supposed to be fast already?
Why are all these dots necessary? Couldn’t Julia just optimize “ordinary” vector code?
Is this something unique to Julia, or can other languages do the same thing?

The short answers are:

Ordinary vectorized code is fast, but not as fast as a hand-written loop (assuming loops are efficiently compiled, as in Julia) because each vectorized operation generates a new temporary array and executes a separate loop, leading to a lot of overhead when multiple vectorized operations are combined.
The dots allow Julia to recognize the “vectorized” nature of the operations at a syntactic level (before e.g. the type of x is known), and hence the loop fusion is a syntactic guarantee, not a compiler optimization that may or may not occur for carefully written code. They also allow the caller to “vectorize” any function, rather than relying on the function author. (The @. macro lets you add dots to every operation in an expression, improving readability for expressions with lots of dots.)
Other languages have implemented loop fusion for vectorized operations, but typically for only a small set of types and operations/functions that are known to the compiler or vectorization library. Julia’s ability to do it generically, even for user-defined array types and functions/operators, is unusual and relies in part on the syntax choices above and on its ability to efficiently compile higher-order functions.

Finally, we’ll review why, since these dots actually correspond to broadcast operations, they can combine arrays and scalars, or combine containers of different shapes and kinds, and we’ll compare broadcast and map. Moreover, Julia 0.6 expanded and clarified the notion of a “scalar” for broadcast, so that it is not limited to numerical operations: you can use broadcast and fusing “dot calls” for many other tasks (e.g. string processing).

Isn’t vectorized code already fast?

To explore this question (also discussed in this blog post), let’s begin by rewriting the code above in a more traditional vectorized style, without so many dots, such as you might use in Julia 0.4 or in other languages (most famously Matlab, Python/Numpy, or R).

X = f(2 * X.^2 + 6 * X.^3 - sqrt(X))

Of course, this assumes that the functions sqrt and f are “vectorized,” i.e. that they accept vector arguments X and compute the function elementwise. This is true of sqrt in Julia 0.4, but it means that we have to rewrite our function f from above in a vectorized style, as e.g. f(x) = 3x.^2 + 5x + 2 (changing f to use the elementwise operator .^ because vector^scalar is not defined). (If we were using Julia 0.4 and cared a lot about efficiency, we might have instead used the @vectorize_1arg f Number macro to generate more specialized elementwise code.)

Which functions are vectorized?

As an aside, this example illustrates an annoyance with the vectorized style: you have to decide in advance whether a given function f(x) will also be applied elementwise to arrays, and either write it specially or define a corresponding elementwise method.

(Our function f accepts any x type, and in Matlab or R there is no distinction between a scalar and a 1-element array. However, even if a function accepts an array argument x, that doesn’t mean it will work elementwise for an array unless you write the function with that in mind.)

For library functions like sqrt, this means that the library authors have to guess at which functions should have vectorized methods, and users have to guess at what vaguely defined subset of library functions work for vectors.

One possible solution is to vectorize every function automatically. The language Chapel does this: every function f(x...) implicitly defines a function f(x::Array...) that evaluates map(f, x...) (Chamberlain et al, 2011). This could be implemented in Julia as well via function-call overloading (Bezanson, 2015: chapter 4), but we chose to go in a different direction.

Instead, starting in Julia 0.5, any function f(x) can be applied elementwise to an array X with the “dot call” syntax f.(X). Thus, the caller decides which functions to vectorize. In Julia 0.6, “traditionally” vectorized library functions like sqrt(X) are deprecated in favor of sqrt.(X), and dot operators like x .+ y are now equivalent to dot calls (+).(x,y). Unlike Chapel’s implicit vectorization, Julia’s f.(x...) syntax corresponds to broadcast(f, x...) rather than map, allowing you to combine arrays and scalars or arrays of different shapes/dimensions. (broadcast and map are compared at the end of this post; each has its own unique capabilities.) From the standpoint of the programmer, this adds a certain amount of clarity because it indicates explicitly when an elementwise operation is occuring. From the standpoint of the compiler, dot-call syntax enables the syntactic loop fusion optimization described in more detail below, which we think is an overwhelming advantage of this style.

Why vectorized code is fast

In many dynamically typed languages popular for interactive technical computing (Matlab, Python, R, etc.), vectorization is seen as a key (often the key) performance optimization. It allows your code to take advantage of highly optimized (perhaps even parallelized) library routines for basic operations like scalar*array or sqrt(array). Those functions, in turn, are usually implemented in a low-level language like C or Fortran. Writing your own “devectorized” loops, in contrast, is too slow, unless you are willing to drop down to a low-level language yourself, because the semantics of those dynamic languages make it hard to compile them to efficient code in general.

Thanks to Julia’s design, a properly written devectorized loop in Julia has performance within a few percent of C or Fortran, so there is no necessity of vectorizing; this is explicitly demonstrated for the devectorized loop above in the accompanying notebook. However, vectorization may still be convenient for some problems. And vectorized operations like scalar*array or sqrt(array) are still fast in Julia (calling optimized library routines, albeit ones written in Julia itself).

Furthermore, if your problem involves a function that does not have a pre-written, highly optimized, vectorized library routine in Julia, and that does not decompose easily into existing vectorized building blocks like scalar*array, then you can write your own building block without dropping down to a low-level language. (If all the performance-critical code you will ever need already existed in the form of optimized library routines, programming would be a lot easier!)

Why vectorized code is not as fast as it could be

There is a tension between two general principles in computing: on the one hand, re-using highly optimized code is good for performance; on the other other hand, optimized code that is specialized for your problem can usually beat general-purpose functions. This is illustrated nicely by the traditional vectorized version of our code above:

f(x) = 3x.^2 + 5x + 2
X = f(2 * X.^2 + 6 * X.^3 - sqrt(X))

Each of the operations like X.^2 and 5*X individually calls highly optimized functions, but their combination leaves a lot of performance on the table when X is an array. To see that, you have to realize that this code is equivalent to:

tmp1 = X.^2
tmp2 = 2*tmp1
tmp3 = X.^3
tmp4 = 6 * tmp3
tmp5 = tmp2 + tmp4
tmp6 = sqrt(X)
tmp7 = tmp5 - tmp7
X = f(tmp7)

That is, each of these vectorized operations allocates a separate temporary array, and is a separate library call with its own inner loop. Both of these properties are bad for performance.

First, eight arrays are allocated (tmp1 through tmp7, plus another for the result of f(tmp7), and another four are allocated internally by f(tmp7) for the same reasons, for 12 arrays in all. The resulting X = ... expression does not update X in-place, but rather makes the variable X “point” to a new array returned by f(tmp7), discarding the old array X. All of these extra arrays are eventually deallocated by Julia’s garbage collector, but in the meantime it wastes a lot of memory (an order of magnitude!)

By itself, allocating/freeing memory can take a significant amount of time compared to our other computations. This is especially true if X is very small so that the allocation overhead matters (in our benchmark notebook, we pay a 10× cost for a 6-element array and a 6× cost for a 36-element array), or if X is very large so that the memory churn matters (see below for numbers). Furthermore, you pay a different performance price from the fact that you have 12 loops (12 passes over memory) compared to one, in part because of the loss of memory locality.

In particular, reading or writing data in main computer memory (RAM) is much slower than performing scalar arithmetic operations like + and *, so computer hardware stores recently used data in a cache: a small amount of much faster memory. Furthermore, there is a hierarchy of smaller, faster caches, culminating in the register memory of the CPU itself. This means that, for good performance, you should load each datum x = X[i] once (so that it goes into cache, or into a register for small enough types), and then perform several operations like f(2x^2 + 6x^3 - sqrt(x)) on x while you still have fast access to it, before loading the next datum; this is called “temporal locality.” The traditional vectorized code discards this potential locality: each X[i] is loaded once for a single small operation like 2*X[i], writing the result out to a temporary array before immediately reading the next X[i].

In typical performance benchmarks (see notebook), therefore, the traditional vectorized code X = f(2 * X.^2 + 6 * X.^3 - sqrt(X)) turns out to be about 10× slower than the devectorized or fused-vectorized versions of the same code at the beginning of this article for X = zeros(10^6). Even if we pre-allocate all of the temporary arrays (completely eliminating the allocation cost), our benchmarks show that performing a separate loop for each operation still is about 4–5× slower for a million-element X. This is not unique to Julia! Vectorized code is suboptimal in any language unless the language’s compiler can automatically fuse all of these loops (even ones that appear inside function calls), which rarely happens for the reasons described below.

Why does Julia need dots to fuse the loops?

You might look at an expression like 2 * X.^2 + 6 * X.^3 - sqrt(X) and think that it is “obvious” that it could be combined into a single loop over X. Why can’t Julia’s compiler be smart enough to recognize this?

The thing that you need to realize is that, in Julia, there is nothing particularly special about + or sqrt — they are arbitrary functions and could do anything. X + Y could send an email or open a plotting window, for all the compiler knows. To figure out that it could fuse e.g. 2*X + Y into a single loop, allocating a single array for the result, the compiler would need to:

Deduce the types of X and Y and figure out what * and + functions to call. (Julia already does this, at least when type inference succeeds.)
Look inside of those functions, realize that they are elementwise loops over X and Y, and realize that they are pure (e.g. 2*X has no side-effects like modifying Y).
Analyze expressions like X[i] (which are calls to a function getindex(X, i) that is “just another function” to the compiler), to detect that they are memory reads/writes and determine what data dependencies they imply (e.g. to figure out that 2*X allocates a temporary array that can be eliminated).

The second and third steps pose an enormous challenge: looking at an arbitrary function and “understanding” it at this level turns out to be a very hard problem for a computer. If fusion is viewed as a compiler optimization, then the compiler is only free to fuse if it can prove that fusion won’t change the results, which requires the detection of purity and other data-dependency analyses.

In contrast, when the Julia compiler sees an expression like 2 .* X .+ Y, it knows just from the syntax (the “spelling”) that these are elementwise operations, and Julia guarantees that the code will always fuse into a single loop, freeing it from the need to prove purity. This is what we term syntactic loop fusion, described in more detail below.

A halfway solution: Loop fusion for a few operations/types

One approach that may occur to you, and which has been implemented in a variety of languages (e.g. Kennedy & McKinley, 1993; Lewis et al., 1998; Chakravarty & Keller, 2001; Manjikian & Abdelrahman, 2002; Sarkar, 2010; Prasad et al., 2011; Wu et al., 2012), is to only perform loop fusion for a few “built-in” types and operations that the compiler can be designed to recognize. The same idea has also been implemented as libraries (e.g. template libraries in C++: Veldhuizen, 1995) or domain-specific languages (DSLs) as extensions of existing languages; in Python, for example, loop fusion for a small set of vector operations and array/scalar types can be found in the Theano, PyOP2, and Numba software. Likewise, in Julia we could potentially build the compiler to recognize that it can fuse *, +, .^, and similar operations for the built-in Array type, (and perhaps only for a few scalar types). This has, in fact, already been implemented in Julia as a macro-based DSL (you add @vec or @acc decorators to a vectorized expression) in the Devectorize and ParallelAccelerator packages.

However, even though Julia will certainly implement additional compiler optimizations as time passes, one of the key principles of Julia’s design is to “build in” as little as possible into the core language, implementing as much as possible of Julia in Julia itself (Bezanson, 2015). Put another way, the same optimizations should be just as available to user-defined types and functions as to the “built-in” functions of Julia’s standard library (Base). You should be able to define your own array types (e.g. via the StaticArrays package or PETSc arrays) and functions (such as our f above), and have them be capable of fusing vectorized operations.

Moreover, a difficulty with fancy compiler optimizations is that, as a programmer, you are often unsure whether they will occur. You have to learn to avoid coding styles that accidentally prevent the compiler from recognizing the fusion opportunity (e.g. because you called a “non-built-in” function), you need to learn to use additional compiler-diagnostic tools to identify which optimizations are taking place, and you need to continually check these diagnostics as new versions of the compiler and language are released. With vectorized code, losing a fusion optimization may mean wasting an order of magnitude in memory and time, so you have to worry much more than you would for a typical compiler micro-optimization.

Syntactic loop fusion in Julia

In contrast, Julia’s approach is quite simple and general: the caller indicates, by adding dots, which function calls and operators are intended to be applied elementwise (specifically, as broadcast calls). The compiler notices these dots at parse time (or technically at “lowering” time, but in any case long before it knows the types of the variables etc.), and transforms them into calls to broadcast. Moreover, it guarantees that nested “dot calls” will always be fused into a single broadcast call, i.e. a single loop.

Put another way, f.(g.(x .+ 1)) is treated by Julia as merely syntactic sugar for broadcast(x -> f(g(x + 1)), x). An assignment y .= f.(g.(x .+ 1)) is treated as sugar for the in-place operation broadcast!(x -> f(g(x + 1)), y, x). The compiler need not prove that this produces the same result as a corresponding non-fused operation, because the fusion is a mandatory transformation defined as part of the language, rather than an optional optimization.

Arbitrary user-defined functions f(x) work with this mechanism, as do arbitrary user-defined collection types for x, as long as you define broadcast methods for your collection. (The default broadcast already works for any subtype of AbstractArray.)

Moreover, dotted operators are now available for not just the familiar ASCII operators like .+, but for any character that Julia parses as a binary operator. This includes a wide array of Unicode symbols like ⊗, ∪, and ⨳, most of which are undefined by default. So, for example, if you define ⊗(x,y) = kron(x,y) for the Kronecker product, you can immediately do [A, B] .⊗ [C, D] to compute the “elementwise” operation [A ⊗ C, B ⊗ D], because x .⊗ y is sugar for broadcast(⊗, x, y).

Note that “side-by-side” binary operations are actually equivalent to nested calls, and hence they fuse for dotted operations. For example 3 .* x .+ y is equivalent to (+).((*).(3, x), y), and hence it fuses into broadcast((x,y) -> 3*x+y, x, y). Note also that the fusion stops only when a “non-dot” call is encountered, e.g. sqrt.(abs.(sort!(x.^2))) fuses the sqrt and abs operations into a single loop, but x.^2 occurs in a separate loop (producing a temporary array) because of the intervening non-dot function call sort!(...).

Should other languages implement syntactic loop fusion?

Obviously, Julia’s approach of syntactic loop fusion relies partly on the fact that, as a young language, we are still relatively free to redefine core syntactic elements like f.(x) and x .+ y. But suppose you were willing to add this or similar syntax to an existing language, like Python or Go, or create a DSL add-on on top of those languages as discussed above; would you then be able to implement the same fusing semantics efficiently?

There is a catch: 2 .* x .+ x .^ 2 is sugar for broadcast(x -> 2*x + x^2, x) in Julia, but for this to be fast we need the higher-order function broadcast to be very fast as well. First, this requires that arbitrary user-defined scalar (non-vectorized!) functions like x -> 2*x + x^2 be compiled to fast code, which is often a challenge in high-level dynamic languages. Second, it ideally requires that higher-order functions like broadcast be able to inline the function argument x -> 2*x + x^2, and this facility is even less common. (It wasn’t available in Julia until version 0.5.)

Also, the ability of broadcast to combine arrays and scalars or arrays of different shapes (see below) turns out to be subtle to implement efficiently without losing generality. The current implementation relies on a metaprogramming feature that Julia provides called generated functions in order to get compile-time specialization on the number and types of the arguments. An alternative solution to the inlining and specialization issues would be to build the broadcast function into the compiler, but then you might lose the ability of broadcast to be overloadable for user-defined containers, nor could users write their own higher-order functions with similar functionality.

The importance of higher-order inlining

In particular, consider a naive implementation of broadcast (only for one-argument functions):

function naivebroadcast(f, x)
    y = similar(x)
    for i in eachindex(x)
        y[i] = f(x[i])
    end
    return y
end

In Julia, as in other languages, f must be some kind of function pointer or function object. Normally, a call f(x[i]) to a function object f must figure out where the actual machine code for the function is (in Julia, this involves dispatching on the type of x[i]; in object-oriented languages, it might involve dispatching on the type of f), push the argument x[i] etcetera to f via a register and/or a call stack, jump to the machine instructions to execute them, jump back to the caller naivebroadcast, and extract the return value. That is, calling a function argument f involves some overhead beyond the cost of the computations inside f.

If f(x) is expensive enough, then the overhead of the function call may be negligible, but for a cheap function like f(x) = 2*x + x^2 the overhead can be very significant: with Julia 0.4, the overhead is roughly a factor of two compared to a hand-written loop that evaluates z = x[i]; y[i] = 2*z + z^2. Since lots of vectorized code in practice evaluates relatively cheap functions like this, it would be a big problem for a generic vectorization method based on broadcast. (The function call also inhibits SIMD optimization by the compiler, which prevents computations in f(x) from being applied simultaneously to several x[i] elements.)

However, in Julia 0.5, every function has its own type. And, in Julia, whenever you call a function like naivebroadcast(f, x), a specialized version of naivebroadcast is compiled for typeof(f) and typeof(x). Since the compiled code is specific to typeof(f), i.e. to the specific function being passed, the Julia compiler is free to inline f(x) into the generated code if it wants to, and all of the function-call overhead can disappear.

Julia is neither the first nor the only language that can inline higher-order functions; e.g. it is reportedly possible in Haskell and in the Kotlin language. Nevertheless, it seems to be a rare feature, especially in imperative languages. Fast higher-order functions are a key ingredient of Julia that allows a function like broadcast to be written in Julia itself (and hence be extensible to user-defined containers), rather than having to be built in to the compiler (and probably limited to “built-in” container types).

Not just elementwise math: The power of broadcast

Dot calls correspond to the broadcast function in Julia. Broadcasting is a powerful concept (also found, for example, in NumPy and Matlab) in which the concept of “elementwise” operations is extended to encompass combining arrays of different shapes or arrays and scalars. Moreover, this is not limited to arrays of numbers, and starting in Julia 0.6 a “scalar” in a broadcast context can be an object of an arbitrary type.

Combining containers of different shapes

You may have noticed that the examples above included expressions like 6 .* X.^3 that combine an array (X) with scalars (6 and 3). Conceptually, in X.^3 the scalar 3 is “expanded” (or “broadcasted”) to match the size of X, as if it became an array [3,3,3,...], before performing ^ elementwise. In practice of course, no array of 3s is ever explicitly constructed.

More generally, if you combine two arrays of different dimensions or shapes, any “singleton” (length 1) or missing dimension of one array is “broadcasted” across that dimension of the other array. For example, A .+ [1,2,3] adds [1,2,3] to each column of an 3×n matrix A. Another typical example is to combine a row vector (or a 1×n array) and a column vector to make a matrix (2d array):

julia> [1 2 3] .+ [10,20,30]
3×3 Array{Int64,2}:
 11  12  13
 21  22  23
 31  32  33

(If x is a row vector, and y is a column vector, then A = x .+ y makes a matrix with A[i,j] = x[j] + y[i].)

Although other languages have also implemented similar broadcast semantics, Julia is unusual in being able to support such operations for arbitrary user-defined functions and types with performance comparable to hand-written C loops, even though its broadcast function is written entirely in Julia with no special support from the compiler. This not only requires efficient compilation and higher-order inlining as mentioned above, but also the ability to efficiently iterate over arrays of arbitrary dimensionalities determined at compile-time for each caller.

Not just numbers

Although the examples above were all for numeric computations, in fact neither the broadcast function nor the dot-call fusion syntax is limited to numeric data. For example:

julia> s = ["The QUICK Brown", "fox     jumped", "over the LAZY dog."];

julia> s .= replace.(lowercase.(s), r"\s+", "-")
3-element Array{String,1}:
 "the-quick-brown"   
 "fox-jumped"        
 "over-the-lazy-dog."

Here, we take an array s of strings, we convert each string to lower case, and then we replace any sequence of whitespace (the regular expression r"\s+") with a hyphen "-". Since these two dot calls are nested, they are fused into a single loop over s and are written in-place in s thanks to the s .= ... (temporary strings are allocated in this process, but not temporary arrays of strings). Furthermore, notice that the arguments r"\s+" and "-" are treated as “scalars” and are “broadcasted” to every element of s.

The general rule (starting in Julia 0.6) is that, in broadcast, arguments of any type are treated as scalars by default. The main exceptions are arrays (subtypes of AbstractArray) and tuples, which are treated as containers and are iterated over. (If you define your own container type that is not a subtype of AbstractArray, you can tell broadcast to treat it as a container to be iterated over by overloading Base.Broadcast.containertype and a couple of other functions.)

Not just containers

Since the dot-call syntax corresponds to broadcast, and broadcast is just an ordinary Julia function to which you can add your own methods (as opposed to some kind of privileged compiler built-in), many possibilities open up. Not only can you extend fusing dot calls to your own data structures (e.g. DistributedArrays extends broadcast to work for arrays distributed across multiple computers), but you can apply the same syntax to data types that are hardly “containers” at all.

For example, the ApproxFun package defines an object called a Fun that represents a numerical approximation of a user-defined function (essentially, a Fun is a fancy polynomial fit). By defining broadcast methods for Fun, you can now take an f::Fun and do, for example, exp.(f.^2 .+ f.^3) and it will translate to broadcast(y -> exp(y^2 + y^3), f). This broadcast call, in turn, will evaluate exp(y^2 + y^3) for y = f(x) at cleverly selected x points, construct a polynomial fit, and return a new Fun object representing the fit. (Conceptually, this replaces elementwise operations on containers with pointwise operations on functions.) In contrast, ApproxFun also allows you to compute the same result using exp(f^2 + f^3), but in this case it will go through the fitting process four times (constructing four Fun objects), once for each operation like f^2, and is more than an order of magnitude slower due to this lack of fusion.

broadcast vs. map

Finally, it is instructive to compare broadcast with map, since map also applies a function elementwise to one or more arrays. (The dot-call syntax invokes broadcast, not map.) The basic differences are:

broadcast handles only containers with “shapes” M×N×⋯ (i.e., a size and dimensionality), whereas map handles “shapeless” containers like Set or iterators of unknown length like eachline(file).
map requires all arguments to have the same length (and hence cannot combine arrays and scalars) and (for array containers) the same shape, whereas broadcast does not (it can “expand” smaller containers to match larger ones).
map treats all arguments as containers by default, and in particular expects its arguments to act as iterators. In contrast, broadcast treats its arguments as scalars by default (i.e., as 0-dimensional arrays of one element), except for a few types like AbstractArray and Tuple that are explicitly declared to be broadcast containers.

Sometimes, of course, their behavior coincides, e.g. map(sqrt, [1,2,3]) and sqrt.([1,2,3]) give the same result. But, in general, neither map nor broadcast generalizes the other — each has things they can do that the other cannot.

Julia 0.5 Highlights

2016-10-11T00:00:00+00:00

To follow along with the examples in this blog post and run them live, you can go to JuliaBox, create a free login, and open the “Julia 0.5 Highlights” notebook under “What’s New in 0.5”. The notebook can also be downloaded from here.

Julia 0.5 is a pivotal release. It introduces more transformative features than any release since the first official version. Moreover, several of these features set the stage for even more to come in the lead up to Julia 1.0. In this post, we’ll go through some of the major changes in 0.5, including improvements to functional programming, comprehensions, generators, arrays, strings, and more.

Functions

Julia has always supported functional programming features:

anonymous functions (lambdas),
inner functions that close over local variables (closures),
functions passed to and from other functions (first-class and higher-order functions).

Before this release, however, these features all came with a significant performance cost. In a language that targets high-performance technical computing, that’s a serious limitation. So the Julia standard library and ecosystem have been rife with work-arounds to get the expressiveness of functional programming without the performance problems. But the right solution, of course, is to make functional programming fast – ideally just as fast as the optimal hand-written version of your code would be. In Julia 0.5, it is. And that changes everything.

This change is so important that there will be a separate blog post about it in the coming weeks, explaining how higher-order functions, closures and lambdas have been made so efficient, as well as detailing the kinds of zero-cost abstractions these changes enable. But for now, I’ll just tease with a little timing comparison. First, some definitions – they’re the same in both 0.4 and 0.5:

v = rand(10^7);                   # 10 million random numbers
double_it_vec(v) = 2v             # vectorized doubling of input
double_it_map(v) = map(x->2x, v)  # map a lambda over input

First, a timing comparison in Julia 0.4:

julia> VERSION
v"0.4.7"

julia> mean([@elapsed(double_it_vec(v)) for _=1:100])
0.024444888209999998

julia> mean([@elapsed(double_it_map(v)) for _=1:100])
0.5515606454499999

On 0.4, the functional version using map is 22 times slower than the vectorized version, which uses specialized generated code for maximal speed. Now, the same comparison in Julia 0.5:

julia> VERSION
v"0.5.0"

julia> mean([@elapsed(double_it_vec(v)) for _=1:100])
0.024549842180000003

julia> mean([@elapsed(double_it_map(v)) for _=1:100])
0.023871925960000002

The version using map is as fast as the vectorized one in 0.5. In this case, writing 2v happens to be more convenient than writing map(x->2x, v), so we may choose not to use map here, but there are many cases where functional constructs are clearer, more general, and more convenient. Now, they are also fast.

Ambiguous methods

One design decision that any multiple dispatch language must make is how to handle dispatch ambiguities: cases where none of the methods applicable to a given set of arguments is more specific than the rest. Suppose, for example, that a generic function, f, has the following methods:

f(a::Int, b::Real) = 1
f(a::Real, b::Int) = 2

In Julia 0.4 and earlier, the second method definition causes an ambiguity warning:

WARNING: New definition
    f(Real, Int64) at none:1
is ambiguous with:
    f(Int64, Real) at none:1.
To fix, define
    f(Int64, Int64)
before the new definition.

This warning is clear and gets right to the point: the case f(a,b) where a and b are of type Int (aka Int64 on 64-bit systems) is ambiguous. Evaluating f(3,4) calls the first method of f – but this behavior is undefined. Giving a warning whenever methods could be ambiguous is a fairly conservative choice: it urges people to define a method covering the ambiguous intersection before even defining the methods that overlap. When we decided to give warnings for potentially ambiguous methods, we hoped that people would avoid ambiguities and all would be well in the world.

Warning about method ambiguities turns out to be both too strict and too lenient. It’s far too easy for ambiguities to arise when shared generic functions serve as extension points across unrelated packages. When many packages extend the same generic functions, it’s common for the methods added to have some ambiguous overlap. This happens even when each package has no ambiguities on its own. Worse still, slight changes to one package can introduce ambiguities elsewhere, resulting in the least fun game of whack-a-mole ever. At the same time, the fact that ambiguities only cause warnings means that people learn to ignore them, which is annoying at best, and dangerous at worst: it’s far too easy for a real problem to be hidden by a barrage of insignificant ambiguity warnings. In particular, on 0.4 and earlier if an ambiguous method is actually called, no error occurs. Instead, one of the possible methods is called, based on the order in which methods were defined – which is essentially arbitrary when they come from different packages. Usually the method works – it does apply, after all – but this is clearly not the right thing to do.

The solution is simple: in Julia 0.5 the existence of potential ambiguities is fine, but actually calling an ambiguous method is an immediate error. The above method definitions for f, which previously triggered a warning, are now silent, but calling f with two Int arguments is a method dispatch error:

julia> f(3,4)
ERROR: MethodError: f(::Int64, ::Int64) is ambiguous. Candidates:
  f(a::Real, b::Int64) at REPL[2]:1
  f(a::Int64, b::Real) at REPL[1]:1
 in eval(::Module, ::Any) at ./boot.jl:231
 in macro expansion at ./REPL.jl:92 [inlined]
 in (::Base.REPL.##1#2{Base.REPL.REPLBackend})() at ./event.jl:46

This improves the experience of using the Julia package ecosystem considerably, while also making Julia safer and more reliable. No more torrent of insignificant ambiguity warnings. No more playing ambiguity whack-a-mole when someone else refactors their code and accidentally introduces ambiguities in yours. No more risk that a method call could be silently broken because of warnings that we’ve all learned to ignore.

Return type annotations

A long-requested feature has been the ability to annotate method definitions with an explicit return type. This aids the clarity of code, serves as self-documentation, helps the compiler reason about code, and ensures that return types are what programmers intend them to be. In 0.5, you can annotate method definitions with a return type like so:

function clip{T<:Real}(x::T, lo::Real, hi::Real)::T
    if x < lo
        return lo
    elseif x > hi
        return hi
    else
        return x
    end
end

This function is similar to the built-in clamp function, but let’s consider this definition for the sake of example. The return annotation on clip has the effect of inserting implicit calls to x->convert(T, x) at each return point of the method. It has no effect on any other method of clip, only the one where the annotation occurs. In this case, the annotation ensures that this method always returns a value of the same type as x, regardless of the types of lo and hi:

julia> clip(0.5, 1, 2) # convert(T, lo)
1.0

julia> clip(1.5, 1, 2) # convert(T, x)
1.5

julia> clip(2.5, 1, 2) # convert(T, hi)
2.0

You’ll note that the annotated return type here is T, which is a type parameter of the clip method. Not only is that allowed, but the return type can be an arbitrary expression of argument values, type parameters, and values from outer scopes. For example, here is a variation that promotes its arguments:

function clip2(x::Real, lo::Real, hi::Real)::promote_type(typeof(x), typeof(lo), typeof(hi))
    if x < lo
        return lo
    elseif x > hi
        return hi
    else
        return x
    end
end

julia> clip2(2, 1, 3)
2

julia> clip2(2, 1, 13//5)
2//1

julia> clip2(2.5, 1, 13//5)
2.5

Return type annotations are a fairly simple syntactic transformation, but they make it easier to write methods with consistent and predictable return types. If different branches of your code can lead to slightly different types, the fix is now as simple as putting a single type annotation on the entire method.

Vectorized function calls

Julia 0.5 introduces the syntax f.(A1, A2, ...) for vectorized function calls. This syntax translates to broadcast(f, A1, A2, ...), where broadcast is a higher-order function (introduced in 0.2), which generically implements the kind of broadcasting behavior found in Julia’s “dotted operators” such as .+, .-, .*, and ./. Since higher-order functions are now efficient, writing broadcast(f,v,w) and f.(v,w) are both about as fast as loops specialized for the operation f and the shapes of v and w. This syntax lets you vectorize your scalar functions the way built-in vectorized functions like log, exp, and atan2 work. In fact, in the future, this syntax will likely replace the pre-vectorized methods of functions like exp and log, so that users will write exp.(v) to exponentiate a vector of values. This may seem a little bit uglier, but it’s more consistent than choosing an essentially arbitrarily set of functions to pre-vectorize, and as I’ll explain below, this approach can also have significant performance benefits.

To give a more concrete sense of what this syntax can be used for, consider the clip function defined above for real arguments. This scalar function can be applied to vectors using vectorized call syntax without any further method definitions:

julia> v = randn(10)
10-element Array{Float64,1}:
 -0.868996
  1.79301
 -0.309632
  1.16802
 -1.57178
 -0.223385
 -0.608423
 -1.54862
 -1.33672
  0.864448

julia> clip(v, -1, 1)
ERROR: MethodError: no method matching clip(::Array{Float64,1}, ::Int64, ::Int64)
Closest candidates are:
  clip{T<:Real}(::T<:Real, ::Real, ::Real) at REPL[2]:2

julia> clip.(v, -1, 1)
10-element Array{Float64,1}:
 -0.868996
  1.0
 -0.309632
  1.0
 -1.0
 -0.223385
 -0.608423
 -1.0
 -1.0
  0.864448

The second and third arguments don’t need to be scalars – as with dotted operators, they can be vectors as well, and the clip operation will be applied to each corresponding triple of values:

julia> clip.(v, repmat([-1,0.5],5), repmat([-0.5,1],5))
10-element Array{Float64,1}:
 -0.868996
  1.0
 -0.5
  1.0
 -1.0
  0.5
 -0.608423
  0.5
 -1.0
  0.864448

From these examples, it may be unclear why this operation is called “broadcast”. The function gets its name from the following behavior: wherever one of its arguments has a singleton dimension (i.e. dimension of size 1), it “broadcasts” that value along the corresponding dimension of the other arguments when applying the operator. Broadcasting allows dotted operations to easily do handy tricks like mean-centering the columns of a matrix:

julia> A = rand(3,4);

julia> B = A .- mean(A,1)
3×4 Array{Float64,2}:
  0.343976   0.427378  -0.503356  -0.00448691
 -0.210096  -0.531489   0.168928  -0.128212
 -0.13388    0.104111   0.334428   0.132699

julia> mean(B,1)
1×4 Array{Float64,2}:
 0.0  0.0  0.0  0.0

The matrix A is 3×4 and mean(A,1) is 1×4 so the .- operator broadcasts the subtraction of each mean value along the corresponding column of A, thereby mean-centering each column. Combining this broadcasting behavior with vectorized call syntax lets us write some fairly fancy custom array operations very concisely:

julia> clip.(B, [-0.3, -0.2, -0.1], [0.4, 0.3, 0.2, 0.1]')
3×4 Array{Float64,2}:
  0.343976   0.3       -0.3       -0.00448691
 -0.2       -0.2        0.168928  -0.128212
 -0.1        0.104111   0.2        0.1

This expression clips each element of B with its own specific (hi,lo) pair from this matrix:

julia> [(lo,hi) for lo=[-0.3, -0.2, -0.1], hi=[0.4, 0.3, 0.2, 0.1]]
3×4 Array{Tuple{Float64,Float64},2}:
 (-0.3,0.4)  (-0.3,0.3)  (-0.3,0.2)  (-0.3,0.1)
 (-0.2,0.4)  (-0.2,0.3)  (-0.2,0.2)  (-0.2,0.1)
 (-0.1,0.4)  (-0.1,0.3)  (-0.1,0.2)  (-0.1,0.1)

Vectorized call syntax avoids ever materializing this array of pairs, however, and the messy code to apply clip to each element of B with the corresponding lo and hi values doesn’t have to be written. When B is larger than a toy example, not constructing a temporary matrix of (lo,hi) pairs can be a big efficiency win.

There is a bit more to the story about vectorized call syntax. It’s common to write expressions applying multiple vectorized functions to some arrays. For example, one might write something like:

max(abs(X), abs(Y))

This computes the absolute values of each element of X and Y and takes the larger of the corresponding elements from abs(X) and abs(Y). In this traditional vectorized form, the code allocates two temporary intermediate arrays – one to store each of abs(X) and abs(Y). If we use the new vectorized function call syntax, however, these calls are syntactically fused into a single call to broadcast with an anonymous function. In other words, we write this:

max.(abs.(X), abs.(Y))

which internally becomes this:

broadcast((x, y)->max(abs(x), abs(y)), X, Y)

This version of the computation avoids allocating any intermediate arrays and performs the entire vectorized computation all at once, directly into the result array. We can see this difference in memory usage and speed when we benchmark these expressions:

julia> using BenchmarkTools

julia> X, Y = rand(1000,1000), rand(1000,1000);

julia> @benchmark max(abs(X), abs(Y))
BenchmarkTools.Trial:
  memory estimate:  22.89 mb
  minimum time:     13.95 ms (1.77% GC)
  median time:      14.17 ms (1.76% GC)
  mean time:        14.32 ms (1.78% GC)
  maximum time:     17.15 ms (3.47% GC)

julia> @benchmark max.(abs.(X), abs.(Y))
BenchmarkTools.Trial:
  memory estimate:  7.63 mb
  minimum time:     2.84 ms (0.00% GC)
  median time:      2.98 ms (0.00% GC)
  mean time:        3.27 ms (18.26% GC)
  maximum time:     5.96 ms (65.68% GC)

julia> 22.89/7.63, 16.63/3.84
(3.0,4.330729166666667)

I’m using the BenchmarkTools package here instead of hand-rolled timing loops. BenchmarkTools has been carefully designed to avoid many of the common pitfalls of benchmarking code and to provide sound statistical estimates of how much time and memory your code uses. For the sake of brevity, I’m omitting some of the less relevant output from @benchmark.

As you can see, the dotted form uses 3 times less memory and is 4.3 times faster. These improvements come from avoiding temporary allocations and performing the entire computation in a single pass over the arrays. Even greater reduction in allocation can occur when we use the new .= operator to also do vectorized assignment:

julia> Z = zeros(X); # matrix of zeros similar to X

julia> @benchmark Z .= max.(abs.(X), abs.(Y))
BenchmarkTools.Trial:
  memory estimate:  96.00 bytes
  minimum time:     1.76 ms (0.00% GC)
  median time:      1.82 ms (0.00% GC)
  mean time:        1.89 ms (0.00% GC)
  maximum time:     4.24 ms (0.00% GC)

With in-place vectorized assignment, we can fill the pre-allocated array, Z, without doing any allocation (the 96 bytes is an artifact), and do so 7.3 times faster than the old-style vectorized computation. This can be a big win in situations where we can reuse the same output array for multiple computations.

The last major missing piece of vectorized call syntax is yet to come – it will be implemented in the next version of Julia. Dotted operators like .+ and .* will cease to be their own independent operators and simply become the vectorized forms of the corresponding scalar operators, + and *. In other words, instead of .+ being a function as it is now, with its own behavior independent of +, when you write X .+ Y it will mean broadcast(+, X, Y). Furthermore, dotted operators will participate in the same syntax-level fusion as other vectorized calls, so an expression like exp.(log.(X) .+ log.(Y)) will translate into a single call to broadcast:

broadcast((x, y)->exp(log(x) + log(y)), X, Y)

This change will complete the transition to a generalized approach to vectorized function application (including syntax-level loop fusion), which will make Julia’s story for writing allocation-free array code much stronger.

Comprehensions

Julia’s array comprehensions have always supported some advanced features such as iterating with several variables to produce multidimensional arrays. This release rounds out the functionality of comprehensions with two additional features: nested generation with multiple for clauses, and filtering with a trailing if clause. To demonstrate these features, consider making a dollar (100¢) using quarters (25¢), dimes (10¢), nickels (5¢) and pennies (1¢). We can generate an array of tuples of total values in each kind of coin by using a comprehension with nested for clauses:

julia> change = [(q,d,n,p) for q=0:25:100 for d=0:10:100-q for n=0:5:100-q-d for p=100-q-d-n]
242-element Array{NTuple{4,Int64},1}:
 (0,0,0,100)
 (0,0,5,95)
 (0,0,10,90)
 (0,0,15,85)
 (0,0,20,80)
 (0,0,25,75)
 ⋮
 (75,10,5,10)
 (75,10,10,5)
 (75,10,15,0)
 (75,20,0,5)
 (75,20,5,0)
 (100,0,0,0)

There are a few notable differences from the multidimensional array syntax:

Each iteration is a new for clause, rather than a single compound iteration separated by commas;
Each successive for clause can refer to variables from the previous clauses;
The result is a single flat vector regardless of how many nested for clauses there are.

The tuple (q,d,n,p) in the comprehension body is a breakdown of monetary value into quarters, dimes, nickels and pennies. Note that the iteration range for p isn’t a range at all, it’s a single value, 100-q-d-n, the unique number guaranteeing that each tuple adds up to a dollar. (This relies on the fact that a number behaves like an immutable zero-dimensional container, holding only itself, a behavior which is sometimes convenient but which has been the subject of significant debate. As of 0.5 it still works.) We can verify that each tuple adds up to 100:

julia> extrema([sum(t) for t in change])
(100,100)

Since 100 is both the minimum and maximum of all the tuple sums, we know they are all exactly 100. So, there are 242 ways to make a dollar with common coins. But suppose we want to ensure that the value in pennies is less than the value in nickels, and so forth. By adding a filter clause, we can do this easily too:

julia> [(q,d,n,p) for q=0:25:100 for d=0:10:100-q for n=0:5:100-q-d for p=100-q-d-n if p < n < d < q]
4-element Array{NTuple{4,Int64},1}:
 (50,30,15,5)
 (50,30,20,0)
 (50,40,10,0)
 (75,20,5,0)

The only difference here is the if p < n < d < q clause at the end of the comprehension, which has the effect that the result only contains cases where this predicate holds true. There are exactly four ways to make a dollar with strictly increasing value from pennies to nickels to dimes to quarters.

Nested and filtered comprehensions aren’t earth-shattering features – everything you can do with them can be done in a variety of other ways – but they are expressive and convenient, found in other languages, and they allow you to try more things with your data quickly and easily, with less pointless refactoring.

Generators

In the previous section we used an array comprehension to take the sum of each tuple, save the sums as an array, and then pass that array of sums to the extrema function to find the largest and smallest sum (they’re all 100):

julia> @time extrema([sum(t) for t in change])
  0.000072 seconds (8 allocations: 2.203 KB)
(100,100)

Wrapping this in the @time macro shows that this expression allocates 2.2 KB of memory – mostly for the array of sums, which is thrown away after the computation. But allocating an array just to find its extrema is unnecessary: the minimum and maximum can be computed over streamed data by keeping the largest and smallest values seen so far. In other words, this calculation could be expressed with constant memory overhead by interleaving the production of values with computation of extrema. Previously, expressing this interleaved computation required some amount of refactoring, and many approaches were considerably less efficient. In 0.5, if you simply omit the square brackets around an array comprehension, you get a generator expression, which instead of producing an array of values, can be iterated over, yielding one value at a time. Since extrema works with arbitrary iterable objects – including generators – expressing an interleaved calculation using constant memory is now as simple as deleting [ and ]:

julia> @time extrema(sum(t) for t in change)
  0.000066 seconds (6 allocations: 208 bytes)
(100,100)

This avoids allocating a temporary array of sums entirely, instead computing the next tuple’s sum only when the extrema function is ready to accept a new value. Using a generator reduces the memory overhead to 208 bytes – the size of the the return value. More importantly, the memory usage doesn’t depend on the size of the change array anymore – it will always be just 208 bytes, even if change holds a trillion tuples. It’s not hard to imagine situations where such a reduction in asymptotic memory usage is crucial. The similar syntax between array comprehensions and generator expressions makes it trivial to move back and forth between the two styles of computation as needed.

Initializing collections

The new generator syntax dovetails particularly nicely with Julia’s convention for constructing collections – to make a new collection, you call the constructor with a single iterable argument, which yields the values you want in the new collection. In its simplest form, this looks something like:

julia> IntSet([1, 4, 9, 16, 25, 36, 49, 64])
IntSet([1, 4, 9, 16, 25, 36, 49, 64])

In this expression, an array of integers is passed to the IntSet constructor to create an object representing that set, which in this case happen to be small squares. Once constructed, the IntSet object no longer refers to the original array of integers. Instead, it uses a bitmask to efficiently store and operate on sets. It displays itself as you would construct it from an array, but that’s merely for convenience – there’s no actual array anymore.

Now, I’m a human (no blogbots here) and I find typing out even short sequences of perfect squares tedious and error prone – despite a math degree, I’m awful at arithmetic. It would be much easier to generate squares with an array comprehension:

julia> IntSet([k^2 for k = 1:8])
IntSet([1, 4, 9, 16, 25, 36, 49, 64])

This comprehension produces the same array of integers that I typed manually above. As before, creating this array object is unnecessary – it would be even better to generate the desired squares as they are inserted into the new IntSet. Which, of course, is precisely what generator expressions allow:

julia> IntSet(k^2 for k = 1:8)
IntSet([1, 4, 9, 16, 25, 36, 49, 64])

Using a generator here is just as clear, more concise, and significantly more efficient:

julia> using BenchmarkTools

julia> @benchmark IntSet([k^2 for k = 1:8])
BenchmarkTools.Trial:
  memory estimate:  320.00 bytes
  minimum time:     163.00 ns (0.00% GC)
  median time:      199.00 ns (0.00% GC)
  mean time:        245.18 ns (12.95% GC)
  maximum time:     5.36 μs (92.47% GC)

julia> @benchmark IntSet(k^2 for k = 1:8)
BenchmarkTools.Trial:
  memory estimate:  160.00 bytes
  minimum time:     114.00 ns (0.00% GC)
  median time:      139.00 ns (0.00% GC)
  mean time:        165.74 ns (11.48% GC)
  maximum time:     4.82 μs (93.20% GC)

As you can see from this benchmark, the version with an array comprehension uses twice as much memory and is 50% slower than constructing the same IntSet using a generator expression.

Constructing dictionaries

Generators can be used to construct dictionaries too, and this use case deserves some special attention since it completes a multi-release process of putting user-defined dictionary types on an equal footing with the built-in Dict type. In Julia 0.3, the => operator only existed as part of syntax for constructing Dict objects: [k₁ => v₁, k₂ => v₂] and [k(i) => v(i) for i = c]. This design was based on other dynamic languages where dictionaries are among a small set of built-in types with special syntax that are deeply integrated into the language. As Julia’s ecosystem has matured, however, it has become apparent that Julia is actually more like Java or C++ in this respect than it is like Python or Lua: the Dict type isn’t that special – it happens to be defined in the standard library, but is otherwise quite ordinary. Many programs use other dictionary implementations: for example, the tree-based SortedDict type, which sorts values by key, or OrderedDict, which maintains keys in the order they are inserted. Having special syntax only for Dict makes using other dictionary implementations problematic. In 0.3, there was no good syntax for constructing values of these dictionaries – the best one could do was to invoke a constructor with an array of two-tuples:

SortedDict([(k₁, v₁), (k₂, v₂)])        # fixed-size dictionaries
SortedDict([(k(i), v(i)) for i in c])   # dictionary comprehensions

Not only are these constructions inconvenient and ugly, they’re also inefficient since they create temporary heap-allocated arrays of heap-allocated tuples of key-value pairs. With much relief, we can now instead write:

SortedDict(k₁ => v₁, k₂ => v₂)          # fixed-size dictionaries, since 0.4
SortedDict(k(i) => v(i) for i = c)      # dictionary comprehensions, since 0.5

This last syntax combines two orthogonal features introduced in 0.4 and 0.5, respectively:

k => v as a standalone syntax for a Pair object, and
generator expressions, particularly to initialize collections.

The Dict type is now constructed in exactly the same way:

julia> Dict("foo" => 1, "bar" => 2)
Dict{String,Int64} with 2 entries:
  "bar" => 2
  "foo" => 1

julia> Dict("*"^k => k for k = 1:10)
Dict{String,Int64} with 10 entries:
  "**********" => 10
  "***"        => 3
  "*******"    => 7
  "********"   => 8
  "*"          => 1
  "**"         => 2
  "****"       => 4
  "*********"  => 9
  "*****"      => 5
  "******"     => 6

This generalization makes the syntax for constructing a Dict slightly longer, but we feel that the increased consistency, ability to change dictionary implementations with a simple search-and-replace, and putting user-defined dictionary-like types on the same level as the built-in Dict type make this change well worthwhile.

Arrays

The 0.5 release was originally intended to include a large number of disruptive array changes, collectively dubbed “Arraymageddon”. After much discussion, experimentation and benchmarking, this set of breaking changes was significantly reduced for a variety of reasons:

Some changes were deemed not to be good ideas after all;
Others were of unclear benefit, so it was decided to reconsider them in the future once there is more information to support a decision;
A few didn’t get implemented due to lack of developer time, including some cases where everyone agrees there’s a problem but there is not yet any complete design for a solution.

Although not many breaking changes happened in 0.5, this was a major release for Julia’s array infrastructure. The code to implement various complex polymorphic indexing operations for generic arrays and array-like structures was majorly refactored, and in the process it shrank by 40% while becoming more complete, more general, and faster. You can read more about the very cool things you can now do with array-like types in an excellent pair of blog posts published here earlier in the year: Multidimensional algorithms and iteration and Generalizing AbstractArrays. In the next two subsections, I’ll go over some of the array changes that did happen in 0.5.

Dimension sum slices

The most significant breaking change in the 0.5 cycle affects multidimensional array slicing. To explain it we’ll need a little terminology. A singleton dimension of a multidimensional array is a dimension whose size is 1. For example, a 5x1 matrix has a trailing singleton dimension and may be called a “column matrix”, and a 1x5 matrix has a leading singleton dimension and may be called a “row matrix”. A scalar slice refers to a dimension in a multidimensional slice expression where the index is a scalar integer (considered to be zero-dimensional), rather than a 1-dimensional range or vector, or some higher-dimensional collection of indices. For example, in A[1,:] the first slice is scalar, the second is not; in A[:,2] the second slice is scalar, the first is not; in A[3,4] both slices are scalar.

All previous versions of Julia have dropped trailing scalar slices when performing multidimensional array slicing. That is, when an array was sliced with multiple indices, the resulting array had the number of dimensions of the original array minus the number of trailing scalar slices. So when you sliced a column out of a matrix the result was a 1-dimensional vector, but when you sliced a row the result was a 2-dimensional row matrix:

julia> VERSION
v"0.4.7"

julia> M = [i+j^2 for i=1:3, j=1:4]
3x4 Array{Int64,2}:
 2  5  10  17
 3  6  11  18
 4  7  12  19

julia> M[:,1] # vector
3-element Array{Int64,1}:
 2
 3
 4

julia> M[3,:] # row matrix
1x4 Array{Int64,2}:
 4  7  12  19

This rule is handy for linear algebra since row and column slices have distinct types and different orientations, but its complexity, asymmetry, and lack of generality make it less than ideal for arrays as general purpose containers. With more dimensions, the asymmetry of this behavior can be seen even in a single slice operation:

julia> VERSION
v"0.4.7"

julia> T = [i+j^2+k^3 for i=1:3, j=1:4, k=1:2]
3x4x2 Array{Int64,3}:
[:, :, 1] =
 3  6  11  18
 4  7  12  19
 5  8  13  20

[:, :, 2] =
 10  13  18  25
 11  14  19  26
 12  15  20  27

julia> T[2,:,2]
1x4 Array{Int64,2}:
 11  14  19  26

The leading dimension of this slice is retained while the trailing dimension is discarded – even though both are scalar slices. The result array is neither 3-dimensional like the original, nor 1-dimensional like the collective indexes (0 + 1 + 0); instead, it’s 2-dimensional – apropos of nothing. Here, in another fairly similar slice, all dimensions are kept:

julia> T[:,4,:]
3x1x2 Array{Int64,3}:
[:, :, 1] =
 18
 19
 20

[:, :, 2] =
 25
 26
 27

By comparison, the new slicing behavior in 0.5 is simple, systematic, and symmetrical. (And not original by any means – APL pioneered this array slicing scheme in the 1960s.) In Julia 0.5, when an array is sliced, the dimension of the result is the sum of the dimensions of the slices, and the dimension sizes of the result are the concatenation of the sizes of the slices. Thus, row slices and column slices both produce vectors:

julia> VERSION
v"0.5.0"

julia> M[:,1] # vector: 1 + 0 = 1
3-element Array{Int64,1}:
 2
 3
 4

julia> M[1,:] # vector: 0 + 1 = 1
4-element Array{Int64,1}:
  2
  5
 10
 170

Similarly, slicing a 3-dimensional array with scalars in all but one dimension also produces a vector:

julia> T[2,:,2] # vector: 0 + 1 + 0 = 1
4-element Array{Int64,1}:
 11
 14
 19
 26

The only example from above that doesn’t produce a vector is the last one:

julia> T[:,4,:] # matrix: 1 + 0 + 1 = 2
3×2 Array{Int64,2}:
 18  25
 19  26
 20  27

The result is a matrix since the leading and trailing slices are ranges, and the middle slice disappears since it is scalar, leaving a matrix. The 0.5 slicing behavior naturally generalizes to higher dimensional slices:

julia> I = [1 2 1; 1 3 2]
2×3 Array{Int64,2}:
 1  2  1
 1  3  2

julia> J = [4 2 1 3]
1×4 Array{Int64,2}:
 4  2  1  3

julia> M[I,J]
2×3×1×4 Array{Int64,4}:
[:, :, 1, 1] =
 17  18  17
 17  19  18

[:, :, 1, 2] =
 5  6  5
 5  7  6

[:, :, 1, 3] =
 2  3  2
 2  4  3

[:, :, 1, 4] =
 10  11  10
 10  12  11

Here we have the following natural identity on dimensions:

size(M[I,J]) == (size(I,1), size(I,2), size(J,1), size(J,2))

In addition to being more systematic and symmetrical, this new behavior allows many complex indexing operations to be expressed concisely.

Although the change to multidimensional slicing behavior is a significant breaking change, it has caused surprisingly little havoc in the Julia package ecosystem. It tends to primarily affect linear algebra code, and when code does break, it’s usually fairly clear what is broken and what needs to be done to fix it. When updating your code, if you need to keep a dimension that is dropped under the new indexing behavior, you can write M[1:1,:]:

julia> M[1:1,:]
1×4 Array{Float64,2}:
 0.950951  0.713032  0.0835119  0.897018

Since integer range construction can be eliminated by Julia’s compiler, writing this is free but has the effect of keeping a dimension which would otherwise be dropped under the new rules. Unfortunately, there’s no way to make this change without breaking some code – we apologize in advance for the inconvenience, and we hope you find the improvement to be worthwhile.

Array views

One of the major news items of 0.5 is a non-change: array slices still create copies of array data. There was a lot of discussion about changing the default behavior to creating views, but we ended up deciding against this change and keeping the old behavior. The motivation for views by default was to improve performance drastically in a variety of slow cases, but after a lot of discussion, experiments, and benchmarks, it was decided not to make this change. The conversation about this decision is long, so I’ll summarize the major points:

Slicing should either consistently produce views or copies. Unpredictably doing one or the other depending on types – or worse still, on runtime values – would be a disaster for writing reliable, generic code.
Guaranteeing view semantics for all abstract arrays – especially sparse and custom array types – is hard and can be quite slow and/or expensive in general cases.
Even in the case of dense arrays with cheap array views, it’s not clear that views are always a performance win. In some cases they definitely are, but in others the fact that a copied slice is contiguous and has optimal memory ordering for iteration overwhelms the benefit of not copying.
Copied slices are easier to reason about and less likely to lead to subtle bugs than views. Views can lead to situations where someone modifies the view, not realizing that it’s a view, thereby unintentionally modifying the original array. These kinds of bugs are hard to track down and even harder to notice.
There is no clear transition or deprecation strategy. Changing from copying slices to views would be a major compatibility issue. We generally give programmers deprecation warnings when some behavior is going to break or change in the next release. Sometimes we can’t do that so we just bite the bullet and break code with an error. But changing slices to views wouldn’t break code with an error, it would just silently cause code to produce different, incorrect results. There’s no clear way to make this transition safely.

Taken together this makes a compelling case against changing the default slicing behavior to returning views. That said, even if they’re not the default, views are a crucial tool for performance in some situations. Accordingly, a huge amount of work went into improving the ergonomics of views in 0.5, including:

Renaming the function for view construction from “sub” to “view”, which seems like a much better name.
Array views now support all forms of indexing supported by arrays. Previously, views did not support some of the more complex forms of array indexing.
The @view macro was introduced, allowing the use of natural slicing syntax for views. In other words you can now write @view A[1:end-1,2:end] instead of view(A, 1:size(A,1)-1, 2:size(A,2)).

Since views are an such important tool for both performance and for expressing complex mutating operations on arrays (especially with higher order functions), we may introduce a special syntax for view slices in the future. In particular, the syntax A@[I...] had a fair amount of popular support. Stay tuned!

And more…

This is far from the full extent of the improvements introduced in Julia 0.5, but this blog post is already getting quite long, so I’ll just summarize a few of the other big ticket items:

The set of string types and operations has been significantly simplified and streamlined. The ASCIIString and UTF8String types have been merged into a single String type, and the UTF16String and UTF32String and related functions have been moved into the LegacyStrings package, which keeps the same implementations as 0.4. In the future, better support for different string encodings will be developed under the StringEncodings package.
Most functionality related to prime generation, primality checking and combinatorics, has been moved into two external packages: Primes and Combinatorics. To use these functions, you’ll need to install these packages and do using Primes or using Combinatorics as necessary.
Julia’s LLVM version was upgraded from 3.3 to 3.7.1. This may not seem like a big deal, but the transition required herculean effort by many core Julia contributors. For a series of different and impossibly annoying reasons, LLVM versions 3.4, 3.5 and 3.6 were not usable for Julia, so we’re very happy to be back to using current versions of our favorite compiler framework.
Support for compiling and running on ARM chips is much improved since 0.4. Julia 0.5 also introduced initial support for Power systems, a development which has been supported and driven by IBM. We will be expanding and improving support for many architectures going forward. With support for ARM and Power, Julia is already a productive platform for technical computing from embedded systems to big iron.

The 0.5 release has experimental multithreading support. This isn’t ready for production usage, but it’s fun to play around with and you can already get impressive performance gains – scalability is a key focus. Julia’s threading provides true concurrent execution like C++, Go or Java: different threads can do work at the same time, up to the number of physical cores available.
Interactive debugging support has been a weak spot in the Julia ecosystem for some time, but not any more. On a vanilla build of Julia 0.5, you can install the Gallium package to get a full-fledged, high-performance debugger: set breakpoints, step through code, examine variables, and inspect stack frames.

I hope you’ve enjoyed this overview of highlights from the new release of Julia, and that you enjoy the release itself even more. Julia 0.5 is easily the strongest release to date, but of course the next one will be even better :)

Happy coding!

Julia 0.5 Release Announcement

2016-10-10T00:00:00+00:00

After over a year of development, the Julia community is proud to announce the release of version 0.5 of the Julia language and standard library. This release contains major language refinements and numerous standard library improvements. A long list of changes is available in the NEWS log found in our main repository, with a summary reproduced below. A separate blog post detailing some of the highlights of the new release has also been posted.

We’ll be releasing regular bugfix backports from the 0.5.x line, which is recommended for users requiring a stable language and API. Major feature work is ongoing on master for 0.6-dev.

The Julia ecosystem continues to grow, and there are now over one thousand registered packages! The third annual JuliaCon took place in Cambridge, MA in the summer of 2016, with an exciting line up of talks and keynotes. Most of them are available to view.

Binaries are available from the main download page or visit JuliaBox to try this release from the comfort of your browser. Happy Coding!

Notable compiler and language changes:

The major focus of this release has been the ability to write fast functional code, removing the earlier performance penalty for anonymous functions and closures. This has been achieved via each function and closure now being its own type, and the captured variables of a closure are fields of its type. All functions, including anonymous functions, are now generic and support all features.
Experimental support for multi threading.
All dimensions indexed by scalars are now dropped, whereas previously only trailing scalar dimensions would be omitted from the result. This is a major breaking changes, but has been made to make the indexing rules much more consistent.
Generator expressions now can create iterators that are computed only on demand.
Experimental support for arrays whose indexing starts from values other than 1. Standard Julia arrays are still 1-based, but external packages can implement array types with indexing from arbitrary indices.
Major simplification of the string types, unifying ASCIIString and UTF8String as String, as well as moving types and functions related to different encodings out of the standard library.
Package operations now use the libgit2 library rather than shelling out to command line git. This makes these calls to package related functions much faster, and more reliable, especially on Windows.
And many many more changes and improvements…

Ports

Julia now runs on the ARM and Power architectures, making it possible to use it on the widest variety of hardware, from the smallest embedded machines to the largest HPC systems. Porting a language to a new architecture is never easy, so special thanks to the people who made it possible. Part of the work to create the Power port was supported by IBM, for which we are grateful.

Developing with Julia

The Julia debugger, Gallium, is now ready to use. It allows for a full, multi language debug experience, debugging Julia and C code with ease. The debugger is also integrated with Juno, the Julia IDE that is now fully featured and ready to use.

StructuredQueries.jl - A generic data manipulation framework

2016-10-03T00:00:00+00:00

This post describes my work conducted this summer at the Julia Lab to develop StructuredQueries.jl, a generic data manipulation framework for Julia.

Our initial vision for this work was much inspired by Hadley Wickham’s dplyr R package, which provides data manipulation verbs that are generic over in-memory R tabular data structures and SQL databases, and DataFramesMeta (begun by Tom Short), which provides metaprogramming facilities for working with Julia DataFrames.

While a generic querying interface is a worthwhile end in itself (and has been discussed elsewhere), it may also be useful for solving problems specific to in-memory Julia tabular data structures. We will discuss how a query interface suggests solutions to two important problems facing the development of tabular data structures in Julia: the column-indexing and nullable semantics problems. So, the present post will describe both the progress of my work and also discuss a wider scope of issues concerning support for tabular data structures in Julia. I will provide some context for these issues; the reader should feel free to skip over any uninteresting details.

Recall that the primary shortcoming of DataArrays.jl is that it does not allow for type-inferable indexing. That is, the means by which missing values are represented in DataArrays – i.e. with a token NA::NAtype object – entails that the most specific return type inferable from Base.getindex(df::DataArray{T}, i) is Union{T, NAtype}. This means that until Julia’s compiler can better handle small Union types, code that naively indexes into a DataArray will perform unnecessarily poorly.

NullableArrays.jl remedied this shortcoming by representing both missing and present values of type T as objects of type Nullable{T}. However, this solution has limitations in other respects. First, use of NullableArrays does nothing to support type inference in column-indexing of DataFrames. That is, the return type of Base.getindex(df::DataFrame, field::Symbol) is not straightforwardly inferable, even if DataFrames are built over NullableArrays. Call this first problem the column-indexing problem. Second, NullableArrays introduces certain difficulties centered around the Nullable type. Call this second problem the nullable semantics problem.

The column-indexing problem is well-documented. To see the difficulty, consider the following function

function f(df::DataFrame)
    A = df[:A]
    x = zero(eltype(A))
    for i in eachindex(A)
        x += A[i]
    end
    return x
end

where df[:A] retrieves the column named :A from df. A user might reasonably expect the above to be idiomatic Julia: the work is written in a for loop that is wrapped inside a function. However, this code will not be (ahead-of-time) compiled to efficient machine instructions because the type of the object that df[:A] returns cannot be inferred during static analysis. This is because there is nothing the DataFrame type can do to communicate the eltypes of its columns to the compiler.

The nullable semantics problem is described throughout a dispersed series of GitHub issues (the interested reader can start here and here) (and at least one mailing list post). To my knowledge, a self-contained treatment has not been given (I don’t necessarily claim to be giving one now). The problem has two parts, which I’ll call the “easy question” and the “hard question”, respectively:

What should the semantics of f(x::Nullable{T}) be given a definition of f(x::T)?
How should we implement these semantics in a sufficiently general and user-friendly way?

In most cases, the answer to the easy question is clear: f(x::Nullable{T}) should return an empty Nullable{U} if x is null and Nullable(f(x.value)) if x is not null. There is a question of how to choose the type parameter U, but a solution involving Julia’s type inference facilities seems to be about right. (The discussion of 0.5-style comprehensions and one or two discussions about the return type of map over an empty array, were all influential on this matter.) We will refer to these semantics as the standard lifting semantics. It is worth noting that there is at least one considerable alternative to standard lifting semantics, at least in the realm of binary operators on Nullable{Bool} arguments: three-valued logic. But whether to use three-valued logic or standard lifting semantics is usually clear from the context of the program and the intention of the programmer.

On the other hand, the hard question is still unresolved. There are a number of possible solutions, and it’s difficult to know how to weigh their costs and benefits.

We’ll return to the column-indexing problem and the hard question of nullable semantics after we’ve described the present query interface. Before we dive in, I want to emphasize that this blog post is a status update, not a release notice (though StructuredQueries is registered so that you can play with it if you like). StructuredQueries (SQ) is a work in progress, and it will likely remain that way for some time. I hope to convince the reader that SQ nonetheless represents an interesting and worthwhile direction for the development of tabular data facilities in Julia.

The query framework

The StructuredQueries package provides a framework for representing the structure of a query without assuming any specific corresponding semantics. By the structure of a query, we mean the series of particular manipulation verbs invoked and the respective arguments passed to these verbs. By the semantics of a query, we mean the actual behavior of executing a query with a particular structure against a particular data source. A query semantics thus depends both on the structure of the query and on the type of the data source against which the query is executed. We will refer to the implementation of a particular query semantics as a collection machinery.

Decoupling the representation of a query’s structure from the collection machinery helps to make the present query framework

generic – the framework should be able to support multiple backends.
modular – the framework should encourage modularity of collection machinery.
extensible – the framework should be easily extensible to represent (relatively) arbitrary manipulations.

These desiderata are interrelated. For instance, modularity of collection machinery allows the latter to be re-used in support for different data backends, thereby supporting generality as well.

In this section we’ll describe how SQ represents query structures. In the following sections we’ll see how SQ’s query representation framework suggests solutions to the column-indexing and nullable semantics problems described above.

To express a query in SQ, one uses the @query macro:

@query qry

where qry is Julia code that follows a certain structure that we will describe below. qry is parsed according to what we’ll call a query context. By a context we mean a general semantics for Julia code that may differ from the semantics of the standard Julia environment. That is to say: though qry must be valid Julia syntax, the code is not run as it would were it executed outside of the @query macro. Rather, code such as qry that occurs inside of a query context is subject to a number of transformations before it is run. @query uses these transformations to produce a graphical representation of the structure of qry. An @query qry invocation returns a Query object, which wraps the query graph produced as a result of processing qry.

We said above that SQ represents queries in terms of their structure but does not itself guarantee any particular semantics. This allows packages to implement their own semantics for a given query structure. To demonstrate this design, I’ve put together (i) an abstract tabular data type, AbstractTable; (ii) an interface to support a collection machinery against what I call column-indexable types T <: AbstractTable; and (iii) a concrete tabular data type, Table <: AbstractTable that satisfies the column-indexable interface and therefore inherits a collection machinery to support SQ queries.

This following behavior mimics that which one would expect from querying against a DataFrame. The main reason for putting together a demonstration using Tables and not DataFrames has to do with ease of experimentation. I can more easily modify the AbstractTable/Table types and interfaces more easily than I can the DataFrame type and interface. Indeed, this project has become just as much about designing an in-memory Julia tabular data type that is most compatible with a Julia query framework as it is about designing a query framework compatible with an in-memory Julia tabular data type. Fortunately, the implementation of backend support for Tables will be straightforward to port to support for DataFrames once we decide where such support should live.

Let’s dive into the query interface by considering examples using the iris data set. (Though the package TablesDemo.jl is intended solely as a demonstration, it is registered so that readers can easily install it with Pkg.add("TablesDemo.jl") and follow along.)

julia> iris = Table(CSV.Source(joinpath(Pkg.dir("Tables"), "csv/iris.csv")))
Tables.Table
│ Row │ sepal_length │ sepal_width │ petal_length │ petal_width │ species  │
├─────┼──────────────┼─────────────┼──────────────┼─────────────┼──────────┤
│ 1   │ 5.1          │ 3.5         │ 1.4          │ 0.2         │ "setosa" │
│ 2   │ 4.9          │ 3.0         │ 1.4          │ 0.2         │ "setosa" │
│ 3   │ 4.7          │ 3.2         │ 1.3          │ 0.2         │ "setosa" │
│ 4   │ 4.6          │ 3.1         │ 1.5          │ 0.2         │ "setosa" │
│ 5   │ 5.0          │ 3.6         │ 1.4          │ 0.2         │ "setosa" │
│ 6   │ 5.4          │ 3.9         │ 1.7          │ 0.4         │ "setosa" │
│ 7   │ 4.6          │ 3.4         │ 1.4          │ 0.3         │ "setosa" │
│ 8   │ 5.0          │ 3.4         │ 1.5          │ 0.2         │ "setosa" │
│ 9   │ 4.4          │ 2.9         │ 1.4          │ 0.2         │ "setosa" │
│ 10  │ 4.9          │ 3.1         │ 1.5          │ 0.1         │ "setosa" │
⋮
with 140 more rows.

We can then use @query to express a query against this data set – say, filtering rows according to a condition on sepal_length:

julia> q = @query filter(iris, sepal_length > 5.0)
Query with Tables.Table source

This produces a Query{S} object, where S is the type of the data source

julia> typeof(q)
StructuredQueries.Query{Tables.Table}

The structure of the query passed to @query consists of a manipulation verb (e.g. filter) that in turn takes a data source (e.g. iris) for its first argument and any number of query arguments (e.g. sepal_length > 5.0) for its latter arguments. These are the three different “parts” of a query: (1) data sources (or just “sources”), (2) manipulation verbs (or just “verbs”), and (3) query arguments.

Each part of a query induces its own context in which code is evaluated. The most significant aspect of such contexts is name resolution. That is to say, names resolve differently depending on which part of a query they appear in and in what capacity they appear:

In a data source specification context – e.g., as the first argument to a verb such as filter above – names are evaluated in the enclosing scope of the @query invocation. Thus, iris in the query used to define q above refers precisely to the Table object to which the name is bound in the top level of Main.
Names of manipulation verbs are not resolved to objects but rather merely signal how to construct the graphical representation of the query. (Indeed, in what follows there is no such function filter that is ever invoked in the execution of a query involving a filter clause.)
Names of functions called within a query argument context, such as > in sepal_length > 5.0 are evaluated in the enclosing scope of the @query invocation.
Names that appear as arguments to function calls within a query argument context, such as sepal_length in sepal_length > 5.0 are not resolved to objects but are rather parsed as “attributes” of the data source (in this case, iris). When the data source is a tabular data structure, such attributes are taken to be column names, but such behavior is just a feature of a particular query semantics (see below in the section “Roadmap and open questions”.) The attributes that are passed as arguments to a given function call in a query argument are stored as data in the graphical query representation.

One can pipe arguments to verbs inside an @query context. For instance, the Query above is equivalent to that produced by

@query iris |> filter(sepal_length > 5.0)

In this case, the first argument (sepal_length > 5.0) to the verb filter is not a data source specification (iris), which is instead the first argument to |>, but is rather a query argument (sepal_length > 5.0).

Query objects represent the structure of a query composed of the three building blocks above. To see how, lets take a look at the internals of a Query:

julia> fieldnames(q)
2-element Array{Symbol,1}:
 :source
 :graph

The first field, :source, just contains the data source specified in the query – in this case, the Table object that was bound to the name iris when the query was specified. The second field, :graph contains a(n admittedly not very interesting) graphical representation of the query structure:

julia> q.graph
FilterNode
  arguments:
      1)  sepal_length > 5.0
  inputs:
      1)  DataNode
            source:  unset source

The filter verb from the original qry expression passed to @query is represented in the graph by a FilterNode object and that the data source is represented by a DataNode object. Both FilterNode and DataNode are leaf subtypes of the abstract QueryNode type. The FilterNode is connected to the DataNode via the :input field of the former. In general, these connections constitute directed acyclic graphs. We may refer to such graphs as QueryNode graphs or query graphs.

SQ currently recognizes the following verbs out of the box – that is, it properly incorporates them into a QueryNode graph:

select
filter
groupby
summarize
orderby
innerjoin (or just join)
leftjoin
outerjoin
crossjoin

One uses collect(q::Query) to materialize q as a concrete set results set – hence the term “collection machinery”. Note that the set of verbs that receive support from the column-indexable interface – that is, the verbs that may be collected against a column-indexable data source – currently only includes the first four: select, filter, groupby, and summarize. This is what such support currently looks like:

julia> q = @query iris |>
           filter(sepal_length > 5.0) |>
           groupby(species, log(petal_length) > .5) |>
           summarize(avg = mean(digamma(petal_width)))
Query with Tables.Table source

julia> q.graph
SummarizeNode
  arguments:
      1)  avg=mean(digamma(petal_width))
  inputs:
      1)  GroupbyNode
            arguments:
                1)  species
                2)  log(petal_length) > 0.5
            inputs:
                1)  FilterNode
                      arguments:
                          1)  sepal_length > 5.0
                      inputs:
                          1)  DataNode
                                source:  unset source


julia> collect(q)
Grouped Tables.Table
Groupings by:
    species
    log(petal_length) > 0.5 (with alias :pred_1)

Source: Tables.Table
│ Row │ species      │ pred_1 │ avg       │
├─────┼──────────────┼────────┼───────────┤
│ 1   │ "virginica"  │ true   │ 0.428644  │
│ 2   │ "setosa"     │ true   │ -3.17557  │
│ 3   │ "versicolor" │ true   │ -0.136551 │
│ 4   │ "setosa"     │ false  │ -4.7391   │

We hope to include support for the other verbs in the near future.

Again we emphasize that this collection machinery is provided by the AbstractTables package, not StructuredQueries. As we see above, the latter provides a framework for representing a query structure, whereas packages such as AbstractTables (i) decide what it means to execute a query with a particular structure against a particular backend, and (ii) provide the implementation of the behavior in (i).

We provide a convenience macro, @collect(qry), which is equivalent to collect(@query(qry)), for when one wishes to query and collect in the same command:

julia> @collect iris |>
           filter(erf(petal_length) / petal_length > log(sepal_width) / 1.5) |>
           summarize(sum = sum(ifelse(rand() > .5, sin(petal_width), 0.0)))
Tables.Table
│ Row │ sum       │
├─────┼───────────┤
│ 1   │ 0.0998334 │

Again, note the patterns of name resolution: names of functions (e.g. erf) invoked within the context of a query argument are evaluated within the enclosing scope of the @query invocation, whereas names in the arguments of such functions (e.g. petal_length) are taken to be attributes of the data source (i.e., iris).

Dummy sources

We saw above how there are three parts to a query structure: verbs, sources and query arguments. A Query object represents the verbs and query arguments together in the QueryNode graph and wraps the data source separately. This suggests that one ought to be able to generate query graphs using @query even if one does not specify a particular data source. One can do precisely this by using dummy sources, which are essentially placeholders that can be “filled in” with particular data sources later, when one calls collect. To indicate a source as a dummy source, simply prepend it with a :. For instance:

julia> q = @query select(:src, twice_sepal_length = 2 * sepal_length)
Query with dummy source src

julia> collect(q, src = iris)
Tables.Table
│ Row │ twice_sepal_length │
├─────┼────────────────────┤
│ 1   │ 10.2               │
│ 2   │ 9.8                │
│ 3   │ 9.4                │
│ 4   │ 9.2                │
│ 5   │ 10.0               │
│ 6   │ 10.8               │
│ 7   │ 9.2                │
│ 8   │ 10.0               │
│ 9   │ 8.8                │
│ 10  │ 9.8                │
⋮
with 140 more rows.

Whatever the name of the dummy source (minus the :) was in the query must be the key in the kwarg passed to collect. Otherwise, the method will fail:

julia> collect(q, tbl = iris)
ERROR: ArgumentError: Undefined source: tbl. Check spelling in query.
 in #collect#5(::Array{Any,1}, ::Function, ::StructuredQueries.Query{Symbol}) at /Users/David/.julia/v0.6/StructuredQueries/src/query/collect.jl:23
 in (::Base.#kw##collect)(::Array{Any,1}, ::Base.#collect, ::StructuredQueries.Query{Symbol}) at ./<missing>:0

The two problems

Now that we’ve seen what the SQ query framework itself consists of, we can discuss how such a framework may help to solve the column-indexing and nullable semantics problems.

Type-inferability

Recall that the column-indexing problem consists in the inability of type inference to detect the return type of

function f(df::DataFrame)
    A = df[:A]
    x = zero(eltype(A))
    for i in eachindex(A)
        x += A[i]
    end
    return x
end

What would make f above amenable to type inference is to pass A = df[:A] above to an inner function that executes the loop, for instance

f_inner(A)
    x = zero(eltype(A))
    for i in 1:length(A)
        x += A[i]
    end
    return x
end

As long as f_inner does not get inlined, type inference will run “at” the point at which the body of f calls f_inner and will have access to the eltype of df[:A], since the latter is passed as an argument to f_inner.

This strategy of introducing a function barrier also works when one requires multiple columns. For instance, suppose I wanted to generate a new column C where C[i] = g(A[i], B[i]). The following solution is type-inferable since the type parameters of the zipped iterator zip(A, B) reflects the eltypes of A and B:

function f(g, df)
    A, B = df[:A], df[:B]
    C = similar(A)
    f_inner!(C, g, zip(A, B))
    return DataFrame(C = C)
end

function f_inner!(C, g, itr) # bang because mutates C
    for (a, b) in itr
        C[i] = g(a, b)
    end
    return C
end

In other words: If one intends to iterate over the rows of some subset of columns of a DataFrame, then at some point there must be a function barrier through which is passed an argument whose signature reflects the eltypes of the relevant columns.

The manipulation described above could be expressed for a column-indexable table (e.g. a Table object) as

@query select(tbl, C = A * B)

The collection machinery that supports this query against, say, a Table source essentially follows the above pattern of f and f_inner. That is, an outer function passes a “scalar kernel” (here, row -> row[1] * row[2]) that reflects the structure of A * B and a “row iterator” (here zip(tbl[:A], tbl[:B])) to an inner function that computes the value of the scalar kernel applied to the “rows” returned by iterating over the row iterator. (Note that the argument to the scalar kernel is assumed to be a Tuple whose individual elements assume the positions of named attributes (such as A and B) in the body of the “value expression” (here A * B) from which the scalar kernel is generated).

The scalar kernel and the information about which column to extract from tbl and zip together are all stored in the QueryNode graph produced by @query. Much of the work in producing such a graph consists in extracting such information from the qry expression (here select(tbl, C = A * B)) and processing it to produce (i) a lambda that captures the form of the transformation (A * B), (ii) a Symbol that names the resultant column (C) and a Vector{Symbol} that lists the relevant argument column names ([:A, :B]) in the order they are encountered during the production of the lambda.

Note that these data (a scalar kernel and result and argument fields) are not necessary to generate SQL code from a raw query argument, say the Expr object :( C = A * B ). Thus, one might argue that it is somewhat wasteful to compute such data and store it in the QueryNode graph when one might be able to compute the data at run-time dispatch of collect on a Query{S} where S is a type that satisfies the column-indexable interface. This is a good point, but there are two considerations to account. The first is that computing the scalar kernel and extracting the result and argument fields from the query argument is probably not prohibitively expensive. The second is that generating the scalar kernel at run-time (i) involves use of eval, which is to be avoided, and (ii) may involve a lot of work to re-incorporate the module information of names appearing in expression to be eval‘d into a scalar kernel. For now, it is easiest to generate scalar kernels at macroexpand-time and let them come along for the ride in the QueryNode graph even if the latter is to be collected against a data source (e.g. a SQL connection) that doesn’t need such data.

The use of metaprogramming to circumvent type-inferability is not a new strategy. Indeed, it is the basis for the DataFramesMeta manipulation framework. The interested reader is referred here and here for more on the history and motivation for these endeavors.

The hard question of nullable semantics

Recall the hard question of nullable semantics involves implementing a given lifting semantics – that is, a given behavior for f(x::Nullable{T}) given a defined method f(x::T)– in a “general” way.

One solution – perhaps the most obvious, and which I have previously endorsed – involves defining the method f(x::Nullable{T}) as something like

function f(x::Nullable{T})
    if isnull(x)
        return Nullable{U}()
    else
        return Nullable(f(x.value))
    end
end

with natural analogues for methods with n-ary arguments. This process is a bit cumbersome, but it would not be difficult to automate with a macro with which one could annotate the original definition f(x::T). Call this approach the “method extension lifting” approach.

The method extension lifting approach is very flexible. However, it does face some difficulties. One must somehow decide which functions should be lifted in this manner, and it’s not clear how this line (between lifted and non-lifted functions) ought to be drawn. And if one cannot edit the definition of a function then a macro is of no use; one must manually introduce the lifted variant.

There is a further problem. If one wants to support lifting over arguments with “mixed” signatures – i.e. signatures in which some argument types are Nullable and some are not – then one has either to extend the promotion machinery or to define methods for mixed signatures, e.g. +{T}(x, y::Nullable{T}). That may end up being a lot of methods. Even if their definition can be automated with metaprogramming, the compilation costs associated with method proliferation may be considerable (but I haven’t tested this).

Finally, there is the problem described in NullableArrays.jl#148. I won’t repeat the entire argument here. The summary of this problem is: if one is going to rely on a minimal set of lifted operators to support generic lifting of user-defined functions, those user-defined functions essentially have to give up much of multiple dispatch.

The difficulties associated with method extension lifting are not insurmountable, but the solution – namely, keeping a repository of lifted methods – requires an undetermined amount of maintenance and coordination.

Another way to implement standard lifting semantics is by means of a higher-order function – that is, on Julia 0.5 where higher-order functions are performant. Such a function – call it lift – might look like the following:

function lift(f, x)
    if hasvalue(x)
        return Nullable(f(x))
    else
        U = Core.Inference.return_type(f, (typeof(x),))
        return Nullable{U}()
    end
end

This definition can naturally be extended to methods with more than one argument. The primary advantage of this approach over method extension lifting is its generality: one needs only to define one (two, three) higher-order lift method to support lifting of all functions of one (two, n) argument(s), as opposed to having to define a lifted version for each such function. Note that as long as hasvalue has some generic fallback method for non-Nullable arguments, such lift functions cover both standard and mixed-signature lifting. (Ideally one would ensure that the code is optimized for when types are non-Nullable; in particular, one would ensure that the dead branch is removed – cf. julia#18484.) Call this approach the “higher-order lifting” approach.

So, with the higher-order lifting approach we might better avoid method proliferation and generality worries, which is nice. However, now we require users to invoke lift everywhere. In particular, to lift f(g(x)) over a Nullable argument x, one needs to write lift(f, lift(g, x)). The least we could do in this case is provide an @lift macro that, say, traverses the AST of f(g(x)) and replaces each function call f(...) by an invocation of lift(f, ...). That might be reasonable, but it’s still an artifact of implementation details of support for missing values, and ideally it would not be exposed to users.

Recall that the present query framework extracts the “value expression” of a query argument (for instance, B * C in the query argument C = A * B) and generates a lambda that mimics the former’s structure (in this case, row -> row[1] * row[2]). A proposed modification (see AbstractTables#2) to this process is to modify the AST of the value expression (A * B) by appropriately inserting calls to lift, e.g.

row -> lift(*, row[1], row[2])

While there is a simpler way to achieve standard lifting semantics, this approach (which is currently employed by the column-indexing collection machinery) does not easily support non-standard lifting semantics such as three-valued logic.

The higher-order lifting approach is not without its own drawbacks. Most notably, non-standard lifting semantics, such as three-valued logic, are more difficult to implement and are subject to restrictions that do not apply to the method extension lifting approach. The details of this difficulty is the proper subject of another blog post. The summary of the problem is: higher-order lifting (via code transformation, such as within @query) can only give non-standard lifting semantics to methods called explicitly within the expression passed to @query. That is,

@query filter(tbl, A | B)

can be given, say, three-valued logic semantics via higher-order lifting, but

f(x, y) = x | y
@query filter(tbl, f(A, B))

cannot.

Which approach to solving the hard question of Nullable semantics is better? It really is not clear. Right now, the Julia statistics community is trying out both solutions. I am hopeful time and experimentation will yield new insights.

SQL backends

Above we have seen (i) how the implementation of a generic querying interface suggested a solution to the column-indexing and the Nullable semantics problems and (ii) how these latter solutions may be implemented in a manner generic over so-called column-indexable in-memory Julia tabular data structures. But we haven’t said anything about how the interface is generic over tables other than in-memory Julia objects. In particular, we desire that the above framework be applicable to SQL database connections as well.

Yeesian Ng, who provided invaluable feedback and ideas during the development of SQ, also began to develop such an extension in a package called SQLQuery. We are working to further integrate it with StructuredQueries in SQLQuery.jl#2, and we encourage the reader to stay tuned for updates concerning this endeavor.

Roadmap and open questions

There is a general roadmap available at structuredQueries.jl#19. I’ll briefly describe some of what I believe are the most pressing/interesting open questions.

Interpolation syntax and implementation are both significant open questions. Suppose I wish to refer to a name in the enclosing scope of an @query invocation. A straightforward syntax would be to prepend the interpolated variable with $, as in

c = .5
q = @query filter(tbl, A > $c)

How should this be implemented? For full generality, we would like to be able to “capture” c from the enclosing scope and store it q. One way to do so is to include c in the closure of a lambda () -> c that we store in q. However, there is the question of how to deal with problems of type-inferability. Solving this problem may either require or strongly suggest some sort of “parametrized queries” API by which one can designate a name inside of a query argument context a parameter that can then be bound after the @query invocation, e.g. specified as kwargs to collect or to a function like bind!(q::Query[; kwargs...]).

We are also still deciding what the general syntax within a query context should look like. A big part of this decision concerns how aliasing and related functionality ought to work. See StructuredQueries.jl#21 for more details. This issue is similar to that of interpolation syntax insofar as both involve name resolution within different query contexts (e.g. in a data source specification context vs. a query argument context).

Finally, extensibility of not only collect but also of the graph generation facilities is an important issue, of which we hope to say more in a later post.

As mentioned above, DataFramesMeta is a pioneering approach to enhancing tabular data support in Julia via metaprogramming. Another exciting (and slightly more mature than the presently discussed package) endeavor in the realm of generic data manipulation facilities support is Query.jl by David Anthoff. Query.jl and SQ are very similar in their objectives, though different in important respects. A comparison of these packages is the proper topic of a separate blog post.

Conclusion

The foregoing post has described a work in progress. Not just the StructuredQueries package, but also the Julia statistical ecosystem. Though it will likely take a while for this ecosystem to mature, the general trend I’ve observed over the past two years is encouraging. It’s also worth noting that much of what is described above would have been difficult to conceive without developments of the Julia language. In particular, performant higher-order functions and type-inferable map have both allowed us to explore solutions that were previously made difficult by the amount of metaprogramming required to ensure type-inferability. It will be interesting to see what we can come up with given the improvements to Julia in 0.6 and beyond.

I’m very grateful to John Myles White for his guidance on this project, to Yeesian Ng at MIT for his collaboration, to Viral Shah and Alan Edelman for arranging this opportunity, and to many others at Julia Central and elsewhere for their help and insight.

A Personal Perspective On JuliaCon 2016

2016-09-21T00:00:00+00:00

The gentle breeze brushed my face and the mild sunshine warmed an otherwise chilly morning. I was standing in front of a large building that can only be described as unique: a series of metal plates jutting out at odd angles, whose dull resplendence cast an instant impression. It was the Ray and Maria Stata Centre, a towering monolith and the venue for an event that people from all over the world came to attend and participate in. I was in town for the third edition of JuliaCon, the annual Julia Conference at MIT.

On the eve of JuliaCon, a series of workshops were organised on some important areas people use Julia for. I was conducting the Parallel Computing workshop along with some other members of the JuliaLab. The key idea in our workshop was to show users the many different ways of writing and executing parallel code in Julia. I was talking about easy GPU computing using my package called ArrayFire and achieving acceleration using Julia’s multi-threading.

Day 1 started off with the first keynote speaker - Guy Steele, a stalwart in the software industry and an expert in programming language design. He spoke about his adventures designing Fortress, a language that was intended to be good at mathematical programming. He went through the key design principles and tradeoffs: from the type hierarchy, to their model for parallelism (automatic work-stealing), and interesting choices (such as non-transitive operator precedence). My colleague Keno Fischer was up next with a tour of the new Julia Debugger: Gallium! Gallium was quite breathtaking in its complexity and versatility, so much so that Keno himself uses it to debug code in C and C++! A powerful debugger becomes even better with GUI-integration, which Mike Innes very usefully pitched in with during his demo of Juno-Gallium integration. Stepping, printing and breakpoints promised a powerful package development experience.

The next session was all about data science. Simon Byrne spoke about the data science ecosystem in Julia and future plans. He touched on the famous problem with DataFrames, and then laid out a roadmap for the ecosystem. The rest of the session featured an interesting demo in music processing, while Arch Robison showed us how to use Julia as a code generator.

The evening had two sessions in parallel at different rooms. This is a recurrent feature of JuliaCon, and it’s always hard to decide which session to attend. This time, I chose to attend the sessions on automatic differentation in JuMP and forward differentiation using ForwardDiff.jl. I didn’t want to miss the talk on iterative methods for sparse linear systems. Performance of different kinds of techniques and approaches were compared and evaluated against one another, which made for a compelling presentation, which I really enjoyed.

The evening session featured Jeffrey Sarnoff, one of the sponsors of JuliaCon 2016. Mr. Sarnoff had some very interesting thoughts on extended precision floating point numbers. And so ended the first day at JuliaCon. Now it was time to head to the JuliaHouse! The JuliaHouse was an AirBnb that a bunch of Julia contributors rented out. They had a yard and a barbecue and it was the ideal place for people to go relax, unwind and network with the other Julia folks. People chilled there till the wee hours of the morning, and somehow made it on time for the next day’s session.

The second day started with a keynote speech by Professor Tim Holy, a prolific contributor to the Julia language and ecosystem. He spoke about the state of arrays in Julia and showed us a few of his ideas for iterators. I saw that Professor Holy is widely admired in the entire Julia community due to his involvement in various packages and the key issues on the language. I noticed that he asked some pretty neat insightful questions at various earlier sessions too. Stefan was up next with his super-important Julia 1.0 talk. It was quite a comprehensive list of things that needed to be done before Julia would be 1.0 ready and he touched on a variety of areas such as the compiler, the type system, the runtime, multi-threading, strings and so on.

The next session saw a team from UC Berkeley show off their autonomous racing car that uses some optimization packages (JuMP and Ipopt in particular) to solve real-time optimization problems. Julia was running on an ARM chip with Ubuntu 14.04 installed. Julia can also run on the Raspberry Pi, and my colleague Avik took some time to show off a cool Minecraft demo running on the Pi. The talk after that was about JuliaBox. Nishanth, another colleague of mine, has been hard at work porting JuliaBox to Google Cloud from AWS, and he spoke about his exciting plans for JuliaBox.

Post lunch, I had to choose again between parallel sessions, but I couldn’t quite resist the session with stochastic PDEs and Finite Elements. Kristoffer Carlsson reviewed the state of FEM in Julia, talking about the packages and ecosystem for every FEM step from assembly to the conjugate gradient. The next talk was given by a professor at TU Vienna whose group conducts research on nano-biosensors, and the group uses Julia to solve the stochastic PDEs that come up when trying to model noise and fluctuations. The next talk on astrodynamics was very interesting in that it gave me an insight into the kinds of computational challenges faced by scientists in the field. There were also some interesting demos which I enjoyed, particularly the one where we modelled and visualized a target orbit, which superimposed upon a visual of the earth in space.

In the afternoon, after much consideration, I went to the session that featured statistical modelling and least squares. The first talk on sparse least squares optimization problems gave me a flavor of the kinds of models and problems economists need to solve, and how the Julia ecosystem helps them. The next talk on computational neuroscience focussed on dealing with tens of terabytes of brain data coming from both animals and human surgery patients. I had a very interesting discussion with John earlier about his work, and I was able to get a keen sense of how why the package he was talking about (VinDsl.jl) was important for his work. And so ended Day 2 at JuliaCon, a highly educational day for me personally, with insights into astrodynamics, finite elements and computational neuroscience.

I would contest that one of the best ways to begin your day is to listen to a speech by a Nobel Laureate. It was quite a surreal experience listening to Professor Tom Sargent, and to see him excited by Julia. He gave us a flavor of macroeconomics research and introduced dynamic programming squared problems that were “a walking advertisement for Julia”. As a case in point, the next session on DSGE models in Julia highlighted the benefits Julia can bring to macroeconomics research and analysis.

The next session had a bunch of Julia Summer of Code (JSOC) students present their projects. Some couldn’t make it to the conference so they presented their work through Google Hangouts or through pre-recorded video. Unfortunately, I couldn’t catch all of them because I wanted to catch my colleague Jameson’s Machine Code talk which was in another room. The material he spoke about was very interesting, and got me thinking about the Julia compiler. I also had a very enlightening discussion with him later about the Julia parser.

It turned out that in the afternoon, I was crunched for time. I was helping Shashi plug ArrayFire into Dagger.jl for his talk that was due in a couple of hours, while also working on my own ArrayFire notebooks for late that evening. But we managed to pull through in time. So the afternoon session had Shashi presenting Dagger, his out-of-core framework, followed by a tour of ParallelAccelerator from the IntelLabs team. I have been following ParallelAccelerator for a while, and I’m excited by how certain aspects of it (such as automatic elimination of bounds checking) can be incorporated into Base Julia.

The evening session showed people how they can accelerate their code in Julia. The speaker before me covered vectorization with Yeppp before I covered GPU acceleration with ArrayFire. It was quite overwhelming to be speaking in front of a bunch of experts, but I think I did okay. But I did finish 5 minutes faster than my allotted time. As it turned out, both parallel sessions actually ended up concluding a few minutes early.

Finally, Andreas came up to the podium for the concluding remarks and closed off a very important JuliaCon for me personally. I was able to appreciate the various kinds of people involved in the Julia community: some who worked on the core language to some who worked on their own packages as part of their research; some who worked on Julia part-time, to some (like myself) who worked full-time; the relatively uninitiated JSOC students to experienced old-timers in the community. One thing tied them all together though: a quite thorough appreciation of a new language whose flexibility and power enabled people to solve important problems, whose community’s openness and sense of democracy welcomed more smart people, and the idea that a group of individuals on different time zones and from different walks of life can drive a revolution in scientific computing.

BioJulia 2016 - online sequence search, sequence demultiplexing, new readers and much more!

2016-09-10T00:00:00+00:00

We are pleased to announce releasing Bio.jl 0.4, a minor release including significant functionality improvements as I promised in the previous blog post.

The following features are added since the post:

Online sequence search algorithms.
Sequence data structure for reference genomes.
Data reader and writer for the .2bit file format.
Data reader and writer for the SAM and BAM file formats.
Sequence demultiplexing tool.
Package to handle BGZF files.

And many other miscellaneous performance and usability improvements! Tutorial notebooks are available at https://github.com/BioJulia/BioTutorials. Here I briefly introduce you to these new features one by one.

Online sequence search algorithms

Sequence search is an indispensable tool in sequence analysis. Since the last post, I have added exact, approximate and regex search algorithms. The search interface of Bio.jl mimics that of Julia’s standard library.

julia> using Bio.Seq

julia> seq = dna"ACAGCGTAGCT"
11nt DNA Sequence:
ACAGCGTAGCT

# Exact search.
julia> search(seq, dna"AGCG")
3:6

# Approximate search with one error or less.
julia> approxsearch(seq, dna"AGGG", 1)
3:6

# Regular expression search.
julia> search(seq, biore"AGN*?G"d)
3:6

Sequence data structure for reference genomes

In Bio.jl DNA sequences are encoded using 4 bits per base by default in order to store ambiguous nucleotides and this encoding does well in most cases. However, some biological sequences such as chromosomal sequences are so long especially for eukaryotic organisms and the default DNA sequences may result in a waste of memory space. ReferenceSequence is a new type introduced in Bio.jl that compresses positions of ambiguous nucleotides using a sparse bit vector. This type can achieve almost 2-bit encoding in common reference sequences because most of the ambiguous nucleotides are clustered in a sequence and the number of them is small compared to other unambiguous nucleotides.

# Converting a DNASequence object to ReferenceSequence.
julia> ReferenceSequence(dna"ACGT"^10000)
40000nt Reference Sequence:
ACGTACGTACGTACGTACGTACGTACGTACGTACGTACG…CGTACGTACGTACGTACGTACGTACGTACGTACGTACGT

# Reading chromosome 1 of human from a FASTA file.
julia> open(first, FASTAReader{ReferenceSequence}, "hg38.fa")
Bio.Seq.SeqRecord{Bio.Seq.ReferenceSequence,Bio.Seq.FASTAMetadata}:
  name: chr1
  sequence: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN…NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
  metadata: Bio.Seq.FASTAMetadata("")

julia> sequence(ans)
248956422nt Reference Sequence:
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN…NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

Data reader and writer for the 2bit file format

2bit is a binary file format to store reference sequences. This is a kind of binary counterpart of FASTA but specialized for DNA reference sequences to enable smaller file size and faster loading. Reference sequences of various organisms are distributed from the download page of UCSC in this file format. An important advantage of 2bit is that sequences are indexed by its name and can be accessed immediately.

# Opening a sequence file of yeast (S.cerevisiae).
julia> reader = open(TwoBitReader, "sacCer3.2bit");

# Loading a chromosome VI using random access index.
julia> reader["chrVI"]
Bio.Seq.SeqRecord{Bio.Seq.ReferenceSequence,Array{UnitRange{Int64},1}}:
  name: chrVI
  sequence: GATCTCGCAAGTGCATTCCTAGACTTAATTCATATCTGC…GTGTGGTGTGTGGGTGTGGTGTGTGGGTGTGGTGTGTGG
  metadata: UnitRange{Int64}[]

Data reader and writer for the SAM and BAM file formats

The SAM and BAM file formats are designed for storing sequences aligned to reference sequences. SAM is a line-oriented text file format and easy to handle with UNIX command line tools. BAM is a compressed binary version of SAM and suitable for storing data in disks and processing with purpose-built softwares like samtools. The BAM data reader is carefully tuned so that users can use it in real analysis with large files. It is also feasible to read a CRAM file combining the BAM reader and samtools view command.

An experimental feature is parallel processing using multiple threads. Multi-threading support is introduced in Julia 0.5 and we use it to parallelize decompression of BAM files. Here is a simple benchmark script to show how much reading speed can be improved with multiple threads:

using Bio.Align

# Count the number of mapped records.
function countmapped(reader)
    ret = 0
    record = BAMRecord()
    while !eof(reader)
        # in-place reading
        read!(reader, record)
        if ismapped(record)
            ret += 1
        end
    end
    return ret
end

println(open(countmapped, BAMReader, ARGS[1]))

JULIA_NUM_THREADS environment variable controls the number of worker threads. The result below shows that the elapsed time is almost halved using two threads:

~/.j/v/Bio $ time julia countmapped.jl SRR1238088.sort.bam
28040186
       29.27 real        28.64 user         0.66 sys
~/.j/v/Bio $ env JULIA_NUM_THREADS=2 time julia countmapped.jl SRR1238088.sort.bam
28040186
       17.40 real        32.31 user         0.63 sys

Package to handle BGZF files

BGZF (Blocked GZip Format) is a gzip-compliant file format commonly used in bioinformatics. BGZF can be read using standard gzip tools but files in the format are compressed block by block and special metadata are added to index the compressed files for random access. BAM files are compressed in this file format and sequence alignments in a specific genomic region can be retrieved efficiently. BGZFStreams.jl is a new package to handle BGZF files like usual I/O streams and it is built on top of our Libz.jl package. Parallel decompression mentioned above is implemented in this package layer.

julia> using BGZFStreams

julia> stream = BGZFStream("/Users/kenta/.julia/v0.5/BGZFStreams/test/bar.bgz")
BGZFStreams.BGZFStream{IOStream}(<mode=read>)

julia> readstring(stream)
"bar"

https://github.com/BioJulia/BGZFStreams.jl

Sequence demultiplexing tool

Sequence demultiplexing is a technique to distinguish the origin of a sequence using its artificially-attached “barcode” sequence. This is often used at a preprocessing phase after multiplexed sequencing, a common technique to sequence multiple samples simultaneously. A barcode sequence, however, may be corrupted due to sequencing error, and we need to find the best matching barcode from a barcode set. The demultiplexer algorithm implemented in Bio.jl is based on a trie-like data structure, and efficiently finds the optimal barcode from the prefix of a given DNA sequence.

# Set DNA barcode pool.
julia> barcodes = DNASequence["ATGG", "CAGA", "GGAA", "TACG"];

# Create a sequence demultiplexer that allows one errors at most.
julia> dplxr = Demultiplexer(barcodes, n_max_errors=1, distance=:hamming)
Bio.Seq.Demultiplexer{Bio.Seq.BioSequence{Bio.Seq.DNAAlphabet{4}}}:
  distance: hamming
  number of barcodes: 4
  number of correctable errors: 1

# Demultiplex a given sequence from its prefix.
julia> demultiplex(dplxr, dna"ATGGCGNT")  # 1st barcode with no errors
(1,0)

julia> demultiplex(dplxr, dna"CAGGCGNT")  # 2nd barcode with one error
(2,1)

Next step

This is still the first half of my project this year. The next term will come with:

Supporting more file formats including GFF3, VCF and BCF.
Integration with databases.
Integration with genome browsers.

And, of course, improving existing features of Bio.jl and other packages. We welcome any contributions and feature requests from you all. Gitter chat channel is the best place to communicate with developers and other users. If you love Julia and/or biology, any reason not to join us?

Acknowledgements

I gratefully acknowledge the Moore Foundation and the Julia project for supporting the BioJulia project. I also would like to thank Ben J. Ward and Kevin Murray for comments on my program code and other contributions.

Graft.jl - General purpose graph analytics for Julia

2016-08-22T00:00:00+00:00

This blog post describes my work on Graft.jl, a general purpose graph analysis package for Julia. For those unfamiliar with graph algorithms, a quick introduction might help.

Proposal

My proposal, titled ParallelGraphs, was to develop a parallelized/distributed graph algorithms library. However, in the first month or so, we decided to work towards a more general framework that supports data analysis on networks (graphs with attributes defined on vertices and edges). Our change in direction was mainly motivated by:

The challenges associated with distributed graph computations. This blog post was an eye opener.
Only very large graphs, of the order of terabytes or petabytes, require distributed execution. Most useful graphs can be analyzed on a single compute node.
Multi-threading is under heavy development, and we decided to wait for the full multi-threaded programming model to be available.
As we looked at public datasets, we felt that the ability to combine graph theoretic analyses with real world data was the missing piece in Julia. LightGraphs.jl already provides fast implementations for most graph algorithms, so we decided to target graph data analysis.

The modified proposal could be summarized as the development of a package that supports:

Vertex and edge metadata : Key value pairs for vertices and edges.
Vertex labelling : Allow vertices to be referenced, externally, through arbitrary Julia types.
SQL like queries for edge data and metadata.
Compatibility with LightGraphs

Graft

ParallelGraphs turned out to be a misnomer, since we were moving towards a more general purpose data analysis framework. So we chose the name Graft, a kind of abbreviation for Graph Toolkit. The following sections detail Graft's features:

Vertex and Edge Metadata

Graphs are often representations of real world entities, and the relationships between them. Such entities (and their relationships), often have data attached to them. While it is quite straightforward to store vertex data (a simple table will suffice), storing edges and their data is very tricky. The data should be structured on the source and target vertices, should support random access and should be vectorized for queries.

At first we tried placing the edge data in a SparseMatrixCSC. This turned out to be a bad idea, because sparse matrices are designed for numeric storage. A simpler solution is to store edge metadata in a DataFrame, and have a SparseMatrixCSC map edges onto indices for the DataFrame. This strategy needed a lot less code, and the benchmarks were more promising. Mutations such as the addition or removal of vertices and edges become more complicated however.

Vertex Labelling

Most graph libraries do not support vertex labelling. It can be very confusing to refer to a vertex by its (often long) integer identifier. It is also computationally expensive to use non-integer labels in the implementation of the package (any such implementation would involve dictionaries). There is no reason, however, for the user to have to use integer labels externally. Graft supports two modes of vertex labelling. By default, a vertex is identified by its internal identifier. A user can assign labels of any arbitrary Julia type to identify vertices, overriding the internal identifiers. This strategy, we feel, makes a reasonable compromise between user experience and performance.

If vertex labels were used in the internal implementation, the graph data structure would probably look like this:

  Dict(
     "Alice" => Dict(
        "age" => 34,
        "occupation"  => "Doctor",
        "adjacencies" => Dict("Bob" => Dict("relationship" => "follow")))
     ),
     "Bob" => Dict(
        "age" => 36,
        "occupation"  => "Software Engineer",
        "adjacencies" => Dict("Charlie" => Dict("relationship" => "friend"))
     ),
     "Charlie" => Dict(
        "age" => 30,
        "occupation"  => "Lawyer",
        "adjacencies" => Dict("David" => Dict("relationship" => "follow"))
     ),
     "David" => Dict(
        "age" => 29,
        "occupation" => "Athlete",
        "adjacencies" => Dict("Alice" => Dict("relationship" => "friend"))
     )
  )

Cleary, using labels internally is a very bad idea. Any sort of data access would set off multiple dictionary look-ups. Instead, if a bidirectional map could be used to translate labels into vertex identifiers and back, the number of dictionary lookups could be reduced to one. The data would also be better structured for query processing.

  # Label Map to resolve queries
  LabelMap(
     # Forward map : labels to vertex identifiers
     Dict("Alice" => 1, "David" => 4", "Charlie" => 3, Bob" => 2),

     # Reverse map : vertex identifiers to labels
     String["Alice, "Bob", "Charlie", "David"]
  )

  # Vertex DataFrame
  4×2 DataFrames.DataFrame
  │ Row │ age │ occupation          │
  ├─────┼─────┼─────────────────────┤
  │ 1   │ 34  │ "Doctor"            │
  │ 2   │ 36  │ "Software Engineer" │
  │ 3   │ 30  │ "Lawyer"            │
  │ 4   │ 29  │ "Athlete"           │

  # SparseMatrixCSC : maps edges onto indices into Edge DataFrame
  4×4 sparse matrix with 4 Int64 nonzero entries:
     [4, 1]  =  1
     [1, 2]  =  2
     [2, 3]  =  3
     [3, 4]  =  4

  # Edge DataFrame
  4×1 DataFrames.DataFrame
  │ Row │ relationship │
  ├─────┼──────────────┤
  │ 1   │ "follow"     │
  │ 2   │ "friend"     │
  │ 3   │ "follow"     │
  │ 4   │ "friend"     │

SQL Like Queries

Graft’s query notation is borrowed from Jplyr. The @query macro is used to simplify the query syntax, and accepts a pipeline of abstractions separated by the pipe operator |>. The stages are described through abstractions:

eachvertex

Accepts an expression, that is run over every vertex. Vertex properties can be expressed using the dot notation. Some reserved properties are v.id, v.label, v.adj, v.indegree and v.outdegree. Examples:

  # Check if the user has overridden the default labels
  julia> @query(g |> eachvertex(v.id == v.label)) |> all

  # Kirchoff's law :P
  julia> @query(g |> eachvertex(v.outdegree - v.indegree)) .== 0

eachedge

Accepts an expression, that is run over every edge. The symbol s is used to denote the source vertex, and t is used to denote the target vertex in the edge. The symbol e is used to denote the edge itself. Edge properties can be expressed through the dot notation. Some reserved properties are e.source, e.target, e.mutualcount, and e.mutual. Examples:

  # Arithmetic expression on edge, source and target properties
  julia> @query g |> eachedge(e.p1 - s.p1 - t.p1)


  # Check if constituent vertices have the same outdegree
  julia> @query g |> eachedge(s.outdegree == t.outdegree)


  # Count the number of "mutual friends" between the source and target vertices in each edge
  julia> @query g |> eachedge(e.mutualcount)

filter

Accepts vertex or edge expressions and computes subgraphs with a subset of vertices, or a subset of edges, or both. Examples:

  # Remove vertices where property p1 equals property p2
  @query g |> filter(v.p1 != v.p2)

  # Remove self loops from the graph
  @query g |> filter(e.source != e.target)

select

Returns a subgraph with a subset of vertex properties, or a subset of edge properties or both. Examples:

  # Preserve vertex properties p1, p2 and nothing else
  @query g |> select(v.p1, v.p2)

  # Preserve vertex property p1 and edge property p2
  @query g |> select(v.p1, e.p2)

Demonstration

The typical workflow we hope to support with Graft is:

Load a graph from memory
Use the query abstractions to construct new vertex/edge properties or obtain subgraphs.
Run complex queries on the subgraphs, or export data to LightGraphs and run computationally expensive algorithms there.
Bring the data back into Graft as a new property, or use it to modify the graphs structure.

The following examples should demonstrate this workflow:

Google+: This demo uses a real, somewhat large, dataset with plenty of text data.
Baseball Players: Two separate datasets spliced together, a table on baseball players and a trust network. The resulting data is quite absurd, but does a good job of showing the quantitative queries Graft can run.

Future Work

Graph IO : Support more graph file formats.
Improve the query interface: The current pipelined macro based syntax has a learning curve, and the macro itself does some eval at runtime. We would like to move towards a cleaner composable syntax, that will pass off as regular Julia commands.
New abstractions, such as Group-by, sort, and table output.
Database backends : A RDBMS can be used instead of the DataFrames. Or Graft can serve as a wrapper on a GraphDB such as Neo4j.
Integration with ComputeFramework for out of core processing. Support for parallelized IO, traversals and queries.

More information can be found here

Acknowledgements

This work was carried out as part of the Google Summer of Code program, under the guidance of mentors: Viral B Shah and Shashi Gowda.

Announcing support for complex-domain linear programs in Convex.jl

2016-08-17T00:00:00+00:00

I am pleased to announce the support for complex-domain linear programs (LPs) in Convex.jl. As one of the Google Summer of Code students under The Julia Language, I had proposed to implement the support for complex semidefinite programming. In the first phase of project, I started by tackling the problem of complex-domain LPs where in first subphase, I had announced the support for complex coefficients during JuliaCon’16 and now I take this opportunity to announce the support for complex variables in LPs.

Complex-domain LPs consist of a real linear objective function, real linear inequality constraints, and real and complex linear equality constraints.

In order to enable complex-domain LPs, we came up with these ideas:

We redefined the conic_form! of every affine atom to accept complex arguments.
Every complex variable z was internally represented as z = z1 + i*z2, where z1 and z2 are real.
We introduced two new affine atoms real and imag which return the real and the imaginary parts of the complex variable respectively.
transpose and ctranspose perform differently on complex variables so a new atom CTransposeAtom was created.
A complex-equality constraint RHS = LHS can be decomposed into two corresponding real equalities constraint real(RHS) = real(LHS) and imag(RHS) = imag(LHS)

After above changes were made to the codebase, we wrote two use cases to demonstrate the usability and the correctness of our idea which I am presenting below:

# Importing Packages
Pkg.clone("https://github.com/Ayush-iitkgp/Convex.jl/tree/gsoc2")
using Convex
 
# Complex LP with real variable
n = 10 # variable dimension (parameter)
m = 5 # number of constraints (parameter)
xo = rand(n)
A = randn(m,n) + im*randn(m,n)
b = A * xo 
# Declare a real variable
x = Variable(n)
p1 = minimize(sum(x), A*x == b, x>=0) 
# Notice A*x==b is complex equality constraint 
solve!(p1)
x1 = x.value

# Let's now solve by decomposing complex equality constraint into the corresponding real and imaginary part.
p2 = minimize(sum(x), real(A)*x == real(b), imag(A)*x==imag(b), x>=0)
solve!(p2)
x2 = x.value
x1==x2 # should return true


# Let's now consider an example using a complex variable
# Complex LP with complex variable
n = 10 # variable dimension (parameter)
m = 5 # number of constraints (parameter)
xo = rand(n)+im*rand(n)
A = randn(m,n) + im*randn(m,n)
b = A * xo

# Declare a complex variable
x = ComplexVariable(n)
p1 = minimize(real(sum(x)), A*x == b, real(x)>=0, imag(x)>=0)
solve!(p1)
x1 = x.value

xr = Variable(n)
xi = Variable(n)
p2 = minimize(sum(xr), real(A)*xr-imag(A)*xi == real(b), imag(A)*xr+real(A)*xi == imag(b), xr>=0, xi>=0)
solve!(p2)
x1== xr.value + im*xi.value # should return true

List of all the affine atoms are as follows:

addition, substraction, multiplication, division
indexing and slicing
k-th diagonal of a matrix
construct diagonal matrix
transpose and ctranspose
stacking
sum
trace
conv
real and imag

Now, I am working towards implementing complex-domain second order conic programming. Meanwhile, I invite the Julia community to play around with the complex-domain LPs. The link to the development branch is here.

Looking forward to your suggestions!

Special thanks to my mentors Madeleine Udell and Dvijotham Krishnamurthy!

An invitation to JuliaCon 2016

2016-05-08T00:00:00+00:00

For the third year in row we are happy to invite you to JuliaCon, the annual meeting of the Julia programming language community. JuliaCon 2016 will be held at the Massachusetts Institute of Technology from June 21st to 25th and as a first, this year we will have several high-profile keynote speakers, as well as the top-notch tutorials and talks you have come to expect over the years. Please purchase your tickets before May 13th to take advantage of the early-bird pricing and we look forward to seeing you in June!

JuliaCon, just like the Julia language, has come a long way over the last three years. In 2014 we were roughly 75 attendees meeting in a medium-sized conference room at the University of Chicago to great success, in 2015 we had about 225 attendees and enough content to cover four full days at the Massachusetts Institute of Technology, and this year we hope that you will join us for the greatest JuliaCon yet!

From June 21st to the 25th JuliaCon 2016 will be held at the Massachusetts Institute of Technology, for a full five days of Julia-related content. On Tuesday 21st we will hold several workshops, on topics ranging from intermediate Julia programming to more advanced topics such as writing high-performance and parallel programming. From Wednesday 22nd to Friday 24th we will start each day with a keynote by a high-profile speaker, followed by talks on a great variety of subjects: macro economics, machine learning, astrophysics, visualisation, and more! On Saturday 25th, the final day of the conference, we will hold a hackathon where attendees are encouraged to team up based on personal interests to either create new Julia projects or contribute to existing ones. All these details are now in the JuliaCon poster.

Without further ado, please allow us to introduce our keynote speakers:

Timothy E. Holy is an Associate Professor of Neuroscience at Washington University in St. Louis. In 2009 he received the NIH Director’s Pioneer award for innovations in optics and microscopy. His lab, which studies how the brain detects pheromones and develops new optical methods for imaging neuronal activity, was one of the first to adopt Julia for scientific research. He is a long time Julia contributor and a lead developer of Julia’s multidimensional array capabilities as well as the author of far too many Julia packages.
Thomas J. Sargent is a Professor of Economics at New York University and Senior Fellow at the Hoover Institution. In 2011 the Royal Swedish Academy of Sciences awarded him the Nobel Memorial Prize in Economic Sciences for his work on macroeconomics. Together with John Stachurski, he founded quant-econ.net, a Julia and Python based learning platform for quantitative economics focusing on algorithms and numerical methods for studying economic problems as well as coding skills.
Guy L. Steele, Jr. is a Software Architect for Oracle Labs and Principal Investigator of the Programming Language Research project. In 1994, he was made a fellow of the Association for Computing Machinery after receiving the Grace Murray Hopper Award in 1988. He is an experienced designer of programming languages, like Scheme, Fortress and Java, and many of his ideas have had an impact on the design of Julia.

We hope that our invitation entices you to join us – new, intermediate, and experienced Julia users – for five days of fun at MIT this June and remember to purchase your tickets before May 13th to receive a 33% early-bird discount!

We need your help to spread this message far and wide! Post the JuliaCon poster and this blog post to your local email lists. Print the poster and post it on your local message board. In addition, please tweet, retweet, post on FaceBook and LinkedIn and other social media. This is the biggest JuliaCon ever, and we need your help in making it a huge success.

BioJulia Project in 2016

2016-04-30T00:00:00+00:00

I am pleased to announce that the next phase of BioJulia is starting! In the next several months, I’m going to implement many crucial features for bioinformatics that will motivate you to use Julia and BioJulia libraries in your work. But before going to the details of the project, let me briefly introduce you what the BioJulia project is. This project is supported by the Moore Foundation and the Julia project.

The BioJulia project is a collaborative open source project to create an infrastructure for bioinformatics in the Julia programming language. It aims to provide fast and accessible software libraries. Julia’s Just-In-Time (JIT) compiler enables this greedy goal without resorting to other compiled languages like C/C++. The central package developed under the project is Bio.jl, which provides fundamental features including biological symbols/sequences, file format parsers, alignment algorithms, wrappers for external softwares, etc. It also supports several common file formats such as FASTA, FASTQ, BED, PDB, and so on. Last year I made the FMIndexes.jl package to build a full-text search index for large genomes as a Julia Summer of Code (JSoC) student, and we released the first development version of Bio.jl. While the BioJulia project is getting more active and the number of contributors are growing, we still lack some important features for realistic applications. Filling in gaps between our current libraries and actual use cases is the purpose of my new project.

So, what will be added in it? Here is the summary of my plan:

Sequence analysis:
- Online sequence search algorithms
- Data structure for reference genomes
- Error-correcting algorithms for DNA barcodes
- Parsers for BAM and CRAM file formats
Integration with data viewers and databases:
- Genome browser backend
- Parsers for GFF3 and VCF/BCF
- Database access through web APIs

These things are of crucial importance for writing analysis programs because they connects software components (e.g. programs, archives, databases, viewers, etc.); data analysis softwares in bioinformatics usually read/write formatted data from/to each other. The figure below shows common workflow of detecting genetic variants; underlined deliverables will connect softwares, archives and databases so that you can write your analysis software in the Julia language.

Sequence Analysis

The online sequence search algorithms will come with three flavors: exact, approximate, and regular expression search algorithms. The exact sequence search literally means finding exactly matching positions of a query sequence in another sequence. The approximate search is similar to the exact search but allows up to a specified number of errors: mismatches, insertions, and deletions. The regular expression search accepts a query in regular expression, which enables flexible description of a query pattern like motifs. For these algorithms, there are already half-done pull requests I’m working on: #152, #153, #143.

After the last release of Bio.jl v0.1.0, the sequence data structure has been significantly rewritten to make biological sequence types coherent and extensible. But because we chose an encoding that requires 4 bits per base to represent DNA sequences, the DNA sequence type consumes too much memory than necessary to store a reference genome, which is usually composed of four kinds of DNA nucleotides (denoted by A/C/G/T) and (consecutive and relatively small number of) undetermined nucleotides (denoted by N). After trying some data structures, I found that memory space of N positions can be dramatically saved using IndexableBitVectors.jl, which is a package I created in JSoC 2015. I’m developing a separated package for reference genomes, ReferenceSequences.jl, and going to improve the functionality and performance to handle huge genomes like the human genome.

If you are a researcher or an engineer who handles high-throughput sequencing data, BAM and CRAM parsers would be the most longing feature addition in the list. BAM is the de facto standard file format to accommodate aligned sequences and most sequence mappers generate alignments in this format. CRAM is a storage-efficient alternative of BAM and is getting popular reflecting explosion of accumulated sequence data. Since these files contain massive amounts of DNA sequences from high-throughput sequencing machines, high-speed parsing is a practically desirable feature. I’m going to concentrate on the speed by careful tuning and multi-thread parallel computation which is planned to be introduced in the next Julia release.

Integration with Data Viewers and Databases

Genome browsers enable to interactively visualize genetic features found in individuals and/or populations. For example, using the UCSC Genome Browser, you can investigate genetic regions along with sequence annotations around the ABO gene in a window. Genome browser is one of the most common visualizations and hence lots of softwares have been developed but unfortunately there is no standardized interface. So, we will need to select a promising one that is an open source and supporting interactions with other softwares. The first candidate is JBrowse, which is built with modern JavaScript and HTML5 technologies. It also supports RESTful APIs and hence it can fetch data from a backend server via HTTP. I’m planning to make an API server that responds to queries from a genome browser to interactively visualize in-memory data.

Many databases distribute their data in some standardized file formats. As for genetic annotations and variants, GFF3 and VCF would be the most common formats. If you are using data from human or mouse, you should know various annotations are available from the GENCODE project. It offers data in GTF or GFF3 file formats. NCBI provides human variation sets in VCF file formats here. These file formats are text, so you may think it is trivial to write parsers when you need them. It is partially true — if you don’t care about completeness and performance. Parsing a text file format in a naive way (for example, split a line by a tab character) allocates many temporary objects and often leads to degrade performance, while careful tuning of a parser leads to complicated code that is hard to maintain. @dcjones challenged this problem and made a great work and made Julia support for Ragel, which generates Julia code that executes a finite state machine. Daniel’s talk of the JuliaCon 2015 is helpful to know about the details if you are interested:

Sometimes you may need only a part of data provided by a database. In such a case, web-based APIs are handy to fetch necessary data on demand. BioMart Central Portal offers a unified access point to a range of biological databases that is programmatically accessible via REST and SOAP APIs. Julian wrapper to BioMart will make it much easier to access data by automatically converting response to Julia objects. In the R language, the biomaRt package is one of the most downloaded packages in Bioconductor packages: https://www.bioconductor.org/packages/stats/.

Try BioJulia!

We need users and collaborators of our libraries. Feedbacks from users in the real world are the most precious thing to improve the quality of our libraries. We welcome feature requests and discussions that will make bioinformatics easier and faster. Tools for phylogenetics and structural biology, which I didn’t mention in this post, are also under active development. You can post issues here: https://github.com/BioJulia/Bio.jl/issues; if you want to get in touch with us more casually, this Gitter room may be more convenient: https://gitter.im/BioJulia/Bio.jl.

Google Summer of Code 2016

2016-04-14T00:00:00+00:00

We’re pleased to announce that the Julia Language is taking part in this year’s Google Summer of Code. This means that interested students will have the opportunity to spend their summers getting paid to write code on a project of their choice.

Student applications are open from March 14th – 25th on the SoC website, but there’s no reason not to get going right away! To get you started thinking about what you’d like to work on, there are a bunch of interesting projects on our ideas page. At this stage, it’s also a good idea to start getting involved with the community around your area of interest by opening issues, sending PRs and speaking to developers on relevant packages. Finding a good mentor for your project will be a big help for most applications, and showing mentors your enthusiasm is a great way to get them on board. Once you’re ready to start writing an application, check out our guidelines which gives some hints on what to include.

To give an idea of the kind of projects we’d like to support:

Parallel and distributed computing
Support for data science and analysis
Compiler optimisations and work on Julia on Android
Numerical and scientific computing – ODEs, matrix library functions, optimisation…
IDEs, tooling and 2D/3D visualisation
GPUs for graphics and numerical computing
Web tooling and networking
… and much more.

We welcome involvement in our summer frivolities even if you’re not a student. Firstly, if you happen to know any students, please let them know! We’d also like to encourage people to step up as mentors, so if you’re interested then please contact us (see below) and let us know what areas you’d like to help with. Please also feel free to give technical feedback on proposals that come up on our mailing lists.

The primary point of contact for the community is our mailing list, julia-users@googlegroups.com. For more administrative questions you can also reach out to us privately at juliasoc@googlegroups.com. Feel free to start discussions about projects and ideas, although note that it’s easier for us to answer broad questions about the process than to give specific technical feedback.

Our participation in previous years has resulted in some great projects, so we’re really looking forward to working with you this year and seeing what you can do. Good luck!

Generalizing AbstractArrays: opportunities and challenges

2016-03-27T00:00:00+00:00

Introduction: generic algorithms with AbstractArrays

Somewhat unusually, this blog post is future-looking: it mostly focuses on things that don’t yet exist. Its purpose is to lay out the background for community discussion about possible changes to the core API for AbstractArrays, and serves as background reading and reference material for a more focused “julep” (a julia enhancement proposal). Here, often I’ll use the shorthand “array” to mean AbstractArray, and use Array if I explicitly mean julia’s concrete Array type.

As the reader is likely aware, in julia it’s possible to write algorithms for which one or more inputs are only assumed to be AbstractArrays. This is “generic” code, meaning it should work (i.e., produce a correct result) on any specific concrete array type. In an ideal world—which julia approaches rather well in many cases—generality of code should not have a negative impact on its performance: a generic implementation should be approximately as fast as one restricted to specific array type(s). This implies that generic algorithms should be written using lower-level operations that give good performance across a wide variety of array types.

Providing efficient low-level operations is a different kind of design challenge than one experiences with programming languages that “vectorize” everything. When successful, it promotes much greater reuse of code, because efficient, generic low-level parts allow you to write a wide variety of efficient, generic higher-level functions.

Naturally, as the diversity of array types grows, the more careful we have to be about our abstractions for these low-level operations.

Examples of arrays

In discussing general operations on arrays, it’s useful to have a diverse collection of concrete arrays in mind.

In core julia, some types we support fairly well are:

Array: the prototype for all arrays
Ranges: a good example of what I often consider a “computed” array, where essentially none of the values are stored in memory. Since there is no storage, these are immutable containers: you can’t set values in individual slots.
BitArrays: arrays that can only store 0 or 1 (false or true), and for which the internal storage is packed so that each entry requires only one bit.
SubArrays: the problems this type introduced, and the resolution we adopted, probably serves as the best model for the generalizations considered here. Therefore, this case is discussed in greater detail below.

Another important class of array types in Base are sparse arrays: SparseMatrixCSC and SparseVector, as well as other sparse representations like Diagonal, Bidiagonal, and Tridiagonal. These are good examples of array types where access patterns deserve careful thought. Notably, despite many commonalities in “strategy” among the 5 or so sparse parametrizations we have, implementations of core algorithms (e.g., matrix multiplication) are specialized for each sparse-like type—in other words, these mimic the “high level vectorized functions” strategy common to other languages. What we lack is a “sparse iteration API” that lets you write the main algorithms of sparse linear algebra efficiently in a generic way. Our current model is probably fine for SparseLike*Dense operations, but gets to be harder to manage if you want to efficiently compute, e.g., Bidiagonal*SparseMatrixCSC: the number of possible combinations you have to support grows rapidly with more sparse types, and thus represents a powerful incentive for developing efficient, generic low-level operations.

Outside of Base, there are some other mind-stretching examples of arrays, including:

DataFrames: indexing arrays with symbols rather than integers. Other related types include NamedArrays, AxisArrays.
Interpolations: indexing arrays with non-integer floating-point numbers
DistributedArrays: another great example of a case in which you need to think through access patterns carefully

SubArrays: a case study

For arrays of fixed dimension, one can write algorithms that index arrays as A[i,j,k,...] (good examples can be found in our linear algebra code, where everything is a vector or matrix). For algorithms that have to support arbitrary dimensionality, for a long time our fallback was linear indexing, A[i] for integer i. However, in general SubArrays cannot be efficiently accessed by a linear index because it results in call(s) to div, and div is slow. This is a CPU problem, not a Julia-specific problem. The slowness of div is still true despite the recent addition of infrastructure to make it much faster—now one can make it merely “really bad” rather than “Terrible, Horrible, No Good, and Very Bad”.

The way we (largely) resolved this problem was to make it possible to do cartesian indexing, A[i,j,k,...], for arrays of arbitrary dimensionality (the CartesianIndex type). To leverage this in practical code, we also had to extend our iterators with the for I in eachindex(A) construct. This allows one to select an iterator that optimizes the efficiency of access to elements of A. In generic algorithms, the performance gains were not small, sometimes on the scale of ten- to fifty-fold. These types were described in a previous blog post.

To my knowledge, this approach has given Julia one of the most flexible yet efficient “array view” types in any programming language. Many languages base views on array strides, meaning situations in which the memory offset is regular along each dimension. Among other things, this requires that the underlying array is dense. In contrast, in Julia we can easily handle non-strided arrays (e.g., sampling at [1,3,17,428,...] along one dimension, or creating a view of a SparseMatrixCSC). We can also handle arrays for which there is no underlying storage (e.g., Ranges). Being able to do this with a common infrastructure is part of what makes different optimized array types useful in generic programming.

It’s also worth pointing out some problems:

Most importantly, it requires that one adopt a slightly different programming style. Despite being well into another release cycle, this transition is still not complete, even in Base.
For algorithms that involve two or more arrays, there’s a possibility that their “best” iterators will be of different types. In principle, this is a big problem. Consider matrix-vector multiplication, A[i,j]*v[j], where j needs to be in-sync for both A and v, yet you’d also like all of these accesses to be maximally-efficient. In practice, right now this isn’t a burning problem: even if our arrays don’t all have efficient linear indexing, to my knowledge all of our (dense) array types have efficient cartesian indexing. Since indexing by N integers (where N is equal to the dimensionality of the array) is always performant, this serves as a reliable default for generic code. (It’s worth noting that this isn’t true for sparse arrays, and the lack of a corresponding generic solution is probably the main reason we lack a generic API for writing sparse algorithms.)

Unfortunately, I suspect that if we want to add support for certain new operations or types (specific examples below), it will force us to set the latter problem on fire.

Challenging examples

Some possible new AbstractArray types pose novel challenges.

ReshapedArrays (#15449)

These are the front-and-center motivation for this post. These are motivated by a desire to ensure that reshape(A, dims) always returns a “view” of A rather than allocating a copy of A. (Much of the urgency of this julep goes away if we decide to abandon this goal, in which case for consistency we should always return a copy of A.) It’s worth noting that besides an explicit reshape, we have some mechanisms for reshaping that currently cause a copy to be created, notably A[:] or A[:, :] applied to a 3D array.

Similar to SubArrays, the main challenge for ReshapedArrays is getting good performance. If A is a 3D array, and you reshape it to a 2D array B, then B[i,j] must be expanded to A[k,l,m]. The problem is that computing the correct k,l,m might result in a call to div. So ReshapedArrays violate a crutch of our current ecosystem, in that indexing with N integers might not be the fastest way to access elements of B. From a performance perspective, this problem is substantial (see #15449, about five- to ten-fold).

In simple cases, there’s an easy way to circumvent this performance problem: define a new iterator type that (internally) iterates over the parent A’s indexes directly. In other words, create an iterator so that B[I] immediately expands to A[I'], and so that the latter has “ideal” performance.

Unfortunately, this strategy runs into a lot of trouble when you need to keep two arrays in sync: if you want to adopt this strategy, you simply can’t write B[i,j]*v[j] for matrix-vector multiplication anymore. A potential way around this problem is to define a new class of iterators that operate on specific dimensions of an array (#15459), writing B[ii,jj]*v[j]. jj (whatever that is) and j need to be in-sync, but they don’t necessarily need to both be integers. Using this kind of strategy, matrix-vector multiplication

for j = 1:size(B, 2)
    vj = v[j]
    for i = 1:size(B, 1)
        dest[i] += B[i,j] * vj
    end
end

might be written in a more performant manner like this:

for (jj, vj) in zip(eachindex(B, Dimension{2}), v)
    for (i, ii) in zip(eachindex(dest), eachindex(B, (:, jj)))
        dest[i] += B[ii,jj]*vj
    end
end

It’s not too hard to figure out what eachindex(B, Dimension{2}) and eachindex(B, (:, jj)) should do: ii, for example, could be a CartesianInnerIndex (a type that does not yet exist) that for a particular column of B iterates from A[3,7,4] to A[5,8,4], where the dth index component wraps around at size(A, d). The big performance advantage of this strategy is that you only have to compute a div to set the bounds of the iterator on each column; the inner loop doesn’t require a div on each element access. No doubt, given suitable definition of jj one could be even more clever and avoid calculating div altogether. To the author, this strategy seems promising as a way to resolve the majority of the performance concerns about ReshapedArrays—only if you needed “random access” would you require slow (integer-based) operations.

However, a big problem is that compared to the “naive” implementation, this is rather ugly.

Row-major matrices, PermutedDimensionArrays, and “taking transposes seriously”

Julia’s Array type stores its entries in column-major order, meaning that A[i,j] and A[i+1,j] are in adjacent memory locations. For certain applications—or for interfacing with certain external code bases—it might be convenient to support row-major arrays, where instead A[i,j] and A[i,j+1] are in adjacent memory locations. More fundamentally, this is partially related to one of the most commented-on issues in all of julia’s development history, known as “taking transposes seriously” aka #4774. There have been at least two attempts at implementation, #6837 and the mb/transpose branch, and for the latter a summary of benefits and challenges was posted.

One of the biggest challenges mentioned was the huge explosion of methods that one would need to support. Can generic code come to the rescue here? There are two related concerns. The first is linear indexing: oftentimes this is conflated with “storage order,” i.e., given two linear indexes i and j for the same array, the offset in memory is proportional to i-j. For row-major arrays, this notion is not viable, because otherwise a loop

function copy!(dest, src)
    for i = 1:length(src)
        dest[i] = src[i]  # trouble if `i` means "memory offset"
    end
    dest
end

would end up taking a transpose if src and dest don’t use the same storage order. Consequently, a linear index has to be defined in terms of the corresponding cartesian (full-dimensionality) index. This isn’t much of a real problem, because it’s one we know how to solve: use ind2sub (which is slow) when you have to, but for efficiency make row major arrays belong to the category (LinearSlow) of arrays that defaults to iteration with cartesian indexes. Doing so will ensure that if one uses generic constructs like eachindex(src) rather than 1:length(src), then the loop above can be fast.

The far more challenging problem concerns cache-efficiency: it’s much slower to access elements of an array in anything other than storage-order. Some reasonably fast ways to write matrix-vector multiplication are

for j = 1:size(B, 2)
    vj = v[j]
    for i = 1:size(B, 1)
        dest[i] += B[i,j] * vj
    end
end

for a column-major matrix B, and

for i = 1:size(B, 1)
    for j = 1:size(B, 2)
        dest[i] += B[i,j] * v[j]
    end
end

for a row-major matrix. (One can do even better than this by using a scalar temporary accumulator, but let’s not worry about that here.) The key point to note is that the order of the loops has been switched.

One could generalize this by defining a RowMajorRange iterator that’s a lot like our CartesianRange iterator, but traverses the array in row-major order. eachindex claims to return an “efficient iterator,” and without a doubt the RowMajorRange is a (much) more efficient iterator than a CartesianRange iterator for row-major arrays. So let’s imagine that eachindex does what it says, and returns a RowMajorRange iterator. Using this strategy, the two algorithms above can be combined into a single generic implementation:

for I in eachindex(B)
    dest[I[1]] += B[I]*v[I[2]]
end

Yay! Score one for efficient generic implementations.

But our triumph is short-lived. Let’s return to the example of copy! above, and realize that dest and src might be two different array types, and therefore might be most-efficiently indexed with different iterator types. We’re tempted to write this as

function copy!(dest, src)
    for (idest, isrc) in zip(eachindex(dest), eachindex(src))
        dest[idest] = src[isrc]
    end
    dest
end

Up until we introduced our RowMajorRange return-type for eachindex, this implementation would have been fine. But we just broke it, because now this will incorrectly take a transpose in certain situations.

In other words, without careful design the goals of “maximally-efficient iteration” and “keeping accesses in-sync” are in conflict.

OffsetArrays and the meaning of AbstractArray

Julia’s arrays are indexed starting at 1, whereas some other languages start numbering at 0. If you take comments on various blog posts at face value, there are vast armies of programmers out there eagerly poised to adopt julia, but who won’t even try it because of this difference in indexing. Since recruiting those armies will lead to world domination, this is clearly a problem of the utmost urgency.

More seriously, there are algorithms which simplify if you can index outside of the range from 1:size(A,d). In my own lab’s internal code, we’ve long been using a CenterIndexedArray type, in which such arrays (all of which have odd sizes) are indexed over the range -n:n and for which 0 refers to the “center” element. One package which generalizes this notion is OffsetArrays. Unfortunately, in practice both of these array types produce segfaults (due to built-in assumptions about when @inbounds is appropriate) for many of julia’s core functions; over time my lab has had to write implementations specialized for CenterIndexedArrays for quite a few julia functions.

OffsetArrays illustrates another conceptual challenge, which can easily be demonstrated by copy!. When dest is a 1-dimensional OffsetArray and src is a standard Vector, what should copy! do? In particular, where does src[1] go? Does it go in the first element of dest, or does it get stored in dest[1] (which may not be the first element).

Such examples force us to think a little more deeply about what an array really is. There seem to be two potential conceptions. One is that arrays are lists, and multidimensional arrays are lists-of-lists-of-lists-of… In such a world view, the right thing to do is to put src[1] into the first slot of dest, because 1 is just a synonym for first. However, this world view doesn’t really endow any kind of “meaning” to the index-tuple of an array, and in that sense doesn’t even include the distinction conveyed by an OffsetArray. In other words, in this world an OffsetArray is simply nonsensical, and shouldn’t exist.

If instead one thinks OffsetArrays should exist, this essentially forces one to adopt a different world view: arrays are effectively associative containers, where each index-tuple is the “key” by which one retrieves a value. With this mode of thinking, src[1] should be stored in dest[1].

Formalizing AbstractArray

These examples suggest a formalization of AbstractArray:

AbstractArrays are specialized associative containers, in that the allowable “keys” may be restricted by more than just their julia type. Specifically, the allowable keys must be representable as a cartesian product of one-dimensional lists of values. The allowed keys may depend not just on the array type but also the specific array (e.g., its size). Attempted access by keys that cannot be converted to one of the allowed keys, for that specific array, result in BoundsErrors.
For any given array, one must be able to generate a finite-dimensional parametrization of the full domain of valid keys from the array itself. This might only require knowledge of the array size, or the keys might depend on some internal storage (think DataFrames and OffsetArrays). In some cases, just the array type might be sufficient (e.g., FixedSizeArrays). By this definition, note that a Dict{ASCII5,Int}, where ASCII5 is a type that means an ASCII string with 5 characters, would qualify as a 5-dimensional (sparse) array, but that a Dict{ASCIIString,Int} would not (because there is no length limit to an ASCIIString, and hence no finite dimensionality).
An array may be indexed by more than one key type (i.e., keys may have multiple parametrizations). Different key parametrizations are equivalent when they refer to the same element of a given array. Linear indexes and cartesian indexes are simple examples of interconvertable representations, but specialized iterators can produce other key types as well.
Arrays may support multiple iterators that produce non-equivalent key sequences. In other words, a row-major matrix may support both CartesianRange and RowMajorRange iterators that access elements in different orders.

Finding a way forward

Resolving these conflicting demands is not easy. One approach might be to decree that some of these array types simply can’t be supported with generic code. It is possible that this is the right strategy. Alternatively, one can attept to devise an array API that handles all of these types (and hopefully more).

In GitHub issue #15648, we are discussing APIs that may resolve these challenges. Readers are encouraged to contribute to this discussion.

An introduction to ParallelAccelerator.jl

2016-03-01T00:00:00+00:00

The High Performance Scripting team at Intel Labs recently released ParallelAccelerator.jl, a Julia package for high-performance, high-level array-style programming. The goal of ParallelAccelerator is to make high-level array-style programs run as efficiently as possible in Julia, with a minimum of extra effort required from the programmer. In this post, we’ll take a look at the ParallelAccelerator package and walk through some examples of how to use it to speed up some typical array-style programs in Julia.

Introduction

Ideally, high-level array-style Julia programs should run as efficiently as possible on high-performance parallel hardware, with a minimum of extra programmer effort required, and with performance reasonably close to that of an expert implementation in C or C++. There are three main things that ParallelAccelerator does to move us toward this goal:

First, we identify implicit parallel patterns in array-style code the user writes. We’ll say more about these parallel patterns shortly.
Second, we compile these parallel patterns to explicit parallel loops.
Third, we minimize runtime overheads incurred by things like array bounds checks and intermediate array allocations.

The key user-facing feature that the ParallelAccelerator package provides is a Julia macro called @acc, which is short for “accelerate”. Annotating functions or blocks of code with @acc lets you designate the parts of your Julia program that you want to compile to optimized native code. Here’s a toy example of using @acc to annotate a function:

julia> using ParallelAccelerator

julia> @acc f(x) = x .+ x .* x
f (generic function with 1 method)

julia> f([1,2,3,4,5])
5-element Array{Int64,1}:
2
6
12
20
30

Under the hood, ParallelAccelerator is essentially a compiler – itself implemented in Julia – that intercepts the usual Julia JIT compilation process for @acc-annotated functions. It compiles @acc-annotated code to C++ OpenMP code, which can then be compiled to a native library by an external C++ compiler such as GCC or ICC. (This intermediate C++ generation step isn’t essential to the design of ParallelAccelerator, though – instead, the compiler could target Julia’s own forthcoming native threading backend. [1]) On the Julia side, ParallelAccelerator generates a proxy function that calls into that native library, and replaces calls to @acc-annotated functions, like f in the above example, with calls to the appropriate proxy function.

We’ll say more shortly about the parallel patterns that ParallelAccelerator targets and about how the ParallelAccelerator compiler works, but before we do, let’s look at some code and some performance results.

A quick preview of results: Black-Scholes option pricing benchmark

Let’s see how to use ParallelAccelerator to speed up a classic high-performance computing benchmark: an implementation of the Black-Scholes formula for option pricing. The following code is a Julia implementation of the Black-Scholes formula.

function cndf2(in::Array{Float64,1})
    out = 0.5 .+ 0.5 .* erf(0.707106781 .* in)
    return out
end

function blackscholes(sptprice::Array{Float64,1},
                      strike::Array{Float64,1},
                      rate::Array{Float64,1},
                      volatility::Array{Float64,1},
                      time::Array{Float64,1})
    logterm = log10(sptprice ./ strike)
    powterm = .5 .* volatility .* volatility
    den = volatility .* sqrt(time)
    d1 = (((rate .+ powterm) .* time) .+ logterm) ./ den
    d2 = d1 .- den
    NofXd1 = cndf2(d1)
    NofXd2 = cndf2(d2)
    futureValue = strike .* exp(- rate .* time)
    c1 = futureValue .* NofXd2
    call = sptprice .* NofXd1 .- c1
    put  = call .- futureValue .+ sptprice
end

function run(iterations)
    sptprice   = Float64[ 42.0 for i = 1:iterations ]
    initStrike = Float64[ 40.0 + (i / iterations) for i = 1:iterations ]
    rate       = Float64[ 0.5 for i = 1:iterations ]
    volatility = Float64[ 0.2 for i = 1:iterations ]
    time       = Float64[ 0.5 for i = 1:iterations ]

    tic()
    put = blackscholes(sptprice, initStrike, rate, volatility, time)
    t = toq()
    println("checksum: ", sum(put))
    return t
end

Here, the blackscholes function takes five arguments, each of which is an array of Float64s. The run function initializes these five arrays and passes them to blackscholes, which, along with the cndf2 (cumulative normal distribution) function that it calls, does several computations involving pointwise addition (.+), subtraction (.-), multiplication (.*), and division (./) on the arrays. It’s not necessary to understand the details of the Black-Scholes formula; the important thing to notice about the code is that we are doing lots of pointwise array arithmetic. Using Julia 0.4.4-pre on a 4-core Ubuntu 14.04 desktop machine with 8 GB of memory, the run function takes about 11 seconds to run when called with an argument of 40,000,000 (meaning that we are dealing with 40-million-element arrays):

julia> @time run(40_000_000)
checksum: 8.381928525856283e8
 12.885293 seconds (458.51 k allocations: 9.855 GB, 2.95% gc time)
11.297714183

Here, the 11.297714183 being returned from run is the number of seconds it takes the blackscholes call alone to return. The 12.885293 seconds reported by @time is a little longer, because it’s the running time of the entire run call.

The many pointwise array operations in this code make it a great candidate for speeding up with ParallelAccelerator (as we’ll discuss more shortly). Doing so requires only minor changes to the code: we import the ParallelAccelerator library with using ParallelAccelerator, then wrap the cndf2 and blackscholes functions in an @acc block, as follows:

using ParallelAccelerator

@acc begin

function cndf2(in::Array{Float64,1})
    out = 0.5 .+ 0.5 .* erf(0.707106781 .* in)
    return out
end

function blackscholes(sptprice::Array{Float64,1},
                      strike::Array{Float64,1},
                      rate::Array{Float64,1},
                      volatility::Array{Float64,1},
                      time::Array{Float64,1})
    logterm = log10(sptprice ./ strike)
    powterm = .5 .* volatility .* volatility
    den = volatility .* sqrt(time)
    d1 = (((rate .+ powterm) .* time) .+ logterm) ./ den
    d2 = d1 .- den
    NofXd1 = cndf2(d1)
    NofXd2 = cndf2(d2)
    futureValue = strike .* exp(- rate .* time)
    c1 = futureValue .* NofXd2
    call = sptprice .* NofXd1 .- c1
    put  = call .- futureValue .+ sptprice
end

end

The definition of run stays the same. With the addition of the @acc wrapper, we now have much better performance:

julia> @time run(40_000_000)
checksum: 8.381928525856283e8
  4.010668 seconds (1.90 M allocations: 1.584 GB, 2.06% gc time)
3.503281464

This time, blackscholes returns in about 3.5 seconds, and the entire run call finishes in about 4 seconds. This is already an improvement, but on subsequent calls to run, we do even better:

julia> @time run(40_000_000)
checksum: 8.381928525856283e8
  1.418709 seconds (158 allocations: 1.490 GB, 8.98% gc time)
1.007861068

julia> @time run(40_000_000)
checksum: 8.381928525856283e8
  1.410865 seconds (154 allocations: 1.490 GB, 7.93% gc time)
1.012813958

In subsequent calls, run finishes in about a second, with the entire call taking about 1.4 seconds. The reason for this additional improvement is that ParallelAccelerator has already compiled the blackscholes and cndf2 functions and doesn’t need to do so again on subsequent runs.

These results were collected on an ordinary desktop machine, but we can scale up further. The following figure reports the time it takes blackscholes to run on arrays of 100 million elements, this time on a 36-core machine with 128 GB of RAM [2]:

The first three bars of the above figure show performance results for ParallelAccelerator using different numbers of threads. Since ParallelAccelerator compiles Julia to OpenMP C++, we can use the OMP_NUM_THREADS environment variable to control the number of threads that the code runs with. Here, with OMP_NUM_THREADS set to 18, blackscholes runs in 0.27 seconds; with 36 threads (matching the number of cores on the machine), running time drops to 0.16 seconds. The third bar shows results for ParallelAccelerator with OMP_NUM_THREADS set to 1, which clocks in at about 3 seconds. For comparison, the rightmost bar show results for “plain Julia”, that is, a version of the code without @acc, which runs in about 21 seconds.

Because Julia doesn’t (yet) have native multithreading support, the plain Julia results shown in the rightmost bar are for one thread. But it is interesting to note that the ParallelAccelerator implementation of Black-Scholes outperforms plain Julia by a factor of about seven, even when running on just one core. The reason for this speedup is that ParallelAccelerator (despite its name!) does more than just parallelize code. The ParallelAccelerator compiler is able to do away with much of the runtime overhead incurred by array bounds checks and allocation of intermediate arrays. After that, with the addition of parallelism, we’re able to do even better, for a total speedup of more than 100x over plain Julia.

To see how ParallelAccelerator accomplishes this, we’ll discuss the parallel patterns that ParallelAccelerator handles in a bit more detail, and then we’ll take a closer look at the ParallelAccelerator compiler pipeline.

Parallel patterns

ParallelAccelerator works by identifying implicit parallel patterns in source code and making the parallelism explicit. These patterns include map, reduce, array comprehension, and stencil.

Map

As we saw in the Black-Scholes example above, the .+, .-, .*, and ./ operations in Julia are pointwise array operations that take input arrays as arguments and produce an output array. ParallelAccelerator translates these pointwise array operations into data-parallel map operations. (See the ParallelAccelerator documentation for a complete list of all the pointwise array operations that it knows how to parallelize.) Furthermore, ParallelAccelerator translates array assignments into in-place map operations. For instance, assigning a = a .* b where a and b are arrays would map .* over a and b and update a in place with the result. For both standard map and in-place map, it is possible for ParallelAccelerator to avoid any array bounds checking once we’ve established that the input arrays and the output arrays are the same size.

Reduce

Reduce operations take an array argument and produce a scalar result by combining all the elements of an array with an associative and commutative operation. ParallelAccelerator translates the Julia functions minimum, maximum, sum, prod, any, and all into data-parallel reduce operations when they are called on arrays.

Array comprehension

Julia supports array comprehensions, a convenient and concise way to construct arrays. For example, the expressions that initialize the five input arrays in the Black-Scholes example above are all array comprehensions. As a more sophisticated example, the following avg function, taken from the Julia manual, takes a one-dimensional input array x of length n and uses an array comprehension to construct an output array of length n-2, in which each element is a weighted average of the corresponding element in the original array and its two neighbors:

avg(x) = [ 0.25*x[i-1] + 0.5*x[i] + 0.25*x[i+1] for i = 2:length(x) - 1 ]

Comprehensions like this one can also be parallelized by ParallelAccelerator: in a nutshell, ParallelAccelerator can transform array comprehensions to code that first allocates an output array and then performs an in-place map that can write to each element of the output array in parallel.

Array comprehensions differ from map and reduce operations in that they involve explicit array indexing. But it is still possible to parallelize array comprehensions in Julia, as long as there are no side effects in the comprehension body (everything before the for). [3] ParallelAccelerator uses a conservative static analysis to try to identify and reject side-effecting operations in comprehensions.

Stencil

In addition to map, reduce, and comprehension, ParallelAccelerator targets a fourth parallel pattern: stencil computations. A stencil computation updates the elements of an array according to a fixed pattern called a stencil. In fact, the avg comprehension example above could also be thought of as a stencil computation, because it updates the contents of an array based on each element’s neighbors. However, stencil computations differ from the other patterns that ParallelAccelerator targets, because there’s not a built-in, user-facing language feature in Julia that expresses stencil computations specifically. So, ParallelAccelerator introduces a new user-facing language construct called runStencil for expressing stencil computations in Julia. Next, we’ll look at an example that illustrates how runStencil works.

Example: Blurring an image with runStencil

Let’s consider a stencil computation that blurs an image using a Gaussian blur. The image is represented as a two-dimensional array of pixels. To blur the image, we set the value of each output pixel to a particular weighted average of the corresponding input pixel’s value and the values of its neighboring input pixels. By repeating this process multiple times, we can get an increasingly blurred image. [4]

The following code implements a Gaussian blur in Julia. It operates on a 2D array of Float32s: the pixels of the source image. It’s easy to obtain such an array using, for instance, the load function from the Images.jl library, followed by a call to convert to get an array of type Array{Float32,2}. (For simplicity, we’re assuming that the input image is a grayscale image, so each pixel has just one value instead of red, green, and blue values. However, it would be straightforward to use the same approach for RGB pixels.)

function blur(img::Array{Float32,2}, iterations::Int)
    w, h = size(img)
    for i = 1:iterations
      img[3:w-2,3:h-2] = 
           img[3-2:w-4,3-2:h-4] * 0.0030 + img[3-1:w-3,3-2:h-4] * 0.0133 + img[3:w-2,3-2:h-4] * 0.0219 + img[3+1:w-1,3-2:h-4] * 0.0133 + img[3+2:w,3-2:h-4] * 0.0030 +
           img[3-2:w-4,3-1:h-3] * 0.0133 + img[3-1:w-3,3-1:h-3] * 0.0596 + img[3:w-2,3-1:h-3] * 0.0983 + img[3+1:w-1,3-1:h-3] * 0.0596 + img[3+2:w,3-1:h-3] * 0.0133 +
           img[3-2:w-4,3+0:h-2] * 0.0219 + img[3-1:w-3,3+0:h-2] * 0.0983 + img[3:w-2,3+0:h-2] * 0.1621 + img[3+1:w-1,3+0:h-2] * 0.0983 + img[3+2:w,3+0:h-2] * 0.0219 +
           img[3-2:w-4,3+1:h-1] * 0.0133 + img[3-1:w-3,3+1:h-1] * 0.0596 + img[3:w-2,3+1:h-1] * 0.0983 + img[3+1:w-1,3+1:h-1] * 0.0596 + img[3+2:w,3+1:h-1] * 0.0133 +
           img[3-2:w-4,3+2:h-0] * 0.0030 + img[3-1:w-3,3+2:h-0] * 0.0133 + img[3:w-2,3+2:h-0] * 0.0219 + img[3+1:w-1,3+2:h-0] * 0.0133 + img[3+2:w,3+2:h-0] * 0.0030
    end
    return img
end

Here, to compute the value of a pixel in the output image, we use the the corresponding input pixel as well as all its neighboring pixels, to a depth of two pixels out from the input pixel – so, twenty-four neighbors. In all, there are twenty-five pixel values to examine. We add all these pixel values together, each multiplied by a weight – in this case 0.0030 for the cornermost pixels, 0.1621 for the center pixel, and for all the other pixels, something in between – and the total is the value of the output pixel. At the borders of the image, we don’t have enough neighboring pixels to compute an output pixel value, so we simply skip those pixels and don’t assign to them. [5]

Notice that the blur function explicitly loops over the number of iterations, that is, times to apply the blur to the the image, but it does not explicitly loop over pixels in the image. Instead, the code is written in array style: it performs just one assignment to the array img, using the ranges 3:w-2 and 3:h-2 to avoid assigning to the borders of the image. On a large grayscale input image of 7095 by 5322 pixels, this code takes about 10 minutes to run for 100 iterations.

Using ParallelAccelerator, we can get much better performance. Let’s look at a version of blur that uses runStencil:

@acc function blur(img::Array{Float32,2}, iterations::Int)
    buf = Array(Float32, size(img)...) 
    runStencil(buf, img, iterations, :oob_skip) do b, a
       b[0,0] = 
            (a[-2,-2] * 0.003  + a[-1,-2] * 0.0133 + a[0,-2] * 0.0219 + a[1,-2] * 0.0133 + a[2,-2] * 0.0030 +
             a[-2,-1] * 0.0133 + a[-1,-1] * 0.0596 + a[0,-1] * 0.0983 + a[1,-1] * 0.0596 + a[2,-1] * 0.0133 +
             a[-2, 0] * 0.0219 + a[-1, 0] * 0.0983 + a[0, 0] * 0.1621 + a[1, 0] * 0.0983 + a[2, 0] * 0.0219 +
             a[-2, 1] * 0.0133 + a[-1, 1] * 0.0596 + a[0, 1] * 0.0983 + a[1, 1] * 0.0596 + a[2, 1] * 0.0133 +
             a[-2, 2] * 0.003  + a[-1, 2] * 0.0133 + a[0, 2] * 0.0219 + a[1, 2] * 0.0133 + a[2, 2] * 0.0030)
       return a, b
    end
    return img
end

Here, we again have a function called blur – now annotated with @acc – that takes the same arguments as the original code. This version of blur allocates a new 2D array called buf that is the same size as the original img array. The allocation of buf is followed by a call to runStencil. Let’s take a closer look at the runStencil call.

runStencil has the following signature:

runStencil(kernel :: Function, buffer1, buffer2, ..., iteration :: Int, boundaryHandling :: Symbol)

In blur, the call to runStencil uses Julia’s do-block syntax for function arguments, so the do b, a ... end block is actually the first argument to the runStencil call. The do block creates an anonymous function that binds the variables b and a. The arguments buffer1, buffer2, ... that are passed to runStencil become the arguments to the anonymous function. In this case, we are passing two buffers, buf and img, to runStencil, and so the anonymous function takes two arguments.

Aside from the anonymous function and the two buffers, runStencil takes two other arguments. The first of these is a number of iterations that we want to run the stencil computation for. In this case, we simply pass along the iterations argument that is passed to blur. Finally, the last argument to runStencil is a symbol indicating how stencil boundaries are to be handled. Here, we’re using the :oob_skip symbol, short for “out-of-bounds skip”. It means that when input indices are out of bounds – for instance, in the situation where the input pixel is one of those on the two-pixel border of the image, and there aren’t enough neighbor pixels to compute the output pixel value – then we simply skip writing to the output pixel. This has the same effect as the careful indexing in the original version of blur.

Finally, let’s look at the body of the do block that we’re passing to runStencil. It contains an assignment to b, using values computed from a. As we’ve said, b and a here are buf and img: our newly-allocated buffer, and the original image. The code here is similar to that of the original implementation of blur, but here we’re using relative rather than absolute indexing into arrays, The index 0,0 in b[0,0] doesn’t refer to any particular element of b, but instead to the current position of a cursor that can be thought of as traversing all the elements of b. On the right side of the assignment. a[-2,-1] refers to the element in a that is two elements to the left and one element up from the 0,0 element of a. In this way, we can express a stencil computation more concisely than the original version of blur did, and we don’t have to worry about getting the indices correct for boundary handling as we had to do before, because the :oob_skip argument tells runStencil everything it needs to no to handle boundaries correctly.

Finally, at the end of the do block, we return a, b. They were bound as b, a, but we return them in the opposite order so that for each iteration of the stencil, we’ll be using the already-blurred buffer as the input for another round of blurring. This continues for however many iterations we’ve specified. There’s therefore no need to write an explicit for loop for stencil iterations when using runStencil; one just passes an argument saying how many iterations should occur.

Therefore runStencil enables us to write more concise code than plain Julia, as we’d expect from a language extension. But where runStencil really shines is in the performance it enables. The following figure compares performance results for plain Julia and ParallelAccelerator implementations of blur, each running for 100 iterations on the aforementioned 7095x5322 source image, run using the same machine as for the previous Black-Scholes benchmark.

The rightmost column shows the results for plain Julia, using the first implementation of blur shown above. The three columns to the left show results for the ParallelAccelerator version that uses runStencil. As we can see, even when running on just one thread, ParallelAccelerator enables a speedup of about 15x: from about 600 seconds to about 40 seconds. Running on 36 threads provides a further parallel speedup of more than 26x, resulting in a total speedup of nearly 400x over plain single-threaded Julia.

An overview of the ParallelAccelerator compiler architecture

Now that we’ve talked about the parallel patterns that ParallelAccelerator speeds up and seen some code examples, let’s take a look at how the ParallelAccelerator compiler works.

The standard Julia JIT compiler parses Julia source code into the Julia abstract syntax tree (AST) representation. It performs type inference on the AST, then transforms the AST to LLVM IR, and finally generates native assembly code. ParallelAccelerator intercepts this process at the level of the AST. It introduces new AST nodes for the parallel patterns we discussed above. It then does various optimizations on the resulting AST. Finally, it generates C++ code that can be compiled by an external C++ compiler. The following figure shows an overview of the ParallelAccelerator compilation process:

As many readers of this blog will know, Julia has good support for inspecting and manipulating its own ASTs. Its built-in code_typed function will return the AST of any function after Julia’s type inference has taken place. This is very convenient for ParallelAccelerator, which is able to use the output from code_typed as the input to the first pass of its compiler, which is called “Domain Transformations”. The Domain Transformations pass produces ParallelAccelerator’s Domain AST intermediate representation.

Domain AST is similar to Julia’s AST, except it introduces new AST nodes for parallel patterns that it identifies. We call these nodes “domain nodes”, collectively. The Domain Transformations pass replaces certain parts of the AST with domain nodes.

The Domain Transformations pass is followed by the Parallel Transformations pass, which replaces domain nodes with “parfor” nodes, each of which represents one or more nested parallel for loops. Loop fusion also takes place during the Parallel Transformations pass. We call the result of Parallel Transformations Parallel AST. [6]

The compiler hands off Parallel AST code to the last pass of the compiler, CGen, which generates C++ code and converts parfor nodes into OpenMP loops. Finally, an external C++ compiler creates an executable which is linked to OpenMP and to a small array runtime component written in C that manages the transfer of arrays back and forth between Julia and C++.

Caveats

ParallelAccelerator is still a proof of concept at this stage. Users should be aware of two issues that can stand in the way of being able to make effective use of ParallelAccelerator. Those issues are, first, package load time, and second, limitations in what Julia programs ParallelAccelerator is able to handle. We discuss each of these issues in turn.

Package load time

Because ParallelAccelerator is a large Julia package (it’s a compiler, after all), it takes a long time (perhaps 20 or 25 seconds on a 4-core desktop machine) for using ParallelAccelerator to run. This long pause is not the time that ParallelAccelerator is taking to compile your @acc-annotated code; it’s the time that Julia is taking to compile ParallelAccelerator itself. After this initial pause, the first call to an @acc-annotated function will incur a brief compilation pause (this time from the ParallelAccelerator compiler, not Julia itself) of perhaps a couple of seconds. Subsequent calls to the same function won’t incur the compilation pause.

Let’s see what these compilation pauses look like in practice. The ParallelAccelerator package comes with a collection of example programs that print timing information, including the Black-Scholes and Gaussian blur examples shown in this post. All the examples print timing information for two calls to an @acc-annotated function: first a “warm-up” call with trivial arguments to measure compilation time, and then a more realistic call. In the output printed by each example, timing information for the more realistic call is preceded by the string "SELFTIMED", while timing information for the warm-up call is preceded by "SELFPRIMED". Let’s run the Black-Scholes example and time it using the time shell command:

$ time julia ParallelAccelerator/examples/black-scholes/black-scholes.jl 
iterations = 10000000
SELFPRIMED 1.766323497
checksum: 2.0954821257116848e8
rate = 1.9205394841503927e8 opts/sec
SELFTIMED 0.052068703

real	0m26.454s
user	0m31.027s
sys	0m0.874s

Here, we’re running Black-Scholes for 10,000,000 iterations on our 4-core desktop machine. The total wall-clock time of 26.454 seconds consists mostly of the time it takes for using ParallelAccelerator to run. Once that’s done, Julia reports a SELFPRIMED time of about 1.8 seconds, which is dominated by the time it takes for ParallelAccelerator to compile the @acc-annotated code, and finally the SELFTIMED time is about 0.05 seconds for this problem size.

As Julia’s compilation speed improves, we expect that package load time will be less of a problem for ParallelAccelerator.

Compiler limitations

ParallelAccelerator is able to handle only a limited subset of Julia language features, and it only supports a limited subset of Julia’s Base library functions. In other words, you cannot yet put an @acc annotation on arbitrary Julia code and expect it to go faster out of the box. The examples in this post give an idea of what kinds of programs are supported currently; for more, check out the full collection of ParallelAccelerator examples. However, if ParallelAccelerator can’t compile some code in an @acc-annotated function, it will simply fall back to running the function under regular Julia. So your code will run, regardless of whether ParallelAccelerator can speed it up.

One reason why an @acc-annotated function might fail to compile is that ParallelAccelerator tries to transitively compile every Julia function that is called by the @acc-annotated function. So, if an @acc-annotated function makes several Julia library calls, ParallelAccelerator will attempt to compile those functions as well – and every Julia function that they call, and so on. If any of the code in the call chain contains a feature that ParallelAccelerator doesn’t currently support, ParallelAccelerator will fail to compile the original @acc-annotated function. It is therefore a good idea to begin by annotating small (but expensive) computational kernels with @acc, rather than wrapping an entire program in an @acc block. The ParallelAccelerator documentation has many more details on which Julia features we don’t support and why.

These limitations explain why the kind of performance improvements that ParallelAccelerator provides aren’t already the default in Julia. Supporting all of Julia would be a major undertaking; however, in many cases, there’s not a fundamental reason why ParallelAccelerator couldn’t support a particular Julia feature or a function in Base, and supporting it is a matter of realizing that it is a problem for users and putting in the necessary engineering effort to fix it. So, when you come across code that ParallelAccelerator can’t handle, please do file bugs!

Conclusion

In this post, we’ve introduced ParallelAccelerator.jl, a package for speeding up array-style Julia programs. It works by identifying implicit parallel patterns in source code and compiling them to efficient, explicitly parallel executables, along the way getting rid of many of the usual overheads of high-level array-style programming.

ParallelAccelerator is an open source project in its early stages, and we enthusiastically encourage comments, questions, bug reports, and contributions from the Julia community. We welcome everyone’s participation, and we are especially interested in how ParallelAccelerator can be used to speed up real-world Julia programs.

[1] Starting with Julia 0.5, Julia will have its own native threading support, which means that ParallelAccelerator can target Julia’s own native threads instead of generating C++ OpenMP code for parallelism. We’ve begun work on implementing a native-threading-based backend for ParallelAccelerator, but we still target C++ by default.

[2] Detailed machine and benchmarking specifications: We use a machine with two Intel Xeon E5-2699 v3 processors (2.3 GHz) with 18 physical cores each and 128 GB RAM, running the CentOS 6.7 Linux distribution. We use the Intel C++ Compiler (ICC) v15.0.2 with “-O3” for compilation of the generated C++ code. The Julia version is 0.4.4-pre+26. The results shown are the average of three runs (we run each version of a benchmark five times and discard the first and last runs).

[3] In Julia, it is not possible to index into a comprehension’s output array in the body of the comprehension. (The avg example indexes only into the input array, not the output array.) Therefore, it’s not necessary to do any bounds checking on writes to the output array. However, we still need to bounds-check reads from the input array (for instance, in the avg example, if we’d written 0.25*x[i-2], that would be out of bounds), so we cannot avoid all array bounds checking for comprehensions in the way that we can for map operations.

[4] In practice, rather than applying successive Gaussian blurs to an image, we’d probably apply a single, larger Gaussian blur, which, as Wikipedia notes, is at least as efficient computationally. Nevertheless, we’ll use it here as an example of a stencil computation that can be iterated.

[5] A more sophisticated implementation of Gaussian blur might do a fancier form of border handling, using only the pixels it has available at the borders.

[6] The names “Domain AST” and “Parallel AST” are inspired by the Domain IR and Parallel IR of the Delite compiler framework.

Multidimensional algorithms and iteration

2016-02-01T00:00:00+00:00

Starting with release 0.4, Julia makes it easy to write elegant and efficient multidimensional algorithms. The new capabilities rest on two foundations: a new type of iterator, called CartesianRange, and sophisticated array indexing mechanisms. Before I explain, let me emphasize that developing these capabilities was a collaborative effort, with the bulk of the work done by Matt Bauman (@mbauman), Jutho Haegeman (@Jutho), and myself (@timholy).

These new iterators are deceptively simple, so much so that I’ve never been entirely convinced that this blog post is necessary: once you learn a few principles, there’s almost nothing to it. However, like many simple concepts, the implications can take a while to sink in. There also seems to be some widespread confusion about the relationship between these iterators and Base.Cartesian, which is a completely different (and much more painful) approach to solving the same problem. There are still a few occasions where Base.Cartesian is necessary, but for many problems these new capabilities represent a vastly simplified approach.

Let’s introduce these iterators with an extension of an example taken from the manual.

eachindex, CartesianIndex, and CartesianRange

You may already know that, in julia 0.4, there are two recommended ways to iterate over the elements in an AbstractArray: if you don’t need an index associated with each element, then you can use

for a in A    # A is an AbstractArray
    # Code that does something with the element a
end

If instead you also need the index, then use

for i in eachindex(A)
    # Code that does something with i and/or A[i]
end

In some cases, the first line of this loop expands to for i = 1:length(A), and i is just an integer. However, in other cases, this will expand to the equivalent of

for i in CartesianRange(size(A))
    # i is now a CartesianIndex
    # Code that does something with i and/or A[i]
end

Let’s see what these objects are:

julia> A = rand(3,2)

julia> for i in CartesianRange(size(A))
          @show i
       end
i = CartesianIndex{2}((1,1))
i = CartesianIndex{2}((2,1))
i = CartesianIndex{2}((3,1))
i = CartesianIndex{2}((1,2))
i = CartesianIndex{2}((2,2))
i = CartesianIndex{2}((3,2))

A CartesianIndex{N} represents an N-dimensional index. CartesianIndexes are based on tuples, and indeed you can access the underlying tuple with i.I. However, they also support certain arithmetic operations, treating their contents like a fixed-size Vector{Int}. Since the length is fixed, julia/LLVM can generate very efficient code (without introducing loops) for operations with N-dimensional CartesianIndexes.

A CartesianRange is just a pair of CartesianIndexes, encoding the start and stop values along each dimension, respectively:

julia> CartesianRange(size(A))
CartesianRange{CartesianIndex{2}}(CartesianIndex{2}((1,1)),CartesianIndex{2}((3,2)))

You can construct these manually: for example,

julia> CartesianRange(CartesianIndex((-7,0)), CartesianIndex((7,15)))
CartesianRange{CartesianIndex{2}}(CartesianIndex{2}((-7,0)),CartesianIndex{2}((7,15)))

constructs a range that will loop over -7:7 along the first dimension and 0:15 along the second.

One reason that eachindex is recommended over for i = 1:length(A) is that some AbstractArrays cannot be indexed efficiently with a linear index; in contrast, a much wider class of objects can be efficiently indexed with a multidimensional iterator. (SubArrays are, generally speaking, a prime example.) eachindex is designed to pick the most efficient iterator for the given array type. You can even use

for i in eachindex(A, B)
    ...

to increase the likelihood that i will be efficient for accessing both A and B.

As we’ll see below, these iterators have another purpose: independent of whether the underlying arrays have efficient linear indexing, multidimensional iteration can be a powerful ally when writing algorithms. The rest of this blog post will focus on this latter application.

Writing multidimensional algorithms with CartesianIndex iterators

A multidimensional boxcar filter

Let’s suppose we have a multidimensional array A, and we want to compute the “moving average” over a 3-by-3-by-… block around each element. From any given index position, we’ll want to sum over a region offset by -1:1 along each dimension. Edge positions have to be treated specially, of course, to avoid going beyond the bounds of the array.

In many languages, writing a general (N-dimensional) implementation of this conceptually-simple algorithm is somewhat painful, but in Julia it’s a piece of cake:

function boxcar3(A::AbstractArray)
    out = similar(A)
    R = CartesianRange(size(A))
    I1, Iend = first(R), last(R)
    for I in R
        n, s = 0, zero(eltype(out))
        for J in CartesianRange(max(I1, I-I1), min(Iend, I+I1))
            s += A[J]
            n += 1
        end
        out[I] = s/n
    end
    out
end

Let’s walk through this line by line:

out = similar(A) allocates the output. In a “real” implementation, you’d want to be a little more careful about the element type of the output (what if the input array element type is Int?), but we’re cutting a few corners here for simplicity.
R = CartesianRange(size(A)) creates the iterator for the array, ranging from CartesianIndex((1, 1, 1, ...)) to CartesianIndex((size(A,1), size(A,2), size(A,3), ...)). We don’t use eachindex, because we can’t be sure whether that will return a CartesianRange iterator, and here we explicitly need one.
I1 = first(R) and Iend = last(R) return the lower (CartesianIndex((1, 1, 1, ...))) and upper (CartesianIndex((size(A,1), size(A,2), size(A,3), ...))) bounds of the iteration range, respectively. We’ll use these to ensure that we never access out-of-bounds elements of A.

Conveniently, I1 can also be used to compute the offset range.
for I in R: here we loop over each entry of A.
n = 0 and s = zero(eltype(out)) initialize the accumulators. s will hold the sum of neighboring values. n will hold the number of neighbors used; in most cases, after the loop we’ll have n == 3^N, but for edge points the number of valid neighbors will be smaller.
for J in CartesianRange(max(I1, I-I1), min(Iend, I+I1)) is probably the most “clever” line in the algorithm. I-I1 is a CartesianIndex that is lower by 1 along each dimension, and I+I1 is higher by 1. Therefore, this constructs a range that, for interior points, extends along each coordinate by an offset of 1 in either direction along each dimension.

However, when I represents an edge point, either I-I1 or I+I1 (or both) might be out-of-bounds. max(I-I1, I1) ensures that each coordinate of J is 1 or larger, while min(I+I1, Iend) ensures that J[d] <= size(A,d).
The inner loop accumulates the sum in s and the number of visited neighbors in n.
Finally, we store the average value in out[I].

Not only is this implementation simple, but it is surprisingly robust: for edge points it computes the average of whatever nearest-neighbors it has available. It even works if size(A, d) < 3 for some dimension d; we don’t need any error checking on the size of A.

Computing a reduction

For a second example, consider the implementation of multidimensional reductions. A reduction takes an input array, and returns an array (or scalar) of smaller size. A classic example would be summing along particular dimensions of an array: given a three-dimensional array, you might want to compute the sum along dimension 2, leaving dimensions 1 and 3 intact.

The core algorithm

An efficient way to write this algorithm requires that the output array, B, is pre-allocated by the caller (later we’ll see how one might go about allocating B programmatically). For example, if the input A is of size (l,m,n), then when summing along just dimension 2 the output B would have size (l,1,n).

Given this setup, the implementation is shockingly simple:

function sumalongdims!(B, A)
    # It's assumed that B has size 1 along any dimension that we're summing
    fill!(B, 0)
    Bmax = CartesianIndex(size(B))
    for I in CartesianRange(size(A))
        B[min(Bmax,I)] += A[I]
    end
    B
end

The key idea behind this algorithm is encapsulated in the single statement B[min(Bmax,I)]. For our three-dimensional example where A is of size (l,m,n) and B is of size (l,1,n), the inner loop is essentially equivalent to

B[i,1,k] += A[i,j,k]

because min(1,j) = 1.

The wrapper, and handling type-instability using function barriers

As a user, you might prefer an interface more like sumalongdims(A, dims) where dims specifies the dimensions you want to sum along. dims might be a single integer, like 2 in our example above, or (should you want to sum along multiple dimensions at once) a tuple or Vector{Int}. This is indeed the interface used in sum(A, dims); here we want to write our own (somewhat simpler) implementation.

A bare-bones implementation of the wrapper is straightforward:

function sumalongdims(A, dims)
    sz = [size(A)...]
    sz[[dims...]] = 1
    B = Array(eltype(A), sz...)
    sumalongdims!(B, A)
end

Obviously, this simple implementation skips all relevant error checking. However, here the main point I wish to explore is that the allocation of B turns out to be type-unstable: sz is a Vector{Int}, the length (number of elements) of a specific Vector{Int} is not encoded by the type itself, and therefore the dimensionality of B cannot be inferred.

Now, we could fix that in several ways, for example by annotating the result:

B = Array(eltype(A), sz...)::typeof(A)

However, this isn’t really necessary: in the remainder of this function, B is not used for any performance-critical operations. B simply gets passed to sumalongdims!, and it’s the job of the compiler to ensure that, given the type of B, an efficient version of sumalongdims! gets generated. In other words, the type instability of B’s allocation is prevented from “spreading” by the fact that B is henceforth used only as an argument in a function call. This trick, using a function-call to separate a performance-critical step from a potentially type-unstable precursor, is sometimes referred to as introducing a function barrier.

As a general rule, when writing multidimensional code you should ensure that the main iteration is in a separate function from type-unstable precursors. Even when you take appropriate precautions, there’s a potential “gotcha”: if your inner loop is small, julia’s ability to inline code might eliminate the intended function barrier, and you get dreadful performance. For this reason, it’s recommended that you annotate function-barrier callees with @noinline:

@noinline function sumalongdims!(B, A)
    ...
end

Of course, in this example there’s a second motivation for making this a standalone function: if this calculation is one you’re going to repeat many times, re-using the same output array can reduce the amount of memory allocation in your code.

Filtering along a specified dimension (exploiting multiple indexes)

One final example illustrates an important new point: when you index an array, you can freely mix CartesianIndexes and integers. To illustrate this, we’ll write an exponential smoothing filter. An efficient way to implement such filters is to have the smoothed output value s[i] depend on a combination of the current input x[i] and the previous filtered value s[i-1]; in one dimension, you can write this as

function expfilt1!(s, x, α)
    0 < α <= 1 || error("α must be between 0 and 1")
    s[1] = x[1]
    for i = 2:length(a)
        s[i] = α*x[i] + (1-α)*s[i-1]
    end
    s
end

This would result in an approximately-exponential decay with timescale 1/α.

Here, we want to implement this algorithm so that it can be used to exponentially filter an array along any chosen dimension. Once again, the implementation is surprisingly simple:

function expfiltdim(x, dim::Integer, α)
    s = similar(x)
    Rpre = CartesianRange(size(x)[1:dim-1])
    Rpost = CartesianRange(size(x)[dim+1:end])
    _expfilt!(s, x, α, Rpre, size(x, dim), Rpost)
end

@noinline function _expfilt!(s, x, α, Rpre, n, Rpost)
    for Ipost in Rpost
        # Initialize the first value along the filtered dimension
        for Ipre in Rpre
            s[Ipre, 1, Ipost] = x[Ipre, 1, Ipost]
        end
        # Handle all other entries
        for i = 2:n
            for Ipre in Rpre
                s[Ipre, i, Ipost] = α*x[Ipre, i, Ipost] + (1-α)*s[Ipre, i-1, Ipost]
            end
        end
    end
    s
end

Note once again the use of the function barrier technique. In the core algorithm (_expfilt!), our strategy is to use two CartesianIndex iterators, Ipre and Ipost, where the first covers dimensions 1:dim-1 and the second dim+1:ndims(x); the filtering dimension dim is handled separately by an integer-index i. Because the filtering dimension is specified by an integer input, there is no way to infer how many entries will be within each index-tuple Ipre and Ipost. Hence, we compute the CartesianRanges in the type-unstable portion of the algorithm, and then pass them as arguments to the core routine _expfilt!.

What makes this implementation possible is the fact that we can index x as x[Ipre, i, Ipost]. Note that the total number of indexes supplied is (dim-1) + 1 + (ndims(x)-dim), which is just ndims(x). In general, you can supply any combination of integer and CartesianIndex indexes when indexing an AbstractArray in Julia.

The AxisAlgorithms package makes heavy use of tricks such as these, and in turn provides core support for high-performance packages like Interpolations that require multidimensional computation.

Additional issues

It’s worth noting one point that has thus far remained unstated: all of the examples here are relatively cache efficient. This is a key property to observe when writing efficient code. In particular, julia arrays are stored in first-to-last dimension order (for matrices, “column-major” order), and hence you should nest iterations from last-to-first dimensions. For example, in the filtering example above we were careful to iterate in the order

for Ipost ...
    for i ...
        for Ipre ...
            x[Ipre, i, Ipost] ...

so that x would be traversed in memory-order.

Summary

As is hopefully clear by now, much of the pain of writing generic multidimensional algorithms is eliminated by Julia’s elegant iterators. The examples here just scratch the surface, but the underlying principles are very simple; it is hoped that these examples will make it easier to write your own algorithms.

Julia IDE work in Atom

2016-01-07T00:00:00+00:00

A PL designer used to be able to design some syntax and semantics for their language, implement a compiler, and then call it a day. – Sean McDirmid

In the few years since its initial release, the Julia language has made wonderful progress. Over four hundred contributors – and counting – have donated their time developing exciting and modern language features like channels for concurrency, a native documentation system, staged functions, compiled packages, threading, and tons more. In the lead up to 1.0 we have a faster and more stable runtime, a more comprehensive standard library, and a more enthusiastic community than ever before.

However, a programming language isn’t just a compiler or spec in a vacuum. More and more, the ecosystem around a language – the packages, tooling, and community that support you – are a huge determining factor in where a language can be used, and who it can be used by. Making Julia accessible to everybody means facing these issues head-on. In particular, we’ll be putting a lot of effort into building a comprehensive IDE, Juno, which supports users with features like smart autocompletion, plotting and data handling, interactive live coding and debugging, and more.

Julia users aren’t just programmers – they’re engineers, scientists, data mungers, financiers, statisticians, researchers, and many other things, so it’s vital that our IDE is flexible and extensible enough to support all their different workflows fluidly. At the same time, we want to avoid reinventing well-oiled wheels, and don’t want to compromise on the robust and powerful core editing experience that people have come to expect. Luckily enough, we think we can have our cake and eat it too by building on top of the excellent Atom editor.

The Atom community has done an amazing job of building an editor that’s powerful and flexible without sacrificing a pleasant and intuitive experience. Web technologies not only make hacking on the editor extremely accessible for new contributors, but also make it easy for us to experiment with exciting and modern features like live coding, making it a really promising option for our work.

Our immediate priorities will be to get basic interactive usage working really well, including strong multimedia support for display and graphics. Before long we’ll have a comprehensive IDE bundle which includes Juno, Julia, and a bunch of useful packages for things like plotting – with the aim that anyone can get going productively with Julia within a few minutes. Once the basics are in place, we’ll integrate the documentation system and the up-and-coming debugger, implement performance linting, and make sure that there’s help and tutorials in place so that it’s easy for everyone to get started.

Juno is implemented as a large collection of independent modules and plugins; although this adds some development overhead, we think it’s well worthwhile to make sure that other projects can benefit from our work. For example, our collection of IDE components for Atom, Ink, is completely language-agnostic and should be reusable by other languages.

New contributions are always welcome, so if you’re interested in helping to push this exciting project forward, check out the developer install instructions and send us a PR!

JSoC 2015 project: DataStreams.jl

2015-10-25T00:00:00+00:00

Data processing got ya down? Good news! The DataStreams.jl package, er, framework, has arrived!

The DataStreams processing framework provides a consistent interface for working with data, from source to sink and eventually every step in-between. It’s really about putting forth an interface (specific types and methods) to go about ingesting and transferring data sources that hopefully makes for a consistent experience for users, no matter what kind of data they’re working with.

######How does it work? DataStreams is all about creating “sources” (Julia types that represent true data sources; e.g. csv files, database backends, etc.), “sinks” or data destinations, and defining the appropriate Data.stream!(source, sink) methods to actually transfer data from source to sink. Let’s look at a quick example.

Say I have a table of data in a CSV file on my local machine and need to do a little cleaning and aggregation on the data before building a model with the GLM.jl package. Let’s see some code in action:

using CSV, SQLite, DataStreams, DataFrames

# let's create a Julia type that understands our data file
csv_source = CSV.Source("datafile.csv")

# let's also create an SQLite destination for our data
# according to its structure
db = SQLite.DB() # create an in-memory SQLite database

# creates an SQLite table
sqlite_sink = SQLite.Sink(Data.schema(csv_source), db, "mydata")

# parse the CSV data directly into our SQLite table
Data.stream!(csv_source, sqlite_sink)

# now I can do some data cleansing/aggregation
# ...various SQL statements on the "mydata" SQLite table...

# now I'm ready to get my data out and ready for model fitting
sqlite_source = SQLite.Source(sqlite_sink)

# stream our data into a Julia structure (Data.Table)
dt = Data.stream!(sqlite_source, Data.Table)

# convert to DataFrame (non-copying)
df = DataFrame(dt)

# do model-fitting
OLS = glm(Y~X,df,Normal(),IdentityLink())

Here we see it’s quite simple to create a Source type by wrapping a true datasource (our CSV file), a destination for that data (an SQLite table), and to transfer the data. We can then turn our SQLite.Sink into an SQLite.Source for getting the data back out again.

So What Have You Really Been Working On?

Well, a lot actually. Even though the DataStreams framework is currently simple and minimalistic, it took a lot of back and forth on the design, including several discussions at this year’s JuliaCon at MIT. Even with a tidy little framework, however, the bulk of the work still lies in actually implementing the interface in various packages. The two that are ready for release today are CSV.jl and SQLite.jl. They are currently available for julia 0.4+ only.

Quick rundown of each package:

CSV: provides types and methods for working with CSV and other delimited files. Aims to be (and currently is) the fastest and most flexible CSV reader in Julia.
SQLite: an interface to the popular SQLite local-machine database. Provides methods for creating/managing database files, along with executing SQL statements and viewing the results of such.

So What’s Next?

ODBC.jl: the next package to get the DataStreams makeover is ODBC. I’ve already started work on this and hopefully should be ready soon.
Other packages: I’m always on the hunt for new ways to spread the framework; if you’d be interested in implementing DataStreams for your own package or want to collaborate, just ping me and I’m happy to discuss!
transforms: an important part of data processing tasks is not just connecting to and moving the data to somewhere else: often you need to clean/transform/aggregate the data in some way in-between. Right now, that’s up to users, but I have some ideas around creating DataStreams-friendly ways to easily incorporate transform steps as data is streamed from one place to another.
DataStreams for chaining pipelines + transforms: I’m also excited about the idea of creating entire DataStreams, which would define entire data processing tasks end-to-end. Setting up a pipeline that could consistently move and process data gets even more powerful as we start looking into automatic-parallelism and extensibility.
DataStream scheduling/management: I’m also interested in developing capabilities around scheduling and managing DataStreams.

The work on DataStreams.jl was carried out as part of the Julia Summer of Code program, made possible thanks to the generous support of the Gordon and Betty Moore Foundation, and MIT.

JSoC 2015 project: Automatic Differentiation in Julia with ForwardDiff.jl

2015-10-23T00:00:00+00:00

This summer, I’ve had the good fortune to be able to participate in the first ever Julia Summer of Code (JSoC), generously sponsored by the Gordon and Betty Moore Foundation. My JSoC project was to explore the use of Julia for automatic differentiation (AD), a topic with a wide array of applications in the field of optimization.

Under the mentorship of Miles Lubin and Theodore Papamarkou, I completed a major overhaul of ForwardDiff.jl, a Julia package for calculating derivatives, gradients, Jacobians, Hessians, and higher-order derivatives of native Julia functions (or any callable Julia type, really).

By the end of this post, you’ll hopefully know a little bit about how ForwardDiff.jl works, why it’s useful, and why Julia is uniquely well-suited for AD compared to other languages.

What is Automatic Differentiation?

In broad terms, automatic differentiation describes a class of algorithms for automatically taking exact derivatives of user-provided functions. In addition to producing more accurate results, AD methods are also often faster than other common differentiation methods (such as finite differencing).

The two main flavors of AD are called forward mode and reverse mode. As you might’ve guessed, this post only discusses forward mode, which is the kind of AD implemented by ForwardDiff.jl.

Seeing ForwardDiff.jl In Action

Before we get down to the nitty-gritty details, it might be helpful to see a simple example that illustrates various methods from ForwardDiff.jl’s API.

The snippet below is a somewhat contrived example, but works well enough as an introduction to the package. First, we define a target function we’d like to differentiate, then use ForwardDiff.jl to calculate some derivatives of the function at a given input:

julia> using ForwardDiff

julia> f(x::Vector) = sum(sin, x) + prod(tan, x) * sum(sqrt, x);

julia> x = rand(5)
5-element Array{Float64,1}:
 0.986403
 0.140913
 0.294963
 0.837125
 0.650451

julia> g = ForwardDiff.gradient(f); # g = ∇f

julia> g(x)
5-element Array{Float64,1}:
 1.01358
 2.50014
 1.72574
 1.10139
 1.2445

julia> j = ForwardDiff.jacobian(g); # j = J(∇f)

julia> j(x)
5x5 Array{Float64,2}:
 0.585111  3.48083  1.7706    0.994057  1.03257
 3.48083   1.06079  5.79299   3.25245   3.37871
 1.7706    5.79299  0.423981  1.65416   1.71818
 0.994057  3.25245  1.65416   0.251396  0.964566
 1.03257   3.37871  1.71818   0.964566  0.140689

julia> ForwardDiff.hessian(f, x) # H(f)(x) == J(∇f)(x), as expected
5x5 Array{Float64,2}:
 0.585111  3.48083  1.7706    0.994057  1.03257
 3.48083   1.06079  5.79299   3.25245   3.37871
 1.7706    5.79299  0.423981  1.65416   1.71818
 0.994057  3.25245  1.65416   0.251396  0.964566
 1.03257   3.37871  1.71818   0.964566  0.140689

Tada!

Okay, that’s not too exciting - I could’ve just done the same thing with Calculus.jl. Why would I ever want to use ForwardDiff.jl?

The simple answer is that ForwardDiff.jl’s AD-based methods are, in many cases, much more performant than the finite differencing methods implemented in other packages.

How ForwardDiff.jl Works - An Overview

The key technique leveraged by ForwardDiff.jl is the implementation of several different ForwardDiffNumber types, each of which allocate storage space for both normal values and derivative values. Elementary numerical functions on a ForwardDiffNumber are then overloaded to evaluate both the original function and the function’s derivative, returning the results in the form of a new ForwardDiffNumber.

Thus, we can pass these number types into a general function $f$ (which is assumed to be composed of the overloaded elementary functions), and the derivative information is naturally propagated at each step of the calculation by way of the chain rule. The final result of the evaluation (usually a ForwardDiffNumber or an array of them) then contains both $f(x)$ and $f'(x)$ , where $x$ was the original point of evaluation.

Simple Forward Mode AD in Julia

The easiest way to write actual Julia code demonstrating this technique is to implement a simple dual number type. Note that there is already a Julia package dedicated to such an implementation, but we’re going to roll our own here for pedagogical purposes.

Here’s how we’ll define our DualNumber type:

immutable DualNumber{T} <: Number
    value::T
    deriv::T
end

value(d::DualNumber) = d.value
deriv(d::DualNumber) = d.deriv

Next, we can start defining functions on DualNumber. Here are a few examples to give you a feel for the process:

function Base.sqrt(d::DualNumber)
    new_value = sqrt(value(d))
    new_deriv = 0.5 / new_value
    return DualNumber(new_value, new_deriv*deriv(d))
end

function Base.sin(d::DualNumber)
    new_value = sin(value(d))
    new_deriv = cos(value(d))
    return DualNumber(new_value, new_deriv*deriv(d))
end

function Base.(:+)(a::DualNumber, b::DualNumber)
    new_value = value(a) + value(b)
    new_deriv = deriv(a) + deriv(b)
    return DualNumber(new_value, new_deriv)
end

function Base.(:*)(a::DualNumber, b::DualNumber)
    val_a, val_b = value(a), value(b)
    new_value = val_a * val_b
    new_deriv = val_b * deriv(a) + val_a * deriv(b)
    return DualNumber(new_value, new_deriv)
end

We can now evaluate the derivative of any scalar function composed of the above elementary functions. To do so, we simply pass an instance of our DualNumber type into the function, and extract the derivative from the result. For example:

julia> f(x) = sqrt(sin(x * x)) + x
f (generic function with 1 method)

julia> f(1.0)
1.8414709848078965

julia> d = f(DualNumber(1.0, 1.0))
DualNumber{Float64}(1.8414709848078965,1.5403023058681398)

julia> deriv1 = deriv(d)
1.589002649374538

julia> using Calculus; deriv2 = Calculus.derivative(f, 1.0)
1.5890026493377403

julia> deriv1 - deriv2
3.679767601738604e-11

Notice that our dual number result comes close to the result obtained from Calculus.jl, but is actually slightly different. That slight difference is due to the approximation error inherent to the finite differencing method employed by Calculus.jl.

In reality, the number types that ForwardDiff.jl provides are quite a bit more complicated than DualNumber. Instead of simple dual numbers, the various ForwardDiffNumber types behave like ensembles of dual numbers and hyper-dual numbers (the higher-order analog of dual numbers). This ensemble-based approach allows for simultaneous calculation of multiple higher-order partial derivatives in a single evaluation of the target function. For an in-depth examination of ForwardDiff.jl’s number type implementation, see this section of the developer documentation.

Performance Comparison: The Ackley Function

The best way to illustrate the performance gains that can be achieved using ForwardDiff.jl is to do some benchmarking. Let’s compare the time to calculate the gradient of a function using ForwardDiff.jl, Calculus.jl, and a Python-based AD tool, AlgoPy.

The function we’ll be using in our test is the Ackley function, which is mathematically defined as

$f(\vec{x}) = -a \exp\left( -b \sqrt{\frac{1}{k} \sum_{i=1}^k x^{2}_{i}} \right) - \exp\left(\frac{1}{k} \sum_{i=1}^k \cos(cx_{i})\right) + a + \exp(1)$

Here’s the definition of the function in Julia:

function ackley(x)
    a, b, c = 20.0, -0.2, 2.0*π
    len_recip = inv(length(x))
    sum_sqrs = zero(eltype(x))
    sum_cos = sum_sqrs
    for i in x
        sum_cos += cos(c*i)
        sum_sqrs += i^2
    end
    return (-a * exp(b * sqrt(len_recip*sum_sqrs)) -
            exp(len_recip*sum_cos) + a + e)
end

…and here’s the corresponding Python definition:

def ackley(x):
    a, b, c = 20.0, -0.2, 2.0*numpy.pi
    len_recip = 1.0/len(x)
    sum_sqrs, sum_cos = 0.0, 0.0
    for i in x:
        sum_cos += algopy.cos(c*i)
        sum_sqrs += i*i
    return (-a * algopy.exp(b*algopy.sqrt(len_recip*sum_sqrs)) -
            algopy.exp(len_recip*sum_cos) + a + numpy.e)

Performance Comparison: The Results

The benchmarks were performed with input vectors of length 16, 1600, and 16000, taking the best time out of 5 trials for each test. I ran them on a late 2013 MacBook Pro (macOS 10.9.5, 2.6 GHz Intel Core i5, 8 GB 1600 MHz DDR3) with the following versions of the relevant libraries: Julia v0.4.1-pre+15, Python v2.7.9, ForwardDiff.jl v0.1.2, Calculus.jl v0.1.13, and AlgoPy v0.5.1.

Let’s start by looking at the evaluation times of ackley(x) in both Python and Julia:

length(x)	Python time (s)	Julia time (s)	Speed-Up vs. Python
16	0.00011	2.3e-6	47.83x
1600	0.00477	4.0269e-5	118.45x
16000	0.04747	0.00037	128.30x

As you can see, there’s already a significant performance difference between the languages. We’ll have to keep that in mind when comparing our Julia differentiation tools with AlgoPy, in order to avoid confusing the languages’ performance characteristics with those of the libraries (though there is obviously a solid coupling between the two concepts).

The below table shows the evaluation times of ∇ackley(x) using various libraries (the chunk_size column denotes a configuration option passed to the ForwardDiff.gradient method, see the chunk-mode docs for details.):

length(x)	AlgoPy time (s)	Calculus.jl time (s)	ForwardDiff time (s)	chunk_size
16	0.00212	2.2e-5	3.5891e-5	16
1600	0.53439	0.10259	0.01304	10
16000	101.55801	11.18762	1.35411	10

From the above tables, we can calculate the speed-up ratio of ForwardDiff.jl over the other libraries:

length(x)	Speed-Up vs. AlgoPy	Speed-Up vs. Calculus.jl
16	59.07x	0.61x
1600	40.98x	7.86x
16000	74.99x	8.26x

As you can see, Python + AlgoPy falls pretty short of the speeds achieved by Julia + ForwardDiff.jl, or even Julia + Calculus.jl. While Calculus.jl is actually almost twice as fast as ForwardDiff.jl for the lowest input dimension vector, it is ~8 times slower than ForwardDiff.jl for the higher input dimension vectors.

Another metric that might be useful to look at is the “slowdown ratio” between the gradient evaluation time and the function evaluation time, defined as:

$\text{slowdown ratio} = \frac{\text{gradient time}}{\text{function time}}$

Here are the results (lower is better):

length(x)	AlgoPy ratio	Calculus.jl ratio	ForwardDiff.jl ratio
16	19.27	9.56	15.60
1600	112.03	2547.61	323.82
16000	2139.41	30236.81	3659.77

Both AlgoPy and ForwardDiff.jl beat out Calculus.jl for evaluation at higher input dimensions, which isn’t too surprising. AlgoPy beating ForwardDiff.jl, though, might catch you off guard - ForwardDiff.jl had the fastest absolute runtimes, after all! One explanation for this outcome is that AlgoPy falls back to vectorized Numpy methods when calculating the gradient, while the ackley function itself uses your usual, slow Python scalar arithmetic. Julia’s scalar arithmetic performance is much faster than Python’s, so ForwardDiff.jl doesn’t have as much “room for improvement” as AlgoPy does.

Julia’s AD Advantage

At the beginning of this post, I promised I would give the reader an answer to the question: “Why is Julia uniquely well-suited for AD compared to other languages?”

There are several good answers, but the chief reason for Julia’s superiority is its efficient implementation of multiple dispatch.

Unlike many other languages, Julia’s type-based operator overloading is fast and natural, as it’s one of the central design tenets of the language. Since Julia is JIT-compiled, the bytecode representation of a Julia function can be tied directly to the types with which the function is called. This allows the compiler to optimize every Julia method for the specific input type at runtime.

This ability is phenomenally useful for implementing forward mode AD, which relies almost entirely on operator overloading in order to work. In most other scientific computing languages, operator overloading is either very slow (e.g. MATLAB), fraught with weird edge cases (e.g. Python), arduous to implement generally (e.g. C++) or some combination of all three. In addition, very few languages allow operator overloading to naturally extend to native, black-box, user-written code. Julia’s multiple dispatch is the secret weapon leveraged by ForwardDiff.jl to overcome these hurdles.

Future Directions

The new version of ForwardDiff.jl has just been released, but development of the package is still ongoing! Here’s a list of things I’d like to see ForwardDiff.jl support in the future:

More elementary function definitions on ForwardDiffNumber types
More optimized versions of existing elementary function definitions on ForwardDiffNumber types
Methods for evaluating Jacobian-matrix products (highly useful in conjunction with reverse mode AD).
Parallel/shared-memory/distributed-memory versions of current API methods for handling problems with huge input/output dimensions
A more robust benchmarking suite for catching performance regressions

If you have any ideas on how to make ForwardDiff.jl more useful, feel free to open a pull request or issue in the package’s GitHub repository.

JSoC 2015 project: Interactive Visualizations in Julia with GLVisualize.jl

2015-10-22T00:00:00+00:00

GLVisualize is an interactive visualization library that supports 2D and 3D rendering as well as building of basic GUIs. It’s written entirely in Julia and OpenGL. I’m really glad that I could continue working on this project with the support of Julia Summer of Code.

During JSoC, my main focus was on advancing GLVisualize, but also improving the surrounding infrastructure like GeometryTypes, FileIO, ImageMagick, MeshIO and FixedSizeArrays. All recorded gifs in this blog post suffer from lossy compression. You can click on most of them to see the code that produced them.

One of the most interesting parts of GLVisualize is, that it’s combining GUIs and visualizations, instead of relying on a 3rd party library like QT for GUIs. This has many advantages and disadvantages. The main advantage is, that interactive visualization share a lot of infrastructure with GUI libraries. By combining these two, new features are possible, e.g. text editing of labels in 3D space, or making elements of a visualization work like a button. These features should end up being pretty snappy, since GLVisualize was created with high performance in mind.

Obviously, the biggest downside is, that it is really hard to reach the maturity and feature completeness from e.g. QT.

So to really get the best of both worlds a lot of work is needed.

Current status of GLVisualize, and what I’ve been doing during JSoC

A surprisingly large amount of time went into improving FileIO together with Tim Holy. The selling point of FileIO is, that one can just load a file into FileIO and it will recognize the format and load the respective IO library. This makes it a lot easier to start working with files in Julia, since no prior knowledge about formats and loading files in Julia is needed. This is perfect for a visualization library, since most visualization start from data, that comes in some format, which might even be unknown initially.

Since all files are loaded with the same function, it becomes much easier to implement functionality like drag and drop of any file supported by FileIO. To give you an example, the implementation of the drag and drop feature in GLVisualize only needs a few lines of code thanks to FileIO:

Another feature I’ve been working on is better 2D support. I’ve implemented different anti-aliased marker, text rendering and line types. Apart from the image markers, they all use the distance field technique, to achieve view independent anti-aliasing. Here are a few examples:

In the last example all the markers move together. This is actually one of the core feature of GLVisualize. The markers share the same memory for the positions on the GPU without any overhead. Each marker then just has a different offset to that shared position. This is easily achieved in GLVisualize, since all visualization methods are defined on the GPU objects. This also works for GPU objects which come from some simulation calculated on the GPU.

During JSoC, I also implemented sliders and line editing widgets for GLVisualize. One can use them to add interactivity to parameters of a visualization:

I have also worked with David P. Sanders to visualize his billiard model, which demonstrates the particle system and a new camera type.

The particle system can use any mesh primitive. To make it easy to load and create meshes, Steve Kelly and I rewrote the Meshes package to include more features and have a better separation of mesh IO and manipulation. The IO is now in MeshIO, which supports the FileIO interface. The mesh types are in GeometryTypes and meshing algorithms are in different packages in the JuliaGeometry org.

In this example one can see, that there are also some GUI widgets to interact with the camera. The small rectangles in the corner are for switching between orthographic and perspective projection. The cube can be used to center the camera on a particular side. These kind of widgets are easy to implement in GLVisualize, as it is build for GUIs and interactivity from the beginning. Better camera controls are a big usability win, and I will put more time into improving these even further.

I recorded one last demo to give you some more ideas of what GLVisualize is currently capable of:

The demo shows different kind of animations, 3D text editing and pop ups that are all relatively easy to include in any visualization created with GLVisualize.

All of this looks promising, but there is still a lot of work needed! First of all, there is still no tagged version of GLVisualize that will just install via Julia’s package manager. This is because Reactive.jl and Images.jl are currently not tagged on a version that works with GLVisualize.

On the other side, the API is not that thought out yet. It is planned to use more ideas from Escher.jl and Compose.jl to improve the API. The goal is to fully support the Compose interface at some point. Like that, GLVisualize can be used as a backend for Gadfly. This will make Gadfly much fitter for large, animated data sets. In the next weeks, I will need to work on tutorials, documentations and handling edge cases better.

Big thanks go to the Julia team and everyone involved to make this possible!

JSoC 2015 project: Efficient data structures and algorithms for sequence analysis in BioJulia

2015-10-21T00:00:00+00:00

Participant: Kenta Sato (@bicycle1885)
Mentor: Daniel C. Jones (@dcjones)

Thanks to a grant from the Gordon and Betty Moore Foundation, I’ve enjoyed the Julia Summer of Code 2015 program administered by the NumFOCUS and a travel to the JuliaCon 2015 at Boston. During this program, I have created several packages about data structures and algorithms for sequence analysis, mainly targeted for bioinformatics. Even though Julia had lots of practical packages for numerical computing on floating-point numbers, it lacked efficient and compact data structures that are fundamental in bioinformatics.

Recent development of high-throughput DNA sequencers has enabled to sequence massive numbers of DNA fragments (known as reads) from biological samples within a day. The first step of sequence analysis is locating positions of these fragments in other long reference sequence, then we can detect genetic variants or gene expressions based on the result. This step is called sequence mapping or aligning, and because reference sequences are most commonly genome-scale (about 3.2 billions length for human), a full-text search index is used to speed up this alignment process. This kind of full-text search index is implemented in many bioinformatics tools, most notably bowtie2 and BWA, whose papers are cited thousands of times.

The main focus of my project was creating a full-text search index in Julia that is easy to use and efficient in practical applications. In the course towards this destination, I’ve created several packages that are useful as a building block for other data structures. I’m going to introduce you these packages in this post.

IntArrays.jl

IntArrays.jl is a package for arrays of unsigned integer. So, is it useful? Yes, it is! This is because the IntArray type implemented in this package can store integers as small space as possible. The IntArray type has a type parameter w that represents the number of bits required to encode elements in an array. For example, if each element is an integer between 0 and 3, you only need to use two bits to encode it and w can be set to 2 or greater. These 2-bit integers are packed into a buffer and therefore the array consumes only one fourth of the space compared to the usual array. The following is a case of a byte sequence of [0x01, 0x03, 0x02, 0x00]:

    index:                           1          2          3          4
    byte sequence (hex):          0x01       0x03       0x02       0x00
    byte sequence (bin):    0b00000001 0b00000011 0b00000010 0b00000000
    packed sequence (w=2):          01         11         10         00
    in-memory layout:         00101101

The full type definition is IntArray{w,T,n}, where w is the number of bits for each element as I explained, T is the type of elements, and n is the dimension of the array. This type is a subtype of the AbstractArray{T,n} and will behave like a familiar array; allocation, random access and update are supported. IntVector and IntMatrix are also defined as type aliases like Vector and Matrix, respectively.

Here is an example:

julia> IntArray{2,UInt8}(2, 3)
2x3 IntArrays.IntArray{2,UInt8,2}:
 0x00  0x00  0x01
 0x00  0x00  0x03

julia> array = IntVector{2,UInt8}(6)
6-element IntArrays.IntArray{2,UInt8,1}:
 0x00
 0x00
 0x03
 0x03
 0x02
 0x00

julia> array[1] = 0x02
0x02

julia> array
6-element IntArrays.IntArray{2,UInt8,1}:
 0x02
 0x00
 0x03
 0x03
 0x02
 0x00

julia> sort!(array)
6-element IntArrays.IntArray{2,UInt8,1}:
 0x00
 0x00
 0x02
 0x02
 0x03
 0x03

And the memory footprint of IntArray is much smaller:

julia> sizeof(IntVector{2,UInt8}(1_000_000))
250000

julia> sizeof(Vector{UInt8}(1_000_000))
1000000

Since packing and unpacking integers in a buffer require additional operations, there are overheads in operations and IntArray is often slower than Array. I’ve tried to keep this discrepancy as small as possible, but the IntArray is about 4-5 times slower when sorting it:

julia> array = rand(0x00:0x03, 2^24);

julia> sort(array); @time sort(array);
  0.488779 seconds (8 allocations: 16.000 MB)

julia> iarray = IntVector{2}(array);

julia> sort(iarray); @time sort(iarray);
  2.290878 seconds (18 allocations: 4.001 MB)

If you have a great idea to improve the performance, please let me know!

IndexableBitVectors.jl

The next package is IndexableBitVectors.jl. You must be familiar with the BitVector type in the standard library; types defined in my package is a static but indexable version of it. Here “indexable” means that a query to ask the number of bits between an arbitrary range can be answered in constant time. If you are already familiar with succinct data structures, you may know this is an important building block of other succinct data structures like wavelet trees, LOUDS, etcetera.

The package exports two variants of such bit vectors: SucVector and RRR. SucVector is simpler and faster than RRR, but RRR is compressible and will be smaller if 0/1 bits are localized in a bit vector. Both types split a bit vector into blocks and cache the number of bits up to the position. In SucVector, the extra space is about 1/4 bits per bit, so it will become ~25% larger than the original bit vector.

The most important query operation over these data structures would be the rank1(bv, i) query, which counts the number of 1 bits within bv[1:i]. Owing to the cached bit counts, we can finish the rank operation in constant time:

julia> using IndexableBitVectors

julia> bv = bitrand(2^30);

julia> function myrank1(bv, i)  # count ones by loop
           r = 0
           for j in 1:i
               r += bv[j]
           end
           return r
       end
myrank1 (generic function with 1 method)

julia> myrank1(bv, 2^29); @time myrank1(bv, 2^29);
  0.714866 seconds (6 allocations: 192 bytes)

julia> sbv = SucVector(bv);

julia> rank1(sbv, 2^29); @time rank1(sbv, 2^29);  # much faster!
  0.000003 seconds (6 allocations: 192 bytes)

julia> rrr = RRR(bv);

julia> rank1(rrr, 2^29); @time rank1(rrr, 2^29);  # much faster, too!
  0.000004 seconds (6 allocations: 192 bytes)

The select1(bv, j) query is also useful in many cases, which locates the j-th 1 bit in the bit vector bv. For example, if a set of positive integers is represented in this bit vector, you can efficiently query the j-th smallest member in the set.

Let’s see the internal representation of SucVector to understand the magic. A bit vector is separated into large blocks:

type SucVector <: AbstractIndexableBitVector
    blocks::Vector{Block}
    len::Int
end

Each large block contains 256 bits and consists of four small blocks which contain 64 bits respectively, a large block stores global 1s’ count up to the starting position of it and a small block stores local 1s’ count staring from the beginning position of its parent large block. Bits itself are stored in four bit chunks corresponding to small blocks:

immutable Block
    # large block
    large::UInt32
    # small blocks
    #   the first small block is used for 8-bit extension of the large block
    #   hence, 40 (= 32 + 8) bits are available in total
    smalls::NTuple{4,UInt8}
    # bit chunks (64bits × 4 = 256bits)
    chunks::NTuple{4,UInt64}
end

Since the bit count of the first small block is always zero, we can exploit this space to extend the cache of the large block (red frame). When running the rank1(bv, i) query, it first picks a large and small block pair that the i-th bit belongs to and then adds their cached bit counts, finally counts remaining 1 bits in a chunk on the fly.

As I mentioned, this data structure can be used as a building block of various data structures. The next package I’m going to introduce is one of them.

WaveletMatrices.jl

You may already know about the wavelet tree, which supports the rank and select queries like SucVector and RRR, but elements are not restricted to 0/1 bits. In fact, the rank and select queries are available on arbitrary unsigned integers. The wavelet tree can be thought as a generalization of indexable bit vectors in this respect. What I’ve implemented is not the well-known wavelet tree, a variant of it called “wavelet matrix”. You can find an implementation and a link to a paper at WaveletMatrices.jl. According to the authors of the paper, the wavelet matrix is “simpler to build, simpler to query, and faster in practice than the levelwise wavelet tree”.

The WaveletMatrix type takes three type parameters: w, T, and B. w and T are analogous to those of IntArray{w,T,n}, and B is a type of indexable bit vector.

julia> using WaveletMatrices

julia> wm = WaveletMatrix{2}([0x00, 0x01, 0x02, 0x03])
4-element WaveletMatrices.WaveletMatrix{2,UInt8,IndexableBitVectors.SucVector}:
 0x00
 0x01
 0x02
 0x03

julia> wm[3]
0x02

julia> rank(0x02, wm, 2)
0

julia> rank(0x02, wm, 3)
1

julia> xs = rand(0x00:0x03, 2^16);

julia> wm = WaveletMatrix{2}(xs);  # 2-bit encoding

julia> sum(xs[1:2^15] .== 0x03)
8171

julia> rank(0x03, wm, 2^15)
8171

The details of the data structure and algorithms are relatively simple but beyond the scope of this post. For people who are interested in this data structure, the paper I mentioned above and my implementation would be helpful. There are more operations that the wavelet matrix can run efficiently and those operations will be added in the future.

FMIndexes.jl

80% of sequence analysis in bioinformatics is about sequence search, which includes pattern search, homologous gene search, genome comparison, short-read mapping, and so on. The FM-Index is often regarded as one of the most efficient indices for full-text search, and I’ve implemented it in the FMIndexes.jl package. Thanks to the packages I’ve introduced so far, the code of it looks really simple. For example, counting the number of occurrences of a given pattern in a text can be written as follows (slightly simplified for explanatory purpose):

function count(query, index::FMIndex)
    sp, ep = 1, length(index)
    # backward search
    i = length(query)
    while sp ≤ ep && i ≥ 1
        char = convert(UInt8, query[i])
        c = index.count[char+1]
        sp = c + rank(char, index.bwt, sp - 1) + 1
        ep = c + rank(char, index.bwt, ep)
        i -= 1
    end
    return length(sp:ep)
end

A unique property of the FM-Index is that an index itself is just a permutation of characters of an original text and counts of characters contained in it. This permutation is called Burrows-Wheeler transform (also known as BWT), and the permuted text is stored in a wavelet matrix (or a wavelet tree) in order to efficiently count the number of characters within a specific region. Therefore, the space required to index a text is often smaller than that of other full-text indices (actually, in practice, efficiently finding positions of a query needs auxiliary data as well). Moreover, this transform is bijective, and thus the original text can be restored from an index.

Building an index for full-text search is ridiculously simple: just passing a sequence to a constructor:

julia> using FMIndexes

julia> fmindex = FMIndex("abracadabra");

The FMIndex type supports two main queries: count and locate. The count(query, index) query literally counts the number of occurrences of the query string and the locate(query, index) locates starting positions of the query. In order to restore the original text, you can use the restore function. Here is a simple usage:

julia> count("a", fmindex)
5

julia> count("abra", fmindex)
2

julia> locate("a", fmindex) |> collect
5-element Array{Any,1}:
 11
  8
  1
  4
  6

julia> locate("abra", fmindex) |> collect
2-element Array{Any,1}:
 8
 1

julia> bytestring(restore(fmindex))
"abracadabra"

As an example, for bioinformaticians, let’s try several queries on a chromosome. You also need to install the Bio.jl package to efficiently parse a FASTA file. The next script reads a chromosome from a FASTA file, build an FM-Index, and then serialize it into a file for later use (I love the serializers of Julia, they are available for free!):

index.jl

using Bio.Seq
using IntArrays
using FMIndexes

# encode a DNA sequence with 3-bit unsigned integers;
# this is because a reference genome has five nucleotides: A/C/G/T/N.
function encode(seq)
    encoded = IntVector{3,UInt8}(length(seq))
    for i in 1:endof(seq)
        encoded[i] = convert(UInt8, seq[i])
    end
    return encoded
end

# read a chromosome from a FASTA file
filepath = ARGS[1]
record = first(open(filepath, FASTA))
println(record.name, ": ", length(record.seq), "bp")
# build an FM-Index
fmindex = FMIndex(encode(record.seq))
# save it in a file
open(string(filepath, ".index"), "w+") do io
    serialize(io, fmindex)
end

OK, then create an index for chromosome 22 of human (you can download it from here):

$ julia4 index.jl chr22.fa
chr22: 50818468bp
$ ls -lh chr22.fa.index
-rw-r--r--+ 1 kenta  staff    74M  9 26 06:30 chr22.fa.index

After construction finished (this will take several minutes), read the index in REPL:

julia> using FMIndexes

julia> fmindex = open(deserialize, "chr22.fa.index");

Now that you can execute queries to search a DNA fragment:

julia> using Bio.Seq

julia> count(dna"GACTTTCAC", fmindex)  # this DNA fragment hits at 111 locations
111

julia> count(dna"GACTTTCACTTT", fmindex)  # this hits at 3 locations
3

julia> locate(dna"GACTTTCACTTT", fmindex) |> collect  # the loci of these hits
3-element Array{Any,1}:
 36253071
 47308573
 34159872

julia> count(dna"GACTTTCACTTTCCC", fmindex)  # found a unique hit!
1

julia> locate(dna"GACTTTCACTTTCCC", fmindex) |> collect
1-element Array{Any,1}:
 36253071

julia> @time locate(dna"GACTTTCACTTTCCC", fmindex);  # this can be located in 32 μs!
  0.000032 seconds (5 allocations: 192 bytes)

This locus, chr22:36253071, is the starting position of the APOL1 gene.

Applications

My aim of having created these packages was to prove that it is practicable to implement high-performance data structures for bioinformatics in Julia. I’m pretty sure that it is true, but it may be skeptical to others. So, I’m going to prove it by writing useful and performant applications using these packages. Now I’m working on FMM.jl, which aligns massive amounts of DNA fragments to a genome sequence using the FM-Index and other algorithms. This is still a work in progress, there would be many bugs and unusual cases I should care about, but its performance is not so bad compared to other implementations.

The BioJulia project is also under active development. The packages I made are intended to work with the Bio.jl package. If you are interested in the BioJulia project, we really welcome your contributions!

JSoC 2015 project: Interactive 3D Graphics in the Browser with Compose3D

2015-10-20T00:00:00+00:00

Over the last three months, I’ve been working on Compose3D, which is an extension of the amazing Compose package to 3D. My work on Compose3D began as a project for my Computer Graphics course along with Pranav T Bhat, and by the end of the course, we had a working prototype for Compose3D with support for contexts and geometries and a very basic WebGL backend.

It has been my pleasure to have been able to continue this work under the guidance of Shashi Gowda and Simon Danisch as a part of the first ever Julia Summer of Code, generously sponsored by the Gordon and Betty Moore Foundation. While I’ve been able to add quite a lot of functionality to Compose3D, it isn’t totally ready for release yet. Hopefully, in some time it will be ready. But as a happy side effect, I have been able to abstract out the WebGL rendering functionality provided by the original prototype (and a lot more!) to a separate package called ThreeJS.jl, which can now be used to render 3D graphics in browsers using Julia, opening up possibilities of displaying such scenes in IJulia notebooks and Escher.

ThreeJS.jl

ThreeJS is now responsible for all the WebGL rendering done by Compose3D. It can also be used as a standalone package for other graphics packages to use as a backend.

Initially, my approach to render scenes in Compose3D was to just emit out the corresponding JavaScript code, into the IJulia notebook, which would then run it! This worked pretty well in IJulia notebooks, but it was soon apparent that there were several flaws with this approach.

It was hard to extend.
Did not play well with Escher.
Nor did it work with Interact to provide interactivity.

So Shashi suggested implementing a Polymer wrapper around the excellent three.js library, to create threejs web components. The Polymer team had done some work on creating threejs components and had a basic implementation of the same ready, which I promptly forked and tweaked to add functionality I needed. It’s quite safe to say that I’ve spent more time writing JavaScript than Julia during JSoC!

Switching over to using web components suddenly opened up 2 major avenues. Compose3D could now work with Escher and also provided interactivity. ThreeJS outputs Patchwork elements, which lets it use Patchwork’s clever diffing capabilities, thereby updating only the required DOM elements and helping performance.

On the other hand, web components introduced issues with IJulia notebooks regarding serving the files required by ThreeJS. I’m still working on finding a good solution for this problem, but for now, a hack gets ThreeJS working in IJulia, albiet with some limitations.

Drawing stuff!

Anyway, now we were all set to draw 3D scenes in browsers! The below code snippet, for example, would draw a red cube illuminated from a corner. The camera in the scenes drawn by ThreeJS can be rotated, zoomed and panned using your mouse or trackpad, allowing you to explore the scene.

import ThreeJS
ThreeJS.outerdiv() << (ThreeJS.initscene() <<
    [
        ThreeJS.mesh(0.0, 0.0, 0.0) <<
        [
            ThreeJS.box(1.0,1.0,1.0),
            ThreeJS.material(Dict(:kind=>"lambert",:color=>"red"))
        ],
        ThreeJS.pointlight(3.0, 3.0, 3.0),
        ThreeJS.camera(0.0, 0.0, 10.0)
    ])

Making them interactive

Currently, interactivity is broken in IJulia (a side effect of the switch to Polymer 1.0, and the new sneaky DOM), so Escher is the way to go if you want to interact with your 3D scene. So an example for this can be the same scene as before, but after adding a slider and make it such that the size of the cube is controlled by the slider.

import ThreeJS
function main(window)
  push!(window.assets, "widgets")
  push!(window.assets, ("ThreeJS", "threejs"))
  side = Input(1.0)
  vbox(
    slider(1.0:5.0) >>> side,
    lift(side) do val
      ThreeJS.outerdiv() << (ThreeJS.initscene() <<
      [
          ThreeJS.mesh(0.0, 0.0, 0.0) <<
          [
              ThreeJS.box(val, val, val),
              ThreeJS.material(Dict(:kind=>"lambert",:color=>"red"))
          ],
          ThreeJS.pointlight(3.0, 3.0, 3.0),
          ThreeJS.camera(0.0, 0.0, 10.0)
      ])
    end
  )
end

You can also do animations!

Small scale animations can also be created using Escher. Instead of using sliders to update the elements, we just update it at certain intervals using the every function or the fpswhen functions. A scene with a rotating cube can be drawn using just a couple of modifications of the above code.

import ThreeJS
function main(window)
  push!(window.assets, "widgets")
  push!(window.assets, ("ThreeJS", "threejs"))
  rx = 0.0
  ry = 0.0
  rz = 0.0
  delta = fpswhen(window.alive, 60) #Update at 60 FPS
  lift(delta) do _
      rx += 0.5
      ry += 0.5
      rz += 0.5
      ThreeJS.outerdiv() << (ThreeJS.initscene() <<
      [
          ThreeJS.mesh(0.0, 0.0, 0.0) <<
          [
              ThreeJS.box(2.0, 2.0, 2.0, rx = rx, ry = ry, rz = rz),
              ThreeJS.material(Dict(:kind=>"lambert",:color=>"red"))
          ],
          ThreeJS.pointlight(3.0, 3.0, 3.0),
          ThreeJS.camera(0.0, 0.0, 10.0)
      ])
    end
end

Surf and mesh plots! (Sort of)

ThreeJS has support to render parametric surfaces, which are basically the kind of surfaces drawn by a typical surf plot. It also has support for drawing lines like a typical mesh plot. Colormaps can be applied to these surfaces by passing in an array of colors to be used. Colors to be applied are calculated and chosen by ThreeJS. These come into effect when put together with materials using the colorkind property of vertex. Screenshots of such surfaces drawn by ThreeJS are shown below.

Compose3D

Compose3D provides an abstraction over the rendering library and lets you compose together primitives to build scenes just like the inspiration for it, the Compose library. This lets you create very interesting structures, with very less code! Compose3D has similar features to Compose, with users being able to create 3D contexts, and then use relative and absolute measures inside them and compose other primitives together.

My favorite example to showcase Compose3D would be the Sierpinski pyramid example. Here, we split the parent context into the sections that we want and then just draw the pyramid in them! So the bottom half of the 3D space is split into 4, and then, a pyramid is arranged on top of them.

using Compose3D

function sierpinski(n)
    if n == 0
        compose(Context(0w,0h,0d,1w,1h,1d),pyramid(0w,0h,0d,1w,1h)) #The basic unit
    else
        t = sierpinski(n - 1)
        compose(Context(0w,0h,0d,1w,1h,1d),
        (Context(0w,0h,0d,(1/2)w,(1/2)h,(1/2)d), t),
        (Context(0w,0h,0.5d,(1/2)w,(1/2)h,(1/2)d), t),
        (Context(0.5w,0h,0.5d,(1/2)w,(1/2)h,(1/2)d), t),
        (Context(0.5w,0h,0d,(1/2)w,(1/2)h,(1/2)d), t),
        (Context(0.25w,0.5h,0.25d,(1/2)w,(1/2)h,(1/2)d), t)) #The top one
    end
end
compose(Context(-5mm,-5mm,-5mm,10mm,10mm,10mm),sierpinski(3))

And voila! You have a Sierpinski pyramid of level 3 like in the figure below.

The switch to ThreeJS allows Compose3D all the advantages that comes with ThreeJS. This includes interactivity and animations!

For example, the same Sierpinski example can be have some interactive elements, say a slider defining the number of levels of recursion and maybe some controlling the colors of the pyramid. This can be done easily in Escher just like it was done with ThreeJS. After defining the sierpinski function given below, just creating a slider and hooking it up to the sierpinski function will set this up!

function main(window)
    push!(window.assets, ("ThreeJS", "threejs")) #Push the threejs static assets
    push!(window.assets, "widgets")
    n = Input(0.0)

    vbox(
        slider(0.0:3.0) >>> n, #Set up the slider
        lift(n) do i
            #Draw the composed figure!
            draw(
                Patchable3D(100,100),
                compose(
                    Context(-5mm,-5mm,-5mm,10mm,10mm,10mm), sierpinski(i)
                )
            )
        end
    )
end

An an example for animations, I ported the Escher boids example by Ian Dunning from 2D to 3D and a screencast of the same can be found below.

Future directions

Several new primitives have been added in ThreeJS which don’t yet have corresponding primitives in Compose3D.
Add support for text in ThreeJS allowing use of labels in plots.
Being able to use surf and mesh that will automatically draw scaled surface plots in browsers and a WebGL based plotting library around ThreeJS.
Actually get Compose3D ready for public use!

JSoC 2015 project: NullableArrays.jl

2015-10-16T00:00:00+00:00

My project under the 2015 Julia Summer of Code program has been to develop the NullableArrays package, which provides the NullableArray data type and its respective interface. I first encountered Julia earlier this year as a suggestion for which language I ought to learn as a matriculating PhD student in statistics. This summer has been an incredible opportunity for me both to develop as a young programmer and to contribute to an open-source community as full of possibility as Julia’s. I’d be remiss not to thank Alan Edelman’s group at MIT, NumFocus, and the Gordon & Betty Moore Foundation for their financial support, John Myles White for his mentorship and guidance, and many others of the Julia community who have helped to contribute both to the package and to my edification as a programmer over the summer. Much of my work on this project was conducted at the Recurse Center, where I received the support of an amazing community of self-directed learners.

The `NullableArray` data structure

NullableArrays are array structures that efficiently represent missing values without incurring the performance difficulties that face DataArray objects, which have heretofore been used to store data that include missing values. The core issue responsible for DataArrays performance woes concerns the way in which the former represent missing values, i.e. through a token NA object of token type NAType. In particular, indexing into, say, a DataArray{Int} can return an object either of type Int or of type NAType. This design does not provide sufficient information to Julia’s type inference system at JIT-compilation time to support the sort of static analysis that Julia’s compiler can otherwise leverage to emit efficient machine code. We can illustrate as much through following example, in which we calculate the sum of five million random Float64s stored in a DataArray:

julia> using DataArrays
# warnings suppressed…

julia> A = rand(5_000_000);

julia> D = DataArray(A);

julia> function f(D::AbstractArray)
           x = 0.0
           for i in eachindex(D)
               x += D[i]
           end
           x
       end
f (generic function with 1 method)

julia> f(D);

julia> @time f(D)
  0.163567 seconds (10.00 M allocations: 152.598 MB, 9.21% gc time)
2.500102419334644e6

Looping through and summing the elements of D is over twenty times slower and allocates far more memory than running the same loop over A:

julia> f(A);

julia> @time f(A)
  0.007465 seconds (5 allocations: 176 bytes)
2.500102419334644e6

This is because the code generated for f(D) must assume that getindex(D, i) for an arbitrary index i may return an object either of type Float64 or of type NAType and hence must “box” every object returned from indexing into D. The performance penalty incurred by this requirement is reflected in the comparison above. (The interested reader can find more about these issues here.)

On the other hand, NullableArrays are designed to support the sort of static analysis used by Julia’s type inference system to generate efficient machine code. The crux of the strategy is to use a single type — Nullable{T} — to represent both missing and present values. Nullable{T} objects are specialized containers that hold precisely either one or zero values. A Nullable that wraps, say, 5 can be taken to represent a present value of 5, whereas an empty Nullable{Int} can represent a missing value that, if it had been present, would have been of type Int. Crucially, both such objects are of the same type, i.e. Nullable{Int}. Interested readers can hear a bit more on these design considerations in my JuliaCon 2015 lighting talk.

Here is the result of running the same loop over a comparable NullableArray:

julia> using NullableArrays

julia> X = NullableArray(A);

julia> function f(X::NullableArray)
           x = Nullable(0.0)
           for i in eachindex(X)
               x += X[i]
           end
           x
       end
f (generic function with 1 method)

julia> f(X);

julia> @time f(X)
  0.009812 seconds (5 allocations: 192 bytes)
Nullable(2.500102419334644e6)

As can be seen, naively looping over a NullableArray is on the same order of magnitude as naively looping over a regular Array in terms of both time elapsed and memory allocated. Below is a set of plots (drawn with Gadfly.jl) that visualize the results of running 20 benchmark samples of f over both NullableArray and DataArray arguments each consisting of 5,000,000 random Float64 values and containing either zero null entries or approximately half randomly chosen null entries.

Of course, it is possible to bring the performance of such a loop over a DataArray up to par with that of a loop over an Array. But such optimizations generally introduce additional complexity that oughtn’t to be required to achieve acceptable performance in such a simple task. Considerably more complex code can be required to achieve performance in more involved implementations, such as that of broadcast!. We intend for NullableArrays to to perform well under involved tasks involving missing data while requiring as little interaction with NullableArray internals as possible. This includes allowing users to leverage extant implementations without sacrificing performance. Consider for instance the results of relying on Base’s implementation of broadcast! for DataArray and NullableArray arguments (i.e., having omitted the respective src/broadcast.jl from each package’s source code). Below are plots that visualize the results of running 20 benchmark samples of broadcast!(dest, src1, src2), where dest and src2 are 5_000_000 x 2 Arrays, NullableArrays or DataArrays, and src1 is a 5_000_000 x 1 Array, NullableArray or DataArray. As above, the NullableArray and DataArray arguments are tested in cases with either zero or approximately half null entries:

We have designed the NullableArray type to feel as much like a regular Array as possible. However, that NullableArrays return Nullable objects is a significant departure from both Array and DataArray behavior. Arguably the most important issue is to support user-defined functions that lack methods for Nullable arguments as they interact with Nullable and NullableArray objects. Throughout my project I have also worked to develop interfaces that make dealing with Nullable objects user-friendly and safe.

Given a method f defined on an argument signature of types (U1, U2, …, UN), we would like to provide an accessible, safe and performant way for a user to call f on an argument of signature (Nullable{U1}, Nullable{U2}, …, Nullable{UN}) without having to extend f herself. Doing so should return Nullable(f(get(u1), get(u1), …, get(un))) if each argument is non-null, and should return an empty Nullable if any argument is null. Systematically extending an arbitrary method f over Nullable argument signatures is often referred to as “lifting” f over the Nullable arguments.

NullableArrays offers keyword arguments for certain methods such as broadcast and map that direct the latter methods to lift passed function arguments over NullableArray arguments:

julia> X = NullableArray(collect(1:10), rand(Bool, 10))
10-element NullableArray{Int64,1}:
 #NULL
 #NULL
 #NULL
     4
     5
     6
     7
     8
 #NULL
    10

julia> f(x::Int) = 2x
f (generic function with 2 methods)

julia> map(f, X)
ERROR: MethodError: `f` has no method matching f(::Nullable{Int64})
Closest candidates are:
  f(::Any, ::Any)
 [inlined code] from /Users/David/.julia/v0.4/NullableArrays/src/map.jl:93
 in _F_ at /Users/David/.julia/v0.4/NullableArrays/src/map.jl:124
 in map at /Users/David/.julia/v0.4/NullableArrays/src/map.jl:172

julia> map(f, X; lift=true)
10-element NullableArray{Int64,1}:
 #NULL
 #NULL
 #NULL
     8
    10
    12
    14
    16
 #NULL
    20

I also plan to release shortly a small package that will offer a more flexible “lift” macro, which will be able to lift function calls over Nullable arguments within a variety of expression types.

We hope that the new NullableArrays package will help to support not only Julia’s statistical computing ecosystem as it moves forward but also any endeavor that requires an efficient, developed interface for handling arrays of Nullable objects. Please do try the package, submit feature requests, report bugs, and, if you’re interested, submit a PR or two. Happy coding!

Julia 0.4 Release Announcement

2015-10-09T00:00:00+00:00

We are pleased to announce the release of Julia 0.4.0. This release contains major language refinements and numerous standard library improvements. A summary of changes is available in the NEWS log found in our main repository. We will be making regular 0.4.x bugfix releases from the release-0.4 branch of the codebase, and we recommend the 0.4.x line for users requiring a more stable Julia environment.

The Julia ecosystem continues to grow, and there are now over 700 registered packages! (highlights below). JuliaCon 2015 was held in June, and >60 talks are available to view. JuliaCon India will be held in Bangalore on 9 and 10 October.

We welcome bug reports on our GitHub tracker, and general usage questions on the users mailing list, StackOverflow, and several community forums.

Binaries are available from the main download page, or visit JuliaBox to try 0.4 from the comfort of your browser. Happy Coding!

Notable compiler and language news:

Incremental code caching for packages, resulting in a major reduction in loading time for Gadfly and other large, inter-dependent packages.
Generational garbage collector which greatly reduces GC overhead for many common workloads.
Function call overloading for arbitrary objects
Generated functions (sometimes known as “staged functions”) introduce finer control over compile-time specialization. Docs and related JuliaCon talk.
Support for documenting user functions and other objects and retrieving the documentation via the help system.
Improvements in the performance and flexibility of multidimensional abstract arrays, SubArrays (array views), and efficient multidimensional iterators.
Inter-task channels for faster communication between parallel tasks
Tuple type improvements: the type tuple (A,B) now written Tuple{A,B}. This change has improved the performance of many tuple-related operations, and allowed one to write fixed-size aggregate fields as field::NTuple{N,T} (Number of elements of given Type).
Major improvements in Julia’s test coverage and the ability to analyze the test coverage of packages
The command line (REPL) now supports tab-completion of emoji characters (common LaTeX symbols have been supported since 0.3!)

Upcoming work for 0.5

Nightly builds will use the versioning scheme 0.5.0-dev.

A major focus of 0.5 will be further (breaking) improvements to core array functionality, as detailed in this issue.
We plan to merge the threading branch, but the functionality will be considered experimental and only available as a compile- time flag for the near future.

Community News

The Julia ecosystem continues to grow, and there are now over 700 registered packages! (highlights below)

The second JuliaCon was held in Cambridge (USA) in June, 2015. Over 60 talks were recorded and are available for viewing.

JuliaCon India will be held in Bangalore on 9 and 10 October.

JuliaBloggers is going strong! A notable recent feature is the #MonthOfJulia series exploring the core language and a number of packages.

Topical Highlights

JuliaStats - statistical and machine learning community.
JuliaOpt - optimization community.
JuliaQuantum - Julia libraries for quantum-science and technology.
JuliaGPU - GPU libraries and tooling.
IJulia - notebook interface built on IPython.
Images - image processing and i/o library.
Gadfly - Grammar of Graphics-inspired statistical plotting.
Winston - 2D plotting.
JunoLab - LightTable-based interactive environment.

JuliaCon 2015 Preview - Deep Learning, 3D Printing, Parallel Computing, and so much more

2015-05-30T00:00:00+00:00

JuliaCon 2015 is being held at the Massachusetts Institute of Technology from June 24th to the 28th. Get your tickets and book your hotel before June 4th to take advantage of early bird pricing.

The first ever JuliaCon was held in Chicago last year and was a great success. JuliaCon is back for 2015, this time in Cambridge, Massachusetts at MIT’s architecturally-delightful Stata Center, the home of computer science at MIT. Last year we had a single-track format, but this year we’ve expanded into a four-day extravaganza:

On Wednesday 24th there will an introduction to Julia workshop run by David P. Sanders (@dpsanders) as well as a Julia hackathon - a great chance to get some help for your new Julia projects, or to begin contributing to Julia or its many packages.
On Thursday 25th and Friday 26th we will be having speakers talking about a range of topics - we were fortunate to have so many fantastic submissions that we had to open up a second track of talks. The near-final schedule is on the main page. We’ll be alternating between ~40 minute long “regular” talks, and ~10 minute long “lightning” talks across all the sessions.
On Saturday 27th we will finish with a series of workshops on a range of topics: data wrangling and visualization, optimization, high-performance computing and more. These workshops run from 1.5 to 3 hours and will be a great way to rapidly boost your Julia skills.

Thursday’s Talks

After getting everyone settled in, we’ll start the conference proper with a session about the use of Julia in a wide variety of scientific applications. Many of the talks at the conference focus on Julia package organizations: groupings of similar packages that promote interoperability and focussing of efforts. In the session Daniel C. Jones (@dcjones), the creator of the visualization package Gadfly, will discuss the advances being made in the BioJulia bioinformatics organization, and Kyle Barbary (@kbarbary) will present JuliaAstro, a home for astronomy and astrophysics packages. Theres something for everyone: quantitative economic modeling (QuantEcon.jl), quantum statistical simulations, and how to fit Julia into a pre-existing body of code in other languages.

After lunch we’ll be splitting into two tracks: visualization and interactivity and statistics. The visualization track will be demonstrating some of the exciting advances being made that enable Julia to both produce high-quality visualizations, but also share them. Mike Innes (@one-more-minute), creator of the Juno IDE for Julia, will be sharing his working on building web-powered apps in Julia, while Viral B. Shah (@ViralBShah), one of the Julia founders, will be discussing more about the inner workings of and plans for JuliaBox. For a different take on “visualization”, Jack Minardi of Voxel8 will be sharing how Julia is powering their 3D printing work.

The statistics session covers some hot topics in the field, including two talks from researchers at MIT about how Julia is playing a big part: probabilistic programming (Sigma.jl) and deep learning (Mocha.jl). Facebooker John Myles White, author of “Machine Learning for Hackers” and a variety of packages in R and Julia, will share his thoughts on how statistics in Julia can be taken to the next stage in development, and Pontus Stenetop (@ninjin) will educate and entertain in his talk “Suitably Naming a Child with Multiple Nationalities using Julia”.

We’ll come together at the end of Thursday to learn more about how to write good Julia code, how to write packages that Just Work on Windows, and how wrappers around C libraries can be made easier than you might think through the magic of Clang.jl. Iain Dunning (@IainNZ), maintainer of Julia’s package listing and test infrastructure will follow up on last years talk by giving a brief history and updated status report on Julia’s package ecosystem. Finally current Googler Lean Hanson (@astrieanna) will share some of her tips for people looking to get started with contributing to Julia and to open-source projects.

Whatever you get up to after the talks end on Thursday, make sure you are up in time for…

Friday’s talks

If you are interested in learning how Julia works from the people who work on it every day, then Friday morning’s session is for you. The morning will kick off with newly-minted-PhD and Julia co-founder Jeff Bezanson (@JeffBezanson), who is still recovering from his defense and will be updating us on the title of his talk soon. We’ll be learning more about different stages of the compilation process from contributors Jake Bolewski (@jakebolewski) and Jacob Quinn (@quinnj), and we’ll be covering a miscellany of other cutting-edge topics for Julia like tuning LLVM, debugging, and interfaces.

In the afternoon we’ll have four sessions split across two rooms. In the second scientific applications session we’ll be learning more about how Julia is being used to prevent airborne collisions from Lincoln Lab’s Robert Moss, and Iain Dunning (@IainNZ) will give a sequel to last years JuliaOpt talk to update us on how Julia is becoming the language of choice for many for optimization. We’ll also hear how Julia is enabling rapid development of advanced algorithms for simulating quantum systems, evolving graphs, and analyzing seismic waves.

The numerical computing track kicks of with Stanford’s Prof. Jack Poulson (@poulson), creator of the Elemental library for distributed-memory linear algebra. Right after, the linear algebra wizard Zhang Xianyi (@xianyi) will give a talk about OpenBLAS, the high-performance linear algebra library Julia ships with. After a break, we’ll hear Viral’s thoughts on how sparse matrices currently and should work in Julia, before finishing off with lightning talks about validated numerics and Taylor series.

We’ll see out the day with two sessions that hit some topics of interest to people deploying Julia into larger systems: data and parallel computing. In the data session we’ll learn how about the nuts and bolts of sharing and storing data in Julia and hear more about plans for the future by the contributors working in these areas. Make sure to check out the talk by Avik Sengupta (@aviks) about his real-world industry experiences about putting Julia code behind a web-accessible API.

The parallel computing session will tackle parallelism at all levels. Contributor Amit Murthy (@amitmurthy) will open the session with a discussion of his recent work and plans for managing Julia in a cluster. We’ll also hear about work being done to make Julia multithreaded at Intel, and about running Julia on a Cray supercomputer.

After all that you will surely be inspired to hack on Julia projects all night, but make sure to wake up for a full day of workshops on Saturday!

Remember to get your tickets and book your hotel before June 4th to take advantage of early bird pricing. We’d also like to thank our platinum sponsors: the Gordon and Betty Moore Foundation, BlackRock, and Julia Computing. We can’t forget out silver sponsors either: Intel and Invenia. We’re looking forward to seeing you there!

Julia Summer of Code 2015

2015-05-23T00:00:00+00:00

Thanks to a generous grant from the Moore Foundation, we are happy to announce the 2015 Julia Summer of Code (JSoC) administered by NumFocus. We realize that this announcement comes quite late in the summer internship process, but we are hoping to fund six projects. The duration of JSoC 2015 will be June 15-September 15. Last date for submitting applications is June 1.

Stipends will match those of the Google Summer of Code (GSoC) at $5500 for the summer plus travel support to attend this year’s JuliaCon at MIT. Some amazing work from last year’s GSoC includes the Juno IDE, the Interact.jl package, and GLPlot; we hope to support another round of fun and useful projects.

If you are looking for a project, first, find a mentor. You may want to contact your favorite core developer, package author, or look through some of the previously proposed projects. Mentors will be looking for some evidence that you have experience using Julia and contributing to open source projects, but you are not expected to be an expert in the proposed project area. In fact, JSoC could be a great opportunity to explore an entirely new subject. If you’re already a contributor to Julia or a Julia package and want to get paid to continue an existing project, that’s okay too! In this case we still ask you to find a mentor who’s familiar with your field of work.

If you are a mentor looking for a student, advertise the project! Post it on julia-users and relevant community forums. Keep in mind that project proposals should be concrete but flexible enough to adapt to the interests of a broad range of potential applicants.

Once a mentor and student have agreed on a project, send an email to juliasoc@googlegroups.com for feedback and approval. We ask for this to be done by June 1st at the latest (yes that’s soon!).

Note that we use student in the broad sense. Participation is open to all, in accordance with applicable regulations. Participants do not need to demonstrate student status in any formal way. Contact juliasoc@googlegroups.com with any questions regarding eligibility.

Happy coding!

Julia 0.3 Release Announcement

2014-08-20T00:00:00+00:00

We are pleased to announce the release of Julia 0.3.0. This release contains numerous improvements across the board from standard library changes to pure performance enhancements as well as an expanded ecosystem of packages as compared to the 0.2 releases. A summary of changes is available in NEWS.md found in our main repository, and binaries are now available on our main download page.

A few notable changes:

System image caching for fast startup.
A pure-Julia REPL was introduced, replacing readline and providing expanded functionality and customization.
The workspace() function was added, to clear the environment without restarting.
Tab substitution of Latex character codes is now supported in the REPL, IJulia, and several editor environments.
Unicode improvements including expanded operators and NFC normalization.
Multi-process shared memory support. (multi-threading support is in progress and has been a major summer focus)
Improved hashing and floating point range support.
Better tuple performance.

We are now transitioning into the 0.4 development cycle and encourage users to use the 0.3.X line if they need a stable julia environment. Many breaking changes will be entering the environment over the course of the next few months. To reflect this period of change, nightly builds will use the versioning scheme 0.4.0-dev. Once the major breaking changes have been merged and the development cycle progresses towards a stable release, the version will shift to 0.4.0-pre, at which point package authors and users should start to think about transitioning the codebases over to the 0.4.X line.

The release-0.3 branch of the codebase will remain open for bugfixes during this time. We encourage users facing problems to open issues on our GitHub tracker, or email the julia-users mailing list.

Happy coding.

News

JuliaBloggers and the searchable package listing were recently introduced.

The first ever JuliaCon was held in Chicago in June, 2014. Several session recordings are available, and the others will be released soon:

The Julia community participated in Google Summer of Code 2014. Wrap-up blog posts will be coming soon from the participants:

Topical highlights

“The colors of chemistry” notebook by Jiahao Chen demonstrating IJulia, Gadfly, dimensional computation with SIUnits, and more.

JuliaStats - statistical and machine learning community.
JuliaOpt - optimization community.
IJulia - notebook interface built on IPython.
Images - image processing and i/o library.
Gadfly - Grammar of Graphics-inspired statistical plotting.
Winston - 2D plotting.

JuliaCon 2014 Optimization Presentations

2014-08-09T00:00:00+00:00

Optimization Session

Iain Dunning / Joey Huchette — JuliaOpt - Optimization Packages for Julia

Iain Dunning and Joey Huchette are both doctoral students in the Massachusetts Institute of Technology Operations Research Center, where they study constrained continuous and combinatorial numerical optimization methods and theory. In this session they present the JuliaOpt suite of optimization packages and how they interoperate. They also discuss how various Julia features enable exciting functionality in these packages.

Video: http://youtu.be/VwZvUvXX-vY
Slides: http://goo.gl/RwUdOI
GitHub: https://github.com/IainNZ https://github.com/joehuchette https://github.com/mlubin

Madeleine Udell — Convex Optimization in Julia

Madeleine Udell is a PhD candidate in Computational & Mathematical Engineering at Stanford University, where she works with Professor Stephen Boyd. Madeleine’s work focuses on modeling and solving large-scale optimization problems and in finding and exploiting structure in high dimensional data. She is the lead developer of the CVX.jl package.

Video: http://youtu.be/SoI0lEaUvTs
Slides: http://goo.gl/Nfy14D
Website: http://web.stanford.edu/~udell/

JuliaCon 2014 Opening Session Presentations

2014-08-09T00:00:00+00:00

Scientific Applications Session

Tim Holy — Image Representation and Analysis

Tim Holy is a Professor in the Department of Anatomy and Neurobiology at Washington University in St. Louis. He’s been involved with Julia development for over 2 years. In this presentation, Tim describes how Images.jl can be used for rapid inquiry and dissection of biomedical imaging data.

Video: http://youtu.be/FA-1B_amwt8
Slides: https://github.com/JuliaCon/presentations/tree/master/Images
GitHub: https://github.com/timholy

Pontus Stenetorp — Natural Language Processing with Julia

Pontus Stenetorp is a Japan Society for the Promotion of Science Postdoctoral Research Fellow at the University of Tokyo working in the areas of machine learning and natural language processing (NLP). In this talk, Pontus describes his recent experience in learning Julia and how Julia and its community have helped in his implementing a transition-based dependency parser in Julia.

Video: http://youtu.be/OrFxjE44COc
Slides: https://github.com/JuliaCon/presentations/blob/master/JuliaNLP/JuliaNLP.pdf
GitHub: https://github.com/ninjin

Speed vs. Correctness (led by Arch Robison)

Arch Robison is a Senior Principal Engineer at Intel and is an expert in parallel programming, being the original designer of the widely used Intel Threading Building Blocks library. In this session, Arch discusses the tradeoffs between instruction-level correctness and its implications for compiler optimizations.

Video: http://youtu.be/GFTCQNYddhs
GitHub: https://github.com/ArchRobison

Fast Numeric Computation in Julia

2013-09-04T00:00:00+00:00

Working on numerical problems daily, I have always dreamt of a language that provides an elegant interface while allowing me to write codes that run blazingly fast on large data sets. Julia is a language that turns this dream into a reality. With Julia, you can focus on your problem, keep your codes clean, and more importantly, write fast codes without diving into lower level languages such as C or Fortran even when performance is critical.

However, you should not take this potential speed for granted. To get your codes fast, you should keep performance in mind and follow general best practice guidelines. Here, I would like to share with you my experience in writing efficient codes for numerical computation.

First, make it correct

As in any language, the foremost goal when you implement your algorithm is to make it correct. An algorithm that doesn’t work correctly is useless no matter how fast it runs. One can always optimize the codes afterwards when necessary. When there are different approaches to a problem, you should choose the one that is asymptotically more efficient. For example, an unoptimized quick-sort implementation can easily beat a carefully optimized bubble-sort when sorting even moderately large arrays. Given a particular choice of algorithm, however, implementing it carefully and observing common performance guidelines can still make a big difference in performance – I will focus on this in the remaining part.

Devectorize expressions

Users of other high level languages such as MATLAB^® or Python are often advised to vectorize their codes as much as possible to get performance, because loops are slow in those languages. In Julia, on the other hand, loops can run as fast as those written in C and you no longer have to count on vectorization for speed. Actually, turning vectorized expressions into loops, which we call devectorization, often results in even higher performance.

Consider the following:

r = exp(-abs(x-y))

Very simple expression, right? Behind the scenes, however, it takes a lot of steps and temporary arrays to get you the results of this expression. The following sequence of temporary array constructions is what is done to compute the above expression:

n = length(x)

tmp1 = Array(Float64, n)
for i = 1:n
    tmp1[i] = x[i]-y[i]
end

tmp2 = Array(Float64, n)
for i = 1:n
    tmp2[i] = abs(tmp1[i])
end

tmp3 = Array(Float64, n)
for i = 1:n
    tmp3[i] = -tmp2[i]
end

r = Array(Float64, n)
for i = 1:n
    r[i] = exp(tmp3[i])
end

We can see that this procedure creates three temporary arrays and it takes four passes to complete the computation. This introduces significant overhead:

It takes time to allocate memory for the temporary arrays;
It takes time to reclaim the memory of these arrays during garbage collection;
It takes time to traverse the memory – generally, fewer passes means higher efficiency.

Such overhead is significant in practice, often leading to 2x to 3x slow down. To get optimal performance, one should devectorize this code like so:

r = similar(x) 
for i = 1:length(x)
    r[i] = exp(-abs(x[i]-y[i]))
end

This version finishes the computation in one pass, without introducing any temporary arrays. Moreover, if r is pre-allocated, one can even omit the statment that creates r. The Devectorize.jl package provides a macro @devec that can automatically translate vectorized expressions into loops:

using Devectorize

@devec r = exp(-abs(x-y))

The comprehension syntax also provides a concise syntax for devectorized computation:

r = [exp(-abs(x[i]-y[i])) for i = 1:length(x)]

Note that comprehension always creates new arrays to store the results. Hence, to write results to pre-allocated arrays, you still have to devectorize the computation manually or use the @devec macro.

Merge computations into a single loop

Traversing arrays, especially large ones, may incur cache misses or even page faults, both of which can cause significant latency. Thus, it is desirable to minimize the number of round trips to memory as much as possible. For example, you may compute multiple maps with one loop:

for i = 1:length(x)
    a[i] = x[i] + y[i]
    b[i] = x[i] - y[i]
end

This is usually faster than writing a = x + y; b = x - y.

The following example shows how you can compute multiple statistics (e.g. sum, max, and min) over a dataset efficiently.

n = length(x)
rsum = rmax = rmin = x[1]
for i = 2:n
    xi = x[i]
    rsum += xi
    if xi > rmax
        rmax = xi
    elseif xi < rmin
        rmin = xi
    end
end

Write cache-friendly codes

Modern computer systems have a complicated heterogeneous memory structure that combines registers, multiple levels of caches, and RAM. Data are accessed through the cache hierarchy – a smaller and much faster memory that stores copies of frequently used data.

Most systems do not provide ways to directly control the cache system. However, you can take steps to make it much easier for the automated cache management system to help you if you write cache-friendly codes. In general, you don’t have to understand every detail about how a cache system works. It is often sufficient to observe the simple rule below:

Access data in a pattern similar to how the data resides in memory – don’t jump around between non-contiguous locations in memory.

This is sometimes referred to as the principle of locality. For example, if x is a contiguous array, then after reading x[i], it is much more likely that x[i+1] is already in the cache than it is that x[i+1000000] is, in which case it will be much faster to access x[i+1] than x[i+1000000].

Julia arrays are stored in column-major order, which means that the rows of a column are contiguous, but the columns of a row are generally not. It is therefore generally more efficient to access data column-by-column than row-by-row. Consider the problem of computing the sum of each row in a matrix. It is natural to implement this as follows:

m, n = size(a)
r = Array(Float64, m)

for i = 1:m
    s = 0.
    for j = 1:n
        s += a[i,j]
    end
    r[i] = s
end

The loop here accesses the elements row-by-row, as a[i,1], a[i,2], ..., a[i,n]. The interval between these elements is m. Intuitively, it jumps at the stride of length m from the begining of each row to the end in each inner loop, and then jumps back to the begining of next row. This is not very efficient, especially when m is large.

This procedure can be made much more cache-friendly by changing the order of computation:

for i = 1:m
    r[i] = a[i,1]
end

for j = 2:n, i = 1:m
    r[i] += a[i,j]
end

Some benchmarking shows that this version can be 5-10 times faster than the one above for large matrices.

Avoid creating arrays in loops

Creating arrays requires memory allocation and adds to the workload of the garbage collector. Reusing the same array is a good way to reduce the cost of memory management.

It is not uncommon that you want to update arrays in an iterative algorithm. For example, in K-means, you may want to update both the cluster means and distances in each iteration. A straightforward way to do this might look like:

while !converged && t < maxiter
    means = compute_means(x, labels)
    dists = compute_distances(x, means)
    labels = assign_labels(dists)
    ...
end

In this implementation of K-means, the arrays means, dists, and labels are recreated at each iteration. This reallocation of memory on each step is unnecessary. The sizes of these arrays are fixed, and their storage can be reused across iterations. The following alternative code is a more efficient way to implement the same algorithm:

d, n = size(x)

# pre-allocate storage
means = Array(Float64, d, K)
dists = Array(Float64, K, n)
labels = Array(Int, n)

while !converged && t < maxiter
    update_means!(means, x, labels)
    update_distances!(dists, x, means)
    update_labels!(labels, dists)
    ...
end

In this version, the functions invoked in the loop updates pre-allocated arrays in-place.

If you are writing a package, it is recommended that you provide two versions for each function that outputs arrays: one that performs the update in-place, and another that returns a new array. The former can usually be implemented as a light-weight wrapper of the latter that copies the input array before modifying it. A good example is the Distributions.jl package, which provides both logpdf and logpdf!, so that one can write lp = logpdf(d,x) when a new array is needed, or logpdf!(lp,d,x) when lp has been pre-allocated.

Identify opportunities to use BLAS

Julia wraps a large number of BLAS routines for linear algebraic computation. These routines are the result of decades of research and optimization by many of the world’s top experts in fast numerical computation. As a result, using them where possible can provide performance boosts that seem almost magical – BLAS routines are often orders of magnitude faster than the simple loop implementations they replace.

For example, consider accumulating weighted versions of vectors as follows:

r = zeros(size(x,1))
for j = 1:size(x,2)
    r += x[:,j] * w[j]
end

You can replace the statement r += x[:,j] * w[j] with a call to the BLAS axpy! function to get better performance:

for j = 1:size(x,2)
    axpy!(w[j], x[:,j], r)
end

This, however, is still far from being optimal. If you are familiar with linear algebra, you may have probably found that this is just matrix-vector multiplication, and can be written as r = x * w, which is not only shorter, simpler and clearer than either of the above loops – it also runs much faster than either versions.

Our next example is a subtler application of BLAS routines to computing pairwise Euclidean distances between columns in two matrices. Below is a straightforward implementation that directly computes pairwise distances:

m, n = size(a)
r = Array(Float64, m, n)

for j = 1:n, i = 1:m
    r[i,j] = sqrt(sum(abs2(a[:,i] - b[:,j])))
end

This is clearly suboptimal – a lot of temporary arrays are created in evaluating the expression in the inner loop. To speed this up, we can devectorize the inner expression:

d, m = size(a)
n = size(b,2)
r = Array(Float64, m, n)

for j = 1:n, i = 1:m
        s = 0.
        for k = 1:d
            s += abs2(a[k,i] - b[k,j])
        end
        r[i,j] = sqrt(s)
    end
end

This version is much more performant than the vectorized form. But is it the best we can do? By employing an alternative strategy, we can write a even faster algorithm for computing pairwise distances. The trick is that the squared Euclidean distance between two vectors can be expanded as:

sum(abs2(x-y)) == sum(abs2(x)) + sum(abs2(y)) - 2*dot(x,y)

If we evaluate these three terms separately, the computation can be mapped to BLAS routines perfectly. Below, we have a new implementation of pairwise distances written using only BLAS routines, including the norm calls that are wrapped by the NumericExtensions.jl package:

using NumericExtensions   # for sqsum
using Base.LinAlg.BLAS    # for gemm!

m, n = size(a)

sa = sqsum(a, 1)   # sum(abs2(x)) for each column in a
sb = sqsum(b, 1)   # sum(abs2(y)) for each column in b

r = sa .+ reshape(sb, 1, n)          # first two terms
gemm!('T', 'N', -2.0, a, b, 1.0, r)  # add (-2.0) * a' * b to r

for i = 1:length(r)
    r[i] = sqrt(r[i])
end

This version is over 100 times faster than our original implementation — the gemm function in BLAS has been optimized to the extreme by many talented developers and engineers over the past few decades.

We should mention that you don’t have to implement this yourself if you really want to compute pairwise distances: the Distance.jl package provides optimized implementations of a broad variety of distance metrics, including this one. We presented this optimization trick as an example to illustrate the substantial performance gains that can be achieved by writing code that uses BLAS routines wherever possible.

Explore available packages

Julia has a very active open source ecosystem. A variety of packages have been developed that provide optimized algorithms for high performance computation. Look for a package that does what you need before you decide to roll your own – and if you don’t find what you need, consider contributing it! Here are a couple of packages that might be useful for those interested in high performance computation:

NumericExtensions.jl – extensions to Julia’s base functionality for high-performance support for a variety of common computations (many of these will gradually get moved into base Julia).
Devectorize.jl – macros and functions to de-vectorize vector expressions. With this package, users can write computations in high-level vectorized way while enjoying the high run-time performance of hand-coded de-vectorized loops.

Check out the Julia package list for many more packages. Julia also ships with a sampling profiler to measure where your code is spending most of its time. When in doubt, measure don’t guess!

Building GUIs with Julia, Tk, and Cairo, Part II

2013-05-23T00:00:00+00:00

Drawing, painting, and plotting

In this installment, we’ll cover both low-level graphics (using Cairo) and plotting graphs inside GUIs (using Winston). Here again we’re relying on infrastructure built by many people, including Jeff Bezanson, Mike Nolta, and Keno Fisher.

Cairo

The basics

The display of the image is handled by Cairo, a C library for two-dimensional drawing. Julia’s Cairo wrapper isn’t currently documented, so let’s walk through a couple of basics first.

If you’re new to graphics libraries like Cairo, there are a few concepts that may not be immediately obvious but are introduced in the Cairo tutorial. The key concept is that the Cairo API works like “stamping,” where a source gets applied to a destination in a region specified by a path. Here, the destination will be the pixels corresponding to a region of a window on the screen. We’ll control the source and the path to achieve the effects we want.

Let’s play with this. First, inside a new window we create a Cairo-enabled Canvas for drawing:

using Base.Graphics
using Cairo
using Tk

win = Toplevel("Test", 400, 200)
c = Canvas(win)
pack(c, expand=true, fill="both")

We’ve created a window 400 pixels wide and 200 pixels high. c is our Canvas, a type defined in the Tk package. Later we’ll dig into the internals a bit, but for now suffice it to say that a Canvas is a multi-component object that you can often treat as a black box. The initial call creating the canvas leaves a lot of its fields undefined, because you don’t yet know crucial details like the size of the canvas. The call to pack specifies that this canvas fills the entire window, and simultaneously fills in the missing information in the Canvas object itself.

Note that the window is currently blank, because we haven’t drawn anything to it yet, so you can see whatever was lying underneath. In my case it captured a small region of my desktop:

Now let’s do some drawing. Cairo doesn’t know anything about Tk Canvases, so we have to pull out the part of it that works directly with Cairo:

ctx = getgc(c)

getgc means “get graphics context,” returning an object (here ctx) that holds all relevant information about the current state of drawing to this canvas.

One nice feature of Cairo is that the coordinates are abstracted; ultimately we care about screen pixels, but we can set up user coordinates that have whatever scaling is natural to the problem. We just have to tell Cairo how to convert user coordinates to device (screen) coordinates. We set up a coordinate system using set_coords, defined in base/graphics.jl:

function set_coords(ctx::GraphicsContext, x, y, w, h, l, r, t, b)

x (horizontal) and y (vertical) specify the upper-left corner of the drawing region in device coordinates, and w and h its width and height, respectively. (Note Cairo uses (0,0) for the top-left corner of the window.) l, r, t, and b are the user coordinates corresponding to the left, right, top, and bottom, respectively, of this region. Note that set_coords will also clip any drawing that occurs outside the region defined by x, y, w, and h; however, the coordinate system you’ve specified extends to infinity, and you can draw all the way to the edge of the canvas by calling reset_clip().

Let’s fill the drawing region with a color, so we can see it:

# Set coordinates to go from 0 to 10 within a 300x100 centered region
set_coords(ctx, 50, 50, 300, 100, 0, 10, 0, 10)
set_source_rgb(ctx, 0, 0, 1)   # set color to blue
paint(ctx)                     # paint the entire clip region

Perhaps surprisingly, nothing happened. The reason is that the Tk Canvas implements a technique called double buffering, which means that you do all your drawing to a back (hidden) surface, and then blit the completed result to the front (visible) surface. We can see this in action simply by bringing another window over the top of the window we’re using to draw, and then bringing our window back to the top; suddenly you’ll see a nice blue rectangle within the window, surrounded by whatever is in the background window(s):

Fortunately, to display your graphics you don’t have to rely on users changing the stacking order of windows: call reveal(c) to update the front surface with the contents of the back surface, followed by update() (or perhaps better, Tk.update() since update is a fairly generic name) to give Tk a chance to expose the front surface to the OS’s window manager.

Now let’s draw a red line:

move_to(ctx, -1, 5)
line_to(ctx, 7, 6)
set_source_rgb(ctx, 1, 0, 0)
set_line_width(ctx, 5)
stroke(ctx)
reveal(c)
Tk.update()

We started at a position outside the coordinate region (we’ll get to see the clipping in action this way). The next command, line_to, creates a segment of a path, the way that regions are defined in Cairo. The stroke command draws a line along the trajectory of the path, after which the path is cleared. (You can use stroke_preserve if you want to re-use this path for another purpose later.)

Let’s illustrate this by adding a solid green rectangle with a magenta border, letting it spill over the edges of the previously-defined coordinate region:

reset_clip(ctx)
rectangle(ctx, 7, 5, 4, 4)
set_source_rgb(ctx, 0, 1, 0)
fill_preserve(ctx)
set_source_rgb(ctx, 1, 0, 1)
stroke(ctx)
reveal(c)
Tk.update()

fill differs from paint in that fill works inside the currently-defined path, whereas paint fills the entire clip region.

Here is our masterpiece, where the “background” may differ for you (mine was positioned over the bottom of a wikipedia page):

Rendering an image

Images are rendered in Cairo inside a rectangle (controlling placement of the image) followed by fill. So far this is just like the simple drawing above. The difference is the source, which now will be a surface instead of an RGB color. If you’re drawing from Julia, chances are that you want to display an in-memory array. The main trick is that Cairo requires this array to be a matrix of type Uint32 encoding the color. The scheme is that the least significant byte is the blue value (ranging from 0x00 to 0xff), the next is green, and the next red. (The most significant byte can encode the alpha value, or transparency, if you specify that transparency is to be used in your image surface.)

Both Winston and Images can generate a buffer of Uint32 for you. Let’s try the one in Images:

using Images
img = imread("some_photo.jpg")
buf = uint32color(img)'
image(ctx, CairoRGBSurface(buf), 0, 0, 10, 10)
reveal(c)
Tk.update()

Rather than manually calling rectangle and fill, we use the convenience method image(ctx, surf, x, y, w, h) (defined in Cairo.jl). Here x, y, w, h are user-coordinates of your canvas, not pixels on the screen or pixels in your image; being able to express location in user coordinates is the main advantage of using image().

The image should now be displayed within your window (squashed, because we haven’t worried about aspect ratio):

It fills only part of the window because of the coordinate system we’ve established, where the range 0:10 corresponds to an inset region in the center of the window.

While it’s a minor point, note that CairoRGBSurface takes a transpose for you, to convert from the column-major order of matrices in Julia to the row-major convention of Cairo. Images avoids taking transposes unless necessary, and is capable of handling images with any storage order. Here we do a transpose in preparation to have it be converted back to its original shape by CairoRGBSurface. If performance is critical, you can avoid the default behavior of CairoRGBSurface by calling CairoImageSurface directly (see the Cairo.jl code).

Redrawing & resize support

A basic feature of windows is that they should behave properly under resizing operations. This doesn’t come entirely for free, although the grid (and pack) managers of Tk take care of many details for us. However, for Canvases we need to to do a little bit of extra work; to see what I mean, just try resizing the window we created above.

The key is to have a callback that gets activated whenever the canvas changes size, and to have this callback capable of redrawing the window at arbitrary size. Canvases make this easy by having a field, resize, that you assign the callback to. This function will receive a single argument, the canvas itself, but as always you can provide more information. Taking our image example, we could set

c.resize = c->redraw(c, buf)

and then define

function redraw(c::Canvas, buf)
    ctx = getgc(c)
    set_source_rgb(ctx, 1, 0, 0)
    paint(ctx)
    set_coords(ctx, 50, 50, Tk.width(c)-100, Tk.height(c)-100, 0, 10, 0, 10)
    image(ctx, CairoRGBSurface(buf), 0, 0, 10, 10)
    reveal(c)
    Tk.update()
end

Here you can see that we’re aiming to be a bit more polished, and want to avoid seeing bits of the desktop around the borders of our drawing region. So we fill the window with a solid color (but choose a garish red, to make sure we notice it) before displaying the image. We also have to re-create our coordinate system, because that too was destroyed, and in this case we dynamically adjust the coordinates to the size of the canvas. Finally, we redraw the image. Note we didn’t have to go through the process of converting to Uint32-based color again. Obviously, you can use this redraw function even for the initial rendering of the window, so there’s really no extra work in setting up your code this way.

If you grab the window handle and resize it, now you should see something like this:

Voila! We’re really getting somewhere now.

Unlike the complete GUI, this implementation doesn’t have the option to preserve the image’s aspect ratio. However, there’s really no magic there; it all comes down to computing sizes and controlling the drawing region and coordinate system.

One important point: resizing the window causes the existing Cairo context(s) to be destroyed, and creates new ones suitable for the new canvas size. One consequence is that your old ctx variable is now invalid, and trying to use it for drawing will cause a segfault. For this reason, you shouldn’t ever store a ctx object on its own; always begin drawing by calling getgc(c) again.

Canvases and the mouse

A Canvas already comes with a set of fields prepared for mouse events. For example, in the complete GUI we have the equivalent of the following:

selectiondonefunc = (c, bb) -> zoombb(imgc, img2, bb)
c.mouse.button1press = (c, x, y) -> rubberband_start(c, x, y, selectiondonefunc)

rubberband_start, a function defined in rubberband.jl, will now be called whenever the user presses the left mouse button. selectiondonefunc is a callback that we supply; it will be executed when the user releases the mouse button, and it needs to implement whatever it is we want to achieve with the selected region (in this case, a zoom operation). Part of what rubberband_start does is to bind selectiondonefunc to the release of the mouse button, via c.mouse.button1release. bb is a BoundingBox (a type defined in base/graphics.jl) that will store the region selected by the user, and this gets passed to selectiondonefunc. (The first two inputs to zoombb, imgc and img2, store settings that are relevant to this particular GUI but will not be described in detail here.)

The mouse inside a Canvas is an object of type MouseHandler, which has fields for press and release of all 3 mouse buttons and additional ones for motion. However, a few cases (which happen to be relevant to this GUI) are not available in MouseHandler. Here are some examples of how to configure these actions:

# Bind double-clicks
bind(c.c, "<Double-Button-1>", (path,x,y)->zoom_reset(imgc, img2))
# Bind Shift-scroll (using the wheel mouse)
bindwheel(c.c, "Shift", (path,delta)->panhorz(imgc,img2,int(delta)))

The delta argument for the wheel mouse will encode the direction of scrolling.

The rubber band (region selection)

Support for the rubber band is provided in the file rubberband.jl. Like navigation.jl, this is a stand-alone set of functions that you should be able to incorporate into other projects. It draws a dashed rectangle employing the same machinery we described at the top of this page, with slight modifications to create the dashes (through the set_dash function). By now, this should all be fairly straightforward.

However, these functions use one additional trick worth mentioning. Let’s finally look at the Tk Canvas object:

type Canvas
    c::TkWidget
    front::CairoSurface  # surface for window
    back::CairoSurface   # backing store
    frontcc::CairoContext
    backcc::CairoContext
    mouse::MouseHandler
    redraw
    
    function ...

Here we can explicitly see the two buffers, used in double-buffering, and their associated contexts. getgc(c), where c is a Canvas, simply returns backcc. This is why all drawing occurs on the back surface. For the rubber band, we choose instead to draw on the front surface, and then (as the size of the rubber band changes) “repair the damage” by copying from the back surface. Since we only have to modify the pixels along the band itself, this is fast. You can see these details in rubberband.jl.

Winston

For many GUIs in Julia, an important component will be the ability to display data graphically. While we could draw graphs directly with Cairo, it would be a lot of work to build from scratch; fortunately, there’s an excellent package, Winston, that already does this.

Since there’s a nice set of examples of some of the things you can do with Winston, here our focus is very narrow: how do you integrate Winston plots into GUIs built with Tk. Fortunately, this is quite easy. Let’s walk through an example:

using Tk
using Winston

win = Toplevel("Testing", 400, 200)
fwin = Frame(win)
pack(fwin, expand=true, fill="both")

We chose to fill the entire window with a frame fwin, so that everything inside this GUI will have a consistent background. All other objects will be placed inside fwin.

Next, let’s set up the elements, a Canvas on the left and a single button on the right:

c = Canvas(fwin, 300, 200)
grid(c, 1, 1, sticky="nsew")
fctrls = Frame(fwin)
grid(fctrls, 1, 2, sticky="sw", pady=5, padx=5)
grid_columnconfigure(fwin, 1, weight=1)
grid_rowconfigure(fwin, 1, weight=1)

ok = Button(fctrls, "OK")
grid(ok, 1, 1)

Finally, let’s plot something inside the Canvas:

x = linspace(0.0,10.0,1001)
y = sin(x)
p = FramedPlot()
add(p, Curve(x, y, "color", "red"))

Winston.display(c, p)
reveal(c)
Tk.update()

You’ll note that you can resize this window, and the plot grows or shrinks accordingly.

Easy, huh? The only part of this code that is specific to GUIs is the line Winston.display(c, p), where we specified that we wanted our plot to appear inside a particular Canvas. Of course, there’s a lot of magic behind the scenes in Winston, but covering its internals is beyond our scope here.

Conclusions

There’s more one could cover, but most of the rest is fairly specific to this particular GUI. A fair amount of code is needed to handle coordinates: selecting specific regions within the 4d image, and rendering to specific regions of the output canvas. If you want to dive into these details, your best bet is to start reading through the ImageView code, but it’s not going to be covered in any more detail here.

Hopefully by this point you have a pretty good sense for how to produce on-screen output with Tk, Cairo, and Winston. It takes a little practice to get comfortable with these tools, but the end result is quite powerful. Happy hacking!

Building GUIs with Julia, Tk, and Cairo, Part I

2013-05-23T00:00:00+00:00

This is the first of two blog posts designed to walk users through the process of creating GUIs in Julia. Those following Julia development will know that plotting in Julia is still evolving, and one could therefore expect that it might be premature to build GUIs with Julia. My own recent experience has taught me that this expectation is wrong: compared with building GUIs in Matlab (my only previous GUI-writing experience), Julia already offers a number of quite compelling advantages. We’ll see some of these advantages on display below.

We’ll go through the highlights needed to create an image viewer GUI. Before getting into how to write this GUI, first let’s play with it to get a sense for how it works. It’s best if you just try these commands yourself, because it’s difficult to capture things like interactivity with static text and pictures.

You’ll need the ImageView package:

Pkg.add("ImageView")

It’s worth pointing out that this package is expected to evolve over time; however, if things have changed from what’s described in this blog, try checking out the “blog” branch directly from the repository. I should also point out that this package was developed on the author’s Linux system, and it’s possible that things may not work as well on other platforms.

First let’s try it with a photograph. Load one this way:

using Images
using ImageView
img = imread("my_photo.jpg")

Any typical image format should be fine, it doesn’t have to be a jpg. Now display the image this way:

display(img, pixelspacing = [1,1])

The basic command to view the image is display. The optional pixelspacing input tells display that this image has a fixed aspect ratio, and that this needs to be honored when displaying the image. (Alternatively, you could set img["pixelspacing"] = [1,1] and then you wouldn’t have to tell this to the display function.)

You should get a window with your image:

OK, nice. But we can start to have some fun if we resize the window, which causes the image to get bigger or smaller:

Note the black perimeter; that’s because we’ve specified the aspect ratio through the pixelspacing input, and when the window doesn’t have the same aspect ratio as the image you’ll have a perimeter either horizontally or vertically. Try it without specifying pixelspacing, and you’ll see that the image stretches to fill the window, but it looks distorted:

display(img)

(This won’t work if you’ve already defined "pixelspacing" for img; if necessary, use delete!(img, "pixelspacing") to remove that setting.)

Next, click and drag somewhere inside the image. You’ll see the typical rubberband selection, and once you let go the image display will zoom in on the selected region.

Again, the aspect ratio of the display is preserved. Double-clicking on the image restores the display to full size.

If you have a wheel mouse, zoom in again and scroll the wheel, which should cause the image to pan vertically. If you scroll while holding down Shift, it pans horizontally; hold down Ctrl and you affect the zoom setting. Note as you zoom via the mouse, the zoom stays focused around the mouse pointer location, making it easy to zoom in on some small feature simply by pointing your mouse at it and then Ctrl-scrolling.

Long-time users of Matlab may note a number of nice features about this behavior:

The resizing and panning is much smoother than Matlab’s
Matlab doesn’t expose modifier keys in conjunction with the wheel mouse, making it difficult to implement this degree of interactivity
In Matlab, zooming with the wheel mouse is always centered on the middle of the display, requiring you to alternate between zooming and panning to magnify a particular small region of your image or plot.

These already give a taste of some of the features we can achieve quite easily in Julia.

However, there’s more to this GUI than meets the eye. You can display the image upside-down with

display(img, pixelspacing = [1,1], flipy=true)

or switch the x and y axes with

display(img, pixelspacing = [1,1], xy=["y","x"])

To experience the full functionality, you’ll need a “4D image,” a movie (time sequence) of 3D images. If you don’t happen to have one lying around, you can create one via include("test/test4d.jl"), where test means the test directory in ImageView. (Assuming you installed ImageView via the package manager, you can say include(joinpath(Pkg.dir(), "ImageView", "test", "test4d.jl")).) This creates a solid cone that changes color over time, again in the variable img. Then, type display(img). You should see something like this:

The green circle is a “slice” from the cone. At the bottom of the window you’ll see a number of buttons and our current location, z=1 and t=1, which correspond to the base of the cone and the beginning of the movie, respectively. Click the upward-pointing green arrow, and you’ll “pan” through the cone in the z dimension, making the circle smaller. You can go back with the downward-pointing green arrow, or step frame-by-frame with the black arrows. Next, clicking the “play forward” button moves forward in time, and you’ll see the color change through gray to magenta. The black square is a stop button. You can, of course, type a particular z, t location into the entry boxes, or grab the sliders and move them.

If you have a wheel mouse, Alt-scroll changes the time, and Ctrl-Alt-scroll changes the z-slice.

You can change the playback speed by right-clicking in an empty space within the navigation bar, which brings up a popup (context) menu:

By default, display will show you slices in the xy-plane. You might want to see a different set of slices from the 4d image:

display(img, xy=["x","z"])

Initially you’ll see nothing, but that’s because this edge of the image is black. Type 151 into the y: entry box (note its name has changed) and hit enter, or move the “y” slider into the middle of its range; now you’ll see the cone from the side.

This GUI is also useful for “plain movies” (2d images with time), in which case the z controls will be omitted and it will behave largely as a typical movie-player. Likewise, the t controls will be omitted for 3d images lacking a temporal component, making this a nice viewer for MRI scans.

Again, we note a number of improvements over Matlab:

When you resize the window, note that the controls keep their initial size, while the image fills the window. With some effort this behavior is possible to achieve in Matlab, but (as you’ll see later in these posts) it’s essentially trivial with Julia and Tk.
When we move the sliders, the display updates while we drag it, not just when we let go of the mouse button.
If you try this with a much larger 3d or 4d image, you may also notice that the display feels snappy and responsive in a way that’s sometimes hard to achieve with Matlab.

Altogether advantages such as these combine to give a substantially more polished feel to GUI applications written in Julia.

This completes our tour of the features of this GUI. Now let’s go through a few of the highlights needed to create it. We’ll tackle this in pieces; not only will this make it easier to learn, but it also illustrates how to build re-useable components. Let’s start with the navigation frame.

First, let me acknowledge that this GUI is built on the work of many people who have contributed to Julia’s Cairo and Tk packages. For this step, we’ll make particular use of John Verzani’s contribution of a huge set of convenience wrappers for most of Tk’s widget functionality. John wrote up a nice set of examples that demonstrate many of the things you can do with it; this first installment is essentially just a “longer” example, and won’t surprise anyone who has read his documentation.

Let’s create a couple of types to hold the data we’ll need. We need a type that stores “GUI state,” which here consists of the currently-viewed location in the image and information needed to implement the “play” functionality:

type NavigationState
    # Dimensions:
    zmax::Int          # number of frames in z, set to 1 if only 2 spatial dims
    tmax::Int          # number of frames in t, set to 1 if only a single image
    z::Int             # current position in z-stack
    t::Int             # current moment in time
    # Other state data:
    timer              # nothing if not playing, TimeoutAsyncWork if we are
    fps::Float64       # playback speed in frames per second
end

Next, let’s create a type to hold “handles” to all the widgets:

type NavigationControls
    stepup                            # z buttons...
    stepdown
    playup
    playdown
    stepback                          # t buttons...
    stepfwd
    playback
    playfwd
    stop
    editz                             # edit boxes
    editt
    textz                             # static text (information)
    textt
    scalez                            # scale (slider) widgets
    scalet
end

It might not be strictly necessary to hold handles to all the widgets (you could do everything with callbacks), but having them available is convenient. For example, if you don’t like the icons I created, you can easily initialize the GUI and replace, using the handles, the icons with something better.

We’ll talk about initialization later; for now, assume that we have a variable state of type NavigationState that holds the current position in the (possibly) 4D image, and ctrls which contains a fully-initialized set of widget handles.

Each button needs a callback function to be executed when it is clicked. Let’s go through the functions for controlling t. First there is a general utility not tied to any button, but it affects many of the controls:

function updatet(ctrls, state)
    set_value(ctrls.editt, string(state.t))
    set_value(ctrls.scalet, state.t)
    enableback = state.t > 1
    set_enabled(ctrls.stepback, enableback)
    set_enabled(ctrls.playback, enableback)
    enablefwd = state.t < state.tmax
    set_enabled(ctrls.stepfwd, enablefwd)
    set_enabled(ctrls.playfwd, enablefwd)
end

The first two lines synchronize the entry box and slider to the current value of state.t; the currently-selected time can change by many different mechanisms (one of the buttons, typing into the entry box, or moving the slider), so we make state.t be the “authoritative” value and synchronize everything to it. The remaining lines of this function control which of the t navigation buttons are enabled (if t==1, we can’t go any earlier in the movie, so we gray out the backwards buttons).

A second utility function modifies state.t:

function incrementt(inc, ctrls, state, showframe)
    state.t += inc
    updatet(ctrls, state)
    showframe(state)
end

Note the call to updatet described above. The new part of this is the showframe function, whose job it is to display the image frame (or any other visual information) to the user. Typically, the actual showframe function will need additional information such as where to render the image, but you can provide this information using anonymous functions. We’ll see how that works in the next installment; below we’ll just create a simple “stub” function.

Now we get to callbacks which we’ll “bind” to the step and play buttons:

function stept(inc, ctrls, state, showframe)
    if 1 <= state.t+inc <= state.tmax
        incrementt(inc, ctrls, state, showframe)
    else
        stop_playing!(state)
    end
end

function playt(inc, ctrls, state, showframe)
    if !(state.fps > 0)
        error("Frame rate is not positive")
    end
    stop_playing!(state)
    dt = 1/state.fps
    state.timer = TimeoutAsyncWork(i -> stept(inc, ctrls, state, showframe))
    start_timer(state.timer, iround(1000*dt), iround(1000*dt))
end

stept() increments the t frame by the specified amount (typically 1 or -1), while playt() starts a timer that will call stept at regular intervals. The timer is stopped if play reaches the beginning or end of the movie. The stop_playing! function checks to see whether we have an active timer, and if so stops it:

function stop_playing!(state::NavigationState)
    if !is(state.timer, nothing)
        stop_timer(state.timer)
        state.timer = nothing
    end
end

An alternative way to handle playback without a timer would be in a loop, like this:

function stept(inc, ctrls, state, showframe)
    if 1 <= state.t+inc <= state.tmax
        incrementt(inc, ctrls, state, showframe)
    end
end

function playt(inc, ctrls, state, showframe)
    state.isplaying = true
    while 1 <= state.t+inc <= state.tmax && state.isplaying
        tcl_doevent()    # allow the stop button to take effect
        incrementt(inc, ctrls, state, showframe)
    end
    state.isplaying = false
end

With this version we would use a single Boolean value to signal whether there is active playback. A key point here is the call to tcl_doevent(), which allows Tk to interrupt the execution of the loop to handle user interaction (in this case, clicking the stop button). But with the timer that’s not necessary, and moreover the timer gives us control over the speed of playback.

Finally, there are callbacks for the entry and slider widgets:

function sett(ctrls,state, showframe)
    tstr = get_value(ctrls.editt)
    try
        val = int(tstr)
        state.t = val
        updatet(ctrls, state)
        showframe(state)
    catch
        updatet(ctrls, state)
    end
end

function scalet(ctrls, state, showframe)
    state.t = get_value(ctrls.scalet)
    updatet(ctrls, state)
    showframe(state)
end

sett runs when the user types an entry into the edit box; if the user types in nonsense like “foo”, it will gracefully reset it to the current position.

There’s a complementary set of these functions for the z controls.

These callbacks implement the functionality of this “navigation” GUI. The other main task is initialization. We won’t cover this in gory detail (you are invited to browse the code), but let’s hit a few highlights.

Creating the buttons

You can use image files (e.g., .png files) for your icons, but the ones here are created programmatically. To do this, specify two colors, the “foreground” and “background”, as strings. One also needs the data array (of type Bool) for the pixels that should be colored by the foreground color, and false for the ones to be set to the background. There’s also the mask array, which can prevent the data array from taking effect in any pixels marked as false in the mask.

Given suitable data and mask arrays (here we just set the mask to trues), and color strings, we create the icon and assign it to a button like this:

icon = Tk.image(data, mask, "gray70", "black")  # background=gray70, foreground=black
ctrls.stop = Button(f, icon)

Here f is the “parent frame” that the navigation controller will be rendered in. A frame is a container that organizes a collection of related GUI elements. Later we’ll find out how to create one.

Assigning callbacks to widgets

The “stop” and “play backwards” buttons look like this:

bind(ctrls.stop, "command", path -> stop_playing!(state))
bind(ctrls.playback, "command", path -> playt(-1, ctrls, state, showframe)

The path input is generated by Tk/Tcl, but we don’t have to use it. Instead, we use anonymous functions to pass the arguments relavant to this particular GUI instantiation. Note that these two buttons share state; that means that any changes made by one callback will have impact on the other.

Placing the buttons in the frame (layout management)

Here our layout needs are quite simple, but I recommend that you read the excellent tutorial on Tk’s grid layout engine. grid provides a great deal of functionality missing in Matlab, and in particular allows flexible and polished GUI behavior when resizing the window.

We position the stop button this way:

grid(ctrls.stop, 1, stopindex, padx=3*pad, pady=pad)

After the handle for the button itself, the next two inputs determine the row, column position of the widget. Here the column position is set using a variable (an integer) whose value will depend on whether the z controls are present. The pad settings just apply a bit of horizontal and vertical padding around the button.

To position the slider widgets, we could do something like this:

ctrls.scalez = Slider(f, 1:state.zmax)
grid(ctrls.scalez, 2, start:stop, sticky="we", padx=pad)

This positions them in row 2 of the frame’s grid, and has them occupy the range of columns (indicated by start:stop) used by the button controls for the same z or t axis. The sticky setting means that it will stretch to fill from West to East (left to right).

In the main GUI we’ll use one more feature of grid, so let’s cover it now. This feature controls how regions of the window expand or shrink when the window is resized:

grid_rowconfigure(win, 1, weight=1)
grid_columnconfigure(win, 1, weight=1)

This says that row 1, column 1 will expand at a rate of 1 when the figure is made larger. You can set different weights for different GUI components. The default value is 0, indicating that it shouldn’t expand at all. That’s what we want for this navigation frame, so that the buttons keep their size when the window is resized. Larger weight values indicate that the given component should expand (or shrink) at faster rates.

Putting it all together and testing it out

We’ll place the navigation controls inside a Tk frame. Let’s create one from the command line:

using Tk
win = Toplevel()
f = Frame(win)
pack(f, expand=true, fill="both")

The first three lines create the window and the frame. pack is an alternative layout engine to grid, and slightly more convenient when all you want is to place a single item so that it fills its container. (You can mix pack and grid as long as they are operating on separate containers. Here we’ll have a frame packed in the window, and the widgets will be gridded inside the frame.) After that fourth line, the window is rather tiny; the call to pack causes the frame to fill to expand the whole window, but at the moment the frame has no contents, so the window is as small as it can be.

We need a showframe callback; for now let’s create a very simple one that will help in testing:

showframe = x -> println("showframe z=", x.z, ", t=", x.t)

Next, load the GUI code (using ImageView.Navigation) and create the NavigationState and NavigationControls objects:

ctrls = NavigationControls()
state = NavigationState(40, 1000, 2, 5)

Here we’ve set up a fake movie with 40 image slices in z, and 1000 image stacks in t.

Finally, we initialize the widgets:

init_navigation!(f, ctrls, state, showframe)

Now when you click on buttons, or change the text in the entry boxes, you’ll see the GUI in action. You can tell from the command line output, generated by showframe, what’s happening internally:

Hopefully this demonstrates another nice feature of developing GUIs in Julia: it’s straightforward to build re-usable components. This navigation frame can be added as an element to any window, and the grid layout manager takes care of the rest. All you need to do is to include ImageView/src/navigation.jl into your module, and you can make use of it with just a few lines of code.

Not too hard, right? The next step is to render the image, which brings us into the domain of Cairo.

Passing Julia Callback Functions to C

2013-05-10T00:00:00+00:00

One of the great strengths of Julia is that it is so easy to call C code natively, with no special “glue” routines or overhead to marshal arguments and convert return values. For example, if you want to call GNU GSL to compute a special function like a Debye integral, it is as easy as:

debye_1(x) = ccall((:gsl_sf_debye_1,:libgsl), Cdouble, (Cdouble,), x)

at which point you can compute debye_1(2), debye_1(3.7), and so on. (Even easier would be to use Jiahao Chen’s GSL package for Julia, which has already created such wrappers for you.) This makes a vast array of existing C libraries accessible to you in Julia (along with Fortran libraries and other languages with C-accessible calling conventions).

In fact, you can even go the other way around, passing Julia routines to C, so that C code is calling Julia code in the form of callback functions. For example, a C library for numerical integration might expect you to pass the integrand as a function argument, which the library will then call to evaluate the integrand as many times as needed to estimate the integral. Callback functions are also natural for optimization, root-finding, and many other numerical tasks, as well as in many non-numerical problems. The purpose of this blog post is to illustrate the techniques for passing Julia functions as callbacks to C routines, which is straightforward and efficient but requires some lower-level understanding of how functions and other values are passed as arguments.

The code in this post requires Julia 0.2 (or a recent git facsimile thereof); the key features needed for callback functions (especially unsafe_pointer_to_objref) are not available in Julia 0.1.

Sorting with `qsort`

Perhaps the most well-known example of a callback parameter is provided by the qsort function, part of the ANSI C standard library and declared in C as:

void qsort(void *base, size_t nmemb, size_t size,
           int(*compare)(const void *a, const void *b));

The base argument is a pointer to an array of length nmemb, with elements of size bytes each. compare is a callback function which takes pointers to two elements a and b and returns an integer less/greater than zero if a should appear before/after b (or zero if any order is permitted). Now, suppose that we have a 1d array A of values in Julia that we want to sort using the qsort function (rather than Julia’s built-in sort function). Before we worry about calling qsort and passing arguments, we need to write a comparison function that works for some arbitrary type T, e.g.

function mycompare{T}(a_::Ptr{T}, b_::Ptr{T})
    a = unsafe_load(a_)
    b = unsafe_load(b_)
    return a < b ? cint(-1) : a > b ? cint(+1) : cint(0)
end
cint(n) = convert(Cint, n)

Notice that we use the built-in function unsafe_load to fetch the values pointed to by the arguments a_ and b_ (which is “unsafe” because it will crash if these are not valid pointers, but qsort will always pass valid pointers). Also, we have to be a little careful about return values: qsort expects a function returning a C int, so we must be sure to return Cint (the corresponding type in Julia) via a call to convert.

Now, how do we pass this to C? A function pointer in C is essentially just a pointer to the memory location of the machine code implementing that function, whereas a function value mycompare (of type Function) in Julia is quite different. Thanks to Julia’s JIT compilation approach,a Julia function may not even be compiled until the first time it is called, and in general the same Julia function may be compiled into multiple machine-code instantiations, which are specialized for arguments of different types (e.g. different T in this case). So, you can imagine that mycompare must internally point to a rather complicated data structure (a jl_function_t in julia.h, if you are interested), which holds information about the argument types, the compiled versions (if any), and so on. In general, it must store a closure with information about the environment in which the function was defined; we will talk more about this below. In any case, it is a very different object than a simple pointer to machine code for one set of argument types. Fortunately, we can get the latter simply by calling a built-in Julia function called cfunction:

const mycompare_c = cfunction(mycompare, Cint, (Ptr{Cdouble}, Ptr{Cdouble}))

Here, we pass cfunction three arguments: the function mycompare, the return type Cint, and a tuple of the argument types, in this case to sort an array of Cdouble (Float64) elements. Julia compiles a version of mycompare specialized for these argument types (if it has not done so already), and returns a Ptr{Void} holding the address of the machine code, exactly what we need to pass to qsort. We are now ready to call qsort on some sample data:

A = [1.3, -2.7, 4.4, 3.1]
ccall(:qsort, Void, (Ptr{Cdouble}, Csize_t, Csize_t, Ptr{Void}),
      A, length(A), sizeof(eltype(A)), mycompare_c)

After this executes, A is changed to the sorted array [ -2.7, 1.3, 3.1, 4.4]. Note that Julia knows how to convert an array A::Vector{Cdouble} into a Ptr{Cdouble}, how to compute the sizeof a type in bytes (identical to C’s sizeof operator), and so on. For fun, try inserting a println("mycompare($a,$b)") line into mycompare, which will allow you to see the comparisons that qsort is performing (and to verify that it is really calling the Julia function that you passed to it).

The problem with closures

We aren’t done yet, however. If you start passing callback functions to C routines, it won’t be long before you discover that cfunction doesn’t always work. For example, suppose we tried to declare our comparison function inline, via:

mycomp = cfunction((a_,b_) -> unsafe_load(a_) < unsafe_load(b_) ? 
                              cint(-1) : cint(+1),
                   Cint, (Ptr{Cdouble}, Ptr{Cdouble}))

Julia barfs on this, printing ERROR: function is not yet c-callable. In general, cfunction only works for “top-level” functions: named functions defined in the top-level (global or module) scope, but not anonymous (args -> value) functions and not functions defined within other functions (“nested” functions). The reason for this stems from one important concept in computer science: a closure.

To understand the need for closures, and the difficulty they pose for callback functions, suppose that we wanted to provide a nicer interface for qsort, one which permitted the user to simply pass a lessthan function returning true or false while hiding all of the low-level business with pointers, Cint, and so on. We might like to do something of the form:

function qsort!{T}(A::Vector{T}, lessthan::Function)
    function mycompare(a_::Ptr{T}, b_::Ptr{T})
        a = unsafe_load(a_)
        b = unsafe_load(b_)
        return lessthan(a, b) ? cint(-1) : cint(+1)
    end
    mycompare_c = cfunction(mycompare, Cint, (Ptr{T}, Ptr{T}))
    ccall(:qsort, Void, (Ptr{T}, Csize_t, Csize_t, Ptr{Void}),
          A, length(A), sizeof(T), mycompare_c)
    A
end

Then we could simply call qsort!([1.3, -2.7, 4.4, 3.1], <) to sort in ascending order using the built-in < comparison, or any other comparison function we wanted. Unfortunately cfunction will again barf when you try to call qsort!, and it is no longer so difficult to understand why. Notice that the nested mycompare function is no longer self-contained: it uses the variable lessthan from the surrounding scope. This is a common pattern for nested functions and anonymous functions: often, they are parameterized by local variables in the environment where the function is defined. Technically, the ability to have this kind of dependency is provided by lexical scoping in a programming language like Julia, and is typical of any language in which functions are “first-class” objects. In order to support lexical scoping, a Julia Function object needs to internally carry around a pointer to the variables in the enclosing environment, and this encapsulation is called a closure.

In contrast, a C function pointer is not a closure. It doesn’t enclose a pointer to the environment in which the function was defined, or anything else for that matter; it is just the address of a stream of instructions. This makes it hard, in C, to write functions to transform other functions (higher-order functions) or to parameterize functions by local variables. This apparently leaves us with two options, neither of which is especially attractive:

We could store lessthan in a global variable, and reference that from a top-level mycompare function. (This is the traditional solution for C programmers calling qsort with parameterized comparison functions.) The problem with this strategy is that it is not re-entrant: it prevents us from calling qsort! recursively (e.g. if the comparison function itself needs to do a sort, for some complicated datastructure), or from calling qsort! from multiple threads (when a future Julia version supports shared-memory parallelism). Still, this is better than nothing.
Every time qsort! is called, Julia could JIT-compile a new version of mycompare, which hard-codes the reference to the lessthan argument passed on that call. This is technically possible and has been implemented in some languages (e.g. reportedly GNU Guile and Lua do something like this). However, this strategy comes at a price: it requires that callbacks be recompiled every time a parameter in them changes, which is not true of the global-variable strategy. Anyway, it is not implemented yet in Julia.

Fortunately, there is often a third option, because C programmers long ago recognized these limitations of function pointers, and devised a workaround: most modern C callback interfaces allow arbitrary data to be passed through to the callback via a “pass-through” (or “thunk”) pointer parameter. As explained in the next section, we can exploit this technique in Julia to pass a “true” closure as a callback.

Passing closures via pass-through pointers

The qsort interface is nowadays considered rather antiquated. Years ago, it was supplemented on BSD-Unix systems, and eventually in GNU libc, by a function called qsort_r that solves the problem of passing parameters to the callback in a re-entrant way. This is how the BSD (e.g. MacOS) qsort_r function is defined:

void qsort_r(void *base, size_t nmemb, size_t size, void *thunk,
             int (*compare)(void *thunk, const void *a, const void *b));

Compared to qsort, there is an extra thunk parameter, and this is passed through to the compare function as its first argument. In this way, you can pass a pointer to arbitrary data through to your callback, and we can exploit this to pass a closure through for an arbitrary Julia callback.

All we need is a way to convert a Julia Function into an opaque Ptr{Void} so that we can pass it through to our callback, and then a way to convert the opaque pointer back into a Function. The former is automatic if we simply declare the ccall argument as type Any (which passes the argument as an opaque Julia object pointer), and the latter is accomplished by the built-in function unsafe_pointer_to_objref. (Technically, we could use type Function or an explicit call to pointer_from_objref instead of Any.) Using these, we can now define a working high-level qsort! function that takes an arbitrary lessthan comparison-function argument:

function qsort!_compare{T}(lessthan_::Ptr{Void}, a_::Ptr{T}, b_::Ptr{T})
    a = unsafe_load(a_)
    b = unsafe_load(b_)
    lessthan = unsafe_pointer_to_objref(lessthan_)::Function
    return lessthan(a, b) ? cint(-1) : cint(+1)
end

function qsort!{T}(A::Vector{T}, lessthan::Function=<)
    compare_c = cfunction(qsort!_compare, Cint, (Ptr{Void}, Ptr{T}, Ptr{T}))
    ccall(:qsort_r, Void, (Ptr{T}, Csize_t, Csize_t, Any, Ptr{Void}),
          A, length(A), sizeof(T), lessthan, compare_c)
    return A
end

qsort!_compare is a top-level function, so cfunction has no problem with it, and it will only be compiled once per type T to be sorted (rather than once per call to qsort! or per lessthan function). We use the explicit ::Function assertion to tell the compiler that we will only pass Function pointers in lessthan_. Note that we gave the lessthan argument a default value of < (default arguments being a recent feature added to Julia).

We can now do qsort!([1.3, -2.7, 4.4, 3.1]) and it will return the array sorted in ascending order, or qsort!([1.3, -2.7, 4.4, 3.1], >) to sort in descending order.

Warning: `qsort_r` is not portable

The example above has one major problem that has nothing to do with Julia: the qsort_r function is not portable. The above example won’t work on Windows, since the Windows C library doesn’t define qsort_r (instead, it has a function called qsort_s, which of course uses an argument order incompatible with both the BSD and GNU qsort_r functions). Worse, it will crash on GNU/Linux systems, which do provide qsort_r but with an incompatible calling convention. And as a result it is difficult to use qsort_r in a way that does not crash either on GNU/Linux or BSD (e.g. MacOS) systems. This is how glibc’s qsort_r is defined:

void qsort_r(void *base, size_t nmemb, size_t size,
             int (*compare)(const void *a, const void *b, void *thunk),
              void *thunk);

Note that the position of the thunk argument is moved, both in qsort_r itself and in the comparison function. So, the corresponding qsort! Julia code on GNU/Linux systems should be:

function qsort!_compare{T}(a_::Ptr{T}, b_::Ptr{T}, lessthan_::Ptr{Void})
    a = unsafe_load(a_)
    b = unsafe_load(b_)
    lessthan = unsafe_pointer_to_objref(lessthan_)::Function
    return lessthan(a, b) ? cint(-1) : cint(+1)
end

function qsort!{T}(A::Vector{T}, lessthan::Function=<)
    compare_c = cfunction(qsort!_compare, Cint, (Ptr{T}, Ptr{T}, Ptr{Void}))
    ccall(:qsort_r, Void, (Ptr{T}, Csize_t, Csize_t, Ptr{Void}, Any),
          A, length(A), sizeof(T), compare_c, lessthan)
    return A
end

If you really needed to call qsort_r from Julia, you could use the above definitions if OS_NAME == :Linux and the BSD definitions otherwise, with a third version using qsort_s on Windows, but fortunately there is not much need as Julia comes with its own perfectly adequate sort and sort! routines.

Passing closures in data structures

As another example that is oriented more towards numerical computations, we’ll examine how we might call the numerical integration routines in the GNU Scientific Library (GSL). There is already a GSL package that handles the wrapper work below for you, but it is instructive to look at how this is implemented because GSL simulates closures in a slightly different way, with data structures.

Like most modern C libraries accepting callbacks, GSL uses a void* pass-through parameter to allow arbitrary data to be passed through to the callback routine, and we can use that to support arbitrary closures in Julia. Unlike qsort_r, however, GSL wraps both the C function pointer and the pass-through pointer in a data structure called gsl_function:

struct {
    double (*function)(double x, void *params);
    void *params;
} gsl_function;

Using the techniques above, we can easily declare a GSL_Function type in Julia that mirrors this C type, and with a constructor GSL_Function(f::Function) that creates a wrapper around an arbitrary Julia function f:

function gsl_function_wrap(x::Cdouble, params::Ptr{Void})
    f = unsafe_pointer_to_objref(params)::Function
    convert(Cdouble, f(x))::Cdouble
end
const gsl_function_wrap_c = cfunction(gsl_function_wrap,
                                      Cdouble, (Cdouble, Ptr{Void}))

type GSL_Function
    func::Ptr{Void}
    params::Any
    GSL_Function(f::Function) = new(gsl_function_wrap_c, f)
end

One subtlety with the above code is that we need to explicitly convert the return value of f to a Cdouble (in case the caller’s code returns some other numeric type for some x, such as an Int). Moreover, we need to explicitly assert (::Cdouble) that the result of the convert was a Cdouble. As with the qsort example, this is because cfunction only works if Julia can guarantee that gsl_function_wrap returns the specified Cdouble type, and Julia cannot infer the return type of convert since it does not know the return type of f(x).

Given the above definitions, it is a simple matter to pass this to the GSL adaptive-integration routines in a wrapper function gsl_integration_qag:

function gsl_integration_qag(f::Function, a::Real, b::Real, epsrel::Real=1e-12,
                             maxintervals::Integer=10^7)
    s = ccall((:gsl_integration_workspace_alloc,:libgsl), Ptr{Void}, (Csize_t,),
              maxintervals)
    result = Array(Cdouble,1)
    abserr = Array(Cdouble,1)
    ccall((:gsl_integration_qag,:libgsl), Cint,
          (Ptr{GSL_Function}, Cdouble,Cdouble, Cdouble, Csize_t, Cint, Ptr{Void}, 
           Ptr{Cdouble}, Ptr{Cdouble}),
          &GSL_Function(f), a, b, epsrel, maxintervals, 1, s, result, abserr)
    ccall((:gsl_integration_workspace_free,:libgsl), Void, (Ptr{Void},), s)
    return (result[1], abserr[1])
end

Note that &GSL_Function(f) passes a pointer to a GSL_Function “struct” containing a pointer to gsl_function_wrap_c and f, corresponding to the gsl_function* argument in C. The return value is a tuple of the estimated integral and an estimated error.

For example, gsl_integration_qag(cos, 0, 1) returns (0.8414709848078965,9.34220461887732e-15), which computes the correct integral sin(1) to machine precision.

Taking out the trash (or not)

In the above examples, we pass an opaque pointer (object reference) to a Julia Function into C. Whenever one passes pointers to Julia data into C code, one has to ensure that the Julia data is not garbage-collected until the C code is done with it, and functions are no exception to this rule. An anonymous function that is no longer referred to by any Julia variable may be garbage collected, at which point any C pointers to it become invalid.

This sounds scary, but in practice you don’t need to worry about it very often, because Julia guarantees that ccall arguments won’t be garbage-collected until the ccall exits. So, in all of the above examples, we are safe: the Function only needs to live as long as the ccall.

The only danger arises when you pass a function pointer to C and the C code saves the pointer in some data structure which it will use in a later ccall. In that case, you are responsible for ensuring that the Function variable lives (is referred to by some Julia variable) as long as the C code might need it.

For example, in the GSL one-dimensional minimization interface, you don’t simply pass your objective function to a minimization routine and wait until it is minimized. Instead, you call a GSL routine to create a “minimizer object”, store your function pointer in this object, call routines to iterate the minimization, and then deallocate the minimizer when you are done. The Julia function must not be garbage-collected until this process is complete. The easiest way to ensure this is to create a Julia wrapper type around the minimizer object that stores an explicit reference to the Julia function, like this:

type GSL_Minimizer
    m::Ptr{Void} # the gsl_min_fminimizer pointer
    f::Any  # explicit reference to objective, to prevent garbage-collection
    function GSL_Minimizer(t)
       m = ccall((:gsl_min_fminimizer_alloc,:libgsl), Ptr{Void}, (Ptr{Void},), t)
       p = new(m, nothing)
       finalizer(p, p -> ccall((:gsl_min_fminimizer_free,:libgsl),
                               Void, (Ptr{Void},), p.m))
       p
    end
end

This wraps around a gsl_min_fminimizer object of type t, with a placeholder f to store a reference to the objective function (once it is set below), including a finalizer to deallocate the GSL object when the GSL_Minimizer is garbage-collected. The parameter t is used to specify the minimization algorithm, which could default to Brent’s algorithm via:

const gsl_brent = unsafe_load(cglobal((:gsl_min_fminimizer_brent,:libgsl), Ptr{Void}))
GSL_Minimizer() = GSL_Minimizer(gsl_brent)

(The call to cglobal yields a pointer to the gsl_min_fminimizer_brent global variable in GSL, which we then dereference to get the actual pointer via unsafe_load.)

Then, when we set the function to minimize (the “objective”), we store an extra reference to it in the GSL_Minimizer to prevent garbage-collection for the lifetime of the GSL_Minimizer, again using the GSL_Function type defined above to wrap the callback:

function gsl_minimizer_set!(m::GSL_Minimizer, f, x0, xmin, xmax)
    ccall((:gsl_min_fminimizer_set,:libgsl), Cint,
          (Ptr{Void}, Ptr{GSL_Function}, Cdouble, Cdouble, Cdouble),
          m.m, &GSL_Function(f), x0, xmin, xmax)
    m.f = f
    m
end

There are then various GSL routines to iterate the minimizer and to check the current x, objective value, or bounds on the minimum, which are convenient to wrap:

gsl_minimizer_iterate!(m::GSL_Minimizer) =
    ccall((:gsl_min_fminimizer_iterate,:libgsl), Cint, (Ptr{Void},), m.m)

gsl_minimizer_x(m::GSL_Minimizer) =
    ccall((:gsl_min_fminimizer_x_minimum,:libgsl), Cdouble, (Ptr{Void},), m.m)

gsl_minimizer_f(m::GSL_Minimizer) =
    ccall((:gsl_min_fminimizer_f_minimum,:libgsl), Cdouble, (Ptr{Void},), m.m)

gsl_minimizer_xmin(m::GSL_Minimizer) =
    ccall((:gsl_min_fminimizer_x_lower,:libgsl), Cdouble, (Ptr{Void},), m.m)
gsl_minimizer_xmax(m::GSL_Minimizer) =
    ccall((:gsl_min_fminimizer_x_upper,:libgsl), Cdouble, (Ptr{Void},), m.m)

Putting all of these together, we can minimize a simple function sin(x) in the interval [-3,1], with a starting guess -1, via:

m = GSL_Minimizer()
gsl_minimizer_set!(m, sin, -1, -3, 1)
while gsl_minimizer_xmax(m) - gsl_minimizer_xmin(m) > 1e-6
    println("iterating at x = $(gsl_minimizer_x(m))")
    gsl_minimizer_iterate!(m)
end
println("found minimum $(gsl_minimizer_f(m)) at x = $(gsl_minimizer_x(m))")

After a few iterations, it prints found minimum -1.0 at x = -1.5707963269964016, which is the correct minimum (−π/2) to about 10 digits.

At this point, I will shamelessly plug my own NLopt package for Julia, which wraps around my free/open-source NLopt library to provide many more optimization algorithms than GSL, with perhaps a nicer interface. However, the techniques used to pass callback functions to NLopt are actually quite similar to those used for GSL.

An even more complicated version of these techniques can be found in the PyCall package to call Python from Julia. In order to pass a Julia function to Python, we again use cfunction on a wrapper function that handles the type conversions and so on, and pass the actual Julia closure through via a pass-through pointer. But in that case, the pass-through pointer consists of a Python object that has been created with a new type that allows it to wrap a Julia object, and garbage-collection is deferred by storing the Julia object in a global dictionary of saved objects (removing it via the Python destructor of the new type). That is all somewhat tricky stuff and beyond the scope of this blog post; I only mention it to illustrate the fact that it is possible to implement quite complex inter-language calling behaviors purely in Julia by building on the above techniques.

Put This In Your Pipe

2013-04-08T00:00:00+00:00

In a previous post, I talked about why “shelling out” to spawn a pipeline of external programs via an intermediate shell is a common cause of bugs, security holes, unnecessary overhead, and silent failures. But it’s so convenient! Why can’t running pipelines of external programs be convenient and safe? Well, there’s no real reason, actually. The shell itself manages to construct and execute pipelines quite well. In principle, there’s nothing stopping high-level languages from doing it at least as well as shells do – the common ones just don’t by default, instead requiring users to make the extra effort to use external programs safely and correctly. There are two major impediments:

Some moderately tricky low-level UNIX plumbing using the pipe, dup2, fork, close, and exec system calls;
The UX problem of designing an easy, flexible programming interface for commands and pipelines.

This post describes the system we designed and implemented for Julia, and how it avoids the major flaws of shelling out in other languages. First, I’ll present the Julia version of the previous post’s example – counting the number of lines in a given directory containing the string “foo”. The fact that Julia provides complete, specific diagnostic error messages when pipelines fail turns out to reveal a surprising and subtle bug, lurking in what appears to be a perfectly innocuous UNIX pipeline. After fixing this bug, we go into details of how Julia’s external command execution and pipeline construction system actually works, and why it provides greater flexibility and safety than the traditional approach of using an intermediate shell to do all the heavy lifting.

Simple Pipeline, Subtle Bug

Here’s how you write the example of counting the number of lines in a directory containing the string “foo” in Julia (you can follow along at home if you have Julia installed from source by changing directories into the Julia source directory and doing cp -a src "source code"; mkdir tmp and then firing up the Julia repl):

julia> dir = "src";

julia> int(readchomp(`find $dir -type f -print0` |> `xargs -0 grep foo` |> `wc -l`))
5

This Julia command looks suspiciously similar to the naïve Ruby version we started with in the previous post:

`find #{dir} -type f -print0 | xargs -0 grep foo | wc -l`.to_i

However, it isn’t susceptible to the same problems:

julia> dir = "source code";

julia> int(readchomp(`find $dir -type f -print0` |> `xargs -0 grep foo` |> `wc -l`))
5

julia> dir = "nonexistent";

julia> int(readchomp(`find $dir -type f -print0` |> `xargs -0 grep foo` |> `wc -l`))
find: `nonexistent': No such file or directory
ERROR: failed processes:
  Process(`find nonexistent -type f -print0`, ProcessExited(1)) [1]
  Process(`xargs -0 grep foo`, ProcessExited(123)) [123]
 in pipeline_error at process.jl:412
 in readall at process.jl:365
 in readchomp at io.jl:172

julia> dir = "foo'; echo MALICIOUS ATTACK; echo '";

julia> int(readchomp(`find $dir -type f -print0` |> `xargs -0 grep foo` |> `wc -l`))
find: `foo\'; echo MALICIOUS ATTACK; echo \'': No such file or directory
ERROR: failed processes:
  Process(`find "foo'; echo MALICIOUS ATTACK; echo '" -type f -print0`, ProcessExited(1)) [1]
  Process(`xargs -0 grep foo`, ProcessExited(123)) [123]
 in pipeline_error at process.jl:412
 in readall at process.jl:365
 in readchomp at io.jl:172

The default, simplest-to-achieve behavior in Julia is:

not susceptible to any kind of metacharacter breakage,
reliably detects all subprocess failures,
automatically raises an exception if any subprocess fails,
prints error messages including exactly which commands failed.

In the above examples, we can see that even when dir contains spaces or quotes, the expression still behaves exactly as intended – the value of dir is interpolated as a single argument to the find command. When dir is not the name of a directory that exists, find fails – as it should – and this failure is detected and automatically converted into an informative exception, including the fully expanded command-lines that failed.

In the previous post, we observed that using the pipefail option for Bash allows detection of pipeline failures, like this one, occurring before the last process in the pipeline. However, it only allows us to detect that at least one thing in the pipeline failed. We still have to guess at what parts of the pipeline actually failed. In the Julia example, on the other hand, there is no guessing required: when a non-existent directory is given, we can see that both find and xargs fail. While it is unsurprising that find fails in this case, it is unexpected that xargs also fails. Why does xargs fail?

One possibility to check for is that the xargs program fails with no input. We can use Julia’s success predicate to try it out:

julia> success(`cat /dev/null` |> `xargs true`)
true

Ok, so xargs seems perfectly happy with no input. Maybe grep doesn’t like not getting any input?

julia> success(`cat /dev/null` |> `grep foo`)
false

Aha! grep returns a non-zero status when it doesn’t get any input. Good to know. It turns out that grep indicates whether it matched anything or not with its return status. Most programs use their return status to indicate success or failure, but some, like grep, use it to indicate some other boolean condition – in this case “found something” versus “didn’t find anything”:

julia> success(`echo foo` |> `grep foo`)
true

julia> success(`echo bar` |> `grep foo`)
false

Now we know why grep is “failing” – and xargs too, since it returns a non-zero status if the program it runs returns non-zero. This means that our Julia pipeline and the “responsible” Ruby version are both susceptible to bogus failures when we search an existing directory that happens not to contain the string “foo” anywhere:

julia> dir = "tmp";

julia> int(readchomp(`find $dir -type f -print0` |> `xargs -0 grep foo` |> `wc -l`))
ERROR: failed process: Process(`xargs -0 grep foo`, ProcessExited(123)) [123]
 in error at error.jl:22
 in pipeline_error at process.jl:394
 in pipeline_error at process.jl:407
 in readall at process.jl:365
 in readchomp at io.jl:172

Since grep indicates not finding anything using a non-zero return status, the readall function concludes that its pipeline failed and raises an error to that effect. In this case, this default behavior is undesirable: we want the expression to just return 0 without raising an error. The simple fix in Julia is this:

julia> dir = "tmp";

julia> int(readchomp(`find $dir -type f -print0` |> ignorestatus(`xargs -0 grep foo`) |> `wc -l`))
0

This works correctly in all cases. Next I’ll explain how all of this works, but for now it’s enough to note that the detailed error message provided when our pipeline failed exposed a rather subtle bug that would eventually cause subtle and hard-to-debug problems when used in production. Without such detailed error reporting, this bug would be pretty difficult to track down.

Do-Nothing Backticks

Julia borrows the backtick syntax for external commands form Perl and Ruby, both of which in turn got it from the shell. Unlike in these predecessors, however, in Julia backticks don’t immediately run commands, nor do they necessarily indicate that you want to capture the output of the command. Instead, backticks just construct an object representing a command:

julia> `echo Hello`
`echo Hello`

julia> typeof(ans)
Cmd

(In the Julia repl, ans is automatically bound to the value of the last evaluated input.) In order to actually run a command, you have to do something with a command object. To run a command and capture its output into a string – what other languages do with backticks automatically – you can apply the readall function:

julia> readall(`echo Hello`)
"Hello\n"

Since it’s very common to want to discard the trailing line break at the end of a command’s output, Julia provides the readchomp(x) command which is equivalent to writing chomp(readall(x)):

julia> readchomp(`echo Hello`)
"Hello"

To run a command without capturing its output, letting it just print to the same stdout stream as the main process – i.e. what the system function does when given a command as a string in other languages – use the run function:

julia> run(`echo Hello`)
Hello

The "Hello\n" after the readall command is a returned value, whereas the Hello after the run command is printed output. (If your terminal supports color, these are colored differently so that you can easily distinguish them visually.) Nothing is returned by the run command, but if something goes wrong, an exception is raised:

julia> run(`false`)
ERROR: failed process: Process(`false`, ProcessExited(1)) [1]
 in error at error.jl:22
 in pipeline_error at process.jl:394
 in run at process.jl:384

julia> run(`notaprogram`)
execvp(): No such file or directory
ERROR: failed process: Process(`notaprogram`, ProcessExited(-1)) [-1]
 in error at error.jl:22
 in pipeline_error at process.jl:394
 in run at process.jl:384

As with xargs and grep above, this may not always be desirable. In such cases, you can use ignorestatus to indicate that the command returning a non-zero value should not be considered an error:

julia> run(ignorestatus(`false`))

julia> run(ignorestatus(`notaprogram`))
execvp(): No such file or directory
ERROR: failed process: Process(`notaprogram`, ProcessExited(-1)) [-1]
 in error at error.jl:22
 in pipeline_error at process.jl:394
 in run at process.jl:384

In the latter case, an error is still raised in the parent process since the problem is that the executable doesn’t even exist, rather than merely that it ran and returned a non-zero status.

Although Julia’s backtick syntax intentionally mimics the shell as closely as possible, there is an important distinction: the command string is never passed to a shell to be interpreted and executed; instead it is parsed in Julia code, using the same rules the shell uses to determine what the command and arguments are. Command objects allow you to see what the program and arguments were determined to be by accessing the .exec field:

julia> cmd = `perl -e 'print "Hello\n"'`
`perl -e 'print "Hello\n"'`

julia> cmd.exec
3-element Union(UTF8String,ASCIIString) Array:
 "perl"
 "-e"
 "print \"Hello\\n\""

This field is a plain old array of strings that can be manipulated like any other Julia array.

Constructing Commands

The purpose of the backtick notation in Julia is to provide a familiar, shell-like syntax for making objects representing commands with arguments. To that end, quotes and spaces work just as they do in the shell. The real power of backtick syntax doesn’t emerge, however, until we begin constructing commands programmatically. Just as in the shell (and in Julia strings), you can interpolate values into commands using the dollar sign ($):

julia> dir = "src";

julia> `find $dir -type f`.exec
4-element Union(UTF8String,ASCIIString) Array:
 "find"
 "src"
 "-type"
 "f"

Unlike in the shell, however, Julia values interpolated into commands are interpolated as a single verbatim argument – no characters inside the value are interpreted as special after the value has been interpolated:

julia> dir = "two words";

julia> `find $dir -type f`.exec
4-element Union(UTF8String,ASCIIString) Array:
 "find"
 "two words"
 "-type"
 "f"

julia> dir = "foo'bar";

julia> `find $dir -type f`.exec
4-element Union(UTF8String,ASCIIString) Array:
 "find"
 "foo'bar"
 "-type"
 "f"

This works no matter what the contents of the interpolated value is, allowing simple interpolation of characters that are quite difficult to pass as parts of command-line arguments even in the shell (for the following examples, tmp/a.tsv and tmp/b.tsv can be created in the shell with echo -e "foo\tbar\nbaz\tqux" > tmp/a.tsv; echo -e "foo\t1\nbaz\t2" > tmp/b.tsv):

julia> tab = "\t";

julia> cmd = `join -t$tab tmp/a.tsv tmp/b.tsv`;

julia> cmd.exec
4-element Union(UTF8String,ASCIIString) Array:
 "join"
 "-t\t"
 "tmp/a.tsv"
 "tmp/b.tsv"

julia> run(cmd)
foo     bar     1
baz     qux     2

Moreover, what comes after the $ can actually be any valid Julia expression, not just a variable name:

julia> `join -t$"\t" tmp/a.tsv tmp/b.tsv`.exec
4-element Union(UTF8String,ASCIIString) Array:
 "join"
 "-t\t"
 "a.tsv"
 "b.tsv"

A tab character is somewhat harder to pass in the shell, requiring command interpolation and some tricky quoting:

bash-3.2$ join -t"$(printf '\t')" tmp/a.tsv tmp/b.tsv
foo	    bar	    1
baz	    qux	    2

While interpolating values with spaces and other strange characters is great for non-brittle construction of commands, there was a reason why the shell split values on spaces in the first place: to allow interpolation of multiple arguments. Most modern shells have first-class array types, but older shells used space-separation to simulate arrays. Thus, if you interpolate a value like “foo bar” into a command in the shell, it’s treated as two separate words by default. In languages with first-class array types, however, there’s a much better option: consistently interpolate single values as single arguments and interpolate arrays as multiple values. This is precisely what Julia’s backtick interpolation does:

julia> dirs = ["foo", "bar", "baz"];

julia> `find $dirs -type f`.exec
6-element Union(UTF8String,ASCIIString) Array:
 "find"
 "foo"
 "bar"
 "baz"
 "-type"
 "f"

And of course, no matter how strange the strings contained in an interpolated array are, they become verbatim arguments, without any shell interpretation. Julia’s backticks have one more fancy trick up their sleeve. We saw earlier (without really remarking on it) that you could interpolate single values into a larger argument:

julia> x = "bar";

julia> `echo foo$x`
`echo foobar`

What happens if x is an array? Only one way to find out:

julia> x = ["bar", "baz"];

julia> `echo foo$x`
`echo foobar foobaz`

Julia does what the shell would do if you wrote echo foo{bar,baz}. This even works correctly for multiple values interpolated into the same shell word:

julia> dir = "/data"; names = ["foo","bar"]; exts=["csv","tsv"];

julia> `cat $dir/$names.$exts`
`cat /data/foo.csv /data/foo.tsv /data/bar.csv /data/bar.tsv`

This is the same Cartesian product expansion that the shell does if multiple {...} expressions are used in the same word.

Distributed Numerical Optimization

2013-04-05T00:00:00+00:00

This post walks through the parallel computing functionality of Julia to implement an asynchronous parallel version of the classical cutting-plane algorithm for convex (nonsmooth) optimization, demonstrating the complete workflow including running on both Amazon EC2 and a large multicore server. I will quickly review the cutting-plane algorithm and will be focusing primarily on parallel computation patterns, so don’t worry if you’re not familiar with the optimization side of things.

Cutting-plane algorithm

The cutting-plane algorithm is a method for solving the optimization problem

$\min_{x \in \mathbb R^d} \sum_{i=1}^n f_i(x)$

where the functions $ f_i $ are convex but not necessarily differentiable. The absolute value function $ |x| $ and the 1-norm $ ||x|| _ 1 $ are typical examples. Important applications also arise from Lagrangian relaxation. The idea of the algorithm is to approximate the functions $ f_i $ with piecewise linear models $ m_i $ which are built up from information obtained by evaluating $ f_i $ at different points. We iteratively minimize over the models to generate candidate solution points.

We can state the algorithm as

Choose starting point $ x $.
For $i = 1,\ldots,n$, evaluate $ f_i(x) $ and update corresponding model $ m_i $.
Let the next candidate $ x $ be the minimizer of $ \sum_{i=1}^n m_i(x) $.
If not converged, goto step 2.

If it is costly to evaluate $ f_i(x) $, then the algorithm is naturally parallelizable at step 2. The minimization in step 3 can be computed by solving a linear optimization problem, which is usually very fast. (Let me point out here that Julia has interfaces to linear programming and other optimization solvers under JuliaOpt.)

Abstracting the math, we can write the algorithm using the following Julia code.

# functions initialize, isconverged, solvesubproblem, and process implemented elsewhere
state, subproblems = initialize()
while !isconverged(state)
    results = map(solvesubproblem,subproblems)
    state, subproblems = process(state, results)
end

The function solvesubproblem corresponds to evaluating $ f_i(x) $ for a given $ i $ and $ x $ (the elements of subproblems could be tuples (i,x)). The function process corresponds to minimizing the model in step 3, and it produces a new state and a new set of subproblems to solve.

Note that the algorithm looks much like a map-reduce that would be easy to parallelize using many existing frameworks. Indeed, in Julia we can simply replace map with pmap (parallel map). Let’s consider a twist that makes the parallelism not so straightforward.

Asynchronous variant

Variability in the time taken by the solvesubproblem function can lead to load imbalance and limit parallel efficiency as workers sit idle waiting for new tasks. Such variability arises naturally if solvesubproblem itself requires solving a optimization problem, or if the workers and network are shared, as is often the case with cloud computing.

We can consider a new variant of the cutting-plane algorithm to address this issue. The key point is

When proportion $0 < \alpha \le 1 $ of subproblems for a given candidate have been solved, generate a new candidate and corresponding set of subproblems by using whatever information is presently available.

In other words, we generate new tasks to feed to workers without needing to wait for all current tasks to complete, making the algorithm asynchronous. The algorithm remains convergent, although the total number of iterations may increase. For more details, see this paper by Jeff Linderoth and Stephen Wright.

By introducing asynchronicity we can no longer use a nice black-box pmap function and have to dig deeper into the parallel implementation. Fortunately, this is easy to do in Julia.

Parallel implementation in Julia

Julia implements distributed-memory parallelism based on one-sided message passing, where process push work onto others (via remotecall) and the results are retrieved (via fetch) by the process which requires them. Macros such as @spawn and @parallel provide pretty syntax around this low-level functionality. This model of parallelism is very different from the typical SIMD style of MPI. Both approaches are useful in different contexts, and I expect an MPI wrapper for Julia will appear in the future (see also here).

Reading the manual on parallel computing is highly recommended, and I won’t try to reproduce it in this post. Instead, we’ll dig into and extend one of the examples it presents.

The implementation of pmap in Julia is

function pmap(f, lst)
    np = nprocs()  # determine the number of processors available
    n = length(lst)
    results = cell(n)
    i = 1
    # function to produce the next work item from the queue.
    # in this case it's just an index.
    next_idx() = (idx=i; i+=1; idx)
    @sync begin
        for p=1:np
            if p != myid() || np == 1
                @spawnlocal begin
                    while true
                        idx = next_idx()
                        if idx > n
                            break
                        end
                        results[idx] = remotecall_fetch(p, f, lst[idx])
                    end
                end
            end
        end
    end
    results
end

On first sight, this code is not particularly intuitive. The @spawnlocal macro creates a task on the master process (e.g. process 1). Each task feeds work to a corresponding worker; the call remotecall_fetch(p, f, lst[idx]) function calls f on process p and returns the result when finished. Tasks are uninterruptable and only surrender control at specific points such as remotecall_fetch. Tasks cannot directly modify variables from the enclosing scope, but the same effect can be achieved by using the next_idx function to access and mutate i. The task idiom functions in place of using a loop to poll for results from each worker process.

Implementing our asynchronous algorithm is not much more than a modification of the above code:

# given constants n and 0 < alpha <= 1
# functions initialize and solvesubproblem defined elsewhere
np = nprocs()
state, subproblems = initialize()
converged = false
isconverged() = converged
function updatemodel(mysubproblem, result)
    # store result
    ...
    # decide whether to generate new subproblems
    state.numback[mysubproblem.parent] += 1
    if state.numback[mysubproblem.parent] >= alpha*n && !state.didtrigger[mysubproblem.parent]
        state.didtrigger[mysubproblem.parent] = true
        # generate newsubproblems by solving linear optimization problem
        ...
        if ... # convergence test
            converged = true
        else
            append!(subproblems, newsubproblems)
            push!(state.didtrigger, false)
            push!(state.numback, 0)
            # ensure that for s in newsubproblems, s.parent == length(state.numback)
        end
    end
end

@sync begin
    for p=1:np
        if p != myid() || np == 1
            @spawnlocal begin
                while !isconverged()
                    if length(subproblems) == 0
                        # no more subproblems but haven't converged yet
                        yield()
                        continue
                    end
                    mysubproblem = shift!(subproblems) # pop subproblem from queue
                    result = remotecall_fetch(p, solvesubproblem, mysubproblem)
                    updatemodel(mysubproblem, result)
                end
            end
        end
    end
end

where state is an instance of a type defined as

type State
    didtrigger::Vector{Bool}
    numback::Vector{Int}
    ...
end

There is little difference in the structure of the code inside the @sync blocks, and the asynchronous logic is encapsulated in the local updatemodel function which conditionally generates new subproblems. A strength of Julia is that functions like pmap are implemented in Julia itself, so that it is particularly straightforward to make modifications like this.

Running it

Now for the fun part. The complete cutting-plane algorithm (along with additional variants) is implemented in JuliaBenders. The code is specialized for stochastic programming where the cutting-plane algorithm is known as the L-shaped method or Benders decomposition and is used to decompose the solution of large linear optimization problems. Here, solvesubproblem entails solving a relatively small linear optimization problem. Test instances are taken from the previously mentioned paper.

We’ll first run on a large multicore server. The runals.jl (asynchronous L-shaped) file contains the algorithm we’ll use. Its usage is

julia runals.jl [data source] [num subproblems] [async param] [block size]

where [num subproblems] is the $n$ as above and [async param] is the proportion $\alpha$. By setting $\alpha = 1$ we obtain the synchronous algorithm. For the asynchronous version we will take $\alpha = 0.6$. The [block size] parameter controls how many subproblems are sent to a worker at once (in the previous code, this value was always 1). We will use 4000 subproblems in our experiments.

To run multiple Julia processes on a shared-memory machine, we pass the -p N option to the julia executable, which will start up N system processes. To execute the asynchronous version with 10 workers, we run

julia -p 12 runals.jl Data/storm 4000 0.6 30

Note that we start 12 processes. These are the 10 workers, the master (which distributes tasks), and another process to perform the master’s computations (an additional refinement which was not described above). Results from various runs are presented in the table below.

	Synchronous		Asynchronous
No. Workers	Speed	Efficiency	Speed	Efficiency
10	154	Baseline	166	Baseline
20	309	100.3%	348	105%
40	517	84%	654	98%
60	674	73%	918	92%

Table: Results on a shared-memory 8x Xeon E7-8850 server. Workers correspond to individual cores. Speed is the rate of subproblems solved per second. Efficiency is calculated as the percent of ideal parallel speedup obtained. The superlinear scaling observed with 20 workers is likely a system artifact.

There are a few more hoops to jump through in order to run on EC2. First we must build a system image (AMI) with Julia installed. Julia connects to workers over ssh, so I found it useful to put my EC2 ssh key on the AMI and also set StrictHostKeyChecking no in /etc/ssh/ssh_config to disable the authenticity prompt when connecting to new workers. Someone will likely correct me on if this is the right approach.

Assuming we have an AMI in place, we can fire up the instances. I used an m3.xlarge instance for the master and m1.medium instances for the workers. (Note: you can save a lot of money by using the spot market.)

To add remote workers on startup, Julia accepts a file with a list of host names through the --machinefile option. We can generate this easily enough by using the EC2 API Tools (Ubuntu package ec2-api-tools) with the command

ec2-describe-instances | grep running | awk '{ print $5; }' > mfile

On the master instance we can then run

julia --machinefile mfile runatr.jl Data/storm 4000 0.6 30

Results from various runs are presented in the table below.

	Synchronous		Asynchronous
No. Workers	Speed	Efficiency	Speed	Efficiency
10	149	Baseline	151	Baseline
20	289	97%	301	99.7%
40	532	89%	602	99.5%

Table: Results on Amazon EC2. Workers correspond to individual m1.medium instances. The master process is run on an m3.xlarge instance.

On both architectures the asynchronous version solves subproblems at a higher rate and has significantly better parallel efficiency. Scaling is better on EC2 than on the shared-memory server likely because the subproblem calculation is memory bound, and so performance is better on the distributed-memory architecture. Anyway, with Julia we can easily experiment on both.

Videos from the Julia tutorial at MIT

2013-03-30T00:00:00+00:00

We held a two day Julia tutorial at MIT in January 2013, which included 10 sessions. MIT Open Courseware and MIT-X graciously provided support for recording of these lectures, so that the wider Julia community can benefit from these sessions.

Julia Lightning Round (slides)

This session is a rapid introduction to julia, using a number of lightning rounds. It uses a number of short examples to demonstrate syntax and features, and gives a quick feel for the language.

Rationale behind Julia and the Vision (slides)

The rationale and vision behind julia, and its design principles are discussed in this session.

Data Analysis with DataFrames (slides)

DataFrames is one of the most widely used Julia packages. This session is an introduction to data analysis with Julia using DataFrames.

Statistical Models in Julia (slides)

This session demonstrates Julia’s statistics capabilities, which are provided by these packages: Distributions, GLM, and LM.

Fast Fourier Transforms

Julia provides a built-in interface to the FFTW library. This session demonstrates the Julia’s signal processing capabilities, such as FFTs and DCTs. Also see the Hadamard package.

Optimization (slides)

This session focuses largely on using Julia for solving linear programming problems. The algebraic modeling language discussed was later released as JuMP. Benchmarks are shown evaluating the performance of Julia for implementing low-level optimization code. Optimization software in Julia has been grouped under the JuliaOpt project.

Metaprogramming and Macros

Julia is homoiconic: it represents its own code as a data structure of the language itself. Since code is represented by objects that can be created and manipulated from within the language, it is possible for a program to transform and generate its own code. Metaprogramming is described in detail in the Julia manual.

Parallel and Distributed Computing (Lab, Solution)

Parallel and distributed computing have been an integral part of Julia’s capabilities from an early stage. This session describes existing basic capabilities, which can be used as building blocks for higher level parallel libraries.

Networking

Julia provides asynchronous networking I/O using the libuv library. Libuv is a portable networking library created as part of the Node.js project.

Grid of Resistors (Lab, Solution)

The Grid of Resistors is a classic numerical problem to compute the voltages and the effective resistance of a 2n+1 by 2n+2 grid of 1 ohm resistors if a battery is connected to the two center points. As part of this lab, the problem is solved in Julia in a number of different ways such as a vectorized implementation, a devectorized implementation, and using comprehensions, in order to study the performance characteristics of various methods.

Efficient Aggregates in Julia

2013-03-05T00:00:00+00:00

We recently introduced an exciting feature that has been in planning for some time: immutable aggregate types. In fact, we have been planning to do this for so long that this feature is the subject of our issue #13 on GitHub, out of more than 2400 total issues so far.

Essentially, this feature drastically reduces the overhead of user-defined types that represent small number-like values, or that wrap a small number of other objects. Consider an RGB pixel type:

immutable Pixel
    r::Uint8
    g::Uint8
    b::Uint8
end

Instances of this type can now be packed efficiently into arrays, using exactly 3 bytes per object. In all other respects, these objects continue to act like normal first-class objects. To see how we might use this, here is a function that converts an RGB image in standard 24-bit framebuffer format to grayscale:

function rgb2gray!(img::Array{Pixel})
    for i=1:length(img)
        p = img[i]
        v = uint8(0.30*p.r + 0.59*p.g + 0.11*p.b)
        img[i] = Pixel(v,v,v)
    end
end

This code will run blazing fast, performing no memory allocation. We have not done thorough benchmarking, but this is in fact likely to be the fastest way to write this function in Julia from now on.

The key to this behavior is the new immutable keyword, which means instances of the type cannot be modified. At first this sounds like a mere restriction — how come I’m not allowed to modify one? — but what it really means is that the object is identified with its contents, rather than its memory address. A mutable object has “behavior”; it changes over time, and there may be many references to the object, all of which can observe those changes. An immutable object, on the other hand, has only a value, and no time-varying behavior. Its location does not matter. It is “just some bits”.

Julia has always had some immutable values, in the form of bits types, which are used to represent fixed-bit-width numbers. It is highly intuitive that numbers are immutable. If x equals 2, you might later change the value of x, but it is understood that the value of 2 itself does not change. The immutable keyword generalizes this idea to structured data types with named fields. Julia variables and containers, including arrays, are all still mutable. While a Pixel object itself can’t change, a new Pixel can be written over an old one within an array, since the array is mutable.

Let’s take a look at the benefits of this feature.

The compiler and GC have a lot of freedom to move and copy these objects around. This flexibility can be used to store data more efficiently, for example keeping the real and imaginary parts of a complex number in separate registers, or keeping only one part in a register.
Immutable objects are easy to reason about. Some languages, such as C++ and C#, provide “value types”, which have many of the benefits of immutable objects. However, their behavior can be confusing. Consider code like the following:

item = lookup(collection, index) modify!(item) The question here is whether we have modified the same item that is in the collection, or if we have modified a local copy. In Julia there are only two possibilities: either item is mutable, in which case we modified the one and only copy of it, or it is immutable, in which case modifying it is not allowed.
No-overhead data abstractions become possible. It is often useful to define a new type that simply wraps a single value, and modifies its behavior in some way. Our favorite modular integer example type fits this description:

immutable ModInt{n} <: Integer k::Int ModInt(k) = new(mod(k,n)) end Since a given ModInt doesn’t need to exist at a particular address, it can be passed to functions, stored in arrays, and so on, as efficiently as a single Int, with no wrapping overhead. But, in Julia, the overhead will not always be zero. The ModInt type information will “follow the data around” at compile time to the extent possible, but heap-allocated wrappers will be added as needed at run time. Typically these wrappers will be short-lived; if the final destination of a ModInt is in a ModInt array, for example, the wrapper can be discarded when the value is assigned. But if the value is only used locally inside a function, there will most likely be no wrappers at all.
Abstractions are fully enforced. If a custom constructor is written for an immutable type, then all instances will be created by it. Since the constructed objects are never modified, the invariants provided by the constructor cannot be violated. At this time, uninitialized arrays are an exception to this rule. New arrays of “plain data” immutable types have unspecified contents, so it is possible to obtain an invalid value from one. This is usually harmless in practice, since arrays must be initialized anyway, and are often created through functions like zeros that do so.
We can automatically type-specialize fields. Since field values at construction time are final, their types are too, so we learn everything about the type of an immutable object when it is constructed.

There are many potential optimizations here, and we have not implemented all of them yet. But having this feature in place provides another lever to help us improve performance over time.

For now though, we at least have a much simpler implementation of complex numbers, and will be able to take advantage of efficient rational matrices and other similar niceties.

Addendum: Under the hood

For purposes of calling C and writing reflective code, it helps to know a bit about how immutable types are implemented. Before this change, we had types AbstractKind, BitsKind, and CompositeKind, for separating which types are abstract, which are represented by immutable bit strings, and which are mutable aggregates. It was sometimes convenient that the type system reflected these differences, but also a bit unwarranted since all these types participate in the same hierarchy and follow the same subtyping rules.

Now, the type landscape is both simpler and more complex. The three Kinds have been merged into a single kind called DataType. The type of every value in Julia is now either a DataType, or else a tuple type (union types still exist, but of course are always abstract). To find out the details of a DataType’s physical representation, you must query its properties. DataTypes have three boolean properties abstract, mutable, and pointerfree, and an integer property size. The CompositeKind properties names and types are still there to describe fields.

The abstract property indicates that the type was declared with the abstract keyword and has no direct instances. mutable indicates, for concrete types, whether instances are mutable. pointerfree means that instances contain “just data” and no references to other Julia values. size gives the size of an instance in bytes.

What used to be BitsKinds are now DataTypes that are immutable, concrete, have no fields, and have non-zero size. The former CompositeKinds are mutable and concrete, and either have fields or are zero size if they have zero fields. Clearly, new combinations are now possible. We have already mentioned immutable types with fields. We could have the equivalent of mutable BitsKinds, but this combination is not exposed in the language, since it is easily emulated using mutable fields. Another new combination is abstract types with fields, which would allow you to declare that all subtypes of some abstract type should have certain fields. That one is definitely useful, and we plan to provide syntax for it.

Typically, the only time you need to worry about these things is when calling native code, when you want to know whether some array or struct has C-compatible data layout. This is handled by the type predicate isbits(T).

Design and implementation of Julia

2012-08-16T00:00:00+00:00

We describe the design and implementation of Julia in our first paper - Julia: A Fast Dynamic Language for Technical Computing. This is work in progress and comments are appreciated.

New York Open Stats Meetup

2012-04-18T00:00:00+00:00

I’ll be giving a talk on Julia at the New York Open Statistical Programming Meetup on May 1st. After my presentation, John Myles White and Shane Conway are going to give followup demos of statistical applications using Julia. Then we’re going to hang out and grab drinks nearby. Thanks to Harlan Harris and Drew Conway for setting the whole thing up!

Announcement:

After a brief hiatus, we are very excited to announce our May meetup will feature one of the hottest new languages in statistical computing: Julia. We are delighted to welcome Stefan Karpinski, one of the creators of Julia, to give an introduction to the language and his perspective on statistical computing.

Julia is a general-purpose, high-level, dynamic language in the tradition of Lisp, Perl, Python and Ruby. It is designed to take advantage of modern techniques for executing dynamic languages with statically-compiled performance. As part of this design, the language has an expressive type system, which programmers may leverage for dispatch and error checking — incidentally providing the compiler with useful type information. Using types is entirely optional, however: “typeless Julia” is a valid and useful subset of the language, similar to traditional dynamic languages, which nevertheless runs at statically compiled speeds.\

Julia is especially good at running Matlab and R-style programs. Given its level of performance, we envision a new era of technical computing where libraries can be developed in a high-level language instead of C or Fortran. We have also experimented with cloud API integration, and begun to develop a web-based interactive computing environment. The ultimate goal is to make cloud-based supercomputing as easy and accessible as Google Docs.

We will also hear from a mix of people who have already started developing in Julia and see some examples of what they have developed.

The meetup will follow our typical schedule: pizza will begin at 6:15pm, Stefan will begin promptly at 7pm, and we will head to The Central Bar around 8:30pm.

Update: You can see the slides for the talk here. There was no video of the talk, but hopefully the slides are informative — there are, among other things, a lot of code examples that should just work if pasted into the Julia repl.

Lang.NEXT Announcement

2012-03-24T00:00:00+00:00

Jeff and I will be giving a presentation on Julia at the upcoming Lang.NEXT conference, a gathering of “programming language design experts and enthusiasts” featuring “talks, panels and discussion on leading programming language work from industry and research.” We are honored and excited to have been invited to speak at an event alongside so many programming language luminaries.

Abstract:

Julia is a dynamic language in the tradition of Lisp, Perl, Python and Ruby. It aims to advance expressiveness and convenience for scientific and technical computing beyond that of environments like Matlab and NumPy, while simultaneously closing the performance gap with compiled languages like C, C++, Fortran and Java.

Most high-performance dynamic language implementations have taken an existing interpreted language and worked to accelerate its execution. In creating Julia, we have reconsidered the basic language design, taking into account the capabilities of modern JIT compilers and the specific needs of technical computing. Our design includes:

Multiple dispatch as the core language paradigm.
Exposing a sophisticated type system including parametric dependent types.
Dynamic type inference to generate fast code from programs with no declarations.
Aggressive specialization of generated code for types encountered at run-time.

Julia feels light and natural for data exploration and algorithm prototyping, but has performance that lets you deploy your prototypes.

Update: You can see the slides for our talk here. Video of the presentation is available here.

Shelling Out Sucks

2012-03-11T00:00:00+00:00

Spawning a pipeline of connected programs via an intermediate shell — a.k.a. “shelling out” — is a really convenient and effective way to get things done. It’s so handy that some “glue languages,” like Perl and Ruby, even have special syntax for it (backticks). However, shelling out is also a common source of bugs, security holes, unnecessary overhead, and silent failures. Here are the three reasons why shelling out is problematic:

Metacharacter brittleness. When commands are constructed programmatically, the resulting code is almost always brittle: if a variable used to construct the command contains any shell metacharacters, including spaces, the command will likely break and do something very different than what was intended — potentially something quite dangerous.
Indirection and inefficiency. When shelling out, the main program forks and execs a shell process just so that the shell can in turn fork and exec a series of commands with their inputs and outputs appropriately connected. Not only is starting a shell an unnecessary step, but since the main program is not the parent of the pipeline commands, it cannot be notified when they terminate — it can only wait for the pipeline to finish and hope the shell indicates what happened.
Silent failures by default. Errors in shelled out commands don’t automatically become exceptions in most languages. This default leniency leads to code that fails silently when shelled out commands don’t work. Worse still, because of the indirection problem, there are many cases where the failure of a process in a spawned pipeline cannot be detected by the parent process, even if errors are fastidiously checked for.

In the rest of this post, I’ll go over examples demonstrating each of these problems. At the end, I’ll talk about better alternatives to shelling out, and in a followup post. I’ll demonstrate how Julia makes these better alternatives dead simple to use. Examples below are given in Ruby which shells out to Bash, but the same problems exist no matter what language one shells out from: it’s the technique of using an intermediate shell process to spawn external commands that’s at fault, not the language.

Metacharacter Brittleness

Let’s start with a simple example of shelling out from Ruby. Suppose you want to count the number of lines containing the string “foo” in all the files under a directory given as an argument. One option is to write Ruby code that reads the contents of the given directory, finds all the files, opens them and iterates through them looking for the string “foo”. However, that’s a lot of work and it’s going to be much slower than using a pipeline of standard UNIX commands, which are written in C and heavily optimized. The most natural and convenient thing to do in Ruby is to shell out, using backticks to capture output:

`find #{dir} -type f -print0 | xargs -0 grep foo | wc -l`.to_i

This expression interpolates the dir variable into a command, spawns a Bash shell to execute the resulting command, captures the output into a string, and then converts that string to an integer. The command uses the -print0 and -0 options to correctly handle strange characters in file names piped from find to xargs (these options cause file names to be delimited by NULs instead of whitespace). Even with extra-careful options, this code for shelling out is simple and clear. Here it is in action:

irb(main):001:0> dir="src"
=> "src"
irb(main):002:0> `find #{dir} -type f -print0 | xargs -0 grep foo | wc -l`.to_i
=> 5

Great. However, this only works as expected if the directory name dir doesn’t contain any characters that the shell considers special. For example, the shell decides what constitutes a single argument to a command using whitespace. Thus, if the value of dir is a directory name containing a space, this will fail:

irb(main):003:0> dir="source code"
=> "source code"
irb(main):004:0> `find #{dir} -type f -print0 | xargs -0 grep foo | wc -l`.to_i
find: `source': No such file or directory
find: `code': No such file or directory
=> 0

The simple solution to the problem of spaces is to surround the interpolated directory name in quotes, telling the shell to treat spaces inside as normal characters:

irb(main):005:0> `find '#{dir}' -type f -print0 | xargs -0 grep foo | wc -l`.to_i
=> 5

Excellent. So what’s the problem? While this solution addresses the issue of file names with spaces in them, it is still brittle with respect to other shell metacharacters. What if a file name has a quote character in it? Let’s try it. First, let’s create a very weirdly named directory:

bash-3.2$ mkdir "foo'bar"
bash-3.2$ echo foo > "foo'bar"/test.txt
bash-3.2$ ls -ld foo*bar
drwxr-xr-x 3 stefan staff 102 Feb  3 16:17 foo'bar/

That’s an admittedly strange directory name, but it’s perfectly legal in UNIXes of all flavors. Now back to Ruby:

irb(main):006:0> dir="foo'bar"
=> "foo'bar"
irb(main):007:0> `find '#{dir}' -type f -print0  | xargs -0 grep foo | wc -l`.to_i
sh: -c: line 0: unexpected EOF while looking for matching `''
sh: -c: line 1: syntax error: unexpected end of file
=> 0

Doh. Although this may seem like an unlikely corner case that one needn’t realistically worry about, there are serious security ramifications. Suppose the name of the directory came from an untrusted source — like a web submission, or an argument to a setuid program from an untrusted user. Suppose an attacker could arrange for any value of dir they wanted:

irb(main):008:0> dir="foo'; echo MALICIOUS ATTACK 1>&2; echo '"
=> "foo'; echo MALICIOUS ATTACK 1>&2; echo '"
irb(main):009:0> `find '#{dir}' -type f -print0  | xargs -0 grep foo | wc -l`.to_i
find: `foo': No such file or directory
MALICIOUS ATTACK
grep:  -type f -print0
: No such file or directory
=> 0

Your box is now owned. Of course, you could sanitize the value of the dir variable, but there’s a fundamental tug-of-war between security (as limited as possible) and flexibility (as unlimited as possible). The ideal behavior is to allow any directory name, no matter how bizarre, as long as it actually exists, but “defang” all shell metacharacters.

The only two way to fully protect against these sorts of metacharacter attacks — whether malicious or accidental — while still using an external shell to construct the pipeline, is to do full shell metacharacter escaping:

irb(main):010:0> require 'shellwords'
=> true
irb(main):011:0> `find #{Shellwords.shellescape(dir)} -type f -print0  | xargs -0 grep foo | wc -l`.to_i
find: `foo\'; echo MALICIOUS ATTACK 1>&2; echo \'': No such file or directory
=> 0

With shell escaping, this safely attempts to search a very oddly named directory instead of executing the malicious attack. Although shell escaping does work (assuming that there aren’t any mistakes in the shell escaping implementation), realistically, no one actually bothers — it’s too much trouble. Instead, code that shells out with programmatically constructed commands is typically riddled with potential bugs in the best case and massive security holes in the worst case.

Indirection and Inefficiency

If we were using the above code to count the number of lines with the string “foo” in a directory, we would want to check to see if everything worked and respond appropriately if something went wrong. In Ruby, you can check if a shelled out command was successful using the bizarrely named $?.success? indicator:

irb(main):012:0> dir="src"
=> "src"
irb(main):013:0> `find #{Shellwords.shellescape(dir)} -type f -print0  | xargs -0 grep foo | wc -l`.to_i
=> 5
irb(main):014:0> $?.success?
=> true

Ok, that correctly indicates success. Let’s make sure that it can detect failure:

irb(main):015:0> dir="nonexistent"
=> "nonexistent"
irb(main):016:0> `find #{Shellwords.shellescape(dir)} -type f -print0  | xargs -0 grep foo | wc -l`.to_i
find: `nonexistent': No such file or directory
=> 0
irb(main):017:0> $?.success?
=> true

Wait. What?! That wasn’t successful. What’s going on?

The heart of the problem is that when you shell out, the commands in the pipeline are not immediate children of the main program, but rather its grandchildren: the program spawns a shell, which makes a bunch of UNIX pipes, forks child processes, connects inputs and outputs to pipes using the dup2 system call, and then execs the appropriate commands. As a result, your main program is not the parent of the commands in the pipeline, but rather, their grandparent. Therefore, it doesn’t know their process IDs, nor can it wait on them or get their exit statuses when they terminate. The shell process, which is their parent, has to do all of that. Your program can only wait for the shell to finish and see if that was successful. If the shell is only executing a single command, this is fine:

irb(main):018:0> `cat /dev/null`
=> ""
irb(main):019:0> $?.success?
=> true
irb(main):020:0> `cat /dev/nada`
cat: /dev/nada: No such file or directory
=> ""
irb(main):021:0> $?.success?
=> false

Unfortunately, by default the shell is quite lenient about what it considers to be a successful pipeline:

irb(main):022:0> `cat /dev/nada | sort`
cat: /dev/nada: No such file or directory
=> ""
irb(main):023:0> $?.success?
=> true

As long as the last command in a pipeline succeeds — in this case sort — the entire pipeline is considered a success. Thus, even when one or more of the earlier programs in a pipeline fails spectacularly, the last command may not, leading the shell to consider the entire pipeline to be successful. This is probably not what you meant by success.

Bash’s notion of pipeline success can fortunately be made stricter with the pipefail option. This option causes the shell to consider a pipeline successful only if all of its commands are successful:

irb(main):024:0> `set -o pipefail; cat /dev/nada | sort`
cat: /dev/nada: No such file or directory
=> ""
irb(main):025:0> $?.success?
=> false

Since shelling out spawns a new shell every time, this option has to be set for every multi-command pipeline in order to be able to determine its true success status. Of course, just like shell-escaping every interpolated variable, setting pipefail at the start of every command is simply something that no one actually does. Moreover, even with the pipefail option, your program has no way of determining which commands in a pipeline were unsuccessful — it just knows that something somewhere went wrong. While that’s better than silently failing and continuing as if there were no problem, its not very helpful for postmortem debugging: many programs are not as well-behaved as cat and don’t actually identify themselves or the specific problem when printing error messages before going belly up.

Given the other problems caused by the indirection of shelling out, it seems like a barely relevant afterthought to mention that execing a shell process just to spawn a bunch of other processes is inefficient. However, it is a real source of unnecessary overhead: the main process could just do the work the shell does itself. Asking the kernel to fork a process and exec a new program is a non-trivial amount of work. The only reason to have the shell do this work for you is that it’s complicated and hard to get right. The shell makes it easy. So programming languages have traditionally relied on the shell to setup pipelines for them, regardless of the additional overhead and problems caused by indirection.

Silent Failures by Default

Let’s return to our example of shelling out to count “foo” lines. Here’s the total expression we need to use in order to shell out without being susceptible to metacharacter breakage and so we can actually tell whether the entire pipeline succeeded:

`set -o pipefail; find #{Shellwords.shellescape(dir)} -type f -print0  | xargs -0 grep foo | wc -l`.to_i

However, an error isn’t raised by default when a shelled out command fails. To avoid silent errors, we need to explicitly check $?.success? after every time we shell out and raise an exception if it indicates failure. Of course, doing this manually is tedious, and as a result, it largely isn’t done. The default behavior — and therefore the easiest and most common behavior — is to assume that shelled out commands worked and completely ignore failures. To make our “foo” counting example well-behaved, we would have to wrap it in a function like so:

def foo_count(dir)
  n = `set -o pipefail;
       find #{Shellwords.shellescape(dir)} -type f -print0  | xargs -0 grep foo | wc -l`.to_i
  raise("pipeline failed") unless $?.success?
  return n
end

This function behaves the way we would like it to:

irb(main):026:0> foo_count("src")
=> 5
irb(main):027:0> foo_count("source code")
=> 5
irb(main):028:0> foo_count("nonexistent")
find: `nonexistent': No such file or directory
RuntimeError: pipeline failed
	from (irb):5:in `foo_count'
	from (irb):13
	from :0
irb(main):029:0> foo_count("foo'; echo MALICIOUS ATTACK; echo '")
find: `foo\'; echo MALICIOUS ATTACK; echo \'': No such file or directory
RuntimeError: pipeline failed
	from (irb):5:in `foo_count'
	from (irb):14
	from :0

However, this 6-line, 200-character function is a far cry from the clarity and brevity we started with:

`find #{dir} -type f -print0 | xargs -0 grep foo | wc -l`.to_i

If most programmers saw the longer, safer version of this in a program, they’d probably wonder why someone was writing such verbose, cryptic code to get something so simple and straightforward done.

Summary and Remedy

To sum it up, shelling out is great, but making code that shells out bug-free, secure, and not prone to silent failures requires three things that typically aren’t done:

Shell-escaping all values used to construct commands
Prefixing each multi-command pipeline with “set -o pipefail;”
Explicitly checking for failure after each shelled out command.

The trouble is that after doing all of these things, shelling out is no longer terribly convenient, and the code becomes annoyingly verbose. In short, shelling out responsibly kind of sucks.

As is so often the case, the root of all of these problems is relying on a middleman rather than doing things yourself. If a program constructs and executes pipelines itself, it remains in control of all the subprocesses, can determine their individual exit conditions, automatically handle errors appropriately, and give accurate, comprehensive diagnostic messages when things go wrong. Moreover, without a shell to interpret commands, there is also no shell to treat metacharacters specially, and therefore no danger of metacharacter brittleness. Python gets this right: using os.popen to shell out is officially deprecated, and the recommended way to call external programs is to use the subprocess module, which spawns external programs without using a shell. Constructing pipelines using subprocess can be a little verbose, but it is safe and avoids all the problems that shelling out is prone to. In my followup post, I will describe how Julia makes constructing and executing pipelines of external commands as safe as Python’s subprocess and as convenient as shelling out.

Stanford Talk Video

2012-03-01T00:00:00+00:00

Jeff gave his previously announced, invited talk at Stanford yesterday and the video is available here. Congrats, Jeff!

Stanford Talk Announcement

2012-02-27T00:00:00+00:00

I will be speaking about Julia at the Stanford EE Computer Systems Colloquium on Wednesday, February 29 at 4:15PM PST. The title of the talk is Julia: A Fast Dynamic Language For Technical Computing.

Abstract:

Julia is a general-purpose, high-level, dynamic language, designed from the start to take advantage of techniques for executing dynamic languages at statically-compiled language speeds. As a result the language has a more powerful type system, and generally provides better type information to the compiler.

Julia is especially good at running MATLAB and R-style programs. Given its level of performance, we envision a new era of technical computing where libraries can be developed in a high-level language instead of C or FORTRAN. We have also experimented with cloud API integration, and begun to develop a web-based, language-neutral platform for visualization and collaboration. The ultimate goal is to make cloud-based supercomputing as easy and accessible as Google Docs.

Speaker Bio:

Jeff Bezanson has been developing the Julia language for two and a half years with a small distributed team of collaborators. Previously, he worked as a software engineer at Interactive Supercomputing, which developed the Star-P parallel extension to MATLAB. At the company, Jeff was a principal developer of “M#”, an implementation of the MATLAB language running on .NET. He is now a second-year graduate student at MIT. Jeff received an A.B. in Computer Science from Harvard University in 2004, and has experience with applications of technical computing in medical imaging.

The talk will be webcast live.

Edit: the video of the talk can be found here.

Why We Created Julia

2012-02-14T00:00:00+00:00

In short, because we are greedy.

We are power Matlab users. Some of us are Lisp hackers. Some are Pythonistas, others Rubyists, still others Perl hackers. There are those of us who used Mathematica before we could grow facial hair. There are those who still can’t grow facial hair. We’ve generated more R plots than any sane person should. C is our desert island programming language.

We love all of these languages; they are wonderful and powerful. For the work we do — scientific computing, machine learning, data mining, large-scale linear algebra, distributed and parallel computing — each one is perfect for some aspects of the work and terrible for others. Each one is a trade-off.

We are greedy: we want more.

We want a language that’s open source, with a liberal license. We want the speed of C with the dynamism of Ruby. We want a language that’s homoiconic, with true macros like Lisp, but with obvious, familiar mathematical notation like Matlab. We want something as usable for general programming as Python, as easy for statistics as R, as natural for string processing as Perl, as powerful for linear algebra as Matlab, as good at gluing programs together as the shell. Something that is dirt simple to learn, yet keeps the most serious hackers happy. We want it interactive and we want it compiled.

(Did we mention it should be as fast as C?)

While we’re being demanding, we want something that provides the distributed power of Hadoop — without the kilobytes of boilerplate Java and XML; without being forced to sift through gigabytes of log files on hundreds of machines to find our bugs. We want the power without the layers of impenetrable complexity. We want to write simple scalar loops that compile down to tight machine code using just the registers on a single CPU. We want to write A*B and launch a thousand computations on a thousand machines, calculating a vast matrix product together.

We never want to mention types when we don’t feel like it. But when we need polymorphic functions, we want to use generic programming to write an algorithm just once and apply it to an infinite lattice of types; we want to use multiple dispatch to efficiently pick the best method for all of a function’s arguments, from dozens of method definitions, providing common functionality across drastically different types. Despite all this power, we want the language to be simple and clean.

All this doesn’t seem like too much to ask for, does it?

Even though we recognize that we are inexcusably greedy, we still want to have it all. About two and a half years ago, we set out to create the language of our greed. It’s not complete, but it’s time for a 1.0 release — the language we’ve created is called Julia. It already delivers on 90% of our ungracious demands, and now it needs the ungracious demands of others to shape it further. So, if you are also a greedy, unreasonable, demanding programmer, we want you to give it a try.

The Julia Blog

Technical preview: Native GPU programming with CUDAnative.jl

How to get started

Hello World Vector addition

How does it work?

What is missing?

Another example: parallel reduction

Try it out!

I want to help

Thanks

More Dots: Syntactic Loop Fusion in Julia

Isn’t vectorized code already fast?

Which functions are vectorized?

Why vectorized code is fast

Why vectorized code is not as fast as it could be

Why does Julia need dots to fuse the loops?

A halfway solution: Loop fusion for a few operations/types

Syntactic loop fusion in Julia

Other partway solutions

Should other languages implement syntactic loop fusion?

The importance of higher-order inlining

Not just elementwise math: The power of broadcast

Combining containers of different shapes

Not just numbers

Not just containers

broadcast vs. map

Julia 0.5 Highlights

Functions

Ambiguous methods

Return type annotations

Vectorized function calls

Comprehensions

Generators

Initializing collections

Constructing dictionaries

Arrays

Dimension sum slices

Array views

And more…

Julia 0.5 Release Announcement

Notable compiler and language changes:

Ports

Developing with Julia

StructuredQueries.jl - A generic data manipulation framework

The query framework

Dummy sources

The two problems

Type-inferability

The hard question of nullable semantics

SQL backends

Roadmap and open questions

Related work

Conclusion

A Personal Perspective On JuliaCon 2016

BioJulia 2016 - online sequence search, sequence demultiplexing, new readers and much more!

Online sequence search algorithms

Sequence data structure for reference genomes

Data reader and writer for the 2bit file format

Data reader and writer for the SAM and BAM file formats

Package to handle BGZF files

Sequence demultiplexing tool

Next step

Acknowledgements

Graft.jl - General purpose graph analytics for Julia

Proposal

Graft

Vertex and Edge Metadata

Vertex Labelling

SQL Like Queries

eachvertex

eachedge

filter

select

Demonstration

Future Work

Acknowledgements

Announcing support for complex-domain linear programs in Convex.jl

An invitation to JuliaCon 2016

BioJulia Project in 2016

Sequence Analysis

The `NullableArray` data structure