Torsten's STDOUT

How to Profile Clojure Code

2015-05-22T00:00:00-04:00

I had some trouble to find a straight answer on how to profile a Clojure application to find CPU hotspots. After some searching and tinkering, I ended up with these steps that worked for me:

1. Download, install, and launch VisualVM. VisualVM is definitely good enough for basic profiling. If you want more detail, you might want to give YourKit a try. You'll have to pay a license for YourKit, though.

2. Add the following to your Leiningen project.clj file. Either in a profile or on the top level:

:jvm-opts ["-Dcom.sun.management.jmxremote"
           "-Dcom.sun.management.jmxremote.ssl=false"
           "-Dcom.sun.management.jmxremote.authenticate=false"
           "-Dcom.sun.management.jmxremote.port=43210"]

This will enable Java Management Extensions (JMX) for the Clojure JVM process. VisualVM uses these extensions to profile the JVM process. The port 43210 is arbitrary, pick what you like.

3a. If you are running the Clojure application on local host continue with step 6.

3b. If the Clojure application is running on a remote host:

4. Connect to the remote host and open a SOCKS proxy with ssh:

ssh remote_host -D 9191

This will allow VisualVM to connect to JMX on the remote host. Strangely, just forwarding a single port (with -L) has not worked for me. Apparently VisualVM needs more than one connection.

5. Enable SOCKS proxy in VisualVM

On Mac OS X, go to VisualVM → Preferences.... Then select the Network tab in the VisualVM app.

Enable Manual proxy settings; then set the SOCKS Proxy: to localhost with port 9191.

Change the No Proxy hosts to local, *.local, 169.254/16, *.169.254/16, localhost, *.localhost.

It is important that 127.0.0.1 is not part of that list.

6. Attach to the running Clojure process:

Go to File → Add JMX Connection...

Enter 127.0.0.1:43210 as the Connection.

7. Start profiling

Select the Sampler tab and click the CPU button.

Happy optimizing!

Expanding JSON arrays to rows with SQL on RedShift

2013-12-12T00:00:00-05:00

Amazon's RedShift is a really neat product that solves a lot of our problems at work. However, its SQL dialect has some limitations when compared to Hive or PostgresSQL. I hit a limit when I needed table-generating functions but found a work-around.

Some of the data we store in RedShift contains JSON arrays. However, when running analytical queries, there is no out-of-the box way to join on "nested data" inside of arrays so up until now this data was very hard to use.

In October, RedShift added new functions to work with JSON ¹ but the support is missing something like Hive's explode() or Postgres' unnest() functions to expand an array from one column into one row for each element.

As a work-around, I came up with a simple hack: Joining the JSON array with a predefined sequence of integers and then extracting the element at each index into a new relation through that join.

If you want to follow along the queries and play with the data, I created a Gist which has all the queries to create the dummy tables and that fills them in with test data.

In this example, I am assuming a table clusters where each row represents a cluster of "things" and each cluster consists of many nodes modeled as a JSON array. Each node then has its size stored in this array – you could ask "What is the maximum node size over all clusters?".

id	node_sizes
1	'[1, 2]'
2	'[5, 1, 3]'
3	'[2]'

Assuming the above data in the table clusters, you can use the following SQL query in RedShift to extract the maximum node size from all arrays:

WITH exploded_array AS (
    SELECT id, JSON_EXTRACT_ARRAY_ELEMENT_TEXT(node_sizes, seq.i) AS size
    FROM clusters, seq_0_to_100 AS seq
    WHERE seq.i < JSON_ARRAY_LENGTH(node_sizes)
  )
SELECT max(size)
FROM exploded_array;

The example query above uses a Common Table Expression to create a intermediate relation exploded_array which looks like this:

id	size
1	1
1	2
2	5
2	1
2	3
3	2

As you can see, each array was expanded into many rows, but the id is still the same for each element.

So where does the magical seq_0_to_100 in the above queries come from?

Since RedShift is currently missing any kind of sequence-generating functions, I had to emulate this as well. For that I created a view like this:

CREATE VIEW seq_0_to_100 AS (
    SELECT 0 AS i UNION ALL
    SELECT 1 UNION ALL
    -- You get the idea...
    SELECT 99 UNION ALL
    SELECT 100
);

One constraint of the presented technique is that the maximum length of the longest array has to be known upfront to generate the sequence of integers. This constraint did not provide any obstacles for me in practice yet since this size can just be queried in most cases:

SELECT MAX(JSON_ARRAY_LENGTH(node_sizes)) FROM clusters;

Of course, this example is not all you can do. Once the data is in the shape of exploded_array you can work with the resulting intermediate relation in any other way and join it with all the things in your data warehouse.

Thanks to Jenny and Hans for reading drafts of this post!

Patching PgBouncer to Drop Slow Queries

2013-09-17T00:00:00-04:00

The other day my friend and colleague Hans was troubleshooting problems with one of our production PostgreSQL databases at 6Wunderkinder. The master was under high load and slowed down the whole system. So we needed to figure out why this happened and fix the problem.

Our setup includes one master PostgreSQL database and two slaves for reading. These database are connected to a PgBouncer which pools connections and is accessed by our app servers.

The problem was that something was running a bad query on the master:

SELECT COUNT(*)
FROM tasks
WHERE /* complex condition */;

This query is very problematic for the master because it is not usually doing these kind of reads with very complex WHERE clauses and it caused high I/O wait times which slowed down the master noticeably.

After searching the whole codebase and logs, we could not find any plausible line where this query could come from. So eventually we had to stop and yield to ActiveRecord’s ORM methods which (un)fortunately hide most raw SQL queries.

We tried getting more informations from PostgreSQL by running:

SELECT procpid, client_addr, current_query
FROM pg_stat_activity
WHERE current_query LIKE 'SELECT COUNT(*) FROM tasks%';

The above query can help identifying the machine which is running a matching query. Since we have a PgBouncer in our setup, client_addr always points to the PgBouncer.

So what’s the solution? Of course: Patching PgBouncer to kill all bad queries. This would trigger exceptions in the Ruby code and would then make the bad code visible in the logs. Causing a few 500s would also not be a problem because all our clients have offline support.

So we put on our C-heads, fired up gdb, and dug a couple minutes through the PgBouncer source. Eventually we found a neat spot: safe_send(). This function was called for every packet being forwarded from the client to the PostgreSQL server.

While this is not a very clean solution (because this is a place in the generic network library), we rewrote all task count queries to be invalid just to test our hack:

int safe_send(int fd, const void *buf, int len, int flags)
{
    ...
    if (memcmp(((char *)buf) + 5, "SELECT COUNT(*) FROM tasks", 26) == 0) {
        memcpy(((char *)buf) + 5, "SELECT * FROM 1337;--", 21);
    }
    ...

And woohoo, this actually worked smooth as butter on our development machine! Testing it on production was also without much harm since PgBouncer has a very useful -R flag to perform an online restart.

So after some cleanup we ended up with a small patch for PgBouncer and LibUsual. The snippet above is the heart of the patch, but we added more: Logging the killed queries and a config option, which enables turning this feature on and off without re-compiling and without even restarting PgBouncer.

If you ever end up with untraceable slow queries in PostgreSQL, apply this patch without any guarantees and at your own risk. And then grep the logs for 1337!