antirez weblog

Game Over

2012-10-14T23:06:44+00:00

Yesterday I wrote a blog post about sexism. In the blog post I expressed a few of the ideas I've about this matter, as a private citizen in his blog should be able to do.

The blog post, and my tweets about this subject, received an extreme reaction including tons of rage and insults. Somebody will say, welcome, you understand at 35 that there are people that it is not worth having an exchange with? Not at all, I'm not very positive about humanity in general, but even it the worst mailing list thread, or in the hottest flame war on usenet, I never have never seen what I saw yesterday.

Basically 140 chars are a perfect fit for just rage without arguments. In general the non linearity of Twitter made it impossible to have any kind of discussion that made sense even with the sane part of the people involved in the discussion. That's not ok for me, for a number of reasons.

The first is that I believe that the hacking culture is not like that, it is a culture that was traditionally able to deal with different and extreme point of views. In the last months what I saw instead was Linus Torwalds attacked in a brutal way for criticising a github feature, and RMS put into ridicule for his ideas on Steve Jobs. In both the circumstances I publicly stated I was extremely shocked with what was happening... really, not everybody that can write 2 lines of javascript is supposed to insult RMS or Linus in my opinion.

While I'm not even remotely comparable to Linus or RMS, I'm still an individual that deservers respect for his ideas.

I also love open source, and guess what? It's not a license. It's a process of exchanging ideas, code, and information, freely. In short, I don't want to be part of what I saw yesterday, ever. For me open source is a lot more than a job. For me the ability to express my ideas is more important than smiling to the community and accept the new rules I'm seeing in place.

As you guess, not everybody reacted like that. Actually most of my 10000+ followers either said nothing or encouraged me by private email or direct messages. Thank you, I don't want to claim that everybody is like what I saw yesterday. However among the people that over reacted there were also well known figures of the programming community.

So what happens now? That I'm done with Twitter. I'm going to close my accounts, and I'll use only the @redisfeed account to provide information to the Redis users about what happens about Redis. Releases, critical bugs, anticipations. It will be low traffic, and should be make more people able to be subscribed to that account.

I'll still write about everything I do about Redis and about anything else I like or think and I want to share with the world, here in my blog. I'll modify the blog code in the next days to make it better for short posts, that will be presented as short messages with a date directly in the front page. I'll study a bit what is the best solution to have an easy to follow blog about development, with small continuous updates and bigger posts from time to time.

I'll close the comments on my blog every time I post about things I want other people to reply just with another blog post, not with some rage put within comments.

Please consider that this decision is not based solely on what happened yesterday: this allowed me to reflect on the value and the level of discussion I had on Twitter in three years. I think I can stay more focused on my work, that is to hack on Redis, if I stay away from tweets, as I decided to stay away from IRC some time ago, and as I don't want to have phone calls. It is the logical next step for me.

Note: the problem is not Twitter itself. Some cultural shifting is happening inside the community lately in my opinion.

Btw I'll be more active in the Redis Google Group, and here on my blog of course, you'll probably end reading a lot more about me, but instead of sometimes hard to understand 140 chars tweets, I hope to write in a more clear and extensive form about Redis and what I'm doing.

Thanks to everybody that supported my work so far, I want to share my work with you, as I enjoy the work that other people are doing for me.

See you here in the next weeks!

17352 views^*

Posted at 23:06:44 | permalink | discuss | print

A different take on sexism in IT

2012-10-13T13:42:42+00:00

More than ten years ago I started to understand that sexism in IT was not an easy topic. Talking with my female coworkers I discovered they were deeply upset and offended by other women that were too easy to ask for respect using sexism as a flag. At first I was a bit shocked about that, but then I realised how obvious it is.

As a woman you want respect because you are capable and smart. Not because you are a woman.

In the course of my life I started to develop an higher and higher intolerance for topics like politically correctness and protection of minorities unless this was clearly put in general terms. If you are an human being you need to be respected because you deserve respect like any other. I don't care if you are black, white, yellow or woman, you are an individual.

Similarly, I will not care who you are if you do something silly at work. Nothing is more offensive for you than me being too easy with you because you are part of some minority. This is, basically, a masked form of reverse-sexism, and is deeply offensive. This is what my female coworkers meant when they were so upset against other women talking about sexism too easily.

In general if there is a problem at the work place between individual A and B, I think it is always an error to talk about sexism, even when the root cause is some asshole not respecting you because you are a woman. Instead the problem should be addressed in a sexual agnostic way. Why is A not payed like B even if they have similar responsibilities and tasks? Why A is not respected by B as she deservers as an individual?

Trying to protect women in tech since they are women is like moving a cultural problem (the sexism) into an individual domain. A woman in tech has nothing less than a male in tech, as such does not need special care or protection. She needs to be respected as everybody else.

Another naive way to consider the problem is to think that sexism is a state of mind of men. Actually the problem is more complex than that, and a lot of women don't consider themselves or other women as capable as men.

Blog posts about this topic that try to make people aware of sexism or try to send the message "we should be all kind so that women will feel great in our industry" are not the solution, nor to stress politically correctness is going to help at all. As a proof in the United States where politically correctness and protection of minority is a topic always over-discussed, the condition of women is worse than in North Europe, where such an obsession does not exist.

It's silly to try to protect all the minorities because they are minorities. We should protect individuals as they have equal dignity, without resort to sex, race, and other discriminatory attributes.

Every minute spent on discussing gender issues in technology is a wasted minute that should be better spent building free software -- Randi Harper a.k.a. FreeBSD Girl

21006 views^*

Posted at 13:42:42 | permalink | discuss | print

Reply to an open minded reader

2012-10-08T15:42:52+00:00

Today I was at the hospital for the usual Greta heartbeat trace that is used to monitor the baby's status when the birth is near (all ok btw): my sole escape from the boringness of the hospital and its deep disorganisation was my iPhone that I was using to read the twitter timeline for the "Redis" search, when I stumbled upon an article written by @soveran about Redis used as a data store, an open minded reader.

Maybe it's because Michael follows the Redis development since its start and worked a lot with it (and contributed the redis.io site as well!), as I do, but well, he really almost used the same words I would use to write his blog post, or at least this was my feeling. What he says is that, you can use Redis as a data store (and we have an interesting story about persistence to recount, if safety is your concern), but as long as you accept two major tradeoffs.

One is the fact that you are limited to memory size, and for extremely write heavy applications while persisting Redis could use as much as 2x memory, so in certain cases you are limited to the 50% of available memory you have.

The second is even more important, and is about the data model. Redis is not the kind of system where you can insert data and then argue about how to fetch those data in creative ways. Not at all, the whole idea of its data model, and part of the fact that it will be so fast to retrieve your data, is that you need to think in terms of organising your data for fetching. You need to design with the query patterns in mind. In short most of the times your data inside Redis is stored in a way that is natural to query for your use case, that's why is so fast apart from being in memory, there is no query analysis, optimisation, data reordering. You are just operating on data structures via primitive capabilities offered by those data structures. End of the story.

However, I would add the following perspective on that.

If you do judicious use of your memory, and exploit the fact that Redis sometimes can do a lot with little (see for instance Redis bit operations), and instead of just have a panic attack about your data growing outside the limits of the known universe you try to do your math and consider how much memory commodity servers nowadays have, you'll discover that there are a tons and one more use cases where it's ok to have the limit of your computer memory. Often this means tens of millions of active users in one server.

And another point about the data model: remember all those stories about DB denormalisation, and hordes of memcached or Redis farms to cache stuff, and things like that? The reality is that fancy queries are an awesome SQL capability (so incredible that it was hard for all us to escape this warm and comfortable paradigm), but not at scale. So anyway if your data needs to be composed to be served, you are not in good waters.

If you know what you are doing you can understand if Redis is for you, even as your sole, main, store.

P.s. if you want to watch @soveran speaking live, go to Redis Conf to meet him. There is also our core hacker Pieter Noordhuis that will rewrite Redis in COBOL or whatever you may need if you ask politely, and a lot more great Redis hackers.

4990 views^*

Posted at 15:42:52 | permalink | discuss | print

Redis 2.6 RC8 is out

2012-10-05T17:33:23+00:00

I just released Redis 2.6.0-RC8 that is likely our latest RC release. Plans are to release 2.6.0 before the end of October.

There are a few new things in RC8 that are worth mentioning.

SRANDMEMBER <count>

The first is the SRANDMEMBER command that now accepts an optional argument to return multiple random elements.

If you want non-repeating elements, use a positive count:

redis 127.0.0.1:6379> sadd myset 1 2 3 4 5 6 7 8 9 10
(integer) 10
redis 127.0.0.1:6379> srandmember myset 3
1) "6"
2) "4"
3) "7"

If you want random elements in a "put the extracted element back in the box before next extraction" fashion (so elements may be repeated), just use a negative count.

redis 127.0.0.1:6379> srandmember myset -3
1) "1"
2) "5"
3) "1"

Long story short the command has a non trivial implementation to bring you the best of the performances even in the non-repeating elements case and even when the requested elements are near to the max number of elements in the set. Hint: sometimes is much faster to copy the set and remove random elements.

SORT by nosort

Expert Redis users already know that SORT can be used in a special way when we are not interested in sorting anything, but just to use the accessory capabilities of SORT like its GET option.

In this mode SORT used to shuffle sorted sets at random. Not in RC8, where the order of elements in the sorted set is respected, LIMIT is optimised using a O(log(N)) algorithm to seek to the right index, and DESC and ASC will do the right thing. Unfortunately this is yet not explicit documented in the SORT man page, but I'll do it in the next days.

New hash function: murmurhash2

The hash table implementation is now using the murmurhash2 hash function together with a randomised seed to prevent attacks. We observed some speed regression with synthetic benchmarks, but in workloads with random access to keys we observed an improvement in performances. Long story short if there is no proof that this change is a bad idea, is better to go for the added security and the better distribution of the new hash table, even if sometimes better distribution means worse cache locality so there are benchmarks where it will always perform poorly compared to djbhash.

On the other hand it is very likely that the new hash function will improve the quality of the distribution of elements returned by SPOP, RANDOMKEY, SRANDMEMBER, and other similar commands.

More resistance to clock skews

When your system clock dances, now it is a lot less likely that this will result in something odd, even if we can't still claim that Redis is clock skew resistent.

Sentinel is inside 2.6!

This may sound strange, but the thing is, Sentinel is going to be central in future developments, and it lives in a complete separated C file with almost no interaction with the rest of the system. So basically it will be trivial to take it in sync with 2.8 / unstable branches progresses in the Sentinel side.

Why, you may ask. Because we want people able to try Sentinel with minimal efforts in the future, in order to enlarge the user base of testers and speedup the process of making the system mature.

In Sentinel land there will be more stress in 2.8 about how to integrate Redis with Sentinel, how client libraries should interact with Sentinel, and so forth, we need to make all the components Sentinel-aware to guarantee the best monitoring / failover experience.

Milestones

The plan is to release Redis 2.6.0 before the end of October, to start hacking on the new things, that are:

Redis 2.8, that is, while hacking to unstable we'll try to merge as much as possible into 2.8 and release it. We did this with 2.6 and it worked well, I'm going to do it again.
Redis Sentinel: we need to improve the system to make it more reliable, documented, and integrated with Redis instances (auto configuration and alike), client libraries, and so forth.
Redis Cluster: I've some plan when 2.6.0 is stable enough to do what I did with Sentinel, to focus on Cluster for one/two months full time in order to bring it from alpha to beta stage, so that is something that users can start testing / using, more or less like it is happening now with Sentinel.

More news soon, have a splendid week end :-)

EDIT: we are looking for Redis talents! VMware has an open position for a great Redis Engineer. If you have solid C and POSIX skills and want to hack on Redis, this job is a good fit! If you prefer just contact me directly.

3388 views^*

Posted at 17:33:23 | permalink | discuss | print

An update about Redis 2.6 and Sentinel

2012-08-23T13:49:25+00:00

I'm back to work after two good weeks of vacations, and I hope you enjoyed your time as well during your days of rest. It's a bit hard to start after many days of pause, but at the same time it feels good to work again at Redis 2.6 and Redis Sentinel. This blog post is an update about the state of this two projects.

Redis 2.6 ETA

The time at which Redis 2.6 will ship as a stable release is one of the most frequently asked questions in the Redis community recently. Actually there is no ETA about Redis 2.6 stable because the idea is to do a stable release only when the test coverage will improve, and once a few more known bugs are fixed.

You can check the open issues with the 2.6 milestone for a complete list.

About the coverage, that's the current state of affairs.

But the reality is that as a Redis user my suggestion is to don't care about labels like release candidate or stable. What matters, after all, is just the actual level of stability. And Redis 2.6 will reach a production level stability before it goes out of release candidate, and I'll make sure to make this information public.

What is production ready?

There are probably much more software engineering books than there should be ;) So despite of the general title of this section my goal here is to state what stable means for me and for the Redis project. After all there are a number of possible metrics, but mine is very simple and in the latest three years worked very reliably.

A Redis branch without active development (major changes to the source code that are not just additions without impacts to what we already have) is stable when the number of critical bugs discovered in the latest 8 weeks is near to zero.

In this context a critical bug is a bug that corrupts data, that crashes the server, that creates an inconsistency between master and slave, and that, at the same time, is not triggered by a crazy edge-case, so edge that it is unlikely to run into it involuntarily.

Note that bugs disconvered both in the beta release and in the current stable release are not counted here, as they are not specific of the beta release.

Usually in a new Redis beta release there is a moment where bugs are rarely discovered because too little users are actually using the release. At some point users start to switch to the new release more and more so that the rate of reporting raises. Later we fix enough of the major things that the remaining bugs are hard enough to discover that for months no critical bug is found at all. This is when something can be considered production ready for me.

Is 2.6 production ready?

Redis 2.6 is still not production ready because critical bugs are still found with a too high rate, even if this rate is slowing down consistently. We need more weeks of work.

Also there are critical knonw issues such as Issue #614 that needs to be addressed changing a few things in the 2.6 internals.

Once I feel that the discovery of a critical bug is as likely in 2.4 as in the current 2.6 release candidate, I'll make sure to inform the community. I hope that this will happen in one or two months at max as the rate at which bugs are discovered in the 2.6 branch is encouraging, but I can't predict the future, so take this ETA with a bit of salt.

About issue #614

Issue #614 is a pretty interesting affair IMHO, as it shows how an error in the design, that was, not picking the simplest approach that could possibly work, caused a number of issues in the history of the Redis development.

Basically while Redis is a simple system, once you start mixing blocking operations such as BRPOP / BLPOP / BRPOPLPUSH with replication and scripting, things can easily get pretty convoluted.

In the old days we already had blocking operations such as BRPOP. Operations that block the client if there is no data to fetch from the list (as it is empty, aka non existing). Once an element is pushed on the list, the first client waiting in the queue for this list is unblocked, as there is finally data to pick.

However in the past there were only two ways to push elements into a list: LPUSH or RPUSH, but guess what, if a list is empty this two operations are the same, so actually there was conceptually only one way to push elements into a list.

In this simplified world, if you push an element and there is another client waiting for it, you can just pass the element to the client! So it's a non operation from the point of view of the data set, replication link, append only file. What a great premature optimization, eh? So my mistake was to handle blocking operations in a synchronous way: Redis currently unblocks and serves the client waiting for push directly in the context of the execution of the push operation.

But then we introduced BRPOPLPUSH that can push as a side effect of waiting for an element. Later we also introduced variadic list push operations. Things were not as simple as usually, for instance if there is one client waiting for elements in mylist, and another client does LPUSH mylist a b c d, then we should replicate only LPUSH mylist b c d, as the first element was consumed by the blocking client. Well that and another zillion of other more complex cases as BRPOPLPUSH can in turn push an element to another list that has clients waiting with blocked operations and so forth.

Eventually I reworked the core so that everything worked, in the form of a few recursive functions and a more complex replication layer that was able to alter the commands in the replication and AOF link, and that was also able to replicate multiple commands for a single call.

But guess what? This does not work with scripting. We can't rewrite scripts to exclude a few elements from their effects. Ultimately the reality is that my implementation of this stuff sucked.

The fix is conceptually easy, and is just, the simplest thing that could possibly work. I simply need to rewrite that to avoid serving blocking clients in the context of the push operation. On push we'll just mark the keys that had clients waiting for data, and that actually received data. Then once the command returns we can serve those clients, in two easy separated steps. So there is no longer to alter commands in the replication link to half-push stuff. We just push everything so that the effects of the push operation (or script) will be full in the dataset and in the replication link and AOF file. Then there is the pop stage.

Long story short I'll fix this in the next days rewriting part of the core of Redis 2.6. This will not help stability as it is touching proven (but partially broken) code, but will give us less bugs in the future.

Redis Sentinel

I can't be more happy with Redis Sentinel, it was just an idea a few weeks ago, and now it's a working system. Because Redis Sentinel is a completely self-contained stuff I'm also tempted to merge it into 2.6 once it is stable and tested enough. We need a few more weeks and many more users to check the real degree of stability of Sentinel.

However there are a few more things that should be addressed ASAP, and this is my short term TODO list of things you'll see committed in the next days and weeks:

Support slave priority. Sentinel actually already has this concept internally, but Redis slaves don't publish a priority in INFO output currently. The lower the priority, the more suitable for promotion a slave is. A priority of zero however means: never promote me as a master.
SLAVEOF sentinel://mastername/ip:port,ip:port,ip:port... Now that we have sentinel we can use it to make configuration simpler. This form of SLAVEOF will simply query a Sentinel among the listed ip:port pairs to discover what the master is currently (with the specified name) and replicate with it.
Support for AUTH in Sentinel.
Check for -BUSY and sent SCRIPT KILL before going into ODOWN condition.
INFO command for Sentinel.
A number of minor but important changes to the state machine that can improve reliability of Sentinel under unexpected conditions.
Update the documentation.

Well, also I should start to blog more, even at the cost of getting a bit less things done every week. Please if I don't post an update on Redis every week ping me on Twitter and tell me I'm a charlatan ;)

9084 views^*

Posted at 13:49:25 | permalink | discuss | print

Redis Sentinel beta released

2012-07-23T12:56:26+00:00

June 5th I started writing the first line of code for Redis Sentinel, and after more or less six weeks I'm happy to release the first public beta. This has been a fun programming sprint where I did something a bit unusual in the Redis development history: to focus on just one aspect of Redis for a number of weeks, almost ignoring everything else was not an important bug report. The reason for following this new methodology was very simple: Redis missed a failover solution in a desperate way, users needed it, VMware encouraged me to go along this path (thanks), so the only possibility to go from zero to something working was to focus on doing just that, enough time to reach a beta that we, as the Redis Community, I'm sure will be able to push forward to production quality in a very short time.

Before to continue better to say where the code is ;) It's merged into the unstable branch at github. Redis Sentinel in fact is a special execution mode of Redis itself.

But you may wonder what Redis Sentinel is exactly. It is a distributed monitoring system for Redis. On top of the monitoring layer it also implements a notification system with a simple to use API, and an automatic failover solution.

Well, this is a pretty cold description of what Redis Sentinel is. Actually it is a system that also tries to make monitoring fun! In short you have this monitoring unit, the Sentinel. The idea is that this monitoring unit is extremely chatty, it speaks the Redis protocol, and you can ask it many things about how it is seeing the Redis instances it is monitoring, what are the attached slaves, what the other Sentinels monitoring the same system and so forth. Sentinel is designed to interact with other programs a lot.

The other idea is that a Sentinel alone can be used, but is often not enough to ensure that you can do things in a reliable way, so instead you take a number of Sentinels placing them across your network infrastructure. One in this computer, one in another, and so forth.

Sentinels are trivial to configure, this is the Redis Way. Point a Sentinel to a master and it will auto-discover the other Sentinels and the attached slaves. Sentinels will agree with other Sentinels if the master should be considered down accordingly to the quorum you required in the configuration, they'll select what Sentinel should perform the failover, if during the failover some other Sentinel should restart it as the previous one appears to be dead, and so forth.

Sentinel is implemented as a state machine in a completely non blocking environment where the monitoring is performed continuously in the background. Then 10 times every second every Sentinel evaluates what it sees, to take decisions.

The design is conceived so that a Sentinel should do its work following a small set of fixed rules, and this rules should be enough to perform the work in a reasonable way, but at the same time the set of rules are easy enough that you can explain them to a newcomer in five minutes, and easy enough that a system administrator can understand what is going to happen during the failover, what exactly can trigger it: easy to understand also means easy to predict.

So that's what we have. Is it perfect? I guess it is not, but it is a very good start in my opinion. And I'm sure that with the help of the Redis Community we'll add what is missing and we'll fix what is not good enough. I've already a TODO list of things that I need to improve that's pretty long, but the current implementation is already something that you can try and even use (not in production environments for the first weeks maybe, just to be safe).

So please join the effort :)

Read the documentation, use the Redis Google Group to comment the design and suggest your ideas, and report the bugs you find in our issues system at Github.

Note: you need to use Sentinel with Redis instances compiled from the latest commit of the 2.4, 2.6 or unstable branch. Redis 2.4.16 will be the first stable release with support for Sentinel.

Also note that there is a known bug in the hiredis library that can make Sentinel crash from time to time, but it's not a problem with Sentinel itself, and we are fixing the library in the next days (the issue happens when SUBSCRIBE returns an error for some reason).

Thank you to everybody that helped in the design process, that showed enthusiasm for this work and encouraged me, and to the great guys at VMware that supported me during this time.

A special thank to Dvir Volk that tested a few previews of Sentinel and provided very useful feedbacks.

More updates in the next weeks with new blog posts. Stay tuned!

37405 views^*

Posted at 12:56:26 | permalink | discuss | print

Sentinel and slaves in INFO output

2012-06-26T16:44:15+00:00

As every Redis user patient enough to follow me on twitter knows, my focus is all on implementing Redis Sentinel lately, and I'm making good progresses. What we have right now is Sentinel implemented as a special mode of the normal Redis server (invoked using redis-server --sentinel or simply renaming redis-server executable into redis-sentinel), so I'm reusing everything was inside Redis, that's a big advantage in terms of development time, code reuse, stability.

Sentinel already implements the connection with all the masters and slaves monitored, the ability to ping instances, collect informations with INFO. You can query Sentinel using redis-cli, but it only accept a different set of commands, especially the SENTINEL command that can be used to get informations about the monitored masters and in general to inspect the Sentinel state.

redis 127.0.0.1:26379> sentinel masters
1) 1) "mymaster"
   2) "127.0.0.1"
   3) "6379"
   4) "om"
   5) "966"
   6) "966"
   7) "0"

I lost almost two days in my schedule because sunday I was silly enough to take too much sun in a single day, getting a fever :) But now I'm back to work. So the idea is to have a beta of Sentinel as soon as possibly in order to get user feedback, probably in the first days of July (this was supposed to be end of June but probably I'll not be able to do that).

Sentinel is conceived to be easy to setup. Like Redis itself, or Redis replication, I try hard to avoid that users need to read a lot of documentation or to perform boring tasks in order to use the system. One of the key points in making it non boring is that the configuration is easy, and consists mainly in the description of what masters to monitor.

Sentinels auto discover the other sentinels that are monitoring the same master using Redis Pub/Sub. But another important step is to auto discover slaves attached to master.

Redis already shows attached slaves in INFO output, but guess what, the port number exported was broken, and nobody noticed. Slaves were not listed with their own listening port but with the TCP port they used to connect (as clients) to the Redis master.

In order to fix this I introduced a new command called REPLCONFIG that is used to set some state before starting the replication, so now a slave does something like that when performing the first synchronization with a master:

REPLCONFIG lisetning-port 6380
SYNC

This means that the new system is backward compatible, as we ignore errors returned by REPLCONFIG. Probably we'll use REPLCONFIG in the future to implement partial replication as well. Imagine something like REPLCONFIG lastoffset ...

This is useful for Sentinel, but also fixes the problem with master's INFO output. This way clients can discover slaves attached to a master, and their exact address and port, just by querying the master.

I hope to have more news about Sentinel soon.

12378 views^*

Posted at 16:44:15 | permalink | discuss | print

The Redis Community Survey

2012-06-13T14:15:47+00:00

I'm a fan of mood-driven development. Among all the stuff I can work to, in order to improve Redis, I tend to favor things that for some reason make me excited more than others: because they are interesting to implement, or because I think the user base needs those new features or fixes a lot, or because I would love to have such features as an user.

Mood-driven development is cool because when you do what you think makes sense, what you like, you are, simply, 10 times better at it. Fortunately my mood is not too lunatic, but is instead influenced by one thing more than everything else: the needs of the Redis community. I love that Redis is used more and more, this makes me feel well because there are other programmers that are trying to build things on top of Redis, so my work is not useless.

But look, Redis has many different moods as well: it's used for analytics, as a cache, messaging system, primary storage system, additional storage, and in many other ways. It's not always trivial to figure out how people are using Redis, what are their top priorities. So we started talking about this at VMware and the result is the Redis Community Survey.

It's a survey you can compile in 5 minutes. But your 5 minutes will mean a lot for us, and can provide insights about what to do first, in what directions we can evolve the system to make sure it is more useful for our users. So if you are a Redis user please take a few minutes to compile and submit the survey.

Your help is very appreciated! Thanks in advance.

20657 views^*

Posted at 14:15:47 | permalink | discuss | print

Redis persistence demystified

2012-03-26T10:08:08+00:00

Part of my work on Redis is reading blog posts, forum messages, and the twitter time line for the "Redis" search. It is very important for a developer to have a feeling about what the community of users, and the community of non users, think about the product he is developing. And my feeling is that there is no Redis feature that is as misunderstood as its persistence.

In this blog post I'll do an effort to be truly impartial: no advertising of Redis, no attempt to skip the details that may put Redis in a bad light. All I want is simply to provide a clear, understandable picture of how Redis persistence works, how much reliable is, and how it compares to other database systems.

The OS and the disk

The first thing to consider is what we can expect from a database in terms of durability. In order to do so we can visualize what happens during a simple write operation:

1: The client sends a write command to the database (data is in client's memory).
2: The database receives the write (data is in server's memory).
3: The database calls the system call that writes the data on disk (data is in the kernel's buffer).
4: The operating system transfers the write buffer to the disk controller (data is in the disk cache).
5: The disk controller actually writes the data into a physical media (a magnetic disk, a Nand chip, ...).

Note: the above is an oversimplification in many ways, because there are more levels of caching and buffers than that.

Step 2 is often implemented as a complex caching system inside the database implementation, and sometimes writes are handled by a different thread or process. However soon or later, the database will have to write data to disk, and this is what matters from our point of view. That is, data from memory has to be transmitted to the kernel (step 3) at some point.

Another big omission of details is about step 3. The reality is more complex since most advanced kernels implement different layers of caching, that usually are the file system level caching (called the page cache in Linux) and a smaller buffer cache that is a buffer containing the data that waits to be committed to the disk. Using special APIs is possible to bypass both (see for instance O_DIRECT and O_SYNC flags of the open system call on Linux). However from our point of view we can consider this as an unique layer of opaque caching (that is, we don't know the details). It is enough to say that often the page cache is disabled when the database already implements its caching to avoid that both the database and the kernel will try to do the same work at the same time (with bad results). The buffer cache is usually never turned off because this means that every write to the file will result into data committed to the disk that is too slow for almost all the applications.

What databases usually do instead is to call system calls that will commit the buffer cache to the disk, only when absolutely needed, as we'll see later in a more detailed way.

When is our write safe along the line?

If we consider a failure that involves just the database software (the process gets killed by the system administrator or crashes) and does not touch the kernel, the write can be considered safe just after the step 3 is completed with success, that is after the write(2) system call (or any other system call used to transfer data to the kernel) returns successfully. After this step even if the database process crashes, still the kernel will take care of transferring data to the disk controller.

If we consider instead a more catastrophic event like a power outage, we are safe only at step 5 completion, that is, when data is actually transfered to the physical device memorizing it.

We can summarize that the important stages in data safety are the 3, 4, and 5. That is:

How often the database software will transfer its user-space buffers to the kernel buffers using the write (or equivalent) system call?
How often the kernel will flush the buffers to the disk controller?
And how often the disk controller will write data to the physical media?

Note: when we talk about disk controller we actually mean the caching performed by the controller or the disk itself. In environments where durability is important system administrators usually disable this layer of caching.

Disk controllers by default only perform a write through caching for most systems (i.e. only reads are cached, not writes). It is safe to enable the write back mode (caching of writes) only when you have batteries or a super-capacitor device protecting the data in case of power shutdown.

POSIX API

From the point of view of the database developer the path that the data follows before being actually written to the physical device is interesting, but even more interesting is the amount of control the programming API provides along the path.

Let's start from step 3. We can use the write system call to transfer data to the kernel buffers, so from this point of view we have a good control using the POSIX API. However we don't have much control about how much time this system call will take before returning successfully. The kernel write buffer is limited in size, if the disk is not able to cope with the application write bandwidth, the kernel write buffer will reach it's maximum size and the kernel will block our write. When the disk will be able to receive more data, the write system call will finally return. After all the goal is to, eventually, reach the physical media.

Step 4: in this step the kernel transfers data to the disk controller. By default it will try to avoid doing it too often, because it is faster to transfer it in bigger pieces. For instance Linux by default will actually commit writes after 30 seconds. This means that if there is a failure, all the data written in the latest 30 seconds can get potentially lost.

The POSIX API provides a family of system calls to force the kernel to write buffers to the disk: the most famous of the family is probably the fsync system call (see also msync and fdatasync for more information). Using fsync the database system has a way to force the kernel to actually commit data on disk, but as you can guess this is a very expensive operation: fsync will initiate a write operation every time it is called and there is some data pending on the kernel buffer. Fsync() also blocks the process for all the time needed to complete the write, and if this is not enough, on Linux it will also block all the other threads that are writing against the same file.

What we can't control

So far we learned that we can control step 3 and 4, but what about 5? Well formally speaking we don't have control from this point of view using the POSIX API. Maybe some kernel implementation will try to tell the drive to actually commit data on the physical media, or maybe the controller will instead re-order writes for the sake of speed, and will not really write data on disk ASAP, but will wait a couple of milliseconds more. This is simply out of our control.

In the rest of this article we'll thus simplify our scenario to two data safety levels:

Data written to kernel buffers using the write(2) system call (or equivalent) that gives us data safety against process failure.
Data committed to the disk using the fsync(2) system call (or equivalent) that gives us, virtually, data safety against complete system failure like a power outage. We actually know that there is no guarantee because of the possible disk controller caching, but we'll not consider this aspect because this is an invariant among all the common database systems. Also system administrators often can use specialized tools in order to control the exact behavior of the physical device.

Note: not all the databases use the POSIX API. Some proprietary database use a kernel module that will allow a more direct interaction with the hardware. However the main shape of the problem remains the same. You can use user-space buffers, kernel buffers, but soon or later there is to write data on disk to make it safe (and this is a slow operation). A notable example of a database using a kernel module is Oracle.

Data corruption

In the previous paragraphs we analyzed the problem of ensuring data is actually transfered to the disk by the higher level layers: the application and the kernel. However this is not the only concern about durability. Another one is the following: is the database still readable after a catastrophic event, or its internal structure can get corrupted in some way so that it may no longer be read correctly, or requires a recovery step in order to reconstruct a valid representation of data?

For instance many SQL and NoSQL databases implement some form of on-disk tree data structure that is used to store data and indexes. This data structure is manipulated on writes. If the system stops working in the middle of a write operation, is the tree representation still valid?

In general there are three levels of safety against data corruption:

Databases that write to the disk representation not caring about what happens in the event of failure, asking the user to use a replica for data recovery, and/or providing tools that will try to reconstruct a valid representation if possible.
Database systems that use a log of operations (a journal) so that they'll be able to recover to a consistent state after a failure.
Database systems that never modify already written data, but only work in append only mode, so that no corruption is possible.

Now we have all the elements we need to evaluate a database system in terms of reliability of its persistence layer. It's time to check how Redis scores in this regard. Redis provides two different persistence options, so we'll examine both one after the other.

Snapshotting

Redis snapshotting is the simplest Redis persistence mode. It produces point-in-time snapshots of the dataset when specific conditions are met, for instance if the previous snapshot was created more than 2 minutes ago and there are already at least 100 new writes, a new snapshot is created. This conditions can be controlled by the user configuring the Redis instance, and can also be modified at runtime without restarting the server. Snapshots are produced as a single .rdb file that contains the whole dataset.

The durability of snapshotting is limited to what the user specified as save points. If the dataset is saved every 15 minutes, than in the event of a Redis instance crash or a more catastrophic event, up to 15 minutes of writes can be lost. From the point of view of Redis transactions snapshotting guarantees that either a MULTI/EXEC transaction is fully written into a snapshot, or it is not present at all (as already said RDB files represent exactly point in time images of the dataset).

The RDB file can not get corrupted, because it is produced by a child process in an append-only way, starting from the image of data in the Redis memory. A new rdb snapshot is created as a temporary file, and gets renamed into the destination file using the atomic rename(2) system call once it was successfully generated by a child process (and only after it gets synched on disk using the fsync system call).

Redis snapshotting does NOT provide good durability guarantees if up to a few minutes of data loss is not acceptable in case of incidents, so it's usage is limited to applications and environments where losing recent data is not very important.

However even when using the more advanced persistence mode that Redis provides, called "AOF", it is still advisable to also turn snapshotting on, because to have a single compact RDB file containing the whole dataset is extremely useful to perform backups, to send data to remote data centers for disaster recovery, or to easily roll-back to an old version of the dataset in the event of a dramatic software bug that compromised the content of the database in a serious way.

It's worth to note that RDB snapshots are also used by Redis when performing a master -> slave synchronization.

One of the additional benefits of RDB is the fact for a given database size, the number of I/Os on the system is bound, whatever the activity on the database is. This is a property that most traditional database systems (and the Redis other persistence, the AOF) do not have.

Append only file

The Append Only File, usually called simply AOF, is the main Redis persistence option. The way it works is extremely simple: every time a write operation that modifies the dataset in memory is performed, the operation gets logged. The log is produced exactly in the same format used by clients to communicate with Redis, so the AOF can be even piped via netcat to another instance, or easily parsed if needed. At restart Redis re-plays all the operations to reconstruct the dataset.

To show how the AOF works in practice I'll do a simple experiment, setting up a new Redis 2.6 instance with append only file enabled:

./redis-server --appendonly yes

Now it's time to send a few write commands to the instance:

redis 127.0.0.1:6379> set key1 Hello
OK
redis 127.0.0.1:6379> append key1 " World!"
(integer) 12
redis 127.0.0.1:6379> del key1
(integer) 1
redis 127.0.0.1:6379> del non_existing_key
(integer) 0

The first three operations actually modified the dataset, the fourth did not, because there was no key with the specified name. This is how our append only file looks like:

$ cat appendonly.aof 
*2
$6
SELECT
$1
0
*3
$3
set
$4
key1
$5
Hello
*3
$6
append
$4
key1
$7
 World!
*2
$3
del
$4
key1

As you can see the final DEL is missing, because it did not produced any modification to the dataset.

It is as simple as that, new commands received will get logged into the AOF, but only if they have some effect on actual data. However not all the commands are logged as they are received. For instance blocking operations on lists are logged for their final effects as normal non blocking commands. Similarly INCRBYFLOAT is logged as SET, using the final value after the increment as payload, so that differences in the way floating points are handled by different architectures will not lead to different results after reloading an AOF file.

So far we know that the Redis AOF is an append only business, so no corruption is possible. However this desirable feature can also be a problem: in the above example after the DEL operation our instance is completely empty, still the AOF is a few bytes worth of data. The AOF is an always growing file, so how to deal with it when it gets too big?

AOF rewrite

When an AOF is too big Redis will simply rewrite it from scratch in a temporary file. The rewrite is NOT performed by reading the old one, but directly accessing data in memory, so that Redis can create the shortest AOF that is possible to generate, and will not require read disk access while writing the new one.

Once the rewrite is terminated, the temporary file is synched on disk with fsync and is used to overwrite the old AOF file.

You may wonder what happens to data that is written to the server while the rewrite is in progress. This new data is simply also written to the old (current) AOF file, and at the same time queued into an in-memory buffer, so that when the new AOF is ready we can write this missing part inside it, and finally replace the old AOF file with the new one.

As you can see still everything is append only, and when we rewrite the AOF we still write everything inside the old AOF file, for all the time needed for the new to be created. This means that for our analysis we can simply avoid considering the fact that the AOF in Redis gets rewritten at all. So the real question is, how often we write(2), and how often we fsync(2).

AOF rewrites are generated only using sequential I/O operations, so the whole dump process is efficient even with rotational disks (no random I/O is performed). This is also true for RDB snapshots generation. The complete lack of Random I/O accesses is a rare feature among databases, and is possible mostly because Redis serves read operations from memory, so data on disk does not need to be organized for a random access pattern, but just for a sequential loading on restart.

AOF durability

This whole article was written to reach this paragraph. I'm glad I'm here, and I'm even more glad you are still here with me.

The Redis AOF uses an user-space buffer that is populated with new data as new commands are executed. The buffer is usually flushed on disk every time we return back into the event loop, using a single write(2) call against the AOF file descriptor, but actually there are three different configurations that will change the exact behavior of write(2), and especially, of fsync(2) calls.

This three configurations are controlled by the appendfsync configuration directive, that can have three different values: no, everysec, always. This configuration can also be queried or modified at runtime using the CONFIG SET command, so you can alter it every time you want without stopping the Redis instance.

appendfsync no

In this configuration Redis does not perform fsync(2) calls at all. However it will make sure that clients not using pipelining, that is, clients that wait to receive the reply of a command before sending the next one, will receive an acknowledge that the command was executed correctly only after the change is transfered to the kernel by writing the command to the AOF file descriptor, using the write(2) system call.

Because in this configuration fsync(2) is not called at all, data will be committed to disk at kernel's wish, that is, every 30 seconds in most Linux systems.

appendfsync everysec

In this configuration data will be both written to the file using write(2) and flushed from the kernel to the disk using fsync(2) one time every second. Usually the write(2) call will actually be performed every time we return to the event loop, but this is not guaranteed.

However if the disk can't cope with the write speed, and the background fsync(2) call is taking longer than 1 second, Redis may delay the write up to an additional second (in order to avoid that the write will block the main thread because of an fsync(2) running in the background thread against the same file descriptor). If a total of two seconds elapsed without that fsync(2) was able to terminate, Redis finally performs a (likely blocking) write(2) to transfer data to the disk at any cost.

So in this mode Redis guarantees that, in the worst case, within 2 seconds everything you write is going to be committed to the operating system buffers and transfered to the disk. In the average case data will be committed every second.

appednfsync always

In this mode, and if the client does not use pipelining but waits for the replies before issuing new commands, data is both written to the file and synched on disk using fsync(2) before an acknowledge is returned to the client.

This is the highest level of durability that you can get, but is slower than the other modes.

The default Redis configuration is appendfsync everysec that provides a good balance between speed (is almost as fast as appendfsync no) and durability.

What Redis implements when appendfsync is set to always is usually called group commit. This means that instead of using an fsync call for every write operation performed, Redis is able to group this commits in a single write+fsync operation performed before sending the request to the group of clients that issued a write operation during the latest event loop iteration.

In practical terms it means that you can have hundreds of clients performing write operations at the same time: the fsync operations will be factorized - so even in this mode Redis should be able to support a thousand of concurrent transactions per second while a rotational device can only sustain 100-200 write op/s.

This feature is usually hard to implement in a traditional database, but Redis makes it remarkably more simple.

Why is pipelining different?

The reason for handling clients using pipelining in a different way is that clients using pipelining with writes are sacrificing the ability to read what happened with a given command, before executing the next one, in order to gain speed. There is no point in committing data before replying to a client that seems not interested in the replies before going forward, the client is asking for speed. However even if a client is using pipelining, writes and fsyncs (depending on the configuration) always happen when returning to the event loop.

AOF and Redis transactions

AOF guarantees a correct MULTI/EXEC transactions semantic, and will refuse to reload a file that contains a broken transaction at the end of the file. An utility shipped with the Redis server can trim the AOF file to remove the partial transaction at the end.

Note: since the AOF file is populated using a single write(2) call at the end of every event loop iteration, an incomplete transaction can only appear if the disk where the AOF resides gets full while Redis is writing.

Comparison with PostrgreSQL

So how durable is Redis, with its main persistence engine (AOF) in its default configuration?

Worst case: It guarantees that write(2) and fsync(2) are performed within two seconds.
Normal case: it performs write(2) before replying to client, and performs an fsync(2) every second.

What is interesting is that in this mode Redis is still extremely fast, for a few reasons. One is that fsync is performed on a background thread, the other is that Redis only writes in append only mode, that is a big advantage.

However if you need maximum data safety and your write load is not high, you can still have the best of the durability that is possible to obtain in any database system using fsync always.

How this compares to PostgreSQL, that is (with good reasons) considered a good and very reliable database?

Let's read some PostgreSQL documentation together (note, I'm only citing the interesting pieces, you can find the full documentation here in the PostgreSQL official site)

fsync (boolean)

If this parameter is on, the PostgreSQL server will try to make sure that updates are physically written to disk, by issuing fsync() system calls or various equivalent methods (see wal_sync_method). This ensures that the database cluster can recover to a consistent state after an operating system or hardware crash.

[snip]

In many situations, turning off synchronous_commit for noncritical transactions can provide much of the potential performance benefit of turning off fsync, without the attendant risks of data corruption.

So PostgreSQL needs to fsync data in order to avoid corruptions. Fortunately with Redis AOF we don't have this problem at all, no corruption is possible. So let's check the next parameter, that is the one that more closely compares with Redis fsync policy, even if the name is different:

synchronous_commit (enum)

Specifies whether transaction commit will wait for WAL records to be written to disk before the command returns a "success" indication to the client. Valid values are on, local, and off. The default, and safe, value is on. When off, there can be a delay between when success is reported to the client and when the transaction is really guaranteed to be safe against a server crash. (The maximum delay is three times wal_writer_delay.) Unlike fsync, setting this parameter to off does not create any risk of database inconsistency: an operating system or database crash might result in some recent allegedly-committed transactions being lost, but the database state will be just the same as if those transactions had been aborted cleanly.

Here we have something much similar to what we can tune with Redis. Basically the PostgreSQL guys are telling you, want speed? Probably it is a good idea to disable synchronous commits. That's like in Redis: want speed? Don't use appendfsync always.

Now if you disable synchronous commits in PostgreSQL you are in a very similar affair as with Redis appendfsync everysec, because by default wal_writer_delay is set to 200 milliseconds, and the documentation states that you need to multiply it by three to get the actual delay of writes, that is thus 600 milliseconds, very near to the 1 second Redis default.

MySQL InnoDB has similar parameters the user can tune. From the documentation:

If the value of innodb_flush_log_at_trx_commit is 0, the log buffer is written out to the log file once per second and the flush to disk operation is performed on the log file, but nothing is done at a transaction commit. When the value is 1 (the default), the log buffer is written out to the log file at each transaction commit and the flush to disk operation is performed on the log file. When the value is 2, the log buffer is written out to the file at each commit, but the flush to disk operation is not performed on it. However, the flushing on the log file takes place once per second also when the value is 2. Note that the once-per-second flushing is not 100% guaranteed to happen every second, due to process scheduling issues.

You can read more here.

Long story short: even if Redis is an in memory database it offers good durability compared to other on disk databases.

From a more practical point of view Redis provides both AOF and RDB snapshots, that can be enabled simultaneously (this is the advised setup, when in doubt), offering at the same time easy of operations and data durability.

Everything we said about Redis durability can also be applied not only when Redis is used as a datastore but also when it is used to implement queues that needs to persist on disk with good durability.

Credits

Didier Spezia provided very useful ideas and insights for this blog post. The topic is huge and I'm sure I overlooked a lot of things, but surely thanks to Didier the current post is much better compared to the first draft.

Addendum: a note about restart time

I received a few requests about adding some information about restart time, since when a Redis instance is stopped and gets restarted it has to read the dataset from disk into memory. I think it is a good addition, because there are differences between RDB and AOF persistence, and between Redis 2.6 and Redis 2.4. Also it is interesting to see how Redis compares with PostgreSQL and MySQL in this regard.

First of all it's worth to mention why Redis requires to load the whole dataset in memory before starting to serve request to clients: the reason is not, strictly speaking, that it is an in-memory DB. It is conceivable to think that a database that is in memory, but uses the same representation of data in memory and on disk could start serving data ASAP.

Actually the true reason is that we optimized the different representations for the different scopes they serve: on disk we have a compact append-only representation that is not suitable for random access. On memory we have the best possible representation for fast data fetching and modification. But this forces us to perform a conversion step on loading. Redis reads keys one after the other on disk, and encodes the same keys and associated values using the in-memory representation.

With RDB file this process is very fast for a few reasons: the first is that RDB files are usually more compact, binary, and sometimes even encode values in the same format they are in memory (this happens for small aggregate data types that are encoded as ziplists or intsets).

CPU and disk speed will do a big difference, but as a general rule you can think that a Redis server will load an RDB file at the rate of 10 ~ 20 seconds per gigabyte of memory used, so loading a dataset composed of tens of gigabytes can take even a few minutes.

Loading an AOF file that was just rewritten by the server takes something like twice per gigabyte in Redis 2.6, but of course if a lot of writes reached the AOF file after the latest compaction it can take longer (however Redis in the default configuration triggers a rewrite automatically if the AOF size reaches 200% of the initial size).

Restarting an instance is usually not needed however in a setup with a single server it is a better idea to use replication in order to transfer the control to the new Redis instance without service interruption. For instance in the case of an upgrade to a newer Redis version usually the system administrator will setup the Redis instance running the new version as slave of the old instance, then will point all the clients to the new instance, will turn this instance into a master, and will finally shut down the old one.

What about traditional on disk databases? They don't need to load data in memory... or maybe yes? Well basically they do a better job than Redis is this regard, because, when you start a MySQL server it is albe to serve request since the first second, however if the database and index files are no longer in the operating system cache what is happening is a cold restart. In this case the database will work since the start, but will be very slow and may not be able to cope with the speed at which the application is requesting data. I saw this happening multiple times first-hand.

What is happening in a cold restart is that the database is actually reading data from disk to memory, very similarly to what Redis does, but incrementally.

Long story short: Redis requires some time to restart if the dataset is big. On disk databases are better in this regard, but you can't expect that they'll perform well in the case of a cold restart, and if they are under load it is easy to see a condition where the whole application is actually blocked for several minutes. On the other hand once Redis starts, it starts at full speed.

Edit: want to learn more?

This article at sqlite.org about atomic commits is very good.
What every programmer should know about disks
Disks lie

79342 views^*

Posted at 10:08:08 | permalink | discuss | print

Redis reliable queues with Lua scripting

2012-03-17T20:15:02+00:00

Redis 2.6 support for Lua scripting opens a lot of possibilities, basically because you can do atomically a lot of things that before required to pay a big performance hit. Now the price is much cheaper, so why don't abuse our power?

Even more important is that, before scripting, in order to turn non atomic primitives into atomic primitives you required help of the Redis WATCH command, that is a check and set style primitive. Being it an optimistic locking when there is high contention, like in the example of a queue with multiple workers (with many clients accessing a single key with WATCH), performances may be pretty bad.

In this blog post I want to show a pattern based on the scripting capability that can be used to implement reliable queues.

Circular queue

In our system there are only two players: Producers and Consumers, and we only push IDs, it's up to the consumer to agree with the producer about what this IDs really mean, similarly to Michel Martens Ost library.

An item is in processing state if a client is already processing it but has not yet finished.

Everything is based on the idea that tasks are never removed from the list, unless they were actually processed. But instead of using a service list to put there tasks that are in the processing state, we use a single list for everything.

Producer

From the point of view of the producer, if there is the new object ID 123 that needs to be processed by a worker (consumer), only an operation is performed:

LPUSH queue 123

So we add the item on the top of the list. Items on the top will be processed the last by workers, so this queue is First In Last Out.

Consumer

The interesting part is what the consumer does to access an item in the queue. It runs a Lua script that does the following:

Get the element on the tail of the list, for instance 45.
Put the same element on the head of the list, but followed by a trailing asterisk to signal that the item is currently being processed, followed by the unix time (passed by the client to the scripting engine). So in the end we get "45" from the tail, and we put "45*<unixtime>" to the head.
Return the element to the client (45 in this case).

If the element currently on the tail was already followed by an asterisk and unix time the script does not add an additional asterisk and unix time, it is simply moved on the head, and returned to the client, including the asterisk and the timestamp.

So the client calling this script will either receive 45 (or any other ID actaully), or an ID followed by an unix timestamp like 45*1332014784.

What the consumer does with the returned value

If the item is in processing state but is still young enough (no timeout) it is discarded and the script is called again to fetch the next ID.

Otherwise if the item timed out the consumer will check if the item was actually processed or not by the original client, in an application-specific way, and will remove it from the queue if needed, otherwise the client will call another script that atomically remove the old item and add a new one with the new timestamp. And of course it will start processing it.

When an item was processed successfully it gets removed from the queue using LREM.

Advantages

The advantage of this system, that may actually be modeled in many different ways, is that you have a rotating list full of jobs to process or currently being processed. There is no way for a job to be lost. Also clients will receive jobs that are still being processed every time a full run of the list is performed, so this jobs will be activated again if needed, but will still remain in the list forever as long as no one is able to complete them.

Improvements

If tasks take a lot of time to complete using LREM to delete the task may not be optimal. We may use an additional key with a Redis set where we store all the completed tasks, that the lua script will remove every time an item in the processing state is encountered and is also in the Set.

Another good use of an additional Set is to mark the items currently processed or waiting to be processed if we don't want to put the same ID multiple times (rarely useful).

Blocking VS polling

This system requires some form of polling from the point of view of the consumer. In order to avoid that a consumer will rotate the list as fast as possible without actually fetching interesting things. To avoid this problem is possible to use a sentinel to signal the end of the list (like a special task ID -1) so that clients will pause a bit when this element is encountered. Another solution is to simply sleep a bit if after N calls to the script no processable element was found.

Another alternative is to use a second list just to notify that new tasks are available, using blocking pop. and push.

Alternative implementations

An alternative implementation is to use a list and a sorted set: the list contains new elements to process, while the sorted set elements that are in the processing state, scored by unix time. Basically there are endless alternatives, the main point is that now with scripting we can fetch an element while adding it somewhere else, with even additional information (the unix time) without issues, so many new patterns are possible in the messaging area of Redis usage.

29726 views^*

Posted at 20:15:02 | permalink | discuss | print

Redis 2.6 is near, and a few more updates

2012-02-24T16:00:03+00:00

Redis 2.6 was expected to go live in the first weeks of 2012, but today is 24th of February and there are still no 2.6-rc1 tags around, what happened to it you may ask!?

Well, for one time, a delay is not a signal that something is wrong. What happened is simply that we put a lot more than expected inside this release, so without further delays here is a list of new features:

Server side Lua scripting, probably the most exciting and big news, with built-in support for fast json JSON and MessagePack encoding and decoding.
Milliseconds resolution expires, also added new commands with milliseconds precision. This means that if you set an expire at 1 second, now the key will stop existing after exactly 1000 milliseconds, with an error of +/- 1 millisecond. At the same time you have new commands like PEXIRE, PTTL, PSETEX, that let you specify the timeout of a key in milliseconds. What to trottle an API so that no more than two requests per 50 milliseconds are done? now you can easily.
Hardcoded limits about max number of clients removed. Now your Redis instance can handle all the clients your OS is able to handle, without recompilations or other hard coded limits.
AOF low level semantics is generally more sane, and especially when used in slaves. This is an uncommon use case, and the misbehavior was subtle, but now the implementation and behavior is definitely more sane.
Clients max output buffer soft and hard limits. You can specifiy different limits for different classes of clients (normal,pubsub,slave).
AOF is now able to rewrite aggregate data types using variadic commands, often producing an AOF that is faster to save, load, and is smaller in size. So what in 2.4 used to be N LPUSH calls to reconstruct a list of N items, now it is N/64, because variadic LPUSH with (up to) 64 arguments was used.
Every redis.conf directive is now accepted as a command line option for the redis-server binary, with the same name and number of arguments. You can write ./redis-server --slaveof 127.0.0.1 6379 --port 6380, and in general pass any possible option, exactly like it is specified in redis.conf.
Hash table seed randomization for protection against collisions attacks.
Performances improved when writing large objects to Redis.
Significant parts of the core refactored or rewritten. New internal APIs and core changes allowed to develop Redis Cluster on top of the new code, however for 2.6 all the cluster code was removed, and will be released with Redis 3.0 when it is more complete and stable.
Redis ASCII art logo added at startup. This is where our major efforts went in the latest months.
redis-benchmark improvements: ability to run selected tests, CSV output, faster, better help, and support for pipelining giving awesome results. More about this later in this blog post.
redis-cli improvements: --eval for comfortable development of Lua scripts.
SHUTDOWN now supports two optional arguments: SAVE and NOSAVE. They respectively force to save an RDB when no RDB persistence is configured, or to avoid to save when RDB persistence is configured.
INFO output split into sections, the command is now able to just show specific sections.
New statistics about how many time a command was called, and how much execution time it used (INFO commandstats).
More predictable SORT behavior in edge cases.
INCRBYFLOAT and HINCRBYFLOAT commands, for atomic fast float counters.
Virtual Memory was removed from the code (was already deprecated in 2.4)
Much better bug report on crash, with stack trace, register dump, state of the client causing the crash, command vector and so forth. This was in part back ported to 2.4 releases.

There are two features still to merge, but already implemented into branches:

Small hashes now implemented using ziplists instead of zipmaps, for better performances when there are more than 253 fields but less than the number of fields needed to convert the zipmap into a full hash table.
More coherent behavior of list blocking commands in presence of non trivial conditions and blocked clients.

And new internals...

Redis 2.6 offers the above new features, but another interesting fact is that it is also a spinoff of the unstable branch, the one that is going to be Redis 3.0 soon or later. Instead 2.4 was a spinoff of Redis 2.2 code base. This means that we now are working with a better code base that makes implementing certain features simpler.

It will also make it much easier for us in the future to backport stuff from the unstable branch to 2.6. This means that we can either backport stuff from time to time into 2.6 releases, or to create a 2.8 branch to merge all the interesting features that are already stable to create an intermediate release in a few months from now.

Redis benchmarks with pipelining support, impressive numbers, and stupid benchmarks

After looking to the next set of benchmarks that were actually measuring everything but actual DB performances, I decided to go ahead and implement pipelining in the Redis-benchmark tool to show some good numbers.

Redis-benchmark used to create 50 clients, and perform something like: send request, wait for reply, send request, wait for reply, with all those 50 clients. However Redis supports pipelining, that is, if you have N queries to do where you don't need the reply of the previous to perform the next request, you can send N queries at once to Redis, and then read all the replies. This dramatically improve performances because there are less syscall required, less context switches, less TCP packets, and so forth.

Most real world Redis applications use pipelining, often you need to do things like paginate a list of objets, so you do LRANGE to get the IDs, and then a pipeling with all the GET or HGETALL and so forth. Or you want to write an object on the database and update it's position into a sorted set.

But still redis-benchmark was not able to test pipelining, so when we saw Redis can do 150k requests per second in entry level hardware we were actually saying ... if you never use pipelining at all. But how it can perform if you can use it?

Let's check with pipelining, using my glorious MBA 11" running OSX:

$ redis-benchmark -P 64 -q -n 1000000
PING_INLINE: 540540.56 requests per second
PING_BULK: 636942.62 requests per second
SET: 301204.81 requests per second
GET: 430848.75 requests per second
INCR: 341530.06 requests per second
LPUSH: 305623.47 requests per second
LPOP: 296120.81 requests per second
SADD: 313774.72 requests per second
SPOP: 418060.22 requests per second

Wow, 430k GETS/sec requests per second with a macbook air, and finally with this new benchmark not everything is the same, PING is faster than GET that is faster than SET, and so forth. This also means: more ability to optimize commands in our side.

If you test this into a Xeon, you get 650k GETs easily, or other impressive numbers even reducing the pipeling from 64 to 32 or 16.

Now to show how benchmarks can easily be turned into everything you want, we have this numbers of Redis performing 500k operations per second, per core, but now in the web site of HyperDex I read: With 32 servers and sufficient clients to create a heavy workload, HyperDex is able to sustain 3.2 million ops/s..

Hey dudes, I can do 1/6th of the ops/sec you do with 32 servers using just 1 core of my Xeon desktop. What this means? nothing.

Long story short, don't show benchmarks unless you have a very good methodology explained in the web site, and your methodology makes sense, otherwise it is just marketing that does not provide any value to the user.

A better way to do benchmarks is to isolate a common real-world problem, and write a real world implementation of this problem using different databases, in the idiomatic way for every database, mixing both writes and reads in the same benchmark. Then test the different implementations with many simultaneous clients, with millions of objects.

Those tests, performed independently by smart programmers, is what is making Redis very popular across guys that have serious requests per second, and I hope that 2.6 with built-in server side scripting will allow them to get more out of Redis.

54693 views^*

Posted at 16:00:03 | permalink | discuss | print

How my todo list works

2012-02-07T13:49:00+00:00

There is a constant in my life for 335 days a year (let's assume that 30 days of vacation are a bit more relaxing): I've tons of things to do every day. Most are about my work, many are about handling home, health, family, and so forth.

I'm not the kind of guy that is good at remembering that I need to do this and that, usually I wake up, do a breakfast, sit in front of my computer and start to write code, read issues, reply to emails, and so forth. So to get pending things done I absolutely need some form of todo list, and in the course of the years I tried different things.

Paper and pen

One of the systems that I used most was just a piece of A4 paper and a pen. I did this for years, and it's not too bad, but this system eventually does not scale: writing by hand is time consuming so you end writing a bit too little information to complete or remember the task well enough. Deletion forces you to rewrite the items on a new piece of paper often. Paper gets easily lost, and you don't have it with you if you move often (I always change work place moving from home to a small office before lunch).

Computer notes

Eventually I ended trying different solutions using the computer and the keyboard: from a todo service I coded from myself, to "Remember the milk", and everything in the middle. All those systems worked for a few weeks, but I always ended with some kind of mess, too many different "lists", and accessing a web site to look or modify my TODO list was boring.

However I discovered that the biggest problem was not the web service, how it was implemented, or how fast it was, the biggest of the problems was... myself. More specifically the way I used my TODO list.

I finally found a system that works great for me, and is working great since months. So at this point I want to share it with you. I'll not try to get into details about why it works and so forth, I'll just describe it. If you are looking for an alternative for your todo list keeping business, try it and check yourself.

My system

I write my TODO list using Evernote, in a single note called TODO. Evernote is great for two reasons in this context: it's fast because it is a resident program, but gets synched, so you have your TODO list in all your computers, in your phone, and so forth.
The note is split into three sub parts: daily, weekly, monthly.
The last two items in the daily list are: "read the weekly list if it is monday", "read the monthly list if it is the first day of the month".
Every time you need to insert a new todo list item, just insert it at the end of the appropriate sub-list, daily, weekly, or monthly, depending on the urgency you have to do this, or simply where do you think it is more appropriate for the item to stay.
READ THE DAILY LIST EVERY DAY once you sit in front of the computer <- this is the core of the system. Don't do nothing before. No emails, no news sites, nothing. Read the list.
When appropriate, move items between sublists. For instance if you are reading the monthly list and something is urgent now, move it in the daily part of the list.
When needed, remove items, because you already completed the task or because it is no longer relevant or a priority.

That's all. You don't need to do at least one item or alike per day, as long as you keep reading the list every day. It's up to you when to act, how much you act, this system is not designed to fix your ability to get things done, is designed just to fix the schedule, and to keep you informed with little efforts about what you should do today.

It's working well for me and I hope it works well for you as well. If you find ways to improve this system I would love to hear.

49334 views^*

Posted at 13:49:00 | permalink | discuss | print

Redis Moka Awards 2011

2011-12-28T14:55:23+00:00

The Redis community is something special for me... it is full of great guys trying to participate in the development providing ideas, help, fixes and support.

However there are a few users that clearly go the extra mile helping the project in a special way: by making the code more robust. Doing heroic debugging sessions, reviewing the code I commit pointing at bugs, suggesting the adoption of some new idea or library to make Redis better, that turns out to be the right idea.

Don't get me wrong, all the other efforts are very appreciated, like the work done in the Redis Group trying to help newcomers or to come up with a solid design to attack a new problem with Redis, but I feel a need to recognize in a special way the efforts having a direct effect in the code quality.

I planned to recognize this efforts with a special Redis t-shirt, but after many months I still don't have a good design, and after all... the t-shirt is a bit too obvious. So recently I had a new idea. After all coders need to say awake to help with Redis and they usually love coffee, and Italy is good at coffee, so why not sending Moka pots as an award? And here we are.

The winners of this year can select between the classic Bialetti Moka pot and the induction variant, depending on the cooktop you have. If you already have one you can convert the price into the equivalent amount of good coffee for your Moka (this is also useful in case the price is awarded multiple times to the same user).

This is the same Moka pot that millions of italians use every day to make coffee, they work great, are reliable and last literally for decades.

And... the winners of this year are... :) (in alphabetical order):

Thank you guys! Please send me the shipping address and the kind of cookpit you have (induction or normal). Also please send me info about the size you want, I recommend the three cups one, but if you plan to always be alone this will waste a lot of coffee and the one cup is better.

p.s. please send me your address before 15th of January if you can!

28028 views^*

Posted at 14:55:23 | permalink | discuss | print

Testing the new Redis AOF rewrite

2011-12-13T15:55:04+00:00

Redis 2.4 introduced variadic versions of many Redis commands, including SADD, ZADD, LPUSH, RPUSH. HMSET was already available in Redis 2.2. So for every Redis data type, we know have a way to add multiple items in a single command, for example:

LPUSH mylist item1 item2 item3
SADD myset A B C
ZADD myzset 1 first 2 second 3 third
HMSET myhash name foo surname bar

However this feature was not used when rewriting the AOF log (operation now performed automatically since Redis 2.4, but that the user can still trigger using the BGREWRITEAOF, even if the server is not configured to use AOF).

The AOF was still generated using a single command for every element inside an aggregate data type. For instance a three elements list required three different calls to LPUSH in the rewritten AOF file.

Finally Redis 2.6 (that will be forked from the current unstable branch, just removing the cluster code) is introducing the use of variadic commands for AOF log rewriting. The result is that both rewriting and loading an AOF file containing aggregate types and not just plain key->string pairs will be much faster.

How much faster?

We'll start checking the speed gain that can be obtained in a real world dataset with very few keys containing aggregate data types, that is, the database of lloogg.com. Since lloogg was designed in the early stage of Redis development where Hashes where still not available, it stores a lot of user counters as separated keys, so there are a lot of keys just containing a string (huge waste of memory, but I've still to find the time to modify the code). However there is around a 5% of sorted sets. This is an excerpt from the full output of Redis Sampler against the lloogg live DB.

TYPES
=====
 string: 95480 (95.48%)   zset: 4469 (4.47%)       list: 48 (0.05%)        
 set: 3 (0.00%)

As you can see this is far from the ideal dataset to make the new AOF changes to look cool, still the result is significant:

Time needed to rewrite the AOF log, and size of the resulting file with the OLD rewrite: about 12 seconds, 569 MB
Time needed to rewrite the AOF log, and size of the resulting file with the NEW rewrite: about 9 seconds, 479 MB
Time to BGSAVE, for reference: about 9 seconds, file size: 344 MB.

Now let's check the loading time of all the three options:

Time to load the RDB: 7.156 seconds
Time to load the OLD AOF: 15.232 seconds
Time to load the NEW AOF: 12.589 seconds

I think this is very good news if you consider this database contained just a small number of keys. Now what happens for users that have a lot of lists, hashes, sets, sorted sets?

Bigger gains

To test the new code with a database that better represents an use case where most of the keys are aggregate values I created a dataset with 1 million of hashes containing 16 fiels each. Fields are reasonably sized, like Field1: Value1, Field2: Value2, and so forth.

I used this Lua script to create the dataset:

local i, j
for i=1,1000000 do
    for j=1,16 do
        redis.call('hmset','key'..i,'field:'..j,'value:'..j)
    end
end
return {ok="DONE"}

(Note, if you use the latest unstable branch you can run it using: redis-cli --eval /tmp/script.lua)

Now the same metrics as above but against this new dataset:

Time needed to rewrite the AOF log, and size of the resulting file with the OLD rewrite: about 17 seconds, 851 MB
Time needed to rewrite the AOF log, and size of the resulting file with the NEW rewrite: about 10 seconds, 440 MB
Time to BGSAVE, for reference: about 4 seconds, file size: 158 MB.

Now let's check the loading time of all the three options:

Time to load the RDB: 1.888 seconds
Time to load the OLD AOF: 31.946 seconds
Time to load the NEW AOF: 17.512 seconds

As you can see now both the AOF rewriting and loading time is reduced to almost an half of the time required with Redis 2.4. However you can still see an amazing 1.888 seconds in the time needed to load the RDB. Why?

Because since Redis 2.4 BGSAVE directly outputs the encoded version of the value, if the value is encoded as a ziplist, an intset or a zipmap. This is a huge advantage, both while loading and saving the database, that could be easily implemented in the AOF rewrite. However I'm currently not doing it as probably in the next versions of Redis we'll have an option to rewrite the AOF log in RDB format itself... so with the unification of the two systems a lot of problems will be reduced.

28582 views^*

Posted at 15:55:04 | permalink | discuss | print

Redis for win32 and the Microsoft patch

2011-12-09T10:43:24+00:00

A few days ago Microsoft released a patch to compile Redis under win32. The team working on this project used the already existing win32/win64 port as a reference, and used the libuv library that powers the node.js project.

Yesterday the story hit Hacker News as I discovered a few hours later (I was away from the computer since December 8th is an holiday here in Italy), and as usually when you mix Microsoft, Open Source, and a news site, the result is some friction inside comments.

I decided to write this blog post to clarify my opinion on the matter, and it is funny how this blog post will delay another that I've already written and I was ready to publish today, called "We are programmers, we need a revolution", that in some way is related to what I want for the Redis future... but you'll see that post in a few days if you are interested. Now back to the win32 patch.

How good the patch is

What the Microsoft team working at the patch did was to port Redis to libuv that is, mainly, a library for evented programming based on libev, but cross platform. Actually libuv is ending as a container of many useful programming tools to interact with the operating system that are usually different between POSIX and WIN32. It was developed for the node.js project in order to, eventually, contain every difference between POSIX and WIN32 inside a unique library.

Persistence was not properly addressed yet, but apparently the next version of the patch should handle it better. However currently the persistence is basically unusable since it blocks the main thread while Redis is saving a snapshot. There are also intermitting problems passing the test, it is not clear to the authors of the patch if it is due to the testing engine itself or to actual issues in the patch.

But in short: the patch is exactly as functional as it was the native win32/win64 port that was already provided by dmajkic: a port good enough to develop under Windows without the need of running Redis under a virtualized Linux install (not a big effort btw, in my opinion), but that was not good enough to use Redis in production under win32 systems.

It is worth to note that the native port operated by dmajkic and not using libuv has the remarkable advantage of just adding the minimal set of changes in order to port Redis to win32, it implements a new win32 backend in the event library we use, ae.c, that proved to be a very stable and performant component in our stack in the latest two years. So it was a lot more compact, and I see this as an advantage.

Patches or pull requests?

Microsoft was criticized for not sending a pull request, but a patch... I think here the point is that they provided some code: send a pull request, send a patch, or an email, it's the same and IMHO there is very little point in this formal things. But I think that in this specific instance the patch into a gist was the right way to contribute, since the patch was huge and since in the past I stated many times that I don't want to add win32 support directly in the Redis main project, but I'll favor the creation of a satellite "Redis-win32" project that is separated from the main project.

Also note that in our contributing guidelines we state that it is better to talk with me or Pieter before going forward with the development of significant code, after all I say no many times, so why wasting efforts? But Microsoft did not informed me simply because this was a project they wanted to do anyway I guess, so sending a patch is appropriate even more. However I was informed by email about the fact a patch would be published in 24 hours, and I appreciated it.

In short: Microsoft behavior as an OSS contributor in this case is fine from my point of view.

Why I'll not accept this patch

I don't think Redis running under win32 is a very important feature. It is cool to have a win32 port that can be used for testing, as we had before, and as we have in a different implementation thanks to the Microsoft patch, so developers using Windows can easily test Redis and develop their projects. But what is the point in providing a production quality win32 port?

I think that Linux completely won as a platform to deploy software, and even if you want to run your code under win32 systems what's wrong about installing Linux boxes to run Redis? For instance Stack Overflow runs their systems in a mix of Windows and Linux boxes, they have no troubles into using Linux to run Redis.

Instead handling a win32 port directly in the main project means to delay everything else for the little gain of having, eventually, a production ready win32 port of Redis. In Redis we use a lot subtle things about the operating system, from copy on write to the time needed to fork a process, to the way operating systems overcommit memory. If we add a new platform, in the future, exploiting the OS to do the best for our users will get harder and harder. It is completely not the case.

However I like the idea of a win32 port as a separated project, with a different set of developers, and not officially supported by the main project. That is just added value, and can provide a more reasonable port for development, or even for production at some point, without impacting the main project: so fork the code, and have fun. I'll help if needed, ask me questions, let's collaborate on general ideas. I'll also put a page about the win32 port in the redis.io site so that users will be aware of the port.

Let's merge just for libuv?

In the latest days I also heard that is a good idea to switch to libuv in general, win32 port or not. I beg to differ.

To start the node.js project is using libuv since they are interested in multi platform code able to run under Win32 and POSIX. Otherwise libuv from the point of view of what Redis uses of an evented library does not offer new interesting things (we just use file events and timers with our ae.c library). If there is some interesting abstraction that we'll need to use in the future, like streams, we'll implement it in ae.c, but as far as I can tell we will NOT have that need.

Also, I've an argument that for me is truly important:

$ wc -l ae*.[ch]
     397 ae.c
     118 ae.h
      94 ae_epoll.c
      96 ae_kqueue.c
      72 ae_select.c
     777 total

What the above means is: ability to resolve any possible bug with our events or timers in no time, instead of trying to understand a much bigger multi platform code.

I avoid dependencies, but when dependencies are needed I don't have problems with them. Redis includes a full Lua interpreter and the jemalloc allocator. it was idiotic to provide my implementation of a programming language, or to rewrite an allocator that works as well as jemalloc works. When dependencies provide a lot of added value it is worth adding them. Instead when you need to switch to something bigger and more complex without any gain, why to do it?

Ah, and about the gain being some kind of feature only exciting for we code nerds and having zero effects on how a system works, please read my next article in a few days, we are programmers and we need a revolution.

49857 views^*

Posted at 10:43:24 | permalink | discuss | print

Short term Redis plans

2011-11-07T21:39:23+00:00

Users often ask me what is the Redis development roadmap, so it is probably time to write a blog post about our short/mid term plans, with the most important points.

There are two major features that we are pushing forward: scripting and cluster, so let's start from this two.

Scripting

Scripting implements Lua scripting support for Redis. We already have detailed documentation for this feature that describes what is already implemented and what will likely go inside the first stable release of Redis featuring scripting. The only part you should consider outdated both in the doc and the implementation is how scripts running for too much time are handled (this topic was extensively covered in the Redis google group).

Redis scripting will appear in Redis 2.6, and I'm trying hard to ship Redis 2.6 RC1 for the end of this year (2011). There are many features that are planned for 2.6 as well, I'm not sure I'll be able to address everything but at some point I think I'll try to do a time-driven release for 2.6 and just put inside scripting and everything else that is already implemented/stable.

Redis Cluster

Redis cluster is definitely the next big thing, and you can read our Redis Cluster draft specification to get an idea about what it can do and what not. But in short Redis Cluster is a distributed implementation of a subset of Redis standalone. Not all commands will be supported, especially we don't support things like multi-key operations. In general we are just implementing the subset of Redis that we are sure can be made working in a solid way in a cluster setup, with predictable behaviors.

Redis cluster will stress consistency in favor of ability to resist to netsplits and failures in general. Basically it will tollerate well a few instances going down, but will not survive to big netsplits like other eventually consistent systems are able to do.

Redis cluster is as important as scripting, but will be delayed to Redis 3.0 since scripting it is much simpler to implement and in our opinion of almost equal importance for most users (if not more) so we prioritized scripting first.

The current status of Redis cluster is that you can play with it already but not everything in the specification is implemented. It will take a few more months in order to reach beta, and then we'll work on the details in order to ship something solid. We'll try hard to resist to ship a system that is not mature just to say we are already cluster-ready. Redis is fortunately very useful already so we want to make sure to ship the cluster version only when it will likely resolve problems instead of creating new ones.

The good news is that the Redis Cluster design is particularly simple in almost all the aspects, this helps our hope to ship a good system in a reasonable amount of time.

Replication improvements

As part of the work we are doing for Redis Cluster we'll need to improve replication. This part of Redis almost always received improvements in the course of releases, but with Redis Cluster we need an even better one. For instance it is planned to avoid a full resync every time the link goes down if the downtime was reasonable and the differences can be accumulated. In short when the slave disconnects the master does not kill the client representation of the slave, but continues sending data (that gets bufferized). When the slave reconnects we recognized it form a new per-instance ID that always changes after a restart (or after a SLAVEOF NO ONE command), and perform the incremental resync.

This changes will either be shipped with Redis 3.0 or with a future version (the current replication is not optimal but probably already good enough for the first Redis Cluster release, so it is not clear if we'll be able to fix it before or after the first release).

Persistence improvements

Currently we have two persistence modes: append only files and RDB persistence. Both have different tradeoffs. It is not clear what we'll do about it but it is possible that we will either unify the two models and/or improve AOF a lot so that it does no longer need the online rewrite process in most use cases (but the log can be rewritten by an external process or simply a Redis thread).

Everything is very hypothetical in this area for now, but there are al lot of ideas that we accumulated in the latest years that are wroth to experiment with for sure.

We want also work both in the communication (most users don't understand that Redis with both AOF and RDB enabled is very durable already, and this is the setup we suggest) and the implementation to make sure that Redis AOF can be a very durable solution, as durable as the best SQL databases out there.

This is definitely a post-cluster stuff.

More introspection

There is a plan to use Pub/Sub in order to communicate events happening inside Redis, like a key that expired, clients connecting / disconnecting, operations performed against keys. We'll probably allow the user to script this feature with Lua so that you can, for instance, push all the keys expired inside a list as well, or other things that can't be reliably done with clients and Pub/Sub since the client is not guaranteed to get all the messages (it can get disconnected for some reason).

High resolution expires

I'm working at it already, we'll have high precision expires in Redis 2.6. So you can set an expire just for a few milliseconds for a given key. The current resolution is 1 second that is ok for most applications but not for all.

Performances improvements when reading/writing big objects

If you check the 'slowset' branch there is work in this direction already. As part of this work I'm creating a speed regression test, since we really lack it. Note: with big objects I mean sets/gets in the range of 100k or 1 MB per element. Redis performs very well already with reads/writes of a few Kb values.

Many other smaller things

See the list of issues filtered by "new feature" tag on github to get an idea about the smallest things that are going to be implemented.

I hope this helped please ask me questions in the blog comments if you want more details. I'll reply tomorrow morning likely since here is already late ;)

39375 views^*

Posted at 21:39:23 | permalink | discuss | print

On cryptography and dogmas

2011-10-21T14:14:09+00:00

Yesterday I finally released the initial public release of Lamer News, that is both a real world Redis programming example in the form of an Hacker News style site, and a project to run a completely independent (with a consortium) programming news site.

The project was well received, and was in the top page of HN for some time. Thanks for providing your feedbacks.

After the release of the code I got a few requests about changing the hash function I was using in order to hash passwords in the database:

# Turn the password into an hashed one, using
# SHA1(salt|password).
def hash_password(password)
    Digest::SHA1.hexdigest(PasswordSalt+password)
end

The above code uses SHA1 with a Salt. As others pointed out this is not the safest pick, since there are ways to compute SHA1 very fast. After some time a chorus of people started twitting and commenting a single sentece: "Use bcrypt". I proposed using nested SHA1 in a loop, in order to avoid adding more dependencies in the code (if you check the README one of the goals is to take the code simple and depending on just a few gems). And at this point it happened: the crypto dogma. No way to reason about crypto primitives and their possible applications and combinations, but just "use bcrypt". In the eyes of this crew programmers are just stupid drones applying guidelines, that can't in any way reason about cryptography. But I'll talk more about that later...

For now let's do a step backward... and show what the original problem is with all this, and how much insecure the original code is.

The problem

The problem is pretty easy to understand, but it is worth to be explained in details. In order to avoid storing passwords in cleartext into the database is common practice to hash passwords. So:

HP (hashed password) = HASH (password)

When the software needs to perform the user authentication it receives the plaintext password, hashes it again, and verifies that it matches the one in the database. If so the user is authenticated.

However what happens if an attacker, let's call it Eve, will steal the database and the passwords are leaked? Eve has a number of hashed passwords, let's call them HP1, HP2, HP3, ... Her goal is to find an attack such that it can turn back HP into P.

The hashing algorithm HASH is public, so the first thing Eve can do is to apply HASH to a dictionary composed of common words and check if HP matches the HASH(common_word). If there is a match the original password was found. Note that there are not so many words in the English dictionary, so this attack is very easy to perform, and super fast.

But maybe our user, Bob, picked a password that is not in the dictionary, but is neither particularly long.

Eve can generate all the combinations up to 6 chars passwords and hash them with HASH, trying to find a match. This attack is computationally harder. If the password is a completely binary string, let's say of six characters, there are 256^6 passwords, that is, 281474976710656.

If our attacker can hash one billion passwords per second (it is possible with modern GPUs without spending a fortune on it) cracking this password takes:

281474976710656 / 1000000000 = 281474 seconds

This is just... three days, so one day and half in the average case. Not good! it's too easy to crack. There is another problem, an user will hardly use all the 256 characters with equal probability. Let's consider the worst case of it just using 26 low case letters without number nor symbols. This time let's consider an 8 characters password.

There are 26 ^ 8 possible passwords, that is: 208827064576 possible passwords. This time our password can be cracked in 208 seconds (half that time in average).

This is clearly not good. How long should be a 26 letters alphabet password to be unaccessible for the attacker able to compute HASH 1 billion times per second?

A 14 characters password will resist 1024 years on average to be cracked. For a 16 characters password our attacker needs 1382824 years.

Just 12 chars will resist for one year and half in average, definitely too little for most applications.

So is SHA1 secure for hashing passwords? Yes if the user picked a strong password of 14 chars or more. Otherwise not very secure. It all depends on the length of the password, and guess what, users have a bad habit of picking bad and short passwords.

It is worse than that

Unfortunately it is worse than that. For instance the attack against our 12 chars password can be made instantaneous in an easy way: using three years to compute a table of all the 12 chars passwords and associated resulting hash value. This is basically a big map between HASH(P) and P.

However such a table takes space, a lot of terabytes (86792 for precision) to store the lookup table assuming we have a so cool compression algorithm that can use just a byte per HP,P pair (an unreachable goal likely). However this is a valid attack when the size of the table is reasonable.

The point here is, many times in cryptography an attack can be made working using space instead of time.

The good thing is that there is a way to avoid the user precomputing a single table that will work for all the sites using the same hash function, that is, using *a salt. A salt is an (assumed public) string we concatenate to our password before hashing it, so if our salt is "lame", and the password is "foo" we will perform:

HP = HASH("foolame")

This way for the table-based attack to work the attacker needs to pre-compute a table with all the 12 char passwords combination hashed with the same salt. This means, this table is useless if Eve plans to attack another site with a different salt.

Random salts

We can do even better than that, and not just store HP, but also a random salt. When we create the user account we also generate an user-specific random salt, and store it along with the hashed password.

With a per-user salt we are safer, the requirement is a table per user now if the attacker wants to precompute it. And even more interesting while a global salt is more likely to be leaked even if the user passwords are not leaked, this is unlikely to happen if you have a salt per every user.

Making HASH slow

However even if we stop all the table based attacks, there is still a fundamental problem: if the password is short and Eve is able to compute HASH 1 billion times per second we have problems.

There is one thing we can do: using an hash function HASH that is MONKEY ASSES SLOW.

There are algorithms that are very slow both in hardware and in software for instance. Or we can take an existing algorithm and make it very slow by using it into a loop.

For example Blowfish is an encryption algorithm with a slow key scheduling algorithm (the algorithm is pretty fast once you performed the key scheduling, so Blowfish is not good only if you want to encrypt many short messages with different keys, but can be fast if you want to encrypt a big message with a single key).

The fact Blowfish key scheduling algorithm is slow makes it a good candidate for HASH.

So Niels Provos and David Mazières designed an algorithm called Bcrypt that can be used in order to hash passwords. The algorithm was presented in 1999 and uses a modified blowfish key scheduling algorithm. I'm not sure if past analysis against Blowfish can be applied to Bcrypt after the modifications, nor how much analysis was performed against Bcrypt itself, so I can't comment about the security of the algorithm in question.

However it is a popular pick, Provos and Mazières are two known cryptographer so probably the algorithm has no obvious flaws as well.

Once you use a slow HASH the attacker will start to have much troubles. For instance Bcrypt is "tunable", you can modify an input parameter to make it slower or faster. If you make it slow enough so that even with good hardware you can't compute more than 1000 hashes per second, it is still probably fast enough for your authentications servers to handle, but it is unpractically slow for Eve to crack even a 8 characters password, even using just 26 letters:

26^8/1000/3600/24/365 =6.6218627782

3.3 years on average to crack an 8 digits password. Probably still a bit too weak but better than a few seconds...

However note how we are still not secure against a dictionary attack. If the user picked a common word there are no hopes. 30k hashes are still trivial to perform in a reasonable amount of time.

On dogmas

So far we showed a few interesting points I think, first: there is no hashing schema that will save users picking very bad passwords. It is very important to force users to add non alphanumerical characters and a few capital letters in the password IF security is very important for your application.

It is important to understand how things work. And this brings us to the following point. After the "use bcrypt" chorus I replied that I could use another solution instead, based on just iterating SHA1. But apparently cryptography is not a topic a programmer should understand for many. It is just a dogma. When you have dogmas you are going to be a bad programmer probably, what about if your system does not have bcrypt support for some reason and you still want to mount something useful?

What I proposed was this trivial schema:

SHA1(SHA1(SHA1(...(SHA1(password|salt))))

How heretic! I was marked as a stupid not understanding security, that is not safe to chain hash functions like this, and so forth. But if you think at it:

SHA1 is a one way hash function. It is composed of a small computation step called round that is iterated again and again. There is no key scheduling as it is not a block cipher, it just compresses a stream of bits into a fixed length output.

It is very important to understand that many crypto algorithms are based on that idea of taking a simpler function and iterating it many times to strength the effect it has. This concept is so important that sometimes an attack to an algorithm disappears (becomes not practical or requires more time than brute force) if you add more rounds. Sometimes cryptographers use a variant of the algorithm with a reduced amount of rounds just to analyze the algorithm in a more attackable form, to understand better how strong the variant with the full number of rounds is.

So why we don't just add a lot of rounds? Because it is slow. Even an amateur cryptographer could design an algorithm that is secure but slow. A good cryptographer will be able to find the tradeoff between security and speed.

But... now we know this concept of rounds, and we know that in SHA1 there is no key scheduling algorithm, the output of the function is only related to its input, nor SHA1 is designed to be inverted, as there is no decryption stage.

So it is quite natural that the schema I proposed of computing SHA1(SHA1(SHA1(..))) will just do that, adding rounds to SHA1. So for the fundamental properties of SHA1 it should be computationally unfeasible to write a function SHA1000 that is equivalent to 1000 times SHA1 nested but that can be computed easily.

Note that the output of SHA1(SHA1(..)) is not the same as modifying the algorithm adding more rounds since there is a pre and post stage in the SHA1 algorithm that will make the output differ compared to a plain SHA1 with more rounds.

But guess what? This morning I discovered that actually the algorithm PBKDF1 described into RFC2898 does exactly what I proposed.

There are people that are very happy to show you the way, but if you look at them more closely you discover they are clueless. So please use proven standards, try to write secure code, but use your mind, learn about cryptography and how you can combine primitives. Dogmas are lame.

It is not a good idea for a programmer to try designing a block cipher and then use it for sensible purposes, there are specialists doing that, but understanding what are the building blocks, and what you can do with cryptography, how to mount protocols, it is a very important skill for our community.

Finally it is ok for me when people are rude with me when they are right. Arrogance can be handled if it is mixed with smartness. When instead it meets ignorance it is really just a sad affair.

Edit: two new interesting links from HN:

About the second link, in the message the explanation is not very clear but this is what happens:

Here the attack they want to mount is the following: find another string, ANY string, that will hash to the same output, but only 32 bits of the output.

Since it is any string, it can also be a SHA1 itself. So what you do is to start with an "X" that can be ANY ANY value, even "foo". And you start doing:

    x = SHA1(x)
    x = SHA1(x)
    ... again and again ...

ok? Well in the average case after 2^31 iterations you find a collision, right?

But the output 65536 iterations ago was it! The string that will output that specific 32 bit output after SHA1() nested 65536 times. So you want to go backward but it is not possible, SHA1 can't be inverted. So what you do? You start again from "X" and stop exactly 65536 iterations before you found the wanted value.

Obviously doing 65536 more SHA1s of that string you get the previous output. So you found your string. Why the original poster says that the attack takes 2x time but can even optimized? Since you can store the value of SHA1 at 10000 iterations, at 20000 and so forth. Then instead of re-running the iteration again you start from the nearest cached value.

44114 views^*

Posted at 14:14:09 | permalink | discuss | print

What's wrong with the iPhone 4s, and why Jobs is not my hero

2011-10-07T13:29:00+00:00

The iPhone 4s is out, and apparently for the first time it disappointed many of the most addicted Apple fan boys, especially the ones that Apple grown in the latest decade in a sort of semi religious way, using bold statements like this changes everything, again.

Apparently for many, the 4s did not changed everything enough: they probably expected some kind of tangible change like a bigger display, a redesign, or who knows... maybe a portable holodeck.

If you ask me, the iPhone 4s is a huge step forward, because of Siri. I'm not talking about what Siri is currently able to do, or the current impact it can have in our interaction with a portable device. Even if it seems like a great new way for interacting with computers already, what makes me so excited is the fact that natural language processing finally hit a mass market product. What I hope is that the experience will be already better enough for average people that this will boost innovation in that area, both in the industry and in the academia.

I'm pretty sure Google will end involved in that game. After all till yesterday Android was the state of art in voice interaction. But apparently Google is not able to connect the dots: they solve a given problem in a great way, that is, translating speech to text, but they can't translate that into great user experience for their users.

Now that Apple, again, showed the world the obvious, other companies will compete in the same arena, improving the technology.

In my opinion Siri will be integrated in the iPad soon as well, since the iPad needs Siri even more than the iPhone itself. With tablets you have a bigger screen that makes you feel like you can use it to accomplish actual work, but the limitation of typing in a virtual keyboard, in a non natural position, severely limit what you can do. If you can use your voice it is a different story.

So if it is so cool what is wrong with the iPhone 4s? First: that Apple sells you new software implicitly pretending it is selling you new hardware. Siri is the most interesting thing in the iPhone 4s and could run perfectly well in the iPhone 4 (it is mostly a server-side thing), but even if you purchased an iPhone 4 a few weeks ago, no way, you can't have it, even if you would pay for a software update.

No wait. It is worse than that. Siri was even available in the App Store as a standard application, but was now removed since it got integrated into the 4s.

You may complain that Siri as application sucked, without integration there is no fun. Unfortunately the lack of integration is a result of a closed environment, the iOS environment, where there is a single entity dictating what you can run and what not, and how an application can interact with the device (usually in a very limited form).

All this is happening at the same time as the world lost Steve Jobs. News sites are full of articles showing how great he was, and I think he actually contributed a lot to the technology world. But my heroes are different: they want a world where everybody has access to the best technology, to the best hospitals, and making money is a side effect of contributing in a non evil way to the development of our culture.

Being among the creators of the marketing and business philosophy that Apple pushes forward made Steve Jobs a great CEO, but the world needs different kind of heroes. Unfortunately when they happen to die you can expect some small news in major media, at max.

43186 views^*

Posted at 13:29:00 | permalink | discuss | print

Why the MBA 11 is now my sole computer

2011-09-30T13:53:18+00:00

(Since I received a few questions about how I feel writing code with the MBA I'm posting this blog post.)

I work from a number of different places. Mainly from home and from an office where I've a room for me, here in Catania, together with other friends of mine also writing code for another company.

This way I'm sure I'll not spend the whole day at home, and when I've an interesting problem I've a few friends to share ideas with in front of a coffee. So usually I work at home in the morning, then go to the office for lunch, and stay at the office till the end of the working day.

It is also common for me to work from the swimming pool waiting for my son, or from my parents home, and in many other places especially during weekends and holidays. Clearly I need a computer that is good for mobility, and guess what: the Macbook PRO 13" that I own is not the best fit. It is simply too big to both carry and take in your legs. It is not good for "couch browsing" when I need some info or to check if there are updates on some Redis matter that is particularly urgent.

The iPad is also not an option: I spent a life in front of a keyboard, I'm good at it, I can type fast without efforts. Any computing experience that makes me slow at typing is frustrating. Without to mention that the iPad is the worst computing device to write even a single line of code.

So... when Apple released the first version of the 11" Macbook Air (Late 2010) I purchased one within the first week of the announcement, and my computing experience changed.

The old MBA

The old MBA was almost exactly what I needed. A small full-featured computer with a good enough keyboard, good battery life, and readable screen.

I spend almost 90% of my time inside a Terminal or a web browser, and the 11" is not a problem for me in this contexts. I force myself to write code with an 80 column max line, so even using a bigger screen I tend to use small terminal apps. The 11" screen is enough to have a big font in the Terminal app to display 80 column x 37 rows. Web browsing is also ok, just a matter of tuning the font size to read comfortably.

The old MBA was so good from many points of view that I rapidly started using it to code, and every day I was using more MBA and less MBP. However it had a big problem: it was slow, very very slow. Don't get me wrong, thanks to the SSD HD it is fast enough to do things like running the twitter client, browsing the web without waiting too much time, and even for watching videos (as long as you don't have other background stuff). But once you start using it to write C code and run unit tests... well it is entirely a different story. Compiling Redis after switching branch or running the test was too slow. Not slow enough to stop using it given the advantages, but from time to time I found myself switching back to the MBP 13 in order to work more comfortably, especially during bug hunting or other tasks where there was a edit-compile-test fast loop.

I was asking myself the same question again and again: "Why Apple does small computers that are slow, and big computers that are fast?". An obvious reason was I guess the battery life concerns. If the computer is small the battery is small, and too much computing power costs more battery. However the tradeoff was still not very clear. I was in need of a small, but fast, computer, even with a not so great battery life.

The new MBA

At some point it happened: Apple released an 11" computer that was as portable as the older model but was almost four times faster. Obviously I ordered one the same day it was announced, making sure to get the one with the i7 processor for maximum performances.

My new MBA is actually two times faster than my old MBP 13 at running the Redis test. Long story short I no longer use my MBA 13, that is now only used connected to the TV in order to watch stuff in streaming...

The new MBA is virtually identical to the old one, if not for the backlit keyboard that is also a pro for many people (but not for me as I think that if you can't see the keys you are harming your eyes, so I always use at least a dim light even when working at night. Btw I don't watch at the keyboard while typing as I guess most of you.).

My feeling is that the battery life of my new MBA is not as good as my old MBA, but actually I never verified this experimentally, I'll try it since I've also the old one that is now used by my wife.

In short if you want an MBA 11" and you are a programmer, concerned with screen size or speed, go for it, it is the best computer I ever owned for sure. Virtually all my Redis development is done with the MBA, but since it is so fast nothing prevents you from connecting it to a bigger screen when you are in the location where you usually work, turning it into a real desktop.

For instance I've a 22" Samsung monitor and an USB keyboard that I use with it when I'm at the office. I just plug the video cable and an USB hub with everything connected to switch from mobile to big-screen mode: a wireless mouse, an HD to do backups, a Dell keyboard, and so forth.

I hope this helped somebody with mixed feelings about purchasing it or not :)

42408 views^*

Posted at 13:53:18 | permalink | discuss | print

Everything about Redis 2.4

2011-07-29T17:46:47+00:00

A few months ago I realized that the cluster support for Redis, currently in development in the unstable branch, was going to require some time to be shipped into a stable release, and required significant changes in the Redis core.

At the same time I and Pieter already had a number of good things not related to cluster in our development code: delaying everything for the cluster stable release was not acceptable. So I took a different path, forking 2.2 into 2.4, and merging my and Pieter's developments (at least the ones compatible with the 2.2 code base) into this new branch. In other words 2.4 was possible because git rules.

2.4 delayed the work into unstable, but this was a good compromise after all. And now the effort finally reached a form that is near to be stable as we are at the release candidate number five. You can find Redis 2.4-rc5 in the Redis site download section, and in a few weeks this will be rebranded Redis 2.4.0-stable if no critical bugs will be discovered.

This article is going to show you in detail all the new things introduced in Redis 2.4. Before continuing, no... scripting is not included. It will be released with Redis 2.6 that will be based on the Redis unstable code base instead. Redis 2.6 is planned for this fall.

The following is a summary of all the changes contained in 2.4. We'll show every one in detail in the course of the article.

Small sorted sets now use significantly less memory.
RDB Persistence is much much faster for many common data sets.
Many write commands now accept multiple arguments, so you can add multiple items into a Set or List with just a single command. This can improve the performance in a pretty impressive way.
Our new allocator is jemalloc.
Less memory is used by the saving child, as we reduced the amount of copy on write.
INFO is more informative. However it is still the old 2.2-alike INFO, not the new one into unstable composed of sub sections.
The new OBJECT command can be used to introspect Redis values.
The new CLIENT command allows for connected clients introspection.
Slaves are now able to connect to the master instance in a non-blocking fashion.
Redis-cli was improved in a few ways.
Redis-benchmark was improved as well.
Make is now colorized ;)
VM has been deprecated.
In general Redis is now faster than ever.
We have a much improved Redis test framework.

Everything on this list was coded by me and Pieter Noordhuis but feedbacks from users were really helpful. A special thank goes to Hampus Wessman that spotted and fixed interesting bugs.

VMware kindly sponsored all our work as usually. Thanks!

Memory optimized Sorted Sets

One of the most interesting changes in Redis 2.2 was the support for memory optimized small values. Why to represent a Redis List as a linked list if it only got 10 elements for instance? If you have a billion of lists this is going to take a lot of space since there are a lot of pointers, many allocations each with its own overhead, and so forth.

So we introduced the ability to switch encoding on the fly. Lists, Sets, Hashes, all start encoded as an unique blob that uses little memory, even if it requires O(N) algorithms to do things that are otherwise O(1). But once a given threshold is reached Redis converts this values into the old representation. So the amortized time is still O(1) to perform the operation on the element, but we use a lot less memory. Many datasets are composed of millions of small lists, hashes, and so forth.

However in Redis 2.2 we applied this optimization to everything but Sorted Sets. Redis 2.4 finally brings this optimization to Sorted Sets as well, as we discovered that there are many users also using data sets with many many small sorted sets. And this brings us to the next point...

Faster RDB persistence

If our small values are encoded as a blobs, this means we can do something very interesting from the point of view of persistence: this values are already serialized!

The kind of representation we use for small values does not have pointers or alike. The only change that we required was to put all the integers (lengths and relative offsets in the encoding format) in an endianess independent form. I used little endian encoding as this means no conversion most of the times.

This is a huge win from the point of view of RDB persistence. In Redis 2.2 to save an hash with ten fields represented as an zipmap (this is one of our special encoding formats) required to iterate the hash and save every field and value as a different logical objects in the RDB format.

Now instead we save the serialized value as it is in memory! Many datasets are now an order of magnitude faster to load and save. This also means that Redis 2.2 can't read datasets saved with 2.4.

Variadic write commands

Finally many write commands are able to take multiple values! This is the full list:

SADD set val1 val2 val3 ... -- now returns the number of elements added (not already present).
HDEL hash field2 field3 field3 ... -- now returns the number of elements removed.
SREM set val1 val2 val3 ... -- now returns the number of elements removed.
ZREM zset val1 val2 val3 ... -- now returns the number of elements removed.
ZADD zset score1 val1 score2 val2 ... -- now returns the number of elements added.
LPUSH/RLPUSH list val1 val2 val3 ... -- return value is the new length of the list, as usually.

Since Redis ability to process commands faster is not usually related to the time needed to alter the data set, but to the time spent into I/O, dispatching, sending the reply back, this means that now for some applications there is some impressive speed improvement.

Just an example:

> redis-cli del mylist
(integer) 1
> ./redis-benchmark -n 100000 lpush mylist 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
====== lpush mylist 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ======
  100000 requests completed in 1.28 seconds
  50 parallel clients
  3 bytes payload
  keep alive: 1

99.93% <= 1 milliseconds
99.95% <= 2 milliseconds
100.00% <= 2 milliseconds
78247.26 requests per second

> redis-cli llen mylist
(integer) 2101029

Yes, we added two million items into a list in 1.28 seconds, with a networking layer between us and the server. Just saying...

You may ask, why we modified only a specific number of commands into variadic versions? We did it in all the commands where the return value would not require a type change, nor to be dependent by the number of arguments. For all the rest there will be scripting... doing this and more for you :)

Jemalloc FTW

The jemalloc affair is one of our most fortunate use of external code ever. If you used to follow the Redis developments you know I'm not exactly the kind of guy excited to link some big project to Redis without some huge gain. We don't use libevent, our data structures are implemented in small .c files, and so forth.

But an allocator is a serious thing. Since we introduced the specially encoded data types Redis started suffering from fragmentation. We tried different things to fix the problem, but basically the Linux default allocator in glibc sucks really, really hard.

Including jemalloc inside of Redis (no need to have it installed in your computer, just download the Redis tarball as usually and type make) was a huge win. Every single case of fragmentation in real world systems was fixed by this change, and also the amount of memory used dropped a bit.

So now we build on Linux using Jemalloc by default. Thanks Jemalloc! If you are on osx or *BSD you can still force a jemalloc build with make USE_JEMALLOC=yes, but those other systems have a sane libc malloc so usually this is not required. Also a few of those systems use jemalloc-derived libc malloc implementations.

Less copy-on-write

Redis RDB persistence, and the AOF log rewriting system are based on fork() memory semantic in modern operation systems. While the child is writing the new AOF or an RDB file, it is cool to have the operating system preserving a point-in-time copy of the dataset for us, but every time we change a page of memory in the parent process this will get duplicated. This is known as copy-on-write, and is responsible for the additional memory used by the saving child in Redis.

We did different changes in the past in order to reduce copy on write, but one of the latest change needed was still not implemented, related to the internal working of our hash table implementation iterator. Finally 2.4 has this change. It is interesting to note that I did the error of back porting this change into Redis 2.2. This was responsible of many bugs in the course of Redis 2.2 recent history. In the future I'll continue to be conservative as I was in the past and will do just the minimal changes in stable releases.

The additional copy on write in Redis 2.2 looked like a bug, but fixing those bug with a patch involving several changes into the core was surely not a good idea. To wait for the next release is almost always the right thing to do in the case of non critical bugs.

More fields in INFO

The new INFO into unstable is much better compared to the one into 2.2 and 2.4. It was not a good idea to backport it into 2.4 as it was too much different code, but the new 2.4 INFO has a few interesting new fields, especially this two:

used_memory_peak:185680824
used_memory_peak_human:177.08M

Your RSS and your fragmentation rate are usually related to the peak memory usage. Now Redis is able to hold this information, and this is very useful for memory related troubleshooting.

So for instance if you have an RSS of 5 GB but your DB is almost empty, are you sure it used to be always empty? Now there is just to look at this field.

Two new introspection commands: OBJECT and CLIENT

The DEBUG command was already able to show a few interesting informations about Redis objects. However you can't count on DEBUG as this command is not required to be stable over time, and should never be used if not in order to hack on Redis code base.

The OBJECT command brings a few interesting information about Redis values in a space that is accessible and usable by developers.

You can find the full documentation of the Object command here.

Another interesting new command is the CLIENT command. Using this command you are able to both list and kill clients. I'm sorry but I've still to write the documentation for this command, so here I'll show an interactive usage example:

redis 127.0.0.1:6379> client list
addr=127.0.0.1:49083 fd=5 idle=0 flags=N db=0 sub=0 psub=0
addr=127.0.0.1:49085 fd=6 idle=9 flags=N db=0 sub=0 psub=0

We got the list of clients, and some info about what they are doing (or not doing, see the idle field). Now it's time to kill some client:

redis 127.0.0.1:6379> client kill 127.0.0.1:49085
OK
redis 127.0.0.1:6379> client kill 127.0.0.1:49085
(error) ERR No such client

Non blocking slave connect

Redis master - slave replication was a non blocking process already almost for everything but the connect(2) call performed by the slave to the master.

This is finally fixed. A small change but with a significantly better behavior compared to the past. We still need to fix a few things about replication, but we'll do other changes in order to make replication better for cluster. Redis Cluster uses replication in order to maintain copies of nodes, so you can expect that as cluster will evolve replication will also evolve.

Better redis-cli and redis-benchmark

Redis-cli is now able to do more interesting things. For instance you can now prefix a command with a number to run the command multiple times:

redis 127.0.0.1:6379> 4 ping
PONG
PONG
PONG
PONG

Another interesting change is the ability to reconnect to an instance if the link goes down and to retry the reconnection after every command typed.

Finally redis-cli can be now used to monitor INFO parameters together with grep. In the following example we display the memory usage every second.

./redis-cli -r 10000 -i 1 info | grep used_memory_human
used_memory_human:909.22K
used_memory_human:909.22K
used_memory_human:909.22K

Redis-benchmark was also improved, and now you can specify the exact command to benchmark, that is an awesome change. You can see an example run of the new redis-benchmark in the paragraph related to variadic commands.

Other improvements

An important change in Redis 2.4 is that it is the last version of Redis featuring VM. Redis will warn you that it is not a good idea to use VM as we are going to no longer support it in future versions of Redis as already discussed many times.

Also the new test is much faster, we have a full article about this change. The new continuous integration is also helpful, running our code base over valgrind multiple times every hour.

Another interesting change is the colorized make process ;) You may thing this is just a fancy thing, but actually it is much simpler to see compilation warnings this way.

I hope you'll enjoy Redis 2.4, and a big thank you to all the Redis community! Since it's friday, have a good week end :)

80040 views^*

Posted at 17:46:47 | permalink | discuss | print