Florian Hopf

Making Your Daily Standups More Effective

2023-07-31T00:00:00+08:00

For a long time I couldn’t really put my finger on what bothered me with how standup meetings are often being run in Scrum like processes. Now that I realized how focusing on the work you should be doing can make them more effective I never want to go back to experiencing them differently.

The Usual Way

A very common way to run standups is to go around the room (virtually or in person) and let people answer three questions, often informally:

What did I do yesterday?
What am I planning to do today?
Are there any blockers?

People would talk about how they spend their time, including the tasks they are working on, sometimes using the board to move tasks around. While this sounds great it has some serious issues and doesn’t lead to the team performing at their best.

The Change

Instead of asking those questions there’s a lot of benefit to put the board with your tasks on it at the center of the discussion. You start with the tasks on the right side (often in test or in review) and discuss what is needed to move them to done. You then follow all columns until you arrive on the left, making sure that you cover all the work the team is currently committing to do. This is sometimes also called Walking the board.

Benefits

This allows for a few noticeable benefits.

You don’t miss work items. If you only talk about what people did you are not really looking at the work that you should be doing, the work you committed to do for this iteration. This can easily lead to tasks left in review or in progress, without ever being touched or noticed, dragged from sprint to sprint.

You notice how much work is in progress. To reduce context switches it’s a good idea to minimize the work items the team or team members keep in progress. If you don’t look at the overall work and only at the work everyone is doing you can easily miss if there are too many open tasks.

You provide an opportunity to do actual team work. If you decide that reducing work in progress is something to strive for it’s a good idea to have everyone help finishing work items. This can mean that others are jumping in to help bring tasks to completion, e.g. by testing or pairing on issues. It becomes the responsibility of the team to tackle the tasks.

The meeting can be more engaging. Even if you are interested in what others are doing your mind can start wandering when others are taking a few minutes to talk about their schedule. Talking about the tasks will have a higher relevance and also a greater variation.

People don’t need to appear busy. For everyone who has been in a role that includes some organizational work or just a broader context of work (like being a tech lead) you will remember times when you had a hard time to actually summarize what you worked on. There’s so many small things, often happening in an unplanned manner and this can lead to you desperately going through your memory or to-do list to find items you did, just so that you have something to say, also including items that might not be relevant for the team.

And the Downsides?

Even though I hope that everything mentioned so far there can still be some concerns coming up when considering implementing this approach.

Putting the work in the center of the discussion can feel like making the sync more mechanic, less about the people. While this can be true I think this is easily compensated by the reduced stress of people having to be perceived as busy all of the time. Also, you will likely save some of your standup time with this approach - why not use it on something that is actually social? For example, in my current team we spend some time talking about what we did over the weekend every Monday morning, making for an enjoyable start in the week.

You might be missing important content. You could say that the fact that the tech lead did two interviews and talked to another team could be relevant for the team. This can again be true, not all work that is relevant for the team will be tracked on the board. But it's pretty easy to fix this using a single question at the end - “Anything else that is relevant for the team?”. This is the time when people can update everyone on things that are happening that they need to pay attention to. But it’s never the time for people to appear busy.

Shoutout to my former colleague Alasdair George for helping me see this problem clearly and put a name on it.

1-1s For New Managers

2023-07-03T00:00:00+08:00

When starting out as a manager one of the obvious changes is that you need to pay even more attention to people. One of the core activities to get to know and connect with your reports will be 1-1 meetings. While you will most likely have experienced them already as an individual contributor being on the other end of the table is a bit different. In this post I’ll describe some of my experiences while starting out as a new manager.

At it’s core 1-1 meetings are a way for reports to bring any concerns or questions to the manager while for the manager they can be a great tool to get to know your people and what drives them. It’s an opportunity to learn how to align the project work with the interest of the people. And it’s also a good way to build a trusting relationship.

Logistics

There’s lots of thoughts about frequency and many people will advocate for weekly meetings. That’s also how I started, doing weekly 30 minutes but that got too taxing for me, even though I only have 5-7 reports at a time. The time for 1-1s adds to all the other meetings, making me very exhausted, also due to my more introverted nature. I switched to doing 30 minutes biweekly for now with everyone and it seems to work fine so far. Not every person will have the same needs though, in the past I had a report on my team who required a bit more attention, at that time I did one hour a week with them.

Many people will recommend to not skip 1-1s and I try to adhere to this rule with the exception of me being on leave. Not sure where I got it from (maybe from High output management?) but some people also recommend to not have scheduled recurring meetings. Instead book the next 1-1 at the end of the last one, then you can take vacations or other events into account.

As mentioned, a 1-1 should primarily be a time for the report, bringing any topics they would like to discuss. While asking lots of questions from your side is a good idea it should definitively not be a status meeting.

You should still make sure that some topics are discussed regularly, like career growth or even checking in on recent work experiences. Also, it can happen that reports just don’t bring a lot of topics by themselves - you should still make good use of the time.

One thing that I am doing is just asking a lot of questions, both on topics provided by engineers but also for areas that I find interesting. It could be something they worked on, something they did, like running a meeting or giving a presentation, something where I either think I can give some feedback or where I suspect some hidden learning. I’ll be drilling down with questions on the area, often leading to some interesting insights. I use this technique most frequently to extract some learnings we can have on the work we do or the process we are following.

Having an agenda can work but depends on the discipline of the both report and manager. I am currently not doing shared agendas and I also don’t take notes in a shared document. Even though I know this is very useful I didn’t find a good way how to do it yet, mainly because I am taking extremely detailed notes myself during the discussion and those are not really suitable for sharing.

It is often being mentioned that 1-1s can be a way to give context, to show the purpose of the work. I am rarely doing this. I might not be doing it enough anyway but when I do that I try to share with the whole team, e.g. the impact our work has on the company or if there are things happening that affect us. This aspect of providing context can be useful for new joiners though when I would answer any questions they might have on how the work we are doing impacts the business.

1-1s can also be used for mentoring but mostly for more junior engineers and only if they ask for it. This could be giving concrete guidance on issues they are facing or sharing your experience on different aspects.

There are structured ways to run 1-1s as well but I am mostly doing them on the fly. I spend some time in advance though to prepare key questions or topics I want to cover for each person. If you are ever running out of things to cover there are lists with good questions for 1-1s as well (example).

Challenges

Awkward silence is an issue I am still struggling with. I am very fast in breaking silence, either by saying something myself or by moving to another topic. Giving the time for people to think can lead to more insights, getting more input from them.

If people show strong emotions (this happened less frequent for me, but frustration is a more common one) acknowledge them. Don’t try to fix immediately, especially in cases where you can’t solve it. One occurrence that comes to my mind when I experienced this is when a report was eagerly waiting for a promotion and it didn’t happen.

It’s also important to develop some self awareness of your own emotional state. For example if you are stressed or upset about something else it’s good to talk about it. Otherwise your report might attribute your reaction to themselves. Developing this self awareness will also help you in other areas of your work and life but this is still a constant learning for me. Any books on emotional intelligence can help you learn more.

Resources

I hope my experiences can be useful for some people starting out. I wouldn’t be surprised if some of the things I am doing now are not forever and I hope I can continue learning on the practices. There’s fortunately quite a bit of information being published on engineering management. Most of the books will touch on 1-1s as well, some that helped me when I started out are:

Also, go check out what Christian Uhl has written on the topic.

Application Integration with Apache Camel

2019-07-07T00:00:00+08:00

Apache Camel is a very useful tool when it comes to integrating different systems and technologies. In this post I will introduce some of its concepts and show how you can test and run your application using Spring Boot.

Apache Camel

Camel is an implementation of many integration patterns, mostly inspired by the book Enterprise Integration Patterns. Messages are passing through channels from endpoint to endpoint and along the way they can be translated, filtered and routed to other channels/endpoints. This is taken from the website for the book:

Besides the integration patterns Apache Camel offers implementations of many protocols and technologies in its components. This means not only can you build your application based on the patterns above but there is also the groundwork for doing technical integrations, one example being reading files from an FTP server or downloading data from an email server.

To use Camel effectively you need to understand a few concepts.

Endpoints

Endpoints describe how to access external systems. Each endpoint is handled by a component which registers endpoint prefixes during runtime. Some components are available with the core camel dependency, some need to be added explicitly. Examples for components are the MailComponent, for receiving and sending emails, RabbitComponent for use with RabbitMQ and FileComponent, that provides functionality for reading and writing files.

Endpoints are being configured by URIs. Each component provides a prefix that determines if it's responsible for an URI. An example URI for the FileComponent:

file:/opt/storage?move=.success&moveFailed=.error

This will poll the directory /opt/storage for new files. If a file is processed successfully it is moved to the folder .success, if there is an error, it is moved to .error.

Another example for RabbitMQ, that can be used for sending or receiving messages.

rabbitmq://host:5672/?username=op&password=op

The documentation of the components normally provides a table with a description of all the parameters that are available.

Routes

Endpoints are being connected by Routes. Most of the time you will use the nice Java DSL which is available if you extend your class from RouteBuilder. A simple example that polls files from a folder and writes them to RabbitMQ:

from(“file:/opt/attachments”).to(“rabbitmq:host:5672/”);

Besides directly connecting endpoints you can also use filter to skip some messages and choice to send them to different endpoints. Or you can have any other processing that you wish in between.

Exchanges

Exchanges are containers for the current message flow, not to be confused with the term Exchange in the AMQP world. It's a wrapper the contains the in/out message and potential errors. A message consists of a header, that might keep component specific information like the filename for the FileComponent, and a body.

Processor

A Processor is anything that does work in a route, filter and choice mentioned above are processors. But you can also do other tasks by implementing the Processor interface, e.g. if you want to do some transformation.

@Override
public void process(Exchange exchange) throws Exception {

}

This processor can be registered in the route by adding a .process(new MyProcessor()).

Testing

Camel provides support for testing routes in isolation by replacing endpoints with mock endpoints, injected using AdviceWithRouteBuilder. Messages can be sent to you endpoints using the MessageProducerTemplate. You can then assert that a message is arriving at the mock endpoint.

MockEndpoint resultEndpoint = getMockEndpoint("mock:result");
resultEndpoint.setAssertPeriod(1000);
resultEndpoint.setExpectedMessageCount(1);

writeToInputFolder(getClass().getResourceAsStream("/transaction-report.csv"));
resultEndpoint.assertIsSatisfied();

Testing can be a bit special when it comes to Apache Camel as some of the processing is being done asynchronously.

Runtime

When using Camel standalone you have to take care to start a CamelContext, which holds all the configured components and routes. When using Spring Boot you can use the spring-boot-camel-starter that will manage the CamelContext for you. You just create routes and annotate the methods with @Bean. By setting the property camel.springboot.main-run-controller=true the application will stay alive even if you haven't included something like Spring MVC.

Conclusion

I really like the model of Apache Camel. It provides asynchronous processing and you can decouple different parts of the processing, e.g. by first downloading files to a certain folder and then parse them in another route. Both of those tasks can run in isolation. Besides all the different protocols that are implemented Apache Camel also provides solutions for cross cutting concerns like error handling.

The declarative approach can make it a bit harder to start for beginners but once you find your way around there is also a lot of potential.

Indonesian Language in Lucene, Solr and Elasticsearch

2018-03-23T00:00:00+08:00

Indonesian, or Bahasa Indonesia, is a very approachable language for westerners. It uses latin characters, there's a clear structure, no tenses, no gender or plural forms and it contains many foreign words (as a German I especially enjoy the dutch influenced terms like knalpot for exhaust pipe). If you're growing up outside of Asia Indonesia might be a quite distant country for you which you don't hear a lot about. But because the country is so big there are actually quite a lot of people speaking the language, making it, together with its sibling Bahasa Melayu, one of the most common languages on earth. And if that is not enough, once you visit Indonesia you will see that the people are very positive minded and happy. Maybe another reason to be interested in the language.

As I've been learning a bit of Indonesian and got to spend quite some time in Indonesia for work and leisure I thought it might be a good idea to look into the Indonesian Analyzer for Lucene and see how it processes text. If you don't know what an Analyzer does I can point you to one of my older posts on the absolute basics of indexing data.

The IndonesianAnalyzer in Lucene

If you want to use the IndonesianAnalyzer, it is available with lucene-analyzers-common, which you most likely have included already. You can just create an instance and use it in any way you like. This snippet will display the terms for the text in a String.

private List<String> analyze(String text) throws IOException {
    List<String> terms = new ArrayList<>();

    try(Analyzer analyzer = new IndonesianAnalyzer();
        TokenStream tokenStream = analyzer.tokenStream(null, text)) {
        tokenStream.reset();
        while (tokenStream.incrementToken()) {
            terms.add(tokenStream.getAttribute(CharTermAttribute.class).toString());
        }
    }
    return terms;
}

The IndonesianAnalyzer in elasticsearch

The IndonesianAnalyzer can be used with elasticsearch as well. In the mapping you can refer to it by the analyzer name indonesian.

{   
  "mappings": {
    "doc": {
      "properties": {
        "content": {
          "type": "text", "analyzer": "indonesian"
        }               
      }
    }
  } 
}

The elasticsearch documentation also has a section on the analyzer explaining how to rebuild it using different filters.

The IndonesianAnalyzer in Solr

Most of the time you would create your own analyzer chain in Solr. This is from the reference guide.

<analyzer>
  <tokenizer class="solr.StandardTokenizerFactory"/>
  <filter class="solr.LowerCaseFilterFactory"/>
  <filter class="solr.IndonesianStemFilterFactory" 
    stemDerivational="true" />
</analyzer>

Features of the Analyzer

Let's look at a very simple example sentence first.

Saya mau makan mie ayam.

I want to eat Chicken noodles. Not only did you learn that I like indonesian food but you can also see that the indonesian language uses latin characters and separates words by whitespace. Let's see what the IndonesianAnalyzer does with this text.

If you look at the terms produced by the Lucene example above you will get the following list.

[makan, mie, ayam]

So only three of the five words are left. Saya (I) and mau (want to) are dropped. This is caused by a default list of stopwords, words that are considered not to be important when searching. Those words are maintained in a text file that is shipped with the analyzer. If you want to use a different list for you content you can use one of the constructors that accepts a CharArraySet, for elasticsearch and Solr you can use a custom StopFilter.

Now, the rest of the words remained the same, there's no stemming involved yet, which is a common way to process natural language by reducing terms to its base form. Let's look at another example.

Kami, bangsa Indonesia, dengan ini menjatakan kemerdekaan Indonesia.

This is the first sentence of the declaration of independence of Indonesia which was proclaimed in 1945. We, the people of Indonesia, hereby declare the independence of Indonesia.

If you process this text using the Analyzer you will get the following list of terms.

[bangsa, indonesia, jata, merdeka, indonesia]

Again, words like kami, dengan, ini have been removed as those are in the list of stopwords. But something else has happened. menjatakan became jata and kemerdekaan became merdeka. The Indonesian language doesn't have verb inflection but there are many prefixes and suffixes that can change the meaning of words. In this case kemerdekaan (independence) is a variation of merdeka (independent). There are many prefixes and suffixes. makan is to eat, makanan is food. minum is to drink, minuman is a drink. sama is same, bersama is together. The IndonesianAnalyzer will stem those examples correctly (even though sama and bersama are stopwords).

Implementation

Like most analyzers the IndonesianAnalyzer combines just a few other components, namely a Tokenizer and serveral TokenFilters.

StandardTokenizer
StandardFilter
LowercaseFilter
StopFilter
SetKeywordMarkerFilter
IndonesianStemFilter

The IndonesianStemFilter is the interesting component that is responsible for the stemming. It uses the IndonesianStemmer that is based on the paper A Study of Stemming Effects on Information Retrieval in Bahasa Indonesia.

As with most other rule based stemmers some words might not be stemmed correctly. An example: menunggu means waiting, it is stemmed to unggu, but the correct base form would be tunggu. If you want to get rid of cases like this you can either add the words to the stemExclusionSet that can be passed in to the analyzer to protect them from stemming. Or you can build your own analyzer that uses the StemmerOverrideFilter - maybe that's material for another blogpost.

Scoring

Bahasa Indonesia poses an interesting challenge when it comes to scoring search results. Scoring algorithms like TF/IDF and BM25 rely on the frequency of terms. But in Indonesian a plural is often formed by just repeating a word. mobil means car - mobil mobil means cars. But if a text talks about a single car or multiple cars shouldn't make a difference when it comes to scoring. Depending on the text you are searching it might be necessary to ignore the frequencies - or write a custom filter that skips words that are repeated immediately.

Conclusion

Stemming doesn't have a place in every search application. But it's one of the techniques that can help making natural language more accessible without being too complex. It can make your search seem like magic.

Working with natural languages is one thing I enjoy a lot when working with search engines. And if like in this case I am learning something about the language in the process that is even better.

Book Review – Mastering Docker

2018-03-09T00:00:00+08:00

Packtpub has not the best reputation when it comes to the quality of books and that is for a reason. But there are some really good books as well, I learned Solr using Solr 1.4 Enterprise Search Server, got more familiar with Spring Boot through Learning Spring Boot and learned some things while reading Hibernate Search by Example and Grade Effective Implementation Guide. That's why I'm trying their books from time to time again, this time when wanting to read a book on Docker, so I got me Mastering Docker by Russ McKendrick during their recent 5$ sale.

When I picked up the book I had already started working with Docker. I wanted to read a book to get a deeper understandig of the technology, to be able to solve future problems but also because I think learning works best during the time you actually use something.

My expectations and needs are somehow met, I learned quite some new things while reading it. The author goes through all the commands and tools and explains how to use them step by step using examples. This is a good thing for some (like docker and docker-compose) but might be a bit too much for others (do we really need screenshots of the my profile screen on Dockerhub?). What I was definitively missing is a deeper look at some of the concepts.

The most interesting parts for me were all the supporting technologies in the ecosystem like Weave and Rancher, including nearly ecstatic reactions ("consul-template can do whaaaaaat?"). I can say the book did its job for me.

Messaging with RabbitMQ

2018-03-02T00:00:00+08:00

RabbitMQ is a robust message broker that can be used to implement different messaging patterns. Even though there is an exellent tutorial available (using different languages and frameworks) it can be a bit difficult to get into the concepts. In this post I want to show some different paradigms that can be implemented with RabbitMQ and why I struggled with some of the concepts.

Sending and receiving using queues

The easiest thing to do is to use a queue for sending the messages and having a consumer that reads from the same queue.

Nothing stops you from having multiple consumers, that each can process messages from the queue. After a message is consumed it is gone from the queue.

This is especially well suited when having tasks that need to be executed and it doesn't matter which of the consumer processes the task.

Publish/Subscribe

Not all use cases are in the way that you just want to consume a message once. Often you want to have multiple consumers that should all process all the messages. One example can be storing objects in different data stores (e.g. a search index and the database), another one domain events like an order that has been submitted and should be processed by the order management system and the inventory system. This calls for a publish/subscribe mechanism and of course RabbitMQ has you covered for this.

The biggest difference compared to using a queue alone is that in this case the producer doesn't write to the queue directly anymore. An instance of what is called an Exchange accepts messages and forwards them to one or more queues.

To have a classic publish/subscribe model you would use a FanoutExchange which forwards the messages to one or more queues. To connect exchange and queue you are declaring a binding, in this case that all messages for a certain exchange should be forwarded to a certain queue.

Each consumer reads messages from a dedicated queue. That also means that you will need one binding for each consumer that is listening.

With RabbitMQ it is possible to use queues that are automatically deleted when the consumer stops listening. This allows for very dynamic behaviour with consumers joining and leaving.

Looking at our first example you might be wondering why there are different ways to send messages, to queues and to exchanges. It turns out that sending to queues really is not possible. There is always a default exchange involved that just forwards the messages. This default exchange just takes the messages for a certain routing key (which is the queue name) and puts them in the queue with the same name.

Publish/Subscribe with filtering

Besides sending messages to all queues that are registered for an exchange it is also possible to filter them according to a routing key. All messages are being sent to an exchange and this exchange decides by looking at the routing key to which queues a message should be sent to.

If you want to do this for an exact match of the routing key this is being done by the DirectExchange.

When binding a DirectExchange to a queue you need to supply a routing key that will determine which routing keys will be considered for this queue. If you want to assign multiple routing keys you can just add multiple bindings for the same queue.

You can also supply wildcards for determining which routing keys should be used for a queue, this is done by using a TopicExchange that expects hierarchical routing keys.

Those two exchanges - Direct and Topic exchange - probably were confusing me the most at the beginning. When it comes to topic I am thinking about classic publish subscribe systems like when using a FanoutExchange. The exchange name is the topic clients are registering for. But here topic refers to a kind of routing on an existing exchange. Same with the DirectExchange: I would have expected a direct exchange to be similar to the first example where you send messages to queues directly. But a direct exchange in this case refers to direct routing and you always need to supply a routing key for this.

If you want to know more about the different kinds of exchanges head over to the tutorials on the RabbitMQ website.

Book Review – Working Effectively with Legacy Code

2018-02-23T00:00:00+08:00

I am feeling a bit embarrassed that I read Working Effectively with Legacy Code by Michael Feathers only recently. It has been recommended so many times to me and it's the top mentioned book on Stackoverflow but somehow I also expected that it doesn't contain any new revelations for me – which is far from reality.

You will have heard Michaels definition of legacy code before: Legacy code is code without tests. It's no surprise that the book focuses a lot on strategies for getting legacy code under test. It consists of many rather small chapters that each focus on a certain aspect of working with legacy code with chapter titles like This Class is Too Big and I Don't Want It to Get Any Bigger or I'm Changing the Same Code All Over the Place. I've been reading the book cover to cover but those chapters can also be read standalone, especially when coming back to the book later. The examples are mostly in Java and some in C++ (that poses some unique problems).

The most valuable aspect of the book is that it provides a very structured way of thinking about legacy code. Not only by providing recipes but also by defining certain terms, the most important one being the seam, a place in your code that you can use to replace some behaviour in tests. This can be to separate some unwanted blocks of code (e.g. a remote http call, a database interaction) or to sense an interaction – to be able to see that a certain code path is being taken and maybe even look at the variables at this point. To get seams into place you use dependency breaking techniques. The book contains a catalog of those in the last part, some examples being Extract and Override Call or Extract Interface .

Only while listening to a recent interview with Michael on SE-Radio I noticed that the book doesn't mention a technique you will hear often when people talk about legacy code: Implementing a system wide test before doing any changes, e.g. by using something like Selenium or golden master testing. Instead Michael focuses on writing tests that are close to the change and will do some refactorings without having tests in place to get the system in a testable state.

I'd be very interested how a book like this (that I still consider very relevant in its 2004 state) would look like if it was written today with all the more advanced mocking and system test tools available. I'm excited that there is an announcement for a new book on the same topic by Michael.

A lot of people tend to shy away from working with legacy code. But even in your greenfield project it won't take long to see areas that need improvement. So the techniques in this book are definitively relevant for everybody. If you haven't read the book yet – go ahead and do it.

Service Testing with Docker Containers

2018-02-14T00:00:00+08:00

During the recent months I've been helping a company improving their automated testing practices. Besides doing coaching on TDD I also had the chance to work on a project consisting of multiple services where I was able to introduce some service tests using Docker. It's the first time I've used Docker on a project for real and I was quite happy how useful it can be for doing service tests in a distributed environment. In this post I will describe a few of the things I did and learnt along the way.

Motivation

It's common wisdom that having tests on the lowest level can be very beneficial, as being symbolized by the test pyramid as well.

Unit Tests can be executed a lot faster
it's easier to identify why a test failed

But having unit tests often is not enough. In reality failures often hide in the integration of components, be it technical or in the interaction of the components.

If all your remote calls are mocked you might never notice that you have configured your http client the wrong way. If you never ran your tests on a real database you might never notice that your transaction is never being committed or that the sql you use for migrating the table only works on your in memory database.

One way to solve this is building end to end tests that drive the application through the frontend and execute real user behaviour on the system. This might sound like a good idea at first but often this leads to a very fragile system. Tests might pass on one day and fail on another.

Some network segment might not be available
Some other application team might have dropped their database on the staging environment
a million other things can have happened

But all you can see is that your tests failed and you have to investigate why.

You can have far more reliable tests if you only test some components in isolation. Startup a service with all its required dependencies, execute a request on it and see that the result is as expected, be it a response, an entry in a database, a message on a queue or anything else.

If there are any remote services that are being used by the system you can integrate very simple mock services. After all the intention of those tests is not to test everything that can be tested but only some representative areas.

Docker

Now, how does Docker help with this? It allows you to easily start your components as separate containers that can interact with each other. A container runs one of the components, there is one container for each database, one container for the service, one container for each mock and so on.

Containers are started from images that are specified using a Dockerfile. It can extend an existing image (e.g. a container that provides a java runtime) and add application specific tasks (e.g. runtime flags). An example: the following dockerfile is what is being generated by JHipster for a service.

FROM openjdk:8-jre-alpine

ENV SPRING_OUTPUT_ANSI_ENABLED=ALWAYS \
    JHIPSTER_SLEEP=0 \
    JAVA_OPTS=""

# add directly the war
ADD *.war /app.war

EXPOSE 8081
CMD echo "The application will start in ${JHIPSTER_SLEEP}s..." && \
    sleep ${JHIPSTER_SLEEP} && \
    java ${JAVA_OPTS} -Djava.security.egd=file:/dev/./urandom -jar /app.war

It extends a jre image that provides the Java Runtime, adds the war file with the application code and has the command to start it. This Dockerfile can be used to create an image using docker build and it can be run using docker run.

If you have multiple containers that should be run together as it is the case with the tests we are executing here you can use docker compose. You specify the services in a yml file, you can pass in environment variables and many other settings. A simple example for a service that uses PostgreSQL.

version: '2'
services:
    my-service:
        image: my-service
        environment:
            - SPRING_DATASOURCE_URL=jdbc:postgresql://postgresql:5432/my-service
            - SPRING_DATASOURCE_USERNAME=user
            - SPRING_DATASOURCE_PASSWORD=
    postgresql:
        image: postgres:9.6.2
        environment:
            - POSTGRES_USER=user
            - POSTGRES_PASSWORD=
        ports:
            - 5432:5432

This is starting two containers: One for the service my-service and one for the database. During execution there will be a dedicated network that allows the hosts for the containers to be resolved by their service name, this is why we can use the url jdbc:postgresql://postgresql:5432/my-service in this example.

Which environments variables are available of course depend on the image you are using. For PostgreSQL you can see that you can define a user and a password.

One thing you want to make sure when working with existing images for databases and other components: Always use a fixed version, don't use latest or you might be in for some surprises when running on different machines.

Writing Tests

What kind of tests you write of course heavily depends on the kind of application you are looking at. For many applications it could mean to send a http request to the service and check afterwards if there is a new entry in the database. Of course there can also be many other results to be checked: A file that is being created, a message that is being sent to a queue, another request that is being sent to another service.

As you don't need to communicate with your service in process there is also no need to use the same technology and even if you do I don't think you should share any code. For example for a Java application that uses Hibernate I wouldn't reuse the entity classes in the tests but use plain JDBC or any other technologies instead. Just implement the parts that you really need to validate the basic functionality.

A Java library that can be quite useful for writing this kind of tests is Awaitility. It implements mechanisms for testing asyncronous interactions, mostly by means of polling. The code for waiting for a condition can look something like this:

    await().atMost(5, TimeUnit.SECONDS).until(hasNewEntry());

with hasNewEntry():

    private Callable<Boolean> hasNewEntry() {
        return () -> jdbcTemplate.queryForObject("select count(*) from my_database_entry", Integer.class) > 0;
    }

The easiest way to run the tests is to start a container for them as well, which makes the other services available by their name as hostname in the same network. For Java based tests you can derive your image from the maven image and add your project files to it. When adding the flag --abort-on-container-exit for the docker-compose run you can make sure that all containers are being shut down when one container ends, which most likely will be the test container.

You can add the integrationtests to the docker compose file:

    integrationtest:
        build: integrationtest
        command: ./wait-for-it.sh -t 150 my-service:8081 -- mvn test
        volumes:
          - $PWD/target/surefire-reports:/app/target/surefire-reports
          - ~/.m2/repository:/root/.m2/repository

I am using wait-for-it that can be used to only execute a command when a certain service is available. In this case we are waiting until something is available at my-service:8081 with a timeout of 150 seconds.

The test output directory and the local maven repository are added as volumes, which makes them available in the container filesystem. The repository prevents downloading the artifacts again and again, the test output directory is where the reports are being written.

Once a setup like this is in place people might be keen to write as many tests as possible using this approach, testing different inputs and error conditions. Don't. Just use it what it is for, testing integration of the different technologies. For the business tests go with the smaller integration or unit tests.

When there are other services involved you of course need to make sure that those are at least available. Having very simple dummys that return predefined results can get you a long way already. Again, you can use any technology for this. For http base services tools like Express or Spring MVC can be good choices.

Conclusion

It took me a while to get used to the different concepts in Docker and how to combine everything but it can be a really powerful tool to tests services in isolation.

This approach is a lot better than what I have seen in many companies: Using real services in a dev or staging environment in testing which means that when those go down or have different data than expected your tests will fail. For a more modern approach of testing in production (which I don't have a lot of experience with) have a look at this post by Cindy Sridharan.

Book Review - NoSQL Distilled

2017-11-29T00:00:00+08:00

Like many others I have a tendency of buying more books than I can read. NoSQL Distilled by Pramod J. Sadalage and Martin Fowler is one of those books I got lying around for quite some time. I don't regret at all having it picked up now.

The short book consists of two parts, one that goes in the concepts behind the technologies and one that shows some details of concrete NoSQL databases. You will learn about the different data models, data distribution using replication and sharding, consistency and the CAP theorem. The second part introduces specialities of each type of data store, how to handle schema migrations and how to choose a data store.

NoSQL distilled builds on the very useful notion of distinguishing aggregate oriented and non aggregate oriented databases. Document databases, columnar and key value stores all are aggregate oriented, in the sense that all of the data for one aggregate is stored together and can be retrieved using a single key. Relational and graph databases are aggregate ignorant because the aggregates in your application will be stored in different tables or nodes and only be combined during read or update.

The book is short and easy to read. The authors mention that they inteded to have it read during a plane flight. Though I have read some parts of it on a long distance flight I am not too sure I could really manage to consume all of the information in one session :). With a moving target like NoSQL you might think that in over five years since the publication there will be lots of outdated material in the book. But the authors managed to really extract the concepts behind the technology so most of the book is still as relevant as when it was first published. I can recommend it, definitively not only if this is your first contact with NoSQL.

Learning to Build Clojure Webapps

2017-10-20T00:00:00+08:00

A while ago I gave a talk at an internal event at Zenika Singapore. We were free to choose a topic so I chose something I thought I didn't know enough about - what it feels like to build a web app in Clojure. This post is a transcript of the talk. I'll go into some details on Clojure, which libraries you can use to build web apps and how all of that felt to me.

Clojure

Clojure is a LISP dialect that targets the JVM, Microsofts CLR and JavaScript by means of ClojureScript. It is a functional and dynamic language. One of its specialities are the immutable data structures that rely on structural sharing when appending or removing elements. This allows for good performance even though maintaining immutability.

I got mainly interested in Clojure because

it is very different from all the C based languages around
while still being a general purpose language

Due to some of its features like Software Transactional Memory (STM) and atoms it can be especially well suited for building correct concurrent applications.

One aspect that I deem important as well: It seems to be a quite friendly community and I liked both of the user groups I attended, Clojure Berlin and the Clojure Meetup Singapore.

If you want to get started with Clojure it's best to start with the very common build tool Leinigen. Among other things it offers a simple way to build and run projects, standalone or in the REPL, a feature that is commonly used with LISP dialects.

Using the REPL and Leiningen alone will not be enough for you, you at least need something to edit files. Naturally many people seem to use emacs, besides that there is also Cursive (built on IntelliJ), Lighttable (an experimental IDE) and Nightcode (a very simple editor with built in REPL).

To get started with a project you can just install Leiningen and get started by firing up a REPL using lein repl.

Let's start with a simple operation, adding two numbers.

user=> (+ 2 3)
5

This already shows two of the more unusual features of Clojure. First, all of the code is represented as a list. That is why even the operation is enclosed in brackets. Second, Clojure uses the prefix notation even when it comes to mathematical operations. But this has also the benefit that you can just increase the number of parameters for the add operation.

user=> (+ 2 3 4 5)
14

And if you think about it: That is not that unusual at all. If you see + as the name of a function (which it is) this is very similar to the way you would call a function in a c like language.

plus(2, 3, 4, 5)

Of course you can also assign the result of a calculation to a variable.

user=> (def result (+ 2 3 4 5))
#'user/result
user=> result
14
user=> (- result 1)
13

Besides numeric values there are also other data types, e.g. strings and boolean values that you can use directly.

It is very easy to create functions as well.

user=> (defn append-mod [val] (str val "-mod"))
#'user/append-mod
user=> (append-mod "some-value")
"some-value-mod"

The more interesting data structures are the collections. There are vectors for sequential data.

user=> (def characters ["a" "b" "c"])
#'user/characters
user=> (characters 0)
"a"
user=> (characters 1)
"b"

And there are maps, often used with so called keywords as keys.

user=> (def my-map {:key "value" :foo "bar"})
#'user/my-map
user=> (my-map :key)
"value"

Being a functional language it is very common to do transformation on collections, e.g. using the map operation. This also shows the use of functions as first class citizens as seen by passing in the upper-case function.

userer=> (map clojure.string/upper-case characters)
("A" "B" "C")

Important: this doesn't modify the existing collection - it creates a new one. All possible due to the efficient implementation of the data structures.

Challenges

It is true that Clojure has a simple syntax that makes it easy to get started. But there are still many things to learn and when beginning it doesn't matter that much if something is syntax, a macro or a library call. It can be especially confusing that sometimes there are related but different concepts. For example those are all different ways to create a function, but for different purposes.

(defn name [] (body))
(def name (fn [] (body)))
#(body)

One thing that is still difficult for me is to decide how to structure programs in Clojure. It seems to be common to have a lot of functions in the same namespace. I imagine that it is difficult to decide which function is responsible for which data or task.

Finally, at least to me, the docs can be confusing.

Building Web Apps

As a general purpose language Clojure can of course be and is used a lot for building web apps as well. Compared to the Java landscape there is a lot less choice, a very common combination is to use three libraries, Ring as the core library, Compojure for routing and Hiccup for templating.

Ring is implemented as a pipeline. First there is an adapter that is used to map to an internal request/response representation and that adapts to an existing web runtime. There is an adapter that use Jetty or another one that only relies on the Servlet API which makes it possible to deploy a Ring app to any servlet container.

Next there are the middlewares. Those are like filters in the Java servlet world and can be used to enhance the application. By default there are some middlewares configured e.g. for handling parameters, session or cookies.

In the end a request hits the handler which transforms the request to a response.

Compojure can help you implement the routing for those handlers. This is a simple example that creates one successful route and one function to handle paths that are not found.

(defroutes app
  (GET "/" [] "<h1>Hello World</h1>")
  (route/not-found "<h1>Page not found</h1>"))

If you want to get started yourself the easiest way is to use the leiningen-compojure template that creates the application skeleton for you.

lein new compojure my_project_name

This will create the necessary folder structure, a leiningen file, a handler and a test for it.

If you don't want to write all the html in Strings yourself like in the example above it's time for a templating library. One that is often used in combination with Ring and Compojure is Hiccup. It allows you to write Clojure code that is being translated to HTML.

user=> (html [:span {:class "foo"} "bar"])
"<span class="\"foo\"">bar</span>"

I developed a very simple application using those libraries: a tool for seeing and adding vocabulary. Once started you can see a list of vocabulary, add more words and look up translations. I wouldn't say it's production ready for now, one reason being that it stores all the data in memory only :) You can find the source code on Github.

It contains a GET and a POST route for reading the list and adding a word and a Hiccup template.

Conclusion

It was fun working on a new language and after grasping the basics of Clojure it is easy to get started with a webapp, mainly because of the Leiningen Compojure template that makes it easy to set up a new project. There is a lot of ready made functionality available in the Ring middlewares. Hiccup still feels a bit weird to me, I am not too sure if many frontend developers are keen on working with it.

I wish I would have had more time to prepare the talk and work on the example. There are still a lot of things I have no idea about and I hope I can find some time to continue learning. Even though I don't foresee to use Clojure in a project at work anytime soon it can be very benificial to learn about new approaches - your perception of some language features can change.

If you want to get started with Clojure as well - I can especially recommend Aphyrs series on the basics. I have seen a useful talk by different InnoQ people over the years that especially covers the web libraries. A recording by Michael Vitz is available on YouTube.

Centralized Logging Night Class

2017-09-22T00:00:00+08:00

This week my colleague Joanna and I were running a night class on centralized logging at the elastic meetup Singapore for the first time. We had lots of help of the organizer Alberto who also managed to get a room at his employer Pivotal Labs. Pizza was sponsored by my employer Zenika.

Our intention was to have people get in touch with centralized logging and the elastic stack for the first time by conduction a 2 hour workshop. The reception was very good, besides the 60 people registered there were also 60 on the waiting list. In the end around 30 people turned up which is the expected amount - like in many other cities in Singapore more people sign up for events than actually show up.

We started off with an introduction to the topic where all of the components and how they play together were introduced in a talk. We covered Filebeat, Logstash, Kibana and the different scaling mechanisms. Conveniently we could take slides that we normally use for trainings. Some elastic guys were present as well and helped with questions on newer features.

After that we had a Pizza break before diving into the exercises. We chose a very simple setup of parsing and indexing access logs generated by a script. We provided the participants with instructions as well as the slides in a Github repo. Joanna and me and some of the elastic folks tried to help the people when struggling.

Some of the questions I can remember:

Deciding when to use Logstash or when to send data to elasticsearch directly.
How to secure the system. How to do alerting. Those questions will surely be appreciated by elastic.
How long does it take to get an intital system up and running.
Which kind of logs can be parsed automatically.

Some of the problems people had:

Finding the right artifact to download from the elastic website (Linux tar.gz downloaded for mac, finding the way around the website, ...)
Startup problems due to user rights: Starting the system as root (which elasticsearch will not allow), start with a user that has no rights to read the configuration
Finding the right script to start elasticsearch (no .sh extension)

The most common issue I noticed during trainings, whitespace in yaml files, didn't play a role at all. Maybe we didn't do enough configuration changes yet.

There are a few things I would do differently for the next event.

Have a shorter break. People enjoy socializing but we didn't have enough time left for the exercises.
Ease the setup: Maybe provide a container instead of letting the user set up all of the components.
Introduce elasticsearch. That is something we just didn't do.
Show people how to use the console in Kibana to do simple queries to elasticsearch.

Most of the people seemed to be happy about the event and for us as the speakers it's a great way to get to know many different people. There's a lot more interaction than when just doing an upfront talk. We are planning to do more events like this in the future.

Elasticsearch and the Languages of Singapore

2017-09-04T00:00:00+08:00

In June I gave a short talk at the first edition of Voxxed Days Singapore on using Elasticsearch to search the different languages of Singapore. This is a transcript of the talk, a video recording is available as well. We'll first look at some details of the data storage in elasticsearch before we see how it can be used to search the four official languages of Singapore.

If you are already familiar with Elasticsearch you can also jump to the section on the different languages directly.

Elasticsearch

Elasticsearch is a distributed search engine, communication, queries and configuration is mainly done using HTTP and JSON. It is written in Java and based on the popular library Lucene. Mostly by means of Lucene Elasticsearch provides support for searching a multitude of natural languages and that is what we are looking at in this article.

Getting started with Elasticsearch is pretty easy. On their website you can download different archives that you can just unpack. They contain scripts that can then be executed, the only prerequesite is a recent version of the Java Virtual Machine.

wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-5.4.1.zip
# zip is for Windows and Linux
unzip elasticsearch-5.4.1.zip
elasticsearch-5.4.1/bin/elasticsearch

Once started you can directly access the HTTP interface of Elasticsearch on the default port 9200.

Without any configuration you can then start writing data to the search index. I am using curl in the examples but you can use any HTTP client.

curl -XPOST "http://localhost:9200/voxxed/doc" -d '
> {
> "title": "Hello world!",
> "content": "Hello Voxxed Days Singapore!"
> }'

We are posting a simple JSON document that contains two fields: title and content. The url contains two path fragments that describe the index name (voxxed), a logical collection of documents and the type (doc) that determines how the documents are stored internally.

Now that the data is stored we can immediately search it.

curl -XPOST "http://localhost:9200/voxxed/doc/_search" -d '
> {
> "query": {
>   "match": {
>     "content": "Singapore"
>   }
> }
> }'

We are again posting a JSON document, this time appending _search to the url. The body of the request contains a query in a json structure, the so called query dsl. Simply put this searches for all documents in the index that contain the Singapore in the content field.

This returns another json structure that, among other information, contains the document we indexed initially.

{
  "took" : 127,
  [...]
  }, 
  "hits" : {
    "total" : 1,
    "max_score" : 0.2876821,
    "hits" : [
      {
        "_index" : "voxxed",
        "_type" : "doc",
        "_id" : "AVwAP4Aw9lCQvRKyIhgJ",
        "_score" : 0.2876821,
        "_source" : {
          "title" : "Hello world!",
          "content" : "Hello Voxxed Days Singapore!"
        }
      }
    ]
  }
}

So how does searching work internally? Elasticsearch provides an inverted index that is used for the lookup of the search terms. For our simple example the inverted index for the content field looks similar to this.

Term	Doc Id
days	1
hello	1
singapore	1
voxxed	1

Each word of our phrase Hello Voxxed Days Singapore! is stored with a pointer to the document it occurs in. Words are extracted, all punctuation is removed and the words are being lowercased. When a search for a term is being executed the same processing is being done and there is a direct lookup of the result in the index.

The process of preparing the content for storage is called analyzing and will be different for different kinds of data and applications. It is encapsulated in an analyzer that processes the incoming text, tokenizes it using a Tokenizer and processes it using optional TokenFilters. By default the Tokenizer splits on word boundaries and there is a filter that lowercases the content.

The analyzing process is also where the language specific processing can happen. There are some prebuilt Analyzers for different languages available that can do different things like character normalization or stemming, which is an algorithmic process that tries to reduce words to their base form. Besides using the analyzers that are shipped with elasticsearch you can also define custom analyzers that use some of the predefined tokenizers and token filter.

Analyzers need to be configured upfront in the mapping before documents are stored in the index. To configure an english analyzer for the content field we can issue the following PUT request.

curl -XPUT "http://localhost:9200/voxxed_en" -d'
{
  "mappings": {
    "doc": {
      "properties": {
        "content": {
          "type": "text", 
          "analyzer": "english"
        }
      }  
    }
  }
}'

There's a new index name but the name of the type is the same. This is a common strategy to handle multilingual content. Have one index per language but with the same structure.

Using this mapping we can now search for the term day instead of days as well. The analyzing does some normalization on the words so that each of them can be found. In general the way a word is stored in the index influences the search quality a lot and is the common way to allow users to search for words in different ways.

Languages of Singapore

There are four official languages in Singapore

English
Malay
Mandarin
Tamil

We have already seen how we can search for english content using the english analyzer. Let's look at Malay language next.

Malay

Malay is the national language of Singapore, that is why the national anthem is also in Malay. It starts with the following lines

Mari kita rakyat Singapura Sama-sama menuju bahagia

Malay uses the latin alphabet, words are separated by whitespace. We can just index this text as it is.

curl -XPOST "http://localhost:9200/voxxed/doc" -d'
{
  "title": "Majulah Singapura",
  "content": "Mari kita rakyat Singapura Sama-sama menuju bahagia"
}'

We can immediately search this just like we searched english language text. The standard analyzer prepares the words as they are for the index.

curl -XPOST "http://localhost:9200/voxxed/doc/_search" -d'
{
  "query": {
    "match": {
      "content": "bahagia"
    }
  }
}'

We are searching for bahagia which means happiness. The document is found as expected.

For malay there is no language specific analyzer available but the standard analyzer works fine. Malay doesn't have a lot of word inflection but has some prefix and suffix rules. What could be possible is to process the text with the Indonesian stemmer that is availabe. Both languages share many rules but there might also be exceptions.

Tamil

Tamil is a bit differnt as it is written using a different script.

ஆறின கஞ்சி பழங் கஞ்சி

This is a proverb saying Cold food is soon old food. Even if you can't read it you can see that the second and fourth word are the same, meaning food.

We can again index this content and then search it.

{
  "query": {
    "match": {
      "content": "கஞ்சி"
    }
  }
}

There is no special handling for Tamil. The standard tokenizer splits words correctly. It doesn't matter that it's a different script as elasticsearch compares the words on the byte level.

Mandarin

Mandarin is very different from the others as it uses a different script and no white space. My creativity is decreasing so this just means Hello Singapore.

你好新加坡

If you index this using the standard analyzer it will split to single characters. When using this for search this can lead to a lot of false positives. That's why there are alternatives available, most notably the CJKAnalyzer that can work on chinese, japanese and korean language and builds bigrams of the characters.

For the example this leads to

你好
好新
新加
加坡

Even this can lead to invalid or irrelevant words but it is better than searching for single characters. When searching for the word Singapore the document is found correctly.

{
  "query": {
    "match": {
      "content": "新加坡"
    }
  }
}

An alternative can be to use the Smart Chinese Plugin that is, like the name suggest, smarter than just building bigrams. It uses a probabilistic approach to determine sentence and word boundaries.

Conclusion

Each of the languages of Singapore has its specialities. There is basic support for all of them in Elasticsearch but working with multiple languages can be challenging, in real life as well as in search engines.

Spring Security and Multiple Filter Chains

2017-08-21T00:00:00+08:00

Spring Security is an immensely useful technology. It allows you to secure your application without being too intrusive and allows to plug with many different authentication mechanisms. On the other hand it is not that easy to get into and one of those tools that I have to relearn each time I am touching it. In this post I'll describe some of the basics of spring security and how you can use it to secure different parts of your application in different ways.

Spring Security Configuration

Let's look at a piece of configuration for Spring Security, you can find the full source code on Github. I am using Spring Boot but most parts should be the same for all Spring applications.

@Configuration
public class SecurityConfig extends WebSecurityConfigurerAdapter {

    @Override
    protected void configure(HttpSecurity http) throws Exception {
        http
                .httpBasic()
                .and()
                .authorizeRequests().antMatchers("/secret/**").authenticated()
                .and()
                .authorizeRequests().antMatchers("/**").permitAll();
    }

}

In the simplest case you just configure the HttpSecurity using the method chaining that is common in Spring Security. In this case we enable HTTP Basic Auth and require authentication for one endpoint (everything below /secure/). All other requests (denoted by /**) will be permitted. The patterns that are used here are of the Ant path syntax but you can also use different RequestMatcher to decide which parts of your application require which authentication.

All the functionality of Spring boot is implemented in a filter chain. The call to httpBasic() above actually just makes sure that the relevant filter is added to the filter chain. In this case the BasicAuthenticationFilter will check if there is an Authorization header and evaluate it. If one is found it will add an Authentication object to the context and execute the rest of the filter chain. At the end of the chain is the FilterSecurityInterceptor that checks if the requested resource requires authentication and if the one that is set conforms to the requested roles.

You can also exclude some parts of the application from authentication by configuring the WebSecurity. The following method makes sure that any requests to /resources/ skip the configuration above.

@Override
public void configure(WebSecurity web) throws Exception {
    web.ignoring().antMatchers("/resources/**");
}

Under the hood this will add an additional filter chain that is triggered for the path configured and does nothing.

Multiple Filter Chains

Sometimes it can be necessary to use different authentication mechanisms for different parts of your application. To achieve that, Spring Security allows you to add several configuration objects. It is a common practice to use inner configuration classes for this that can also share some parts of the enclosing application. The following class adds two different Spring Security filter chains.

public class SecurityConfig {

    @Configuration
    public static class ApiConfiguration extends WebSecurityConfigurerAdapter {

        @Override
        protected void configure(HttpSecurity http) throws Exception {
            // doesn't really make sense to protect a REST API using form login but it is just for illustration
            http
                    .formLogin()
                    .and()
                    .authorizeRequests().antMatchers("/secret/**").authenticated()
                    .and()
                    .authorizeRequests().antMatchers("/**").permitAll();
        }

        @Override
        public void configure(WebSecurity web) throws Exception {
            web.ignoring().antMatchers("/resources/**");
        }
    }

    @Order(1)
    @Configuration
    public static class ActuatorConfiguration extends WebSecurityConfigurerAdapter {

        @Override
        protected void configure(HttpSecurity http) throws Exception {
            http
                    .antMatcher("/management/**")
                    .httpBasic()
                    .and()
                    .authorizeRequests().antMatchers("/management/**").authenticated();
        }

        @Override
        public void configure(WebSecurity web) throws Exception {
            super.configure(web);
        }
    }
}

Both of the classes inherit from the adapter configuration class and configure their HttpSecurity. Each of those classes adds a filter chain and the first one that matches is executed. The @Order annotation can be used to influence the order of the filter chains to make sure that the right one is executed first.

It can also be necessary to restrict the filter chain to only a certain part of the application so that it is not triggered for other parts. The ActuatorConfiguration is restricted to only match requests to /management/. Be aware that there are two different places in the configuration that accept a RequestMatcher. The one at the beginning restricts the url the filter chain is triggered for. The ones after authorizeRequests() are used to define which requests require what kind of authentication.

Note that configuring the WebSecurity is not tied to one of the HttpSecurity configurations as those add their own filter chain, only the order might be different. If you add a pattern in both configurations it will even operate on the same instance of WebSecurity.

One last thing: In case you are using a custom authentication filter (e.g. for token based authentication) you might have to take care that you don't register your filter as a Servlet Filter as well. You can influence that by configuring a method returning a FilterRegistrationBean and accepting an instance of your Filter. just create a new FilterRegistrationBean for your filter and set enabled to false.

Traffic Light Visualizations for Kibana

2017-04-03T00:00:00+08:00

Kibana 5.3 shipped with an interesting feature that had been anticipated for quite a while. It provides the ability to display the latest value for a certain field using the Top Hit Aggregation. I'll show how to use the Top Hit Aggregation to create two different visualizations.

But first, of course you need some timestamped data. The following documents are minimal examples.

POST /logstash-2017.03.30/log
{
  "metric": 0.6,
  "ip": "192.168.0.1",
  "@timestamp": "2017-03-30T09:10:22.611Z"
}

POST /logstash-2017.03.30/log
{
  "metric": 0.4,
  "ip": "192.168.0.2",
  "@timestamp": "2017-03-30T09:10:22.611Z"
}

POST /logstash-2017.03.30/log
{
  "metric": 0.7,
  "ip": "192.168.0.1",
  "@timestamp": "2017-03-30T09:10:22.611Z"
}

Besides the timestamp there are two fields for each document: metric (which could be anything, e.g. used disk space, load or anything else) and ip (which is an identifier for a machine). There are two values for the ip 192.168.0.1 (0.6, 0.7) and one for 192.168.0.2 (0.4).

Metric Visualization

The easiest way to use the new aggregation type is by just displaying the latest value in a widget. You can create a new visualization using a metric, choosing Top Hit as aggregation. You can choose how many latest values you want to take into account (1 for only the latest) and how to aggregate them. By default the values will be sorted on timestamp in descending order which you can also change to display the first value.

Of course you can also tie this visualization to a search, querying for the ip 192.168.0.2 will then only display the latest value for this certain ip.

Heatmap Visualization

A more visual approach to displaying the latest value can be done using a heatmap. You can build a traffic light style dashboard of any values in your system.

The Top Hit can be registered as a metric aggregation in the first section of the heatmap configuration.

The buckets on the X-Axis can then be determined by a Terms aggregation on the ip field, displaying separate sections for each ip.

Finally, the color to display for the different values can be configured on the options tab. You can create custom ranges that can then be assigned to a certain color in the legend on the right.

These visualizations can help a lot to immediately judge the health of your system.

Changes

2017-01-24T00:00:00+08:00

I came to Karlsruhe for my studies, started working at synyx and stayed here after switching to working as an independent developer and consultant. For years I helped running the local Java User Group, I am one of the founders of the Search Meetup Karlsruhe and most of my clients are from the area as well. Though my wife and myself liked it here we decided it's time for a change: We'll be relocating to Singapore for at least a year.

I am glad to announce that I'll start working as a developer at Zenika Singapore. I guess I'll still be doing search projects and hope that I can continue blogging from time to time. My consulting business will shut down at the beginning of February but if you would like to work with me that is still possible - you just have to talk to Zenika instead of myself. Thanks to all former colleagues, user group members and friends. I'd be happy to hear from you when you are in the area.

Java Clients for Elasticsearch Transcript

2016-11-09T00:00:00+08:00

This is a transcript of a talk I gave at the Singapore Java User Group on November 9 2016. It can also be seen as an updated version of an article with the same name I published in 2014 on the Found blog.

In this talk I will introduce three different clients for elasticsearch as well as Spring Data Elasticsearch. But to get started let's look at some of the basics of elasticsearch.

elasticsearch

To introduce elasticsearch I am using a definition that is taken directly from the elastic website.

Elasticsearch is a distributed, JSON-based search and analytics engine, designed for horizontal scalability, maximum reliability, and easy management.

Let's see first what a JSON-based search and analytics engine means.

To understand what elasticsearch does it's good to see an example of a search page. This is something everybody is familiar with, the code search on Github.

Keywords can be entered in a single search input, below is a list of results. One of the distinguishing features between a search engine and other databases is that there is a notion of relevance. We can see that for our search term elasticsearch the project for the search engine is on the first place. It's very likely that people are looking for the project when searching for this term. The factors that are used to determine if a result is more relevant than another can vary from application to application - I don't know what Github is doing but I can imagine that they are using factors like popularity besides classical text relevance features. There are a lot more features on the website that a classical search engine like elasitcsearch supports: Highlighting the occurance in the result, paginate the list and sort using different criteria. On the left you can see the so called facets that can be used to further refine the result list using criteria from the documents found. This is similar to features found on ecommerce sites like ebay and Amazon. For doing something like this there is the aggregation feature in elasticsearch that is also the basis for its analytics capabilities. This and a lot more can be done using elasticsearch as well. In this case this is even more obvious - Github is actually using elasticsearch for searching through the large amount of data they are storing.

If you want to build a search application like this you have to install the engine first. Fortunately elasticsearch is really easy to get started with. There is no special requirement besides a recent Java runtime. You can download the elasticsearch archive from the elastic website, unpack it and start elasticsearch using a script.

# download archive
wget https://artifacts.elastic.co/downloads/
    elasticsearch/elasticsearch-5.0.0.zip

unzip elasticsearch-5.0.0.zip

# on windows: elasticsearch.bat
elasticsearch-5.0.0/bin/elasticsearch

For production use there are also packages for different Linux distributions. You can see that elasticsearch is started by doing a HTTP GET request on the standard port. In the examples I am using curl, the command line client for doing HTTP requests, that is available for a lot of environments.

curl -XGET "http://localhost:9200"

elasticsearch will answer this request with a JSON document that contains some information on the installation.

{
  "name" : "LI8ZN-t",
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "UvbMAoJ8TieUqugCGw7Xrw",
  "version" : {
    "number" : "5.0.0",
    "build_hash" : "253032b",
    "build_date" : "2016-10-26T04:37:51.531Z",
    "build_snapshot" : false,
    "lucene_version" : "6.2.0"
  },
  "tagline" : "You Know, for Search"
}

The most important fact for us is that we can see that the server is started. But there is also versioning information on elasticsearch and Lucene, the underlying library used for most of the search functionality.

If we now want to store data in elasticsearch we send it as a JSON document as well, this time using a POST request. As I really like the food in Singapore I want to build an application that allows me to search my favorite food. Let's index the first dish.

curl -XPOST "http://localhost:9200/food/dish" -d'
{
  "food": "Hainanese Chicken Rice",
  "tags": ["chicken", "rice"],
  "favorite": {
    "location": "Tian Tian",
    "price": 5.00
  }
}'

We are using the same port we used before, this time we just add two more fragments to the url: food and dish. The first is the name of the index, a logical collection of documents. The second is the type. It determines the structure of the document we are saving, the so called mapping.

The dish itself is modeled as a document. elasticsearch supports different data types like string, that is used for the food attribute, a list like in tags and even embedded documents like the favorite document. Besides that there are more primitive types like numerics, booleans and specialized types like geo coordinates.

We can now index another document doing another POST request.

curl -XPOST "http://localhost:9200/food/dish" -d'
{
  "food": "Ayam Penyet",
  "tags": ["chicken", "indonesian"],
  "spicy": true
}'

The structure of this document is a bit different. It doesn't contain thefavorite subdocument but has another attribute spicy instead. Documents of the same kind can be very different - but keep in mind that you need to interpret some parts in your application. Normally you will have similar documents.

With those documents indexed it is automatically possible to search them. One option is to do a GET request on /_search and add the query term as a parameter.

curl -XGET "http://localhost:9200/food/dish/_search?q=chicken"

Searching for chicken in both documents also returns both of them. This is an excerpt of the result.

...
{"total":2,"max_score":0.3666863,"hits":[{"_index":"food","_type":"dish","_id":"AVg9cMwARrBlrY9tYBqX","_score":0.3666863,"_source":
{
  "food": "Hainanese Chicken Rice",
  "tags": ["chicken", "rice"],
  "favorite": {
    "location": "Tian Tian",
    "price": 5.00
  }
}},
...

There is some global information like the amount of documents found. But the most important property is the hits array that contains the original source of our indexed dishes.

It's very easy to get started like this but most of the time the queries will be more complex. That's why elasticsearch provides the query DSL, a JSON structure that describes a query as well as any other search features that are requested.

curl -XPOST "http://localhost:9200/food/dish/_search" -d'
{
  "query": {
    "bool": {
      "must": {
      "match": {
        "_all": "rice"
      }
      },
      "filter": {
    "term": {
      "tags.keyword": "chicken"
    }
      }
    }
  }
}'

We are searching for all documents that contain the term rice and also have chicken in tags. Accessing a field using the .keyword allows to do an exact search and is a new feature in elasticsearch 5.0.

Besides the search itself you can use the query DSL to request more information from elasticsearch, be it something like highlighting or autocompletion or the aggregations that can be used to build a faceting feature.

Let's move on to another part of the definition.

Elasticsearch is [...] distributed [...], designed for horizontal scalability, maximum reliability

So far we have only accessed a single elasticsearch instance.

Our application would be talking directly to that node. Now, as elasticsearch is designed for horizontal scalability we can also add more nodes.

The nodes form a cluster. We can still talk to the first node and it will distribute all requests to the necessary nodes of the cluster. This is completely transparent to us.

Building a cluster with elasticsearch is really easy at the beginning but of course it can be more challenging to maintain a production cluster.

Now that we have a basic understanding about what elasticsearch does let's see how we can access it from a Java application.

Transport Client

The transport client has been available from the beginning and is the client chosen most frequently. Starting with elasticsearch 5.0 it has its own artifact that can be integrated in your build, e.g. using Gradle.

dependencies {
    compile group: 'org.elasticsearch.client',
        name: 'transport',
        version: '5.0.0'
}

All functionality of Elasticsearch is available using the Client interface, a concrete instance is the TransportClient, that can be instanciated using a Settings object and can have one or more addresses of elasticsearch nodes.

TransportAddress address =
    new InetSocketTransportAddress(
        InetAddress.getByName("localhost"), 9300);

Client client = new PreBuiltTransportClient(Settings.EMPTY)
    addTransportAddress(address);

The client then provides methods for different features of elasticsearch. First, let's search again. Recall the structure of the query we issued above.

curl -XPOST "http://localhost:9200/food/dish/_search" -d'
{
  "query": {
    "bool": {
      "must": {
      "match": {
        "_all": "rice"
      }
      },
      "filter": {
    "term": {
      "tags.keyword": "chicken"
    }
      }
    }
  }
}'

A bool query that has a match query in its must section and a term query in its filter section.

Luckily once you have a query like this you can easily transform it to the Java equivalent.

SearchResponse searchResponse = client
   .prepareSearch("food")
   .setQuery(
    boolQuery().
      must(matchQuery("_all", "rice")).
      filter(termQuery("tags.keyword", "chicken")))
   .execute().actionGet();

assertEquals(1, searchResponse.getHits().getTotalHits());

SearchHit hit = searchResponse.getHits().getAt(0);
String food = hit.getSource().get("food").toString();

We are requesting a SearchSourceBuilder by calling prepareSearch on the client. There we can set a query using the static helper methods. And again, it's a bool query that has a match query in its must section and a term query in its filter section.

Calling execute returns a Future object, actionGet is the blocking part of the call. The SearchResponse represents the same JSON structure we can see when doing a search using the HTTP interface. The source of the dish is then available as a map.

When indexing data there are different options available. One is to use the jsonBuilder to create a JSON representation.

XContentBuilder builder = jsonBuilder()
    .startObject()
        .field("food", "Roti Prata")
        .array("tags", new String [] {"curry"})
        .startObject("favorite")
        .field("location", "Tiong Bahru")
        .field("price", 2.00)
        .endObject()
    .endObject();

It provides different methods that can be used to create the structure of the JSON document. This can then be used as the source for an IndexRequest.

IndexResponse resp = client.prepareIndex("food","dish")
        .setSource(builder)
        .execute()
        .actionGet();

Besides using the jsonBuilder there are several other options available.

A common option is to use a Map, the convenience methods that accept field name and value for simple structures or the option to pass in a String, often in combination with a library like Jackson for serialization.

We have seen above that the Transport Client accepts the address of one or more elasticsearch nodes. You might have noticed that the port is different to the one used for http, 9300 instead of 9200. This is because the client doesn't communicate via http - it connects to an existing cluster using the transport protocol, a binary protocol that is also used for inter node communication in a cluster.

You might have noticed as well that so far we are only talking to one node of the cluster. Once this node goes down we might not be able to access our data anymore. If you need high availability you can enable the sniffing option that lets your client talk to multiple nodes in the cluster.

Now when one of the nodes goes down, we can still access the data using the other nodes. The feature can be enabled by setting client.transport.sniff to true when creating the client.

TransportAddress address =
    new InetSocketTransportAddress(
        InetAddress.getByName("localhost"), 9300);

Settings settings = Settings.builder()
            .put("client.transport.sniff", true)
            .build();

Client client = new PreBuiltTransportClient(settings)
    addTransportAddress(address);

This feature works by requesting the current state of the cluster from the known node using one of the management APIs of elasticsearch. When configured this is done during startup and in an regular interval, by default every 5s.

Sniffing is an important feature to make sure your application stay up even during node failure.

When using the Transport Client you have some obvious benefits: As the client is shipped with the server (and even includes a dependency to the server) you can be sure that all of the current API is available for use in your client code. Communication is more efficient than JSON over HTTP and there is support for client side load balancing.

On the other side there are some drawbacks as well: As the transport protocol is an internal protocol you need to use a compatible elasticsearch version on the server and the client. Also, rather unexpected, this also means that a similar JDK version needs to be used. Additionally you need to include all of the dependencies to elasticsearch in your application. This can be a huge problem, especially with larger existing applications. For example it might happen that a CMS already ships some version of Lucene. Often it is not possible to resolve dependency conflicts like this.

Fortunately, there is a solution for this.

RestClient

elasticsearch 5.0 introduced a new client that uses the HTTP API of elasticsearch instead of the internal protocol. This requires far less dependencies. Also you don't need to care about the version that much - the current client can be used with elasticsearch 2.x as well.

But there is also a drawback - it doesn't have a lot of features yet.

The client is available as a Maven artifact as well.

dependencies {
    compile group: 'org.elasticsearch.client',
        name: 'rest',
        version: '5.0.0'
}

The client only depends on the apache httpclient and its dependencies. This is a Gradle listing of all the dependencies.

+--- org.apache.httpcomponents:httpclient:4.5.2
+--- org.apache.httpcomponents:httpcore:4.4.5
+--- org.apache.httpcomponents:httpasyncclient:4.1.2
+--- org.apache.httpcomponents:httpcore-nio:4.4.5
+--- commons-codec:commons-codec:1.10
\--- commons-logging:commons-logging:1.1.3

It can be instanciated by passing in one or more HttpHost.

RestClient restClient = RestClient.builder(
    new HttpHost("localhost", 9200, "http"),
    new HttpHost("localhost", 9201, "http"))
    .build();

As there is not a lot of functionality as of now most of the JSON is just available as a String. This is an example of executing a match_all query and transform the response to a String using a helper method.

  HttpEntity entity = new NStringEntity(
      "{ \"query\": { \"match_all\": {}}}",
      ContentType.APPLICATION_JSON);
  // alternative: performRequestAsync
  Response response = restClient.performRequest("POST",                                     "/_search", emptyMap(), entity);
  String json = toString(response.getEntity());
  // ...

Indexing data is low level as well. You just send the String containing the JSON document to the endpoint. The client supports sniffing using a separate library. Besides the fact that there are less dependencies and the elasticsearch version is not as important anymore there is another benefit for operations: The cluster can now be separated from the applications with HTTP being the only protocol to talk to the cluster.

Most of the functionality depends on the Apache http client directly. There is support for setting timeouts, using basic auth, custom headers and error handling.

For now there is no query support. If you are able to add the elasticsearch dependency to your application (which of course voids some of the benefits again) you can use the SearchSourceBuilder and related functionality to create Strings for the query.

Besides the new RestClient there is also another HTTP client available that has more features: The community built client Jest.

Jest

Jest is available for a long time already and is a viable alternative to the standard clients. It is available via Maven central as well.

dependencies {
    compile group: 'io.searchbox',
        name: 'jest',
        version: '2.0.0'
}

The JestClient is the central interface that allows to send requests to elasticsearch. It can be created using a factory.

JestClientFactory factory = new JestClientFactory();
factory.setHttpClientConfig(new HttpClientConfig
            .Builder("http://localhost:9200")
            .multiThreaded(true)
            .build());

JestClient client = factory.getObject();

As with the RestClient Jest doesn't have any support for generating queries. You can either create them using String templating or reuse the elasticsearch builders (with the drawback of having to manage all dependencies again).

A builder can be used to create the search request.

String query = jsonStringThatMagicallyAppears;

Search search = new Search.Builder(query)
    .addIndex("library")
    .build();

SearchResult result = client.execute(search);
assertEquals(Integer.valueOf(1), result.getTotal());

The result can be processed by traversing the Gson object structure which can become rather complex.

JsonObject jsonObject = result.getJsonObject();
JsonObject hitsObj = jsonObject.getAsJsonObject("hits");
JsonArray hits = hitsObj.getAsJsonArray("hits");
JsonObject hit = hits.get(0).getAsJsonObject();

// ... more boring code

But that is not how you normally work with Jest. The good thing about Jest is that it directly supports indexing and search Java beans. For example we can have a representation of our dish documents.

public class Dish {

    private String food;
    private List<String> tags;
    private Favorite favorite;

    @JestId
    private String id;

     // ... getters and setters
}

This class can then be automatically populated from the search result.

Dish dish = result.getFirstHit(Dish.class).source;

assertEquals("Roti Prata", dish.getFood());

Of course the bean support can be used to index data as well.

Jest can be a good alternative when accessing elasticsearch via http. It has a lot of useful functionality like the bean support when indexing and searching and a sniffing feature called node discovery. Unfortunately you have to create the search queries yourself but this is the case for the RestClient as well.

Now that we have looked at three clients it is time to see an abstraction on a higher level.

Spring Data Elasticsearch

The family of Spring Data projects provides access to different data stores using a common programming model. It doesn't try to provide an abstraction over all stores, the specialities of each store are still available. The most impressive feature is the dynamic repositories that allow you to define queries using an interface. Popular modules are Spring Data JPA for accessing relational databases and Spring Data MongoDB.

Like all Spring modules the artifacts are available in Maven central.

dependencies {
    compile group: 'org.springframework.data',
    name: 'spring-data-elasticsearch',
    version: '2.0.4.RELEASE'
}

The documents to be indexed are represented as Java beans using custom annotations.

@Document(indexName = "spring_dish")
public class Dish {

    @Id
    private String id;
    private String food;
    private List<String> tags;
    private Favorite favorite;

     // more code

}

Different annotations can be used to define how the document will be stored in elasticsearch. In this case we just define the index name to use when persisting the document and the property that is used for storing the id generated by elasticsearch.

For accessing the documents one can define an interface typed to the dish class. There are different interfaces available for extension, ElasticsearchCrudRepository provides generic index and search operations.

public interface DishRepository 
  extends ElasticsearchCrudRepository<Dish, String> {

}

The module provides a namespace for XML configuration.

<elasticsearch:transport-client id="client" />

<bean name="elasticsearchTemplate" 
  class="o.s.d.elasticsearch.core.ElasticsearchTemplate">
    <constructor-arg name="client" ref="client"/>
</bean>

<elasticsearch:repositories 
  base-package="de.fhopf.elasticsearch.springdata" />

The transport-client element instanciates a transport client, ElasticsearchTemplate provides the common operations on elasticsearch. Finally, the repositories element instructs Spring Data to scan for interfaces extending one of the Spring Data interface. It will automatically create instances for those.

You can then have the repository wired in your application and use it for storing and finding instances of Dish.

Dish mie = new Dish();
mie.setId("hokkien-prawn-mie");
mie.setFood("Hokkien Prawn Mie");
mie.setTags(Arrays.asList("noodles", "prawn"));

repository.save(Arrays.asList(hokkienPrawnMie));

// one line ommited

Iterable<Dish> dishes = repository.findAll();

Dish dish = repository.findOne("hokkien-prawn-mie");

Retrieving documents by id is not very interesting for a search engine. To really query documents you can add more methods to your interface that follow a certain naming convention.

public interface DishRepository 
  extends ElasticsearchCrudRepository<Dish, String> {

    List<Dish> findByFood(String food);

    List<Dish> findByTagsAndFavoriteLocation(String tag, String location);

    List<Dish> findByFavoritePriceLessThan(Double price);

    @Query("{\"query\": {\"match_all\": {}}}")
    List<Dish> customFindAll();
}

Most of the methods start with findBy followed by one or more properties. For example findByFood will query the field food with the given parameter. Structured queries are possible as well, in this case by adding lessThan. This will return all dishes that have a lower price than the given one. The last method uses a different approach. It doesn't follow a naming convention but uses a Query annotation instead. Of course this query can contain placeholders for parameters as well.

To wrap up, Spring Data Elasticsearch is an interesting abstraction on top of the standard client. It is somewhat tied to a certain elasticsearch version, the current release uses version 2.2. There are plans for making it compatible with 5.x but this may still take some time. There is a pull request that uses Jest for communication but it is unclear if and when this will be merged. Unfortunately there is not a lot activity in the project.

Conclusion

We have looked at three Java clients and the higher level abstraction Spring Data Elasticsearch. Each of those has its pros and cons and there is no advice to use one in all cases. The transport client has full API support but is tied to the elasticsearch dependency. The RestClient is the future and will one day supersede the transport client. Feature wise it is currently very low level. Jest has a richer API but is developed externally and the company behind it doesn't seem to exist anymore though there is activity by the commiters in the project. Spring Data Elasticsearch on the other hand is better suited for developers using Spring Data already and don't want to get in contact with the elasticsearch API directly. It is currently tied to a version of the standard client, development activity is rather low.

Book Review: Relevant Search

2016-10-20T07:00:00+08:00

Relevancy, the notion that some results are better than others is one of the key factors that distinguishes search engines from most other databases. Additionaly it is a task that can sometimes seem like magic and is difficult to get right. Applications like Google have set the bar for how a search engine is expected to work. The relevant results should all be on the top positions. As the saying goes, if you want to make sure a secret stays a secret put it on page 3 of a Google search result page.

Doug Turnbull and John Berryman have written a book about all aspects that are related to relevancy. You will learn how the inverted index works, about different kinds of queries and the way they influence the score of the result documents. You will see how you can use boost or special queries to influence the result ordering and about different ways to help the user find the things they are looking for.

For most parts the book uses one coherent example, the search for movies. This is very well suited as it is a mixture of structured and unstructured data. All the examples in the book are using Elasticsearch but there is also an appendix that shows how to do similar things with Solr.

When starting with the book I thought I had a basic introduction to search engines in my hand. But I was wrong - both authors obviously have lots of experience with search relevancy tuning (no wonder they are connected to the development of tools like Splainer and Quepid). The book is different from a lot of other books on search technologies in that it doesn't describe all the features of a certain search engine but shows how to use them to build business applications.

Event though I am intensively working with search engines myself I learned a lot while reading the book. Some of the tactics are things that are widely done when building applications based on search engines but the authors manage to name them explicitly and build a structured approach. Finally, besides being very informative the book is an easy read that contains lots of jokes. If you are doing something with search engines you are well advised to read it.

A Simple Way to Index Java Beans in Elasticsearch

2016-07-01T07:00:00+08:00

When it comes to data stores Java programmers are used to working with Java beans that are magically persisted. Solutions like Hibernate and the JPA specification for relational data stores or Morphia and Spring Data MongoDB are popular examples.

Developers working with Elasticsearch sometimes have the same desire - pass a Java bean and have it indexed automatically. There is an implementation of Spring Data for Elasticsearch available but it might be overhead for you or not be supported by your version of Elasticsearch. And there's Jest which uses the HTTP API that supports storing Java Beans directly.

If you want to do the same using the standard Java client for Elasticsearch there is no direct support for that but it can be implemented by hand easily.

Suppose you want to persist the following simple object structure that represents a book.

Publisher publisher = new Publisher();
publisher.setCountry("UK");
publisher.setName("Packt");
Book book = new Book();
book.setTitle("Learning Spring Boot");
book.setAuthors(Arrays.asList("Greg L. Turnquist"));
book.setPublisher(publisher);

Often it happens that we are thinking so hard about one way to solve a problem that we can't see the easier way. We don't need a special framework for Elasticsearch. Elastcsearch will happily store most JSON structures for you. And fortunately creating JSON documents from Java objects is a solved problem using Libraries like Jackson or GSON.

We can simply add a dependency, in this case to jackson-databind, to the project if it's not already there and instanciate an ObjectMapper.

ObjectMapper mapper = new ObjectMapper();

If you're using Spring Boot you will normally even be able to just @Autowire the ObjectMapper. The ObjectMapper can then be used to create a JSON representation of the object.

String value = mapper.writeValueAsString(book);

This will result in a string similar to this one.

{"title":"Learning Spring Boot","authors":["Greg L. Turnquist"],"publisher":{"name":"Packt","country":"UK"}}

You can then index the result using the Elasticsearch client interface.

IndexResponse response = client
        .prepareIndex(indexName, "book")
        .setSource(value).execute().actionGet();

When retrieving the document you can create Java objects again using the readValue method.

GetResponse getResponse = client
        .prepareGet(indexName, "book", response.getId())
        .execute().actionGet();
String source = getResponse.getSourceAsString();
Book persistedBook = mapper
        .readValue(source, Book.class);
assertEquals("Packt", persistedBook.getPublisher().getName());

Or even better: Maybe you don't even need to create a Java object again? When you're only displaying the result in a template maybe it's enough to just pass in a Map of the resulting document?

Map<String, Object> sourceAsMap = 
    getResponse.getSourceAsMap();

Sometimes we are looking for complicated solutions when we don't even need them. As Elasticsearch uses JSON everywhere it is very easy to use common libraries for serialization, be it in Java or in other languages.

On Writing a Book

2016-06-17T07:00:00+08:00

Last December I was holding the final printed version of my German book on Elasticsearch in my hands. It's a good feeling and though there were stressful times I don't regret writing it. In this very long post I'd like to talk about what it's like to write and publish a book the traditional way.

Signing the contract

I first got approached by dpunkt.verlag in June 2013. Niko Köbler talked to an editor of the publisher at a conference, they somehow got to Elasticsearch and he recommended me as a potential author. The first call I had with the editor was very informal – we talked about search technology and the role of Elasticsearch. They were thinking about doing a book about it in the future but there were no concrete plans. We agreed to wait for half a year and then see again.

I didn't really think that something would come out of it at all but in the end of March 2014 the editor contacted me again, telling me that they wanted to do a book on Elasticsearch and if I would be interested.

I always thought that writing a book is a good thing. Not that I really worked on going that route but I somehow admired people who had done it. So I didn't have to think a lot – I wanted to do it and agreed.

Even before signing a contract I had to create an outline of the book, i.e. create a table of contents with the headlines for the different chapters and sections in the book. This is more difficult than it might sound, even when you think you know what you want to write about. You need to make sure that all the necessary topics are covered and that it's a coherent story.

This outline is then sent to several reviewers who provide feedback for the author (and very likely also assure the publisher that the author is the right person for the job). After incorporating some feedback in the outline I had to put milestone dates on the chapters. That also means defining the order the chapters will be written in, which normally is not the order they are read in (example: the first chapter in the book is the last chapter I wrote). I already expected the writing process to be a huge undertaking, taking longer than I would think. I planned on finishing writing the book in around eight months, having it released in about one year. I signed the contract in July 2014 at Java Forum Stuttgart.

Tools

I had the choice of technology for writing the book – Word, Open Office documents or LaTex. I chose LaTex that I already knew from writing my thesis and I suspected would be less problematic when it comes to formatting issues. Later I sometimes regretted the choice as I had to struggle a lot setting the system up. Only later we found out that I didn't receive all the necessary files, so these problems could have been prevented.

Nearly all of my writing was done directly in vim. I had a notebook but didn't do a lot of writing on paper (but I am writing the first version of the post you are just reading on paper). I put all the assets under version control in Git, also pushing them to a remote server for backup and sometimes copying all of it on USB sticks. You really don't want to start again from the beginning after that much work :).

I struggled a bit with the choice of tool for images. I started with some placeholder images I created using yEd, experimented with some other tools, and finally went back to yEd again. There might be more beautiful diagrams than the ones in the book but I am quite happy with the result.

The Writing Process

Writing happens on a chapter by chapter basis according to the milestone plan I created with the outline. If you're lucky like me you can find people who are willing to review the chapters for you. Otherwise the publisher will have people for reviewing the book. I contacted some people for help and saved the publishers reviewers for the final version of the book. I am especially thankful for the help of Tobias Kraft of Exensio who reviewed each and every chapter in detail. Thanks for your work Tobias!

I started writing with the second chapter that I planned to be one coherent example on how to use Elasticsearch, how analyzing and search works and so on. I was already well prepared for writing as during the time I started to work on the book I was publishing blog posts regularly, nearly on a weekly basis. I planned on establishing a writing habit, due to my work as a Freelancer I was able to reduce client projects to four days a week and have one day I can dedicate to writing alone. But even though I planned to do so it was difficult to execute. Half a year is a very long time span and you tend to have more urgent things to do – "I will catch up on writing later on" – except you don't.

When I finished the second chapter it had taken a lot longer than expected and it grew very long. Even I knew that this was not the quality to be published in a book. And the feedback of my editor and the reviewers was the same. Though they could see where I wanted to go my writing wasn't good yet. The sentences were too long, you could notice the different times I had written the parts and all in all the chapter was far too long for an initial example. Back to the writing desk.

I was already late and had to redo everything. But after a while it got better: I extracted two more chapters on topics I didn't originally plan to write about in detail. But still I had to take off too much time from writing for doing other things. Months passed and I noticed I wouldn't be able to make the deadline. In January 2015 I re-planned the dates for book delivery, extending the deadlines for about two months.

Writing went well afterwards, I managed to get into some kind of flow more regularly but still I was taking too long. In March I extended for another month, planning on finishing the book in the end of April. In the end of March I noticed that even this plan would not work out and extended for another two months, finishing writing in the end of June 2015. I managed to meet this goal.

I logged all the time for writing in an excel sheet in 15 minute blocks. I became so obsessed with collecting the numbers that I also didn't work on the book if I had only five or ten minutes because then I couldn't log the time correctly.

Writing happened either in the morning before doing customer work or on dedicated days I spent at the library. It's good for me to have a dedicated work environment. Most of the days I felt really exhausted after working around six hours on the book. Writing is more challenging than normal programming work.

The Editorial Process

After I was done with writing a pdf was sent to some reviewers who had around three weeks to send their feedback. The result was really helpful with useful comments and suggestions. I incorporated some feedback for the final version, being extra careful to not add any errors because no technical reviewer would see those edits anymore.

This was also the time I did the index for the book, using the headlines as a guidance. Having a bad index can be really annoying for a technical book. This also took longer than you might think. I also had to redo the bibliography because I used the wrong notation. This basically meant typing all of the list again. I thought about some automation but in the end it was done by hand in a few hours.

After doing the edits I sent the manuscript to the publisher for final proofreading.

It was a bit of a surprise for me that the corrections arrived via mail and on paper. But it's the same with me – reading and correcting on paper works better.

After having added the corrections to the manuscript the book went off for typesetting which roughly took another two weeks. Afterwards I had to change some things regarding the code formatting and redo some of the images. The final version then went to print in mid November, the first big relief for me.

Holding the Book in the Hand

Before extending the deadlines I expected the book to be published in July 2015. When it was added to Amazon the initial release date stated September 2015, later October, then November and finally December 2015. I had a longer holiday booked for Christmas and I was really eager to hold the book in my hands before leaving.

Fortunately the first copy of the book was sent to me December 11 2015 and a few days later a box with all my author copies arrived.

Happy that my #Elasticsearch book is now available. A package full of past work :-) https://t.co/a4wWJpvmc0 pic.twitter.com/zhM1llaUbx
— Florian Hopf (@fhopf) 15. Dezember 2015

I can tell you, it's an excellent feeling holding the book in your hands after a year and a half of often stressful times.

The Website

Quite early I decided that it's a good idea to have a dedicated website for the book. I got me the domain elasticsearch-buch.de for marketing and for putting stuff on it that didn't make it in the book. Fortunately until today I didn't have to add an errata section :).

I created the website in the end of November 2015 using the static site generator Jekyll and a predefined template. I put up some articles and marketing texts and a list of all the resources in the book as clickable links. Even after the book was released I hadn't added all the content I was referring to in the book. There were some quite stressful last minute publish processes before leaving for a longer holiday.

After the release I also had the chance to publish a guest blog post about the book on the German elastic blog.

Compensation

It's no secret that this book won't make me rich. I don't know how many copies will be sold but in the end I expect the hourly rate for the writing to be in the low one digit € range. You might not be surprised that my consulting rates are a bit higher than that, but of course the book is good marketing for services. I received less requests for work than expected afterwards but having written the book I am more confident to request higher rates because I know the topic in depth.

But it's not only a money thing. I learned a lot while writing the book, going into far more detail I would have done otherwise, making sure that I really understood all the functionality I was describing.

Finally having written a book just feels good.

Statistics

Finally some numbers on the writing process.

Number of commits in repo	266
First commit with chapter content	16.06.2014
Last commit before sending the book for publishing	13.11.2015
Number of time entries logged	162
Number of hours logged	342
Character numbers of all tex files	495701
Number of resources in bibliography	199

I was a bit surprised to see that I had only logged 342 hours. So I could have done the book in around two months working 40 hours a week. Except that this wouldn't have been possible – there was a lot of conscious and unconscious thinking involved when I wasn't writing. Besides that it is nearly impossible for me to write for more than six hours a day.

On Versions

Elasticsearch is a very active project with lots of features added for new releases. I started writing the book using Elasticsearch 1.3.1, later switching to 1.4.x and then to the final version of the book 1.6. I was quite dedicated to ensure there are no mistakes in the example requests so after switching versions I had to make sure that all the requests were still working. This also means making sure the output of queries and the log output are still the same.

During the time of writing the rivers that were used to pull data in Elasticsearch were deprecated. I had already described the Twitter river to gather data from Twitter for use in the aggregations chapter. As switching this to another mechanism very likely meant that the document structure changed and I would have to redo all the aggregation examples I decided to stick with the river and explain how the same can be done using Logstash in a separate article on the website.

Another thing that happened in February 2015 was the rebranding of the company behind Elasticsearch to elastic. This meant that I had to redo all the links and all mentions of the company. This was more work than I would have thought in advance.

Two rather bad things happened with regards to versioning. In the chapter on centralized logging I describe Graylog as an alternative to the ELK stack. I used version 0.9.2 but in February 2015 Graylog 1.0 was released. As I had the chapter finished already including lots of screenshots and didn't have any time to spare I decided to stay with the version, stating it explicitly and just making sure that I didn't describe features that were gone in Graylog 1.0.

Even worse, after finishing the manuscript Elasticsearch 2.0 was released. It sounds like a big deal but lots of the changes were under the hood and it didn't change that much for the user. I wasn't able to redo everything for 2.0 but wanted to make sure that the book was also valid for the new version. I published a blog post on the website for the book describing the features and changes.

Generally I think it is very important to clearly state what version of software components you are using for writing the book. Otherwise readers might be confused when other versions don't work as described.

Conclusion

Writing a book can be extremely stressful but also is a very rewarding experience. Having a publisher with an editor means that you do have competent help when it comes to language and style (some publishers might be different and not focused on quality that much). The process of writing the book will never be paid from the sales alone. But it can be useful for marketing and it's a great way to learn about something you care about in depth.

The process is a classic example of Hofstadters law: It always takes longer than you expect, even when you take into account Hofstadter's Law. I wasn't able to spend that much time on writing as I was planning to, additionally reviewing and incorporating feedback took longer than I would have expected. Nevertheless I don't regret it – it was a good experience.

Stringify Everything in Elasticsearch

2016-05-27T06:00:00+08:00

A while ago I was working on a prototype to search larger structured documents using Elasticsearch. We were only interested to make the text searchable, with an option to search all the text and some seperate fields. Elasticsearch is of course a perfect solution for this with the _all field and the possiblility to search single or multiple fields.

The documents we had to make searchable were rather complex, consisting of hundreds of fields with different data types, special identifiers and numeric and string values. We had everything exported in JSON documents but unfortunately the data types were mixed and changing from document to document. Indexing two example documents might look like this:

POST /example/doc
{
    "my-id": 1,
    "my-tag": "one",
    "my-flag": true
}

POST /example/doc
{
    "my-id": "1b",
    "my-tag": "one b",
    "my-flag": "enabled"
}

What happens when these documents are indexed? Elasticsearch will try to guess the field type by its value. For the first document my-id clearly is a numeric value and my-flag a boolean. But for the second document the field types change to string values that of course can't be indexed in a numeric or boolean field. So indexing will fail for the second document. What can be done?

Of course the best way would have been to create the JSON documents correctly in the first place but this would have been rather complex because of the environment we were working in. As we were working on a prototype the quicker solution was to create a mapping for Elasticsearch that treats the values as strings. But creating a dedicated mapping would have been too complex – remember there were hundres of fields that we would have to check and configure. As we were ok with just treating all the fields in the documents as string values the solution was to add a dynamic template to our index that then maps all fields to string.

DELETE /example

PUT /example

PUT /example/doc/_mapping
{
   "doc": {
      "dynamic_templates": [
         {
            "all_strings": {
               "match": "*",
               "mapping": {
                  "type": "string",
                  "analyzer": "standard"
               }
            }
         }
      ]
   }
}

Now, when indexing the documents again, Elasticsearch will consider the dynamic template all_strings. By adding * for match this will be enabled for all fields. Each field will then automatically be configured as a string value, all documents can be indexed and searched afterwards.

Blog Relaunch

2016-05-16T14:16:00+08:00

I started a blog on blogspot in 2009, just to have something to put my thoughts on. In the beginning I did not blog regularly with 17 posts from 2009 to 2011 but wrote my most popular post so far when it comes to total visits on using Akka from Java. When starting as an independent developer in 2012 it was clear for me that I wanted to use blogging for my marketing because it's a great way for me: When writing posts I can learn a lot about different topics I am interested in.

As I wanted to have the blog linked to my name I configured blogspot to use blog.florian-hopf.de instead of the old fhopf.blogspot.com. At the same time I introduced some pages about my work, articles I had written and talks I had given. I generated some static files using Nanoc with a very simple bootstrap based theme. Those files were then being delivered using www.florian-hopf.de. To give the impression of a coherent website I added the same template to blogspot so it might have happened that you didn't even notice that two systems were involved.

After starting as an independent developer I also increased my blogging frequency with publishing weekly for several months. This lead to 13 posts in 2012 (I started freelancing in May), 25 posts in 2013 and 29 posts in 2014. As I then started writing my book on Elasticsearch there were nearly no posts in 2015 and 2016 but I am planning to write more regularly soon.

So far the combination of blogger and static website worked for me but it had some drawbacks.

Though I got Nanoc somehow working I wasn't really convinced it's the best tool for the job.
The fonts of the template were far too small, especially for mobile devices.
I always had to keep the templates in sync. When changing something on the website (e.g. a new navigation item) I had to add it to the blogger template as well.
Finally the most important drawback: I am writing most blog post in text files first and version them using Git. Going through the blogger web interface felt like a totally unnecessary step.

Fortunately finally I found the time to do something against it and you are looking at it right now. This site is completely generated using Jekyll, the template is based on the simple responsive Hyde. Jekyll provides an easy way to migrate your blogger content. The comments are now on Disqus and I hope that all old comments are still accessible. RSS is still delivered using Feeburner that now points to my server. The blog is still running on the old subdomain but all of the content is now delivered using my server.

With the new template I hope my posts can be read more easily, especially on mobile devices. If you notice something weird I would be really happy if you let me know. I hope you will enjoy some of the posts I will be writing in the future.

Learning Lucene

2016-04-05T14:16:00+08:00

I am currently working with a team starting a new project based on Lucene. While most of the time I would argue on using either Solr or Elasticsearch instead of plain Lucene it was a conscious decision. In this post I am compiling some sources for learning Lucene – I hope you will find them helpful or you can hint what sources I missed.

Project documentation

The first choice of course is the excellent project documentation. It contains the Javadoc for all the modules (core, analyzers-common and queryparser being the most important ones) that also contains further documentation, for example an explanation of a simple demo app and helpful introductions to analysis and querying and scoring. You might also be interested in the standard index file formats.

Besides the documentation that comes with the releases there is also lots of information in the project wiki but you need to know what you are looking for. You can also join the mailing lists to learn about what other users are doing.

When looking at analyzer components the Solr Start website can be useful. Though dedicated to Solr the list of analyzer components can be useful to determine analyzers for Lucene as well. It also contains a searchable version of the Javadocs.

Books

The classic book about the topic is Lucene in Action. On over 500 pages it explains all the underlying concepts in detail. Unfortunately some of the information is outdated and lots of the code examples won't work anymore. Also the newer concepts are not included. Still it's the recommended piece on learning Lucene.

Anonther book I've read is Lucene 4 Cookbook published at Packt. It contains more current examples but is not suited well for learning the basics. Additionally it felt to me as if no editor worked on this book, there are lots of repetitions, typos and broken sentences. (I am making lots of grammar mistakes myself when blogging - but I am expecting more from a published book.)

You can also learn a lot about different aspects of Lucene by reading a book on one of the search servers based on it. I can recommend Elasticsearch in Action, Solr in Action and Elasticsearch – The definitive Guide. (If you can read German I am of course inviting you to read my book on Elasticsearch.)

Blogs, Conferences and Videos

There are countless blog posts on Lucene, a very good introduction is Lucene: The Good Parts by Andrew Montalenti. Some blogs publish regular pieces on Lucene, recommended ones are by Mike McCandless (who now mostly blogs on the elastic Blog), OpenSource Connections, Flax and Uwe Schindler. There is a lot of content about Lucene on the elastic Blog, if you want to hear about current development I can recommend the "This week in Elasticsearch and Apache Lucene" series. There are also some interesting posts on the Lucidworks Blog and I am sure there are lots of other blogs I forgot to mention here.

Lucene is a regular topic on two larger conferences: Lucene/Solr Revolution and Berlin Buzzwords. You can find lots of video recordings of the past events on their website.

Sources

Finally, the project is open source so you can learn a lot about it by reading the source code of either the library or the tests.

Another option is to look at applications using it, either Solr and Elasticsearch. Of course you need to find your way around the sources of the project but sometimes this isn't too hard. One example for Elasticsearch: If you would like to learn about how the common multi_match-Query is implemented in Lucene you will easily find the class MultiMatchQuery that creates the Lucene queries.

What did I miss?

I hope there is something useful for you in this post. I am sure I missed lots of great resources for learning Lucene. If you would like to add one let me know in the comments or on Twitter.

Logging Requests to Elasticsearch

2016-03-23T14:09:00+08:00

This is something I wanted to write down for years but never got down to completing the post. It can help you a lot with certain Elasticsearch setups by answering two questions using the slow log.

Is my application talking to Elasticsearch?
What kind of queries are being built by my application?

A while ago I helped a colleague on one of my current projects to debug some problems with Elasticsearch integrated into proprietary software. He was not sure if there are any requests arriving at Elasticsearch and what those look like. We activated the slow log for Elasticsearch, which not only can be used to log the slow queries but also to enable debugging for any queries that reach Elasticsearch.

The slow log, as the name suggests, is there to log slow requests. As slow is a subjective term you can define thresholds that need to be passed. For example you can define that any queries slower than 50ms are logged in the debug level but any queries that take longer than 500ms in the warn level.

Slow queries can be configured for both phases of the query execution: query and fetch. In the query phase only the ids of the documents are retrieved in the form of a search result list. The fetch phase is where the result documents are retrieved.

Besides the slow query log there is also the slow index log which can be used in the same way but measures the time for indexing.

Both of these settings are index settings. That means they are configured for each index and can therefore be different across indices.

Instance Settings

There are multiple places where you can configure index settings. The first is config/elasticsearch.yml that contains the configuration of the instance. For older versions of Elasticsearch it already contains the lines that are commented out, in newer versions you need to include them yourself. If you want to log all requests at debug level you can just add the following lines and set a threshold of 0s.

index.search.slowlog.threshold.query.debug: 0s
index.search.slowlog.threshold.fetch.debug: 0s
index.indexing.slowlog.threshold.index.debug: 0s

You need to reboot the instance so that the settings are activated. Any indexing and search requests will now be logged to separate log file in the log folder. With the default configuration the logs will be at logs/elasticsearch_index_indexing_slowlog.log and logs/elasticsearch_index_search_slowlog.log. The query log will now contain entries like this:

[2016-03-23 06:43:47,231][DEBUG][index.search.slowlog.fetch] took[5.8ms], took_millis[5], types[talk], stats[], search_type[QUERY_THEN_FETCH], total_shards[5], source[{"query":{"match":{"tags":"Java"}}}], extra_source[]

If you are testing this with multiple shards on one instance you might get more log lines than expected: There will be one line for every shard in the query phase and one line for the fetch phase.

Runtime Settings

Besides the setting in elasticsearch.yml the slow request logs can also be activated using the HTTP API which doesn't require a reboot of the instance and is therefore really well suited for debugging production issues. The following request changes the setting for the query log for an index conference.

curl -XPUT "http://localhost:9200/conference/_settings" -d'
{
    "index.search.slowlog.threshold.query.debug": "0s"
}'

When you are done debugging your issue you can just set a higher threshold again.

ActiveMQ as a Message Broker for Logstash

2015-07-23T13:53:00+08:00

When scaling Logstash it is common to add a message broker that is used to temporarily buffer incoming messages before they are being processed by one or more Logstash nodes. Data is pushed to the brokers either through a shipper like Beaver that reads logfiles and sends each event to the broker. Alternatively the application can send the log events directly using something like a Log4j appender.

A common option is to use Redis as a broker that stores the data in memory but using other options like Apache Kafka is also possible. Sometimes organizations are not that keen to introduce lots of new technology and want to reuse existing stores. ActiveMQ is a widely used messaging and integration platform that supports different protocols and looks just perfect for the use as a message broker. Let's see the options to integrate it.

Setting up ActiveMQ

ActiveMQ can easily be set up using the scripts that ship with it. On Linux it's just a matter of executing ./activemq console. Using the admin console at http://127.0.0.1:8161/admin/ you can create new queues and even enqueue messages for testing.

Consuming messages with AMQP

An obvious way to try to connect ActiveMQ to Logstash is using AMQP, the Advanced Message Queuing Protocol. It's a standard protocol that is supported by different messaging platforms.

There used to be a Logstash input for AMQP but unfortunately it has been renamed to rabbitmq-input because RabbitMQ is the main system that is supported.

Let's see what happens if we try to use the input with ActiveMQ.

input {
    rabbitmq {
        host => "localhost"
        queue => "TestQueue"
        port => 5672
    }
}

output {
    stdout {
        codec => "rubydebug"
    }
}

We tell Logstash to listen on localhost on the standard port on a queue named TestQueue. The result should just be dumped to the standard output. Unfortunately Logstash only issues errors because it can't connect.

Logstash startup completed
RabbitMQ connection error: . Will reconnect in 10 seconds... {:level=>:error}

In the ActiveMQ logs we can see that our parameters are correct but unfortunately both systems seem to speak different dialects of AMQP.

 WARN | Connection attempt from non AMQP v1.0 client. AMQP,0,0,9,1
org.apache.activemq.transport.amqp.AmqpProtocolException: Connection from client using unsupported AMQP attempted
...

So bad luck with this option.

Consuming messages with STOMP

The aptly named Simple Text Oriented Messaging Protocol is another option that is supported by ActiveMQ. Fortunately there is a dedicated input for it. It is not included in Logstash by default but can be installed easily.

bin/plugin install logstash-input-stomp

Afterwards we can just use it in our Logstash config.

input {
    stomp {
        host => "localhost"
        destination => "TestQueue"
    }
}

output {
    stdout {
        codec => "rubydebug"
    }
}

This time we are better off: Logstash really can connect and dumps our message to the standard output.

bin/logstash --config stomp.conf 
Logstash startup completed
{
       "message" => "Can I kick it...",
      "@version" => "1",
    "@timestamp" => "2015-07-22T05:42:35.016Z"
}

Consuming messages with JMS

Though the stomp-input works there is even another option that is not released yet but can already be tested: jms-input supports the Java Messaging System, the standard way of doing messaging on the JVM.

Currently you need to build the plugin yourself (which didn't work on my machine but should be caused by my outdated local jruby installation).

Getting data in ActiveMQ

Now that we know of ways to consume data from ActiveMQ it is time to think about how to get data in. When using Java you can use something like a Log4j- or Logback-Appender that push the log events directly to the queue using JMS.

When it comes to shipping data unfortunately none of the more popular solutions seems to be able to push data to ActiveMQ. If you know of any solution that can be used it would be great if you could leave a comment.

All in all I think it can be possible to use ActiveMQ as a broker for Logstash but it might require some more work when it comes to shipping data.

Fixing Elasticsearch Allocation Issues

2015-02-06T13:59:00+08:00

Last week I was working with some Logstash data on my laptop. There are around 350 indices that contain the logstash data and an index that holds the metadata for Kibana 4. When trying to start the single node cluster I have to wait a while, until all indices are available. Some APIs can be used to see the progress of the startup process.

The cluster health API gives general information about the state of the cluster and indicates if the cluster health is green, yellow or red. After a while the number of unassigned shards didn't change anymore but the cluster still stayed in a red state.

curl -XGET 'http://localhost:9200/_cluster/health?pretty=true'
{
  "cluster_name" : "elasticsearch",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "active_primary_shards" : 1850,
  "active_shards" : 1850,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 1852
}

One shard couldn't be recovered: 1850 were ok but it should have been 1851. To see the problem we can use the cat indices command that will show us all indices and their health.

curl http://localhost:9200/_cat/indices
[...]
yellow open logstash-2014.02.16 5 1 1184 0   1.5mb   1.5mb 
red    open .kibana             1 1                        
yellow open logstash-2014.06.03 5 1 1857 0     2mb     2mb 
[...]

The .kibana index didn't turn yellow. It only consists of one primary shard that couldn't be allocated.

Restarting the node and closing and opening the index didn't help. Looking at elasticsearch-kopf I could see that primary and replica shards both were unassingned (You need to tick the checkbox that says hide special to see the index).

Fortunately there is a way to bring the cluster in a yellow state again. We can manually allocate the primary shard on our node.

Elasticsearch provides the Cluster Reroute API that can be used to allocate a shard on a node. When trying to allocate the shard of the index .kibana I first got an exception.

curl -XPOST "http://localhost:9200/_cluster/reroute" -d'
{
    "commands" : [ {
          "allocate" : {
              "index" : ".kibana", "shard" : 0, "node" : "Jebediah Guthrie"
          }
        }
    ]
}'

[2015-01-30 13:35:47,848][DEBUG][action.admin.cluster.reroute] [Jebediah Guthrie] failed to perform [cluster_reroute (api)]
org.elasticsearch.ElasticsearchIllegalArgumentException: [allocate] trying to allocate a primary shard [.kibana][0], which is disabled

Fortunately the message already tells us the problem: By default you are not allowed to allocate primary shards due to the danger of losing data. If you'd like to allocate a primary shard you need to tell it Elasticsearch explicitly by setting the property allow_primary.

curl -XPOST "http://localhost:9200/_cluster/reroute" -d'
{
    "commands" : [ {
          "allocate" : {
              "index" : ".kibana", "shard" : 0, "node" : "Jebediah Guthrie", "allow_primary": "true"
          }
        }
    ]
}'

For me this helped and my shard got reallocated and the cluster health turned yellow.

I am not sure what caused the problems but it is very likely related to the way I am working locally. I am regularly sending my laptop to sleep which is something you never do on a server. Nevertheless I have seen this problem a few times locally which justifies writing down the necessary steps to fix it.

Logging to Redis using Spring Boot and Logback

2015-01-23T15:15:00+08:00

When doing centralized logging, e.g. using Elasticsearch, Logstash and Kibana or Graylog2 you have several options available for your Java application. You can either write your standard application logs and parse those using Logstash, either consumed directly or shipped to another machine using something like logstash-forwarder. Alternatively you can write in a more appropriate format like JSON directly so the processing step doesn't need that much work for parsing your messages. As a third option is to write to a different data store directly which acts as a buffer for your log messages. In this post we are looking at how we can configure Logback in a Spring Boot application to write the log messages to Redis directly.

Redis

We are using Redis as a log buffer for our messages. Not everyone is happy with Redis but it is a common choice. Redis stores its content in memory which makes it well suited for fast access but can also sync it to disc when necessary. A special feature of Redis is that the values can be different data types like strings, lists or sets. Our application uses a single key and value pair where the key is the name of the application and the value is a list that contains all our log messages. This way we can handle several logging applications in one Redis instance.

When testing your setup you might also want to look into the data that is stored in Redis. You can access it using the redis-cli client. I collected some useful commands for validating your log messages are actually written to Redis.

Command	Description
`KEYS *`	Show all keys in this Redis instance
`LLEN key`	Show the number of messages in the list for `key`
`LRANGE key 0 100`	Show the first 100 messages in the list for `key`

The Logback Config

When working with Logback most of the time an XML file is used for all the configuration. Appenders are the things that send the log output somewhere. Loggers are used to set log levels and attach appenders to certain pieces of the application.

For Spring Boot Logback is available for any application that uses the spring-boot-starter-logging which is also a dependency of the common spring-boot-starter-web. The configuration can be added to a file called logback.xml that resides in src/main/resources.

Spring boot comes with a file and a console appender that are already configured correctly. We can include the base configuration in our file to keep all the predefined configurations.

For logging to Redis we need to add another appender. A good choice is the logback-redis-appender that is rather lightweight and uses the Java client Jedis. The log messages are written to Redis in JSON directly so it's a perfect match for something like logstash. We can make Spring Boot log to a local instance of Redis by using the following configuration.

<?xml version="1.0" encoding="UTF-8"?>
<configuration>
    <include resource="org/springframework/boot/logging/logback/base.xml"/>
    <appender name="LOGSTASH" class="com.cwbase.logback.RedisAppender">
        <host>localhost</host>
        <port>6379</port>
        <key>my-spring-boot-app</key>
    </appender>
    <root level="INFO">
        <appender-ref ref="LOGSTASH" />
        <appender-ref ref="CONSOLE" />
        <appender-ref ref="FILE" />
    </root>
</configuration>

We configure an appender named LOGSTASH that is an instance of RedisAppender. Host and port are set for a local Redis instance, key identifies the Redis key that is used for our logs. There are more options available like the interval to push log messages to Redis. Explore the readme of the project for more information.

Spring Boot Dependencies

To make the logging work we of course have to add a dependency to the logback-redis-appender to our pom. Depending on your Spring Boot version you might see some errors in your log file that methods are missing.

This is because Spring Boot manages the dependencies it uses internally and the versions for jedis and commons-pool2 do not match the ones that we need. If this happens we can configure the versions to use in the properties section of our pom.

<properties>
    <commons-pool2.version>2.0</commons-pool2.version>
    <jedis.version>2.5.2</jedis.version>
</properties>

Now the application will start and you can see that it sends the log messages to Redis as well.

Enhancing the Configuration

Having the host and port configured in the logback.xml is not the best thing to do. When deploying to another environment with different settings you have to change the file or deploy a custom one.

The Spring Boot integration of Logback allows to set some of the configuration options like the file to log to and the log levels using the main configuration file application.properties. Unfortunately this is a special treatment for some values and you can't add custom values as far as I could see.

But fortunately Logback supports the use of environment variables so we don't have to rely on configuration files. Having set the environment variables REDIS_HOST and REDIS_PORT you can use the following configuration for your appender.

    <appender name="LOGSTASH" class="com.cwbase.logback.RedisAppender">
        <host>${REDIS_HOST}</host>
        <port>${REDIS_PORT}</port>
        <key>my-spring-boot-app</key>
    </appender>

We can even go one step further. To only activate the appender when the property is set you can add conditional processing to your configuration.

    <if condition='isDefined("REDIS_HOST") &amp;&amp; isDefined("REDIS_PORT")'>
        <then>
            <appender name="LOGSTASH" class="com.cwbase.logback.RedisAppender">
                <host>${REDIS_HOST}</host>
                <port>${REDIS_PORT}</port>
                <key>my-spring-boot-app</key>
            </appender>
        </then>
    </if>

You can use a Java expression for deciding if the block should be evaluated. When the appender is not available Logback will just log an error and uses any other appenders that are configured. For this to work you need to add the Janino library to your pom.

Now the appender is activated based on the environment variables. If you like you can skip the setup for local development and only set the variables on production systems.

Conclusion

Getting started with Spring Boot or logging to Redis alone is very easy but some of the details are some work to get right. But it's worth the effort: Once you get used to centralized logging you don't want to have your systems running without it anymore.

Use Cases for Elasticsearch: Analytics

2014-12-12T17:35:00+08:00

In the last post in this series we have seen how we can use Logstash, Elasticsearch and Kibana for doing logfile analytics. This week we will look at the general capabilities for doing analytics on any data using Elasticsearch and Kibana.

Use Case

We have already seen that Elasticsearch can be used to store large amounts of data. Instead of putting data into a data warehouse Elasticsearch can be used to do analytics and reporting. Another use case is social media data: Companies can look at what happens with their brand if they have the possibility to easily search it. Data can be ingested from multiple sources, e.g. Twitter and Facebook and combined in one system. Visualizing data in tools like Kibana can help with exploring large data sets. Finally mechanisms like Elasticsearchs Aggregations can help with finding new ways to look at the data.

Aggregations

Aggregations provide what the now deprecated facets have been providing but also a lot more. They can combine and count values from different documents and therefore show you what is contained in your data. For example if you have tweets indexed in Elasticsearch you can use the terms aggregation to find the most common hashtags. For details on indexing tweets in Elasticsearch see this post on the Twitter River and this post on the Twitter input for Logstash.

curl -XGET "http://localhost:9200/devoxx/tweet/_search" -d'
{
    "aggs" : {
        "hashtags" : {
            "terms" : { 
                "field" : "hashtag.text" 
            }
        }
    }
}'

Aggregations are requested using the aggs keyword, hashtags is a name I have chosen to identify the result and the terms aggregation counts the different terms for the given field (Disclaimer: For a sharded setup the terms aggregation might not be totally exact). This request might result in something like this:

"aggregations": {
      "hashtags": {
         "buckets": [
            {
               "key": "dartlang",
               "doc_count": 229
            },
            {
               "key": "java",
               "doc_count": 216
            },
[...]

The result is available for the name we have chosen. Aggregations put the counts into buckets that contain of a value and a count. This is very similar to how faceting works, only the names are different. For this example we can see that there are 229 documents for the hashtag dartlang and 216 containing the hashtag java.

This could also be done with facets alone but there is more: Aggregations can even be combined. You can now nest another aggregation in the first one that for every bucket will give you more buckets for another criteria.

curl -XGET "http://localhost:9200/devoxx/tweet/_search" -d'
{
    "aggs" : {
        "hashtags" : {
            "terms" : { 
                "field" : "hashtag.text" 
            },
            "aggs" : {
                "hashtagusers" : {
                    "terms" : {
                        "field" : "user.screen_name"
                    }
                }
            }
        }
    }
}'

We still request the terms aggregation for the hashtag. But now we have another aggregation embedded, a terms aggregation that processes the user name. This will then result in something like this.

               "key": "scala",
               "doc_count": 130,
               "hashtagusers": {
                  "buckets": [
                     {
                        "key": "jaceklaskowski",
                        "doc_count": 74
                     },
                     {
                        "key": "ManningBooks",
                        "doc_count": 3
                     },
    [...]

We can now see the users that have used a certain hashtext. In this case one user used one hashtag a lot. This is information that is not available that easily with queries and facets alone.

Besides the terms aggreagtion we have seen here there are also lots of other interesting aggregations available and more are added with every release. You can choose between bucket aggregations (like the terms aggregation) and metrics aggregations, that calculate values from the buckets, e.g. averages oder other statistical values.

Visualizing the Data

Besides the JSON output we have seen above, the data can also be used for visualizations. This is something that can then be prepared even for a non technical audience. Kibana is one of the options that is often used for logfile data but can be used for data of all kind, e.g. the Twitter data we have already seen above.

There are two bar charts that display the term frequencies for the mentions and the hashtags. We can already see easily which values are dominant. Also, the date histogram to the right shows at what time most tweets are sent. All in all these visualizations can provide a lot of value when it comes to trends that are only seen when combining the data.

The image shows Kibana 3, which still relies on the facet feature. Kibana 4 will instead provide access to the aggregations.

Conclusion

This post ends the series on use cases for Elasticsearch. I hope you enjoyed reading it and maybe you learned something new along the way. I can't spend that much time blogging anymore but new posts will be coming. Keep an eye on this blog.

Use Cases for Elasticsearch: Index and Search Log Files

2014-09-19T14:19:00+08:00

In the last posts we have seen some of the properties of using Elasticsearch as a document store, for searching text content and geospatial search. In this post we will look at how it can be used to index and store log files, a very useful application that can help developers and operations in maintaining applications.

Logging

When maintaining larger applications that are either distributed across several nodes or consist of several smaller applications searching for events in log files can become tedious. You might already have been in the situation that you have to find an error and need to log in to several machines and look at several log files. Using Linux tools like grep can be fun sometimes but there are more convenient ways. Elasticsearch and the projects Logstash and Kibana, commonly known as the ELK stack, can help you with this.

With the ELK stack you can centralize your logs by indexing them in Elasticsearch. This way you can use Kibana to look at all the data without having to log in on the machine. This can also make Operations happy as they don't have to grant access to every developer who needs to have access to the logs. As there is one central place for all the logs you can even see different applications in context. For example you can see the logs of your Apache webserver combined with the log files of your application server, e.g. Tomcat. As search is core to what Elasticsearch is doing you should be able to find what you are looking for even more quickly.

Finally Kibana can also help you with becoming more proactive. As all the information is available in real time you also have a visual representation of what is happening in your system in real time. This can help you in finding problems more quickly, e.g. you can see that some resource starts throwing Exceptions without having your customers report it to you.

The ELK Stack

For logfile analytics you can use all three applications of the ELK stack: Elasticsearch, Logstash and Kibana. Logstash is used to read and enrich the information from log files. Elasticsearch is used to store all the data and Kibana is the frontend that provides dashboards to look at the data.

The logs are fed into Elasticsearch using Logstash that combines the different sources. Kibana is used to look at the data in Elasticsearch. This setup has the advantage that different parts of the log file processing system can be scaled differently. If you need more storage for the data you can add more nodes to the Elasticsearch cluster. If you need more processing power for the log files you can add more nodes for Logstash.

Logstash

Logstash is a JRuby application that can read input from several sources, modify it and push it to a multitude of outputs. For running Logstash you need to pass it a configuration file that determines where the data is and what should be done with it. The configuration normally consists of an input and an output section and an optional filter section. This example takes the Apache access logs, does some predefined processing and stores them in Elasticsearch:

input {
  file {
    path => "/var/log/apache2/access.log"
  }
}

filter {
  grok {
    match => { message => "%{COMBINEDAPACHELOG}" }
  }
}

output {
  elasticsearch_http {
    host => "localhost"
  }
}

The file input reads the log files from the path that is supplied. In the filter section we have defined the grok filter that parses unstructured data and structures it. It comes with lots of predefined patterns for different systems. In this case we are using the complete Apache log pattern but there are also more basic building block like parsing email and ip addresses and dates (which can be lots of fun with all the different formats).

In the output section we are telling Logstash to push the data to Elasticsearch using http. We are using a server on localhost, for most real world setups this would be a cluster on separate machines.

Kibana

Now that we have the data in Elasticsearch we want to look at it. Kibana is a JavaScript application that can be used to build dashboards. It accesses Elasticsearch from the browser so whoever uses Kibana needs to have access to Elasticsearch.

When using it with Logstash you can open a predefined dashboard that will pull some information from your index. You can then display charts, maps and tables for the data you have indexed. This screenshot displays a histogram and a table of log events but there are more widgets available like maps and pie and bar charts.

As you can see you can extract a lot of data visually that would otherwise be buried in several log files.

Conclusion

The ELK stack can be a great tool to read, modify and store log events. Dashboards help with visualizing what is happening. There are lots of inputs in Logstash and the grok filter supplies lots of different formats. Using those tools you can consolidate and centralize all your log files.

Lots of people are using the stack for analyzing their log file data. One of the articles that is available is by Mailgun, who are using it to store billions of events. And if that's not enough read this post on how CERN uses the ELK stack to help running the Large Hadron Collider

In the next post we will look at the final use case for Elasticsearch: Analytics.

Use Cases for Elasticsearch: Geospatial Search

2014-08-29T13:11:00+08:00

In the previous posts we have seen that Elasticsearch can be used to store documents in JSON format and distribute the data across multiple nodes as shards and replicas. Lucene, the underlying library, provides the implementation of the inverted index that can be used to search the documents. Analyzing is a crucial step for building a good search application.

In this post we will look at a different feature that can be used for applications you would not immediately associate Elasticsearch with. We will look at the geo features that can be used to build applications that can filter and sort documents based on the location of the user.

Locations in Applications

Location based features can be useful for a wide range of applications. For merchants the web site can present the closest point of service for the current user. Or there is a search facility for finding points of services according to a location, often integrated with something like Google Maps. For classifieds it can make sense to sort them by distance from the user searching, the same is true for any search for locations like restaurants and the like. Sometimes it also makes sense to only show results that are in a certain area around me, in this case we need to filter by distance. Probably the user is looking for a new appartment and is not interested in results that are too far away from his workplace. Finally locations can also be of interest when doing analytics. Social media data can tell you where something interesting is happening just by looking at the amount of status messages sent from a certain area.

Most of the time locations are stored as a pair of latitude and longitude, which denotes a point. The combination of 48.779506, 9.170045 for example points to Liederhalle Stuttgart which happens to be the location for Java Forum Stuttgart. Geohashes are an alternative means to encode latitude and longitude. They can be stored in arbitrary precision so those can also refer to a larger area instead of a point.

When calculating a Geohash the map is divided into several buckets or cells. Each bucket is identified by a base 32 encoded value. The complete geohash then consists of a sequence of characters. Each following character marks the bucket in the previous bucket so you are zooming in to the location. The longer the geohash string the more precise the location is. For example u0wt88j3jwcp is the geohash for Liederhalle Stuttgart. The prefix u0wt on the other hand is the area of Stuttgart and some of the surrounding cities.

The hierarchical nature of geohashes and the possiblity to express them as strings makes them a good choice for storing them in the inverted index. You can create geohashes using the original geohash service or more visually appealing using the nice GeohashExplorer.

Locations in Elasticsearch

Elasticsearch accepts lat and lon for specifying latitude and longitude. These are two documents for a conference in Stuttgart and one in Nuremberg.

{
    "title" : "Anwendungsfälle für Elasticsearch",
    "speaker" : "Florian Hopf",
    "date" : "2014-07-17T15:35:00.000Z",
    "tags" : ["Java", "Lucene"],
    "conference" : {
        "name" : "Java Forum Stuttgart",
        "city" : "Stuttgart",
            "coordinates": {
                "lon": "9.170045",
                "lat": "48.779506"
            }
    } 
}
{
    "title" : "Anwendungsfälle für Elasticsearch",
    "speaker" : "Florian Hopf",
    "date" : "2014-07-15T16:30:00.000Z",
    "tags" : ["Java", "Lucene"],
    "conference" : {
        "name" : "Developer Week",
        "city" : "Nürnberg",
            "coordinates": {
                "lon": "11.115358",
                "lat": "49.417175"
            }
    } 
}

Alternatively you can use the GeoJSON format, accepting an array of longitude and latitude. If you are like me be prepared to hunt down why queries aren't working just to notice that you messed up the order in the array.

The field needs to be mapped with a geo_point field type.

{
    "properties": {
          […],
       "conference": {
            "type": "object",
            "properties": {
                "coordinates": {
                    "type": "geo_point",
                    "geohash": "true",
                    "geohash_prefix": "true"
                }
            }
       }
    }
}'

By passing the optional attribute geohash Elasticsearch will automatically store the geohash for you as well. Depending on your usecase you can also store all the parent cells of the geohash using the parameter geohash_prefix. As the values are just strings this is a normal ngram index operation which stores the different substrings for a term, e.g. u, u0, u0w and u0wt for u0wt.

With our documents in place we can now use the geo information for sorting, filtering and aggregating results.

Sorting by Distance

First, let's sort all our documents by distance from a point. This would allow us to build an application that displays the closest location for the current user.

curl -XPOST "http://localhost:9200/conferences/_search " -d'
{
    "sort" : [
        {
            "_geo_distance" : {
                "conference.coordinates" : {
                    "lon": 8.403697,
                    "lat": 49.006616
                },
                "order" : "asc",
                "unit" : "km"
            }
        }
    ]
}'

We are requesting to sort by _geo_distance and are passing in another location, this time Karlsruhe, where I live. Results should be sorted ascending so the closer results come first. As Stuttgart is not far from Karlsruhe it will be first in the list of results.

The score for the document will be empty. Instead there is a field sort that contains the distance of the locations from the one provided. This can be really handy when displaying the results to the user.

Filtering by Distance

For some usecase we would like to filter our results by distance. Some online real estate agencies for example provide the option to only display results that are in a certain distance from a point. We can do the same by passing in a geo_distance filter.

curl -XPOST "http://localhost:9200/conferences/_search" -d'
{
   "filter": {
      "geo_distance": {
         "conference.coordinates": {
            "lon": 8.403697,
            "lat": 49.006616
         },
         "distance": "200km",
         "distance_type": "arc"
      }
   }
}'

We are again passing the location of Karlsruhe. We request that only documents in a distance of 200km should be returned and that the arc distance_type should be used for calculating the distance. This will take into account that we are living on a globe.

The resulting list will only contain one document, Stuttgart, as Nuremberg is just over 200km away. If we use the distance 210km both of the documents will be returned.

Geo Aggregations

Elasticsearch provides several useful geo aggregations that allow you to retrieve more information on the locations of your documents, e.g. for faceting. On the other hand as we do have the geohash as well as the prefix enabled we can retrieve all of the cells our results are in using a simple terms aggregation. This way you can let the user drill down on the results by filtering on the cell.

curl -XPOST "http://localhost:9200/conferences/_search" -d'
{
    "aggregations" : {
        "conference-hashes" : {
            "terms" : {
                "field" : "conference.coordinates.geohash"
            }
        }
    }
}'

Depending on the precision we have chosen while indexing this will return a long list of prefixes for hashes but the most important part is at the beginning.

[...]
   "aggregations": {
      "conference-hashes": {
         "buckets": [
            {
               "key": "u",
               "doc_count": 2
            },
            {
               "key": "u0",
               "doc_count": 2
            },
            {
               "key": "u0w",
               "doc_count": 1
            },
            [...]
        }
    }

Stuttgart and Nuremberg both share the parent cells u and u0.

Alternatively to the terms aggregation you can also use specialized geo aggregations, e.g. the geo distance aggregation for forming buckets of distances.

Conclusion

Besides the features we have seen here Elasticsearch offers a wide range of geo features. You can index shapes and query by arbitrary polygons, either by passing them in or by passing a reference of an indexed polygon. When geohash prefixes are turned on you can also filter by geohash cell.

With the new HTML 5 location features location aware search and content delivery will become more important. Elasticsearch is a good fit for building this kind of applications.

Two users in the geo space are Foursquare, a very early user of Elasticsearch, and Gild, a recruitment agency that does some magic with locations.

Resources for Freelancers

2014-08-13T14:03:00+08:00

More than half a year ago I wrote a post on my first years as a freelancer. While writing the post I noticed that there are quite some resources I would like to recommend which I deferred to another post that never was written. Last weekend at SoCraTes we had a very productive session on freelancing. We talked about different aspects from getting started, kinds of and reasons for freelancing to handling your sales pipeline.

#freelancing #session #mind #map #summary #socrates14 #softwerkskammer pic.twitter.com/JAkFMfaOHZ
— Benjamin (@dataduke) 8. August 2014

David and me recommended some resources on getting started so this is the perfect excuse to write the post I planned to write initially.

I will keep it minimal and only present a short description of each point.

Softwerkskammer Freelancer Group: Our discussion at SoCraTes lead to founding a new group in the Softwerkskammer community. We plan to exchange knowledge and probably even work opportunities.
The Freelancers' Show: A podcast on everything freelancing. Started as the Ruby Freelancers but the topics always were general. Fun to listen to, when it comes to software development you might also want to listen to the Ruby Rogues.
Book Yourself Solid: I read this when getting started with freelancing. Helps you with deciding what you want to do and with marketing yourself.
Get Clients Now: A workbook with daily tasks for improving your business. It's a 28 day program that contains some really good ideas and helps you working on getting more work.
Duct Tape Marketing: A book on improving your marketing activities. I took less out of this book than the other two.
Email Course by Eric Davis: Eric Davis, one of the hosts of the Freelancers' Show also provides a free email course for freelancers.
Mediafon Ratgeber Selbstständige: A German book on all practical issues you have to take care of.

There is also stuff that is only slightly related to freelancing but helped me on the way, either through learning or motivation.

Technical Blogging: A book that can help you getting started with blogging. Can be motivating but also contains some good tips.
My Blog Traffic Sucks.: A short book on blogging. This book lead to my very frequent blog publishing habit.
Confessions of a Public Speaker: A very good and entertaining read on presenting.
The 100$ Startup: Not exactly about freelancing but about small startups of all kinds, a very entertaining read about people working for themselves.

I am sure I forgot some of the things that helped me but I hope one of the resources can help you and your freelancing business. If you are missing something feel free to leave a comment.

Scrapy and Elasticsearch

2014-07-30T15:09:00+08:00

On 29.07.2014 I gave a talk at Search Meetup Karlsruhe on using Scrapy with Elasticsearch, the slides are here. This post evolved from the talk and introduces you to web scraping and search with Scrapy and Elasticsearch.

Web Crawling

You might think that web crawling and scraping only is for search engines like Google and Bing. But a lot of companies are using it for different purposes: Price comparison, financial risk information and portals all need a way to get the data. And at least sometimes the way is to retrieve it through some public website. Besides these cases where the data is not in your hand it can also make sense if the data is aggregated already. For intranet and portal search engines it can be easier to just scrape the frontend instead of building data import facilities for different, sometimes even old systems.

The Example

In this post we are looking at a rather artificial example: Crawling the meetup.com page for recent meetups to make them available for search. Why artificial? Because meetup.com has an API that provides all the data in a more convenient way. But imagine there is no other way and we would like to build a custom search on this information, probably by adding other event sites as well.

This is a part of the Search Meetup Karlsruhe page that displays the recent meetups.

We can see that there is already some information we are interested in like the title and the link to the meetup page.

Roll Your Own?

When deciding on doing web scraping you might be tempted to build it yourself using a script or some code. How hard can it be to fetch a website, parse its source and extract all links to follow?

For demoing some of the features of Akka I have built a simple web crawler that visits a website, follows all links and indexes the content in Lucene. While this is not a lot of code you will notice soon that it is not suited for real world uses: It is hammering the crawled page with as many requests as possible. There is no way to make it behave nicely by respecting the robots.txt. Additional processing of the content is too hard to add afterwards. All of this is enough to lean to a ready made solution.

Scrapy

Scrapy is a framework for building crawlers and process the extracted data. It is implemented in Python and does asynchronous, non-blocking networking. It is easily extendable, not only via the item pipeline the content flows through. Finally it already comes with lots of features that you might have to build yourself otherwise.

In Scrapy you implement a spider, that visits a certain page and extracts items from the content. The items then flow through the item pipeline and get dumped to a Feed Exporter that then writes the data to a file. At every stage of the process you can add custom logic.

This is a very simplified diagram that doesn't take the asynchronous nature of Scrapy into account. See the Scrapy documentation for a more detailed view.

For installing Scrapy I am using pip which should be available for all systems. You can then run pip install scrapy to get it.

To get started using Scrapy you can use the scaffolding feature to create a new project. Just issue something like scrapy startproject meetup and scrapy will generate quite some files for you.

meetup/
meetup/scrapy.cfg
meetup/meetup
meetup/meetup/settings.py
meetup/meetup/__init__.py
meetup/meetup/items.py
meetup/meetup/pipelines.py
meetup/meetup/spiders
meetup/meetup/spiders/__init__.py

For now we can concentrate on the items.py, that describes the strucure of the data to crawl, and the spiders directory where we can put our spiders.

Our First Spider

First we need to define what data structure we would like to retrieve. This is described as an Item that is then created using a Spider and flows through the item pipeline. For our case we can put this into items.py

from scrapy.item import Item, Field

class MeetupItem(Item):
    title = Field()
    link = Field()
    description = Field()

Our MeetupItem defines three fields for the title, the link and a description we can search on. For real world usecases this would contain more information like the date and time or probably more information on the participants.

To fetch data and create Items we need to implement a Spider instance. We create a file meetup_spider.py in the spiders directory.

from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from meetup.items import MeetupItem

class MeetupSpider(BaseSpider):
    name = "meetup"
    allowed_domains = ["meetup.com"]
    start_urls = [
        "http://www.meetup.com/Search-Meetup-Karlsruhe/"
    ]

    def parse(self, response):
        responseSelector = Selector(response)
        for sel in responseSelector.css('li.past.line.event-item'):
            item = MeetupItem()
            item['title'] = sel.css('a.event-title::text').extract()
            item['link'] = sel.xpath('a/@href').extract()
            yield item

Our spider extends BaseSpider and defines a name, the allowed domains and a start url. Scrapy calls the start url and passes the response to the parse method. We are then using a Selector to parse the data using eiher css or xpath. Both is shown in the example above.

Every Item we create is returned from the method. If we would have to visit another page we could also return a Request object and Scrapy would then visit that page as well.

We can run this spider from the project directory by issuing scrapy crawl meetup -o talks.json. This will use our meetup spider and write the items as JSON to a file.

2014-07-24 18:27:59+0200 [scrapy] INFO: Scrapy 0.20.0 started (bot: meetup)
[...]
2014-07-24 18:28:00+0200 [meetup] DEBUG: Crawled (200) <get http:="" www.meetup.com="" search-meetup-karlsruhe=""> (referer: None)
2014-07-24 18:28:00+0200 [meetup] DEBUG: Scraped from <200 http://www.meetup.com/Search-Meetup-Karlsruhe/>
        {'link': [u'http://www.meetup.com/Search-Meetup-Karlsruhe/events/178746832/'],
         'title': [u'Neues in Elasticsearch 1.1 und Logstash in der Praxis']}
2014-07-24 18:28:00+0200 [meetup] DEBUG: Scraped from <200 http://www.meetup.com/Search-Meetup-Karlsruhe/>
        {'link': [u'http://www.meetup.com/Search-Meetup-Karlsruhe/events/161417512/'],
         'title': [u'Erstes Treffen mit Kurzvortr\xe4gen']}
2014-07-24 18:28:00+0200 [meetup] INFO: Closing spider (finished)
2014-07-24 18:28:00+0200 [meetup] INFO: Stored jsonlines feed (2 items) in: talks.json
2014-07-24 18:28:00+0200 [meetup] INFO: Dumping Scrapy stats:
        {'downloader/request_bytes': 244,
         'downloader/request_count': 1,
[...]
         'start_time': datetime.datetime(2014, 7, 24, 16, 27, 59, 540300)}
2014-07-24 18:28:00+0200 [meetup] INFO: Spider closed (finished)
                    </get>

You can see that Scrapy visited the page and extracted two items. Finally it prints some stats on the crawl. The file contains our items as well

{"link": ["http://www.meetup.com/Search-Meetup-Karlsruhe/events/178746832/"], "title": ["Neues in Elasticsearch 1.1 und Logstash in der Praxis"]}
{"link": ["http://www.meetup.com/Search-Meetup-Karlsruhe/events/161417512/"], "title": ["Erstes Treffen mit Kurzvortr\u00e4gen"]}

This is fine but there is a problem. We don't have all the data that we would like to have, we are missing the description. This information is not fully available on the overview page so we need to crawl the detail pages of the meetup as well.

The Crawl Spider

We still need to use our overview page because this is where all the recent meetups are listed. But for retrieving the item data we need to go to the detail page.

As mentioned already we could solve our new requirement using our spider above by returning Request objects and a new callback function. But we can solve it another way, by using the CrawlSpider that can be configured with a Rule that advices where to extract links to visit.

In case you are confused, welcome to the world of Scrapy! When working with Scrapy you will regularly find cases where there are several ways to do a thing.

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from meetup.items import MeetupItem

class MeetupDetailSpider(CrawlSpider):
    name = "meetupDetail"
    allowed_domains = ["meetup.com"]
    start_urls = ["http://www.meetup.com/Search-Meetup-Karlsruhe/"]
    rules = [Rule(SgmlLinkExtractor(restrict_xpaths=('//div[@id="recentMeetups"]//a[@class="event-title"]')), callback='parse_meetup')]

    def parse_meetup(self, response):
        sel = Selector(response)
        item = MeetupItem()
        item['title'] = sel.xpath('//h1[@itemprop="name"]/text()').extract()
        item['link'] = response.url
        item['description'] = sel.xpath('//div[@id="past-event-description-wrap"]//text()').extract()
        yield item

Besides the information we have set for our other spider we now also add a Rule object. It extracts the links from the list and passes the responses to the supplied callback. You can also add rules that visit links by path, e.g. all with the fragment /articles/ in the url.

Our parse_meetup method now doesn't work on the overview page but on the detail pages that are extracted by the rule. The detail page has all the information available we need and will now even pass the description to our item.

Now that we have all the information we can do something useful with it: Index it in Elasticsearch.

Elasticsearch

Elasticsearch support for Scrapy is available by installing a module: pip install "ScrapyElasticSearch". It takes the Items created by your spider and indexes those in Elasticsearch using the library pyes.

Looking at the Scrapy architecture above you might expect that the module is implemented as a FeedExporter that exports the items to Elasticsearch instead of the filesystem. For reasons unknown to me exporting to a database or search engine is done using an ItemPipeline which is a component in the item pipeline. Confused?

To configure Scrapy to put the items to Elasticsearch of course you need to have an instance running somewhere. The pipeline is configured in the file settings.py.

ITEM_PIPELINES = [
  'scrapyelasticsearch.ElasticSearchPipeline',
]

ELASTICSEARCH_SERVER = 'localhost' 
ELASTICSEARCH_PORT = 9200 
ELASTICSEARCH_INDEX = 'meetups'
ELASTICSEARCH_TYPE = 'meetup'
ELASTICSEARCH_UNIQ_KEY = 'link'

The configuration should be straightforward. We enable the module by adding it to the ITEM_PIPELINES and configure additional information like the host, index and type name. Now when crawling for the next time Scrapy will automatically push your data to Elasticsearch.

I am not sure if this can be an issue when it comes to crawling but the module doesn't use bulk indexing but indexes each item by itself. If you have a very large amount of data this could be a problem but should be totally fine for most uses. Also, of course you need to make sure that your mapping is in place before indexing data if you need some predefined handling.

Conclusion

I hope you could see how useful Scrapy can be and how easy it is to put the data in stores like Elasticsearch. Some approaches of Scrapy can be quite confusing at first but nevertheless it's an extremely useful tool. Give it a try the next time you are thinking about web scraping.

Use Cases for Elasticsearch: Flexible Query Cache

2014-07-25T14:58:00+08:00

In the previous two posts on use cases for Elasticsearch we have seen that Elasticsearch can be used to store even large amounts of documents and that we can access those using the full text features of Lucene via the Query DSL. In this shorter post we will put both of use cases together to see how read heavy applications can benefit from Elasticsearch.

Search Engines in Classic Applications

Looking at classic applications search engines were a specialized thing that was only responsible for helping with one feature, the search page.

On the left we can see our application, most of its functionality is build by querying the database. The search engine only plays a minor part and is responsible for rendering the search page.

Databases are well suited for lots of types of applications but it turns out that often it is not that easy to scale them. Websites with high traffic peaks often have some problems scaling database access. Indexing and scaling machines up can help but often requires specialized knowledge and can become rather expensive.

As with other search features especially ecommerce providers started doing something different. They started to employ the search engine not only for full text search but also for other parts of the page that require no direct keyword input by the user. Again, let's have a look at a page at Amazon.

This is one of the category pages that can be accessed using the navigation. We can already see that the interface looks very similar to a search result page. There is a result list, we can sort and filter the results using the facets. Though of course I have no insight how Amazon is doing this exactly a common approach is to use the search engine for pages like this as well.

Scaling Read Requests

A common problem for ecommerce websites is that there are huge traffic spikes. Depending on your kind of business you might have a lot more traffic just before christmas. Or you might have to fight spikes when there are TV commercials for your service or any special discounts. Flash sale sites are at the extreme end of those kind of sites with very high spikes at a certain point in time when a sale starts.

It turns out that search engines are good at being queried a lot. The immutable data set, the segments, are very cache friendly. When it comes to filters those can be cached by the engine as well most of the times. On a warm index most of the data will be in RAM so it is lightning fast.

Back to our example of talks that can be accessed online. Imagine a navigation where the user can choose the city she wants to see events for. You can then issue a query like this to Elasticsearch:

curl -XPOST "http://localhost:9200/conferences/_search " -d'
{
    "filter": {
        "term": {
           "conference.city": "stuttgart"
        }
    }
}'

There is no query part but only a filter that limits the results to the talks that are in Stuttgart. The whole filter will be cached so if a lot of users are accessing the data there can be a huge performance gain for you and especially your users.

Additionally as we have seen new nodes can be added to Elasticsearch without a lot of hassle. If we need more query capacity we can easily add more machines and more replicas, even temporarily. When we can identify some pages that can be moved to the search engine the database doesn't need to have that much traffic anymore.

Especially for getting the huge spikes under control it is best to try to not access the database anymore for read heavy pages and deliver all of the content from the search engine.

Conclusion

Though in this post we have looked at ecommerce the same strategy can be applied to different domains. Content management systems can push the editorial content to search engines and let those be responsible for scaling. Classifieds, social media aggregation, .... All of those can benefit from the cache friendly nature of a search engine. Maybe you will even notice that parts of your data don't need to be in the database at all and you can migrate them to Elasticsearch as a primary data store. A first step to polyglot persistence.

Slides for Use Cases for Elasticsearch for Developer Week and Java Forum Stuttgart

2014-07-16T00:06:00+08:00

I am giving the German talk Anwendungsfälle für Elasticsearch (Use Cases for Elasticsearch) twice in July 2014, first at Developer Week Nürnberg at 15.07.2014 and then at Java Forum Stuttgart at 17.07.2014. The slides for the Developer Week talk, which are a superset of the Java Forum talk, are now available on Slideshare.

In additon to the talks I published a blog post on each of the use cases.

If you are interested in all posts on this blog you can subscribe to my feed.

If you have any feedback on the talk or the topic I would appreciate a comment or you can just contact me directly.

Use Cases for Elasticsearch: Full Text Search

2014-07-11T15:09:00+08:00

In the last post of this series on use cases for Elasticsearch we looked at the features Elasticsearch provides for storing even large amounts of documents. In this post we will look at another one of its core features: Search. I am building on some of the information in the previous post so if you haven't read it you should do so now.

As we have seen we can use Elasticsearch to store JSON documents that can even be distributed across several machine. Indexes are used to group documents and each document is stored using a certain type. Shards are used to distribute parts of an index across several nodes and replicas are copies of shards that are used for distributing load as well as for fault tolerance.

Full Text Search

Everybody uses full text search. The amount of information has just become too much to access it using navigation and categories alone. Google is the most prominent example offering instant keyword search across a huge amount of information.

Looking at what Google does we can already see some common features of full text search. Users only provide keywords and expect the search engine to provide good results. Relevancy of documents is expected to be good and users want the results they are looking for on the first page. How relevant a document is can be influenced by different factors like h the queried term exists in a document. Besides getting the best results the user wants to be supported during the search process. Features like suggestions and highlighting on the result excerpt can help with this.

Another area where search is important is E-Commerce with Amazon being one of the dominant players.

The interface looks similar to the Google one. The user can enter keywords that are then searched for. But there are also slight differences. The suggestions Amazon provides are more advanced, also hinting at categories a term might be found in. Also the result display is different, consisting of a more structured view. The structure of the documents being searched is also used for determining the facets on the left that can be used to filter the current result based on certain criteria, e.g. all results that cost between 10 and 20 €. Finally, relevance might mean something completely different when it comes to something like an online store. Often the order of the result listing is influenced by the vendor or the user can sort the results by criteria like price or release date.

Though neither Google nor Amazon are using Elasticsearch you can use it to build similar solutions.

Searching in Elasticsearch

As with everything else, Elasticsearch can be searched using HTTP. In the most simple case you can append the _search endpoint to the url and add a parameter: curl -XGET "http://localhost:9200/conferences/talk/_search?q=elasticsearch&pretty=true". Elasticsearch will then respond with the results, ordered by relevancy.

{
  "took" : 81,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.067124054,
    "hits" : [ {
      "_index" : "conferences",
      "_type" : "talk",
      "_id" : "iqxb7rDoTj64aiJg55KEvA",
      "_score" : 0.067124054,
      "_source":{
    "title" : "Anwendungsfälle für Elasticsearch",
    "speaker" : "Florian Hopf",
    "date" : "2014-07-17T15:35:00.000Z",
    "tags" : ["Java", "Lucene"],                                  
    "conference" : {
        "name" : "Java Forum Stuttgart",
        "city" : "Stuttgart"
    }
}

    } ]
  }
}

Though we have searched on a certain type now you can also search multiple types or multiple indices.

Adding a parameter is easy but search requests can become more complex. We might request highlighting or filter the documents according to a criteria. Instead of using parameters for everything Elasticsearch offers the so called Query DSL, a search API that is passed in the body of the request and is expressed using JSON.

This query could be the result of a user trying to search for elasticsearch but mistyping parts of it. The results are filtered so that only talks for conferences in the city of Stuttgart are returned.

curl -XPOST "http://localhost:9200/conferences/_search " -d'
{
    "query": {
        "match": {
            "title" : {
               "query": "elasticsaerch",
               "fuzziness": 2
            }
        }
    },
    "filter": {
        "term": {
           "conference.city": "stuttgart"
        }
    }
}'

This time we are querying all documents of all types in the index conferences. The query object requests one of the common queries, a match query on the title field of the document. The query attribute contains the search term that would be passed in by the user. The fuzziness attribute requests that we should also find documents that contain terms that are similar to the term requested. This will take care of the misspelled term and also return results containing elasticsearch. The filter object requests that all results should be filtered according to the city of the conference. Filters should be used whenever possible as they can be cached and do not calculate the relevancy which should make them faster.

Normalizing Text

As search is used everywhere users also have some expectations of how it should work. Instead of issuing exact keyword matches they might use terms that are only similar to the ones that are in the document. For example a user might be querying for the term Anwendungsfall which is the singular of the contained term Anwendungsfälle, meaning use cases in German: curl -XGET "http://localhost:9200/conferences/talk/_search?q=title:anwendungsfall&pretty=true"

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : null,
    "hits" : [ ]
  }
}

No results. We could try to solve this using the fuzzy search we have seen above but there is a better way. We can normalize the text during indexing so that both keywords point to the same term in the document.

Lucene, the library search and storage in Elasticsearch is implemented with provides the underlying data structure for search, the inverted index. Terms are mapped to the documents they are contained in. A process called analyzing is used to split the incoming text and add, remove or modify terms.

On the left we can see two documents that are indexed, on the right we can see the inverted index that maps terms to the documents they are contained in. During the analyzing process the content of the documents is split and transformed in an application specific way so it can be put in the index. Here the text is first split on whitespace or punctuation. Then all the characters are lowercased. In a final step the language dependent stemming is employed that tries to find the base form of terms. This is what transforms our Anwendungsfälle to Anwendungsfall.

What kind of logic is executed during analyzing depends on the data of your application. The analyzing process is one of the main factors for determining the quality of your search and you can spend quite some time with it. For more details you might want to look at my post on the absolute basics of indexing data.

In Elasticsearch, how fields are analyzed is determined by the mapping of the type. Last week we have seen that we can index documents of different structure in Elasticsearch but as we can see now Elasticsearch is not exactly schema free. The analyzing process for a certain field is determined once and cannot be changed easily. You can add additional fields but you normally don't change how existing fields are stored.

If you don't supply a mapping Elasticsearch will do some educated guessing for the documents you are indexing. It will look at any new field it sees during indexing and do what it thinks is best. In the case of our title it uses the StandardAnalyzer because it is a string. Elasticsearch does not know what language our string is in so it doesn't do any stemming which is a good default.

To tell Elasticsearch to use the GermanAnalyzer instead we need to add a custom mapping. We first delete the index and create it again:

curl -XDELETE "http://localhost:9200/conferences/"

curl -XPUT "http://localhost:9200/conferences/“

We can then use the PUT mapping API to pass in the mapping for our type.

curl -XPUT "http://localhost:9200/conferences/talk/_mapping" -d'
{
    "properties": {
       "tags": {
          "type": "string",
          "index": "not_analyzed"
       },
       "title": {
          "type": "string",
          "analyzer": "german"
       }
    }
}'

We have only supplied a custom mapping for two fields. The rest of the fields will again be guessed by Elasticsearch. When creating a production app you will most likely map all of your fields in advance but the ones that are not that relevant can also be mapped automatically. Now, if we index the document again and search for the singular, the document will be found.

Advanced Search

Besides the features we have seen here Elasticsearch provides a lot more. You can automatically gather facets for the results using aggregations which we will look at in a later post. The suggesters can be used to perform autosuggestion for the user, terms can be highlighted, results can be sorted according to fields, you get pagination with each request, .... As Elasticsearch builds on Lucene all the goodies for building an advanced search application are available.

Conclusion

Search is a core part of Elasticsearch that can be combined with its distributed storage capabilities. You can use to Query DSL to build expressive queries. Analyzing is a core part of search and can be influenced by adding a custom mapping for a type. Lucene and Elasticsearch provide lots of advanced features for adding search to your application.

Of course there are lots of users that are building on Elasticsearch because of its search features and its distributed nature. GitHub uses it to let users search the repositories, StackOverflow indexes all of its questions and answers in Elasticsearch and SoundCloud offers search in the metadata of the songs.

In the next post we will look at another aspect of Elasticsearch: Using it to index geodata, which lets you filter and sort results by postition and distance.

Use Cases for Elasticsearch: Document Store

2014-07-04T19:28:00+08:00

I'll be giving an introductory talk about Elasticsearch twice in July, first at Developer Week Nürnberg, then at Java Forum Stuttgart. I am showing some of the features of Elasticsearch by looking at certain use cases. To prepare for the talks I will try to describe each of the use cases in a blog post as well.

When it comes to Elasticsearch the first thing to look at often is the search part. But in this post I would like to start with its capabilities as a distributed document store.

Getting Started

Before we start we need to install Elasticsearch which fortunately is very easy. You can just download the archive, unpack it and use a script to start it. As it is a Java based application you of course need to have a Java runtime installed.

# download archive
wget https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-1.2.1.zip
 
# zip is for windows and linux
unzip elasticsearch-1.2.1.zip
 
# on windows: elasticsearch.bat
elasticsearch-1.2.1/bin/elasticsearch

Elasticsearch can be talked to using HTTP and JSON.When looking around at examples you will often see curl being used because it is widely available. (See this post on querying Elasticsearch using plugins for alternatives). To see if it is up and running you can issue a GET request on port 9200: curl -XGET http://localhost:9200. If everything is set up correctly Elasticsearch will respond with something like this:

{
 "status" : 200,"name" : "Hawkeye", 
 "version" : {
  "number" : "1.2.1",
  "build_hash" : "6c95b759f9e7ef0f8e17f77d850da43ce8a4b364",
  "build_timestamp" : "2014-06-03T15:02:52Z",
     "build_snapshot" : false,
     "lucene_version" : "4.8"
  },
  "tagline" : "You Know, for Search"
}

Storing Documents

When I say document this means two things. First, Elasticsearch stores JSON documents and even uses JSON internally a lot. This is an example of a simple document that describes talks for conferences.

{
    "title" : "Anwendungsfälle für Elasticsearch",
    "speaker" : "Florian Hopf",
    "date" : "2014-07-17T15:35:00.000Z",
    "tags" : ["Java", "Lucene"],
    "conference" : {
        "name" : "Java Forum Stuttgart",
        "city" : "Stuttgart"
    } 
}

There are fields and values, arrays and nested documents. Each of those features is supported by Elasticsearch.

Besides the JSON documents that are used for storing data in Elasticsearch, document refers to the underlying library Lucene, that is used to persist the data and handles data as documents consisting of fields. So this is a perfect match: Elasticsearch uses JSON, which is very popular and supported from lots of technologies. But the underlying data structures also use documents.

When indexing a document we can issue a post request to a certain URL. The body of the request contains the document to be stored, the file we are passing contains the content we have seen above.

curl -XPOST http://localhost:9200/conferences/talk/ --data-binary @talk-example-jfs.json

When started Elasticsearch listens on port 9200 by default. For storing information we need to provide some additional information in the URL. The first segment after the port is the index name. An index name is a logical grouping of documents. If you want to compare it to the relational world this can be thought of as the database.

The next segment that needs to be provided is the type. A type can describe the structure of the doucments that are stored in it. You can again compare this to the relational world, this could be a table, but that is only slightly correct. Documents of any kind can be stored in Elasticsearch, that is why it is often called schema free. We will look at this behaviour in the next post where you will see that schema free isn't the most appropriate term for it. For now it is enough to know that you can store documents with completely different structure in Elasticsearch. This also means you can evolve your documents and add new fields as appropriate.

Note that neither index nor type need to exist when starting indexing documents. They will be created automatically, one of the many features that makes it so easy to start with Elasticsearch.

When you are storing a document in Elasticsearch it will automatically generate an id for you that is also returned in the result.

{
 "_index":"conferences",
 "_type":"talk",
 "_id":"GqjY7l8sTxa3jLaFx67_aw",
 "_version":1,
 "created":true
}

In case you want to determine the id yourself you can also use a PUT on the same URL we have seen above plus the id. I don't want to get into trouble by calling this RESTful but did you notice that Elasticsearch makes good use of the HTTP verbs?

Either way how you stored the document you can always retrieve it by specifying the index, type and id.

curl -XGET http://localhost:9200/conferences/talk/GqjY7l8sTxa3jLaFx67_aw?pretty=true

which will respond with something like this:

{
  "_index" : "conferences",
 [...]
  "_source":{
    "title" : "Anwendungsfälle für Elasticsearch",
    "speaker" : "Florian Hopf",
    "date" : "2014-07-17T15:35:00.000Z",
    "tags" : ["Java", "Lucene"],
    "conference" : {
        "name" : "Java Forum Stuttgart",
        "city" : "Stuttgart"
    } 
}
}

You can see that the source in the response contains exactly the document we have indexed before.

Distributed Storage

So far we have seen how Elasticsearch stores and retrieves documents and we have learned that you can evolve the schema of your documents. The huge benefit we haven't touched so far is that it is distributed. Each index can be split into several shards that can then be distributed across several machines.

To see the distributed nature in action fortunately we don't need several machines. First, let's see the state of our currently running instance in the plugin elasticsearch-kopf (See this post on details how to install and use it):

On the left you can see that there is one machine running. The row on top shows that it contains our index conferences. Even though we didn't explicitly tell Elasticsearch it created 5 shards for our index that are currently all on the instance we started. As each of the shards is a Lucene index in itself even if you are running your index on one instance the documents you are storing are already distributed across several Lucene indexes.

We can now use the same installation to start another node. After a short time we should see the instance in the dashboard as well.

As the new node joins the cluster (which by default happens automatically) Elasticsearch will automatically copy the shards to the new node. This is because by default it not only uses 5 shards but also 1 replica, which is a copy of a shard. Replicas are always placed on different nodes than their shards and are used for distributing the load and for fault tolerance. If one of the nodes crashes the data is still available on the other node.

Now, if we start another node something else will happen. Elasticsearch will rebalance the shards. It will copy and move shards to the new node so that the shards are distributed evenly across the machines.

Once defined when creating an index the number of shards can't be changed. That's why you normally overallocate (create more shards than you need right now) or if your data allows it you can create time based indices. Just be aware that sharding comes with some cost and think carefully about what you need. Designing your distribution setup can still be difficult even with Elasticsearch does a lot for you out of the box.

Conclusion

In this post we have seen how easy it is to store and retrieve documents using Elasticsearch. JSON and HTTP are technologies that are available in lots of programming environments. The schema of your documents can be evolved as your requirements change. Elasticsearch distributes the data by default and lets you scale across several machines so it is suited well even for very large data sets.

Though using Elasticsearch as a document store is a real use case it is hard to find users that are only using it that way. Nobody retrieves the documents only by id as we have seen in this post but uses the rich query facilities we will look at next week. Nevertheless you can read about how Hipchat uses Elasticsearch to store billions of messages and how Engagor uses Elasticsearch here and here. Both of them are using Elasticsearch as their primary storage.

Though it sounds more drastic than it probably is: If you are considering using Elasticsearch as your primary storage you should also read this analysis of Elasticsearchs behaviour in case of network partitions. Next week we will be looking at using Elasticsearch for something obvious: Search.

Goodbye Sense - Welcome Alternatives?

2014-06-27T15:28:00+08:00

I only recently noticed that Sense, the Chrome Plugin for Elasticsearch has been pulled from the app store by its creator. There are quite strong opinions in this thread and I would like to have Sense as a Chrome plugin as well. But I am also totally fine with Elasticsearch as a company trying to monetize some of its products so that is maybe something we just have to accept. What is interesting is that it isn't even possible to fork the project and keep developing it as there is no explicit license in the repo. I guess there is a lesson buried somewhere in here.

In this post I would like to look at some of the alternatives for interacting with Elasticsearch. Though the good thing about Sense is that it is independent from the Elasticsearch installation we are looking at plugins here. It might be possible to use some of them without installing them in Elasticsearch but I didn't really try. The plugins are generally doing more things but I am looking at the REST capabilities only.

Marvel

Marvel is the commercial plugin by Elasticsearch (free for development purposes). Though it does lots of additional things, it contains the new version of Sense. Marvel will track lots of the state and interaction with Elasticsearch in a seperate index so be aware that it might store quite some data. Also of course you need to respect the license; when using it on a production system you need to pay.

The main Marvel dashboard, which is Kibana, is available at http://localhost:9200/_plugin/marvel. Sense can be accessed directly using http://localhost:9200/_plugin/marvel/sense/index.html.

The Sense version of Marvel behaves exactly like the one you are used from the Chrome plugin. It has highlighting, autocompletion (even for new features), the history and the formatting.

elasticsearch-head

elasticsearch-head seems to be one of the oldest plugins available for Elasticsearch and it is recommended a lot. The main dashboard is available at http://localhost:9200/_plugin/head/ which contains the cluster overview.

There is an interface for building queries at the Structured Query tab./p>

It lets you execute queries by selecting values from dropdown boxes and it can even detect fields that are available for the index and type. Results are displayed in a table. Unfortunately the values that can be selected are rather outdated. Instead of the match query it still contains the text query that is deprecated since Elasticsearch 0.19.9 and is not available anymore with newer versions of Elasticsearch.

Another interface on the Any Request tab lets you execute custom requests.

The text box that accepts the body has no highlighting and it is not possble to use tabs but errors will be displayed, the response is formatted, links are set and you do have the option to use a table or the JSON format for responses. The history lets you execute older queries.

There are other options like Result Transformer that sound interesting but I have never tried those.

elasticsearch-kopf

elasticsearch-kopf is a clone of elasticsearch-head that also provides an interface to send arbitrary requests to Elasticsearch.

You can enter queries and let them be executed for you. There is a request history, you have highlighting and you can format the request document but unfortunately the interface is missing a autocompletion.

If you'd like to learn more about elasticsearch-kopf I have recently published a tour through its features.

Inquisitor

Inquisitor is a tool to help you understand Elasticsearch queries. Besides other options it allows you to execute search queries.

Index and type can be chosen from the ones available in the cluster. There is no formatting in the query field, you can't even use tabs for indentation, but errors in your query are displayed in the panel on top of the results while typing. The response is displayed in a table, matching fields are automatically highlighted. Because of the limited possibilites when entering text the plugin seems to be more useful when it comes to the analyzing part or for pasting existing queries

Elastic-Hammer

Andrew Cholakian, the author of Exploring Elasticsearch, has published another query tool, Elastic-Hammer. It can either be installed locally or used as an online version directly.

It is a quite useful query tool that will display syntactic errors in your query and format images and links in a pretty response. It even offers autocompletion though not as elaborated as the one Sense and Marvel are providing: It will display any allowed term, no matter the context. So you can't really see which terms currently are allowed but only that the term is allowed at all. Nevertheless this can be useful. Searches can also be saved in local storage and executed again.

Conclusion

Currently none of the free and open source plugins seems to provide an interface that is as good as the one contained in Sense and Marvel. As Marvel is free for development you can still use but you need to install it in the instances again. Sense was more convenient and easier to start but I guess one can get along with Marvel the same way.

Finally I wouldn't be surprised if someone from the very active Elasticsearch community comes up with another tool that can take the place of Sense again.

An Alternative to the Twitter River - Index Tweets in Elasticsearch with Logstash

2014-06-20T15:46:00+08:00

For some time now I've been using the Elasticsearch Twitter river for streaming conference tweets to Elasticsearch. The river runs on an Elasticsearch node, tracks the Twitter streaming API for keywords and directly indexes the documents in Elasticsearch. As the rivers are about to be deprecated it is time to move on to the recommended replacement: Logstash.

With Logstash the retrieval of the Twitter data is executed in a different process, probably even on a different machine. This helps in scaling Logstash and Elasticsearch seperately.

Installation

The installation of Logstash is nearly as easy as the one for Elasticsearch though you can't start it without a configuration that tells it what you want it to do. You can download it, unpack the archive and there are scripts to start it. If you are fine with using the embedded Elasticsearch instance you don't even need to install this separately. But you need to have a configuration file in place that tells Logstash what to do exactly.

Configuration

The configuration for Logstash normally consists of three sections: The input, optional filters and the output section. There is a multitude of existing components for each of those available. The structure of a config file looks like this (taken from the documentation):

# This is a comment. You should use comments to describe
# parts of your configuration.
input {
  ...
}

filter {
  ...
}

output {
  ...
}

We are using the Twitter input, the elasticsearch_http output and no filters.

Twitter

As with any Twitter API interaction you need to have an account and configure the access tokens.

input {
    twitter {
        # add your data
        consumer_key => ""
        consumer_secret => ""
        oauth_token => ""
        oauth_token_secret => ""
        keywords => ["elasticsearch"]
        full_tweet => true
    }
}

You need to pass in all the credentials as well as the keywords to track. By enabling the full_tweet option you can index a lot more data, by default there are only a few fields and interesting information like hashtags or mentions are missing.

The Twitter river seems to have different names than the ones that are sent with the raw tweets so it doesn't seem to be possible to easily index Twitter logstash data along with data created by the Twitter river. But it should be no big deal to change the Logstash field names as well with a filter.

Elasticsearch

There are three plugins that are providing an output to Elasticsearch: elasticsearch, elasticsearch_http and elasticsearch_river. elasticsearch provides the opportunity to bind to an Elasticsearch cluster as a node or via transport, elasticsearch_http uses the HTTP API and elasticsearch_river communicates via the RabbitMQ river. The http version lets you use different Elasticsearch versions for Logstash and Elasticsearch, this is the one I am using. Note that the elasticsearch plugin also provides an option for setting the protocol to http that also seems to work.

output {
    elasticsearch_http {
        host => "localhost"
        index => "conf"
        index_type => "tweet"
    }
}

In contrast to the Twitter river the Logstash plugin does not create a special mapping for the tweets. I didn't go through all the fields but for example the coordinates don't seem to be mapped correctly to geo_point and some fields are analyzed that probably shouldn't be (urls, usernames). If you are using those you might want to prepare your index by supplying it with a custom mapping.

By default tweets will be pushed to Elasticsearch every second which should be enough for any analysis. You can even think about reducing this with the property idle_flush_time.

Running

Finally, when all of the configuration is in place you can execute Logstash using the following command (assuming the configuration is in a file twitter.conf):

bin/logstash agent -f twitter.conf

Nothing left to do but wait for the first tweets to arrive in your local instance at http://localhost:9200/conf/tweet/_search?q=*:*&pretty=true.

For the future it would be really useful to prepare a mapping for the fields and a filter that removes some of the unused data. For now you have to check what you would like to use of the data and prepare a mapping in advance.

A Tour Through elasticsearch-kopf

2014-06-13T13:49:00+08:00

When I needed a plugin to display the cluster state of Elasticsearch or needed some insight into the indices I normally reached for the classic plugin elasticsearch-head. As it is recommended a lot and seems to be the unofficial successor I recently took a more detailed look at elasticsearch-kopf. And I like it.

I am not sure about why elasticsearch-kopf came into existence but it seems to be a clone of elasticsearch-head (kopf means head in German so it is even the same name).

Installation

elasticsearch-kopf can be installed like most of the plugins, using the script in the Elasticsearch installation. This is the command that installs the version 1.1 which is suitable for the 1.1.x branch of Elasticsearch.

bin/plugin --install lmenezes/elasticsearch-kopf/1.1

elasticsearch-kopf is then available on the url http://localhost:9200/_plugin/kopf/.

Cluster

On the front page you will see a similar diagram of what elasticsearch-head is providing. The overview of your cluster with all the shards and the distribution across the nodes. The page is being refreshed so you will see joining or leaving nodes immediately. You can adjust the refresh rate in the settings dropdown just next to the kopf logo (by the way, the header reflects the state of the cluster so it might change its color from green to yellow to red).

Also, there are lots of different settings that can be reached via this page. On top of the node list there are 4 icons for creating a new index, deactivating shard allocation, for the cluster settings and the cluster diagnosis options.

Creating a new index brings up a form for entering the index data. You can also load the settings from an existing index or just paste the settings json in the field on the right side.

The icon for disabling the shard allocation just toggles it, disabling the shard allocation can be useful during a cluster restart. Using the cluster settings you can reach a form where you can adjust lots of values regarding your cluster, the routing and recovery. The cluster health button finally lets you load different json documents containing more details on the cluster health, e.g. the nodes stats and the hot threads.

Using the little dropdown just next to the index name you can execute some operations on the index. You can view the settings, open and close the index, optimize and refresh the index, clear the caches, adjust the settings or delete the index.

When opening the form for the index settings you will be overwhelmed at first. I didn't know there are so many settings. What is really useful is that there is an info icon next to each field that will tell you what this field is about. A great opportunity to learn about some of the settings.

What I find really useful is that you can adjust the slow index log settings directly. The slow log can also be used to log any incoming queries so it is sometimes useful for diagnostic purposes.

Finally, back on the cluster page, you can get more detailed information on the nodes or shards when clicking on them. This will open a lightbox with more details.

REST

The rest menu entry on top brings you to another view which is similar to the one Sense provided. You can enter queries and let them be executed for you. There is a request history, you have highlighting and you can format the request document but unfortunately the interface is missing the autocompletion. Nevertheless I suppose this can be useful if you don't like to fiddle with curl.

Aliases

Using the aliases tab you can have a convenient form for managing your index aliases and all the relevant additional information. You can add filter queries for your alias or influence the index or search routing. On the right side you can see the existing aliases and remove them if not needed.

Analysis

The analysis tab will bring you to a feature that is also very popular for the Solr administration view. You can test the analyzers for different values and different fields. This is a very valuable tool while building a more complex search application.

Unfortunately the information you can get from Elasticsearch is not as detailed as the one you can get from Solr: It will only contain the end result so you can't really see which tokenizer or filter caused a certain change.

Percolator

On the percolator tab you can use a form to register new percolator queries and view existing ones. There doesn't seem to be a way to do the actual percolation but maybe this page can be useful for using the percolator extensively.

Warmers

The warmers tab can be used to register index warmer queries.

Repository

The final tab is for the snapshot and restore feature. You can create repositories and snapshots and restore them. Though I can imagine that most of the people are automating the snapshot creation this can be a very useful form.

Conclusion

I hope you could see in this post that elasticsearch-kopf can be really useful. It is very unlikely that you will ever need all of the forms but it is good to have them available. The cluster view and the rest interface can be very valuable for your daily work and I guess there will be new features coming in the future.

See Your Solr Cache Sizes: Eclipse Memory Analyzer

2014-05-09T15:44:00+08:00

Solr uses different caches to prevent too much IO access and calculations during requests. When indexing doesn't happen too frequently you can get huge performance gains by employing those caches. Depending on the structure of your index data and the size of the caches they can become rather large and use a substantial part of your heap memory. In this post I would like to show how you can use the Eclipse Memory Analyzer to see how much space your caches are really using in memory.

Configuring the Caches

All the Solr caches can be configured in solrconfig.xml in the query section. You will find definitions like this:

<filterCache class="solr.FastLRUCache"
  size="8000"
  initialSize="512"
  autowarmCount="0"/>

This is an example of a filter cache configured to use the FastLRUCache, a maximum size of 8000 items and no autowarming. Solr ships with two commonly used cache implementations, the FastLRUCache, that uses a ConcurrentHashMap and the LRUCache, that synchronizes the calls. Some of the caches are still configured to use the LRUCache but on some read heavy projects I had good results with changing those to FastLRUCache as well.

Additionaly, starting from Solr 3.6 there is also the LFUCache. I have never used it and it is still marked as experimental and subject to change.

Solr comes with the following caches:

FilterCache: Caches a bitset of the filter queries. This can be a very effective cache if you are reusing filters.
QueryResultCache: Stores an ordered list of the document ids for a query.
DocumentCache: Caches the stored fields of the Lucene documents. If you have large or many fields this cache can become rather large.
FieldValueCache: A cache that is mainly used for faceting.

Additionaly you will see references to the FieldCache which is not a cache managed by Solr and can not be configured.

In the default configuration Solr only caches 512 items per cache which can often be too small. You can see the usage of your cache in the administration view of Solr in the section Plugin/Stats/Caches of your core. This will tell you the hit rate as well as the evictions for your caches.

The stats are a good starting point for tuning your caches but you should be aware that by setting the size too large you can see some unwanted GC activity. That is why it might be useful to look at the real size of your caches in memory instead of the item count alone.

Eclipse MAT

Eclipse MAT is a great tool for looking at your heap in memory and see which objects occupy the space. As the name implies it is based on Eclipse and can either be downloaded as a standalone tool or is available via update sites for integration in an existing instance.

Heap dumps can be aquired using the tool directly but you can also open existing dumps. On opening it will automatically calculate a chart of the largest objects that might already contain some of the cache objects, if you are keeping lots of items in the cache.

Using the links below the pie chart you can also open further automatic reports, e.g. the Top Consumers, a more detailed page on large objects.

Even if you do see some of the cache classes here, you can't really see which of the caches it is that consumes the memory. Using the Query Browser menu on top of the report you can also list instances of classes directly, no matter how large those are.

We are choosing List Objects with outgoing references and enter the class name for the FastLRUCache, org.apache.solr.search.FastLRUCache. For the default configuration you will see two instances. When clicking on one of the instances you can see the name of the cache in the lower left window, in this case the filter cache.

There are two numbers available for the heap size: The shallow size and the retained size. When looking at the caches we are interested in the retained size as this is the size that would be available when the instance is garbage collected, i.e. the size of the cache that is only used by the cache. In our case this is around 700kB but this can grow a lot.

You can also do the same inspection for the org.apache.solr.search.LRUCache to see the real size of your caches.

Conclusion

The caches can get a lot bigger than in our example here. Eclipse Memory Analyzer has helped me a lot already to see if there are any problems with a heap that is growing too large.

What Is Special About This? Significant Terms in Elasticseach

2014-04-30T10:32:00+08:00

I have been using Elasticsearch a few times now for doing analytics of twitter data for conferences. Popular hashtags and mentions that can be extraced using facets can show what is hot at a conference. But you can go even further and see what makes each hashtag special. In this post I would like to show you the significant terms aggregation that is available with Elasticsearch 1.1. I am using the tweets of last years Devoxx as those contain enough documents to play around.

Aggregations

Elasticsearch 1.0 introduced aggregations, that can be used similar to facets but are far more powerful. To see why those are useful let's take a step back and look at facets, that are often used to extract statistical values and distributions. One useful example for facets is the total count of a hashtag:

curl -XGET "http://localhost:9200/devoxx/tweet/_search" -d'
{
    "size": 0,  
    "facets": {
       "hashtags": {
          "terms": {
             "field": "hashtag.text",
             "size": 10,
             "exclude": [
                "devoxx", "dv13"
             ]
          }
       }
    }
}'

We request a facet called hashtags that uses the terms of hashtag.text and returns the 10 top values with the counts. We are excluding the hashtags devoxx and dv13 as those are very frequent. This is an excerpt of the result with the popular hashtags:

   "facets": {
      "hashtags": {
         "_type": "terms",
         "missing": 0,
         "total": 19219,
         "other": 17908,
         "terms": [
            {
               "term": "dartlang",
               "count": 229
            },
            {
               "term": "java",
               "count": 216
            },
            {
               "term": "android",
               "count": 139
            },
    [...]

Besides the statistical information we are retrieving here facets are often used for offering a refinement on search results. A common use is to display categories or features of products on eCommerce sites for example.

Starting with Elasticsearch 1.0 you can have the same behaviour by using one of the new aggregations, in this case a terms aggregation:

curl -XGET "http://localhost:9200/devoxx/tweet/_search" -d'
{
    "size" : 0,
    "aggs" : {
        "hashtags" : {
            "terms" : { 
                "field" : "hashtag.text", 
                "exclude" : "devoxx|dv13"
            }
        }
    }
}'

Instead of requesting facets we are now requesting a terms aggregation for the field hashtag.text. The exclusion is now based on a regular expression instead of a list. The result looks similar to the facet return values:

   "aggregations": {
      "hashtags": {
         "buckets": [
            {
               "key": "dartlang",
               "doc_count": 229
            },
            {
               "key": "java",
               "doc_count": 216
            },
            {
               "key": "android",
               "doc_count": 139
            },
    [...]

Each value forms a so called bucket that contains a key and a doc_count.

But aggregations not only are a replacement for facets. Multiple aggregations can be combined to give more information on the distribution of different fields. For example we can see the users that used a certain hashtag by adding a second terms aggregation for the field user.screen_name:

curl -XGET "http://localhost:9200/devoxx/tweet/_search" -d'
{
    "size" : 0,
    "aggs" : {
        "hashtags" : {
            "terms" : { 
                "field" : "hashtag.text", 
                "exclude" : "devoxx|dv13"
            },
            "aggs" : {
                "hashtagusers" : {
                    "terms" : {
                        "field" : "user.screen_name"
                    }
                }
            }
        }
    }
}'

Using this nested aggregation we now get a list of buckets for each hashtag. This list contains the users that used the hashtag. This is a short excerpt for the #scala hashtag:

 
               "key": "scala",
               "doc_count": 130,
               "hashtagusers": {
                  "buckets": [
                     {
                        "key": "jaceklaskowski",
                        "doc_count": 74
                     },
                     {
                        "key": "ManningBooks",
                        "doc_count": 3
                     },
    [...]

We can see that there is one user that is responsible for half of the hashtags. A very dedicated user.

Using aggregations we can get information that we were not able to get with facets alone. If you are interested in more details about aggregations in general or the metrics aggregations I haven't touched here, Chris Simpson has written a nice post on the feature, there is a nice visual one at the Found blog, another one here and of course there is the official documentation on the Elasticsearch website.

Significant Terms

Elasticsearch 1.1 contains a new aggregation, the significant terms aggregation. It allows you to do something very useful: For each bucket that is created you can see the terms that make this bucket special.

Significant terms are calculated by comparing a foreground frequency (which is the frequency of the bucket you are interested in) with a background frequency (which for Elasticsearch 1.1 always is the frequency of the complete index). This means it will collect any results that have a high frequency for the current bucket but not for the complete index.

For our example we can now check for the hashtags that are often used with a certain mention. This is not the same that can be done with the terms aggregation. The significant terms will only return those terms that are occuring often for a certain user but not as frequently for all users. This is what Mark Harwood calls the uncommonly common.

curl -XGET "http://localhost:9200/devoxx/tweet/_search" -d'
{
    "size" : 0,
    "aggs" : {
        "mentions" : {
            "terms" : { 
                "field" : "mention.screen_name" 
            },
            "aggs" : {
                "uncommonhashtags" : {
                    "significant_terms" : {
                        "field" : "hashtag.text"
                    }
                }
            }
        }
    }
}'

We request a normal terms aggregation for the mentioned users. Using a nested significant_terms aggregation we can see any hashtags that are often used with the mentioned user but not so often in the whole index. This is a snippet for the account of Brian Goetz:

            {
               "key": "BrianGoetz",
               "doc_count": 173,
               "uncommonhashtags": {
                  "doc_count": 173,
                  "buckets": [
                     {
                        "key": "lambda",
                        "doc_count": 13,
                        "score": 1.8852860861614915,
                        "bg_count": 33
                     },
                     {
                        "key": "jdk8",
                        "doc_count": 8,
                        "score": 0.7193691737111163,
                        "bg_count": 32
                     },
                     {
                        "key": "java",
                        "doc_count": 21,
                        "score": 0.6601749139630457,
                        "bg_count": 216
                     },
                     {
                        "key": "performance",
                        "doc_count": 4,
                        "score": 0.6574225667412876,
                        "bg_count": 9
                     },
                     {
                        "key": "keynote",
                        "doc_count": 9,
                        "score": 0.5442707998673785,
                        "bg_count": 52
                     },
        [...]

You can see that there are some tags that are targeted a lot at the keynote by Brian Goetz and are not that common for the whole index.

Some more ideas what we could look at with the significant terms aggregation:

Find users that are using a hashtag a lot.
Find terms that are often used with a certain hashtag.
Find terms that are used by a certain user.
...

Besides these impressive analytics feature significant terms can also be used for search applications. A useful example is given in the Elasticsearch documentation itself: If a user searches for "bird flu" automatically display a link to a search to H5N1 which should be very common in the result documents but not in the whole of the corpus.

Conclusion

With significant terms Elasticsearch has again added a feature that might very well offer surprising new applications and use cases for search. Not only is it important for analytics but it can also be used to improve classic search applications. Mark Harwood has collected some really interesting use cases on the Elasticsearch blog. If you'd like to read another post on the topic you can see this post at QBox-Blog that introduces significant terms as well as the percentile and cardinality aggregations.

The Absolute Basics of Indexing Data

2014-04-11T14:34:00+08:00

Ever wondered how a search engine works? In this post I would like to show you a high level view of the internal workings of a search engine and how it can be used to give fast access to your data. I won't go into any technical details, what I am describing here holds true for any Lucene based search engine, be it Lucene itself, Solr or Elasticsearch.

Input

Normally a search engine is agnostic to the real data source of indexing data. Most often you push data into it via an API that already needs to be in the expected format, mostly Strings and data types like integers. It doesn't matter if this data originally resides in a document in the filesystem, on a website or in a database.

Search engines are working with documents that consist of fields and values. Though not always used directly you can think of documents as JSON documents. For this post imagine we are building a book database. In our simplified world a book just consists of a title and one or more authors. This would be two example documents:

{
    "title" : "Search Patterns",
    "authors" : [ "Morville", "Callender" ],
}
{
    "title" : "Apache Solr Enterprise Search Server",
    "authors" : [ "Smiley", "Pugh" ]
}

Even though the structure of both documents is the same in our case, the format of the document doesn't need to be fixed. Both documents could have totally different attributes, nevertheless both could be stored in the same index. In reality you will try to keep the documents similar, after all, you need a way to handle the documents in your application.

Lucene itself doesn't even have the concept of a key. But of course you need a key to identify your documents when updating them. Both Solr and Elasticsearch have ids that can either be chosen by the application or be autogenerated.

Analyzing

For every field that is indexed a special process called analyzing is employed. What it does can differ from field to field. For example, in a simple case it might just split the terms on whitespace and remove any punctuation so Search Patterns would become two terms, Search and Patterns.

Index Structure

An inverted index, the structure search engines are using, is similar to a map that contains search terms as key and a reference to a document as value. This way the process of searching is just a lookup of the term in the index, a very fast process. Those might be the terms that are indexed for our example documents.

Field	Term	Document Id
title	Apache	2
	Enterprise	2
	Patterns	1
	Search	1,2
	Server	2
	Solr	2
author	Callender	1
	Morville	1
	Pugh	2
	Smiley	2

A real index contains more information like position information to enable phrase queries and frequencies for calculating the relevancy of a document for a certain search term.

As we can see the index holds a reference to the document. This document, that is also stored with the search index, doesn't necessarily have to be the same as our input document. You can determine for each field if you would like to keep the original content which is normally controlled via an attribute named stored. As a general rule, you should have all the fields stored that you would like to display with the search results. When indexing lots of complete books and you don't need to display it in a results page it might be better to not store it at all. You can still search it, as the terms are available in the index, but you can't access the original content.

More on Analyzing

Looking at the index structure above we can already imagine how the search process for a book might work. The user enters a term, e.g. Solr, this term is then used to lookup the documents that contain the term. This works fine for cases when the user types the term correctly. A search for solr won't match for our current example.

To mitigate those difficulties we can use the analyzing process already mentioned above. Besides the tokenization that splits the field value into tokens we can do further preprocessing like removing tokens, adding tokens or modifying tokens (TokenFilter).

For our book case it might at first be enough to do lowercasing on the incoming data. So a field value Solr will then be stored as solr in the index. To enable the user to also search for Solr with an uppercase letter we need to do analyzing for the query as well. Often it is the same process that is used for indexing but there are also cases for different analyzers.

The analyzing process not only depends on the content of the documents (field types, language of text fields) but also on your application. Take one common scenario: Adding synonyms for terms to the index. You might think that you just take a huge list of synonyms like WordNet and add those to each of your application. This might in fact decrease the search experience of your users as there are too many false positives. Also, for certain terms of the domain of your users WordNet might not contain the correct synonyms at all.

Duplication

When designing the index structure there are two competing forces: Often you either optimize for query speed or for index size. If you have a lot of data you probably need to take care that you only store data that you really need and even only put terms in the index that are necessary for lookups. Oftentimes for smaller datasets the index size doesn't matter that much and you can design your index for query performance.

Let's look at an example that can make sense for both cases. In our book information system we would like to display an alphabetic navigation for the last name of the author. If the user clicks on A, all the books of authors starting with the letter A should be displayed. When using the Lucene query syntax you can do something like this with its wildcard support: Just issue a query that contains the letter the user clicked and a trailing *, e.g. a*.

Wildcard queries have become very fast with recent Lucene versions, nevertheless it still is a query time impact. You can also choose another way. When indexing the data you can add another field that just stores the first letter of the name. This is what the relevant configuration might look like in Elasticsearch but the concept is the same for Lucene and Solr:

"author": {
    "type": "multi_field",
    "fields": {
        "author" : {
            "type": "string"
        },
        "letter" : {
            "type": "string",
            "analyzer": "single_char_analyzer"
        }
    }
}

Under the hood, another term dictionary for the field author.letter will be created. For our example it will look like this:

Field	Term	Document Id
author.letter	C	1
	M	1
	P	2
	S	2

Now, instead of issuing a wildcard query on the author field we can directly query the author.letter field with the letter. You can even build the navigation from all the terms in the index using techniques like faceting the extract all the available terms for a field from the index.

Conclusion

These are the basics of the indexing data for a search engine. The inverted index structure makes searching really fast by moving some processing to the indexing phase. When we are not bound by any index size concerns we can design our index for query performance and add additional fields that duplicate some of the data. This design for queries is what makes search engines similar to how lots of the NoSQL solutions are used.

If you'd like to go deeper in the topics I recommend watching the talk What is in a Lucene index by Adrien Grand. He shows some of the concepts I have mentioned here (and a lot more) but also how those are implemented in Lucene.

BEDCon - Berlin Expert Days 2014

2014-04-07T12:48:00+08:00

BEDCon is over again. This marks the third year I have been there and it still has the cheapest prices for a general conference I have seen in Germany. If you are looking for a nice conference in Berlin you should definitively consider it.

Interesting Talks

The three most interesting talks I attended:

Java App Servers are Dead by Eberhard Wolff. Contains some good arguments why deploying applications to app servers might not be the best idea. I especially liked the idea that the target application server is a dependency of your project (because you need a certain version) and dependencies should be explicit in you source tree.
Resilience with Hystrix by Uwe Friedrichsen. A more advanced talk that was perfect for me because I had seen an introduction to fault tolerance by Uwe Friedrichsen at XPDays Germany last year. Hystrix is a library I will definitively check out.
Log Managment with Graylog2 by Lennart Koopmann. Graylog2 is a full application for doing log management that includes Elasticsearch, MongoDB and a Play application. Lennart mentioned some interesting numbers about installations, a system with an impressive scale.

Talks I would have liked to see:

RESTful HTTP on the JVM by Martin Eigenbrodt and Stefan Tilkov
Micro Services by Timmo Freudl-Gierke

My Talks

Surprisingly I had two talks accepted for BEDCon. I am especially glad that the more important talk on Search-Driven Applications with Tobias of Exensio went really well and we were talking to a packed room. For me giving a talk in a team is far less stressful than giving a talk alone. I am looking forward to giving this talk again at Entwicklertag Karlsruhe in May. The slides are available on Speaker Deck.

My second short talk on Akka also went ok, the slides are online. If you are interested in Akka you can also have a look at my blogposts on message passing concurrency with Akka and supervision in Akka.

Tweets

I know, I am repeating myself, but again I stored all the tweets for the conference in Elasticsearch so I can look at them using Kibana. The usual rules apply, each retweet counts as a seperate tweet. I am using a sharded version of the index so some of the counts might not be totally exact.

Distribution

There seem to be more tweets on the first day. I also had the impression that there were fewer people there for the second day at all. The first day has a spike that is probably related to the blackout at around 12.

Hashtags

This is a surprise: elasticsearch beats any other hashtag. logstash, kibana, springmvc, roca ... very specialized hashtags as well. As you can see from the numbers there weren't that many tweets at all.

Mentions

An even bigger surprise for me: I seem to have got mentioned a lot. After looking into this, this is caused by retweets counting as a mention as well. I had some tweets that got retweeted a few times.

Negativity

There is one thing that had me quite upset. During a short power failure (which was not related to BEDCon at all) I was watching a short talk. The speaker had nothing better to do than to insult the technicians that were there to help him. I don't get this attitude and I hope to never see this speaker again on any conference I am attending.

BEDCon is a great conference, all the people involved are doing a great job. I hope in the following years I can find the time to go there again.

JavaLand 2014: The Conference, the Talks, the Tweets

2014-03-28T16:12:00+08:00

This week on Tuesday and Wednesday parts of the German Java community met for JavaLand, the conference of the German Java community. In this post I would like to talk about the conference in general, some of the talks I attended and show you some facts about what happened on Twitter.

Location and Organization

The conference has the most unusual setting one can imagine: A theme park, Phantasialand in Brühl. Though I was rather sceptical about this it really turned out to be a good choice. The rooms for the talks were mostly well suited and it was a nice place for meeting outside. On Tuesday evening and at some times during the day some of the attractions were open for use. Though I am not really into theme parks I have to admit that some of them really were fun.

Another unusual choice of the organizers: There were no lunch breaks. Every talk started at the full hour and lasted for 45 minutes, starting from 09:00 in the morning and continuing to the evening. You had 15 minutes to change rooms and get a coffee. The attendees could decide by themselves if they would like to eat a really quick lunch in 15 minutes or skip one of the slots completely. This circumvents some of the problems with large queues at other conferences or breaks that are too long for some participants.

The quality of the food was excellent, with buffets during lunch time and even Tuesday evening. There were several main dishes, different variations of salads and dessert. One of the best conference catering I have ever seen.

There were quite some community events, e.g. the Java User Group café or the innovation lab with the Oculus Rift and a Quadrocopter.

The Talks

Most of the talks were in German but there were also some in English. I doubt that you get a lot of value if you don't speak German and go to JavaLand just for the talks, though there were several high profile non German speakers. On several partner stages companies presented talks done by their employees. I had an especially good impression of the Codecentric stage and regret that I couldn't go to see Patrick Peschlow talking on Elasticsearch, a talk I heard good things about.

Some of the talks had a common theme:

Reactive Programming

There was a surprising amount of talks on reactive programming. First Roland Kuhn, the project lead of the Akka project, introduced the principles of building reactive applications. He showed how choosing an event driven approach can lead to scalable, resilient and responsive applications. He didn't go into technical details but only introduced the concepts.

At the end of day one, Michael Pisula gave a technical introduction into the actor framework Akka, also referring to a lot of the concepts Roland Kuhn mentioned in the morning. Though I already have a good grasp of Akka there were some details that were really useful.

At the beginning of the second day Niko Köbler and Heiko Spindler gave another talk on reactive programming. It was a mixture of concepts and technical details, showing MeteorJS, a framework for combining JavaScript on the client side with the server side, and again Akka.

Profiling

There were two sessions related to profiling: First Fabian Lange of Codecentric showing different aspects of profiling in general and some details on microbenchmarking. I will especially take a closer look on jmh, a microbenchmark tool in OpenJDK.

In "JVM and application bottleneck troubleshooting with simple tools" Daniel Witkowski of Azul demoed an example process of analyzing a problematic application. He used system tools and VisualVM to find and identify problems. A rather basic talk but nevertheless it is good to keep some of the practices in mind.

Building Web Applications

Felix Müller and Nils Wloka presented two different web frameworks. Play, presented by Felix Müller, is a framework that is mostly developed by Typesafe, the company behind Scala and Akka. You can use it from Java as well as Scala though the templates are always written in Scala. I started building an application using Play a while ago and had a quite good impression. If I had a use case for a full stack framework I would definitively have another look.

Nils Wloka did a session that obviously wasn't for everyone: 45 minutes of live coding in Clojure building a complete web application for voting on conference talks using Ring, Compojure and Hiccup. I am currently working on a simple application using the same stack so I could follow at least most of his explanations. I especially liked that he managed to finish the app, deploy it to OpenShift and used it to let the audience vote on his talk. Quite impressive.

I like it that both frameworks support rapid development. You don't need to restart the server as often as with common Java web development. A lot of the changes are available instantly.

Miscellaneous

There were two more talks I'd like to mention: On the second day Patrick Baumgartner gave an excellent introduction into Neo4J, the graph database on the JVM. He showed several use cases (including a graph of Whiskey brands that spawned quite a discussion on the properties of Whiskey in the audience) and some of the technical details. Though I just recently attended a long talk on Neo4J at JUG Karlsruhe and already had seen a similar talk by Patrick at BEDCon last year it was really entertaining and good for refreshing some of the knowledge.

Finally the highlight of the conference for me: On the first day Stefan Tilkov gave one of his excellent talks on architecture: He showed how splitting applications and building for replacement instead of building for reuse can lead to cleaner applications. He gave several examples of companies that had employed these principles. I have never attended a bad talk by Stefan so if he's giving a talk at a conference you are at you should definitively go see it.

The Tweets

As I have done before for Devoxx and other conferences I tracked all of the tweets with the hashtags #javaland and #javaland14, stored them in Elasticsearch and therefore had the possibility to analyze them with Kibana. I started tracking several months before the conference but I am only showing some of my findings for the conference days here, as those can give good insight into which topics were hot. Each retweet counts as a separate tweet so if there is one tweet that gets retweeted a lot this will have a strong impact on the numbers.

Timing

Looking at the distribution of the tweets for the two conference days we can see that there are a lot of tweets in the morning of the first day and surprisingly in the evening of the first day. I suspect those are mostly tweets by people announcing that they are now starting to ride the Black Mamba. Of course this might also be related to the start of the Java 8 launch event but I like the Black Mamba theory better.

Mentions

A good incication of popular speakers are the mentions. It's a huge surprise that the twitter handle for the conference only is at third place. The two speakers Arun Gupta and David Blevins seem to be rather popular on twitter (or they just had a really long discussion).

Hashtags and Common Terms

To see the trends we can have a look at the hashtags people used. I excluded #javaland as it is contained in every tweet anyway.

The Java 8 launch event is dominant but Java EE 7 and JavaFX both are strong as well. There was quite some interest for Stephen Chin and his Nighthacking sessions. Neo4J and Asciidoctor are quite a surprise (but not #wegotbeer, a reasonable hashtag).

Hashtags are one thing but we can also look at all the terms in the text of the tweets. I excluded a long list of common words so this is my curated list of the important terms.

"Great", "thanks" and "cool" ... I am not that deep into statistics but it seems to me that people liked the conference.

Conclusion

JavaLand was a fantastic conference. I had some doubts about the setting in the theme park but it was really great. I will definitively go there again next year, if you are in the area you should think about going there as well. Thanks to all organizers for doing a great job.

Book Review: Instant Apache Solr for Indexing Data How-to

2014-03-21T15:25:00+08:00

Indexing, the process of putting data in a search engine, often is the foundation of anything when building a search based application. With Instant Apache Solr Indexing Data Howto Alexandre Rafalovitch has written a book dedicated to different aspects of indexing.

The book is written in a cookbook style with tasks that are solved using different features of Apache Solr. Each task is classified with a difficulty level (simple, intermediate, advanced). The book shows you how to work with collections, index text and binary content (using http and Java) and how to use the Data Import Handler. You will learn about the difference between soft and hard commits, how the UpdateRequestProcessor works (showing the useful ScriptUpdateProcessor) and the functionality of atomic updates. The final task describes an approach to index content in multiple languages.

Though it is quite short the book contains some really useful information. As it is dedicated to indexing alone you can't really use it for learning how to work with all aspects of Apache Solr but you can get some bang for the buck for you daily work. The only thing that I missed in the book is a more detailed description of more filters and tokenizers. Nevertheless you get quite some value from the book for a reasonable price.

If you are looking for more in-depth information I recommend Apache Solr 3 Enterprise Search Server (which covers an older version) or the recently finished Solr in Action

Building a Navigation from a Search Index

2014-03-14T15:09:00+08:00

Though the classic use case for search engines is keyword search nowadays they are often used to serve content that doesn't resemble a search result list at all. As they are optimized for read access they make good candidates for serving any part of a website that needs a dynamic query mechanism. Faceting is most often used to display filters for refining the search results but it can also be used to build hierarchical navigations from results. I am using Solr in this post but the concept can also be used with Elasticsearch or even plain Lucene.

Browsing Search Results

What we are trying to achieve can often be seen on Ecommerce sites but is also useful for serving content. You will be familiar with the navigation that Amazon provides: You can browse the categories that are displayed in a hierarchical navigation.

Of course I am not familiar with the implementation details of how they are storing their content but search engines can be used to build something like this. You can index different types (editiorial content and products) and tag those with categories. The navigation and the page content is then built from the search results that are returned.

For a simple example we are indexing just products, two books and one record. Two Books by Haruki Murakami are in the category Books/Fiction and Books/Non-Fiction. The record by 13 & God is in the category Music/Downbeat. The resulting navigation should then be something like this:

Books
- Non-Fiction
  - Haruki Murakami
- Fiction
  - Haruki Murakami
Music
- Downbeat
  - 13 & God

PathHierarchyTokenizer

Lucene provides the PathHierarchyTokenizer that can be used to split path like hierarchies. It takes a path as input and creates segments from it. For example when indexing Books/Non-Fiction/Haruki Murakami it will emit the following tokens:

Books
Books/Non-Fiction
Books/Non-Fiction/Haruki Murakami

What is important: It doesn't just split the string on a delimiter but creates a real hierarchy with all the parent paths. This can be used to build our navigation.

Solr

Let's see an example with Solr. The configuration and a unit test is also available on Github.

We are using a very simple schema with documents that only contain a title and a category

<fields>
    <field name="title" type="string" indexed="true" stored="true" required="true" multiValued="false" /> 
    <field name="category" type="category" indexed="true" stored="false"/>
</fields>

The title is just a string field but the category is a custom field that uses the PathHierarchyTokenizerFactory.

<fieldType name="category" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="index">
        <tokenizer class="solr.PathHierarchyTokenizerFactory" delimiter="/"/>
    </analyzer>
    <analyzer type="query">
        <tokenizer class="solr.KeywordTokenizerFactory"/>
    </analyzer>
</fieldType>

When indexing data the path is split according to the rules of the PathHierarchyTokenizer. When querying we are taking the query term as it is so we can have exact matches.

Suppose we are now indexing three documents that are in the three categories we have seen above:

URL=http://localhost:8082/solr/update/json
curl $URL -H 'Content-type:application/json' -d '
[
  {"title":"What I Talk About When I Talk About Running", "category":"Books/Non-Fiction/Haruki Murakami"}, 
  {"title":"South of the Border, West of the Sun", "category":"Books/Fiction/Haruki Murakami"}, 
  {"title":"Own Your Ghost", "category":"Music/Downbeat/13&God"}
]'
curl "$URL?commit=true"

We can easily request the navigation using facets. We query on all documents but return no documents at all. A facet is returned that contains our navigation structure:

curl http://localhost:8082/solr/collection1/select?q=*%3A*&rows=0&wt=json&indent=true&facet=true&facet.field=category
{
  "responseHeader":{
    "status":0,
    "QTime":1},
  "response":{"numFound":3,"start":0,"docs":[]
  },
  "facet_counts":{
    "facet_queries":{},
    "facet_fields":{
      "category":[
        "Books",2,
        "Books/Fiction",1,
        "Books/Fiction/Haruki Murakami",1,
        "Books/Non-Fiction",1,
        "Books/Non-Fiction/Haruki Murakami",1,
        "Music",1,
        "Music/Downbeat",1,
        "Music/Downbeat/13&God",1]},
    "facet_dates":{},
    "facet_ranges":{}}}

When displaying the navigation you can now simply split the paths according to their hierarchies. The queries that are executed for displaying the content contain a filter query that filters the currently selected navigation item.

curl "http://localhost:8082/solr/collection1/select?q=*%3A*&wt=json&fq=category:Books"
{
  "responseHeader":{
    "status":0,
    "QTime":27},
  "response":{"numFound":2,"start":0,"docs":[
      {
        "title":"What I Talk About When I Talk About Running"},
      {
        "title":"South of the Border, West of the Sun"}]
  }}

Using tags and exclusions you can even build the navigation using the same request that queries for the filtered documents.

curl "http://localhost:8082/solr/collection1/select?q=*%3A*&wt=json&fq=%7B%21tag%3Dcat%7Dcategory:Books&facet=true&facet.field=%7B%21ex%3Dcat%7Dcategory"
{
  "responseHeader":{
    "status":0,
    "QTime":2},
  "response":{"numFound":2,"start":0,"docs":[
      {
        "title":"What I Talk About When I Talk About Running"},
      {
        "title":"South of the Border, West of the Sun"}]
  },
  "facet_counts":{
    "facet_queries":{},
    "facet_fields":{
      "category":[
        "Books",2,
        "Books/Fiction",1,
        "Books/Fiction/Haruki Murakami",1,
        "Books/Non-Fiction",1,
        "Books/Non-Fiction/Haruki Murakami",1,
        "Music",1,
        "Music/Downbeat",1,
        "Music/Downbeat/13&God",1]},
    "facet_dates":{},
    "facet_ranges":{}}}

If the navigation you are retrieving from the search engine is your single source of truth you might also have to add a sort mechanism; by default all facets are sorted by their count. This works in our simple case but not for the real world. To have them sorted in a defined way you can add numeric identifiers. You would then index paths like 100|Books/110|Non-Fiction/111|Haruki Murakami and request alphanumeric sorting via facet.sort=index. When displaying the result you just remove the front of the string.

As you now are using the search engine to build the navigation you will immediately have the benefits of its filtering mechanisms. Only use categories that have online articles? Add a filter query fq=online:true. Make sure that there are no categories with products that are out of stock? fq=inStock:true.

Conclusion

Search engines offer great flexibility when delivering content. A lot of their functionality can be used to build applications that pull most of their content from the index.

Prefix and Suffix Matches in Solr

2014-03-07T14:25:00+08:00

Search engines are all about looking up strings. The user enters a query term that is then retrieved from the inverted index. Sometimes a user is looking for a value that is only a substring of values in the index and the user might be interested in those matches as well. This is especially important for languages like German that contain compound words like Semmelknödel where Knödel means dumpling and Semmel specializes the kind.

Wildcards

For demoing the approaches I am using a very simple schema. Documents consist of a text field and an id. The configuration as well as a unit test is also vailable on Github.

<fields>
    <field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
    <field name="text" type="text_general" indexed="true" stored="false"/>
</fields>
<uniqueKey>id</uniqueKey>
<types>
    <fieldType name="string" class="solr.StrField" sortMissingLast="true" />

    <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
        <analyzer>
            <tokenizer class="solr.StandardTokenizerFactory"/>
            <filter class="solr.LowerCaseFilterFactory"/>
        </analyzer>
    </fieldType>
</types>

One approach that is quite popular when doing prefix or suffix matches is to use wildcards when querying. This can be done programmatically but you need to take care that any user input is then escaped correctly. Suppose you have the term dumpling in the index and a user enters the term dump. If you want to make sure that the query term matches the document in the index you can just add a wildcard to the user query in the code of your application so the resulting query then would be dump*

Generally you should be careful when doing too much magic like this: if a user is in fact looking for documents containing the word dump she might not be interested in documents containing dumpling. You need to decide for yourself if you would like to have only matches the user is interested in (precision) or show the user as many probable matches as possible (recall). This heavily depends on the use cases for your application.

You can increase the user experience a bit by boosting exact matches for your term. You need to create a more complicated query but this way documents containing an exact match will score higher:

dump^2 OR dump*

When creating a query like this you should also take care that the user can't add terms that will make the query invalid. The SolrJ method escapeQueryChars of the class ClientUtils can be used to escape the user input.

If you are now taking suffix matches into account the query can get quite complicated and creating a query like this on the client side is not for everyone. Depending on your application another approach can be the better solution: You can create another field containing NGrams during indexing.

Prefix Matches with NGrams

NGrams are substrings of your indexed terms that you can put in an additional field. Those substrings can then be used for lookups so there is no need for any wildcards. Using the (e)dismax handler you can automatically set a boost on your field that is used for exact matches so you get the same behaviour we have seen above.

For prefix matches we can use the EdgeNGramFilter that is configured for an additional field:

...
    <field name="text_prefix" type="text_prefix" indexed="true" stored="false"/>
...
    <copyField source="text" dest="text_prefix"/>
...    
    <fieldType name="text_prefix" class="solr.TextField" positionIncrementGap="100">
        <analyzer type="index">
            <tokenizer class="solr.LowerCaseTokenizerFactory"/>
            <filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="15" side="front"/>
        </analyzer>
        <analyzer type="query">
            <tokenizer class="solr.LowerCaseTokenizerFactory"/>
        </analyzer>
    </fieldType>

During indexing time the text field value is copied to the text_prefix field and analyzed using the EdgeNGramFilter. Grams are created for any length between 3 and 15, starting from the front of the string. When indexing the term dumpling this would be

dum
dump
dumpl
dumpli
dumplin
dumpling

During query time the term is not split again so that the exact match for the substring can be used. As usual, the analyze view of the Solr admin backend can be a great help for seeing the analyzing process in action.

Using the dismax handler you can now pass in the user query as it is and just advice it to search on your fields by adding the parameter qf=text^2,text_prefix.

Suffix Matches

With languages that have compound words it's a common requirement to also do suffix matches. If a user queries for the term Knödel (dumpling) it is expected that documents that contain the termSemmelknödel also match.

Using Solr versions up to 4.3 this is no problem. You can use the EdgeNGramFilterFactory to create grams starting from the back of the string.

...
    <field name="text_suffix" type="text_suffix" indexed="true" stored="false"/>
...    
    <copyField source="text" dest="text_suffix"/>
...
    <fieldType name="text_suffix" class="solr.TextField" positionIncrementGap="100">
        <analyzer type="index">
            <tokenizer class="solr.StandardTokenizerFactory"/>
            <filter class="solr.LowerCaseFilterFactory"/>
            <filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="15" side="back"/>
        </analyzer>
        <analyzer type="query">
            <tokenizer class="solr.KeywordTokenizerFactory"/>
            <filter class="solr.LowerCaseFilterFactory"/>
        </analyzer>
    </fieldType>
...

This creates suffixes of the indexed term that also contains the term knödel so our query works.

But, using more recent versions of Solr you will encounter a problem during indexing time:

java.lang.IllegalArgumentException: Side.BACK is not supported anymore as of Lucene 4.4, use ReverseStringFilter up-front and afterward
    at org.apache.lucene.analysis.ngram.EdgeNGramTokenFilter.(EdgeNGramTokenFilter.java:114)
    at org.apache.lucene.analysis.ngram.EdgeNGramTokenFilter.(EdgeNGramTokenFilter.java:149)
    at org.apache.lucene.analysis.ngram.EdgeNGramFilterFactory.create(EdgeNGramFilterFactory.java:52)
    at org.apache.lucene.analysis.ngram.EdgeNGramFilterFactory.create(EdgeNGramFilterFactory.java:34)

You can't use the EdgeNGramFilterFactory anymore for suffix ngrams. But fortunately the stack trace also advices us how to fix the problem. We have to combine it with ReverseStringFilter:

<fieldType name="text_suffix" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="index">
        <tokenizer class="solr.LowerCaseTokenizerFactory"/>
        <filter class="solr.ReverseStringFilterFactory"/>
        <filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="15" side="front"/>
        <filter class="solr.ReverseStringFilterFactory"/>
    </analyzer>
    <analyzer type="query">
        <tokenizer class="solr.LowerCaseTokenizerFactory"/>
    </analyzer>
</fieldType>

This will now yield the same results as before.

Conclusion

Whether you are going for manipulating your query by adding wildcards or if you should be using the NGram approach heavily depends on your use case and is also a matter of taste. Personally I am using NGrams most of the time as disk space normally isn't a concern for the kind of projects I am working on. Wildcard search has become a lot faster in Lucene 4 so I doubt there is a real benefit there anymore. Nevertheless I tend to do as much processing I can during indexing time.

Introduction to the ODF Toolkit

2014-02-27T14:11:00+08:00

Microsoft Office has been the dominating office suite and unfortunately it still is. For a long time not only the programs were closed but also the file format.

Open Document

Nevertheless there are open alternatives available, most notable Libre Office/Apache OpenOffice.org. In 2005 the OASIS foundation standardized Open Document, an open alternative to the proprietary world of Microsoft. Open Document is heavily influenced by the OpenOffice.org file format but is supported by multiple office suites and viewers.

Open Document files are zip files that contain some XML documents. You can go ahead and unzip any documents you might have:

unzip -l aufwaende-12.ods 
Archive:  aufwaende-12.ods
  Length      Date    Time    Name
---------  ---------- -----   ----
       46  2012-12-31 15:16   mimetype
      815  2012-12-31 15:16   meta.xml
     8680  2012-12-31 15:16   settings.xml
   171642  2012-12-31 15:16   content.xml
     3796  2012-12-31 15:16   Thumbnails/thumbnail.png
        0  2012-12-31 15:16   Configurations2/images/Bitmaps/
        0  2012-12-31 15:16   Configurations2/popupmenu/
        0  2012-12-31 15:16   Configurations2/toolpanel/
        0  2012-12-31 15:16   Configurations2/statusbar/
        0  2012-12-31 15:16   Configurations2/progressbar/
        0  2012-12-31 15:16   Configurations2/toolbar/
        0  2012-12-31 15:16   Configurations2/menubar/
        0  2012-12-31 15:16   Configurations2/accelerator/current.xml
        0  2012-12-31 15:16   Configurations2/floater/
    22349  2012-12-31 15:16   styles.xml
      993  2012-12-31 15:16   META-INF/manifest.xml
---------                     -------
   208321                     16 files

The mimetype file determines what kind of document it is (in this case application/vnd.oasis.opendocument.spreadsheet), META-INF/manifest.xml lists the files in the archive. The most important file is content.xml that contains the body of the document.

Server Side Processing

Though there are quite some viewers and editors for Open Document available when it comes to the server side the situation used to be different. For processing Microsoft Office files there is the Java library Apache POI, which provides a lot of functionality to read and manipulate Microsoft Office files. But if you wanted to process Open Document files nearly your only option was to install OpenOffice.org on the server and talk to it by means of its UNO API. Not exactly an easy thing to do.

ODF Toolkit

Fortunately there is light at the end of the tunnel: the ODF Toolkit project, currently incubating at Apache, provides lightweight access to files in the Open Document format from Java. As the name implies it's a toolkit, consisting of multiple projects.

The heart of it is the schema generator that ingests the Open Document specification that is available as a RelaxNG schema. It provides a template based facility to generate files from the ODF specification. Currently it only generates Java classes but it can also be used to create different files (think of documentation or accessors for different programming languages).

The next layer of the toolkit is ODFDOM. It provides templates that generate classes for DOM access of elements and attributes of ODF documents. Additionally it provides facilities like packaging and document encryption.

For example, you can list the file paths of an ODF document using the ODFPackage class:

OdfPackage pkg = OdfPackage.loadPackage("aufwaende-12.ods");
Set filePaths = pkg.getFilePaths();

If you are familiar with the Open Document spec ODFDOM will be the only library you need. But if you are like most of us and don't know all the elements and attributes by heart there is another project for you: Simple API provides easy access to a lot of the features you might expect from a library like this: You can deal with higher level abstractions like paragraphs for text or rows and cells in the spreadsheet world or search for and replace text.

This code snippet creates a spreadsheet, adds some cells to it and saves it:

SpreadsheetDocument doc = SpreadsheetDocument.newSpreadsheetDocument();
Table sheet = doc.getSheetByIndex(0);
sheet.getCellByPosition(0, 0).setStringValue("Betrag");
sheet.getCellByPosition(1, 0).setDoubleValue(23.0);
doc.save(File.createTempFile("odf", ".ods"));

Code

If you are interested in seeing more code using the ODF Toolkit you can have a look at the cookbook that contains a lot of useful code snippets for the Simple API. Additionally you should keep an eye on this blog for the second part of the series where we will look at an application that extracts data from spreadsheets.

Book Review: Search-Based Applications

2014-02-20T14:55:00+08:00

Search is moving away from the simple keyword search box with a result list for indexed web pages. Features like facetting and aggregations offer completely new possibilities for data discovery, making them relevant for business applications as well.

The Book

Search-Based Applications by Greogory Grefenstette and Laura Wilber describes the changes that have occured in the last years regarding search engines, traditionally used for indexing web pages and databases, that have been used for business applications.

The short book introduces the motivation for using search engines in business applications, mostly caused by exponential data growth and realtime needs. Several chapters describe what has changed in the database and search engine world, focusing on one aspect in each chapter. On the search side it shows that advanced features like faceted search or natural language processing techniques can be valuable for offering real time access on data that has traditionally been put to a data warehouse. On the database side it shows that with the advent of non-relational types, some databases are moving in the direction of the flexible schema, scalability or specialized access patterns of search engines.

Some common themes in the book for using search based applications are the aggregation of content from different data sources and the reduction of load on databases by offloading the traffic to the read optimized search engines. Mixing content from different data sources can be useful to provide flexible access on multiple legacy systems, increasing usability of the applications. The document model of search engines and the possibility to do incremental indexing lead to applications that provide near realtime access to data and can be adjusted to match changing needs quicker.

Though most of the book is product agnostic one chapter lists some platforms that are available for building search based applications, mainly focusing on big commercial players like Exalead (the company of the authors), Endeca and Autonomy. The book closes with three case studies that show different aspects of building search based applications.

Even if there are some statements contained that I don't fully agree with or that are even contradictory it is a very good book for understanding the reasoning behind building search based applications. I got some new ideas for applications of search engines and this alone makes it a worthwile read.

Open Source Options for Search Based Applications

Though the book lists quite some SBA platforms and related technology there is not a single mention of Apache Solr, which is quite surprising as it employs a lot of the features the authors define for SBAs. Solr has the Data Import Handler to connect external data sources, semantic technologies (though probably not as rich as some of the commercial options) and complementary open source projects like carrot² for search result clustering or ManifoldCF as a connector framework.

When the book talks about replacing parts of data warehouse applications with SBAs for real time analytics this of course reminds me of use cases for Elasticsearch. Kibana or custom dashboards can make a wealth of information that is contained in the index accessible in an easy way.

Search Meetups in Germany

2014-02-13T14:35:00+08:00

I enjoy going to user group events. Not only because of the talks that are an integral part of most meetups but also to meet and chat with likeminded people.

Fortunately there are some user groups in Germany that are focused on search technology, a topic I am especially interested in. This post lists those I know, if there is one I missed let me know in the comments. For reasons of suspense I am listing the groups from east to west.

Elasticsearch User Group Berlin

Berlin has the luxury of a usergroup dedicated to Elasticsearch only. The group is organized by people of Trifork who are seasoned event organizers. The group seems to have a surprising success with regular meetings and up to 50 participants. This is probably caused by the high startup density in Berlin, the ease of use and scalability of Elasticsearch makes it very popular among them.

Search Meetup Munich

Search Meetup Munich is a very active group organized by Alexander Reelsen of Elasticsearch. There are bimonthly meetings at alternating companies with 2 to 3 talks per event. Topics are open source search in general with a strong emphasis on Lucene, Solr and Elasticsearch. Most speakers will give the talk in English if there are people in the audience who don't speak German. The amount of participants ranges from 20 - 40 people. I am surprised about the vital community in Munich with a lot of startups doing interesting things with search. Though it is quite a way from Karlsruhe to Munich I try to attend the meetings as often as I can.

Solr Lucene User Group Deutschland e.V.

Though the name implies it's a national group Solr Lucene User Group Deutschland e.V. is located in Augsburg. It seems to be mainly organized by members of SHI GmbH, a prominent Lucidworks and Elasticsearch partner. The meetup page is rather quite so far with one event last year with one participant.

Search Meetup Frankfurt

The first search meetup I attended with around 10 participants, a talk on the indexing pipeline of the Solr based product search solution Searchperience and some discussions. There are quite some people with non-Java background doing PHP web development. Unfortunately the 2012 event I attended seems to be the last event that happened. I don't take that personally.

Search Meetup Karlsruhe

Last but not least: As I probably can't travel to Munich all the time and I would like to have some exchange with locals I just started Search Meetup Karlsruhe together with Exensio, long time Solr users and Elasticsearch partners. I don't expect it to be as huge as Munich or Berlin but I hope we can start some interesting discussions.

We just scheduled our first meeting with two talks on Linked Data Search and the difference between building applications based on databases vs. search engines. If you are in the area and interested in search you should join us.

elasticsearch.Stuttgart (Update 16.02.2014)

Just a day after publishing this post another Elasticsearch Meetup was announced, this time in Stuttgart. The first event is scheduled for March 25 with an Elasticsearch 1.0 release party including a talk by Alexander Reelsen. If this didn't clash with JavaLand conference I would definitively go there but I hope there will be more events in the future I can attend.

Elasticsearch is Distributed by Default

2014-02-07T15:00:00+08:00

One of the big advantages Elasticsearch has over Solr is that it is really easy to get started with. You can download it, start it, index and search immediately. Schema discovery and the JSON based REST API all make it a very beginner friendly tool.

Also, another aspect, Elasticsearch is distributed by default. You can add nodes that will automatically be discovered and your index can be distributed across several nodes.

The distributed nature is great to get started with but you need to be aware that there are some consequences. Distribution comes with a cost. In this post I will show you how relevancy of search results might be affected by sharding in Elasticsearch.

Relevancy

As Elasticsearch is based on Lucene it also uses its relevancy algorithm by default, called TF/IDF. Term frequency (the amount of terms in a document) and the frequency of the term in an index (IDF) are important parts of the relevancy function. You can see details of the default formula in the Lucene API docs but for this post it is sufficient to know that the more often a term occurs in a document the more relevant it is considered. Terms that are more frequent in the index are considered less relevant.

A Problematic Example

Let's see the problem in action. We are starting with a fresh Elasticsearch instance and index some test documents. The documents only consist of one field that has the same text in it:

curl -XPOST http://localhost:9200/testindex/doc/0 -d '{ "title" : "Mein Hut der hat vier Ecken, 4 Ecken hat mein Hut" }'
{"ok":true,"_index":"testindex","_type":"doc","_id":"0","_version":1}
curl -XPOST http://localhost:9200/testindex/doc/1 -d '{ "title" : "Mein Hut der hat vier Ecken, 4 Ecken hat mein Hut" }'
{"ok":true,"_index":"testindex","_type":"doc","_id":"1","_version":1}
curl -XPOST http://localhost:9200/testindex/doc/2 -d '{ "title" : "Mein Hut der hat vier Ecken, 4 Ecken hat mein Hut" }'
{"ok":true,"_index":"testindex","_type":"doc","_id":"2","_version":1}
curl -XPOST http://localhost:9200/testindex/doc/3 -d '{ "title" : "Mein Hut der hat vier Ecken, 4 Ecken hat mein Hut" }'
{"ok":true,"_index":"testindex","_type":"doc","_id":"3","_version":1}
curl -XPOST http://localhost:9200/testindex/doc/4 -d '{ "title" : "Mein Hut der hat vier Ecken, 4 Ecken hat mein Hut" }'
{"ok":true,"_index":"testindex","_type":"doc","_id":"4","_version":1}

When we search for those documents by text they of course are returned correctly.

curl -XGET "http://localhost:9200/testindex/doc/_search?q=title:hut&pretty=true"
{
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 5,
    "max_score" : 0.10848885,
    "hits" : [ {
      "_index" : "testindex",
      "_type" : "doc",
      "_id" : "4",
      "_score" : 0.10848885, "_source" : { "title" : "Mein Hut der hat vier Ecken, 4 Ecken hat mein Hut" }
    }, {
      "_index" : "testindex",
      "_type" : "doc",
      "_id" : "0",
      "_score" : 0.10848885, "_source" : { "title" : "Mein Hut der hat vier Ecken, 4 Ecken hat mein Hut" }
    }, {
      "_index" : "testindex",
      "_type" : "doc",
      "_id" : "1",
      "_score" : 0.10848885, "_source" : { "title" : "Mein Hut der hat vier Ecken, 4 Ecken hat mein Hut" }
    }, {
      "_index" : "testindex",
      "_type" : "doc",
      "_id" : "2",
      "_score" : 0.10848885, "_source" : { "title" : "Mein Hut der hat vier Ecken, 4 Ecken hat mein Hut" }
    }, {
      "_index" : "testindex",
      "_type" : "doc",
      "_id" : "3",
      "_score" : 0.10848885, "_source" : { "title" : "Mein Hut der hat vier Ecken, 4 Ecken hat mein Hut" }
    } ]
  }
}

Now, let's index five more documents that are similar to the first documents but contain our test term Hut only once.

curl -XPOST http://localhost:9200/testindex/doc/5 -d '{ "title" : "Mein Hut der hat vier Ecken, 4 Ecken hat er nicht" }'
{"ok":true,"_index":"testindex","_type":"doc","_id":"5","_version":1} 
curl -XPOST http://localhost:9200/testindex/doc/6 -d '{ "title" : "Mein Hut der hat vier Ecken, 4 Ecken hat er nicht" }'
{"ok":true,"_index":"testindex","_type":"doc","_id":"6","_version":1}
curl -XPOST http://localhost:9200/testindex/doc/7 -d '{ "title" : "Mein Hut der hat vier Ecken, 4 Ecken hat er nicht" }'
{"ok":true,"_index":"testindex","_type":"doc","_id":"7","_version":1}
curl -XPOST http://localhost:9200/testindex/doc/8 -d '{ "title" : "Mein Hut der hat vier Ecken, 4 Ecken hat er nicht" }'
{"ok":true,"_index":"testindex","_type":"doc","_id":"8","_version":1}
curl -XPOST http://localhost:9200/testindex/doc/9 -d '{ "title" : "Mein Hut der hat vier Ecken, 4 Ecken hat er nicht" }'
{"ok":true,"_index":"testindex","_type":"doc","_id":"9","_version":1}

As the default relevancy formula takes the term frequency in a document into account those documents should score less than our original documents. So if we query for hut again the results still contain our original documents at the beginning:

curl -XGET "http://localhost:9200/testindex/doc/_search?q=title:hut&pretty=true"
{
  "took" : 6,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 10,
    "max_score" : 0.2101998,
    "hits" : [ {
      "_index" : "testindex",
      "_type" : "doc",
      "_id" : "4",
      "_score" : 0.2101998, "_source" : { "title" : "Mein Hut der hat vier Ecken, 4 Ecken hat mein Hut" }
    }, {
      [...]
    }, {
      "_index" : "testindex",
      "_type" : "doc",
      "_id" : "3",
      "_score" : 0.2101998, "_source" : { "title" : "Mein Hut der hat vier Ecken, 4 Ecken hat mein Hut" }
    }, {
      "_index" : "testindex",
      "_type" : "doc",
      "_id" : "9",
      "_score" : 0.1486337, "_source" : { "title" : "Mein Hut der hat vier Ecken, 4 Ecken hat er nicht" }
    }, {
      [...]
    }, {
      "_index" : "testindex",
      "_type" : "doc",
      "_id" : "8",
      "_score" : 0.1486337, "_source" : { "title" : "Mein Hut der hat vier Ecken, 4 Ecken hat er nicht" }
    } ]
  }
}

We are still happy. The most relevant documents are at the top of our search results. Now let's index something that is completely different from our original documents:

curl -XPOST http://localhost:9200/testindex/doc/10 -d '{ "title" : "mayhem and chaos" }'
{"ok":true,"_index":"testindex","_type":"doc","_id":"10","_version":1}

Now, if we search again for our test term something strange will happen:

curl -XGET "http://localhost:9200/testindex/doc/_search?q=title:hut&pretty=true"
{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 10,
    "max_score" : 0.35355338,
    "hits" : [ {
      "_index" : "testindex",
      "_type" : "doc",
      "_id" : "3",
      "_score" : 0.35355338, "_source" : { "title" : "Mein Hut der hat vier Ecken, 4 Ecken hat mein Hut" }
    }, {
      "_index" : "testindex",
      "_type" : "doc",
      "_id" : "8",
      "_score" : 0.25, "_source" : { "title" : "Mein Hut der hat vier Ecken, 4 Ecken hat er nicht" }
    }, {
      "_index" : "testindex",
      "_type" : "doc",
      "_id" : "4",
      "_score" : 0.2101998, "_source" : { "title" : "Mein Hut der hat vier Ecken, 4 Ecken hat mein Hut" }
    }, {
      [...]
    } ]
  }
}

Though the document we indexed last has nothing to do with our original documents it influenced our search and one of the documents that should score less is now the second result. This is something that you wouldn't expect. The behavior is caused by the default sharding of Elasticsearch that distributes a logical Elasticsearch index across several Lucene indices.

Sharding

When you are starting a single instance and index some documents Elasticsearch will by default create five shards under the hood, so there are five Lucene indices. Each of those shards contains some of the documents you are adding to the index. The assignment of documents to a shard happens in a way so that the documents will be distributed evenly.

You can get information about the shards and their document counts using the indices status API or more visually appealing using one of the plugins, e.g. elasticsearch-head. There are five shards for our index, once we click on a shard we can see further details about the shard, including the doc count.

If you check the shards right after you indexed the first five documents you will notice that those are distributed evenly across all shards. Each shard contains one of the documents. The second batch is again distributed evenly. The final document we index creates some imbalance. One shard will have an additional document.

The Effects on Relevancy

Each shard in Elasticsearch is a Lucene index in itself and as an index in Elasticsearch consists of multiple shards it needs to distribute the queries across multiple Lucene indices. Especially the inverse document frequency is difficult to calculate in this case.

Reconsider the Lucene relevancy formula: the term frequency as well as the inverse document frequency are important. When indexing the original 5 documents all documents had the same term frequency as well as the same idf for our term. The next documents still had no impact on the idf as each document in the index still contained the term.

Now, when indexing the last document something potentially unexpected is happening. The new document is added to one of the shards. On this shard we therefore changed the inverse document frequency which is calculated from all the documents that contain the term but also takes the overall document count in the Lucene index into account. On the shard that contains the new document we increased the idf value as now there are more documents in the Lucene index. As idf has quite some weight on the overall relevancy score we "boosted" the documents of the Lucene index that now contains more documents.

If you'd like to see details on the relevancy calculation you can use the explain API or simply add a parameter explain=true. This will not only tell you all the details of the results of the relevancy function for each document but also which shard a document resides on. It can give you really useful information when debugging relevancy problems.

How to Fix It?

When beginning with Elasticsearch you might fix this by setting the your index to use one shard only. Though this will work it is not a good idea: Sharding is a very powerful feature of Elasticsearch and you shouldn't give it up on it easily. If you notice that there are problems with your relevancy that are caused by these issues you should rather try to use the search_type dfs_query_then_fetch instead of the default query_then_fetch. The difference between those is that dfs queries all the document frequencies of the shards in advance. This way Elasticsearch can calculate the overall document frequency and all results will be in the correct order:

curl -XGET "http://localhost:9200/testindex/doc/_search?q=title:hut&pretty=true&explain=true&search_type=dfs_query_then_fetch"

Conclusion

Though the example we have seen here is artificially constructed this is something that can occur and I have already seen in live applications. The behaviour can especially be relevant when there are either very few documents or your documents are distributed to the shards in an unfortunate way. It is great that Elasticsearch makes distributed searches as easy and as performant as possible but you need to be aware that you might not get exact hits.

Zachary Thong has written a blog post about this behavior as well at the Elasticsearch blog.

Proxying Solr

2014-01-30T15:53:00+08:00

The dominant deployment model for Solr is running it as a standalone webapp. You can use it in embedded mode in Java but then you are missing some of the goodies like the seperate JVM (your GC will thank you for it) and you are of course tied to Java then.

Most of the time Solr is considered similar to a database; only custom webapp code can talk to it and it is not exposed to the net. In your webapp you are then using any of the client libraries to access Solr and build your queries.

With the rise of JavaScript on the client side sometimes people get the idea to put Solr to the web directly. There is no custom webapp layer in between, only the web talking to Solr.

A proxy needs to sit in front of the Solr server that only allows certain requests. You won't allow any requests that are potentially modifying your index or do any other harm. This can be done but you need to be aware of some things:

You need to take extra care to only expose request handlers and parameters that you need.
Bugs or features in Solr might expose more functionality than you expect.
Denial of Service attacks are easier to do.
The client side logic can get more complicated though there are libs like AJAX Solr available.

Most of the time putting Solr directly to the web is not an option, but you can, if you are willing to take some risk. I think that especially the possibility of DOS attacks shouldn't be taken lightly. The more flexibility you want to have on the query side the more care needs to be taken to secure the system. If you'd like to do it anyway see this post on how to use nginx as a proxy to Solr and this list of specialized proxies for Solr. For general instructions on securing your Solr server see the project wiki.

Analyze your Maven Project Dependencies with dependency:analyze

2014-01-23T15:27:00+08:00

When working on a larger Maven project it might happen that you lose track of the dependecies in your project. Over time you are adding new dependencies, remove code or move code to modules so some of the dependencies become obsolete. Though I did lots of Maven projects I have to admit I didn't know until recently that the dependency plugin contains a useful goal for solving this problem: dependency:analyze.

The dependency:analyze mojo can find dependencies that are declared for your project but are not necessary. Additionally it can find dependecies that are used but are undeclared, which happens when you are directly using transitive dependencies in your code.

Analyzing Dependencies

I am showing an example with the Odftoolkit project. It contains quite some dependencies and is old enough that some of them are outdated. ODFDOM is the most important module of the project, providing low level access to the Open Document structure from Java code. Running mvn dependency:tree we can see its dependencies at the time of writing:

mvn dependency:tree
[INFO] Scanning for projects...
[INFO]                                                                         
[INFO] ------------------------------------------------------------------------
[INFO] Building ODFDOM 0.8.10-incubating-SNAPSHOT
[INFO] ------------------------------------------------------------------------
[INFO] 
[INFO] --- maven-dependency-plugin:2.1:tree (default-cli) @ odfdom-java ---
[INFO] org.apache.odftoolkit:odfdom-java:jar:0.8.10-incubating-SNAPSHOT
[INFO] +- org.apache.odftoolkit:taglets:jar:0.8.10-incubating-SNAPSHOT:compile
[INFO] |  \- com.sun:tools:jar:1.7.0:system
[INFO] +- xerces:xercesImpl:jar:2.9.1:compile
[INFO] |  \- xml-apis:xml-apis:jar:1.3.04:compile
[INFO] +- junit:junit:jar:4.8.1:test
[INFO] +- org.apache.jena:jena-arq:jar:2.9.4:compile
[INFO] |  +- org.apache.jena:jena-core:jar:2.7.4:compile
[INFO] |  +- commons-codec:commons-codec:jar:1.5:compile
[INFO] |  +- org.apache.httpcomponents:httpclient:jar:4.1.2:compile
[INFO] |  +- org.slf4j:jcl-over-slf4j:jar:1.6.4:compile
[INFO] |  +- org.apache.httpcomponents:httpcore:jar:4.1.3:compile
[INFO] |  +- org.slf4j:slf4j-api:jar:1.6.4:compile
[INFO] |  +- org.slf4j:slf4j-log4j12:jar:1.6.4:compile
[INFO] |  \- log4j:log4j:jar:1.2.16:compile
[INFO] +- org.apache.jena:jena-core:jar:tests:2.7.4:test
[INFO] +- net.rootdev:java-rdfa:jar:0.4.2:compile
[INFO] |  \- org.apache.jena:jena-iri:jar:0.9.1:compile
[INFO] \- commons-validator:commons-validator:jar:1.4.0:compile
[INFO]    +- commons-beanutils:commons-beanutils:jar:1.8.3:compile
[INFO]    +- commons-digester:commons-digester:jar:1.8:compile
[INFO]    \- commons-logging:commons-logging:jar:1.1.1:compile
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 1.877s
[INFO] Finished at: Mon Jan 20 00:41:05 CET 2014
[INFO] Final Memory: 13M/172M
[INFO] ------------------------------------------------------------------------

The project contains some direct dependencies with a lot of transitive dependencies. When running mvn dependency:analyze on the project we will see that our dependencies don't seem to be correct:

mvn dependency:analyze
[INFO] Scanning for projects...
[INFO]                                                                         
[INFO] ------------------------------------------------------------------------
[INFO] Building ODFDOM 0.8.10-incubating-SNAPSHOT
[INFO] ------------------------------------------------------------------------
[INFO] 
[...] 
[INFO] <<< maven-dependency-plugin:2.1:analyze (default-cli) @ odfdom-java <<<
[INFO] 
[INFO] --- maven-dependency-plugin:2.1:analyze (default-cli) @ odfdom-java ---
[WARNING] Used undeclared dependencies found:
[WARNING]    org.apache.jena:jena-core:jar:2.7.4:compile
[WARNING]    xml-apis:xml-apis:jar:1.3.04:compile
[WARNING] Unused declared dependencies found:
[WARNING]    org.apache.odftoolkit:taglets:jar:0.8.10-incubating-SNAPSHOT:compile
[WARNING]    org.apache.jena:jena-arq:jar:2.9.4:compile
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 4.769s
[INFO] Finished at: Mon Jan 20 00:43:27 CET 2014
[INFO] Final Memory: 28M/295M
[INFO] ------------------------------------------------------------------------

The second part of the warnings is easier to understand; we have declared some dependencies that we are never using, the taglets and jena-arq. When comparing this with the output we got above you will notice that the largest set of transitive dependencies was imported by the jena-arq dependency. And we don't even need it.

The first part seems to be more difficult: there are two used but undeclared dependencies found. What does it mean? Shouldn't compiling fail if there are any dependencies that are undeclared? No, it just means that we are directly using a transitive dependency from our code which we should better declare ourselves.

Breaking the Build on Dependency Problems

If you want to find problems with your dependencies as early as possible it's best to integrate the check in your build. The dependency:analyze goal we have seen above is meant to be used in a standalone way, for automatic execution there is the analyze-only mojo. It automatically binds to the verify phase and can be declared like this:

<plugin>
    <artifactId>maven-dependency-plugin</artifactId>
    <version>2.8</version>
    <executions>
        <execution>
            <id>analyze</id>
            <goals>
                <goal>analyze-only</goal>
            </goals>
            <configuration>
                <failOnWarning>true</failOnWarning>
                <outputXML>true</outputXML>
            </configuration>
        </execution>
    </executions>
</plugin>

Now the build will fail if there are any problems found. Conveniently, if an undeclared dependency has been found, it will also output the XML that you can then paste in your pom file.

A final word of caution: the default analyzer works on the bytecode level so in special cases it might not notice a dependency correctly, e.g. when you are using constants from a dependency that are inlined.

Geo-Spatial Features in Solr 4.2

2014-01-17T22:42:00+08:00

Last week I have shown how you can use the classic spatial support in Solr. It uses the LatLonType to index locations that can then be used to query, filter or sort by distance. Starting with Solr 4.2 there is a new module available. It uses the Lucene Spatial module which is more powerful but also needs to be used differently. You can still use the old approach but in this post I will show you how to use the new features to do the same operations we saw last week.

Indexing Locations

Again we are indexing talks that contain a title and a location. For the new spatial support you need to add a different field type:

<fieldType name="location_rpt" class="solr.SpatialRecursivePrefixTreeFieldType"
    distErrPct="0.025"
    maxDistErr="0.000009"
    units="degrees"/>

Contrary to the LatLonType the SpatialRecursivePrefixTreeFieldType is no subfield type but stores the data structure itself. The attribute maxDistErr determines the accuracy of the location, in this case it is 0.000009 degrees which is close to one meter and should be enough for most location searches.

To use the type in our documents of course we also need to add it as a field:

<field name="location" type="location_rpt" indexed="true" stored="true"/>

Now we are indexing some documents with three fields: the path (which is our id), the title of the talk and the location.

curl http://localhost:8082/solr/update/json?commit=true -H 'Content-type:application/json' -d '
[
 {"path" : "1", "title" : "Search Evolution", "location" : "49.487036,8.458001"},
 {"path" : "2", "title" : "Suchen und Finden mit Lucene und Solr", "location" : "49.013787,8.419936"}
]'

Again, the location of the first document is Mannheim, the second Karlsruhe. You can see that the locations are encoded in an ngram-like fashion when looking at the schema browser in the administration backend:

Sorting by Distance

A common use case is to sort the results by distance from a certain location. You can't use the Solr 3 syntax anymore but need to use a the geofilt query parser that maps the distance to the score which you then sort on.

http://localhost:8082/solr/select?q={!geofilt%20score=distance%20sfield=location%20pt=49.487036,8.458001%20d=100}&sort=score asc

As the name implies the geofilt query parser originally is for filtering. You need to pass in the distance that is used for filtering so by sorting you might also cause an impact on the results that are returned. For our example passing in a distance of 10 kilometers will only yield one result. This is something to be aware of.

Filtering by Distance

We can use the same approach we saw above to filter our results to only match talks in a given area. We can either use the geofilt query parser (that filters by radius) or the bbox query parser (which filters on a box around the radius). As you can imagine, the query looks similar:

http://localhost:8082/solr/select?q=*:*&fq={!geofilt%20score=distance%20sfield=location%20pt=49.013787,8.419936%20d=10}

This will return all talks in a distance of 10 kilometers from Karlsruhe.

Doing Fancy Stuff

Besides the features we have looked at in this post you can also do more advanced stuff. In Solr 3 Spatial you can't have multivalued location fields, which is possible with Solr 4.2. Also now you can also index lines or polygons that can then be queried and intersected. In this presentation Chris Hostetter uses this feature to determine overlapping of time, an interesting use case that you might not think of at first.

Geo-Spatial Features in Solr 3

2014-01-10T00:44:00+08:00

Solr is mainly known for its full text search capabilities. You index text and are able to search it in lowercase or stemmed form, depending on your analyzer chain. But besides text Solr can do more: You can use RangeQueries to query numeric fields ("Find all products with a price lower than 2€"), do date arithmetic ("Find me all news entries from last week") or do geospatial queries, which we will look at in this post. What I am describing here is the old spatial search support. Next week I will show you how to do the same things using recent versions of Solr.

Indexing Locations

Suppose we are indexing talks in Solr that contain a title and a location. We need to add the field type for locations to our schema:

<fieldType name="location" class="solr.LatLonType" subFieldSuffix="_coordinate"/>

LatLonType is a subfield type which means that it not only creates one field but also additional fields, one for longitude and one for latitude. The subFieldSuffix attribute determines the name of the field that will be <fieldname>_<i>_<subFieldSuffix>. If the name of our field is location and we are indexing a latitude/longitude pair this would lead to three fields: location, location_0_coordinate, location_1_coordinate.

To use the type in our schema we need to add one field and one dynamic field definition for the sub fields:

<field name="location" type="location" indexed="true" stored="true"/>
<dynamicField name="*_coordinate" type="tdouble" indexed="true" stored="false"/>

The dynamic field is of type tdouble so we need to make sure that it is also available in our schema. The attributes indexed on location is special in this case: It determines if the subfields for the coordinates are created at all.

Let's index some documents. We are adding three fields, the path (which is our id), the title of the talk and the location.

curl http://localhost:8983/solr/update/json?commit=true -H 'Content-type:application/json' -d '
[
 {"path" : "1", "title" : "Search Evolution", "location" : "49.487036,8.458001"},
 {"path" : "2", "title" : "Suchen und Finden mit Lucene und Solr", "location" : "49.013787,8.419936"}
]'

The location of the first document is Mannheim, the second Karlsruhe. We can see that our documents are indexed and that the location is stored by querying all documents:

curl "http://localhost:8983/solr/select?q=*%3A*&wt=json&indent=true"

Looking at the schema browser we can also see that the two subfields have been created. Each contains the terms for the Trie field.

Sorting by Distance

One use case you might have when indexing locations is to sort the results by distance from a certain location. This can for example be useful for classifieds or rentals to show the nearest results first.

Sorting can be done via the geodist() function. We need to pass in the location that is used as a basis via the pt parameter and the location field to use in the function via the sfield parameter. We can see this in action by sorting twice, once for a location in Durlach near Karlsruhe and once for Heidelberg, which is near Mannheim:

curl "http://localhost:8983/solr/select?wt=json&indent=true&q=*:*&sfield=location&pt=49.003421,8.483133&sort=geodist%28%29%20asc"
curl http://localhost:8983/solr/select?wt=json&indent=true&q=*:*&sfield=location&pt=49.399119,8.672479&sort=geodist%28%29%20asc

Both return the results in the correct order. You can also use the geodist() function to boost results that are closer to your location. See the Solr wiki for details.

Filtering by Distance

Another common use case is to filter the search results to only show results from a certain area, e.g. in a distance of 10 kilometers. This can either be done automatically or via facets.

Filtering is done using another function, geofilt(). It accepts the same parameters we have seen before but of course for filtering you add it as a filter query. The distance can be passed using the parameter d, the unit defaults to kilometers. Suppose you are in Durlach and only want to see talk that are in a distance of 10 kilometers:

curl "http://localhost:8983/solr/select?wt=json&indent=true&q=*:*&fq={!geofilt}&pt=49.003421,8.483133&sfield=location&d=10"

This only returns the result in Karlsruhe. Once we decide that we want to see results in a distance of 100 kilometers we again see both results:

curl "http://localhost:8983/solr/select?wt=json&indent=true&q=*:*&fq={!geofilt}&pt=49.003421,8.483133&sfield=location&d=100"

Pretty useful! If you are interested, there is more on the Solr wiki. Next week I will show you how to do the same using the new spatial support in Solr versions starting from 4.2.

20 Months of Freelancing

2014-01-01T16:40:00+08:00

It's now 20 months that I am working as a freelancer on my own. With the end of year thing going on it's time to look back on what happened and I would like to take the chance to write about what I did, what works and what I would like to do in the future.

Why Freelancing?

During my time at university I started working for a small consulting company that specialized in open source software. I was the first employee and started working part time but even paused my studies for half a year to work full time with them. The company grew and in 2006 after finally getting my degree I joined them full time. I always enjoyed the work and dedicated a lot of my energy and time. In 2012, with around 30 employees I noticed that I needed something else. I had already switched to a 4 day work week in 2011 to have more time for myself, to learn and experiment. Though the company is still great to work with it just didn't fit me anymore.

After a long time at a company that has partly been home and family it is difficult to just switch to another company. Also I wanted to have more control over what kind of projects I am doing and I liked to have some time on my own to write blog posts and do talks at user groups and conferences. I always spent time at customer projects and often liked it so it was an obvious decision to go with freelancing.

The Start

Before I quit I decided that I wanted to do more with search technologies. I had worked a lot on content management systems and search always is a crucial part. Having done several larger projects with Lucene and Solr and even the first Solr integration in OpenCms I knew that I had the necessary experience and that I liked it.

I had minimal savings when I quit and no customers so far. Other freelancers are often surprised when I tell this and advice to only quit when you already know who you will be working for next. I guess this was some kind of hubris, I was really determined that I wanted to do freelancing and knew that there were companies who needed my help.

I started freelancing in May and had already organized to give a talk at our local Java User Group on Lucene and Solr in July. I wanted to have the full month of May for talk preparation and bootstrapping the business, all the things like getting a website, getting an accountant and so on. Unfortunately I didn't find a project until the beginning of July with a lot of my savings already spent on living cost and necessary items for the business. Be aware that it will take you up to two months from the beginning of a project until you see the first money.

Marketing

The good thing about freelancing: I can call all the activities I like to do marketing and tell myself that those are necessary. The bad thing about it: I don't spend enough time on paid projects.

I am spending lots of my time that is not paid for on learning: Blog posts, books and conferences. I got to a quite frequent rhythm with weekly posts, spoke at several user groups and conferences and joined an open source project, OdfToolkit. A lot of freelancers don't do any of those and dedicate all their time working on customer projects but those activities are part of the reason I went with freelancing.

The Projects

When talking about freelancing you probably think about sitting in the coffee shop, doing several projects in parallel. For me this is different, lots of Java projects are rather long term and require you to work in a team, which is best done on premise. Though I like the idea of doing more diverse projects I am also happy to have some stability. Having long term clients prevents some of the context switching involved with multiple projects and you have to spend less time on sales.

My first project involved working on an online shop for a large retailer built on Hybris, a commercial Ecommerce engine. I did a lot of Solr stuff and though it was rather stressful working on product search was really interesting. Also the people are nice.

Though I started with the intention of doing more search projects I am currently involved in a large CMS project for a retailer, (re)building parts of their online presence. Search only plays a minor part in it but I like working with the people, it's a great work atmosphere and some of the problems they face are really interesting. Before doing the project I had to think a lot whether I want to sign this long term contract but I am glad I did. Fortunately I still have time to do some short term consulting on the side (mostly single days, mostly Solr).

Where Do I Get The Projects From?

When starting I thought it would be a lot easier to get projects but customers are not exactly magically lining up to get my services. I try to avoid working with freelance agents, though a lot of Java projects are only possible to get through them. Most of the project inquiries I get directly are from people who know me from organizing the local Java User Group. I didn't start helping the user group for the marketing but I have to admit, it really paid of.

Besides that I am still working for customers of my old employer. They contact me with interesting projects and though of course they are taking their share I still earn enough for myself.

Most of the inquiries I get from agencies, mostly through my XING profile are for Hybris and CoreMedia, two of the commercial systems I did work in. I enjoy working with CoreMedia and could imagine to do projects with Hybris again but I would be far more happy if agencies contacted me for Lucene, Solr or Elasticsearch.

There have been some inquiries from people who found me through my blog but never something that was really doable (mostly overseas). Speaking at user groups and conferences has led to some contacts but never to a real project so far. So you could say that the marketing activities I spend most of my time on didn't pay off. But getting direct projects is not the only benefit of both of these activities. Those are also important for me for learning and growing.

The Future

Freelancing has been exactly the right choice for me. I managed to find projects where I can do my 4 day work week, leaving enough time for blogging, preparing talks and learning. I managed to do weekly blog posts for quite some months during the year, cut back a bit because it became overwhelming. Starting with the new year I hope I can get back to more frequent posts. Also, I'll be submitting talks to conferences again and hope that I can find more time to work on the OdfToolkit.

I'll be staying with my current client for as long as they need me but I am determined to only do search centric projects afterwards. Also I am planning to do a bit of work in other countries in Europe with a special twist. Watch this blog for the announcement.

When starting with freelancing you have a lot of questions and even simple things can take some time to find out on your own. I will compile a list of resources that helped me and publish those on my blog soon. If you are just starting with freelancing you are of course also welcome to contact me anytime.

Book Review: Taming Text

2013-12-18T14:58:00+08:00

This is text. As I presume you are human you will understand the words and their meaning. Some words have multiple meanings like the word like. Also as English isn't my native tongue there will be errors in my writing but you will understand it anyway. Our brain is doing a fantastic job at inferring meaning from the context. This is something that is far more difficult for machines.

Grant S. Ingersoll, Thomas S. Morton and Andrew L. Farris have written a book about all the difficulties that you might encounter when processing text with machines and ways to solve them. Taming Text not only shows you the theory of extracting, searching and classifying information in text but also introduces different open source projects that you can integrate in your application.

Each chapter focuses on one problem space and most of them can even be read in isolation. You will learn about the difficulties in understanding text, mostly caused by ambiguous meanings and the context words appear in. Tokenization and entity recognition are introduced with some basics of linguistics. Searching in text is covered well with all details on analyzing, the inverted index and the vector space model, which is also important for clustering and classification. Fuzzy string matching, the process of looking up similar strings, is shown using the famous Levenshtein distance, NGrams and Tries. A larger part of the book finally focuses on text clustering, the unsupervised process of putting documents into clusters, and classification and categorization, a learning process that needs some precategorized data.

Throughout all the chapters the authors introduce sample applications in Java using one or more of the open source projects that are covered. You will see an application that searches text in Mary Shelleys Frankenstein using Apache Lucene and does entity recognition to identify people and places using Apache OpenNLP. Apache Solr is mostly used for searching and OpenNLP can do extensive analysis of text like tokenization, determining sentences or parts of speech tagging. Content is extracted from different file formats using Apache Tika. Text clustering is shown using Carrot² for search result clustering in Solr, Apache Mahout is mainly used for document clustering and classification with some help of Lucene, Solr and OpenNLP. The final example of the book builds on the knowledge of all the preceding chapters showing you an example question answering system similar to IBM Watson that accepts natural language questions and tries to give correct answers from a data set extracted from Wikipedia.

This book is exceptional in that it covers many different topics but the authors manage to combine them in a coherent example. It is one of the books in this years Jolt award for a good reason. If you are doing anything with text, be it searching or analytics you are advised to get a copy for yourself. I know that I will come back to mine again in the future when I need to refresh some of the information.

Reindexing Content in Elasticsearch with stream2es

2013-11-28T14:42:00+08:00

Last week I wrote about reindexing content in Elasticsearch using a script that extracts the source field from the indexed content. You can use it for cases when your mapping changes or you need to adjust the index settings. After publishing the post Drew Raines mentioned that there is an easier way using the stream2es utility only. Time to have a look at it!

stream2es can be used to stream content from several inputs to Elasticsearch. In my last post I used it to stream a file containing the sources of documents to an Elasticsearch index. Besides that it can index data from Wikipedia or Twitter or from Elasticsearch directly, which we will look at now.

Again, we are indexing some documents:

curl -XPOST "http://localhost:9200/twitter/tweet/" -d'
{
    "user" : "kimchy",
    "post_date" : "2009-11-15T14:12:12",
    "message" : "trying out Elastic Search"
}'
curl -XPOST "http://localhost:9200/twitter/tweet/" -d'
{
    "user" : "kimchy",
    "post_date" : "2009-11-15T14:14:14",
    "message" : "Elasticsearch works!"
}'

Now, if we need to adjust the mapping we can just create a new index with the new mapping:

curl -XPOST "http://localhost:9200/twitter2" -d'
{
    "mappings" : {
        "tweet" : {
            "properties" : {
                "user" : { "type" : "string", "index" : "not_analyzed" }
            }
        }
    }
}'

You can now use stream2es to transfer the documents from the old index to the new one:

stream2es es --source http://localhost:9200/twitter/ --target http://localhost:9200/twitter2/

This will make our documents available in the new index:

curl -XGET http://localhost:9200/twitter2/_count?pretty=true
{                                                                  
  "count" : 2,                                                      
  "_shards" : {                                                    
    "total" : 5,                                                    
    "successful" : 5,                                               
    "failed" : 0                                                   
  }
}

You can now delete the old index. To keep your data available on the same old index name you can also create an alias that will point to your new index:

curl -XDELETE http://localhost:9200/twitter
curl -XPOST 'http://localhost:9200/_aliases' -d '
{
    "actions" : [
        { "add" : { "index" : "twitter2", "alias" : "twitter" } }
    ]
}'

Looking at the mapping you can see that the twitter index now points to our updated version:

curl -XGET http://localhost:9200/twitter/tweet/_mapping?pretty=true
{
  "tweet" : {
    "properties" : {
      "bytes" : {
        "type" : "long"
      },
      "message" : {
        "type" : "string"
      },
      "offset" : {
        "type" : "long"
      },
      "post_date" : {
        "type" : "date",
        "format" : "dateOptionalTime"
      },
      "user" : {
        "type" : "string",
        "index" : "not_analyzed",
        "omit_norms" : true,
        "index_options" : "docs"
      }
    }
  }
}

Reindexing Content in Elasticsearch

2013-11-21T20:32:00+08:00

One of the crucial parts on any search application is the way you map your content to the analyzers. It will determine which query terms match the terms that are indexed with the documents. Sometimes during development you might notice that you didn't get this right from the beginning and need to reindex your data with a new mapping. While for some applications you can easily start the indexing process again this become more difficult for others. Luckily Elasticsearch by default stores the original content in the _source field. In this short article I will show you how to use a script developed by Simon Willnauer that lets you retrieve all the data and reindex it with a new mapping.

You can do the same thing in an easier way using the utility stream2es only. Look at this post if you are interested

Reindexing

Suppose you have indexed documents in Elasticsearch. Imagine that those are a lot that can not be reindexed again easily or reindexing would take some time.

curl -XPOST "http://localhost:9200/twitter/tweet/" -d'
{
    "user" : "kimchy",
    "post_date" : "2009-11-15T14:12:12",
    "message" : "trying out Elastic Search"
}'
curl -XPOST "http://localhost:9200/twitter/tweet/" -d'
{
    "user" : "kimchy",
    "post_date" : "2009-11-15T14:14:14",
    "message" : "Elasticsearch works!"
}'

Initially this will create the mapping that is determined from the values.

curl -XGET "http://localhost:9200/twitter/tweet/_mapping?pretty=true"
{
  "tweet" : {
    "properties" : {
      "message" : {
        "type" : "string"
      },
      "post_date" : {
        "type" : "date",
        "format" : "dateOptionalTime"
      },
      "user" : {
        "type" : "string"
      }
    }
  }
}

Now if you notice that you would like to change some of the existing fields to another type you need to reindex as Elasticsearch doesn't allow you to modify the mapping for existing fields. Additional fields are fine, but not existing fields. You can leverage the _source field that you can also see when querying a document.

curl -XGET "http://localhost:9200/twitter/tweet/_search?q=user:kimchy&pretty=true&size=1"
{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 0.30685282,
    "hits" : [ {
      "_index" : "twitter",
      "_type" : "tweet",
      "_id" : "oaFqxMnqSrex6T7_Ut-erw",
      "_score" : 0.30685282, "_source" : {
    "user" : "kimchy",
    "post_date" : "2009-11-15T14:12:12",
    "message" : "trying out Elastic Search"
}

    } ]
  }
}

For his "no slides no bullshit introduction to Elasticsearch" Simon Willnauer has implemented a script that retrieves the _source fields for all documents of an index. After installing the prerequisites you can use it by passing in your index name:

fetchSource.sh twitter > result.json

It prints all the documents to stdout which can be redirected to a file. We can now delete our index and recreate it using a different mapping.

curl -XDELETE http://localhost:9200/twitter
curl -XPOST "http://localhost:9200/twitter" -d'
{
    "mappings" : {
        "tweet" : {
            "properties" : {
                "user" : { "type" : "string", "index" : "not_analyzed" }
            }
        }
    }
}'

The file we just created can now be send to Elasticsearch again using the handy stream2es utility.

stream2es stdin --target "http://localhost:9200/twitter/tweet" < result.json

All your documents are now indexed using the new mapping.

Implementation

Let's look at the details of the script. At the time of writing this post the relevant part of the script looks like this:

SCROLL_ID=`curl -s -XGET 'localhost:9200/'${INDEX_NAME}'/_search?search_type=scan&scroll=11m&size=250' -d '{"query" : {"match_all" : {} }}' | jq '._scroll_id' | sed s/\"//g`
RESULT=`curl -s -XGET 'localhost:9200/_search/scroll?scroll=10m' -d ${SCROLL_ID}`

while [[ `echo ${RESULT} | jq -c '.hits.hits | length'` -gt 0 ]] ; do
  #echo "Processed batch of " `echo ${RESULT} | jq -c '.hits.hits | length'`
  SCROLL_ID=`echo $RESULT | jq '._scroll_id' | sed s/\"//g`
  echo $RESULT | jq -c '.hits.hits[] | ._source + {_id}' 
  RESULT=$(eval "curl -s -XGET 'localhost:9200/_search/scroll?scroll=10m' -d ${SCROLL_ID}")
done

It uses scrolling to efficiently traverse the documents. Processing of the JSON output is done using jq, a lightweight and flexible command-line JSON processor, that I should have used as well when querying the SonarQube REST API.

The first line in the script creates a scan search that uses scrolling. The scroll will be valid for 11 minutes, returns 250 documents on each request and queries all documents, as requested with the match_all query. This response doesn't contain any documents but the _scroll_id which is then extracted with jq. The final sed command removes the quotes around it.

The scroll id now is used to send queries to Elasticsearch. On each iteration it is checked if there are any hits at all. If there are the request will return a new scroll id for the next batch. The result is echoed to the console. .hits.hits[] will return the list of all hits. Using the pipe symbol in jq processes each hit with the filter on the right that prints the source as well as the id of the hit.

Conclusion

The script is a very useful addition to your Elasticsearch toolbox. You can use it to reindex or just export your content. I am glad I looked at the details of the implementation as in the future jq can come in really handy as well.

Devoxx in Tweets

2013-11-16T18:51:00+08:00

For the first time in several years I unfortunately had to skip this years Devoxx. There are so many tweets that remind me of the good talks going on there and I thought I would do something useful with them. So again I indexed them in Elasticsearch using the Twitter river and therefore can look at them using Kibana. David Pilato also has set up a public instance and I could imagine that there will be a more thorough analysis done by the Devoxx team but here are my thoughts on this years Devoxx without having been there.

I'll be looking at three things: The top 10 mentions, the top 10 hashtags and the tweet distribution over time. For the mentions I have excluded @Devoxx and @java, for the hashtags I have excluded #devoxx, #dvx13 and #dv13 as the mentions and tags are too dominant and don't tell a lot. I have collected all tweets mentioning the term devoxx so there will be a lot I missed. Each retweet counts as a seperate tweet.

Overall Trends

Looking at the timeline of the whole week you can see that the amount of tweets is high at the beginning and continually rises with thursday having even more tweets than wednesday which is quite a surprise to me. I would have thought that the first conference day is the one that has the most tweets.

Stephan007, the founder of Devoxx, has the most mentions which is no surprise. Chet Haase and Romain Guy are following. I have never seen a talk done by them but I probably should. The Dart language is the dominant hashtag with a lot of buzz around their 1.0 release. Java, Android and Scala are still hot technologies. Android is a bit of a surprise here. It's nice that the initiative Devoxx4Kids ranks quite high.

Daily Analysis

Monday

On monday the top mention is @AngularJS. Of course this is caused by the two AngularJS university sessions that lasted nearly the whole day. Angular is a hot topic but I am not yet planning to do any work with it. The session on JavaEE 7 also created a lot of interest as can be seen by the mentions of its hosts Arun Gupta and Antonio Goncalves. They especially encouraged people to participate on Twitter which seems to have been received very well. Scala is another hot topic with the university session by Dick Wall and Joshua Suereth that I would really have liked to see.

Tuesday

Tuesday is dominated by the two excellent speakers Matt Raible and Venkat Subramaniam. I especially regret that I couldn't see Venkat in action who I consider to be one of the best speakers available. I am not sure what the tag hackergarden is referring to as I didn't find an event on monday evening or tuesday. There is also quite some interest in Reactor, the reactive framework of the Spring ecosystem.

Wednesday

Brian Goetz got a lot of mentions for the keynote. I think it's a surprise that there are so many mentions of David Blevins for his talk EJB 3.2 and beyond which I wouldn't have expected to be that popular. The big event of the day was the launch of Ceylon 1.0 as can be seen from the hashtag. I heard good things about Ceylon but I still consider it an underdog of the alternative JVM languages.

Thursday

Romain Guy is leading the mentions with his very popular talk "Filthy Rich Android Clients", followed by Jonas Boner of Akka fame and Venkat Subramaniam. The launch of Dart 1.0 dominates the keywords. The Javaposse still ranks in the top 10 with their popular traditional session.

Friday

Friday normally has fewer participants than the other days. Joshua Suereth received a lot of tweets for his Scala talk, ranking high both in mentions and hashtags. The session on Google Glass also was very popular. I am not sure which session caused the mention of dagger.

Programming Language Popularity

Niko Schmuck proposed to add the language popularity over the week. As this is quite interesting here is the totally unscientific popularity chart that of course should determine which language you are learning next. I am not querying the hashtags but any mention of the terms.

Java dominates but JavaScript is very strong on Monday and Thursday. Ceylon has its share on Wednesday while Thursday is the Dart day. Scala is very popular on Monday and Friday.

A ranked version:

Java	1234
JS	584
Dart	490
Scala	252
Ceylon	172
Groovy	171
Clojure	94
Kotlin	16

The Drink Tweets

As there are quite some people tweeting we can see some trends with regards to the drink tweets. First the coffee tweets:

Quite a spike on Monday with people either mentioning that they need coffee or complaining about the coffee. This repeats on tuesday on wednesday in the morning, people seem to have accepted the situation on thursday.

Another common topic, especially since the conference is located in Belgium, are the beer tweets.

Surprise, surprise, people tend to tweet about beer in the evening. I like the huge Javaposse-spike on thursday with a lot of mentions of the beer sponsor Atlassian.

Conclusion

Though I haven't been there I could get a small glimpse of the trends at this years Devoxx. As soon as the videos are available I will buy the account for this years conference, not only because there are so many interesting talks to see but also because the Devoxx team is doing a fantastic job that I'd like to support in any way I can.

Updates

17.11. Added a section on programming language popularity
18.11. Updated the weekly diagram with a more accurate

Lucene Solr Revolution 2013 in Dublin

2013-11-10T19:06:00+08:00

I just returned from Lucene Solr Revolution Europe, the conference on everything Lucene and Solr which this year was held in Dublin. I always like to recap what I took from a conference so here are some impressions.

The Venue

In the spirit of last years conference, which was merged with ApacheCon and held in a soccer stadium in Sinsheim, this years venue was a Rugby Stadium. It's seems to be quite common that conferences are organized there and the location was well suited. For some of the room changes you had to walk quite a distance but that's nothing that couldn't be managed.

The Talks

As there were four tracks in parallel choosing the talk to attend could prove to be difficult. There were so many interesting things to choose from. Fortunately all the talks have been recorded and will be made available for free on the conference website.

The following are a selection of talks that I think were most valuable to me.

Keynote: Michael Busch on Lucene at Twitter

Michael Busch is a regular speaker at Search conferences because Twitter is doing some interesting things. On the one hand they have to handle near realtime search, massive data sets and lots of requests. On the other hand they can always be sure that their documents are of a certain size. They maintain two different data stores as Lucene indices, the realtime index that contains the most recent data and the archive index that makes older tweets searchable. They introduced the archive index only a few months ago which in my opinion led to a far more reliable search experience. They have done some really interesting things like encoding the position info of a term with the doc id because they only need few bits to address positions in a 140 character document. Also they changed some aspects of the posting list encoding because they always display results sorted by date. They are trying to make their changes more general so those can be contributed back to Lucene.

Solr Indexing and Analysis Tricks by Erik Hatcher

I always enjoy listening to the talks of Erik Hatcher, probably also because his university session at Devoxx 2009 was the driving factor for me starting to use Solr. In this years talk he presented lots of useful aspects for indexing data in Solr. One of the most interesting facts I took from this talk is the use of the ScriptUpdateProcessor that is included in Solr since version 4. You can define scripts that are executed during indexing and can manipulate the document. This is a valuable alternative to copyFields, especially if you would like to have the content stored as well. By default you can implement the logic in JavaScript but there are alternatives available.

Hacking Lucene and Solr for Fun and Profit by Grant Ingersoll

Grant Ingersoll presented some applications of Lucene and Solr not directly involving search like Classification, Recommendations and Analytics. Some examples had been taken from his excellent book Taming Text (watch this blog for a review of the book in the near future).

Schemaless Solr and the Solr Schema REST API by Steve Rowe

One of the factors of the success of Elasticsearch is its ease of use. You can download it and start indexing documents immediately without doing any configuration work. One of the features that enables you to do this is the autodiscovery of fields by value. Starting with Solr 4.4 you can now use Solr in a similar way. You can configure that you want Solr to manage your schema. This way unknown fields are then created automatically based on the first value that is extracted by configured parsers. As with Elasticsearch you shouldn't rely on this feature exclusively so there is also a way to add new fields of a certain type via the Schema REST API. When Solr is in managed mode it will modify the schema.xml so you might lose changes you made manually. For the future the developers are even thinking about moving away from XML for the managed mode as there are better options for when readability doesn't matter.

Stump the Chump with Chris Hostetter

This seems to be a tradition at Lucene Solr Revolution. Chris Hostetter has to find solutions to problems that have been submitted before or are posted by the audience. It's a fun event but you can also learn a lot.

Query Latency Optimization with Lucene by Stefan Pohl

Stefan first introduced some basic latency factors and how to measure them. He recommended to not instrument the low level Lucene classes when profiling your application as those rely heavily on hotspot optimizations. Besides introducing the basic mechanisms of how conjunction (AND) and disjunction (OR) work he described some recent Lucene improvements that can speed up your application, among those LUCENE-4571, the new minShouldMatch implementation and LUCENE-4752, which allows custom ordering of documents in the index.

Relevancy Hacks for eCommerce by Varun Thacker

Varun introduced the basics of relevancy sorting in Lucene and Solr and how those might affect product searches. TF/IDF is sometimes not the best solution ("IDF is a measurement of rarity not necessarily importance"). He also showed the ways to influence the relevancy: Implementation of a custom Similarity class, boosting and function queries.

What is in a Lucene Index by Adrien Grand

Adrien started with the basics fo a Lucene index and how it differs from a database index: the dictionary structure, segments and merging. He then moved on to topics like the structure of the posting list, term vectors, the FST terms index and the difference between stored fields and doc values. This is a talk full of interesting details on the internal workings of Lucene and the implications for the performance of your application.

Conclusion

As said before I couldn't attend all the talks I would have liked. I especially heard good things about the following talks which I will watch as soon as those are available:

Integrate Solr with Real-Time Stream Processing Applications by Timothy Potter
The Typed Index by Christoph Goller
Implementing a Custom Search Syntax Using Solr, Lucene and Parboiled by John Berryman

I really enjoyed Lucene Solr Revolution. Not only were there a lot of interesting talks to listen to but it was also a good opportunity to meet new people. On both evenings there have been get togethers with free drinks and food which must have cost LucidWorks a fortune. I couldn't attend the closing remarks but I heard they announced that they want to move to smaller, national events in Europe instead of the central conference. I hope those will still be events that attract so many commiters and interesting people.

Switch Off Legacy Code Violations in SonarQube

2013-10-31T16:04:00+08:00

While I don't believe in putting numbers on source code quality, SonarQube (formerly known as Sonar) can be a really useful tool during development. It enforces a consistent style across your team, has discovered several possible bugs for me and is a great tool to learn: You can browse the violations and see why a certain expression or code block can be a problem.

To make sure that your code base stays in a consistent state you can also go as far as mandating that there should be no violations in the code developers check in. One of the problems with this is that a lot of projects are not green field projects and you have a lot of existing code. If your violation number already is high it is difficult to judge if no new violations were introduced.

In this post I will show you how you can start with zero violations for existing code without touching the sources, something I got inspired to do by Jens Schauder in his great talk Working with Legacy Teams. We will ignore all violations based on the line in the file so if anybody touches the file the violations will show again and the developer is responsible for fixing the legacy violations.

The Switch Off Violations Plugin

We are using the Switch Off Violations Plugin for SonarQube. It can be configured with different exclusion patterns for the issues. You can define regular expressions for code blocks that should be ignored or deactivate violations at all or on a file or line basis.

For existing code you want to ignore all violations for certain files and lines. This can be done by inserting something like this in the text area Exclusion patterns:

de.fhopf.akka.actor.IndexingActor;pmd:SignatureDeclareThrowsException;[23]

This will exclude the violation for throwing raw Exceptions in line 23 of the IndexingActor class. When analyzing the code again this violation will be ignored.

Retrieving violations via the API

Besides the nice dashboard SonarQube also offers an API that can be used to retrieve all the violations for a project. If you are not keen to look up all existing violations in your code base and insert those by hand you can use it to generate the exclusion patterns automatically. All of the violations can be found at /api/violations, e.g. http://localhost:9000/api/violations.

I am sure there are other ways to do it but I used jsawk to parse the JSON response (On Ubuntu you have to install Spidermonkey instead of the default js interpreter.. And you have to compile it yourself. And I had to use a specific version. Sigh.).

Once you have set up all the components you can now use jsawk to create the exclusion patterns for all existing violations:

curl -XGET 'http://localhost:9000/api/violations?depth=-1' | ./jsawk -a 'return this.join("\n")' 'return this.resource.key.split(":")[1] + ";*;[" + this.line + "]"' | sort | uniq

This will present a list that can just be pasted in the text area of the Switch Off Violations plugin or checked in to the repository as a file. With the next analysis process you will then hopefully see zero violations. When somebody changes a file by inserting a line the violations will be shown again and should be fixed. Unfortunately some violations are not line based and will yield a line number 'undefined'. Currently I just removed those manually so you still might see some violations.

Conclusion

I presented one way to reset your legacy code base to zero violations. With SonarQube 4.0 the functionality of the Switch Violations Off plugin will be available in the core so it will be easier to use. I am still looking for the best way to keep the exclusion patterns up to date. Once somebody had to fix the violations for an existing file the pattern should be removed.

Update 09.01.2014

Starting with SonarQube 4 this approach doesn't work anymore. Some features of the SwitchOffViolations plugin have been moved to the core but excluding violations by line is not possible anymore and also will not be implemented. The developers recommend to only look at the trends of the project and not the overall violation count. This can be done nicely using the differentials.

Elasticsearch at Scale - Kiln and GitHub

2013-10-25T14:33:00+08:00

Most of us are not exposed to data at real scale. It is getting more common but still I appreciate that more progressive companies that have to fight with large volumes of data are open about it and talk about their problems and solutions. GitHub and Fog Creek are two of the larger users of Elasticsearch and both have published articles and interviews on their setup. It's interesting that both of these companies are using it for a very specialized use case, source code search. As I have recently read the article on Kiln as well as the interview with the folks at GitHub I'd like to summarize some of the points they made. Visit the original links for in depth information.

Elasticsearch at Fog Creek for Kiln

In this article on InfoQ Kevin Gessnar, a developer at Fog Creek describes the process of migrating the code search of Kiln to Elasticsearch.

Initial Position

Kiln allows you to search on commit messages, filenames and file contents. For commit messages and filenames they were initially using the full text search features of SQL Server. For the file content search they were using a tool called OpenGrok that leverages Ctags to analyze the code and stores it in a Lucene index. This provided them will all of the features they needed but unfortunately the solution couldn't scale with their requirements. Queries took several seconds up to the timeout value of 30 seconds.

It's interesting to see that they decided against Solr because of poor read performance on heavy writes. Would be interesting to see if this is still the case for current versions.

Scale

They are indexing several million documents every day, which comes to terabytes of data. They are still running their production system on two nodes only. These are numbers that really surprised me. I would have guessed that you need more nodes for this amount of data (well, probably those are really big machines). They only seem to be using Elasticsearch for indexing and search but retrieve the result display data from their primary storage layer.

Elasticsearch at GitHub

Andrew Cholakian, who is doing a great job with writing his book Exploring Elasticsearch in the open, published an interview with Tim Pease and Grant Rodgers of GitHub on their Elasticsearch setup, going through a lot of details.

Initial Position

GitHub used to have their search based on Solr. As the volume of data and search increased they needed a solution that scales. Again, I would be interested if current versions of Solr Cloud could handle this volume.

Scale

They are really searching big data. 44 Amazon EC2 instances power search on 2 billion documents which make up 30 terabyte of data. 8 instances don't hold any data but are only there to distribute the queries. They are planning to move from the 44 Amazon instances to 8 larger physical machines. Besides their user facing data they are indexing internal data like audit logs and exceptions (it isn't clear to me from the interview if in this case Elasticsearch is their primary data store which would be remarkable). They are using different clusters for different data types so that the external search is not affected when there are a lot of exceptions.

Challenges

Shortly after launching their new search feature people started discovering that you could also search for files people had accidentally commited like private ssh keys or passwords. This is an interesting phenomen where just the possibility for better retrieval made a huge difference. All the information had been there before but it just couldn't be found easily. This led to an increase in search volume that was not anticipated. Due to some configuration issues (suboptimal Java version, no setting for minimum of masters) their cluster became unstable and they had to disable search for the whole site.

Further Takeaways

Use routing to keep your data together on one shard
Thrift seems to be far more complicated from an ops point of view compared to HTTP
Use the slow query log
Time slicing your indices is a good idea if the data allows

A Common Theme

Both of these articles have some observations in common:

Elasticsearch is easy to get started with
Scaling is not an issue
the HTTP interface is good for debugging and operations
the Elasticsearch community and the company are really helpful when it comes to problems

Cope with Failure - Actor Supervision in Akka

2013-10-11T14:04:00+08:00

A while ago I showed an example on how to use Akka to scale a simple application with multiple threads. Tasks can be split into several actors that communicate via immutable messages. State is encapsulated and each actor can be scaled independently. While implementing an actor you don't have to take care of low level building blocks like Threads and synchronization so it is far more easy to reason about the application.

Besides these obvious benefits, fault tolerance is another important aspect. In this post I'd like to show you how you can leverage some of Akkas characteristics to make our example more robust.

The Application

To recap, we are building a simple web site crawler in Java to index pages in Lucene. The full code of the examples is available on GitHub. We are using three actors: one which carries the information on the pages to be visited and visited already, one that downloads and parses the pages and one that indexes the pages in Lucene.

By using several actors to download and parse pages we could see some good performance improvements.

What could possibly go wrong?

Things will fail. We are relying on external services (the page we are crawling) and therefore the network. Requests could time out or our parser could choke on the input. To make our example somewhat reproducible I just simulated an error. A new PageRetriever, the ChaosMonkeyPageRetriever sometimes just throws an Exception:

@Override
public PageContent fetchPageContent(String url) {
    // this error rate is derived from scientific measurements
    if (System.currentTimeMillis() % 20 == 0) {
      throw new RetrievalException("Something went horribly wrong when fetching the page.");
    }
    return super.fetchPageContent(url);
}

You can surely imagine what happens when we use this retriever in the sequential example that doesn't use Akka or threads. As we didn't take care of the failure our application just stops when the Exception occurs. One way we could mitigate this is by surrounding statements with try/catch-Blocks but this will soon intermingle a lot of recovery and fault processing code with our application logic. Once we have an application that is running in multiple threads fault processing gets a lot harder. There is no easy way to notify other Threads or save the state of the failing thread.

Supervision

Let's see Akkas behavior in case of an error. I added some logging that indicates the current state of the visited pages.

1939 [default-akka.actor.default-dispatcher-5] INFO de.fhopf.akka.actor.Master - inProgress:  55, allPages:  60
1952 [default-akka.actor.default-dispatcher-4] INFO de.fhopf.akka.actor.Master - inProgress:  54, allPages:  60
[ERROR] [10/10/2013 06:47:39.752] [default-akka.actor.default-dispatcher-5] [akka://default/user/$a/$a] Something went horribly wrong when fetching the page.
de.fhopf.akka.RetrievalException: Something went horribly wrong when fetching the page.
        at de.fhopf.akka.actor.parallel.ChaosMonkeyPageRetriever.fetchPageContent(ChaosMonkeyPageRetriever.java:21)
        at de.fhopf.akka.actor.PageParsingActor.onReceive(PageParsingActor.java:26)
        at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:167)
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
        at akka.actor.ActorCell.invoke(ActorCell.scala:456)
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
        at akka.dispatch.Mailbox.run(Mailbox.scala:219)
        at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

1998 [default-akka.actor.default-dispatcher-8] INFO de.fhopf.akka.actor.Master - inProgress:  53, allPages:  60
2001 [default-akka.actor.default-dispatcher-12] INFO de.fhopf.akka.actor.PageParsingActor - Restarting PageParsingActor because of class de.fhopf.akka.RetrievalException
2001 [default-akka.actor.default-dispatcher-2] INFO de.fhopf.akka.actor.PageParsingActor - Restarting PageParsingActor because of class de.fhopf.akka.RetrievalException
2001 [default-akka.actor.default-dispatcher-10] INFO de.fhopf.akka.actor.PageParsingActor - Restarting PageParsingActor because of class de.fhopf.akka.RetrievalException
[...]
2469 [default-akka.actor.default-dispatcher-12] INFO de.fhopf.akka.actor.Master - inProgress:   8, allPages:  78
2487 [default-akka.actor.default-dispatcher-7] INFO de.fhopf.akka.actor.Master - inProgress:   7, allPages:  78
2497 [default-akka.actor.default-dispatcher-5] INFO de.fhopf.akka.actor.Master - inProgress:   6, allPages:  78
2540 [default-akka.actor.default-dispatcher-13] INFO de.fhopf.akka.actor.Master - inProgress:   5, allPages:  78

We can see each exception that is happening in the log file but our application keeps running. That is because of Akkas supervision support. Actors form hierarchies where our PageParsingActor is a child of the Master actor because it is created from its context. The Master is responsible to determine the fault strategy for its children. By default it will restart the Actor in case of an exception which makes sure that the next message is processed correctly. This means even in case of an error Akka tries to keep the system in a running state.

The reaction to a failure is determined by the method supervisorStrategy() in the parent actor. Based on an Exception class you can choose several outcomes:

resume: Keep the actor running as if nothing had happened
restart: Replace the failing actor with a new instance
suspend: Stop the failing actor
escalate: Let your own parent decide on what to do

A supervisor that would restart the actor for our exception and escalate otherwise could be added like this:

// allow 100 restarts in 1 minute ... this is a lot but we the chaos monkey is rather busy
private SupervisorStrategy supervisorStrategy = new OneForOneStrategy(100, Duration.create("1 minute"), new Function() {

    @Override
    public Directive apply(Throwable t) throws Exception {
        if (t instanceof RetrievalException) {
            return SupervisorStrategy.restart();
        }
        // it would be best to model the default behaviour in other cases
        return SupervisorStrategy.escalate();
    }

});

@Override
public SupervisorStrategy supervisorStrategy() {
    return supervisorStrategy;
}

Let's come back to our example. Though Akka takes care of restarting our failing actors the end result doesn't look good. The application continues to run after several exceptions but our application then just stops and hangs. This is caused by our business logic. The Master actor keeps all pages to visit in the VisitedPageStore and only commits the Lucene index when all pages are visited. As we had several failures we didn't receive the result for those pages and the Master still waits.

One way to fix this is to resend the message once the actor is restarted. Each Actor class can implement some methods that hook into the actors lifecycle. In preRestart() we can just send the message again.

@Override
public void preRestart(Throwable reason, Option<Object> message) throws Exception {
    logger.info("Restarting PageParsingActor and resending message '{}'", message);
    if (message.nonEmpty()) {
        getSelf().forward(message.get(), getContext());
    }
    super.preRestart(reason, message);
}

Now if we run this example we can see our actors recover from the failure. Though some exceptions are happening all pages get visited eventually and everything will be indexed and commited in Lucene.

Though resending seems to be the solution to our failures you need to be careful to not break your system with it: For some applications the message might be the cause for the failure and by resending it you will keep your system busy with it in a livelock state. When using this approach you should at least add a count to the message that you can increment on restart. Once it is sent too often you can then escalate the failure to have it handled in a different way.

Conclusion

We have only handled one certain type of failure but you can already see how powerful Akka can be when it comes to fault tolerance. Recovery code is completely separated from the business code. To learn more on different aspects of error handling read the Akka documentation on supervision and fault tolerance or this excellent article by Daniel Westheide.

Brian Foote on Prototyping

2013-10-04T15:09:00+08:00

Big Ball of Mud is a collection of patterns by Brian Foote, published in 1999. The title stems from one of the patterns, Big Ball of Mud, the "most frequently deployed of software architectures". Though this might sound like a joke at first the article contains a lot of really useful information on the forces at work when dealing with large codebases and legacy code. I especially like his take on prototyping applications.

Did you ever find yourself in the following situation? A customer agrees to build a prototype of an application to learn and to see something in action. After the prototype is finished the customer tries to force you to reuse the prototype, "because it already does what we need". Cold sweat, you probably were taking lots of shortcuts in the prototype and didn't build it with maintainability in mind. This is what Brian recommends to circumvent this situation:

One way to minimize the risk of a prototype being put into production is to write the prototype in using a language or tool that you couldn't possibly use for a production version of your product.

Three observations:

Nowadays the choice of languages doesn't matter that much for running the code in production with virtual machines that support a lot of languages.
This only holds true for prototypes that are used to explore the domain. When doing a technical proof of concept at least some parts of the prototype need to use the intended technology.
Prototypes are sometimes also used to make the team familiar with a new technology that is set for the project.

Nevertheless this is a really useful advice to keep in mind.

Feature Toggles in JSP with Togglz

2013-09-27T16:30:00+08:00

Feature Toggles are a useful pattern when you are working on several features but want to keep your application in a deployable state. One of the implementations of the pattern available for Java is Togglz. It provides ways to check if a feature is enabled programmatically, from JSF or JSP pages or even when wiring Spring beans. I couldn't find a single example on how to use the JSP support so I created an example project and pushed it to GitHub. In this post I will show you the basics of Togglz and how to use it in Java Server Pages.

Togglz

Features that you want to make configurable are described with a Java Enum. This is an example with two features that can be enabled or disabled:

public enum ToggledFeature implements Feature {

    TEXT,
    MORE_TEXT;

    public boolean isActive() {
        return FeatureContext.getFeatureManager().isActive(this);
    }
}

This Enum can then be used to check if a feature is enabled in any part of your code:

if (ToggledFeature.TEXT.isActive()) {
    // do something clever
}

The config class is used to wire the feature enum with a configuration mechanism:

public class ToggledFeatureConfiguration implements TogglzConfig {

    public Class<? extends Feature> getFeatureClass() {
        return ToggledFeature.class;
    }

    public StateRepository getStateRepository() {
        return new FileBasedStateRepository(new File("/tmp/features.properties"));
    }

    public UserProvider getUserProvider() {
        return new ServletUserProvider("ADMIN_ROLE");
    }
}

The StateRepository is used for enabling and disabling features. We are using a file based one but there are others available.

To configure Togglz for your webapp you can either do it using CDI, Spring or via manual configuration in the web.xml:

<web-app xmlns="http://java.sun.com/xml/ns/javaee"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://java.sun.com/xml/ns/javaee http://java.sun.com/xml/ns/javaee/web-app_3_0.xsd"
    version="3.0">

    <context-param>
        <param-name>org.togglz.core.manager.TogglzConfig</param-name>
        <param-value>de.fhopf.togglz.ToggledFeatureConfiguration</param-value>
    </context-param>

    <filter>
        <filter-name>TogglzFilter</filter-name>
        <filter-class>org.togglz.servlet.TogglzFilter</filter-class>
    </filter>
    <filter-mapping>
        <filter-name>TogglzFilter</filter-name>
        <url-pattern>/*</url-pattern>
    </filter-mapping>

</web-app>

In my example I had to add the filter manually though with Servlet 3.0 this shouldn't be necessary. I am not sure if this is caused by the way Gradle runs Jetty or if this is always the case when doing the configuration via a context-param.

Togglz with Java Server Pages

For the integration of Togglz in JSPs you need to add the dependency togglz-jsp to your project. It contains a tag that can be used to group code which can then be enabled or disabled. A simple example for our ToggledFeature:

<%@ taglib uri="http://togglz.org/taglib" prefix="togglz" %>

This is some text that is always shown.

<togglz:feature name="TEXT">
This is the text of the TEXT feature.
</togglz:feature>

<togglz:feature name="MORE_TEXT">
This is the text of the MORE_TEXT feature.
</togglz:feature>

Both features will be disabled by default so you will only see the first sentence. You can control which features are enabled (even at runtime) in /tmp/features.properties. This is what it looks like when the TEXT feature is enabled:

TEXT=true
MORE_TEXT=false

A Word of Caution

I am just starting using feature toggles in an application so I wouldn't call me experienced. But I have the impression that you need to be really disciplined when using it. Old feature toggles that are not used should be removed as soon as possible. Unfortunately the huge benefit of compile time safety in Java for removing a feature from the enum is gone with JSPs; the names of the features are only Strings so you will have to do some file searches when removing a feature.

Kibana and Elasticsearch: See What Tweets Can Say About a Conference

2013-09-20T14:38:00+08:00

In my last post I showed how you can index tweets for an event in Elasticsearch and how to do some simple queries on it using its HTTP API. This week I will show how you can use Kibana 3 to visualize the data and make it explorable without having to learn the Elasticsearch API.

Installing Kibana

Kibana 3 is a pure HTML/JS frontend for Elasticsearch that you can use to build dashboards for your data. We are still working with the example data the is indexed using the Twitter River. It consists of tweets for FrOSCon but can be anything, especially data that contains some kind of timestamp as it's the case for tweets. To install Kibana you can just fetch it from the GitHub repostory (Note: now there are also prepackaged archives available that you can download without cloning the repository):

git clone https://github.com/elasticsearch/kibana.git

You will now have a folder kibana that contains the html files as well as all the assets needed. The files need to be served by a webserver so you can just copy the folder to the directory e.g. Apache is serving. If you don't have a webserver installed you can simply serve the current directory using python:

python -m SimpleHTTPServer 8080

This will make Kibana available at http://localhost:8080/kibana/src. With the default configuration Elasticsearch needs to be running on the same machine as well.

Dashboards

A dashboard in Kibana consists of rows that can contain different panels. Each panel can either display data, control which data is being displayed or both. Panels do not stand on their own; the results that are getting displayed are the same for the whole dashboard. So if you choose something in one panel you will notice that the other panels on the page will also get updated with new values.

When accessing Kibana you are directed to a welcome page from where you can choose between several dashboard templates. As Kibana is often used for logfile analytics there is an existing dashboard that is preconfigured to work with Logstash data. Another generic dashboard can be used to query some data from the index but we'll use the option "Unconfigured Dashboard" which gives some hints on which panels you might want to have.

This will present you with a dashboard that contains some rows and panels already.

Starting from the top it contains these rows:

The "Options" row that contains one text panel
The "Query" row that contains a text query panel
A hidden "Filter" row that contains a text panel and the filter panel. The row can be toggled visible by clicking on the text Filter on the left.
The "Graph" row two text panels
The large "Table" row with one text panel.

Those panels are already laid out in a way that they can display the widgets that are described in the text. We will now add those to get some data from the event tweets.

Building a Dashboard

The text panels are only there to guide you when adding the widgets you need and can then be removed. To add or remove panels for a row you can click the little gear next to the title of the row. This will open an options menu. For the top row we are choosing a timepicker panel with a default mode of absolute. This gives you the opportunity to choose a begin and end date for your data. The field that contains the timestamp is called "created_at". After saving you can also remove the text panel on the second tab.

If you now open the "Filters" row you will see that there now is a filter displayed. It is best to keep this row open to see which filters are currently applied. You can remove the text panel in the row.

In the graph section we will add two graph panels instead of the text panels: A pie chart that displays the terms of the tweet texts and a date histogram that shows how many tweets there are for a certain time. For the pie chart we use the field "text" and exclude some common terms again. Note that if you are adding terms to the excluded terms when the panel is already created that you need to initiate another query, e.g. by clicking the button in the timepicker. For the date histogram we are again choosing the timestamp field "created_at".

Finally, in the last row we are adding a table to display the resulting tweet documents. Besides adding the columns "text", "user.screen_name" and "created_at" we can leave the settings like it's proposed.

We now have a dashboard to play with the data and see the results immediately. Data can be explored by using any of the displays, you can click in the pie chart to choose a certain term or choose a time range in the date histogram. This makes it really easy to work with the data.

Answering questions

Now we have a visual representation of all the terms and the time of day people are tweeting most. As you can see, people are tweeting slightly more during the beginning of the day.

You can now check for any relevant terms that you might be interested in. For example, let's see when people tweet about beer. As we do have tweets in multiple languages (german, english and people from cologne) we need to add some variation. We can enter the query

text:bier* OR text:beer* OR text:kölsch

in the query box.

There are only few tweets about it but it will be a total surprise to you that most of the tweets about beer tend to be send later during the day (I won't go into detail why there are so many tweets mentioning the terms horse and piss when talking about Kölsch).

Some more surprising facts: There is not a single tweet mentioning Java but a lot of tweets that mention php, especially during the first day. This day seems to be far more successful for the PHP dev room.

Summary

I hope that I could give you some hints on how powerful Kibana can be when it comes to analytics of data, not only with log data. If you'd like to read another detailed step by step guide on using Kibana to visualize Twitter data have a look at this article by Laurent Broudoux.

Simple Event Analytics with ElasticSearch and the Twitter River

2013-09-11T14:25:00+08:00

Tweets can say a lot about an event. The hashtags that are used and the time that is used for tweeting can be interesting to see. Some of the questions you might want answers to:

Who tweeted the most?
What are the dominant keywords/hashtags?
When is the time people are tweeting the most?
And, most importantly: Is there a correlation between the time and the amount of tweets mentioning coffee or beer?

During this years FrOSCon I indexed all relevant tweets in ElasticSearch using the Twitter River. In this post I'll show you how you can index tweets in ElasticSearch to have a dataset you can do analytics with. We will see how we can get answers to the first two questions using the ElasticSearch Query DSL. Next week I will show how Kibana can help you to get a visual representation of the data.

Indexing Tweets in ElasticSearch

To run ElasticSearch you need to have a recent version of Java installed. Then you can just download the archive and unpack it. It contains a bin directory with the necessary scripts to start ElasticSearch:

bin/elasticsearch -f

-f will take care that ElasticSearch starts in the foreground so you can also stop it using Ctrl-C. You can see if your installation is working by calling http://localhost:9200 in your browser.

After stopping it again we need to install the ElasticSearch Twitter River that uses the Twitter streaming API to get all the tweets we are interested in.

bin/plugin -install elasticsearch/elasticsearch-river-twitter/1.4.0

Twitter doesn't allow anonymous access to its API anymore so you need to register for the OAuth access at https://dev.twitter.com/apps. Choose a name for your application and generate the key and token. Those will be needed to configure the plugin via the REST API. In the configuration you need to pass your OAuth information as well as any keyword you would like to track and the index that should be used to store the data.

curl -XPUT localhost:9200/_river/frosconriver/_meta -d '
{
    "type" : "twitter",
    "twitter" : {
        "oauth" : {
            "consumer_key" : "YOUR_KEY",
            "consumer_secret" : "YOUR_SECRET",
            "access_token" : "YOUR_TOKEN",
            "access_token_secret" : "YOUR_TOKEN_SECRET"
        },
        "filter" : {
            "tracks" : "froscon"
        }
    },
    "index" : {
        "index" : "froscon",
        "type" : "tweet",
        "bulk_size" : 1
    }
}
'

The index doesn't need to exist yet, it will be created automatically. I am using a bulk size of 1 as there aren't really many tweets. If you are indexing a lot of data you might consider setting this to a higher value.

After issuing the call you should see some information in the logs that the river is starting and receiving data. You can see how many tweets there are in your index by issuing a count query:

curl 'localhost:9200/froscon/_count?pretty=true

You can see the basic structure of the documents created by looking at the mapping that is created automatically.

http://localhost:9200/froscon/_mapping?pretty=true

The result is quite long so I am not replicating it here but it contains all the relevant information you might be interested in like the user who tweeted, the location of the user, the text, the mentions and any links in it.

Doing Analytics Using the ElasticSearch REST API

Once you have enough tweets indexed you can already do some analytics using the ElasticSearch REST API and the Query DSL. This requires you to have some understanding of the query syntax but you should be able to get started by skimming through the documentation.

Top Tweeters

First, we'd like to see who tweeted the most. This can be done by doing a query for all documents and facet on the user name. This will give us the names and count in a section of the response.

curl -X POST "http://localhost:9200/froscon/_search?pretty=true" -d '
  {
    "size": 0,
    "query" : {
      "match_all" : {}
    },
    "facets" : {
      "user" : { 
        "terms" : {
          "field" : "user.screen_name"
        } 
      }                            
    }
  }
'

Those are the top tweeters for FrOSCon:

Dominant Keywords

The dominant keywords can also be retrieved using a facet query, this time on the text of the tweet. As there are a lot of german tweets for FrOSCon and the text field is processed using the StandardAnalyzer that only removes english stopwords it might be necessary to exclude some terms. Also you might want to remove some other common terms that indicate retweets or are part of urls.

curl -X POST "http://localhost:9200/froscon/_search?pretty=true" -d '
  {
    "size": 0,
    "query" : {
      "match_all" : {}
    },
    "facets" : {
      "keywords" : { 
        "terms" : {
          "field" : "text", 
          "exclude" : ["froscon", "rt", "t.co", "http", "der", "auf", "ich", "my", "die", "und", "wir", "von"] 
        }
      }                            
    }
  }
'

Those are the dominant keywords for FrOSCon:

talk (no surprise for a conference)
slashme
teamix (a company that does very good marketing. Unfortunately in this case this is more because their fluffy tux got stolen. The tweet about it is the most retweeted tweet of the data.)

Summary

Using the Twitter River it is really easy to get some data into ElasticSearch. The Query DSL makes it easy to extract some useful information. Next week we will have a look at Kibana that doesn't necessarily require a deep understanding of the ElasticSearch queries and can visualize our data.

Developing with CoreMedia

2013-09-04T12:01:00+08:00

A while ago I had the chance to attend a training on web development with CoreMedia. It's a quite enterprisey commercial Content Management System that powers large corporate websites like telekom.com as well as news sites like Bild.de (well, you can't hold CoreMedia responsible for the kind of "content" people put into their system). As I have been working with different Java based Content Management Systems over the years I was really looking forward to learn about the system I heard really good things about. In this post I'll describe the basic structure of the system as well how it feels like to develop with it.

System Architecture

As CoreMedia is built to scale to really large sites the architecture is also built around redundant and distributed components. The part of the system the editors are working on is seperated from the parts that serve the content to the internet audience. A publication process copies the content from the editorial system to the live system.

The heart of CoreMedia is the Content Server. It stores all the content in a database and makes it retrievable. You rarely access it directly but only via other applications that then talk to it in the background via CORBA. Editors used to work with CoreMedia using a Java client (used to be called the Editor, now known as the Site Manager), starting with CoreMedia 7 there is also the web based Studio that is used to create and edit content. A preview application can be used to see how the site looks before being published. Workflows, that are managed using the Workflow Server, can be used to control the processes around editing as well as publication.

The live system consists of several components that are mostly laid out in a redundant way. There is one Master Live Server as well as 0 to n Replication Live Servers that are used for distributing the load as well as fault tolerance. The Content Management Servers are accessed from the Content Application Engine (CAE) that contains all the delivery and additional logic for your website. One or more Solr instances are used to provide the search services for your application.

Document Model

The document model for your application describes the content types that are available in the system. CoreMedia provides a blueprint application that contains a generic document model that can be used as a basis for your application but you are also free to build something completely different. The document model is used throughout the whole system as it describes the way your content is stored. The model is object oriented in nature with documents that consist of attributes. There are 6 attribute types like String (fixed length Strings), XML (variable length Strings) and Blob (binary data) available that form the basis of all your types. An XML configuration file is used to describe your specific document model. This is an example of an article that contains a title, the text and a list of related articles.

<DocType Name="Article">
  <StringProperty Name="title"/>
  <XmlProperty Grammar="coremedia-richtext-1.0" Name="text"/>
  <LinkListProperty LinkType="Article" Name="related"/>
</DocType>

Content Application Engine

Most of the code you will be writing is the delivery code that is part of the Content Application Engine, either for preview or for the live site. This is a standard Java webapp that is assembled from different Maven based modules. CAE code is heavily based on Spring MVC with the CoreMedia specific View Dispatcher that takes care of the rendering of different documents. The document model is made available using the so called Contentbeans that can be generated from the document model. Contentbeans access the content on demand and can contain additional business logic. So those are no POJOs but more active objects similar to Active Record entities in the Rails world.

Our example above would translate to a Contentbean with getters for the title (a java.lang.String), the text (a com.coremedia.xml.Markup) and a getter for a java.util.List that is typed to de.fhopf.Article.

Rendering of the Contentbeans happens in JSPs that are named according to classes or interfaces with a specific logic to determine which JSP should be used. An object Article that resides in the package de.fhopf would then be found in the path de/fhopf/Article.jsp, if you want to add a special rendering mechanism for List this would be in java/util/List.jsp. Different rendering of objects can be done by using a view name. An Article that is rendered as a link would then be in de/fhopf/Artilcle.link.jsp.

This is done using one of the custom Spring components of CoreMedia, the View Dispatcher, a View Resolver that determines the correct view to be invoked for a certain model based on the content element in the Model. The JSP that is used can then contain further includes on other elements of the content, be it documents in the sense of CoreMedia or one of the attributes that are available. Those includes are again routed through the View Dispatcher.

Let's see an example for rendering the list of related articles for an article. Say you call the CAE with a certain content id, that is an Article. The standard mechanism routes this request to the Article.jsp described above. It might contain the following fragment to include the related articles:

<cm:include self="${self.related}"/>

Note that we do not tell which JSP to include. CoreMedia automatically figures out that we are including a List, for example a java.util.ArrayList. As there is no JSP available at java/util/ArrayList.jsp Coremedia will automatically look for any interfaces that are implemented by that class, in this case it will find java/util/List.jsp. This could then contain the following fragment:

<ul>
<c:forEach items="${self}" var="item">
  <li><cm:include self="${item}" view="link"></li>
</c:forEach>
</ul>

As the List in our case contains Article implementations this will then hit the Article.link.jsp that would finally render the link. This is a very flexible approach with a high degree of reusability for the fragments. The List.jsp we are seeing above has no connection to the Article. You can use it for any objects that should be rendered in a List structure, the View Dispatcher of CoreMedia takes care of which JSP to include for a certain type.

To minimize the load on the Content Server you can also add caching via configuration settings. Data Views, that are a layer on top of the Contentbeans, are then held in memory and contain prefilled beans that don't need to access the Content Management Server anymore. This object cache approach is different to the html fragment caching a lot of other systems are doing.

Summary

Though this is only a very short introduction you should have seen that CoreMedia really is a nice system to work with. The distributed nature not only makes it scalable but this also has implications when developing for it: When you are working on the CAE you are only changing code in this component. You can start the more heavyweight Contentserver only once and afterwards work with the lightweight CAE that can be run using the Maven jetty plugin. Restarts don't take a long time so you have short turnaround times. The JSPs are very cleanly structured and don't need to include scriptlets (I heard that this has been different for earlier versions). As most of the application is build around Spring MVC you can use a lot of knowledge that is around already.

FrOSCon 8 2013 - Free and Open Source Software Conference

2013-08-28T14:35:00+08:00

Last weekend I attended FrOSCon, the Free and Open Source Software Conference taking place in St. Augustin near Bonn. It's a community organized conference with an especially low entrance fee and a relaxed vibe. The talks are a good mixture of development and system administration topics.

Some of the interesting talks I attended:

Fixing Legacy Code by Kore Nordmann and Benjamin Eberlein

Though this session was part of the PHP track it contained a lot of valuable information related to working with legacy code in any language. Besides strategies for getting an application under test the speakers showed some useful refactorings that can make sense to start with. Slides

Building Awesome Ruby Command Line Apps by Christian Vervoorts

Christian first showed some of the properties that make up a good command line app. You should choose sane default values but make those configurable. Help functionality is crucial for a good user experience, via -h parameter and a man page. In the second part Chistian introduced some Ruby gems that can be used to build command line apps. GLI seems to be the most interesting with a nice DSL and its scaffolding functionality.

Talking People Into Creating Patches by Isabel Drost-Fromm

Isabel, who is very active in the Apache community, introduced some of her findings when trying to make students, researchers and professionals participate in Open Source. The participants where a mixture of people running open source projects and developers that are interested in contributing to open source. I have been especially interested in this talk because I wouldn't mind having more people help with the Odftoolkit I am also working on. When working with professionals, who are the main target, it is important to answer quickly on mails or issues as they might move on to other projects and might not be able to help later on. Also, it's nice to have some easy tasks in the bugtracker that can be processed by newbies.

MySQL Performance Schema by Carsten Thalheimer

Performance Schema is a new feature in MySQL 5.5 and is activated by default since 5.6. It monitors a lot of the internal functionality like file access and queries so you can later see which parts you can optimize. Some performance measurements done by the MySQL developers showed that keeping it activated has an performance impact of around 5%. Though this doesn't sound that good at first I think you can gain a lot more performance by the insight you have in the inner workings. Working with Performance Schema is supposed to be rather complex ("Take two weeks to work with it"), ps_helper is a more beginner friendly functionality that can get you started with some useful metrics.

Summary

FrOSCon is the one of the most relaxing conferences I know. It is my goto place for seeing stuff that is not directly related to Java development. The low fee makes it a no brainer to attend. If you are interested in any of this years talks they will also be made available online.

Getting Started with ElasticSearch: Part 2 - Querying

2013-08-21T14:15:00+08:00

This is the second part of the article on things I learned while building a simple Java based search application on top of ElasticSearch. In the first part of this article we looked at how to index data in ElasticSearch and what the mapping is. Though ElasticSearch is often called schema free specifying the mapping is still a crucial part of creating a search application. This time we will look at the query side and see how we can get our indexed talks out of it again.

Simple Search

Recall that our documents consist of a title, the date and the speaker of a talk. We have adjusted the mapping so that for the title we are using the German analyzer that stems our terms and we can search on variations of words. This curl request creates a similar index:

curl -XPUT "http://localhost:9200/blog" -d'
{
    "mappings" : {
        "talk" : {
            "properties" : {
                "title" : { "type" : "string", "store" : "yes", "analyzer" : "german" }
            }
        }
    }
}'

Let's see how we can search on our content. We are indexing another document with a German title.

curl -XPOST "http://localhost:9200/blog/talk/" -d'
{
    "speaker" : "Florian Hopf",
    "date" : "2012-07-04T19:15:00",
    "title" : "Suchen und Finden mit Lucene und Solr"
}'

All searching is done on the _search endpoint that is available on the type or index level (you can also search on multiple types and indexes by separating them with a comma). As the title field uses the German analyzer we can search on variations of the words, e.g. suche which stems to the same root as suchen, such.

curl -XGET "http://localhost:9200/blog/talk/_search?q=title:suche&pretty=true"                                                                       
{
  "took" : 14, 
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
},                                                                                                                                                                                                                             
  "hits" : {
    "total" : 1,
    "max_score" : 0.15342641,
    "hits" : [ {
      "_index" : "blog",
      "_type" : "talk",
      "_id" : "A2Qv3fN3TkeYEhxA4zicgw",
      "_score" : 0.15342641, "_source" : {
        "speaker" : "Florian Hopf",
        "date" : "2012-07-04T19:15:00",
        "title" : "Suchen und Finden mit Lucene und Solr"
      }
    } ]
  }

The _all field

Now that this works, we might want to search on multiple fields. ElasticSearch provides the convenience functionality of copying all field content to the _all field that is used when omitting the field name in the query. Let's try the query again:

curl -XGET "http://localhost:9200/blog/talk/_search?q=suche&pretty=true"
{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : null,
    "hits" : [ ]
  }

No results. Why is that? Of course we have set the analyzer correctly for the title as we have seen above. But this doesn't mean that the content is analyzed in the same way for the _all field. As we didn't specify an analyzer for this field it still uses the StandardAnalyzer that splits on whitespace but doesn't do any stemming. If you want to have a consistent behavior for the title and the _all field you need to set the analyzer in the mapping:

curl -XPUT "http://localhost:9200/blog/talk/_mapping" -d'
{
    "mappings" : {
        "talk" : {
            "_all" : {"analyzer" : "german"},
            "properties" : {
                "title" : { "type" : "string", "store" : "yes", "analyzer" : "german" }
            }
        }
    }
}'

Note that as with all mapping changes you can't change the type of the _all field once it's created. You need to delete the index, put the new mapping and reindex your data. Afterwards our search will return the same results for the two queries.

_source

You might have noticed from the example above that ElasticSearch returns the special _source field for each result. This is very convenient as you don't need to specify which fields should be stored. But be aware that this might become a problem for large fields that you don't need for each search request (content section of articles, images that you might store in the index). You can either disable the use of the source field and indicate which fields should be stored in the mapping for your indexed type or you can specify in the query which fields you'd like to retrieve:

curl -XGET "http://localhost:9200/blog/talk/_search?q=suche&pretty=true&fields=speaker,title"
{
  "took" : 6,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 0.15342641,
    "hits" : [ {
      "_index" : "blog",
      "_type" : "talk",
      "_id" : "MA2oYAqnTdqJhbjnCNq2zA",
      "_score" : 0.15342641
    }, {
      "_index" : "blog",
      "_type" : "talk",
      "_id" : "aGdDy24cSImz6DVNSQ5iwA",
      "_score" : 0.076713204,
      "fields" : {
        "speaker" : "Florian Hopf",
        "title" : "Suchen und Finden mit Lucene und Solr"
      }
    } ]
  }

The same can be done if you are not using the simple query parameters but the more advanced query DSL:

curl -XPOST "http://localhost:9200/blog/talk/_search" -d'
{
    "fields" : ["title", "speaker"],
    "query" : {
        "term" : { "speaker" : "florian" }
    }
}'

Querying from Java

Besides the JSON based Query DSL you can also query ElasticSearch using Java. The default ElasticSearch Java client provides builders for creating different parts of the query that can then be combinded. For example if you'd like to query on two fields using the multi_match query this is what it looks like using curl:

curl -XPOST "http://localhost:9200/blog/_search" -d'
{
    "query" : {
        "multi_match" : {
            "query" : "Solr",
            "fields" : [ "title", "speaker" ]
        }
    }
}'

The Java version maps quite well to this. Once you found the builders you need you can use the excellent documentation of the Query DSL for your Java client as well.

QueryBuilder multiMatch = multiMatchQuery("Solr", "title", "speakers");
SearchResponse response = esClient.prepareSearch("blog")
        .setQuery(multiMatch)
        .execute().actionGet();
assertEquals(1, response.getHits().getTotalHits());
SearchHit hit = response.getHits().getAt(0);
assertEquals("Suchen und Finden mit Lucene und Solr", hit.getSource().get("title"));

The same QueryBuilder we are constructing above can also be used on other parts of the query: For example it can be passed as a parameter to create a QueryFilterBuilder or can be used to construct a QueryFacetBuilder. This composition is a very powerful way to build flexible applications. It is easier to reason about the components of the query and you could even test parts of the query on its own.

Faceting

One of the most prominent features of ElasticSearch is its excellent faceting support that not only is used for building search applications but also for doing analytics of large data sets. You can use different kinds of faceting, e.g. for certain terms, using the TermsFacet, or for queries, using the query facet. The query facet would accept the same QueryBuilder that we used above.

TermsFacetBuilder facet = termsFacet("speaker").field("speaker");
QueryBuilder query = queryString("solr");
SearchResponse response = esClient.prepareSearch("blog")
        .addFacet(facet)
        .setQuery(query)
        .execute().actionGet();
assertEquals(1, response.getHits().getTotalHits());
SearchHit hit = response.getHits().getAt(0);
assertEquals("Suchen und Finden mit Lucene und Solr", hit.getSource().get("title"));
TermsFacet resultFacet = response.getFacets().facet(TermsFacet.class, "speaker");
assertEquals(1, resultFacet.getEntries().size());

Conclusion

ElasticSearch has a really nice Java API, be it for indexing or for querying. You can get started with indexing and searching in no time though you need to know some concepts or the results might not be what you expect.

The Pragmatic Programmers Rubber Duck of the 19th Century

2013-08-14T14:02:00+08:00

In their influential book "The Pragmatic Programmer" Andy Hunt and Dave Thomas describe a technique for finding solutions to hard problems you are struggling with. They recommend to just tell the problem to somebody, not for getting an answer, but because while explaining the problem you are thinking differently about it. And, if there is nobody around you, get yourself a rubber duck you can talk to, hence the name of the tip.

It's obvious that this is not a new discovery made by the authors. Everybody has experienced similar situations where they are finding a solution to a problem while explaining it to someone. But I have been surprised to read this in the essay "Über die allmähliche Verfertigung der Gedanken beim Reden" by Heinrich von Kleist dating to 1805/1806 (translated from German by me):

If you want to know something and you can't find it in meditation I advice you [...] to tell it to the next acquaintance you are meeting. He doesn't need to be a keen thinker and I don't mean you should ask him: no! Rather you should tell him about it in the first place.

1806. The same tip (without the duck). This is another case where we are relearning things that have been discovered before, something especially computer science is prone to.

So, what is the big achievement of the authors? It's not that they are finding only new ideas. We don't need that many new ideas. There is a lot of stuff around that is just waiting to be applied to our work as software developers. Those old ideas need to be put into context. There is even a benefit in stating obvious things that might trigger rethinking your habits.

SoCraTes 2013 - Two Days of Open Space

2013-08-11T20:23:00+08:00

Last week I attended SoCraTes, an international Open Space conference on Software Craftmanship and Testing. As a few days have passed now I'd like to recap what sessions I attended and what was discussed. Though this is only a minor part of the sessions that have been discussed you should get a good grasp on what kind of topics are shared during the two days.

General

The conference is located at Seminarzentrum Rückersbach which is close to nowhere (Aschaffenburg). As there is nothing around you where people would normally go during conferences (pubs) everybody is staying on the premise for full 48h. You sleep there, you eat there, you spend your evening there. Generally you are around the same people from thursday evening to saturday evening which leads to a lot of interesting talks and evening activities.

Open Space format

Some of the time is needed for the framework of the Open Space. On thursday evening there is a world cafe, where you spend time on different tables with different people to discuss what you expect from the conference and what you would like to learn.

Every day starts with the planning of the day, the marketplace. Every participant can propose talks, discussions and hands on sessions and put those on the schedule. There are several rooms available and, if the weather permits, there is also plenty of space outside for discussions. The day then is dedicated to the sessions, some of which I will describe below. In the evening there is some kind of retrospective of the day. Also you can propose evening activities, which can range from discussions, coding and board games.

The sessions I will describe now are only a snapshot of what has been available. As there are so many good things it's sometimes hard to decide which session to attend.

Day 1

Agenda for day 1, photo taken by Robert Hostlowsky, who also blogged about SoCraTes over at the Codecentric blog.

Continuous Delivery

Sergey, who did a lot of sessions for the event, discussed a problem he is facing when trying to implement Continuous Delivery. One of the basic concepts of CD is that you should be using the same artifact for all of your stages, that means an artifact should only be build once and deployed to the different systems for automated testing, user testing and finally deployment. When an artifact is promoted to one of the later stages you would like to update the version so it is obvious that it is a real release candidate. Depending on the technology you are using it might not be that easy to update the version without building the artifact again.

Integrated Tests are a scam

This session, inspired by this article by J.B. Rainsberger, mostly revolved around the problems of testing database heavy applications, where some business logic tends to be contained in the database e.g. by means of OR mapping configurations. During the discussion it became obvious that the term integration test is too overloaded: A lot of people are thinking of integrating external systems like databases whereas others think of it as integrating some of your components. I learned about Hexagonal Architecture which I didn't know as a term before.

Mapping Personal Practices

Markus Gärtner hosted a laid back outdoor session on determining in which areas of your professional life you would like to improve. First we collected all the stuff we are doing daily on a mind map and discussed them. Next we determined which parts we would like to improve in during the next months. I have done similar things for myself before but it was interesting to see what other people are working on and what they are interested in.

Specification by Example Experience Report

Nicole introduced some of the concepts of specification by example using the hot dog point of sale that was also used for the architecture kata during last years SoCraTes. She described some of the experiences they had when introducing it in their company (which doesn't sell hot dogs as far as I know). Specification by Example and BDD had been huge topics during last years conference and I would really like to try it once and see if it really improves communication. The (german) slides she used as an introduction are also available online.

Designing Functional Programs

A few people gathered to discuss the implications functional programming has when designing an application, e.g. when to use higher level functions. Unfortunately nobody was around who had already implemented an application in a functional way. To get some experience we tried to do the Mars Rover kata in Java Script. The kata probably was not the ideal choice as it is rather stateful and therefore a more advanced problem.

Productivity Porn

A discussion on self management and everything around it. People shared some of their practices and Jan showed his foldable Personal Kanban board. It's a fact that you can really spend a lot of time thinking about productivity which sometime is not the most productive thing. But Personal Kanban seems to help a lot of people so I am planning to read the recommended book about it and try it for myself.

Day 2

Agenda for day 2, again taken by Robert.

VIM Show and Tell

Sebastian proposed a session on VIM where everybody should show their favourite plugin or feature. I am still a novice VIM user so I mainly learned some of the basics. Quite some people there are using VIM for their development work, mostly with dynamic languages I guess. This video on InfoQ has been recommended and after watching it, I am also recommending it here.

Monads

Another talk by Nicole where she introduced Monads, starting with an example in Java and moving on to Haskell. I have a better idea now on what Monads could be but the concept still is too abstract for me.

Async Patterns

Sergey presented three alternative solutions for building concurrent solutions:

Actors, a core feature of Erlang and Akka
Reactive Extensions
Communicating Sequential Processes as implemented in Clojure

SOLID Principles

Another session by Sebastian where we discussed the SOLID principles, considered to be the basics of good object oriented design. It was interesting to see that though you think you know the concepts it is still difficult to define them. While looking at some examples it also became obvious that sometimes you might follow one principle while violating another. Unfortunately I couldn't stay for the second, practical part of the session.

Quit your Job

During the world cafe Daniel Temme mentioned that he had given a talk last year on quitting your job and quit his job just again. He told me parts of the story in the evening but I was glad that he decided to give the talk again. Though it is rather provocative the story behind it is important and spans a lot of areas of your life: Sometimes you are caught in your habits and don't really notice that the right thing to do would be something else. Daniel is currently on a journey where he visits company and works for food and accomodation.

Last words

SoCraTes was really awesome again. Thanks to all the organizers and participants that shared a lot. I'll definitely try to be there again next year.

GETting Results from ElasticSearch

2013-07-31T13:41:00+08:00

ElasticSearch exposes most of its functionality via a RESTful API. When it comes to querying the data, you can either pass request parameters, e.g. the query string, or use the query DSL which structures the queries as JSON objects. For a talk I gave I used this example which executes a GET request with curl and passes the JSON query structure in the request body.

curl -XGET 'http://localhost:9200/jug/talk/_search' -d '{
    "query" : {
        "query_string" : {"query" : "suche"} 
    },
    "facets" : {
        "tags" : {
            "terms" : {"field" : "speaker"} 
        }
    }
}'

An alert listener(*) later told me that you can't use a request body with a GET request. While preparing the talk I also thought about this and only added it after testing it successfully. At least it is unusual and some software like proxy caches might not handle your request as intended. A lot of ElasticSearch examples I have seen are using POST requests instead but I think semantically requesting search results should be a GET.

The ElasticSearch docs explicitly allow the use of GET requests with a request body:

"Both HTTP GET and HTTP POST can be used to execute search with body. Since not all clients support GET with body, POST is allowed as well."

I didn't find a hint in the HTTP specification whether this should be allowed or not. This answer on Stackoverflow goes in the same direction as my initial concern, that you shouldn't do it because it might not be expected by users and also not supported by some stacks.

Finally, in this message Roy Fielding, one of the authors of the HTTP specification, discourages the use of a request body with GET.

"... any HTTP request message is allowed to contain a message body, and thus must parse messages with that in mind. Server semantics for GET, however, are restricted such that a body, if any, has no semantic meaning to the request."

As the query DSL influences the response it is clear that this is a semantic meaning, which by the words of Roy Fielding shouldn't be the done. So no consensus exists on this topic. By hindsight I am quite surprised that this works out that well with ElasticSearch and there aren't any problems I have heard about.

(*) I wish he didn't tell me, while talking about another talk, that he always looks for mistakes in slides when he's bored ;).

Book Review: Hibernate Search by Example

2013-06-05T17:58:00+08:00

PacktPub kindly offered me a free review edition of Hibernate Search by Example. Though I've used Lucene and Hibernate independently on a lot of projects I've never used Hibernate Search which builds on both technologies. That makes me a good candidate for reviewing an introductory book on it.

The Project

Hibernate Search is a really interesting technology. It transparently indexes Hibernate entities in Lucene and provides a slick DSL for querying the index. You decide which entities and which fields to index by adding annotations on class and field level. Custom analyzer chains can be defined for the entities and referenced from fields. Each entity is written to its own index but it can also include data from related or embedded entities. By default, Lucene is only used for querying and ranking, the result list is still populated from the database. If this is not enough for your application you can also use projection to use stored fields in Lucene for result display.

The Book

Steve Perkins, the author of Hibernate Search by Example, did a great job in designing an example that can evolve with the book. It starts very simple, also explaining the build setup using Maven and an embedded Jetty instance with a H2 database. Each following chapter builds on the results of the previous chapter and enhances the project with different features. Each chapter is dedicated to a certain topic that is immediately used in the application so you can see its benefit. This way you are learning different aspects, from mapping entities and performing queries to advanced mapping aspects, analyzing, filtering, using facets and even setting up master slave systems and sharding your data. But not only is the book structured in a good way, the author also has a very clear writing. Combined with the practical examples this makes it very easy to read. If you're planning to implement a solution on top of Hibernate Search you're well advised to read this book.

My Impression of the Technology

As I am building quite a lot of search applications I'd like to add some impressions of Hibernate Search. Though it's a very interesting technology I think you should be careful when deciding on its use. Not only will you tie yourself to Hibernate as your JPA provider but there are also some implications on the search side. Hibernate Search offers advanced features like faceting, it can be distributed and sharded. But there might be features that you want to build later on that would be far more easy with a search server like Solr or ElasticSearch. Hibernate Search uses some components of Solr but a real integration (using Solr as a backend) is rather difficult I guess. Solr needs in schema configured in a file so you would need to duplicate it. ElasticSearch could be a far better candidate as its schema mapping can be created with its REST API. I am really curious if somebody has been thinking about starting an implementation of Hibernate Search on top of ElasticSearch. With the Lucene implementation that is described in the book you can easily enhance your database driven application with advanced search functionality. But be aware that future requirements might be more difficult to build compared to integrating a search server from the beginning.

Getting Started with ElasticSearch: Part 1 - Indexing

2013-05-28T14:44:00+08:00

ElasticSearch is gaining a huge momentum with large installations like Github and Stackoverflow switching to it for its search capabilities. Its distributed nature makes it an excellent choice for large datasets with high availability requirements. In this 2 part article I'd like to share what I learned building a small Java application just for search.

The example I am showing here is part of an application I am using for talks to show the capabilities of Lucene, Solr and ElasticSearch. It's a simple webapp that can search on user group talks. You can find the sources on GitHub.

Some experience with Solr can be helpful when starting with ElasticSearch but there are also times when it's best to not stick to your old knowledge.

Installing ElasticSearch

There is no real installation process involved when starting with ElasticSearch. It's only a Jar file that can be started immediately, either directly using the java command or via the shell scripts that are included with the binary distribution. You can pass the location of the configuration files and the index data using environment variables. This is a Gradle snippet I am using to start an ElasticSearch instance:

task runES(type: JavaExec) {
    main = 'org.elasticsearch.bootstrap.ElasticSearch'
    classpath = sourceSets.main.runtimeClasspath
    systemProperties = ["es.path.home":'' + projectDir + '/elastichome',
                        "es.path.data":'' + projectDir + '/elastichome/data']
}

You might expect that ElasticSearch uses a bundled Jetty instance as it has become rather common nowadays. But no, it implements all the transport layer with the asynchronous networking library Netty so you never deploy it to a Servlet container.

After you started ElasticSearch it will be available at http://localhost:9200. Any further instances that you are starting will automatically connect to the existing cluster and even use another port automatically so there is no need for configuration and you won't see any "Address already in use" problems.

You can check that your installation works using some curl commands.

Index some data:

curl -XPOST 'http://localhost:9200/jug/talk/' -d '{
    "speaker" : "Alexander Reelsen",
    "date" : "2013-06-12T19:15:00",
    "title" : "Elasticsearch"
}'

And search it:

curl -XGET 'http://localhost:9200/jug/talk/_search?q=elasticsearch'

The url contains two fragments that determine the index name (jug) and the type (talk). You can have multiple indices per ElasticSearch instance and multiple types per index. Each type has its own mapping (schema) but you can also search across multiple types and multiple indices. Note that we didn't create the index and the type, ElasticSearch figures out index name and mapping automatically from the url and the structure of the indexed data.

Java Client

There are several alternative clients available when working with ElasticSearch from Java, like Jest that provides a POJO marshalling mechanism on indexing and for the search results. In this example we are using the Client that is included in ElasticSearch. By default the client doesn't use the REST API but connects to the cluster as a normal node that just doesn't store any data. It knows about the state of the cluster and can route requests to the correct node but supposedly consumes more memory. For our application this doesn't make a huge difference but for production systems that's something to think about.

This is an example setup for a Client object that can then be used for indexing and searching:

Client client = NodeBuilder.nodeBuilder().client(true).node().client();

You can use the client to create an index:

client.admin().indices().prepareCreate(INDEX).execute().actionGet();

Note that the actionGet() isn't named this way because it is an HTTP GET request, this is a call to the Future object that is returned by execute, so this is the blocking part of the call.

Mapping

As you have seen with the indexing operation above ElasticSearch doesn't require an explicit schema like Solr does. It automatically determines the likely types from the JSON you are sending to it. Of course, this might not always be correct, and you might want to define custom analyzers for your content so you can also adjust the mappings to your needs. As I was so used to the way Solr does this that I was looking for a way to add the mapping configuration via a file in the server config. This is something you can do indeed using a file called default-mapping.json or via index templates. On the other hand you can also use the REST based put mapping API which has the benefit that you don't need to distribute the file to all nodes manually and also you don't need to restart the server. The mapping then is part of the cluster state and will get distributed to all nodes automatically.

ElasticSearch provides most of its API via Builder classes. Surprisingly I didn't find a Builder for the mapping. One way to construct it is to use the generic JSON builder:

XContentBuilder builder = XContentFactory.jsonBuilder().
  startObject().
    startObject(TYPE).
      startObject("properties").
        startObject("path").
          field("type", "string").field("store", "yes").field("index", "not_analyzed").
        endObject().
        startObject("title").
          field("type", "string").field("store", "yes").field("analyzer", "german").
        endObject().
        // more mapping
      endObject().
    endObject().
  endObject();
client.admin().indices().preparePutMapping(INDEX).setType(TYPE).setSource(builder).execute().actionGet();

Another way I have seen is to put the mapping in a file and just read it to a String, e.g. by using the Guava Resources class.

After you have adjusted the mapping you can have a look at the result at the _mapping endpoint of the index at http://localhost:9200/jug/_mapping?pretty=true.

Indexing

Now we are ready to index some data. In the example application I am using simple data classes that represent talks to be indexed. Again, you have different options how to transform your objects to the JSON ElasticSearch understands. You can build it by hand, e.g. with the XContentBuilder we have already seen above, or more conveniently, by using something like the JSON processor Jackson that can serialize and deserialize Java objects to and from JSON. This is what it looks like when using the XContentBuilder:

XContentBuilder sourceBuilder = XContentFactory.jsonBuilder().startObject()
  .field("path", talk.path)
  .field("title", talk.title)
  .field("date", talk.date)
  .field("content", talk.content)
  .array("category", talk.categories.toArray(new String[0]))
  .array("speaker", talk.speakers.toArray(new String[0]));
IndexRequest request = new IndexRequest(INDEX, TYPE).id(talk.path).source(sourceBuilder);
client.index(request).actionGet();

You can also use the BulkRequest to prevent having to send a request for each document.

With ElasticSearch you don't need to commit after you indexed. By default, it will refresh the index every second which is fast enough for most use cases. If you want to be able to search the data as soon as possible you can also call refresh() on the client. This can be really useful when writing tests and you don't want to wait for a second between indexing and searching.

This concludes the first part of this article on getting started with ElasticSearch using Java. The second part contains more information on searching the data we indexed.

Softwerkskammer Rhein-Main Open Space

2013-02-19T16:39:00+08:00

On Saturday I attended an Open Space in Wiesbaden, organized by members of Softwerkskammer Rhein-Main, a very active chapter of the German software craftmanship community. The event took place in the offices of Seibert Media above a shopping mall including a nice view of the city.

The Format

Open Space conferences are special as there is no predefined agenda. All the attendees can bring ideas and propose those in the opening session and choose a time slot and room. Sessions are not necessarily normal presentations but rather discussions so it's even OK to just propose a question that you have or a topic you'd like to learn more about from the attendees. Also, there are some guidelines and rules: sessions don't need to start and end in time, you can always leave a session in case you feel you can't contribute and you shouldn't be disappointed if nobody shows up for your proposed session.

Personal Kanban

Dennis Traub presented a session on Personal Kanban. As I did Kanban style development in one project already I was eager to learn how to apply the principles to personal organization. Basically it all works the same as normal Kanban. Tasks are visualized on a board where a swimlane defines the state of a task with work items flowing from left (todo) to right (done). You can define swimlanes as it fits your habits, e.g. one for todos, one for in progress and one for blocked. The in progress lane needs to have a Work in Progress limit which is the amount of tasks you start and process in parallel. An important aspect is that you don't have to put all your backlog items to the todo lane but you can also keep them in a seperate place. This keeps you from getting overwhelmed when looking at the board.

It sounds like Kanban is a good way for organizing your daily life. For me personally the biggest hindrance is that I am working from my living room and I'd rather not put a Kanban board in my living room. If I'd use a separate office I guess I'd try it immediately.

Open Source

An attendee wanted to know some experiences with Open Source communities. Two full time committers, Ollie for Spring and Marcel for Eclipse, shared some of their experiences. I am still surprised that a lot of Open Source projects have quite some bugs in their trackers that could easily be fixed by newcomers. A lot of people like Open Source software but not that many seem to be interested in contributing to a project continuously. Most of the interaction with users in the issue trackers are one time reports, so the people report one bug and move on. Even for big projects like Spring and Eclipse it's hard to find committers. One way to motivate people is to organize hack days where users learn to work with the sources of the projects but this also needs quite some preparation.

Freelancing

The topic of freelancing was discussed all over the day. Markus Tacker presented his idea of the kybernetic agency, a plan to form a freelance network with people who can work on projects together. We discussed benefits and possible problems, mainly of legal type. A quite inspiring session that also made me think about the difference of freelancing in the Java enterprise world compared to PHP development. Most of the freelancers I know would prefer not to work 5 days a week for one client exclusively but that is often a prerequisite for projects in the enterprise world.

Learning

Learning is a topic that is very important to me so I proposed a session on it. I already switched from 5 to 4 days the last months of my employment at synyx because I felt the need to invest more time in learning which is often not possible when working on client projects. Even now as a freelancer I keep one day for learning only. What works best for me is writing blogposts that contain some sample code. I can build something and when writing the post I make sure that I have a deep understanding of the topic I am writing about. Other people also said that the most important aspect is to have something to work on, reading or watching screencasts alone is no sustainable activity. I also liked the technique of another freelancer: whenever he notices that he could do something different on the current project he stops to track the time for the customer and tries to find ways to improve the project, probably learning a new approach. This is something you are doing implicitly as a freelancer, you often spend some of your spare time thinking about client work but I like this explicit approach.

Summary

All in all this was a really fruitful, but also exhausting, day. Though I chose meta topics exclusively I gained a lot from visiting. Thanks a lot to the organizers (mainly Benjamin), moderators, sponsors and all the attendees that made this event possible. I am looking forward to meeting a lot of the people again at Socrates this year.

Book Review: Gradle Effective Implementation Guide

2013-02-01T19:10:00+08:00

PacktPub kindly offered me a free review edition of Gradle Effective Implementation Guide written by mrhaki Hubert Klein Ikkink. As I planned to read it anyway I agreed to write a review of it.

Maven was huge for Java Development. It brought dependency management, sane conventions and platform independent builds to the mainstream. If there is a Maven pom file available for an open source project you can be quite sure to manage to build it on your local machine in no time.

But there are cases when it doesn't work that well. Its phase model is rather strict and the one-artifact-per-build restriction can get in your way for more unusual build setups. You can workaround some of these problems using profiles and assemblies but it feels that it is primarily useful for a certain set of projects.

Gradle is different. It's more flexible but there's also a learning curve involved. Groovy as its build DSL is easy to read but probably not that easy to write at first because there are often multiple ways to do something. As a standard Java developer like me you might be unsure about the proper way of doing something.

There are a lot of helpful resources online, namely the forum and the excellent user guide but as I prefer to read longer sections offline I am really glad that there now is a book available that contains extensive information and can get you started with Gradle.

Content

The book starts with a general introduction into Gradle. You'll get a high level overview of its features, learn how to install it and write your first build file. You'll also learn some important options of the gradle executable that I haven't been aware of.

Chapter 2 explains tasks and how to write build files. This is a very important chapter if you are not that deep into the Groovy language. You'll learn about the implicitly available Task and Project instances and the different ways of accessing methods and properties and of defining tasks and dependencies between them.

Working with files is an important part of any build system. Chapter 3 contains detailed information on accessing and modifying files, file collections and file trees. This is also where the benefit of using Groovy becomes really obvious. The ease of working with collections can lead to very concise build definitions though you have all the power of Groovy and the JVM at your hands. The different log levels are useful to know and can come in handy when you'd like to diagnose a build.

While understanding tasks is an important foundation for working with Gradle it's likely that you are after using it with programming languages. Nearly all of the remaining chapters cover working with different aspects on builds for JVM languages. Chapter 4 starts with a look at the Java plugin and its additional concepts. You'll see how you can compile and package Java applications and how to work with sourceSets.

Nearly no application is an island. The Java world provides masses of useful libraries that can help you build your application. Proper dependency management, as introduced in Chapter 5, is important for easy build setups and for making sure that you do not introduce incompatible combinations of libraries. Gradle supports Maven, Ivy and local file based repositories. Configurations are used to group dependencies, e.g. to define dependencies that are only necessary for tests. If you need to influence the version you are retrieving for a certain dependency you can configure resolution strategies, version ranges and exclusions for transitive dependencies.

Automated testing is a crucial part of any modern software development process. Gradle can work with JUnit and TestNG out of the box. Test execution times can be improved a lot by the incremental build support and the parallelization of tests. I guess this can lead to dramatically shorter build times, something I plan to try on an example project with a lot of tests in the near future. This chapter also introduces the different ways to run an application, create distributions and how to publish artifacts.

The next chapter will show you how you can structure your application in separate projects. Gradle has clever ways to find out which projects need to be rebuild before and after building a certain project.

Chapter 8 contains information on how to work with Scala and Groovy code. The necessary compiler versions can be defined in the build so there is no need to have additional installations. I've heard good things about the Scala integration so Gradle seems to be a viable alternative to sbt.

The check task can be used to gather metrics on your project using many of the available open source projects for code quality measurement. Chapter 9 shows you how to include tools like Checkstyle, PMD and FindBugs to analyze your project sources, either standalone or by sending data to Sonar.

If you need additional functionality that is not available you can start implementing your own tasks and plugins. Chapter 10 introduces the important classes for writing custom plugins and how to use them from Groovy and Java.

Gradle can be used on several Continuous Integration systems. As I've been working with Hudson/Jenkins exclusively during the last years it was interesting to also read about the commercial alternatives Team City and Bamboo in Chapter 11.

The final chapter contains a lot of in depth information on the Eclipse and IDEA plugins. Honestly, this contains more information on the Eclipse file format than I wanted to know but I guess that can be really useful for users. Unfortunately the excellent Netbeans plugin is not described in the book.

Summary

The book is an excellent introduction into working effectively with Gradle. It has helped me to get a far better understanding of the concepts. If you are thinking about or already started working with Gradle I highly recommend to get a copy. There are a lot of detailed example files that you can use immediately. Many of those are very close to real world use cases and can help you thinking about additional ways Gradle can be useful for organizing your builds.

Make your Filters Match: Faceting in Solr

2013-01-24T16:49:00+08:00

Facets are a great search feature that let users easily navigate to the documents they are looking for. Solr makes it really easy to use them though when naively querying for facet values you might see some unexpected behaviour. Read on to learn the basics of what is happening when you are passing in filter queries for faceting. Also, I'll show how you can leverage local params to choose a different query parser when selecting facet values.

Introduction

Facets are a way to display categories next to a users search results, often with a count of how many results are in this category. The user can then select one of those facet values to retrieve only those results that are assigned to this category. This way he doesn't have to know what category he is looking for when entering the search term as all the available categories are delivered with the search results. This approach is really popular on sites like Amazon and eBay and is a great way to guide the user.

Solr brought faceting to the Lucene world and arguably the feature was an important driving factor for its success (Lucene 3.4 introduced faceting as well). Facets can be build from terms in the index, custom queries and ranges though in this post we will only look at field facets.

As a very simple example consider this schema definition:

<fields>
    <field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" /> 
    <field name="text" type="text_general" indexed="true" stored="true"/>
    <field name="author" type="string" indexed="true" stored="false"/>
</fields>

There are three fields, the id, a title that we'd probably like to search on and an author. The author is defined as a string field which means no analyzing at all. The faceting mechanism uses the term value and not a stored value so we want to make sure that the original value is preserved. I explicitly don't store the author information to make it clear that we are working with the indexed value.

Let's index some book data with curl (see this GitHub repo for the complete example including some unit tests that execute the same functionality using Java).

curl http://localhost:8082/solr/update -H "Content-Type: text/xml" --data-binary \
    '<add><doc>
            <field name="id">1</field>
            <field name="text">On the Shortness of Life</field>
            <field name="author">Seneca</field>
    </doc> 
    <doc>
            <field name="id">2</field>
            <field name="text">What I Talk About When I Talk About Running</field>
            <field name="author">Haruki Murakami</field>
    </doc> 
    <doc>
            <field name="id">3</field>
            <field name="text">The Dude and the Zen Master</field>
            <field name="author">Jeff "The Dude" Bridges</field>
    </doc>
    </add>'
curl http://localhost:8082/solr/update -H "Content-Type: text/xml" --data-binary '<commit />'

And verify that the documents are available:

curl http://localhost:8082/solr/query?q=*:*
{
  "responseHeader":{
    "status":0,
    "QTime":3,
    "params":{
      "q":"*:*"}},
  "response":{"numFound":3,"start":0,"docs":[
      {
        "id":"1",
        "text":"On the Shortness of Life"},
      {
        "id":"2",
        "text":"What I Talk About When I Talk About Running"},
      {
        "id":"3",
        "text":"The Dude and the Zen Master"}]
  }}

I'll omit parts of the response in the following examples. We can also have a look at the shiny new administration view of Solr 4 to see all terms that are indexed for the field author.

Each of the author names is indexed as one term.

Faceting

Let's move on to the faceting part. To let the user drill down on search results there are two steps involved. First you tell Solr that you would like to retrieve facets with the results. Facets are contained in an extra section of the response and consist of the indexed term as well as a count. As with most Solr parameters you can either send the necessary options with the query or preconfigure them in solrconfig.xml. This query has faceting on the author field enabled:

curl "http://localhost:8082/solr/query?q=*:*&facet=on&facet.field=author"
{
  "responseHeader":{...},
  "response":{"numFound":3,"start":0,"docs":[
      {
        "id":"1",
        "text":"On the Shortness of Life"},
      {
        "id":"2",
        "text":"What I Talk About When I Talk About Running"},
      {
        "id":"3",
        "text":"The Dude and the Zen Master"}]
  },
  "facet_counts":{
    "facet_queries":{},
    "facet_fields":{
      "author":[
        "Haruki Murakami",1,
        "Jeff \"The Dude\" Bridges",1,
        "Seneca",1]},
    "facet_dates":{},
    "facet_ranges":{}}}

And this is what a configuration in solrconfig looks like:

<requestHandler name="/select" class="solr.SearchHandler">
  <lst name="defaults">
    <str name="q">*:*</str>  
    <str name="echoParams">none</str>
    <int name="rows">10</int>
    <str name="df">text</str>
    <str name="facet">on</str>
    <str name="facet.field">author</str>
    <str name="facet.mincount">1</str>
  </lst>
</requestHandler>

This way we don't have to pass the parameters with the query anymore and can see which parts of the query change.

Common Filtering

When a user chooses a facet you issue the same query again, this time adding a filter query that restricts the search results to any that have the value for this certain fields set. In our case the user would only see books of one certain author. Let's start simple and pretend that a user can't handle the massive amount of 3 search results and is only interested in books on Seneca:

curl 'http://localhost:8082/solr/select?fq=author:Seneca'
{
  "responseHeader":{...},
  "response":{"numFound":1,"start":0,"docs":[
      {
        "id":"1",
        "text":"On the Shortness of Life"}]
  },
  "facet_counts":{
    "facet_queries":{},
    "facet_fields":{
      "author":[
        "Seneca",1]},
    "facet_dates":{},
    "facet_ranges":{}}}

Works fine. We added a filter query that restricts the results to only those that are written by Seneca. Note that there is only one facet left because the search results don't contain any books by other authors. Let's see what happens when we try to filter the results to see only books by Haruki Murakami. We need to URL encode the blank, the rest of the query stays the same:

curl 'http://localhost:8082/solr/select?fq=author:Haruki%20Murakami'
{
  "responseHeader":{...},
  "response":{"numFound":0,"start":0,"docs":[]
  },
  "facet_counts":{
    "facet_queries":{},
    "facet_fields":{
      "author":[]},
    "facet_dates":{},
    "facet_ranges":{}}}

No results. Why is that? The default query parser for filter queries is the Lucene query parser. It tokenizes the query on whitespace, so even if we store the field unanalyzed it's not the query we are probably expecting to use. The query that is the result of the parsing process is not a term query as in our first example. It's a boolean query that consists of two term queries author:Haruki text:murakami. If you are familiar with the Lucene query syntax this won't be a surprise to you. If you prefix a term with a field name and a colon it will search on this field, otherwise it will search on the default field we declared in solrconfig.xml.

How can we fix it? Simple, just turn it into a phrase by surrounding the words with double quotes:

curl 'http://localhost:8082/solr/select?fq=author:"Haruki%20Murakami"'
{
  "responseHeader":{...},
  "response":{"numFound":1,"start":0,"docs":[
      {
        "id":"2",
        "text":"What I Talk About When I Talk About Running"}]
  },
  "facet_counts":{
    "facet_queries":{},
    "facet_fields":{
      "author":[
        "Haruki Murakami",1]},
    "facet_dates":{},
    "facet_ranges":{}}}

Or, if you prefer, you can also escape the blank using the backslash, which yields the same result:

curl 'http://localhost:8082/solr/select?fq=author:Haruki\%20Murakami'

Fun fact: I am not that good at picking examples. If we are filtering on our last author we will be surprised (at least I scratched my head for a while):

curl 'http://localhost:8082/solr/select?fq=author:Jeff%20"The%20Dude"%20Bridges'
{
  "responseHeader":{...},
  "response":{"numFound":1,"start":0,"docs":[
      {
        "id":"3",
        "text":"The Dude and the Zen Master"}]
  },
  "facet_counts":{
    "facet_queries":{},
    "facet_fields":{
      "author":[
        "Jeff \"The Dude\" Bridges",1]},
    "facet_dates":{},
    "facet_ranges":{}}}

This actually seemed to work though we neither turned it into a phrase nor did we escape the blanks. If we look at how the Lucene query parser handles this query we see immediately why this returns a result. As with the last example this is turned into a boolean query, only the first query is executed against the author field. The other two tokens are searching on the default field and in this case "The Dude" matches the text field: author:Jeff text:"the dude" text:bridges. If you just want to match on the author field you can escape the blanks as we did in the example before:

curl 'http://localhost:8082/solr/select?fq=author:Jeff\%20\"The\%20Dude\"\%20Bridges'

I'll spare you with the response.

Using Local Params to set the Query Parser

At ApacheCon Europe in November Eric Hatcher did a really interesting presentation on query parsers in Solr where he introduced another, probably cleaner way to do this: You can use the local param syntax for choosing a different query parser. As we have learnt, the query parser defaults to the Lucene query parser. You can change the query parser for the query by setting the defType parameter, either via request parameters or in the solrconfig.xml but I am not aware of any way to set it for the filter queries. As we have unanalyzed terms the correct thing to do would be to use a TermQuery, which can be built using the TermQParserPlugin. To use this parser we can explicitly set it in the filter query:

curl 'http://localhost:8082/solr/select?fq={!term%20f=author%20v='Jeff%20"The%20Dude"%20Bridges'}'

Or, for better readability, without the URL encoding:

curl 'http://localhost:8082/solr/select?fq={!term f=author v='Jeff "The Dude" Bridges'}'

The local params are enclosed by curly braces. The value term is a shorthand for type='term', f is the fiels the TermQuery should be built for and v the value. Though this might look quirky at first this is a really powerful feature, especially since you can reference other request parameters from the local params. Consider this configuration of a request handler:

<requestHandler name="/selectfiltered" class="solr.SearchHandler">
  <lst name="defaults">
    <str name="q">*:*</str>  
    <str name="echoParams">explicit</str>
    <int name="rows">10</int>
    <str name="wt">json</str>
    <str name="indent">true</str>
    <str name="df">text</str>
    <str name="facet">on</str>
    <str name="facet.field">author</str>
    <str name="facet.mincount">1</str>
  </lst>
  <lst name="appends">
    <str name="fq">{!term f=author v=$author}</str>
  </lst>
</requestHandler>

The default configuration is the same as we were using above. Only the appends section is new, which adds additional parameters to the request. There are similar local params as we were using via curl, but the real filter query is replaced by the variable $author. This can now be passed in cleanly via an aptly named parameter:

curl 'http://localhost:8082/solr/selectfiltered?author=Jeff%20"The%20Dude"%20Bridges'

There are a lot of powerful features in Solr that are not that commonly used. To see this example in Java have a look at the Github repository of this blogpost.

JUnit Rule for ElasticSearch

2013-01-10T15:03:00+08:00

While I am using Solr a lot in my current engagement I recently started a pet project with ElasticSearch to learn more about it. Some of its functionality is rather different from Solr so there is quite some experimentation involved. I like to start small and implement tests if I like to find out how things work (see this post on how to write tests for Solr).

ElasticSearch internally uses TestNG and the test classes are not available in the distributed jar files. Fortunately it is really easy to start an ElasticSearch instance from within a test so it's no problem to do something similar in JUnit. Felix Müller posted some useful code snippets on how to do this, obviously targeted at a Maven build. The ElasticSearch instance is started in a setUp method and stopped in a tearDown method:

private EmbeddedElasticsearchServer embeddedElasticsearchServer;

@Before
public void startEmbeddedElasticsearchServer() {
    embeddedElasticsearchServer = new EmbeddedElasticsearchServer();
}

@After
public void shutdownEmbeddedElasticsearchServer() {
    embeddedElasticsearchServer.shutdown();
}

As it is rather cumbersome to add these methods to all tests I transformed the code to a JUnit rule. Rules can execute code before and after a test is run and influence its execution. There are some base classes available that make it really easy to get started with custom rules.

Our ElasticSearch example can be easily modeled using the base class ExternalResource (see the full example code on GitHub):

public class ElasticsearchTestNode extends ExternalResource {

    private Node node;
    private Path dataDirectory;
    
    @Override
    protected void before() throws Throwable {
        try {
            dataDirectory = Files.createTempDirectory("es-test", new FileAttribute []{});
        } catch (IOException ex) {
            throw new IllegalStateException(ex);
        }

        ImmutableSettings.Builder elasticsearchSettings = ImmutableSettings.settingsBuilder()
                .put("http.enabled", "false")
                .put("path.data", dataDirectory.toString());

        node = NodeBuilder.nodeBuilder()
                .local(true)
                .settings(elasticsearchSettings.build())
                .node();
    }

    @Override
    protected void after() {
        node.close();
        try {
            FileUtils.deleteDirectory(dataDirectory.toFile());
        } catch (IOException ex) {
            throw new IllegalStateException(ex);
        }
    }
    
    public Client getClient() {
        return node.client();
    }
}

The before method is executed before the test is run so we can use it to start ElasticSearch. All data is written to a temporary folder. The after method is used to stop ElasticSearch and delete the folder.

In your test you can now just use the rule, either with the @Rule annotation to have it triggered on each test method, or using @ClassRule to execute it only once per class:

public class CoreTest {

    @Rule
    public ElasticsearchTestNode testNode = new ElasticsearchTestNode();
    
    @Test
    public void indexAndGet() throws IOException {
        testNode.getClient().prepareIndex("myindex", "document", "1")
                .setSource(jsonBuilder().startObject().field("test", "123").endObject())
                .execute()
                .actionGet();
        
        GetResponse response = testNode.getClient().prepareGet("myindex", "document", "1").execute().actionGet();
        assertThat(response.getSource().get("test")).isEqualTo("123");
    }
}

As it is really easy to implement custom rules I think this is a feature I will be using more often in the future.

12 Conferences of 2012

2013-01-03T17:52:00+08:00

I went to a lot, probably too many conferences in 2012. As the year is over now I'd like to summarize some of the impressions, maybe there's a conference you didn't know about and you'd like to attend this year.

FOSDEM

FOSDEM is the Free and Open Source Software Developer European Meetup, a yearly event that takes place in Brussels, Belgium. There are multiple tracks and developer rooms on a multitude of topics ranging from databases to programming languages and open source tools. The rooms are spread across some buildings at the University so there might be some walking involved when switching tracks. What is rather special is that there's no registration involved, you just go there and that's it. The amount of people can be overwhelming, especially in the main entrance area. Unfortunately I was rather disappointed with the talks I chose. The event is a very good fit if you are working on an Open Source project and, as the name of the conference suggests, want to meet other developers of the project.

Berlin Expert Days

BED-Con is a rather young conference organized by the Java User Group Berlin-Brandenburg. I haven't been there in the first year but in 2012 it still had a small and informal feeling. The conference takes place in three rooms of the Freie Universität Berlin, the content selection was an excellent mixture of technical and process/meta talks, most of them in German. If you can afford the trip to Berlin I'd definitively recommend going there.

JAX

The largest and best known German Java conference. There are two editions, JAX in April in Mainz (the one I attended) and W-JAX in November in Munich. There's one huge hall and several smaller rooms and a wide variety of topics you can choose from. I never planned to go there as the admission fee is rather high (thanks to Software & Support for sponsoring my ticket) but I have to admit that it really can be worth the money. There were excellent talks by Charles Nutter, Tim Berglund and many more. The infrastructure (food, coffee, schedules) is very good, if you are on a business budget you can gain a lot by visiting.

Berlin Buzzwords

A niche conference on Search, Scale and Big Data. A lot of people are coming from overseas just to visit. If you are interested in these technologies definitively go there. For more information see this post.

Barcamp Karlsruhe

My first real Barcamp. Really fun event but of course there are always some sessions that are not as interesting as anticipated. Topics ranged from Computer and work stuff to more soft content. I always thought this is a nerd only event but as there is so much to choose from Barcamps might even be interesting for people who are not that much into computers. Very well organized, no admission fee, interesting sessions.

Socrates

The International Software Craftmanship and Testing Conference. Awesome setting and the first open space conference I attended. It takes place in a seminar center in the middle of nowhere which makes it a very intense experience. Besides the sessions there are a lot of informal discussions going on around the day with the very enthusiastic attendants. The 2012 event started Thursday evening with a world cafe, Friday and Saturday open space and an optional code retreat on Sunday. I'd say there were three kinds of sessions: informal discussion rounds, practical hands on sessions and talks. It seems that most people liked the practical sessions best, so if I could choose again I'd go to more of those. Thanks to the sponsors all we had to pay was the accommodation and one meal which additionally makes it an incredible cheap event. Be quick with registration as space is limited.

FrOSCon

The Free and Open Source Software Conference is a great community weekend event with different tracks on admin and development topics. I like it a lot because of the variation of talks and the very informal setting. It's a mixture of holiday and learning and for me a chance to get information on topics that are not presented at the other developer conferences I attend. Talks are partly english, partly german. You can either stay in St. Augustin or in Bonn, it's only a short tram ride.

JAX On Tour

JAX On Tour is another event I attended because of the generous sponsorship of Software & Support. It's not a conference but a training event with longer talks that are grouped together. It's a small event and there's always time for questions. I learned a lot, mainly about documenting architectures. This is a really good alternative to a normal conference to grasp a topic in depth.

OpenCms-Days

If you are into OpenCms you probably already heard about OpenCms-Days, and if not this is probably not for you. Two days of OpenCms only, used by Alkacon to present new features and by the community to present extensions and projects. I am always impressed that there are people who fly around the world just to attend, but of course this is the only conference of its kind worldwide. There is always something new to learn and it's fun to meet the community.

DevFest Karlsruhe

A one day event on Google technologies. There have been several events in different cities worldwide, this one organized by the Google Developer Group Karlsruhe. The organizers were really unlucky as multiple speakers canceled on short notice, nevertheless there were some really good talks. Kudos to the organizers who managed to get this event started in a really short time frame.

ApacheCon Europe

I originally went to ApacheCon because it took place in Sinsheim, which is close to Karlsruhe. Fortunately in this year the conference also hosted the LuceneCon Europe, so there were lots of interesting talks for me. Additionally I had been voted as a committer to the ODFToolkit just before it and I was able to meet some other people that are involved in the project. The location was really special (soccer stadium) but I think a lot of non-locals had to suffer a bit because of the lack of hotels and taxis. This community event can be really interesting even if you are only a user of a project.

Devoxx

Devoxx is the largest Java conference in Europe, organized by the Java User Group Belgium. They attract a lot of high class speakers so this is the place to keep you informed on the Java universe. Located in a large multiplex cinema in a suburb of Antwerp, very comfy chairs and huge screens. The week starts with two university days that contain longer talks and are usually less crowded. This year I went there on Wednesday, due to a train strike in Belgium I arrived quite late. I have to admit that it probably wasn't worth the hassle for only 1.5 days. If you are going there I recommend to stay the whole week.

2013

So this was a lot last year. I don't plan to visit that many conferences again. I am sure that I will be going to Berlin Buzzwords, Socrates, FrOSCon. There probably will be more, but it won't get 13 this year :).

Finally, I couldn't have afforded to pay for all those conferences myself, so thanks to synyx, who paid for FOSDEM and BEDCon when I was still employed there and provided me with a free ticket to OpenCms-Days. Thanks to Software & Support for letting me attend JAX and JAX On Tour for free, those guys are fantastic supporters of the Java User Group Karlsruhe. Also, thanks to the Devoxx team for letting me attend as an ambassador of our JUG.

Gradle is too Clever for my Plans

2012-12-20T14:16:00+08:00

While writing this post about the Lucene Codec API I noticed something strange when running the tests with Gradle. When experimenting with a library feature most of the times I write unit tests that validate my expectations. This is a habit I learned from Lucene in Action and can also be useful in real world scenarios, e.g. to make sure that nothing breaks when you update a library.

OK, what happened? This time I did not only want to have the test result but also ran the test for a side effect, I wanted a Lucene index to be written to the /tmp directory to manually have a look at it. This worked fine for the first time, but not afterwards, e.g. after my machine was rebooted and the directory cleared.

It turns out that the Gradle developers know that a test shouldn't be used to execute stuff. So once the test is run successfully it is just not run again until its input changes! Though this bit me this time this is a really nice feature to speed up your builds. And if you really need to execute the tests, you can always run gradle cleanTest test.

Looking at a Plaintext Lucene Index

2012-12-07T19:45:00+08:00

The Lucene file format is one of the reasons why Lucene is as fast as it is. An index consist of several binary files that you can't really inspect if you don't use tools like the fantastic Luke.

Starting with Lucene 4 the format for these files can be configured using the Codec API. Several implementations are provided with the release, among those the SimpleTextCodec that can be used to write the files in plaintext for learning and debugging purposes.

To configure the Codec you just set it on the IndexWriterConfig:

StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_40);
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40, analyzer);
// recreate the index on each execution
config.setOpenMode(IndexWriterConfig.OpenMode.CREATE);
config.setCodec(new SimpleTextCodec());

The rest of the indexing process stays exactly the same as it used to be:

Directory luceneDir = FSDirectory.open(plaintextDir);
try (IndexWriter writer = new IndexWriter(luceneDir, config)) {
    writer.addDocument(Arrays.asList(
            new TextField("title", "The title of my first document", Store.YES),
            new TextField("content", "The content of the first document", Store.NO)));

    writer.addDocument(Arrays.asList(
            new TextField("title", "The title of the second document", Store.YES),
            new TextField("content", "And this is the content", Store.NO)));
}

After running this code the index directory contains several files. Those are not the same type of files that are created using the default codec.

ls /tmp/lucene-plaintext/
_1_0.len  _1_1.len  _1.fld  _1.inf  _1.pst  _1.si  segments_2  segments.gen

The segments_x file is the starting point (x depends on the amount of times you have written to the index before and starts with 1). This still is a binary file but contains the information which codec is used to write to the index. It contains the name of each Codec that is used for writing a certain segment.

The rest of the index files are all plaintext. They do not contain the same information as their binary cousins. For example the .pst file represents the complete posting list, the structure you normally mean when talking about an inverted index:

field content
  term content
    doc 0
      freq 1
      pos 1
    doc 1
      freq 1
      pos 4
  term document
    doc 0
      freq 1
      pos 5
  term first
    doc 0
      freq 1
      pos 4
field title
  term document
    doc 0
      freq 1
      pos 5
    doc 1
      freq 1
      pos 5
  term first
    doc 0
      freq 1
      pos 4
  term my
    doc 0
      freq 1
      pos 3
  term second
    doc 1
      freq 1
      pos 4
  term title
    doc 0
      freq 1
      pos 1
    doc 1
      freq 1
      pos 1
END

The content that is marked as stored resides in the .fld file:

doc 0
  numfields 1
  field 0
    name title
    type string
    value The title of my first document
doc 1
  numfields 1
  field 0
    name title
    type string
    value The title of the second document
END

If you'd like to have a look at the rest of the files checkout the code at Github.

The SimpleTextCodec only is an interesting byproduct. The Codec API can be used for a lot useful things. For example the feature to read indices of older Lucene versions is implemented using seperate codecs. Also, you can mix several Codecs in an index so reindexing on version updates should not be necessary immediately. I am sure more useful codecs will pop up in the future.

Getting rid of synchronized: Using Akka from Java

2012-08-23T14:02:00+08:00

I've been giving an internal talk on Akka, the Actor framework for the JVM, at my former company synyx. For the talk I implemented a small example application, kind of a web crawler, using Akka. I published the source code on Github and will explain some of the concepts in this post.

Motivation

To see why you might need something like Akka, think you want to implement a simple web crawler for offline search. You are downloading pages from a certain location, parse and index the content and follow any links that you haven't parsed and indexed yet. I am using HtmlParser for downloading and parsing pages and Lucene for indexing them. The logic is contained in two service objects, PageRetriever and Indexer, that can be used from our main application.

A simple sequential execution might then look something like this:

public void downloadAndIndex(String path, IndexWriter writer) {
    VisitedPageStore pageStore = new VisitedPageStore();
    pageStore.add(path);
        
    Indexer indexer = new IndexerImpl(writer);
    PageRetriever retriever = new HtmlParserPageRetriever(path);
        
    String page;
    while ((page = pageStore.getNext()) != null) {
        PageContent pageContent = retriever.fetchPageContent(page);
        pageStore.addAll(pageContent.getLinksToFollow());
        indexer.index(pageContent);
        pageStore.finished(page);
    }
        
    indexer.commit();
}

We are starting with one page, extract the content and the links, index the content and store all links that are to be visited in the VisitedPageStore. This class contains the logic to determine which links are visited already. We are looping as long as there are more links to follow, once we are done we commit the Lucene IndexWriter.

This implementation works fine, when running on my outdated laptop it will finish in around 3 seconds for an example page. (Note that the times I am giving are by no means meant as a benchmark but are just there to give you some idea on the numbers).

So we are done? No, of course we can do better by optimizing the resources we have available. Let's try to improve this solution by splitting it into several tasks that can be executed in parallel.

Shared State Concurrency

The normal way in Java would be to implement several Threads that do parts of the work and access the state via guarded blocks, e.g. by synchronizing methods. So in our case there might be several Threads that access our global state that is stored in the VisitedPageStore.

This model is what Venkat Subramaniam calls Synchronize and Suffer in his great book Programming Concurrency on the JVM. Working with Threads and building correct solutions might not seem that hard at first but is inherintly difficult. I like those two tweets that illustrate the problem:

Adding the "synchronized" keyword to Java was a mistake. Makes people believe they can write multi-threaded code.
— Erik Dörnenburg (@erikdoe) Oktober 25, 2011

95% of syncronized code is broken. The other 5% is written by Brian Goetz. - Venkat Subramaniam at #s2gx
— Ronny Løvtangen (@rlovtangen) Oktober 26, 2011

Brian Goetz of course being the author of the de-facto standard book on the new Java concurrency features, Java Concurrency in Practice.

Akka

So what is Akka? It's an Actor framework for the JVM that is implemented in Scala but that is something that you rarely notice when working from Java. It offers a nice Java API that provides most of the functionality in a convenient way.

Actors are a concept that was introduced in the seventies but became widely known as one of the core features of Erlang, a language to build fault tolerant, self healing systems. Actors employ the concept of Message Passing Concurrency. That means that Actors only communicate by means of messages that are passed into an Actors mailbox. Actors can contain state that they shield from the rest of the system. The only way to change the state is by passing in messages. Each Actor is executed in a different Thread but they provide a higher level of abstraction than working with Threads directly.

When implementing Actors you put the behaviour in a method receive() that can act on incoming messages. You can then reply asynchronously to the sender or send messages to any other Actor.

For our problem at hand an Actor setup might look something like this:

There is one Master Actor that also contains the global state. It sends a message to fetch a certain page to a PageParsingActor that asynchonously responds to the Master with the PageContent. The Master can then send the PageContent to an IndexingActor which responds with another message. With this setup we have done a first step to scale our solution. There are now three Actors that can be run on different cores of your machine.

Actors are instantiated from other Actors. On the top there's the ActorSystem that is provided by the framework. The MasterActor is instaciated from the ActorSystem:

ActorSystem actorSystem = ActorSystem.create();
final CountDownLatch countDownLatch = new CountDownLatch(1);
ActorRef master = actorSystem.actorOf(new Props(new UntypedActorFactory() {

    @Override
    public Actor create() {
        return new SimpleActorMaster(new HtmlParserPageRetriever(path), writer, countDownLatch);
    }
}));

master.tell(path);
try {
    countDownLatch.await();
    actorSystem.shutdown();
} catch (InterruptedException ex) {
    throw new IllegalStateException(ex);
}

Ignore the CountdownLatch as it is only included to make it possible to terminate the application. Note that we are not referencing an instance of our class but an ActorRef, a reference to an actor. You will see later why this is important.

The MasterActor contains references to the other Actors and creates them from its context. This makes the two Actors children of the Master:

public SimpleActorMaster(final PageRetriever pageRetriever, final IndexWriter indexWriter,
    final CountDownLatch latch) {

    super(latch);
    this.indexer = getContext().actorOf(new Props(new UntypedActorFactory() {

        @Override
        public Actor create() {

            return new IndexingActor(new IndexerImpl(indexWriter));
        }
    }));

    this.parser = getContext().actorOf(new Props(new UntypedActorFactory() {

        @Override
        public Actor create() {

           return new PageParsingActor(pageRetriever);
        }
    }));
}

The PageParsingActor acts on messages to fetch pages and sends a message with the result to the sender:

public void onReceive(Object o) throws Exception {
    if (o instanceof String) {
        PageContent content = pageRetriever.fetchPageContent((String) o);
        getSender().tell(content, getSelf());
    } else {
        // fail on any message we don't expect
        unhandled(o);
    }
}

The IndexingActor contains some state with the Indexer. It acts on messages to index pages and to commit the indexing process.

public void onReceive(Object o) throws Exception {
    if (o instanceof PageContent) {
        PageContent content = (PageContent) o;
        indexer.index(content);
        getSender().tell(new IndexedMessage(content.getPath()), getSelf());
    } else if (COMMIT_MESSAGE == o) {
        indexer.commit();
        getSender().tell(COMMITTED_MESSAGE, getSelf());
    } else {
        unhandled(o);
    }
}

The MasterActor finally orchestrates the other Actors in its receive() method. It starts with one page and sends it to the PageParsingActor. It keeps the valuable state of the application in the VisitedPageStore. When no more pages are to be fetched and indexed it sends a commit message and terminates the application.

public void onReceive(Object message) throws Exception {

    if (message instanceof String) {
        // start
        String start = (String) message;
        visitedPageStore.add(start);
        getParser().tell(visitedPageStore.getNext(), getSelf());
    } else if (message instanceof PageContent) {
        PageContent content = (PageContent) message;
        getIndexer().tell(content, getSelf());
        visitedPageStore.addAll(content.getLinksToFollow());

        if (visitedPageStore.isFinished()) {
            getIndexer().tell(IndexingActor.COMMIT_MESSAGE, getSelf());
        } else {
            for (String page : visitedPageStore.getNextBatch()) {
                getParser().tell(page, getSelf());
            }
        }
    } else if (message instanceof IndexedMessage) {
        IndexedMessage indexedMessage = (IndexedMessage) message;
        visitedPageStore.finished(indexedMessage.path);

        if (visitedPageStore.isFinished()) {
            getIndexer().tell(IndexingActor.COMMIT_MESSAGE, getSelf());
        }
    } else if (message == IndexingActor.COMMITTED_MESSAGE) {
        logger.info("Shutting down, finished");
        getContext().system().shutdown();
        countDownLatch.countDown();
    }
}

What happens if we run this example? Unfortunately it now takes around 3.5 seconds on my dual core machine. Though we are now able to run on both cores we have actually decreased the speed of the application. This is probably an important lesson. When building scalable applications it might happen that you are introducing some overhead that decreases the performance when running in the small. Scalability is not about increasing performance but about the ability to distribute the load.

So it was an failure to switch to Akka? Not at all. It turns out that most of the time the application is fetching and parsing pages. This includes waiting for the network. Indexing in Lucene is blazing fast and the Master mostly only dispatches messages. So what can we do about it? We already have split our application into smaller chunks. Fortunately the PageParsingActor doesn't contain any state at all. That means we can easily parallelize its tasks.

This is where the talking to references is important. For an Actor it's transparent if there is one or a million Actors behind a reference. There is one mailbox for an Actor reference that can dispatch the messages to any amount of Actors.

We only need to change the instanciation of the Actor, the rest of the application remains the same:

parser = getContext().actorOf(new Props(new UntypedActorFactory() {

        @Override
        public Actor create() {

            return new PageParsingActor(pageRetriever);
        }
}).withRouter(new RoundRobinRouter(10)));

By using a router the Akka framework automatically takes care that there are 10 Actors available. The messages are distributed to any available Actor. This takes the runtime down to 2 seconds.

A word on Blocking

Note that the way I am doing network requests here is not recommended in Akka. HTMLParser is doing blocking networking which should be carefully reconsidered when designing a reactive system. In fact, as this application is highly network bound, we might even gain more benefit by just using an asynchronous networking library. But hey, then I wouldn't be able to tell you how nice it is to use Akka. In a future post I will highlight some more Akka features that can help to make our application more robust and fault tolerant.

Slides and demo code for my talk at JUG KA available

2012-07-06T00:50:00+08:00

I just uploaded the (german) slides as well as the example code for yesterdays talk on Lucene and Solr at our local Java User Group.

The demo application contains several subprojects for indexing and searching with Lucene and Solr as well as a simple Dropwizard application that demonstrates some search features. See the README files in the source tree to find out how to run the application.

Dropwizard Encoding Woes

2012-06-29T15:57:00+08:00

I have been working on an example application for Lucene and Solr for my upcoming talk at the Java User Group Karlsruhe. As a web framework I wanted to try Dropwizard, a lightweight application framework that can expose resources via JAX-RS, provides out of the box monitoring support and can render resource representations using Freemarker. It's really easy to get started, there's a good tutorial and the manual.

An example resource might look like this:

import javax.ws.rs.GET;
import javax.ws.rs.Path;
import javax.ws.rs.Produces;
import javax.ws.rs.core.MediaType;

@Path("/example")
@Produces(MediaType.TEXT_HTML)
public class ExampleResource {

    @GET
    public ExampleView illustrate() {
        return new ExampleView("Mot\u00f6rhead");
    }

}

The Resource produces HTML using Freemarker, which is possible if you add the view bundle in the service. There is one method that is called when the resource is addressed using GET. Inside the method we create a view object accepting a message that in this case contains the umlaut 'ö'. The view class that is returned by the method looks like this:

import com.yammer.dropwizard.views.View;

public class ExampleView extends View {

    private final String message;

    public ExampleView(String message) {
        super("example.fmt");
        this.message = message;
    }

    public String getMessage() {
        return message;
    }
}

It accepts a message as constructor parameter. The template name is passed to the parent class. This view class is now available in a freemarker template, an easy variant looks like this:

<html>
    <body>
        <h1>${message} rocks!</h1>
    </body>
</html>

If I run this on my machine and access it with Firefox it doesn't work as expected. The umlaut character is broken, something Lemmy surely doesn't approve:

Accessing the resource using curl works flawlessly:

curl http://localhost:8080/example
<html>
    <body>
        <h1>Motörhead rocks!</h1>
    </body>
</html>

Why is that? It's Servlet Programming 101: You need to set the character encoding of the response. My Firefox defaults to ISO-8859-1, curl seems to use UTF-8 by default. How can we fix it? Tell the client which encoding we are using, which can be done using the Produces annotation:

@Produces("text/html; charset=utf-8")

So what does it have to do with Dropwizard? Nothing really, it's a JAX-RS thing. All components in Dropwizard (Jetty and Freemarker notably) are using UTF-8 by default.

Running and Testing Solr with Gradle

2012-06-20T20:28:00+08:00

A while ago I blogged on testing Solr with Maven on the synyx blog. In this post I will show you how to setup a similar project with Gradle that can start the Solr webapp and execute tests against your configuration.

Running Solr

Solr is running as a webapp in any JEE servlet container like Tomcat or Jetty. The index and search configuration resides in a directory commonly referred to as Solr home that can be outside of the webapp directory. This is also the place where the Lucene index files are created. The location for Solr home can be set using an environment variable.

The Solr war file is available in Maven Central. This post describes how to run a war file that is deployed in a Maven repository using Gradle. Let's see how the Gradle build file looks like for running Solr:

import org.gradle.api.plugins.jetty.JettyRunWar

apply plugin: 'java'
apply plugin: 'jetty'

repositories {
    mavenCentral()
}

// custom configuration for running the webapp
configurations {
    solrWebApp
}

dependencies {
    solrWebApp "org.apache.solr:solr:3.6.0@war"
}

// custom task that configures the jetty plugin
task runSolr(type: JettyRunWar) {
    webApp = configurations.solrWebApp.singleFile

    // jetty configuration
    httpPort = 8082
    contextPath = 'solr'
}

// executed before jetty starts
runSolr.doFirst {
    System.setProperty("solr.solr.home", "./solrhome")
}

We are creating a custom configuration that contains the Solr war file. In the task runSolr we configure the Jetty plugin. To add the Solr home environment variable we can use the way described by Sebastian Himberger. We add a code block that is executed before Jetty starts and sets the environment variable using standard Java mechanisms. You can now start Solr using gradle runSolr. You will see some errors regarding multiple versions of slf4j that are very like caused by this bug.

Testing the Solr configuration

Solr provides some classes that start an embedded instance using your configuration. You can use these classes in any setup as they do not depend on the gradle jetty plugin. Starting with Solr 3.2 the test framework is not included in solr-core anymore. This is what the relevant part of the dependency section looks like now:

testCompile "junit:junit:4.10"
testCompile "org.apache.solr:solr-test-framework:3.6.0"

Now you can place a test in src/test/java that either uses the convenience methods provided by SolrTestCaseJ4 or you can instantiate an EmbeddedSolrServer and execute any SolrJ actions. Both of these ways will use your custom config. This way you can easily validate that configuration changes don't break existing functionality. An example of using the convenience methods:

import org.apache.solr.SolrTestCaseJ4;
import org.apache.solr.client.solrj.SolrServerException;
import org.junit.BeforeClass;
import org.junit.Test;
import java.io.IOException;

public class BasicConfigurationTest extends SolrTestCaseJ4 {

    @BeforeClass
    public static void initCore() throws Exception {
        SolrTestCaseJ4.initCore("solrhome/conf/solrconfig.xml", "solrhome/conf/schema.xml", "solrhome/");
    }

    @Test
    public void noResultInEmptyIndex() throws SolrServerException {
        assertQ("test query on empty index",
                req("text that is not found")
                , "//result[@numFound='0']"
        );
    }

    @Test
    public void pathIsMandatory() throws SolrServerException, IOException {
        assertFailedU(adoc("title", "the title"));
    }

    @Test
    public void simpleDocumentIsIndexedAndFound() throws SolrServerException, IOException {
        assertU(adoc("path", "/tmp/foo", "content", "Some important content."));
        assertU(commit());

        assertQ("added document found",
                req("important")
                , "//result[@numFound='1']"
        );
    }

}

We extend the class SolrTestCaseJ4 that is responsible for creating the core and instanciating the runtime using the paths we provide with the method initCore(). Using the available assert methods you can execute queries and validate the result using XPath expressions.

An example that instanciates a SolrServer might look like this:

import org.apache.solr.SolrTestCaseJ4;
import org.apache.solr.client.solrj.SolrQuery;
import org.apache.solr.client.solrj.SolrServerException;
import org.apache.solr.client.solrj.embedded.EmbeddedSolrServer;
import org.apache.solr.client.solrj.response.FacetField;
import org.apache.solr.client.solrj.response.QueryResponse;
import org.apache.solr.common.SolrInputDocument;
import org.apache.solr.common.params.SolrParams;
import org.junit.After;
import org.junit.Before;
import org.junit.BeforeClass;
import org.junit.Test;

import java.io.IOException;

public class ServerBasedTalkTest extends SolrTestCaseJ4 {

    private EmbeddedSolrServer server;

    @BeforeClass
    public static void initCore() throws Exception {
        SolrTestCaseJ4.initCore("solr/conf/solrconfig.xml", "solr/conf/schema.xml");
    }

    @Before
    public void initServer() {
        server = new EmbeddedSolrServer(h.getCoreContainer(), h.getCore().getName());
    }

    @Test
    public void queryOnEmptyIndexNoResults() throws SolrServerException {
        QueryResponse response = server.query(new SolrQuery("text that is not found"));
        assertTrue(response.getResults().isEmpty());
    }

    @Test
    public void singleDocumentIsFound() throws IOException, SolrServerException {
        SolrInputDocument document = new SolrInputDocument();
        document.addField("path", "/tmp/foo");
        document.addField("content", "Mein Hut der hat 4 Ecken");

        server.add(document);
        server.commit();

        SolrParams params = new SolrQuery("ecke");
        QueryResponse response = server.query(params);
        assertEquals(1L, response.getResults().getNumFound());
        assertEquals("/tmp/foo", response.getResults().get(0).get("path"));
    }

    @After
    public void clearIndex() {
        super.clearIndex();
    }
}

The tests can now be executed using gradle test.

Testing your Solr configuration is important as changes in one place might easily lead to side effects with another search functionality. I recommend to add tests even for basic functionality and evolve the tests with your project.

Reading term values for fields from a Lucene Index

2012-06-16T15:13:00+08:00

Sometimes when using Lucene you might want to retrieve all term values for a given field. Think of categories that you want to display as search links or in a filtering dropdown box. Indexing might look something like this:

IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_36, new StandardAnalyzer(Version.LUCENE_36));
IndexWriter writer = new IndexWriter(directory, config);

Document doc = new Document();

doc.add(new Field("Category", "Category1", Field.Store.NO, Field.Index.NOT_ANALYZED));
doc.add(new Field("Category", "Category2", Field.Store.NO, Field.Index.NOT_ANALYZED));
doc.add(new Field("Author", "Florian Hopf", Field.Store.NO, Field.Index.NOT_ANALYZED));
writer.addDocument(doc);

doc.add(new Field("Category", "Category3", Field.Store.NO, Field.Index.NOT_ANALYZED));
doc.add(new Field("Category", "Category2", Field.Store.NO, Field.Index.NOT_ANALYZED));
doc.add(new Field("Author", "Theo Tester", Field.Store.NO, Field.Index.NOT_ANALYZED));
writer.addDocument(doc);

writer.close();

We are adding two documents, one that is assigned Category1 and Category2 and one that is assigned Category2 and Category3. Note that we are adding both fields unanalyzed so the Strings are added to the index as they are. Lucenes index looks something like this afterwards:

Field	Term	Documents
Author	Florian Hopf	1
	Theo Tester	2
Category	Category1	1
	Category2	1, 2
	Category3	2

The fields are sorted alphabetically by fieldname first and then by term value. You can access the values using the IndexReaders terms() method that returns a TermEnum. You can instruct the IndexReader to start with a certain term so you can directly jump to the category without having to iterate all values. But before we do this let's look at how we are used to access Enumeration values in Java:

Enumeration en = ...;
while(en.hasMoreElements()) {
    Object obj = en.nextElement();
    ...
}

In a while-loop we are checking if there is another element and retrieve it inside the loop. As this pattern is very common when iterating the terms with Lucene you might end with something like this (Note that all the examples here are missing the stop condition. If there are more fields the terms of those fields will also be iterated):

TermEnum terms = reader.terms(new Term("Category"));
// this code is broken, don't use
while(terms.next()) {
    Term term = terms.term();
    System.out.println(term.text());
}

The next() method returns a boolean if there are more elements and points to the next element. The term() method then can be used to retrieve the Term. But this doesn't work as expected. The code only finds Category2 and Category3 but skips Category1. Why is that? The Lucene TermEnum works differently than we are used from Java Enumerations. When the TermEnum is returned it already points to the first element so with next() we skip this first element.

This snippet instead works correctly using a for loop:

TermEnum terms = reader.terms(new Term("Category"));
for(Term term = terms.term(); term != null; terms.next(), term = terms.term()) {
    System.out.println(term.text());
}

Or you can use a do while loop with a check for the first element:

TermEnum terms = reader.terms(new Term("Category"));
if (terms.term() != null) {
    do {
        Term term = terms.term();
        System.out.println(term.text());
    } while(terms.next());
}

You can't really blame Lucene for this as the methods are aptly named. It's our habits that lead to minor errors like this.

Berlin Buzzwords 2012

2012-06-07T00:20:00+08:00

Berlin Buzzwords is an annual conference on search, store and scale technology. I've heard good things about it before and finally got convinced to go there this year. The conference itself lasts for two days but there are additional events before and afterwards so if you like you can spend a whole week.

The Barcamp

As I had to travel on sunday anyway I took an earlier train to attend the barcamp in the early evening. It started with a short introduction of the concepts and the scheduling. Participants could suggest topics that they either would be willing to introduce by themselfes or just anything they are interested in. There were three roomes prepared, a larger and two smaller ones.

Among others I attended sessions on HBase, designing hybrid applications, Apache Tika and Apache Jackrabbit Oak.

HBase is a distributed database build on top of the Hadoop filesystem. It seems to be used more often than I would have expected. Interesting to hear about the problems and solutions of other people.

The next session on hybrid relational and NoSQL applications stayed rather high level. I liked the remark by one guy that Solr, the underdog of NoSQL, often is the first application where people are ok with dismissing some guarantees regarding their data. Adding NoSQL should be exactly like this.

I only started just recently to use Tika directly so it was really interesting to see where the project is heading in the future. I was surprised to hear that there now also is a TikaServer that can do similar things like those I described for Solr. That's something I want to try in action.

Jackrabbit Oak is a next generation content repository that is mostly driven by the Day team of Adobe. Some of the ideas sound really interesting but I got the feeling that it still can take some time until this really can be used. Jukka Zitting also gave a lightning talk on this topic at the conference, the slides are available here.

The atmosphere in the sessions was really relaxed so even though I expected to only listen I took the chance to participate and ask some questions. This probably is the part that makes a barcamp as effective as it is. As you are constantly participating you keep really contentrated on the topic.

Day 1

The first day started with a great keynote by Leslie Hawthorn on building and maintaining communities. She compared a lot of the aspects of community work with gardening and introduced OpenMRS, a successful project building a medical record platform. Though I currently am not actively involved in an open source project I could relate to a lot of the situations she described. All in all an inspiring start of the main conference.

Next I attended a talk on building hybrid applications with MongoDb. Nothing new for me but I am glad that a lot of people now recommend to split monolithic applications into smaller services. This also is a way to experiment with different languages and techniques without having to migrate large parts of an application.

A JCR view of the world provided some examples on how to model different structures using a content tree. Though really introductionary it was interesting to see what kind of applications can be build using a content repository. I also liked the attitude of the speaker: The presentation was delivered using Apache Sling which uses JCR under the hood.

Probably the highlight of the first day was the talk by Grant Ingersoll on Large Scale Search, Discovery and Analytics. He introduced all the parts that make up larger search systems and showed the open source tools he uses. To increase the relevance of the search results you have to integrate solutions to adapt to the behaviour of the users. That's probably one of the big takeaways for me of the whole conference: Always collect data on your users searches to have it available when you want to tune the relevance, either manually or through some learning techniques. The slides of the talk are worth looking at.

The rest of the day I attended several talks on the internals of Lucene. Hardcore stuff, I would be lying if I said I would have understood everything but it was interesting nevertheless. I am glad that some really smart people are taking care that Lucene stays as fast and feature rich as it is.

The day ended with interesting discussions and some beer at the Buzz Party and BBQ.

Day 2

The first talk of the second day on Smart Autocompl... by Anne Veling was fantastic. Anne demonstrated a rather simple technique for doing semantic analysis of search queries for specialized autocompletion for the largest travel information system in the Netherlands. The query gets tokenized and then each field of the index (e.g. street or city) is queried for each of the tokens. This way you can already guess which might be good field matches.

Another talk introduced a scalable tool for preprocessing of documents, Hydra. It stores the documents as well as mapping data in a MongoDb instance and you can parallelize the processing steps. The concept sounds really interesting, I hope I can find time to have a closer look.

In the afternoon I attended several talks on Elasticsearch, the scalable search server. Interestingly a lot of people seem to use it more as a storage engine than for searching.

One of the tracks was cancelled, Ted Dunning introduced new stuff in Mahout instead. He's a really funny speaker and though I am not deep into machine learning I was glad to hear that you are allowed to use and even contribute to Mahout even if you don't have a PhD.

In the last track of the day Alex Pinkin showed 10 problems and solutions that you might encounter when building a large app using Solr. Quite some useful advice.

The location

The event took place at Urania, a smaller conference center and theatre. Mostly it was suited well but some of the talks were so full that you either had to sit on the floor or weren't even able to enter the room. I understand that it is difficult to predict how many people attend a certain event but some talks probably should have been scheduled in different rooms.

The food was really good and though it first looked like the distribution was a bottleneck this worked pretty well.

The format

This year Berlin Buzzwords had a rather unusual format. Most of the talks were only 20 minutes long with some exceptions that were 40 minutes long. I have mixed feelings about this: On the one hand it was great to have a lot of different topics. On the other hand some of the concepts definitively would have needed more time to fully explain and grasp. Respect to all the speakers who had to think about what they would talk about in such a short timeframe.

Berlin Buzzwords is a fantastic conference and I will definitively go there again.

Content Extraction with Apache Tika

2012-05-12T01:58:00+08:00

Sometimes you need access to the content of documents, be it that you want to analyze it, store the content in a database or index it for searching. Different formats like word documents, pdfs and html documents need different treatment. Apache Tika is a project that combines several open source projects for reading content from a multitude of file formats and makes the textual content as well as some metadata available using a uniform API. I will show two ways how to leverage the power of Tika for your projects.

Accessing Tika programmatically

First, Tika can of course be used as a library. Surprisingly the user docs on the website explain a lot of the functionality that you might be interested in when writing custom parsers for Tika but don't show directly how to use it.

I am using Maven again, so I add a dependency for the most recent version:

<dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-parsers</artifactId>
    <version>1.1</version>
    <type>jar</type>
</dependency>

tika-parsers also includes all the other projects that are used so be patient when Maven fetches all the transitive dependencies.

Let's see what some test code for extracting data from a pdf document called slides.pdf, that is available in the classpath, looks like.

Parser parser = new PdfParser();
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
InputStream content = getClass().getResourceAsStream("/slides.pdf");
parser.parse(content, handler, metadata, new ParseContext());
assertEquals("Solr Vortrag", metadata.get(Metadata.TITLE));
assertTrue(handler.toString().contains("Lucene"));

First, we need to instanciate a Parser that is capable of reading the format, in this case PdfParser that uses PDFBox for extracting the content. The parse method expects some parameters to configure the parsing process as well as an InputStream that contains the data of the document. Metadata will contain all the metadata for the document, e.g. the title or the author after the parsing is finished.

Tika uses XHTML as the internal representation for all parsed content. This XHTML document can be processed by a SAX ContentHandler. A custom implementation BodyContentHandler returns all the text in the body area, which is the main content. The last parameter ParseContext can be used to configure the underlying parser instance.

The Metadata class consists of a Map-like structure with some common keys like the title as well as optional format specific information. You can look at the contents with a simple loop:

for (String name: metadata.names()) { 
    System.out.println(name + ": " + metadata.get(name));
}

This will produce an output similar to this:

xmpTPg:NPages: 17
Creation-Date: 2010-11-20T09:47:28Z
title: Solr Vortrag
created: Sat Nov 20 10:47:28 CET 2010
producer: OpenOffice.org 2.4
Content-Type: application/pdf
creator: Impress

The textual content of the document can be retrieved by calling the toString() method on the BodyContentHandler.

This is all fine if you exactly know that you only want to retrieve data from pdf documents. But you probably don't want to introduce a huge switch-block for determining the parser to use depending on the file name or some other information. Fortunately Tika also provides an AutodetectParser that employs different strategies for determining the content type of the document. The code above all stays the same, you just use a different parser:

Parser parser = new AutodetectParser();

This way you don't have to know what kind of document you are currently processing, Tika will provide you with metadata as well as the content. You can pass in additional hints for the parser e.g. the filename or the content type by setting it in the Metadata object.

Extracting content using Solr

If you are using the search server Solr you can also leverage its REST API for extracting the content. The default configuration has a request handler configured for /update/extract that you can send a document to and it will return the content it extracted using Tika. You just need to add the necessary libraries for the extraction. I am still using Maven so I have to add an additional dependency:

<dependency>
    <groupId>org.apache.solr</groupId>
    <artifactId>solr</artifactId>
    <version>3.6.0</version>
    <type>war</type>
</dependency>
<dependency>
    <groupId>org.apache.solr</groupId>
    <artifactId>solr-cell</artifactId>
    <version>3.6.0</version>
    <type>jar</type>
</dependency>

This will include all of the Tika dependencies as well as all necessary third party libraries.

Solr Cell, the request handler, normally is used to index binary files directly but you can also just use it for extraction. To transfer the content you can use any tool that can speak http, e.g. for curl this might look like this:

curl -F "file=@slides.pdf" "localhost:8983/solr/update/extract?extractOnly=true&extractFormat=text"

By setting the parameter extractOnly to true we advice Solr that we don't want to index the content but want to have it extracted to the response. The result will be the standard Solr XML format that contains the body content as well as the metadata.

You can also use the Java client library SolrJ for doing the same:

ContentStreamUpdateRequest request = new ContentStreamUpdateRequest("/update/extract");
request.addFile(new File("slides.pdf"));
request.setParam("extractOnly", "true");
request.setParam("extractFormat", "text");
NamedList<Object> result = server.request(request);

The NamedList will contain entries for the body content as well as another NamedList with the metadata.

Update

Robert has asked in the comments what the response looks like.
Solr uses configurable response writers for marshalling the message. The default format is xml but can be influenced by passing the wt attribute to the request. A simplified standard response looks like this:

curl -F "file=@slides.pdf" "localhost:8983/solr/update/extract?extractOnly=true&extractFormat=text"
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">0</int><int name="QTime">1952</int></lst><str name="slides.pdf">

Features                                                                                                                                                                            
                                                                                                                                                                                    
HTTPSchnittstelle                                                                                                                                                                  
XMLbasierte Konfiguration                                                                                                                                                          
Facettierung                                                                                                                                                                        
Sammlung nützlicher LuceneModule/Dismax                                                                                                                                            
                                                                                                                                                                                    
Features                                                                                                                                                                            
                                                                                                                                                                                    
HTTPSchnittstelle                                                                                                                                                                  
XMLbasierte Konfiguration                                                                                                                                                          
Facettierung                                                                                                                                                                        
Sammlung nützlicher LuceneModule/Dismax                                                                                                                                            
JavaClient SolrJ                                                                                                                                                                   
                                                                                                                                                                                    
[... more content ...] 
                                                                                                                                                                                   
</str><lst name="slides.pdf_metadata"><arr name="xmpTPg:NPages"><str>17</str></arr><arr name="Creation-Date"><str>2010-11-20T09:47:28Z</str></arr><arr name="title"><str>Solr Vortrag</str></arr><arr name="stream_source_info"><str>file</str></arr><arr name="created"><str>Sat Nov 20 10:47:28 CET 2010</str></arr><arr name="stream_content_type"><str>application/octet-stream</str></arr><arr name="stream_size"><str>425327</str></arr><arr name="producer"><str>OpenOffice.org 2.4</str></arr><arr name="stream_name"><str>slides.pdf</str></arr><arr name="Content-Type"><str>application/pdf</str></arr><arr name="creator"><str>Impress</str></arr></lst>                                                                            
</response>

The response contains some metadata (how long the processing took), the content of the file as well as the metadata that is extracted from the document.

If you pass the atrribute wt and set it to json, the response is contained in a json structure:

curl -F "file=@slides.pdf" "localhost:8983/solr/update/extract?extractOnly=true&extractFormat=text&wt=json"             
{"responseHeader":{"status":0,"QTime":217},"slides.pdf":"\n\n\n\n\n\n\n\n\n\n\n\nSolr Vortrag\n\n   \n\nEinfach mehr finden mit\n\nFlorian Hopf\n29.09.2010\n\n\n   \n\nSolr?\n\n\n   \n\nSolr?\n\nServerization of Lucene\n\n\n   \n\nApache Lucene?\n\nSearch engine library\n\n\n   \n\nApache Lucene?\n\nSearch engine library\nTextbasierter Index\n\n\n   \n\nApache Lucene?\n\nSearch engine library\nTextbasierter Index\nText Analyzer\n\n\n   \n\nApache Lucene?\n\nSearch engine library\nTextbasierter Index\nText Analyzer\nQuery Syntax \n\n\n   \n\nApache Lucene?\n\nSearch engine library\nTextbasierter Index\nText Analyzer\nQuery Syntax \nScoring\n\n\n   \n\nFeatures\n\nHTTPSchnittstelle\n\n\n   \n\nArchitektur\n\nClient SolrWebapp Lucene\nhttp\n\nKommunikation über XML, JSON, JavaBin, Ruby, ...\n\n\n   \n\nFeatures\n\nHTTPSchnittstelle\nXMLbasierte Konfiguration\n\n\n   \n\nFeatures\n\nHTTPSchnittstelle\nXMLbasierte Konfiguration\nFacettierung\n\n\n   \n\nFeatures\n\nHTTPSchnittstelle\nXMLbasierte Konfiguration\nFacettierung\nSammlung nützlicher LuceneModule/Dismax\n\n\n   \n\nFeatures\n\nHTTPSchnittstelle\nXMLbasierte Konfiguration\nFacettierung\nSammlung nützlicher LuceneModule/Dismax\nJavaClient SolrJ\n\n\n   \n\nDemo\n\n\n   \n\nWas noch?\nAdminInterface\nCaching\nSkalierung\nSpellchecker\nMoreLikeThis\nData Import Handler\nSolrCell\n\n\n   \n\nRessourcen\nhttp://lucene.apache.org/solr/\n\n\n\n","slides.pdf_metadata":["xmpTPg:NPages",["17"],"Creation-Date",["2010-11-20T09:47:28Z"],"title",["Solr Vortrag"],"stream_source_info",["file"],"created",["Sat Nov 20 10:47:28 CET 2010"],"stream_content_type",["application/octet-stream"],"stream_size",["425327"],"producer",["OpenOffice.org 2.4"],"stream_name",["slides.pdf"],"Content-Type",["application/pdf"],"creator",["Impress"]]}

There are quite some ResponseWriters available for different languages, e.g. for Ruby. You can have a look at them at the bottom of this page: http://wiki.apache.org/solr/QueryResponseWriter

Importing Atom feeds in Solr using the Data Import Handler

2012-05-08T00:58:00+08:00

I am working on a search solution that makes some of the content I am producing available through one search interface. One of the content stores is the blog you are reading right now, which among other options makes the content available here using Atom.

Solr, my search server of choice, provides the Data Import Handler that can be used to import data on a regular basis from sources like databases via JDBC or remote XML sources, like Atom.

Data Import Handler used to be a core part of Solr but starting from 3.1 it is shipped as a separate jar and not included in the standard war anymore. I am using Maven with overlays for development so I have to add a dependency for it:

<dependencies>
  <dependency>
    <groupId>org.apache.solr</groupId>
    <artifactId>solr</artifactId>
    <version>3.6.0</version>
    <type>war</type>
  </dependency>
  <dependency>
    <groupId>org.apache.solr</groupId>
    <artifactId>solr-dataimporthandler</artifactId>
    <version>3.6.0</version>
    <type>jar</type>
  </dependency>
</dependencies>

To enable the data import handler you have to add a request handler to your solrconfig.xml. Request handlers are registered for a certain url and, as the name suggests, are responsible for handling incoming requests:

<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
  <lst name="defaults">
    <str name="config">data-config.xml</str>
  </lst>
</requestHandler>

The file data-config.xml that is referenced here contains the mapping logic as well as the endpoint to access:

<?xml version="1.0" encoding="UTF-8" ?>
<dataConfig>
    <dataSource type="URLDataSource" encoding="UTF-8" connectionTimeout="5000" readTimeout="10000"/>
    <document>
        <entity name="blog"
                pk="url"
                url="http://fhopf.blogspot.com/feeds/posts/default?max-results=100"
                processor="XPathEntityProcessor"
                forEach="/feed/entry" transformer="DateFormatTransformer,HTMLStripTransformer,TemplateTransformer">
            <field column="title" xpath="/feed/entry/title"/>
            <field column="url" xpath="/feed/entry/link[@rel='alternate']/@href"/>
            <!-- 2012-03-07T21:35:51.229-08:00 -->
            <field column="last_modified" xpath="/feed/entry/updated" 
                dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss.SSS" locale="en"/>
            <field column="text" xpath="/feed/entry/content" stripHTML="true"/>
            <field column="category" xpath="/feed/entry/category/@term"/>
            <field column="type" template="blog"/> 
        </entity>
    </document>
</dataConfig>

First we configure which datasource to use. This is where you alternatively would use another implementation when fetching documents from a database.

Documents describe the fields that will be stored in the index. The attributes for the entity element determine where and how to fetch the data, most importantly the url and the processor. forEach contains an XPath to identify the elements we'd like to loop over. The transformer attribute is used to specify some classes that are the available when mapping the remote XML to the Solr fields.

The field elements contain the mapping between the Atom document and the Solr index fields. The column attribute determines the name of the index field, xpath determines the node to use in the remote XML document. You can use advanced XPath options like mapping to attributes of elements where only another attribute is set. E.g. /feed/entry/link[@rel='alternate']/@href points to an element that determines an alternative representation of a blog post entry:

<feed ...> 
  ...
  <entry> 
    ...
    <link rel='alternate' type='text/html' href='http://fhopf.blogspot.com/2012/03/testing-akka-actors-from-java.html' title='Testing Akka actors from Java'/>
    ...
  </entry>
...
</feed>

For the column last_modified we are transforming the remote date format to the internal Solr representation using the DateProcessor. I am not sure yet if this is the correct solution as it seems to me I'm losing the timezone information. For the text field we are first removing all html elements that are contained in the blog post using the HTMLStripTransformer. Finally, the type contains a hardcoded value that is set using the TemplateTransformer.

To have everything in one place let's see how the schema for our index looks like:

<field name="url" type="string" indexed="true" stored="true" required="true"/>
<field name="title" type="text_general" indexed="true" stored="true"/>
<field name="category" type="text_general" indexed="true" stored="true" multiValued="true"/>
<field name="last_modified" type="date" indexed="true" stored="true"/>
<field name="text" type="text_general" indexed="true" stored="false" multiValued="true"/>
<field name="type" type="string" indexed="true" stored="false"/>

Finally, how can you trigger the dataimport? There is an option described in the Solr wiki, but probably a simple solution might be enough for you. I am using a shell script that is triggered by a cron job. These are the contents:

#!/bin/bash
curl localhost:8983/solr/dataimport?command=full-import

The data import handler is really easy to setup and you can use it to import quite a lot of data sources into your index. If you need more advanced crawling features you might want to have a look at Apache ManifoldCF, a connector framework for plugging content repositories into search engines like Apache Solr.

Testing Akka actors from Java

2012-03-07T17:02:00+08:00

If you're looking for a general introduction into using Akka from Java have a look at this post

In a recent project I've been using Akka for a concurrent producer-consumer setup. It is an actor framework for the JVM that is implemented in Scala but provides a Java API so normally you don't notice that your dealing with a Scala library.

Most of my business code is encapsulated in services that don't depend on Akka and can therefore be tested in isolation. But for some cases I've been looking for a way to test the behaviour of the actors. As I struggled with this for a while and didn't find a real howto on testing Akka actors from Java I hope my notes might be useful for other people as well.

The main problem when testing actors is that they are managed objects and you can't just instanciate them. Akka comes with a module for tests that is documented well for using it from Scala. But besides the note that it's possible you don't find a lot of information on using it from Java.

When using Maven you need to make sure that you have the akka-testkit dependency in place:

<dependency>
    <groupId>com.typesafe.akka</groupId>
    <artifactId>akka-testkit</artifactId>
    <version>2.1-SNAPSHOT</version>
    <scope>test</scope>
</dependency>

I will show you how to implement a test for the actors that are introduced in the Akka java tutorial. It involves one actor that does a substep of calculating Pi for a certain start number and a given number of elements.

To test this actor we need a way to set it up. Akka-testkit provides a helper TestActorRef that can be used to set it up. Using scala this seems to be rather simple:

val testActor = TestActorRef[Worker]

If you try to do this from Java you will notice that you can't use a similar call. I have to admit that I am not quite sure yet what is going on. I would have expected that there is an apply() method on the TestActorRef companion object that uses some kind of implicits to instanciate the Worker object. But when inspecting the sources the thing that comes closest to it is this definition:

def apply[T <: Actor](factory: ⇒ T)(implicit system: ActorSystem)

No sign of implicit for the factory. Something I still have to investigate further.

To use it from Java you can use the method apply that takes a reference to a Function0 and an actor system. The actor system can be setup easily using

actorSystem = ActorSystem.apply();

The apply() method is very important in scala as it's kind of the default method for objects. For example myList(1) is internally using myList.apply(1).

If you're like me and expect that Function0 is a single method interface you will be surprised. It contains a lot of strange looking methods that you really don't want to have cluttering your test code:

TestActorRef workerRef = TestActorRef.apply(new Function0() {

    @Override
    public Worker apply() {
        throw new UnsupportedOperationException("Not supported yet.");
    }

    @Override
    public void apply$mcV$sp() {
        throw new UnsupportedOperationException("Not supported yet.");
    }

    @Override
    public boolean apply$mcZ$sp() {
        throw new UnsupportedOperationException("Not supported yet.");
    }

    @Override
    public byte apply$mcB$sp() {
        throw new UnsupportedOperationException("Not supported yet.");
    }

    @Override
    public short apply$mcS$sp() {
        throw new UnsupportedOperationException("Not supported yet.");
    }

    @Override
    public char apply$mcC$sp() {
        throw new UnsupportedOperationException("Not supported yet.");
    }

    @Override
    public int apply$mcI$sp() {
        throw new UnsupportedOperationException("Not supported yet.");
    }

    @Override
    public long apply$mcJ$sp() {
        throw new UnsupportedOperationException("Not supported yet.");
    }

    @Override
    public float apply$mcF$sp() {
        throw new UnsupportedOperationException("Not supported yet.");
    }

    @Override
    public double apply$mcD$sp() {
        throw new UnsupportedOperationException("Not supported yet.");
    }
}, actorSystem);

The only method we really are interested in is the normal apply method. Where do those other methods come from? There is no obvious hint in the scaladocs.

During searching for the solution I found a mailing list thread that explains some of the magic. The methods are performance optimizations for boxing and unboxing that are automatically generated by the scala compiler for the @specialized annotation. Still, I am unsure about why this is happening exactly. According to this presentation I would have expected that I am using the specialized instance for Object, maybe that is something special regarding traits?

Fortunately we don't really need to implement the interface ourself: There's an adapter class, AbstractFunction0, that makes your code look much nicer:

@Before
public void initActor() {
    actorSystem = ActorSystem.apply();
    actorRef = TestActorRef.apply(new AbstractFunction0() {

        @Override
        public Pi.Worker apply() {
            return new Pi.Worker();
        }
           
    }, actorSystem);
}

This is like I would have expected it to behave in the first place.

Now, as we have setup our test we can use the TestActorRef to really test the actor. For example we can test that the actor doesn't do anything for a String message:

@Test
public void doNothingForString() {
    TestProbe testProbe = TestProbe.apply(actorSystem);
    actorRef.tell("Hello", testProbe.ref());

    testProbe.expectNoMsg(Duration.apply(100, TimeUnit.MILLISECONDS));
}

TestProbe is another helper that can be used to check the messages that are sent between cooperating actors. In this example we are checking that no message is passed to the sender for 100 miliseconds, which should be enough for execution.

Let's test some real functionality. Send a message to the actor and check that the result message is send:

@Test
public void calculatePiFor0() {
    TestProbe testProbe = TestProbe.apply(actorSystem);
    Pi.Work work = new Pi.Work(0, 0);        
    actorRef.tell(work, testProbe.ref());

    testProbe.expectMsgClass(Pi.Result.class);     
    TestActor.Message message = testProbe.lastMessage();
    Pi.Result resultMsg = (Pi.Result) message.msg();
    assertEquals(0.0, resultMsg.getValue(), 0.0000000001);
}

Now we use the TestProbe to block until a message arrives. When it's there we can have a look at using the lastMessage().

You can look at the rest of the test on Github. Comments are more than welcome as I am pretty new to Scala as well as Akka.

Update

As Jonas Bonér points out I've been using the Scala API. Using the Props class the setup is easier:

    @Before
    public void initActor() {
        actorSystem = ActorSystem.apply();
        actorRef = TestActorRef.apply(new Props(Pi.Worker.class), actorSystem);
    }

Legacy Code Retreat

2012-02-20T00:22:00+08:00

Yesterday I attended the first german Legacy Code Retreat in Bretten. The event was organized by Softwerkskammer, the german software craftsmanship community.

A legacy code retreat doesn't work like a common code retreat where you implement a certain functionality again and again. It instead starts with some really flawed code and the participants apply different refactoring steps to make it more testable and maintainable. There are six iterations of 45 minutes with different tasks or aims. For each iteration you work with a different partner and after a short retrospective with all participants you mostly start again from the original code.

The github repository for the legacy code contains the code in several languages among which are Java, C++, C# and Ruby.

Iteration 1

The first iteration was used to get to know the functionality of the code. There were no real rules so the participants were free to explore the code in any way they liked.

I paired with Heiko Seebach who I already knew to be a Ruby guy. We were looking at the code with a standard text editor, already quite unfamiliar to standard Java IDE work. I already got enough Ruby knowledge to understand code when I see it so this was no problem. For quite some time we tried to understand a certain aspect that was happening when running the code. It turned out that this was a bug in the Ruby-version of the code. Next we tried to setup RSpec and get starting with some tests.

During this iteration I didn't learn that much about the legacy code but more about some Ruby stuff.

Iteration 2

The target of the second iteration was to prepare a golden master test that could be used during all of the following iterations. The original legacy code is triggered by random input (in the Java version using java.util.Random) and writes all its state to System.out. We should capture the output for a certain input sequence and write it to a file. This can then automatically be compared to the output of a modified version. If both files are the same there are likely no regresions in the code.

I paired with another Java guy and we were working on my machine in Netbeans. I noticed how unfamiliar I am with standard Netbeans project setup as I am using Maven most of the time. We were doing the test and started some refactorings, all in all a quite productive iteration. Things I learned: java.util.Random really only uses the seed for its number generation so if you are using the same seed again and again you always get the same result. Also, when doing file stuff in plain Java I really miss commons-io.

Iteration 3

In Iteration 3 we were supposed to use an antipattern for testing: Subclass to Test. You take the original class and overwrite some methods in it that are called from the method to test.

It turned out that the original code is not suited well for this approach. There are only few methods that really rely on other methods. Most of the methods are accessing the state via the fields directly. Me and my partner therefore didn't really overwrite the methods but instead use an initializer block for prepareing the state of method calls. This is similar to an approach for Map-initialization that I started to apply only just recently:

Map data = new HashMap() {
    {
        put("key", "value");
    }
};

The approach worked quite fine for the given code but it's probably true that the tests won't stay maintainable.

Iteration 4

Iteration 4 was based on the previous iteration. All the methods that have been subclassed for testing should be moved to delegates and passed into the original class using a dependency injection approach.

I paired with a C++ guy who is doing embedded stuff during his day job on his C++ code. It turned out that we had quite different opinions and experiences. He was really focused on performance and couldn't understand why you would want to move methods to another class just to delegate to them as you are introducing overhead with the method call.

I haven't done any C++ programming since University. Eclipse seems to be suited well for development but compared to Java it still seems to lack a lot of convenience functionality.

Iteration 5

On Iteration 5 I paired with Tilman, a Clean Code Aficionado who I already knew from our local Java User Group. We were supposed to change as many methods as possible to real functions that don't work on fields but on parameter values only.

A lot of people were struggling with this approach at first. But it turns out if you are doing this you have a really good starting position for doing further refactorings more easily.

My partner was doing most of the coding with some input from me. We were taking some directions I wouldn't have taken by myself but the resulting code was really well structured and could be reduced in size. Also we worked with an interesting Eclipse plugin I had seen before already: Infinitest always runs the tests in the background, no need to run the tests manually. Have to check if there's something like this available for Netbeans as well.

Iteration 6

To be honest, I don't know what the goal of the sixth iteration really was. I was pairing with a developer that was still fighting with the failing tests from the previous iteration. Most of the iteration we tried to get these running again. In the last few minutes we managed to extract some clases and clean up some code.

Conclusion

The first german legacy code retreat really was a great experience. I learned a lot and, probably even more important, had a lot of fun.

The food and the location both were excellent. Thanks to the organizers Nicole and Andreas as well as the sponsors for making it possible. It's great to be able to attend a high quality event totally for free.

Running my Tests again

2012-01-11T05:29:00+08:00

For some time I've been bugged by a Netbeans problem that I couldn't find any solution to. When running a unit test from within Netbeans from time to time it happended that the tests just failed. They seemed to be executed in an old state. Running them again didn't help either, it seemed that some parts of the project didn't get recompiled. When executing the tests from a command line Maven build there were never any problems and afterwards the tests could be run again from Netbeans. The problem only occured very infrequently but nevertheless it was really annoying. I started not running the tests from Netbeans at all but only using Maven. That is also not a good solution as you either run all tests or have to edit the command line all the time for running only a single test.

Recently I noticed what caused the problem: Netbeans has its compile on save feature on for tests. This means it is using its internal incremental compile feature which doesn't seem to work fine at least for some project setups.

You can disable it in the project properties on the Build/Compile node. I haven't seen any problems since disabling it. Saves me a lot of time to run the tests from the IDE again.

Talking about Code

2011-12-29T17:22:00+08:00

Yesterday I attended the Softwerkskammer Karlsruhe meetup for the first time. Softwerkskammer tries to connect the Software craftmanship community in Germany.

The topic for the evening was simple: More Code. We looked at a lot of samples from a real project and discussed what's wrong with them and what could be done better. There were a lot of different opinions but that's a good thing as I got to question some habits I have when programming.

This has been the first time I've been to a meeting where there is a lively discussion like this. The conferences and user groups I attend mostly have classic talks with one speaker and far less audience participation. Talking about code is a really good way to learn and this won't be the last time I attended a meetup. Thanks to the organizers.

Spring in Action

2011-12-26T23:46:00+08:00

Sometimes it's comfortable to not be an absolute expert in a certain technology. This makes it really easy to learn new stuff, e.g. by profane methods like reading a book. Even if you are a Spring expert it is still likely that you will take something from the latest edition of Spring in Action by Craig Walls as it covers a wide range of topics. I haven't read one of the predecessors but people told me that those are even better.

Having finished the book recently I just wanted to take the time to write down two interesting small configuration features that I learned from it.

p-Namespace

A feature that I just didn't know before but seems to be quite useful is the p-Namespace. It's a namespace that is not backed by a schema and allows to configure beans in a really concise way. For example look at how a bean might be configured normally:

    <bean id="foo" class="foo.bar.Baz">
        <property name="myLongProperty" value="2"/>
        <property name="myStringProperty" value="Hallo"/>
    </bean>

The properties we'd like to set are children of the bean node. Netbeans comes with nice autocompletion support for the property names as you can see from the screenshot.

The p-Namespace is a more concise version where the property names itself become attributes of the bean node:

    <bean id="foo" class="foo.bar.Baz"
        p:myLongProperty="2" p:myStringProperty="Hallo"/>

See that Netbeans is also clever enough to offer code completion here as well.

I am not sure if I will use the short form of the p-Namespace a lot. A consistent use of the features in a project is quite important so I think if the short form is used it should be used everywhere in the project.

Accessing Constants

Sometimes you need to access some constants in your Spring configuration files. There are several ways to handle this, one of it using the util-Namespace:

<property name="day">
    <util:constant static-field="java.util.Calendar.WEDNESDAY"/>
</property>

Another way can be to use the Spring Expression Language to access it:

<property name="day" value="#{T(java.util.Calendar).WEDNESDAY}"/>

I think this can be used more commonly as the value doesn't need to be registered as a subnode. For example I had some problems using util:constant as key or value in a util:map. That would have been easy just using the EL version.

Not another Diamond Operator Introduction

2011-12-08T05:46:00+08:00

I just returned from the talk "Lucky Seven" of our local Java User Group. It was far better than I expected. Not that I expected Wolfgang Weigend to be a bad speaker but though I organized the event I got the feeling that I had seen one too many Java 7 introductions already. But there was more ...

One of the interesting aspects that I haven't been paying that much attention to is the merge of the JRockit and Hotspot VM. Hotspot will be the base of the new development and JRockit features will be merged in. Some of these features will already be available in OpenJDK during the JDK 7 timespan.

One of the changes got some amount of interest lately: The PermGen space will be removed. Sounds like a major change but, once it works, it will definitively be a huge benefit.

JRockit has been highly respected for its monitoring features. Among those is the interesting Java Flight Recorder that reminds me of the commercial project Chronon. It will be an always on recording of data in the JVM that can be used for diagnostic purposes. Sounds really interesting!

The overall goal of the convergence is to have a VM that can tune itself. Looking forward to it!

The (mixed german and english) slides of the talk are available for download.

Getting started with Gradle

2011-10-26T16:58:00+08:00

Maven has been my build tool of choice for some years now. Coming from Ant the declarative approach, useful conventions as well as the dependency management offered a huge benefit. But as with most technologies the more you are using it the more minor and major flaws appear. A big problem is that with Maven builds are sometimes not reproducable. The outcome of the build is influenced by the state of your local repository.

Gradle is a Groovy based build system that is often recommended as a more advanced system. The features that make it appealing to me are probably the easier syntax and the advanced dependency cache.

For a recent project that I just uploaded for someone else I needed to add a simple way to build the jar. Time to do it with Gradle and see what it feels like.

The build script

The purpose of the build is simple: compile some classes with some dependencies and package those to a jar file. Same as Maven and Ant, Gradle also needs at least one file that describes the build. This is what build.gradle looks like:

apply plugin: 'java'

repositories {
    mavenCentral()
    mavenRepo url: "http://bp-cms-commons.sourceforge.net/m2repo"
}

dependencies {
    compile group: 'org.opencms', name: 'opencms-core', version: '7.5.4'
    compile group: 'javax.servlet', name: 'servlet-api', version: '2.5'
}

sourceSets {
    main {
        java {
            srcDir 'src'
        }
    }
}

Let's step through the file line by line. The first line tells gradle to use the java plugin. This plugin ships with tasks for compiling and packaging java classes.

In the next block we are declaring some dependency repositories. Luckily Gradle supports Maven repositories so existing repositories like Maven central can be used. I guess without this feature Gradle would not gain a lot of adoption at all. There are two repos declared: Maven central where most of the common dependencies are stored and a custom repo that provides the OpenCms dependencies.

The next block is used to declare which dependencies are necessary for the build. Gradle also supports scopes (in Gradle: configurations) so for example you can declare that some jars are only needed during test run. The dependency declaration is in this case similar to the Maven coordinates but Gradle also supports more advanced features like version ranges.

The last block isn't really necessary. It's only there because my Java sources are located in src instead of the default src/main/java. Gradle uses a lot of the Maven conventions so it's really easy to migrate builds.

Building

To build the project you need Gradle installed. You can download a single distribution that already packages Grooovy and all the needed files. You only need to add the bin folder to your path.

Packaging the jar is easy: You just run the jar task in the java plugin: gradle :jar. Gradle will start to download all direct and transitive dependencies. The fun part: It uses a nice command line library that can display text in bold, rewrite lines and the like. Fun to watch it.

I like the simplicity and readability of the build script. You don't need to declare anything if you don't really need it. No coordinates, no schema declaration, nothing. I hope I will find time to use it in a larger project so I can see what it really feels like in the daily project work.

Book Review: Solr 1.4 Enterprise Search Server

2011-03-01T03:08:00+08:00

I've been interested in Solr since I read about it the first time, must have been some time in 2008, doing some research for a search centric web page that was supposed to be run on OpenCms but unfortunately was never developed. At that time I wouldn't have used it as I hadn't heard about it before but I liked the idea a lot. After having attended the Devoxx university session by Eric Hatcher on Solr in 2009 I was completely sure that the next search system I would implement would be based on Solr. The project's nearly finished now, time to recap what I took out of the book I got for learning Solr.

First of all, when learning a new technology I prefer paper books over internet research. Though there are other books available, Solr 1.4 Enterprise Search Server by David Smiley and Eric Pugh seems to be the one that is most often recommended.

The book starts off with a high level introduction into what Solr and Lucene are, some first examples and interestingly, how to build Solr from source. Though the book was released before Solr 1.4 the authors seemed to have the foresight that some features might still be lacking and had to be included manually. In fact, I've never seen an open source project where applying patches is such a common thing as it seems to be the case for Solr.

Schema configuration and text analysis are the topics for the second chapter. It begins with an introduction into MusicBrainz, a freely available data set of music data is used as an example throughout the book. This chapter is crucial to the understanding of Solr as it introduces a lot of Lucene concepts that probably not every reader is familiar with.

After quite some theory chapter 3 starts with the practical parts, covering the indexing process. Curl, the command line http client, is used to send data to solr and retrieve it. Another option, the data import handler, that directly imports data from a database, is also introduced.

Chapter 4 to 6 walk the reader through the search process and several useful components to enhance the users search experience like faceting and the dismax request handler. This is the part where Solr really shines as you can see how easy it is to integrate new features in your application that probably would have taken a long time to develop using plain Lucene.

Deploying Solr is covered in Chapter 7 with quite some useful information on configuring and monitoring a Solr instance. Chapter 8 looks at some client APIs from different programmin languages, SolrJ being the most important to me. The book ends with an in-depth look at how Solr can be tunded and scaled.

I can say that this is a really excellent book, as an introduction to Solr as well as a reference while developing your application. The most common use cases are covered, the examples make it really easy to adopt the concepts in your application. There are lots of hands on information that prove useful during development and deployment of your application.

Some slight drawbacks I don't want to keep to myself: As the common message format for Solr is a custom XML dialect, there is a lot of XML in the book to digest. As it's so common to use it that's not necessarily a bad thing but you might get quite dizzy looking at a lot of angle brackets. From a readers perspective some variety would have been nice e.g. by mixing XML with the Ruby format or JSON or introducing client APIs earlier. Also, while it's a good idea to use a data set that is freely available, MusicBrainz probably isn't the best format for demoing some features. There are no large text sections or documents, which are often what a search application will be build on. And finally, not really an issue of the authors but rather of the publisher, PacktPub: When skimming through the book it's quite hard to see when a new section begins. The headlines do not contain a numbering scheme and are of a very similar size.

Nevertheless, if you have to develop an application using Solr, you should by all means buy this book, you won't regret it.

Running Ruby on Rails Tests in Netbeans

2011-01-23T09:55:00+08:00

I don't get it. Netbeans is often recommended as an excellent IDE for Ruby on Rails development, not only when targeting the JVM. Nevertheless, even some basic features don't seem to be working with the default setup. You can't even run the tests, which is fundamental to developing using a dynamic language.

What's happening? Suppose you have a simple app and you want to run some tests using the test database. Not sure if this is mandatory when using the built in JRuby but it seems to be normal to use the jdbcmysql adapter. When you try to run the tests you will see something like this:


1) Error:
test_index_is_ok(ContactsControllerTest):
ActiveRecord::StatementInvalid: ActiveRecord::JDBCError: Table 'kontakt_test.contacts' doesn't exist: DELETE FROM `contacts`

followed by the stack trace that isn't really helpful as it's not the root cause. Rails somehow doesn't create the tables in the test database. You'll see a more helpful output when starting the rake task "db:test:prepare" directly in debug mode:


** Invoke db:test:prepare (first_time)
** Invoke db:abort_if_pending_migrations (first_time)
** Invoke environment (first_time)
** Execute environment
** Execute db:abort_if_pending_migrations
rake aborted!
Task not supported by 'jdbcmysql'
/path/to/netbeans-6.9.1/ruby/jruby-1.5.1/lib/ruby/gems/1.8/gems/rails-2.3.8/lib/tasks/databases.rake:380
/path/to/netbeans/ruby/jruby-1.5.1/lib/ruby/gems/1.8/gems/rake-0.8.7/lib/rake.rb:636:in `call'
/path/to/netbeans/ruby/jruby-1.5.1/lib/ruby/gems/1.8/gems/rake-0.8.7/lib/rake.rb:636:in `execute'
/path/to/netbeans/ruby/jruby-1.5.1/lib/ruby/gems/1.8/gems/rake-0.8.7/lib/rake.rb:631:in `each'
[...]
** Execute db:test:prepare
** Invoke db:test:load (first_time)
** Invoke db:test:purge (first_time)
** Invoke environment 
** Execute db:test:purge

The task fails in the database task in the rails lib. You can open up the source code by opening the node Libraries/Built-in-JRuby/rails-2.3.8/lib/tasks/databases.rake in Netbeans.

At line 357 you can see the problem: Rails only expexts some hardcoded adapters, jdbcmysql not being one of them. It skips the task for unknown adapters. Two options to fix it: Insert a regular expression that matches both:


when /mysql/ # instead of when "mysql"

or add the jdbcmysql adapter as a second option:


when "mysql","jdbcmysql"

Now the tests are running and hopefully passing. The same kind of error might occur for other tasks as well as there are some more checks for the mysql adapter in this file. You should be able to fix them the same way.

I wouldn't have expected to have to patch the rails code for using it in Netbeans but this doesn't seem to be uncommon. Using a recent active record version is supposed to fix the problem as you can use mysql as an adapter name then but I didn't find a way to run the jdbc generator from Netbeans. It isn't available in the list of generators and I didn't find a generator gem to download.

What's to be learned for me from this? I got a better understanding of how the build process works using rake. But more importantly: even technologies that have been hyped for a long time might not be that flawless as you would expect.

Refactoring in Git

2011-01-09T20:49:00+08:00

To me, when using SVN, the most important reason for using an IDE plugin was the refactoring support: SVN doesn't notice when you rename a file, you have to explicitly call svn mv.

I thought this would be a major problem with Git, as a Java refactoring changes the content and the filename in one go. As the content changes the SHA1-checksum also changes and you'd run into problems. Fortunately, that's not the case.

With Git, you don't need a special operation: It detects renames with minor changes automatically.

Time for a test. Suppose you have a simple Java class like this:

public class TestClass {

    public static void main(String [] args) {
        System.out.println("Hello Git");
    }

}

Commit it to the Git repository:

flo@hank:~/git-netbeans$ git add src/TestClass.java
flo@hank:~/git-netbeans$ git commit -m "added test class"
[master 9269c2f] added test class
 1 files changed, 7 insertions(+), 0 deletions(-)
 create mode 100644 src/TestClass.java

Rename the class (either by using an IDE or by executing a manual refactoring by changing the file name and the class name):

public class TestClassWithNewName {

    public static void main(String [] args) {
        System.out.println("Hello Git");
    }

}

git status will tell you something like this:

flo@hank:~/git-netbeans$ git status
# On branch master
# Changed but not updated:
#   (use "git add/rm ..." to update what will be committed)
#   (use "git checkout -- ..." to discard changes in working directory)
#
#       deleted:    src/TestClass.java
#
# Untracked files:
#   (use "git add ..." to include in what will be committed)
#
#       src/TestClassWithNewName.java
no changes added to commit (use "git add" and/or "git commit -a")

Doesn't look that good yet. It detects an added and a removed file. Next, stage the changes and have another look at the status:

flo@hank:~/git-netbeans$ git rm src/TestClass.java
rm 'src/TestClass.java'
flo@hank:~/git-netbeans$ git add src/TestClassWithNewName.java
flo@hank:~/git-netbeans$ git status
# On branch master
# Changes to be committed:
#   (use "git reset HEAD ..." to unstage)
#
#       renamed:    src/TestClass.java -> src/TestClassWithNewName.java
#

Neat, Git detected a rename. Let's commit and see the log:

flo@hank:~/git-netbeans$ git commit -m "refactored class"
[master 4acd7f1] refactored class
 1 files changed, 1 insertions(+), 1 deletions(-)
 rename src/{TestClass.java => TestClassWithNewName.java} (72%)
flo@hank:~/git-netbeans$ git log src/TestClassWithNewName.java
commit 4acd7f19ccd6cc02816ee7f1293ea5a69d7a4ca7
Author: Florian Hopf 
Date:   Sun Jan 9 14:27:59 2011 +0100

    refactored class

Hmmm, only the last commit? Looks like we have to tell that we want to follow renames:

flo@hank:~/git-netbeans$ git log --follow src/TestClassWithNewName.java
commit 4acd7f19ccd6cc02816ee7f1293ea5a69d7a4ca7
Author: Florian Hopf 
Date:   Sun Jan 9 14:27:59 2011 +0100

    refactored class

commit 9269c2fd194b2bd2b93a18ab88f21fb2180c5870
Author: Florian Hopf 
Date:   Sun Jan 9 13:48:35 2011 +0100

    added test class

What do I take from this experiment? I guess I won't use the Netbeans Git plugin for now. I still have to get acquainted to the command line and its better to learn the basics first.

Git hook for Redmine messages

2011-01-09T18:40:00+08:00

At work we are using Redmine with the repository references enabled. When adding special terms like refs #1234 or fixes #1234 to the commit message the commit is automatically assigned to ticket 1234 and shown with the ticket. Only commiting code that references a ticket is considered to be a best practice as all changes are documented with a ticket.

As I'm using the Git SVN bridge now I tend to commit more than using plain SVN. Often I just forget to add the refs marker which is quite annoying. Pro Git introduces a hook that can be used to check your commit message for a special format.

This is the shamelessly copied hook, adjusted to the Redmine keywords:

#!/usr/bin/env ruby
message_file = ARGV[0]
message = File.read(message_file)

$regex = /(refs #(\d+)|fixes #(\d+))/

if !$regex.match(message)
  puts "Your message is not formatted correctly (missing refs #XXX or fixes #XXX)"
  exit 1
end

How to use it? Copy the code to the file .git/hooks/commit-msg in your project and make it executable (chmod +x .git/hooks/commit-msg).

Try to commit without the markers:

flo@hank:~/git-redmine$ git commit -am "commit that doesn't reference a ticket"
Your message is not formatted correctly (missing refs #XXX or fixes #XXX)

And with a marker:

flo@hank:~/git-redmine$ git commit -am "commit that references a ticket, refs #1234"
[master 189b6b1] commit that references a ticket, refs #1234
 1 files changed, 2 insertions(+), 0 deletions(-)

If you want to skip the hook for some reason you can do so using the --no-verify option:

flo@hank:~/git-redmine$ git commit --no-verify -am "special commit that doesn't reference a ticket"
[master d1c0698] special commit that doesn't reference a ticket
 1 files changed, 1 insertions(+), 0 deletions(-)

GoGear title management in Sqllite

2011-01-06T18:20:00+08:00

While writing this post I noticed that I am just wrong with my assumptions on my GoGear device. Read on to learn why. I'll publish it anyway as the information should still be valid for older versions of GoGear.

I've been looking for a way to enable the bookmarks feature for audio books for my Philips GoGear SA1922. Unfortunately according to this image, which is only displayed on the german site, it's not supported for this version though with the latest firmware there's a special menu option for audio books.

While experimenting I learned a few things along the way that are quite interesting. GoGear uses Sqlite for managing all meta information of the audio files stored. I use golb to transfer music from my Linux machine which does all the magic of extracting ID3 tags and inserting all data in the database. Normally you would use golb -f _system/media/audio/MyDb in the root folder of the mounted storage device to scan all files on the device and write it to the database _system/media/audio/MyDb.

If you want to see or manipulate the data you can use the sqlite client: In the same folder call sqlite _system/media/audio/MyDb. This will open a client console similar to mysql:


flo@hank:/media/PHILIPS$ sqlite _system/media/audio/MyDb
SQLite version 2.8.17
Enter ".help" for instructions
sqlite>

To see the schema information you can issue the .schema command, which display information on all the tables and its indexes:


sqlite> .schema
CREATE TABLE albumTable(         iAlbumId               INTEGER PRIMARY KEY, cAlbumTitle        VARCHAR(100) );
CREATE TABLE artistTable(        iArtistId              INTEGER PRIMARY KEY, cArtistName        VARCHAR(100) );
CREATE TABLE dirTable(           iDirId                 INTEGER PRIMARY KEY, cDirName           VARCHAR(260),iParentDirId       INTEGER );
CREATE TABLE genreTable(         iGenreId               INTEGER PRIMARY KEY, cGenreName         VARCHAR(50) );
CREATE TABLE playlistTable( iPlaylistId INTEGER PRIMARY KEY,cPlaylistName       VARCHAR(100), cFileName         VARCHAR(260),iDirId                     INTEGER );
CREATE TABLE playsongTable( iPlaysongId INTEGER PRIMARY KEY,iPlaylistId INTEGER, iOrderNr       INTEGER,iSongId         INTEGER );
CREATE TABLE songTable (         iSongId                INTEGER PRIMARY KEY,cSongTitle  VARCHAR(100),iArtistId          INTEGER,iAlbumId                INTEGER,iTrackNr            INT8,iTrackLength       INT16,iNrPlayed         INT16,cFileName         VARCHAR(260),iDirId                     INTEGER,iYear      INT8,iGenreId            INTEGER,iBitRate       INTEGER,iSampleRate    INTEGER,iFileSize      INTEGER,iMediaType     INTEGER );
CREATE INDEX album_cAlbumTitle ON albumTable (cAlbumTitle);
CREATE INDEX artist_cArtistName ON artistTable (cArtistName);
CREATE INDEX dir_cDirName ON dirTable (cDirName);
CREATE INDEX dir_iParentDirId ON dirTable (iParentDirId);
CREATE INDEX genre_cGenreName ON genreTable (cGenreName);
CREATE INDEX playlist_cPlaylistName ON playlistTable (cPlaylistName);
CREATE INDEX playsong_iOrderNr ON playsongTable (iOrderNr);
CREATE INDEX playsong_iPlaylistId ON playsongTable (iPlaylistId);
CREATE INDEX playsong_iSongId ON playsongTable (iSongId);
CREATE INDEX song_cFileName ON songTable (cFileName);
CREATE INDEX song_cSongTitle ON songTable (cSongTitle);
CREATE INDEX song_iAlbumId ON songTable (iAlbumId);
CREATE INDEX song_iArtistId ON songTable (iArtistId);
CREATE INDEX song_iDirId ON songTable (iDirId);
CREATE INDEX song_iGenre ON songTable (iGenreId);
CREATE INDEX song_iTrackNr ON songTable (iTrackNr);

To see some of the song information you can query the songTable:


sqlite> select * from songTable limit 3;
1|CRE041 Sprachen|1|1|0|4476|0|chaosradio_express_041.mp3|28|2007|1|128|44100|71709702|1
2|Java Posse #331 - Roundup '10 - Modules|2|2|331|3783|0|JavaPosse331.mp3|28|2010|2|96|44100|45460832|1
3|CRE080 Geschichte der Typographie|1|1|0|7947|0|chaosradio_express_080.mp3|28|2008|3|128|44100|127247455|1

You can use standard sql to update the information:


sqlite> update songTable set cSongTitle = "Sprachen - Chaos Radio Express 41" where iSongId = 1;

At this point I was about to hit the publish button for the post. Luckily I tried if the update happened at all. Turned on my device: Still the old title. Rebootet the device, deleted and recreated all indexes, inspected the Golb sourcecode, found nothing. After a while it struck me: I don't need golb for my GoGear version. This device seems to extract all information from the id3 tags directly, very likely during startup. Classic fail!

Podcasts for developers

2010-12-31T17:28:00+08:00

As I'm walking to the office every day I've incorporated the habit of listening to podcasts on my way. I'd like to list some of them which might be of interest to other developers.

The Javaposse

An almost legendary Java podcast. Interesting development news are discussed in short and sometimes in length.

Hanselminutes

Scott Hanselman is a .Net developer who also likes to see beyond his own nose. There are a lot of topics he discusses that are also relevant and interesting if you're not into Microsoft.

FLOSS Weekly

A sometimes weekly show where different people from the open source universe are interviewed. I only listen to some episodes that attract my attention but it's always enoyable.

Linux-Outlaws

The two hosts rant about everything Linux and open source. It's always a lot of fun but the episodes tend to get quite long for a weekly show.

SE-Radio

Mainly an interview podcast on everything software development. A lot of high class guests have been featured so far.

WebDevRadio

Covers web development topics using different server side technologies, from ASP to Grails.

Improving Podcasts

A company podcast with some development episodes and a lot of agile topics. I get the feeling that it's slowly dying as only few new episodes are produced.

Stackoverflow

The discontinued podcast of Jeff Atwood and Joel Spolsky. A lot of episodes are still worth listening to.

German Podcasts

Es gibt natürlich auch interessante deutsche Podcasts.

Software-ArchitektTOUR

Manche Episoden sind ziemlich interessant und unterhaltsam, ich habe mich jedoch auch schon über manche Inhalte geärgert. Hängt glaube ich ein bisschen an den jeweiligen Hosts, ob es ein für mich interessantes Thema wird oder nicht.

Chaosradio Express

Der Interview-Podcast des Chaos Computer Clubs. Es dreht sich nicht alles um IT, wer beispielsweise wissen will, wie das Universum aufgebaut ist oder einen Einblick in das Leben in der ehemaligen DDR haben will ist hier auch genau richtig.

Slides and sample code for Solr talk

2010-11-21T03:03:00+08:00

I just uploaded the slides and sample code for my talk about Apache Solr at the Java User Group Karlsruhe.

The sample consists of an example solr configuration and some scripts that can be used to index files in Solr. A simple Spring MVC app can be used to search the content. Go to Github and grab the application as well as the slides if you're interested.

Don't try to do it all

2010-11-07T14:22:00+08:00

For several years I have been running a web site for my father which contains some information about his work. It started out from html that was served statically, in the beginning of 2007 I migrated the content to OpenCms. During this time I was very eager to learn everything that was related to building a web site from scratch, from the internal workings of OpenCms to Apache configuration and CSS and its peculiarities.

Not that any of these are areas that are not important to know as somebody who is doing web development professionally. But for spare-time projects you have to be careful that you do not run out of time, which is exactly what happened to me. Especially for the styling of the page I did not invest as much time as I would have needed. I started from plain html and did all of styling my myself. I read a lot about the box model and how to apply CSS correctly. Unfortunately only the basic structure was finished before I ran out of time so I published the page with nearly no styling for the content. And that's how it stayed for 3 1/2 years ...

Recently I invested some time in it again. I ported the layout to use the YAML framework with its standard design and reorganized the structure of the content. Also, while at it, I updated from OpenCms 7.0.2 to OpenCms 7.5.2. In nearly no time the site was in a much better state than before without having to fight several browsers.

The main things I should have done from the beginning:

Use a CSS framework or a ready made template unless you're willing to invest a lot of time

Take extra care when structuring the content in OpenCms. Do not nest too deep if it's not necessary.

Use a source code versioning system even if you're not planning to have a huge project

Things that worked well during the relaunch:

Groovy is a nice tool to migrate content from one schema type to the other

YAML and its YAML-Builder are really easy to use

Apache is configured in no time using Sebastians excellent tutorial

There's still a lot of work to do: Some pages still use tables for layouting which especially doesn't work well anymore when using yaml as it highlights cells when hovering. Also, I'm planning to add a contact form using Ruby on Rails. Of course this could be easily done using Alkacon Webform but that's a good opportunity to learn a new technology.

At least SEO-wise the site seems to be OK: We are beating the famous singer as well as the bavarian wheat beer company with the same name :)

Using OpenCmsTestCase

2010-04-14T06:32:00+08:00

These are basically some notes for me because I just had to relearn all of it.

To use OpenCmsTestCase in a project the following steps have to be applied:

Download the source distribution and unzip it

set the file encoding: export ANT_OPTS=-Dfile.encoding=iso-8859-1

run ant bindist

run ant compile-tests

Somehow some unittests are always not compiling for me: remove the java files and the entries from the TestSuites

create a jar file from the folder org in ../BuildCms/build/test, e.g. jar -cf opencms-test-7.5.2.jar org

add the jar to your project classpath/deploy to your maven repository

add hsqldb.jar to your project

copy the folders data and webapp to your project

copy test/log4j.properties and test/test.properties to your test classpath and adjust the directory paths in test.properties (a good reason to use Maven so you can use the resource filtering mechanism)

play with the files in data/imports and adjust them to your needs

A simple test case example:


import org.opencms.file.CmsObject;
import org.opencms.file.CmsResource;
import org.opencms.test.OpenCmsTestCase;
import org.opencms.test.OpenCmsTestProperties;

public class DummyOpenCmsTest extends OpenCmsTestCase {

    static {
        OpenCmsTestProperties.initialize(OpenCmsTestProperties.getResourcePathFromClassloader("test.properties"));
    }

    public DummyOpenCmsTest(String name) {
        super(name);
    }

    @Override
    public void setUp() throws Exception {
        super.setUp();
        setupOpenCms("simpletest", "/sites/default/");
    }

    public void testExistingResource() throws Exception {
        CmsObject cms = getCmsObject();
        CmsResource res = cms.readResource("/index.html");
        assertEquals("/sites/default/index.html", res.getRootPath());
    }

    @Override
    public void tearDown() throws Exception {
        super.tearDown();
        removeOpenCms();
    }
}

Unicode is not UTF-8

2010-04-10T19:18:00+08:00

Problems with encoding are common on a lot of projects I worked on. Sometimes I tend to get the feeling that I understand most of it but then there are always aspects I did not get right. This week I noticed that even my basic knowledge is not really firm.

Currently I am working on a system where we do a lot of imports from other systems that provide data as XML. The company that delivers the data sent us some sample data that we tried to import. The XML document was supposed to be in UTF-8 but somehow our parser always choked on some byte sequences. When we added iso-8859-1 to the xml prolog the parsing was working fine but all non-ASCII characters where not displayed correctly.

Using hexedit I looked at the document and located the values for non-ASCII characters like 'ä' which is displayed as 'C3 A4' in hex. But looking it up in the Unicode code chart it should be the value '00 E4'. We complained that the data seemed to be send in a different encoding but not UTF-8.

Of course the company could not find any problem because we were just wrong. Unicode is not UTF-8! UTF-8 is an encoding scheme which is used to encode unicode characters but the byte values do not match.

Let's analyze the example character 'ä' in a UTF-8 document. It displays as 'C3 A4'. In binary format this is:

1100 0011 1010 0100

UTF-8 uses a start byte and one or more continuation bytes. A start byte is identified by two leading '11' which makes the first byte our start byte. Continuation bytes are identified by a leading '10', so the second byte is a continuation byte. These are the control bits that are used by UTF-8. Let's see what our sequence looks like if we just remove these control bits, shift the bits together and pad the left side with 0:

0000 0000 1110 0100

Of course this is the expected Unicode value '00 E4' for 'ä'.

Very basic, but I still managed to get it wrong.

Later a colleague noticed that in some part of our application a String was created from a byte array without specifying an explicit encoding. Ouch! Finally it was fixed quickly but we should have looked at our code first before blaming the data provider.

Playing with Groovy

2010-04-09T15:12:00+08:00

As I am the one who does most of our OpenCms projects I am also the one who has to deploy new versions to our internal Nexus repository so we can easily use the libraries from Maven. The guys developing OpenCms use Ant for their builds so there is no official Maven repository available.

Most of the time I added only the dependencies that are really necessary on compile time because creating a Maven POM by hand is quite cumbersome, not to mention the deployment of all the dependencies (either uploading to Nexus or deploying using Maven). One of those time consuming tasks that needs some automation.

I chose Groovy for implementing a little helper script because it is really good at dealing with XML and it's always good to learn some new techniques. I already got some experience in modifying existing scripts but did not use it for creating something from scratch.

The script basically just reads a folder with jars and creates a pom for the whole project as well as a script for deploying all additional artifacts using Maven. Creating the files currently involves two steps:

A properties file is created from the information that is guessed from the filename of the jars. As jars are often not named consistent this will not succeed for all jars. So you have to review the file and change some group names, artifactIds and versions (another step that could be automated, e.g. by querying a Nexus instance, but let's save some work for the future ;) ). Another properties file holds project and deployment information like project groupId and artifactId and the server to deploy to.

The properties files are read again by another script and the project pom as well as the deployment script is generated.

Of course you also have to review the deployment script because you don't want to deploy any artifacts that are already available. A good way to find out which artifacts are missing is to call something like {{{mvn compile}}} on the generated pom.
I uploaded the (uncommented) scripts and helper classes, maybe it's useful for somebody.

I guess there are far better solutions to creating maven projects from existing libraries, I am looking forward to hearing about them.

The XML manipulation features of Groovy are really nice. When writing XML you append your node structure to a builder object and it creates the markup for you. You always stay very close to the format you intend to output. E.g. this is the code to create the XML for a pom file:


def writer = new StringWriter();
def xmlBuilder = new MarkupBuilder(writer);

xmlBuilder.project('xmlns' : 'http://maven.apache.org/POM/4.0.0',
  'xmlns:xsi' : 'http://www.w3.org/2001/XMLSchema-instance',
  'xsi:schemaLocation' : 'http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd') {
  modelVersion('4.0.0')
  groupId(project.group)
  artifactId(project.artifact)
  packaging('jar')
  version(project.version)
  name(project.artifact)
  dependencies() {
      dependencies.each() { dep ->
          dependency {
              groupId(dep.group)
              artifactId(dep.artifact)
              version(dep.version)
          }
      }
  }
}

You are looking at this code and immediately can imagine the structure of the resulting XML. I like it!

Also, file manipulation is really nice. No need to do any resource cleanup. This code is responsible for writing the created pom file:


new File(dir + "pom.xml").withWriter() { out ->
   out.println(projectPom.toString());
}

Some drawbacks I noticed when doing scripting like this:

I tend to get sloppy while coding. Not adding semicolons to the end of lines, doing too many things in one class/script, ...

Code completion in Netbeans is horrible if you are used to Java standards, but I guess that is just very hard to implement

You have to compile manually, at least in Netbeans. If you are changing a Groovy class that is used by a script you have to remember to do a build because when running the script Netbeans will not compile it for you.

A lot of coding errors are only discovered on runtime. E.g. using the wrong name for properties or calling a constructor that doesn't exist

I still can't imagine using a dynamic language on production code. Of course the deployment time tends to be shorter but still you have to execute the code to see if it is really correct. I doubt that writing tests could compensate for the lack of static type checking but probably this needs a shift of the mindset.

Fun with TreeMap

2010-03-15T15:45:00+08:00

TreeMap is the only implementation of a SortedMap that is included in the JDK. It stores its values in a Red-black tree, therefore its entries can be accessed, inserted and deleted really fast.

Unfortunately, using the class is not really intuitive, so you better read the API docs carefully or prepare to spend hours of your time chasing mysterious bugs.

Imagine that you want to use a TreeMap with String keys and String values that is ordered according to the length of the key in a way so that the longest key comes first. Should be quite simple, just implement a comparator that does the work for you:


import java.util.Comparator;

public class StringLengthComparator implements Comparator<String> {

    public int compare(String o1, String o2) {
        // if both lengths are the same return 0
        return o2.length() - o1.length();
    }
}

If both Strings have the same length we return 0 and a negative or positive value otherwise. As I always get confused when to return a positive value and when a negative let's create a simple test:


    Map<String, String> map;
    
    @Before
    public void setUp() {
        map = new TreeMap<String, String>(new StringLengthComparator());
    }
    
    @Test
    public void testValuesWithDifferentLength() {

        map.put("zero", "0");
        map.put("one", "1");
        map.put("three", "3");

        assertThat("Length failed", map.size(), is(3));

        Iterator<String> it = map.keySet().iterator();
        assertThat(it.next(), is("three"));
        assertThat(it.next(), is("zero"));
        assertThat(it.next(), is("one"));
    }

If we run this test everything seems to be ok;


Testsuite: StringLengthComparatorTest
Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0,018 sec

But what happens if we just insert the missing two? Lets see:


    @Test
    public void testDifferentValuesWithSameLength() {

        map.put("zero", "0");
        map.put("one", "1");
        map.put("two", "2");
        map.put("three", "3");

        assertThat("Length failed", map.size(), is(4));

        Iterator<String> it = map.keySet().iterator();
        // just check the first two values as we do not know the order of the same length
        assertThat(it.next(), is("three"));
        assertThat(it.next(), is("zero"));
    }

This test should also pass, we just can't tell the ordering of the same length keys:


Testcase: testDifferentValuesWithSameLength(StringLengthComparatorTest):        FAILED
Length failed
Expected: is <4>
     got: <3>

Oops ... what happened here? One of the values just disappeared? We are inserting 4 entries but there are only 3 in the Map?

The first time I stumbled across this it took me several hours to figure out what was going on. After I found out what the problem was it should have taught me to always read the javadoc carefully. The contract for TreeMap states the following:

Note that the ordering maintained by a sorted map (whether or not an explicit comparator is provided) must be consistent with equals if this sorted map is to correctly implement the Map interface.

What exactly does this mean? Comparator tells us more about consistent with equals:

The ordering imposed by a comparator c on a set of elements S is said to be consistent with equals if and only if c.compare(e1, e2)==0 has the same boolean value as e1.equals(e2) for every e1 and e2 in S.

Normally, the Map- and Set-Interface use equals to determine if a certain key (for a Map) or a certain value (for a Set) is already inserted. But for TreeMap and TreeSet this is not the case. These implementations use compareTo for determining whether an Object is equal to another Object. This means you are not allowed to return 0 from a compare-method if the two Objects are not equal!

How to fix it? Just test if the two Strings are equal before comparing the length in the Comparator:


    public int compare(String o1, String o2) {

        // return 0 only if both are equal
        if (o1.equals(o2)) {
            return 0;
        } else if (o1.length() > o2.length()) {
            return -1;
        } else {
            return 1;
        }
    }

Alright, the test passes. Just to be sure we add another test:


    @Test
    public void testYetMoreValues() {

        map.put("zero", "0");
        map.put("one", "1");
        map.put("four", "4");

        assertThat("Length failed", map.size(), is(3));

        assertNotNull("Zero's not there", map.get("zero"));
    }

Seems to be a stupid test but let's see what happens:


Testsuite: StringLengthComparatorTest
Tests run: 3, Failures: 1, Errors: 0, Time elapsed: 0,027 sec

Testcase: testYetMoreValues(StringLengthComparatorTest):        FAILED
Zero's not there
junit.framework.AssertionFailedError: Zero's not there
        at StringLengthComparatorTest.testYetMoreValues(StringLengthComparatorTest.java:51)

How can this happen? We tested for equality but the "zero" value still is not there? Let's see what the tree looks like on every step.

Still simple if the first value ("zero") is inserted: The entry is the root node:

When inserting "one" afterwards, the Comparator tells that the result of the comparison is 1, which means that the node has to be inserted on the right side:

When inserting "four", the tree looks a little bit different. As our Comparator in this case also return 1, the node is also inserted right of "zero" but needs to be left of "one". "four" is our new root node:

Seems to look ok. When reading from left to right the nodes are sorted according to their length. But let's see what happens when we try to look up "zero" based on the tree above. We start with the root node "four". As the length of "zero" and "four" is the same, our Comparator returns 1. But on the right side it only finds the "one" node. Our comparator returns -1 when comparing "four" to "one". Nothing there, so null is returned.

The Javadoc for Comparator tells us more on what went wrong here:

The implementor must ensure that sgn(compare(x, y)) == -sgn(compare(y, x)) for all x and y.

This means that the method needs to be implemented symetrically. When calling compare("zero", "one") we always need to return the negated value of calling compare("one", "zero"). With our implementation this is not the case as we always return 1 as a fallback.

I hope this is a final implementation of the Comparator delegating to Strings compareTo()-method which is already implemented symetrically:


    public int compare(String o1, String o2) {

        // return 0 only if both are equal
        if (o1.equals(o2)) {
            return 0;
        } else if (o1.length() == o2.length()) {
            // delegate to Strings compareTo to get symmetric behavior
            return o1.compareTo(o2);
        } else if (o1.length() > o2.length()) {
            return -1;
        } else {
            return 1;
        }
    }

I struggled with both of these problems on two different projects and it always took me some hours. Thanks to my colleague Marc who led me to the solution of the symmetric problem.

Implementing an API abstraction using Google Collections

2009-11-19T20:17:00+08:00

I am using OpenCms, the open source content management system, quite often. It is a good choice for building structured medium to large sized websites. One drawback though sometimes is the layout of the APIs. E.g. for accessing the resources managed in the system programatically you have to use some final and nonfinal classes that depend on a running instance of OpenCms. The original developers very likely chose this approach for security reasons but this can become quite cumbersome, e.g. when dealing with tests, as you cannot mock some of these classes easily. There is an integration test facility that comes with OpenCms but as this starts an OpenCms instance every time a test is executed, running tests takes some time. I tend to write tests that work without a running system whenever possible.

To improve testability of my components I implemented a thin layer above the normal OpenCms access means using some interfaces and simple POJOs so that my business logic can be tested without starting OpenCms. One interface and its default implementation act as a kind of DAO for accessing the virtual filesystem of OpenCms. Its method signatures do not contain any OpenCms dependencies that can't be mocked or reconstructed easily. E.g. to represent file resources the OpenCms class normally used is CmsResource. This class is quite difficult to instanciate outside of a running OpenCms instance as it contains internal references to different database tables. To reduce the need for mocking these external classes I implemented a simple POJO, Resource, that contains relevant information like the path to the resource and it's type.

Some methods of my VFS DAO return a Collection of Resources, e.g. when reading all resources in a subfolder. As the OpenCms API returns an untyped List that contains CmsResources and in my interface method signature I use List<? extends Resource> some transformation needs to take place. In the first project I used the abstraction I implemented it in a really simple way:


List<Resource> resources = new ArrayList<Resource>();
@SupressWarnings("Unchecked")
List<CmsResources> cmsResources = cms.readResources(...);
for (CmsResource cmsResource: cmsResources) {
 resources.add(transform(cmsResource);
}

The transform method just creates an instance of the Resource and fills it with the needed values.


private Resource transform(CmsResource cmsResource) {
 Resource resource = new Resource();
 resource.setDateLastModified(cmsResource.getDateLastModified());
 ...
 return resource;
}

This approach works and in my opinion is ok to use in many circumstances. I sacrificed some performance for a gain in testability and design. But for large collections or operations that are triggered frequently this of course can become a performance issue as for the sake of abstraction it is necessary to iterate the collection.

The better solution is to use a lazy list that transforms the CmsResources on the fly to Resources. With a lazy list you don't have to iterate it when transforming. The transformation happens when you are accessing the list.

Google Collections provides a functional style approach for transforming lists lazily. You create a class that implements the interface Function that can be typed for the source and target. In its apply method the transformation step is implemented in basically the same way as in the method displayed above.


public class ResourceTransformationFunction implements Function {

   public Resource apply(CmsResource cmsResource) {
       Resource resource = new Resource();
       resource.setDateCreated(cmsResource.getDateCreated());
       resource.setDateLastModified(cmsResource.getDateLastModified());
       ...
       return resource;
   }
}

The original List is transfomed using a static method call that accepts an instance of our Function:


@SuppressWarnings("unchecked")
List<CmsResource> cmsResources = cms.readResources(...);
List<Resource> resources =  Lists.transform(cmsResources, new ResourceTransformationFunction());

The transformation happens when the List is accessed so when you are iterating the collection only once, which should be the case in most applications, there is no overhead at all (besides the creation of the new objects).

I really like the ease of use and reusabilty of the Google Collections solution. Also, the jar comes with absolutely no dependencies which makes it easily embedabble in any project.

A similiar functional approach will be part of the new concurrency features in JDK 7. ParallelArray, which makes use of the Fork/Join framework will provide the ability to use functions and predicates when constructing arrays. Brian Goetz' talk on Devoxx 2008 contained a detailed introduction to these features.

On Devoxx 2009, Dick Wall of the Javaposse held a really good talk about appliying a more functional style of programming to the Java programming language. This talk will be available some time in the future at parleys.com