Shalin Says...

Welcome 2021

Fri, 01 Jan 2021 13:51:57 +0530

I learned important life lessons in the last year. Many of these are inter-related:

Work intentionally - It is too easy to live life on auto-pilot. Be thoughtful about your relationships, work, and learning.
Train the monkey before building the pedestal - We focus our attention on the easy parts, the things we know how to do. Work on hard problems first. In other words, be ambitious and optimize for growth.
Invert, always invert - Write down your anti-goals and either avoid them or ruthlessly de-prioritize time spent on them.
Build a learning loop - Learn, reflect, apply, evaluate. There’s no point reading if you are not learning!
Walk, exercise, sleep - A tested cure for mental blocks.

We have a new start in this new year. I wish you all well. Happy 2021!

What’s cooking in Solr 6

Mon, 14 Dec 2015 21:19:21 +0530

Bangalore just got a new search meetup group started by folks from eBay. This one is a bit more general than our existing Bangalore Solr/Lucene Meetup group which by the way is close to 600 members and running strong!

Today was the first meetup and I presented a session on the upcoming release of Solr 6 which has some juicy new features!

It is called “What’s cooking in Solr 6″. Hope you find it useful!

How to make Apache Solr listen on a specific IP address or host name

Tue, 25 Aug 2015 20:43:44 +0530

Someone asked me how to ensure that Solr is exposed exclusively on a server’s internal IP address so I thought this bit of information would be useful more generally.

On Linux, edit the solr.in.sh file, find the property called SOLR_HOST (it is commented out by default) and set its value to the IP address or the host name that you want Solr to listen for requests.

The procedure is similar on Windows, except that the file to be edited is solr.in.cmd.

Solr auto-detects the IP address of the node on which it is running by default. In case you are curious on how it works behind the scenes, take a look at ZkController.java’s normalizeHostName(String host) method

Edit (15 September 2015) - It was pointed out to me later that setting SOLR_HOST is not enough because the host/IP set by that property is only used by SolrCloud for making inter-shard requests. We also need to set a property used by Jetty in solr.in.sh or solr.in.cmd:

Testing SolrCloud with Jepsen

Sun, 21 Dec 2014 01:52:31 +0530

So yeah, I wrote a blog post after a long time and no, it’s not this one but a mammoth post over at the Lucidworks blog called “Call me maybe: SolrCloud, Jepsen and flaky networks”.

Some interesting excerpts:

TL;DR;
We tested SolrCloud against bridge, random transitive, and fixed transitive network partitions using Jepsen and found no data loss issues for both compare-and-set operations and inserts. One major and a few minor bugs were found. Most have been fixed in Solr 4.10.2 and others will be fixed soon. We’re working on writing better Jepsen tests for Solr and to integrate them into Solr’s build servers. This is a process and it’s not over yet.

What is Jepsen?

Jepsen is a tool which simulates network partitions and tests how distributed data stores behave under them. Put in a simple way, Jepsen will cut off one or more nodes from talking to other nodes in a cluster while continuing to insert, update or lookup data during the partition as well as after the partition heals to find if they lose data, read inconsistent data or become unavailable.

Why is this important? This is important because networks are unreliable. It’s not just the network; garbage collection pauses in the JVM or heavy activity by your neighbour in a server running on the cloud can also manifest in slowdowns which are virtually indistinguishable from a network partition. Even in the best managed data centers, things go wrong; disks fail, switches malfunction, power supplies get shorted out, RAM modules die and a distributed system that runs on a large scale should strive to work around such issues as much as possible.

We found some bugs as well:

SOLR-6530 – Commits under network partition can put any node into ‘down’ state

SOLR-6583 – Resuming connection with ZooKeeper causes log replay

SOLR-6610 – Slow cluster startup

Where’s the code?

All our changes to Jepsen are available on our Jepsen fork at Github in the solr-jepsen branch

Conclusion?

While Solr may require a little extra work in setting up Zookeeper in the beginning, as you can see by these tests, this allows us to create a significantly more stable environment when it matters most: production. The Solr community made a conscious decision to trade a tiny bit of ease of getting started in exchange for a whole lot of “get finished”. This should result in significantly less data loss and more reliable operations in general.

What’s next?

Integrate Jepsen with Solr’s build servers, get these tests running on each change
Test harder; write more tests against more scenarios/topologies. Break Solr, then fix it again.

It’s been a lot of fun! Until next time.

Shalin Says… turned 4 today! I received this notification...

Wed, 11 Dec 2013 12:04:07 +0530

Shalin Says… turned 4 today!

I received this notification from tumblr today and I figured that it was an appropriate occasion to check in to this oft neglected blog. There are many things to say but too little time to say them.

The biggest item that I am working on right now is SOLR-5308 which allows users to split all documents with a particular route key into their own collection. So if you have a multi-tenant collection then this feature will allow you to migrate a tenant into their own collection seamlessly and without downtime. I’d appreciate help in the form of testing and suggestions on this feature. Please vote or watch the Jira issue if you think this is interesting or useful to you.

First Bangalore Lucene/Solr Meetup Report

Mon, 17 Jun 2013 11:31:00 +0530

The First Bangalore Lucene/Solr meetup was organized on Saturday, 8th June 2013 courtesy of the initiatives of Anshum Gupta and Varun Thacker. Although I joined in as a co-organizer but honestly I did nothing except tweet about it and show up with some slides.

I must say that I was pleasantly surprised at the rate at which the group went from zero to a hundred members (it stands at a 132 members as of writing this post). Our initial limit for the venue was fifty but it was increased to seventy five once the size of the venue was confirmed. Microsoft Accelerator was gracious enough to provide a conference room and refreshments for the attendees. 50+ people showed up which is pretty good considering that the meetup schedule clashed with some other popular meetups. The agenda was dominated by presentations but quite a bit of time was spent in Q&A.

Vinodh Kumar R (Head of BloomReach India) gave a talk on ranking models (adversarial vs implicit vs real time news) applicable for different kind of search applications.

Varun Thacker from Unbxd talked about Faceted search and result reordering in Solr focusing on e-commerce applications. He introduced term range facets, multi-select faceting and then delved into reordering documents using function queries and query elevation along with examples and use-cases for each.

Dikchant Sahi presented Apache Solr's DataImportHandler to index databases and xml files in Solr. He also gave a live demo of full and delta imports of a sample music data set into Solr.

I gave a presentation on SolrCloud and Shard Splitting which is something that Anshum Gupta and I have been working on the past few months.

Here are the slides that I presented:

SolrCloud and Shard Splitting from Shalin Mangar

At the start of my presentation, I solicited an informal poll from the audience to gauge their interest:

Everyone in attendance was familiar with the projects
Most of the attendees were already using Solr for search
Everyone in attendance was using Solr 3.x and no one was on SolrCloud yet
Almost everyone was evaluating, prototyping or testing SolrCloud

I met Umesh Prasad from Flipkart at the meetup and we chatted quite a bit on Solr’s performance under heavy bulk re-indexing workloads and also about accommodating large elevation and synonym files in SolrCloud. I’m happy to know that Flipkart uses Apache Solr for their excellent search. I also met a couple of search enthusiasts who have used Solr in the past and want to contribute back to the community.

All in all, I think it was a good first step towards establishing a strong Lucene/Solr community in Bangalore. I wish that the next meetup gives more time for one to one interactions and focused conversations around search issues. It’d be nice to have more about Lucene in the next meetup. A lot of people inquired about training on Apache Solr so we may organize a workshop for the next meetup.

Drop me a line if you are interested in attending a Solr training in Bangalore. Also, if you’re in Bangalore and interested in Lucene/Solr or search in general, do join the Bangalore Lucene/Solr Meetup group.

Resuming blogging

Tue, 04 Jun 2013 01:12:00 +0530

I realize that it’s been a long time since I posted something here. About three years ago, I started working on projects at AOL, which did not have anything to do with Apache Solr. Increasingly I found myself having nothing to say publicly about my work. Though I did get back to working with Apache Solr for the AOL WebMail team, I didn’t resume blogging due to sheer laziness, I guess. (Yes, they use Solr! In fact, AOL WebMail just upgraded to Apache Solr 4.2.1 with impressive results!)

A lot has happened in the meantime. In December 2012, I joined the impressive team at LucidWorks – the Lucene/Solr company, to work (almost) full-time on open source search. In January, this year, I found my life partner and got married.

Now that I work on Lucene/Solr again, I think it is time to resume blogging regularly. I intend to write about new features, tutorials, tips & tricks and perhaps also explain the internals of Lucene/Solr features in greater depth. Here’s to a new start!

Apache Lucene 3.1.0 and Apache Solr 3.1.0

Fri, 01 Apr 2011 16:24:59 +0530

Apache Lucene 3.1.0 and Apache Solr 3.1.0:

This is the first release bringing Lucene and Solr release versions in sync. There are numerous bug fixes, optimizations and new features. Download from here

My Android phone - Samsung Galaxy S Review

Fri, 06 Aug 2010 02:15:00 +0530

I had been holding out on buying a more internet friendly phone for some time now, waiting for 3G service to start in India. After my iPad experience, it was clear to me that I couldn’t be happy with an iPhone but it was also obvious that none of the available Android phones were good enough.

Enter the new Samsung Galaxy S with Android 2.1, awesome 4" display, light weight and a good (enough) battery life. A little market research showed that Samsung’s Super AMOLED technology was best in class and a better phone, the Samsung Galaxy S2, will only be available next year. With 3G to be introduced (supposedly) around October, the stage was set. So, one fine July evening, I bought the Samsung Galaxy S.

First of all, a message to people who are surprised upon hearing the price - the Samsung Galaxy S is not a phone; it is a pocket sized computer that also happens to be a phone. And what have I been upto with this device? Here’s a list of ten things (in no particular order) that I have used it to:

Get email alerts via K9 Mail, read blogs, check twitter, buzz and facebook feeds
Take photographs and upload it to Facebook (darn you Airtel, I could never get MMS to work properly)
Watch movies on my desktop monitor and phone through VLC Remote and Gmote.
Stream music from GrooveShark - Yay! freedom from syncing my music
Play Asphalt5 while sitting at the back of an autorickshaw - something ironic about it.
Use AIM to answer questions from our QA team while having lunch. Note to myself, use a spoon next time if you want to use the phone.
SMS while walking down the road without the fear of banging into an obstacle.
Read a book on the Kindle app.
Navigate to Kolar and back with Google Maps
Impress people with the live wallpapers, gesture search and some not so useful tidbits such as a lie detector and a sky map.

Google integration means that all my contacts are backed up on their servers. I am still surprised that something as simple as backing up contacts has been so hard for phones till now. Another cool feature that I loved was that I could link phone numbers with Facebook contacts together. Now when a friend calls, I see his Facebook photo automatically. The device has a decent battery life; with on and off wifi use, it lasts for a little more than a day. The device has a 5 megapixel camera and can record a 720p video. And, by the way, flash sites works too.

Before I bought the phone, I wasn’t sure how easy it would be to use a touch keyboard. Well, it wasn’t too hard with the default keyboard but ever since I switched to the Swype keyboard, I’m insanely fast. I’d recommend Swype to everybody. Good thing that Samsung provides this keyboard application for free.

There are a few downsides though. The phone feels sluggish when more than a couple of applications are running. The “Advanced Task Killer” application is essential for a decent experience. There is no flash light so one must depend on external light source being available to avoid dark photos. Sometimes (and this happens rarely), after moving out of a wifi zone, it won’t automatically switch to a GPRS/EDGE connection unless I restart the phone.

All in all, I’m happy with the device and eagerly waiting for an Android 2.2 (froyo) upgrade and Airtel’s 3G service.

Solr 1.4.1 Released

Tue, 29 Jun 2010 13:57:23 +0530

Solr 1.4.1 Released:

From the mailing list announcement:

Apache Solr 1.4.1 has been released and is now available for public
download!
http://www.apache.org/dyn/closer.cgi/lucene/solr/

Solr 1.4.1 is a bug fix release. See the change log for more details.

The iPad experience

Sat, 26 Jun 2010 00:18:00 +0530

Now that I have had the chance to play with the Apple iPad, I thought I’d put down some observations and opinions about the device. I know that I’m late to the party but hey, they don’t sell these devices here in India.

First of all, I love that I do not have to sit before a desk to use this. There is a real need in the world for a device like this and even though many have tried in the past, hats off to Apple for pulling this off. No company other than Apple, without its army of fan boys, would have been able to make this kind of a product successful. But now that Apple has, we can all expect that the new shiny competing devices will be able to do at least as much as what the iPad does. So, thank you Apple!

The model that I have tried is the 32GB iPad. The device is slim, the touch experience is intuitive and the display is sharp. It is great as a browsing device. I guess it would have been easier to use if it were a little more lighter - which is not too much to expect from later versions. The dimensions at 9.56 x 7.47 inches feels a little too big at times. A 7-8 inches display may be more handy.

Typing on the touch keyboard takes a little getting used to but practice makes perfect. Still, without my favorite set of browser plugins, I find it hard to share stuff on twitter/buzz. I guess the iPad is designed for people who are mostly consumers of data rather than creators. The device does not have a USB port which is a bummer. A Linux user like me cannot transfer songs or movies from my laptop to the device because one must use iTunes. I had to borrow a friend’s laptop to use Windows(!) so that I can transfer some media into the device. But to do that, you need to convert your divx media to the format iPad can play. The App Store is not open to anyone outside the US so I couldn’t use it.

As a geek, one gets bored with the device very soon. There’s just too many things that I want to do with it but am not able to because Apple won’t let me. I am looking forward to the Samsung Android based tablet which is supposed to be coming out in Q3. No one can beat Apple in the wow factor category but I am hoping that the Android devices may allow me to do much more.

[Update] - I used the excellent Handbrake application to convert between media formats and I’d highly recommend it.

[Update 2] - An interesting discussion on Google Buzz on this topic.

Apache Mahout 0.3 Released

Thu, 18 Mar 2010 14:58:29 +0530

Apache Mahout 0.3 has been released. Apache Mahout is a project which attempts to make machine learning both scalable and accessible. It is a sub-project of the excellent Apache Lucene project which provides open source search software.

From the project website:

The Apache Lucene project is pleased to announce the release of Apache Mahout 0.3. Highlights include:

New: math and collections modules based on the high performance Colt library

Faster Frequent Pattern Growth(FPGrowth) using FP-bonsai pruning

Parallel Dirichlet process clustering (model-based clustering algorithm)

Parallel co-occurrence based recommender

Parallel text document to vector conversion using LLR based ngram generation

Parallel Lanczos SVD (Singular Value Decomposition) solver

Shell scripts for easier running of algorithms, utilities and examples

.. and much much more: code cleanup, many bug fixes and performance improvements

Details on what’s included can be found in the release notes. Downloads are available from the Apache Mirrors

Congratulations and thanks to all Mahout developers!

Merging Lucene and Solr

Wed, 17 Mar 2010 01:18:00 +0530

A couple of weeks back, Apache Lucene Committer and PMC member, Michael McCandless started a discussion on factoring out a shared, standalone Analysis package for Lucene, Solr and Nutch. During the discussions, Yonik Seeley, Solr Creator, proposed merging the development of Lucene and Solr. After intense discussions and multiple rounds of voting, the following changes are being put into effect:

Merging the developer mailing lists into a single list.
Merging the set of committers.
When any change is committed (to a module that “belongs to” Solr or to Lucene), all tests must pass.
Release details will be decided by dev community, but, Lucene may release without Solr.
Modularize the sources: pull things out of Lucene’s core (break out query parser, move all core queries & analyzers under their contrib counterparts), pull things out of Solr’s core (analyzers, queries).

The following things do not change:

Besides modularizing (above), the source code would remain factored into separate dirs/modules the way it is now.
Issue tracking remains separate (SOLR-XXX and LUCENE-XXX issues).
User’s lists remain separate.
Web sites remain separate.
Release artifacts/jars remain separate.

So what does it mean for Lucene/Solr users? Nothing much, really. Except that you should see tighter co-ordination between Lucene and Solr development. New Lucene features should reach Solr faster and releases should be more frequent. Solr features may also be made available to Lucene users who do not want to setup Solr use the RESTy APIs.

Already, Solr has been upgraded to use Lucene trunk (in branches/solr) and should soon become the new Solr trunk. There is talk of re-organizing the source structure to better fit the new model. Things are moving fast!

Personally, I feel that this merge is a good thing for both Lucene and Solr:

Solr users get the latest Lucene improvements faster and releases get streamlined.
Lucene users get access to Solr features such as faceting.
The in-sync trunk allows new features to make their way into the right place (Lucene vs Solr) more easily and duplication is minimized.
Bugs are caught earlier by the huge combined test suite.
More number of committers means more ideas and hands available to the projects
Other Lucene based projects can benefit too because many Solr features will be made available through Java APIs.

There are a couple of things to be worked out. For example, we need to decide where the integrated sources should live and whether or not to sync Solr’s version with Lucene’s. All this will take some time but I am confident that our combined community will manage the transition well.

Apache Lucene Java 3.0.1 and 2.9.2 Released

Sat, 27 Feb 2010 15:39:00 +0530

Congratulations to the Apache Lucene team on releasing Lucene Java 3.0.1 and 2.9.2. Both of these are bug fix releases and are backwards compatible with Lucene Java 3.0.0 and 2.9.1 respectively.

From the official announcement:

Hello Lucene users,

On behalf of the Lucene development community I would like to announce the release of Lucene Java versions 3.0.1 and 2.9.2:

Both releases fix bugs in the previous versions:

2.9.2 is a bugfix release for the Lucene Java 2.x series, based on Java 1.4

3.0.1 has the same bug fix level but is for the Lucene Java 3.x series, based on Java 5

New users of Lucene are advised to use version 3.0.1 for new developments, because it has a clean, type-safe API.

Important improvements in these releases include:

An increased maximum number of unique terms in each index segment.

Fixed experimental CustomScoreQuery to respect per-segment search. This introduced an API change!

Important fixes to IndexWriter: a commit() thread-safety issue, lost document deletes in near real-time indexing.

Bugfixes for Contrib’s Analyzers package.

Restoration of some public methods that were lost during deprecation removal.

The new Attribute-based TokenStream API now works correctly with different class loaders.

Both releases are fully compatible with the corresponding previous versions. We strongly recommend upgrading to 2.9.2 if you are using 2.9.1 or 2.9.0; and to 3.0.1 if you are using 3.0.0.

See core changes at:

http://lucene.apache.org/java/3_0_1/changes/Changes.html

http://lucene.apache.org/java/2_9_2/changes/Changes.html

and contrib changes at:

http://lucene.apache.org/java/3_0_1/changes/Contrib-Changes.html

http://lucene.apache.org/java/2_9_2/changes/Contrib-Changes.html

Binary and source distributions are available at http://www.apache.org/dyn/closer.cgi/lucene/java/

Lucene artifacts are also available in the Maven2 repository at http://repo1.maven.org/maven2/org/apache/lucene/

Microsoft dropping FAST search for Linux, Unix

Mon, 08 Feb 2010 02:09:47 +0530

According to a blog post from Microsoft Distinguished Engineer and CTO, FAST Bjørn Olstad, the 2010 products will be the last to have a search core that runs on Linux and UNIX.

Being involved in Apache Solr and the newly formed Lucene Connectors Framework (LCF) project, I’m very interested in the implications. Undoubtedly, at least some FAST customers will not be happy with this decision and will want to switch to something which can still run on Linux/UNIX.

I believe that this is a great opportunity for the Apache Solr/LCF combo. Perhaps, the newly proposed Apache Spatial Information Systems (SIS) will help as well. Of course, this is big news for Lucid Imagination, Sematext and other companies as well who offer consultancy, training and support for Lucene/Solr.

I’d like to ask people who have used FAST in the past, what would it take for Lucene/Solr/LCF to fill the gap?

Saw a Kurkure advertisement on a website titled “No...

Sun, 24 Jan 2010 16:48:20 +0530

Saw a Kurkure advertisement on a website titled “No plastic in Kurkure”. ROTFL!

Solr In Action Case Studies

Fri, 15 Jan 2010 14:55:38 +0530

Well, the cat is out of the bag. I’ve been working with Otis on Solr In Action. We’re looking for a couple of contributors to write case studies for the book describing how they have used Solr. Otis just posted this to his blog and to the Solr mailing list as well.

So, if you are are using Apache Solr in some clever, interesting or unusual way, or deal with large indexes or large number of cores or distributed search and are willing to share this information with the world, please get in touch. We are looking for between five to ten pages (soft limits) per case study.

You can contact either me or Otis by leaving a comment on this post with your contact info or contact @shalinmangar or @otisg on Twitter or email me on shalin at apache and we’ll get back to you right away.

It’d be great if you can share this message around too.

The Total Growth of Open Source

Tue, 12 Jan 2010 13:01:00 +0530

The Total Growth of Open Source:

Amit Deshpande and Dirk Riehle from SAP Labs have conducted and published a research on the growth of open source software.

The data has been culled from Ohloh.net and is based on the stats and activity of around 5000 open source projects written in 30 different languages and 103 open source licenses.

Some interesting quotes from the publication:

Successful open source projects like Linux, Apache, PostgreSQL and many others are growing super-linearly. Previous research showed that linear and quadratic growth is the dominant growth pattern of open source software projects

Our work shows that the additions to open source projects, the total project size (measured in source lines of code), the number of new open source projects, and the total number of open source projects are growing at an exponential rate. The total amount of source code and the total number of projects double about every 14 months.

Open Source has taken off handsomely and continues to thrive. It is not just about philosophy any more, it is good business sense.

In case you are interested about Solr stats, see the Solr project page at Ohloh.

Update - The report is quite old but I just discovered it now :)

SolrMarc vs DIHMarc

Mon, 11 Jan 2010 00:01:51 +0530

Erik has written about Solr’s usage in libraries on the Lucid Imagination Blog. Solr has found its way into many libraries and quite rightly so. However, one of the main things that Erik talks about in that blog post is the performance of DataImportHandler vs SolrMarc (the indexing library used by both VUFind and Blacklight).

Quoting from Erik’s email to the solrmarc-tech google group:

The difference in speeds:
SolrMarc: 22 docs / s
DIHmarc: 1,745 docs / s

W00t!

Well, I don’t know much about SolrMarc but I’ve seen DataImportHandler instances with comparable (or even better) throughput many times. There is something fishy inside SolrMarc for sure and I have a feeling that fixing it would be a low hanging fruit. However, Erik’s opinion is that DataImportHandler is a better way to index and that he will devote a portion of his time to helping the Solr using library community (thanks Erik!).

DIH has really taken off since it debuted in Solr 1.3 and it would be safe to call it the de-facto standard for indexing data into Solr. It may not be the most elegant way to index data but it is quick and it works great. With the planned features for DataImportHandler in Solr v1.5, it will continue to improve and it makes sense to base VUFind and Blacklight’s indexing infrastructure on top of it. I’m very excited to see this happening.

Exploding mobile web usage in India

Tue, 22 Dec 2009 19:12:46 +0530

Exploding mobile web usage in India:

Opera published a study titled State of the Mobile Web, November 2009 which I found through TechCrunch. I can’t help but notice the tremendous growth in web usage through mobile phones in India. Page views have grown by 228.5% Y/Y and unique users have grown by 208.4% but if you look at metrics like page views per user or the amount of data transferred per user, you’ll see that they are quite small.

One of the reason is that in India we don’t have 3G (actually we do but it is limited to only one provider - BSNL) and browsing the Internet is painfully slow on GPRS connections. With 3G finally coming next year, I’m quite sure that the mobile web usage in India will just explode.

Right now the only kind of mobile applications that can work in India are SMS based. The increase in download speeds will make more people use the Internet and SMS will be less relevant, though that may take more time than a year. If 3G indeed is introduced by mid next year by prominent providers like Airtel and Vodaphone, I won’t be surprised to see Y/Y growth rate exceeding 500%.

Companies building products/services around SMS should start thinking about a mobile web strategy now.