Pathbreak Developer Notebook

Did Amazon CloudFront CDN make my site faster?

Karthik Shiraly — Wed, 28 Dec 2011 04:20:55 +0000

Overview

When I was deploying my website, I ran into a slow page load problem. One of the pages had about 9 non-interleaved screenshot PNG image files, each about (700 x 500 pixels) in size and between 40 KB and 350KB file size.

I wondered if deploying these images on Amazon CloudFront would improve response times. Amazon CloudFront is the Content Delivery Network (CDN) offering from Amazon and is one of the cloud services that constitute Amazon Web Services (AWS).

CDNs are supposed to improve response times by replicating resources across multiple servers around the world, and serving a requested resource from the server closest to the requesting client. The implicit assumption is that the root cause of latency is geographical distance (greater the distance, more the number of routers involved in-between), and so serving files from a server that is physically closer should reduce latency.

Since my site was already hosted on Amazon’s EC2, it made sense to try their CloudFront CDN, rather than some other vendor’s CDN. Though this was not a performance critical page, it did provide the opportunity to experiment with CloudFront for a realistic scenario, and the knowledge gained may prove useful in future. So I started experimenting…

Setup

I decided to use Amazon S3 as the origin server for Cloudfront (the origin server is the server from which Cloudfront picks up the resources to replicate). I opted for "reduced redundancy storage" setting instead of "standard redundancy" for the S3 bucket, to minimize costs (and also because these images are already available to me from my development machine and web server..standard redundancy makes more sense for user content data or critical backups).

Evaluation criteria

Better response times would be great.

Even if there was no improvement in response times, a CDN would still reduce the load on my rather underpowered EC2 micro instance web server, and spare me some more connections for more dynamic content, like my SaaS products. So I was already somewhat biased towards using Cloudfront or some other file server before evaluating them.

But CloudFront, like other AWS services, is a metered service. So the evaluation also needed to keep costs in mind.

Performance measurements

For response time measurements, I decided to use different tools to get a complete picture:

The first set of measurements are taken using browsers. All 3 major browsers – Chrome, Firefox and IE – provide excellent profiling tools for developers.
However, browser measurements are not enough. The system should also be tested for scalability. What happens to response times when there are dozens of concurrent connections requesting the page? Can the page be rendered to all those users without much increase in response times? With a single web server on an underpowered machine, this is clearly not possible. But putting a CDN in the mix should shift atleast some of the load on my puny single unscalable web server to Amazon’s scalable mammoth delivery network. I used Apache Jmeter and Apache Bench (ab) tools to load the server.

Browser measurements

Methodology

Chrome’s developer tools network tab, Firefox Firebug network tab, and IE’s developer tools Network tab provide profiling information.

Chrome and Firefox (via Firebug and Firebug NetExport plugin) can export profiling data to JSON format files called .har files.

IE exports to XML files which have a similar schema to the JSON .har files but expressed in XML.

Each browser was tested 5 times with a complete cache cleanup in between. The cache cleanup ensured that all images were downloaded in each test. However, cache cleanup does not clear the browsers’ DNS caches, which means DNS lookup timings are usually manifested only in the first test.

A python script was used to parse these files, calculate averages and produce the below HTML table of averages.

Results

Legend to the table:

1st column => the image file name

"OwnServer" => tests in which images were downloaded from my Apache web server running on EC2 and EBS

"Cloudfront" => tests in which images were downloaded from Cloudfront distribution with S3 as origin server

T => Total time for request and response (including thread blocked, wait, connect, send request, wait, receive response)

R => Total time for just receiving all the data

W => Time spent waiting before response started

All figures are in milliseconds

	Chrome OwnServer	Chrome Cloudfront	Mozilla OwnServer	Mozilla Cloudfront	IE OwnServer	IE Cloudfront
corporatesearch.png 158KB	T:14897 R:14525 W:369	T:7913 R:7721 W:141	T:9924 R:9476 W:447	T:13010 R:12792 W:177	T:9038 R:8567 W:386	T:11600 R:11406 W:153
jobsearch.png 353KB	T:15516 R:14795 W:359	T:15982 R:15666 W:265	T:19716 R:18328 W:367	T:16061 R:15856 W:162	T:19394 R:18629 W:393	T:17225 R:16997 W:187
p-and-f-charting.png 59KB	T:5600 R:4869 W:365	T:6520 R:6218 W:250	T:7728 R:6341 W:363	T:9085 R:8842 W:199	T:5934 R:4683 W:399	T:7098 R:6246 W:811
s-and-r-charting.png 40KB	T:4048 R:3313 W:366	T:5301 R:5006 W:243	T:6102 R:4347 W:363	T:3788 R:3550 W:177	T:5456 R:3700 W:973	T:3151 R:2614 W:496
dialogs.png 114KB	T:15074 R:4492 W:315	T:12620 R:5056 W:316	T:14032 R:4737 W:314	T:11809 R:4402 W:180	T:11830 R:5288 W:299	T:12483 R:4346 W:318
candidateshortlist.png 160KB	T:11319 R:10951 W:366	T:7246 R:7082 W:121	T:9932 R:9491 W:440	T:10277 R:10108 W:127	T:7694 R:7310 W:380	T:9235 R:9044 W:147
mainscreen.png 166KB	T:16810 R:10758 W:309	T:13498 R:8243 W:138	T:18673 R:11210 W:323	T:14714 R:7201 W:1516	T:16115 R:11113 W:337	T:13815 R:8579 W:252
technical-analysis- signals.png 112KB	T:11550 R:7339 W:307	T:13209 R:8886 W:155	T:12731 R:7460 W:351	T:9232 R:6055 W:227	T:11668 R:6421 W:336	T:8031 R:5615 W:221
homepage.png 266KB	T:15834 R:15400 W:357	T:14030 R:13797 W:181	T:16209 R:14827 W:364	T:12099 R:11927 W:134	T:17235 R:16470 W:396	T:15956 R:15609 W:305

Analysis of browser results

The metrics to pay attention here are R (the average receive times) and W (the wait times).

I didn’t pay much attention to T (the average total times) because I felt they are misleading. The problem is that browsers download embedded resources like s using a small number of connections. When there are more resources than there are connections, the extra resources are blocked until some connections are freed. These blocked times manifest in the T values, but they are not deterministic and are also not similar across browsers since connection implementations differ. Hence, total times should be ignored in my opinion.

What can we observe from the R(eceive) and W(ait) times?

Chrome: For 5 out of 9 images, R(eceive) times from cloudfront are less than receive from own server. For other 4 images, receive times from cloudfront are slightly higher. So it’s almost a tie. However, W(ait) times are consistently less for Cloudfront. So, Cloudfront leads.
Firefox: For 6 out of 9 images, R(eceive) times from cloudfront are lesser. W(ait) times are also consistently less, except in one case, which seems to be an anomaly. Cloudfront leads again.
IE: For 6 out of 9 images, R(eceive) times from cloudfront are lesser. W(ait) times are also consistently less, except in 2 cases, which seem to be anomalies. Cloudfront leads again.

Browser Conclusions

Cloudfront does makes the site faster…but not as consistently or drastically as expected, atleast in my tests (I’m in India and my nearest edge locations seem to be Singapore or Hong Kong).

One possible factor may be that the resources should get lots of hits for Cloudfront to cache and provide them more effectively. I’m not sure about this, but cloudfront documentation does seem to hint that more popular resources will benefit more.

Load measurements using apache bench (ab)

Methodology

ab is incapable of downloading a web page and all its embedded resources. So I ran ab requests on just one of the image files – the biggest one at 350KB.

I set different values for -n and -c options. -k was enabled to simulate browser behaviour by keeping connections alive.

Results

	Own server	Cloudfront
50 total requests, 1 user
Total time	292.69s	214s
Mean Time / request	5.85 s	4.28s
Max time taken by 90% of requests	8.4s	4.9s
Max time taken by 50% of requests	5.35s	4.26s
Data transferred	18051951	18070081
50 total requests, 5 concurrent users
Total time	211.97 s	215.24s
Mean Time / request	4.24 s	4.30s
Max time taken by 90% of requests	31.4s	35.3s
Max time taken by 50% of requests	19.3s	18.7s
Data transferred	18737815	18826276
50 total requests, 10 concurrent users
Total time	217.34 s	218.7s
Mean Time / request	4.35 s	4.37s
Max time taken by 90% of requests	58.8s	67.4s
Max time taken by 50% of requests	40s	27.5s
Data transferred	19460082	19705381
50 total requests, 25 concurrent users
Total time	227.57s	239.53s
Mean Time / request	4.55s	4.79s
Max time taken by 90% of requests	142.6s	130.6s
Max time taken by 50% of requests	61.9s	46.7s
Data transferred	20218419	21540782
80 total requests, 40 concurrent users
Total time	337.93s	412.69s
Mean Time / request	4.22s	5.16s
Max time taken by 90% of requests	235.3s	239.3s
Max time taken by 50% of requests	82.4s	84s
Data transferred	28945837	34307057
100 total requests, 50 concurrent users
Total time	477.91s	477.13s
Mean Time / request	4.78s	4.77s
Max time taken by 90% of requests	260.1s	201.7s
Max time taken by 50% of requests	137.1s	53.3s
Data transferred	34238460	43105119

Analysis of ab results

Results are so all over the place, that I found it difficult to draw any conclusion!

The 50th percentile results in some tests clearly favour Cloudfront, but not consistently.

I also found it hard to understand some of the raw values (not shown here). For example, in the last test with 100 requests across 50 concurrent users, total time was 477.1s but the longest request was 454s! How that can be is beyond me. I’m guessing that a request sent fairly early never got a response. It’s possible that this was because load was too much for my puny 512 kbps bandwidth.

Another thing to notice is that data volume with cloudfront is atleast 25% higher at higher loads. I’m guessing that this is because of TCP retransmissions, though why it appears only when communicating with cloudfront is not clear.

Conclusion

I’m reluctant to draw any concrete conclusion from ab results except that 50% of requests seem to be faster most of the time when using Cloudfront.

Load measurements using Apache JMeter

Methodology

JMeter was used to test the following loads:

50 total requests with 1 user. Retrieve embedded resource using a pool of 9 threads (9 because the page had 9 images)
50 total requests across 5 concurrent users. Retrieve embedded resources using pools of 5 threads (only 5 because JMeter creates multiple pools for each virtual user, which means 5 users x 5 threads = 25 threads would be created. I was afraid that higher pool sizes might make bandwidth contention a factor in the timings)

Results

	Ownserver	Cloudfront	Notes
50 total requests, 1 user, 9 downloading threads
avg	22.8s	20.01s
90% of requests	24.4s	25.7s

50 total requests, 5 concurrent users, 5 downloading threads per user
avg	75s	48s	Actually overall 77s, but only 48s if 7 anomalous measurements were removed. Ownserver actually never finished 50 requests. Probably, socket timeouts.
90% of requests	110.5s	60s	Actually 232.6s But 34 out of 43 (80%) were within 60s.

Analysis of results

When simulating a single user, using Cloudfront didn’t show any major improvement in speed.

But when simulating 5 concurrent users with 5 resource downloading threads per user, I saw interesting results. 7 results timed out with extremely high times like 270 seconds. These I put down as anomalies, possibly because I was overloading my bandwidth.

Without those anomalies included, the average time per request was just 48 seconds when using cloudfront, compared to 75 seconds when not. Also, 80% of the remaining timings completed within 60 seconds when using cloudfront, compared to 110.5 seconds when not.

Conclusion

So load testing with JMeter shows that Cloudfront is better at higher loads.

Measurements using www.webpagetest.org

Methodology

www.webpagetest.org provides automated testing for websites, from client locations around the world.

5 tests were conducted from each location and each method of serving images.

Results

Its results come out as follows:

	Served from own server	Cloudfront
New York	8.772 s	8.911 s
London	8.791 s	8.703 s

Conclusion

Doesn’t look like Cloudfront has improved page speeds.

Cost analysis

If the choice is between storing content on an EC2 EBS drive and serving it from EC2 web server, vs. storing it in S3 and serving it via Cloudfront, the following cost components are relevant (as of Dec 2011):

Assume ‘B’ GBs is the size of content (for simplicity, I’ll assume just 1 file of ‘B’ GBs) being stored.

Assume 1 user requests this file each and every second every day, which comes to 86400 requests/day or 2,592,000 requests/month.

via EBS and EC2	via S3 and Cloudfront
EBS storage per GB = $0.10B	S3 reduced redundancy storage = $0.093B (ignoring S3 IO request costs by assuming this file will be stored just once, and then always served via Cloudfront)
EBS cost per 1 million IO requests = $0.10 x 2.592 = 0.2592B	Cloudfront data transfer = $0.19B But as we have seen with ab tests, at higher loads, more data is transferred due to TCP retransmissions. Assuming 20% extra data is transferred, this will come to $0.228B
Data transfer through elastic IP = $0.01B	Cloudfront cost per 10000 HTTP requests = $0.009 x 2592000/10000 = $2.3328
Total: $0.3692B If that file is 1GB in size, this comes to $0.37	Total: $2.3328 + 0.321B If that file is 1GB in size, this comes to $2.65

So cost wise too, Cloudfront comes out costlier than serving off EBS or S3. It’s really only its HTTP request costs that tilt the choice away from Cloudfront.

Final conclusion

In my case, my website is not a high traffic site. I also didn’t observe any drastic improvement in page speeds, except possibly at high loads (shown by the JMeter results). And cost wise, it’s indeed cheaper to stick with EBS and EC2.

So, should I use Cloudfront or not? I think it’s not needed for my site at the moment.

Simulating browsers using JMeter

Karthik Shiraly — Tue, 27 Dec 2011 11:57:23 +0000

JMeter is commonly used to stress test webpages by simulating multiple users concurrently visiting a webpage URL. However, for this simulation to be accurate, JMeter needs to be configured correctly so that it behaves like a browser.

In this article, I explain what settings to configure, to make JMeter simulate browser requests fairly accurately.

Before configuring JMeter correctly, let’s understand how browsers work:

When user enters a webpage URL in browser, it connects to server, starts downloading the page, and starts parsing.
As it’s parsing, it’ll encounter embedded URLs like javascript, CSS and image files.
A browser then creates more threads, each of which opens a new connection and fetches one of these embedded URLs. Most browsers use a limited number of connections per server (6 in case of firefox at the time of writing) and cap the total number of downloading threads (48 in case of firefox at the time of writing).
The page is considered loaded when all these embedded URLs have been fetched.

JMeter can simulate this behaviour if the following 2 settings are configured:

Retrieve All Embedded Resources from HTML Files

This checkbox is found near the bottom of HTTP Request Defaults config elements and HTTP Request samplers.

Check the checkbox to make JMeter download embedded resources like javascript, CSS and images, just as a browser would.

Add a View Results in Tree listener element if you want to see which embedded resources are downloaded and their metrics. Note that View Results in Table bytes don’t include the embedded resources.

Use concurrent pool. Size=n

The behaviour of this checkbox and pool size are as follows:

Retrieve all embedded resources from HTML files

Use concurrent pool

Behaviour

Checked

Unchecked

The main page and its embedded resources will be downloaded in the same thread.

For example, if Thread group is simulating 3 users, Jmeter creates 3 threads – one for each simulated user – named "Thread Group 1-1" to "Thread Group 1-3".

Each of these threads will download all embedded resources sequentially in the context of their respective thread.

If page P has resource A,B and C, Jmeter will download them as follows:
~ThreadGroup1-1 : p, A, B, C (downloaded one after another)
~ThreadGroup1-2 : p, A, B, C (downloaded one after another)
~ThreadGroup1-3 : p, A, B, C (downloaded one after another)

Checked

Checked.
Pool size=x

As usual, JMeter creates threads named "Thread Group 1-k" to simulate users.

In addition, for every one of these threads simulating a user, JMeter creates separate threadpools of size x with thread names like pool-n-thread-m.

The main page is downloaded by the user’s thread "Thread Group 1-k" while the embedded resources are downloaded by its associated threadpool with thread names like pool-n-thread-m.

So to simulate browsers, check the ‘Use concurrent pool‘ checkbox and specify a reasonable pool size (4-8 seems typical for browsers).

However, when setting the concurrent pool size, keep in mind the number of users being simulated, because a separate threadpool is created for each of these simulated users. If there are many users, too many threads may get created and start affecting the response times adversely due to bandwidth contention at the JMeter side. If many users are to be simulated, it’s recommended to distribute JMeter testing to multiple machines.

Solr on Jetty on Ubuntu

Karthik Shiraly — Fri, 14 Oct 2011 21:39:00 +0000

Article Relevancy
Solr v3.3.0, Jetty v6.1, Ubuntu 10.04 Lucid Lynx 64-bit server

This article explains steps involved in deploying Apache Solr search engine as a system service on the Jetty servlet container on Ubuntu OS. This article is based on information from the Solr Jetty wiki page and on troubleshooting experiences of others.

Prerequisites:

Target system should have atleast Java 6 installed (in my case, OpenJRE 6 is installed)

Steps:

1. In this description, /opt/solr will be the target directory where Solr will be deployed.

2. The /example directory in the solr package forms the basis of the installation on the target system. It contains multiple configurations, each suitable for a different use case:

/example-DIH : a multicore configuration with each core demonstrating a different data importing configuration

/multicore : a simple multicore installation

/solr : a basic single core configuration.

Copy the configuration suitable for your application into /example/solr (replacing the one already there if necessary) and discard the rest. A configuration typically consists of /conf and /data (and sometimes also /bin and /lib) sub directories.

2. Additionally, the /dist and /contrib package directories contain important jars required by some of these configurations:

/dist/apache-solr-dataimporthandler*.jars – if you require data importing capabilities.

/dist/apache-solr-cell-*.jars , /contrib/extraction/lib/*.jars – If you require content extraction from PDF, MS office and other document files.

These jars should also be deployed on the target system.

3. Copy these files to the target system and create the directory structure suggested below under /opt/solr:

|-- dist - All required jars, including additional jars from /contrib
|-- etc - this should probably go into the root /etc directory, as per conventions
|   |-- jetty.xml
|   `-- webdefault.xml
|-- lib
|-- solr
|   |-- bin
|   |-- conf
|   |   |-- admin-extra.html
|   |   |-- dataimport.properties
|   |   |-- elevate.xml
|   |   |-- protwords.txt
|   |   |-- schema.xml
|   |   |-- scripts.conf
|   |   |-- solrconfig.xml
|   |   |-- stopwords.txt
|   |   |-- synonyms.txt
|   |   `-- xml-data-config.xml
|   |-- data
|-- start.jar
|-- webapps
|   `-- solr.war
`-- work

4. The solr process should run with its own dedicated credentials, so that authorizations can be administered at a fine granularity. So create a system user and group named ‘solr’.

$ sudo adduser --system solr
$ sudo addgroup solr
$ sudo adduser solr solr

5. Create a log directory /var/log/solr for solr and jetty logs.

6. Jetty outputs its errors to STDERR by default. Redirect it to a rolling log file by adding this section to /opt/solr/etc/jetty.xml.

    
    
    
   
      
        
          /yyyy_mm_dd.stderrout.log
          false
          90
          GMT
          
        
      
    
    Redirecting stderr/stdout to

7. Now we need to set file and directory permissions so that the solr process user can work correctly.

Use chown to make solr:solr as the owner and group.

$ sudo chown -R solr:solr /opt/solr
$ sudo chown -R solr:solr /var/log/solr

Use chmod to give write permissions to solr:solr for the following directories:

/opt/solr/data

/opt/solr/work

/var/log/solr

8. The basic installation should work now. Try by launching jetty as a regular process:

/opt/solr$ sudo java -Dsolr.solr.home=/opt/solr/solr -jar start.jar

This should start solr.

Verify that logs are getting generated under /var/logs/solr.

Test it by sending a query to http://localhost:8983/solr/select?q=something using curl.

9. Now we need to install solr as a system daemon so that it can start automatically. Download the jetty.sh startup script (link courtesy http://wiki.apache.org/solr/SolrJetty) and save it as /etc/init.d/solr. Give it executable rights.

The following environment variables need to be set. They can either be inserted in this /etc/init.d/solr script itself, or they can be stored in /etc/default/jetty, which is read by the script.

JAVA_HOME=/usr/lib/jvm/default-java

JAVA_OPTIONS="-Xmx64m -Dsolr.solr.home=/opt/solr/solr"

JETTY_HOME=/opt/solr

JETTY_USER=solr

JETTY_GROUP=solr

JETTY_LOGS=/var/log/solr

Set the -Xmx parameters as per your requirements.

10. Additionally, this startup script has a problem that prevents it from running in Ubuntu. If you try running this right now using

$ sudo /etc/init.d/solr

you’ll get a

Starting Jetty: FAILED

error.

The problem – as explained well in this troubleshooting article – is in this line that attempts to start the daemon:

if start-stop-daemon -S -p"$JETTY_PID" $CH_USER -d"$JETTY_HOME" -b -m -a "$JAVA" -- "${RUN_ARGS[@]}" --daemon

In Ubuntu, –daemon is not a valid option for start-stop-daemon. Remove that option from the script:

if start-stop-daemon -S -p"$JETTY_PID" $CH_USER -d"$JETTY_HOME" -b -m -a "$JAVA" -- "${RUN_ARGS[@]}"

If you try starting it now, it should work:

$ sudo /etc/init.d/solr

It should give a

Starting Jetty: OK

message, and ps -ef |grep java should show the "java -jar start.jar" process.

11. Finally, it’s time to configure this as an init script. Read this article if you want a background on Ubuntu runlevels and init scripts.

Insert these lines at the top of /etc/init.d/solr to make it a LSB (Linux Standard Base) compliant init script. Without these lines, it’s not possible to configure the run level scripts.

### BEGIN INIT INFO

# Provides:          solr

# Required-Start:    $local_fs $remote_fs $network

# Required-Stop:     $local_fs $remote_fs $network

# Should-Start:      $named

# Should-Stop:       $named

# Default-Start:     2 3 4 5

# Default-Stop:      0 1 6

# Short-Description: Start Solr.

# Description:       Start the solr search engine.

### END INIT INFO

Now run the following command:

$ sudo update-rc.d solr defaults
 Adding system startup for /etc/init.d/solr ...
   /etc/rc0.d/K20solr -> ../init.d/solr
   /etc/rc1.d/K20solr -> ../init.d/solr
   /etc/rc6.d/K20solr -> ../init.d/solr
   /etc/rc2.d/S20solr -> ../init.d/solr
   /etc/rc3.d/S20solr -> ../init.d/solr
   /etc/rc4.d/S20solr -> ../init.d/solr
   /etc/rc5.d/S20solr -> ../init.d/solr

As you can see, the run levels 2-5 (they are equivalent in Ubuntu) are now configured to start solr.

Ubuntu startup – init scripts, runlevels, upstart jobs explained

Karthik Shiraly — Sun, 25 Sep 2011 09:03:24 +0000

Article Relevancy
Ubuntu 10.04 Lucid Lynx; believed to be relevant for Ubuntu 8.x to 11.x, the latest release at the time of writing this article

Contents

Ubuntu has 2 different mechanisms for starting system services:

The traditional mechanism based on run levels, and scripts in /etc/init.d and /etc/rcn.d directories
A new mechanism known as upstart.

Some services are started using one mechanism and others using the other. If you want to control the services, it’s necessary to understand these mechanisms.

Run levels and init.d scripts – the traditional mechanism

Linux has the concept of run levels, in all distros as part of the Linux Base Specification. They can be considered to be “modes” in which Linux runs.

Run level	Name	Description
0	Halt	Shuts down the system
1	Single-user mode	Mode for administrative tasks.
2	Multi-User Mode	Does not configure network interfaces and does not export networks services
3	Multi-User Mode with Networking	Starts the system normally
4	Not used / user definable	For special purposes
5	Start the system normally with GUI display manager	Run level 3 + display manager
6	Reboot	Reboots the system
s or S	Single-user mode	Does not configure network interfaces, or start daemons.

In Ubuntu (and Debian), run levels 2 to 5 are equivalent and configured with the same set of services.

Get Current run level

Use the runlevel command to get current run level. runlevel is available in Ubuntu as well as redhat based distros like CentOS (not sure about other distros).

karthik@ubuntuLynx:~$ runlevel
N 2

/etc/init.d directories

The /etc/init.d directory contains scripts, which can start / stop / restart services. These are invoked with a start|stop argument at startup and shutdown.

/etc/rcn.d directories

The /etc/rcn.d directories specify which scripts in /etc/init.d are enabled for run level n.

For example, /etc/rc2.d specifies which scripts in /etc/init.d are enabled for run level 2. At startup and shutdown, only these enabled scripts are invoked.

Entries in /etc/rcn.d directories are symlinks to scripts in /etc/init.d, but with a special prefix of the format

[S|K]nn

S means the script is enabled for this run level.

K means the script is disabled for this run level.

nn is a sequence number that can be used to control the sequence of starting services, so that services which depend on other services are started only after those other services are started.

Below is a listing or /etc/rc2.d. It shows that tomcat6, dovecot and postfix are not automatically started in run level 2. However, they can be started manually.

K08tomcat6
K76dovecot
K80postfix
S20gpm
S20winbind
S50rsync
S70dns-clean
S70pppd-dns
S91apache2
S99grub-common
S99ondemand
S99rc.local

Enabling and disabling run level services

Use the chkconfig –list command to get an overview of all services and their status. If not installed, install it using sudo apt-get install chkconfig. It gives a status listing like this:

karthik@ubuntukarmic:~$ chkconfig --list
acpi-support              0:off  1:off  2:on   3:on   4:on   5:on   6:off
acpid                     0:off  1:off  2:off  3:off  4:off  5:off  6:off
alsa-utils                0:off  1:off  2:off  3:off  4:off  5:off  6:off
...

Use the update-rc.d command to enable or disable a service at a run level:

Syntax: sudo update-rc.d name enable|disable runlevel

Example: sudo update-rc.d dovecot disable 2

sudo update-rc.d dovecot defaults

When creating new init scripts, ensure that the script has the following section (this is an example – change values appropriately) at the top to make it LSB (Linux Standard Base) compliant. Without this section, update-rc.d won’t work but will give a “missing LSB information” warning…

### BEGIN INIT INFO
# Provides:          solr
# Required-Start:    $local_fs $remote_fs $network
# Required-Stop:     $local_fs $remote_fs $network
# Should-Start:      $named
# Should-Stop:       $named
# Default-Start:     2 3 4 5
# Default-Stop:      0 1 6
# Short-Description: Start Solr.
# Description:       Start the solr search engine.
### END INIT INFO

Upstart

Upstart jobs are configured in /etc/init directory, in .conf files.

Use the service command to start and stop upstart services:

sudo service start|stop

For disabling an upstart service from starting up, open the respective /etc/init/[service].conf file and comment out the lines that begin with start on.

example:

...
#start on (net-device-up
#          and local-filesystems
#         and runlevel [2345])

...

This will disable the service from starting at startup, but allow manual starts using service start command.

For completely disabling a service – both from automatic and manual starts – it’s better to uninstall the package, but it’s also possible to just rename the .conf file to .conf.disabled.

Resources for further reading

http://askubuntu.com/questions/19320/whats-the-recommend-way-to-enable-disable-services/20347#20347 – this post from an Ubuntu developer explains in detail the history behind the init.d mechanism, its problems, and how the new Upstart mechanism solves them.
http://www.yolinux.com/TUTORIALS/LinuxTutorialInitProcess.html
http://www.linux-tutorial.info/modules.php?name=MContent&pageid=67
http://oldfield.wattle.id.au/luv/boot.html – The Linux boot process.
http://manpages.ubuntu.com/manpages/hardy/man8/update-rc.d.8.html
http://upstart.ubuntu.com/cookbook/#what-is-upstart – From the Upstart Cookbook

Content Extraction in Solr

Karthik Shiraly — Sun, 28 Nov 2010 04:28:00 +0000

Article Relevancy
Apache Solr 1.4.x

Contents

Overview

The example solrconfig.xml is already configured for content extraction from any document format – like MS Word DOC, PDF, – which can be handled by Apache Tika.

Content extraction requires libraries found in the /contrib/extraction directory. These include Solr Cell, Apache Tika and Apache POI libraries.

The ExtractingRequestHandler configuration in solrconfig.xml specifies the endpoint at which documents can be submitted for extraction. It’s usually http://localhost:8983/solr/update/extract.

Howto

To index a document, send the request as

curl “http://localhost:8983/solr/update/extract?literal.id=book1&commit=true” -F myfile=@book.pdf

The request goes as a multi-part form encoding.

By default, document contents are added into the document field “text”. The field can be changed in /solr/conf/solrconfig.xml in the Extracting handler’s element; it has a child element “fmap.content” that specifies which field content should be indexed under.

text

Since “text” is NOT a stored field, features like result highlighting won’t be available.

If results highlighting is required, modify /solr/conf/schema.xml to include a new *stored* field called “doc_content” which receives document contents from extracting handler. “doc_content” itself can be included in the “text” catch-all field so that all queries can be matched against document contents.

Restrictions of default content extraction

Since extracting handler can specify only a single content field, contents of multiple files will all go into the same content field. This is a problem if the the content file containing the search string has to be indicated to user.
There is no out-of-the-box workaround for this in solr. It’s required to write a specialized extracting handler to map each file (“content stream” in solr terminology) in the multipart request to separate content fields.

Solr search data modelling

Karthik Shiraly — Sun, 28 Nov 2010 04:09:00 +0000

Article Relevancy
Apache Solr 1.4.x and 3.3.x

Contents

Overview

Searchable entities of an application need to be modelled as Solr documents and fields for them to be searchable by Solr.

The schema.xml in /solr/conf is where the application search model should be defined.

The element defines the set of field types available in the model.

The element defines the set of fields of each document in the model. Each field has a type which is defined in the element.

section

This section describes types for all fields in the model. Contains elements. Each has these attributes:

name is the name of the field type definition and is referred from the section
class is the subclass of org.apache.solr.schema.FieldType that models this field type definition. Class names starting with "solr" refer to java classes in the org.apache.solr.analysis package.
sortMissingLast and sortMissingFirst

The optional sortMissingLast and sortMissingFirst attributes are currently supported on types that are sorted internally as strings. This includes "string","boolean","sint","slong","sfloat","sdouble","pdate"
- If sortMissingLast="true", then a sort on this field will cause documents without the field to come after documents with the field, regardless of the requested sort order (asc or desc).
- If sortMissingFirst="true", then a sort on this field will cause documents without the field to come before documents with the field, regardless of the requested sort order.
- If sortMissingLast="false" and sortMissingFirst="false" (the default), then default lucene sorting will be used which places docs without the field first in an ascending sort and last in a descending sort.
omitNorms is set to true to omit the norms associated with this field (this disables length normalization and index-time boosting for the field, and saves some memory). Only full-text fields or fields that need an index-time boost need norms.

Each field type definition has an associated Analyzer to tokenize and filter characters or tokens.

The Trie field types are suitable for numeric fields that involve numeric range queries. The trie concept makes searching such fields faster.

Basic field types

string	Fields of this type are not analyzed (ie, not tokenized or filtered), but are indexed and stored verbatim.
binary	For binary data. Should be sent/retrieved as Base64 encoded strings.
int/tint/pint long/tlong/plong float/tfloat/pfloat double/tdouble/ pdouble	The regular types (int,float,etc) and their t- versions differ in their precisionStep values.The precisionStep value is used to generate indexes at different precision levels, to support numeric range queries. Both sets are modelled by TrieField types, but the t- versions have precisionStep of 8 while the regular types have 0.So numeric range queries will be faster with the t-versions, but indexes will be larger (and probably slower). The p- versions are when numeric range queries are not needed at all. They are modelled by non-Trie types.
date/tdate/pdate	Similar to the above differences in numeric fields.Use tdate for date ranges and date faceting.Dates have to be in a special UTC timezone format, like this example: 2011-02-06T05:34:00.299Z Use org.apache.solr.common.util.DateUtil.getThreadLocalDateFormat().format(new Date()) to get a date in this format.
sint/slong/ sfloat/sdouble	Sortable fields

Text field types Being a full text search solution, the text field types and their configuration becomes the most critical part of the modelling. Modelling of text fields is explained in detail in the article Solr text field types, analyzers, tokenizers & filters explained.

section

Fields of documents are described in this section using elements.

Each element can have these attributes:

name	(mandatory) the name for the field. Very critical information, used in search queries, facet fields.
type	(mandatory) the name of a previously defined type from the section
indexed	true if this field should be indexed (should be searchable or sortable)
stored	true if this field value should be retrievable verbatim in search results.
compressed	[false] if this field should be stored using gzip compression (this will only apply if the field type is compressable; among the standard field types, only TextField and StrField are).This is very useful for large data fields, but will probably slow down search results – so it should not be used for fields that involve frequent querying
multiValued	true if this field may contain multiple values per document
omitNorms	(expert) set to true to omit the norms associated with this field (this disables length normalization and index-time boosting for the field, and saves some memory). Only full-text fields or fields that need an index-time boost need norms.
termVectors	[false] set to true to store the term vector for a given field. When using MoreLikeThis, fields used for similarity should be stored for best performance.
termPositions	Store position information with the term vector. This will increase storage costs.
termOffsets	Store offset information with the term vector. This will increase storage costs.
default	a value that should be used if no value is specified when adding a document.

The example deployment itself defines many commonly used fields and types; study them and check if something needed is already available before modelling your own.

elements can be used to model field names which are not explicitly defined by name, but which match some defined pattern.

definitions specify to copy one field to another at the time a document is added to the index. It’s used either to index the same field differently, or to add multiple fields to the same field for easier/faster searching. For example, all text fields in the document can be copied to a single catch-all field, for faster querying.

element specifies the field to be used to determine and enforce document uniqueness.

element specifies the field to be queried when it’s not explicitly specified in the query string using a “field:value” syntax. The catch-all copyfield is usually specified as the default search field.

specifies query parser configuration. defaultOperator=”AND|OR” specifies whether queries are combined using AND operator or OR operator.

Faceting – or drilldown – search using Solr

Karthik Shiraly — Fri, 26 Nov 2010 18:37:00 +0000

Article Relevancy
Apache Solr 1.4.x

Contents

Overview

Faceted searching – also called as drilldown searching – refers to incrementally refining search results by different criteria at each level. Popular e-shopping sites like Amazon and Ebay provide this in their search pages.

Solr has excellent support for faceting. The sections below describe how to use faceting in java applications, using the solrj client API.

Steps

Step 1 : Do the first level search and get first level facets

SolrQuery qry = new SolrQuery(strQuery);
String[] fetchFacetFields = new String[]{"categories"};
qry.setFacet(true);
qry.addFacetField(fetchFacetFields);
qry.setIncludeScore(true);
qry.setShowDebugInfo(true);
QueryRequest qryReq = new QueryRequest(qry); 

QueryResponse resp = qryReq.process(solrServer);  

SolrDocumentList results = resp.getResults();
int count = results.size();
System.out.println(count + " hits");
for (int i = 0; i > count; i++) {
    SolrDocument hitDoc = results.get(i);
    System.out.println("#" + (i+1) + ":" + hitDoc.getFieldValue("name"));
    for (Iterator> flditer = hitDoc.iterator(); flditer.hasNext();) {
        Entry entry = flditer.next();
        System.out.println(entry.getKey() + ": " + entry.getValue());
    }
} 

List facetFields = resp.getFacetFields();
for (int i = 0; i > facetFields.size(); i++) {
    FacetField facetField = facetFields.get(i);
    List facetInfo = facetField.getValues();
    for (FacetField.Count facetInstance : facetInfo) {
        System.out.println(facetInstance.getName() + " : " + facetInstance.getCount() + " [drilldown qry:" + facetInstance.getAsFilterQuery());
    }
}

The response will contain details of number of hits for each instance of the facet.

For example, if the field categories has values movies and songs in the set of matching hits, then each of them is called a facet instance.

Each facet instance of a FacetField has a name (“songs”), and each has an associated facet instance count and a filter query.

Facet instance count of 10 for “categories:songs” means in the set of all search results, 10 results have the value of categories as songs.

Facet instance filter query is the subquery to go down to the next level of drilldown search, by filtering on the facet instance value.

At this point in a typical drilldown search user interface, the left sidebar with all the filters would display those facet instances that have nonzero instance count with checkboxes and respective counts. User can then select the most promising facet to drilldown along and check its checkbox...

Step 2: Add facet filter query for next level of refined results

Add the filter query of facet instance to the main query, using addFilterQuery.

Filter query for single facet instance is of the format ":”. example: addFilterQuery(“categories:movies”);

// filterQueries is a String[] of facet filter queries got using getAsFilterQuery() from previous search
SolrQuery qry = new SolrQuery(strQuery);
if (filterQueries != null) {
    for (String fq : filterQueries) {
        qry.addFilterQuery(fq);
    }
}
qry.setFacet(true);
qry.addFacetField(fetchFacetFields);
qry.setIncludeScore(true);
qry.setShowDebugInfo(true);
QueryRequest qryReq = new QueryRequest(qry);
QueryResponse resp = qryReq.process(solrServer);

For subsequent levels of refinement, add facet instance filter queries to the current level’s main query, and add the list of facet fields required for the next level.

The facet filter queries have some rather intricate syntaxes for achieving various search behaviours, which are described below.

Selecting multiple facets

In some drilldown search designs, a user is allowed to specify multiple facet instances for the same field. For example, a categories field may have multiple category facet instances. In such cases, the facet instances should be combined using an OR operator.

Categories [ ]

Movies (300) [ ]

Songs (400) [ ]

Ads (150) []

If user selects “Movies” and “songs”, the filter query should have the semantics of an OR operator –

“..where category=movies OR category=songs”.

This can be specified in solr filter queries by enclosing the facet instances inside parentheses:

:(value1 value2 value3…)

examples:

In command line URL :

fq=categories%3A%28songs+movies%29

where %3A is character ‘:‘ , %28 is character ‘(‘ and %29 is character ‘)’

OR, equivalently

In java

qry.addFilterQuery(“categories:(songs movies)”);

Whitespaces in facet instances

If facet instances have whitespaces within them, then multiple facet instances should be specified simply by enclosing them in double quotes (%22).

For example, for a facet field "crn" with facet instances “M.Tech. Computer Sc. & Engg.” and “ELECTRICAL ENGINEERING” (note the whitespaces), the syntax:

In URLs:

fq=crn%3A%28%22M.Tech.+Computer+Sc.+%26+Engg.%22+%22ELECTRICAL+ENGINEERING%22%29

In Java:

qry.addFilterQuery("crn:("M.Tech. Computer Sc. & Engg." "ELECTRICAL ENGINEERING")");

Handling large number of facet values using pagination

Solr provides pagination for facet values and automatically imposes a limit on the number of values returned for each facet field. This limit can be set using the facet.limit query parameter, or setFacetLimit() API, and the facet value offset can be set using facet.offset query parameter.

However, there is no direct API like setFacetOffset() in SolrJ…instead, use

solrQry.add(FacetParams.FACET_OFFSET, “100”)

The Solr API also contains methods that refer to "facet queries". It’s important not to confuse facet queries and filter queries of facets.At first glance, it looks like the facet query concept is what will provide us the drilldown possibility. But not so.

Facet query is a kind of dynamic facet field, applicable only to certain use cases where it makes sense to categorize items in ranges – either numerical or date ranges .

For example, if items have to be categorized into price ranges like [$100-$200], [$200-$300] etc, then facet queries have to be used to “get the count of all items whose price>$100 and price<$200”. Just specifying the price field as a facet field would not be useful here, because it just returns the list of all unique prices available in the search results. What really provides the drilldown capabilities in this case is the facet query concept.

Facet queries are specified using the syntax field:[start TO end]. In URL, it should go in encoded format :

facet.query=age:[20+TO+22]

In API, it’s specified as

solrQuery.addFacetQuery(“age:[20 TO 22]”);

The facet counts are always in the context of the set of search results of main query + filter queries.

Embedded Solr

Karthik Shiraly — Thu, 25 Nov 2010 19:14:00 +0000

Article Relevancy
Apache Solr 1.4.x

A java application running in a JVM can use the EmbeddedSolrServer to host Solr in the same JVM.

Following snippet shows how to use it:

public class EmbeddedServerExplorer {
    public static void main(String[] args) {
        try {
            // Set "solr.solr.home" to the directory under which /conf and /data are present.
            System.setProperty("solr.solr.home", "solr");
            CoreContainer.Initializer initializer = new CoreContainer.Initializer();
            CoreContainer coreContainer = initializer.initialize();
            EmbeddedSolrServer server = new EmbeddedSolrServer(coreContainer, "");
            SolrInputDocument doc = new SolrInputDocument();
            doc.addField("id", "embeddedDoc1");
            doc.addField("name", "test embedded server");
            server.add(doc);
            server.commit();
            coreContainer.shutdown();
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

Using Solr from java applications with SolrJ

Karthik Shiraly — Tue, 23 Nov 2010 14:39:00 +0000

Article Relevancy
Apache Solr 1.4.x

Contents

Overview

SolrJ provides java wrappers and adaptors to communicate with Solr and translate its results to java objects. Using SolrJ is much more convenient than using raw HTTP and JSON. Internally, SolrJ uses Apache HttpClient to send HTTP requests.

Important classes

SolrJ API is fairly simple and intuitive. The diagram below shows important SolrJ classes.

Setup the client connection to server

solrServer = new CommonsHttpSolrServer("http://localhost:8983/solr");
solrServer.setParser(new XMLResponseParser());

Response parser in java client API can be either XML or binary. In other language APIs, JSON is possible.

Add or update document(s)

SolrInputDocument doc = new SolrInputDocument();
// Add fields. The field names should match fields defined in schema.xml
doc.addField(FLD_ID, docId++);
try {
    solrServer.add(doc);
    return true;
} catch (Exception e) {
    LOG.error("addItem error", e);
    return false;
}

Commit changes

For best performance, commit changes only after all – or a batch of reasonable size -documents are added/updated.

solrServer.commit();

Send a search query

SolrQuery qry = new SolrQuery("name:video");
qry.setIncludeScore(true);
qry.setShowDebugInfo(true);
qry.setRows(100);
QueryRequest qryReq = new QueryRequest(qry);
QueryResponse resp = qryReq.process(solrServer);

SolrQuery.setRows() specifies how many results to return in the response. The actual count of all hits may be much higher. If “field:” is omitted from query string, then the field specified by in schema.xml is searched.

Handle search results

SolrDocumentList results = resp.getResults();
System.out.println(results.getNumFound() + " total hits");
int count = results.size();
System.out.println(count + " received hits");
for (int i = 0; i &gt; count; i++) {
    SolrDocument hitDoc = results.get(i);
    System.out.println("#" + (i+1) + ":" + hitDoc.getFieldValue("name"));
    for (Iterator> flditer = hitDoc.iterator(); flditer.hasNext();) {
        Entry entry = flditer.next();
        System.out.println(entry.getKey() + ": " + entry.getValue());
    }
}

SolrDocumentList.getNumFound() is total number of hits in the index. But in each response, only as many results as specified by SolrQuery.setRows() will be returned. These two attributes can be used for pagination.

Getting started with Solr

Karthik Shiraly — Tue, 23 Nov 2010 09:23:00 +0000

Article Relevancy
Apache Solr 1.4.x

Contents

Introduction

Apache Solr is a full fledged, search server based on the Lucene toolkit.

Lucene provides the core search algorithms and index storage required by those algorithms. Most basic search requirements can be fulfilled by Lucene itself without requiring Solr. But using plain Lucene has some drawbacks in development and non functional aspects, forcing development teams to cover these in their designs. This is where Solr adds value.

Solr provides these benefits over using the raw Lucene toolkit:

Solr allows search behaviour to be configured through configuration files, rather than through code. Specifying search fields, indexing criteria, and indexing behaviour in code is prone to maintenance problems.
Lucene is java centric (but also has ports to other languages). Solr however provides a HTTP interface that allows any platform to use it. Projects that involve multiple languages or platforms can use the same solr server.
Solr provides an out-of-the-box faceted search (also called drilldown search) facility, that allows users to incrementally refine results using filters and "drilldown" towards a narrow set of best matches. Many shopping web portals use this feature to allow their users to incrementally refine their results.
Solr’s query syntax is slightly easier than Lucene’s. Either a default field can be specified, or solr provides a syntax of its own called dismax, that searches a fixed set of fields.
Solr’s java client API is much simpler and easier than Lucene’s. Solr abstracts away many of the underlying Lucene concepts.
Solr provides straightforward add, update, and delete document API, unlike Lucene.
Solr supports a pluggable architecture. For example, post processor plugins (example: search results highlighting) allow raw results to be modified. .
Solr facilitates scalability using solutions like caching, memory tweaking, clustering, sharding and load balancing.
Solr provides plugins to fetch database data and index them. This workflow is probably the most common requirement for any search implementation, and solr provides it out-of-the-box.

The following sections describe basics of deploying Solr and using it from command line.

Directory layout of Solr package

Extracted Solr package has this layout:

/client	Contains client APIs in different languages to talk to a Solr server
/contrib/clustering	Plugin that provides clustering capabilities for Solr, using Carrot2 clustering framework
/contrib/dataimporthandler	Plugin that is useful for indexing data in databases
/contrib/extraction	Plugin that is useful for extracting text from PDFs, Word DOCs, etc.
/contrib/velocity	Handler to present and manipulate search results using velocity templates.
/dist	Contains Solr core jars and wars that can be deployed in servlet containers or elsewhere, and the solrj client API for java clients.
/dist/solrj-lib	Libraries required by solrj client API .
/docs	Offline documentation and javadocs
/lib	Contains Lucene and other jars required by Solr
/src	Source code
/example	A skeleton standalone solr server deplyment. Default environment is Jetty. When deploying Solr, this is the directory that’s customized and deployed.
/example/etc	Jetty or other environment specific configuration files go here
/example/example-DIH	An example DB and the Data Import Handler plugin configuration to index that DB
/example/exampledocs	Example XML request files to send to Solr server. Usage: java –jar post.jar
/example/lib	Jetty and servlet libraries. Not required if Solr is being deployed in a different environment
/example/logs	Solr request logs
/example/multicore	It’s possible to host multiple search cores in the same environment. Use case could be separate indexes for different categories of data.
/example/solr	This is the main data area of Solr.
/example/solr/conf	Contains configuration files used by Solr. solrconfig.xml – Configuration parameters, memory tuning, different types of request handlers. schema.xml – Specifies fields and analyzer configuration for indexing and querying. Other files contain data required by different components like the Stop word filter.
/example/solr/data	This contains the actual results of indexing.
/example/webapps	The solr webapp deployed in Jetty
/example/work	Scratch directory for the container environment

Getting Started Guide

1) Copy the skeleton server under /example to the deployment directory.

2) Customize /example/solr/conf/schema.xml as explained in later sections, to model search fields of the application.

3) Start the solr server. For the default Jetty environment, use this command line with current directory set to /example:

java -DSTOP.PORT=8079 –DSTOP.KEY=secret –jar start.jar

The STOP.PORT specifies the port on which server should listen for a stop instruction, and STOP.KEY is just a kind of secret key to be passed while stopping.

4) If building from source, the WAR will be named something like apache-solr-4.0-snapshot.jar. Copy this to /webapps and importantly, rename it to solr.war. Without that renaming, Jetty will give 404 errors for /solr URLs.

5) The solr server will now be available at http://localhost:8983/solr. 8983 is the default jetty connector port, as specified in /example/etc/jetty.xml

6) To stop the server, use the command line:

java -DSTOP.PORT=8079 –DSTOP.KEY=secret –jar start.jar –stop

Managing solr server with ant during development

Starting and stopping solr can be conveniently done from an IDE like Eclipse using an Ant script:

Customizing Solr installation

The solr server distribution under /example is just that – an example. It should be customized to fit your search requirements. The conf/schema.xml should be changed to model searchable entities of the application, as described in this article.

Multicore configuration and deployment

Multicore configuration allows multiple schemas and indexes in a single solr server process. Multicores are useful when disparate entities with different fields need to be searched using a single server process.

The package contains an example multicore configuration in /example/multicore. It contains 2 cores, each with its own schema.xml and solrconfig.xml.
Core names and instance directories can be changed in solr.xml.
The default multicore schema.xmls are rather simplistic and don’t contain the exhaustive list of field type definitions available in /example/solr/conf/schema.xml. So, copy all files under/example/solr/conf/* into /example/multicore/core0/conf/* and/example/multicore/core1/conf/*
Modify the core schema XMLs according to the data they are indexing
The copied solrconfig.xml has a element that points to /example/multicore/data. This is where index and other component data are stored. Since the same solrconfig is copied into both cores, both cores end up pointing to the same data directory and will try to write to same index, most likely resulting in index corruption. So, just comment out the elements. Then each core will store data in its respective/example/multicore//data.
The jar lib directories in default single core solrconfig.xml don’t match with the default directory structure in a multicore structure.Those relative paths are with solr.home (ie, “/example/solr“) as base directory. Change the relative paths of /contrib and /dist, such that they’re relative *to the core’s directory* (ie,/example/solr/).
Finally, the multicore configuration should be made the active configuration, either by specifying”java -Dsolr.home=/example/multicore -jar start.jar” OR preferably, By copying all files under/example/multicore/* into /example/solr, the default solr home.

Using Solr from command line

The primary method of communicating with Solr is using HTTP. A HTTP capable command line client like curl is useful for this.

Querying: Queries should be sent as

http://localhost:8983/solr/select/?q=

http://localhost:8983/solr//select/?q=

for multicore installation

Inserting or Updating documents in a single core installation: Solr update handler listens by default on the URL: http://localhost:8983/solr/update/ in a single core configuration.

To post an XML file with documents, use command line

curl http://localhost:8983/solr/update/?commit=true –F "myfile=@updates.xml"

Inserting or Updating documents in a multi core installation: Each core’s update handler listens by default on the URL: http://localhost:8983/solr//update/

Updating with content extraction: Content extracting handler listens on the URL http://localhost:8983/solr/update/extract/ or http://localhost:8983/solr//update/extract. Use the command line

curl "http://localhost:8983/solr/update/extract?literal.id=book1&commit=true" -F "myfile=@book.pdf"

where literal.id adds a regular field called "id" to the new document created by extracting handler.

The query parameters that Solr accepts are documented in Solr wiki.

Boolean operators in search queries

All Lucene queries are valid in Solr too. However, solr does provide some additional conveniences.

A default boolean operator can specified using a element in schema.xml.

Each query can also override boolean behaviour using the q.op=AND|OR query param. However, remember that the schema default or q.op affect not just the query terms, but also the facet filter queries.

For example, selecting 2 facet values for the same facet field will now imply that both should be satisfied. This is because internally, a filter query is just a part of the query from Lucene point of view.

All words should be found: Prefix a + in front of each word. example: +video +science (=>only documents that contain both “video” AND “science” are returned)
Any one word should be found: This is the default behaviour when queries contain words without any prefix.example: video science (=>any document which contains either “video” or “science” is returned)
Documents which don’t contain a word: Prefix a “–” in front of each word that should not be present, for a successful hit. example: video –science (=>any document which contains “video” but not “science” is returned).

Pathbreak Developer Notebook

Did Amazon CloudFront CDN make my site faster?

Overview

Setup

Evaluation criteria

Performance measurements

Browser measurements

Methodology

Results

Analysis of browser results

Browser Conclusions

Load measurements using apache bench (ab)

Methodology

Results

Analysis of ab results

Conclusion

Load measurements using Apache JMeter

Methodology

Results

Analysis of results

Conclusion

Measurements using www.webpagetest.org

Methodology

Results

Conclusion

Cost analysis

Final conclusion

Simulating browsers using JMeter

Solr on Jetty on Ubuntu

Ubuntu startup – init scripts, runlevels, upstart jobs explained

Run levels and init.d scripts – the traditional mechanism

Get Current run level

/etc/init.d directories

/etc/rcn.d directories

Enabling and disabling run level services

Upstart

Resources for further reading

Content Extraction in Solr

Overview

Howto

Restrictions of default content extraction

Solr search data modelling

Overview

section

Basic field types

section

Faceting – or drilldown – search using Solr

Overview

Steps

Facet filter query syntax

Handling large number of facet values using pagination

Facet Query vs Filter Query of facet

Undestanding facet counts

Embedded Solr

Using Solr from java applications with SolrJ

Overview

Important classes

Setup the client connection to server

Add or update document(s)

Commit changes

Send a search query

Handle search results

Getting started with Solr

Introduction

Directory layout of Solr package

Getting Started Guide

Managing solr server with ant during development

Customizing Solr installation

Multicore configuration and deployment

Using Solr from command line

Boolean operators in search queries