<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Pathbreak Developer Notebook</title>
	<atom:link href="http://www.pathbreak.com/blog/feed" rel="self" type="application/rss+xml" />
	<link>http://www.pathbreak.com/blog</link>
	<description>Pathbreak Technologies on Software Architectures, Engineering, Technologies &#38; Programming</description>
	<lastBuildDate>Wed, 28 Dec 2011 04:20:55 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
		<item>
		<title>Did Amazon CloudFront CDN make my site faster?</title>
		<link>http://www.pathbreak.com/blog/did-amazon-cloudfront-cdn-make-my-site-faster?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=did-amazon-cloudfront-cdn-make-my-site-faster</link>
		<comments>http://www.pathbreak.com/blog/did-amazon-cloudfront-cdn-make-my-site-faster#comments</comments>
		<pubDate>Wed, 28 Dec 2011 04:20:55 +0000</pubDate>
		<dc:creator>Karthik Shiraly</dc:creator>
				<category><![CDATA[AWS]]></category>
		<category><![CDATA[ab]]></category>
		<category><![CDATA[Apache bench]]></category>
		<category><![CDATA[CDN]]></category>
		<category><![CDATA[Cloudfront]]></category>
		<category><![CDATA[JMeter]]></category>

		<guid isPermaLink="false">http://www.pathbreak.com/blog/did-amazon-cloudfront-cdn-make-my-site-faster</guid>
		<description><![CDATA[Contents Overview Setup Evaluation criteria Performance measurements Browser measurements Methodology Results Analysis of browser results Browser Conclusions Load measurements using apache bench (ab) Methodology Results Analysis of ab results Conclusion Load measurements using Apache JMeter Methodology Results Analysis of results Conclusion Measurements using www.webpagetest.org Methodology Results Conclusion Cost analysis Final conclusion Overview When I was [...]]]></description>
			<content:encoded><![CDATA[<div class="sidebox">
<div class="toc"><b>Contents</b>
<ol>
<li><a href="http://www.pathbreak.com/blog/did-amazon-cloudfront-cdn-make-my-site-faster#toc-overview">Overview</a></li>
<li><a href="http://www.pathbreak.com/blog/did-amazon-cloudfront-cdn-make-my-site-faster#toc-setup">Setup</a></li>
<li><a href="http://www.pathbreak.com/blog/did-amazon-cloudfront-cdn-make-my-site-faster#toc-evaluation-criteria">Evaluation criteria</a></li>
<li><a href="http://www.pathbreak.com/blog/did-amazon-cloudfront-cdn-make-my-site-faster#toc-performance-measurements">Performance measurements</a>
<ol>
<li><a href="http://www.pathbreak.com/blog/did-amazon-cloudfront-cdn-make-my-site-faster#toc-browser-measurements">Browser measurements</a></p>
<ol>
<li><a href="http://www.pathbreak.com/blog/did-amazon-cloudfront-cdn-make-my-site-faster#toc-methodology">Methodology</a></li>
<li><a href="http://www.pathbreak.com/blog/did-amazon-cloudfront-cdn-make-my-site-faster#toc-results">Results</a></li>
<li><a href="http://www.pathbreak.com/blog/did-amazon-cloudfront-cdn-make-my-site-faster#toc-analysis-of-browser-results">Analysis of browser results</a></li>
<li><a href="http://www.pathbreak.com/blog/did-amazon-cloudfront-cdn-make-my-site-faster#toc-browser-conclusions">Browser Conclusions</a></li>
</ol>
</li>
<li><a href="http://www.pathbreak.com/blog/did-amazon-cloudfront-cdn-make-my-site-faster#toc-load-measurements-using-apache-bench-ab">Load measurements using apache bench (ab)</a>
<ol>
<li><a href="http://www.pathbreak.com/blog/did-amazon-cloudfront-cdn-make-my-site-faster#toc-methodology1">Methodology</a></li>
<li><a href="http://www.pathbreak.com/blog/did-amazon-cloudfront-cdn-make-my-site-faster#toc-results1">Results</a></li>
<li><a href="http://www.pathbreak.com/blog/did-amazon-cloudfront-cdn-make-my-site-faster#toc-analysis-of-ab-results">Analysis of ab results</a></li>
<li><a href="http://www.pathbreak.com/blog/did-amazon-cloudfront-cdn-make-my-site-faster#toc-conclusion">Conclusion</a></li>
</ol>
</li>
<li><a href="http://www.pathbreak.com/blog/did-amazon-cloudfront-cdn-make-my-site-faster#toc-load-measurements-using-apache-jmeter">Load measurements using Apache JMeter</a>
<ol>
<li><a href="http://www.pathbreak.com/blog/did-amazon-cloudfront-cdn-make-my-site-faster#toc-methodology2">Methodology</a></li>
<li><a href="http://www.pathbreak.com/blog/did-amazon-cloudfront-cdn-make-my-site-faster#toc-results2">Results</a></li>
<li><a href="http://www.pathbreak.com/blog/did-amazon-cloudfront-cdn-make-my-site-faster#toc-analysis-of-results">Analysis of results</a></li>
<li><a href="http://www.pathbreak.com/blog/did-amazon-cloudfront-cdn-make-my-site-faster#toc-conclusion1">Conclusion</a></li>
</ol>
</li>
<li><a href="http://www.pathbreak.com/blog/did-amazon-cloudfront-cdn-make-my-site-faster#toc-measurements-using-www-webpagetest-org">Measurements using <a href="http://www.webpagetest.org">www.webpagetest.org</a></a>
<ol>
<li><a href="http://www.pathbreak.com/blog/did-amazon-cloudfront-cdn-make-my-site-faster#toc-methodology3">Methodology</a></li>
<li><a href="http://www.pathbreak.com/blog/did-amazon-cloudfront-cdn-make-my-site-faster#toc-results3">Results</a></li>
<li><a href="http://www.pathbreak.com/blog/did-amazon-cloudfront-cdn-make-my-site-faster#toc-conclusion2">Conclusion</a></li>
</ol>
</li>
</ol>
</li>
<li><a href="http://www.pathbreak.com/blog/did-amazon-cloudfront-cdn-make-my-site-faster#toc-cost-analysis">Cost analysis</a></li>
<li><a href="http://www.pathbreak.com/blog/did-amazon-cloudfront-cdn-make-my-site-faster#toc-final-conclusion">Final conclusion</a></li>
</ol>
</div>
</div>
<h1 id="toc-overview">Overview</h1>
<p>When I was deploying my website, I ran into a slow page load problem. One of the pages had about 9 non-interleaved screenshot PNG image files, each about (700 x 500 pixels) in size and between 40 KB and 350KB file size. </p>
<p>I wondered if deploying these images on Amazon CloudFront would improve response times. <a href="http://aws.amazon.com/cloudfront/" target="_blank">Amazon CloudFront</a> is the Content Delivery Network (CDN) offering from Amazon and is one of the cloud services that constitute Amazon Web Services (AWS).&#160; </p>
<p>CDNs are supposed to improve response times by replicating resources across multiple servers around the world, and serving a requested resource from the server closest to the requesting client. The implicit assumption is that the root cause of latency is geographical distance (greater the distance, more the number of routers involved in-between), and so serving files from a server that is physically closer should reduce latency.</p>
<p>Since my site was already hosted on Amazon&#8217;s EC2, it made sense to try their CloudFront CDN, rather than some other vendor&#8217;s CDN. Though this was not a performance critical page, it did provide the opportunity to experiment with CloudFront for a realistic scenario, and the knowledge gained may prove useful in future. So I started experimenting&#8230;</p>
<p>&#160;</p>
<h1 id="toc-setup">Setup</h1>
<p>I decided to use Amazon S3 as the origin server for Cloudfront (the origin server is the server from which Cloudfront picks up the resources to replicate). I opted for &quot;reduced redundancy storage&quot; setting instead of &quot;standard redundancy&quot; for the S3 bucket, to minimize costs (and also because these images are already available to me from my development machine and web server..standard redundancy makes more sense for user content data or critical backups).</p>
<p>&#160;</p>
<h1 id="toc-evaluation-criteria">Evaluation criteria</h1>
<p>Better response times would be great. </p>
<p>Even if there was no improvement in response times, a CDN would still reduce the load on my rather <a href="http://aws.amazon.com/ec2/#instance" target="_blank">underpowered EC2 micro instance</a> web server, and spare me some more connections for more dynamic content, like my SaaS products. So I was already somewhat biased towards using Cloudfront or some other file server before evaluating them.</p>
<p>But CloudFront, like other AWS services, is a metered service. So the evaluation also needed to keep costs in mind.</p>
<p>&#160;</p>
<h1 id="toc-performance-measurements">Performance measurements</h1>
<p>For response time measurements, I decided to use different tools to get a complete picture:</p>
<ul>
<li>The first set of measurements are taken using browsers. All 3 major browsers &#8211; Chrome, Firefox and IE &#8211; provide excellent profiling tools for developers. </li>
<li>However, browser measurements are not enough. The system should also be tested for scalability. What happens to response times when there are dozens of concurrent connections requesting the page? Can the page be rendered to all those users without much increase in response times? With a single web server on an underpowered machine, this is clearly not possible. But putting a CDN in the mix should shift atleast some of the load on my puny single unscalable web server to Amazon&#8217;s scalable mammoth delivery network. I used <a href="http://jmeter.apache.org/" target="_blank">Apache Jmeter</a> and <a href="http://httpd.apache.org/docs/2.2/programs/ab.html" target="_blank">Apache Bench (ab)</a> tools to load the server. </li>
</ul>
<h2 id="toc-browser-measurements">Browser measurements</h2>
<h3 id="toc-methodology">Methodology</h3>
<p>Chrome&#8217;s developer tools network tab, Firefox Firebug network tab, and IE&#8217;s developer tools Network tab provide profiling information. </p>
<p>Chrome and Firefox (via Firebug and Firebug NetExport plugin) can export profiling data to JSON format files called .har files. </p>
<p>IE exports to XML files which have a similar schema to the JSON .har files but expressed in XML. </p>
<p>&#160;</p>
<p>Each browser was tested 5 times with a complete cache cleanup in between. The cache cleanup ensured that all images were downloaded in each test. However, cache cleanup does not clear the browsers&#8217; DNS caches, which means DNS lookup timings are usually manifested only in the first test.</p>
<p>&#160;</p>
<p>A python script was used to parse these files, calculate averages and produce the below HTML table of averages.</p>
<p>&#160;</p>
<h3 id="toc-results">Results</h3>
<blockquote><p><em><strong>Legend to the table:</strong></em></p>
<p>1st column =&gt; the image file name</p>
<p>&quot;OwnServer&quot; =&gt; tests in which images were downloaded from my Apache web server running on EC2 and EBS</p>
<p>&quot;Cloudfront&quot; =&gt; tests in which images were downloaded from Cloudfront distribution with S3 as origin server</p>
<p>T =&gt; Total time for request and response (including thread blocked, wait, connect, send request, wait, receive response)</p>
<p>R =&gt; Total time for just receiving all the data</p>
<p>W =&gt; Time spent waiting before response started</p>
<p>All figures are in milliseconds</p>
</blockquote>
<p><strong></strong></p>
<table border="1">
<tbody>
<tr>
<td>&nbsp;</td>
<td>Chrome          <br />OwnServer</td>
<td>Chrome          <br />Cloudfront</td>
<td>Mozilla          <br />OwnServer</td>
<td>Mozilla          <br />Cloudfront</td>
<td>IE          <br />OwnServer</td>
<td>IE          <br />Cloudfront</td>
</tr>
<tr>
<td>corporatesearch.png          <br />158KB</td>
<td>T:14897          <br />R:14525           <br />W:369</td>
<td>T:7913          <br />R:7721           <br />W:141</td>
<td>T:9924          <br />R:9476           <br />W:447</td>
<td>T:13010          <br />R:12792           <br />W:177</td>
<td>T:9038          <br />R:8567           <br />W:386</td>
<td>T:11600          <br />R:11406           <br />W:153</td>
</tr>
<tr>
<td>jobsearch.png          <br />353KB</td>
<td>T:15516          <br />R:14795           <br />W:359</td>
<td>T:15982          <br />R:15666           <br />W:265</td>
<td>T:19716          <br />R:18328           <br />W:367</td>
<td>T:16061          <br />R:15856           <br />W:162</td>
<td>T:19394          <br />R:18629           <br />W:393</td>
<td>T:17225          <br />R:16997           <br />W:187</td>
</tr>
<tr>
<td>p-and-f-charting.png          <br />59KB           </td>
<td>T:5600          <br />R:4869           <br />W:365</td>
<td>T:6520          <br />R:6218           <br />W:250</td>
<td>T:7728          <br />R:6341           <br />W:363</td>
<td>T:9085          <br />R:8842           <br />W:199</td>
<td>T:5934          <br />R:4683           <br />W:399</td>
<td>T:7098          <br />R:6246           <br />W:811</td>
</tr>
<tr>
<td>s-and-r-charting.png          <br />40KB</td>
<td>T:4048          <br />R:3313           <br />W:366</td>
<td>T:5301          <br />R:5006           <br />W:243</td>
<td>T:6102          <br />R:4347           <br />W:363</td>
<td>T:3788          <br />R:3550           <br />W:177</td>
<td>T:5456          <br />R:3700           <br />W:973</td>
<td>T:3151          <br />R:2614           <br />W:496</td>
</tr>
<tr>
<td>dialogs.png          <br />114KB</td>
<td>T:15074          <br />R:4492           <br />W:315</td>
<td>T:12620          <br />R:5056           <br />W:316</td>
<td>T:14032          <br />R:4737           <br />W:314</td>
<td>T:11809          <br />R:4402           <br />W:180</td>
<td>T:11830          <br />R:5288           <br />W:299</td>
<td>T:12483          <br />R:4346           <br />W:318</td>
</tr>
<tr>
<td>candidateshortlist.png          <br />160KB</td>
<td>T:11319          <br />R:10951           <br />W:366</td>
<td>T:7246          <br />R:7082           <br />W:121</td>
<td>T:9932          <br />R:9491           <br />W:440</td>
<td>T:10277          <br />R:10108           <br />W:127</td>
<td>T:7694          <br />R:7310           <br />W:380</td>
<td>T:9235          <br />R:9044           <br />W:147</td>
</tr>
<tr>
<td>mainscreen.png          <br />166KB</td>
<td>T:16810          <br />R:10758           <br />W:309</td>
<td>T:13498          <br />R:8243           <br />W:138</td>
<td>T:18673          <br />R:11210           <br />W:323</td>
<td>T:14714          <br />R:7201           <br />W:1516</td>
<td>T:16115          <br />R:11113           <br />W:337</td>
<td>T:13815          <br />R:8579           <br />W:252</td>
</tr>
<tr>
<td>technical-analysis-          <br />signals.png           <br />112KB</td>
<td>T:11550          <br />R:7339           <br />W:307</td>
<td>T:13209          <br />R:8886           <br />W:155</td>
<td>T:12731          <br />R:7460           <br />W:351</td>
<td>T:9232          <br />R:6055           <br />W:227</td>
<td>T:11668          <br />R:6421           <br />W:336</td>
<td>T:8031          <br />R:5615           <br />W:221</td>
</tr>
<tr>
<td>homepage.png          <br />266KB</td>
<td>T:15834          <br />R:15400           <br />W:357</td>
<td>T:14030          <br />R:13797           <br />W:181</td>
<td>T:16209          <br />R:14827           <br />W:364</td>
<td>T:12099          <br />R:11927           <br />W:134</td>
<td>T:17235          <br />R:16470           <br />W:396</td>
<td>T:15956          <br />R:15609           <br />W:305</td>
</tr>
</tbody>
</table>
<p>&#160;</p>
<h3 id="toc-analysis-of-browser-results">Analysis of browser results</h3>
<p>The metrics to pay attention here are R (the average receive times) and W (the wait times). </p>
<p>I didn&#8217;t pay much attention to T (the average total times) because I felt they are misleading. The problem is that browsers download embedded resources like &lt;img&gt;s using a small number of connections. When there are more resources than there are connections, the extra resources are blocked until some connections are freed. These blocked times manifest in the T values, but they are not deterministic and are also not similar across browsers since connection implementations differ. Hence, total times should be ignored in my opinion.</p>
<p>What can we observe from the R(eceive) and W(ait) times?</p>
<ul>
<li>Chrome: For 5 out of 9 images, R(eceive) times from cloudfront are less than receive from own server. For other 4 images, receive times from cloudfront are slightly higher. So it&#8217;s almost a tie. However, W(ait) times are consistently less for Cloudfront. So, Cloudfront leads. </li>
<li>Firefox: For 6 out of 9 images, R(eceive) times from cloudfront are lesser. W(ait) times are also consistently less, except in one case, which seems to be an anomaly. Cloudfront leads again. </li>
<li>IE: For 6 out of 9 images, R(eceive) times from cloudfront are lesser. W(ait) times are also consistently less, except in 2 cases, which seem to be anomalies. Cloudfront leads again. </li>
</ul>
<h3 id="toc-browser-conclusions">Browser Conclusions</h3>
<p><strong>Cloudfront does makes the site faster</strong>&#8230;<em>but </em>not as consistently or drastically as expected, atleast in my tests (I&#8217;m in India and my nearest edge locations seem to be Singapore or Hong Kong).</p>
<p>One possible factor may be that the resources should get lots of hits for Cloudfront to cache and provide them more effectively. I&#8217;m not sure about this, but cloudfront documentation does seem to hint that more popular resources will benefit more.</p>
<p>&#160;</p>
<h2 id="toc-load-measurements-using-apache-bench-ab">Load measurements using apache bench (ab)</h2>
<h3 id="toc-methodology1">Methodology</h3>
<p>ab is incapable of downloading a web page and all its embedded resources. So I ran ab requests on just one of the image files &#8211; the biggest one at 350KB.</p>
<p>I set different values for -n and -c options. -k was enabled to simulate browser behaviour by keeping connections alive.</p>
<p>&#160;</p>
<h3 id="toc-results1">Results</h3>
<table border="0" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td valign="bottom" width="179">&#160;</td>
<td valign="bottom" width="131">
<p><strong>Own server</strong></p>
</td>
<td valign="bottom" width="99">
<p><strong>Cloudfront</strong></p>
</td>
</tr>
<tr>
<td valign="bottom" width="179">
<p><b>50 total requests, 1 user</b></p>
</td>
<td valign="bottom" width="131">&#160;</td>
<td valign="bottom" width="99">&#160;</td>
</tr>
<tr>
<td valign="bottom" width="179">
<p>Total time</p>
</td>
<td valign="bottom" width="131">
<p>292.69s</p>
</td>
<td valign="bottom" width="99">
<p>214s</p>
</td>
</tr>
<tr>
<td valign="bottom" width="179">
<p>Mean Time / request</p>
</td>
<td valign="bottom" width="131">
<p>5.85 s</p>
</td>
<td valign="bottom" width="99">
<p>4.28s</p>
</td>
</tr>
<tr>
<td valign="bottom" width="179">
<p>Max time taken by 90% of requests</p>
</td>
<td valign="bottom" width="131">
<p>8.4s</p>
</td>
<td valign="bottom" width="99">
<p>4.9s</p>
</td>
</tr>
<tr>
<td valign="bottom" width="179">
<p>Max time taken by 50% of requests</p>
</td>
<td valign="bottom" width="131">
<p>5.35s</p>
</td>
<td valign="bottom" width="99">
<p>4.26s</p>
</td>
</tr>
<tr>
<td valign="bottom" width="179">
<p>Data transferred</p>
</td>
<td valign="bottom" width="131">
<p>18051951</p>
</td>
<td valign="bottom" width="99">
<p>18070081</p>
</td>
</tr>
<tr>
<td valign="bottom" width="310">
<p><b>50 total requests, 5 concurrent users</b></p>
</td>
<td valign="bottom" width="99">&#160;</td>
</tr>
<tr>
<td valign="bottom" width="179">
<p>Total time</p>
</td>
<td valign="bottom" width="131">
<p>211.97 s</p>
</td>
<td valign="bottom" width="99">
<p>215.24s</p>
</td>
</tr>
<tr>
<td valign="bottom" width="179">
<p>Mean Time / request</p>
</td>
<td valign="bottom" width="131">
<p>4.24 s</p>
</td>
<td valign="bottom" width="99">
<p>4.30s</p>
</td>
</tr>
<tr>
<td valign="bottom" width="179">
<p>Max time taken by 90% of requests</p>
</td>
<td valign="bottom" width="131">
<p>31.4s</p>
</td>
<td valign="bottom" width="99">
<p>35.3s</p>
</td>
</tr>
<tr>
<td valign="bottom" width="179">
<p>Max time taken by 50% of requests</p>
</td>
<td valign="bottom" width="131">
<p>19.3s</p>
</td>
<td valign="bottom" width="99">
<p>18.7s</p>
</td>
</tr>
<tr>
<td valign="bottom" width="179">
<p>Data transferred</p>
</td>
<td valign="bottom" width="131">
<p>18737815</p>
</td>
<td valign="bottom" width="99">
<p>18826276</p>
</td>
</tr>
<tr>
<td valign="bottom" width="310">
<p><b>50 total requests, 10 concurrent users</b></p>
</td>
<td valign="bottom" width="99">&#160;</td>
</tr>
<tr>
<td valign="bottom" width="179">
<p>Total time</p>
</td>
<td valign="bottom" width="131">
<p>217.34 s</p>
</td>
<td valign="bottom" width="99">
<p>218.7s</p>
</td>
</tr>
<tr>
<td valign="bottom" width="179">
<p>Mean Time / request</p>
</td>
<td valign="bottom" width="131">
<p>4.35 s</p>
</td>
<td valign="bottom" width="99">
<p>4.37s</p>
</td>
</tr>
<tr>
<td valign="bottom" width="179">
<p>Max time taken by 90% of requests</p>
</td>
<td valign="bottom" width="131">
<p>58.8s</p>
</td>
<td valign="bottom" width="99">
<p>67.4s</p>
</td>
</tr>
<tr>
<td valign="bottom" width="179">
<p>Max time taken by 50% of requests</p>
</td>
<td valign="bottom" width="131">
<p>40s</p>
</td>
<td valign="bottom" width="99">
<p>27.5s</p>
</td>
</tr>
<tr>
<td valign="bottom" width="179">
<p>Data transferred</p>
</td>
<td valign="bottom" width="131">
<p>19460082</p>
</td>
<td valign="bottom" width="99">
<p>19705381</p>
</td>
</tr>
<tr>
<td valign="bottom" width="310">
<p><b>50 total requests, 25 concurrent users</b></p>
</td>
<td valign="bottom" width="99">&#160;</td>
</tr>
<tr>
<td valign="bottom" width="179">
<p>Total time</p>
</td>
<td valign="bottom" width="131">
<p>227.57s</p>
</td>
<td valign="bottom" width="99">
<p>239.53s</p>
</td>
</tr>
<tr>
<td valign="bottom" width="179">
<p>Mean Time / request</p>
</td>
<td valign="bottom" width="131">
<p>4.55s</p>
</td>
<td valign="bottom" width="99">
<p>4.79s</p>
</td>
</tr>
<tr>
<td valign="bottom" width="179">
<p>Max time taken by 90% of requests</p>
</td>
<td valign="bottom" width="131">
<p>142.6s</p>
</td>
<td valign="bottom" width="99">
<p>130.6s</p>
</td>
</tr>
<tr>
<td valign="bottom" width="179">
<p>Max time taken by 50% of requests</p>
</td>
<td valign="bottom" width="131">
<p>61.9s</p>
</td>
<td valign="bottom" width="99">
<p>46.7s</p>
</td>
</tr>
<tr>
<td valign="bottom" width="179">
<p>Data transferred</p>
</td>
<td valign="bottom" width="131">
<p>20218419</p>
</td>
<td valign="bottom" width="99">
<p>21540782</p>
</td>
</tr>
<tr>
<td valign="bottom" width="310">
<p><b>80 total requests, 40 concurrent users</b></p>
</td>
<td valign="bottom" width="99">&#160;</td>
</tr>
<tr>
<td valign="bottom" width="179">
<p>Total time</p>
</td>
<td valign="bottom" width="131">
<p>337.93s</p>
</td>
<td valign="bottom" width="99">
<p>412.69s</p>
</td>
</tr>
<tr>
<td valign="bottom" width="179">
<p>Mean Time / request</p>
</td>
<td valign="bottom" width="131">
<p>4.22s</p>
</td>
<td valign="bottom" width="99">
<p>5.16s</p>
</td>
</tr>
<tr>
<td valign="bottom" width="179">
<p>Max time taken by 90% of requests</p>
</td>
<td valign="bottom" width="131">
<p>235.3s</p>
</td>
<td valign="bottom" width="99">
<p>239.3s</p>
</td>
</tr>
<tr>
<td valign="bottom" width="179">
<p>Max time taken by 50% of requests</p>
</td>
<td valign="bottom" width="131">
<p>82.4s</p>
</td>
<td valign="bottom" width="99">
<p>84s</p>
</td>
</tr>
<tr>
<td valign="bottom" width="179">
<p>Data transferred</p>
</td>
<td valign="bottom" width="131">
<p>28945837</p>
</td>
<td valign="bottom" width="99">
<p>34307057</p>
</td>
</tr>
<tr>
<td valign="bottom" width="310">
<p><b>100 total requests, 50 concurrent users</b></p>
</td>
<td valign="bottom" width="99">&#160;</td>
</tr>
<tr>
<td valign="bottom" width="179">
<p>Total time</p>
</td>
<td valign="bottom" width="131">
<p>477.91s</p>
</td>
<td valign="bottom" width="99">
<p>477.13s</p>
</td>
</tr>
<tr>
<td valign="bottom" width="179">
<p>Mean Time / request</p>
</td>
<td valign="bottom" width="131">
<p>4.78s</p>
</td>
<td valign="bottom" width="99">
<p>4.77s</p>
</td>
</tr>
<tr>
<td valign="bottom" width="179">
<p>Max time taken by 90% of requests</p>
</td>
<td valign="bottom" width="131">
<p>260.1s</p>
</td>
<td valign="bottom" width="99">
<p>201.7s</p>
</td>
</tr>
<tr>
<td valign="bottom" width="179">
<p>Max time taken by 50% of requests</p>
</td>
<td valign="bottom" width="131">
<p>137.1s</p>
</td>
<td valign="bottom" width="99">
<p>53.3s</p>
</td>
</tr>
<tr>
<td valign="bottom" width="179">
<p>Data transferred</p>
</td>
<td valign="bottom" width="131">
<p>34238460</p>
</td>
<td valign="bottom" width="99">
<p>43105119</p>
</td>
</tr>
</tbody>
</table>
<p>&#160;</p>
<h3 id="toc-analysis-of-ab-results">Analysis of ab results</h3>
<p>Results are so all over the place, that I found it difficult to draw any conclusion! </p>
<p>The 50th percentile results in some tests clearly favour Cloudfront, but not consistently. </p>
<p>&#160;</p>
<p>I also found it hard to understand some of the raw values (not shown here). For example, in the last test with 100 requests across 50 concurrent users, total time was 477.1s but the longest request was 454s! How that can be is beyond me. I&#8217;m guessing that a request sent fairly early never got a response. It&#8217;s possible that this was because load was too much for my puny 512 kbps bandwidth.</p>
<p>Another thing to notice is that data volume with cloudfront is atleast 25% higher at higher loads. I&#8217;m guessing that this is because of TCP retransmissions, though why it appears only when communicating with cloudfront is not clear.</p>
<p>&#160;</p>
<h3 id="toc-conclusion">Conclusion</h3>
<p>I&#8217;m reluctant to draw any concrete conclusion from ab results except that 50% of requests seem to be faster most of the time when using Cloudfront.</p>
<p>&#160;</p>
<h2 id="toc-load-measurements-using-apache-jmeter">Load measurements using Apache JMeter</h2>
<h3 id="toc-methodology2">Methodology</h3>
<p>JMeter was used to test the following loads:</p>
<ul>
<li>50 total requests with 1 user. Retrieve embedded resource using a pool of 9 threads (9 because the page had 9 images) </li>
<li>50 total requests across 5 concurrent users. Retrieve embedded resources using pools of 5 threads (only 5 because JMeter creates multiple pools for each virtual user, which means 5 users x 5 threads = 25 threads would be created. I was afraid that higher pool sizes might make bandwidth contention a factor in the timings) </li>
</ul>
<h3 id="toc-results2">Results</h3>
<table border="0" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td valign="bottom" width="175">&nbsp;</td>
<td valign="bottom" width="68">
<p align="center"><strong>Ownserver</strong></p>
</td>
<td valign="bottom" width="67">
<p align="center"><strong>Cloudfront</strong></p>
</td>
<td valign="bottom" width="209">
<p align="center"><strong>Notes</strong></p>
</td>
</tr>
<tr>
<td valign="bottom" width="175">
<p><b>50 total requests,              <br />1 user,               <br />9 downloading threads</b></p>
</td>
<td valign="bottom" width="68">&nbsp;</td>
<td valign="bottom" width="67">&nbsp;</td>
<td valign="bottom" width="209">&nbsp;</td>
</tr>
<tr>
<td valign="bottom" width="175">
<p>avg</p>
</td>
<td valign="bottom" width="68">
<p>22.8s</p>
</td>
<td valign="bottom" width="67">
<p>20.01s</p>
</td>
<td valign="bottom" width="209">&nbsp;</td>
</tr>
<tr>
<td valign="bottom" width="175">
<p>90% of requests</p>
</td>
<td valign="bottom" width="68">
<p>24.4s</p>
</td>
<td valign="bottom" width="67">
<p>25.7s</p>
</td>
<td valign="bottom" width="209">&nbsp;</td>
</tr>
<tr>
<td valign="bottom" width="175">&nbsp;</td>
<td valign="bottom" width="68">&nbsp;</td>
<td valign="bottom" width="67">&nbsp;</td>
<td valign="bottom" width="209">&nbsp;</td>
</tr>
<tr>
<td valign="bottom" width="175">
<p><b>50 total requests,              <br />5 concurrent users,               <br />5 downloading threads per user</b></p>
</td>
<td valign="bottom" width="68">&nbsp;</td>
<td valign="bottom" width="67">&nbsp;</td>
<td valign="bottom" width="209">&nbsp;</td>
</tr>
<tr>
<td valign="bottom" width="175">
<p>avg</p>
</td>
<td valign="bottom" width="68">
<p>75s</p>
</td>
<td valign="bottom" width="67">
<p>48s</p>
</td>
<td valign="bottom" width="209">
<p>Actually overall 77s, but only 48s if 7 anomalous measurements were removed.            <br />Ownserver actually never finished 50 requests. Probably, socket timeouts.</p>
</td>
</tr>
<tr>
<td valign="bottom" width="175">
<p>90% of requests</p>
</td>
<td valign="bottom" width="68">
<p>110.5s</p>
</td>
<td valign="bottom" width="67">
<p>60s</p>
</td>
<td valign="bottom" width="209">
<p>Actually 232.6s            <br />But 34 out of 43 (80%) were within 60s. </p>
</td>
</tr>
</tbody>
</table>
<p>&#160;</p>
<h3 id="toc-analysis-of-results">Analysis of results</h3>
<p>When simulating a single user, using Cloudfront didn&#8217;t show any major improvement in speed.</p>
<p>But when simulating 5 concurrent users with 5 resource downloading threads per user, I saw interesting results. 7 results timed out with extremely high times like 270 seconds. These I put down as anomalies, possibly because I was overloading my bandwidth.</p>
<p>Without those anomalies included, the average time per request was just 48 seconds when using cloudfront, compared to 75 seconds when not. Also, 80% of the remaining timings completed within 60 seconds when using cloudfront, compared to 110.5 seconds when not.</p>
<p>&#160;</p>
<h3 id="toc-conclusion1">Conclusion</h3>
<p>So load testing with JMeter shows that Cloudfront is better at higher loads.</p>
<p>&#160;</p>
<h2 id="toc-measurements-using-www-webpagetest-org">Measurements using <a href="http://www.webpagetest.org">www.webpagetest.org</a></h2>
<h3 id="toc-methodology3">Methodology</h3>
<p><a href="http://www.webpagetest.org">www.webpagetest.org</a> provides automated testing for websites, from client locations around the world. </p>
<p>5 tests were conducted from each location and each method of serving images.</p>
<p>&#160;</p>
<h3 id="toc-results3">Results</h3>
<p>Its results come out as follows:</p>
<table border="1" cellspacing="0" cellpadding="2" width="100%">
<tbody>
<tr>
<td valign="top" width="33%">&#160;</td>
<td valign="top" width="33%">Served from own server</td>
<td valign="top" width="33%">Cloudfront</td>
</tr>
<tr>
<td valign="top" width="33%">New York</td>
<td valign="top" width="33%">8.772 s</td>
<td valign="top" width="33%">8.911 s</td>
</tr>
<tr>
<td valign="top" width="33%">London</td>
<td valign="top" width="33%">8.791 s</td>
<td valign="top" width="33%">8.703 s</td>
</tr>
</tbody>
</table>
<p>&#160;</p>
<h3 id="toc-conclusion2">Conclusion</h3>
<p>Doesn&#8217;t look like Cloudfront has improved page speeds.</p>
<hr />
<p>&#160;</p>
<h1 id="toc-cost-analysis">Cost analysis</h1>
<p>If the choice is between storing content on an EC2 EBS drive and serving it from EC2 web server, vs. storing it in S3 and serving it via Cloudfront, the following cost components are relevant (as of Dec 2011):</p>
<p>Assume &#8216;B&#8217; GBs is the size of content (for simplicity, I&#8217;ll assume just 1 file of &#8216;B&#8217; GBs) being stored.</p>
<p>Assume 1 user requests this file each and every second every day, which comes to 86400 requests/day or 2,592,000 requests/month.</p>
<table border="1" cellspacing="0" cellpadding="2" width="100%">
<tbody>
<tr>
<td valign="top" width="50%"><strong>via EBS and EC2</strong></td>
<td valign="top" width="50%"><strong>via S3 and Cloudfront</strong></td>
</tr>
<tr>
<td valign="top" width="50%">EBS storage per GB = $0.10B</td>
<td valign="top" width="50%">S3 reduced redundancy storage = $0.093B          <br />(ignoring S3 IO request costs by assuming this file will be stored just once, and then always served via Cloudfront)</td>
</tr>
<tr>
<td valign="top" width="50%">EBS cost per 1 million IO requests = $0.10 x 2.592 = 0.2592B</td>
<td valign="top" width="50%">Cloudfront data transfer = $0.19B          <br />But as we have seen with ab tests, at higher loads, more data is transferred due to TCP retransmissions.           <br />Assuming 20% extra data is transferred, this will come to $0.228B           </td>
</tr>
<tr>
<td valign="top" width="50%">Data transfer through elastic IP = $0.01B</td>
<td valign="top" width="50%">Cloudfront cost per 10000 HTTP requests = $0.009 x 2592000/10000 = $2.3328</td>
</tr>
<tr>
<td valign="top" width="50%">Total: $0.3692B          <br />If that file is 1GB in size, this comes to $0.37</td>
<td valign="top" width="50%">Total: $2.3328 + 0.321B          <br />If that file is 1GB in size, this comes to $2.65</td>
</tr>
</tbody>
</table>
<p>So cost wise too, Cloudfront comes out costlier than serving off EBS or S3. It&#8217;s really only its HTTP request costs that tilt the choice away from Cloudfront.</p>
<p>&#160;</p>
<hr />
<p>&#160;</p>
<h1 id="toc-final-conclusion">Final conclusion</h1>
<p>In my case, my website is not a high traffic site. I also didn&#8217;t observe any <em>drastic </em>improvement in page speeds, except possibly at high loads (shown by the JMeter results). And cost wise, it&#8217;s indeed cheaper to stick with EBS and EC2. </p>
<p>So, should I use Cloudfront or not? I think it&#8217;s not needed for my site at the moment.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.pathbreak.com/blog/did-amazon-cloudfront-cdn-make-my-site-faster/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Simulating browsers using JMeter</title>
		<link>http://www.pathbreak.com/blog/simulating-browsers-using-jmeter?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=simulating-browsers-using-jmeter</link>
		<comments>http://www.pathbreak.com/blog/simulating-browsers-using-jmeter#comments</comments>
		<pubDate>Tue, 27 Dec 2011 11:57:23 +0000</pubDate>
		<dc:creator>Karthik Shiraly</dc:creator>
				<category><![CDATA[High Scalability]]></category>
		<category><![CDATA[java]]></category>
		<category><![CDATA[Web development]]></category>
		<category><![CDATA[JMeter]]></category>
		<category><![CDATA[load testing]]></category>
		<category><![CDATA[stress testing]]></category>
		<category><![CDATA[testing]]></category>

		<guid isPermaLink="false">http://www.pathbreak.com/blog/simulating-browsers-using-jmeter</guid>
		<description><![CDATA[JMeter is commonly used to stress test webpages by simulating multiple users concurrently visiting a webpage URL. However, for this simulation to be accurate, JMeter needs to be configured correctly so that it behaves like a browser. In this article, I explain what settings to configure, to make JMeter simulate browser requests fairly accurately. &#160; [...]]]></description>
			<content:encoded><![CDATA[<div class="sidebox"></div>
<p>JMeter is commonly used to stress test webpages by simulating multiple users concurrently visiting a webpage URL. However, for this simulation to be accurate, JMeter needs to be configured correctly so that it behaves like a browser. </p>
<p>In this article, I explain what settings to configure, to make JMeter simulate browser requests fairly accurately. </p>
<p>&#160;</p>
<p>Before configuring JMeter correctly, let&#8217;s understand how browsers work:</p>
<ul>
<li>When user enters a webpage URL in browser, it connects to server, starts downloading the page, and starts parsing. </li>
<li>As it&#8217;s parsing, it&#8217;ll encounter embedded URLs like javascript, CSS and image files. </li>
<li>A browser then creates more threads, each of which opens a new connection and fetches one of these embedded URLs. Most browsers use a limited number of connections per server (6 in case of firefox at the time of writing) and cap the total number of downloading threads (48 in case of firefox at the time of writing). </li>
<li>The page is considered loaded when all these embedded URLs have been fetched. </li>
</ul>
<p>JMeter can simulate this behaviour if the following 2 settings are configured:</p>
<ul>
<li><strong>Retrieve All Embedded Resources from HTML Files</strong>
<p><a href="wp-content/uploads/2011/12/image.png"><img style="background-image: none; border-right-width: 0px; padding-left: 0px; padding-right: 0px; display: inline; border-top-width: 0px; border-bottom-width: 0px; border-left-width: 0px; padding-top: 0px" title="image" border="0" alt="image" src="wp-content/uploads/2011/12/image_thumb.png" width="284" height="55" /></a></p>
<p>This checkbox is found near the bottom of <strong>HTTP Request Defaults </strong>config elements and <strong>HTTP Request </strong>samplers.</p>
<p>Check the checkbox to make JMeter download embedded resources like javascript, CSS and images, just as a browser would.</p>
<p>Add a <strong>View Results in Tree </strong>listener element if you want to see which embedded resources are downloaded and their metrics. Note that <strong>View Results in Table </strong>bytes don&#8217;t include the embedded resources. </p>
</li>
<li><strong>Use concurrent pool. Size=n</strong>
<p><a href="wp-content/uploads/2011/12/image1.png"><img style="background-image: none; border-right-width: 0px; padding-left: 0px; padding-right: 0px; display: inline; border-top-width: 0px; border-bottom-width: 0px; border-left-width: 0px; padding-top: 0px" title="image" border="0" alt="image" src="wp-content/uploads/2011/12/image_thumb1.png" width="468" height="59" /></a></p>
<p>The behaviour of this checkbox and pool size are as follows: </p>
<table border="1" cellspacing="0" cellpadding="2" width="100%">
<tbody>
<tr>
<td valign="top" width="120"><strong>Retrieve all embedded resources from HTML files</strong></td>
<td valign="top" width="123"><strong>Use concurrent pool</strong></td>
<td valign="top" width="391"><strong>Behaviour</strong></td>
</tr>
<tr>
<td valign="top" width="120">Checked</td>
<td valign="top" width="123">Unchecked</td>
<td valign="top" width="391">The main page and its embedded resources will be downloaded in the same thread.              </p>
<p>For example, if Thread group is simulating 3 users, Jmeter creates 3 threads &#8211; one for each simulated user &#8211; named &quot;Thread Group 1-1&quot; to &quot;Thread Group 1-3&quot;.               </p>
<p>Each of these threads will download all embedded resources <em>sequentially</em> in the context of their respective thread.               </p>
<p>If page P has resource A,B and C, Jmeter will download them as follows:               <br />~ThreadGroup1-1 : p, A, B, C (downloaded one after another)               <br />~ThreadGroup1-2 : p, A, B, C (downloaded one after another)               <br />~ThreadGroup1-3 : p, A, B, C (downloaded one after another)</td>
</tr>
<tr>
<td valign="top" width="120">Checked</td>
<td valign="top" width="123">Checked.              <br />Pool size=x</td>
<td valign="top" width="391">As usual, JMeter creates threads named &quot;Thread Group 1-k&quot; to simulate users.              </p>
<p>In addition, for every one of these threads simulating a user, JMeter creates separate threadpools of size x with thread names like <strong>pool-n-thread-m</strong>.               </p>
<p>The main page is downloaded by the user&#8217;s thread &quot;Thread Group 1-k&quot; while the embedded resources are downloaded by its associated threadpool with thread names like <strong>pool-n-thread-m</strong>.</td>
</tr>
</tbody>
</table>
<p>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;
<p>So to simulate browsers, check the &#8216;<strong>Use concurrent pool</strong>&#8216; checkbox and specify a reasonable pool size (4-8 seems typical for browsers).</p>
<p>However, when setting the concurrent pool size, keep in mind the number of users being simulated, because a separate threadpool is created for each of these simulated users. If there are many users, too many threads may get created and start affecting the response times adversely due to bandwidth contention at the JMeter side. If many users are to be simulated, it&#8217;s recommended to distribute JMeter testing to multiple machines.</p>
</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.pathbreak.com/blog/simulating-browsers-using-jmeter/feed</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Solr on Jetty on Ubuntu</title>
		<link>http://www.pathbreak.com/blog/solr-on-jetty-on-ubuntu?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=solr-on-jetty-on-ubuntu</link>
		<comments>http://www.pathbreak.com/blog/solr-on-jetty-on-ubuntu#comments</comments>
		<pubDate>Fri, 14 Oct 2011 21:39:00 +0000</pubDate>
		<dc:creator>Karthik Shiraly</dc:creator>
				<category><![CDATA[Search]]></category>
		<category><![CDATA[Ubuntu]]></category>
		<category><![CDATA[deployment]]></category>
		<category><![CDATA[jetty]]></category>
		<category><![CDATA[solr]]></category>

		<guid isPermaLink="false">http://www.pathbreak.com/blog/solr-on-jetty-on-ubuntu</guid>
		<description><![CDATA[Article RelevancySolr v3.3.0, Jetty v6.1, Ubuntu 10.04 Lucid Lynx 64-bit server This article explains steps involved in deploying Apache Solr search engine as a system service on the Jetty servlet container on Ubuntu OS. This article is based on information from the Solr Jetty wiki page and on troubleshooting experiences of others. Prerequisites: Target system [...]]]></description>
			<content:encoded><![CDATA[<div class="sidebox">
<div class="relevancy"><b>Article Relevancy</b><br/>Solr v3.3.0, Jetty v6.1, Ubuntu 10.04 Lucid Lynx 64-bit server</div>
</div>
<p>This article explains steps involved in deploying Apache Solr search engine as a system service on the Jetty servlet container on Ubuntu OS. This article is based on information from the <a href="http://wiki.apache.org/solr/SolrJetty">Solr Jetty wiki page</a> and on <a href="http://greenash.net.au/thoughts/2011/02/solr-jetty-and-daemons-debugging-jettysh/">troubleshooting experiences of others</a>. </p>
<p><strong></strong></p>
<p><strong>Prerequisites:</strong></p>
<ul>
<li>Target system should have atleast Java 6 installed (in my case, OpenJRE 6 is installed) </li>
</ul>
<p><strong>Steps:</strong></p>
<p>1. In this description, /opt/solr will be the target directory where Solr will be deployed.</p>
<p>&#160;</p>
<p>2. The /example directory in the solr package forms the basis of the installation on the target system. It contains multiple configurations, each suitable for a different use case: </p>
<p>/example-DIH : a multicore configuration with each core demonstrating a different data importing configuration </p>
<p>/multicore : a simple multicore installation</p>
<p>/solr : a basic single core configuration. </p>
<p>Copy the configuration suitable for your application into /example/solr (replacing the one already there if necessary) and discard the rest. A configuration typically consists of /conf and /data (and sometimes also /bin and /lib) sub directories.</p>
<p>&#160;</p>
<p>2. Additionally, the /dist and /contrib package directories contain important jars required by some of these configurations:</p>
<p>/dist/apache-solr-dataimporthandler*.jars &#8211; if you require data importing capabilities. </p>
<p>/dist/apache-solr-cell-*.jars ,&#160; /contrib/extraction/lib/*.jars &#8211; If you require content extraction from PDF, MS office and other document files.</p>
<p>These jars should also be deployed on the target system.</p>
<p>&#160;</p>
<p>3. Copy these files to the target system and create the directory structure suggested below under /opt/solr:</p>
<p> 
<pre>
<pre class="brush: plain; title: ; notranslate">
|-- dist - All required jars, including additional jars from /contrib
|-- etc - this should probably go into the root /etc directory, as per conventions
|   |-- jetty.xml
|   `-- webdefault.xml
|-- lib
|-- solr
|   |-- bin
|   |-- conf
|   |   |-- admin-extra.html
|   |   |-- dataimport.properties
|   |   |-- elevate.xml
|   |   |-- protwords.txt
|   |   |-- schema.xml
|   |   |-- scripts.conf
|   |   |-- solrconfig.xml
|   |   |-- stopwords.txt
|   |   |-- synonyms.txt
|   |   `-- xml-data-config.xml
|   |-- data
|-- start.jar
|-- webapps
|   `-- solr.war
`-- work
</pre>
</pre>
<p></p>
<p>&#160;</p>
<p>4. The solr process should run with its own dedicated credentials, so that authorizations can be administered at a fine granularity. So create a system user and group named &#8216;solr&#8217;.</p>
<p></p>
<pre>
<pre class="brush: bash; title: ; notranslate">
$ sudo adduser --system solr
$ sudo addgroup solr
$ sudo adduser solr solr
</pre>
</pre>
<p></p>
<p>5. Create a log directory /var/log/solr for solr and jetty logs.</p>
<p>6. Jetty outputs its errors to STDERR by default. Redirect it to a rolling log file by adding this section to /opt/solr/etc/jetty.xml.</p>
<p></p>
<pre>
<pre class="brush: xml; title: ; notranslate">
    &lt;!-- =========================================================== --&gt;
    &lt;!-- configure logging                                           --&gt;
    &lt;!-- =========================================================== --&gt;
   &lt;new id=&quot;ServerLog&quot; class=&quot;java.io.PrintStream&quot;&gt;
      &lt;arg&gt;
        &lt;new class=&quot;org.mortbay.util.RolloverFileOutputStream&quot;&gt;
          &lt;arg&gt;&lt;systemproperty default=&quot;/var/log/solr&quot; name=&quot;jetty.logs&quot; /&gt;/yyyy_mm_dd.stderrout.log&lt;/arg&gt;
          &lt;arg type=&quot;boolean&quot;&gt;false&lt;/arg&gt;
          &lt;arg type=&quot;int&quot;&gt;90&lt;/arg&gt;
          &lt;arg&gt;&lt;call class=&quot;java.util.TimeZone&quot; name=&quot;getTimeZone&quot;&gt;&lt;arg&gt;GMT&lt;/arg&gt;&lt;/call&gt;&lt;/arg&gt;
          &lt;get id=&quot;ServerLogName&quot; name=&quot;datedFilename&quot;&gt;&lt;/get&gt;
        &lt;/new&gt;
      &lt;/arg&gt;
    &lt;/new&gt;
    &lt;call class=&quot;org.mortbay.log.Log&quot; name=&quot;info&quot;&gt;&lt;arg&gt;Redirecting stderr/stdout to &lt;ref id=&quot;ServerLogName&quot; /&gt;&lt;/arg&gt;&lt;/call&gt;
    &lt;call class=&quot;java.lang.System&quot; name=&quot;setErr&quot;&gt;&lt;arg&gt;&lt;ref id=&quot;ServerLog&quot; /&gt;&lt;/arg&gt;&lt;/call&gt;
    &lt;call class=&quot;java.lang.System&quot; name=&quot;setOut&quot;&gt;&lt;arg&gt;&lt;ref id=&quot;ServerLog&quot; /&gt;&lt;/arg&gt;&lt;/call&gt;
</pre>
</pre>
<p>&#160;</p>
<p>7. Now we need to set file and directory permissions so that the solr process user can work correctly. </p>
<p>Use <strong>chown</strong> to make <strong>solr:solr</strong> as the owner and group.</p>
<p>&#160;</p>
<pre>
<pre class="brush: bash; title: ; notranslate">
$ sudo chown -R solr:solr /opt/solr
$ sudo chown -R solr:solr /var/log/solr
</pre>
</pre>
<p>
  <br />Use <strong>chmod</strong> to give write permissions to solr:solr for the following directories: </p>
<p>/opt/solr/data </p>
<p>/opt/solr/work</p>
<p>/var/log/solr</p>
<p>&#160;</p>
<p>8. The basic installation should work now. Try by launching jetty as a regular process:</p>
<p>&#160;</p>
<pre>
<pre class="brush: bash; title: ; notranslate">
/opt/solr$ sudo java -Dsolr.solr.home=/opt/solr/solr -jar start.jar
</pre>
</pre>
<p>&#160;</p>
<p>This should start solr. </p>
<p>Verify that logs are getting generated under /var/logs/solr.</p>
<p>Test it by sending a query to http://localhost:8983/solr/select?q=something using curl.</p>
<p>&#160;</p>
<p>9. Now we need to install solr as a system daemon so that it can start automatically. Download the <a href="http://dev.eclipse.org/svnroot/rt/org.eclipse.jetty/jetty/trunk/jetty-distribution/src/main/resources/bin/jetty.sh">jetty.sh startup script</a> (link courtesy <a href="http://wiki.apache.org/solr/SolrJetty">http://wiki.apache.org/solr/SolrJetty</a>) and save it as /etc/init.d/solr. Give it executable rights.</p>
<p>The following environment variables need to be set. They can either be inserted in this /etc/init.d/solr script itself, or they can be stored in /etc/default/jetty, which is read by the script.</p>
<p>&#160;</p>
<pre>
<pre class="brush: plain; title: ; notranslate">
JAVA_HOME=/usr/lib/jvm/default-java

JAVA_OPTIONS=&quot;-Xmx64m -Dsolr.solr.home=/opt/solr/solr&quot;

JETTY_HOME=/opt/solr

JETTY_USER=solr

JETTY_GROUP=solr

JETTY_LOGS=/var/log/solr
</pre>
</pre>
<p>&#160;</p>
<p>Set the -Xmx parameters as per your requirements.</p>
<p>&#160;</p>
<p>10. Additionally, this startup script has a problem that prevents it from running in Ubuntu. If you try running this right now using</p>
<p>&#160;</p>
<pre>
<pre class="brush: bash; title: ; notranslate">
$ sudo /etc/init.d/solr
</pre>
</pre>
<p>&#160;</p>
<p>you&#8217;ll get a </p>
<blockquote>
<p>Starting Jetty: FAILED</p>
</blockquote>
<p>error.</p>
<p>&#160;</p>
<p>The problem &#8211; as explained well in this <a href="http://greenash.net.au/thoughts/2011/02/solr-jetty-and-daemons-debugging-jettysh/">troubleshooting article</a> &#8211; is in this line that attempts to start the daemon:</p>
<p>&#160;</p>
<pre>
<pre class="brush: bash; title: ; notranslate">
if start-stop-daemon -S -p&quot;$JETTY_PID&quot; $CH_USER -d&quot;$JETTY_HOME&quot; -b -m -a &quot;$JAVA&quot; -- &quot;${RUN_ARGS[@]}&quot; --daemon
</pre>
</pre>
<p>&#160;</p>
<p>In Ubuntu, &#8211;daemon is <strong>not</strong> a valid option for start-stop-daemon. Remove that option from the script:</p>
<p></p>
<pre>
<pre class="brush: bash; title: ; notranslate">
if start-stop-daemon -S -p&quot;$JETTY_PID&quot; $CH_USER -d&quot;$JETTY_HOME&quot; -b -m -a &quot;$JAVA&quot; -- &quot;${RUN_ARGS[@]}&quot;
</pre>
</pre>
<p>&#160;</p>
<p>If you try starting it now, it should work:</p>
<pre>
<pre class="brush: bash; title: ; notranslate">
$ sudo /etc/init.d/solr
</pre>
</pre>
<p>&#160;</p>
<p>It should give a</p>
<blockquote>
<p>Starting Jetty: OK</p>
</blockquote>
<p>message, and ps -ef |grep java should show the &quot;java -jar start.jar&quot; process.</p>
<p>&#160;</p>
<p>11. Finally, it&#8217;s time to configure this as an init script. Read this article if you want a background on <a href="ubuntu-startup-init-scripts-runlevels-upstart-jobs-explained">Ubuntu runlevels and init scripts</a>. </p>
<p>Insert these lines at the top of /etc/init.d/solr to make it a LSB (Linux Standard Base) compliant init script. Without these lines, it&#8217;s not possible to configure the run level scripts.</p>
<blockquote>
<p>### BEGIN INIT INFO<br />
    <br /># Provides:&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; solr</p>
<p># Required-Start:&#160;&#160;&#160; $local_fs $remote_fs $network</p>
<p># Required-Stop:&#160;&#160;&#160;&#160; $local_fs $remote_fs $network</p>
<p># Should-Start:&#160;&#160;&#160;&#160;&#160; $named</p>
<p># Should-Stop:&#160;&#160;&#160;&#160;&#160;&#160; $named</p>
<p># Default-Start:&#160;&#160;&#160;&#160; 2 3 4 5</p>
<p># Default-Stop:&#160;&#160;&#160;&#160;&#160; 0 1 6</p>
<p># Short-Description: Start Solr.</p>
<p># Description:&#160;&#160;&#160;&#160;&#160;&#160; Start the solr search engine.</p>
<p>### END INIT INFO</p>
<p>
    </p>
</blockquote>
<p>&#160;</p>
<p>Now run the following command:</p>
<p></p>
<pre>
<pre class="brush: plain; title: ; notranslate">
$ sudo update-rc.d solr defaults
 Adding system startup for /etc/init.d/solr ...
   /etc/rc0.d/K20solr -&gt; ../init.d/solr
   /etc/rc1.d/K20solr -&gt; ../init.d/solr
   /etc/rc6.d/K20solr -&gt; ../init.d/solr
   /etc/rc2.d/S20solr -&gt; ../init.d/solr
   /etc/rc3.d/S20solr -&gt; ../init.d/solr
   /etc/rc4.d/S20solr -&gt; ../init.d/solr
   /etc/rc5.d/S20solr -&gt; ../init.d/solr
</pre>
</pre>
<p></p>
<p>As you can see, the run levels 2-5 (they are equivalent in Ubuntu) are now configured to start solr.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.pathbreak.com/blog/solr-on-jetty-on-ubuntu/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Ubuntu startup &#8211; init scripts, runlevels, upstart jobs explained</title>
		<link>http://www.pathbreak.com/blog/ubuntu-startup-init-scripts-runlevels-upstart-jobs-explained?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=ubuntu-startup-init-scripts-runlevels-upstart-jobs-explained</link>
		<comments>http://www.pathbreak.com/blog/ubuntu-startup-init-scripts-runlevels-upstart-jobs-explained#comments</comments>
		<pubDate>Sun, 25 Sep 2011 09:03:24 +0000</pubDate>
		<dc:creator>Karthik Shiraly</dc:creator>
				<category><![CDATA[Linux]]></category>
		<category><![CDATA[Ubuntu]]></category>
		<category><![CDATA[chkconfig]]></category>
		<category><![CDATA[init.d]]></category>
		<category><![CDATA[rc.d]]></category>
		<category><![CDATA[run levels]]></category>
		<category><![CDATA[runlevel]]></category>
		<category><![CDATA[service command]]></category>
		<category><![CDATA[update-rc.d]]></category>
		<category><![CDATA[upstart]]></category>

		<guid isPermaLink="false">http://www.pathbreak.com/blog/ubuntu-startup-init-scripts-runlevels-upstart-jobs-explained</guid>
		<description><![CDATA[Article RelevancyUbuntu 10.04 Lucid Lynx; believed to be relevant for Ubuntu 8.x to 11.x, the latest release at the time of writing this article Contents Run levels and init.d scripts &#8211; the traditional mechanism Get Current run level /etc/init.d directories /etc/rcn.d directories Enabling and disabling run level services Upstart Resources for further reading Ubuntu has [...]]]></description>
			<content:encoded><![CDATA[<div class="sidebox">
<div class="relevancy"><b>Article Relevancy</b><br/>Ubuntu 10.04 Lucid Lynx; believed to be relevant for Ubuntu 8.x to 11.x, the latest release at the time of writing this article</div>
<div class="toc"><b>Contents</b>
<ol>
<li><a href="http://www.pathbreak.com/blog/ubuntu-startup-init-scripts-runlevels-upstart-jobs-explained#toc-run-levels-and-init-d-scripts-the-traditional-mechanism">Run levels and init.d scripts &#8211; the traditional mechanism</a></p>
<ol>
<li><a href="http://www.pathbreak.com/blog/ubuntu-startup-init-scripts-runlevels-upstart-jobs-explained#toc-get-current-run-level">Get Current run level</a></li>
<li><a href="http://www.pathbreak.com/blog/ubuntu-startup-init-scripts-runlevels-upstart-jobs-explained#toc-etcinit-d-directories">/etc/init.d directories</a></li>
<li><a href="http://www.pathbreak.com/blog/ubuntu-startup-init-scripts-runlevels-upstart-jobs-explained#toc-etcrcn-d-directories">/etc/rc<em>n</em>.d directories</a></li>
<li><a href="http://www.pathbreak.com/blog/ubuntu-startup-init-scripts-runlevels-upstart-jobs-explained#toc-enabling-and-disabling-run-level-services">Enabling and disabling run level services</a></li>
</ol>
</li>
<li><a href="http://www.pathbreak.com/blog/ubuntu-startup-init-scripts-runlevels-upstart-jobs-explained#toc-upstart">Upstart</a></li>
<li><a href="http://www.pathbreak.com/blog/ubuntu-startup-init-scripts-runlevels-upstart-jobs-explained#toc-resources-for-further-reading">Resources for further reading</a></li>
</ol>
</div>
</div>
<p>Ubuntu has 2 different mechanisms for starting system services:</p>
<ul>
<li>The traditional mechanism based on run levels, and scripts in /etc/init.d and /etc/rc<em>n</em>.d directories</li>
<li>A new mechanism known as <em>upstart. </em></li>
</ul>
<p>Some services are started using one mechanism and others using the other. If you want to control the services, it&#8217;s necessary to understand these mechanisms.</p>
<p><br/></p>
<h1 id="toc-run-levels-and-init-d-scripts-the-traditional-mechanism">Run levels and init.d scripts &#8211; the traditional mechanism</h1>
<p>Linux has the concept of <em>run levels</em>, in all distros as part of the Linux Base Specification. They can be considered to be &#8220;modes&#8221; in which Linux runs.</p>
<table width="652" border="1" cellspacing="0" cellpadding="2">
<tbody>
<tr>
<td valign="top" width="71"><strong><span style="color: #0000ff;">Run level</span></strong></td>
<td valign="top" width="185"><strong><span style="color: #0000ff;">Name</span></strong></td>
<td valign="top" width="394"><strong><span style="color: #0000ff;">Description</span></strong></td>
</tr>
<tr>
<td valign="top" width="71">0</td>
<td valign="top" width="185">Halt</td>
<td valign="top" width="394">Shuts down the system</td>
</tr>
<tr>
<td valign="top" width="71">1</td>
<td valign="top" width="185">Single-user mode</td>
<td valign="top" width="394">Mode for administrative tasks.</td>
</tr>
<tr>
<td valign="top" width="71">2</td>
<td valign="top" width="185">Multi-User Mode</td>
<td valign="top" width="394">Does not configure network interfaces and does not export networks services</td>
</tr>
<tr>
<td valign="top" width="71">3</td>
<td valign="top" width="185">Multi-User Mode with Networking</td>
<td valign="top" width="394">Starts the system normally</td>
</tr>
<tr>
<td valign="top" width="71">4</td>
<td valign="top" width="185">Not used / user definable</td>
<td valign="top" width="394">For special purposes</td>
</tr>
<tr>
<td valign="top" width="71">5</td>
<td valign="top" width="185">Start the system normally with GUI display manager</td>
<td valign="top" width="394">Run level 3 + display manager</td>
</tr>
<tr>
<td valign="top" width="71">6</td>
<td valign="top" width="185">Reboot</td>
<td valign="top" width="394">Reboots the system</td>
</tr>
<tr>
<td valign="top" width="71">s or S</td>
<td valign="top" width="185">Single-user mode</td>
<td valign="top" width="394">Does not configure network interfaces, or start daemons.</td>
</tr>
</tbody>
</table>
<p>In Ubuntu (and Debian), <span style="text-decoration: underline;">run levels 2 to 5 are equivalent</span> and configured with the same set of services.</p>
<h2 id="toc-get-current-run-level">Get Current run level</h2>
<p>Use the <strong>runlevel</strong> command to get current run level. <strong>runlevel</strong> is available in Ubuntu as well as redhat based distros like CentOS (not sure about other distros).</p>
<p>karthik@ubuntuLynx:~$ runlevel<br />
N 2</p>
<h2 id="toc-etcinit-d-directories">/etc/init.d directories</h2>
<p>The /etc/init.d directory contains scripts, which can start / stop / restart services. These are invoked with a start|stop argument at startup and shutdown.</p>
<h2 id="toc-etcrcn-d-directories">/etc/rc<em>n</em>.d directories</h2>
<p>The /etc/rc<em>n</em>.d directories specify which scripts in /etc/init.d are enabled for run level <em>n</em>.</p>
<p>For example, /etc/rc2.d specifies which scripts in /etc/init.d are enabled for run level 2. At startup and shutdown, only these enabled scripts are invoked.</p>
<p>Entries in /etc/rc<em>n</em>.d directories are symlinks to scripts in /etc/init.d, but with a special prefix of the format</p>
<p>[S|K]nn</p>
<p>S means the script is enabled for this run level.</p>
<p>K means the script is disabled for this run level.</p>
<p>nn is a sequence number that can be used to control the sequence of starting services, so that services which depend on other services are started only after those other services are started.</p>
<p>Below is a listing or /etc/rc2.d. It shows that tomcat6, dovecot and postfix are not <em>automatically</em> started in run level 2. However, they can be started manually.</p>
<pre class="brush: plain; title: ; notranslate">
K08tomcat6
K76dovecot
K80postfix
S20gpm
S20winbind
S50rsync
S70dns-clean
S70pppd-dns
S91apache2
S99grub-common
S99ondemand
S99rc.local
</pre>
<h2 id="toc-enabling-and-disabling-run-level-services">Enabling and disabling run level services</h2>
<p>Use the <strong>chkconfig &#8211;list </strong>command to get an overview of all services and their status. If not installed, install it using <strong>sudo apt-get install chkconfig. </strong>It gives a status listing like this:</p>
<pre class="brush: plain; title: ; notranslate">
karthik@ubuntukarmic:~$ chkconfig --list
acpi-support              0:off  1:off  2:on   3:on   4:on   5:on   6:off
acpid                     0:off  1:off  2:off  3:off  4:off  5:off  6:off
alsa-utils                0:off  1:off  2:off  3:off  4:off  5:off  6:off
...
</pre>
<p>Use the <strong>update-rc.d</strong> command to enable or disable a service at a run level:</p>
<p>Syntax<em>: sudo     update-rc.d     name    enable|disable    runlevel</em></p>
<p>Example:<strong> </strong><em>sudo update-rc.d dovecot disable 2<br />
</em></p>
<p>or</p>
<p><em>sudo update-rc.d dovecot defaults</em></p>
<p>&nbsp;</p>
<p>When creating new init scripts, ensure that the script has the following section (this is an example &#8211; change values appropriately) at the top to make it  LSB (Linux Standard Base) compliant. Without this section, <strong>update-rc.d</strong> won&#8217;t work but will give a &#8220;missing LSB information&#8221; warning&#8230;</p>
<blockquote>
<pre>### BEGIN INIT INFO
# Provides:          solr
# Required-Start:    $local_fs $remote_fs $network
# Required-Stop:     $local_fs $remote_fs $network
# Should-Start:      $named
# Should-Stop:       $named
# Default-Start:     2 3 4 5
# Default-Stop:      0 1 6
# Short-Description: Start Solr.
# Description:       Start the solr search engine.
### END INIT INFO</pre>
</blockquote>
<p>&nbsp;</p>
<hr />
<h1 id="toc-upstart">Upstart</h1>
<p>Upstart jobs are configured in /etc/init directory, in .conf files.</p>
<p>Use the service command to start and stop upstart services:</p>
<p><strong>sudo service </strong>&lt;servicename&gt; <strong>start|stop</strong></p>
<p>For disabling an upstart service from starting up, open the respective /etc/init/[service].conf file and comment out the lines that begin with <strong>start on</strong>.</p>
<p>example:</p>
<pre class="brush: plain; title: ; notranslate">
...
#start on (net-device-up
#          and local-filesystems
#         and runlevel [2345])

...
</pre>
<p>This will disable the service from starting at startup, but allow manual starts using <strong>service start</strong> command.</p>
<p>For completely disabling a service &#8211; both from automatic and manual starts &#8211; it&#8217;s better to uninstall the package, but it&#8217;s also possible to just rename the .conf file to .conf.disabled.</p>
<hr />
<h1 id="toc-resources-for-further-reading">Resources for further reading</h1>
<ul>
<li><a title="http://askubuntu.com/questions/19320/whats-the-recommend-way-to-enable-disable-services/20347#20347" href="http://askubuntu.com/questions/19320/whats-the-recommend-way-to-enable-disable-services/20347#20347">http://askubuntu.com/questions/19320/whats-the-recommend-way-to-enable-disable-services/20347#20347</a> &#8211; this post from an Ubuntu developer explains in detail the history behind the init.d mechanism, its problems, and how the new Upstart mechanism solves them.</li>
<li><a href="http://www.yolinux.com/TUTORIALS/LinuxTutorialInitProcess.html">http://www.yolinux.com/TUTORIALS/LinuxTutorialInitProcess.html</a></li>
<li><a href="http://www.linux-tutorial.info/modules.php?name=MContent&amp;pageid=67">http://www.linux-tutorial.info/modules.php?name=MContent&amp;pageid=67</a></li>
<li><a href="http://oldfield.wattle.id.au/luv/boot.html">http://oldfield.wattle.id.au/luv/boot.html</a> &#8211; The Linux boot process.</li>
<li><a href="http://manpages.ubuntu.com/manpages/hardy/man8/update-rc.d.8.html">http://manpages.ubuntu.com/manpages/hardy/man8/update-rc.d.8.html</a></li>
<li><a href="http://upstart.ubuntu.com/cookbook/#what-is-upstart">http://upstart.ubuntu.com/cookbook/#what-is-upstart</a> &#8211; From the Upstart Cookbook</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.pathbreak.com/blog/ubuntu-startup-init-scripts-runlevels-upstart-jobs-explained/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Content Extraction in Solr</title>
		<link>http://www.pathbreak.com/blog/content-extraction-in-solr?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=content-extraction-in-solr</link>
		<comments>http://www.pathbreak.com/blog/content-extraction-in-solr#comments</comments>
		<pubDate>Sun, 28 Nov 2010 04:28:00 +0000</pubDate>
		<dc:creator>Karthik Shiraly</dc:creator>
				<category><![CDATA[Search]]></category>
		<category><![CDATA[Apache Tika]]></category>
		<category><![CDATA[solr]]></category>
		<category><![CDATA[Solr Cell]]></category>
		<category><![CDATA[solr content extraction]]></category>

		<guid isPermaLink="false">http://www.pathbreak.com/blog/content-extraction-in-solr</guid>
		<description><![CDATA[Article RelevancyApache Solr 1.4.x Contents Overview Howto Restrictions of default content extraction Overview The example solrconfig.xml is already configured for content extraction from any document format &#8211; like MS Word DOC, PDF, – which can be handled by Apache Tika. Content extraction requires libraries found in the /contrib/extraction directory. These include Solr Cell, Apache Tika [...]]]></description>
			<content:encoded><![CDATA[<div class="sidebox">
<div class="relevancy"><b>Article Relevancy</b><br/>Apache Solr 1.4.x</div>
<div class="toc"><b>Contents</b>
<ol>
<li><a href="http://www.pathbreak.com/blog/content-extraction-in-solr#toc-overview">Overview</a></li>
<li><a href="http://www.pathbreak.com/blog/content-extraction-in-solr#toc-howto">Howto</a></li>
<li><a href="http://www.pathbreak.com/blog/content-extraction-in-solr#toc-restrictions-of-default-content-extraction">Restrictions of default content extraction</a></li>
</ol>
</div>
</div>
<h1 id="toc-overview">Overview</h1>
<p>The example solrconfig.xml is already configured for content extraction from any document format &#8211; like MS Word DOC, PDF, – which can be handled by Apache Tika. </p>
<p>Content extraction requires libraries found in the /contrib/extraction directory. These include Solr Cell, Apache Tika and Apache POI libraries.</p>
<p>The <strong>ExtractingRequestHandler</strong> configuration in solrconfig.xml specifies the endpoint at which documents can be submitted for extraction. It&#8217;s usually <a href="http://localhost:8983/solr/update/extract">http://localhost:8983/solr/update/extract</a>.</p>
<p>&#160;</p>
<h1 id="toc-howto">Howto</h1>
<ul>
<li>To index a document, send the request as</li>
</ul>
<blockquote><p>curl “<a href="http://localhost:8983/solr/update/extract?literal.id=book1&amp;commit=true%22">http://localhost:8983/solr/update/extract?literal.id=book1&amp;commit=true”</a> -F <a href="mailto:myfile=@book.pdf">myfile=@book.pdf</a></p>
</blockquote>
<p>The request goes as a multi-part form encoding.</p>
<ul>
<li>By default, document contents are added into the document field “text”. The field can be changed in /solr/conf/solrconfig.xml in the Extracting handler’s &lt;requesHandler&gt; element; it has a child element “fmap.content” that specifies which field content should be indexed under.</li>
<blockquote><p>&lt;str name=”fmap.content”&gt;text&lt;/str&gt;</p>
</blockquote>
</ul>
<p>Since “text” is NOT a stored field, features like result highlighting won’t be available.</p>
<p>If results highlighting is required, modify /solr/conf/schema.xml to include a new *<em>stored</em>* field called “doc_content” which receives document contents from extracting handler. “doc_content” itself can be included in the “text” catch-all field so that all queries can be matched against document contents.</p>
<p>&#160;</p>
<h1 id="toc-restrictions-of-default-content-extraction">Restrictions of default content extraction</h1>
<ul>
<li>Since extracting handler can specify only a single content&#160; field, contents of multiple files will all go into the same content field. This is a problem if the the content file containing the search string has to be indicated to user. </li>
<li>There is no out-of-the-box workaround for this in solr. It’s required to write a specialized extracting handler to map each file (“content stream” in solr terminology) in the multipart request to separate content fields.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.pathbreak.com/blog/content-extraction-in-solr/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Solr search data modelling</title>
		<link>http://www.pathbreak.com/blog/solr-search-data-modelling?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=solr-search-data-modelling</link>
		<comments>http://www.pathbreak.com/blog/solr-search-data-modelling#comments</comments>
		<pubDate>Sun, 28 Nov 2010 04:09:00 +0000</pubDate>
		<dc:creator>Karthik Shiraly</dc:creator>
				<category><![CDATA[Search]]></category>
		<category><![CDATA[solr]]></category>
		<category><![CDATA[solr fields]]></category>
		<category><![CDATA[solr modelling]]></category>
		<category><![CDATA[solr schema.xml]]></category>

		<guid isPermaLink="false">http://www.pathbreak.com/blog/solr-search-data-modelling</guid>
		<description><![CDATA[Article RelevancyApache Solr 1.4.x and 3.3.x Contents Overview &#60;types&#62; section Basic field types &#60;fields&#62; section Overview Searchable entities of an application need to be modelled as Solr documents and fields for them to be searchable by Solr. The schema.xml in /solr/conf is where the application search model should be defined. The &#60;types&#62; element defines the [...]]]></description>
			<content:encoded><![CDATA[<div class="sidebox">
<div class="relevancy"><b>Article Relevancy</b><br/>Apache Solr 1.4.x and 3.3.x</div>
<div class="toc"><b>Contents</b>
<ol>
<li><a href="http://www.pathbreak.com/blog/solr-search-data-modelling#toc-overview">Overview</a></li>
<li><a href="http://www.pathbreak.com/blog/solr-search-data-modelling#toc-types-section">&lt;types&gt; section</a></li>
<li><a href="http://www.pathbreak.com/blog/solr-search-data-modelling#toc-basic-field-types">Basic field types </a></li>
<li><a href="http://www.pathbreak.com/blog/solr-search-data-modelling#toc-fields-section">&lt;fields&gt; section</a></li>
</ol>
</div>
</div>
<h1 id="toc-overview">Overview</h1>
<p>Searchable entities of an application need to be modelled as Solr documents and fields for them to be searchable by Solr.</p>
<p>The <strong>schema.xml</strong> in /solr/conf is where the application search model should be defined. </p>
<p>The <strong>&lt;types&gt;</strong> element defines the set of field types available in the model. </p>
<p>The <strong>&lt;fields&gt;</strong> element defines the set of fields of each document in the model. Each field has a type which is defined in the <strong>&lt;types&gt;</strong> element. </p>
<p>&#160;</p>
<h1 id="toc-types-section">&lt;types&gt; section</h1>
<p> This section describes types for all fields in the model. Contains <strong>&lt;fieldType&gt;</strong> elements. Each <strong>&lt;fieldType&gt;</strong> has these attributes:
<ul>
<li><strong>name</strong> is the name of the field type definition and is referred from the &lt;fields&gt; section </li>
<li><strong>class</strong> is the subclass of <strong>org.apache.solr.schema.FieldType</strong> that models this field type definition.&#160; Class names starting with &quot;solr&quot; refer to java classes in the org.apache.solr.analysis package. </li>
<li><strong>sortMissingLast</strong> and <strong>sortMissingFirst </strong><br />
<blockquote>The optional sortMissingLast and sortMissingFirst attributes are currently supported on types that are sorted internally as strings. This includes &quot;string&quot;,&quot;boolean&quot;,&quot;sint&quot;,&quot;slong&quot;,&quot;sfloat&quot;,&quot;sdouble&quot;,&quot;pdate&quot;<br />
- If sortMissingLast=&quot;true&quot;, then a sort on this field will cause documents without the field to come after documents with the field, regardless of the requested sort order (asc or desc).<br />
- If sortMissingFirst=&quot;true&quot;, then a sort on this field will cause documents without the field to come before documents with the field, regardless of the requested sort order.<br />
- If sortMissingLast=&quot;false&quot; and sortMissingFirst=&quot;false&quot; (the default), then default lucene sorting will be used which places docs without the field first in an ascending sort and last in a descending sort.</p></blockquote>
</li>
<li><strong>omitNorms </strong>is set to true to omit the norms associated with this field (this disables length normalization and index-time boosting for the field, and saves some memory). Only full-text fields or fields that need an index-time boost need norms. </li>
</ul>
<p>Each field type definition has an associated <strong>Analyzer</strong> to tokenize and filter characters or tokens. </p>
<p>The <strong>Trie </strong>field types are suitable for numeric fields that involve numeric range queries. The trie concept makes searching such fields faster. </p>
<p>&#160;</p>
<h1 id="toc-basic-field-types">Basic field types </h1>
<table style="table-layout: fixed" border="1" cellspacing="0" cellpadding="2" width="100%">
<tbody>
<tr>
<td valign="top" width="20%">string</td>
<td valign="top" width="80%">Fields of this type are not analyzed (ie, not tokenized or filtered), but are indexed and stored verbatim.</td>
</tr>
<tr>
<td valign="top" width="20%">binary</td>
<td valign="top" width="80%">For binary data. Should be sent/retrieved as Base64 encoded strings.</td>
</tr>
<tr>
<td valign="top" width="20%">int/tint/pint         <br />long/tlong/plong          <br />float/tfloat/pfloat double/tdouble/          <br />pdouble</td>
<td valign="top" width="80%">The regular types (int,float,etc) and their t- versions differ in their precisionStep values.The precisionStep value is used to generate indexes at different precision levels, to support numeric range queries. Both sets are modelled by TrieField types, but the t- versions have precisionStep of 8 while the regular types have 0.So numeric range queries will be faster with the t-versions, but indexes will be larger (and probably slower). The p- versions are when numeric range queries are not needed at all. They are modelled by non-Trie types.</td>
</tr>
<tr>
<td valign="top" width="20%">date/tdate/pdate</td>
<td valign="top" width="80%">Similar to the above differences in numeric fields.Use tdate for date ranges and date faceting.Dates have to be in a special UTC timezone format, like this example: <strong>2011-02-06T05:34:00.299Z</strong> Use <strong>org.apache.solr.common.util.DateUtil</strong>.<em>getThreadLocalDateFormat</em>().format(new Date()) to get a date in this format.</td>
</tr>
<tr>
<td valign="top" width="20%">sint/slong/         <br />sfloat/sdouble</td>
<td valign="top" width="80%">Sortable fields</td>
</tr>
</tbody>
</table>
<p><strong></strong></p>
<p><strong>Text field types</strong> Being a full text search solution, the text field types and their configuration becomes the most critical part of the modelling. Modelling of text fields is explained in detail in the article <a href="solr-text-field-types-analyzers-tokenizers-filters-explained" target="_blank">Solr text field types, analyzers, tokenizers &amp; filters explained</a>. </p>
<p>&#160;</p>
<h1 id="toc-fields-section">&lt;fields&gt; section</h1>
<p>Fields of documents are described in this section using <strong>&lt;field&gt;</strong> elements. </p>
<p>Each <strong>&lt;field&gt;</strong> element can have these attributes: </p>
<table style="table-layout: fixed" border="1" cellspacing="0" cellpadding="2" width="100%">
<tbody>
<tr>
<td valign="top" width="20%">name</td>
<td valign="top" width="80%">(mandatory) the name for the field. Very critical information, used in search queries, facet fields.</td>
</tr>
<tr>
<td valign="top" width="20%">type</td>
<td valign="top" width="80%">(mandatory) the name of a previously defined type from the&#160;&#160;&#160;&#160;&#160;&#160; &lt;types&gt; section</td>
</tr>
<tr>
<td valign="top" width="20%">indexed</td>
<td valign="top" width="80%">true if this field should be indexed (should be searchable or sortable)</td>
</tr>
<tr>
<td valign="top" width="20%">stored</td>
<td valign="top" width="80%">true if this field value should be retrievable verbatim in search results.</td>
</tr>
<tr>
<td valign="top" width="20%">compressed</td>
<td valign="top" width="80%">[false] if this field should be stored using gzip compression (this will only apply if the field type is compressable; among the standard field types, only TextField and StrField are).This is very useful for large data fields, but will probably slow down search results – so it should not be used for fields that involve frequent querying</td>
</tr>
<tr>
<td valign="top" width="20%">multiValued</td>
<td valign="top" width="80%">true if this field may contain multiple values per document</td>
</tr>
<tr>
<td valign="top" width="20%">omitNorms</td>
<td valign="top" width="80%">(expert) set to true to omit the norms associated with this field (this disables length normalization and index-time boosting for the field, and saves some memory).&#160; Only full-text fields or fields that need an index-time boost need norms.</td>
</tr>
<tr>
<td valign="top" width="20%">termVectors</td>
<td valign="top" width="80%">[false] set to true to store the term vector for a given field. When using MoreLikeThis, fields used for similarity should be stored for best performance.</td>
</tr>
<tr>
<td valign="top" width="20%">termPositions</td>
<td valign="top" width="80%">Store position information with the term vector.&#160; This will increase storage costs.</td>
</tr>
<tr>
<td valign="top" width="20%">termOffsets</td>
<td valign="top" width="80%">Store offset information with the term vector. This will increase storage costs.</td>
</tr>
<tr>
<td valign="top" width="20%">default</td>
<td valign="top" width="80%">a value that should be used if no value is specified when adding a document.</td>
</tr>
</tbody>
</table>
<p>The example deployment itself defines many commonly used fields and types; study them and check if something needed is already available before modelling your own. </p>
<p><strong>&lt;dynamicField&gt;</strong> elements can be used to model field names which are not explicitly defined by name, but which match some defined pattern. </p>
<p><strong>&lt;copyField&gt;</strong> definitions specify to copy one field to another at the time a document is added to the index.&#160; It&#8217;s used either to index the same field differently, or to add multiple fields to the same field for easier/faster searching. For example, all text fields in the document can be copied to a single catch-all field, for faster querying. </p>
<p><strong>&lt;uniqueKey&gt;</strong> element specifies the field to be used to determine and enforce document uniqueness. </p>
<p><strong>&lt;defaultSearchField&gt;</strong> element specifies the field to be queried when it’s not explicitly specified in the query string using a “field:value” syntax. The catch-all copyfield is usually specified as the default search field. </p>
<p><strong>&lt;solrQueryParse&gt;</strong> specifies query parser configuration. <strong>defaultOperator=”AND|OR” </strong>specifies whether queries are combined using AND operator or OR operator. </p>
]]></content:encoded>
			<wfw:commentRss>http://www.pathbreak.com/blog/solr-search-data-modelling/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Faceting &#8211; or drilldown &#8211; search using Solr</title>
		<link>http://www.pathbreak.com/blog/faceting-or-drilldown-search-using-solr?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=faceting-or-drilldown-search-using-solr</link>
		<comments>http://www.pathbreak.com/blog/faceting-or-drilldown-search-using-solr#comments</comments>
		<pubDate>Fri, 26 Nov 2010 18:37:00 +0000</pubDate>
		<dc:creator>Karthik Shiraly</dc:creator>
				<category><![CDATA[Search]]></category>
		<category><![CDATA[drilldown search]]></category>
		<category><![CDATA[solr]]></category>
		<category><![CDATA[solr faceting]]></category>

		<guid isPermaLink="false">http://www.pathbreak.com/blog/faceting-or-drilldown-search-using-solr</guid>
		<description><![CDATA[Article RelevancyApache Solr 1.4.x Contents Overview Steps Facet filter query syntax Handling large number of facet values using pagination Facet Query vs Filter Query of facet Undestanding facet counts Overview Faceted searching &#8211; also called as drilldown searching &#8211; refers to incrementally refining search results by different criteria at each level. Popular e-shopping sites like [...]]]></description>
			<content:encoded><![CDATA[<div class="sidebox">
<div class="relevancy"><b>Article Relevancy</b><br/>Apache Solr 1.4.x</div>
<div class="toc"><b>Contents</b>
<ol>
<li><a href="http://www.pathbreak.com/blog/faceting-or-drilldown-search-using-solr#toc-overview">Overview</a></li>
<li><a href="http://www.pathbreak.com/blog/faceting-or-drilldown-search-using-solr#toc-steps">Steps</a></li>
<li><a href="http://www.pathbreak.com/blog/faceting-or-drilldown-search-using-solr#toc-facet-filter-query-syntax">Facet filter query syntax</a></li>
<li><a href="http://www.pathbreak.com/blog/faceting-or-drilldown-search-using-solr#toc-handling-large-number-of-facet-values-using-pagination">Handling large number of facet values using pagination</a></li>
<li><a href="http://www.pathbreak.com/blog/faceting-or-drilldown-search-using-solr#toc-facet-query-vs-filter-query-of-facet">Facet Query vs Filter Query of facet</a></li>
<li><a href="http://www.pathbreak.com/blog/faceting-or-drilldown-search-using-solr#toc-undestanding-facet-counts">Undestanding facet counts</a></li>
</ol>
</div>
</div>
<h1 id="toc-overview">Overview</h1>
<p>Faceted searching &#8211; also called as drilldown searching &#8211; refers to incrementally refining search results by different criteria at each level. Popular e-shopping sites like Amazon and Ebay provide this in their search pages. </p>
<p>Solr has excellent support for faceting. The sections below describe how to use faceting in java applications, using the solrj client API.</p>
<p>&#160;</p>
<h1 id="toc-steps">Steps</h1>
<p><strong>Step 1 : Do the first level search and get first level facets</strong></p>
<pre>
<pre class="brush: java; title: ; notranslate">
SolrQuery qry = new SolrQuery(strQuery);
String[] fetchFacetFields = new String[]{&quot;categories&quot;};
qry.setFacet(true);
qry.addFacetField(fetchFacetFields);
qry.setIncludeScore(true);
qry.setShowDebugInfo(true);
QueryRequest qryReq = new QueryRequest(qry); 

QueryResponse resp = qryReq.process(solrServer);  

SolrDocumentList results = resp.getResults();
int count = results.size();
System.out.println(count + &quot; hits&quot;);
for (int i = 0; i &gt; count; i++) {
    SolrDocument hitDoc = results.get(i);
    System.out.println(&quot;#&quot; + (i+1) + &quot;:&quot; + hitDoc.getFieldValue(&quot;name&quot;));
    for (Iterator&lt;Entry&lt;String, Object&gt;&gt; flditer = hitDoc.iterator(); flditer.hasNext();) {
        Entry&lt;String, Object&gt; entry = flditer.next();
        System.out.println(entry.getKey() + &quot;: &quot; + entry.getValue());
    }
} 

List&lt;FacetField&gt; facetFields = resp.getFacetFields();
for (int i = 0; i &gt; facetFields.size(); i++) {
    FacetField facetField = facetFields.get(i);
    List&lt;Count&gt; facetInfo = facetField.getValues();
    for (FacetField.Count facetInstance : facetInfo) {
        System.out.println(facetInstance.getName() + &quot; : &quot; + facetInstance.getCount() + &quot; [drilldown qry:&quot; + facetInstance.getAsFilterQuery());
    }
}
</pre>
</pre>
<p>&#160;</p>
<p>The response will contain details of number of hits for each instance of the facet. </p>
<p>For example, if the field <strong>categories</strong> has values <strong>movies</strong> and <strong>songs</strong> in the set of matching hits, then each of them is called a <strong>facet instance.</strong>&#160; </p>
<p>Each facet instance of a FacetField has a name (“songs”), and each has an associated facet instance count and a filter query. </p>
<p>Facet instance count of 10 for “categories:songs” means in the set of all search results, 10 results have the value of <strong>categories</strong> as <strong>songs</strong>. </p>
<p>Facet instance filter query is the subquery to go down to the next level of drilldown search, by filtering on the facet instance value. </p>
<p>At this point in a typical drilldown search user interface, the left sidebar with all the filters would display those facet instances that have nonzero instance count with checkboxes and respective counts. User can then select the most promising facet to drilldown along and check its checkbox...</p>
<p>&#160;</p>
<p><strong>Step 2: Add facet filter query for next level of refined results</strong></p>
<p>Add the filter query of facet instance to the main query, using <strong>addFilterQuery.</strong> </p>
<p>Filter query for single facet instance is of the format &quot;&lt;field&gt;:&lt;value&gt;”. example: addFilterQuery(“categories:movies”); </p>
<pre>
<pre class="brush: java; title: ; notranslate">
// filterQueries is a String[] of facet filter queries got using getAsFilterQuery() from previous search
SolrQuery qry = new SolrQuery(strQuery);
if (filterQueries != null) {
    for (String fq : filterQueries) {
        qry.addFilterQuery(fq);
    }
}
qry.setFacet(true);
qry.addFacetField(fetchFacetFields);
qry.setIncludeScore(true);
qry.setShowDebugInfo(true);
QueryRequest qryReq = new QueryRequest(qry);
QueryResponse resp = qryReq.process(solrServer);
</pre>
</pre>
<p>For subsequent levels of refinement, add facet instance filter queries to the current level’s main query, and add the list of facet fields required for the next level. </p>
<p>&#160;</p>
<h1 id="toc-facet-filter-query-syntax">Facet filter query syntax</h1>
<p>The facet filter queries have some rather intricate syntaxes for achieving various search behaviours, which are described below.</p>
<p>&#160;</p>
<p><strong>Selecting multiple facets</strong></p>
<p>In some drilldown search designs, a user is allowed to specify multiple facet instances for the <em>same </em>field. For example, a <strong>categories </strong>field may have multiple category facet instances. In such cases, the facet instances should be combined using an OR operator. </p>
<blockquote><p><span style="text-decoration: underline">Cate<span style="background-color: #ffffff">gories</span></span> <span style="background-color: #ffffff">[ ] </span></p>
<p><span style="background-color: #ffffff">&#160; Movies (300)</span> <span style="background-color: #ffffff">[ ] </span></p>
<p><span style="background-color: #ffffff">&#160; Songs (400)</span> <span style="background-color: #ffffff">[ ] </span></p>
<p><span style="background-color: #ffffff">&#160; Ads (150)</span> []</p>
</blockquote>
<p>&#160;</p>
<p>If user selects “Movies” and “songs”, the filter query should have the semantics of an OR operator &#8211; </p>
<p>“..where category=movies OR category=songs”. </p>
<p>This can be specified in solr filter queries by enclosing the facet instances inside parentheses: </p>
<blockquote><p>&lt;fqfield&gt;:<strong><span style="color: #ff0000">(</span></strong>value1 value2 value3…<strong><span style="color: #ff0000">)</span></strong> </p>
</blockquote>
<p>examples: </p>
<p><span style="color: #666666">In command line URL : </span></p>
<blockquote><p><span style="color: #666666"><em>fq=categories%3A%28songs+movies%29</em></span> </p>
</blockquote>
<p><span style="color: #666666">where %3A is character ‘:‘&#160;&#160; , %28 is character ‘(‘ and %29 is character ‘)’</span> </p>
<p><span style="color: #666666">OR, equivalently</span> </p>
<p><span style="color: #666666">In java</span></p>
<blockquote><p><span style="color: #666666">qry.<em>addFilterQuery(“categories:(songs movies)”);</em></span> </p>
</blockquote>
<p><span style="color: #666666"><strong></strong></span></p>
<p><span style="color: #666666"><strong>Whitespaces in facet instances</strong></span></p>
<p><span style="color: #666666">If facet instances have whitespaces within them, then multiple facet instances should be specified simply by enclosing them in double quotes (%22).</span></p>
<p><span style="color: #666666"></span><span style="color: #666666">For example, for a facet field &quot;crn&quot; with facet instances “M.Tech. Computer Sc. &amp; Engg.” and “ELECTRICAL ENGINEERING” (note the whitespaces), the syntax: </span></p>
<p>In URLs: </p>
<blockquote><p><em>fq=crn%3A%28<span style="color: #ff0000">%22M.Tech.+Computer+Sc.+%26+Engg.%22</span>+<span style="color: #ff0000">%22ELECTRICAL+ENGINEERING%22</span>%29</em> </p>
</blockquote>
<p>OR </p>
<p>In Java:</p>
<blockquote><p>qry.<em>addFilterQuery(&quot;crn:(&quot;M.Tech. Computer Sc. &amp; Engg.&quot; &quot;ELECTRICAL ENGINEERING&quot;)&quot;);</em> </p>
</blockquote>
<p><strong></strong></p>
<p>&#160;</p>
<p>&#160;</p>
<h1 id="toc-handling-large-number-of-facet-values-using-pagination">Handling large number of facet values using pagination</h1>
<p>Solr provides pagination for facet values and automatically imposes a limit on the number of values returned for each facet field. This limit can be set using the <strong>facet.limit</strong> query parameter, or <em>setFacetLimit()</em> API, and the facet value offset can be set using <strong>facet.offset </strong>query parameter. </p>
<p>However, there is no direct API like <em>setFacetOffset() </em>in SolrJ…instead, use </p>
<blockquote><p><em>solrQry.add(FacetParams.FACET_OFFSET, “100”)</em></p></blockquote>
<p>&#160;</p>
<p>&#160;</p>
<h1 id="toc-facet-query-vs-filter-query-of-facet">Facet Query vs Filter Query of facet</h1>
<p>The Solr API also contains methods that refer to &quot;facet queries&quot;. It&#8217;s important not to confuse facet queries and filter queries of facets.At first glance, it looks like the facet query concept is what will provide us the drilldown possibility. But not so. </p>
<p><em><u>Facet query</u></em> is a kind of dynamic facet field, applicable only to certain use cases where it makes sense to categorize items in <strong>ranges</strong> &#8211; either numerical or date ranges . </p>
<p>For example, if items have to be categorized into price ranges like [$100-$200], [$200-$300] etc, then <em><u>facet queries</u></em> have to be used to “get the count of all items whose price&gt;$100 and price&lt;$200”. Just specifying the price field as a facet field would not be useful here, because it just returns the list of all unique prices available in the search results. What really provides the drilldown capabilities in this case is the facet query concept. </p>
<p>Facet queries are specified using the syntax <strong>field:[start TO end].</strong> In URL, it should go in encoded format : </p>
<blockquote><p><em>facet.query=age:[20+TO+22]</em></p></blockquote>
<p> In API, it’s specified as<br />
<blockquote>solrQuery.addFacetQuery(“age:[20 TO 22]”);</p></blockquote>
<p>&#160;</p>
<h1 id="toc-undestanding-facet-counts">Undestanding facet counts</h1>
<p>The facet counts are always in the context of the set of search results of main query + filter queries. <a href="wp-content/uploads/2010/12/image1.png"><img style="border-right-width: 0px; display: inline; border-top-width: 0px; border-bottom-width: 0px; border-left-width: 0px" title="image" border="0" alt="image" src="wp-content/uploads/2010/12/image-thumb1.png" width="543" height="346" /></a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.pathbreak.com/blog/faceting-or-drilldown-search-using-solr/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Embedded Solr</title>
		<link>http://www.pathbreak.com/blog/embedded-solr?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=embedded-solr</link>
		<comments>http://www.pathbreak.com/blog/embedded-solr#comments</comments>
		<pubDate>Thu, 25 Nov 2010 19:14:00 +0000</pubDate>
		<dc:creator>Karthik Shiraly</dc:creator>
				<category><![CDATA[Search]]></category>
		<category><![CDATA[embedded solr]]></category>
		<category><![CDATA[solr]]></category>

		<guid isPermaLink="false">http://www.pathbreak.com/blog/embedded-solr</guid>
		<description><![CDATA[Article RelevancyApache Solr 1.4.x A java application running in a JVM can use the EmbeddedSolrServer to host Solr in the same JVM. Following snippet shows how to use it:]]></description>
			<content:encoded><![CDATA[<div class="sidebox">
<div class="relevancy"><b>Article Relevancy</b><br/>Apache Solr 1.4.x</div>
</div>
<p>A java application running in a JVM can use the <strong>EmbeddedSolrServer</strong> to host Solr in the same JVM. </p>
<p>Following snippet shows how to use it:</p>
<pre class="brush: java; title: ; notranslate">
public class EmbeddedServerExplorer {
    public static void main(String[] args) {
        try {
            // Set &quot;solr.solr.home&quot; to the directory under which /conf and /data are present.
            System.setProperty(&quot;solr.solr.home&quot;, &quot;solr&quot;);
            CoreContainer.Initializer initializer = new CoreContainer.Initializer();
            CoreContainer coreContainer = initializer.initialize();
            EmbeddedSolrServer server = new EmbeddedSolrServer(coreContainer, &quot;&quot;);
            SolrInputDocument doc = new SolrInputDocument();
            doc.addField(&quot;id&quot;, &quot;embeddedDoc1&quot;);
            doc.addField(&quot;name&quot;, &quot;test embedded server&quot;);
            server.add(doc);
            server.commit();
            coreContainer.shutdown();
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}
</pre>
]]></content:encoded>
			<wfw:commentRss>http://www.pathbreak.com/blog/embedded-solr/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Using Solr from java applications with SolrJ</title>
		<link>http://www.pathbreak.com/blog/using-solr-from-java-applications-with-solrj?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=using-solr-from-java-applications-with-solrj</link>
		<comments>http://www.pathbreak.com/blog/using-solr-from-java-applications-with-solrj#comments</comments>
		<pubDate>Tue, 23 Nov 2010 14:39:00 +0000</pubDate>
		<dc:creator>Karthik Shiraly</dc:creator>
				<category><![CDATA[Search]]></category>
		<category><![CDATA[search]]></category>
		<category><![CDATA[solr]]></category>
		<category><![CDATA[solrj]]></category>

		<guid isPermaLink="false">http://www.pathbreak.com/blog/using-solr-from-java-applications-with-solrj</guid>
		<description><![CDATA[Article RelevancyApache Solr 1.4.x Contents Overview Important classes Setup the client connection to server Add or update document(s) Commit changes Send a search query Handle search results Overview SolrJ provides java wrappers and adaptors to communicate with Solr and translate its results to java objects. Using SolrJ is much more convenient than using raw HTTP [...]]]></description>
			<content:encoded><![CDATA[<div class="sidebox">
<div class="relevancy"><b>Article Relevancy</b><br/>Apache Solr 1.4.x</div>
<div class="toc"><b>Contents</b>
<ol>
<li><a href="http://www.pathbreak.com/blog/using-solr-from-java-applications-with-solrj#toc-overview">Overview</a></li>
<li><a href="http://www.pathbreak.com/blog/using-solr-from-java-applications-with-solrj#toc-important-classes">Important classes</a></li>
<li><a href="http://www.pathbreak.com/blog/using-solr-from-java-applications-with-solrj#toc-setup-the-client-connection-to-server">Setup the client connection to server</a></li>
<li><a href="http://www.pathbreak.com/blog/using-solr-from-java-applications-with-solrj#toc-add-or-update-documents">Add or update document(s)</a></li>
<li><a href="http://www.pathbreak.com/blog/using-solr-from-java-applications-with-solrj#toc-commit-changes">Commit changes</a></li>
<li><a href="http://www.pathbreak.com/blog/using-solr-from-java-applications-with-solrj#toc-send-a-search-query">Send a search query</a></li>
<li><a href="http://www.pathbreak.com/blog/using-solr-from-java-applications-with-solrj#toc-handle-search-results">Handle search results</a></li>
</ol>
</div>
</div>
<h1 id="toc-overview">Overview</h1>
<p>SolrJ provides java wrappers and adaptors to communicate with Solr and translate its results to java objects. Using SolrJ is much more convenient than using raw HTTP and JSON. Internally, SolrJ uses Apache HttpClient to send HTTP requests.</p>
<p>&#160;</p>
<h1 id="toc-important-classes">Important classes</h1>
<p>SolrJ API is fairly simple and intuitive. The diagram below shows important SolrJ classes.</p>
<p><a href="wp-content/uploads/2010/12/image2.png"><img style="border-right-width: 0px; display: inline; border-top-width: 0px; border-bottom-width: 0px; border-left-width: 0px" title="image" border="0" alt="image" src="wp-content/uploads/2010/12/image-thumb2.png" width="543" height="650" /></a> </p>
<p>&#160;</p>
<h1 id="toc-setup-the-client-connection-to-server">Setup the client connection to server</h1>
<pre class="brush: java; title: ; notranslate">
solrServer = new CommonsHttpSolrServer(&quot;http://localhost:8983/solr&quot;);
solrServer.setParser(new XMLResponseParser());
</pre>
<p>Response parser in java client API can be either XML or binary. In other language APIs, JSON is possible. </p>
<p>&#160;</p>
<h1 id="toc-add-or-update-documents">Add or update document(s)</h1>
<pre class="brush: java; title: ; notranslate">
SolrInputDocument doc = new SolrInputDocument();
// Add fields. The field names should match fields defined in schema.xml
doc.addField(FLD_ID, docId++);
try {
    solrServer.add(doc);
    return true;
} catch (Exception e) {
    LOG.error(&quot;addItem error&quot;, e);
    return false;
}
</pre>
<p><strong></strong></p>
<h1 id="toc-commit-changes">Commit changes</h1>
<p>For best performance, commit changes only after all – or a batch of reasonable size -documents are added/updated. </p>
<pre class="brush: java; title: ; notranslate">
solrServer.commit();
</pre>
</p>
<h1 id="toc-send-a-search-query">Send a search query</h1>
<pre class="brush: java; title: ; notranslate">
SolrQuery qry = new SolrQuery(&quot;name:video&quot;);
qry.setIncludeScore(true);
qry.setShowDebugInfo(true);
qry.setRows(100);
QueryRequest qryReq = new QueryRequest(qry);
QueryResponse resp = qryReq.process(solrServer);
</pre>
<p>SolrQuery.setRows() specifies how many results to return in the response. The actual count of all hits may be much higher. If “field:” is omitted from query string, then the field specified by <strong>&lt;defaultSearchField&gt;</strong> in schema.xml is searched. </p>
<h1 id="toc-handle-search-results">Handle search results</h1>
<pre class="brush: java; title: ; notranslate">
SolrDocumentList results = resp.getResults();
System.out.println(results.getNumFound() + &quot; total hits&quot;);
int count = results.size();
System.out.println(count + &quot; received hits&quot;);
for (int i = 0; i &amp;amp;gt; count; i++) {
    SolrDocument hitDoc = results.get(i);
    System.out.println(&quot;#&quot; + (i+1) + &quot;:&quot; + hitDoc.getFieldValue(&quot;name&quot;));
    for (Iterator&lt;Entry&lt;String, Object&gt;&gt; flditer = hitDoc.iterator(); flditer.hasNext();) {
        Entry&lt;String, Object&gt; entry = flditer.next();
        System.out.println(entry.getKey() + &quot;: &quot; + entry.getValue());
    }
}
</pre>
<p>SolrDocumentList.getNumFound() is total number of hits in the index. But in each response, only as many results as specified by SolrQuery.setRows() will be returned. These two attributes can be used for pagination.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.pathbreak.com/blog/using-solr-from-java-applications-with-solrj/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Getting started with Solr</title>
		<link>http://www.pathbreak.com/blog/getting-started-with-solr?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=getting-started-with-solr</link>
		<comments>http://www.pathbreak.com/blog/getting-started-with-solr#comments</comments>
		<pubDate>Tue, 23 Nov 2010 09:23:00 +0000</pubDate>
		<dc:creator>Karthik Shiraly</dc:creator>
				<category><![CDATA[Search]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[search]]></category>
		<category><![CDATA[solr]]></category>
		<category><![CDATA[solr deployment]]></category>
		<category><![CDATA[solr multicore]]></category>

		<guid isPermaLink="false">http://www.pathbreak.com/blog/directory-layout-of-solr-package</guid>
		<description><![CDATA[Article RelevancyApache Solr 1.4.x Contents Introduction Directory layout of Solr package Getting Started Guide Managing solr server with ant during development Customizing Solr installation Multicore configuration and deployment Using Solr from command line Boolean operators in search queries Introduction Apache Solr is a full fledged, search server based on the Lucene toolkit. Lucene provides the [...]]]></description>
			<content:encoded><![CDATA[<div class="sidebox">
<div class="relevancy"><b>Article Relevancy</b><br/>Apache Solr 1.4.x</div>
<div class="toc"><b>Contents</b>
<ol>
<li><a href="http://www.pathbreak.com/blog/getting-started-with-solr#toc-introduction">Introduction</a></li>
<li><a href="http://www.pathbreak.com/blog/getting-started-with-solr#toc-directory-layout-of-solr-package">Directory layout of Solr package</a></li>
<li><a href="http://www.pathbreak.com/blog/getting-started-with-solr#toc-getting-started-guide">Getting Started Guide</a></li>
<li><a href="http://www.pathbreak.com/blog/getting-started-with-solr#toc-managing-solr-server-with-ant-during-development">Managing solr server with ant during development</a></li>
<li><a href="http://www.pathbreak.com/blog/getting-started-with-solr#toc-customizing-solr-installation">Customizing Solr installation</a></li>
<li><a href="http://www.pathbreak.com/blog/getting-started-with-solr#toc-multicore-configuration-and-deployment">Multicore configuration and deployment</a></li>
<li><a href="http://www.pathbreak.com/blog/getting-started-with-solr#toc-using-solr-from-command-line">Using Solr from command line</a></li>
<li><a href="http://www.pathbreak.com/blog/getting-started-with-solr#toc-boolean-operators-in-search-queries">Boolean operators in search queries</a></li>
</ol>
</div>
</div>
<h1 id="toc-introduction">Introduction</h1>
<p><a href="http://lucene.apache.org/solr/">Apache Solr</a> is a full fledged, search server based on the Lucene toolkit. </p>
<p>Lucene provides the core search algorithms and index storage required by those algorithms. Most basic search requirements can be fulfilled by Lucene itself without requiring Solr. But using plain Lucene has some drawbacks in development and non functional aspects, forcing development teams to cover these in their designs. This is where Solr adds value. </p>
<p>Solr provides these benefits over using the raw Lucene toolkit:</p>
<ul>
<li>Solr allows search behaviour to be configured through configuration files, rather than through code. Specifying search fields, indexing criteria, and indexing behaviour in code is prone to maintenance problems. </li>
<li>Lucene is java centric (but also has ports to other languages). Solr however provides a HTTP interface that allows any platform to use it. Projects that involve multiple languages or platforms can use the same solr server.</li>
<li>Solr provides an out-of-the-box <strong>faceted search</strong> (also called drilldown search) facility, that allows users to incrementally refine results using filters and &quot;drilldown&quot; towards a narrow set of best matches. Many shopping web portals use this feature to allow their users to incrementally refine their results.</li>
<li>Solr’s query syntax is slightly easier than Lucene’s. Either a default field can be specified, or solr provides a syntax of its own called dismax, that searches a fixed set of fields.</li>
<li>Solr’s java client API is much simpler and easier than Lucene’s. Solr abstracts away many of the underlying Lucene concepts.</li>
<li>Solr provides straightforward add, update, and delete document API, unlike Lucene.</li>
<li>Solr supports a pluggable architecture. For example, post processor plugins (example: search results highlighting) allow raw results to be modified. .</li>
<li>Solr facilitates scalability using solutions like caching, memory tweaking, clustering, sharding and load balancing.</li>
<li>Solr provides plugins to fetch database data and index them. This workflow is probably the most common requirement for any search implementation, and solr provides it out-of-the-box.</li>
</ul>
<p>The following sections describe basics of deploying Solr and using it from command line.</p>
<p>&#160;</p>
<h1 id="toc-directory-layout-of-solr-package">Directory layout of Solr package</h1>
<p>Extracted Solr package has this layout:</p>
<table border="1" cellspacing="0" cellpadding="2" width="100%">
<tbody>
<tr>
<td valign="top" width="25%">/client</td>
<td valign="top" width="75%">Contains client APIs in different languages to talk to a Solr server</td>
</tr>
<tr>
<td valign="top" width="25%">/contrib/clustering</td>
<td valign="top" width="75%">Plugin that provides clustering capabilities for Solr, using Carrot2 clustering framework</td>
</tr>
<tr>
<td valign="top" width="25%">/contrib/dataimporthandler</td>
<td valign="top" width="75%">Plugin that is useful for indexing data in databases</td>
</tr>
<tr>
<td valign="top" width="25%">/contrib/extraction</td>
<td valign="top" width="75%">Plugin that is useful for extracting text from PDFs, Word DOCs, etc.</td>
</tr>
<tr>
<td valign="top" width="25%">/contrib/velocity</td>
<td valign="top" width="75%">Handler to present and manipulate search results using velocity templates.</td>
</tr>
<tr>
<td valign="top" width="25%">/dist</td>
<td valign="top" width="75%">Contains Solr core jars and wars that can be deployed in servlet containers or elsewhere, and the solrj client API for java clients.</td>
</tr>
<tr>
<td valign="top" width="25%">/dist/solrj-lib</td>
<td valign="top" width="75%">Libraries required by solrj client API .</td>
</tr>
<tr>
<td valign="top" width="25%">/docs</td>
<td valign="top" width="75%">Offline documentation and javadocs</td>
</tr>
<tr>
<td valign="top" width="25%">/lib</td>
<td valign="top" width="75%">Contains Lucene and other jars required by Solr</td>
</tr>
<tr>
<td valign="top" width="25%">/src</td>
<td valign="top" width="75%">Source code</td>
</tr>
<tr>
<td valign="top" width="25%">/<strong>example</strong></td>
<td valign="top" width="75%"><strong>A skeleton standalone solr server deplyment. Default environment is Jetty. When deploying Solr, this is the directory that&#8217;s customized and deployed.</strong></td>
</tr>
<tr>
<td valign="top" width="25%">/example/etc</td>
<td valign="top" width="75%">Jetty or other environment specific configuration files go here</td>
</tr>
<tr>
<td valign="top" width="25%">/example/example-DIH</td>
<td valign="top" width="75%">An example DB and the Data Import Handler plugin configuration to index that DB</td>
</tr>
<tr>
<td valign="top" width="25%">/example/exampledocs</td>
<td valign="top" width="75%">Example XML request files to send to Solr server. Usage: java –jar post.jar &lt;xml filename&gt;</td>
</tr>
<tr>
<td valign="top" width="25%">/example/lib</td>
<td valign="top" width="75%">Jetty and servlet libraries. Not required if Solr is being deployed in a different environment</td>
</tr>
<tr>
<td valign="top" width="25%">/example/logs</td>
<td valign="top" width="75%">Solr request logs</td>
</tr>
<tr>
<td valign="top" width="25%">/example/multicore</td>
<td valign="top" width="75%">It’s possible to host multiple search cores in the same environment. Use case could be separate indexes for different categories of data.</td>
</tr>
<tr>
<td valign="top" width="25%">/example/solr</td>
<td valign="top" width="75%">This is the main data area of Solr.</td>
</tr>
<tr>
<td valign="top" width="25%">/example/solr/conf</td>
<td valign="top" width="75%">Contains configuration files used by Solr.          </p>
<p>solrconfig.xml – Configuration parameters, memory tuning, different types of request handlers.           </p>
<p>schema.xml – Specifies fields and analyzer configuration for indexing and querying. Other files contain data required by different components like the Stop word filter.</td>
</tr>
<tr>
<td valign="top" width="25%">/example/solr/data</td>
<td valign="top" width="75%">This contains the actual results of indexing.</td>
</tr>
<tr>
<td valign="top" width="25%">/example/webapps</td>
<td valign="top" width="75%">The solr webapp deployed in Jetty</td>
</tr>
<tr>
<td valign="top" width="25%">/example/work</td>
<td valign="top" width="75%">Scratch directory for the container environment</td>
</tr>
</tbody>
</table>
<p> <br />
<h1 id="toc-getting-started-guide">Getting Started Guide</h1>
<p>1) Copy the skeleton server under /example to the deployment directory. </p>
<p>2) Customize /example/solr/conf/schema.xml as explained in later sections, to model search fields of the application. </p>
<p>3) Start the solr server. For the default Jetty environment, use this command line with current directory set to /example: </p>
<blockquote><p><span style="background-color: #ffffff">java -DSTOP.PORT=8079 –DSTOP.KEY=secret –jar start.jar</span></p></blockquote>
<p>The STOP.PORT specifies the port on which server should listen for a stop instruction, and STOP.KEY is just a kind of secret key to be passed while stopping. </p>
<p>4) If building from source, the WAR will be named something like apache-solr-4.0-snapshot.jar. Copy this to /webapps and importantly, <strong>rename it to solr.war.</strong> Without that renaming, Jetty will give 404 errors for /solr URLs. </p>
<p>5) The solr server will&#160; now be available at <strong>http://localhost:8983/solr</strong>. 8983 is the default jetty connector port, as specified in /example/etc/jetty.xml </p>
<p>6) To stop the server, use the command line: </p>
<blockquote><p><span style="background-color: #ffffff">java -DSTOP.PORT=8079 –DSTOP.KEY=secret –jar start.jar &#8211;stop</span></p></blockquote>
<p>&#160;</p>
<p>&#160;</p>
<h1 id="toc-managing-solr-server-with-ant-during-development">Managing solr server with ant during development</h1>
<p>Starting and stopping solr can be conveniently done from an IDE like Eclipse using an Ant script:</p>
<pre>
<pre class="brush: xml; title: ; notranslate">
&lt;project basedir=&quot;.&quot; name=&quot;ManageSolr&quot;&gt;
&lt;property name=&quot;stopport&quot; value=&quot;8079&quot;&gt;&lt;/property&gt;
&lt;property name=&quot;stopsecret&quot; value=&quot;secret&quot;&gt;&lt;/property&gt;

&lt;target name=&quot;start-solr&quot;&gt;
	&lt;java dir=&quot;./dist/solr&quot; fork=&quot;true&quot; jar=&quot;./dist/solr/start.jar&quot;&gt;
		&lt;jvmarg value=&quot;-DSTOP.PORT=${stopport}&quot; /&gt;
		&lt;jvmarg value=&quot;-DSTOP.KEY=${stopsecret}&quot; /&gt;
	&lt;/java&gt;
&lt;/target&gt;

&lt;target name=&quot;stop-solr&quot;&gt;
	&lt;java dir=&quot;./dist/solr&quot; fork=&quot;true&quot; jar=&quot;./dist/solr/start.jar&quot;&gt;
		&lt;jvmarg value=&quot;-DSTOP.PORT=${stopport}&quot; /&gt;
		&lt;jvmarg value=&quot;-DSTOP.KEY=${stopsecret}&quot; /&gt;
		&lt;arg value=&quot;–stop&quot; /&gt;
	&lt;/java&gt;
&lt;/target&gt;

&lt;target name=&quot;restart-solr&quot; depends=&quot;stop-solr,start-solr&quot;&gt;
&lt;/target&gt;

&lt;target name=&quot;deleteAllDocs&quot;&gt;
	&lt;java dir=&quot;./dist/solr/exampledocs&quot; fork=&quot;true&quot; jar=&quot;./dist/solr/exampledocs/post.jar&quot;&gt;
		&lt;arg value=&quot;${basedir}/deleteAllCommand.xml&quot; /&gt;
	&lt;/java&gt;
&lt;/target&gt;
&lt;/project&gt;
</pre>
</pre>
<p>&#160;</p>
<h1 id="toc-customizing-solr-installation">Customizing Solr installation</h1>
<p>The solr server distribution under /example is just that &#8211; an example. It should be customized to fit your search requirements. The conf/schema.xml should be changed to model searchable entities of the application, as described in this article.</p>
<p>&#160;</p>
<h1 id="toc-multicore-configuration-and-deployment">Multicore configuration and deployment</h1>
<p>Multicore configuration allows multiple schemas and indexes in a single solr server process. Multicores are useful when disparate entities with different fields need to be searched using a single server process.</p>
<ul>
<li>The package contains an example multicore configuration in <strong>/example/multicore</strong>.&#160; It contains 2 cores, each with its own schema.xml and solrconfig.xml. </li>
<li>Core names and instance directories can be changed in solr.xml. </li>
<li>The default multicore schema.xmls are rather simplistic and don’t contain the exhaustive list of field type definitions available in <strong>/example/solr/conf/schema.xml</strong>.&#160; So, copy all files under<strong>/example/solr/conf/*</strong> into <strong>/example/multicore/core0/conf/*</strong> and<strong>/example/multicore/core1/conf/*</strong> </li>
<li>Modify the core schema XMLs according to the data they are indexing </li>
<li>The copied solrconfig.xml has a &lt;datadir&gt; element that points to <strong>/example/multicore/data</strong>. This is where index and other component data are stored. Since the same solrconfig is copied into both cores, both cores end up pointing to the same data directory and will try to write to same index, most likely resulting in index corruption.&#160; So, just comment out the &lt;datadir&gt; elements. Then each core will store data in its respective<strong>/example/multicore/&lt;coredir&gt;/data</strong>. </li>
<li>The jar lib directories in default single core solrconfig.xml don’t match with the default directory structure in a multicore structure.Those relative paths are with solr.home (ie, “<strong>/example/solr</strong>“) as base directory.&#160; Change the relative paths of /contrib and /dist, such that they’re relative *to the core’s directory* (ie,<strong>/example/solr/&lt;coredir&gt;</strong>). </li>
<li>Finally, the multicore configuration should be made the active configuration, either by specifying”java -Dsolr.home=/example/multicore -jar start.jar”&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; OR preferably,&#160;&#160;&#160;&#160;&#160;&#160;&#160; By copying all files under<strong>/example/multicore/*</strong> into <strong>/example/solr</strong>, the default solr home. </li>
</ul>
<h1 id="toc-using-solr-from-command-line">Using Solr from command line</h1>
<p>The primary method of communicating with Solr is using HTTP. A HTTP capable command line client like <a href="http://curl.haxx.se/" target="_blank"><strong>curl</strong></a><strong>&#160;</strong>is useful for this.</p>
<p><strong>Querying:</strong> Queries should be sent as </p>
<blockquote><p>http://localhost:8983/solr/select/?q=&lt;query&gt; </p>
</blockquote>
<p>or </p>
<blockquote><p>http://localhost:8983/solr/&lt;core name&gt;/select/?q=&lt;query&gt; </p>
</blockquote>
<p> for multicore installation
<p><strong></strong></p>
<p><strong></strong></p>
<p><strong>Inserting or Updating documents in a single core installation: </strong>Solr update handler listens by default on the URL: <strong>http://localhost:8983/solr/update/</strong> in a single core configuration. </p>
<p>To post an XML file with documents, use command line</p>
<blockquote><p>curl http://localhost:8983/solr/update/?commit=true –F &quot;myfile=@updates.xml&quot;</p>
</blockquote>
<p><strong></strong></p>
<p><strong></strong></p>
<p><strong>Inserting or Updating documents in a multi core installation: </strong>Each core’s update handler listens by default on the URL: <strong>http://localhost:8983/solr/&lt;core name&gt;/update/</strong></p>
<p>&#160;</p>
<p><strong>Updating with content extraction</strong>: Content extracting handler listens on the URL http://localhost:8983/solr/update/extract/ or http://localhost:8983/solr/&lt;core name&gt;/update/extract. Use the command line </p>
<blockquote><p>curl &quot;http://localhost:8983/solr/update/extract?literal.id=book1&amp;commit=true&quot; -F &quot;myfile=@book.pdf&quot; </p></blockquote>
<p>where literal.id adds a regular field called &quot;id&quot; to the new document created by extracting handler.</p>
<p>&#160;</p>
<p>The query parameters that Solr accepts are documented in <a href="http://wiki.apache.org/solr/CommonQueryParameters" target="_blank">Solr wiki</a>.</p>
<p>&#160;</p>
<h1 id="toc-boolean-operators-in-search-queries">Boolean operators in search queries</h1>
<p>All Lucene queries are valid in Solr too. However, solr does provide some additional conveniences.</p>
<p>A default boolean operator can specified using a <strong>&lt;solrQueryParser defaultOperator=”AND|OR”/&gt;</strong> element in schema.xml.</p>
<p>Each query can also override boolean behaviour using the <strong>q.op=AND|OR</strong> query param. However, remember that the schema default or q.op affect not just the query terms, but also the facet filter queries.</p>
<p>For example, selecting 2 facet values for the same facet field will now imply that both should be satisfied. This is because internally, a filter query is just a part of the query from Lucene point of view.</p>
<ul>To restrict boolean logic to just the query terms, use the following syntax:
<li><strong>All words should be found:</strong> Prefix a + in front of each word. <em>example</em>: +video +science (=&gt;only documents that contain both “video” AND “science” are returned) </li>
<li><strong>Any one word should be found:</strong> This is the default behaviour when queries contain words without any prefix.<em>example:</em> video science (=&gt;any document which contains either “video” or “science” is returned) </li>
<li><strong>Documents which don’t contain a word:</strong> Prefix a “–” in front of each word that should not be present, for a successful hit. <em>example: </em>video –science (=&gt;any document which contains “video” but not “science” is returned).</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.pathbreak.com/blog/getting-started-with-solr/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
