<?xml version="1.0" encoding="UTF-8"?><feed
  xmlns="http://www.w3.org/2005/Atom"
  xmlns:thr="http://purl.org/syndication/thread/1.0"
  xml:lang="en"
  xml:base="http://dev.tailsweep.com/wp-atom.php"
   >
	<title type="text">Tailsweep dev blog</title>
	<subtitle type="text"></subtitle>

	<updated>2009-10-27T14:29:11Z</updated>

	<link rel="alternate" type="text/html" href="http://dev.tailsweep.com" />
	<id>http://dev.tailsweep.com/feed/atom/</id>
	<link rel="self" type="application/atom+xml" href="http://dev.tailsweep.com/feed/atom/?language=sv+en" />

	<generator uri="http://wordpress.org/" version="3.0.1">WordPress</generator>
		<entry>
		<author>
			<name>Marcus Herou</name>
						<uri>http://</uri>
					</author>
		<title type="html"><![CDATA[Ulimit can be a bastard]]></title>
		<link rel="alternate" type="text/html" href="http://dev.tailsweep.com/ulimit-can-be-a-bastard/" />
		<id>http://dev.tailsweep.com/?p=89</id>
		<updated>2009-10-27T14:29:11Z</updated>
		<published>2009-10-27T14:29:11Z</published>
		<category scheme="http://dev.tailsweep.com" term="Uncategorized" />		<summary type="html"><![CDATA[How do you change per process ulimit without rebooting ? We have not found a way but a workaround. root: ulimit -n 65536 su $user -p Done! The -p preserves root&#8217;s environment.]]></summary>
		<content type="html" xml:base="http://dev.tailsweep.com/ulimit-can-be-a-bastard/"><![CDATA[<p>How do you change per process ulimit without rebooting ? We have not found a way but a workaround.</p>
<p>root:</p>
<p>ulimit -n 65536</p>
<p>su $user -p</p>
<p>Done! The -p preserves root&#8217;s environment.</p>
]]></content>
		<link rel="replies" type="text/html" href="http://dev.tailsweep.com/ulimit-can-be-a-bastard/#comments" thr:count="0"/>
		<link rel="replies" type="application/atom+xml" href="http://dev.tailsweep.com/ulimit-can-be-a-bastard/feed/atom/" thr:count="0"/>
		<thr:total>0</thr:total>
	</entry>
		<entry>
		<author>
			<name>Marcus Herou</name>
						<uri>http://</uri>
					</author>
		<title type="html"><![CDATA[Some new servers]]></title>
		<link rel="alternate" type="text/html" href="http://dev.tailsweep.com/some-new-servers/" />
		<id>http://dev.tailsweep.com/?p=87</id>
		<updated>2009-10-27T13:26:46Z</updated>
		<published>2009-10-27T13:26:46Z</published>
		<category scheme="http://dev.tailsweep.com" term="Uncategorized" />		<summary type="html"><![CDATA[Finally some new servers are racked up in cabinet2]]></summary>
		<content type="html" xml:base="http://dev.tailsweep.com/some-new-servers/"><![CDATA[<p>Finally some new servers are racked up in cabinet2</p>
<p style="text-align: center;"><a href="http://dev.tailsweep.com/wp-content/uploads/2009/10/img_0233.jpg"><img class="alignnone size-medium wp-image-88 aligncenter" title="img_0233" src="http://dev.tailsweep.com/wp-content/uploads/2009/10/img_0233-300x225.jpg" alt="" width="300" height="225" /></a></p>
]]></content>
		<link rel="replies" type="text/html" href="http://dev.tailsweep.com/some-new-servers/#comments" thr:count="0"/>
		<link rel="replies" type="application/atom+xml" href="http://dev.tailsweep.com/some-new-servers/feed/atom/" thr:count="0"/>
		<thr:total>0</thr:total>
	</entry>
		<entry>
		<author>
			<name>Marcus Herou</name>
						<uri>http://</uri>
					</author>
		<title type="html"><![CDATA[Vi söker utvecklare]]></title>
		<link rel="alternate" type="text/html" href="http://dev.tailsweep.com/vi-soker-utvecklare/" />
		<id>http://dev.tailsweep.com/?p=85</id>
		<updated>2009-10-08T10:42:46Z</updated>
		<published>2009-10-08T09:52:37Z</published>
		<category scheme="http://dev.tailsweep.com" term="Job" /><category scheme="http://dev.tailsweep.com" term="Uncategorized" /><category scheme="http://dev.tailsweep.com" term="Jobb" />		<summary type="html"><![CDATA[Tailsweep har en enorm utvecklingstakt och vi behöver stärka upp vårt utvecklingsteam med fler utvecklare. Tailsweep är ett datadrivet företag som i alla aspekter hanterar stora mängder data. Har du erfarenhet av att skriva program som processar stora mängder data (gärna med nedan nämnda tekniker) eller helt enkelt har följande två enkla egenskaper: Vara smart [...]]]></summary>
		<content type="html" xml:base="http://dev.tailsweep.com/vi-soker-utvecklare/"><![CDATA[<p>Tailsweep har en enorm utvecklingstakt och vi behöver stärka upp vårt utvecklingsteam med fler utvecklare.</p>
<p>Tailsweep är ett datadrivet företag som i alla aspekter hanterar stora mängder data. Har du erfarenhet av att skriva program som processar stora mängder data (gärna med nedan nämnda tekniker) eller helt enkelt har följande två enkla egenskaper:</p>
<ul>
<li>Vara smart</li>
<li>Få saker utförda</li>
</ul>
<p>Så är du med största sannolikhet rätt person för jobbet och du kommer trivas hos oss. De &#8220;krav&#8221; som nämns nedan är endast för att ge en hint om vilka tekniker vi använder. Främst letar vi efter personer som passar i bolaget och som älskar att utveckla och är bra på det. Allt annat är egentligen ointressant.</p>
<p>De tre områden som du kommer arbeta inom är:</p>
<ul>
<li>Tailsweep Search &amp; Report &#8211; Crawler &amp; Sökindex, ett av sveriges absolut största dataindex för blogginnehåll.</li>
<li>Tailsweep Analytics &#8211; Vårt statistiksystem, påminner mycket om Google Analytics. I princip alla de största svenska bloggarna är anslutna till detta system. Förmodligen det mest avancerade i Sverige.</li>
<li>Tailsweep Ad System &#8211; Vårt annonssystem som publicerar kampanjer på tusentals sajter på bloggar runtom i världen varje dag. De tekniska utmaningarna inom detta system är mycket intressanta för att vara modest.</li>
</ul>
<p>Om du har erfarenhet inom nedan nämnda tekniker ges det en guldstjärna i kanten:</p>
<ul>
<li>Hadoop &#8211; Processar vårt loggdata och kör vår crawler</li>
<li>HBase &#8211; Används bara i utveckling men kommer bli en viktig komponent framåt för ytterligare uppskalning</li>
<li>Hive &#8211; Skall bli vår BI-lösning</li>
<li>Lucene &#8211; Använder vi flitigt där skalbarhet inte är lika viktigt men &#8220;närhet&#8221; till datat är viktigare</li>
<li>Lucene SOLR &#8211; Vårt sökindex använder SOLR och är ett distribuerat index</li>
<li>Lucene Nutch &#8211; Kan du Nutch så kan du det mesta om vår crawler</li>
<li>Någon annan dataminingplattform</li>
<li>Någon annan BI-lösning</li>
<li>Någon annan sökplattform (Sphinx tex)</li>
<li>Någon annan indexmotor</li>
</ul>
<p>Det språk vi i huvudsak utvecklar i är Java så det är viktigt att du behärskar det språket men om du besitter andra nischade kunskaper så väger det såklart också tungt tex genom erfarenhet inom nån sökmotor, statistiksystem eller liknande.</p>
<p>Vi skriver i princip alla våra mallar i Velocity så det är klart att det är trevligt om du sett det mallspråket förr.</p>
<p>Vi driftar, utvecklar och arbetar på Ubuntu Linux. Vi använder samma OS lokalt som på driftplattformen för att säkerställa att inga konstiga OS-relaterade buggar hittar ut i prod som inte gick att testa lokalt.</p>
<p>Andra meriterande teknikkunskaper</p>
<ul>
<li>MySQL &#8211; Vår huvudsakliga DB</li>
<li>J2EE Servlet Applikationer &#8211; Våra webappar är skrivna för J2EE och driftas i Tomcat</li>
<li>Spring &#8211; Denna IOC-container använder vi överallt</li>
<li>Spring MVC &#8211; För våra webappar</li>
<li>Hibernate &#8211; Används överallt där inte prestanda är kritiskt</li>
<li>Perl &#8211; Listar också perl då vi har massor av batchjobb som kör perl</li>
</ul>
<p>Vidare listar jag <em>några</em> andra verktyg och tekniker som används flitigt men som bara är kuriosa i sammanhanget</p>
<ul>
<li>Subversion &#8211; All vår källkod finns i Subversion</li>
<li>Maven &#8211; Alla projekt byggs med Maven 2</li>
<li>Lighttpd &#8211; Driftar vårt statiska innehåll och våra bloggar</li>
<li>WordPress &#8211; Våra bloggar körs i wordpress</li>
<li>BASH &#8211; Ja vi använder bashscript överallt</li>
<li>NFS &#8211; Används mest ur bekvämlighetssynpunkt</li>
<li>GlusterFS &#8211; Experimentiellt skalbart filsystem</li>
<li>Eclipse &#8211; Utvecklar vi i.</li>
<li>HAProxy &#8211; Vår LB, enkel, snabb och stabil</li>
<li>SNMP &#8211; Alla maskiner övervakas med SNMP</li>
<li>Postfix &#8211; Mail</li>
<li>Nagios &#8211; Larm av våra viktigaste tjänster</li>
<li>Cacti &#8211; Trendgrafer av prestandakritiska tjänster</li>
<li>Mantis &#8211; Vårt case-verktyg, enkelt och tillfredställande</li>
</ul>
<p><strong>Exempel på projekt för att komma igång på Tailsweep<br />
</strong></p>
<ul>
<li>Vi ska bygga om vår statistikmotor till att använda Hive istället för MonetDB som vi använder idag. Hive är utmärkt till att processa enorma mängder loggfiler och detta är vår viktigaste tjänst.</li>
</ul>
<ul>
<li>Vi har byggt en egen shardad lösning i MySQL som spänner över 50 databaser i vår sökplattform men vi tittar på att flytta denna arkitektur till HBase, vilket är en variant av Googles BigTable som hanterar all data rörande inloggade Google användare.</li>
<li>Vi ska bygga en behavioural targeting motor som ska distribuera kampanjer till de sajter där de presterar bäst. Till detta så måste man bygga en annonspool som kampanjerna &#8220;sugs&#8221; ifrån.</li>
</ul>
<p>Låter det intressant ? Då kommer du gilla att jobba på Tailsweep.</p>
<p>Skicka ett mail till job at tailsweep.com med din CV så kontaktar jag dig och sätter upp ett möte.</p>
<p>Med vänlig hälsning</p>
<p>//Marcus Herou, CTO Tailsweep AB</p>
]]></content>
		<link rel="replies" type="text/html" href="http://dev.tailsweep.com/vi-soker-utvecklare/#comments" thr:count="1"/>
		<link rel="replies" type="application/atom+xml" href="http://dev.tailsweep.com/vi-soker-utvecklare/feed/atom/" thr:count="1"/>
		<thr:total>1</thr:total>
	</entry>
		<entry>
		<author>
			<name>Marcus Herou</name>
						<uri>http://</uri>
					</author>
		<title type="html"><![CDATA[Patch Hadoop for faster startup]]></title>
		<link rel="alternate" type="text/html" href="http://dev.tailsweep.com/patch-hadoop-for-faster-startup/" />
		<id>http://dev.tailsweep.com/?p=82</id>
		<updated>2009-09-24T07:42:50Z</updated>
		<published>2009-09-24T07:42:50Z</published>
		<category scheme="http://dev.tailsweep.com" term="Uncategorized" /><category scheme="http://dev.tailsweep.com" term="hadoop" />		<summary type="html"><![CDATA[Do you add dependency support for your jobs in Hadoop by configuring the &#8220;tmpjars&#8221; property ? This means that your jar-files need to be located on HDFS and loaded by Hadoop on runtime. If you do so then your app will be significantly slower in terms of startup time. You can reduce the startup time [...]]]></summary>
		<content type="html" xml:base="http://dev.tailsweep.com/patch-hadoop-for-faster-startup/"><![CDATA[<p>Do you add dependency support for your jobs in Hadoop by configuring the &#8220;tmpjars&#8221; property ?</p>
<p>This means that your jar-files need to be located on HDFS and loaded by Hadoop on runtime.</p>
<p>If you do so then your app will be significantly slower in terms of startup time. You can reduce the startup time from 1 min to less then 10 secs by patching the mapred/org/apache/hadoop/mapred/TaskRunner.java class to find the files from a local repo instead from HDFS</p>
<p>Find the place where the classpath is being built in that source file (line 272 in hadoop-0.18.3) and insert this code snippet between</p>
<p>classPath.append(sep);</p>
<p>classPath.append(workDir);</p>
<p>&#8211;SNIPPET_HERE&#8211;</p>
<p>//  Build exec child jmv args.<br />
Vector&lt;String&gt; vargs = new Vector&lt;String&gt;(8);<br />
File jvm =                                  // use same jvm as parent<br />
new File(new File(System.getProperty(&#8220;java.home&#8221;), &#8220;bin&#8221;), &#8220;java&#8221;);</p>
<p>vargs.add(jvm.toString());</p>
<p>Here the snippet is:</p>
<p>&lt;code&gt;</p>
<p>String additionalClassPath = conf.get(&#8220;mapred.additional.class.path&#8221;);<br />
if (additionalClassPath != null)<br />
{<br />
String[] localfiles = additionalClassPath.split(&#8220;,&#8221;);<br />
for(int i = 0; i &lt; localfiles.length;i++)<br />
{<br />
String localfile = localfiles[i].trim();<br />
LOG.info(&#8220;Adding &#8220;+localfile);<br />
classPath.append(sep);<br />
classPath.append(localfile);<br />
}<br />
}</p>
<p>&lt;/code&gt;</p>
<p>Then just build the new hadoop jar by issuing &#8220;ant jar&#8221; make sure that you have the same jar on all nodes as well as the jobtracker.</p>
]]></content>
		<link rel="replies" type="text/html" href="http://dev.tailsweep.com/patch-hadoop-for-faster-startup/#comments" thr:count="0"/>
		<link rel="replies" type="application/atom+xml" href="http://dev.tailsweep.com/patch-hadoop-for-faster-startup/feed/atom/" thr:count="0"/>
		<thr:total>0</thr:total>
	</entry>
		<entry>
		<author>
			<name>Marcus Herou</name>
						<uri>http://</uri>
					</author>
		<title type="html"><![CDATA[Mammatus is now a replicated KeyValueStore]]></title>
		<link rel="alternate" type="text/html" href="http://dev.tailsweep.com/mammatus-is-now-a-replicated-keyvaluestore/" />
		<id>http://dev.tailsweep.com/?p=81</id>
		<updated>2009-09-09T09:27:24Z</updated>
		<published>2009-09-09T09:27:24Z</published>
		<category scheme="http://dev.tailsweep.com" term="Uncategorized" /><category scheme="http://dev.tailsweep.com" term="Cassandra" /><category scheme="http://dev.tailsweep.com" term="replication" /><category scheme="http://dev.tailsweep.com" term="Voldemort" />		<summary type="html"><![CDATA[We proudly announce that Mammatus have support for transactional replication of configurable KeyValueStore(s). Something similar to Cassandra (where is it thesedays?) or Voldemort Our pagehit/adhit tracking services at script.tailsweep.com uses this feature and we have about 1000 web requests per second so you can say that it is quite stress tested . Look in the [...]]]></summary>
		<content type="html" xml:base="http://dev.tailsweep.com/mammatus-is-now-a-replicated-keyvaluestore/"><![CDATA[<p>We proudly announce that Mammatus have support for transactional replication of configurable KeyValueStore(s). Something similar to Cassandra (where is it thesedays?) or <a href="http://project-voldemort.com/">Voldemort</a></p>
<p>Our pagehit/adhit tracking services at script.tailsweep.com uses this feature and we have about 1000 web requests per second so you can say that it is quite stress tested <img src='http://dev.tailsweep.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> .</p>
<p>Look in the <a href="http://dev.tailsweep.com/projects/mammatus/xref-test/com/tailsweep/mammatus/test/MasterSlaveTest.html">MasterSlaveTest</a> class for examples.</p>
]]></content>
		<link rel="replies" type="text/html" href="http://dev.tailsweep.com/mammatus-is-now-a-replicated-keyvaluestore/#comments" thr:count="0"/>
		<link rel="replies" type="application/atom+xml" href="http://dev.tailsweep.com/mammatus-is-now-a-replicated-keyvaluestore/feed/atom/" thr:count="0"/>
		<thr:total>0</thr:total>
	</entry>
		<entry>
		<author>
			<name>Marcus Herou</name>
						<uri>http://</uri>
					</author>
		<title type="html"><![CDATA[Cheap backup]]></title>
		<link rel="alternate" type="text/html" href="http://dev.tailsweep.com/cheap-backup/" />
		<id>http://dev.tailsweep.com/?p=79</id>
		<updated>2009-05-05T20:29:12Z</updated>
		<published>2009-05-05T20:29:12Z</published>
		<category scheme="http://dev.tailsweep.com" term="Uncategorized" />		<summary type="html"><![CDATA[I really loves to have backups, but hate to pay for it since it deep down in my gut feels like wasted money somehow. So how do you get most bang for the buck ? Buy some simple 1TB USB2 drives and just plug them into one of your servers and mount them as regular [...]]]></summary>
		<content type="html" xml:base="http://dev.tailsweep.com/cheap-backup/"><![CDATA[<p>I really loves to have backups, but hate to pay for it since it deep down in my gut feels like wasted money somehow. So how do you get most bang for the buck ?</p>
<p>Buy some simple 1TB USB2 drives and just plug them into one of your servers and mount them as regular drives. Simple as that.</p>
<p>Want to have RAID ? No problem, this is what we did.</p>
<p>FInd the device-names by issuing:</p>
<p>sudo fdisk -l</p>
<p>The two drives came out as /dev/sdb1 and /dev/sdc1</p>
<p>Here is the magic:</p>
<p><span id="intelliTxt">mknod /dev/md0 b 9 0</span><br />
mdadm -C -v /dev/md0 -l 1 -n 2 /dev/sdb1 /dev/sdc1<br />
mkfs.ext3 -L/usb_drive1 /dev/md0<br />
tune2fs -c 0 /dev/md0<br />
tune2fs -i 0 /dev/md0<br id="pje7" /> tune2fs -o journal_data_writeback /dev/md0</p>
<p>Mount it.</p>
<p>mount /dev/md0 /srv/backup</p>
<p>That is really it <img src='http://dev.tailsweep.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p>This is how it looks now in our cabinet, really ugly but what the heck, who cares haha.</p>
<p style="text-align: center;"><a href="http://dev.tailsweep.com/wp-content/uploads/2009/05/20090505092.jpg"><img class="alignnone size-medium wp-image-80 aligncenter" title="20090505092" src="http://dev.tailsweep.com/wp-content/uploads/2009/05/20090505092-225x300.jpg" alt="" width="225" height="300" /></a></p>
]]></content>
		<link rel="replies" type="text/html" href="http://dev.tailsweep.com/cheap-backup/#comments" thr:count="0"/>
		<link rel="replies" type="application/atom+xml" href="http://dev.tailsweep.com/cheap-backup/feed/atom/" thr:count="0"/>
		<thr:total>0</thr:total>
	</entry>
		<entry>
		<author>
			<name>Marcus Herou</name>
						<uri>http://</uri>
					</author>
		<title type="html"><![CDATA[Tailsweep goes Hive]]></title>
		<link rel="alternate" type="text/html" href="http://dev.tailsweep.com/tailsweep-goes-hive/" />
		<id>http://dev.tailsweep.com/?p=78</id>
		<updated>2009-04-27T06:18:31Z</updated>
		<published>2009-04-27T06:18:31Z</published>
		<category scheme="http://dev.tailsweep.com" term="Uncategorized" /><category scheme="http://dev.tailsweep.com" term="hive" /><category scheme="http://dev.tailsweep.com" term="monetdb" />		<summary type="html"><![CDATA[We have now started to experiment with Hive. It makes perfect sence since what we have built internally is basically Hive but in the form of zillions of Haoop jobs. How nice would it not be to just clean your data, create a csv format of the actual log and then inject it into HIve [...]]]></summary>
		<content type="html" xml:base="http://dev.tailsweep.com/tailsweep-goes-hive/"><![CDATA[<p>We have now started to experiment with <a href="http://wiki.apache.org/hadoop/Hive/">Hive.</a> It makes perfect sence since what we have built internally is basically Hive but in the form of zillions of Haoop jobs.</p>
<p>How nice would it not be to just clean your data, create a csv format of the actual log and then inject it into HIve and then apply various SQL commands which outputs the results to a format of your choice ?</p>
<p>Sounds like a DataWareHouse ? Well it is more or less but it has the computing power of all machines in the cluster which makes it very useful. We are using MonetDB right as of current and it is blazing fast but it performs poorly on a machine with little memory (which is no surprise) and as well claims all memory it can find so we limit it with some tricks to not swap out the machine completely.</p>
]]></content>
		<link rel="replies" type="text/html" href="http://dev.tailsweep.com/tailsweep-goes-hive/#comments" thr:count="1"/>
		<link rel="replies" type="application/atom+xml" href="http://dev.tailsweep.com/tailsweep-goes-hive/feed/atom/" thr:count="1"/>
		<thr:total>1</thr:total>
	</entry>
		<entry>
		<author>
			<name>Marcus Herou</name>
						<uri>http://</uri>
					</author>
		<title type="html"><![CDATA[Solr external scoring]]></title>
		<link rel="alternate" type="text/html" href="http://dev.tailsweep.com/solr-external-scoring/" />
		<id>http://dev.tailsweep.com/?p=77</id>
		<updated>2009-04-25T07:23:44Z</updated>
		<published>2009-04-24T22:11:28Z</published>
		<category scheme="http://dev.tailsweep.com" term="Lucene" /><category scheme="http://dev.tailsweep.com" term="Uncategorized" /><category scheme="http://dev.tailsweep.com" term="externalfilefield" /><category scheme="http://dev.tailsweep.com" term="function query" />		<summary type="html"><![CDATA[We had issues with trying to figure out howto get SOLR to be able to handle external scores. Thanks to Grant Ingersoll and Yonik Seeley we now have figured this out. The solution: ExternalFileField + FunctionQuery This is how I tested this setup. # solr.xml &#60;?xml version="1.0" encoding="UTF-8" ?&#62; &#60;solr persistent="true" sharedLib="lib"&#62;  &#60;cores adminPath="/admin/cores"&#62;         [...]]]></summary>
		<content type="html" xml:base="http://dev.tailsweep.com/solr-external-scoring/"><![CDATA[<p>We had issues with trying to figure out howto get SOLR to be able to handle external scores. Thanks to Grant Ingersoll and Yonik Seeley we now have figured this out.</p>
<p>The solution: <a href="http://lucene.apache.org/solr/api/org/apache/solr/schema/ExternalFileField.html">ExternalFileField</a> + <a href="http://wiki.apache.org/solr/FunctionQuery">FunctionQuery</a></p>
<p>This is how I tested this setup.</p>
<pre># solr.xml
&lt;?xml version="1.0" encoding="UTF-8" ?&gt;
&lt;solr persistent="true" sharedLib="lib"&gt;
 &lt;cores adminPath="/admin/cores"&gt;
        &lt;core name="test" instanceDir="test" /&gt;
 &lt;/cores&gt;
&lt;/solr&gt;

# Schema, a pkId (blog entry) belongs to a blogId (the blog)
&lt;schema name="test" version="1.1"&gt;
    &lt;types&gt;
   	&lt;fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/&gt;
    	&lt;fieldType name="integer" class="solr.IntField" omitNorms="true"/&gt;
    	&lt;fieldType name="float" class="solr.FloatField" omitNorms="true"/&gt;
    	&lt;fieldType name="entryRankFile" keyField="pkId" defVal="0" stored="false" indexed="false" class="solr.ExternalFileField" valType="float"/&gt;
	&lt;fieldType name="blogRankFile" keyField="blogId" defVal="0" stored="false" indexed="false" class="solr.ExternalFileField" valType="float"/&gt;
    &lt;/types&gt;
    &lt;fields&gt;
	&lt;field name="pkId" type="string" indexed="true" stored="true" required="true" /&gt;
	&lt;field name="blogId" type="integer" indexed="true" stored="true" required="true" /&gt;
	&lt;field name="entryRank" type="entryRankFile" /&gt;
	&lt;field name="blogRank" type="blogRankFile" /&gt;
    &lt;/fields&gt;
    &lt;uniqueKey&gt;pkId&lt;/uniqueKey&gt;
    &lt;defaultSearchField&gt;pkId&lt;/defaultSearchField&gt;
    &lt;solrQueryParser defaultOperator="OR"/&gt;
&lt;/schema&gt;

# dataDir/external_blogRank.txt
1=2.0
2=1.0
3=3.0
4=1.0

# Add doc file, save it as /tmp/add.xml
&lt;add&gt;
    &lt;doc&gt;&lt;field name="pkId"&gt;1&lt;/field&gt;&lt;field name="blogId"&gt;1&lt;/field&gt;&lt;/doc&gt;
    &lt;doc&gt;&lt;field name="pkId"&gt;2&lt;/field&gt;&lt;field name="blogId"&gt;1&lt;/field&gt;&lt;/doc&gt;
    &lt;doc&gt;&lt;field name="pkId"&gt;3&lt;/field&gt;&lt;field name="blogId"&gt;2&lt;/field&gt;&lt;/doc&gt;
    &lt;doc&gt;&lt;field name="pkId"&gt;4&lt;/field&gt;&lt;field name="blogId"&gt;3&lt;/field&gt;&lt;/doc&gt;
    &lt;doc&gt;&lt;field name="pkId"&gt;5&lt;/field&gt;&lt;field name="blogId"&gt;4&lt;/field&gt;&lt;/doc&gt;
&lt;/add&gt;

# Add some data
curl http://127.0.0.1:8110/solr/test/update --data-binary @/tmp/add.xml -H "Content-Type: text/xml"
&lt;?xml version="1.0" encoding="UTF-8"?&gt;
&lt;response&gt;
&lt;lst name="responseHeader"&gt;&lt;int name="status"&gt;0&lt;/int&gt;&lt;int name="QTime"&gt;239&lt;/int&gt;&lt;/lst&gt;
&lt;/response&gt;

# Commit
curl http://127.0.0.1:8110/solr/test/update -H "Content-Type: text/xml" --data-binary '&lt;commit /&gt;'
&lt;?xml version="1.0" encoding="UTF-8"?&gt;
&lt;response&gt;
&lt;lst name="responseHeader"&gt;&lt;int name="status"&gt;0&lt;/int&gt;&lt;int name="QTime"&gt;6&lt;/int&gt;&lt;/lst&gt;
&lt;/response&gt;</pre>
<p># Issue query, should return all entries which have the highest blogRank first</p>
<p>mahe@mahe-laptop:~$ GET &#8220;http://127.0.0.1:8110/solr/test/select?indent=on&amp;start=0&amp;rows=100&amp;q=*:* _val_:\&#8221;log(blogRank)\&#8221;"</p>
<p>&lt;?xml version=&#8221;1.0&#8243; encoding=&#8221;UTF-8&#8243;?&gt;<br />
&lt;response&gt;</p>
<p>&lt;lst name=&#8221;responseHeader&#8221;&gt;<br />
&lt;int name=&#8221;status&#8221;&gt;0&lt;/int&gt;<br />
&lt;int name=&#8221;QTime&#8221;&gt;3&lt;/int&gt;<br />
&lt;lst name=&#8221;params&#8221;&gt;<br />
&lt;str name=&#8221;start&#8221;&gt;0&lt;/str&gt;<br />
&lt;str name=&#8221;indent&#8221;&gt;on&lt;/str&gt;<br />
&lt;str name=&#8221;q&#8221;&gt;*:* _val_:&#8221;log(blogRank)&#8221;&lt;/str&gt;<br />
&lt;str name=&#8221;rows&#8221;&gt;100&lt;/str&gt;<br />
&lt;/lst&gt;<br />
&lt;/lst&gt;<br />
&lt;result name=&#8221;response&#8221; numFound=&#8221;5&#8243; start=&#8221;0&#8243;&gt;<br />
&lt;doc&gt;<br />
&lt;int name=&#8221;blogId&#8221;&gt;3&lt;/int&gt;<br />
&lt;str name=&#8221;pkId&#8221;&gt;4&lt;/str&gt;<br />
&lt;/doc&gt;<br />
&lt;doc&gt;<br />
&lt;int name=&#8221;blogId&#8221;&gt;1&lt;/int&gt;<br />
&lt;str name=&#8221;pkId&#8221;&gt;1&lt;/str&gt;<br />
&lt;/doc&gt;<br />
&lt;doc&gt;<br />
&lt;int name=&#8221;blogId&#8221;&gt;1&lt;/int&gt;<br />
&lt;str name=&#8221;pkId&#8221;&gt;2&lt;/str&gt;<br />
&lt;/doc&gt;<br />
&lt;doc&gt;<br />
&lt;int name=&#8221;blogId&#8221;&gt;2&lt;/int&gt;<br />
&lt;str name=&#8221;pkId&#8221;&gt;3&lt;/str&gt;<br />
&lt;/doc&gt;<br />
&lt;doc&gt;<br />
&lt;int name=&#8221;blogId&#8221;&gt;4&lt;/int&gt;<br />
&lt;str name=&#8221;pkId&#8221;&gt;5&lt;/str&gt;<br />
&lt;/doc&gt;<br />
&lt;/result&gt;<br />
&lt;/response&gt;</p>
<p>Badabom badabing!</p>
<p>Update:</p>
<p>An even better query (Thanks to Yonik): Takes the actual internal scoring into account as well.</p>
<p>GET &#8216;http://127.0.0.1:8110/solr/test/select?indent=on&amp;start=0&amp;rows=100&amp;q={!boost b=blogRank v=$qq}&amp;qq=title:solr&amp;debugQuery=on&#8217;</p>
]]></content>
		<link rel="replies" type="text/html" href="http://dev.tailsweep.com/solr-external-scoring/#comments" thr:count="5"/>
		<link rel="replies" type="application/atom+xml" href="http://dev.tailsweep.com/solr-external-scoring/feed/atom/" thr:count="5"/>
		<thr:total>5</thr:total>
	</entry>
		<entry>
		<author>
			<name>Marcus Herou</name>
						<uri>http://</uri>
					</author>
		<title type="html"><![CDATA[Replication in Mammatus]]></title>
		<link rel="alternate" type="text/html" href="http://dev.tailsweep.com/replication-in-mammatus/" />
		<id>http://dev.tailsweep.com/?p=75</id>
		<updated>2008-12-14T18:02:14Z</updated>
		<published>2008-12-14T17:24:59Z</published>
		<category scheme="http://dev.tailsweep.com" term="Uncategorized" /><category scheme="http://dev.tailsweep.com" term="mammatus" /><category scheme="http://dev.tailsweep.com" term="master" /><category scheme="http://dev.tailsweep.com" term="replication" /><category scheme="http://dev.tailsweep.com" term="slave" />		<summary type="html"><![CDATA[I have created a way of replicating state which is similar to MySQL. We have several cases where we want to update a Btree on a central server and then having it replicated across all slave nodes. Today we serialize a HashMap to disk, rsyncs it and when the slaves understands that the underlying file [...]]]></summary>
		<content type="html" xml:base="http://dev.tailsweep.com/replication-in-mammatus/"><![CDATA[<p>I have created a way of replicating state which is similar to MySQL.</p>
<p>We have several cases where we want to update a Btree on a central server and then having it replicated across all slave nodes.</p>
<p>Today we serialize a HashMap to disk, rsyncs it and when the slaves understands that the underlying file is changed it initializes itself with that. This works, however it is not a smart way of doing it since it needs to reload the entire state even though just one entry has been added. To solve that you need to add transaction logging and replicate those transactions.</p>
<p>So how does it work ?</p>
<p>* TransactionLogger needs to be initialized on both master and slave.</p>
<p>* You write to the master file.</p>
<p>* The slave polls the master and sends it&#8217;s latest sequence number (trx id) called X.</p>
<p>* The master sends the delta entries from X to Y where Y is the latest entry noted on the master when the client initiated the request.</p>
<p>I wrote the transaction loggers as separate modules so you need to wire them up to make the storage synchronized.</p>
<p>On the slave you need a StateChangeListener and on the master you need to wrap the storage engine in a TransactionLoggerCacheStrategy.</p>
<p>Here is a fully working <a href="http://dev.tailsweep.com/wp-content/uploads/2008/12/logmanager.xml">example</a> spring context file.</p>
<p>Example code:</p>
<p>public static void main(String[] args)<br />
{<br />
String[] cfg = {&#8220;logManager.xml&#8221;};<br />
ClassPathXmlApplicationContext ctx = new ClassPathXmlApplicationContext(cfg);<br />
Cache cacheMaster = (Cache)ctx.getBean(&#8220;masterCache&#8221;);<br />
Cache cacheSlave = (Cache)ctx.getBean(&#8220;slaveCache&#8221;);</p>
<p>cacheMaster.put(&#8220;testing&#8221;, new Date());<br />
while(true)<br />
{<br />
Date date = (Date)cacheSlave.get(&#8220;testing&#8221;);<br />
if(date != null)<br />
{<br />
System.out.println(&#8220;Huzza!&#8221;);<br />
System.exit(0);<br />
}<br />
try<br />
{<br />
Thread.sleep(1000);<br />
}<br />
catch (InterruptedException e)<br />
{<br />
e.printStackTrace();<br />
}<br />
}<br />
}</p>
]]></content>
		<link rel="replies" type="text/html" href="http://dev.tailsweep.com/replication-in-mammatus/#comments" thr:count="0"/>
		<link rel="replies" type="application/atom+xml" href="http://dev.tailsweep.com/replication-in-mammatus/feed/atom/" thr:count="0"/>
		<thr:total>0</thr:total>
	</entry>
		<entry>
		<author>
			<name>Marcus Herou</name>
						<uri>http://</uri>
					</author>
		<title type="html"><![CDATA[Spring with Hadoop]]></title>
		<link rel="alternate" type="text/html" href="http://dev.tailsweep.com/spring-with-hadoop/" />
		<id>http://dev.tailsweep.com/?p=74</id>
		<updated>2008-12-13T09:51:39Z</updated>
		<published>2008-12-13T09:51:39Z</published>
		<category scheme="http://dev.tailsweep.com" term="Uncategorized" /><category scheme="http://dev.tailsweep.com" term="hadoop" /><category scheme="http://dev.tailsweep.com" term="spring" />		<summary type="html"><![CDATA[We have really been struggling with creating a way of launch hadoop jobs and create and wire all components with Spring. Finally we have come to a nice way of doing this where we make use of the Hadoop Configuration to tell the jobs which spring context files they should use. Example Client (from where [...]]]></summary>
		<content type="html" xml:base="http://dev.tailsweep.com/spring-with-hadoop/"><![CDATA[<p>We have really been struggling with creating a way of launch hadoop jobs and create and wire all components with Spring.</p>
<p>Finally we have come to a nice way of doing this where we make use of the Hadoop Configuration to tell the jobs which spring context files they should use.</p>
<p>Example</p>
<p>Client (from where you launch JobClient)</p>
<p>JobConf job = createJob();</p>
<p>job.set(&#8220;configs&#8221;, &#8220;classpath:ctx1.xml,&#8221;classpath:ctx2.xml&#8221;);</p>
<p>&#8230;..</p>
<p>Inside a Mapper, Reducer or MapRunnable public void configure(JobConf jobConf) method.</p>
<p>String[] configs = jobConf.get(&#8220;configs&#8221;).split(&#8220;,&#8221;);<br />
ApplicationContext ctx = new ClassPathXmlApplicationContext(configs);</p>
<p>&#8230;Extract the beans you want and manually wire up the Job. e.g.</p>
<p>this.contentParsers = (ContentParsers)ctx.getBean(&#8220;contentParsers&#8221;);</p>
<p>For this to work you need to have all configurations in your jar-file which you tell hadoop to run with:</p>
<p>job.setJar(jarFile);</p>
<p>and if you want to add some dependency jar files use:</p>
<p>job.set(&#8220;tmpjars&#8221;, &#8220;/lib/jar1,/lib/jar2&#8243;);</p>
<p>where the tmpjars must reside in HDFS before running the job.</p>
<p>use ${HADOOP_HOME}/bin/hadoop dfs -copyFromLocal your_working_dir/lib /</p>
<p>This will put the dir /lib in the HDFS root, which of course is just an example.</p>
<p>We use the same spring context files in both dev/stage/prod environments and use environment specific property files which we use to filter the context files before wrapping them inside the jar.</p>
<p>Example:</p>
<p>&#8212;clip context file&#8212;</p>
<p>&lt;property name=&#8221;numberOfUrlsPerCrawl&#8221; value=&#8221;${numberOfUrlsPerCrawl}&#8221; /&gt;</p>
<p>&#8212;clip&#8212;</p>
<p><strong>environment.local.properties</strong></p>
<p>numberOfUrlsPerCrawl=100</p>
<p><strong>environment.prod.properties</strong></p>
<p>numberOfUrlsPerCrawl=100000</p>
<p>The client side of course as well is Spring wired.</p>
]]></content>
		<link rel="replies" type="text/html" href="http://dev.tailsweep.com/spring-with-hadoop/#comments" thr:count="0"/>
		<link rel="replies" type="application/atom+xml" href="http://dev.tailsweep.com/spring-with-hadoop/feed/atom/" thr:count="0"/>
		<thr:total>0</thr:total>
	</entry>
	</feed>
<!-- WP Super Cache is installed but broken. The path to wp-cache-phase1.php in wp-content/advanced-cache.php must be fixed! -->