Tailsweep dev blog

Ulimit can be a bastard

2009-10-27T14:29:11Z

How do you change per process ulimit without rebooting ? We have not found a way but a workaround.

root:

ulimit -n 65536

su $user -p

Done! The -p preserves root’s environment.

Some new servers

2009-10-27T13:26:46Z

Finally some new servers are racked up in cabinet2

Vi söker utvecklare

2009-10-08T10:42:46Z

Tailsweep har en enorm utvecklingstakt och vi behöver stärka upp vårt utvecklingsteam med fler utvecklare.

Tailsweep är ett datadrivet företag som i alla aspekter hanterar stora mängder data. Har du erfarenhet av att skriva program som processar stora mängder data (gärna med nedan nämnda tekniker) eller helt enkelt har följande två enkla egenskaper:

Vara smart
Få saker utförda

Så är du med största sannolikhet rätt person för jobbet och du kommer trivas hos oss. De “krav” som nämns nedan är endast för att ge en hint om vilka tekniker vi använder. Främst letar vi efter personer som passar i bolaget och som älskar att utveckla och är bra på det. Allt annat är egentligen ointressant.

De tre områden som du kommer arbeta inom är:

Tailsweep Search & Report – Crawler & Sökindex, ett av sveriges absolut största dataindex för blogginnehåll.
Tailsweep Analytics – Vårt statistiksystem, påminner mycket om Google Analytics. I princip alla de största svenska bloggarna är anslutna till detta system. Förmodligen det mest avancerade i Sverige.
Tailsweep Ad System – Vårt annonssystem som publicerar kampanjer på tusentals sajter på bloggar runtom i världen varje dag. De tekniska utmaningarna inom detta system är mycket intressanta för att vara modest.

Om du har erfarenhet inom nedan nämnda tekniker ges det en guldstjärna i kanten:

Hadoop – Processar vårt loggdata och kör vår crawler
HBase – Används bara i utveckling men kommer bli en viktig komponent framåt för ytterligare uppskalning
Hive – Skall bli vår BI-lösning
Lucene – Använder vi flitigt där skalbarhet inte är lika viktigt men “närhet” till datat är viktigare
Lucene SOLR – Vårt sökindex använder SOLR och är ett distribuerat index
Lucene Nutch – Kan du Nutch så kan du det mesta om vår crawler
Någon annan dataminingplattform
Någon annan BI-lösning
Någon annan sökplattform (Sphinx tex)
Någon annan indexmotor

Det språk vi i huvudsak utvecklar i är Java så det är viktigt att du behärskar det språket men om du besitter andra nischade kunskaper så väger det såklart också tungt tex genom erfarenhet inom nån sökmotor, statistiksystem eller liknande.

Vi skriver i princip alla våra mallar i Velocity så det är klart att det är trevligt om du sett det mallspråket förr.

Vi driftar, utvecklar och arbetar på Ubuntu Linux. Vi använder samma OS lokalt som på driftplattformen för att säkerställa att inga konstiga OS-relaterade buggar hittar ut i prod som inte gick att testa lokalt.

Andra meriterande teknikkunskaper

MySQL – Vår huvudsakliga DB
J2EE Servlet Applikationer – Våra webappar är skrivna för J2EE och driftas i Tomcat
Spring – Denna IOC-container använder vi överallt
Spring MVC – För våra webappar
Hibernate – Används överallt där inte prestanda är kritiskt
Perl – Listar också perl då vi har massor av batchjobb som kör perl

Vidare listar jag några andra verktyg och tekniker som används flitigt men som bara är kuriosa i sammanhanget

Subversion – All vår källkod finns i Subversion
Maven – Alla projekt byggs med Maven 2
Lighttpd – Driftar vårt statiska innehåll och våra bloggar
WordPress – Våra bloggar körs i wordpress
BASH – Ja vi använder bashscript överallt
NFS – Används mest ur bekvämlighetssynpunkt
GlusterFS – Experimentiellt skalbart filsystem
Eclipse – Utvecklar vi i.
HAProxy – Vår LB, enkel, snabb och stabil
SNMP – Alla maskiner övervakas med SNMP
Postfix – Mail
Nagios – Larm av våra viktigaste tjänster
Cacti – Trendgrafer av prestandakritiska tjänster
Mantis – Vårt case-verktyg, enkelt och tillfredställande

Exempel på projekt för att komma igång på Tailsweep

Vi ska bygga om vår statistikmotor till att använda Hive istället för MonetDB som vi använder idag. Hive är utmärkt till att processa enorma mängder loggfiler och detta är vår viktigaste tjänst.

Vi har byggt en egen shardad lösning i MySQL som spänner över 50 databaser i vår sökplattform men vi tittar på att flytta denna arkitektur till HBase, vilket är en variant av Googles BigTable som hanterar all data rörande inloggade Google användare.
Vi ska bygga en behavioural targeting motor som ska distribuera kampanjer till de sajter där de presterar bäst. Till detta så måste man bygga en annonspool som kampanjerna “sugs” ifrån.

Låter det intressant ? Då kommer du gilla att jobba på Tailsweep.

Skicka ett mail till job at tailsweep.com med din CV så kontaktar jag dig och sätter upp ett möte.

Med vänlig hälsning

//Marcus Herou, CTO Tailsweep AB

Patch Hadoop for faster startup

2009-09-24T07:42:50Z

Do you add dependency support for your jobs in Hadoop by configuring the “tmpjars” property ?

This means that your jar-files need to be located on HDFS and loaded by Hadoop on runtime.

If you do so then your app will be significantly slower in terms of startup time. You can reduce the startup time from 1 min to less then 10 secs by patching the mapred/org/apache/hadoop/mapred/TaskRunner.java class to find the files from a local repo instead from HDFS

Find the place where the classpath is being built in that source file (line 272 in hadoop-0.18.3) and insert this code snippet between

classPath.append(sep);

classPath.append(workDir);

–SNIPPET_HERE–

// Build exec child jmv args.
Vector vargs = new Vector(8);
File jvm = // use same jvm as parent
new File(new File(System.getProperty(“java.home”), “bin”), “java”);

vargs.add(jvm.toString());

Here the snippet is:


String additionalClassPath = conf.get(“mapred.additional.class.path”);

if (additionalClassPath != null)

{

String[] localfiles = additionalClassPath.split(“,”);

for(int i = 0; i < localfiles.length;i++)

{

String localfile = localfiles[i].trim();

LOG.info(“Adding “+localfile);

classPath.append(sep);

classPath.append(localfile);

}

}

Then just build the new hadoop jar by issuing “ant jar” make sure that you have the same jar on all nodes as well as the jobtracker.

Mammatus is now a replicated KeyValueStore

2009-09-09T09:27:24Z

We proudly announce that Mammatus have support for transactional replication of configurable KeyValueStore(s). Something similar to Cassandra (where is it thesedays?) or Voldemort

Our pagehit/adhit tracking services at script.tailsweep.com uses this feature and we have about 1000 web requests per second so you can say that it is quite stress tested .

Look in the MasterSlaveTest class for examples.

Cheap backup

2009-05-05T20:29:12Z

I really loves to have backups, but hate to pay for it since it deep down in my gut feels like wasted money somehow. So how do you get most bang for the buck ?

Buy some simple 1TB USB2 drives and just plug them into one of your servers and mount them as regular drives. Simple as that.

Want to have RAID ? No problem, this is what we did.

FInd the device-names by issuing:

sudo fdisk -l

The two drives came out as /dev/sdb1 and /dev/sdc1

Here is the magic:

mknod /dev/md0 b 9 0
mdadm -C -v /dev/md0 -l 1 -n 2 /dev/sdb1 /dev/sdc1
mkfs.ext3 -L/usb_drive1 /dev/md0
tune2fs -c 0 /dev/md0
tune2fs -i 0 /dev/md0
tune2fs -o journal_data_writeback /dev/md0

Mount it.

mount /dev/md0 /srv/backup

That is really it

This is how it looks now in our cabinet, really ugly but what the heck, who cares haha.

Tailsweep goes Hive

2009-04-27T06:18:31Z

We have now started to experiment with Hive. It makes perfect sence since what we have built internally is basically Hive but in the form of zillions of Haoop jobs.

How nice would it not be to just clean your data, create a csv format of the actual log and then inject it into HIve and then apply various SQL commands which outputs the results to a format of your choice ?

Sounds like a DataWareHouse ? Well it is more or less but it has the computing power of all machines in the cluster which makes it very useful. We are using MonetDB right as of current and it is blazing fast but it performs poorly on a machine with little memory (which is no surprise) and as well claims all memory it can find so we limit it with some tricks to not swap out the machine completely.

Solr external scoring

2009-04-25T07:23:44Z

We had issues with trying to figure out howto get SOLR to be able to handle external scores. Thanks to Grant Ingersoll and Yonik Seeley we now have figured this out.

The solution: ExternalFileField + FunctionQuery

This is how I tested this setup.

# solr.xml


 
        
 


# Schema, a pkId (blog entry) belongs to a blogId (the blog)

    
   	
    	
    	
    	
	
    
    
	
	
	
	
    
    pkId
    pkId
    


# dataDir/external_blogRank.txt
1=2.0
2=1.0
3=3.0
4=1.0

# Add doc file, save it as /tmp/add.xml

    11
    21
    32
    43
    54


# Add some data
curl http://127.0.0.1:8110/solr/test/update --data-binary @/tmp/add.xml -H "Content-Type: text/xml"


0239


# Commit
curl http://127.0.0.1:8110/solr/test/update -H "Content-Type: text/xml" --data-binary ''


06

# Issue query, should return all entries which have the highest blogRank first

mahe@mahe-laptop:~$ GET “http://127.0.0.1:8110/solr/test/select?indent=on&start=0&rows=100&q=*:* _val_:\”log(blogRank)\”"

0
3

0
on
*:* _val_:”log(blogRank)”
100

3
4

1
1

1
2

2
3

4
5

Badabom badabing!

Update:

An even better query (Thanks to Yonik): Takes the actual internal scoring into account as well.

GET ‘http://127.0.0.1:8110/solr/test/select?indent=on&start=0&rows=100&q={!boost b=blogRank v=$qq}&qq=title:solr&debugQuery=on’

Replication in Mammatus

2008-12-14T18:02:14Z

I have created a way of replicating state which is similar to MySQL.

We have several cases where we want to update a Btree on a central server and then having it replicated across all slave nodes.

Today we serialize a HashMap to disk, rsyncs it and when the slaves understands that the underlying file is changed it initializes itself with that. This works, however it is not a smart way of doing it since it needs to reload the entire state even though just one entry has been added. To solve that you need to add transaction logging and replicate those transactions.

So how does it work ?

* TransactionLogger needs to be initialized on both master and slave.

* You write to the master file.

* The slave polls the master and sends it’s latest sequence number (trx id) called X.

* The master sends the delta entries from X to Y where Y is the latest entry noted on the master when the client initiated the request.

I wrote the transaction loggers as separate modules so you need to wire them up to make the storage synchronized.

On the slave you need a StateChangeListener and on the master you need to wrap the storage engine in a TransactionLoggerCacheStrategy.

Here is a fully working example spring context file.

Example code:

public static void main(String[] args)
{
String[] cfg = {“logManager.xml”};
ClassPathXmlApplicationContext ctx = new ClassPathXmlApplicationContext(cfg);
Cache cacheMaster = (Cache)ctx.getBean(“masterCache”);
Cache cacheSlave = (Cache)ctx.getBean(“slaveCache”);

cacheMaster.put(“testing”, new Date());
while(true)
{
Date date = (Date)cacheSlave.get(“testing”);
if(date != null)
{
System.out.println(“Huzza!”);
System.exit(0);
}
try
{
Thread.sleep(1000);
}
catch (InterruptedException e)
{
e.printStackTrace();
}
}
}

Spring with Hadoop

2008-12-13T09:51:39Z

We have really been struggling with creating a way of launch hadoop jobs and create and wire all components with Spring.

Finally we have come to a nice way of doing this where we make use of the Hadoop Configuration to tell the jobs which spring context files they should use.

Example

Client (from where you launch JobClient)

JobConf job = createJob();

job.set(“configs”, “classpath:ctx1.xml,”classpath:ctx2.xml”);

…..

Inside a Mapper, Reducer or MapRunnable public void configure(JobConf jobConf) method.

String[] configs = jobConf.get(“configs”).split(“,”);
ApplicationContext ctx = new ClassPathXmlApplicationContext(configs);

…Extract the beans you want and manually wire up the Job. e.g.

this.contentParsers = (ContentParsers)ctx.getBean(“contentParsers”);

For this to work you need to have all configurations in your jar-file which you tell hadoop to run with:

job.setJar(jarFile);

and if you want to add some dependency jar files use:

job.set(“tmpjars”, “/lib/jar1,/lib/jar2″);

where the tmpjars must reside in HDFS before running the job.

use ${HADOOP_HOME}/bin/hadoop dfs -copyFromLocal your_working_dir/lib /

This will put the dir /lib in the HDFS root, which of course is just an example.

We use the same spring context files in both dev/stage/prod environments and use environment specific property files which we use to filter the context files before wrapping them inside the jar.

Example:

—clip context file—

—clip—

environment.local.properties

numberOfUrlsPerCrawl=100

environment.prod.properties

numberOfUrlsPerCrawl=100000

The client side of course as well is Spring wired.