<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>DataMine Lab</title>
	<atom:link href="https://dataminelab.com/feed/" rel="self" type="application/rss+xml" />
	<link>https://dataminelab.com</link>
	<description>data is the answer</description>
	<lastBuildDate>Fri, 01 Jun 2018 20:16:22 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>https://wordpress.org/?v=4.9</generator>
	<item>
		<title>YCSB run against HBase 0.92 on Amazon Elastic MapReduce</title>
		<link>https://dataminelab.com/blog/ycsb-run-against-hbase-0-92-on-amazon-elastic-mapreduce/</link>
		<comments>https://dataminelab.com/blog/ycsb-run-against-hbase-0-92-on-amazon-elastic-mapreduce/#respond</comments>
		<pubDate>Sun, 16 Sep 2012 20:29:52 +0000</pubDate>
		<dc:creator><![CDATA[Krystian Nowak]]></dc:creator>
				<category><![CDATA[Blog]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Amazon]]></category>
		<category><![CDATA[EC2]]></category>
		<category><![CDATA[EMR]]></category>
		<category><![CDATA[HBase]]></category>
		<category><![CDATA[YCSB]]></category>

		<guid isPermaLink="false">http://dataminelab.com/?p=494</guid>
		<description><![CDATA[In this post we will show you how in simple steps using Yahoo! Cloud Serving Benchmark: https://github.com/dataminelab/YCSB you can run benchmarks against HBase 0.92 cluster deployed automatically by Amazon Elastic MapReduce and what measurements and comparisons you can obtain while choosing among different available instance types. We will create EMR HBase clusters using the tooling [&#8230;]]]></description>
				<content:encoded><![CDATA[<p>In this post we will show you how in simple steps using Yahoo! Cloud Serving Benchmark: <a title="https://github.com/dataminelab/YCSB" href="https://github.com/dataminelab/YCSB">https://github.com/dataminelab/YCSB</a> you can run benchmarks against <a title="HBase 0.92" href="http://hbase.apache.org/">HBase 0.92</a> cluster deployed automatically by <a title="Amazon Elastic MapReduce" href="http://aws.amazon.com/elasticmapreduce/">Amazon Elastic MapReduce</a> and what measurements and comparisons you can obtain while choosing among different available <a title="instance types" href="http://aws.amazon.com/ec2/instance-types/">instance types</a>.</p>
<p><span id="more-494"></span></p>
<p><strong>We will create EMR HBase clusters</strong> using the tooling provided by Amazon:<br />
<a title="http://elasticmapreduce.s3.amazonaws.com/elastic-mapreduce-ruby.zip" href="http://elasticmapreduce.s3.amazonaws.com/elastic-mapreduce-ruby.zip">http://elasticmapreduce.s3.amazonaws.com/elastic-mapreduce-ruby.zip</a></p>
<p>Note: As you might see in <em>commands.rb</em> the <em>default_hadoop_version</em> is set to 0.20(.x), but as our tests found using Hadoop in version 1.0.3 has significant performance gain. Therefore when creating EMR cluster, we will explicitly set this version.</p>
<p><strong>Let&#8217;s create one:</strong></p>
<pre>
elastic-mapreduce --create \
--hbase \
--name "EMR HBase YCSB" \
--num-instances 2 \
--instance-type m1.large \
--hadoop-version 1.0.3
Created job flow j-1PP3JU6UJ0HQ1
</pre>
<p></p>
<pre>
elastic-mapreduce --list --active
j-1PP3JU6UJ0HQ1     WAITING
ec2-23-22-19-48.compute-1.amazonaws.com          EMR HBase YCSB
 COMPLETED      Start HBase</pre>
<p></p>
<p>Build the project (HBase master server variables should now defaults to localhost (<em>127.0.0.1</em>)).</p>
<pre>
git clone git@github.com:dataminelab/YCSB.git
cd YCSB
export MAVEN_OPTS="-Xmx512m -Xms128m -Xss2m"</pre>
<p></p>
<blockquote><p>(check <a title="http://jira.codehaus.org/browse/MASSEMBLY-549" href="http://jira.codehaus.org/browse/MASSEMBLY-549">http://jira.codehaus.org/browse/MASSEMBLY-549</a> why&#8230;)</p></blockquote>
<pre>mvn clean install -Dcheckstyle.skip=true
cd distribution/target
scp -i ~/.ssh/dataminelab-ec2.pem ycsb-0.1.5-SNAPSHOT.tar.gz \
hadoop@ec2-23-22-19-48.compute-1.amazonaws.com:/home/hadoop/ycsb.tar.gz 
ssh -i ~/.ssh/dataminelab-ec2.pem \
hadoop@ec2-23-22-19-48.compute-1.amazonaws.com
tar xvzf ycsb.tar.gz
ln -s ycsb-0.1.5-SNAPSHOT ycsb
cd ycsb
</pre>
<p></p>
<p>Create the working table in HBase (aleady pre-split):</p>
<pre>
hbase org.apache.hadoop.hbase.util.RegionSplitter usertable -c 200 -f family
</pre>
<p></p>
<p>Hard to be perfect &#8211; because of <a title="https://issues.apache.org/jira/browse/HBASE-4163" href="https://issues.apache.org/jira/browse/HBASE-4163">https://issues.apache.org/jira/browse/HBASE-4163</a> is still not in place &#8211; please vote! :)<br />
But it still seems to be better than no split at all!</p>
<p>You might spot:</p>
<pre>
12/08/25 13:39:16 ERROR metrics.MetricsSaver:
Failed SaveRecords hdfs:/mnt/var/lib/hadoop/metrics/raw/i-694c4712_04272_raw.bin
Shutdown in progress
</pre>
<p>as in <a title="https://forums.aws.amazon.com/thread.jspa?threadID=100643" href="https://forums.aws.amazon.com/thread.jspa?threadID=100643">https://forums.aws.amazon.com/thread.jspa?threadID=100643</a> but it doesn&#8217;t seem to hurt us&#8230;</p>
<pre>
hbase shell
scan '.META.', {COLUMNS =&gt; 'info:regioninfo'}
exit
</pre>
<p></p>
<p>Load initial data into HBase</p>
<pre>
./bin/ycsb load hbase -p columnfamily=family -P workloads/workloada | tee load.log
</pre>
<p></p>
<p>Check for your own eyes that the data is loaded into HBase</p>
<pre>
hbase shell

hbase(main):001:0&gt; count 'usertable'
Current count: 1000, row: user995698996184959679
1000 row(s) in 2.3210 seconds
</pre>
<p></p>
<p>And run the tests &#8211; only as a warm-up:</p>
<pre>
./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloada \
-p columnfamily=family \
-p operationcount=10000 \
-s \
-threads 10 | tee warm-up-tests.log
</pre>
<p></p>
<p>And now the real tests with 10 threads:</p>
<pre>
./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloada \
-p columnfamily=family \
-p operationcount=100000 \
-s \
-threads 10 | tee real-tests-workload-a.log
</pre>
<p></p>
<pre>
cat real-tests-workload-a.log
</pre>
<p></p>
<pre>
[OVERALL], RunTime(ms), 47132.0
[OVERALL], Throughput(ops/sec), 2121.700755325469
[UPDATE], Operations, 50209
[UPDATE], AverageLatency(us), 186.93305980999423
</pre>
<p></p>
<p>And also 10 threads, but for another workload type.</p>
<pre>
./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloadf \
-p columnfamily=family \
-p operationcount=100000 \
-s -threads 10 | tee real-tests-workload-f.log
cat real-tests-workload-f.log
</pre>
<p></p>
<pre>
[OVERALL], RunTime(ms), 52748.0
[OVERALL], Throughput(ops/sec), 1895.8064760749223
[UPDATE], Operations, 50018
[UPDATE], AverageLatency(us), 11.925006997480907
</pre>
<p></p>
<p>Now we might check how these workload scenarios behave when increasing thread number.<br />
Starting with 100 threads.</p>
<pre>
./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloada \
-p columnfamily=family \
-p operationcount=100000 \
-s \
-threads 100 | tee real-tests-workload-a-100t.log
cat real-tests-workload-a-100t.log
</pre>
<p></p>
<pre>
[OVERALL], RunTime(ms), 24234.0
[OVERALL], Throughput(ops/sec), 4126.433935792688
[UPDATE], Operations, 50063
[UPDATE], AverageLatency(us), 1076.5547010766434
</pre>
<p></p>
<p>500 threads</p>
<pre>
./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloada \
-p columnfamily=family \
-p operationcount=100000 \
-s \
-threads 500 | tee real-tests-workload-a-500t.log
cat real-tests-workload-a-500t.log
</pre>
<p></p>
<pre>
[OVERALL], RunTime(ms), 20706.0
[OVERALL], Throughput(ops/sec), 4829.518014102193
[UPDATE], Operations, 50099
[UPDATE], AverageLatency(us), 6167.192359128925
</pre>
<p></p>
<p>1000 threads</p>
<pre>
./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloada \
-p columnfamily=family \
-p operationcount=100000 \
-s \
-threads 1000 | tee real-tests-workload-a-1kt.log
cat real-tests-workload-a-1kt.log
</pre>
<p></p>
<pre>
[OVERALL], RunTime(ms), 21484.0
[OVERALL], Throughput(ops/sec), 4654.626698938745
[UPDATE], Operations, 49988
[UPDATE], AverageLatency(us), 9423.208390013604
</pre>
<p></p>
<p>2000 threads</p>
<pre>
./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloada \
-p columnfamily=family \
-p operationcount=100000 \
-s \
-threads 2000 | tee real-tests-workload-a-2kt.log
cat real-tests-workload-a-2kt.log
</pre>
<p></p>
<pre>
[OVERALL], RunTime(ms), 24358.0
[OVERALL], Throughput(ops/sec), 4105.427374989737
[UPDATE], Operations, 49957
[UPDATE], AverageLatency(us), 7786.985767760274
</pre>
<p></p>
<p>And the same for the other workload scenario now:<br />
100 threads</p>
<pre>
./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloadf \
-p columnfamily=family \
-p operationcount=100000 \
-s \
-threads 100 | tee real-tests-workload-f-100t.log
cat real-tests-workload-f-100t.log
</pre>
<p></p>
<pre>
[OVERALL], RunTime(ms), 33924.0
[OVERALL], Throughput(ops/sec), 2947.7655936799906
[UPDATE], Operations, 50136
[UPDATE], AverageLatency(us), 17.44125977341631
</pre>
<p></p>
<p>1000 threads</p>
<pre>
./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloadf \
-p columnfamily=family \
-p operationcount=100000 \
-s \
-threads 1000 | tee real-tests-workload-f-1kt.log
cat real-tests-workload-f-1kt.log
</pre>
<p></p>
<pre>
[OVERALL], RunTime(ms), 29309.0
[OVERALL], Throughput(ops/sec), 3411.921252857484
[UPDATE], Operations, 50127
[UPDATE], AverageLatency(us), 16.611586570111914
</pre>
<p></p>
<p>2000 threads</p>
<pre>
./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloadf \
-p columnfamily=family \
-p operationcount=100000 \
-s \
-threads 2000 | tee real-tests-workload-f-2kt.log
cat real-tests-workload-f-2kt.log
</pre>
<p></p>
<pre>
[OVERALL], RunTime(ms), 29311.0
[OVERALL], Throughput(ops/sec), 3411.688444611238
[UPDATE], Operations, 49951
[UPDATE], AverageLatency(us), 59.80148545574663
</pre>
<p></p>
<p>3000 threads</p>
<pre>
./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloadf \
-p columnfamily=family \
-p operationcount=100000 \
-s \
-threads 3000 | tee real-tests-workload-f-3kt.log
cat real-tests-workload-f-3kt.log
</pre>
<p></p>
<pre>
[OVERALL], RunTime(ms), 32314.0
[OVERALL], Throughput(ops/sec), 3063.6875657609703
[UPDATE], Operations, 49492
[UPDATE], AverageLatency(us), 20.00127293299927
</pre>
<p></p>
<p>4000 threads</p>
<pre>
./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloadf \
-p columnfamily=family \
-p operationcount=100000 \
-s \
-threads 4000 | tee real-tests-workload-f-4kt.log
cat real-tests-workload-f-4kt.log
</pre>
<p></p>
<pre>
[OVERALL], RunTime(ms), 35051.0
[OVERALL], Throughput(ops/sec), 2852.985649482183
[UPDATE], Operations, 50095
[UPDATE], AverageLatency(us), 38.50611837508733
</pre>
<p></p>
<p><strong>Let&#8217;s now try more instances instead just one slave &#8211; 4 slaves, same type as before.</strong></p>
<pre>
elastic-mapreduce --create \
--hbase \
--name "EMR HBase YCSB" \
--num-instances 5 \
--instance-type m1.large \
--hadoop-version 1.0.3
Created job flow j-OE7G6YUHMD2I
</pre>
<p></p>
<pre>
elastic-mapreduce --list --active
j-OE7G6YUHMD2I      WAITING
ec2-50-17-100-242.compute-1.amazonaws.com         EMR HBase YCSB
COMPLETED      Start HBase
</pre>
<p></p>
<p>Now just copy already built test suite:</p>
<pre>
scp -i ~/.ssh/dataminelab-ec2.pem ycsb-0.1.5-SNAPSHOT.tar.gz \
hadoop@ec2-50-17-100-242.compute-1.amazonaws.com:/home/hadoop/ycsb.tar.gz
ssh -i ~/.ssh/dataminelab-ec2.pem \
hadoop@ec2-50-17-100-242.compute-1.amazonaws.com

tar xvzf ycsb.tar.gz
ln -s ycsb-0.1.5-SNAPSHOT ycsb
cd ycsb
</pre>
<p></p>
<p>Initialize table:</p>
<pre>
hbase org.apache.hadoop.hbase.util.RegionSplitter usertable -c 200 -f family
</pre>
<p></p>
<p>Load initial data:</p>
<pre>
./bin/ycsb load hbase \
-p columnfamily=family \
-P workloads/workloada | tee load.log
</pre>
<p></p>
<p>And run tests:<br />
warm-up</p>
<pre>
./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloada \
-p columnfamily=family \
-p operationcount=10000 \
-s \
-threads 10 | tee warm-up-tests.log
</pre>
<p></p>
<p>10 threads</p>
<pre>
./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloada \
-p columnfamily=family \
-p operationcount=100000 \
-s \
-threads 10 | tee real-tests-workload-a.log
cat real-tests-workload-a.log
</pre>
<p></p>
<pre>
[OVERALL], RunTime(ms), 42609.0
[OVERALL], Throughput(ops/sec), 2346.9220117815485
[UPDATE], Operations, 50073
[UPDATE], AverageLatency(us), 117.53685618996265
</pre>
<p></p>
<p>100 threads</p>
<pre>
./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloada \
-p columnfamily=family \
-p operationcount=100000 \
-s \
-threads 100 | tee real-tests-workload-a-100t.log
cat real-tests-workload-a-100t.log
</pre>
<p></p>
<pre>
[OVERALL], RunTime(ms), 23500.0
[OVERALL], Throughput(ops/sec), 4255.31914893617
[UPDATE], Operations, 49837
[UPDATE], AverageLatency(us), 1089.7759295302687
</pre>
<p></p>
<p>500 threads</p>
<pre>
./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloada \
-p columnfamily=family \
-p operationcount=100000 \
-s \
-threads 500 | tee real-tests-workload-a-500t.log
cat real-tests-workload-a-500t.log
</pre>
<p></p>
<pre>
[OVERALL], RunTime(ms), 19763.0
[OVERALL], Throughput(ops/sec), 5059.960532307848
[UPDATE], Operations, 50196
[UPDATE], AverageLatency(us), 4854.259104311101
</pre>
<p></p>
<p>1000 threads</p>
<pre>
./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloada \
-p columnfamily=family \
-p operationcount=100000 \
-s \
-threads 1000 | tee real-tests-workload-a-1kt.log
cat real-tests-workload-a-1kt.log
</pre>
<p></p>
<pre>
[OVERALL], RunTime(ms), 20028.0
[OVERALL], Throughput(ops/sec), 4993.0097862991815
[UPDATE], Operations, 49904
[UPDATE], AverageLatency(us), 9582.977617024688
</pre>
<p></p>
<p>2000 threads</p>
<pre>
./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloada \
-p columnfamily=family \
-p operationcount=100000 \
-s \
-threads 2000 | tee real-tests-workload-a-2kt.log
cat real-tests-workload-a-2kt.log
</pre>
<p></p>
<pre>
[OVERALL], RunTime(ms), 22608.0
[OVERALL], Throughput(ops/sec), 4423.2130219391365
[UPDATE], Operations, 49988
[UPDATE], AverageLatency(us), 6244.29357045691
</pre>
<p></p>
<p>5000 threads</p>
<pre>
./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloada \
-p columnfamily=family \
-p operationcount=100000 \
-s \
-threads 5000 | tee real-tests-workload-a-5kt.log
cat real-tests-workload-a-5kt.log
</pre>
<p></p>
<pre>
[OVERALL], RunTime(ms), 24861.0
[OVERALL], Throughput(ops/sec), 4022.3643457624394
[UPDATE], Operations, 50100
[UPDATE], AverageLatency(us), 8150.377125748503
</pre>
<p></p>
<p>10k threads</p>
<pre>
./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloada \
-p columnfamily=family \
-p operationcount=100000 \
-s \
-threads 10000 | tee real-tests-workload-a-10kt.log
cat real-tests-workload-a-10kt.log
</pre>
<p></p>
<pre>
[OVERALL], RunTime(ms), 25336.0
[OVERALL], Throughput(ops/sec), 3946.9529523208084
[UPDATE], Operations, 50176
[UPDATE], AverageLatency(us), 8851.578204719388
</pre>
<p></p>
<p>workload f, 10 threads</p>
<pre>
./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloadf \
-p columnfamily=family \
-p operationcount=100000 \
-s \
-threads 10 | tee real-tests-workload-f.log
cat real-tests-workload-f.log
</pre>
<p></p>
<pre>
[OVERALL], RunTime(ms), 53310.0
[OVERALL], Throughput(ops/sec), 1875.8206715438005
[UPDATE], Operations, 49867
[UPDATE], AverageLatency(us), 12.18058034371428
</pre>
<p></p>
<p>100 threads</p>
<pre>
./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloadf \
-p columnfamily=family \
-p operationcount=100000 \
-s \
-threads 100 | tee real-tests-workload-f-100t.log
cat real-tests-workload-f-100t.log
</pre>
<p></p>
<pre>
[OVERALL], RunTime(ms), 30991.0
[OVERALL], Throughput(ops/sec), 3226.7432480397533
[UPDATE], Operations, 50145
[UPDATE], AverageLatency(us), 13.73040183467943
</pre>
<p></p>
<p>1k threads</p>
<pre>
./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloadf \
-p columnfamily=family \
-p operationcount=100000 \
-s \
-threads 1000 | tee real-tests-workload-f-1kt.log
cat real-tests-workload-f-1kt.log
</pre>
<p></p>
<pre>
[OVERALL], RunTime(ms), 29185.0
[OVERALL], Throughput(ops/sec), 3426.4176803152304
[UPDATE], Operations, 50047
[UPDATE], AverageLatency(us), 29.82979998801127
</pre>
<p></p>
<p>2k threads</p>
<pre>
./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloadf \
-p columnfamily=family \
-p operationcount=100000 \
-s \
-threads 2000 | tee real-tests-workload-f-2kt.log
cat real-tests-workload-f-2kt.log
</pre>
<p></p>
<pre>
[OVERALL], RunTime(ms), 31906.0
[OVERALL], Throughput(ops/sec), 3134.206732276061
[UPDATE], Operations, 50111
[UPDATE], AverageLatency(us), 24.55253337590549
</pre>
<p></p>
<p>3k threads</p>
<pre>
./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloadf \
-p columnfamily=family \
-p operationcount=100000 \
-s \
-threads 3000 | tee real-tests-workload-f-3kt.log
cat real-tests-workload-f-3kt.log
</pre>
<p></p>
<pre>
[OVERALL], RunTime(ms), 34410.0
[OVERALL], Throughput(ops/sec), 2877.070619006103
[UPDATE], Operations, 49607
[UPDATE], AverageLatency(us), 23.37424153849255
</pre>
<p></p>
<p><strong>Now let&#8217;s see how even more serious instances offered by AWS would behave in this scenario!</strong><br />
m1.xlarge (2 x more memory, 2 x more CPU than m1.large)</p>
<pre>
elastic-mapreduce --create \
--hbase \
--name "EMR HBase YCSB" \
--num-instances 5 \
--instance-type m1.xlarge \
--hadoop-version 1.0.3
Created job flow j-2ICBS9029MJAV
</pre>
<p></p>
<pre>
./elastic-mapreduce --list --active
j-2ICBS9029MJAV      WAITING
ec2-107-21-130-111.compute-1.amazonaws.com         EMR HBase YCSB
COMPLETED      Start HBase
</pre>
<p></p>
<pre>
scp -i ~/.ssh/dataminelab-ec2.pem ycsb-0.1.5-SNAPSHOT.tar.gz \
hadoop@ec2-107-21-130-111.compute-1.amazonaws.com:/home/hadoop/ycsb.tar.gz
ssh -i ~/.ssh/dataminelab-ec2.pem \
hadoop@ec2-107-21-130-111.compute-1.amazonaws.com

tar xvzf ycsb.tar.gz
ln -s ycsb-0.1.5-SNAPSHOT ycsb
cd ycsb
</pre>
<p></p>
<pre>
hbase org.apache.hadoop.hbase.util.RegionSplitter usertable -c 200 -f family
</pre>
<p></p>
<pre>
./bin/ycsb load hbase \
-p columnfamily=family \
-P workloads/workloada | tee load.log
</pre>
<p></p>
<pre>
./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloada \
-p columnfamily=family \
-p operationcount=10000 \
-s \
-threads 10 | tee warm-up-tests.log
</pre>
<p></p>
<p>10 threads</p>
<pre>
./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloada \
-p columnfamily=family \
-p operationcount=100000 \
-s \
-threads 10 | tee real-tests-workload-a.log
cat real-tests-workload-a.log
</pre>
<p></p>
<pre>
[OVERALL], RunTime(ms), 39481.0
[OVERALL], Throughput(ops/sec), 2532.8639092221574
[UPDATE], Operations, 49981
[UPDATE], AverageLatency(us), 62.85440467377604
</pre>
<p></p>
<p>100 threads</p>
<pre>
./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloada \
-p columnfamily=family \
-p operationcount=100000 \
-s \
-threads 100 | tee real-tests-workload-a-100t.log
cat real-tests-workload-a-100t.log
</pre>
<p></p>
<pre>
[OVERALL], RunTime(ms), 17877.0
[OVERALL], Throughput(ops/sec), 5593.779716954747
[UPDATE], Operations, 50100
[UPDATE], AverageLatency(us), 640.4568662674651
</pre>
<p></p>
<p>1k threads</p>
<pre>
./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloada \
-p columnfamily=family \
-p operationcount=100000 \
-s -threads 1000 | tee real-tests-workload-a-1kt.log
cat real-tests-workload-a-1kt.log
</pre>
<p></p>
<pre>
[OVERALL], RunTime(ms), 13986.0
[OVERALL], Throughput(ops/sec), 7150.00715000715
[UPDATE], Operations, 49750
[UPDATE], AverageLatency(us), 8759.566291457286
</pre>
<p></p>
<p>2k threads</p>
<pre>
./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloada \
-p columnfamily=family \
-p operationcount=100000 \
-s \
-threads 2000 | tee real-tests-workload-a-2kt.log
cat real-tests-workload-a-2kt.log
</pre>
<p></p>
<pre>
[OVERALL], RunTime(ms), 14783.0
[OVERALL], Throughput(ops/sec), 6764.526821348847
[UPDATE], Operations, 50118
[UPDATE], AverageLatency(us), 26718.534857735744
</pre>
<p></p>
<p>3k threads</p>
<pre>
./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloada \
-p columnfamily=family \
-p operationcount=100000 \
-s \
-threads 3000 | tee real-tests-workload-a-3kt.log
cat real-tests-workload-a-3kt.log
</pre>
<p></p>
<pre>
[OVERALL], RunTime(ms), 15477.0
[OVERALL], Throughput(ops/sec), 6396.588486140725
[UPDATE], Operations, 49465
[UPDATE], AverageLatency(us), 12066.01403012231
</pre>
<p></p>
<p>4k threads</p>
<pre>
./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloada \
-p columnfamily=family \
-p operationcount=100000 \
-s \
-threads 4000 | tee real-tests-workload-a-4kt.log
cat real-tests-workload-a-4kt.log
</pre>
<p></p>
<pre>
[OVERALL], RunTime(ms), 15261.0
[OVERALL], Throughput(ops/sec), 6552.650547146321
[UPDATE], Operations, 49883
[UPDATE], AverageLatency(us), 22551.664294449012
</pre>
<p></p>
<p>another workload, 10 threads</p>
<pre>
./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloadf \
-p columnfamily=family \
-p operationcount=100000 \
-s \
-threads 10 | tee real-tests-workload-f.log
cat real-tests-workload-f.log
</pre>
<p></p>
<pre>
[OVERALL], RunTime(ms), 45751.0
[OVERALL], Throughput(ops/sec), 2185.744573889095
[UPDATE], Operations, 49950
[UPDATE], AverageLatency(us), 9.801721721721721
</pre>
<p></p>
<p>500 threads</p>
<pre>
./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloadf \
-p columnfamily=family \
-p operationcount=100000 \
-s \
-threads 500 | tee real-tests-workload-f-500t.log
cat real-tests-workload-f-500t.log
</pre>
<p></p>
<pre>
[OVERALL], RunTime(ms), 21870.0
[OVERALL], Throughput(ops/sec), 4572.473708276178
[UPDATE], Operations, 49678
[UPDATE], AverageLatency(us), 11.18187125085551
</pre>
<p></p>
<p>1k threads</p>
<pre>
./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloadf \
-p columnfamily=family \
-p operationcount=100000 \
-s \
-threads 1000 | tee real-tests-workload-f-1kt.log
cat real-tests-workload-f-1kt.log
</pre>
<p></p>
<pre>
[OVERALL], RunTime(ms), 19207.0
[OVERALL], Throughput(ops/sec), 5206.435153850159
[UPDATE], Operations, 49879
[UPDATE], AverageLatency(us), 11.812406022574631
</pre>
<p></p>
<p>2k threads</p>
<pre>
./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloadf \
-p columnfamily=family \
-p operationcount=100000 \
-s \
-threads 2000 | tee real-tests-workload-f-2kt.log
cat real-tests-workload-f-2kt.log
</pre>
<p></p>
<pre>
[OVERALL], RunTime(ms), 20493.0
[OVERALL], Throughput(ops/sec), 4879.715024642561
[UPDATE], Operations, 50114
[UPDATE], AverageLatency(us), 12.770423434569182
</pre>
<p></p>
<p><strong>And for now, more CPU power!</strong><br />
c1.xlarge (same memory, 5 x more CPU than m1.large)</p>
<pre>
elastic-mapreduce --create \
--hbase \
--name "EMR HBase YCSB" \
--num-instances 5 \
--instance-type c1.xlarge \
--hadoop-version 1.0.3
Created job flow j-3KZHQRG2D74AY
</pre>
<p></p>
<pre>
./elastic-mapreduce --list --active
j-3KZHQRG2D74AY     WAITING
ec2-75-101-255-226.compute-1.amazonaws.com          EMR HBase YCSB
COMPLETED      Start HBase
</pre>
<p></p>
<pre>
scp -i ~/.ssh/dataminelab-ec2.pem ycsb-0.1.5-SNAPSHOT.tar.gz \
hadoop@ec2-75-101-255-226.compute-1.amazonaws.com:/home/hadoop/ycsb.tar.gz
ssh -i ~/.ssh/dataminelab-ec2.pem \
hadoop@ec2-75-101-255-226.compute-1.amazonaws.com

tar xvzf ycsb.tar.gz
ln -s ycsb-0.1.5-SNAPSHOT ycsb
cd ycsb
</pre>
<p></p>
<pre>
hbase org.apache.hadoop.hbase.util.RegionSplitter usertable -c 200 -f family
</pre>
<p></p>
<pre>
./bin/ycsb load hbase \
-p columnfamily=family \
-P workloads/workloada | tee load.log
</pre>
<p></p>
<pre>
./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloada \
-p columnfamily=family \
-p operationcount=10000 \
-s \
-threads 10 | tee warm-up-tests.log
</pre>
<p></p>
<p>10 threads</p>
<pre>
./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloada \
-p columnfamily=family \
-p operationcount=100000 \
-s \
-threads 10 | tee real-tests-workload-a.log
cat real-tests-workload-a.log
</pre>
<p></p>
<pre>
[OVERALL], RunTime(ms), 32121.0
[OVERALL], Throughput(ops/sec), 3113.228106223343
[UPDATE], Operations, 49973
[UPDATE], AverageLatency(us), 71.10029415884577
</pre>
<p></p>
<p>100 threads</p>
<pre>
./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloada \
-p columnfamily=family \
-p operationcount=100000 \
-s \
-threads 100 | tee real-tests-workload-a-100t.log
cat real-tests-workload-a-100t.log
</pre>
<p></p>
<pre>
[OVERALL], RunTime(ms), 15076.0
[OVERALL], Throughput(ops/sec), 6633.059166887769
[UPDATE], Operations, 50167
[UPDATE], AverageLatency(us), 644.8327187194769
</pre>
<p></p>
<p>1k threads</p>
<pre>
./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloada \
-p columnfamily=family \
-p operationcount=100000 \
-s \
-threads 1000 | tee real-tests-workload-a-1kt.log
cat real-tests-workload-a-1kt.log
</pre>
<p></p>
<pre>
[OVERALL], RunTime(ms), 12864.0
[OVERALL], Throughput(ops/sec), 7773.63184079602
[UPDATE], Operations, 50240
[UPDATE], AverageLatency(us), 9889.390306528663
</pre>
<p></p>
<p>2k threads</p>
<pre>
./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloada \
-p columnfamily=family \
-p operationcount=100000 \
-s \
-threads 2000 | tee real-tests-workload-a-2kt.log
cat real-tests-workload-a-2kt.log
</pre>
<p></p>
<pre>
[OVERALL], RunTime(ms), 14889.0
[OVERALL], Throughput(ops/sec), 6716.367788300087
[UPDATE], Operations, 50216
[UPDATE], AverageLatency(us), 41222.41986617811
</pre>
<p></p>
<p>3k threads</p>
<pre>
./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloada \
-p columnfamily=family \
-p operationcount=100000 \
-s \
-threads 3000 | tee real-tests-workload-a-3kt.log
cat real-tests-workload-a-3kt.log
</pre>
<p></p>
<pre>
[OVERALL], RunTime(ms), 14461.0
[OVERALL], Throughput(ops/sec), 6845.9995850909345
[UPDATE], Operations, 49451
[UPDATE], AverageLatency(us), 51852.53568178601
</pre>
<p></p>
<p>5k threads</p>
<pre>
./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloada \
-p columnfamily=family \
-p operationcount=100000 \
-s \
-threads 5000 | tee real-tests-workload-a-5kt.log
cat real-tests-workload-a-5kt.log
</pre>
<p></p>
<pre>
[OVERALL], RunTime(ms), 17072.0
[OVERALL], Throughput(ops/sec), 5857.544517338331
[UPDATE], Operations, 49835
[UPDATE], AverageLatency(us), 82378.54861041436
</pre>
<p></p>
<p>10k threads</p>
<pre>
./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloada \
-p columnfamily=family \
-p operationcount=100000 \
-s \
-threads 10000 | tee real-tests-workload-a-10kt.log
cat real-tests-workload-a-10kt.log
</pre>
<p></p>
<pre>
[OVERALL], RunTime(ms), 20226.0
[OVERALL], Throughput(ops/sec), 4944.131316127757
[UPDATE], Operations, 50113
[UPDATE], AverageLatency(us), 49147.25219005049
</pre>
<p></p>
<p>another workload, 10 threads</p>
<pre>
./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloadf \
-p columnfamily=family \
-p operationcount=100000 \
-s \
-threads 10 | tee real-tests-workload-f.log
cat real-tests-workload-f.log
</pre>
<p></p>
<pre>
[OVERALL], RunTime(ms), 40801.0
[OVERALL], Throughput(ops/sec), 2450.920320580378
[UPDATE], Operations, 49966
[UPDATE], AverageLatency(us), 12.13715326421967
</pre>
<p></p>
<p>400 threads</p>
<pre>
./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloadf \
-p columnfamily=family \
-p operationcount=100000 \
-s \
-threads 400 | tee real-tests-workload-f-400t.log
cat real-tests-workload-f-400t.log
</pre>
<p></p>
<pre>
[OVERALL], RunTime(ms), 17856.0
[OVERALL], Throughput(ops/sec), 5600.358422939068
[UPDATE], Operations, 50071
[UPDATE], AverageLatency(us), 14.301591739729584
</pre>
<p></p>
<p>500 threads</p>
<pre>
./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloadf \
-p columnfamily=family \
-p operationcount=100000 \
-s \
-threads 500 | tee real-tests-workload-f-500t.log
cat real-tests-workload-f-500t.log
</pre>
<p></p>
<pre>
[OVERALL], RunTime(ms), 17909.0
[OVERALL], Throughput(ops/sec), 5583.784689262382
[UPDATE], Operations, 50210
[UPDATE], AverageLatency(us), 16.105915156343357
</pre>
<p></p>
<p>1k threads</p>
<pre>
./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloadf \
-p columnfamily=family \
-p operationcount=100000 \
-s \
-threads 1000 | tee real-tests-workload-f-1kt.log
cat real-tests-workload-f-1kt.log
</pre>
<p></p>
<pre>
[OVERALL], RunTime(ms), 16982.0
[OVERALL], Throughput(ops/sec), 5888.5879166175955
[UPDATE], Operations, 50088
[UPDATE], AverageLatency(us), 15.313268647180962
</pre>
<p></p>
<p>2k threads</p>
<pre>
./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloadf \
-p columnfamily=family \
-p operationcount=100000 \
-s \
-threads 2000 | tee real-tests-workload-f-2kt.log
cat real-tests-workload-f-2kt.log
</pre>
<p></p>
<pre>
[OVERALL], RunTime(ms), 17219.0
[OVERALL], Throughput(ops/sec), 5807.538184563564
[UPDATE], Operations, 49989
[UPDATE], AverageLatency(us), 17.61469523295125
</pre>
<p></p>
<p>Even after running these simple scenarios we are able to check how for given configuration the number of threads used influences the throughput for each of workload type:</p>
<ul>
<li>
workload a:<br />
<a href="http://dataminelab.com/wp-content/uploads/2012/09/workloada.png"><img src="http://dataminelab.com/wp-content/uploads/2012/09/workloada.png" alt="" title="workloada" width="754" height="409" class="alignnone size-full wp-image-532" srcset="https://dataminelab.com/wp-content/uploads/2012/09/workloada.png 754w, https://dataminelab.com/wp-content/uploads/2012/09/workloada-300x162.png 300w" sizes="(max-width: 754px) 100vw, 754px" /></a>
</li>
<li>
workload f:<br />
<a href="http://dataminelab.com/wp-content/uploads/2012/09/workloadf.png"><img src="http://dataminelab.com/wp-content/uploads/2012/09/workloadf.png" alt="" title="workloadf" width="755" height="434" class="alignnone size-full wp-image-534" srcset="https://dataminelab.com/wp-content/uploads/2012/09/workloadf.png 755w, https://dataminelab.com/wp-content/uploads/2012/09/workloadf-300x172.png 300w" sizes="(max-width: 755px) 100vw, 755px" /></a>
</li>
</ul>
<p>You can now play with other instance types and instance numbers. You can also mix multiple nodes running YCSB benchmark code and observe possible saturation, either from master&#8217;s CPU or network layer.</p>
<p>We also invite you to play with the code or even contribute features and improvements, so that others can benefit from them too &#8211; have fun!</p>
]]></content:encoded>
			<wfw:commentRss>https://dataminelab.com/blog/ycsb-run-against-hbase-0-92-on-amazon-elastic-mapreduce/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>BigData events</title>
		<link>https://dataminelab.com/blog/bigdata-events/</link>
		<comments>https://dataminelab.com/blog/bigdata-events/#comments</comments>
		<pubDate>Fri, 04 May 2012 13:03:28 +0000</pubDate>
		<dc:creator><![CDATA[Radek Maciaszek]]></dc:creator>
				<category><![CDATA[Blog]]></category>

		<guid isPermaLink="false">http://dataminelab.com/?p=470</guid>
		<description><![CDATA[We observe an explosion of BigData events. While half a year ago London hosted maybe one interesting meetup a month nowadays there is rarely a week without few of them. Supply is keeping up with demand. There is an increasing number of monthly meetups: BigData London, HUG UK, Data Science London, London R, Cassandra London, [&#8230;]]]></description>
				<content:encoded><![CDATA[<p>We observe an explosion of BigData events. While <a href="http://dataminelab.com/blog/bigdata-london-events/">half a year ago</a> London hosted maybe one interesting meetup a month nowadays there is rarely a week without few of them. Supply is keeping up with demand.</p>
<p>There is an increasing number of monthly meetups: <a href="http://www.meetup.com/big-data-london/">BigData London</a>, <a href="http://www.meetup.com/hadoop-users-group-uk/">HUG UK</a>, <a href="http://www.meetup.com/Data-Science-London/">Data Science London</a>, <a href="http://www.meetup.com/LondonR/">London R</a>, <a href="http://www.meetup.com/Cassandra-London/">Cassandra London</a>, <a href="http://skillsmatter.com/user-group/nosql/neo4j-user-group/">Neo4j London</a>, <a href="http://skillsmatter.com/user-group/home/mongodb-user-group/">London MongoDB User Group</a>, <a href="http://www.meetup.com/Oracle-UK-BigData/">Oracle BigData</a>, <a href="http://www.meetup.com/Data-Visualization-London/">Data Visualisation London</a>, <a href="http://www.meetup.com/BigData-Debate-and-Networking/">Big Data Debate</a>, <a href="http://www.meetup.com/DeNormalised-London/">DeNormalised London</a>, <a href="http://www.meetup.com/LonData/">LonData</a>, <a href="http://www.meetup.com/clouds/">CloudComputing</a>.</p>
<p>Upcoming conferences that are worth mentioning:</p>
<ul>
<li><a href="http://skillsmatter.com/event/nosql/progressive-nosql-tutorials/ac-4146">Skillsmatter Progressive NoSQL Tutorial</a> (9-11th May 2012)</li>
<li><a href="http://berlinbuzzwords.de/">Berlin Buzzworlds</a> &#8211; only couple hours from London via Eurostar (4-5th June 2012)</li>
<li><a href="http://www.whitehallmedia.co.uk/bda/">Big Data Analytics 2012</a> (20th June 2012)</li>
<li><a href="http://www.bigdatasummit.co.uk/">Big Data Summit 2012</a> (28th June 2012)</li>
<li><a href="http://www.terrapinn.com/2012/big-data-world-europe/index.stm">Big Data World Europe</a> (19-20th September 2012)</li>
<li><a href="http://denormalised.com/denomormalised_conference_lond/">DeNormalised NoSQL Conference London</a> (20-21st September 2012)</li>
<li><a href="http://www.bigdatacongress.com/index.php">Big Data Congress</a> (7th November 2012)</li>
</ul>
<p>We just had a <a href="http://bigdataweek.com/category/big-data-week/big-data-uk/">London BigData week</a> that was full of meetings and hackatons dedicated to Hadoop, Visualisations and NoSQL. In case you missed the last Big Data week you are for a treat &#8211; simply <a href="https://www.facebook.com/dataminelab">like us on Facebook</a> to <strong>have a chance of winning one ticket (worth £495)</strong> for <a href="http://skillsmatter.com/event/nosql/progressive-nosql-tutorials/ac-4146">3 days of SkillsMatter NoSQL tutorials</a>.</p>
<p>There are as well few online places where every data scientist can improve or challenge their skills:</p>
<ul>
<li><a href="https://www.coursera.org/course/ml">Stanford Machine Learning</a> &#8211; online Machine Learning course</li>
<li><a href="http://www.extension.harvard.edu/open-learning-initiative/bits-computer-science-course">Harvard Open Learning Initiative</a> &#8211; online computer science courses from Harvard</li>
<li><a href="http://www.apple.com/education/itunes-u/">iTunes U</a> &#8211; growing library of online courses</li>
<li><a href="http://bigdatauniversity.com/courses/">BigData University</a> &#8211; few basic tutorials on BigData</li>
<li><a href="http://www.kaggle.com/">Kaggle</a> &#8211; test your skills and win prizes</li>
</li>
</li>
</ul>
<p>If you know of anything interesting coming up in London, let us know in the comments.</p>
]]></content:encoded>
			<wfw:commentRss>https://dataminelab.com/blog/bigdata-events/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>R Analytics in the Cloud</title>
		<link>https://dataminelab.com/blog/r-analytics-in-the-cloud/</link>
		<comments>https://dataminelab.com/blog/r-analytics-in-the-cloud/#respond</comments>
		<pubDate>Mon, 21 Nov 2011 14:02:47 +0000</pubDate>
		<dc:creator><![CDATA[Radek Maciaszek]]></dc:creator>
				<category><![CDATA[Blog]]></category>

		<guid isPermaLink="false">http://dataminelab.com/?p=430</guid>
		<description><![CDATA[Last week I was invited to Big Data London to talk about &#8220;R Analytics in the Cloud&#8221;. As a case study, I presented the ageing project I&#8217;ve been working on as part of my Masters studies at Birkbeck, University of London. Ageing is one of the fundamental mysteries in biology and many scientists are already [&#8230;]]]></description>
				<content:encoded><![CDATA[<p>Last week I was invited to <a href="http://www.meetup.com/big-data-london/">Big Data London</a> to talk about &#8220;R Analytics in the Cloud&#8221;. As a case study, I presented the ageing project I&#8217;ve been working on as part of my Masters studies at Birkbeck, University of London. Ageing is one of the fundamental mysteries in biology and many scientists are already studying this process. I am excited to be part of the research group led by <a href="http://www.ucl.ac.uk/slms/people/show.php?UPI=ESCHU11">Eugene Schuster</a> at <a href="http://www.ucl.ac.uk/iha/">UCL Institute of Healthy Ageing</a>. This project has also given me the chance to use some of my Hadoop experience in the academic field.</p>
<p><a href="http://en.wikipedia.org/wiki/Bioinformatics">Bioinformatics</a> is the science of applying information technology to biology in order to understand the latter. There are numerous ways in which computers can aid biologists. In this particular project, we have been using <a href="http://en.wikipedia.org/wiki/DNA_microarray">microarrays</a> to find the connection between different genes. The use of microarray technologies has enabled us to detect changes to gene expression across the genome in thousands of experiments with hundreds of species. However, interpreting the changes identified in these experiments has been hampered by a lack of knowledge of the gene function.  Even in highly studied genomes, approximately 50-60% of genes will be assigned functions, yet less than 30% will be annotated with a highly specific function. Little of the annotation will have been observed in experiments conducted with the species of interest, as most gene function annotation is based on annotations assigned to orthologous genes taken from experiments done with other species, such as yeast and mammalian cell culture.</p>
<p>We are interested in building a better understanding of gene function in the worm C. elegans by harnessing the large quantity of experimental microarray data in the public database.  Currently, we have a database of over fifty curated experiments. With this, we attempt to assign putative functions to genes based on the expression profile across experiments in the public repositories. My role in this project is to help expand the number of curated experiments in the database and study the functions of approximately 1000 genes known to be regulated in long-lived worms, to try to understand the functions of these genes, e.g. by showing experimental evidence of a role in nutrient sensing, innate immunity or stress response. </p>
<p>Here are the slides from the presentation. Refer to slides 10 and 11 to see how to migrate your R application to the cloud in just 3 lines of code:</p>
<div style="width:425px" id="__ss_10206144"> <strong style="display:block;margin:12px 0 4px"><a href="http://www.slideshare.net/dataminelab/r-analytics-in-the-cloud" title="R Analytics in the Cloud" target="_blank">R Analytics in the Cloud</a></strong> <iframe src="http://www.slideshare.net/slideshow/embed_code/10206144" width="425" height="355" frameborder="0" marginwidth="0" marginheight="0" scrolling="no"></iframe> </div>
<p>Oh, and did I mention how cool our lab is? Have a look at the following ad, which was made at UCL  just a couple of metres from my desk.<br />
<iframe width="560" height="315" src="http://www.youtube.com/embed/nuRdF46TKDs" frameborder="0" allowfullscreen></iframe></p>
<p>Full disclosure: DataMine Lab is in no way affiliated with Birkbeck or UCL and the above project is part of my individual bioinformatics studies.</p>
]]></content:encoded>
			<wfw:commentRss>https://dataminelab.com/blog/r-analytics-in-the-cloud/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>BigData London events (and a ticket giveaway)</title>
		<link>https://dataminelab.com/blog/bigdata-london-events/</link>
		<comments>https://dataminelab.com/blog/bigdata-london-events/#comments</comments>
		<pubDate>Mon, 03 Oct 2011 17:21:21 +0000</pubDate>
		<dc:creator><![CDATA[Radek Maciaszek]]></dc:creator>
				<category><![CDATA[Blog]]></category>

		<guid isPermaLink="false">http://dataminelab.com/?p=367</guid>
		<description><![CDATA[There have never been as many data events advertised in London as we’d like, but the situation is slowly improving. This summer was particularly good for conferences and there are some interesting things scheduled in the next few months. A couple of events worth mentioning are: NoSQL eXchange (2 November, 2011) Predictive Analytics (30 November [&#8230;]]]></description>
				<content:encoded><![CDATA[<p>There have never been as many data events advertised in London as we’d like, but the situation is slowly improving. This summer was particularly good for conferences and there are some interesting things scheduled in the next few months.</p>
<p>A couple of events worth mentioning are:</p>
<ul>
<li><a href="http://skillsmatter.com/event/nosql/nosql-big-data-exchange-2011">NoSQL eXchange</a> (2 November, 2011)</li>
<li><a href="http://www.predictiveanalyticsworld.com/london/2011/">Predictive Analytics</a> (30 November &#8211; 1 December 2011)</li>
</ul>
<p>NoSQL Exchange should offer an interesting overview on various NoSQL technologies, including MongoDB, Cassandra, CouchDB and Riak. Tom Wilkie (from <a href="http://www.acunu.com/">Acunu</a>) will give a tour on the future of NOSQL. Data Mine Lab is a sponsor of <a href="http://skillsmatter.com/event-details/home/nosql-big-data-exchange-2011/js-2536">NoSQL Exchange</a> and we are giving away one <strong>free ticket</strong> (worth £195). Simply <a href="http://www.facebook.com/pages/DataMine-Lab/183095475081000">like us on Facebook</a> before 21nd of October for a chance to win that ticket. Every new follower within that timeframe will be entered into a drawing. </p>
<p>There are also some recurring events on the topics of information, big data and visualisation:</p>
<ul>
<li><a href="http://www.creativemornings.com/">Creative Mornings</a> &#8211; the last talk by David McCandless from <a href="http://www.informationisbeautiful.net/">Information is Beautiful</a> was a treat, and the next one looks great for data geeks as well.</li>
<li><a href="http://www.meetup.com/LondonQS/">Quantified Self</a> &#8211; an interesting take on analysing the data from your life.</li>
<li><a href="http://www.meetup.com/big-data-london/">BigData London</a> &#8211; a good mix of like-minded people working with a range of big data technologies.</li>
<li><a href="http://www.meetup.com/hadoop-users-group-uk/">HUG UK</a> &#8211; another Meetup group, this time focussed on Hadoop.</li>
<li><a href="http://www.meetup.com/Cassandra-London/">Cassandra London</a> &#8211; for users of Apache Cassandra.</li>
<li><a href="http://skillsmatter.com/user-group/nosql/neo4j-user-group/">Neo4j London</a> &#8211; a graph database group.</li>
<li><a href="http://skillsmatter.com/user-group/home/mongodb-user-group/">London MongoDB User Group</a> &#8211; is featuring talks on MongoDB and NoSQL.</li>
</ul>
<p>If you&#8217;re as busy as us, you probably find it difficult to keep up with international events such as OSCON and Strata conferences. Fortunately, thanks to O&#8217;Reilly you can enjoy the talks even if you missed the conference in the first place. There&#8217;s plenty of interesting material on the website and it&#8217;s definitely worth a look:</p>
<ul>
<li><a href="http://shop.oreilly.com/product/0636920022169.do">OSCON Data Sessions 2011</a></li>
<li><a href="http://shop.oreilly.com/product/0636920019954.do">Strata Conference 2011</a></li>
</ul>
<p>This list isn&#8217;t intended to be exhaustive and undoubtedly there are data events happening that we&#8217;re unaware of. If you know of anything interesting coming up in London, let us know in the comments.</p>
]]></content:encoded>
			<wfw:commentRss>https://dataminelab.com/blog/bigdata-london-events/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Calculating unique visitors in Hadoop and Hive</title>
		<link>https://dataminelab.com/blog/calculating-unique-visitors-in-hadoop-and-hive/</link>
		<comments>https://dataminelab.com/blog/calculating-unique-visitors-in-hadoop-and-hive/#comments</comments>
		<pubDate>Tue, 06 Sep 2011 17:01:51 +0000</pubDate>
		<dc:creator><![CDATA[Radek Maciaszek]]></dc:creator>
				<category><![CDATA[Blog]]></category>
		<category><![CDATA[Hadoop]]></category>

		<guid isPermaLink="false">http://dataminelab.com/?p=118</guid>
		<description><![CDATA[Unique visitors One of the most important website metrics is the number of unique visitors. However, it is also one of the most difficult to calculate. In this post, I will review a sampling strategy which produces a very good estimate of unique users, yet is computationally cheap. Non-additive data It is relatively easy to [&#8230;]]]></description>
				<content:encoded><![CDATA[<h2>Unique visitors</h2>
<p>One of the most important website metrics is the number of unique visitors. However, it is also one of the most difficult to calculate. In this post, I will review a sampling strategy which produces a very good estimate of unique users, yet is computationally cheap.</p>
<h2>Non-additive data</h2>
<p>It is relatively easy to calculate small numbers of unique visitors: all you need to do is perform a single SQL query.</p>
<p>To calculate the number of unique records in Hive, run the following:</p>
<p>[gist id=1159244]</p>
<p>However, once the number of records in the table &#8220;page_views&#8221; becomes very large, this query may result in <a href="http://en.wikipedia.org/wiki/Out_of_memory">OOM errors</a>. If this happens, there are other ways to calculate the exact number of unique visitors. Alternatively, it is possible to generate useful figures by using a sample.</p>
<h2>Sampling</h2>
<p>In practice, estimating the unique visitors metric gives pretty close results. In our tests on tens of millions of records, the results came <strong>within 0.1% of real values</strong>. One thing to remember is to ensure you sample visitors and not page views. The presented sampling method is a simple <a href="http://en.wikipedia.org/wiki/Bernoulli_sampling">Bernoulli Sampling</a>.</p>
<p>Having a sample can sometimes be even more useful than calculating the exact  number. You can build a data warehouse around the sample and slice and  dice on unique visitors &#8212; something which cannot be done on  pre-calculated non-additive data. I will show at the end of this post  how to create a cube that can be used to visualise unique visitors  data.</p>
<h2>Hashing</h2>
<p>In order to sample users, we need to get every n-th user randomly from the population of records. One way to do it is to calculate the visitor hash for every record using a uniform hashing function (such as Md5). Md5 generates a random hexadecimal string on which we can filter only those users whose hash finishes with an arbitrary string, such as &#8217;00&#8217;. Notice that since this is a uniform hashing function, the probability that the user hash finishes with &#8216;0&#8217; is 1/16, and so the probability that it finishes with &#8217;00&#8217; is 1/256.</p>
<p>Note that Hive (at the time of writing, version 0.7) does not implement an Md5 function, so feel free to use the following code to add an Md5 hash function to Hive:</p>
<p>[gist id=1050002]</p>
<p>Alternatively you may patch your Hive distribution with the code from the following ticket <a href="https://issues.apache.org/jira/browse/HIVE-1262">HIVE-1262</a>.</p>
<h2>HiveQL</h2>
<p>The following query will generate a unique visitors sample:</p>
<p>[gist id=1049918]</p>
<h2>Pentaho</h2>
<p>There are many other issues with unique visitors, such as how to present non-additive results to the end user. BI tools (such as <a href="http://mondrian.pentaho.com/">Pentaho Mondrian</a>) allow you to do this with the distinct aggregate function:</p>
<p>[gist id=1159282]</p>
<p>After loading the sample to your aggregate, the OLAP tools will allow you to report on it in a similar way to how you would report on standard additive data. See below:</p>
<div id="attachment_346" style="max-width: 725px" class="wp-caption alignnone"><a href="http://dataminelab.com/wp-content/uploads/2011/08/unique_visitors.png"><img class="size-full wp-image-346" title="Unique Visitors Cube" src="http://dataminelab.com/wp-content/uploads/2011/08/unique_visitors.png" alt="Unique Visitors Cube" width="715" height="328" srcset="https://dataminelab.com/wp-content/uploads/2011/08/unique_visitors.png 715w, https://dataminelab.com/wp-content/uploads/2011/08/unique_visitors-300x137.png 300w" sizes="(max-width: 715px) 100vw, 715px" /></a><p class="wp-caption-text">Unique Visitors Cube</p></div>
]]></content:encoded>
			<wfw:commentRss>https://dataminelab.com/blog/calculating-unique-visitors-in-hadoop-and-hive/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>We&#8217;re hiring!</title>
		<link>https://dataminelab.com/blog/hiring-software-developer/</link>
		<comments>https://dataminelab.com/blog/hiring-software-developer/#respond</comments>
		<pubDate>Mon, 01 Aug 2011 11:40:27 +0000</pubDate>
		<dc:creator><![CDATA[Radek Maciaszek]]></dc:creator>
				<category><![CDATA[Blog]]></category>

		<guid isPermaLink="false">http://dataminelab.com/?p=328</guid>
		<description><![CDATA[DataMine Lab is looking to hire a software developer. We are seeking a talented developer with excellent Java skills who shares our passion in the craft of data processing. You should have a good knowledge of algorithms and TDD, and be experienced in building real-world high traffic applications. We constantly test new technology, using solutions [&#8230;]]]></description>
				<content:encoded><![CDATA[<p>DataMine Lab is looking to hire a software developer. We are seeking a talented developer with excellent Java skills who shares our passion in the craft of data processing.</p>
<p>You should have a good knowledge of algorithms and TDD, and be experienced in building real-world high traffic applications.</p>
<p>We constantly test new technology, using solutions such as:</p>
<ul>
<li> Hadoop / Hive / HBase</li>
<li> Amazon Cloud</li>
<li> Mahout</li>
<li><a href="/technology/">and many others</a></li>
</ul>
<p>Experience with Hadoop is an asset, but more important is your ability to quickly learn new technology, commitment to quality, attention to detail, and enthusiasm.</p>
<p><strong>How to apply</strong><br />
If you think you are right for this role, please send your CV to <a href="mailto:info@dataminelab.com?Subject=%5BDeveloper%5D">info@dataminelab.com</a> with the subject line [Developer], explaining why you would like to work with us.</p>
<p>Look forward to hearing from you!</p>
]]></content:encoded>
			<wfw:commentRss>https://dataminelab.com/blog/hiring-software-developer/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Behavioural targeting helps online advertising &#8211; our study confirms</title>
		<link>https://dataminelab.com/blog/behavioural-targeting-online/</link>
		<comments>https://dataminelab.com/blog/behavioural-targeting-online/#respond</comments>
		<pubDate>Tue, 05 Jul 2011 13:52:14 +0000</pubDate>
		<dc:creator><![CDATA[Radek Maciaszek]]></dc:creator>
				<category><![CDATA[Blog]]></category>

		<guid isPermaLink="false">http://dataminelab.com/?p=190</guid>
		<description><![CDATA[How can behavioural targeting help online advertising? I often asked myself this question while studying towards an MSc in Cognitive and Decisions Sciences. This course covered a broad range of topics, ranging from computer science, AI and neuroscience, to psychology and philosophy. I was always tempted to see if I could use some of this [&#8230;]]]></description>
				<content:encoded><![CDATA[<h2>How can behavioural targeting help online advertising?</h2>
<p>I often asked myself this question while studying towards an MSc in <a href="http://www.ucl.ac.uk/lifesciences-faculty/degree-programmes/cognitive-decision-sciences">Cognitive and Decisions Sciences</a>. This course covered a broad range of topics, ranging from computer science, AI and neuroscience, to psychology and philosophy. I was always tempted to see if I could use some of this research in our work.</p>
<p>You may ask what cognitive science has in common with online advertising. The answer is quite a lot, as some of <a href="/customers">our customers</a> have already noticed. It can help to better understand the interests of online visitors and their decision processes, and can ultimately help website owners to serve more relevant web content and advertising. I decided to explore this question in more detail during my MSc thesis.</p>
<p>This project was motivated by a paper published by scientists from Microsoft Research (Yan et al., 2009) who found that behavioural targeting in search advertising could yield up to a 670% increase in the overall CTR (click-through ratio). We performed a systematic study of the clickstream logs of a commercial ad network and found that the overall CTR could be increased by as much as 909%.</p>
<p>This study considered the impact of behavioural targeting techniques on online display advertising. Specifically, we investigated whether simulating delivery of traffic to chosen clusters of users would increase the overall CTR of all ads. We examined the data using different evaluation metrics, such as user similarity, precision, recall and F-measure, then we used the t-test to confirm the significance of the results. The experimental design was implemented with the help of scalable <a href="/services">data mining</a> libraries, which allowed a successful analysis of the large body of data.</p>
<p>You may download the paper here (the source code is included): <a href="/wp-content/uploads/2011/07/MSc-Thesis-How-much-BT-can-help-online-advertising.pdf">MSc Thesis &#8211; How much behavioural targeting can help online advertising</a></p>
<p>I used many data technologies in this project, starting with <a href="/services">Hadoop and Hive</a> and <a href="http://aws.amazon.com/ec2/">Amazon Cloud</a>. All programming was done with Java and Python. The biggest challenge was posed by the amount of data which needed to be clustered, but thanks to <a href="http://mahout.apache.org/">Apache Mahout</a> this project was finished in less than a month.</p>
<h2>Abstract</h2>
<p>Online advertising has exploded during the past few years; the current UK market (as of the middle of 2010) is evaluated at more than £3.5 billion. Such advertising grew dramatically &#8212; by about 2200% &#8212; during the 2000s. Behavioural targeting (BT) is largely regarded as one of the most effective techniques in optimizing online advertising. However, despite the impressive numbers involved in this industry, there are only a few academic studies performed on real world click-stream data (e.g. Yan, Liu, Wang, Zhang, Jiang &amp; Chen 2009; Ratnaparkhi 2010; Chen, Pavlov, &amp; Canny 2009). This may be linked to the extreme demands on system resources caused by the huge amount of advertising data available.</p>
<p>Yan et al. (2009) confirmed that BT could significantly increase the effectiveness of one specific type of online advertising (so-called search advertising). In this work we investigate whether techniques linked to BT may be beneficial to online display advertising. Using data from a major commercial ad network, we show that a simple BT technique (such as user clustering) could improve click-through ratio by more than 900%.</p>
<p>Furthermore, from a software engineering perspective, we provide support for using distributed open source technologies to tackle the complex analysis of advertising data.</p>
<h2>References</h2>
<ul>
<li>Yan , J., &amp; Liu, N., &amp; Wang, G., &amp; Zhang, W., &amp; Jiang, Y., &amp; Chen, Z. (2009). How much can Behavioural Targeting Help Online Advertising? Proceedings of the 18th international conference on World Wide Web. Madrid, Spain.</li>
<li> Ratnaparkhi, A. (2010). Finding predictive search queries for behavioral targeting. In ADKDD’10, The 4th International Workshop on Data Mining and Audience Intelligence for Advertising.</li>
<li> Chen, Y., Pavlov, D., &amp; Canny, J.F. (2009). Large-scale behavioral targeting. In KDD &#8217;09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, 209-218.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>https://dataminelab.com/blog/behavioural-targeting-online/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>New smart look</title>
		<link>https://dataminelab.com/blog/new-smart-look/</link>
		<comments>https://dataminelab.com/blog/new-smart-look/#respond</comments>
		<pubDate>Mon, 20 Jun 2011 15:09:12 +0000</pubDate>
		<dc:creator><![CDATA[Radek Maciaszek]]></dc:creator>
				<category><![CDATA[Blog]]></category>

		<guid isPermaLink="false">http://dataminelab.com/?p=64</guid>
		<description><![CDATA[Welcome to our new website. DataMine Lab is a new kind of a software consulting company that works with large data projects. Traditionally to analyse big data, big investments were needed. Times have changed and thanks to cloud computing and mature open source technology we can do this now on a budget and in record [&#8230;]]]></description>
				<content:encoded><![CDATA[<p>Welcome to our new website.</p>
<p>DataMine Lab is a new kind of a software consulting company that works with large data projects. Traditionally to analyse big data, big investments were needed. Times have changed and thanks to cloud computing and mature open source technology we can do this now on a budget and in record time. We love open source software and use it to our clients&#8217; advantage.</p>
<p>We would love to know your opinion about our website, please let us know what you think at <a href="mailto:info@dataminelab.com">info@dataminelab.com</a>.</p>
]]></content:encoded>
			<wfw:commentRss>https://dataminelab.com/blog/new-smart-look/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
