<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss2full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><!-- name="generator" content="SnipSnap/1.0b3-uttoxeter" --><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:admin="http://webns.net/mvcb/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:blogChannel="http://backend.userland.com/blogChannelModule" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" version="2.0">

  <channel>
    <title>thinkberg</title>
    
    <link>http://thinkberg.com/space/start</link>
    <description />
    <dc:creator>arte</dc:creator>
<dc:type>Text</dc:type>
<dc:title>start</dc:title>
<dc:identifier>http://thinkberg.com/space/start</dc:identifier>
<dc:date>2007-09-06T02:23:40+01:00</dc:date>
<dc:language>en</dc:language>

    <!-- <blogChannel:changes>http://www.weblogs.com/rssUpdates/changes.xml</changes> -->
    <admin:generatorAgent rdf:resource="http://www.snipsnap.org/space/version-1.0b3-uttoxeter" />
    
       <atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" type="application/rss+xml" href="http://feeds.feedburner.com/thinkberg" /><feedburner:info uri="thinkberg" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com/" /><item>
        <title>Japanese TWIMPACT beta</title>
        <link>http://feedproxy.google.com/~r/thinkberg/~3/xYCwz8GrMTQ/1</link>
        <description>We are now running a beta site for TWIMPACT for Japan only. It only works with japanese tweets and works quite well. What is interesting is the battle between a former politician 555hamako and masason the president of Softbank, a large telecommunications company in Japan.First, masason started out december 2009 with a quick rise to the top. Then 555hamako followed beginning of 2010 (the year of elections) with an even steeper rise to take the crown. Also, it looks like masason has not managed to attract the same size of an audience as before, his TWIMPACT stalls a little at the end. Maybe he was just watching the winter Olympics.I guess we will be starting to adapt the TWIMPACT rating to degrade over time to provide a better view of the current impact a user has. Even though it is hard to keep on rising one keeps its TWIMPACT at the moment. This effect is much more visible on the global site where you can find a lot of not-so-spammy-spam twitterers that rise quickly and should fall down over time again after they hit the ceiling.</description>
        <guid isPermaLink="false">http://thinkberg.com/space/start/2010-03-04/1#Japanese_TWIMPACT_beta</guid>
        <content:encoded><![CDATA[<a href="/space/start/2010-03-04/1/555hamako_vs_masason.png"><img src="http://thinkberg.com/space/start/2010-03-04/1/555hamako_vs_masason_small.png" alt="masason vs. 555hamako impact (click to enlarge)" class="float-right" border="0"/></a><p class="paragraph"/>We are now running a beta site for TWIMPACT for Japan only. It only works with japanese tweets and works quite well. What is interesting is the battle between a former politician <span class="nobr"><img src="http://thinkberg.com/theme/images/Icon-Extlink.png" alt="&gt;&gt;" border="0"/><a href="http://twimpact.jp/user/555hamako">555hamako</a></span> and <span class="nobr"><img src="http://thinkberg.com/theme/images/Icon-Extlink.png" alt="&gt;&gt;" border="0"/><a href="http://twimpact.jp/user/masason">masason</a></span> the president of <span class="nobr"><img src="http://thinkberg.com/theme/images/Icon-Extlink.png" alt="&gt;&gt;" border="0"/><a href="http://softbank.jp">Softbank</a></span>, a large telecommunications company in Japan.<p class="paragraph"/>First, <span class="nobr"><img src="http://thinkberg.com/theme/images/Icon-Extlink.png" alt="&gt;&gt;" border="0"/><a href="http://twimpact.jp/user/masason">masason</a></span> started out december 2009 with a quick rise to the top. Then <span class="nobr"><img src="http://thinkberg.com/theme/images/Icon-Extlink.png" alt="&gt;&gt;" border="0"/><a href="http://twimpact.jp/user/555hamako">555hamako</a></span> followed beginning of 2010 (the year of elections) with an even steeper rise to take the crown. Also, it looks like masason has not managed to attract the same size of an audience as before, his TWIMPACT stalls a little at the end. Maybe he was just watching the winter Olympics.<p class="paragraph"/>I guess we will be starting to adapt the <span class="nobr"><img src="http://thinkberg.com/theme/images/Icon-Extlink.png" alt="&gt;&gt;" border="0"/><a href="http://twimpact.com">TWIMPACT</a></span> rating to degrade over time to provide a better view of the current impact a user has. Even though it is hard to keep on rising one keeps its TWIMPACT at the moment. This effect is much more visible on the <span class="nobr"><img src="http://thinkberg.com/theme/images/Icon-Extlink.png" alt="&gt;&gt;" border="0"/><a href="http://twimpact.com">global site</a></span> where you can find a lot of not-so-spammy-spam twitterers that rise quickly and should fall down over time again after they hit the ceiling.<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/thinkberg?a=xYCwz8GrMTQ:q0XPEAalwnY:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/thinkberg?d=yIl2AUoC8zA" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/thinkberg/~4/xYCwz8GrMTQ" height="1" width="1"/>]]></content:encoded>
        <dc:creator>arte</dc:creator>
<dc:type>Text</dc:type>
<dc:title>Japanese TWIMPACT beta</dc:title>
<dc:identifier>http://thinkberg.com/space/start/2010-03-04/1#Japanese_TWIMPACT_beta</dc:identifier>
<dc:date>2010-03-04T11:32:12+01:00</dc:date>
<dc:language>en</dc:language>

        <comments>http://thinkberg.com/comments/start/2010-03-04/1#post</comments>
      <feedburner:origLink>http://thinkberg.com/space/start/2010-03-04/1#Japanese_TWIMPACT_beta</feedburner:origLink></item>
    
       <item>
        <title>NoSQL: MongoDB performance testing (part 2: counting)...</title>
        <link>http://feedproxy.google.com/~r/thinkberg/~3/DP8Y5rWEEvk/1</link>
        <description>After my insert tests last time I decided to look at some count queries as we do count a lot at twimpact.com. As a first result I can say that without any index count makes no sense with a database of this size.I have used the database left over from my last insert test and added a few indexes which takes around 30-40 minutes per index. I did not check in more detail about the time it takes as we tend to create the index while working on the database anyway.Now for todays results. The queries are quite simple, but in our case practical. I get a cursor for 1.000.000 documents as a result of a simple query and count the amount of documents that have the value of one of the documents properties:def cursor = db.find().limit(1000000)
// alternative: query one of the indexed properties
// def cursor = db.find(new BasicDBObject("property", new BasicDBObject("&amp;#92;$ne", null))).limit(1000000)cursor.each &amp;#123; doc &amp;#45;&amp;#62;
  def value = doc.get("property")
  def count = db.getCount(new BasicDBObject("property", value))
&amp;#125;The time was taken for each of the "db.getCount()" calls and it turns out that around 40-50% of all queries result in negligible query time (&amp;#60; 1ms) which is the smallest time frame I can measure right now. This needs to be taking into account when evaluating the graphs as they only show the queries with at least 1ms duration (log scale plot).In the plot you see query time versus the result of getCount(). As expected higher counts may take longer,Some explanation is necessary for the plots. random means that I get some documents and count one of the properties (the same for all documents). I do not know the order in which the documents come, so they are unrelated to the property I am counting. correlated is the counting if I query the documents using an index and the count the property that was indexed. The assumption here was that it might be easier for the database to count all documents having a certain property value if I previously queried all documents having a non-null property value.This holds true for the long index but not for the string index. The latter behaves about the same as my random counts.The results show that count queries are very fast, but only if indexed.What we also need for twimpact.com are some more advanced queries. I assume that the results for those also depend on how we design our documents to fit our needs. The design will take some time and I will get back with results of design and advanced queries at a later date.</description>
        <guid isPermaLink="false">http://thinkberg.com/space/start/2009-10-01/1#NoSQL:_MongoDB_performance_testing_(part_2:_counting)...</guid>
        <content:encoded><![CDATA[After my insert tests last time I decided to look at some count queries as we do count a lot at <span class="nobr"><img src="http://thinkberg.com/theme/images/Icon-Extlink.png" alt="&gt;&gt;" border="0"/><a href="http://twimpact.com">twimpact.com</a></span>. As a first result I can say that without any index count makes no sense with a database of this size.<p class="paragraph"/>I have used the database left over from my last insert test and added a few indexes which takes around 30-40 minutes per index. I did not check in more detail about the time it takes as we tend to create the index while working on the database anyway.<p class="paragraph"/>Now for todays results. The queries are quite simple, but in our case practical. I get a cursor for 1.000.000 documents as a result of a simple query and count the amount of documents that have the value of one of the documents properties:<p class="paragraph"/><div class="code"><pre>def cursor = db.find().limit(1000000)
// alternative: query one of the indexed properties
// def cursor = db.find(<span class="java&#45;keyword">new</span> BasicDBObject(<span class="java&#45;quote">"property"</span>, <span class="java&#45;keyword">new</span> BasicDBObject(<span class="java&#45;quote">"&#92;$ne"</span>, <span class="java&#45;keyword">null</span>))).limit(1000000)<p class="paragraph"/>cursor.each &#123; doc &#45;&#62;
  def value = doc.get(<span class="java&#45;quote">"property"</span>)
  def count = db.getCount(<span class="java&#45;keyword">new</span> BasicDBObject(<span class="java&#45;quote">"property"</span>, value))
&#125;</pre></div><p class="paragraph"/><a href="/space/start/2009-10-01/1/mongo.test.count.png"><img src="http://thinkberg.com/space/start/2009-10-01/1/mongo.test.count.small.png" alt="MongoDB query test (click to enlarge)" class="float-right" border="0"/></a><p class="paragraph"/>The time was taken for each of the <i class="italic">"db.getCount()"</i> calls and it turns out that around 40-50% of all queries result in negligible query time (&#60; 1ms) which is the smallest time frame I can measure right now. This needs to be taking into account when evaluating the graphs as they only show the queries with at least 1ms duration (log scale plot).<p class="paragraph"/>In the plot you see query time versus the result of getCount(). As expected higher counts may take longer,<p class="paragraph"/>Some explanation is necessary for the plots. <b class="bold">random</b> means that I get some documents and count one of the properties (the same for all documents). I do not know the order in which the documents come, so they are unrelated to the property I am counting. <b class="bold">correlated</b> is the counting if I query the documents using an index and the count the property that was indexed. The assumption here was that it might be easier for the database to count all documents having a certain property value if I previously queried all documents having a non-null property value.<p class="paragraph"/>This holds true for the <i class="italic">long</i> index but not for the <i class="italic">string</i> index. The latter behaves about the same as my random counts.<p class="paragraph"/>The results show that count queries are very fast, but only if indexed.<p class="paragraph"/>What we also need for <span class="nobr"><img src="http://thinkberg.com/theme/images/Icon-Extlink.png" alt="&gt;&gt;" border="0"/><a href="http://twimpact.com">twimpact.com</a></span> are some more advanced queries. I assume that the results for those also depend on how we design our documents to fit our needs. The design will take some time and I will get back with results of design and advanced queries at a later date.<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/thinkberg?a=DP8Y5rWEEvk:s2w--gb10nw:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/thinkberg?d=yIl2AUoC8zA" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/thinkberg/~4/DP8Y5rWEEvk" height="1" width="1"/>]]></content:encoded>
        <dc:creator>arte</dc:creator>
<dc:type>Text</dc:type>
<dc:title>NoSQL: MongoDB performance testing (part 2: counting)...</dc:title>
<dc:identifier>http://thinkberg.com/space/start/2009-10-01/1#NoSQL:_MongoDB_performance_testing_(part_2:_counting)...</dc:identifier>
<dc:date>2009-10-01T11:17:30+01:00</dc:date>
<dc:language>en</dc:language>

        <comments>http://thinkberg.com/comments/start/2009-10-01/1#post</comments>
      <feedburner:origLink>http://thinkberg.com/space/start/2009-10-01/1#NoSQL:_MongoDB_performance_testing_(part_2:_counting)...</feedburner:origLink></item>
    
       <item>
        <title>NoSQL : MongoDB performance testing (part 1: insert)...</title>
        <link>http://feedproxy.google.com/~r/thinkberg/~3/yO7FkARKbK4/1</link>
        <description>The twimpact.com project currently uses a PostgreSQL. This is all well, except that it does not scale too well in our environment. Removing some indexes actually improved the performance but I can foresee that the amount of data coming in will slow the application down again.That is a reason I am looking at non-SQL alternatives. The list includes redis, the Cassandra Project and MongoDB.I do admit, I only looked shortly at redis, but this is due to the fact that it is a very simple key/value store and we do need some query functionality. Some playing with Cassandra and the Java driver was awkward and in the end I had MongoDB up and running in no time.The setup is as follows:

4GB MacBook, 2.4Ghz Intel Core 2 Duo, slow disk
MongoDB: mongodb-osx-x86_64-2009-09-19
(i had to work in parallel, so there might be some swapping)
Currently the database on a remote server has about 38.000.000 tweets stored. At the start of my testing it contained about 35.000.000. The procedure to do the insert test was to copy over batches of 10.000 tweets like the following pseudo code shows:// initialize MongoDB (started with a complete new one for each test)
def db = new Mongo("twimpact")
DBCollection coll = db.getCollection("twimpact");
// coll.createIndex(new BasicDBObject("retweet_id", 1)) // long index
// coll.createIndex(new BasicDBObject("from_user", 1))  // short string indexdef offset = 0
def limit = 10000
def rowCount = sql.count("tweets")while(offset &amp;#60; rowCount) &amp;#123;
  // get batch of tweets form PostgreSQL server
  def data = sql.rows("SELECT &amp;#42; FROM tweets OFFSET $&amp;#123;offset&amp;#125; LIMIT $&amp;#123;limit&amp;#125;")
  // convert each row into a document and insert
  data.each &amp;#123; row &amp;#45;&amp;#62;
    BasicDBObject info = new BasicDBObject();
    row.each &amp;#123; key, value &amp;#45;&amp;#62;
      info.put(key, value);
    &amp;#125;
    coll.insert(info);
  &amp;#125;
  offset += data.size()
&amp;#125;The time was taken for requesting the data from the SQL data (not shown in the graphs) and for the row loop. In case of the bulk insert test the row loop first stored 5000 new documents in a pre-allocated array and then inserted them:&amp;#8230;
  DBObject&amp;#91;&amp;#93; bulk = DBObject&amp;#91;5000&amp;#93;
  &amp;#8230; loop &amp;#8230;
  // two times as 10000 was too big for the driver
  coll.insert(bulk)
...The documents we created were not that big, but have some real-world importance to use with their structure. They might be changed to adapt to the non-schema world though. Here is a good example:&amp;#123;
  "id": 3551935825,
  "user_id": 1657468,
  "retweet_id": 15965974 ,
  "from_user": "thinkberg", 
  "from_user_id": 6190551, 
  "to_user": null , 
  "to_user_id": null, 
  "text": "RT @Neurotechnology interesting post, RT @chris23 Augmented Reality Meets Brain&amp;#45;Computer Interface &amp;#104;ttp://bit.ly/3fg9OG", 
  "iso_language_code": "en", 
  "source": "&amp;#60;a href=&amp;#34;&amp;#104;ttp://adium.im&amp;#34; rel=&amp;#34;nofollow&amp;#34;&amp;#62;Adium&amp;#60;/a&amp;#62;", 
  "created_at": "Wed Aug 26 2009 06:49:09 GMT+0200 (CEST)",
  "updated_at": "Wed Aug 26 2009 06:50:11 GMT+0200 (CEST)",
  "version": 0,
  "retweet_user_id": null
&amp;#125;And now for the results. Just like expected there is a downgrade in performance as soon as a certain size of the database is reached. MongoDB took about 2.8GB of my RAM and had to create new data files during the process.The first insert test did not create or update any index so there is a sustained performance over the whole time. There are remarkable dips which probably happened whenever I unlocked the laptop or switched from one application to another.Looking at the insert with a number (long) index it appears that the performance degrades slightly and stabilizes shortly after about 20.000.000 inserts. I guess this might be the point where RAM shortness comes into play as you can see similar behavior in the string and bulk/string index tests.A dramatic performance boost had the bulk inserting. Unfortunately I had to insert each batch in two bulks of 5.000 tweets each as the driver reported that the object was too big" when using an array of 10.000 tweets. While single inserts stabilize around 1000 tweets/s at the end, the bulk insert still reached about 1500-2000 tweets/s.Looking at where the insert performance started and where it ended might let you conclude that this is going to be slow, but from my experience with a much smaller PostgreSQL database (~4.000.000 tweets) on this laptop I am impressed. Being able to insert around 1000 tweets/s is way faster than what we experience with the current system at twimpact.com where we accumulate an analyzer backlog. Given the fact that this test was performed on my laptop and not a production system it is to be expected that the reality looks much better :-)But inserting is not all, even though this is what we do a lot. Next I am going to take the database and do some query testing to see whether it fits our needs.</description>
        <guid isPermaLink="false">http://thinkberg.com/space/start/2009-09-25/1#NoSQL_:_MongoDB_performance_testing_(part_1:_insert)...</guid>
        <content:encoded><![CDATA[The <span class="nobr"><img src="http://thinkberg.com/theme/images/Icon-Extlink.png" alt="&gt;&gt;" border="0"/><a href="http://twimpact.com">twimpact.com</a></span> project currently uses a <span class="nobr"><img src="http://thinkberg.com/theme/images/Icon-Extlink.png" alt="&gt;&gt;" border="0"/><a href="http://www.postgresql.org">PostgreSQL</a></span>. This is all well, except that it does not scale too well in our environment. Removing some indexes actually improved the performance but I can foresee that the amount of data coming in will slow the application down again.<p class="paragraph"/>That is a reason I am looking at non-SQL alternatives. The list includes <span class="nobr"><img src="http://thinkberg.com/theme/images/Icon-Extlink.png" alt="&gt;&gt;" border="0"/><a href="http://code.google.com/p/redis/">redis</a></span>, the <span class="nobr"><img src="http://thinkberg.com/theme/images/Icon-Extlink.png" alt="&gt;&gt;" border="0"/><a href="http://incubator.apache.org/cassandra/">Cassandra Project</a></span> and <span class="nobr"><img src="http://thinkberg.com/theme/images/Icon-Extlink.png" alt="&gt;&gt;" border="0"/><a href="http://www.mongodb.org">MongoDB</a></span>.<p class="paragraph"/>I do admit, I only looked shortly at redis, but this is due to the fact that it is a very simple key/value store and we do need some query functionality. Some playing with Cassandra and the Java driver was awkward and in the end I had MongoDB up and running in no time.<p class="paragraph"/>The setup is as follows:
<ul class="star">
<li>4GB MacBook, 2.4Ghz Intel Core 2 Duo, slow disk</li>
<li>MongoDB: mongodb-osx-x86_64-2009-09-19</li>
<li>(i had to work in parallel, so there might be some swapping)</li>
</ul>Currently the database on a remote server has about 38.000.000 tweets stored. At the start of my testing it contained about 35.000.000. The procedure to do the <b class="bold">insert</b> test was to copy over batches of 10.000 tweets like the following pseudo code shows:<p class="paragraph"/><div class="code"><pre>// initialize MongoDB (started with a complete <span class="java&#45;keyword">new</span> one <span class="java&#45;keyword">for</span> each test)
def db = <span class="java&#45;keyword">new</span> Mongo(<span class="java&#45;quote">"twimpact"</span>)
DBCollection coll = db.getCollection(<span class="java&#45;quote">"twimpact"</span>);
// coll.createIndex(<span class="java&#45;keyword">new</span> BasicDBObject(<span class="java&#45;quote">"retweet_id"</span>, 1)) // <span class="java&#45;object">long</span> index
// coll.createIndex(<span class="java&#45;keyword">new</span> BasicDBObject(<span class="java&#45;quote">"from_user"</span>, 1))  // <span class="java&#45;object">short</span> string index<p class="paragraph"/>def offset = 0
def limit = 10000
def rowCount = sql.count(<span class="java&#45;quote">"tweets"</span>)<p class="paragraph"/><span class="java&#45;keyword">while</span>(offset &#60; rowCount) &#123;
  // get batch of tweets form PostgreSQL server
  def data = sql.rows(<span class="java&#45;quote">"SELECT &#42; FROM tweets OFFSET $&#123;offset&#125; LIMIT $&#123;limit&#125;"</span>)
  // convert each row into a document and insert
  data.each &#123; row &#45;&#62;
    BasicDBObject info = <span class="java&#45;keyword">new</span> BasicDBObject();
    row.each &#123; key, value &#45;&#62;
      info.put(key, value);
    &#125;
    coll.insert(info);
  &#125;
  offset += data.size()
&#125;</pre></div><p class="paragraph"/>The time was taken for requesting the data from the SQL data (not shown in the graphs) and for the row loop. In case of the bulk insert test the row loop first stored 5000 new documents in a pre-allocated array and then inserted them:<p class="paragraph"/><div class="code"><pre>&#8230;
  DBObject&#91;&#93; bulk = DBObject&#91;5000&#93;
  &#8230; loop &#8230;
  // two times as 10000 was too big <span class="java&#45;keyword">for</span> the driver
  coll.insert(bulk)
...</pre></div><p class="paragraph"/>The documents we created were not that big, but have some real-world importance to use with their structure. They might be changed to adapt to the non-schema world though. Here is a good example:<p class="paragraph"/><div class="code"><pre>&#123;
  <span class="java&#45;quote">"id"</span>: 3551935825,
  <span class="java&#45;quote">"user_id"</span>: 1657468,
  <span class="java&#45;quote">"retweet_id"</span>: 15965974 ,
  <span class="java&#45;quote">"from_user"</span>: <span class="java&#45;quote">"thinkberg"</span>, 
  <span class="java&#45;quote">"from_user_id"</span>: 6190551, 
  <span class="java&#45;quote">"to_user"</span>: <span class="java&#45;keyword">null</span> , 
  <span class="java&#45;quote">"to_user_id"</span>: <span class="java&#45;keyword">null</span>, 
  <span class="java&#45;quote">"text"</span>: <span class="java&#45;quote">"RT @Neurotechnology interesting post, RT @chris23 Augmented Reality Meets Brain&#45;Computer Interface <img src="http://thinkberg.com/theme/images/Icon-Extlink.png" alt="&gt;&gt;" border="0"/><span class="nobr"><a href="http://bit.ly/3fg9OG">&#104;ttp://bit.ly/3fg9OG</a></span>"</span>, 
  <span class="java&#45;quote">"iso_language_code"</span>: <span class="java&#45;quote">"en"</span>, 
  <span class="java&#45;quote">"source"</span>: <span class="java&#45;quote">"&#60;a href=&#34;<img src="http://thinkberg.com/theme/images/Icon-Extlink.png" alt="&gt;&gt;" border="0"/><span class="nobr"><a href="http://adium.im&#38;#34;">&#104;ttp://adium.im&#34;</a></span> rel=&#34;nofollow&#34;&#62;Adium&#60;/a&#62;"</span>, 
  <span class="java&#45;quote">"created_at"</span>: <span class="java&#45;quote">"Wed Aug 26 2009 06:49:09 GMT+0200 (CEST)"</span>,
  <span class="java&#45;quote">"updated_at"</span>: <span class="java&#45;quote">"Wed Aug 26 2009 06:50:11 GMT+0200 (CEST)"</span>,
  <span class="java&#45;quote">"version"</span>: 0,
  <span class="java&#45;quote">"retweet_user_id"</span>: <span class="java&#45;keyword">null</span>
&#125;</pre></div><p class="paragraph"/>And now for the results. Just like expected there is a downgrade in performance as soon as a certain size of the database is reached. MongoDB took about 2.8GB of my RAM and had to create new data files during the process.<p class="paragraph"/><a href="/space/start/2009-09-25/1/mongo.stat.png"><img src="http://thinkberg.com/space/start/2009-09-25/1/mongo.stat.small.png" alt="mongo.stat.small" class="float-right" border="0"/></a><p class="paragraph"/>The first insert test did not create or update any index so there is a sustained performance over the whole time. There are remarkable dips which probably happened whenever I unlocked the laptop or switched from one application to another.<p class="paragraph"/>Looking at the insert with a number (long) index it appears that the performance degrades slightly and stabilizes shortly after about 20.000.000 inserts. I guess this might be the point where RAM shortness comes into play as you can see similar behavior in the string and bulk/string index tests.<p class="paragraph"/>A dramatic performance boost had the bulk inserting. Unfortunately I had to insert each batch in two bulks of 5.000 tweets each as the driver reported that the object was too big" when using an array of 10.000 tweets. While single inserts stabilize around 1000 tweets/s at the end, the bulk insert still reached about 1500-2000 tweets/s.<p class="paragraph"/>Looking at where the insert performance started and where it ended might let you conclude that this is going to be slow, but from my experience with a much smaller PostgreSQL database (~4.000.000 tweets) on this laptop I am impressed. Being able to insert around 1000 tweets/s is way faster than what we experience with the current system at <span class="nobr"><img src="http://thinkberg.com/theme/images/Icon-Extlink.png" alt="&gt;&gt;" border="0"/><a href="http://twimpact.com">twimpact.com</a></span> where we accumulate an analyzer backlog. Given the fact that this test was performed on my laptop and not a production system it is to be expected that the reality looks much better :-)<p class="paragraph"/>But inserting is not all, even though this is what we do a lot. Next I am going to take the database and do some query testing to see whether it fits our needs.<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/thinkberg?a=yO7FkARKbK4:N4TLtf2GlsQ:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/thinkberg?d=yIl2AUoC8zA" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/thinkberg/~4/yO7FkARKbK4" height="1" width="1"/>]]></content:encoded>
        <dc:creator>arte</dc:creator>
<dc:type>Text</dc:type>
<dc:title>NoSQL : MongoDB performance testing (part 1: insert)...</dc:title>
<dc:identifier>http://thinkberg.com/space/start/2009-09-25/1#NoSQL_:_MongoDB_performance_testing_(part_1:_insert)...</dc:identifier>
<dc:date>2009-09-25T09:15:55+01:00</dc:date>
<dc:language>en</dc:language>

        <comments>http://thinkberg.com/comments/start/2009-09-25/1#post</comments>
      <feedburner:origLink>http://thinkberg.com/space/start/2009-09-25/1#NoSQL_:_MongoDB_performance_testing_(part_1:_insert)...</feedburner:origLink></item>
    
       <item>
        <title>twimpact.com - trends by citation</title>
        <link>http://feedproxy.google.com/~r/thinkberg/~3/wJl9Swq77ts/1</link>
        <description>It feels good to code a little again. Again, social software but this time from the analysis point of view. Check out twimpact.com to see the trends of the last hour bubble up.All done in grails, which I love.</description>
        <guid isPermaLink="false">http://thinkberg.com/space/start/2009-07-29/1#twimpact.com_-_trends_by_citation</guid>
        <content:encoded><![CDATA[It feels good to code a little again. Again, social software but this time from the analysis point of view. Check out <span class="nobr"><img src="http://thinkberg.com/theme/images/Icon-Extlink.png" alt="&gt;&gt;" border="0"/><a href="http://twimpact.com">twimpact.com</a></span> to see the trends of the last hour bubble up.<p class="paragraph"/>All done in <span class="nobr"><img src="http://thinkberg.com/theme/images/Icon-Extlink.png" alt="&gt;&gt;" border="0"/><a href="http://www.grails.org">grails</a></span>, which I love.<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/thinkberg?a=wJl9Swq77ts:_pdEmgydak4:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/thinkberg?d=yIl2AUoC8zA" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/thinkberg/~4/wJl9Swq77ts" height="1" width="1"/>]]></content:encoded>
        <dc:creator>arte</dc:creator>
<dc:type>Text</dc:type>
<dc:title>twimpact.com - trends by citation</dc:title>
<dc:identifier>http://thinkberg.com/space/start/2009-07-29/1#twimpact.com_-_trends_by_citation</dc:identifier>
<dc:date>2009-07-29T08:17:33+01:00</dc:date>
<dc:language>en</dc:language>

        <comments>http://thinkberg.com/comments/start/2009-07-29/1#post</comments>
      <feedburner:origLink>http://thinkberg.com/space/start/2009-07-29/1#twimpact.com_-_trends_by_citation</feedburner:origLink></item>
    
       <item>
        <title>Re-use replaced backup harddisks</title>
        <link>http://feedproxy.google.com/~r/thinkberg/~3/O539WCsF6Mo/1</link>
        <description>Now you have RAID system. It runs perfectly, but it also runs full as all storages do over time. You buy new 1.5TB harddisks, replacing the old 500GB ones. Now what do you do with those old ones? They are still perfectly healthy disks.Well, you buy an external SATA dock!Then you can do off-RAID backup to the disks. Those disks probably last longer than your DVD backups.</description>
        <guid isPermaLink="false">http://thinkberg.com/space/start/2009-05-01/1#Re-use_replaced_backup_harddisks</guid>
        <content:encoded><![CDATA[Now you have RAID system. It runs perfectly, but it also runs full as all storages do over time. You buy new 1.5TB harddisks, replacing the old 500GB ones. Now what do you do with those old ones? They are still perfectly healthy disks.<p class="paragraph"/>Well, you buy an <span class="nobr"><img src="http://thinkberg.com/theme/images/Icon-Extlink.png" alt="&gt;&gt;" border="0"/><a href="http://www.sharkoon.com/html/produkte/externe_gehaeuse/sata_quickport_pro/index_en.html">external SATA dock</a></span>!<p class="paragraph"/>Then you can do off-RAID backup to the disks. Those disks probably last longer than your DVD backups.<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/thinkberg?a=O539WCsF6Mo:_-LU3jsbBcg:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/thinkberg?d=yIl2AUoC8zA" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/thinkberg/~4/O539WCsF6Mo" height="1" width="1"/>]]></content:encoded>
        <dc:creator>arte</dc:creator>
<dc:type>Text</dc:type>
<dc:title>Re-use replaced backup harddisks</dc:title>
<dc:identifier>http://thinkberg.com/space/start/2009-05-01/1#Re-use_replaced_backup_harddisks</dc:identifier>
<dc:date>2009-05-01T10:17:56+01:00</dc:date>
<dc:language>en</dc:language>

        <comments>http://thinkberg.com/comments/start/2009-05-01/1#post</comments>
      <feedburner:origLink>http://thinkberg.com/space/start/2009-05-01/1#Re-use_replaced_backup_harddisks</feedburner:origLink></item>
    
       <item>
        <title>The next Backup iteration</title>
        <link>http://feedproxy.google.com/~r/thinkberg/~3/gifyB9j5D1Y/1</link>
        <description>Finally I have a backup strategy for my server too. Not actually perfect, but it works for me. I even added backup of some data from my home RAID system and vice versa to it. The data is backed up to two different locations (rsync.net and Amazon S3) and additionally to the RAID. Some data, like photos is transferred from the RAID to the Server and from there to Amazon S3. All Laptops backup to the RAID. That is too much data to be stored at either offsite location price-wise.All data transfer is encrypted. The data files are encrypted at either offsite backup but not on the RAID for easy access.</description>
        <guid isPermaLink="false">http://thinkberg.com/space/start/2009-04-25/1#The_next_Backup_iteration</guid>
        <content:encoded><![CDATA[Finally I have a backup strategy for my server too. Not actually perfect, but it works for me. I even added backup of some data from my home RAID system and vice versa to it. The data is backed up to two different locations (<span class="nobr"><img src="http://thinkberg.com/theme/images/Icon-Extlink.png" alt="&gt;&gt;" border="0"/><a href="http://resync.net">rsync.net</a></span> and <span class="nobr"><img src="http://thinkberg.com/theme/images/Icon-Extlink.png" alt="&gt;&gt;" border="0"/><a href="http://aws.amazon.com/s3">Amazon S3</a></span>) and additionally to the RAID. Some data, like photos is transferred from the RAID to the Server and from there to Amazon S3. All Laptops backup to the RAID. That is too much data to be stored at either offsite location price-wise.<p class="paragraph"/>All data transfer is encrypted. The data files are encrypted at either offsite backup but not on the RAID for easy access.<p class="paragraph"/><img src="http://thinkberg.com/space/start/2009-04-25/1/backup.png" alt="backup" class="middle" border="0"/><div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/thinkberg?a=gifyB9j5D1Y:msNLnMvfIaQ:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/thinkberg?d=yIl2AUoC8zA" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/thinkberg/~4/gifyB9j5D1Y" height="1" width="1"/>]]></content:encoded>
        <dc:creator>arte</dc:creator>
<dc:type>Text</dc:type>
<dc:title>The next Backup iteration</dc:title>
<dc:identifier>http://thinkberg.com/space/start/2009-04-25/1#The_next_Backup_iteration</dc:identifier>
<dc:date>2009-04-25T22:23:54+01:00</dc:date>
<dc:language>en</dc:language>

        <comments>http://thinkberg.com/comments/start/2009-04-25/1#post</comments>
      <feedburner:origLink>http://thinkberg.com/space/start/2009-04-25/1#The_next_Backup_iteration</feedburner:origLink></item>
    
       <item>
        <title>Twitter - what?</title>
        <link>http://feedproxy.google.com/~r/thinkberg/~3/r5liOvBi1XU/1</link>
        <description>In contrast to my last post, I am using twitter now. More for telling the world what we do, than what I personally do for my leisure. It is the only valid way for me. Giving an idea of what's happening in research.</description>
        <guid isPermaLink="false">http://thinkberg.com/space/start/2009-03-27/1#Twitter_-_what?</guid>
        <content:encoded><![CDATA[In contrast to my last post, I am using twitter now. More for telling the world what we do, than what I personally do for my leisure. It is the only valid way for me. Giving an idea of what's happening in research.<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/thinkberg?a=r5liOvBi1XU:rzlylNV0uAw:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/thinkberg?d=yIl2AUoC8zA" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/thinkberg/~4/r5liOvBi1XU" height="1" width="1"/>]]></content:encoded>
        <dc:creator>arte</dc:creator>
<dc:type>Text</dc:type>
<dc:title>Twitter - what?</dc:title>
<dc:identifier>http://thinkberg.com/space/start/2009-03-27/1#Twitter_-_what?</dc:identifier>
<dc:date>2009-03-27T10:46:45+01:00</dc:date>
<dc:language>en</dc:language>

        <comments>http://thinkberg.com/comments/start/2009-03-27/1#post</comments>
      <feedburner:origLink>http://thinkberg.com/space/start/2009-03-27/1#Twitter_-_what?</feedburner:origLink></item>
    
       <item>
        <title>New Job - Industry Liaison Manager</title>
        <link>http://feedproxy.google.com/~r/thinkberg/~3/BGTdGSg0yPA/1</link>
        <description>I have changed jobs and moved away from the Fraunhofer Society to take a post as Industry Liaison Manager for a Machine Learning and Neurotechnology Research group at the Berlin Institute of Technology.My main focus now will be to manage our industry relations, organize talks and seminars and work on technology transfer. The research project works on non-invasive neurotechnology to improve sensors, data analysis and apply the results in neuro-usability and other applications related to man-machine interaction.This is going to be a challenging and most interesting job.</description>
        <guid isPermaLink="false">http://thinkberg.com/space/start/2009-03-23/1#New_Job_-_Industry_Liaison_Manager</guid>
        <content:encoded><![CDATA[I have changed jobs and moved away from the <span class="nobr"><img src="http://thinkberg.com/theme/images/Icon-Extlink.png" alt="&gt;&gt;" border="0"/><a href="http://www.fraunhofer.de/">Fraunhofer Society</a></span> to take a post as <span class="nobr"><img src="http://thinkberg.com/theme/images/Icon-Extlink.png" alt="&gt;&gt;" border="0"/><a href="http://www.user.tu-berlin.de/matthias.jugel">Industry Liaison Manager</a></span> for a <span class="nobr"><img src="http://thinkberg.com/theme/images/Icon-Extlink.png" alt="&gt;&gt;" border="0"/><a href="http://www.ml.tu-berlin.de">Machine Learning</a></span> and <span class="nobr"><img src="http://thinkberg.com/theme/images/Icon-Extlink.png" alt="&gt;&gt;" border="0"/><a href="http://bbci.de">Neurotechnology Research</a></span> group at the Berlin Institute of Technology.<p class="paragraph"/>My main focus now will be to manage our industry relations, organize talks and seminars and work on technology transfer. The research project works on <span class="nobr"><img src="http://thinkberg.com/theme/images/Icon-Extlink.png" alt="&gt;&gt;" border="0"/><a href="http://www.bfnt-berlin.de">non-invasive neurotechnology</a></span> to improve sensors, data analysis and apply the results in neuro-usability and other applications related to man-machine interaction.<p class="paragraph"/>This is going to be a challenging and most interesting job.<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/thinkberg?a=BGTdGSg0yPA:1dJ7_6AkVtY:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/thinkberg?d=yIl2AUoC8zA" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/thinkberg/~4/BGTdGSg0yPA" height="1" width="1"/>]]></content:encoded>
        <dc:creator>arte</dc:creator>
<dc:type>Text</dc:type>
<dc:title>New Job - Industry Liaison Manager</dc:title>
<dc:identifier>http://thinkberg.com/space/start/2009-03-23/1#New_Job_-_Industry_Liaison_Manager</dc:identifier>
<dc:date>2009-03-23T13:53:12+01:00</dc:date>
<dc:language>en</dc:language>

        <comments>http://thinkberg.com/comments/start/2009-03-23/1#post</comments>
      <feedburner:origLink>http://thinkberg.com/space/start/2009-03-23/1#New_Job_-_Industry_Liaison_Manager</feedburner:origLink></item>
    
       <item>
        <title>Amazon S3 / WebDAV proxy updated</title>
        <link>http://feedproxy.google.com/~r/thinkberg/~3/ZuNJqDzSRY4/1</link>
        <description>I took the liberty to check out my old code and work on it to finally fix some of the problems. It now correctly uses the last-modified time and the cache handling as well as lazy download from S3 is implemented. To really work with the server it will need better cache handling. After many tests the basic and copymove finally run through repeatedly without failure.Still a long way to go.Update: (2009-01-28) In the meantime I implemented the property handling which only fails for some strange UTF-8 property values. Now the litmus test runs 99% through. Using MacOS X Finder to test looks promising.</description>
        <guid isPermaLink="false">http://thinkberg.com/space/start/2009-01-23/1#Amazon_S3_/_WebDAV_proxy_updated</guid>
        <content:encoded><![CDATA[I took the liberty to check out <a href="http://thinkberg.com/space/code/Moxo+S3+DAV+Proxy">my old code</a> and work on it to finally fix some of the problems. It now correctly uses the last-modified time and the cache handling as well as lazy download from S3 is implemented. To really work with the server it will need better cache handling. After many tests the basic and copymove finally run through repeatedly without failure.<p class="paragraph"/>Still a long way to go.<p class="paragraph"/><b class="bold">Update: (2009-01-28)</b> In the meantime I implemented the property handling which only fails for some strange UTF-8 property values. Now the litmus test runs 99% through. Using MacOS X Finder to test looks promising.<div class="feedflare">
<a href="http://feeds.feedburner.com/~f/thinkberg?a=roVM28Gc"><img src="http://feeds.feedburner.com/~f/thinkberg?d=41" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/thinkberg/~4/ZuNJqDzSRY4" height="1" width="1"/>]]></content:encoded>
        <dc:creator>arte</dc:creator>
<dc:type>Text</dc:type>
<dc:title>Amazon S3 / WebDAV proxy updated</dc:title>
<dc:identifier>http://thinkberg.com/space/start/2009-01-23/1#Amazon_S3_/_WebDAV_proxy_updated</dc:identifier>
<dc:date>2009-01-28T21:45:36+01:00</dc:date>
<dc:language>en</dc:language>

        <comments>http://thinkberg.com/comments/start/2009-01-23/1#post</comments>
      <feedburner:origLink>http://thinkberg.com/space/start/2009-01-23/1#Amazon_S3_/_WebDAV_proxy_updated</feedburner:origLink></item>
    
       <item>
        <title>twitter: the public chat</title>
        <link>http://feedproxy.google.com/~r/thinkberg/~3/2HlBRZr_0Zg/1</link>
        <description>I have been following a few friends twitter messages via Google Reader and I get the impression that it works much like a group chat system. The conversations are similar to cross-linked comments in weblogs and have a similar publicity.Unlike these friends I never really started to use twitter and even deleted my account there, as well as in a few other social networking systems. I give away so much already so I don't want to make the harvesting too easy. What strikes me though is, why a service like twitter has taken away the public chat room from classic instant messaging systems. It works much like IRC (Internet Relay Chat) where you can just join into an open chat. However, it looks crude that you have to read the others chat to actually communicate.I guess the real advantage of twitter is the simple user interfaces on loads of different systems that heavy weight instant messaging systems failed to provide until now.</description>
        <guid isPermaLink="false">http://thinkberg.com/space/start/2009-01-18/1#twitter:_the_public_chat</guid>
        <content:encoded><![CDATA[I have been following a few friends <span class="nobr"><img src="http://thinkberg.com/theme/images/Icon-Extlink.png" alt="&gt;&gt;" border="0"/><a href="http://www.twitter.com">twitter</a></span> messages via <span class="nobr"><img src="http://thinkberg.com/theme/images/Icon-Extlink.png" alt="&gt;&gt;" border="0"/><a href="https://www.google.com/reader">Google Reader</a></span> and I get the impression that it works much like a group chat system. The conversations are similar to cross-linked comments in weblogs and have a similar publicity.<p class="paragraph"/>Unlike these friends I never really started to use twitter and even deleted my account there, as well as in a few other social networking systems. I give away so much already so I don't want to make the harvesting too easy. What strikes me though is, why a service like twitter has taken away the public chat room from classic instant messaging systems. It works much like <span class="nobr"><img src="http://thinkberg.com/theme/images/Icon-Extlink.png" alt="&gt;&gt;" border="0"/><a href="http://en.wikipedia.org/wiki/IRC">IRC (Internet Relay Chat)</a></span> where you can just join into an open chat. However, it looks crude that you have to read the others chat to actually communicate.<p class="paragraph"/>I guess the real advantage of twitter is the simple user interfaces on loads of different systems that heavy weight instant messaging systems failed to provide until now.<div class="feedflare">
<a href="http://feeds.feedburner.com/~f/thinkberg?a=lGdmh5H4"><img src="http://feeds.feedburner.com/~f/thinkberg?d=41" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/thinkberg/~4/2HlBRZr_0Zg" height="1" width="1"/>]]></content:encoded>
        <dc:creator>arte</dc:creator>
<dc:type>Text</dc:type>
<dc:title>twitter: the public chat</dc:title>
<dc:identifier>http://thinkberg.com/space/start/2009-01-18/1#twitter:_the_public_chat</dc:identifier>
<dc:date>2009-01-18T17:57:17+01:00</dc:date>
<dc:language>en</dc:language>

        <comments>http://thinkberg.com/comments/start/2009-01-18/1#post</comments>
      <feedburner:origLink>http://thinkberg.com/space/start/2009-01-18/1#twitter:_the_public_chat</feedburner:origLink></item>
    
  </channel>
</rss>
