<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/atom10full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><feed xmlns="http://www.w3.org/2005/Atom" xmlns:thr="http://purl.org/syndication/thread/1.0" xml:lang="en" xml:base="http://friism.com/wp-atom.php">
	<title type="text">Randoom</title>
	<subtitle type="text">a Michael Friis production</subtitle>

	<updated>2013-05-11T17:40:24Z</updated>

	<link rel="alternate" type="text/html" href="http://friism.com" />
	<id>http://friism.com/feed/atom</id>
	

	<generator uri="http://wordpress.org/" version="3.3.1">WordPress</generator>
		<atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" type="application/atom+xml" href="http://feeds.feedburner.com/friism/sLZw" /><feedburner:info xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" uri="friism/slzw" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com/" /><entry>
		<author>
			<name>admin</name>
						<uri>http://</uri>
					</author>
		<title type="html"><![CDATA[Compressed string storage with NHibernate]]></title>
		<link rel="alternate" type="text/html" href="http://friism.com/compressed-string-storage-with-nhibernate" />
		<id>http://friism.com/?p=821</id>
		<updated>2013-05-11T17:40:24Z</updated>
		<published>2013-05-11T17:40:24Z</published>
		<category scheme="http://friism.com" term="C#" /><category scheme="http://friism.com" term="NHibernate" />		<summary type="html"><![CDATA[This blog post demonstrates how to use a IUserType to make NHibernate compress strings before storing them. It also shows how to use an AttributeConvention to configure the relevant type mapping. By compressing strings before storing them you can save storage space and potentially speed up your app because fewer bits are moved on and off [...]]]></summary>
		<content type="html" xml:base="http://friism.com/compressed-string-storage-with-nhibernate"><![CDATA[<p>This blog post demonstrates how to use a <code>IUserType</code> to make <a href="http://nhforge.org/">NHibernate</a> compress strings before storing them. It also shows how to use an <code>AttributeConvention</code> to configure the relevant type mapping.</p>
<p>By compressing strings before storing them you can save storage space and potentially <a href="http://www.citusdata.com/blog/64-zfs-compression">speed up your app because fewer bits are moved on and off physical storage</a>. In this example, compression is done using the extremely fast <a href="https://code.google.com/p/lz4/">LZ4</a> algorithm so as to not slow data storage and retrieval.</p>
<p>The downside to compressing string stored in the database is that running ad-hoc SQL queries (such is <code>mystring like '%foo%'</code>) is not possible.</p>
<h3>Background</h3>
<p>I was building an app that was downloading and storing lots HTML and for convenience I was storing the HTML in a SQL Server database. SQL Server has no good way to compress stored data so the database files grew very quickly. This bugged me because most of the content would compress well. I was using Entity Framework and started throwing around for ways to hook into how EF serializes data or for a way to create a custom string type that could handle the compression. Even with the EF6 pre-releases, I couldn&#8217;t find any such hooks.</p>
<h3>NHibernate IUserType</h3>
<p>So I migrated to <a href="http://nhforge.org/">NHibernate</a> which lets you define custom datatypes and control how they&#8217;re stored in the database by implementing the <code>IUserType</code> interface. The best tutorial I&#8217;ve found for implementing <code>IUserType</code> is <a href="http://blog.miraclespain.com/archive/2008/Mar-18.html">this one</a> by Jacob Andersen. You can check out my <a href="https://github.com/friism/NHibernateCompressedStringSample/blob/master/Core/Persistence/CompressedString.cs">full implementation of a compressed string IUserType on GitHub</a>. The two most interesting methods are <code>NullSafeGet()</code> and <code>NullSafeSet()</code>:</p>
<pre class="prettyprint">	public object NullSafeGet(IDataReader rs, string[] names, object owner)
	{
		var value = rs[names[0]] as byte[];
		if (value != null)
		{
			var deCompressor = LZ4DecompressorFactory.CreateNew();
			return Encoding.UTF8.GetString(deCompressor.Decompress(value));
		}

		return null;
	}

	public void NullSafeSet(IDbCommand cmd, object value, int index)
	{
		var parameter = (DbParameter)cmd.Parameters[index];

		if (value == null)
		{
			parameter.Value = DBNull.Value;
			return;
		}

		var compressor = LZ4CompressorFactory.CreateNew();
		parameter.Value = compressor.Compress(Encoding.UTF8.GetBytes(value as string));
	}</pre>
<p>The actual compression is done by <a href="https://github.com/stangelandcl/LZ4Sharp">LZ4Sharp</a> which is a .NET implementation of the <a href="https://code.google.com/p/lz4/">LZ4</a> compression algorithm. LZ4 is notable, not for compressing data a lot, but for compressing and uncompressing data extremely quickly. A single modern CPU core can LZ4-compress at up to 300 MB/s and uncompress much faster. This should minimize the overhead of compressing and uncompressing data as it enters and leaves the database.</p>
<p>For <code>SqlTypes</code> we use <code>BinarySqlType(int.MaxValue)</code>:</p>
<pre class="prettyprint">	public SqlType[] SqlTypes
	{
		get { return new[] { new BinarySqlType(int.MaxValue) }; }
	}</pre>
<p>This causes the type to be mapped to a <code>varbinary(max)</code> column in the database.</p>
<h3>Mapping</h3>
<p>To facilitate mapping, we&#8217;ll use an Attribute:</p>
<pre class="prettyprint">	[AttributeUsage(AttributeTargets.Property)]
	public class CompressedAttribute : Attribute
	{
	}</pre>
<p>And an <code>AttributeConvention</code> for FluentNHibernate to use:</p>
<pre class="prettyprint">	public class CompressedAttributeConvention : AttributePropertyConvention
	{
		protected override void Apply(CompressedAttribute attribute, IPropertyInstance instance)
		{
			if (instance.Property.PropertyType != typeof(string))
			{
				throw new ArgumentException();
			}

			instance.CustomType(typeof(CompressedString));
		}
	}</pre>
<p>Here&#8217;s how to use the convention with AutoMap:</p>
<pre class="prettyprint">	var autoMap = AutoMap.AssemblyOf()
		.Where(x => typeof(Entity).IsAssignableFrom(x))
		.Conventions.Add(new CompressedAttributeConvention());</pre>
<p>The full <a href="https://github.com/friism/NHibernateCompressedStringSample/blob/master/Core/Persistence/SessionFactory.cs">SessionFactory is on GitHub</a>.</p>
<p>With this, we get nice, clean entity classes with strings that are automatically compressed when stored:</p>
<pre class="prettyprint">	public class Document : Entity
	{
		[Compressed]
		public virtual string Text { get; set; }
	}</pre>
<h3>Limitations</h3>
<p>As mentioned in the introduction you can&#8217;t do ad-hoc SQL queries because compressed strings are stored in the database as binary blobs. Querying with NHibernate as also somewhat limited. Doing <code>document.Text == "foo"</code> actually works because NHibernate runs &#8220;Foo&#8221; through the compression. Queries that involve <code>Contains()</code> will (silently) not work, unfortunately. This is because NHibernate translates this to a <code>like</code> query, which won&#8217;t work with the compressed binary blob. I haven&#8217;t looked into hooking into the query engine to fix this.</p>
]]></content>
		<link rel="replies" type="text/html" href="http://friism.com/compressed-string-storage-with-nhibernate#comments" thr:count="0" />
		<link rel="replies" type="application/atom+xml" href="http://friism.com/compressed-string-storage-with-nhibernate/feed/atom" thr:count="0" />
		<thr:total>0</thr:total>
	</entry>
		<entry>
		<author>
			<name>admin</name>
						<uri>http://</uri>
					</author>
		<title type="html"><![CDATA[Danish state budget data]]></title>
		<link rel="alternate" type="text/html" href="http://friism.com/danish-state-budget-data" />
		<id>http://friism.com/?p=792</id>
		<updated>2013-01-20T21:36:01Z</updated>
		<published>2013-01-20T21:36:01Z</published>
		<category scheme="http://friism.com" term="Scraping" />		<summary type="html"><![CDATA[A couple of weeks ago, Peter Brodersen asked me whether I had made a tree-map visualization of the 2013 Danish state budget. Here it is. It&#8217;s on Many Eyes and requires Java (sorry). You can zoom in on individual spending areas by right-clicking on them: About the data I started scraping and analyzing budget data at [...]]]></summary>
		<content type="html" xml:base="http://friism.com/danish-state-budget-data"><![CDATA[<p>A couple of weeks ago, <a href="https://twitter.com/peterbrodersen/status/288036028915273728">Peter Brodersen asked me</a> whether I had made a tree-map visualization of the 2013 Danish state budget. <a href="http://www-958.ibm.com/software/analytics/manyeyes/visualizations/danish-state-budget-2013-as-treema">Here it is</a>. It&#8217;s on Many Eyes and requires Java (sorry). You can zoom in on individual spending areas by right-clicking on them:</p>
<p><a style="margin: 0pt; padding: 0pt;" href="http://www-958.ibm.com/me/visualizations/danish-state-budget-2013-as-treema/comments/387d8254633611e2b163000255111976"> <img style="border: 1px solid #6898C8; margin: 0; padding-top: 10px; padding-bottom: 15px;" title="Danish State Budget 2013 as Treemap" src="http://www-958.ibm.com/me/files/thumbnails/384eda08-6336-11e2-b163-000255111976.png?size=200x150" alt="Danish State Budget 2013 as Treemap" /> <img style="border: 0pt none; margin: 0pt; padding: 0pt; display: block; position: relative; top: -9px;" title="Many Eyes" src="http://www-958.ibm.com/me/images/blog_this_caption.jpg" alt="Many Eyes" /></a></p>
<p><strong>About the data</strong></p>
<p>I started scraping and analyzing budget data at <a href="http://ekstrabladet.dk/">Ekstra Bladet</a> in 2010. The goal was to find ways to help people understand how the Danish state uses it&#8217;s money and to let everyone rearrange and balance out the 15 billion DDK long term deficit that was frequently cited in the run-up to the 2011 parliamentary election. We didn&#8217;t get around to this, unfortunately.</p>
<p>The Danish state burns through a lot of money, which is inherently interesting. The budget published online is also very detailed, which is great. Showing off the magnitude and detail in an interesting way turns out to be difficult though, and the best I&#8217;ve come up with is the Many Eyes tree-map.</p>
<p>To see if anyone can do a better job, I&#8217;m making all the underlying <a href="https://www.google.com/fusiontables/data?docid=1lrrPUtsMGvN5ouAHhytOEQ-MersM6PvBs2v5rcg">data available in a Google Fusion Table</a>. The data is hierarchical with six levels of detail (this is also why the zoomable tree-map works fairly well). Here&#8217;s an example hierarchy, starting from the ministry using money (Ministry of Labor), down to what the money was used for (salaries and benefits):</p>
<pre>Beskæftigelsesministeriet
    Arbejdsmiljø
        Arbejdsmarkedets parters arbejdsmiljøindsats
            Videncenter
                Indtægtsdækket virksomhed
                    Lønninger / personaleomkostninger.</pre>
<p>In the Fushion table data there&#8217;s a line with an amount for each level. That means that the same money shows up six times, once for each level in the hierarchy. To generate the tree-map, one would start with lines at line-level 5 (the most detailed) and use the ParentBudgetLine to find the parent lines in the hierarchy. The C# code that accomplishes this is found <a href="https://github.com/friism/dk-budget-parser/blob/master/EB.Budget.Parser/Export/Exporter.cs#L130">here</a>.</p>
<p>The Fushion table contains data for budgets from 2003 to 2013. The &#8220;Year&#8221; column is the budget year that this line belongs to. &#8220;Linecode&#8221; is the code used in the budget. &#8220;CurrentYearBudget&#8221; is the budgeted amount for the year that this particular budget was published (ie. the projected spend in 2013 for the 2013 state budget). Year[1-3]Budget are the projected spends for the coming three years (ie. 2014-2016 for the 2013 budget). PreviousYear[1-2]Budget are the spends actually incurred for the previous two years (ie. 2011 and 2012 for the 2013 budget).</p>
<p>We have data for multiple years and comparing projected numbers in previous years with actual numbers in later years might yield interesting examples of departments going over budget and other irregularities.</p>
<p>Since we have data for multiple years, we can also visualize changes in spending for individual ministries over time. This turns out to be slightly less interesting than one might suspect because changing governments have a tendency to rename, create or close down ministries fairly often. Here&#8217;s a time-graph <a href="http://www-958.ibm.com/software/analytics/manyeyes/visualizations/danish-state-spending-by-ministry">example</a>:</p>
<p><a style="margin: 0pt; padding: 0pt;" href="http://www-958.ibm.com/me/visualizations/danish-state-spending-by-ministry/comments/88c48cc02e4511e1a017000255111976"> <img style="border: 1px solid #6898C8; margin: 0; padding-top: 10px; padding-bottom: 15px;" title="Danish State spending by Ministry" src="http://www-958.ibm.com/me/files/thumbnails/889286da-2e45-11e1-a017-000255111976.png?size=200x150" alt="Danish State spending by Ministry" /> <img style="border: 0pt none; margin: 0pt; padding: 0pt; display: block; position: relative; top: -9px;" title="Many Eyes" src="http://www-958.ibm.com/me/images/blog_this_caption.jpg" alt="Many Eyes" /></a></p>
<p>The source code that parses the budget and outputs it in various ways can be found on <a href="https://github.com/friism/dk-budget-parser">GitHub</a>. The code was written on Ekstra Bladet&#8217;s dime.</p>
<p><strong>Dedication:</strong> This blog post is dedicated to <a href="http://en.wikipedia.org/wiki/Aaron_Swartz">Aaron Swartz</a>. Aaron committed suicide sometime around January 11th, 2013. He had many cares and labors, and one of them was making data more generally available to the public.</p>
]]></content>
		<link rel="replies" type="text/html" href="http://friism.com/danish-state-budget-data#comments" thr:count="0" />
		<link rel="replies" type="application/atom+xml" href="http://friism.com/danish-state-budget-data/feed/atom" thr:count="0" />
		<thr:total>0</thr:total>
	</entry>
		<entry>
		<author>
			<name>friism</name>
						<uri>http://www.itu.dk/~friism/blog/</uri>
					</author>
		<title type="html"><![CDATA[Tax records for Danish companies]]></title>
		<link rel="alternate" type="text/html" href="http://friism.com/tax-records-for-danish-companies" />
		<id>http://friism.com/?p=708</id>
		<updated>2013-01-13T20:40:25Z</updated>
		<published>2012-12-26T09:02:23Z</published>
		<category scheme="http://friism.com" term="Scraping" />		<summary type="html"><![CDATA[This week, the Danish tax-authorities published an interface that lets you browse information on how much tax companies registered in Denmark are paying. I&#8217;ve written a scraper that has fetched all the records. I&#8217;ve published all 243,711 records as a Google Fusion Table that will let you explore and download the data. If you use this data [...]]]></summary>
		<content type="html" xml:base="http://friism.com/tax-records-for-danish-companies"><![CDATA[<p>This week, the Danish tax-authorities <a href="http://skat.dk/SKAT.aspx?oId=2089696&amp;vId=0">published</a> an <a href="http://www.skat.dk/SKAT.aspx?oId=69073">interface</a> that lets you browse information on how much tax companies registered in Denmark are paying. I&#8217;ve written a scraper that has fetched all the records. I&#8217;ve published all 243,711 records as a <a href="https://www.google.com/fusiontables/DataSource?docid=1hhMfEJGc8wKIbz6XTKUM391q_X9YlB2Br9Yyb-o">Google Fusion Table that will let you explore and download the data</a>. If you use this data for analysis or reporting, please credit <code>Michael Friis, http://friism.com/</code>. The <a href="https://github.com/friism/Tax">scraper source code</a> is also available if you&#8217;re interested.</p>
<p><strong>UPDATE 1/9-12:</strong> Niels Teglsbo has exported the data from Google Fusion tables and created a convenient Excel Spreadsheet for <a href="http://ge.tt/51IgKGU/v/0">download</a>.</p>
<p><strong>The bigger picture</strong></p>
<p>Tax records for individuals (and companies presumably) used to be public in Denmark and still are in Norway and Sweden. If you&#8217;re in Denmark, you can probably head down to your local municipality, demand the old tax book and look up how much tax your grandpa paid in 1920. The municipality of Esbjerg <a href="http://eba.esbjergkommune.dk/Esbjergs%20historie/Borgere/Trykte%20Skatteb%C3%B8ger%20Esbjerg/Avanceret%20S%C3%B8g.aspx">publishes old records online</a> in searchable form. Here&#8217;s a <a href="http://eba.esbjergkommune.dk/Esbjergs%20historie/Borgere/Trykte%20Skatteb%C3%B8ger%20Esbjerg/Detaljeside.aspx?qid=71085">record of Carpenter N. Møller paying kr. 6.00 in taxes in 1892</a>.</p>
<p>The Danish business lobby complained loudly when the move to publish current tax records was announced. I agree that the release of this information by a center-left government is an example of political demagoguery and that&#8217;s yucky, but apart from that, I don&#8217;t think there are any good reasons why this information should not be public. It&#8217;s also worth noting that publicly listed companies are already required to publish financial statements and non-public ones are required to submit yearly financials to the government which then helpfully <a href="http://cvr.dk/Site/Forms/CMS/DisplayPage.aspx?pageid=39">resells them to anyone interested</a>.</p>
<p>It&#8217;s good that this information is now completely public: Limited liability companies and the privileges and protections offered by these are an awesome invention. In return for those privileges, it&#8217;s fair for society to demand information about how a company is being run to see how those privileges are being put to use.</p>
<p>The authorities <a href="http://borsen.dk/nyheder/virksomheder/artikel/1/237069/selskaber_skal_i_offentlig_skatte-gabestok.html">announced</a> their intention to publish tax records in the summer of 2012 and it has apparently taken them 6 months to build a very limited interface on top of their database. The interface lets you look up individual companies by id (&#8220;CVR nummer&#8221;) or name and inspect their records. You have to know the name or id of any company that you&#8217;re interested in because there&#8217;s no way to browse or explore the data. Answering a simple question such as &#8220;Which company paid the most taxes in 2011?&#8221; is impossible using the interface.</p>
<p>Having said that, I think it&#8217;s great whenever governments release data and I commend the Danish tax authorities for making this data available. And even with very limited interfaces like this, it&#8217;s generally possible to scrape all data and analyze it in greater detail and that is what I&#8217;ve done.</p>
<p><strong>So what&#8217;s in there</strong></p>
<p><strong></strong>The tax data-set contains information on 243,711 companies. Note that this data does not contain the names and ids of all companies operating in Denmark in 2011. Some types of corporations (I/S corporations and sole proprietorships for example) have their profits taxed as personal income for the individuals that own them. That means they won&#8217;t show up in the data.</p>
<p><strong>UPDATE 12/30-12: </strong><a href="https://twitter.com/MagnusBjerg/status/284240160131006464">Magnus Bjerg pointed out</a> that some companies are duplicated in the data. This seems to be the case at least for all (roughly 48) companies that pay tariffs for extraction of oil and gas. Here are some examples: <a href="http://skat.dk/SKAT.aspx?oId=skattelister&amp;x=87197719">Shell 1</a> and <a href="http://skat.dk/SKAT.aspx?oId=skattelister&amp;x=10373816">Shell 2</a> and <a href="http://skat.dk/SKAT.aspx?oId=skattelister&amp;x=22756214">Maersk 1</a> and <a href="http://skat.dk/SKAT.aspx?oId=skattelister&amp;x=71731219">Maersk 2</a>. The numbers for these companies look very similar but are not exactly the same. The duplicated companies with different identifiers are likely due to Skat messing up CVR ids and SE ids. Additional details on SE ids can be found here <a href="http://www.e-conomic.dk/regnskabsprogram/ordbog/se-nummer">here</a>. My guess is that Skat pulled standard taxes and fossil fuel taxes from two different registries and forgot to merge and check for duplicates.</p>
<div>Here are the Danish companies that reported the greatest profits in 2011. These companies also paid the most taxes:</div>
<div>
<ol>
<li><a href="http://skat.dk/SKAT.aspx?oId=skattelister&amp;x=87197719">SHELL OLIE- OG GASUDVINDING DANMARK B.V. (HOLLAND), DANSK  FILIAL</a></li>
<li><a href="http://skat.dk/SKAT.aspx?oId=skattelister&amp;x=10373816">A/S Dansk Shell/Eksportvirksomhed</a></li>
<li><a href="http://skat.dk/SKAT.aspx?oId=skattelister&amp;x=22756214">A.P. MØLLER &#8211; MÆRSK A/S</a></li>
<li><a href="http://skat.dk/SKAT.aspx?oId=skattelister&amp;x=71731219">A.P.Møller &#8211; Mærsk A/S/ Oil &amp; Gas Activity</a></li>
<li><a href="http://skat.dk/SKAT.aspx?oId=skattelister&amp;x=24257630">Novo A/S</a></li>
</ol>
<div>Here are the companies that booked the greatest losses:</div>
<div>
<ol>
<li><a href="http://skat.dk/SKAT.aspx?oId=skattelister&amp;x=58180912">FLSMIDTH &amp; CO. A/S</a> &#8211; lost kr. 1,537,929,000.00</li>
<li><a href="http://skat.dk/SKAT.aspx?oId=skattelister&amp;x=15694688">Sund og Bælt Holding A/S</a> &#8211; lost kr. 1,443,935,000.00</li>
<li><a href="http://skat.dk/SKAT.aspx?oId=skattelister&amp;x=36213728">DONG ENERGY A/S</a> &#8211; lost kr. 1,354,480,560.00</li>
<li><a href="http://skat.dk/SKAT.aspx?oId=skattelister&amp;x=28313519">TAKEDA A/S</a> &#8211; lost kr. 786,286,000.00</li>
<li><a href="http://skat.dk/SKAT.aspx?oId=skattelister&amp;x=22438018">PFA HOLDING A/S</a> &#8211; lost kr. 703,882,104.00</li>
</ol>
</div>
<div>Here are companies that are reporting a lot of profit but paying few or no taxes:</div>
<div>
<ol>
<li><a href="http://skat.dk/SKAT.aspx?oId=skattelister&amp;x=36213728">DONG ENERGY A/S</a> - kr. 3,148,994,114.00 profit, kr. 0 tax</li>
<li><a href="http://skat.dk/SKAT.aspx?oId=skattelister&amp;x=28313519">TAKEDA A/S</a> - kr. 745,424,000.00 profit, kr. 0 tax</li>
<li><a href="http://skat.dk/SKAT.aspx?oId=skattelister&amp;x=54879415">Rockwool International A/S</a> - kr. 284,696,514.00 profit, kr. 0 tax</li>
<li><a href="http://skat.dk/SKAT.aspx?oId=skattelister&amp;x=32892973">COWI HOLDING A/S</a> - kr. 177,272,657.00 profit, kr. 2,399,803.00 tax</li>
<li><a href="http://skat.dk/SKAT.aspx?oId=skattelister&amp;x=28316887">DANAHER TAX ADMINISTRATION ApS.</a> - kr. 155,222,377.00 profit, kr. 0 tax</li>
</ol>
<p><strong>Benford&#8217;s law</strong></p>
<div><a href="http://en.wikipedia.org/wiki/Benford's_law">Benford&#8217;s law</a> states that numbers in many real-world sources of data are much more likely to start with the digit 1 (30% of numbers) than with the digit 9 (less than 5% of numbers). Here&#8217;s the frequency distribution of first-digits of the numbers for profits, losses and taxes as reported by Danish companies plotted against the frequencies predicted by Benford:</div>
<div><a href="http://friism.com/wp-content/uploads/2012/12/benford.png"><img class="aligncenter size-full wp-image-745" title="Benford distribution of first digits" src="http://friism.com/wp-content/uploads/2012/12/benford.png" alt="" width="481" height="289" /></a></div>
<p>&nbsp;</p>
<p>The digit distributions perfectly match those predicted by Benford&#8217;s law. That&#8217;s great news: If Danish companies were systematically doctoring their tax returns and coming up with fake profit numbers, then those numbers would likely be more uniformly distributed and wouldn&#8217;t match Benford&#8217;s predictions. This is because crooked accountants trying to come up with random-looking numbers will tend to choose numbers starting with digits like 9 too often and numbers starting with the digit 1 too rarely.</p>
<p><strong>UPDATE 12/30-12:</strong> It&#8217;s important to stress that the fact that the tax numbers conform to Benfords law does not imply that companies always pay the taxes they are due. It does suggest, however, that Danish companies–as a rule–do not put made-up numbers on their tax returns.</p>
<p><strong>Technical details</strong></p>
<div>To scrape the tax website I found two ways to access tax information for a company:</div>
<ol>
<li>Access an individual company using the <code>x</code> query parameter for the CVR identifier: <a href="http://skat.dk/SKAT.aspx?oId=skattelister&amp;x=29604274">http://skat.dk/SKAT.aspx?oId=skattelister&amp;x=29604274</a></li>
<li>Spoof the <code>POST</code> request generated by the <a href="http://msdn.microsoft.com/en-us/library/bb386454(v=vs.100).aspx">UpdatePanel</a> that gets updated when you hit the &#8220;søg&#8221; button</li>
</ol>
<p>The former is the simplest approach, but the latter is preferable for a scraper because much less HTML is transferred from the server when updating the panel compared to requesting the page anew for each company.</p>
<p>To get details on a company, one has to know it&#8217;s identifier. Unfortunately there&#8217;s no authoritative list of CVR identifiers, although the government has <a href="http://www.digst.dk/da/Digitaliseringsstrategi/Digitaliseringsstrategiens-initiativer/~/media/Digitaliseringsstrategi/Initiativbeskrivelserne/104.ashx">promised to publish such a list in 2013</a>. The <a href="http://thepiratebay.se/torrent/6619217/cvr">contents of the entire Danish CVR register was leaked</a> in 2011, so one could presumably harvest identifiers from that data. The most fool-proof method though, is to just brute-force through all possible identifiers. CVR identifiers consist of 7 digits with an 8th checksum-digit. The process of computing the <a href="http://www.erhvervsstyrelsen.dk/modulus_11">checksum is documented</a> publicly. Here&#8217;s my implementation of the checksum computation. Please let me know if you think it&#8217;s wrong:</p>
<pre class="code">	private static int[] digitWeights = { 2, 7, 6, 5, 4, 3, 2 };

	public static int ToCvr(int serial)
	{
		var digits = serial.ToString().Select(x =&gt; int.Parse(x.ToString()));
		var sum = digits.Select((x, y) =&gt; x * digitWeights[y]).Sum();
		var modulo = sum % 11;
		if (modulo == 1)
		{
			return -1;
		}
		if (modulo == 0)
		{
			modulo = 11;
		}
		var checkDigit = 11 - modulo;
		return serial * 10 + checkDigit;
	}</pre>
<p>My guess is that the lowest serial (without the checksum) is 1,000,000 because that&#8217;s the lowest serial that will yield an 8-digit identifier. The largest serial is likely 9,999,999. I could be wrong though, so if you have any insights please let me know. Roughly one in eleven serials are discarded because the checksum is 10, which is invalid. That leaves about 8 million identifiers to be tried. It&#8217;s wasteful to have to submit 8 million requests to get records for a couple of hundred thousand companies, but one can hope that 8 million requests will get the governments attention and that they&#8217;ll start publishing data more efficiently.</p>
</div>
</div>
]]></content>
		<link rel="replies" type="text/html" href="http://friism.com/tax-records-for-danish-companies#comments" thr:count="2" />
		<link rel="replies" type="application/atom+xml" href="http://friism.com/tax-records-for-danish-companies/feed/atom" thr:count="2" />
		<thr:total>2</thr:total>
	</entry>
		<entry>
		<author>
			<name>admin</name>
						<uri>http://</uri>
					</author>
		<title type="html"><![CDATA[Screen scraping with WatiN]]></title>
		<link rel="alternate" type="text/html" href="http://friism.com/screen-scraping-with-watin" />
		<id>http://friism.com/?p=699</id>
		<updated>2012-05-09T21:25:01Z</updated>
		<published>2012-05-08T19:43:03Z</published>
		<category scheme="http://friism.com" term="Scraping" />		<summary type="html"><![CDATA[This post describes how to use WatiN to screen scrape web sites that don&#8217;t want to be scraped. WatiN is generally used to instrument browsers to perform integration testing of web applications, but it works great for scraping too. Screen scraping websites can range in difficulty from very easy to extremely hard. When encountering hard-to-scrape sites, the typical cause [...]]]></summary>
		<content type="html" xml:base="http://friism.com/screen-scraping-with-watin"><![CDATA[<p>This post describes how to use <a href="http://watin.org/">WatiN</a> to screen scrape web sites that don&#8217;t want to be scraped. WatiN is generally used to instrument browsers to perform integration testing of web applications, but it works great for scraping too.</p>
<p>Screen scraping websites can range in difficulty from <a href="http://friism.com/raw-updated-data-on-danish-business-leader-groups">very easy</a> to <a href="http://friism.com/downloading-the-eu">extremely hard</a>. When encountering hard-to-scrape sites, the typical cause of difficulty is fumbling incompetence on the part of the people that built the site to be scraped. Every once in a while however, you&#8217;ll encounter a site openly displaying data to the casual browser, but with measures in place to prevent automatic scraping of that data.</p>
<p><a href="http://www.dkpto.org/">The Danish Patent and Trademark Office</a> is one such site. The people there maintain a <a href="http://onlineweb.dkpto.dk/pvsonline/patent">searchable database</a> that lets you search and peruse Danish and international patents. Unfortunately, computers are not allowed. If one tries to issue HTTP <code>POST</code> to the resource that generally performs searches and shows patents, an error is returned. If one emulates visiting the site with a real browser by providing a browser-looking User Agent setting, collecting cookies etc. (for example by using a tool like <a href="https://github.com/axefrog/SimpleBrowser">SimpleBrowser</a>), the site sends a made-up 999 HTTP response code and the message &#8220;No Hacking&#8221;.</p>
<p>Faced with such an obstruction, there are two avenues of attack:</p>
<ol>
<li>Break out <a href="http://www.wireshark.org/">Wireshark</a> or <a href="http://www.fiddler2.com/fiddler2/">Fiddler</a> and spend a lot of time figuring out what it takes to fabricate requests that fools the site into thinking they originate from a normal browser and not from your bot</li>
<li>Instrument an actual browser so that the site will have no way (other than timing analysis and IP address request rate limiting) of knowing whether requests are from a bot or from a normal client</li>
</ol>
<p>The second option turns out to be really easy because people have spent lots of time building tools for automatically testing web applications using full browsers, tools like <a href="http://watin.org/">WatiN</a>. For example, successfully scraping the Danish Patent Authorities site using WatiN is as simple as this:</p>
<pre class="prettyprint">private static void GetPatentsInYear(int year)
{
	using (var browser = new IE("http://onlineweb.dkpto.dk/pvsonline/Patent"))
	{
		// go to the search form
		browser.Button(Find.ByName("menu")).ClickNoWait();

		// fill out search form and submit
		browser.CheckBox(Find.ByName("brugsmodel")).Click();
		browser.SelectList(Find.ByName("datotype")).Select("Patent/reg. dato");
		browser.TextField(Find.ByName("dato")).Value = string.Format("{0}*", year);
		browser.Button(Find.By("type", "submit")).ClickNoWait();
		browser.WaitForComplete();

		// go to first patent found in search result and save it
		browser.Buttons.Filter(Find.ByValue("Vis")).First().Click();
		GetPatentFromPage(browser, year);

		// hit the 'next' button until it's no longer there
		while (GetNextPatentButton(browser).Exists)
		{
			GetNextPatentButton(browser).Click();
			GetPatentFromPage(browser, year);
		}
	}
}

private static Button GetNextPatentButton(IE browser)
{
	return browser.Button(button =>
		button.Value == "Næste" &amp;&amp; button.ClassName == "knapanden");
}</pre>
<p>Note that in this example, we&#8217;re using Internet Explorer because it&#8217;s the easiest to setup and use (WatiN also works with Firefox, but only older versions). There&#8217;s definitely room for improvement, in particular it&#8217;d be interesting to explore parallelizing the scraper to download patents faster.  The &#8211; still incomplete &#8211; project source code is available on <a href="https://github.com/friism/Patents">Github</a>. I&#8217;ll do a post shortly on what interesting data can be extracted from Danish patents.</p>
]]></content>
		<link rel="replies" type="text/html" href="http://friism.com/screen-scraping-with-watin#comments" thr:count="0" />
		<link rel="replies" type="application/atom+xml" href="http://friism.com/screen-scraping-with-watin/feed/atom" thr:count="0" />
		<thr:total>0</thr:total>
	</entry>
		<entry>
		<author>
			<name>admin</name>
						<uri>http://</uri>
					</author>
		<title type="html"><![CDATA[Raw updated data on Danish business leader groups]]></title>
		<link rel="alternate" type="text/html" href="http://friism.com/raw-updated-data-on-danish-business-leader-groups" />
		<id>http://friism.com/?p=686</id>
		<updated>2012-03-18T02:34:03Z</updated>
		<published>2012-03-18T02:34:03Z</published>
		<category scheme="http://friism.com" term="Scraping" />		<summary type="html"><![CDATA[Last summer, I published data on the members of Danish business leader groups, obtained with code written while I was still at Ekstra Bladet. I&#8217;ve cleaned up the code and removed the parts that fetched celebrities from various other obscure sources. You can fork the project on Github. The code is fairly straightforward. The scraper [...]]]></summary>
		<content type="html" xml:base="http://friism.com/raw-updated-data-on-danish-business-leader-groups"><![CDATA[<p>Last summer, I <a href="http://friism.com/members-of-danish-vl-groups">published data on the members of Danish business leader groups</a>, obtained with code written while I was still at <a href="http://ekstrabladet.dk/">Ekstra Bladet</a>. I&#8217;ve cleaned up the code and removed the parts that fetched celebrities from various other obscure sources. You can <a href="https://github.com/friism/VLGroups">fork the project on Github</a>.</p>
<p>The code is fairly straightforward. The <a href="https://github.com/friism/VLGroups/blob/master/Scraper/VLGroupScraper.cs">scraper</a> itself is less than 150 loc. The scraper is <a href="https://github.com/friism/VLGroups/blob/master/Scraper/Program.cs">configured</a> to be run in a <a href="http://blog.appharbor.com/2012/03/08/background-workers-in-beta">background worker</a> on AppHarbor and will conduct a scrape once a month (I don&#8217;t know how often the VL-people update their website, but monthly updates seems sufficient to keep track of coming and goings). The resulting data can be fetched using a simple JSON API. You can find a list of scraped member-batches <a href="http://vlgroups.apphb.com/">here</a> (there&#8217;s just one at the time of writing). Hitting <a href="http://vlgroups.apphb.com/Member">http://vlgroups.apphb.com/Member</a> will always net you the latest batch.</p>
<p>I was motivated to revisit the code after this week&#8217;s <a href="http://www.b.dk/fra-editorial/eldrup-forsoegte-at-forfremme-barylen-0">dethroning of Anders Eldrup from his position as CEO of Dong Energy</a>. Anders Eldrup sits in VL-gruppe 1, the most prestigious one. Let&#8217;s see if he&#8217;s still there next time the scraper looks. 14 other Dong Energy executives are members of other groups, although interestingly, Jakob Baruël Poulsen (Eldrup&#8217;s handsomely rewarded sidekick) is nowhere to be found. I think data like this in an important piece of the puzzle to figure out what relations exist between business leaders in Denmark and the Anders Eldrup debacle demonstrates why keeping track is important.</p>
<p><a href="http://friism.com/wp-content/uploads/2012/03/vl.jpg"><img class="aligncenter size-medium wp-image-689" title="VL logo" src="http://friism.com/wp-content/uploads/2012/03/vl-300x66.jpg" alt="" width="300" height="66" /></a></p>
]]></content>
		<link rel="replies" type="text/html" href="http://friism.com/raw-updated-data-on-danish-business-leader-groups#comments" thr:count="1" />
		<link rel="replies" type="application/atom+xml" href="http://friism.com/raw-updated-data-on-danish-business-leader-groups/feed/atom" thr:count="1" />
		<thr:total>1</thr:total>
	</entry>
		<entry>
		<author>
			<name>admin</name>
						<uri>http://</uri>
					</author>
		<title type="html"><![CDATA[Nordic Newshacker]]></title>
		<link rel="alternate" type="text/html" href="http://friism.com/nordic-newshacker" />
		<id>http://friism.com/?p=662</id>
		<updated>2012-02-26T02:07:01Z</updated>
		<published>2012-02-25T22:32:12Z</published>
		<category scheme="http://friism.com" term="Journalism" />		<summary type="html"><![CDATA[The excellent people at the Danish newspaper Information are hosting a competition to promote data journalism. It&#8217;s called &#8220;Nordisk Nyhedshacker 2012&#8220;. Data journalism was what I spent some of my time at Ekstra Bladet doing, and the organizers have been kind enough to put me on the jury. The winner will get a scholarship to [...]]]></summary>
		<content type="html" xml:base="http://friism.com/nordic-newshacker"><![CDATA[<p>The excellent people at the Danish newspaper <a href="http://information.dk/">Information</a> are hosting a competition to promote data journalism. It&#8217;s called &#8220;<a href="http://www.nyhedshacker.net/">Nordisk Nyhedshacker 2012</a>&#8220;. Data journalism was what I spent some of my time at <a href="http://ekstrabladet.dk/">Ekstra Bladet</a> doing, and the organizers have been kind enough to put me on the <a href="http://www.nyhedshacker.net/content/s%C3%A5dan-deltager-du">jury</a>. The winner will get a scholarship to go work at <a href="http://www.guardiannews.com/">The Guardian</a> for a month, sponsored by Google. Frankly, I&#8217;d prefer working at Information, but I guess The Guardian will do. If you&#8217;re a journalist that can hack or if you&#8217;re hacker interested in using your craft to make people more informed about the world we live in, you should use this opportunity to come up with something interesting and be recognized for it.</p>
<p>Hopefully, you already have awesome ideas for what to build. Should you need some inspiration, here a few interesting pieces of data you might want to consider (projects using this data will <em>not</em> be judged differently than others).</p>
<ul>
<li>Examine the <a href="http://wikileaks.org/cablegate.html">US Embassy Cables released by Wikileaks</a>. I&#8217;ve tried to <a href="http://friism.com/us-embassy-cables-related-to-denmark">filter out the ones related to Denmark</a>.</li>
<li>Examine the power relationships of members of Danish business leader groups. I have <a href="http://friism.com/members-of-danish-vl-groups">extracted the membership info from their web site</a>. It&#8217;d be extra interesting if you combine this information with data about who sits on the boards of big Danish companies, perhaps to make the beginnings of something like <a href="http://littlesis.org/">LittleSis</a> so that we can keep track of what favours those in power are doing each other.</li>
<li>Do something interesting with the <a href="https://thepiratebay.se/torrent/6619217/">CVR database of Danish companies</a> that was leaked on The Pirate Bay last year.</li>
<li>Ekstra Bladet has been kind enough to let me open source the code for the award-winning <a href="http://krimikort.ekstrabladet.dk/">Krimikort</a> (Crime Map) I built while working there. It&#8217;s not quite ready to be released yet, but we&#8217;re making the current data available now. There&#8217;s 62,753 nuggets of geo-located and categorised crime ready for you to look at. You can download a rar file (50 MB) <a href="http://crimemap.s3.amazonaws.com/EBCrime.rar">here</a>. To use the data, you have to get a free copy of SQL Server Express and mount the database (Google will tell you how).</li>
</ul>
<p>I&#8217;m afraid I won&#8217;t be able be participate in many of the activities preceding the actual competition but I can&#8217;t wait to see what people come up with!</p>
<p style="text-align: center;"><img class="size-full wp-image-665 aligncenter" title="newshacker-illu" src="http://friism.com/wp-content/uploads/2012/02/newshacker-illu.png" alt="" width="593" height="348" /></p>
]]></content>
		<link rel="replies" type="text/html" href="http://friism.com/nordic-newshacker#comments" thr:count="0" />
		<link rel="replies" type="application/atom+xml" href="http://friism.com/nordic-newshacker/feed/atom" thr:count="0" />
		<thr:total>0</thr:total>
	</entry>
		<entry>
		<author>
			<name>admin</name>
						<uri>http://</uri>
					</author>
		<title type="html"><![CDATA[US Embassy Cables Related to Denmark]]></title>
		<link rel="alternate" type="text/html" href="http://friism.com/us-embassy-cables-related-to-denmark" />
		<id>http://friism.com/?p=648</id>
		<updated>2011-09-03T02:39:58Z</updated>
		<published>2011-09-03T02:39:58Z</published>
		<category scheme="http://friism.com" term="Journalism" />		<summary type="html"><![CDATA[As you may know, Wikileaks has released the full, un-redacted database of US Embassy cables. A torrent file useful for downloading all the data is available from Wikileaks, at the bottom of this page. It&#8217;s a PostgreSQL data dump. Danish journalists seem to be completely occupied producing vacuous election coverage, so to help out, I&#8217;ve [...]]]></summary>
		<content type="html" xml:base="http://friism.com/us-embassy-cables-related-to-denmark"><![CDATA[<p>As you may know, Wikileaks has <a href="http://www.huffingtonpost.com/2011/09/02/wikileaks-diplomatic-cables_n_946574.html">released the full, un-redacted database of US Embassy cables</a>. A torrent file useful for downloading all the data is available from Wikileaks, at the bottom of <a href="http://wikileaks.org/cablegate.html">this page</a>. It&#8217;s a <a href="http://www.postgresql.org/">PostgreSQL</a> data dump. Danish journalists seem to be completely occupied producing vacuous election coverage, so to help out, I&#8217;ve filtered out the Denmark-related cables and are making them available as Google Spreadsheets/Fusiontables.</p>
<p>The first set (<a href="https://docs.google.com/spreadsheet/ccc?key=0AtG1Me8drfGxdFVld080dnBLc1Q5NDhXYW1HR1h3WFE&#038;hl=en_US">link</a>) are cables (146 in all) from the US Embassy in Copenhagen, with all the &#8220;UNCLASSIFIED&#8221; ones filtered out (since they are typically trivial, if entertaining in their triviality). Here&#8217;s the query:</p>
<pre class="prettyprint">
copy (
	select *
	from cable
	where origin = 'Embassy Copenhagen'
		and classification not like '%UNCLASSIFIED%'
	order by date desc)
to 'C:/data/cph_embassy_confidential.csv' with csv header
</pre>
<p>The second set, at 1438 rows, (<a href="http://www.google.com/fusiontables/DataSource?dsrcid=1396282">link</a>) mention either &#8220;Denmark&#8221; or &#8220;Danish&#8221;, are from embassies other than the one in Copenhagen and are not &#8220;UNCLASSIFIED&#8221;. Query:</p>
<pre class="prettyprint">

copy (
	select *
	from cable
	where origin != 'Embassy Copenhagen'
		and classification not like '%UNCLASSIFIED%'
 		and (
 			content like '%Danish%' or
 			content like '%Denmark%'
 		)
	order by date desc
)
to 'C:/data/not_cph_embassy_confidential.csv'
	with csv header
	force quote content
	escape '"'
</pre>
<p><iframe width='600' height='300' frameborder='0' src='https://docs.google.com/spreadsheet/pub?hl=en_US&#038;hl=en_US&#038;key=0AtG1Me8drfGxdFVld080dnBLc1Q5NDhXYW1HR1h3WFE&#038;output=html&#038;widget=true'></iframe></p>
]]></content>
		<link rel="replies" type="text/html" href="http://friism.com/us-embassy-cables-related-to-denmark#comments" thr:count="0" />
		<link rel="replies" type="application/atom+xml" href="http://friism.com/us-embassy-cables-related-to-denmark/feed/atom" thr:count="0" />
		<thr:total>0</thr:total>
	</entry>
		<entry>
		<author>
			<name>admin</name>
						<uri>http://</uri>
					</author>
		<title type="html"><![CDATA[Members of Danish VL Groups]]></title>
		<link rel="alternate" type="text/html" href="http://friism.com/members-of-danish-vl-groups" />
		<id>http://friism.com/?p=624</id>
		<updated>2011-08-28T03:02:14Z</updated>
		<published>2011-08-28T02:47:30Z</published>
		<category scheme="http://friism.com" term="Scraping" />		<summary type="html"><![CDATA[Denmark has a semi-formalised system of VL-groups. &#8220;VL&#8221; is short for &#8220;Virksomhedsleder&#8221; which translates to &#8220;business leader&#8221;. The groups select their own members, and the whole thing is organised by the Danish Society for Business Leadership. The groups are not composed only of business people &#8212; top civil servants and politicians are also members. The [...]]]></summary>
		<content type="html" xml:base="http://friism.com/members-of-danish-vl-groups"><![CDATA[<p>Denmark has a semi-formalised system of VL-groups. &#8220;VL&#8221; is short for &#8220;Virksomhedsleder&#8221; which translates to &#8220;business leader&#8221;. The groups select their own members, and the whole thing is organised by the <a href="http://vl.dk/">Danish Society for Business Leadership</a>.  The groups are not composed only of business people &#8212; top civil servants and politicians are also members. The groups meet up regularly to smoke weed, sing Kumbayah and talk about whatever people from those walks of life talk about when they get together.</p>
<p>Before doing what I <a href="http://appharbor.com/">currently do</a>, I worked for <a href="http://ekstrabladet.dk/">Ekstra Bladet</a>, a Danish tabloid. Other than giving Danes their daily dose of nekkid girls with fake boobs and keeping punters abreast of phone numbers of the freshest trafficked African prostitutes, Ekstra Bladet spends a lot of time holding Denmarks high&#8217;n-mighty to account. To that end, I worked on building a database of influential people and celebrities so that we could automatically track when their names crop in court documents and other official filings (scared yet, are we?). The VL-group members obviously belong in this database. Fortuitously, group membership is <a href="http://vl.dk/GRUPPEOVERSIGT-162676">published online</a> and is easily scraped.</p>
<p>In case you are interested, I&#8217;ve created a <a href="https://docs.google.com/spreadsheet/ccc?key=0AtG1Me8drfGxdEVaWm9nYm5lUV84YlM2UGdMMWhxbmc&amp;hl=en_US">Google Docs Spreadsheet</a> with the composition of the groups as of August 2011.  I&#8217;ve included only groups in Denmark proper &#8212; there are also overseas groups for Danish expatriates and groups operating in the Danish North Atlantic colonies. The spreadsheet (3320 members in all) is embedded at the bottom of this post.</p>
<p>Now, with this list in hand, any well-trained Ekstra Bladet employee will be brainstorming what sort of other outrage can be manufactured from the group membership data. How about looking at the gender distribution of the members? (At this point I&#8217;d like to add a disclaimer: I personally don&#8217;t care whether the VL-groups are composed primarily of men, women or transgendered garden gnomes so I dedicate the following to <a href="http://www.hovedetpaabloggen.dk/">Trine Maria Kristensen</a>. Also, an Ekstra Bladet journalist <a href="http://ekstrabladet.dk/nyheder/samfund/article1492991.ece">wrote this story</a> up some months after I left, but I wanted to make the underlying data available).</p>
<p>To determine the gender of each group member, I used the Department of Family Affairs <a href="http://www.familiestyrelsen.dk/samliv/navne/soeginavnelister/godkendtefornavne/">lists of boys and girls given names</a> (yes, the Socialist People&#8217;s Kingdom of Denmark gives parents lists of pre-approved names to choose from when naming their children). Some of the names are ambigious (eg. Kim and Bo are permitted for both boys and girls). For these names, the gender-determinitation chooses what I deem to be the most common gender for that name in Denmark.</p>
<p>Overall, there are 505 females out of 3320 group members (15.2%). 8 groups of 95 have no women at all (groups <a href="http://iframe.vl.dk/vlgruppe.php?area=K%C3%B8benhavn&amp;id=25">25</a>, <a href="http://iframe.vl.dk/vlgruppe.php?area=K%C3%B8benhavn&amp;id=28">28</a>, <a href="http://iframe.vl.dk/vlgruppe.php?area=K%C3%B8benhavn&amp;id=52">52</a>, <a href="http://iframe.vl.dk/vlgruppe.php?area=K%C3%B8benhavn&amp;id=61">61</a>, <a href="http://iframe.vl.dk/vlgruppe.php?area=K%C3%B8benhavn&amp;id=63">63</a>, <a href="http://iframe.vl.dk/vlgruppe.php?area=K%C3%B8benhavn&amp;id=69">69</a>, <a href="http://iframe.vl.dk/vlgruppe.php?area=K%C3%B8benhavn&amp;id=104">10</a>4 and <a href="http://iframe.vl.dk/vlgruppe.php?area=K%C3%B8benhavn&amp;id=115">115</a>). 12 groups include a single woman, while 6 have two. There is also a single all-female troupe, <a href="http://iframe.vl.dk/vlgruppe.php?area=K%C3%B8benhavn&amp;id=107">VL Group 107</a>.</p>
<p>Please take advantage of the <a href="https://docs.google.com/spreadsheet/ccc?key=0AtG1Me8drfGxdEVaWm9nYm5lUV84YlM2UGdMMWhxbmc&amp;hl=en_US">data</a> below to come up with other interesting analysis of the group compositions.</p>
<p><iframe width='770' height='800' frameborder='0' src='https://docs.google.com/spreadsheet/pub?hl=en_US&#038;hl=en_US&#038;key=0AtG1Me8drfGxdEVaWm9nYm5lUV84YlM2UGdMMWhxbmc&#038;output=html&#038;widget=true'></iframe></p>
]]></content>
		<link rel="replies" type="text/html" href="http://friism.com/members-of-danish-vl-groups#comments" thr:count="2" />
		<link rel="replies" type="application/atom+xml" href="http://friism.com/members-of-danish-vl-groups/feed/atom" thr:count="2" />
		<thr:total>2</thr:total>
	</entry>
		<entry>
		<author>
			<name>friism</name>
						<uri>http://www.itu.dk/~friism/blog/</uri>
					</author>
		<title type="html"><![CDATA[Non-trivial Facebook FQL example]]></title>
		<link rel="alternate" type="text/html" href="http://friism.com/non-trivial-facebook-fql-example" />
		<id>http://friism.com/?p=607</id>
		<updated>2010-11-12T22:16:39Z</updated>
		<published>2010-11-11T21:00:39Z</published>
		<category scheme="http://friism.com" term="Facebook" /><category scheme="http://friism.com" term="fql" /><category scheme="http://friism.com" term="newmediadays" />		<summary type="html"><![CDATA[This post will demonstrate a few non-trivial FQL calls from Javascript, including batching interdependent queries in one request. The example queries all participants of a public Facebook event and gets their names and any public status updates they&#8217;ve posted recently. It then goes on to find all friend-relations between the event participants and graphs those [...]]]></summary>
		<content type="html" xml:base="http://friism.com/non-trivial-facebook-fql-example"><![CDATA[<p>This post will demonstrate a few non-trivial <a href="http://developers.facebook.com/docs/reference/fql/">FQL</a> calls from Javascript, including batching interdependent queries in one request. The example queries all participants of a <a href="http://www.facebook.com/event.php?eid=113151515408991">public Facebook event</a> and gets their names and any public status updates they&#8217;ve posted recently. It then goes on to find all friend-relations between the event participants and graphs those with an <a href="http://thejit.org/">InfoVis</a> <a href="http://thejit.org/static/v20/Jit/Examples/Hypertree/example1.html">Hypertree</a>. I haven&#8217;t spent time on browser-compatibility in result-rendering (sorry!), but the actual queries work fine across browsers. You can try out the <a href="http://friism.com/new-media-days-facebook-demo">example here</a>. The network-graph more-or-less only works in Google Chrome.</p>
<p>The demo was created for a session I did with <a href="http://flopper.dk/">Filip Wahlberg</a> at the <a href="http://newmediadays.dk/">New Media Days conference</a>. The session was called &#8220;<a href="http://newmediadays.dk/michael-friis">Hack it, Mash it</a>&#8221; and involved us showing off some of the stuff we do at <a href="http://ekstrabladet.dk">ekstrabladet.dk</a> and then demonstrating what sort of info can be pulled from Facebook. Amanda Cox was <a href="http://newmediadays.dk/amanda-cox">on the next morning</a> and pretty much obliterated us with all the great interactive visualizations the New York Times builds, but that was all right.</p>
<p>Anyway, on to the code. Here are the three queries</p>
<pre class="prettyprint">
var eventquery = FB.Data.query(
	'select uid from event_member ' +
	'where rsvp_status in ("attending", "unsure") ' +
		'and eid = 113151515408991 '
);

var userquery = FB.Data.query(
'select uid, name from user ' +
'where uid in  ' +
	' (select uid from {0})', eventquery
);

var streamquery = FB.Data.query(
	'select source_id, message from stream ' +
	'where ' +
	'updated_time > "2010-11-04" and ' +
	'source_id in ' +
		'(select uid from {0}) ' +
	'limit 1000 '
	, eventquery
);

FB.Data.waitOn([eventquery, userquery, streamquery],
	function () {
		// do something interesting with the data
	}
);
</pre>
<p>Once the function passed to <code>waitOn</code> executes, all the queries have executed and results are available. The neat thing is that <code>FB.Data</code> bundles the queries so that, even though the last two queries depend on the result of the first one to execute, the browser only does one request.  Facebook limits the number of results returned from queries on the <a href="http://developers.facebook.com/docs/reference/fql/stream">stream</a> table (which stores status updates and similar). Passing a clause on &#8216;updated_time&#8217; seems to arbitrarily increase this number.</p>
<p>So now that we have the uid&#8217;s of all the attendees, how do we get the friend-relations between those Facebook users? Generally, Facebook won&#8217;t give you the friends of a random user without your app first getting permission from said user. Facebook will tell you whether any two users are friends and this is done by querying the <a href="http://developers.facebook.com/docs/reference/fql/friend">friend</a> table. So I wrote this little query which handily gets all the relations in a set of uids. Assume you&#8217;ve stored all the uids in an array:</p>
<pre class="prettyprint">
var uidstring = uids.join(",");
var friendquery = FB.Data.query(
	'select uid1, uid2 ' +
	'from friend ' +
	'where uid1 in ({0}) and uid2 in ({0})'
	, uidstring
);

FB.Data.waitOn([friendquery], function () {
	// do something with the relations, like draw a graph
});
</pre>
<p>Neat huh? The full script can be found here: <a href="http://friism.com/wp-content/uploads/2010/11/nmdscript.js">http://friism.com/wp-content/uploads/2010/11/nmdscript.js</a></p>
]]></content>
		<link rel="replies" type="text/html" href="http://friism.com/non-trivial-facebook-fql-example#comments" thr:count="0" />
		<link rel="replies" type="application/atom+xml" href="http://friism.com/non-trivial-facebook-fql-example/feed/atom" thr:count="0" />
		<thr:total>0</thr:total>
	</entry>
		<entry>
		<author>
			<name>friism</name>
						<uri>http://www.itu.dk/~friism/blog/</uri>
					</author>
		<title type="html"><![CDATA[Wikileaks Iraq wardiaries data quality]]></title>
		<link rel="alternate" type="text/html" href="http://friism.com/wikileaks-iraq-wardiaries-data-quality" />
		<id>http://friism.com/?p=585</id>
		<updated>2010-11-03T08:33:10Z</updated>
		<published>2010-10-23T14:47:51Z</published>
		<category scheme="http://friism.com" term="Geocoding" /><category scheme="http://friism.com" term="iraq" /><category scheme="http://friism.com" term="wikileaks" />		<summary type="html"><![CDATA[til;dr: The Wikileaks Iraq data is heavily redacted (by Wikeleaks presumably) compared to the Afghanistan data: Names &#8212; of persons, bases, units and more &#8212; have been purged from the &#8220;Title&#8221; and &#8220;Summary&#8221; column-texts and the precision of geograpical coordinates have been truncated. This makes both researching and visualizing the Iraq data somewhat difficult. (this [...]]]></summary>
		<content type="html" xml:base="http://friism.com/wikileaks-iraq-wardiaries-data-quality"><![CDATA[<p>til;dr: The Wikileaks Iraq data is heavily redacted (by Wikeleaks presumably) compared to the Afghanistan data: Names &#8212; of persons, bases, units and more &#8212; have been purged from the &#8220;Title&#8221; and &#8220;Summary&#8221; column-texts and the precision of geograpical coordinates have been truncated. This makes both researching and visualizing the Iraq data somewhat difficult.</p>
<p>(this is a cross-post from the <a href="http://bits.ekstrabladet.dk/wikileaks-iraq-wardiaries-data-quality">Ekstra Bladet Bits blog</a>)</p>
<p>Ekstra Bladet received the Iraq data from Wikileaks some time before the Friday 22. 23:00 (DK-time) embargo. We knew the dump was going to be in the exact same format as the Afghanistan one, so loading the data was a snap. When we started running some of the same research-scripts used on the Afghanistan data, it quickly became clear that something was amiss however. For example, we could only find a single report mentioning Danish involvement (namely the &#8220;Danish Demining Group&#8221;) in the Iraq War. We had drawn up a list persons, companies and places of interest, but searches for these also turned up nothing. A quick perusal of a few sample reports revealed that almost all identifying names have been purged from report texts.</p>
<p><strong>Update:</strong> It turns out that Ekstra Bladet got the redacted version of the from Wikileaks. Apparently some 6 international news organisations (and the Danish newspaper <a href="http://www.information.dk/">Infomation</a>) got the full, unredacted data. They won&#8217;t be limited in the ways mentioned below.</p>
<p>This caused us to temporarily abandon the search for interesting individual events and instead try to visualize the events in aggregate using maps. I had readied a heatmap tile-renderer which &#8212; when fed the Afghanistan data &#8212; produces really nice zoomable heatmaps overlayed on Google Maps. When loaded with the Iraq data however, the heatmap tiles had strange artifacts. This turns out to be because the report geo-coordinate-precision has been truncated. We chose not to publish the heatmap, but the effect is also evident on this Google Fusion-tables based <a href="http://ekstrabladet.dk/nyheder/krigogkatastrofer/article1436517.ece">map of IED-attacks</a> (article text in Danish). The geo-precision truncation makes it impossible to produce something like the <a href="http://www.guardian.co.uk/world/datablog/2010/jul/26/wikileaks-afghanistan-ied-attacks">Guardian IED heatmap</a>, demonstrating IED-attacks hugging roads and major cities.</p>
<div id="attachment_589" class="wp-caption alignnone" style="width: 600px"><a href="http://friism.com/wp-content/uploads/2010/10/2010-10-23-16h23_10.png"><img class="size-full wp-image-589" title="Artifacts due to geo-precision blurring" src="http://friism.com/wp-content/uploads/2010/10/2010-10-23-16h23_10.png" alt="" width="590" height="286" /></a><p class="wp-caption-text">Artifacts due to geo-precision blurring</p></div>
<p>We did manage to produce some body count-based articled before the embargo. Creating simple infographics showing report- and attack-frequency over time is also possible. Looking at the reports, it is also fairly easy to establish that Iraqi police mistreated prisoners. Danish soldiers are known to have handed over prisoners to Iraqi police (via British troops), making this significant in a Danish context. We have &#8212; however &#8212; not been able to use the reports to scrutinize the Danish involvement in the Iraq war in the same depth that we could with the Afghanistan data.</p>
<p>We initially thought the redactions were only for the pre-embargo data dump and that an unredacted dataset might become available post-embargo. That seems not to be the case though, since the reports <a href="http://warlogs.wikileaks.org/iraq/diarydig">Wikileaks published</a> online after the embargo are also redacted.</p>
<p>I&#8217;m not qualified to say whether the redactions in the Iraq reports are necessary to protect the individuals mentioned in them. It is worth noting that the Pentagon itself found that <a href="http://www.wired.com/dangerroom/2010/10/doc-of-the-day-wikileaks-didnt-blow-u-s-afghan-intel-sources/">no sources were revealed by the Afghanistan leak</a>. The Iraq-leak is great ressource for documenting the brutality of the war there, but the redactions do make it difficult to make sense of individual events.</p>
]]></content>
		<link rel="replies" type="text/html" href="http://friism.com/wikileaks-iraq-wardiaries-data-quality#comments" thr:count="0" />
		<link rel="replies" type="application/atom+xml" href="http://friism.com/wikileaks-iraq-wardiaries-data-quality/feed/atom" thr:count="0" />
		<thr:total>0</thr:total>
	</entry>
	</feed><!-- Dynamic page generated in 0.247 seconds. --><!-- Cached page generated by WP-Super-Cache on 2013-05-16 11:13:15 --><!-- Compression = gzip -->
