<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Robert Grossman</title>
	<atom:link href="http://rgrossman.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://rgrossman.com</link>
	<description>Robert Grossman&#039;s home page</description>
	<lastBuildDate>Fri, 29 Nov 2013 21:22:46 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.5.1</generator>
		<item>
		<title>Bionimbus Protected Data Cloud (PDC) Update</title>
		<link>http://rgrossman.com/2013/09/25/bionimbus-pdc-update/</link>
		<comments>http://rgrossman.com/2013/09/25/bionimbus-pdc-update/#comments</comments>
		<pubDate>Wed, 25 Sep 2013 18:09:54 +0000</pubDate>
		<dc:creator>Robert Grossman</dc:creator>
				<category><![CDATA[big data]]></category>

		<guid isPermaLink="false">http://rgrossman.com/?p=367</guid>
		<description><![CDATA[The Bionimbus Protected Data Cloud (PDC) is an open source petabyte-scale cloud that is designed to manage, analyze and share large genomic datasets for the research community in a secure and compliant fashion. The Bionimbus now contains all of the &#8230; <a href="http://rgrossman.com/2013/09/25/bionimbus-pdc-update/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>The <a href="http://bionimbus.opensciencedatacloud.org/">Bionimbus Protected Data Cloud</a> (PDC) is an open source petabyte-scale cloud that is designed to manage, analyze and share large genomic datasets for the research community in a secure and compliant fashion.  The Bionimbus now contains all of the data available to date from <a href="http://cancergenome.nih.gov/">The Cancer Genome Atlas</a> (TCGA).   Today, this is over 600 TB of data and will grow over the next two years to over 2.5 PB.  This includes both the controlled access BAM files containing the genomic data, as well as the open access aggregated data derived from the BAM files.</p>
<p>I&#8217;ll be giving a talk today about the Bionimbus PDC at the O&#8217;Reilly <a href="http://strataconf.com/rx2013">Strata Health Rx</a> Conference in Boston.</p>
<p><a href="http://strataconf.com/rx2013?cmp=ba-strata-strx13-speaker-banner-125-125"><br />
	<img src="http://cdn.oreillystatic.com/en/assets/1/event/98/rx2013_speaking_125x125.png" width="125" height="125"  border="0"  alt="Strata Rx Conference 2013"  /><br />
</a></p>
<p>To analyze TCGA data using the Bionimbus TCGA, you will need the required approvals from <a href="http://www.ncbi.nlm.nih.gov/gap">dbGaP</a>.  Any researcher authorized to analyze controlled access TCGA data is welcome to use modest amounts of compute and storage resources on the PDC.  If you need additional resources, you can apply for a PDC research allocation.  </p>
<p>Please contact us if you would like to contribute some data to the PDC, have a project that would like to join the PDC, or have a biomedical cloud that would like to interoperate with the PDC.</p>
]]></content:encoded>
			<wfw:commentRss>http://rgrossman.com/2013/09/25/bionimbus-pdc-update/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>A Tool for Keeping Big Data in Sync</title>
		<link>http://rgrossman.com/2013/03/20/a-tool-for-keeping-big-data-in-sync/</link>
		<comments>http://rgrossman.com/2013/03/20/a-tool-for-keeping-big-data-in-sync/#comments</comments>
		<pubDate>Wed, 20 Mar 2013 02:08:59 +0000</pubDate>
		<dc:creator>Robert Grossman</dc:creator>
				<category><![CDATA[big data]]></category>
		<category><![CDATA[data transport]]></category>
		<category><![CDATA[UDR]]></category>
		<category><![CDATA[UDT]]></category>

		<guid isPermaLink="false">http://rgrossman.com/?p=350</guid>
		<description><![CDATA[The rsync utility is a wonderfully useful tool for keeping two datasets synchronized, but it was never designed to keep two large datasets synchronized when they are separated by a long distance. Over the past couple of years, we developed &#8230; <a href="http://rgrossman.com/2013/03/20/a-tool-for-keeping-big-data-in-sync/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>The rsync utility is a wonderfully useful tool for keeping two datasets synchronized, but it was never designed to keep two large datasets synchronized when they are separated by a long distance.  Over the past couple of years, we developed a utility called UDR at the Laboratory for Advanced Computing at the University of Chicago which integrates rsync with the high performance network protocol <a href="http://udt.sf.net">UDT</a>.</p>
<p><a href="http://udt.sf.net">UDT</a> is a reliable UDP-based protocol that was designed to move large datasets over wide area, high performance networks.  UDT is open source and has been used as the basis for over six commercial products.</p>
<p>UDR is open source and available from <a href="https://github.com/LabAdvComp/UDR">github</a>. </p>
<p>Here are some test results conducted by Erich Weiler from the University of California at Santa Cruz moving genomic data:</p>
<table padding=2 border=1>
<tr>
<th>Source</th>
<th>Destination</th>
<th>UDR</th>
<th>rsync</th>
</tr>
<tr>
<td>Santa Cruz</td>
<td>Milwaukee</td>
<td>500 Mb/s</td>
<td>160 Mb/s</td>
</tr>
<tr>
<td>Santa Cruz</td>
<td>Detroit</td>
<td>600 Mb/s</td>
<td>150 Mb/s</td>
</tr>
<tr>
<td>Santa Cruz</td>
<td>Bielefeld</td>
<td>600 Mb/s</td>
<td>6 Mb/s</td>
</tr>
<tr>
<td>Santa Cruz</td>
<td>Aarhus</td>
<td>350 Mb/s</td>
<td>6 Mb/s</td>
</tr>
<tr>
<td>Santa Cruz</td>
<td>Brisbane</td>
<td>550 Mb/s</td>
<td>3 Mb/s</td>
</tr>
</table>
<p>Allison Heath is the Project Lead for UDR.</p>
]]></content:encoded>
			<wfw:commentRss>http://rgrossman.com/2013/03/20/a-tool-for-keeping-big-data-in-sync/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Do You Want Hands On Experience Working with Big Data?</title>
		<link>http://rgrossman.com/2013/02/15/osdc-pire-2013-fellowships/</link>
		<comments>http://rgrossman.com/2013/02/15/osdc-pire-2013-fellowships/#comments</comments>
		<pubDate>Fri, 15 Feb 2013 01:46:32 +0000</pubDate>
		<dc:creator>Robert Grossman</dc:creator>
				<category><![CDATA[big data]]></category>
		<category><![CDATA[data science]]></category>
		<category><![CDATA[data science education]]></category>

		<guid isPermaLink="false">http://rgrossman.com/?p=340</guid>
		<description><![CDATA[If you are a graduate student or post-doc interested in improving your big data skills, you might want to consider applying for an Open Science Data Cloud (OSDC) PIRE 2013 Fellowship. These fellowships are supported by the NSF PIRE Program &#8230; <a href="http://rgrossman.com/2013/02/15/osdc-pire-2013-fellowships/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>If you are a graduate student or post-doc interested in improving your big data skills, you might want to consider applying for an Open Science Data Cloud (<a href="http://www.opensciencedatacloud.org">OSDC</a>) PIRE 2013 Fellowship. These fellowships are supported by the NSF PIRE Program and provide support for up to eight weeks of work.</p>
<p>The OSDC allows researchers to compute over 1 PB of scientific data from a variety of scientific disciplines.</p>
<p>We provide a big data bootcamp for OSDC PIRE Fellows.   OSDC PIRE Fellows then spend time working with one of the OSDC foreign collaborators on a variety of projects, including:</p>
<ul>
<li>Expanding the OSDC to other countries. </li>
<li>Developing infrastructure so that the OSDC can interoperate with science clouds in other countries.</li>
<li>Working on the OSDC software infrastructure. </li>
<li>Developing domain specific OSDC applications in the biological sciences, earth sciences, social sciences, or digital humanities.  </li>
</ul>
<p>To apply for a OSDC PIRE Fellowship, please fill out the application <a href="http://news.opensciencedatacloud.org/pire-training/osdc-pire-application/">here</a>.  Only U.S. citizens or permanent residents are eligible for OSDC PIRE Fellowships.</p>
]]></content:encoded>
			<wfw:commentRss>http://rgrossman.com/2013/02/15/osdc-pire-2013-fellowships/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The Unreasonable Effectiveness of Consensus Labeling</title>
		<link>http://rgrossman.com/2012/12/21/the-unreasonable-effectiveness-of-consensus-labeling/</link>
		<comments>http://rgrossman.com/2012/12/21/the-unreasonable-effectiveness-of-consensus-labeling/#comments</comments>
		<pubDate>Fri, 21 Dec 2012 13:32:50 +0000</pubDate>
		<dc:creator>Robert Grossman</dc:creator>
				<category><![CDATA[big data]]></category>
		<category><![CDATA[data science]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[text mining]]></category>

		<guid isPermaLink="false">http://rgrossman.com/?p=325</guid>
		<description><![CDATA[The majority of large datasets are unlabeled, while the majority of machine learning algorithms that you are likely to use require labeled data. Of course this is a simplification, but it captures quite well my experience in practice. One approach &#8230; <a href="http://rgrossman.com/2012/12/21/the-unreasonable-effectiveness-of-consensus-labeling/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>The majority of large datasets are unlabeled, while the majority of machine learning algorithms that you are likely to use require labeled data.  Of course this is a simplification, but it captures quite well my experience in practice.</p>
<p>One approach that we used in a recent research project is what you make call <em>consensus labeling</em>.  Here is a high level outline of the approach:</p>
<ol>
<p>
<li>Select three or more high quality classifiers that have been trained on (small amounts) of labeled data.  These classifiers will be used in the next step to assign labels to unlabeled data.  </li>
</p>
<p>
<li>Apply the ensemble of classifiers to a large dataset of unlabeled data to create a labeled dataset. Labels can be assigned either by using a majority vote or by only labeling those records in which the classifiers all agree (a consensus).</p>
<p>
<li>From this larger labeled dataset, train and validate a classifier or other machine learning algorithm. </li>
</p>
</ol>
<p>The goal of the project was to explore a class of algorithms that each night could use a large computing infrastructure (in our case the Open Cloud Consortium&#8217;s petabyte-scale <a href="http://opencloudconsortium.org/2012/06/11/open-cloud-consortium-offers-clouds-for-science/">OCC-Y Cloud</a>) to analyze an ever changing collection of text documents and build a new model for entity extraction, part of speech tagging, etc. </p>
<p>The project was a joint project with Andrey Rzhetsky and Shi Yu and I have described just a small part it.  You can find more details in the paper: Shi Yu, Robert Grossman and Andrey Rzhetsky, Global and Local Approach of Part-of-Speech Tagging for Large Corpora, Information Retrieval and Knowledge Discovery in Biomedical Text: Papers from the 2012 AAAI Fall Symposium, AAAI Press, Menlo Park, California, 2012. <a href="http://www.aaai.org/ocs/index.php/FSS/FSS12/paper/viewPaper/5638">pdf</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://rgrossman.com/2012/12/21/the-unreasonable-effectiveness-of-consensus-labeling/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The Open Science Data Cloud &#8211; Two Year Update</title>
		<link>http://rgrossman.com/2012/11/20/osdc-2012-update/</link>
		<comments>http://rgrossman.com/2012/11/20/osdc-2012-update/#comments</comments>
		<pubDate>Tue, 20 Nov 2012 11:34:34 +0000</pubDate>
		<dc:creator>Robert Grossman</dc:creator>
				<category><![CDATA[analytic infrastructure]]></category>
		<category><![CDATA[big data]]></category>
		<category><![CDATA[big data biology]]></category>
		<category><![CDATA[data science]]></category>

		<guid isPermaLink="false">http://rgrossman.com/?p=306</guid>
		<description><![CDATA[I just got back from SC 12, which took place in Salt Lake City this year. We shared our research booth with the Open Science Data Cloud (OSDC) and with the (ICAIR) Research Center from Northwestern University. The OSDC just &#8230; <a href="http://rgrossman.com/2012/11/20/osdc-2012-update/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>I just got back from SC 12, which took place in Salt Lake City this year.  We shared our research booth with the Open Science Data Cloud  (<a href="http://www.opensciencedatacloud.org">OSDC</a>) and with the (<a href="http://www.icair.org/">ICAIR</a>) Research Center from Northwestern University.  </p>
<p>The OSDC just turned two years old.  It is a petabyte scale science cloud for researchers to manage, analyze and share their data and to get easy access to data from other scientists.  The OSDC is operated by the not-for-profit Open Cloud Consortium (<a href="http://www.opencloudconsortium.org">OCC</a>), which is taking a long term point of view in how to build and operate cloud-based infrastructure to serve the needs of researchers.</p>
<p><a href="http://rgrossman.com/files/2012/11/OSDC_CMYK_Vert-01.jpg"><img src="http://rgrossman.com/files/2012/11/OSDC_CMYK_Vert-01-300x116.jpg" alt="Open Science Data Cloud" title="OSDC" width="300" height="116" class="alignleft size-medium wp-image-312" /></a></p>
<p>There is now over 800 TB of <a href="http://www.opensciencedatacloud.org/publicdata/">data available</a> to the research community in the OSDC.  It should be 1 PB by the end of the year, and will grow significantly during 2013.  Any researcher may apply for an account to compute over this data (we have an allocation committee that selects projects).  Small usage of the OSDC is free, and we ask larger projects to pay for the costs of using the facility.  In particular, we will work with projects to help them write the OSDC into their grants to pay for their usage.  A model that has proved useful is for projects to request one or more racks from their funding agency each year, which the OSDC can operate.  </p>
<p>We are currently buying racks that contain about 575 cores, 2.3 TB of RAM and 1 PB of raw storage.  These cost about $250,000 and provide about 5,000,000 core hours of compute each year.  We have developed software and services so that we can (generally) run our racks remotely and lights out.</p>
<p>From 2008-2010, we operating a proof-of-concept infrastructure that consisted of four distributed data centers connected with a wide area 10G network, with an infrastructure that was a mixture of <a href="http://hadoop.apache.org">Hadoop</a>, <a href="http://sector.sf.net">Sector</a>, and <a href="http://www.eucalyptus.com">Eucalyptus</a>, and running cloud-based applications from several scientific  disciplines.   You can find a description of the OSDC in 2010 in this <a href="http://dl.acm.org/citation.cfm?id=1851533">paper</a>.</p>
<p>From 2010 to the present, we have been operated a production cloud-based infrastructure serving several scientific disciplines, including the biological sciences with <a href="http://bionimbus.opensciencedatacloud.org">Bionimbus</a>; the earth sciences with <a href="http://matsu.opensciencedatacloud.org">Matsu</a>; and the <a href="http://arxiv.culturomics.org/">digital humanities</a>.  The infrastructure is now based primarily on <a href="http://hadoop.apache.org">Hadoop</a> and <a href="http://www.openstack.org">OpenStack</a>.  We use <a href="http://udt.sf.net">UDT</a> and <a href="https://github.com/LabAdvComp/UDR/tree/master/udt">UDR</a> to transport and synchronize large terabyte size datasets, and we are now beginning to use 100G network connections.  We presented an update about the OSDC at the Data Cloud 2012 Workshop at SC 12.  You can find the paper <a href="http://www.cse.buffalo.edu/faculty/tkosar/datacloud2012/papers/datacloud2012_paper_1.pdf">here</a>.</p>
<p>We are planning for a scale up of the OSDC beginning in approximately 2015-2016, when we hope to open up a small boutique 3-5MW scale data center to house the OSDC called the Burnham Center for Knowledge Discovery.  We are raising $1M to help plan for the Burnham Center, which we expect to cost between $15M to $25M (including the first several years of operating expenses), depending upon the scale. </p>
<p>The software for the OSDC is all open source.  We are always looking for volunteers, especially those that can help develop code, or help operate the OSDC, and for donors, especially those that can provide equipment, funds for operating expenses, and funds for the planning or operations of the Burnham Center.  Please write us at info at opencloudconsortium dot org if you are interested in getting involved.</p>
]]></content:encoded>
			<wfw:commentRss>http://rgrossman.com/2012/11/20/osdc-2012-update/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Datascopes for the Long Tail of Science</title>
		<link>http://rgrossman.com/2012/10/12/datascopes-for-the-long-tail-of-science/</link>
		<comments>http://rgrossman.com/2012/10/12/datascopes-for-the-long-tail-of-science/#comments</comments>
		<pubDate>Fri, 12 Oct 2012 22:15:27 +0000</pubDate>
		<dc:creator>Robert Grossman</dc:creator>
				<category><![CDATA[analytic infrastructure]]></category>
		<category><![CDATA[big data]]></category>
		<category><![CDATA[data science]]></category>

		<guid isPermaLink="false">http://rgrossman.com/?p=299</guid>
		<description><![CDATA[This week, I gave a talk at the GLIF Workshop in Chicago on what is being called the long tail of science. We have telescopes to study distant objects and microscopes to study small objects. A number of us who &#8230; <a href="http://rgrossman.com/2012/10/12/datascopes-for-the-long-tail-of-science/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>This week, I gave a talk at the <a href="http://www.glif.is/meetings/2012/">GLIF Workshop</a> in Chicago on what is being called the long tail of science.</p>
<p>We have telescopes to study distant objects and microscopes to study small objects.  A number of us who are thinking about big data are asking the question what is an appropriate instrument to study big data?  At the talk, I talked a bit about the lessons learned by <a href="http://www.opencloudconsortium.org">OCC</a> as we design and develop the <a href="http://www.opensciencedatacloud.org">Open Science Data Cloud</a>, which you can think of as one design of a datascope.</p>
<p>The talk was called &#8220;The Open Science Data Cloud: Empowering the Long Tail of Science&#8221; and you can find the slides on <a href="http://www.slideshare.net/rgrossman/the-open-science-data-cloud-empowering-the-long-tail-of-science">slideshare</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://rgrossman.com/2012/10/12/datascopes-for-the-long-tail-of-science/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Managing and Analyzing 1,000,000 Genomes</title>
		<link>http://rgrossman.com/2012/09/18/million-genomes-challeng/</link>
		<comments>http://rgrossman.com/2012/09/18/million-genomes-challeng/#comments</comments>
		<pubDate>Tue, 18 Sep 2012 19:32:18 +0000</pubDate>
		<dc:creator>Robert Grossman</dc:creator>
				<category><![CDATA[analytic infrastructure]]></category>
		<category><![CDATA[big data]]></category>
		<category><![CDATA[big data biology]]></category>
		<category><![CDATA[data science]]></category>
		<category><![CDATA[genomics]]></category>

		<guid isPermaLink="false">http://rgrossman.com/?p=292</guid>
		<description><![CDATA[I gave a talk last week at XLDB 2012 about Bionimbus, which is cloud based system for managing, analyzing, transporting, and sharing large genomics datasets in a secure and compliant fashion. Bionimbus was developed at the Institute for Genomics and &#8230; <a href="http://rgrossman.com/2012/09/18/million-genomes-challeng/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>I gave a talk last week at <a href="http://www-conf.slac.stanford.edu/xldb2012/">XLDB 2012</a> about <a href="http://bionimbus.opensciencedatacloud.org">Bionimbus</a>, which is cloud based system for managing, analyzing, transporting, and sharing large genomics datasets in a secure and compliant fashion.  Bionimbus was developed at the Institute for Genomics and Systems Biology (<a href="http://www.igsb.org">IGSB</a>) and is used by IGSB and some of their collaborators to manage and analyze their next gen sequencing data.</p>
<p>We have been using Version 2.0 of Bionimbus for the past two years and are beginning the transition to Version 3.0.   In Version 3.0, we have factored out some of the services and made them more generally available to other other <a href="http://www.opensciencedatacloud.org">Open Science Data Cloud (OSDC)</a> applications.  In particular, we have factored out the key service which provides digital IDs for datasets and the file and permissions services.</p>
<p>Recently, we have been thinking about what you might call the Million Genome Challenge.   Over the next several years, the <a href="http://www.cancer.gov/">National Cancer Institute</a>, and perhaps other organizations, will sequence a million genomes and use this data to increase our understanding of biological pathways and of genomic variation across individuals.  With this knowledge, we can begin to stratify diseases, leading to precision diagnosis and precision treatment that is personalized for individual patients.</p>
<p>The numbers associated with a million cancer genomes are worth thinking about.  The whole genome data for a tumor and a matching normal tissue sample require about 1 TB.  Thus, one million genomes require about 1,000,000 TB.  This is 1,000 PB or 1 EB.   Compressing the data might reduce the data by about a factor of 10.  Throwing away the alignment data and retaining only the variation data would reduce the data by about a factor of about 100.  Assuming it costs about $1,000 to sequence each whole genome, the project as a whole requires about $1B for the sequencing.   It might require another $1B for the infrastructure and analysis.  Although obviously a large project, a project like this is likely to fundamentally alter the way we understand and treat diseases. </p>
<p>In my XLDB talk, I discuss Bionimbus, the Million Genome Challenge, and related topics.  You can find my talk at the conference <a href="http://www-conf.slac.stanford.edu/xldb2012/ProgramC.asp">web site</a>, or on my <a href="http://www.slideshare.net/rgrossman">slideshare</a> site.</p>
]]></content:encoded>
			<wfw:commentRss>http://rgrossman.com/2012/09/18/million-genomes-challeng/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Why is it still so hard to analyze remote and distributed data?</title>
		<link>http://rgrossman.com/2012/07/09/remote-and-distributed-data/</link>
		<comments>http://rgrossman.com/2012/07/09/remote-and-distributed-data/#comments</comments>
		<pubDate>Mon, 09 Jul 2012 10:55:44 +0000</pubDate>
		<dc:creator>Robert Grossman</dc:creator>
				<category><![CDATA[analytic infrastructure]]></category>
		<category><![CDATA[big data]]></category>
		<category><![CDATA[data science]]></category>
		<category><![CDATA[data standards]]></category>

		<guid isPermaLink="false">http://rgrossman.com/?p=251</guid>
		<description><![CDATA[A good question to think about is: If the web (of documents), which is built upon open standards around html (for describing documents) and http (for accessing documents), is so successful, why don&#8217;t we have a web of data, built &#8230; <a href="http://rgrossman.com/2012/07/09/remote-and-distributed-data/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>A good question to think about is:</p>
<blockquote><p>If the web (of documents), which is built upon open standards around html (for describing documents) and http (for accessing documents), is so successful, why don&#8217;t we have a web of data, built upon open standards around xml (or something perhaps a bit more concise for describing data) and a protocol for accessing data (and metadata).</p></blockquote>
<div id="attachment_285" class="wp-caption alignleft" style="width: 234px"><a href="http://rgrossman.com/files/2012/07/blue-brothers-mdw-s.jpg"><img src="http://rgrossman.com/files/2012/07/blue-brothers-mdw-s-224x300.jpg" alt="" title="Two Men Looking at Data" width="224" height="300" class="size-medium wp-image-285" /></a><p class="wp-caption-text">These two men are discussing why it is so difficult to work with remote and distributed data.  </p></div>
<p>About ten years ago, I published a paper called <a href="http://papers.rgrossman.com/grossman-dataspace-02.pdf">DataSpace</a> &#8211; A Web Infrastructure for the Exploratory Analysis and Mining of Data, that described an infrastructure called <em>DataSpace</em> for creating a web of data, that uses html and xml for describing data and metadata and a protocol we introduced called the dataspace transfer protocol or dstp for transferring data.  The key idea in DataSpace was to make it lightweight and minimal.   It was based upon distributed columns of data, each of which was attached to a key called a universal correlation key or UCK.   We developed reference implementations of dstp servers to serve columns of data, associated metadata, and associated UCKs.   Correlating distributed columns of data was simple and applications just used UCKs.  Discovery of data and metadata just used standard mechanisms.</p>
<p>The W3C <a href="http://www.w3.org/2001/sw/">Semantic Web</a> effort, which was more ambitious, started at approximately the same time.  Despite millions of dollars of funding, it too hasn&#8217;t really caught on.  </p>
<p>It is an interesting exercise to try to think about why the semantic web, DataSpace, or any of the similar ideas haven&#8217;t caught on.</p>
<p>Today, we have <a href="http://linkeddata.org">linked data</a>, whose key concepts are relatively close to DataSpace.  Linked data is much simpler than the semantic web, and is based upon these four principles:</p>
<p>Tim Berners-Lee listed four principles of linked data in a note <a href="http://www.w3.org/DesignIssues/LinkedData">Design Issues: Linked Data</a>:</p>
<ol>
<li>Use URIs to identify things.</li>
<li>Use HTTP URIs so that these things can be referred to and looked up (&#8220;dereferenced&#8221;) by people and user agents.</lil>
<li>Provide useful information about the thing when its URI is dereferenced, using standard formats such as RDF/XML. </li>
<li>Include links to other, related URIs in the exposed data to improve discovery of other related information on the Web.</li>
</ol>
<p>DataSpace is quite similar except it encourages the use of UCKs so that columns of data can be correlated.</p>
<p>More recently, <a href="http://www.infoblox.com/en/company/leadership.html">Stuart Bailey</a>, the Founder and CTO of Infoblox, has been working on <a href="http://www.if-map.org/">IF-MAP</a>, which is a standard for describing and accessing in a secure way distributed collections of objects and their links, as well as metadata about objects and their links.  IF-MAP is an abbreviation for Interface to Metadata Access Points) and is a Trusted Computer Group (<a href="http://www.trustedcomputinggroup.org/">TCG</a>) standard.</p>
<p>Stuart Bailey was part of the original DataSpace effort and IF-MAP is an interesting evolution of some of the key ideas in DataSpace.</p>
<p>It still seems like a great time to ask, why don&#8217;t we have web of data supporting simple discovery, exploration, correlation and access?</p>
]]></content:encoded>
			<wfw:commentRss>http://rgrossman.com/2012/07/09/remote-and-distributed-data/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>An Introduction to Big Data for the General Reader</title>
		<link>http://rgrossman.com/2012/06/06/an-introduction-to-big-data-for-general-reader/</link>
		<comments>http://rgrossman.com/2012/06/06/an-introduction-to-big-data-for-general-reader/#comments</comments>
		<pubDate>Wed, 06 Jun 2012 11:43:32 +0000</pubDate>
		<dc:creator>Robert Grossman</dc:creator>
				<category><![CDATA[analytic infrastructure]]></category>
		<category><![CDATA[big data]]></category>
		<category><![CDATA[Hadoop]]></category>

		<guid isPermaLink="false">http://rgrossman.com/?p=158</guid>
		<description><![CDATA[I recently finished a book about computing for general readers called &#8220;The Structure of Digital Computing: From Mainframes to Big Data,&#8221; which is available from Amazon. Chapter 5 is a non-technical introduction to big data, which you can download here. &#8230; <a href="http://rgrossman.com/2012/06/06/an-introduction-to-big-data-for-general-reader/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>I recently finished a book about computing for general readers called &#8220;The Structure of Digital Computing: From Mainframes to Big Data,&#8221; which is available from <a href="http://www.amazon.com/The-Structure-Digital-Computing-Mainframes/dp/1936298007">Amazon</a>. </p>
<p>Chapter 5 is a non-technical introduction to  big data, which you can download <a href="http://rgrossman.com/files/2012/05/structure-chapter5.pdf" onClick="_gaq.push(['_trackEvent', 'Files', 'Downloads', 'Structure - Chapter 5']);">here</a>.</p>
<p>Although the term <em>big data</em> is relatively new, the discipline, which is sometimes called data intensive computing, is at least 20 years old.  One way to think of big data is similar to the way we think of high performance computing.  We tend to think of a high performance computer as a specialized computer that has at least 1,000x to 10,000x or more computing power than a desktop computer (I&#8217;m thinking of processors here rather than cores).  For example the <a href="http://www.nccs.gov/jaguar/">Jaguar</a> at LLNL is one of the world&#8217;s largest supercomputers.  The Jaguar XT4 had 7,832 Quad-Core AMD Opterons (31,328 cores) and the Jaguar XT5 has 37,376 Six-Core AMD Opterons (224,256 cores).</p>
<div id="attachment_279" class="wp-caption alignleft" style="width: 310px"><a href="http://rgrossman.com/files/2012/07/four-racks-s_w_ellis-cc-2681151694_5e0bb01081.jpg"><img src="http://rgrossman.com/files/2012/07/four-racks-s_w_ellis-cc-2681151694_5e0bb01081-300x225.jpg" alt="" title="Four racks in a data center" width="300" height="225" class="size-medium wp-image-279" /></a><p class="wp-caption-text">A common architecture for big data is to use racks of computers to provide scale out processing of data.  Source: Sean Ellis, Creative Commons, Flickr.</p></div>
<p>We can think of big data in a similar way.  Think of a system for big data as a specialized system that can manage 1,000x to 10,000x or more data volume than a standard desktop computer.  These days we talk about <em>scaling out</em> computing infrastructure to fill a data center or a warehouse and we tend to measure big data computing in MW instead of petabytes.  A 15 MW data center might have a 100,000 computers and 100&#8242;s of PB of disks.  Most software for managing and analyzing data was not designed to scale out to computing infrastructure of this size.  The most popular choice for a big data software stack is <a href="http://hadoop.apache.org/">Hadoop</a>.  Another choice is <a href="http://sector.sf.net">Sector</a>.</p>
<p>A great book on this style of computing is by Barroso and Holzle <a href="http://www.morganclaypool.com/doi/abs/10.2200/S00193ED1V01Y200905CAC006">The Datacenter as a Computer</a>: An Introduction to the Design of Warehouse-Scale Machines.</p>
<p>Specialized big data systems are also built to handle high velocity data streams.  Sometimes big data is said to be concerned with data whose volume, velocity or variety is to big to be handled by conventional systems.</p>
<p>For some time, we have produced more data than we can easily analyze. About 15 years ago, I organized three NSF-supported workshops to understand big data.  Although we called it data mining then, the opportunities and challenges 15 years ago and the opportunities and challenges today look pretty similar.  For those interested in a longer term perspective on big data, it might be interesting to skim the report, which you can find <a href="http://docs.rgrossman.com/tr/dmr-v8-4-5.htm">here</a>.  </p>
<p>Perhaps what is different today is that just as computing cycles, disk storage, and network bandwidth has become commoditized, over the next decade or so data itself will become commoditized.   Although the volume, velocity, and variety of data continues to grow, the challenge as always is to extract interesting, useful and actionable information from it.  I agree with Tom Kalil from the White House Office of Science and Technology Policy that this is a fundamental research challenge or in his words: <a href="http://www.whitehouse.gov/blog/2012/03/29/big-data-big-deal">big data is a big deal</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://rgrossman.com/2012/06/06/an-introduction-to-big-data-for-general-reader/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The Five Eras of Computing</title>
		<link>http://rgrossman.com/2012/05/06/the-five-eras-of-computing/</link>
		<comments>http://rgrossman.com/2012/05/06/the-five-eras-of-computing/#comments</comments>
		<pubDate>Sun, 06 May 2012 02:05:32 +0000</pubDate>
		<dc:creator>Robert Grossman</dc:creator>
				<category><![CDATA[future of computing]]></category>
		<category><![CDATA[five eras of computing]]></category>
		<category><![CDATA[history of computing]]></category>
		<category><![CDATA[mainframes]]></category>

		<guid isPermaLink="false">http://rgrossman.com/?p=191</guid>
		<description><![CDATA[Those who don't know history are destined to repeat it. Edmund Burke (1729 - 1797) My impression of computer science on some days is that the community does a lot of repeating (and often times calls it revolutionary). I just &#8230; <a href="http://rgrossman.com/2012/05/06/the-five-eras-of-computing/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<pre>Those who don't know history are destined to repeat it.
Edmund Burke (1729 - 1797)</pre>
<p>My impression of computer science on some days is that the community does a lot of repeating (and often times calls it revolutionary).   I just finished a book about computing for general readers that takes the opposite point of view.  </p>
<p>The book is called &#8220;The Structure of Digital Computing: From Mainframes to Big Data,&#8221; and is available from <a href="http://www.amazon.com/The-Structure-Digital-Computing-Mainframes/dp/1936298007">Amazon</a>. </p>
<p><a href="http://rgrossman.com/files/2012/05/five-eras.jpg"><img src="http://rgrossman.com/files/2012/05/five-eras-300x219.jpg" alt="" title="Four Eras of Computing" width="300" height="219" class="alignleft size-medium wp-image-282" /></a></p>
<p>The book takes a 50 year perspective on the history of computing and divides this period into five overlapping eras (see the table below).  Most of what vendors try to pass off as revolutionary is simply market clutter.  Genuine innovations are rare and hard to predict, but are usually recognized and appreciated quite quickly.  </p>
<p>From the book&#8217;s perspective, we are transitioning from the third era of computing (the web) to the fourth era of computing, the era of computing devices. In the device era, many of us will replace our desktop and laptop computers with digital devices, such as smart phones and, in the future, wearable computers. </p>
<p>These devices (large and small) are all producing data and an important research challenge today is to develop better technologies for managing, analyzing and sharing this data.</p>
<p>These five eras are introduced in Chapter 1, which you can download from the book&#8217;s <a href="http://www.structureofdigitalcomputing.com" onClick="_gaq.push(['_trackEvent', 'Files', 'Downloads', 'Structure - Chapter 1']);">web site</a>.</p>
<p><b>1. Mainframe Era:</b><br />
When: 1965-1985<br />
Bottleneck: computer cycles<br />
Becoming commoditized: NA </p>
<p><b>2. PC Era:</b><br />
When: 1980-2000<br />
Bottleneck: application software<br />
Becoming commoditized: computer cycles </p>
<p><b>3. Web Era:</b><br />
When: 1995-2015<br />
Bottleneck: network bandwidth<br />
Becoming commoditized: application software </p>
<p><b>4. Device Era:</b><br />
When: 2005-2025<br />
Bottleneck: data<br />
Becoming commoditized: network bandwidth</p>
<p><b>5. Data Era:</b><br />
When: 2015-2035<br />
Bottleneck: actionable information<br />
Becoming commoditized: data</p>
]]></content:encoded>
			<wfw:commentRss>http://rgrossman.com/2012/05/06/the-five-eras-of-computing/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
