Robert Grossman

Bionimbus Protected Data Cloud (PDC) Update

Robert Grossman — Wed, 25 Sep 2013 18:09:54 +0000

The Bionimbus Protected Data Cloud (PDC) is an open source petabyte-scale cloud that is designed to manage, analyze and share large genomic datasets for the research community in a secure and compliant fashion. The Bionimbus now contains all of the data available to date from The Cancer Genome Atlas (TCGA). Today, this is over 600 TB of data and will grow over the next two years to over 2.5 PB. This includes both the controlled access BAM files containing the genomic data, as well as the open access aggregated data derived from the BAM files.

I’ll be giving a talk today about the Bionimbus PDC at the O’Reilly Strata Health Rx Conference in Boston.

To analyze TCGA data using the Bionimbus TCGA, you will need the required approvals from dbGaP. Any researcher authorized to analyze controlled access TCGA data is welcome to use modest amounts of compute and storage resources on the PDC. If you need additional resources, you can apply for a PDC research allocation.

Please contact us if you would like to contribute some data to the PDC, have a project that would like to join the PDC, or have a biomedical cloud that would like to interoperate with the PDC.

A Tool for Keeping Big Data in Sync

Robert Grossman — Wed, 20 Mar 2013 02:08:59 +0000

The rsync utility is a wonderfully useful tool for keeping two datasets synchronized, but it was never designed to keep two large datasets synchronized when they are separated by a long distance. Over the past couple of years, we developed a utility called UDR at the Laboratory for Advanced Computing at the University of Chicago which integrates rsync with the high performance network protocol UDT.

UDT is a reliable UDP-based protocol that was designed to move large datasets over wide area, high performance networks. UDT is open source and has been used as the basis for over six commercial products.

UDR is open source and available from github.

Here are some test results conducted by Erich Weiler from the University of California at Santa Cruz moving genomic data:

Source	Destination	UDR	rsync
Santa Cruz	Milwaukee	500 Mb/s	160 Mb/s
Santa Cruz	Detroit	600 Mb/s	150 Mb/s
Santa Cruz	Bielefeld	600 Mb/s	6 Mb/s
Santa Cruz	Aarhus	350 Mb/s	6 Mb/s
Santa Cruz	Brisbane	550 Mb/s	3 Mb/s

Allison Heath is the Project Lead for UDR.

Do You Want Hands On Experience Working with Big Data?

Robert Grossman — Fri, 15 Feb 2013 01:46:32 +0000

If you are a graduate student or post-doc interested in improving your big data skills, you might want to consider applying for an Open Science Data Cloud (OSDC) PIRE 2013 Fellowship. These fellowships are supported by the NSF PIRE Program and provide support for up to eight weeks of work.

The OSDC allows researchers to compute over 1 PB of scientific data from a variety of scientific disciplines.

We provide a big data bootcamp for OSDC PIRE Fellows. OSDC PIRE Fellows then spend time working with one of the OSDC foreign collaborators on a variety of projects, including:

Expanding the OSDC to other countries.
Developing infrastructure so that the OSDC can interoperate with science clouds in other countries.
Working on the OSDC software infrastructure.
Developing domain specific OSDC applications in the biological sciences, earth sciences, social sciences, or digital humanities.

To apply for a OSDC PIRE Fellowship, please fill out the application here. Only U.S. citizens or permanent residents are eligible for OSDC PIRE Fellowships.

The Unreasonable Effectiveness of Consensus Labeling

Robert Grossman — Fri, 21 Dec 2012 13:32:50 +0000

The majority of large datasets are unlabeled, while the majority of machine learning algorithms that you are likely to use require labeled data. Of course this is a simplification, but it captures quite well my experience in practice.

One approach that we used in a recent research project is what you make call consensus labeling. Here is a high level outline of the approach:

Select three or more high quality classifiers that have been trained on (small amounts) of labeled data. These classifiers will be used in the next step to assign labels to unlabeled data.

Apply the ensemble of classifiers to a large dataset of unlabeled data to create a labeled dataset. Labels can be assigned either by using a majority vote or by only labeling those records in which the classifiers all agree (a consensus).
From this larger labeled dataset, train and validate a classifier or other machine learning algorithm.

The goal of the project was to explore a class of algorithms that each night could use a large computing infrastructure (in our case the Open Cloud Consortium’s petabyte-scale OCC-Y Cloud) to analyze an ever changing collection of text documents and build a new model for entity extraction, part of speech tagging, etc.

The project was a joint project with Andrey Rzhetsky and Shi Yu and I have described just a small part it. You can find more details in the paper: Shi Yu, Robert Grossman and Andrey Rzhetsky, Global and Local Approach of Part-of-Speech Tagging for Large Corpora, Information Retrieval and Knowledge Discovery in Biomedical Text: Papers from the 2012 AAAI Fall Symposium, AAAI Press, Menlo Park, California, 2012. pdf.

The Open Science Data Cloud – Two Year Update

Robert Grossman — Tue, 20 Nov 2012 11:34:34 +0000

I just got back from SC 12, which took place in Salt Lake City this year. We shared our research booth with the Open Science Data Cloud (OSDC) and with the (ICAIR) Research Center from Northwestern University.

The OSDC just turned two years old. It is a petabyte scale science cloud for researchers to manage, analyze and share their data and to get easy access to data from other scientists. The OSDC is operated by the not-for-profit Open Cloud Consortium (OCC), which is taking a long term point of view in how to build and operate cloud-based infrastructure to serve the needs of researchers.

There is now over 800 TB of data available to the research community in the OSDC. It should be 1 PB by the end of the year, and will grow significantly during 2013. Any researcher may apply for an account to compute over this data (we have an allocation committee that selects projects). Small usage of the OSDC is free, and we ask larger projects to pay for the costs of using the facility. In particular, we will work with projects to help them write the OSDC into their grants to pay for their usage. A model that has proved useful is for projects to request one or more racks from their funding agency each year, which the OSDC can operate.

We are currently buying racks that contain about 575 cores, 2.3 TB of RAM and 1 PB of raw storage. These cost about $250,000 and provide about 5,000,000 core hours of compute each year. We have developed software and services so that we can (generally) run our racks remotely and lights out.

From 2008-2010, we operating a proof-of-concept infrastructure that consisted of four distributed data centers connected with a wide area 10G network, with an infrastructure that was a mixture of Hadoop, Sector, and Eucalyptus, and running cloud-based applications from several scientific disciplines. You can find a description of the OSDC in 2010 in this paper.

From 2010 to the present, we have been operated a production cloud-based infrastructure serving several scientific disciplines, including the biological sciences with Bionimbus; the earth sciences with Matsu; and the digital humanities. The infrastructure is now based primarily on Hadoop and OpenStack. We use UDT and UDR to transport and synchronize large terabyte size datasets, and we are now beginning to use 100G network connections. We presented an update about the OSDC at the Data Cloud 2012 Workshop at SC 12. You can find the paper here.

We are planning for a scale up of the OSDC beginning in approximately 2015-2016, when we hope to open up a small boutique 3-5MW scale data center to house the OSDC called the Burnham Center for Knowledge Discovery. We are raising $1M to help plan for the Burnham Center, which we expect to cost between $15M to $25M (including the first several years of operating expenses), depending upon the scale.

The software for the OSDC is all open source. We are always looking for volunteers, especially those that can help develop code, or help operate the OSDC, and for donors, especially those that can provide equipment, funds for operating expenses, and funds for the planning or operations of the Burnham Center. Please write us at info at opencloudconsortium dot org if you are interested in getting involved.

Datascopes for the Long Tail of Science

Robert Grossman — Fri, 12 Oct 2012 22:15:27 +0000

This week, I gave a talk at the GLIF Workshop in Chicago on what is being called the long tail of science.

We have telescopes to study distant objects and microscopes to study small objects. A number of us who are thinking about big data are asking the question what is an appropriate instrument to study big data? At the talk, I talked a bit about the lessons learned by OCC as we design and develop the Open Science Data Cloud, which you can think of as one design of a datascope.

The talk was called “The Open Science Data Cloud: Empowering the Long Tail of Science” and you can find the slides on slideshare.

Managing and Analyzing 1,000,000 Genomes

Robert Grossman — Tue, 18 Sep 2012 19:32:18 +0000

I gave a talk last week at XLDB 2012 about Bionimbus, which is cloud based system for managing, analyzing, transporting, and sharing large genomics datasets in a secure and compliant fashion. Bionimbus was developed at the Institute for Genomics and Systems Biology (IGSB) and is used by IGSB and some of their collaborators to manage and analyze their next gen sequencing data.

We have been using Version 2.0 of Bionimbus for the past two years and are beginning the transition to Version 3.0. In Version 3.0, we have factored out some of the services and made them more generally available to other other Open Science Data Cloud (OSDC) applications. In particular, we have factored out the key service which provides digital IDs for datasets and the file and permissions services.

Recently, we have been thinking about what you might call the Million Genome Challenge. Over the next several years, the National Cancer Institute, and perhaps other organizations, will sequence a million genomes and use this data to increase our understanding of biological pathways and of genomic variation across individuals. With this knowledge, we can begin to stratify diseases, leading to precision diagnosis and precision treatment that is personalized for individual patients.

The numbers associated with a million cancer genomes are worth thinking about. The whole genome data for a tumor and a matching normal tissue sample require about 1 TB. Thus, one million genomes require about 1,000,000 TB. This is 1,000 PB or 1 EB. Compressing the data might reduce the data by about a factor of 10. Throwing away the alignment data and retaining only the variation data would reduce the data by about a factor of about 100. Assuming it costs about $1,000 to sequence each whole genome, the project as a whole requires about $1B for the sequencing. It might require another $1B for the infrastructure and analysis. Although obviously a large project, a project like this is likely to fundamentally alter the way we understand and treat diseases.

In my XLDB talk, I discuss Bionimbus, the Million Genome Challenge, and related topics. You can find my talk at the conference web site, or on my slideshare site.

Why is it still so hard to analyze remote and distributed data?

Robert Grossman — Mon, 09 Jul 2012 10:55:44 +0000

A good question to think about is:

If the web (of documents), which is built upon open standards around html (for describing documents) and http (for accessing documents), is so successful, why don’t we have a web of data, built upon open standards around xml (or something perhaps a bit more concise for describing data) and a protocol for accessing data (and metadata).

These two men are discussing why it is so difficult to work with remote and distributed data.

About ten years ago, I published a paper called DataSpace – A Web Infrastructure for the Exploratory Analysis and Mining of Data, that described an infrastructure called DataSpace for creating a web of data, that uses html and xml for describing data and metadata and a protocol we introduced called the dataspace transfer protocol or dstp for transferring data. The key idea in DataSpace was to make it lightweight and minimal. It was based upon distributed columns of data, each of which was attached to a key called a universal correlation key or UCK. We developed reference implementations of dstp servers to serve columns of data, associated metadata, and associated UCKs. Correlating distributed columns of data was simple and applications just used UCKs. Discovery of data and metadata just used standard mechanisms.

The W3C Semantic Web effort, which was more ambitious, started at approximately the same time. Despite millions of dollars of funding, it too hasn’t really caught on.

It is an interesting exercise to try to think about why the semantic web, DataSpace, or any of the similar ideas haven’t caught on.

Today, we have linked data, whose key concepts are relatively close to DataSpace. Linked data is much simpler than the semantic web, and is based upon these four principles:

Tim Berners-Lee listed four principles of linked data in a note Design Issues: Linked Data:

Use URIs to identify things.
Use HTTP URIs so that these things can be referred to and looked up (“dereferenced”) by people and user agents.
Provide useful information about the thing when its URI is dereferenced, using standard formats such as RDF/XML.
Include links to other, related URIs in the exposed data to improve discovery of other related information on the Web.

DataSpace is quite similar except it encourages the use of UCKs so that columns of data can be correlated.

More recently, Stuart Bailey, the Founder and CTO of Infoblox, has been working on IF-MAP, which is a standard for describing and accessing in a secure way distributed collections of objects and their links, as well as metadata about objects and their links. IF-MAP is an abbreviation for Interface to Metadata Access Points) and is a Trusted Computer Group (TCG) standard.

Stuart Bailey was part of the original DataSpace effort and IF-MAP is an interesting evolution of some of the key ideas in DataSpace.

It still seems like a great time to ask, why don’t we have web of data supporting simple discovery, exploration, correlation and access?

An Introduction to Big Data for the General Reader

Robert Grossman — Wed, 06 Jun 2012 11:43:32 +0000

I recently finished a book about computing for general readers called “The Structure of Digital Computing: From Mainframes to Big Data,” which is available from Amazon.

Chapter 5 is a non-technical introduction to big data, which you can download here.

Although the term big data is relatively new, the discipline, which is sometimes called data intensive computing, is at least 20 years old. One way to think of big data is similar to the way we think of high performance computing. We tend to think of a high performance computer as a specialized computer that has at least 1,000x to 10,000x or more computing power than a desktop computer (I’m thinking of processors here rather than cores). For example the Jaguar at LLNL is one of the world’s largest supercomputers. The Jaguar XT4 had 7,832 Quad-Core AMD Opterons (31,328 cores) and the Jaguar XT5 has 37,376 Six-Core AMD Opterons (224,256 cores).

A common architecture for big data is to use racks of computers to provide scale out processing of data. Source: Sean Ellis, Creative Commons, Flickr.

We can think of big data in a similar way. Think of a system for big data as a specialized system that can manage 1,000x to 10,000x or more data volume than a standard desktop computer. These days we talk about scaling out computing infrastructure to fill a data center or a warehouse and we tend to measure big data computing in MW instead of petabytes. A 15 MW data center might have a 100,000 computers and 100′s of PB of disks. Most software for managing and analyzing data was not designed to scale out to computing infrastructure of this size. The most popular choice for a big data software stack is Hadoop. Another choice is Sector.

A great book on this style of computing is by Barroso and Holzle The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines.

Specialized big data systems are also built to handle high velocity data streams. Sometimes big data is said to be concerned with data whose volume, velocity or variety is to big to be handled by conventional systems.

For some time, we have produced more data than we can easily analyze. About 15 years ago, I organized three NSF-supported workshops to understand big data. Although we called it data mining then, the opportunities and challenges 15 years ago and the opportunities and challenges today look pretty similar. For those interested in a longer term perspective on big data, it might be interesting to skim the report, which you can find here.

Perhaps what is different today is that just as computing cycles, disk storage, and network bandwidth has become commoditized, over the next decade or so data itself will become commoditized. Although the volume, velocity, and variety of data continues to grow, the challenge as always is to extract interesting, useful and actionable information from it. I agree with Tom Kalil from the White House Office of Science and Technology Policy that this is a fundamental research challenge or in his words: big data is a big deal.

The Five Eras of Computing

Robert Grossman — Sun, 06 May 2012 02:05:32 +0000

Those who don't know history are destined to repeat it.
Edmund Burke (1729 - 1797)

My impression of computer science on some days is that the community does a lot of repeating (and often times calls it revolutionary). I just finished a book about computing for general readers that takes the opposite point of view.

The book is called “The Structure of Digital Computing: From Mainframes to Big Data,” and is available from Amazon.

The book takes a 50 year perspective on the history of computing and divides this period into five overlapping eras (see the table below). Most of what vendors try to pass off as revolutionary is simply market clutter. Genuine innovations are rare and hard to predict, but are usually recognized and appreciated quite quickly.

From the book’s perspective, we are transitioning from the third era of computing (the web) to the fourth era of computing, the era of computing devices. In the device era, many of us will replace our desktop and laptop computers with digital devices, such as smart phones and, in the future, wearable computers.

These devices (large and small) are all producing data and an important research challenge today is to develop better technologies for managing, analyzing and sharing this data.

These five eras are introduced in Chapter 1, which you can download from the book’s web site.

1. Mainframe Era:
When: 1965-1985
Bottleneck: computer cycles
Becoming commoditized: NA

2. PC Era:
When: 1980-2000
Bottleneck: application software
Becoming commoditized: computer cycles

3. Web Era:
When: 1995-2015
Bottleneck: network bandwidth
Becoming commoditized: application software

4. Device Era:
When: 2005-2025
Bottleneck: data
Becoming commoditized: network bandwidth

5. Data Era:
When: 2015-2035
Bottleneck: actionable information
Becoming commoditized: data