We live today in a world flooded with data. In just one short decade, we have gone from a data-poor world to a data-rich one. The buzzword Big Data captures this phenomenon, and it’s one of the few cases where the reality actually can match the hype. Big Data is transforming every industry and human activity, including commerce, entertainment, agriculture, government, and the sciences.
For the past decade at Stanford, Jure Leskovec, Jeff Ullman, and I have been teaching a popular course called “Mining of Massive Datasets,” where we teach the fundamental techniques and tools to deal with Big Data. This class has trained a whole generation of data scientists and engineers who work at many major Silicon Valley companies and startups.
The Stanford course is popular, and attracts hundreds of students. But the course textbook, also called Mining of Massive Datasets and published by Cambridge University Press, has been downloaded by hundreds of thousands of students and practitioners. This helped us realize that our Stanford students are just a small fraction of the vast number of people worldwide who might benefit from the course.
So we are now making this class available online, on Coursera, for the entire world. In this class we will introduce fundamental algorithms and techniques to deal with Big Data, such as MapReduce, Locality Sensitive Hashing, Page Rank, and algorithms for Large Graphs and Data Streams. We will also show how to apply our toolkit to important practical applications, such as Web Search, Recommender Systems and Online Advertising.
The class starts September 29 and runs for 9 weeks. One of the key decisions we made is to not “water down” the material in any way from the course we teach at Stanford; the MOOC contains exactly the same material as the Stanford class. You can sign up for the class on the Coursera page. Here’s a short introductory video we recorded for the class.
In addition to the materials provided with this MOOC, the second edition of our textbook is now available for free download here. If you liked the first edition of the book, you should definitely check out the second edition -- we’ve added lots of new material, including graph algorithms, social network analysis, large-scale machine learning, and dimensionality reduction.
Marc Andreessen has famously pointed out that software is eating the world. Data is the fuel that powers software’s conquests. Data is created whenever humans and software interact, or when software interacts with other software. This virtuous cycle -- the success of software creates more data, and more data makes software even more powerful -- is a dynamic that is transforming the world we live in. Join us on Coursera to learn how to harness the power of data so that you can be an active participant, rather than a mere spectator, in this transformation.
Two weeks ago, I announced the exciting news that Kosmix had agreed to be acquired by Walmart, the world’s largest retailer. At that point, the deal was signed but subject to customary closing conditions. Today, I’m delighted to announce that the closing conditions have been fulfilled and that the deal is officially closed.
Today is the last day in the life of Kosmix as an independent company. It is a day to look back with pride at our accomplishments. I’m proud of our pioneering development of breakthrough semantic analysis technology, and of applying it to real-time social media streams to create the Social Genome. I’m proud that we built RightHealth into one of the top three health and medical information sites by global reach. I’m extremely proud that we touched so many people each and every day: in March, our properties RightHealth, Tweetbeat, and Kosmix.com together served over 17.5 million unique visitors worldwide, who spent over 5.5 billion seconds on our services. But most of all, I’m proud of being part of the Kosmix team, and having the privilege of working with a team of extremely talented and passionate individuals who, over the course of the past several years, went from coworkers to family.
Today also is the first day in the life of @WalmartLabs. As I wrote in my prior post, our mission is to invent the next generation of ecommerce: integrated experiences that leverage the store, the web, and mobile, with social identity being the glue. We are at an inflection point in the development of retailing. Social media and the mobile phone will have as profound an effect on the trajectory of retail in the early years of the 21st century as did the development of highways in the early part of the 20th century. @WalmartLabs, which combines Walmart’s scale with Kosmix’s social genome platform, is in a unique position to invent and build this future.
I’m also delighted that the entire Kosmix crew, without exception, will be part of this exciting new journey. If you are interested in joining a talented and passionate team working at the crossroads of social media, mobile, and commerce, give us a holler. @WalmartLabs is hiring!
P.S. To stay tuned, follow @WalmartLabs on Twitter.
Eric Schmidt famously observed that every two days now, we create as much data as we did from the dawn of civilization until 2003. A lot of the new data is not locked away in enterprise databases, but is freely available to the world in the form of social media: status updates, tweets, blogs, and videos.
At Kosmix, we’ve been building a platform, called the Social Genome, to organize this data deluge by adding a layer of semantic understanding. Conversations in social media revolve around “social elements” such as people, places, topics, products, and events. For example, when I tweet “Loved Angelina Jolie in Salt,” the tweet connects me (a user) to Angelia Jolie (an actress) and SALT (a movie). By analyzing the huge volume of data produced every day on social media, the Social Genome builds rich profiles of users, topics, products, places, and events.
The Social Genome platform powers the sites Kosmix operates today: TweetBeat, a real-time social media filter for live events; Kosmix.com, a site to discover content by topic; and RightHealth, one of the top three health and medical information sites by global reach. In March, these properties together served over 17.5 million unique visitors worldwide, who spent over 5.5 billion seconds on our services.
Quite a few of us at Kosmix have backgrounds in ecommerce, having worked at companies such as Amazon.com and eBay. As we worked on the Social Genome platform, it became apparent to us that this platform could transform ecommerce by providing an unprecedented level of understanding about customers and products, going well beyond purchase data. The Social Genome enables us to take search, personalization and recommendations to the next level.
That’s why we were so excited when Walmart invited us to share with them our vision for the future of retailing. Walmart is the world’s largest retailer, with 10.5 billion customer visits every year to their stores and 1.5 billion online – 1 in 10 customers around the world shop Walmart online, and that proportion is growing. More and more visitors to the retail stores are armed with powerful mobile phones, which they use both to discover products and to connect with their friends and with the world. It was very soon apparent that the Walmart leadership shared our vision and our enthusiasm. And so @WalmartLabs was born (official announcement here).
We are at an inflection point in the development of ecommerce. The first generation of ecommerce was about bringing the store to the web. The next generation will be about building integrated experiences that leverage the store, the web, and mobile, with social identity being the glue that binds the experience. Walmart’s enormous global reach and incredible scale of operations -- from the United States and Europe to growing markets like China and India -- is unprecedented. @WalmartLabs, which combines Walmart’s scale with Kosmix’s social genome platform, is in a unique position to invent and build this future.
It is every technologist’s dream that the products they build will impact billions and will continue on to the next generation. The social commerce opportunity is huge, and today is day zero. We have liftoff!
Slumdog Millionaire is one my favorite movies of all time. And I have followed the career of A.R. Rahman, who composed the movie's music, for several years ever since his debut in 1992. So I was quite thrilled when Slumdog was nominated for 10 academy awards -- and Rahman in two categories, Original Score and Original Song. Thrilled, and a little surprised: while I like Rahman's work in Slumdog, I don't think it's his best work. There is of course nothing wrong with that, as long as Rahman's work is better than that of his competitors this year.
But it got me to thinking: if Rahman had composed the same music for an obscure film this year, rather than for Slumdog Millionaire, would he have been nominated? And even if he had been nominated, what are his chances of winning? In other words, is there a Matthew Effect in Oscar nominations -- to them that have, more shall be given? And, once nominated, is there a halo surrounding movies with many nominations that improves the odds of winning across many award categories? I thought it might be fun to run the numbers based on past years' nominees and winners to see if I could find answers to these questions; it turned out to be somewhat instructive as well, since it required an extension of the standard Market Basket analysis from the world of data mining.
To get the data, I went straight to the source: the official Academy Awards database , which lists all the nominations and winners for the past 80 years. Unfortunately there is not a single page that lists all this information, but it was fairly straightforward to write python scripts that queried the website a few times and collated the data in tabular form. The result: a table that lists every nomination and winner in every category beteen 1927 and 2007. There were 8616 nominations in the period, representing 4215 distinct movies; so each movie was nominated on average for 2 award categories.
Let's start first with the nominations, to see if there is any evidence of the Matthew Effect. Let's say N(k) is the number of movies with exactly k nominations. The table below shows k and N(k) for k between 1 and 10. If we ignore two outliers (k=1 and k=7), it appears that N(k+1)/N(k) is close to 0.6 for k between 2 and 10; the decay is certainly much slower than exponential. This indicates that the number of nominations roughly follows a power-law; and a power-law is the classic embodiment of of the Matthew Effect, arising in contexts such as income and wealth distribution. The table below summarizes the data.
Nominations | Movies |
1 | 2796 |
2 | 513 |
3 | 260 |
4 | 195 |
5 | 128 |
6 | 81 |
7 | 87 |
8 | 50 |
9 | 31 |
10 | 29 |
The next step is to enquire whether there are Oscar categories for
which the effect is much stronger than for other categories. To study
this, we divide the nominated movies into two groups: movies with 4 or
fewer nominations (the "poor" group) and movies with 5 or more
nominations (the "rich" group). Overall, 5382 nominations, or
62.5%, went to movies in the poor group and 3234 nominations, or 37.5%,
went to movies in the rich group. Now, let's look at the major Oscar
categories. The major outliers are Best Picture and Best Director -- both
nominations went overwhelmingly to movies in the rich category (70% and
73%, respectively, compared to the average of 37.5%). This is not
surprising, because the best picture is typically one that is strong in
many disciplines. There is some bias in the acting categories as well,
but the big surprise is Film Editing: 68% of the nominations in this
category are "rich" movies. At other extreme are Music and Special
Effects: approximately 70% of the nominated movies are in the "poor"
category. So it appears that in these categories at least, talent gets
its due without help from Matthew.
Moving from nominations to actual winners, the obvious question is:
does being nominated in many categories boost the chances of winning in
a disproportionate manner? To study this, I used the Market Baskets
approach from Data Mining. In a classic Market Baskets scenario, we ask
which items are often purchased together: such as milk and eggs. In
this case, we model each movie as a basket: the contents of a movie's
basket are its nominations and wins. Do movies with many nominations in
their baskets have a disproportionate number of wins?
We must first deal with a technicality. In a normal
market basket scenario, the contents of each basket are independent of
every other basket, but in this case there are dependencies. Consider
the set of market baskets of the movies that have all been nominated in
a single award category in a particular year; clearly, one of these has
to be the winner in that category, and so the basket of that movie will
also contain a win in that category.
It's easy to extend the Market Baskets model to capture
this idea. I'll call the new model Constrained Market Baskets. Consider
a subset S of market baskets; say, the set of market baskets
corresponding to the "rich" movies with 5 or more nominations. Suppose
movie M is in this set, and has been nominated in award category C. If
there are (say) a total of 5 nominees in this category, then the prior
probability of movie M's basket containing a win is 1/5 or 0.2. We can
repeat this for all the categories M is nominated in, and add up the
priors; this gives the "prior expected value" of the number of wins in
M's basket. We add up the expected wins for all the movies in set S to
get the total number of wins we expect the set S of movies to have;
call this EW. Now, if OW is the actual number of "Observed Wins" across
the movies in set S, we want to see if there is a discrepancy between
EW and OW. In particular, we define the "win boost" of set S to be
OW/EW. If the win boost is higher than 1, then the set S of market
baskets has a disproportionate number of wins, and if it's much less
than 1, then it has fewer wins than expected.
When we do the analysis, the set of "poor" movies, with 4 or fewer nominations, had a total of 5382 nominations, with 1143 "expected wins" but only 840 "observed wins"; a win boost of 0.73. The "rich" movies, by contrast, with 3234 nominations, were expected to win 657 Oscars but actually won 958, a win boost of 1.46. In
other words: the rich movies, which represent only 37.5% of all
nominations, actually won more than half of all the actual Oscar awards!
Matthew!
Once again, we can break up the results by category, and look at the
win boosts for specific categories of awards. For most major award
categories, the win boosts for the rich and poor categories are in line
with the overall average boosts. As in the case of nominations, the
effect is very significant in the best picture and best director
categories: in these categories, the "poor" movies have a win boost of
just 0.30! We noted that the Music category seemed resilient to Matthew
in the case of nominations; but in the case of wins, this category has
a win boost of 1.7 for the rich movies, in line with the overall
average. The surprising and significant outlier in this case is the
Best Supporting Actor category, with win boosts very close to 1.0 for
both the rich and the poor movies. It appears that the Best Supporting
Actor award shows no evidence of Matthew; the other acting categories,
however, are in line with the overall averages.
I don't have a deep enough understanding of the movie industry and the
Academy Awards process to speculate on the reasons for these effects.
Perhaps great talent attracts other great talent, and the Awards
reflect that reality. And perhaps the difference between the behavior
of wins and of nominations has to do with the fact that the former uses
simple plurality voting while the latter uses a preferential voting
scheme. In any case, I'm happy on two counts. The statistics on the
Music category say that the Matthew effect likely did not help Mr
Rahman in securing his nominations; but now that he has been nominated,
his chances of winning are greatly boosted because he is associated
with Slumdog's 10 nominations. Jai Ho!
Update: A big night for Slumdog, winning 8 awards, including both the music and song awards for A. R. Rahman. While 8 awards is not the best Oscar performance ever, it is the most number of awards won by a movie with 10 nominations (the ones that won more awards had more nominations). Matthew must be pleased.
Today I'm delighted to share some fantastic news. My company Kosmix has raised $20 million in new financing to power our growth. Even more than the amount of financing, I'm especially proud that the lead investor in this round is Time Warner, the world's largest media company. Our existing investors Lightspeed, Accel, and DAG participated in the round as well. The Kosmix team also is greatly strengthened by the addition of Ed Zander as investor and strategic advisor. In an amazing career that spans Sun Microsystems and Motorola, Ed has repeatedly demonstrated leadership that grew good ideas into great products and businesses. His counsel will be invaluable as we take Kosmix to the next level as a business.
In these perilous economic times, the funding is a big vote of confidence in Kosmix's product and business. Kosmix web sites attract 11 million visits every month, and we have a proven revenue model with significant revenues and robust growth. RightHealth, the proof-of-concept we launched in 2007, grew with astonishing rapidity to become the #2 health web site in the US. These factors played a big role in helping us close this round of funding with a healthy uptick in valuation from our prior round. Together with the money already in the bank from our prior rounds, we now have more than enough runway to take the company to profitability and beyond.
A few months ago, we put out an alpha version of Kosmix.com. Many people used it and gave us valuable feedback; thank you! We listened, and made changes. Lots of changes. The result is the beta version of Kosmix.com, which we launched today. What's changed? More information sources (many thousands), huge improvements in our relevance algorithms, a much-improved user interface, and a completely new homepage. Give it a whirl and let us know what you think.
To those of you new to Kosmix, the easiest way to explain what Kosmix does is by analogy. Google and Yahoo are search engines; Kosmix is an explore engine. Search engines work really well if your goal is to find a specific piece of information -- a train schedule, a company website, and so on. In other words, they are great at finding needles in the haystack. When you're looking for a single fact, a single definitive web page, or the answer to a specific question, then the needle-in-haystack search engine model works really well. Where it breaks down is when the objective is to learn about, explore, or understand a broad topic. For example:
In the examples above, I'm especially pleased about the way Kosmix
picks great niche sources for topics. For example, I hadn't heard about
chow.com or known that Martha Stewart has how-to videos on her website.
Other "gems" of this kind include Jambase, TMZ, The Onion, DailyPlate,
MamaHerb, and Wonkette. Part of the goal of Kosmix is to bring you such
gems: information sources or sites you have either not heard of, or
just not thought about in the current context.
In other words: Google = Search + Find. Kosmix = Explore + Browse. Browsing sometimes uncovers surprising connections that you might not even have thought about. The power of the model was brought home to me last week as I was traveling around in England. I'd heard a lot about Stonehenge and wanted to visit; so of course I went to the Kosmix topic page on Stonehenge. In addition to the usual comprehensive overview of Stonehenge, the topic page showed me places to stay in Bath, Somerset (which happens to be the best place to stay when you're visiting Stonehenge). It also showed me other ancient monuments in the same area I could visit while I was there. Score one for serendipity.
Some of us remember the early days of the World Wide Web: the thrill of just browsing around, following links, and discovering new sites that surprise, entertain, and sometimes even inform. We have lost some of that joy now with our workmanlike use of search engines for precision-guided information finding. We built the new Kosmix homepage to capture some of the pleasure of aimless browsing -- exploring for pure pleasure. The homepage shows you the hot news, topics, videos, slide shows, and gossip of the moment. If you find something interesting you can dive right in and start browsing around that topic. We compile this page in the same manner as our topic pages: by aggregating information for many other sources and then applying a healthy dose of algorithms. Dig in; who knows what surprises await?
How does Kosmix work its magic? As I wrote when we put out the alpha, the problem we're solving is fundamentally different from search, and we've taken a fundamentally different approach. The web has evolved from a collection of documents that neatly fit in a search engine index to a collection of rich interactive applications. Applications such as Facebook, MySpace, YouTube, and Yelp. Instead of serving results from an index, Kosmix builds topic pages by querying these applications and assembling the results on-the-fly into a 2-dimensional grid. We have partnered with many of the services that appear in the results pages, and use publicly available APIs in other cases. The secret sauce is our algorithmic categorization technology. Given a topic, categorization tells us where the topic fits in a really big taxonomy, what the related topics are, and so on. In turn, other algorithms use this information to figure out the right set of information sources for a topic from among the thousands we know about. And then other algorithms figure out how to lay the information on the page in a 2-dimensional grid.
While we are proud of what we have built, we know there is still a long way to go. And we cannot do it without your feedback. So join the USS Kosmix on our maiden voyage. Our mission: to explore strange newtopics; to discover surprising new connections; to boldly go where no search engine has gone before!
Update: Vijay Chittoor has posted more details on the new product features on the Kosmix blog. Coverage on TechCrunch, GigaOM, VentureBeat. I'm particularly pleased that Om Malik thinks his page on Kosmix is better than the bio on his site!
The internet world has been agog over Google's entry into the browser wars with Chrome. When we look back to this event several years from now with the benefit of hindsight, we might see it either as a master stroke, or as Google's biggest strategic misstep.
The potential advantages to the internet community as a whole are considerable. The web has evolved beyond its roots as a collection of HTML documents and dumb frontends to database applications. We now expect everything from a web application that we do from a desktop application, and then some more: the added bonus of connectivity to vast computing resources in the cloud. In this context, browsers need to evolve from HTML renderers to runtime containers, much as web servers evolved from simple servers of static files and cgi scripts to modern application servers with an array of plugins that provide a variety of services. Chrome is the first browser to explicitly acknowledge this transition and make it the centerpiece of their efforts, and will force other browsers to follow suit. We will all benefit.
The potential advantages to Google also are considerable. If the stars and planets align, they can challenge Microsoft's dominance on the desktop by making the desktop irrelevant. Even otherwise, they can hope to use their dominance in search to promote Chrome, gaining significant browser marketshare and ensuring that Microsoft cannot challenge Google's search dominance by building features into Internet Explorer and Windows that integrate MSN's search and other services.
Therein, however, lies the first and perhaps the biggest risk to Google. Until now, Microsoft has been unable to really use IE and Windows to funnel traffic to MSN services and choke off Google. Given their antitrust woes, they have been treading carefully on this matter. Any overt attempt by them will evoke cries of foul from many market participants. Google has been in a great position to lead the outcry, because it has been purely a service accessible from the browser, without any toehold in browser market itself.
Chrome, however, eases some of the pressure on Microsoft. If Microsoft integrates MSN search or other services tightly into IE, it will be harder for Google to cry foul -- Microsoft could point to Chrome, and any steps taken by Google to integrate their services into Chrome, as counter-arguments. In addition, any outcry from Google can now be characterized as sour grapes from a loser -- Microsoft can say, we both have browsers out there, they have one too, ours is just better, and let consumers decide for themselves.
In some sense, regardless of the actual market penetration of Chrome, Google has lost the moral high ground in future arguments with Microsoft. I wonder whether Google might have achieved all their aims better not by releasing a Google-branded browser, but by working with Mozilla to improve Firefox from within.
Second, while Google has shown impressive technological wizardry in search and advertising, the desktop application game is very different from the internet service game. While users are very forgiving about beta tags that stay for years on services such as gmail, user expectations on matters such as compatibility and security bugs are very high for desktop applications. It remains to be seen whether Google has the culture to succeed in this game, going beyond providing whiz bang features that thrill developers -- such as a blazingly fast Javascript engine -- to deliver a mainstream browser that competes on stability, security, and features.
The third problem is one of data contagion. Google has the largest "database of intentions" in the world today: our search histories, which form the basis of Google's ad targeting. The thing that keeps me from freaking out that Google knows so much about me is that I access Google using a third-party browser. If Google has access to my desktop, and can tie my search history to that, the company can learn much about me that I keep isolated from my search behavior. The cornerstone of privacy on the web today is that we can use products from different companies to create isolation: desktop from Microsoft, browser from Mozilla, search from Google. These companies have no incentive to share information. This is one instance where information silos serve us well as consumers. Any kind of vertical integration has the potential to erode privacy.
I'm not suggesting that Google would do anything evil with this data, or indeed that the thought even crossed their minds; thus far Google has behaved with admirable restraint is their usage of the database of intentions, staying away for example from behavioral targeting. But we should all be cognizant of the fact that companies are in business purely to benefit their shareholders. At some point, someone at Google might realize that the contents of my desktop can be used to target advertising, and it might be prove tempting in a period of slow revenue growth under a different management team.
Two striking historical parallels come to mind, one a masterstroke and the other a blunder, in both cases setting into motion events that could not be undone. In 49 BC, Julius Caesar crossed the Rubicon with his army, triggering a civil war where he triumphed over the forces of Pompey and became the master of Rome. And in 1812, Napoleon Bonaparte had Europe at his feet when he made the fateful decision to invade Russia, greatly weakening his power and leading ultimately to his defeat at Waterloo. It will be interesting to see whether Chrome ends up being Google's Rubicon or its St. Petersburg. Alea iacta est.
Popularized by Google, the MapReduce paradigm has proven to be a powerful way to analyze large datasets by harnessing the power of commodity clusters. While it provides a straightforward computational model, the approach suffers from certain key limitations, as discussed in a prior post:
Three approaches have emerged to bridge the gap between relational databases and Map Reduce. Let's examine each approach in turn and then discuss their pros and cons.
The first approach is to create a new higher-level scripting language that uses Map and Reduce as primitive operations. Using such a scripting language, one can express operations that require multiple map reduce steps, together with joins and other set-oriented data processing operations. This approach is exemplified by Pig Latin, being developed by a team at Yahoo. PigLatin provides primitive operations that are commonly found in database systems, such as Group By, Join, Filter, Union, ForEach, and Distinct. Each PigLatin operator can take a User Defined Function (UDF) as a parameter.
The programmer creates a script that chains these operators to achieve the desired effect. In effect, the programmer codes by hand the query execution plan that might have been generated by a SQL engine. The effect of a single Map Reduce can be simulated by a Filter step followed by a Group By step. In many common cases, we don't even need to use UDFs, if the filtering and grouping criteria are straightforward ones that are supported in PigLatin. The PigLatin engine translates each script into a sequence of jobs on a Hadoop cluster. The PigLatin team reports that 25% of Hadoop jobs on Yahoo today originate as PigLatin scripts. That's impressive adoption.
Another interesting solution in this category is Sawzall, a new scripting language developed at Google. Sawzall allows map reduce operations to be coded using a language that is reminiscent of awk. If your computation fits the Sawzall model, the code is much shorter and more elegant than C/C++/Java Map and Reduce functions. Sawzall, however, suffers from two drawbacks: it limits the programmer to a prefined set of aggregations in the Reduce phase (although it supplies a big library of these); and it offers no support for data analysis that goes beyond a single Map Reduce step, as PigLatin does. Most important, Sawzall is not available outside of Google, while PigLatin has been open-sourced by Yahoo.
The second approach is to integrate Map Reduce with a SQL database. Two database companies have recently announced support for MapReduce: Greenplum and Aster Data. Interestingly, they have taken two very different approaches. I will call Greenplum's approach "loose coupling" and Aster Data's approach "tight coupling". Let's examine each in turn.
Greenplum's loose-coupling approach ties together Greenplum's database with Hadoop's implementation of Map Reduce. A Hadoop Map Reduce operation is visible as a database view within Greenplum's SQL interpreter. Conversely, Hadoop map and reduce functions can access data in the database by iterating over the results of database queries. Issuing a SQL query that uses a map-reduce view will launch the corresponding map-reduce operation, whose results can then be processed by the rest of the SQL query.
Aster Data's tight-coupling approach is more interesting: the database natively supports map reduce (with no need for Hadoop). Map and reduce functions can be written in a variety of programming languages (C/C++, java, python). Aster has extended the SQL language itself to support how these functions get invoked, creating a new SQL dialect called SQL/MR. One of the cool features is that map and reduce functions are automatically polymorphic, just like native SQL functions such as SUM, COUNT and so on: the programmer can write them once and the database engine can invoke them with rows with different numbers of columns and columns of different types. This is a huge convenience over the Hadoop approach.
What are the pros and cons of these three different approaches? The advantage of the Pig Latin approach is that it works directly at the file level, and therefore it can express MapReduce computations that don't fit the relational data model. An example of such an operation is building an inverted index on a collection of text documents. Databases in general are bad at handling large text and image data, which are treated as "blobs."
The biggest disadvantages of the PigLatin approach is the need to learn an entirely new programming language. There is a large group of developers and DBA's familiar with SQL, and PigLatin does not have this support base. The second disadvantage is that the developer has to code declarative query plans by hand, while SQL programmer can rely on two decades of work on SQL query optimizers, which can automatically decide the order of operations, the degree of parallelism, and when to use indexes.
The advantages and disadvantages of the SQL integration approach in general mirror those of the Pig Latin approach. The loose coupling approach of Greenplum allows the use of files as well as relations, and therefore in principle supports file-based computations. The burden is on the application programmer, however, to decide on the scheduling and optimization of the Hadoop portion of the computation, without much help from the database.
Aster's tight-coupling approach, on the other hand, allows a much greater degree of automatic query optimization. The database system is intimately involved in the way map and reduce operations are scheduled across the cluster, and can decide on the degree of parallelism, as well use strategies such as pipelining across map reduce and relational operators. In addition, since the database system is solely in charge of overall resource allocation and usage, it also ensures sandboxing of user-defined code, preventing it from consuming too many resources and slowing down other tasks. For computations that use only data in the relational database, Aster by far has the most elegant solution; the weakness, of course, is that data stored outside the database is off-limits.
Update: Tassos Argyros from Aster Data points out that Aster's implementation does in fact allow access to data stored outside the database. The developer needs to write a UDF that exposes the data to the database engine.
All three approaches thus have their strengths and weaknesses. It's exciting to see the emergence of fresh thinking on data analytics, going beyond the initial file-oriented Map Reduce model. Over time, these approaches will evolve, borrowing learnings from one other. In time one or more will become the dominant paradigm for data analytics; I will be watching this space with great interest.
Disclosure: I'm an investor in Aster Data and sit on their Board of Directors.