Stanford InfoBlog

CIDR 2009 Trip Report (Posted by Steven Whang)

2009-01-28T17:37:00.000-08:00

This collaborative blog post was written by InfoLab members who attended the recent CIDR conference held Jan. 4-7, 2009 at the Asilomar Conference Grounds. The viewpoints in the blog do not necessarily represent the entire InfoLab.

Instead of trying to cover all the interesting work presented, we focus on major trends centered on the two keynotes and the Best Paper Award. These three talks covered important research directions for the database community: user interfaces, power-sensitive systems, and new hardware for databases. We then briefly mention works by InfoLab members/alums.

1. User Interfaces

The first keynote by Jeff Heer demonstrated how people can easily collaborate on data analysis. As an example, we were shown visualizations of United States census data over the last 150 years using sense.us, a prototype web application for social visual data analysis. Users can analyze the data (e.g., adding an annotation that the sharp decrease in the number of people with military jobs in the late 1920's was due to the Great Depression) and easily share their results with others (e.g., posting a URL of their view). Hence, the key contribution of sense.us is supporting asynchronous collaboration for visualization.

The demonstration clearly showed the importance of a good user interface for database systems. As DBMSs and query languages like SQL are used by a broader audience of programmers, there is a need to provide easier tools for end-user data management and manipulation. While sense.us is already an excellent tool for collaboration, it still remains to be seen how database systems can adopt the underlying science of human interaction. Several issues were raised by the audience including managing groups of collaborators, preserving privacy, and exporting visualizations to other systems.

Two other works in the conference also focused on user interfaces. A presentation by Yannis Ioannidis discussed the challenges of providing a natural language user interface for databases (e.g., a database should give back an answer like "The director's name is Woody Allen" instead of a table). A presentation by Zachary Ives demonstrated CopyCat, a tool that provides an interface for integrating data without having to design a schema; the CopyCat system "learns" the schema based on copies and pastes made by the user.

2. Power-sensitive Systems

The second keynote by James Hamilton proposed a cost and power efficient system for internet-scale services (CEMS project). The first observation made was that server efficiency is key to improving the overall data center power efficiency (nearly 60% of the power delivered to a high-scale data center is delivered to servers). Next, servers were built using low-cost, low-power client or embedded components. The resulting CEMS prototype outperforms a high-scale commercial internet service by 379% in terms of performance/joule. Hence, the prototype is a significant contribution to the recent trend of power-sensitive systems.

In addition to the keynote on the CEMS project, which improved hardware for better power efficiency, there was discussion on the software side of improving power efficiency. Mehul Shah's presentation argued that data management software should also be optimized for power efficiency and suggested promising areas in database systems that could improve. Stavros Harizopoulos's "Gong Show" presentation went a step further and suggested that performance should be thrown away in favor of power efficiency! A presentation by Willis Lang proposed two specific techniques that trade power efficiency for performance.

3. New Hardware for Databases

The Best Paper Award was given to "uFLIP: Understanding Flash IO Patterns" (Luc Bouganim et al.), which proposed a benchmark for flash devices. The motivation is that commercially available flash devices do not behave as the flash chips they contain due to the additional layer of block mapping, wear-leveling, and error correction. Consequently, a benchmark is necessary to understand the complex behavior of flash devices (e.g., block writes are not uniform in time) that can help algorithm and system design. The authors compared various flash devices using their uFLIP benchmark and produced helpful guidelines for algorithm development (e.g., random writes should be limited to a focused area of 4-16MB in order to perform nearly as well as sequential writes).

A natural question to ask is whether we can use an interface for flash memory that allows databases to directly access flash cards (instead of relying on black-box flash devices). In the case of disks, database systems usually access the raw disk (instead of accessing the disk through a file system) and disable the operating system's data caching in order to handle caching themselves. Moreover, database systems sometimes even have full control over processor scheduling and the mapping of page tables where both tasks are usually left to the operating system. We suspect that directly accessing flash cards is currently a difficult task compared to directly accessing disks.

Another advance in new hardware is using multi-core processors (or chip-level multiprocessors, CMP). Ippodratis Pandis's Gong Show presentation provided various solutions for scaling databases on CMPs.

4. Works by InfoLab Members/Alums

InfoLab members Georgia Koutrika and Benjamin Bercovitz presented CourseRank, a popular course evaluation and planning social system that is used by over 9,000 students out of 14,000 at Stanford (most undergrads use CourseRank). Based on their experience with CourseRank, Georgia and Benjamin proposed various research challenges for social sites such as encouraging information discovery (using tag clouds) and enabling flexible recommendations in a declarative fashion.

InfoLab alum Chris Olston presented a general two-phase approach for interactive querying over web-scale data. The idea is to first supply a query template in advance to the system, which can then prepare auxiliary structures (e.g., materialized views and indexes) to facilitate real-time query responses later on. As a result, interactive querying is possible for a general class of queries and data at a very large scale.

InfoLab alum Shivnath Babu presented an integrated diagnostic tool for database and SAN administrators. Using an abstraction that ties together the execution path of queries and the SAN, it is possible to diagnose query slowdowns caused by combinations of events across the database and the SAN.

InfoLab alum Yannis Papakonstantinou presented app2you, a web application creator that lets developers create web applications without doing database coding or designing. The app2you platform presents a tradeoff point between having a wide application scope (e.g., by building applications using Java, Ajax, and SQL) and providing ease of specification (e.g., by simply copying an application template).

The following InfoLab members/alums co-authored papers, but did not give talks:

Anish Das Sarma (member): "Sailing the Information Ocean with Awareness of Currents: Discovery and Application of Source Dependence"
Jun Yang (alum): "RIOT: I/O-Efficient Numerical Computing without SQL"
Ramana Yerneni (alum): "A Scalable Data Platform for a Large Number of Small Applications"
Janet Wiener (alum): "Visualizing the Robustness of Query Execution"
Marianne Winslett (alum): "Remembrance: The Unbearable Sentience of Being Digital"
Jeffrey Naughton (alum): "The Case for a Structured Approach to Managing Unstructured Data"

Check out other blog posts on CIDR 2009 by Joe Hellerstein, Pat Helland, and Leigh Dodds.

Vacation Post: Claremont (Berkeley) Database Research Self-Assessment Revisited (Posted by Paul Heymann)

2008-12-21T13:48:00.000-08:00

It's vacation time in the InfoLab, so I thought I would write a follow-up post on a previous topic. In June, Hector wrote a post about attending the Berkeley Database Research Self-Assessment. A few months later (in late August), the Self-Assessment came out as the Claremont Report on Database Research (PDF available here).

There was a little bit of discussion at the time. A dbworld message went out on August 19th, and there were a few follow-ups. Alon Halevy was the first to (briefly) blog about the report, Dave Kellogg gave a good summary of the main points as did Tasso Argyros at the Aster Data Systems Blog, and there are a number of other posts with in depth comments, like the ebiquity blog looking for more Semantic Web discussion and Stephen Arnold discussing massive heterogeneous data. Two posts connect the Claremont Report's focus on big data to Nature's recent issue on Big Data.

I would summarise the report, but I actually think it's pretty clear. (These are my thoughts, incidentally, and may not represent the views of all of the InfoLab on such a broad report.) We are at a historic point in data management, and there are huge opportunities. Many communities would like better tools for data management, and would be happy and willing to learn them if we provided them (as long as it isn't Datalog, see Hellerstein below... "Datalog? OMG!"). But, we're not, or at least, not really. Sam Madden's contribution actually struck a chord with me as someone who is much more often a consumer rather than a creator of database technology (working mostly on the web rather than core database research):

At the risk of feeding what ultimately may be a really well crafted troll, what Madden describes is what I face on a daily basis. My usual tools end up being things like awk, sort, join, ad-hoc Python and perl scripts, SQLite with a ramdisk or otherwise in memory, and other one-off or only somewhat appropriate tools even when my data is relational and I would be able to phrase my queries much more succinctly in a declarative language. Rather than being able to use a distributed DBMS to do parallel work, I end up using MapReduce (Hadoop), usually with some hacks to use higher level language (currently Dumbo, maybe I'll try Pig or ~~FQL~~ Hive again sometime soon).

Is anyone seriously working to address this problem? It seems much sexier to work on new semantics (e.g., semi-structured data, streams, uncertainty, anonymity) or new ways to optimise retrieval (e.g., column stores, self-tuning). But neither of these really address what seems to be the massive cost of boundary crossing. This isn't to deride any of the work on new semantics or new optimizations, and on the contrary, that work is extremely important for the database community to remain relevant to a wide community of potential users. But, it seems to take forever to get data into a database (and exotic bulk loading tools make it complex as well), index it, and get it ready to be queried. Then it takes forever to get data back out if you are using the database for declarative data manipulation rather than having online queries be the end result. Maybe the answer is having data in XML, and then querying that data directly (but, to paraphrase JWZ, paraphrasing someone else, "Some people, when confronted with a problem, think 'I know, I'll use XML.' Now they have two problems."). Maybe the answer is that the relational database is an oddity, and that the much more common pattern is for simple, bad languages and bad data models to succeed, especially if they have simple models of computation and look like C (see for example Worse is Better, particularly "Models of Software Acceptance: How Winners Win").

Are there tools that will let me manipulate my data declaratively and efficiently, but then get out of my way when I want the data in R, or I want to write some ad-hoc analysis? Are there any production level tools that don't have a huge start-up cost when data goes in, and that might actually give me some indication of when the data will come out? Is everyone just using various forms of delimited text to organise massive amounts of structured, semi-structured, and unstructured data? (Except banks and retailers anyway.)

In any case, while I personally found Madden's research direction most accurate in describing what I need from databases in my work, there are a number of interesting research directions that people presented about. Unfortunately, they're in a variety of formats (they're all originally from the Claremont Report page), so I've munged them and then put them on Slideshare for your perusal. (Some are a little bit like reading tea leaves without seeing the actual talk, but most seem pretty clear in content.)

What do you, as a reader of the Stanford InfoBlog, think is the most important research direction below? Was something missed that is near and dear to your heart? What solutions do you use today to manipulate big and exotic data in your work?

Update 2008-12-22 23:12:00 EST: Switched link to FQL to be a link to Hive. Good catch, Jeff!

Research Directions

Rakesh Agrawal

Eric A. Brewer

AnHai Doan

Johannes Gehrke

Le Gruenwald

Laura M. Haas

Alon Y. Halevy

Joseph M. Hellerstein

Yannis E. Ioannidis

Hank F. Korth

Donald Kossmann

Samuel Madden

Beng Chin Ooi

Raghu Ramakrishnan

Sunita Sarawagi

Michael Stonebraker

Alexander S. Szalay

Gerhard Weikum

Clustering the Tagged Web (Posted by Daniel Ramage)

2008-11-25T23:07:00.000-08:00

I'm looking forward to presenting Clustering the Tagged Web (mirror) at WSDM 2009. The paper is joint work with Paul Heymann, Hector Garcia-Molina, and Christopher D. Manning. We examine how and when social web bookmarks like those found on del.icio.us can be used to improve unsupervised flat clustering algorithms on a web corpus. The one-line summary is that tags should be incorporated as a parallel information channel to page text and to anchor text to gain maximum benefit on broad clustering tasks.

A social web bookmark is like a regular web bookmark of a URL by some web user, except that the social web bookmark is posted to a centralized web site with a set of associated tags. A tag is (usually) a single-word annotation that describes the URL being bookmarked. Earlier this year, del.icio.us, a popular social bookmarking site, had a page up that defined a tag as:

"A tag is simply a word you use to describe a bookmark. Unlike folders, you make up tags when you need them and you can use as many as you like. The result is a better way to organize your bookmarks and a great way to discover interesting things on the Web." -- del.icio.us

In the screenshot above, the web page of Strunk and White's classic book on writing, The Elements of Style, has been tagged with "writing," "reference," "english," and "grammar" by del.icio.us users.

Why tags? Tags are a rich source of information about a large subset of the web's most relevant pages. We believe they are a particularly useful source of information for automatic web page clustering because, by and large, each is a human-provided annotation of page content. Additionally, new, high quality documents on the web are often tagged before they establish much of a footprint in the web's link graph [Can Social Bookmarking Improve Web Search?]. So making good use of tags promises impact when link structure and anchor text are still spotty.

Given a large set of items (web pages in our case), the goal of a flat clustering algorithm with hard assignments is to coherently group together those items into some smaller number of clusters.

Why clustering? Automatic clusterings of web pages into topical groups can impact the quality of information retrieval on the tagged web in several ways, including: promoting diversity of search results, enabling improved cluster-driven search interfaces like scatter/gather, and improving upon language-model based retrieval algorithms, among others. Plus it's a fairly well understood task, but one in which a new data source like tags really should (and indeed does) have a substantial impact. Our goal in the paper is to show how that impact can best be achieved.

Some of the paper's main findings are:

Staying within the standard vector space model (VSM), the K-means clustering algorithm can be modified to make effective use of tags as well as page text. We found that weighting all word dimensions in aggregate equally to (or with some constant multiple of) all the tag dimensions resulted in the best performance. In other words, if you normalize tag dimensions and word dimensions independently, then your clustering model improves. This insight may well apply to other tasks in the VSM.
Our experience incorporating tagging in the VSM prompted us to consider other clustering approaches with the potential to more directly model the difference between tags and page text. One such approach is a new hidden-variable clustering model, Multi-Multinomial Latent Dirichlet Allocation (MM-LDA), similar to latent Dirichlet allocation (LDA). Like LDA, MM-LDA assumes that every document is made up of a (weighted) mixture of topics, and that each word comes from the word distribution of one of those topics. MM-LDA also assumes that each tag is analogously chosen from a per-topic tag distribution. (The figure above shows MM-LDA's plate diagram for those who are familiar with Bayesian graphical models and can be safely ignored by those who aren't.) The upshot is that MM-LDA acts as a high performing clustering algorithm that makes even better use of tags and page text by treating each as a set of observations linked only by a shared topic distribution. It's fairly fast and generally outperforms K-means.
Anchor text - the text surrounding links to a target web page - is another kind of web-user provided page annotation, but it acts quite differently than tags, at least from the perspective of web page clustering. Treating anchor text, page text, and tags as independent information channels is again the best way to combine these three types of signal. We found that including tags improves performance even when anchor text was provided to the clustering algorithm. However, including anchor text as an additional source of annotations doesn't really improve on just clustering with words and tags.
If you're trying to infer a relatively small number of clusters in a very diverse set of documents, it makes sense to utilize page text and tags jointly as described above. But if you're looking to cluster a set of pages that are already a specific subset (like pages about Programming Languages), tags often directly encode a sort of cluster membership (like "java" and "python"). So when tags are at the same level of specificity as the clusters you're trying to infer and you have enough tags to avoid sparsity issues, you might do better clustering on tags alone, ignoring page text.

We invite you to read the paper and I hope to see some of you in Barcelona!

An Often Ignored Collaboration Pitfall: Time Phase Agenda Mismatch (Posted by Andreas Paepcke)

2008-11-08T19:00:00.000-08:00

[An earlier version of the following thoughts were posted on an internal online forum of the Council on Library and Information Resources (CLIR). The material was further discussed and developed at CLIR's symposium on Promoting Digital Scholarship: Formulating Research Challenges in the Humanities, Social Sciences, and Computation.]

The Stanford Infolab has enjoyed a multi-year string of active, cross disciplinary collaborations. We have worked closely with biodiversity researchers, physicians, and political scientists on projects of mutual interest. Several publications emerged from these collaborations, not just in the CS community, but also in the Biology literature [e.g. 1, 2, 3, 4, 5].

Stanford University, the National Science Foundation, and others attempt to encourage cross-disciplinary efforts through financial and other incentives. In our experience such collaborations are in fact highly beneficial. They are not, however, trivial to manage.

Time Phase Agenda Mismatch

Every cross disciplinary work we have been involved in has experienced some degree of mismatch in what would be an optimal activity for each party at any given time. For example, the best new computing tool that would provide the optimal, immediate progress to, say, a political scientist, might be of no interest to a computer scientist needing to publish; the underlying science for the tool was developed several years ago.

Vice versa, a cutting-edge CS prototype might either be too exotic for use by a political scientist trained in more standard tools, or it might prove too brittle and incomplete for everyday use. In entering collaborative work both partners therefore need to be clear about expectations.

Note that all parties in an endeavor might well agree that long-term collaboration is the right approach. The problem lies in the day-to-day decisions about resource and time allocation. A look at the traditional process of computer science research will clarify the issue from the CS point of view.

Computer Science Workflow

Here is the required workflow for many research university computer science faculty: Propose an important, difficult-to-solve problem, plus thoughts towards a solution to the National Science Foundation. Grant in hand, compete with other faculty of the same university for student interest. Ph.D. students are the most valuable in this competition, because they will stay longer than Masters or Undergraduate students and will dig deeper.

The faculty member's responsibility towards Ph.D. students is to move them towards graduation. This task requires the identification of constituent sub-problems, whose solutions will be published in highly regarded computer science conferences or journals. Often the work will include a prototype that is stable enough for performance measurements or usability testing. Very rarely will this prototype include all the details that would be required for practical use.

In fact, forcing Ph.D. students into such 'filler' work of adding the required bells and whistles to a prototype might be considered irresponsible, because these students are already trained for this type of work and need new challenges.

Employment of Masters students can be, and often is, the answer. Two issues arise around this solution. First, the best Masters students will be looking to tackle cutting edge CS work. Being offered filler work, they will choose other projects, leaving only less talented or insufficiently trained students who then need very significant supervision.

The second downside of hiring Masters students for filler work is that the investment---currently about $75,000 per academic year at many institutions---will not move a computer science professor closer to the next grant that will be required to feed the existing Ph.D. students who usually straddle the time boundaries of at least two grants.

Where's the Payoff?

Enter the biologist, physician, historian, political scientist, or law scholar in the cross-disciplinary enterprise. Let me denote this person the 'partner'. We assume here that the common vision of a collaborative project is compelling to both participants. Both are perfectly well disposed towards the other. Let us even assume that the respective fields' jargon as well as deeper conceptual notions are mutually understood. Assume further that the CS professor will hear and understand the needs of the partner.

A novel prototype is now constructed with important input from the partner. Everyone is rightfully excited. But now the problem sets in. The CS professor and the involved students will write a CS paper, and they are then ready to move on to the next sub-problem of the collaborative project. The partner, in contrast, is eager to start using the tool, ... which breaks under even mild use and does not include all the required features.

Do note that this state of a prototype is acceptable in the context of CS publications. A perfectly honest prototype is expected to be built up to the point where the *salient* features are solid and can be measured. It is understood in the CS research community that the remainder of a prototype may be a scaffold. That state of affairs is not a scam. Taking software from prototype to product quality is extremely expensive and, again, will not lead to progress in the students' or professor's research career.

Where are we now in this scenario? The CS professor is impatient to move on to the next sub-problem within the project. The partner is disappointed. He has invested significant time explaining his problems to the CS team, and testing intermediate results. Now, when his labor's results seem near, they are not.

The know-how for the often very large remainder, the filler work, was developed in CS years ago, when the partner did not need it. Now he does, but the CS resources are not allocated. The CS professor's and the partner's agendas are out of phase, even though their long term goals match.

The take-away point is that a collaboration agreement must address this situation before work begins. Expectations must be managed and mutually understood.

Some Solutions

Our own past successes broke out of this difficulty along different paths. Admittedly, we did not plan any of these solutions in advance. In one case the possibility of a startup company was enough to make the partner's work worth-while to him. Sometimes, if CS results from the prototype promise economic interest, an existing company might license the ideas and work the prototype into a product, from which the partner can then benefit. Delays in the partner's satisfaction are naturally built into this solution.

In another case the succession of published results led to follow-on funding that included resources for the partner. The CS-typical rapid forward movement without full development of the covered terrain thereby benefited everyone: The readers of resulting publications were learning; the CS professor and partner enjoyed the satisfaction of having produced knowledge that neither could have produced alone; the funding agencies produced innovation in accordance with their mission; the CS professor's future research will be colored by the new understanding of the partner community's needs, and the partner can enjoy financial resources in addition to having gained an improved understanding of what is easy in CS, and what is hard. Future collaborations will thereby be improved as well. The disadvantage of this solution is that the partner's research community cannot see the impact on their area of expertise until much later.

Yet another model we have followed is for research staff to skip vacations and to spend the summer implementing filler portions of a prototype. This activity means that correspondingly little grand thinking is achieved during that time. But the prototype moves to full usability by the partners. Unfortunately, this solution is difficult to scale.

No matter the field of a partner, the computer science side will often need to engage in at least some 'grunt work' at some point in the project. This work needs to be of immediate, convincing benefit to the partner. CS research culture will need to learn how to accommodate these activities even though they are currently often not respected.

The Role of Funding Agencies

Some calls for funding proposals require proof that the output tangibles of the research---prototypes, data sets, and such---will be maintained and expanded after expiration of the grant. While likely motivated by the right concerns, such a requirement is usually impractical. For what can proposing research organizations promise?

A startup company is one option for a continuation promise. Unfortunately, economic feasibility can usually not be predicted in the context of advanced research projects. The promise of a startup company is therefore unrealistic at the time proposals are written.

Another promise might be the hire of full-time staff that will care for the tangibles after the grant terminates. Two problems arise with this solution. First, such staff needs to be financed over long periods of time---a commitment most funding agencies are unable and unwilling to make.

Second, a CS research organization cannot through grant after grant staff maintenance of ever more orphan tangibles. Such an organization would quickly run dry of funds and supervision resources for students to whom at least educational institutions owe focus.

Unfunded mandates in calls for grant proposals are thus not a likely answer. One possibility might be for grants to include money specifically for hardening prototypes. For example, such funds might be spent to hire the student(s) who constructed the prototype for the summer following their graduation. The advantage of this solution is that the creators of the prototype are in the best position to improve code quickly. However, salaries would likely need to be higher than what is typical for students, because first, these potential hires will be graduates at that point, and second, the work of hardening is not desirable for many (at that point former) students.

Another component to addressing the problem of time phase agenda mismatch would be for funding agencies truly to acknowledge the efforts of non-CS partners in collaborative grants. Concretely, such acknowledgment would mean that subsequent proposals by, say, a historian could realistically cite the results of an earlier collaborative effort as past achievement in the field of history. Even if the collaborative effort did not immediately lead to changes in historical inquiry, the advancement of computing methods towards use by historians must 'count' as a true contribution.

Conclusion

In summary, cross disciplinary computing projects harbor immense potential for both parties. Both can be inspired just by grasping the other's mode of thought. The potential exists for moving both fields forward. Frequently, however, results of cross disciplinary work cannot advance both disciplines equally during any given phase of a collaboration. When one party is satisfied, additional work, time, and money is often required to provide satisfactory closure for the other as well. Satisfaction will usually not be symmetric at any given time during a collaboration. Both sides must anticipate overhead work that would not be considered worthy of attention in a single-disciplinary activity. Funding agencies can play a role by (i) encouraging the hardening of tools, and (ii) by creating a culture where collaboration is rewarded with favorable consideration for future funding even if the significance of outcomes are asymmetric among the participating parties.

The answer to the above complications in cross disciplinary work is to adjust reward structures and foster the cultural adjustments that will be required across the disciplines. The potential benefits are well worth this effort.

Generic Entity Resolution with Negative Rules (Posted by Steven Whang)

2008-10-22T20:57:00.000-07:00

Entity Resolution (ER) is a process of identifying records that refer to the same real-world entity and merging them together. For example, two companies that merge may want to combine their customer records: for a given customer that dealt with the two companies they create a composite record that combines the known information.

However, the process for matching and merging records is most often application-specific, complex, and error-prone. The input records may contain ambiguous and not-fully specified data, and it may be impossible to capture all the application nuances and subtleties in whatever logic is used to decide when records match and how they should be merged. Thus, the set of resolved records (after ER) may contain "errors" that may be apparent to a domain specialist. For example, we may have a customer record with an address in a country we do not do business with. Or two different company records where the expert happens to know that one company recently required the other so they are now the same entity.

In our paper, we address the identification and handling of such inconsistencies using negative rules, which are predicates that take a number of records and return whether the records are consistent or not. For example, if ER mistakenly merges two person records with different genders, we can specify a binary negative rule that flags an inconsistency for any record that has both genders. The negative rules are then used with the original ER rules to find a consistent ER solution.

There are two reasons why we cannot simply modify the ER rules instead of using the negative rules. First, the negative rules are only used to check the consistency of the final ER result and not the consistency of the intermediate records that are created during the ER process. For example, a record that has both genders may turn out to be Female (based on more evidence) at the end of the ER process (and thus consistent). Second, the negative rules can be viewed as an effort of a second party to fix the errors made by the ER rules of the first party.

We propose a general algorithm that finds a consistent ER solution and an enhanced algorithm that is more efficient than the general algorithm, but assumes certain properties on the negative and ER rules. Our main findings are as follows:

Negative rules can significantly improve the accuracy of the ER process. Depending on the strategy, a domain expert may need to help resolve the inconsistencies.
Different combinations of negative rules may result in different accuracy and runtime results. The negative rules themselves also need to be correct and effectively pinpoint the errors of the ER rules.
Applying negative rules is an expensive process (at least quadratic) that should not be run on the entire dataset if the dataset is very large. A common technique in ER is to use blocking techniques (many variations exist) where the entire dataset is divided into smaller blocks, and the blocks are processed one at a time. The negative rules can be used within each block.
Even without the properties, the enhanced algorithm can efficiently produce results that are nearly identical to those of the general algorithm.

Feedback! (Posted by Paul Heymann)

2008-09-24T10:20:00.000-07:00

Hello there. We've now been posting to the Stanford InfoBlog for a few months. We hope you've enjoyed the posts so far, but we'd like to know a little more about who you are and what you like in order to better serve you (and to get an idea of how our own research fits into the world of research outside of Stanford and outside of academia).

If you have a few minutes, we would really appreciate if you could fill out a survey here. We'll try not to give out any identifying details (not that any of the questions are that personal), and we'll use the responses to improve the InfoBlog. We might have a quick post at some later point to show what the results looked like.

Also, if you'd like, feel free to post any suggestions for the blog as comments to this post.

(P.S. The survey link above has a limit as to the number of responses, so if the survey is closed, don't worry! We may post another link, or just analyze the data at the point where we reach the limit.)

VLDB 2008 Trip Report (Posted by Ioannis Antonellis)

2008-09-16T15:12:00.000-07:00

I recently came back from VLDB 2008 in Auckland, New Zealand where I gave a talk on my Simrank++ paper (see also my previous post, and Greg Linden's related post). The conference took place in the SkyCity Convention Center, located next to the Sky Tower, the southern hemisphere’s tallest tower. It was quite an experience, while heading to the conference and passing through the tower every morning, to watch people jumping from 320 meters high. Several conference participants were actually brave enough to claim that they were planning to take part in this or a different adventure activity in New Zealand (including our own Parag)...

In this post I plan to give an overview of the main keynote talk and the 10-year best paper award session talk as well as provide comments on a few research talks I attended. The slides from all presentations are already available from the VLDB website (here) and video recordings from all sessions will presumably be posted soon as well. Unfortunately, many of the papers are not yet available from VLDB, but I've tried to post links to the papers where they are available.

The main keynote

Mark Hill from University of Wisconsin-Madison gave the main keynote (slides): a teaching keynote on transactional memory. In his talk, entitled "Is Transactional Memory an Oxymoron" (notice the oxymoron: transactions are durable, memory is not), he gave a nice overview of transactional memory implementations via software and hardware and suggested how transactional memory can be used in database applications.

Mark Hill giving the main keynote

As he explained, DBMS transactions and transactional memory (TM) transactions differ in (a) their design goals, (b) their state, and as a result in (c) their implementation as well.

(a) DBMS transactions target mostly failures and then concurrency. The underlying assumption is that weird things can happen during a concurrent execution of transactions so there is need for all or nothing execution semantics. On the other hand TM transactions target only concurrency because their goal is to make parallel programming easier.
(b) The state for DBMS transactions consists of some durable storage (disk) and non-durable memory used as a cache for the disk. However, TM transactions are defined over the non-durable user-level memory. This explains why the title of the talk is not an oxymoron, as the non-durable memory is sensible for achieving the concurrency.
(c) Finally, both the differences in goals and the state have led to completely different implementations. According to TPC-C the best DBMSs achieve around a million transactions per minute per system whereas TM implementations execute a billion transactions per minute per core.

In summary, he argued that transactional memory will probably be more useful for new parallel applications than for DBMSs since the latter already use optimized latching strategies.

Mark Hill performing Maori dances
during the conference dinner

10-year best paper award

The paper "A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces" by Stephen Blott, Hans Schek and Roger Weber from VLDB 1998 was the winner of the 10-year best paper award. In that paper, which currently has more than 700 citations, Stephen and his coauthors explain in a systematic way why the "curse of dimensionality" appears in studying nearest neighbor search problems on high dimensions. They analyzed the existing data structures, which were partitioning-based and showed formally that linear scanning of the database is more efficient for high dimensions. They also came up with a data structure that deals with the curse of dimensionality by approximating partitions of the space. Overall the talk was very pleasant and the material was clearly presented.

However, it was really unfortunate when Stephen seemed not to have followed the rich subsequent work on approximate answering of nearest neighbor queries. For example, a question by Surajit Chaudhuri about the relationship between hashing-based schemes like locality sensitive hashing (LSH) and the presented work went unanswered.

Experiments and Analysis track/Best paper awards

This year there was a new track with papers that try to reproduce and expand on previously published experimental results. One of those papers (Finding Frequent Items in Data Streams by Graham Cormode and Marios Hadjieleftheriou) was a co-winner of the Best Paper Award. The other best paper was on "Constrained Physical Design Turning" by my summer colleagues from Microsoft Research, Nico Bruno and Surajit Chaudhuri.

Nico Bruno presents Constrained Physical Design Tuning

Proceedings of VLDB

This year VLDB endowment announced the Journal track of the Proceedings of VLDB as an attempt to reduce the high review load. The vision of the VLDB endowment is a journal (JDMR) for short "conference style" papers with rapid turn-around, where authors submit papers only to the journal, there is a fixed review period and finally all database conferences will be able to select papers for presentation from the available pool of papers published in the journal. Many people expressed their skepticism on whether this will work, so it remains to be seen... Personally, I like the idea of waking up every day, knowing that I can submit my next VLDB paper today and even better knowing in three months from today that I will be visiting Lyon next summer. :)

Research talks

Parag gave a talk on "Scheduling Shared Scans of Large Data Files" and Anish presented the TRIO-related paper "Towards Special-Purpose Indexes and Statistics for Uncertain Data" in the MUD workshop.

In general the conference program was diverse enough, there were talks on traditional database subjects (theory, systems, XML databases, DB performance and evaluation, Distributed Systems Processing, Indexing, Data Integration, privacy), more recent 'trends' (Column store databases, Uncertain databases, Stream processing) as well as papers related to Web search, sponsored search, association rule mining, IR and text databases. Also, this year all papers had a 25 minutes slot for the talk; this enabled more papers to be accepted.

Alison Holoway gave a nice talk on her paper "Read-Optimized Databases, in Depth" (with David DeWitt). Following the big debate of C-store vs 'anti C-store' they are trying to come up with a more fair comparison between the two systems. In the paper, they focus on studying scans for compressed row and column store for different compression types, queries and table types. In the same session, Ioannis Koltsidas presented an interesting paper on combining flash memory with disks as a storage.

Google had two papers I found interesting, one on extracting structured data from Web tables and another one on surfacing the deep web. Yahoo! had (among many papers) a search-related paper on "Relaxation in Text Search Using Taxonomies" where they presented a document retrieval model that augments text queries with multidimensional taxonomy restrictions. Another interesting paper on text search looked at how a presumably untrusted text search engine can provide guarantees that its results do not favor specific documents. Also, Microsoft researchers presented SCOPE, a SQL-like scripting language for parallel processing of large datasets; Microsoft's version of PIG Latin from Yahoo! and Sawzall from Google.

Finally, in the same session as my talk, there was another interesting paper on Simrank by Dmitry Lizorkin who presented optimization techniques for computing Simrank scores efficiently in large graphs.

Alon Halevy gave a tutorial on (what else?) Dataspaces

That concludes my trip report from New Zealand. If you were in the conference and have any comments please do leave a comment!

Certain Answers From Uncertain Data (Posted by Parag Agrawal)

2008-08-28T09:00:00.000-07:00

In his blog entry, Anish argues that maintaining uncertainty through the data management system (Approach-M) can yield benefits. Processing uncertain data correctly involves capturing dependencies (or correlations) along with probability values in the system. For example, consider a simplified weather forecasting system which predicts that it will rain in either Palo Alto or Sunnyvale with probabilities .3 and .7, because of uncertainty with regards to wind direction. It also predicts that it will rain in Fremont with .2 probability if it rains in Palo Alto, since Fremont is downwind from Palo Alto. (The uncertainty perhaps being with respect to wind speed.) Similarly, it will rain in Milpitas with .5 probability if it rains in Sunnyvale. Capturing correlations correctly would let us conclude that at least one of Sunnyvale or Fremont will be dry. While these correlations are crucial to drawing correct conclusions, end users may often prefer a final result that is simpler and more certain.

Allowing the user to compute most likely answers is a common way to provide a "simple to use" result. The result may be restricted to these high-probability results using a confidence threshold, or a top-k by confidence query. This paper is a part of the large body of work that addresses this problem. For the example above, a user might only want to get a travel warning when the chance of rain in a city of interest exceeded a threshold (say .5). This can be posed as confidence threshold query with a predicate restricting the search to only cities of interest for the user. Queries like this just "clean" up the result to remove some of the uncertainty, allowing the user to "zoom" into the interesting information in the result. I am interested in exploring other ways of cleaning uncertainty that may be useful to some applications.

While the techniques above return more certain answers, they don't resolve any uncertainty. However, can throwing more data at the problem improve results by actually reconciling uncertainty? Consider weather forecast information from multiple sources -- each could be uncertain, they could be mutually inconsistent or mutually reinforcing. Can careful resolution of these data sources yield better, more certain results? I am betting that the answer is "yes" -- This paper provides the foundation for such resolution in a principled manner.

SpotSigs: Are Stopwords Finally Good for Something? (Posted by Martin Theobald)

2008-08-22T18:30:00.000-07:00

In almost all classical Information Retrieval settings that have a text processing component, stopwords are first discarded before anything interesting happens with the document. “Interesting” here might mean indexing the content for search, extracting features for automatic classification, or some other form of content analysis of whatever flavor. Jonathan (my co-author on the SpotSigs paper) had the amazing idea that stopwords may however be very good indicators of the actual interesting parts of a web page. It is especially useful to know where the interesting parts of a web page are when they are interspersed with “added-value” content such as advertisements or navigational banners. This is most strikingly the case with online news articles, but applies more generally across the web.

In our SpotSigs project, we tried to detect near-duplicate Web pages in the news domain. This is a particularly challenging setting because the page layouts are often literally drenched with ads or navigational banners as added by the different sites. The actual core articles constitute only a minor fraction of the overall page, which makes online news a very hard setting for any unsupervised clustering or deduplication approach. Moreover, near duplicates of the same core article frequently pop up from different news sites as most of the content gets delivered by the same sources, such as Associated Press, and the very same core articles then often end up completely unchanged (or only after some minor editing) on many of the sites.

In response to this setting, the idea of extracting more “localized” signatures—namely those that are close to stopword occurrences (hence spots)—was born. These localized signatures exploit the observation that stopwords are frequently and uniformly distributed throughout any form of natural-language text—at least in Western languages—but they remain very infrequent in the typical headline-style banners or ads. SpotSigs connects such stopword anchors (called antecedents) with a nearby n-gram, which is simply a concatenation of further text tokens that are themselves not a stopword, in a very similar way to classic Shingling on the entire page content. In choosing only those Shingles that are connected to a stopword antecedent, however, SpotSigs tends to extract more robust signatures than plain Shingling. At the same time it allows for a more efficient and less error-prone signature extraction as compared to many of the far more sophisticated tools for HTML layout analysis. Another nice property is how SpotSigs handles synthetic documents that do not exhibit any natural language text, like 404 documents that crawlers typically encounter. SpotSigs automatically discards such synthetic documents.

As a second focus of the paper, SpotSigs also addresses some algorithmic challenges by tackling the inherent quadratic complexity when having to consider all candidate pairs of documents for the deduplication step. Here, our entire clustering algorithm is developed on the simple idea that we may never need to compare documents (or their respective signature sets) if they already substantially vary in length – at least if some set-resemblance-based similarity measure such as Jaccard similarity is used. This basic observation lets us very efficiently match high-dimensional signature vectors using a combination of classic techniques such as collection partitioning and inverted index pruning. As a rather surprising result, we found that the SpotSigs matching algorithm may even outperform linear-time similarity hashing approaches like locality-sensitive hashing in runtime – at least for reasonably high similarity thresholds and if the distribution of document (or signature) lengths throughout the collection is not too skewed – which nicely advocates the old algorithmic paradigm that sorting may sometimes be favorable over hashing. Overall, for a collection of 1.6 million documents from the TREC WT10g benchmark, we achieve a remarkably fast runtime of only about 5 minutes for parsing and extracting signatures, and less than 15 seconds for the actual clustering step on top of our index structures, already on a single mid-range server machine.

What I personally like about SpotSigs is that all its parts – from the signature extraction, over to the partitioning and index pruning – seamlessly fit together like some seemingly random pieces of a puzzle that finally make a nice picture. For example, the partitioning approach is not only good for breaking down the overall quadratic runtime into many much smaller pieces, but it also helps to smooth the skew toward shorter documents that we typically find in web collections and to provide more balanced partition sizes (the slides provide a little more detail on this). Similar themes recur throughout the work, for example, the threshold-based pruning approach applied in no less than three variations throughout the entire algorithm – with all steps being based on the very same similarity bound we derive for the (weighted) Jaccard resemblance between two Spot Signature sets.

After the presentation at SIGIR 2008, we received some interesting comments about the inherently subjective nature of detecting near duplicates. This suggests that for future work, thinking about how to personalize the signature extraction step beyond just using stopword anchors would certainly be an intriguing direction. We'd also like to improve the clustering algorithm by further investigating disk-based index structures, possibly distributing the algorithm onto multiple machines, and extending our bounding approach for more similarity metrics such as the well-know Cosine measure, which is more commonly used in IR than for example Jaccard.

Have a look at the paper or slides for more details.

Who is the Data Leaker? (Posted by Panagiotis Papadimitriou)

2008-08-14T15:30:00.000-07:00

In the course of doing business, sometimes sensitive data must be handed over to supposedly trusted third parties. For example, social networking sites, such as Facebook, share users' data with social applications owners (a user approval is often required). Similarly, enterprises that outsource their data processing have to give data to various other companies. Data may also be shared for other purposes, e.g., a hospital may give patient records to researchers who will devise new treatments.

We call the owner of the data the distributor and the supposedly trusted third parties the agents. In a perfect world there would be no need to hand over sensitive data to parties that may unknowingly or maliciously leak it. However, in many cases we must indeed work with agents that may not be 100% trusted. So, if data is leaked we may not be certain if it came from an agent or from some other source. Our goal is to detect when the distributor's sensitive data has been leaked by agents, and if possible to identify the agent that leaked the data.

Traditionally, leakage detection is handled by watermarking, e.g., a unique code is embedded in each distributed copy. If that copy is later discovered in the hands of an unauthorized party, the leaker can be identified. Watermarks involve some modification of the original data and are very useful in many cases. However, there are cases where it is important not to alter the original distributor's data. For example, the data of a Facebook profile may not look different to different users who have access to it. If an outsourcer is doing our payroll, he must have the exact salary and customer identification numbers. If medical researchers will be treating patients (as opposed to simply computing statistics), they may need accurate data for the patients.

In this paper we propose unobtrusive techniques for detecting leakage of a set of objects or records. Specifically, we study the following scenario: After giving a set of objects to agents, the distributor discovers some of those same objects in an unauthorized place. (For example, the data may be found on a web site, or may be obtained through a legal discovery process.) At this point the distributor can assess the likelihood that the leaked data came from one or more agents, as opposed to having been independently gathered by other means. Using an analogy with cookies stolen from a cookie jar, if we catch Freddie with a single cookie, he can argue that a friend gave him the cookie. But if we catch Freddie with 5 cookies, it will be much harder for him to argue that his hands were not in the cookie jar. If the distributor sees "enough evidence'' that an agent leaked data, he may stop doing business with him, or may initiate legal proceedings.

We develop a model for assessing the "guilt" of agents. Based on this model, we present algorithms for distributing objects to agents in a way that improves our chances of identifying a leaker. Finally, we also consider the option of adding "fake" objects to the distributed set. Such objects do not correspond to real entities but appear realistic to the agents. In a sense, the fake objects act as a type of watermark for the entire set, without modifying any individual members. If it turns out an agent was given one or more fake objects that were leaked, then the distributor can be more confident that agent was guilty.

You may want to check our short presentation or the full paper for details.

Matching Hierarchies Using Shared Objects (Posted by Robert Ikeda)

2008-08-08T13:02:00.000-07:00

Objects are often organized in a hierarchy to help in managing or browsing them. For example, books in a library can be divided by subject (history, science, ...) and then by country of publication (United Kingdom, USA, ...). Web pages at a site can also be placed in a hierarchy. For instance, a French tourist site may have categories cities, history, hotels, tours; within the cities category we have pages divided by city, and then by events, maps, restaurants.

At Stanford, we have studied the problem of hierarchy matching, in particular, how to determine corresponding edges between two related hierarchies. The need to match hierarchies arises in many situations where the objects come from different systems. In our book hierarchy example above, we may want to combine the book catalogs of two different libraries; in our tourism example, we may want to build a meta-web-site that combines the resources of two or more French tourism sites.

Traditionally, hierarchy matching is done by comparing the text associated with each edge. In contrast, we explored using the placement of objects present in both hierarchies to infer how the hierarchies relate. We chose to explore this alternative approach since the text matching method is not foolproof. For instance, there could be different wordings in the facets; one book hierarchy may refer to "drugs" while the other uses the term "medications."

Suppose we are matching two book hierarchies and both hierarchies contain a particular book. We would infer from the presence of this shared book in the two hierarchies that the two corresponding paths in the hierarchies have a relationship; perhaps they share edges in common. Given many shared objects, we would have a lot of information about possible relationships to reconcile, and how to best reconcile the information is not obvious. We developed two algorithms (one rule-based, one statistics-based) that use shared objects to determine feasible facets for a second hierarchy, given a hierarchy with known facets (label-value pairs that define what objects are placed under an edge).

We ran experiments on both real and synthetic data and compared the performances of the two algorithms. What we found was that given clean synthetic data (data that conformed to the properties we encoded in our rule-based algorithm), our rule-based algorithm performed consistently better than the statistics-based algorithm. However, with real data, or even with some of our synthetic data, the statistics-based algorithm proved to be more robust. For details, please see here.

SIGIR 2008 Trip Report (Posted by Paul Heymann)

2008-07-30T15:25:00.000-07:00

I just got back from SIGIR'08 in Singapore. I was there to give a talk on my Social Tag Prediction paper, which hopefully I will get around to writing up a full post about sometime soon. In the meantime, I am posting some of my personal reactions to the conference. This is a long post, so feel free to skip down as none of the parts rely on the others.

Kai-Fu Lee Keynote

Keynotes seem to be pretty hard to get right. One hour is a long time, and they either end up being too vague or focus too much on a specific area of the speaker's interest. However, an even bigger issue for me with industry speakers (at least when the talk is not about a research paper) is that they often seem too worried about giving away secrets about their company. As a result, this means that these sorts of talks can end up as reheated summaries of what all of the researchers know, rather than giving any new insight into the challenges industry is facing. Kai-Fu Lee's keynote mostly avoided these pitfalls, and overall I thought it gave a really good overview of the challenges faced by Google in China.

Kai Fu-Lee Keynote

His talk was more or less in the form of bullet points about Google's strategy in China, so I will take more or less that form here. (Greg Linden also has a blog post about the keynote here.)

The number of Internet users in China is growing at 35% each year, and accelerating. This means that whoever can capture new users (who presumably have no idea what they are doing) will ultimately be the market leader in China (assuming no outside intervention).
New users are kind of classless. For new users, result quality in search does not matter and neither does a clean user interface. New users either do not care or cannot yet appreciate these things. Instead, Google is focusing on Internet cafes, entertainment, and music, which users tend to like (also, blended search with video, news, URLs, and other verticals, called "universal search" at Google). These new users look at a clean empty page like google.com and wonder whether the company forgot to finish the rest.
Google could not get people to spell their name. They tried a number of things until finally just giving up and registering "G.cn".
Local engineers building new local products (rather than just cursory localization of worldwide products) seems to be key to Google's plans in China. Examples include an expanded Chinese Zeitgeist, a tool for finding holiday greetings to SMS, and various emergency relief efforts.
An interesting fact is that Chinese is substantially more dense than other languages. This substantially changes user interface design. For example, in English Google News, each entry needs a title, a date, and a snippet. However, in Chinese, the whole summary of the story can fit in the title area, obviating the need for snippets and leading to more stories per page.
Kai-Fu showed a graph with no axes, which he said related to mobile usage. He said that the iPhone web usage was 50 times higher than the nearest mobile competitor, even in China where iPhones are black market. There has been some reporting of the 50 times higher stat elsewhere. (On a personal note, when using my friends' iPhones, I find typing so inconvenient that I usually end up running to Google because I know that a search for "cnn" will give me the right result and then I do not have to type the full URL. Maybe my iPhone touch typing skills will improve with time.)
Users love clicking and hate typing due to annoying text input tools. This leads to a Google product to help with text input as well as a focus on (ugly) directories which these users love.
Piracy in China is high, and Kai-Fu (I could not tell how seriously) said that it had gone down from 99% to 96%. Whether or not this is true, it allowed Kai-Fu to make a little fun of his former employer, saying that a drop from 99% to 96% meant four times the revenues in China.
Google thinks that freeware authors who add ad software (but not malware) to their products may be a reasonable distribution method in China for software. Of course, this is not without hazards, I think Kai-Fu stated that the average Chinese computer gets reinstalled every four months.
China has huge broadband penetration, mostly through Internet cafes.
There are more Internet users in China than the US, and the growth in Chinese users is accelerating.
The average age of a Chinese Internet user is 25, and this number is dropping. Huge numbers of fifteen year olds are getting online.

Overall, Kai-Fu said that all of the recent work of Google China had led to a market share increase from (I think) about 15% to 25% in two years. One thing I was not sure of was whether Kai-Fu was overstating the contribution of this strategy to the 10% increase. He pointed out a few recent deals which I could not really evaluate, like a recent deal with China Telecom which made me wonder if Google had also gotten more politically savvy in China in the past few years. In any case, this was a great talk, and left me at least (though perhaps not seasoned China watchers) with a lot to think about.

Chinese Users Outnumber US Users

Technical Papers

There were a number of technical papers and sessions and sessions that caught my eye.

This being the Stanford InfoBlog, it seems like I should start with the Stanford papers at SIGIR this year. There were three, one by Martin, one by Mengqiu, and one by myself. Martin (of the InfoLab) presented SpotSigs (DOI), which are a duplicate detection technique tuned for news stories and other scenarios where we care about the main text of a page rather than navigational elements.

Martin Presents SpotSigs

Mengqiu (of the NLP group) presented models for improving passage based retrieval (DOI), work he did while at CMU with Luo Si (now at Purdue).

Mengqiu Presents Passage Based Retrieval

My social tag prediction paper (DOI) (with Dan Ramage of the NLP group) focused on how well we can predict tags in social bookmarking systems, and techniques for doing so (including SVM and association rules).

Paul Presents Social Tag Prediction

The tag prediction and tag recommendation problems seem to be getting really popular lately. In addition to the ECML/PKDD Discovery Challenge (which I wrote about here), a number of people are working on similar work. The two (DOI) other (DOI) papers in my session looked at tag recommendation (and related problems) as well.

Real-Time Tag Recommendation

Top-k Querying Social-Tagging Networks

After my talk, I met a number of people who have also been looking at association rules for tags or tag prediction in different contexts. I wish there was a better way to find out who is working on similar work before the work itself is completed!

Later that day, I chatted a little bit with Brian Davison. Brian has been looking at link structure of related pages on the web for many years, for example, in the context of authority propagation, trust propagation, and hypertext classification. One thing I had not realized was that he actually got interested in the link structure of related pages through his dissertation work on prefetching and caching on the web. Previously, I had known his work just in the context of link structure, for example, his work on topical locality in the web (which I think I in turn came upon via Chakrabarti's excellent Mining the Web).

Brian Davison Presents Separate and Inequal

Brian gave an excellent talk (DOI) in the morning (due to one of his student's having visa issues, which seemed fairly common at this conference), though I unfortunately missed his other student's talk (DOI) later that day on their recent work in web page classification.

Classifiers Without Borders

Unfortunately, my talk session was at the same time as the question answering session, which is a topic I have grown increasingly interested in of late. The three papers there were: 1 (DOI) 2 (DOI) 3 (DOI).

Collaborative Exploratory Search

Two papers that had some buzz about them were Pickens et al.'s paper on Collaborative Exploratory Search (DOI) and Liu et al.'s paper on BrowseRank (DOI). The former paper looked at how to work in teams on web search tasks, a topic which is apparently growing increasingly popular in the HCI community these days. The latter paper looked at results ranking based on user browsing behavior from toolbars. I think the Pickens paper ended up winning the Best Paper award.

Lastly, two talks that I found personally interesting were Jonathan Elsas' talk on models for blog feed search (DOI) (slides) and another talk (maybe by Peter Bailey?) on relevance assessment (DOI).

I have been following Jonathan's work for a while now, as he always seems to be doing something interesting with structured, social data. (I finally got to meet him at the conference as well, incidentally.) His talk was mostly about ways to combine both high level and low level structure in blogs, but I was most excited by a somewhat unrelated fact, that they were using Wikipedia for pseudo relevance feedback (PRF). Previous work (DOI) at SIGIR 2007 had looked at this as a possibility, but it was interesting to see both more confirmation that Wikipedia is good for PRF and further mechanisms for using it in that way. Mysteriously, his talk was the third half hour talk in a set of sessions where all of the other sessions had two talks, so he seemed to have the spotlight. In any case, the room was packed.

Jonathan Elsas Presents Blog Feed Search

The talk on relevance assessment was interesting to me because it seemed to be pushing back on a trend which has been happening for a while now. Specifically, in the past few years, there has been a gradual trend towards using extremely cheap sources of labor for creating test collections for evaluating various tasks. For example, some recent work by a former member of the InfoLab looked at Mechanical Turk for evaluating entity resolution (DOI). The relevance assessment paper looked at three types of judges:

Gold Judges: Created a topic for a particular collection, and are experts in that topic. For instance, a history professor comes up with a topic "items worn by Abraham Lincoln" and judges results as relevant and non-relevant.
Silver Judges: Did not create the topic, but are an expert in it. For instance, a history professor.
Bronze Judges: Did not create the topic and are not and expert in it. Just a random user.

What the work found was that while all three types of judges were fine for making broad distinctions between systems, using poor judges could make the distinctions between top performing systems less clear, or even reverse the ordering of top systems in some cases.

That concludes my trip report from Singapore. Apologies for any inaccuracies in the above, I have not had a chance to read many of the papers above in depth yet, so most of my observations are from my vague recollections of the talks. (Do leave a comment if you have any complaints!) Also, if you were at the conference (or were reading the conference papers) and feel like there was important work not discussed here, do leave a comment as well!

Addendum: Yannis pointed me to this other (very recent) paper with more Mechanical Turk evaluation results. Several other people are blogging about SIGIR'08, like Daniel Tunkelang, Pranam Kolari, and Paraic Sheridan. All photos are from the SIGIR'08 website.

Why Uncertainty in Data is Great (Posted by Anish Das Sarma)

2008-07-23T19:20:00.000-07:00

Background: Uncertain Data

With the advent and growing popularity of several new applications (such as information extraction on the web, information integration, scientific databases, sensor data management, entity resolution), there is an increasing need to manage uncertain data in a principled fashion. Examples of uncertain data are:

Extraction: You extract structure from an HTML table/text on the Web, but due to the inherent uncertainty in the extraction process, you only have some confidence on the result. You extract the fact that John works for Google, but are not entirely sure, so associated a probability 0.8 with this fact.
Human Readings: Let’s suppose people are viewing birds in California (Christmas Bird Count: http://www.audubon.org/Bird/cbc/). George reports that he saw a bird fly past, but wasn’t sure whether it was a Crow or a Raven. He may also attach confidences with each of them: He is 75% sure it was a Crow, and associates only a 25% chance of it being a Raven.
Sensors: Sensor S1 reported the temperature of a room to be 75±5.
Inherent Data Uncertainty: From weather.com, you extract the fact that there will be rain in Stanford on Monday, but you only have a 75% confidence in this.
Data Integration/Entity Resolution: You are mapping schemas of various tables, and are unsure of whether "mailing-address" in one table corresponds to "home-address" or "office-address" in another table. We are de-duplicating a database of names, and are not sure whether "John Doe" and "J. Doe" refer to the same person.

There are many other examples of uncertainty arising in real-world scenarios.

How Do We Deal With Uncertainty

With large volumes of uncertain data being produced that needs to be subsequently queried and analyzed, there is a dire need to deal with this uncertainty in some way. At a high-level, there are two approaches to dealing with data uncertainty:

CLEAN (Approach-C): "Clean" the data as quickly as possible to get rid of the uncertainty. Once the data has been cleaned, it can be stored in and queried by any traditional DBMS, and life is good thereafter.
MANAGE (Approach-M): In contrast, we could keep the uncertainty in data around, and "manage" it in a principled fashion. This involves building DBMSs that can store such uncertain data, and process them correctly, i.e., handle the probabilities, range of values, dependencies, etc.

Let us compare the two approaches. Both Approach-C and Approach-M entail several technical challenges. Cleaning uncertain data is not a trivial process by any stretch of imagination. There has been work in the database community in cleaning uncertain data in various environments. (For one piece of work on cleaning in sensor/RFID networks, check this IQIS 2006 paper.) Once the data has been cleaned, there is no additional effort involved, as it can be processed by any off-the-shelf DBMS. In contrast, with Approach-M less upfront effort is involved. However, processing uncertain data becomes significantly more challenging. I would like to highlight the Trio project (http://infolab.stanford.edu/trio/) at Stanford that is building a system to manage uncertain data along with data lineage (also known as history or provenance). The lineage feature in Trio allows for an intuitive representation (see our VLDB 2006 paper), and efficient query processing (see our ICDE 2008 paper or a short DBClip). I would like to note that several other database groups are also studying the problem of managing uncertain data.

Why Managing Is Better Than Cleaning

Without going into technical details, I would like to describe why Approach-M is in general better than Approach-C. While Approach-C gives instant gratification by removing all "dirty data," Approach-M gives better long term results. Intuitively, Approach-C greedily eliminates all uncertainty, but Approach-M could resolve uncertainty more accurately later on because it has more information. Another way to look at it is that in Approach-C, the error involved in cleaning the data keeps compounding as we further query the certain data. But in Approach-M, since the uncertainty is explicitly modeled, we don’t have this problem.

Let us take a very simple example to see how uncertainty can be resolved using Approach-M. Consider the Christmas Bird Count described earlier, where people report bird sightings with a relational schema (Observer, Bird-ID, Bird-Name). (Suppose for the sake of this example birds are tagged with an identifying number, and the main challenge is in associating the number with the bird species.) In this extremely simple example, suppose there is just one sighting in our database by Mary, who saw Bird-1, and identified it as being either a Finch (80%) or a Toucan (20%). At this point we know that Bird-1 is definitely a Finch or a Toucan, but we are not sure which one it is; using Approach-C, we would like to create a certain database, and since Mary feels much more confident that Bird-1 is a Finch, we would enter the tuple (Mary, Bird-1, Finch) into the database and forget the information that it could have possibly been a Toucan as well.

In Approach-M, we would store Mary’s sighting as is, which could be represented as:

(Mary, Bird-1, {Finch: 0.8, Toucan: 0.2})

Why is this better? Suppose next day we get another independent observer, Susan, reports the sighting of Bird-1, and she thinks it’s either a Nightingale (70%) or Toucan (30%). The following day we get another independent observer’s sighting who says Bird-1 is either a Hummingbird (65%) or Toucan (35%). Clearly when we reconcile all these sightings, the likelihood of Bird-1 being Toucan is quite high. The method of "reconciling" these readings can be quite complicated, and is an important topic of research, but any reasonable reconciliation should indicate the probability of Bird-1 being Toucan quite high since all other sightings are conflicting. However, using Approach-C, all the three observers’ readings of Toucan would have been weeded out.

Uncertainty In Data Integration

For readers who are not convinced by the synthetic example above, here’s a very real application: data integration, which has been an important area of research for several decades.

Data integration systems offer a uniform interface to a set of data sources. As argued in our chapter to appear in a book on uncertain data, data integration systems need to model uncertainty at their core. In addition to the possibility of data being uncertain, semantics mappings and the mediated schema may be approximate. For example, in an application like Google Base that enables anyone to upload structured data, or when mapping millions of sources on the deep web, we cannot imagine specifying exact mappings. The chapter provides the theoretical foundations for modeling uncertainty in a data integration system.

At Google, we built a data integration system that incorporates this probabilistic framework, and completely automatically sets up probabilistic mediated schemas and probabilistic mappings, after which queries can be answered. We applied our system to integrate tables gathered from all over the web in multiple domains, including 50-800 data sources. Details of this work can be found in our SIGMOD 2008 paper. We observed that the system is able to produce high-quality answers with no human intervention. The answers obtained using the probabilistic framework was significantly better than any deterministic technique compared against.

Making Flexible Recommendations in CourseRank (Posted by Georgia Koutrika)

2008-07-17T14:00:00.000-07:00

Some background on CourseRank

CourseRank is a social tool for course planning that we are developing at the InfoLab. CourseRank helps Stanford students make informed decisions about classes. It displays official university information and statistics, such as bulletin course descriptions, grade distributions, and results of official course evaluations. Students can anonymously rank courses they have taken, add comments, and rank the accuracy of each others' comments. They can also shop for classes, get personalized recommendations, and organize their classes into a quarterly schedule or devise a four-year plan.

CourseRank also functions as a feedback tool for faculty and administrators, ensuring that information is as accurate as possible. Faculty can modify or add comments to their courses, and can see how their class compares to other classes.

A year after its launch, the system is already being used by more than 6,200 Stanford students, out of a total of about 14,000 students. The vast majority of CourseRank users are undergraduates, and there are only about 7,000 undergraduates at Stanford. Thus, CourseRank is already used by a very large fraction of Stanford undergraduates.

Recommendations the Classical Way

Recommendations comprise an integral part of the system functionality. In order to generate recommendations for students, we initially adopted existing recommendation approaches (see Adomavicius and Tuzhilin for a good survey). However, although the initial version of CourseRank has been very popular with students (see editorial in the Stanford student paper), we soon had first-hand experience with the limitations of classical recommendation systems.

Limitation: Inflexible, canned recommendations

Similarly to most commercial recommendation systems, our initial version offered no choices, just a fixed list of recommended courses for the student to consider. However, users may expect different recommendations under different circumstances, and we received many requests, from students and administrators, for more flexible recommendations. For example, students want to specify the type of course they are interested in (e.g., a biology class, something that satisfies Stanford's writing requirement). They also want to request recommendations based on a peer group they select (e.g., students in their major, or freshmen only) and based on different criteria (e.g., students in their major with similar grades or with similar ratings). For example, a student may want recommendations for CS courses from CS students with similar grades (i.e., with similar performance) and for dance classes from students with similar ratings (i.e., with similar tastes). In some cases, students want to see the recommendations the system would give their best friend, not themselves.

Limitation: Hard-wired recommendations

The recommendation algorithm is typically embedded in the system code, not expressed declaratively. From the designer viewpoint, this fact makes it hard to modify the algorithm, or to experiment with different approaches. However, we (the CourseRank implementers) want to experiment with different recommendation approaches: Would approach X be more effective than approach Y in our environment? What are the right weights for blending two recommendations (e.g., one based on what students like, and another based on what courses are more critical for completing a major)? What is the best way to predict, not good courses, but likely grades a student will obtain in future courses?

To accommodate our end-users’ requests and also to meet our own research and development goals, we were faced with the prospects of implementing many different recommendation approaches and their variants. Therefore, we decided we needed to have more flexible recommendations, i.e., recommendations that could be easily described, customized, and executed by a single engine.

The idea of Flexible Recommendations (FlexRecs)

So, we designed FlexRecs, a framework for flexible recommendations over relational data. The general idea of FlexRecs can be described as follows:

A given recommendation approach is expressed declaratively as a high-level workflow. Imagine that such a workflow comprises a series of interconnected operators that describe how a recommendation is computed. In a workflow, sets of objects flow from one operator to the next, and are filtered, ranked, etc. in the process. The workflow is a “high-level” description in the sense that it does not contain actual code, but rather, it describes what code (operators) to call upon. These descriptions can also handle parameters, so that end-users can at run time personalize their recommendations, e.g., by choosing the peer group of students to provide recommendations, or by weighting recommendations that are blended.
A designer can easily express multiple workflows for different types of recommendations, as well as new types that the designer may think of. The end user can select from them, depending on her information needs. This selection is done through a GUI, which also allows the user to enter parameters for workflows in order to get more accurate and personalized recommendations. For instance, the user may specify that her recommendations be based on what courses students in her major are taking. Such functionality is essentially similar to advanced searches: a designer builds a set of parameterized queries. End users can transparently execute these queries with parameter values and receive different results through an advanced search interface.

Our framework describes how a recommendation workflow can specifically be defined for relational data, using traditional relational operators such as select, project and join, plus new recommendation operators that generate or combine recommendations. The recommendation operators may call upon functions in a library that implement common tasks for generating recommendations, such as computing the Jaccard or Pearson similarity of two sets of objects. The system can execute a workflow by "compiling" it into a sequence of SQL calls, which are executed by a conventional DBMS. When possible, library functions are compiled into the SQL statements themselves; in other cases we can rely on external functions that are called by the SQL statements.

We have built a flexible recommendation engine for compiling and executing FlexRecs workflows as part of CourseRank. The FlexRecs framework has indeed made it possible to define many recommendation strategies (at least all of the ones we have been able to think of to date). The new version of CourseRank, to be released September 2008, will let end-users tailor their recommendations. Although the end-users will only see choices from simple menus, and select values from sliders, they will actually be selecting what workflow to run, and with what parameters.

Database Research Principles Revealed (Posted by Jennifer Widom)

2008-07-11T15:23:00.000-07:00

Last summer I was named recipient of the 2007 ACM SIGMOD Edgar F. Codd Innovations Award, an honor that came with both good news and bad news. The good news: $1000 and something new to spice up my bio. The bad news: A last-minute trip to Beijing at an inopportune time (though enjoyable in the end) to deliver a plenary talk at the conference.

Back when Hector received the Innovations Award in 1999, there was only the good news part; an invited SIGMOD conference talk for the winner was introduced with Jeff's award in 2006. The problem with this type of talk is that you're not allowed to just trot out your latest research spiel. The talk is expected to be sweeping, insightful, and (most of all) entertaining, while still remaining technical enough to avoid any hushed remarks about being an over-the-hill armchair researcher who thinks only big thoughts and no little ones.

I spent a lot of time mulling over what I could say to these conference-goers in Beijing, at least those who didn't sneak off to visit the Forbidden City instead. (Not many of them did, probably thanks to the drizzle and thick smog.) I decided to solidify some research strategies and pet peeves that I believe have influenced my entire career, with very concrete examples for technical credibility, and photographs to keep it entertaining.

Slides from the talk are available in PowerPoint and pdf (an inline slideshare version is below). This blog post summarizes the key points.

Finding Research Ideas

There's no magic to finding research areas, at least for me. I started working in Active Databases at IBM because I was told to. I started working in Data Warehousing at Stanford because Hector came back from a company visit one day saying it was the latest hot thing and there might be some research in it. I worked in Semistructured Data as an offshoot of our Data Integration project -- the integration part made me uncomfortable so I decided to build a DBMS for our "lightweight self-describing object model" instead. Data Streams was an area I'd always felt just plain made sense, but it took years for me to convince any students to work on it. Lastly, my current work on Uncertainty and Lineage is an idea that just popped into my head one day during my morning jog. Really.

I never know where the next idea is coming from, or when it will arrive, which is actually kind of scary since I don't like to stay in areas too long. One small but interesting observation: Although I've worked in what seem to be diverse areas, the problem of Incremental View Maintenance has popped up in every single one of them.

Finding Research Topics

Once a research area has been selected, how does one find a topic within that area? Here I actually do have a strategy. If you take one of the many simple but fundamental assumptions underlying traditional database systems, and drop it, the entire kit-and-kaboodle of data management and query processing often needs to be revisited. (I like the analogy of pulling at a loose thread in a garment, ultimately unraveling the whole thing.) Once you need to revisit the data model, query language, storage and indexing structures, query processing and optimization, concurrency control and recovery, and application and user interfaces, you've got yourself a bunch of thesis topics and a fun prototype to develop.

I followed this recipe for Semistructured Data (dropped assumption: schema declared in advance), Data Streams (dropped assumption: data resides in persistent data sets), and now Uncertain Data (dropped assumption: tuples contain exact values). Of course you don't need to revisit every aspect of data management and query processing every time, but so far there have always been plenty of topics to go around.

The Research Itself

Here comes my biggest pet peeve. If one is to follow my recipe and reconsider data management and query processing for a new kind of DBMS, it's imperative to think about all three of the critical components -- data model, query language, and system -- and in that order! We in research have a rare luxury, compared to those in industry, that we can mull over a data model for a long time before we move on to think about how we'll query it, and we can nail down a solid syntax and semantics for a query language before we implement it. This sequence is not only a luxury, I consider it a requirement for good research: Lay down the foundations cleanly and carefully before system-building begins. This policy has been the the biggest underlying principle of my research and, I believe, the primary reason for its success (on those occasions it's been successful).

Let's look briefly at the three critical components, then talk about how to disseminate research results.

Data Model

Nailing down a new data model that "works" is hardly a trivial task. The talk (here are the PowerPoint and pdf links again) provides some concrete examples of subtleties in data stream models, where the same query can (and across current systems, does) give very different results depending on some hidden and often overlooked aspects of a stream model. In the Trio project, we debated uncertainty data models for nearly a year before settling on the one we used, and it was well worth it in the end.

Query Language

Like data models, the subtleties involved in query language design are often underestimated. First, there seems to be some confusion between syntax and semantics: from a research perspective, only semantics is really interesting. For example, if we apply SQL syntax to a data stream model, or to a model for uncertain data, we certainly can't declare victory -- in these new models it's often unclear what the semantics of a syntactically-obvious SQL query really are. (Here too, concrete examples are given in the talk.) For both the STREAM and Trio projects, just the task of specifying an exact semantics for SQL queries over the new model was a significant challenge.

Unfortunately, the challenges and contributions of specifying a new query language (or new semantics for an existing one) don't tend to be recognized in traditional ways. Publishing a SIGMOD or VLDB paper about a query language is near impossible. After many failed attempts to publish a paper describing the Lore query language, we finally sent it to a new journal that was desperate for submissions. The Lorel paper now has over 500 citations on Citeseer (over 1200 on Google Scholar) and was among the top-100 cited papers in all of Computer Science for a spell. The fact is that language papers are very difficult to publish, but they can have huge impact in the long run. Unfortunately that's tough to explain to a graduate student.

In another of my favorite language-related stories, I was confused about the semantics of a "competing" trigger (active database) language to the one I was designing; this was way back around 1990. I asked the obvious person running the other project (who shall remain nameless, but is very tall) what the semantics would be of a specific set of triggers in his language. His response: "Hmm, that's a tricky one. I would have to run it to find out."

The talk includes examples not only of trickiness in applying SQL to new models, but also subtleties in designing query languages for semistructured data and for data streams. It also demonstrates a guiding principle for designing query language semantics in the "modified-relational" models I tend to work with: reuse relational semantics whenever possible (which is not the same thing as reusing SQL or even relational algebra syntax); it's a clean and well-defined place to start, and can cover a lot of ground if the semantics are compartmentalized well.

System

After all that thinking, debating, designing, specifying, and proving that goes into figuring out a new data model and query language, building a prototype system to realize them is a very satisfying finishing step, and critical for full impact of new ideas.

I'll admit the model-language-system sequence isn't quite as clean a division as I've made it out to be: When building a system and trying it out, one inevitably discovers flaws in the data model and query language, and there tends to be at least a moderate feedback loop. Even then, working out (modified) foundations before committing them to code is, in my mind, rule number one.

Disseminating Research Results

I have strong feelings on this topic. First, if you've done something important, don't wait to tell others about it. There's no place for secrecy (or laziness) in research, and there's every place for being the first one with a new idea or result. Write up your work, do it well and do it soon, post it on the web and inflict it on friends.

Second, don't get discouraged by SIGMOD and VLDB rejections. Those conferences aren't the only places for important work, by a long shot. Workshops often reach the most important people in a specific area. I've always been a fan of SIGMOD Record (and more recently the CIDR conference) for disseminating ideas or results that, for whatever reason, aren't destined for a major conference.

Finally, build prototypes and make them easy to use. That means a decent interface (both human and API), and even more importantly setting things up so folks can try out the prototype over the web before committing to a full download and install.

Should Ad Networks Bother Fighting Click Fraud? (Posted by Bobji Mungamuru)

2008-07-02T20:30:00.000-07:00

It's an understatement to say that online advertising has become big business. Much of the growth has been due to the emergence of advertising networks, such as Google, Yahoo! and Advertising.com.

Over the last few years, however, click fraud has emerged as a significant threat to the business model of the online advertising industry. (I recommend this book chapter if you want to learn more about click fraud.)

You can think of it as an analogue to a "currency crisis" – advertisers (investors) avoid buying ads (currency) because they are worried about the devaluation of ads due to rampant fraud, and it is this lack of advertiser (investor) confidence itself that causes the value of ads to fall.

The central issue is this: Suppose an advertising network (or "ad network", for short) decides that a given click-through is "fraudulent". The implication is that the ad network will not bill the advertiser for that invalid click. On the other hand, if the click is marked valid, the ad network could charge full price for it. Therefore, arguably, the ad network is “leaving money on the table” by marking clicks fraudulent. As such, why would ad networks even bother fighting fraud?

Google CEO Eric Schmidt famously claimed in July 2006 that there in fact was a "perfect economic solution" to click fraud, which is to just "let it happen". He conjectured that the auctions used to sell ads are "self-correcting", since market prices would naturally adjust for fraud. Is Dr. Schmidt right? Is it economically reasonable for ad networks to just let fraud happen?

In my paper, "Should Ad Networks Bother Fighting Click Fraud? (Yes, They Should.)", I analyze a simple economic model of the online advertising market, and conclude that Dr. Schmidt is, in fact, quite wrong. In fact, from an ad network’s perspective, to "let it happen" is absolutely the worst economic solution! Ad networks who develop effective algorithms for detecting fraud, and aggressively apply these algorithms, gain a significant competitive advantage in the online marketplace.

In other words, Google's official response to the "let it happen" fiasco, while well-intentioned, missed the whole point: it's not out of the "goodness of their heart" that Google should fight fraud, rather it is out of sheer economic self-interest.

I presented a shorter version of this paper at the Financial Cryptography and Data Security conference this January. The title of the paper was "Competition and Fraud in Online Advertising Markets".

Stanford Entity Resolution Framework: An Overview (Posted by Steven Whang)

2008-06-25T01:55:00.000-07:00

Goal

The goal of the Stanford Entity Resolution Framework (SERF) project is to develop a generic infrastructure for Entity Resolution (ER). ER (also known as deduplication, or record linkage) is an important information integration problem: The same "real-world entities" (e.g., customers, or products) are referred to in different ways in multiple data records. For instance, two records on the same person may provide different name spellings, and addresses may differ. The goal of ER is to "resolve" entities, by identifying the records that represent the same entity and reconciling them to obtain one record per entity.

Early Work

Our initial work focuses on ER performance. We first identify three main challenges of ER: matching two records, merging two records that match, and chaining where we iteratively compare merged records with other records to discover additional record matches. We abstract the first two operations into binary match and merge functions, which we consider as black-box functions that are provided by the user. Given the functions, we derive ER solutions using the minimum number of potentially expensive record comparisons (thus our approach is focused on performance rather than accuracy). We also identify important properties for the match and merge functions that make ER natural and efficient. Our Swoosh algorithms are proved to be optimal in the number of record comparisons in worst-case scenarios.

In addition, we have addressed the following challenges:

Distribution: We distribute the operation of Swoosh on several machines. The important property to satisfy is that for any two records, there exists a machine that compares them. We investigate several schemes that guarantee this property while parallelizing ER.

Confidence: We consider numerical confidences associated with records and derive an ER result consisting of high-confidence records. The challenge is to perform ER efficiently when confidences are involved.

Negative Rules: We drop the assumption that the match and merge functions are always correct, and specify integrity checks in the form of negative rules to derive a consistent ER result using the match and merge functions.

Current Work

More recently, we have been focusing on ER scalability. Although Swoosh is an efficient algorithm, it is an exhaustive one where all the record pairs could be compared (making the algorithm quadratic) and thus cannot run on very large datasets (say on millions of records). Various blocking techniques can be used to scale ER where we divide records into (possibly overlapping) blocks, assuming that records in different blocks are not likely not match with each other. We thus save a significant amount of processing by running ER on one block at a time. The problem with blocking is that it may miss matching records. Many works have focused on improving the accuracy and performance of blocking by figuring out the correct "blocking criteria" to use.

Complementing various ER and blocking techniques, we propose a novel generic framework for blocking (called iterative blocking) where the ER results of blocks can now be reflected to other blocks. Blocks are iteratively processed until no block contains any more matching records. Unlike blocking, we now have an additional step of re-distributing the merged records generated from one block to other blocks. This process can improve the accuracy of ER (by possibly identifying additional record matches) while improving performance (by saving the processing time for other blocks) compared to simple blocking.

Thoughts on ECML PKDD Discovery Challenge (RSDC'08) (Posted by Paul Heymann)

2008-06-18T18:44:00.000-07:00

Over the past few years, I have looked into a variety of problems in collaborative tagging systems, like whether they can be better organized, whether they can help web search, and how to avoid spam. Most recently, I have been looking at tag prediction; I will be presenting a paper called "Social Tag Prediction" at SIGIR'08 next month (more details on that in a later blog post). As such, I am very curious to see the outcome of the ECML PKDD Discovery Challenge (RSDC'08). (These are my initial thoughts as a non-participant, so if you have any complaints or find any errors, do leave a comment!)

BibSonomy and the University of Kassel

The Challenge is being put on by four members of the Knowledge and Data Engineering team at the University of Kassel:

This team, along with about ten other project members is behind BibSonomy. BibSonomy is one of the three (or so) major collaborative tagging systems for academic publications (the others that I know of being CiteULike and Connotea). BibSonomy supports bookmarking URLs as well as publications, but its main sell over something like del.icio.us is that it helps academics organize and share publications. One particularly nice thing about BibSonomy is that they have always been willing to share their data with the academic community (see the FAQ for more details, CiteULike shares some data in an anonymized form).

Dataset

Most collaborative tagging systems focus on a single object type: e.g., URLs, books, products. BibSonomy is different in that users post two distinct object types: academic publications and URL bookmarks. Furthermore, the dataset (at least for the discovery challenge) has about equal numbers of (non-spam) unique publications and URLs (approximately 200,000 each). The dataset also denotes whether each user is a spammer or not, but here things are a bit less balanced. There are about ten times as many spam bookmarks as ham bookmarks, but almost no one seems to have bothered to spam academic publications. The full dataset statistics are:

(tag,user,object) triples {ham: 816197, spam: 13258759}
URL bookmark objects {ham: 181833, spam: 2059991}
academic publication objects {ham: 219417, spam: 716}
users {ham: 2467, spam: 29248}

See the dataset page for more details about the Challenge dataset.

Tasks: Tag Recommendation and Tag Spam

The Challenge consists of two tasks, tag recommendation and tag spam detection.

Tag Recommendation

Tag recommendation seems to be equivalent to what other researchers call tag suggestion, but slightly different from what I call "tag prediction." In a tag recommendation or tag suggestion task, the goal is to assist in the tagging process. As the user is typing in tags describing a particular object, the system also provides the user with a list of helpful suggestions.

In the Challenge, the "ideal" recommendation is assumed to be whatever tag the user ended up choosing, though one could imagine different definitions for what makes a recommendation "good."

For example, suppose most users in the system tag articles about music with "music", but a particular user tags such articles with "audio". A good recommendation in a real world system might be to recommend that such articles be tagged with "music" in the future so that the user has the same labels as other users in the system (enhancing discoverability of the resources in the system), but this would be a bad strategy in the Challenge.

By contrast, when I set up a "tag prediction" task, the gold standard was to detect all tags that could be applied to a particular object, rather than particular tags that particular users chose to apply to a particular object.

Tag Recommendation (Suggestion): Given (user,object) try to guess tag.
Tag Prediction: Given (object) try to guess all potential tags that could occur in (tag,user,object) triples in the future.

Thus, the goal of "tag recommendation" is to assist and speed up tagging by users, while the goal of "tag prediction" is to guess all of the tags that could be applied to a particular object. One helps the users add tagging metadata, while the other adds tagging metadata directly.

Neither is necessarily a better or worse task, but I am struck by how even for a relatively specific goal like "predict tags in a tagging system" relatively minor details can have a huge impact on the design and applicability of solutions to the problem.

Tag Spam Detection

The spam detection task seems similar to previous spam detection tasks. The goal is to guess which users are spammers, based on previous spammers labeled by BibSonomy. (This is different from our previous work, which looked primarily at methods for prevention, ranking, and ways to simulate spam in tagging systems.)

The traditional difficulty with such tasks is defining what exactly constitutes "spam content" and how much spam content constitutes a "spammer." However, it seems likely that even if some spammers go unlabeled, or if some legitimate users are mislabeled as spammers, the relative ordering of the spam detection algorithms will be approximately the same.

The bigger issue that I am worried about with this challenge is that most of the spam signals will actually be too easy to identify. Unlike the web or e-mail where spammers have been competing against spam detection algorithms for a decade or more, tag spam is relatively new.

As a result, it seems like certain really obvious signals might catch most spam, because tag spammers are not really trying that hard, yet. For example, a quick glance at the Challenge dataset shows that BibSonomy got about 56 legitimate URLs in China (.cn) and about 127,000 spam URLs in China (.cn). So you can eliminate about 6 percent of spam URLs by just ignoring all of China. Another top level domain constitutes about 20 percent of the spam URLs and just about 1 percent of the legitimate URLs.

In fact, about 40 percent of the legitimate URLs in the dataset come from ".net", ".hu", ".org", ".edu", or ".de" while only about 9 percent of the illegitimate URLs come from those domains. In other words, there appears to be a lot of signal in the top level domains alone.

Likewise, of the 2,467 legitimate users in the dataset, about half of them (1,211) post academic publications, while only 113 of the spammers bother to do so. As a result, it looks like the difficulty in the task will be finding the legitimate users who are just posting URLs (as opposed to academic publications).

Factors That Might Impact The Challenge

There are three factors that I think will have a big impact on the Challenge:

Compared to some other systems (e.g., del.icio.us), BibSonomy is relatively small. For example, the data in the dataset is equivalent to about two to four days worth of del.icio.us bookmarks postings. I wonder if this will end up having a big impact on the types of algorithms that will do well, in that they will have less data to work with. One of my favorite simple algorithms, association rules, might not do as well when there is less data to be had.
For the tag recommendation task, it seems like there might be big differences between predicting tags for academic publications and predicting them for URLs. In the former case, the dataset seems to have more text data, but on the other hand, the tags might be much more specific and sparse (e.g., "neuralnetworks" versus "funny"). Likewise, URLs may have easier tags to predict, but less text to predict them based on. With more text, it might make more sense to use algorithms that look more like text categorization, whereas with less text, it might make sense to use algorithms that look more like collaborative filtering. (Our previous work looks at predicting tags based on text and tags based on other tags in the context of tag prediction, Zhichen Xu et al. looked at tag suggestion in a more collaborative filtering style way in "Towards the Semantic Web: Collaborative Tag Suggestions.")
It also seems as though the setup of the task will force the best teams to model the user, as opposed to just the objects. Because a system does not get credit for recommending "rubyonrails" when the user chose "RoR", it seems that it will be likely to be important to model not just what the object is about, but what the user is likely to say the object is about. (Incidentally, machine translation also faces this problem: if two systems translate a piece of text, and their translations are both different from a gold standard, which one is better?) Furthermore, if the users tagging objects are themselves seeing BibSonomy's tag recommender, one may need to (directly or indirectly) model the user, the tag recommender, and the user's reaction to the recommender!

Ultimately, all of these reflect necessary choices made to setup the Challenge, but it will be interesting to see what impact they have on the winning systems. Good luck to all the teams!

Update (2008-06-20): A small clarification, it looks like the tag recommendation task will be somewhere in between predicting tags based on (user, object) and predicting tags based on (object). The tag recommendation submissions will be in the form (object, tag1, tag2, tag3...) so systems will not really be predicting tags for a particular user. On the other hand, none of the objects seem to have really complete coverage of all of the tags that could apply to them, so the task is not exactly tag prediction either.

Update (2008-06-23): Actually, contrary to the previous update, it looks like the tag recommendation task will be based on (user, object) after all. The confusion is that content_id is actually a combination of both user and object as opposed to object. This mailing list posting gives a few more details.

Report: 2008 Berkeley Database Self-Assessment (Posted by Hector Garcia-Molina)

2008-06-08T21:58:00.000-07:00

Periodically, a group of self-anointed database "experts" meet to assess the state of the database field. The most recent of these self-assessment meetings was held at Berkeley, CA May 29-30, 2008.

The Berkeley meeting (above) was the fifth in the series.

The four earlier meetings were held in Laguna Beach 1988, Stanford 1990, Asilomar 1998, and Lowell (MA) 2003. After each of the meetings, a report has been written. The previous reports can be found at:

Laguna Beach, 1988: Missing online. (DBLP Metadata)
Lagunita 1990: Missing online. (DBLP Metadata)
Asilomar 1998. (SIGMOD Record: html, ps.gz, pdf.gz) (ACM Citation) (DBLP Metadata)
Lowell 2003. (MSR)

Typically, each report is written by one or two of the participants, trying to summarize the discussion at the meeting. The rest of the participants get to put their name on the report after providing comments. The report does not capture everyone's opinion, and more than once I have been surprised to find material in the report that I did not recall hearing at the meeting. But that is not a bad thing: The reports are usually more coherent than the meetings!

At this year's meeting, many of the same topics from previous years made an appearance: data integration, privacy and scalability are some old-time favorites. However, some "new" topics made an appearance (by new I mean new to these reports). For instance, this time we discussed social networking and virtual worlds as two applications that may require new data management technology.

The topic that generated the most passionate discussion was that of academic publications and conferences. Given that the meeting was a few days after VLDB rejection notices had been mailed out, it is not surprising that people were complaining about the poor review process these days. It seems reviewers are more interested in looking for ways to kill a paper ("my paper was not cited", "the margins are a nanometer too wide", "the idea is useful but there are not enough theorems") than in identifying promising ideas that will have impact. I suspect there will not be much discussion on this topic in the workshop report, as it was felt that these issues are best handled by the organizations that run conferences, such as SIGMOD, VLDB and ICDE. I would not hold my breath...

After the workshop, all participants attended the Jim Gray Tribute May 31, also at Berkeley.

There were about 800 participants at the tribute (above). It was a very moving event. It was great to see so many friends from the database and systems communities, but it was unfortunate that it took Jim's disappearance to bring us all together.

We'll let all of our Stanford InfoBlog readers know when the report for the 2008 Berkeley self-assessment appears... Stay tuned.

Simrank++: Putting together 3 simple ideas (Posted by Ioannis Antonellis)

2008-06-06T11:45:00.000-07:00

Two ideas that really pushed web search further came from two completely different directions.

Two ideas

The first one, link analysis, came into play as a tool for improving the quality of search results. The idea as people express it today is simple, yet its power has been proven invaluable. The more web pages that link to your web page, the more popular your web page becomes.

The second idea, sponsored search, allowed search engines to make revenue, grow and keep providing their service for free. The idea again is simple: The search engine sells each keyword to up to 10 advertisers and it displays their ads to the users that issue a query with that keyword. Since the keywords are not real goods that you can easily set a price for, the search engines rely on auctions to determine the price for each keyword. (Had the search engines been able to set fixed keyword prices, many Ph.D. students, including myself, would have no thesis topic to work on...)

Another idea: The wisdom of Crowds

But, my favorite simple idea that has started flooding around recently is that of exploiting the wisdom of the crowds; the information that the billions of web users provide in all its forms. The most concrete and recent example of a successful application that is based on this idea is Wikipedia. Now that this large repository of knowledge has been created, many researchers have started to think about using it to improve existing data mining methods.

Another, still under development and immature, set of techniques that are based on the wisdom of the crowds uses click logs as a source of implicit information that web search users provide.

The paper

In our paper "Simrank++: Query Rewriting through link analysis of the Click Graph", which will be presented at Auckland, New Zealand in the coming VLDB 2008 conference, we are trying to put all these three ideas into play together. Specifically, we develop a link analysis technique for click graphs that can be used to increase the revenue of a sponsored search engine.

So, what is a click graph, how exactly do we analyze its link structure and how does this improve the revenue of a search engine?

Click graph

The click graph is a bipartite graph with query nodes on one side, ad nodes on the other and an edge between a query and an ad when a user that issued the query clicked on the ad. In addition, each edge between a query and an ad has associated with it a weight that corresponds to the number of clicks the query brought to the ad.

How to increase search engine revenue

The potential for revenue increase comes from queries that no advertiser has bid on. Since the search engine has no indication which ads are relevant to those queries, there are no ads it can display. Given such a query, if there was a way to guess which ads are relevant (but for some reason the associated advertisers decided not to bid on it), we could display the relevant ads and keep both the advertisers and the search engine happy.

The solution comes from the click graph: the wisdom of the crowds. Given a query, we use the click graph to generate a list of query rewrites, i.e., of other queries that are "similar" to it. For example, for the query "camera," the queries "digital camera" and "photography" may be useful because the user may also be interested in ads for those related queries. The query "battery" may also be useful because users that want a camera may also be in the market for a spare battery. Both the query and its rewrites can then be used to find ads.

The schemes we present analyze the connections in the click graph to identify rewrites that may be useful. Our techniques identify not only queries that are directly connected by an ad (e.g., users that submit either "mp3" or "i-tunes" click on ad an for "iPod") but also queries that are more indirectly related.

Simrank++

The intuition of our solution, Simrank++, is the following:

two queries are similar if they are connected to similar ads and two ads are similar if they are connected to similar queries.

Sounds confusing, but it reveals the power of recursion. Initially each query/ad is only similar to itself, but by continuously applying these two rules, you end up with a similarity score for each pair of queries.

Are these scores always correct? How can we even define what a correct score is? And how does everything change when you try to take into account the edge weights of the click graph in the computation of the similarity scores?

The answers to these more detailed questions are in the paper...

Why Write a Blog? (Posted by Paul Heymann)

2008-05-28T09:12:00.000-07:00

Most blogs seem to start out with a mission statement, a modus operandi, or an introduction. I would like to start out with something a little more analytical. Each post in this blog is likely to discuss some topic related to research at the InfoLab, and I would like to start with a simple question: why blog? The answer has a lot to do with what we do at the InfoLab, the future of research on data and the web, and about the nature of research itself.

Blogging is Huge

About a year ago now, I started working on a paper called "Can Social Bookmarking Improve Web Search?" Interestingly, that paper ended up being more about the nature of social bookmarking data (URLs which have been annotated with keyword "tags" by users) than it was really about web search itself. (I suppose that makes sense, given that we are the former "database group" and fascinated by data or information in a wide variety of contexts.) Specifically, what we really ask is:

Is the data produced by social bookmarking systems different enough from other data that search engines have access to that it really constitutes "new information?"
Are social bookmarking systems producing enough data to make a difference? On the scale of the web?

The answer to the second question is what sparked my most recent interest in blogs.

del.icio.us, the social bookmarking site I analyzed, gets over 100,000 posts on an average day. (See, for example, deli.ckoma, which has daily information about the number of posts to del.icio.us, going back several years.) But in the course of my analysis, I needed something to compare to that number.

Is 100,000 URLs with a few tags (keyword annotations) a large data source, or a small one, and compared to what? The most natural comparison I could find was the blogosphere.

Blog posts seem to:

Usually have at least one link to other, related, outside material (i.e., point to new and interesting URLs).
Usually have some discussion of that outside material (i.e., "annotate URLs").
Usually get written by end users who might not be building large scale websites (i.e., are "user-generated content").

In all three of these aspects, blog posts seem like a natural analogue to posts on social bookmarking systems.

When I started looking into numbers for the growth of blogs and the current quantity of blog posts, I was surprised. Blogging is about an order of magnitude bigger than social bookmarking (at least, for now), despite usually being more detailed and requiring more end user effort. Sifry, for example, puts the number of blog posts per day around 1.4 million blog posts per day.

Blogging is one of the most massive and dynamic phenomena on the web today.

Blogging is Structured

Database researchers have been fascinated by the web for a long time. In 1998, a group of the top researchers in databases got together to try to outline the research challenges for the next decade. What they produced was the Asilomar Report on Database Research, which, among other things, concludes that the grand challenge for database research for the next ten years should be:

The Information Utility: Make it easy for everyone to store, organize, access, and analyze the majority of human information online.

However, database researchers tend to like schema and structure in data, something which has been pretty uncommon on the web until recently. There are some new developments which might give the web more structure, for example, Microformats or the Semantic Web. But it seems like we are going to be stuck with our current web for a while yet. And for now, the most structured data is coming from things like blogs with posts, RDF, RSS, Atom, Pingbacks, Trackbacks, and a variety of other structured output and interactions.

Blogging may be the web's best hope for structured, machine-readable data.

The Web is Becoming Key To Disseminating Research

Researchers have a responsibility to disseminate their most interesting results to their community, and often to the public at large. Over the past decade, that has become increasingly easy. Specifically, the web has made it possible (and even simple) for researchers to make available research results which would only have been available to a small subset of academics and industry researchers a decade ago.

This has led to a conflicts like:

Should research be Open Access?
How can we keep double blind peer review while still making research results available on the web in a timely manner?
Should journals exist in an era when publishing can be so easily done on the web? (The arXiv is a powerful example of un-peer reviewed, quality work published on the web.)

However, regardless of the answers to these questions, the web has become an integral part of the research process.

The Beginning

Years ago, the InfoLab did something unusual at the time. We started putting our publications up on the web, with structured data describing them, at our DBPubs publication server.

Now we think it is the right time to join the growing movement of researchers who use blogs to publicize and join a conversation about their work. Some of those people in Computer Science include Scott Aaronson, Hal Daume III, Greg Linden, and John Riedl.

We hope that the eclectic mix of research at the InfoLab will lead to an interesting and useful InfoBlog, for you, our readers, and that you will join us in this conversation.