<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss2full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" version="2.0">

<channel>
	<title>DBMS 2 : DataBase Management System Services</title>
	
	<link>http://www.dbms2.com</link>
	<description>Choices in data management and analysis</description>
	<lastBuildDate>Tue, 15 May 2012 17:24:44 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.2</generator>
		<atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" type="application/rss+xml" href="http://feeds.feedburner.com/dbms2/feed" /><feedburner:info uri="dbms2/feed" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com/" /><item>
		<title>Notes on the analysis of large graphs</title>
		<link>http://feedproxy.google.com/~r/dbms2/feed/~3/FKZ7miRHNh0/</link>
		<comments>http://www.dbms2.com/2012/05/13/notes-on-the-analysis-of-large-graphs/#comments</comments>
		<pubDate>Mon, 14 May 2012 03:35:29 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Aster Data]]></category>
		<category><![CDATA[Data models and architecture]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Health care]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[RDF and graphs]]></category>
		<category><![CDATA[Scientific research]]></category>
		<category><![CDATA[Telecommunications]]></category>
		<category><![CDATA[Yarcdata and Cray]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=6364</guid>
		<description><![CDATA[This post is part of a series on managing and analyzing graph data. Posts to date include: Graph data model basics Relationship analytics definition Relationship analytics applications Analysis of large graphs (this post) My series on graph data management and analytics got knocked off-stride by our website difficulties. Still, I want to return to one [...]]]></description>
			<content:encoded><![CDATA[<p><em>This post is part of a series on managing and analyzing graph data. Posts to date include:</em></p>
<ul>
<li><em><a href="../../../../../2012/05/04/notes-on-graph-data-management/">Graph data model basics</a></em></li>
<li><em><a href="../../../../../2012/05/07/terminology-relationship-analytics/"><em>Relationship analytics definition</em></a></em></li>
<li><em><a href="../../../../../2012/05/07/relationship-analytics-application-notes/">Relationship analytics applications</a> </em><em></em></li>
<li><em>Analysis of large graphs (this post)</em></li>
</ul>
<p>My series on graph data management and analytics got knocked off-stride by <a href="http://www.dbms2.com/2012/05/07/site-reliability-has-been-ghastly/">our website difficulties</a>. Still, I want to return to one interesting set of issues &#8212; analyzing large graphs, specifically ones that don&#8217;t fit comfortably into RAM on a single server. By no means do I have the subject figured out. But here are a few notes on the matter.</p>
<p>How big can a graph be? That of course depends on:</p>
<ul>
<li><strong>The number of nodes.</strong> If the nodes of a graph are people, there&#8217;s an obvious upper bound on the node count. Even if you include their houses, cars, and so on, you&#8217;re probably capped in the range of 10 billion.</li>
<li><strong>The number of edges. </strong>(Even more important than the number of nodes.) If every phone call, email, or text message in the world is an edge, that&#8217;s a lot of edges.</li>
<li><strong>The typical size of a <em>(node, edge, node)</em> triple.</strong> I don&#8217;t know why you&#8217;d have to go much over 100 bytes post-compression*, but maybe I&#8217;m overlooking something.</li>
</ul>
<p><em>*Even if your graph has 10 billion nodes, those can be tokenized in 34 bits, so the main concern is edges. Edges can include <a href="../../../../../2010/06/19/objectivity-infinite-graph/">weights, timestamps, and so on,</a> but how many specifics do you really need? At some point you can surely rely on a pointer to full detail stored elsewhere.</em></p>
<p>The biggest graph-size estimates I&#8217;ve gotten are from my clients at Yarcdata, a division of Cray. (&#8220;Yarc&#8221; is &#8220;Cray&#8221; spelled backwards.) To my surprise, they suggested that graphs about people could have 1000s of edges per node, whether in:</p>
<ul>
<li>An intelligence scenario, perhaps with billions of nodes and hence trillions of edges.</li>
<li>A telecom user-analysis case, with perhaps 100 million nodes and hence 100s of billions of edges.</li>
</ul>
<p>Yarcdata further suggested that bioinformatics use cases could have node counts higher yet, characterizing Bio2RDF as one of the &#8220;smaller&#8221; ones at 22 billion nodes. In these cases, the nodes/edge average seems lower than in people-analysis graphs, but we&#8217;re still talking about 100s of billions of edges.</p>
<p>Recalling that <a href="../../../../../2012/05/07/relationship-analytics-application-notes/">relationship analytics boils down to finding paths and subgraphs</a>, the naive relational approach to such tasks would be:<span id="more-6364"></span></p>
<ul>
<li>Store a table with one row per edge.</li>
<li>Do an (n-1)-way join, where n is the number of edges in the path or subgraph.</li>
</ul>
<p>In many cases the cardinality of intermediate result sets would be high, and you&#8217;d basically be doing a series of full table scans. Those could take a while.</p>
<p>There are various approaches to dealing with this challenge. For example:</p>
<ul>
<li>Graph analysis has been around long enough that much of it has surely been done relationally.</li>
<li>I wrote about some specific <a href="../../../../../2007/06/15/fast-rdf-in-specialty-relational-databases/">relational strategies for graph analysis</a> five years ago.</li>
<li>A lot of graph analysis these days is being done in Hadoop (or other MapReduce, notably Aster Data&#8217;s).</li>
<li><a href="../../../../../2010/06/19/objectivity-infinite-graph/">Objectivity Infinite Graph  and Google Pregel</a> emphasize pre-fetching (or pre-shipping) edges that might soon be needed.</li>
<li>Yarcdata, with its Cray genes, tries to optimize hardware (single RAM image across a cluster, with a whole lot of multithreading) for in-memory Apache Jena performance. Unfortunately, I&#8217;m not clear as to which data structure(s) <a href="http://jena.apache.org/about_jena/architecture.html">Jena</a> uses.</li>
</ul>
<p>When trying to figure out which of these techniques is likely to win in the most demanding cases, I run into the key controversy around analytic graph data management &#8212; <strong>how successfully can graphs be partitioned?</strong> Opinions vary widely, with the correct answers in each case surely depending on:</p>
<ul>
<li>The topology of the graph.</li>
<li>The size of the graph.</li>
<li>The length of the paths that need to be examined.</li>
</ul>
<p>But in the interest of getting this posted tonight, I&#8217;ll leave further discussion of graph partitioning to another time.</p>
<img src="http://feeds.feedburner.com/~r/dbms2/feed/~4/FKZ7miRHNh0" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2012/05/13/notes-on-the-analysis-of-large-graphs/feed/</wfw:commentRss>
		<slash:comments>12</slash:comments>
		<feedburner:origLink>http://www.dbms2.com/2012/05/13/notes-on-the-analysis-of-large-graphs/</feedburner:origLink></item>
		<item>
		<title>We’re back</title>
		<link>http://feedproxy.google.com/~r/dbms2/feed/~3/Tj4EBOiwMkM/</link>
		<comments>http://www.dbms2.com/2012/05/13/were-back/#comments</comments>
		<pubDate>Mon, 14 May 2012 03:11:40 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[About this blog]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=6362</guid>
		<description><![CDATA[Our blogs have been moved to a new hosting company, and everything should be working. Ditto our business site. If you notice any counterexamples, please be so kind as to ping me.]]></description>
			<content:encoded><![CDATA[<p>Our blogs have been moved to a new hosting company, and everything should be working. Ditto <a href="http://www.monash.com">our business site</a>.</p>
<p>If you notice any counterexamples, please be so kind as to ping me.</p>
<img src="http://feeds.feedburner.com/~r/dbms2/feed/~4/Tj4EBOiwMkM" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2012/05/13/were-back/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://www.dbms2.com/2012/05/13/were-back/</feedburner:origLink></item>
		<item>
		<title>Comments are briefly being turned off</title>
		<link>http://feedproxy.google.com/~r/dbms2/feed/~3/SS-nNz0KLZk/</link>
		<comments>http://www.dbms2.com/2012/05/09/comments-are-briefly-being-turned-off/#comments</comments>
		<pubDate>Wed, 09 May 2012 14:56:34 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[About this blog]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=6356</guid>
		<description><![CDATA[I need to move web hosts, and am initiating the process now. This involves a large file copy, a recopy of same, and a variety of manual steps. So until the process is complete, updating site databases is a bad idea. A comment is, of course, an update. So we&#8217;re closing off comments across DBMS [...]]]></description>
			<content:encoded><![CDATA[<p>I need to move web hosts, and am initiating the process now. This involves a large file copy, a recopy of same, and a variety of manual steps. So until the process is complete, updating site databases is a bad idea.</p>
<p>A comment is, of course, an update. So we&#8217;re closing off comments across <em><a href="http://www.dbms2.com">DBMS 2</a>, <a href="http://www.strategicmessaging.com/">Strategic Messaging</a>, <a href="http://www.texttechnologies.com">Text Technologies</a>, <a href="http://www.softwarememories.com">Software Memories</a>, </em>and the <em><a href="http://www.monashreport.com">Monash Report</a>.</em> I hope to turn them back on shortly.</p>
<p>The sites should remain readable all the way through &#8212; unless, of course, there are more <a href="http://www.dbms2.com/2012/05/07/site-reliability-has-been-ghastly/">hosting company outages</a>.</p>
<img src="http://feeds.feedburner.com/~r/dbms2/feed/~4/SS-nNz0KLZk" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2012/05/09/comments-are-briefly-being-turned-off/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://www.dbms2.com/2012/05/09/comments-are-briefly-being-turned-off/</feedburner:origLink></item>
		<item>
		<title>Site reliability has been ghastly</title>
		<link>http://feedproxy.google.com/~r/dbms2/feed/~3/Tpo-EDvUoqI/</link>
		<comments>http://www.dbms2.com/2012/05/07/site-reliability-has-been-ghastly/#comments</comments>
		<pubDate>Mon, 07 May 2012 17:53:49 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[About this blog]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=6353</guid>
		<description><![CDATA[Unfortunately, we&#8217;ve had serious site outages over the past few days, as well as an increased frequency of shorter-term problems. My ordinarily excellent hosting company is going through a bad stretch, and I&#8217;ll have to move away from them. (As usual, I&#8217;ll rely on http://www.webhostingtalk.com for recommendations.) When I pull the trigger on the move, [...]]]></description>
			<content:encoded><![CDATA[<p>Unfortunately, we&#8217;ve had serious site outages over the past few days, as well as an increased frequency of shorter-term problems. My ordinarily excellent hosting company is going through a bad stretch, and I&#8217;ll have to move away from them. (As usual, I&#8217;ll rely on http://www.webhostingtalk.com for recommendations.)</p>
<p>When I pull the trigger on the move, there will be a short period when I turn off comments across all my blogs. I&#8217;ll post again here to announce when that is happening.</p>
<p>I apologize for the inconvenience.</p>
<img src="http://feeds.feedburner.com/~r/dbms2/feed/~4/Tpo-EDvUoqI" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2012/05/07/site-reliability-has-been-ghastly/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		<feedburner:origLink>http://www.dbms2.com/2012/05/07/site-reliability-has-been-ghastly/</feedburner:origLink></item>
		<item>
		<title>Relationship analytics application notes</title>
		<link>http://feedproxy.google.com/~r/dbms2/feed/~3/4M-9_VtepjI/</link>
		<comments>http://www.dbms2.com/2012/05/07/relationship-analytics-application-notes/#comments</comments>
		<pubDate>Mon, 07 May 2012 14:06:02 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Predictive modeling and advanced analytics]]></category>
		<category><![CDATA[RDF and graphs]]></category>
		<category><![CDATA[Telecommunications]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=6328</guid>
		<description><![CDATA[This post is part of a series on managing and analyzing graph data. Posts to date include: Graph data model basics Relationship analytics definition Relationship analytics applications (this post) Analysis of large graphs In my recent post on graph data models, I cited various application categories for relationship analytics. For most applications, it&#8217;s hard to [...]]]></description>
			<content:encoded><![CDATA[<p><em>This post is part of a series on managing and analyzing graph data. Posts to date include:</em></p>
<ul>
<li><em><a href="../../../../../2012/05/04/notes-on-graph-data-management/">Graph data model basics</a></em></li>
<li><a href="http://www.dbms2.com/2012/05/07/terminology-relationship-analytics/"><em>Relationship analytics definition</em></a></li>
<li><em>Relationship analytics applications (this post)</em></li>
<li><a href="http://www.dbms2.com/2012/05/13/notes-on-the-analysis-of-large-graphs/"><em>Analysis of large graphs</em></a></li>
</ul>
<p>In my recent post on <a href="../../../../../2012/05/04/notes-on-graph-data-management/">graph data models</a>, I cited various application categories for <em>relationship analytics.</em> For most applications, it&#8217;s hard to get a lot of details. Reasons include:</p>
<ul>
<li>In adversarial domains such as national security, anti-fraud, or search engine ranking, it&#8217;s natural to keep algorithms secret.</li>
<li>The big exception &#8211;<strong> influencer analytics</strong>, aka social network analysis &#8212; is obscured by a major hype/reality gap (so, come to think of it, is a lot of other predictive modeling).</li>
</ul>
<p>Even so, it&#8217;s fairly safe to say:</p>
<ul>
<li>Much of relationship analytics is about subgraph pattern matching.</li>
<li>Much of relationship analytics is about identifying subgraph patterns that are predictive of certain characteristics or outcomes.</li>
<li>An important kind of relationship analytics challenge is to identify influential individuals.</li>
</ul>
<p><span id="more-6328"></span>Notes on that middle point include:</p>
<ul>
<li>Pattern identification could be done through trial-and-error visualization, through predictive modeling, or through any form of investigative analytics in between.</li>
<li>I presume what&#8217;s hardest about all this from a processing-performance standpoint would often be enumerating the subgraphs of a particular candidate pattern.</li>
</ul>
<p>So I&#8217;m tempted to say &#8220;it&#8217;s all about subgraphs.&#8221; But it might be more accurate yet to say <strong>&#8220;It&#8217;s about paths&#8221;. </strong>Arguably, that&#8217;s saying the same thing; paths are subgraphs, and subgraphs are made up of paths, so a way of finding one is also a way of finding the other. But referring to paths nods to such standard tasks as:</p>
<ul>
<li>Finding the shortest path between two nodes.</li>
<li>Calculating centrality metrics.</li>
</ul>
<p>Paths are also simpler than subgraphs, and hence also simpler to think about.</p>
<p>Let&#8217;s drill down a bit more on the cases of influencer analysis and centrality. Telecom service providers around the world compete with relatively few of their peers (because they&#8217;re so geographically bound), and hence are pretty good about sharing technical ideas with each other. One application that has spread like wildfire is influencer analysis for churn control. The idea is to identify influential subscribers who, if they left your service, would be particularly likely to take other people with them, so that you can make great efforts to retain them. The key data used is CDRs (call detail records).</p>
<p>As in many things, it&#8217;s tough to separate influencer analysis adoption fact from fiction.</p>
<ul>
<li>The telecom case is surely real; I&#8217;ve heard of many examples.</li>
<li>Social networking is a harder call. Top-down, the story sounds good; but bottom-up, I&#8217;m not so sure.*</li>
<li>I&#8217;m quite dubious about attempts to use influencer analysis based on, say, credit card records; the detailed information about person-to-person connections isn&#8217;t there.</li>
<li>National security clearly uses similar kinds of techniques, albeit for slightly different purposes.</li>
</ul>
<p>Specific conclusions I&#8217;ve heard include:</p>
<ul>
<li>Who calls you is a better predictor of whether you influence cellular subscribers to churn along with you than who you call.</li>
<li>Length of calls is an indicator of involvement influence in terrorist networks (short ones suggest there&#8217;s serious business being done).</li>
</ul>
<p><em>*For example my <a href="http://klout.com/#/CurtMonash/topics">Klout profile</a> asserts I&#8217;m more influential about Airlines than about Databases or Software. A bit of manual intervention could surely change that &#8212; which just serves to underscore my doubts about the effectiveness of social network analytic automation.</em></p>
<p>One more thing &#8212; relationship analytics on social networks rarely works unless you take out a few spurious highly-connected nodes. The paradigmatic example is the local pizza parlor, which receives many phone calls, but is neither a terrorist mastermind nor a major influence  upon telecom service churn. More on that point when I write about the partitioning of large graphs.</p>
<img src="http://feeds.feedburner.com/~r/dbms2/feed/~4/4M-9_VtepjI" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2012/05/07/relationship-analytics-application-notes/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		<feedburner:origLink>http://www.dbms2.com/2012/05/07/relationship-analytics-application-notes/</feedburner:origLink></item>
		<item>
		<title>Terminology: Relationship analytics</title>
		<link>http://feedproxy.google.com/~r/dbms2/feed/~3/fadE9oSamPw/</link>
		<comments>http://www.dbms2.com/2012/05/07/terminology-relationship-analytics/#comments</comments>
		<pubDate>Mon, 07 May 2012 14:05:16 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Cogito and 7 Degrees]]></category>
		<category><![CDATA[Objectivity and Infinite Graph]]></category>
		<category><![CDATA[QlikTech and QlikView]]></category>
		<category><![CDATA[RDF and graphs]]></category>
		<category><![CDATA[Yarcdata and Cray]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=6327</guid>
		<description><![CDATA[This post is part of a series on managing and analyzing graph data. Posts to date include: Graph data model basics Relationship analytics definition (this post) Relationship analytics applications Analysis of large graphs In late 2005, I encountered a company called Cogito that was using a graphical data manager to analyze relationships. They called this [...]]]></description>
			<content:encoded><![CDATA[<p><em>This post is part of a series on managing and analyzing graph data. Posts to date include:</em></p>
<ul>
<li><em><a href="../../../../../2012/05/04/notes-on-graph-data-management/">Graph data model basics</a></em></li>
<li><em>Relationship analytics definition (this post)<br />
</em></li>
<li><a href="http://www.dbms2.com/2012/05/07/relationship-analytics-application-notes/"><em>Relationship analytics applications</em></a></li>
<li><a href="http://www.dbms2.com/2012/05/13/notes-on-the-analysis-of-large-graphs/"><em>Analysis of large graphs</em></a></li>
</ul>
<p>In late 2005, I encountered a company called Cogito that was using a graphical data manager to analyze relationships. They called this &#8220;relational analytics&#8221;, which I thought was a terrible name for something that they were trying to claim should NOT be done in a relational DBMS. On the spot, I coined <strong>relationship analytics</strong> as an alternative. A business relationship ensued, which included a short <a href="http://www.monash.com/CogitoBulletin.pdf">white paper</a>. Cogito didn&#8217;t do so well, however, and for a while <a href="../../../../../2009/08/21/social-network-analysis-aka-relationship-analytics/">the term &#8220;relationship analytics&#8221; faltered</a> too. But recently it&#8217;s made a bit of a comeback, having been adopted by Objectivity, Qlik Tech, Yarcdata and others.</p>
<p>&#8220;Relationship analytics&#8221; is not a perfect name, both because it&#8217;s longish and because it might over-connote a social-network focus. But then, <a href="http://www.strategicmessaging.com/no-market-categorization-is-ever-precise/2011/03/01/">no other term would be perfect either</a>. So we might as well stick with it.</p>
<p>In that case, &#8220;relationship analytics&#8221; could use an actual definition, preferably one a little heftier than just:</p>
<blockquote><p>Analytics on graphs.</p></blockquote>
<p><span id="more-6327"></span>At the risk of sounding circular, I&#8217;ll try:</p>
<blockquote><p><strong>Relationship analytics</strong> is analytics that focuses upon <strong>relationships encoded in data.</strong></p></blockquote>
<p>Notes on that proposed definition include:</p>
<ul>
<li>The more directly the relationships are encoded &#8212; for example by a node-edge-node <a href="../../../../../2012/05/04/notes-on-graph-data-management/">graph data model</a> &#8212; the more applicable the term is likely to be.</li>
<li>It can still be relationship analytics if the nodes of the graph are ultimately more important than the edges. The edges just have to be central &#8212; no pun intended &#8212; to the analytics.</li>
<li>&#8220;Analytics&#8221; is a vague term, and &#8220;relationship analytics&#8221; inherits the vagueness. That said, I think of relationship analytics as being more about <a href="../../../../../2011/03/03/investigative-analytics/">investigative analytics</a> than the operational kind.</li>
</ul>
<p><em>So what do you think &#8212; does this definition of &#8220;relationship analytics&#8221; work?<br />
</em></p>
<img src="http://feeds.feedburner.com/~r/dbms2/feed/~4/fadE9oSamPw" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2012/05/07/terminology-relationship-analytics/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		<feedburner:origLink>http://www.dbms2.com/2012/05/07/terminology-relationship-analytics/</feedburner:origLink></item>
		<item>
		<title>Notes on graph data management</title>
		<link>http://feedproxy.google.com/~r/dbms2/feed/~3/XI1dX0ViSIc/</link>
		<comments>http://www.dbms2.com/2012/05/04/notes-on-graph-data-management/#comments</comments>
		<pubDate>Fri, 04 May 2012 08:07:19 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[RDF and graphs]]></category>
		<category><![CDATA[Telecommunications]]></category>
		<category><![CDATA[Workday]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=6321</guid>
		<description><![CDATA[This post is part of a series on managing and analyzing graph data. Posts to date include: Graph data model basics (this post) Relationship analytics definition Relationship analytics applications Analysis of large graphs Interest in graph data models keeps increasing. But it&#8217;s tough to discuss them with any generality, because &#8220;graph data model&#8221; encompasses so [...]]]></description>
			<content:encoded><![CDATA[<p><em>This post is part of a series on managing and analyzing graph data. Posts to date include:</em></p>
<ul>
<li><em>Graph data model basics </em><em>(this post)</em></li>
<li><a href="http://www.dbms2.com/2012/05/07/terminology-relationship-analytics/"><em>Relationship analytics definition</em></a></li>
<li><em><a href="http://www.dbms2.com/2012/05/07/relationship-analytics-application-notes/">Relationship analytics applications</a></em></li>
<li><a href="http://www.dbms2.com/2012/05/13/notes-on-the-analysis-of-large-graphs/"><em>Analysis of large graphs</em></a></li>
</ul>
<p>Interest in graph data models keeps increasing. But it&#8217;s tough to discuss them with any generality, because &#8220;graph data model&#8221; encompasses so many different things. Indeed, just as all data structures can be mapped to relational ones, it is also the case that all data structures can be mapped to graphs.</p>
<p>Formally, a graph is a collection of <em>(node, edge, node)</em> triples. In the simplest case, the edge has no properties other than existence or maybe direction, and the triple can be reduced to a <em>(node, node) </em>pair, unordered or ordered as the case may be. It is common, however, for edges to encapsulate additional properties, the canonical examples of which are:</p>
<ul>
<li><strong>Weight.</strong> Usually, the intuition here is that the weight is a number indicating the strength of the connection. This is generally derived from more basic data.</li>
<li><strong>Kind. </strong>The edge can encapsulate one or more descriptors indicating the kind of relationship between the nodes.</li>
</ul>
<p>Many of the graph examples I can think of fit into four groups:<span id="more-6321"></span></p>
<ul>
<li>Networks of people, aka <strong>social networks.</strong>Three (overlapping) areas of particular importance are:
<ul>
<li>People/communications.
<ul>
<li>One canonical example is influencer-finding in telecommunications customer bases. The nodes are subscribers; the edges are call details (raw or aggregated).</li>
<li>Other examples may be found as subgraphs of our next category, namely &#8230;</li>
</ul>
</li>
</ul>
<ul>
<li>&#8230; people/places/things.
<ul>
<li>This is the classic structure for anti-terrorism, law enforcement, or anti-fraud use cases.</li>
<li>Nodes are people, buildings/addresses, cars, businesses, etc. , except that &#8230;</li>
<li>&#8230; nodes can actually be ordered pairs <em>(tangible thing, timestamp).</em> After all, it&#8217;s more interesting if two people were, not just in the same place, but in the same place at the same time.</li>
</ul>
</li>
<li>People/connections/recommendations.
<ul>
<li>Similarly, there are use cases in which various people have social network connections, and then also recommend products of some kind.</li>
<li>Edges can carry information about the evident strength of the social network connection &#8230;</li>
<li>&#8230; but also about apparent similarities in taste.</li>
</ul>
</li>
</ul>
</li>
<li>Graphs of <strong>IT objects.</strong>Various sets of conceptual IT objects can be viewed as graphs. For example:
<ul>
<li>I visited Workday recently. They refer to their Java object model as a &#8220;graph.&#8221;</li>
<li>Neo Technology (the neo4j guys) started out doing a content management system, and eventually decided that what they really wanted underneath it was a graph-oriented DBMS.</li>
<li>Now one of Neo&#8217;s major application areas is MDM (Master Data Management).</li>
<li>Most dramatically, there&#8217;s Tim Berners-Lee&#8217;s &#8220;Semantic Web&#8221;, which is built on RDF, which models things as &#8220;a directed, labeled graph&#8221;. SPARQL, OWL and so on are in the mix as well. To date, the Semantic Web has been a lot of hot air, only without the hot aspect; still, it&#8217;s obviously influenced many people&#8217;s thinking about graphs.</li>
<li><em>Edit: Please see Marie&#8217;s comment below for a rather major example I left out. <img src='http://www.dbms2.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </em></li>
</ul>
</li>
<li><strong>Taxonomies, ontologies, </strong>and/or<strong> semantic networks.</strong>
<ul>
<li>To a large extent this overlaps with my previous category &#8230;</li>
<li>&#8230; but I&#8217;m particularly fond of the example of straightforward taxonomies of words, e.g. WordNet. The nodes are the words themselves, or more precisely word senses (i.e., specific meanings of a word); edges are typically chosen from a limited set of alternatives such as <em>is_a, is_part_of, </em>or <em>entails.</em></li>
</ul>
</li>
<li>Finally, there are representations of <strong>physical graphs</strong>. Examples might include telecom networks, utility grids, or locations and routes for physical deliveries.</li>
</ul>
<p>My main reason for reciting these diverse examples is to illustrate that, for any really interesting technical discussion, it is necessary to focus on a subset of the possible use cases.</p>
<p><em>This post is intended to start a short series. When the next one goes up &#8212; focusing on a particular set of use cases <img src='http://www.dbms2.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />  &#8212; this footer will be edited accordingly.</em></p>
<img src="http://feeds.feedburner.com/~r/dbms2/feed/~4/XI1dX0ViSIc" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2012/05/04/notes-on-graph-data-management/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		<feedburner:origLink>http://www.dbms2.com/2012/05/04/notes-on-graph-data-management/</feedburner:origLink></item>
		<item>
		<title>Big Data hype?</title>
		<link>http://feedproxy.google.com/~r/dbms2/feed/~3/qTFpsEam314/</link>
		<comments>http://www.dbms2.com/2012/05/03/big-data-hype/#comments</comments>
		<pubDate>Thu, 03 May 2012 10:17:54 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Business intelligence]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[IBM and DB2]]></category>
		<category><![CDATA[Microsoft and SQL*Server]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[Splunk]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=6311</guid>
		<description><![CDATA[A reporter wrote in to ask whether investor interest in &#8220;Big Data&#8221; was justified or hype. (More precisely, that&#8217;s how I reinterpreted his questions. ) His examples were Splunk&#8217;s IPO, Teradata&#8217;s stock price increase, and Birst&#8217;s financing. In a nutshell: My comments, lightly edited, are in plain text below. Further thoughts are in italics. Of [...]]]></description>
			<content:encoded><![CDATA[<p>A reporter wrote in to ask whether investor interest in &#8220;Big Data&#8221; was justified or hype. (More precisely, that&#8217;s how I reinterpreted his questions. <img src='http://www.dbms2.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />  ) His examples were Splunk&#8217;s IPO, Teradata&#8217;s stock price increase, and Birst&#8217;s financing. In a nutshell:</p>
<ul>
<li>My comments, lightly edited, are in plain text below.</li>
<li>Further thoughts are in italics.</li>
<li>Of course I also linked him to my post <a href="../../../../../2011/09/11/big-data-has-jumped-the-shark/">&#8220;Big Data&#8221; has jumped the shark</a>.</li>
<li>Overall, my responses boil down to &#8220;Of course there&#8217;s some hype.&#8221;</li>
</ul>
<p>1. A great example of hype is that anybody is calling Birst a &#8220;Big Data&#8221; or &#8220;Big Data analytics&#8221; company. If anything, Birst is a &#8220;little data&#8221; analytics company that claims, as a differentiating feature, that it can handle ordinary-sized data sets as well.<span id="more-6311"></span></p>
<p><em>When I checked Birst&#8217;s website, &#8220;Big Data&#8221; was nowhere to be found. On the other hand, the term was all over its press pitch for the financing.</em></p>
<p>2. The great growth in database sizes is both caused and balanced out by Moore&#8217;s Law. The net effect is healthy but not enormous growth in the overall data management and analytics markets.</p>
<p><em>I&#8217;ve made versions of that point many times before.</em></p>
<p>3. Incumbent data and analytic technology vendors such as Oracle, IBM, and Microsoft are vulnerable, but are competing very hard. Favorable exits have ensued for companies such <a href="../../../../../2010/09/20/ibm-netezza-acquisition/">Netezza</a>, <a href="../../../../../2008/07/24/microsoft-is-buying-datallegro/">DATAllegro</a>, <a href="../../../../../2011/02/14/some-quick-notes-on-hp-vertica/">Vertica</a>, and <a href="../../../../../2011/03/04/teradata-aster-data-ncluster/">Aster Data</a>.</p>
<p><em>The connection between those two points is that the big companies will hold a lot of share, but part of how they&#8217;ll hold it is through acquisitions. For example, IBM, Microsoft, HP, Teradata, and Greenplum all bought newish analytic RDBMS vendors, at an aggregate cost of several billion dollars. And SAP bought Sybase.</em></p>
<p>But while there have been billions of dollars in fairly recent analytics-related acquisitions, the pace of acquisition would have to accelerate much further yet to justify current valuations.</p>
<p><em>Upon reflection, I may have overestimated the acquisition/IPO total-value-created ratio somewhat. Even so, what&#8217;s the last enterprise technology vendor to create huge investor value by going public, continuing to prosper, and so on? Red Hat and Autonomy may be as good as it gets. VMware isn&#8217;t really an example, because of its ownership structure. </em></p>
<p>4. I&#8217;m worried that people may be overestimating the business benefit of accurate analytics, great thought that value truly is. For example, it&#8217;s not plausible that all enterprises in the world use better analytics to all improve their respective market shares.</p>
<p><em>Yes, it&#8217;s great to be an arms dealer to all sides. But &#8220;Big Data&#8221; technology is just another chapter in the ever-growing importance of IT.</em></p>
<img src="http://feeds.feedburner.com/~r/dbms2/feed/~4/qTFpsEam314" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2012/05/03/big-data-hype/feed/</wfw:commentRss>
		<slash:comments>13</slash:comments>
		<feedburner:origLink>http://www.dbms2.com/2012/05/03/big-data-hype/</feedburner:origLink></item>
		<item>
		<title>Thinking about market segments</title>
		<link>http://feedproxy.google.com/~r/dbms2/feed/~3/5g02ss73TT8/</link>
		<comments>http://www.dbms2.com/2012/05/01/thinking-about-market-segments/#comments</comments>
		<pubDate>Tue, 01 May 2012 11:00:08 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Data mart outsourcing]]></category>
		<category><![CDATA[Games and virtual worlds]]></category>
		<category><![CDATA[IBM and DB2]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Microsoft and SQL*Server]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[Software as a Service (SaaS)]]></category>
		<category><![CDATA[Telecommunications]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=6297</guid>
		<description><![CDATA[It is a reasonable (over)simplification to say that my business boils down to: Advising vendors what/how to sell. Advising users what/how to buy. One complication that commonly creeps in is that different groups of users have different buying practices and technology needs. Usually, I nod to that point in passing, perhaps by listing different application [...]]]></description>
			<content:encoded><![CDATA[<p>It is a reasonable (over)simplification to say that <a href="http://www.monash.com/">my business</a> boils down to:</p>
<ul>
<li>Advising vendors what/how to sell.</li>
<li>Advising users what/how to buy.</li>
</ul>
<p>One complication that commonly creeps in is that different groups of users have different buying practices and technology needs. Usually, I nod to that point in passing, perhaps by listing different application areas for a company or product. But now let&#8217;s address it head on. Whether or not you care about the particulars, I hope the sheer length of this post reminds you that there are many different market segments out there.</p>
<p>Last June I <a href="http://www.dbms2.com/2011/06/26/what-to-think-about-before-you-make-a-technology-decision/">wrote</a>:</p>
<blockquote><p>In almost any IT decision, there are a number of <strong>environmental constraints</strong> that need to be acknowledged. Organizations may have <strong>standard vendors</strong>, favored vendors, or simply vendors who give them <a href="../2011/06/24/observations-on-oracle-pricing/">particularly deep discounts</a>. <strong>Legacy systems</strong> are in place, application and system alike, and may or may not be open  to replacement. Enterprises may have on-premise or off-premise  preferences; SaaS (Software as a Service) vendors probably have <strong>multitenancy</strong> concerns. Your organization can determine which aspects of your system you’d ideally like to see be tightly <strong>integrated </strong>with each other, and which you’d prefer to keep only loosely coupled. You may have biases for or against <strong>open-source software.</strong> You may be pro- or anti-<strong>appliance.</strong> Some applications have a substantial need for elastic scaling. And some kinds of issues cut across multiple areas, such as <strong>budget</strong>, <strong>timeframe, security, </strong>or<strong> trained personnel.</strong></p></blockquote>
<p>I&#8217;d further say that it matters whether the buyer:</p>
<ul>
<li>Is a large central IT organization.</li>
<li>Is the well-staffed IT organization of a particular business department.</li>
<li>Is a small, frazzled IT organization.</li>
<li>Has strong engineering or technical skills, but less in the way of IT specialists.</li>
<li>Is trying to skate by without much technical knowledge of any kind.</li>
</ul>
<p>Now let&#8217;s map those considerations (and others) to some specific market segments.<span id="more-6297"></span></p>
<ul>
<li><strong>Traditional large enterprises&#8217; central IT organizations</strong> commonly:
<ul>
<li>Favor large, proven vendors and well-accepted IT methodologies.</li>
<li>Would like to consolidate their IT vendors as much as possible.</li>
<li>Have major challenges with legacy systems and data integration &#8230;</li>
<li>&#8230; which are often exacerbated by mergers.</li>
<li>Spend a lot of cycles on bureaucracy and company politics.</li>
<li>Notwithstanding the forgoing, have resources to invest in some &#8220;sizzle&#8221; initiatives.</li>
</ul>
</li>
<li><strong>The very largest enterprises</strong> are more likely than their slightly smaller counterparts to:
<ul>
<li>View IT as a potential area of competitive differentiation.</li>
<li>Believe much of what they do should be custom, due to their unique needs and resources.</li>
<li>Experiment with unproven technologies.</li>
</ul>
</li>
<li><strong>Smaller enterprises</strong> may:
<ul>
<li>Have small, generalist, overwhelmed staffs.</li>
<li>Hope for turnkey application solutions (SaaS or otherwise).</li>
<li>Get very committed to/reliant on a small number of vendors.</li>
</ul>
</li>
<li>In particular, <strong>IBM or Microsoft loyalists </strong>can be:
<ul>
<li>Extremely locked into their preferred vendor&#8217;s strategies.</li>
<li>Not very fruitful for rival vendors to attempt to sell to.</li>
</ul>
</li>
<li><strong>Humongous consumer internet companies</strong> tend to:
<ul>
<li>Have very high opinions of themselves and their technical abilities.</li>
<li>Be open source zealots, for reasons both of free-like-beer and free-like-speech.</li>
<li>In particular, not want to buy anybody else&#8217;s software.</li>
<li>Not be big fans of relational database designs.</li>
</ul>
</li>
<li><strong>Other large consumer internet companies</strong> tend to:
<ul>
<li>Be like the humongous ones they look up to, but maybe not to the same extremes.</li>
<li>In particular, be more willing to pay for software.</li>
<li>Be mired in company politics only/mainly to the extent they are both large and old(er).</li>
</ul>
</li>
<li><strong>Smaller consumer internet companies</strong> tend to:
<ul>
<li>Be like the large ones they look up to, but &#8230;</li>
<li>&#8230; be quite short on traditional IT skills, and work around that shortage by reinventing various wheels.</li>
</ul>
</li>
<li><strong>Business-oriented SaaS</strong> (Software as a Service) companies commonly:
<ul>
<li>Are drawn to the cool open source technologies consumer internet companies use &#8230;</li>
<li>&#8230; but may wind up using more traditional kinds of DBMS, for the same reasons those DBMS are used in other business applications.</li>
<li>Are more primitive in the analytic capabilities they offer their customers than I think they should be (analytics-only vendors sometimes excepted).</li>
<li>Are refreshingly free of traditional IT politics, because technology is too important to them to mess around with too badly. (Of course, any other kinds of company politics may still come into play.)</li>
</ul>
</li>
<li><strong>Internet operations of traditional enterprises:</strong>
<ul>
<li>Sometimes are just like stand-alone internet businesses.</li>
<li>Sometimes are just like &#8212; and part of &#8212; the rest of the enterprise&#8217;s IT operations.</li>
<li>More commonly are somewhere in between.</li>
</ul>
</li>
<li><strong>Marketing departments of traditional enterprises </strong>sometimes:
<ul>
<li>Want to do their own data acquisition, management, and/or analysis &#8230;</li>
<li>&#8230; without having great IT resources of their own.</li>
<li>Invest in <a href="../../../../../2012/01/23/departmental-analytics-general-observations/">departmental analytics</a> efforts or even &#8230;</li>
<li>&#8230; have line executives who are analytically proficient.</li>
<li>Make heavy use of SaaS, as an alternative to relying on central IT, or as a natural byproduct of acquiring third-party data.</li>
</ul>
</li>
<li><strong>Large investment firms </strong>commonly:
<ul>
<li>Have numerous departments, each with its own IT experts.</li>
<li>Care about sub-millisecond latency &#8230;</li>
<li>&#8230; and sub-week time-to-value.</li>
<li>Experience return-on-investment in a very different way than most businesses do.</li>
</ul>
</li>
<li><strong>Telecom service companies</strong> commonly differ from other similarly-sized enterprises in that:
<ul>
<li>They are more aggressive about using innovative technology to manage (and analyze) data.</li>
<li>Somewhat resemble investment firms in having multiple departments that each have broad engineering discretion.</li>
</ul>
</li>
<li><strong>National security</strong> customers often:
<ul>
<li>Want the best, cutting-edge, sometimes custom technology, yet &#8230;</li>
<li>&#8230; make themselves very cumbersome to sell to and support.</li>
<li>Are not forthcoming about how they use what they buy.</li>
</ul>
</li>
</ul>
<p>I could keep going for quite a while &#8212; but for now I won&#8217;t. Vertical markets I&#8217;m thus omitting include but are not limited to:</p>
<ul>
<li>Pharmaceutical researchers</li>
<li>Hospitals</li>
<li>Insurers</li>
<li><a href="../../../../../2009/10/03/issues-in-scientific-data-management/">Academic scientists</a></li>
</ul>
<p>Finally, for yet another omission &#8212; in my original outline I contemplated distinguishing among various geographical areas, with my first-pass segmentation being:</p>
<ul>
<li>North America</li>
<li>Europe</li>
<li>Japan</li>
<li>China</li>
<li>Smaller geographies</li>
</ul>
<img src="http://feeds.feedburner.com/~r/dbms2/feed/~4/5g02ss73TT8" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2012/05/01/thinking-about-market-segments/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
		<feedburner:origLink>http://www.dbms2.com/2012/05/01/thinking-about-market-segments/</feedburner:origLink></item>
		<item>
		<title>Notes on the Hadoop and HBase markets</title>
		<link>http://feedproxy.google.com/~r/dbms2/feed/~3/9h-xEEfj1ds/</link>
		<comments>http://www.dbms2.com/2012/04/24/notes-on-the-hadoop-and-hbase-markets/#comments</comments>
		<pubDate>Tue, 24 Apr 2012 08:40:31 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Amazon and its cloud]]></category>
		<category><![CDATA[ClearStory Data]]></category>
		<category><![CDATA[Cloud computing]]></category>
		<category><![CDATA[Cloudera]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[HBase]]></category>
		<category><![CDATA[Hortonworks]]></category>
		<category><![CDATA[MapR]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Market share and customer counts]]></category>
		<category><![CDATA[Petabyte-scale data management]]></category>
		<category><![CDATA[WibiData]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=6284</guid>
		<description><![CDATA[I visited my clients at Cloudera and Hortonworks last week, along with scads of other companies. A few of the takeaways were: Cloudera now has 220 employees. Cloudera now has over 100 subscription customers. Over the past year, Cloudera has more than doubled in size by every reasonable metric. Over half of Cloudera&#8217;s customers use [...]]]></description>
			<content:encoded><![CDATA[<p>I visited my clients at Cloudera and Hortonworks last week, along with scads of other companies. A few of the takeaways were:</p>
<ul>
<li>Cloudera now has 220 employees.</li>
<li>Cloudera now has over 100 subscription customers.</li>
<li>Over the past year, Cloudera has more than doubled in size by every reasonable metric.</li>
<li>Over half of Cloudera&#8217;s customers use HBase, vs. <a href="http://www.dbms2.com/2011/07/18/hbase-is-not-broken/">a figure of 18+ last July</a>.</li>
<li>Omer Trajman &#8212; who by the way has made a long-overdue official move into technical marketing &#8212; can no longer keep count of <a href="http://www.dbms2.com/2011/07/06/petabyte-hadoop-clusters/">how many petabyte-scale Hadoop clusters Cloudera supports</a>.</li>
<li>Cloudera gets the majority of its revenue from subscriptions. However, professional services and training continue to be big businesses too.</li>
<li>Cloudera has trained over 12,000 people.</li>
<li>Hortonworks is training people too.</li>
<li>Hortonworks now has 70 employees, and plans to have 100 or so by the end of this quarter.</li>
<li>A number of those Hortonworks employees are executives who come from seriously profit-oriented backgrounds. Hortonworks clearly has capitalist intentions.</li>
<li>Hortonworks thinks a typical enterprise Hadoop cluster has 20-50 nodes, with 50-100 already being on the large side.</li>
<li>There are huge amounts of Elastic MapReduce/Hadoop processing in the Amazon cloud. Some estimates say it&#8217;s the majority of all Amazon Web Services processing.</li>
<li>I met with 4 young-company clients who I regard as building vertical analytic stacks (WibiData, MarketShare, MetaMarkets, and ClearStory). All 4 are heavily dependent on Hadoop. (The same isn&#8217;t as true of older companies who built out a lot of technology before Hadoop was invented.)</li>
<li>There should be more HBase information at <a href="http://www.hbasecon.com/">HBaseCon</a> on May 22.</li>
<li>If MapR still has momentum, nobody I talked with has noticed.</li>
</ul>
<img src="http://feeds.feedburner.com/~r/dbms2/feed/~4/9h-xEEfj1ds" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2012/04/24/notes-on-the-hadoop-and-hbase-markets/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		<feedburner:origLink>http://www.dbms2.com/2012/04/24/notes-on-the-hadoop-and-hbase-markets/</feedburner:origLink></item>
	</channel>
</rss><!-- Dynamic page generated in 0.464 seconds. --><!-- Cached page generated by WP-Super-Cache on 2012-05-15 13:25:22 --><!-- Compression = gzip -->

