<?xml version="1.0" encoding="ISO-8859-1"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss1full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns="http://purl.org/rss/1.0/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:admin="http://webns.net/mvcb/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0">
<channel rdf:about="http://fgiasson.com/blog">
	<title>Frederick Giasson's Weblog</title>
	<link>http://fgiasson.com/blog</link>
	<description />
	<dc:date>2009-11-18T17:10:35Z</dc:date>
	<admin:generatorAgent rdf:resource="http://wordpress.org/?v=2.8.4" />
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<sy:updateBase>2000-01-01T12:00+00:00</sy:updateBase>
		<items>
		<rdf:Seq>
					<rdf:li rdf:resource="http://fgiasson.com/blog/index.php/2009/11/16/when-linked-data-rules-fail/" />
					<rdf:li rdf:resource="http://fgiasson.com/blog/index.php/2009/10/20/common-and-irjson-php-parsers-released/" />
					<rdf:li rdf:resource="http://fgiasson.com/blog/index.php/2009/09/18/a-new-home-for-umbel-web-services/" />
					<rdf:li rdf:resource="http://fgiasson.com/blog/index.php/2009/08/21/new-release-of-umbel-v072/" />
					<rdf:li rdf:resource="http://fgiasson.com/blog/index.php/2009/08/18/structwsf-early-querying-metrics/" />
					<rdf:li rdf:resource="http://fgiasson.com/blog/index.php/2009/08/12/construct-a-skin-for-structwsf/" />
					<rdf:li rdf:resource="http://fgiasson.com/blog/index.php/2009/08/10/re-introduction/" />
					<rdf:li rdf:resource="http://fgiasson.com/blog/index.php/2009/07/02/release-of-structwsf-construct-and-the-community-web-site/" />
					<rdf:li rdf:resource="http://fgiasson.com/blog/index.php/2009/06/16/structwsf-and-construct-websites-unveiled/" />
					<rdf:li rdf:resource="http://fgiasson.com/blog/index.php/2009/04/29/rdf-aggregates-and-full-text-search-on-steroids-with-solr/" />
				</rdf:Seq>
	</items>
<atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" href="http://feeds.feedburner.com/FredOnSomething" type="application/rss+xml" /><feedburner:emailServiceId>FredOnSomething</feedburner:emailServiceId><feedburner:feedburnerHostname>http://feedburner.google.com</feedburner:feedburnerHostname><feedburner:browserFriendly>This is an XML content feed. It is intended to be viewed in a newsreader or syndicated to another site.</feedburner:browserFriendly><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com" /></channel>
<item rdf:about="http://fgiasson.com/blog/index.php/2009/11/16/when-linked-data-rules-fail/">
	<title>When Linked Data Rules Fail</title>
	<link>http://feedproxy.google.com/~r/FredOnSomething/~3/SUH3v-KgZ0Y/</link>
	 <dc:date>2009-11-16T17:03:11Z</dc:date>
	<dc:creator>Fred</dc:creator>
			<dc:subject><![CDATA[Semantic Web]]></dc:subject>
	<description>
High Visibility Problems with NYT, data.gov Show Need for Better
Practices

When I say, "shot", what do you think of? A flu shot? A shot of whisky? A moon shot? A gun shot? What if I add the term "bank"? Do you now think of someone being shot in an armed robbery ...</description>
	<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=When Linked Data Rules Fail&amp;rft.aulast=Giasson&amp;rft.aufirst=Frédérick&amp;rft.subject=Semantic Web&amp;rft.source=Frederick Giasson&#8217;s Weblog&amp;rft.date=2009-11-16&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://fgiasson.com/blog/index.php/2009/11/16/when-linked-data-rules-fail/&amp;rft.language=English"></span>
<p><a href="http://www.adhd-mindbydesign.com"><img style="border: 0px solid; width: 220px; height: 223px; float: left; margin-right: 10px;" title="Image Source: www.adhd-mindbydesign.com" src="http://fgiasson.com/blog/wp-content/uploads/2009/11/091115_disconnected.jpg" alt="Image Source: www.adhd-mindbydesign.com" hspace="5" vspace="5" align="left" /></a></p>
<h2>High Visibility Problems with NYT, data.gov Show Need for Better<br />
Practices</h2>
<p>When I say, &#8220;shot&#8221;, what do you think of? A flu shot? A shot of whisky? A moon shot? A gun shot? What if I add the term &#8220;bank&#8221;? Do you now think of someone being shot in an armed robbery of a local bank or similar?</p>
<p>And, now, what if I add a reference to say, <a style="font-style: italic;" href="http://en.wikipedia.org/wiki/The_Hustler_%28film%29">The Hustler</a>, or Minnesota Fats, or &#8220;Fast Eddie&#8221; Felson? Do you now see the connection to a pressure-packed banked pool shot in some smoky bar room?</p>
<p>As humans we need context to make connections and remove ambiguity. For machines, with their limited reasoning and inference engines, context and accurate connections are even more important.</p>
<p>Over the past few weeks we have seen announcements of two large and high-visibility <a href="http://en.wikipedia.org/wiki/Linked_data">linked data</a></p>
<p>projects:  One, a first release of references for articles concerning about 5,000 people from the New York Times at <a class="http" href="http://data.nytimes.com/">data.nytimes.com</a>; and Two, a massive exposure of 5 billion triples from <a href="http://tw.rpi.edu/">data.gov</a> datasets provided by the <a href="http://tw.rpi.edu/">Tetherless World Constellation</a> (TWC) at <a href="http://rpi.edu/">Rennselaer Polytechnic Institute</a> (RPI).</p>
<p>On various grounds from <a href="http://go-to-hellman.blogspot.com/2009/10/new-york-times-blunders-into-linked.html"> licensing</a> to <a href="http://dowhatimean.net/2009/10/linked-data-at-the-new-york-times-exciting-but-buggy">data characterization</a> and to creating linked data for its <a href="http://www.betaversion.org/%7Estefano/linotype/news/351/">own sake</a>, some prominent commentators have weighed in on what is good and what is not so good with these datasets. One of us, Mike, <a href="http://www.mkbergman.com/843/must-read-data-smoke-and-mirrors/">commented</a> about a week ago that &#8220;we have now moved beyond &#8216;proof of concept&#8217; to<br />
the need for actual useful data of trustworthy provenance and proper mapping and characterization. Recent efforts are a disappointment that no enterprise would or could rely upon.&#8221;</p>
<p>Reactions to <a href="http://www.mkbergman.com/843/must-read-data-smoke-and-mirrors/">that posting</a> and continued discussion on various <a href="http://lists.w3.org/Archives/Public/public-esw-thes/2009Nov/0000.html"> mailing lists</a> warrant a more precise dissection of what is wrong and still needs to be done with these datasets <a href="#ld1">[1]</a>.<br />
<h3>Berners-Lee&#8217;s Four Linked Data &#8220;Rules&#8221;</h3>
<p> It is useful, then, to return to first principles, namely the original four &#8220;rules&#8221; posed by Tim Berners-Lee in his design note on linked data <a href="#ld2">[2]</a>:</p>
<ol>
<li>Use URIs as names for things</li>
<li>Use HTTP URIs so that people can look up those names</li>
<li>When someone looks up a URI, provide useful information, using thestandards (RDF, SPARQL)</li>
<li>Include links to other URIs so that they can discover more things.</li>
</ol>
<p>The first two rules are definitional to the idea of linked data. They cement the basis of linked data in the Web, and are not at issue with either of the two linked data projects that are the subject of this posting.</p>
<p>However, it is the lack of specifics and guidance in the last two rules where the breakdowns occur. Both the NYT and the RPI datasets suffer from a lack of &#8220;providing useful information&#8221; (Rule #3). And, the <span class="double_u">nature</span> of the links in Rule #4 is a real problem for the NYT dataset.<br />
<h3>What Constitutes &#8220;Useful Information&#8221;?</h3>
<p> The Wikipedia entry on <a href="http://en.wikipedia.org/wiki/Linked_data">linked data</a> expands on &#8220;useful information&#8221; by augmenting the original rule with the parenthetical clause, &#8221; (<span style="font-style: italic;">i.e.</span>, a structured description — metadata).&#8221; But even that expansion is insufficient.</p>
<p>Fundamentally, what are we talking about with linked data? Well, we are talking about instances that are characterized by one or more attributes. Those instances exist within contexts of various natures. And, those contexts may relate to other existing contexts.</p>
<p>We can break this problem description down into three parts:</p>
<ul>
<li>A <span style="font-weight: bold; font-style: italic;">vocabulary</span> that defines the nature of the instances and their descriptive attributes</li>
<li>A <span style="font-weight: bold; font-style: italic;">schema</span> of some nature that describes the structural relationships amongst instances and their characteristics, and, optimally,</li>
<li>A <span style="font-weight: bold; font-style: italic;">mapping</span> to existing external schema or constructs that help place the data into context.</li>
</ul>
<p>At minimum, <span class="double_u">ANY</span> dataset exposed as linked data needs to be described by a <span style="font-weight: bold; font-style: italic;">vocabulary</span>. Both the NYT and RPI datasets fail on this score, as we elaborate below. Better practice is to also provide a <span style="font-weight: bold; font-style: italic;">schema</span> of relationships in which to embed each instance record. And, best practice is to also <span style="font-weight: bold; font-style: italic;">map</span> those structures to external schema.</p>
<p>Lacking this &#8220;useful information&#8221;, especially a defining vocabulary, we cannot begin to understand whether our instances deal with drinks, bank robberies or pool shots. This lack, in essence, makes the information worthless, even though available via URL.<br />
<h4>The data.gov (RPI) Case</h4>
<p> With the support of NSF and various grant funding, RPI has set up the<br />
<a href="http://data-gov.tw.rpi.edu/wiki/The_Data-gov_Wiki">Data-Gov Wiki</a> <a href="#ld3">[3]</a>, which is in the process of converting the datasets on <a ref="http://www.data.gov">data.gov</a> to RDF,placing them into a semantic wiki to enable comment and annotation, and providing that data as RSS feeds. Other demos are also being placed on the site.</p>
<p>As of the date of this posting, the site had a <a href="http://data-gov.tw.rpi.edu/wiki/Data.gov_Catalog">catalog</a> of 116 datasets from the 800 or so available on data.gov, leading to these statistics:</p>
<ul>
<li>459,412,419 table entries</li>
<li>5,074,932,510 triples, and</li>
<li>7,564 properties (or attributes).</li>
</ul>
<p>We&#8217;ll take one of these datasets, <a href="http://www.data.gov/details/319">#319</a>, and look a bit closer at it:</p>
<table border="1" cellspacing="0" cellpadding="4">
<tbody>
<tr>
<th style="background-color: #cccccc;">Wiki</th>
<th style="background-color: #cccccc;"> Title</th>
<th style="background-color: #cccccc;"> Agency</th>
<th style="background-color: #cccccc;"> Name</th>
<th style="background-color: #cccccc;"> data.gov Link</th>
<th style="background-color: #cccccc;"> No Properties</th>
<th style="background-color: #cccccc;"> No Triples</th>
<th style="background-color: #cccccc;">RDF File</th>
</tr>
<tr>
<td><a title="Dataset 319" href="http://data-gov.tw.rpi.edu/wiki/Dataset_319">Dataset 319</a></td>
<td>Consumer Expenditure Survey</td>
<td><a title="Department of Labor" href="http://data-gov.tw.rpi.edu/wiki/Department_of_Labor">Department of Labor</a></td>
<td><a title="LABOR-STAT (page does not exist)" href="http://data-gov.tw.rpi.edu/w/index.php?title=LABOR-STAT&amp;action=edit&amp;redlink=1">LABOR-STAT</a></td>
<td><a title="http://www.data.gov/details/319" rel="nofollow" href="http://www.data.gov/details/319">http://www.data.gov/details/319</a></td>
<td style="text-align: right;">22</td>
<td style="text-align: right;">1,583,236</td>
<td><a title="http://data-gov.tw.rpi.edu/raw/319/index.rdf" rel="nofollow" href="http://data-gov.tw.rpi.edu/raw/319/index.rdf">http://data-gov.tw.rpi.edu/raw/319/index.rdf</a></td>
</tr>
</tbody>
</table>
<p>This report was picked solely because it had a small number of attributes (properties), and is thus easier to screen capture. The summary report on the wiki is shown by this <a href="http://data-gov.tw.rpi.edu/wiki/Dataset_319">page</a>:</p>
<div style="margin: 10px;">
<p><a href="http://fgiasson.com/blog/wp-content/uploads/2009/11/091115_wiki_dataset_319.png"><br />
<img class="center" style="border: 0px solid; width: 600px; height: 611px;" title="Click to expand" src="http://fgiasson.com/blog/wp-content/uploads/2009/11/091115_wiki_dataset_319.png" alt="Data-gov-Wiki Dataset #319" /></a></p>
<p><span style="font-style: italic; font-size: 90%;">(click to expand)</span></div>
<p>So, we see that this specific dataset contains about 22 of the nearly 8,000 attributes across all datasets.</p>
<p>When we click on one of these attribute names, we are then taken to a specific wiki page that only reiterates its label. There is no definition or explanation.</p>
<p>When we inspect this page further we see that, other than the broad characterization of the dataset itself (the bulk of the page), we see at the bottom 22 undefined attributes with labels such as <span style="font-style: italic;">item code</span>, <span style="font-style: italic;">periodicity code</span>, <span style="font-style: italic;">seasonal</span>, and the like. These attributes are the real structural basis for the data in this dataset.</p>
<p>But, what does all of this mean???</p>
<p>To gain a clue, now let&#8217;s go to the source data.gov site for this <a href="http://www.data.gov/details/319">dataset (#319)</a>. Here is how that report looks:</p>
<div style="margin: 10px;">
<p><a href="http://fgiasson.com/blog/wp-content/uploads/2009/11/091115_data_gov_319.png"><br />
<img class="center" style="border: 0px solid; width: 600px; height: 1146px;" title="Click to expand" src="http://fgiasson.com/blog/wp-content/uploads/2009/11/091115_data_gov_319.png" alt="Data.gov Dataset #319" /></a></p>
<p><span style="font-style: italic; font-size: 90%;">(click to expand)</span></div>
<p> Contained within this report we see a listing for additional <a href="ftp://ftp.bls.gov/pub/time.series/cx/cx.txt">metadata</a>. This link tells us about the various data fields contained in this dataset; we see many of these attributes are &#8220;codes&#8221; to various data categories.</p>
<p>Probing further into the dataset&#8217;s <a href="http://www.bls.gov/cex/">technical documentation</a>, we see that there is indeed a rich structure underneath this report, again provided<br />
via various code lookups. There are codes for geography, seasonality (adjusted or not), consumer demographic profiles and a variety of consumption categories. (See, for example, the link to this <a href="http://www.bls.gov/cex/csxgloss.htm">glossary page</a>.) These are the keys to understanding the actual values within this dataset.</p>
<p>For example, one major dimension of the data is captured by the attribute <span style="font-style: italic;">item_code</span>. The survey breaks down consumption expenditures within the broad categories of  Food, Housing, Apparel and Services, Transportation, Health Care, Entertainment, and Other. Within a category, there is also a rich  structural breakdown. For  xample, expenditures for Bakery Products within Food is given a <a href="ftp://ftp.bls.gov/pub/time.series/cx/cx.item">code</a> of FHC2.</p>
<p>But, nowhere are these codes defined or unlocked in the RDF datasets. This absence is true for virtually all of the datasets exposed on this wiki.</p>
<p>So, for literally billions of triples, and 8,000 attributes, we have <span style="font-weight: bold;">ABSOLUTELY NO INFORMATION ABOUT WHAT THE DATA CONTAINS OTHER THAN A PROPERTY LABEL</span>. There is much,much rich value here in data.gov, but all of it remains locked up and hidden.</p>
<p>The sad truth about this data release is that it provides absolutely no value in its current form. We lack the keys to unlock the value.</p>
<p>To be sure, early essential spade work has been done here to begin putting in place the conversion infrastructure for moving text files, spreadsheets and the like to an RDF form. This is yeoman work important to ultimate access. But, until a <span style="font-weight: bold; font-style: italic;">vocabulary</span> is published that defines the attributes and their codes so we can unlock this value, it will remain hidden. And only when its further value (by connecting attributes and relations across datasets) through a <span style="font-weight: bold; font-style: italic;">schema</span> of some nature is also published, the real value from connecting the dots will also remain hidden.<img style="width: 160px; height: 218px; float: right; margin-left: 10px;" title="The Hustler" src="http://fgiasson.com/blog/wp-content/uploads/2009/11/091115_the_hustler.jpg" alt="The Hustler" align="right" /></p>
<p>These datasets may meet the partial conditions of providing clickable URLs, but the crucial &#8220;useful information&#8221; as to what any of this data means is absent.</p>
<p>Every single dataset on data.gov has supporting references to text files, PDFs, Web pages or the like that describe the nature of the data within each dataset. Until that information is exposed and made usable, we have no linked data. </p>
<p>Until ontologies get created from these technical documents, the value of these data instances remain locked up, and no value can be created from having these datasets expressed in RDF.</p>
<p>The devil lies in the details. The essential hard work has not yet begun.</p>
<h4>The NYT Case</h4>
<p>Though at a much smaller scale with many fewer attributes, the <a href="http://data.nytimes.com">NYT dataset</a> suffers from the same failing: it too lacks a <span style="font-weight: bold; font-style: italic;">vocabulary</span>.</p>
<p>So, let&#8217;s take the case of one of the lead actors in <a style="font-style: italic;" href="http://en.wikipedia.org/wiki/The_Hustler_%28film%29">The Hustler</a>, Paul Newman, who played the role of &#8220;Fast Eddie&#8221; Felson. Here is the <a href="http://data.nytimes.com/N31738445835662083893.html">NYT record</a> for the &#8220;person&#8221; <span style="font-style: italic;">Paul<br />
Newman</span> (which they also refer to as <a href="http://data.nytimes.com/newman_paul_per">http://data.nytimes.com/newman_paul_per</a>). Note the header title of <span style="font-weight: bold;">Newman, Paul</span>:</p>
<div style="margin: 10px;">
<p><a href="http://fgiasson.com/blog/wp-content/uploads/2009/11/091115_nyt_paul_newman.png"><br />
<img class="center" style="border: 0px solid; width: 600px; height: 593px;" title="Click to expand" src="http://fgiasson.com/blog/wp-content/uploads/2009/11/091115_nyt_paul_newman.png" alt="NYT 'Paul Newman Articles' Record" /></a></p>
<p><span style="font-style: italic; font-size: 90%;">(click to expand)</span></div>
<p> Click on any of the internal labels used by the NYT for its own attributes (such as <a  ref="http://data.nytimes.com/elements/first_use">nyt:first_use</a>), and you will be given this message:</p>
<div style="margin-left: 40px;">
<p><span style="font-style: italic;">&#8220;An RDFS description and English language documentation for the NYT namespace will be provided soon. Thanks for your patience.&#8221;</span></div>
<p>We again have no idea what is meant by all of this data except for the labels used for its attributes. In this case for <a href="http://data.nytimes.com/elements first_use">nyt:first_use</a> we have a value of &#8220;2001-03-18&#8243;.</p>
<p>Hello? What? What is a &#8220;first use&#8221; for a &#8220;Paul Newman&#8221; of &#8220;2001-03-18&#8243;???</p>
<p>The NYT put the cart before the horse: even if minimal, they should have released their ontology first — or at least at the same time — as they released their data instances. (See further <a href="http://www.mkbergman.com/825/fresh-perspectives-on-the-semantic-enterprise/"> this discussion</a> about how an ontology creation workflow can be incremental by starting simple and then upgrading as needed.) </p>
<h3>Links to Other Things</h3>
<p>Since there really are no links to other things on the Data-Gov Wiki, our focus in this section continues with the NYT dataset using our same example.</p>
<p>We now are in the territory of the fourth &#8220;rule&#8221; of linked data: <span style="font-style: italic;">4. Include links to other URIs so that they can discover more things</span>.</p>
<p>This will seem a bit basic at first, but before we can talk about linking to other things, we first need to understand and define the starting &#8220;thing&#8221; to which we are linking.<br />
<h4>What is a &#8220;Newman, Paul&#8221; Thing?</h4>
<p> Of course, without its own vocabulary, we are left to deduce what this thing &#8220;<span style="font-weight: bold;">Newman, Paul</span>&#8220; <span  class="double_u">is</span> that is shown in the previous screen shot. Our first clue comes from the statement that it is of <span style="font-style: italic;">rdf:type</span> <a href="http://www.w3.org/TR/skos-reference/">SKOS</a> <span style="font-style: italic;">concept</span>. By looking to the SKOS vocabulary, we see that <a href="http://www.w3.org/TR/skos-reference/#concepts"><span style="font-style: italic;">concept</span></a> is a class and is defined as: </p>
<p style="margin-left: 40px; font-style: italic;">A SKOS concept can be viewed as an idea or notion; a unit of thought. However, what constitutes a unit of thought is subjective, and this<br />
definition is meant to be suggestive, rather than restrictive. The notion of a SKOS concept is useful when describing the conceptual or intellectual structure of a knowledge organization system, and when referring to specific ideas or meanings established within a KOS.</p>
<p>We also see that this instance is given a <a href="http://xmlns.com/foaf/0.1/primaryTopic">foaf:primaryTopic</a> of <span style="font-style: italic;">Paul Newman</span>.</p>
<p>So, we can deduce so far that this instance is about the concept or idea of <span style="font-style: italic;">Paul Newman</span>. Now, looking to the attributes of this instance — that is the defining properties provided by the NYT — we see the properties of <a href="http://data.nytimes.com/elements/associated_article_count">nyt:associated_article_count</a>, <a href="http://data.nytimes.com/elements/first_use">nyt:first_use</a>, <a href="http://data.nytimes.com/elements/last_use">nyt:last_use</a> and <a href="http://data.nytimes.com/elements/topicPage">nyt:topicPage</a>. Completing our deductions, and in the absence of its own vocabulary, we can now define this concept instance somewhat as follows:
<p style="margin-left: 40px;"><span style="font-style: italic;">New York Times articles in the period 2001 to 2009 having as their primary topic the actor Paul Newman</span></p>
<p>(BTW, across all records in this dataset, we could see what the earliest first use was to better deduce the time period over which these articles have been assembled, but that has not been done.)</p>
<p>We also would re-title this instance more akin to &#8220;2001-2009 NYT Articles with a Primary Topic of Paul Newman&#8221; or some such and use URIs more akin to this usage. </p>
<h4>sameAs Woes</h4>
<p>Thus, in order to make links or connections with other data, it is essential to understand what the nature is of the subject &#8220;thing&#8221; at hand. There is much confusion about actual &#8220;things&#8221; and the references to &#8220;things&#8221; and what is the nature of a &#8220;thing&#8221; within the literature and on mailing lists.</p>
<p>Our belief and usage in matters of the semantic Web is that all &#8220;things&#8221; we deal with are a reference to whatever the &#8220;true&#8221;, actual thing is. The question then becomes:  What is the nature (or scope) of this referent?</p>
<p>There are actually quite easy ways to determine this nature. First, look to one or more instance examples of the &#8220;thing&#8221; being referred to. In our case above, we have the &#8220;<span style="font-weight: bold;">Newman, Paul</span>&#8221; instance record. Then, look to the properties (or attributes) the publisher of that record has used to describe that thing. Again, in the case above, we have <a href="http://data.nytimes.com/elements/associated_article_count">nyt:associated_article_count</a>, <a href="http://data.nytimes.com/elements/first_use">nyt:first_use</a>, <a href="http://data.nytimes.com/elements/latest_use">nyt:last_use</a> and <a href="http://data.nytimes.com/elements/topicPage">nyt:topicPage</a>.</p>
<p>Clearly, this instance record — that is, its nature — deals with articles or groups of articles. The relation to <span style="font-style: italic;">Paul Newman</span> occurs as a basis of<br />
the <span class="double_u">primary topic</span> of these articles, and not a <span class="double_u">person</span> basis for which to describe the instance. If the nature of the instance was indeed the person <span style="font-style: italic;">Paul Newman</span>, then the attributes of the record would more properly be related to &#8220;person&#8221; properties such as age, sex, birthdate, death date, marital status, etc.</p>
<p>This confusion by NYT as to the nature of the &#8220;things&#8221; they are describing then leads to some very serious errors. By confusing the topic (<span style="font-style: italic;">Paul Newman</span>) of a record with the nature of that record (articles about topics), NYT next misuses one of the most powerful semantic Web predicates available, <span style="font-weight: bold;">owl:sameAs</span>.</p>
<p>By asserting in the &#8220;<span style="font-weight: bold;">Newman, Paul</span>&#8221; record that the instance has a <span style="font-weight: bold;">sameAs</span> relationship with external records in <a href="http://rdf.freebase.com/ns/en.paul_newman">Freebase</a> and <a href="http://dbpedia.org/resource/Paul_Newman">DBpedia</a>, the NYT both <a href="http://en.wikipedia.org/wiki/Entailment">entail</a>s that properties from any of the associated records are shared and <a href="http://en.wikipedia.org/wiki/Inference">infers</a> a chain of other types to describe the record. More precisely, the NYT is asserting that the &#8220;thing&#8221; referred to by these instances are <strong class="moz-txt-star">identical</strong> resources.</p>
<p>Thus, by the <span style="font-weight: bold;">sameAs</span> statements in the <span style="font-weight: bold;">“Newman, Paul”</span> record, the NYT is also asserting that that record is an instance of all these classes:</p>
<table border="0">
<tbody>
<tr>
<td></td>
<td>
<ul>
<li> <a class="uri" rel="rdf:type" href="http://dbpedia.org/about/html/http://www.w3.org/2002/07/owl%23Thing">owl:Thing</a></li>
<li> <a href="http://xmlns.com/foaf/spec/#term_Agent">foaf:Agent</a></li>
<li> <a href="http://xmlns.com/foaf/spec/#term_Person">foaf:Person</a></li>
<li> <a class="uri" rel="rdf:type" href="http://dbpedia.org/ontology/Actor">dbpedia-owl:Actor</a></li>
<li> <a class="uri" rel="rdf:type" href="http://dbpedia.org/class/yago/JewishActors">http://dbpedia.org/class/yago/JewishActors</a></li>
<li> <a class="uri" rel="rdf:type" href="http://dbpedia.org/class/yago/PeopleFromCleveland,Ohio">http://dbpedia.org/class/yago/PeopleFromCleveland,Ohio</a></li>
<li><a class="uri" rel="rdf:type" href="http://dbpedia.org/ontology/Artist">dbpedia-owl:Artist</a></li>
<li> <a class="uri" rel="rdf:type" href="http://dbpedia.org/ontology/Person">dbpedia-owl:Person</a></li>
<li> <a class="uri" rel="rdf:type" href="http://dbpedia.org/class/yago/Person100007846">http://dbpedia.org/class/yago/Person100007846</a></li>
<li> <a class="uri" rel="rdf:type" href="http://dbpedia.org/class/yago/AmericanFilmDirectors">http://dbpedia.org/class/yago/AmericanFilmDirectors</a></li>
<li> <a class="uri" rel="rdf:type" href="http://dbpedia.org/class/yago/YaleUniversityAlumni">http://dbpedia.org/class/yago/YaleUniversityAlumni</a></li>
<li><a class="uri" rel="rdf:type" href="http://dbpedia.org/class/yago/OhioUniversityAlumni">http://dbpedia.org/class/yago/OhioUniversityAlumni</a></li>
<li> <a class="uri" rel="rdf:type" href="http://sw.opencyc.org/2008/06/10/concept/Mx4rvVjWoZwpEbGdrcN5Y29ycA">opencyc:en/MaleHuman</a></li>
<li> <a class="uri" rel="rdf:type" href="http://dbpedia.org/class/yago/AmericanFilmActors">http://dbpedia.org/class/yago/AmericanFilmActors</a></li>
<li> <a class="uri" rel="rdf:type" href="http://dbpedia.org/class/yago/Liberals">http://dbpedia.org/class/yago/Liberals</a></li>
<li> <a class="uri" rel="rdf:type" href="http://dbpedia.org/class/yago/OhioActors">http://dbpedia.org/class/yago/OhioActors</a></li>
<li><a class="uri" rel="rdf:type" href="http://dbpedia.org/class/yago/UnitedStatesNavySailors">http://dbpedia.org/class/yago/UnitedStatesNavySailors</a></li>
<li> <a class="uri" rel="rdf:type" href="http://dbpedia.org/class/yago/PeopleFromWestport,Connecticut"> http://dbpedia.org/class/yago/PeopleFromWestport,Connecticut</a></li>
<li> <a class="uri" rel="rdf:type" href="http://sw.opencyc.org/2008/06/10/concept/Mx4rwQB4UJwpEbGdrcN5Y29ycA"></a> <a class="uri" rel="rdf:type" href="http://sw.opencyc.org/2008/06/10/concept/Mx4rwQB4UJwpEbGdrcN5Y29ycA"> opencyc:en/JewishPerson</a></li>
<li> <a class="uri" rel="rdf:type" href="http://sw.opencyc.org/2008/06/10/concept/Mx4rwMRyTJwpEbGdrcN5Y29ycA">opencyc:en/ActorInMovies</a></li>
<li> <a class="uri" rel="rdf:type" href="http://dbpedia.org/class/yago/LivingPeople">http://dbpedia.org/class/yago/LivingPeople</a></li>
<li> <a class="uri" rel="rdf:type" href="http://dbpedia.org/class/yago/Actor109765278">http://dbpedia.org/class/yago/Actor109765278</a></li>
<li> <a class="uri" rel="rdf:type" href="http://dbpedia.org/class/yago/AmericanVegetarians">http://dbpedia.org/class/yago/AmericanVegetarians</a></li>
<li><a class="uri" rel="rdf:type" href="http://dbpedia.org/class/yago/AmericanPhilanthropists">http://dbpedia.org/class/yago/AmericanPhilanthropists</a></li>
<li> <a class="uri" rel="rdf:type" href="http://dbpedia.org/class/yago/KenyonCollegeAlumni">http://dbpedia.org/class/yago/KenyonCollegeAlumni</a></li>
<li> <a class="uri" rel="rdf:type" href="http://dbpedia.org/class/yago/WesternFilmActors">http://dbpedia.org/class/yago/WesternFilmActors</a></li>
<li> <a class="uri" rel="rdf:type" href="http://dbpedia.org/class/yago/ActorsStudioAlumni">http://dbpedia.org/class/yago/ActorsStudioAlumni</a></li>
<li>and, a hundred other dbpedia_yago superClasses.</li>
</ul>
</td>
</tr>
</tbody>
</table>
<p>Furthermore, because of its strong, reciprocal entailments, the <span style="font-weight: bold;">owl:sameAs</span> assertion would also now entail that the person <span style="font-style: italic;">Paul Newman</span> has the <a href="http://data.nytimes.com/elements/first_use">nyt:first_use</a> and <a href="http://data.nytimes.com/elements/latest_use">nyt:last_use</a> attributes, clearly illogical for a &#8220;person&#8221; thing.</p>
<p>This connection is clearly wrong in both directions. <span style="font-style: italic;">Articles</span> are not <span style="font-style: italic;">persons</span> and don&#8217;t have <span style="font-style: italic;">marital status</span>; and <span style="font-style: italic;">persons</span> do not have <span style="font-style: italic;">first_uses</span>. By misapplying this <span style="font-weight: bold;">sameAs</span> linkage relationship, we have screwed things up in every which way. And the error began with misunderstanding what kinds of &#8220;things&#8221; our data is about.</p>
<h4>Some Options</h4>
<p>However, there are solutions. First, the <span style="font-weight: bold;">sameAs</span> assertions, at least involving these external resources, should be dropped.</p>
<p>Second, if linkages are still desired, a vocabulary such as <a href="http://umbel.org">UMBEL</a> <a href="#ld4">[4]</a> could be used to make an assertion between such a concept, and these other related resources. So, even though these resources are not the same, they are <strong>closely</strong> related. The UMBEL ontology helps us to define this kind of relation between related, but non-identical, resources.</p>
<p>Instead of using the <span style="font-weight: bold;">owl:sameAs</span></p>
<p>property, we would suggest the usage of the <span style="font-weight: bold;">umbel:linksEntity</span>, which links a <span style="font-weight: bold;">skos:Concept</span> to related named entities resources. Additionally, Freebase, which also currently asserts a <span style="font-weight: bold;">sameAs</span> relationship to the NYT resource, could use the <span style="font-weight: bold;">umbel:isAbout</span> relationship to assert that their resource &#8220;is about&#8221; a certain concept, which is the one defined by the NYT.</p>
<p>Alternatively, still other external vocabularies that more precisely capture the intent of the NYT publishers could be found, or the NYT editors could define their own properties specifically addressing their unique linkage interests. </p>
<h4>Other Minor Issues</h4>
<p>As a couple of additional, minor suggestions for the NYT dataset, we would suggest:</p>
<ul>
<li>Create a <span style="font-weight: bold;">foaf:Organization</span> description of the NYT organization, then use it with <span style="font-weight: bold;">dc:creator</span> and <span style="font-weight: bold;">dcterms:rightsHolder</span> rather than using a literal, and</li>
<li>The dual URIs such as &#8220;<a href="http://data.nytimes.com/N31738445835662083893">http://data.nytimes.com/N31738445835662083893</a>&#8221; and &#8220;<a href="http://data.nytimes.com/newman_paul_per">http://data.nytimes.com/newman_paul_per</a>&#8221; are not wrong in themselves, but the purpose is hard to understand. Why does a single organization need to create multiple resources for the <strong class="moz-txt-star">identical resource,</strong> when it comes from the same system and has the same purpose?</li>
</ul>
<h4>Re-visiting the Linkage &#8220;Rule&#8221;</h4>
<p>There are very valuable benefits from entailment, inference and logic to be gained from linking resources. However, if the nature of the &#8220;things&#8221; being linked — or the properties that define these linkages — are incorrect, then very wrong logical implications result. Great care and understanding should be applied to linkage assertions.</p>
<h3>In the End, the Challenge is Not Linked Data, but <span style="font-style: italic; text-decoration: underline;">Connected</span> Data</h3>
<p>Our critical comments are not meant to be disrespectful and are not being picky. The NYT and TWC are prominent institutions for which we should expect leadership on these issues. Our criticisms (and we believe those of others) are also not an expression of a &#8220;<a href="http://en.wikipedia.org/wiki/Hype_cycle">trough of disillusionment</a>&#8221; as <a href="http://twitter.com/gregboutin/status/5558525462">some</a> have been pointing out.</p>
<p>This posting is about poor practices, pure and simple. The time to correct them is now. If asked, we would be pleased to help either institution establish exemplar practices. This is not automatic, and it is not always easy. The data.gov datasets, in particular, will require much time and effort to get right. There is much documentation that needs to be transitioned and expressed in semantic Web formats.</p>
<p>In a broader sense, we also seem to lack a definition of best practices related to <span style="font-weight: bold;">vocabularies</span>, <span style="font-weight: bold;">schema</span> and <span style="font-weight: bold;">mappings</span>. The Berners-Lee rules are imprecise and insufficient as is. Prior best guidance documents tend to<br />
be more how to publish and make URIs linkable, than to properly characterize, describe and connect the data.</p>
<p>Perhaps, in part, this is a bit of a semantics issue. The challenge is not the mechanics of <span style="font-style: italic;">linking data</span>, but the meaning and basis for <span class="double_u">connecting</span> that data. Connections require logic and rationality sufficient to reliably inform inference and rule-based engines. It also needs to pass the sniff test as we &#8220;follow our nose&#8221; by clicking the links exposed by the data.</p>
<p>It is exciting to see high-quality content such as from national governments and major publishers like the New York Times begin to be exposed as linked data. When this content finally gets embedded into usable contexts, we should see manifest uses and benefits emerge. We hope both institutions take our criticisms in that spirit.</p>
<div style="background-color: #ffffcc;border: 1px dotted yellow;margin: 15px 60px;padding: 8px;vertical-align: middle;margin: 0pt 0pt 0pt 10px;  width: 300px; text-align: center;">This posting has been jointly authored by <a href="http://mkbergman.com"> Mike Bergman</a> and <a href="http://fgiasson.com/blog">Fred Giasson</a> and simultaneously published on both of their blogs, hoping to draw more attention to the need for better practices in publishing linked data.</div>
<hr style="margin: 15px 0px;" size="1" />
<div style="margin: 10px 0pt; font-size: 90%;"><a id="ld1" name="ld1"></a> [1] The NYT has been updated with improvements and they fixed multiple issues from the first release. The<br />
problems listed herein, however, still pertain after these improvements.</div>
<div style="margin: 10px 0pt; font-size: 90%;"><a id="ld2" name="ld2"></a> [2] Tim Berners-Lee, 2006. Linked Data (Design Issues), first posted on 2006-07-27; last updated on<br />
2009-06-18. See <a href="http://www.w3.org/DesignIssues/LinkedData.html">http://www.w3.org/DesignIssues/LinkedData.html</a>. Berners-Lee refers to the steps above as &#8220;rules,&#8221; but he elaborates they are expectations of behavior. Most later citations refer to these as &#8220;principles.&#8221;</div>
<div style="margin: 10px 0pt; font-size: 90%;"><a id="ld3" name="ld3"></a> [3] Li Ding, Dominic DiFranzo, Sarah Magidson, Deborah L. McGuinness and Jim Hendler, 2009. Data-GovWiki: Towards Linked Government Data. See <a href="http://www.cs.vu.nl/%7Epmika/swc/documents/Data-gov%20Wiki-data-gov-wiki-v1.pdf"></a><br />
<a href="http://www.cs.vu.nl/%7Epmika/swc/documents/Data-gov%20Wiki-data-gov-wiki-v1.pdf"> http://www.cs.vu.nl/~pmika/swc/documents/Data-gov%20Wiki-data-gov-wiki-v1.pdf</a>.</div>
<div style="margin: 10px 0pt; font-size: 90%;"><a id="ld4" name="ld4"></a> [4] UMBEL <em>(Upper Mapping and Binding Exchange Layer)</em> is a lightweight ontology structure in development for relating Web content and data to a standard set of subject concepts. It purpose has resulted in its creation of an associated vocabulary geared to both class-instance and reciprocal relationships, as well as partial or likelihood relationships. See <a href="http://umbel.org/technical_documentation.html#vocabulary">http://umbel.org/technical_documentation.html#vocabulary</a>.</div>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/FredOnSomething?a=SUH3v-KgZ0Y:C4TlomcCNcQ:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/FredOnSomething?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/FredOnSomething?a=SUH3v-KgZ0Y:C4TlomcCNcQ:7Q72WNTAKBA"><img src="http://feeds.feedburner.com/~ff/FredOnSomething?d=7Q72WNTAKBA" border="0"></img></a>
</div>]]></content:encoded>
	<feedburner:origLink>http://fgiasson.com/blog/index.php/2009/11/16/when-linked-data-rules-fail/</feedburner:origLink></item>
<item rdf:about="http://fgiasson.com/blog/index.php/2009/10/20/common-and-irjson-php-parsers-released/">
	<title>commON and irJSON PHP parsers released</title>
	<link>http://feedproxy.google.com/~r/FredOnSomething/~3/ceiB6LjOjaE/</link>
	 <dc:date>2009-10-20T21:15:45Z</dc:date>
	<dc:creator>Fred</dc:creator>
			<dc:subject><![CDATA[Semantic Web]]></dc:subject>
		<dc:subject><![CDATA[Structured Dynamics]]></dc:subject>
		<dc:subject><![CDATA[irON]]></dc:subject>
	<description>Two days ago we released irON: Instance Record and Object Notation (irON) Specification. irON is a new notation that has been created to describe instance records. irON records can be serialized in 3 different formats: irXML (XML), irJSON (JSON) and commON (CSV: mainly for spreadsheet manipulations).

The release of irON has ...</description>
	<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=commON and irJSON PHP parsers released&amp;rft.aulast=Giasson&amp;rft.aufirst=Frédérick&amp;rft.subject=Semantic Web&amp;rft.subject=Structured Dynamics&amp;rft.subject=irON&amp;rft.source=Frederick Giasson&#8217;s Weblog&amp;rft.date=2009-10-20&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://fgiasson.com/blog/index.php/2009/10/20/common-and-irjson-php-parsers-released/&amp;rft.language=English"></span>
<p><img class="size-full wp-image-988 alignleft" title="iron_logo_235" src="http://fgiasson.com/blog/wp-content/uploads/2009/10/iron_logo_235.png" alt="iron_logo_235" width="99" height="53" />Two days ago <a href="http://structureddynamics.com">we</a> released irON: <a href="http://openstructs.org/iron/iron-specification">Instance Record and Object Notation (irON) Specification</a>. irON is a new notation that has been created to describe instance records. irON records can be serialized in 3 different formats: <a href="http://openstructs.org/iron/iron-specification#mozTocId408837">irXML</a> (XML), <a href="http://openstructs.org/iron/iron-specification#mozTocId462570">irJSON</a> (JSON) and <a href="http://openstructs.org/iron/iron-specification#mozTocId603499">commON</a> (CSV: mainly for spreadsheet manipulations).</p>
<p>The release of irON has already been covered at length on <a href="http://www.mkbergman.com/838/iron-semantic-web-for-mere-mortals/">Mike&#8217;s blog</a> and in <a href="http://structureddynamics.com/pr20091018.html">Structure Dynamics&#8217;s press room</a>; so I won&#8217;t talk more about it here.</p>
<h3>irON Parsers</h3>
<p>What I am happy to release today are the first two parsers that can be used to parse and validate irON datasets of instance records. The first two parsers that have been developed so far are the ones for irJSON and commON. Each parser has been developed in PHP and is available under the <a href="http://www.apache.org/licenses/LICENSE-2.0.html">Apache 2 licence</a>. Now, lets take a look at each of them</p>
<h3>irJSON Parser</h3>
<p style="text-align: left;">The irJSON parser package can be <a href="http://code.google.com/p/iron-notation/downloads/list">downloaded here</a>. Additionally, the source code can be <a href="http://code.google.com/p/iron-notation/source/browse/#svn/trunk/irJSON">browsed here</a>.</p>
<p>First of all, to understand the code, you have to understand the <a href="http://openstructs.org/iron/iron-specification#mozTocId462570">specification of the irJSON serialization</a>.</p>
<p>The irON parser package is everything you need to test and use the parser. The package is composed of the following files:</p>
<ul>
<li>test.php &#8211; If you want to quick-start with      this package, just run this test.php script and you will have an idea of      what it can do for you. This script just runs the parser over a irJSON test      file, and shows you some validation errors along with the internal parsed      structure of the file. From there, you can simply use the irJSONParser      class, with the structure that is returned to do whatever is needed for      you: adding the information in you database, converting the data to      another format, etc.</li>
<li>irJSONParser.php &#8211; This is the irJSON      parser class. It parses the irJSON file and populates its internal      structure that is composed of instances of the classes below.</li>
<li>Dataset.php &#8211; This      class defines a Dataset records with all its attributes. It is the object      that the developed has to manipulate that comes from the parser.</li>
<li>InstanceRecord.php &#8211; This class defines an      Instance Records with all its attributes. It is the object that the      developed has to manipulate that comes from the parser.</li>
<li>StructureSchema.php &#8211; This class defines a      Structure Schema records with all its attributes. It is the object that      the developed has to manipulate that comes from the parser.</li>
<li>LinkageSchema.php &#8211;      This class defines a Linkage Schema records with all its attributes. It is      the object that the developed has to manipulate that comes from the      parser.</li>
</ul>
<p>The irJSON parser also validates the incoming irJSON files according to these three levels of validation:</p>
<ol>
<li>JSON well-formedness validation      &#8211; The first validation test occurs on the JSON serialization itself. A      JSON file has to be a well formed in order to be processed. An error at      this level will raise an error to the user.</li>
<li>irJSON well-formedness validation &#8211; Once      JSON is parsed and well formed, the parser make sure that the file is      irJSON well-formed. If it is not well formed according to the irJSON spec,      an error will be raised to the user.</li>
<li>Structure Schema validation &#8211; The last      validation that occurs is between instance records, and their related      (if available) Structure Schema. If a validation error happens at this      level, a notice will be raised to the user.</li>
</ol>
<p>You can experiment with some of these validation errors and notices by running the test.php script in the package.</p>
<p>With this package, developers can already start to parse irJSON files and to integrate them with some of their prototype projects.</p>
<h3>commON Parser</h3>
<p>The commON parser package can be <a href="http://code.google.com/p/iron-notation/downloads/list">downloaded here</a>. Additionally, the source code can be <a href="http://code.google.com/p/iron-notation/source/browse/#svn/trunk/commON">browsed here</a>.</p>
<p>To understand the code, you have to understand the <a href="http://openstructs.org/iron/iron-specification#mozTocId603499">specification of the commON serialization</a>.</p>
<p>The commON parser package is everything you need to test the parser. The package is composed of the following files:</p>
<ul>
<li>test.php      &#8211; If you want to quick-start with this package, just run this test.php      script and you will have an idea of what it can do for you. This script      just run the parser over a file, and shows you some validation errors      along with the internal parsed structure of the file. From there, you can      simply use the CommonParser class, with the structure that is returned to      do whatever is needed for you: adding the information in you database,      converting the data to another format, etc.</li>
<li>CommonParser.php      &#8211; This is the commON parser class. It parses the commON file and populates      its internal structure that is described in the code. the parser.</li>
</ul>
<p>The commON parser also validates the incoming commON files according to these two levels:</p>
<ol>
<li>CSV      well-formedness validation &#8211; The first validation test occurs on the <a href="http://www.rfc-editor.org/rfc/rfc4180.txt">CSV</a> serialization itself. A CSV file has to be a well formed in order to be      processed. An error at this level will raise an error to the user.</li>
<li>commON      well-formedness validation &#8211; Once CSV is parsed and well formed, the      parser make sure that the file is CSV well-formed. If it is not well      formed according to the CSV RFC, an error will be raised to the user.</li>
</ol>
<p>You can experiment some of these validation errors and notices by running the test.php script in the package.</p>
<p>With this package, developers can already start to parsing commON files and to integrate them with some prototypes of their projects.</p>
<p>The commON parser is less advanced than the irJSON one. For example, the implementation of the &#8220;dataset&#8221; and the &#8220;schema&#8221; processor keywords are not yet done. Other keywords haven&#8217;t (yet) been integrated too. Take a look at the source code to know what is currently missing.</p>
<p>In any case, a lot of things can currently be done with this parser. We will publish specific commON usage use-cases in the coming weeks that will shows people are we are using commON internally and how we will expect our customers to use it to create and maintain different smaller datasets.</p>
<h3><strong>Conclusion</strong></h3>
<p>These are the first versions of the irJSON and commON parsers. We have to continue to development to make them perfectly reflecting the current and future irON specification. We yet have to write the irXML parser too.</p>
<p>I would encourage reporting any issues with these parsers, or any enhancement suggestions, <a href="http://code.google.com/p/iron-notation/issues/list">on this issue tracked</a>.</p>
<p>All discussions regarding these parsers and the irON specification document should happen on the <a href="http://groups.google.com/group/iron-notation?pli=1">irON group mailing list here</a>.</p>
<p>Finally, another step for us will be to embed these parsers in converter web services for <a href="http://openstructs.org/structwsf/">structWSF</a>.</p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/FredOnSomething?a=ceiB6LjOjaE:m_oI95T6wzs:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/FredOnSomething?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/FredOnSomething?a=ceiB6LjOjaE:m_oI95T6wzs:7Q72WNTAKBA"><img src="http://feeds.feedburner.com/~ff/FredOnSomething?d=7Q72WNTAKBA" border="0"></img></a>
</div>]]></content:encoded>
	<feedburner:origLink>http://fgiasson.com/blog/index.php/2009/10/20/common-and-irjson-php-parsers-released/</feedburner:origLink></item>
<item rdf:about="http://fgiasson.com/blog/index.php/2009/09/18/a-new-home-for-umbel-web-services/">
	<title>A New Home for UMBEL Web Services</title>
	<link>http://feedproxy.google.com/~r/FredOnSomething/~3/YqTK2Bnp4z0/</link>
	 <dc:date>2009-09-18T21:27:46Z</dc:date>
	<dc:creator>Fred</dc:creator>
			<dc:subject><![CDATA[Ping the Semantic Web]]></dc:subject>
		<dc:subject><![CDATA[Semantic Web]]></dc:subject>
		<dc:subject><![CDATA[Structured Dynamics]]></dc:subject>
		<dc:subject><![CDATA[UMBEL]]></dc:subject>
	<description>Eight months ago we announced the dissolution of Zitgist LLC. This event led to the creation of a "sandbox" to keep alive all the online assets of the company. Since this sandbox server was not owned by Structured Dynamics, it was becoming hard for us to update UMBEL and its ...</description>
	<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=A New Home for UMBEL Web Services&amp;rft.aulast=Giasson&amp;rft.aufirst=Frédérick&amp;rft.subject=Ping the Semantic Web&amp;rft.subject=Semantic Web&amp;rft.subject=Structured Dynamics&amp;rft.subject=UMBEL&amp;rft.source=Frederick Giasson&#8217;s Weblog&amp;rft.date=2009-09-18&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://fgiasson.com/blog/index.php/2009/09/18/a-new-home-for-umbel-web-services/&amp;rft.language=English"></span>
<p><span style="font-weight: normal; font-size: 14px; "><img class="alignleft size-full wp-image-916" title="umbel_ws" src="http://fgiasson.com/blog/wp-content/uploads/2008/10/umbel_ws.png" alt="umbel_ws" width="170" height="74" />Eight months ago we announced the dissolution of Zitgist LLC. This event led to the creation of a </span>&#8220;<span style="font-weight: normal; font-size: 14px; ">sandbox</span>&#8220;<span style="font-weight: normal; font-size: 14px; "> to keep alive all the online assets of the company. Since this sandbox server was not owned by <a href="http://structureddynamics.com/">Structured Dynamics</a>, it was becoming hard for us to update UMBEL and its online services. It is why we took the time to move the services back on to our new servers.</span><br />
<span style="font-weight: normal; font-size: 14px; "><br />
</span></p>
<h3>A New Home</h3>
<p><img class="alignright size-full wp-image-920" title="sd_logo_260" src="http://fgiasson.com/blog/wp-content/uploads/2009/01/sd_logo_260.png" alt="sd_logo_260" width="260" height="60" />Structured Dynamics LLC now hosts a new version for the UMBEL Web services. From the main menu at the <a href="http://structureddynamics.com/">SD Web site</a> you can access these services under the &#8220;<a href="http://structureddynamics.com/umbel_ws/index.php">umbel ws</a>&#8221; menu option (you can also bookmark the Web services site at <a href="http://umbel.structureddynamics.com/">umbel.structureddynamics.com</a> or <a href="http://ws.umbel.org/">ws.umbel.org</a>.)</p>
<p>This move of UMBEL&#8217;s Web services to a new home will make the future upgrade of UMBEL easier, and this will make the maintenance of the Web services endpoints easier as well. With this move, I am pleased to announce the release of five initial Web services and one visualization tool:</p>
<p><strong>Lookup Web Services:</strong></p>
<ul>
<li><a href="http://ws.umbel.org/finder_subject_concept.php">Finder: Subject      Concept</a></li>
<li><a href="http://ws.umbel.org/reporter_subject_concept.php">Reporter: Subject      Concept</a></li>
</ul>
<p><strong>Inference Engine Web Services:</strong></p>
<ul>
<li><a href="http://ws.umbel.org/inference_lister.php">Inference: Lister &#8212; list      sub-classes, super-classes and equivalent-classes</a></li>
<li><a href="http://ws.umbel.org/inference_validator.php">Inference: Validator &#8212;      verify sub-class, super-class and equivalent-class relationships</a></li>
</ul>
<p><strong>SPARQL endpoint Web Service:</strong></p>
<ul>
<li><a href="http://ws.umbel.org/sparql.php">SPARQL Endpoint</a></li>
</ul>
<p><strong>Visual Tool:</strong></p>
<ul>
<li><a href="http://ws.umbel.org/explorer.php">Subject Concept Explorer</a></li>
</ul>
<p><em>Note that the visual tool is using <a href="http://moritz.stefaner.eu/projects/relation-browser/">Moritz Stefaner&#8217;s Relation Browser</a>.</em></p>
<p><em><br />
</em></p>
<h3>Ping the Semantic Web</h3>
<p><img class="alignright size-full wp-image-832" title="ptswlogo160.gif" src="http://fgiasson.com/blog/wp-content/uploads/2007/08/ptswlogo160.gif" alt="ptswlogo160.gif" width="160" height="90" />Additionally, the <a href="http://pingthesemanticweb.com">Ping the Semantic Web</a> RDF pinging service is now the property of <a href="http://openlinksw.com">OpenLink Software Inc.</a> OpenLink is now hosting, maintaining and developing the service.</p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/FredOnSomething?a=YqTK2Bnp4z0:OAPTDDzmunI:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/FredOnSomething?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/FredOnSomething?a=YqTK2Bnp4z0:OAPTDDzmunI:7Q72WNTAKBA"><img src="http://feeds.feedburner.com/~ff/FredOnSomething?d=7Q72WNTAKBA" border="0"></img></a>
</div>]]></content:encoded>
	<feedburner:origLink>http://fgiasson.com/blog/index.php/2009/09/18/a-new-home-for-umbel-web-services/</feedburner:origLink></item>
<item rdf:about="http://fgiasson.com/blog/index.php/2009/08/21/new-release-of-umbel-v072/">
	<title>New release of UMBEL: v072</title>
	<link>http://feedproxy.google.com/~r/FredOnSomething/~3/_MFgxjqpii0/</link>
	 <dc:date>2009-08-21T18:49:49Z</dc:date>
	<dc:creator>Fred</dc:creator>
			<dc:subject><![CDATA[Semantic Web]]></dc:subject>
		<dc:subject><![CDATA[UMBEL]]></dc:subject>
	<description>I am pleased to announce that we resumed our work with UMBEL. We just released the version v0.72, which is based on the OpenCyc version 2009-01-31. This new version is intermediary and has been created mostly to check the evolution of OpenCyc vis-à-vis UMBEL. Within the next month or so, ...</description>
	<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=New release of UMBEL: v072&amp;rft.aulast=Giasson&amp;rft.aufirst=Frédérick&amp;rft.subject=Semantic Web&amp;rft.subject=UMBEL&amp;rft.source=Frederick Giasson&#8217;s Weblog&amp;rft.date=2009-08-21&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://fgiasson.com/blog/index.php/2009/08/21/new-release-of-umbel-v072/&amp;rft.language=English"></span>
<p style="text-align: left; "><img class="alignright size-full wp-image-825" title="umbel_medium.png" src="http://fgiasson.com/blog/wp-content/uploads/2007/07/umbel_medium.png" alt="umbel_medium.png" width="206" height="100" />I am pleased to announce that we resumed our work with <a href="http://umbel.org">UMBEL</a>. We just released the version <a href="http://umbel.org/documentation.html">v0.72</a>, which is based on the <a href="http://opencyc.org">OpenCyc</a> version 2009-01-31. This new version is intermediary and has been created mostly to check the evolution of OpenCyc vis-à-vis UMBEL. Within the next month or so, we will release a new version (v.080), which will introduce a major new concept that should help systems and users manipulating the entire UMBEL Subject Concepts structure.</p>
<p>For them who want to know what changed between versions v071 and v072, <a href="http://umbel.org/ontology/umbel_v071_v072_difference.csv">here is CVS file that list all the changes between the versions</a>. There are four columns: (1) source node, (2) attribute, (3) target node and (4) version number. This file list all triples that are present in a version, but not in the other. So, you have all changes (nodes &amp; arcs) between the two versions. Mostly all the changes come from internal changes to OpenCyc. We did fix a couple of things such as removing cycles in the graph, etc. But 99% of the changes come from changes within OpenCyc.</p>
<p>Finally note that the web services endpoints will be updated with this new version of UMBEL subject concepts in the coming week along with the dereferencing of their URIs. Stay tuned!</p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/FredOnSomething?a=_MFgxjqpii0:cfoQGI5E9ow:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/FredOnSomething?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/FredOnSomething?a=_MFgxjqpii0:cfoQGI5E9ow:7Q72WNTAKBA"><img src="http://feeds.feedburner.com/~ff/FredOnSomething?d=7Q72WNTAKBA" border="0"></img></a>
</div>]]></content:encoded>
	<feedburner:origLink>http://fgiasson.com/blog/index.php/2009/08/21/new-release-of-umbel-v072/</feedburner:origLink></item>
<item rdf:about="http://fgiasson.com/blog/index.php/2009/08/18/structwsf-early-querying-metrics/">
	<title>structWSF Early Querying Metrics</title>
	<link>http://feedproxy.google.com/~r/FredOnSomething/~3/SXi4kfcfAFw/</link>
	 <dc:date>2009-08-18T21:04:12Z</dc:date>
	<dc:creator>Fred</dc:creator>
			<dc:subject><![CDATA[Structured Dynamics]]></dc:subject>
		<dc:subject><![CDATA[structWSF]]></dc:subject>
	<description>We have been running different structWSF instances for about two months now. Each instance is hosting different dataset(s) that are queried for different purposes. I think that it worth taking some time starting to analyze the querying stats of two of these instances of the early Alpha version of structWSF.

The ...</description>
	<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=structWSF Early Querying Metrics&amp;rft.aulast=Giasson&amp;rft.aufirst=Frédérick&amp;rft.subject=Structured Dynamics&amp;rft.subject=structWSF&amp;rft.source=Frederick Giasson&#8217;s Weblog&amp;rft.date=2009-08-18&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://fgiasson.com/blog/index.php/2009/08/18/structwsf-early-querying-metrics/&amp;rft.language=English"></span>
<p>We have been running different structWSF instances for about two months now. Each instance is hosting different dataset(s) that are queried for different purposes. I think that it worth taking some time starting to analyze the querying stats of two of these instances of the early Alpha version of structWSF.</p>
<p>The goal is to create some kind of checkpoints that we will be able to use in the future to check how the system improved or deteriorated. It is also to check what kind of metrics we could derive from the current logging system, and to check if we could find any bottle neck or issues with any of the endpoints.</p>
<p>The data used to analyze the instance A span from the 2009-06-08 at 7:16:38 to the 2009-08-18 at 12:28:37.</p>
<p>The data used to analyze the instance B span from the 2009-05-20 at 1:46:31to the 2009-08-18 at 12:40:28.</p>
<h3>structWSF Instance A</h3>
<p>The instance A only has 1 dataset with about 1000 instance records in it. As we can notice bellow, the average time of a query to that instance for all web service endpoints is about 210 milliseconds.</p>
<table border="0">
<tbody>
<tr style="border: 1px solid">
<td style="border: 1px solid"><strong><span class="rescolname">Number of queries</span></strong><br />
<span> </span></td>
<td style="border: 1px solid"><strong><span class="rescolname">Average time for each query in seconds</span></strong></td>
</tr>
<tr style="border: 1px solid">
<td style="border: 1px solid">27956</td>
<td style="border: 1px solid">0.218252857656909</td>
</tr>
</tbody>
</table>
<p>The table bellow give us the total number of queries sent to each web service endpoint with an average time for each web service.</p>
<table class="listing" border="0">
<tbody>
<tr>
<td class="restitle" colspan="5"></td>
</tr>
<tr>
<td style="border: 1px solid"><strong><span class="rescolname">Web Service</span></strong></td>
<td style="border: 1px solid"><strong>Number of queries</strong></td>
<td style="border: 1px solid"><strong>Average time for each query in seconds</strong></td>
</tr>
<tr style="border: 1px solid">
<td style="border: 1px solid">dataset_create</td>
<td style="border: 1px solid">265</td>
<td style="border: 1px solid">0.126993534699919</td>
</tr>
<tr class="resrowodd">
<td style="border: 1px solid">converter/tsv</td>
<td style="border: 1px solid">48</td>
<td style="border: 1px solid">0.128808428843714</td>
</tr>
<tr style="border: 1px solid">
<td style="border: 1px solid">dataset_update</td>
<td style="border: 1px solid">17</td>
<td style="border: 1px solid">0.140141641392576</td>
</tr>
<tr class="resrowodd">
<td style="border: 1px solid">dataset_read</td>
<td style="border: 1px solid">11780</td>
<td style="border: 1px solid">0.144073766884864</td>
</tr>
<tr style="border: 1px solid">
<td style="border: 1px solid">auth_registrar_access</td>
<td style="border: 1px solid">883</td>
<td style="border: 1px solid">0.145781793788779</td>
</tr>
<tr class="resrowodd">
<td style="border: 1px solid">converter/bibtex</td>
<td style="border: 1px solid">49</td>
<td style="border: 1px solid">0.149710825511323</td>
</tr>
<tr style="border: 1px solid">
<td style="border: 1px solid">auth_lister</td>
<td style="border: 1px solid">1970</td>
<td style="border: 1px solid">0.159979685066925</td>
</tr>
<tr class="resrowodd">
<td style="border: 1px solid">search</td>
<td style="border: 1px solid">1397</td>
<td style="border: 1px solid">0.180938945980523</td>
</tr>
<tr style="border: 1px solid">
<td style="border: 1px solid">browse</td>
<td style="border: 1px solid">8949</td>
<td style="border: 1px solid">0.199636802392004</td>
</tr>
<tr class="resrowodd">
<td style="border: 1px solid">crud_read</td>
<td style="border: 1px solid">638</td>
<td style="border: 1px solid">0.241032384406063</td>
</tr>
<tr style="border: 1px solid">
<td style="border: 1px solid">dataset_delete</td>
<td style="border: 1px solid">263</td>
<td style="border: 1px solid">0.420157149717388</td>
</tr>
<tr class="resrowodd">
<td style="border: 1px solid">crud_delete</td>
<td style="border: 1px solid">3</td>
<td style="border: 1px solid">0.637878338496</td>
</tr>
<tr style="border: 1px solid">
<td style="border: 1px solid">converter/irv</td>
<td style="border: 1px solid">792</td>
<td style="border: 1px solid">0.661979901670313</td>
</tr>
<tr class="resrowodd">
<td style="border: 1px solid">sparql</td>
<td style="border: 1px solid">715</td>
<td style="border: 1px solid">1.123084135322358</td>
</tr>
<tr style="border: 1px solid">
<td style="border: 1px solid">crud_create</td>
<td style="border: 1px solid">187</td>
<td style="border: 1px solid">1.486844727060763</td>
</tr>
<tr>
<td class="resfooter" colspan="5"></td>
</tr>
</tbody>
</table>
<p>This table gives the number of queries for each returned HTTP response status code by the endpoint. This kind of metrics is useful to debug potential issues</p>
<table class="listing" border="0">
<tbody>
<tr>
<td class="restitle" colspan="5"></td>
</tr>
<tr>
<td style="border: 1px solid"><strong><span>Web Service</span></strong></td>
<td style="border: 1px solid"><strong>Number of queries</strong></td>
<td style="border: 1px solid"><strong><span>HTTP Response Status</span></strong></td>
</tr>
<tr style="border: 1px solid">
<td style="border: 1px solid">auth_lister</td>
<td style="border: 1px solid">1968</td>
<td style="border: 1px solid">200</td>
</tr>
<tr class="resrowodd">
<td style="border: 1px solid">auth_lister</td>
<td style="border: 1px solid">2</td>
<td style="border: 1px solid">400</td>
</tr>
<tr style="border: 1px solid">
<td style="border: 1px solid">auth_registrar_access</td>
<td style="border: 1px solid">883</td>
<td style="border: 1px solid">200</td>
</tr>
<tr class="resrowodd">
<td style="border: 1px solid">browse</td>
<td style="border: 1px solid">8949</td>
<td style="border: 1px solid">200</td>
</tr>
<tr style="border: 1px solid">
<td style="border: 1px solid">converter/bibtex</td>
<td style="border: 1px solid">45</td>
<td style="border: 1px solid">200</td>
</tr>
<tr class="resrowodd">
<td style="border: 1px solid">converter/bibtex</td>
<td style="border: 1px solid">2</td>
<td style="border: 1px solid">400</td>
</tr>
<tr style="border: 1px solid">
<td style="border: 1px solid">converter/bibtex</td>
<td style="border: 1px solid">2</td>
<td style="border: 1px solid">406</td>
</tr>
<tr class="resrowodd">
<td style="border: 1px solid">converter/irv</td>
<td style="border: 1px solid">740</td>
<td style="border: 1px solid">200</td>
</tr>
<tr style="border: 1px solid">
<td style="border: 1px solid">converter/irv</td>
<td style="border: 1px solid">51</td>
<td style="border: 1px solid">400</td>
</tr>
<tr class="resrowodd">
<td style="border: 1px solid">converter/irv</td>
<td style="border: 1px solid">1</td>
<td style="border: 1px solid">406</td>
</tr>
<tr style="border: 1px solid">
<td style="border: 1px solid">converter/tsv</td>
<td style="border: 1px solid">43</td>
<td style="border: 1px solid">200</td>
</tr>
<tr class="resrowodd">
<td style="border: 1px solid">converter/tsv</td>
<td style="border: 1px solid">2</td>
<td style="border: 1px solid">400</td>
</tr>
<tr style="border: 1px solid">
<td style="border: 1px solid">converter/tsv</td>
<td style="border: 1px solid">3</td>
<td style="border: 1px solid">406</td>
</tr>
<tr class="resrowodd">
<td style="border: 1px solid">crud_create</td>
<td style="border: 1px solid">66</td>
<td style="border: 1px solid">200</td>
</tr>
<tr style="border: 1px solid">
<td style="border: 1px solid">crud_create</td>
<td style="border: 1px solid">116</td>
<td style="border: 1px solid">400</td>
</tr>
<tr class="resrowodd">
<td style="border: 1px solid">crud_create</td>
<td style="border: 1px solid">5</td>
<td style="border: 1px solid">500</td>
</tr>
<tr style="border: 1px solid">
<td style="border: 1px solid">crud_delete</td>
<td style="border: 1px solid">3</td>
<td style="border: 1px solid">200</td>
</tr>
<tr class="resrowodd">
<td style="border: 1px solid">crud_read</td>
<td style="border: 1px solid">480</td>
<td style="border: 1px solid">200</td>
</tr>
<tr style="border: 1px solid">
<td style="border: 1px solid">crud_read</td>
<td style="border: 1px solid">158</td>
<td style="border: 1px solid">400</td>
</tr>
<tr class="resrowodd">
<td style="border: 1px solid">dataset_create</td>
<td style="border: 1px solid">265</td>
<td style="border: 1px solid">200</td>
</tr>
<tr style="border: 1px solid">
<td style="border: 1px solid">dataset_delete</td>
<td style="border: 1px solid">261</td>
<td style="border: 1px solid">200</td>
</tr>
<tr class="resrowodd">
<td style="border: 1px solid">dataset_delete</td>
<td style="border: 1px solid">2</td>
<td style="border: 1px solid">500</td>
</tr>
<tr style="border: 1px solid">
<td style="border: 1px solid">dataset_read</td>
<td style="border: 1px solid">11767</td>
<td style="border: 1px solid">200</td>
</tr>
<tr class="resrowodd">
<td style="border: 1px solid">dataset_read</td>
<td style="border: 1px solid">9</td>
<td style="border: 1px solid">400</td>
</tr>
<tr style="border: 1px solid">
<td style="border: 1px solid">dataset_read</td>
<td style="border: 1px solid">4</td>
<td style="border: 1px solid">500</td>
</tr>
<tr class="resrowodd">
<td style="border: 1px solid">dataset_update</td>
<td style="border: 1px solid">17</td>
<td style="border: 1px solid">200</td>
</tr>
<tr style="border: 1px solid">
<td style="border: 1px solid">search</td>
<td style="border: 1px solid">1393</td>
<td style="border: 1px solid">200</td>
</tr>
<tr class="resrowodd">
<td style="border: 1px solid">search</td>
<td style="border: 1px solid">4</td>
<td style="border: 1px solid">400</td>
</tr>
<tr style="border: 1px solid">
<td style="border: 1px solid">sparql</td>
<td style="border: 1px solid">693</td>
<td style="border: 1px solid">200</td>
</tr>
<tr class="resrowodd">
<td style="border: 1px solid">sparql</td>
<td style="border: 1px solid">19</td>
<td style="border: 1px solid">400</td>
</tr>
<tr style="border: 1px solid">
<td style="border: 1px solid">sparql</td>
<td style="border: 1px solid">3</td>
<td style="border: 1px solid">406</td>
</tr>
<tr>
<td class="resfooter" colspan="5"></td>
</tr>
</tbody>
</table>
<h3>structWSF Instance B</h3>
<p>The instance B has 25 datasets with about 2 312 000 instance records in it. As we can notice bellow, the average time of a query to that instance for all web service endpoints is about 550 milliseconds.</p>
<p>Why the average query time per query double with the size of that instance? It is what we will check.</p>
<table class="listing" border="0">
<tbody>
<tr>
<td class="restitle" colspan="5"></td>
</tr>
<tr>
<td style="border: 1px solid"><strong>Number of queries</strong></td>
<td style="border: 1px solid"><strong>Average time for each query in seconds</strong></td>
</tr>
<tr style="border: 1px solid">
<td style="border: 1px solid">37575</td>
<td style="border: 1px solid">0.556303637714566</td>
</tr>
<tr>
<td class="resfooter" colspan="5"></td>
</tr>
</tbody>
</table>
<p>The table bellow give us the total number of queries sent to each web service endpoint with an average time for each web service. What we can notice is that the time it takes to create, delete and update records in the database management systems is related to the size of the dataset. So, what happened and is there anything we can do?</p>
<p>Most of the queries used for this analysis come from queries sent to structWSF v.1.0a1 and v1.0a2. However, something that has a major impact on these results changed in v1.0a3 that has been released last week. The big problem with these numbers is Solr&#8217;s commit time. In version v1.0a1 and v1.0a2, a Solr commit was issued each time something was updated in the index. Commit could take up to minutes sometimes with the size of its index. Since v1.0a3, we give that choice to the system administrator: he can issue commit each time something change in the index, or setup Solr&#8217;s AutoCommit setting properly. That means that we increased the performance of these CUD endpoints by about 95%.</p>
<p>For the SPARQL endpoint, the reason is that it is mostly exclusively used to export data from a structWSF instance. This means that big dump of RDF triples are incurred for each query, which justify the average time per query of 2.1 seconds.</p>
<table class="listing" border="0">
<tbody>
<tr>
<td class="restitle" colspan="5"></td>
</tr>
<tr>
<td style="border: 1px solid"><strong><span>Web Service</span></strong></td>
<td style="border: 1px solid"><strong>Number of queries</strong></td>
<td style="border: 1px solid"><strong>Average time for each query in seconds</strong></td>
</tr>
<tr style="border: 1px solid">
<td style="border: 1px solid">dataset_create</td>
<td style="border: 1px solid">173</td>
<td style="border: 1px solid">0.09835156953404</td>
</tr>
<tr class="resrowodd">
<td style="border: 1px solid">auth_registrar_access</td>
<td style="border: 1px solid">1135</td>
<td style="border: 1px solid">0.114255581658327</td>
</tr>
<tr style="border: 1px solid">
<td style="border: 1px solid">dataset_update</td>
<td style="border: 1px solid">121</td>
<td style="border: 1px solid">0.119028852005636</td>
</tr>
<tr class="resrowodd">
<td style="border: 1px solid">dataset_read</td>
<td style="border: 1px solid">12683</td>
<td style="border: 1px solid">0.159165935205064</td>
</tr>
<tr style="border: 1px solid">
<td style="border: 1px solid">crud_read</td>
<td style="border: 1px solid">8546</td>
<td style="border: 1px solid">0.23457546435556</td>
</tr>
<tr class="resrowodd">
<td style="border: 1px solid">converter/bibtex</td>
<td style="border: 1px solid">109</td>
<td style="border: 1px solid">0.405608450600873</td>
</tr>
<tr style="border: 1px solid">
<td style="border: 1px solid">auth_lister</td>
<td style="border: 1px solid">2315</td>
<td style="border: 1px solid">0.471687612780759</td>
</tr>
<tr class="resrowodd">
<td style="border: 1px solid">search</td>
<td style="border: 1px solid">2313</td>
<td style="border: 1px solid">0.533951056245796</td>
</tr>
<tr style="border: 1px solid">
<td style="border: 1px solid">browse</td>
<td style="border: 1px solid">9103</td>
<td style="border: 1px solid">0.758227908033767</td>
</tr>
<tr class="resrowodd">
<td style="border: 1px solid">converter/tsv</td>
<td style="border: 1px solid">8</td>
<td style="border: 1px solid">0.863690733909698</td>
</tr>
<tr style="border: 1px solid">
<td style="border: 1px solid">sparql</td>
<td style="border: 1px solid">650</td>
<td style="border: 1px solid">2.115058046487879</td>
</tr>
<tr class="resrowodd">
<td style="border: 1px solid">converter/irv</td>
<td style="border: 1px solid">166</td>
<td style="border: 1px solid">2.681712512510398</td>
</tr>
<tr style="border: 1px solid">
<td style="border: 1px solid">crud_update</td>
<td style="border: 1px solid">13</td>
<td style="border: 1px solid">4.649851157114154</td>
</tr>
<tr class="resrowodd">
<td style="border: 1px solid">crud_create</td>
<td style="border: 1px solid">75</td>
<td style="border: 1px solid">11.306954870223277</td>
</tr>
<tr style="border: 1px solid">
<td style="border: 1px solid">dataset_delete</td>
<td style="border: 1px solid">140</td>
<td style="border: 1px solid">27.511527856750207</td>
</tr>
<tr class="resrowodd">
<td style="border: 1px solid">crud_delete</td>
<td style="border: 1px solid">25</td>
<td style="border: 1px solid">34.33350466727492</td>
</tr>
<tr>
<td class="resfooter" colspan="5"></td>
</tr>
</tbody>
</table>
<p>This table gives the number of queries for each returned HTTP response status code by the endpoint.</p>
<table class="listing" border="0">
<tbody>
<tr>
<td class="restitle" colspan="5"></td>
</tr>
<tr>
<td style="border: 1px solid"><strong><span>Web Service</span></strong></td>
<td style="border: 1px solid"><strong>Number of queries</strong></td>
<td style="border: 1px solid"><strong><span class="rescolname">HTTP Response Status</span></strong></td>
</tr>
<tr style="border: 1px solid">
<td style="border: 1px solid">auth_lister</td>
<td style="border: 1px solid">2275</td>
<td style="border: 1px solid">200</td>
</tr>
<tr class="resrowodd">
<td style="border: 1px solid">auth_lister</td>
<td style="border: 1px solid">11</td>
<td style="border: 1px solid">400</td>
</tr>
<tr style="border: 1px solid">
<td style="border: 1px solid">auth_lister</td>
<td style="border: 1px solid">2</td>
<td style="border: 1px solid">406</td>
</tr>
<tr class="resrowodd">
<td style="border: 1px solid">auth_lister</td>
<td style="border: 1px solid">27</td>
<td style="border: 1px solid">500</td>
</tr>
<tr style="border: 1px solid">
<td style="border: 1px solid">auth_registrar_access</td>
<td style="border: 1px solid">1110</td>
<td style="border: 1px solid">200</td>
</tr>
<tr class="resrowodd">
<td style="border: 1px solid">auth_registrar_access</td>
<td style="border: 1px solid">25</td>
<td style="border: 1px solid">400</td>
</tr>
<tr style="border: 1px solid">
<td style="border: 1px solid">browse</td>
<td style="border: 1px solid">9084</td>
<td style="border: 1px solid">200</td>
</tr>
<tr class="resrowodd">
<td style="border: 1px solid">browse</td>
<td style="border: 1px solid">18</td>
<td style="border: 1px solid">400</td>
</tr>
<tr style="border: 1px solid">
<td style="border: 1px solid">browse</td>
<td style="border: 1px solid">1</td>
<td style="border: 1px solid">406</td>
</tr>
<tr class="resrowodd">
<td style="border: 1px solid">converter/bibtex</td>
<td style="border: 1px solid">108</td>
<td style="border: 1px solid">200</td>
</tr>
<tr style="border: 1px solid">
<td style="border: 1px solid">converter/bibtex</td>
<td style="border: 1px solid">1</td>
<td style="border: 1px solid">400</td>
</tr>
<tr class="resrowodd">
<td style="border: 1px solid">converter/irv</td>
<td style="border: 1px solid">154</td>
<td style="border: 1px solid">200</td>
</tr>
<tr style="border: 1px solid">
<td style="border: 1px solid">converter/irv</td>
<td style="border: 1px solid">12</td>
<td style="border: 1px solid">400</td>
</tr>
<tr class="resrowodd">
<td style="border: 1px solid">converter/tsv</td>
<td style="border: 1px solid">8</td>
<td style="border: 1px solid">200</td>
</tr>
<tr style="border: 1px solid">
<td style="border: 1px solid">crud_create</td>
<td style="border: 1px solid">41</td>
<td style="border: 1px solid">200</td>
</tr>
<tr class="resrowodd">
<td style="border: 1px solid">crud_create</td>
<td style="border: 1px solid">33</td>
<td style="border: 1px solid">400</td>
</tr>
<tr style="border: 1px solid">
<td style="border: 1px solid">crud_create</td>
<td style="border: 1px solid">1</td>
<td style="border: 1px solid">500</td>
</tr>
<tr class="resrowodd">
<td style="border: 1px solid">crud_delete</td>
<td style="border: 1px solid">24</td>
<td style="border: 1px solid">200</td>
</tr>
<tr style="border: 1px solid">
<td style="border: 1px solid">crud_delete</td>
<td style="border: 1px solid">1</td>
<td style="border: 1px solid">400</td>
</tr>
<tr class="resrowodd">
<td style="border: 1px solid">crud_read</td>
<td style="border: 1px solid">8268</td>
<td style="border: 1px solid">200</td>
</tr>
<tr style="border: 1px solid">
<td style="border: 1px solid">crud_read</td>
<td style="border: 1px solid">273</td>
<td style="border: 1px solid">400</td>
</tr>
<tr class="resrowodd">
<td style="border: 1px solid">crud_read</td>
<td style="border: 1px solid">5</td>
<td style="border: 1px solid">406</td>
</tr>
<tr style="border: 1px solid">
<td style="border: 1px solid">crud_update</td>
<td style="border: 1px solid">4</td>
<td style="border: 1px solid">200</td>
</tr>
<tr class="resrowodd">
<td style="border: 1px solid">crud_update</td>
<td style="border: 1px solid">9</td>
<td style="border: 1px solid">400</td>
</tr>
<tr style="border: 1px solid">
<td style="border: 1px solid">dataset_create</td>
<td style="border: 1px solid">171</td>
<td style="border: 1px solid">200</td>
</tr>
<tr class="resrowodd">
<td style="border: 1px solid">dataset_create</td>
<td style="border: 1px solid">2</td>
<td style="border: 1px solid">400</td>
</tr>
<tr style="border: 1px solid">
<td style="border: 1px solid">dataset_delete</td>
<td style="border: 1px solid">79</td>
<td style="border: 1px solid">200</td>
</tr>
<tr class="resrowodd">
<td style="border: 1px solid">dataset_delete</td>
<td style="border: 1px solid">61</td>
<td style="border: 1px solid">500</td>
</tr>
<tr style="border: 1px solid">
<td style="border: 1px solid">dataset_read</td>
<td style="border: 1px solid">12647</td>
<td style="border: 1px solid">200</td>
</tr>
<tr class="resrowodd">
<td style="border: 1px solid">dataset_read</td>
<td style="border: 1px solid">11</td>
<td style="border: 1px solid">400</td>
</tr>
<tr style="border: 1px solid">
<td style="border: 1px solid">dataset_read</td>
<td style="border: 1px solid">25</td>
<td style="border: 1px solid">500</td>
</tr>
<tr class="resrowodd">
<td style="border: 1px solid">dataset_update</td>
<td style="border: 1px solid">113</td>
<td style="border: 1px solid">200</td>
</tr>
<tr style="border: 1px solid">
<td style="border: 1px solid">dataset_update</td>
<td style="border: 1px solid">8</td>
<td style="border: 1px solid">500</td>
</tr>
<tr class="resrowodd">
<td style="border: 1px solid">search</td>
<td style="border: 1px solid">2286</td>
<td style="border: 1px solid">200</td>
</tr>
<tr style="border: 1px solid">
<td style="border: 1px solid">search</td>
<td style="border: 1px solid">24</td>
<td style="border: 1px solid">400</td>
</tr>
<tr class="resrowodd">
<td style="border: 1px solid">search</td>
<td style="border: 1px solid">3</td>
<td style="border: 1px solid">406</td>
</tr>
<tr style="border: 1px solid">
<td style="border: 1px solid">sparql</td>
<td style="border: 1px solid">618</td>
<td style="border: 1px solid">200</td>
</tr>
<tr class="resrowodd">
<td style="border: 1px solid">sparql</td>
<td style="border: 1px solid">22</td>
<td style="border: 1px solid">400</td>
</tr>
<tr style="border: 1px solid">
<td style="border: 1px solid">sparql</td>
<td style="border: 1px solid">6</td>
<td style="border: 1px solid">406</td>
</tr>
<tr class="resrowodd">
<td style="border: 1px solid">sparql</td>
<td style="border: 1px solid">4</td>
<td style="border: 1px solid">500</td>
</tr>
<tr>
<td class="resfooter" colspan="5"></td>
</tr>
</tbody>
</table>
<h3>Generating the Stats</h3>
<p>Here is the list of SQL query used to create these stat tables. You can run them locally on your structWSF instance to generate the same kind of statistics.</p>
<p>Timespan of the queries</p>
<blockquote><p>select min(request_datetime) as startdate, max(request_datetime) as enddate from SD.WSF.ws_queries_log;</p></blockquote>
<p>Get the average number of milliseconds per query sent to the syste</p>
<blockquote><p>select count(request_processing_time) as nb_queries, avg(request_processing_time) as average_query_time from SD.WSF.ws_queries_log order by ID desc;</p></blockquote>
<p>Get the average query time for each web service of a structWSF instance.</p>
<blockquote><p>select requested_web_service, count(request_processing_time) as nb_queries, avg(request_processing_time) as average_query_time from SD.WSF.ws_queries_log GROUP BY requested_web_service ORDER BY average_query_time ASC;</p></blockquote>
<p>Status messages counts per web service endpoint</p>
<blockquote><p>select requested_web_service, count(request_http_response_status) as nb_queries, request_http_response_status from SD.WSF.ws_queries_log GROUP BY requested_web_service, request_http_response_status ORDER BY requested_web_service, request_http_response_status;</p></blockquote>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/FredOnSomething?a=SXi4kfcfAFw:l8RXiHFs8-w:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/FredOnSomething?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/FredOnSomething?a=SXi4kfcfAFw:l8RXiHFs8-w:7Q72WNTAKBA"><img src="http://feeds.feedburner.com/~ff/FredOnSomething?d=7Q72WNTAKBA" border="0"></img></a>
</div>]]></content:encoded>
	<feedburner:origLink>http://fgiasson.com/blog/index.php/2009/08/18/structwsf-early-querying-metrics/</feedburner:origLink></item>
<item rdf:about="http://fgiasson.com/blog/index.php/2009/08/12/construct-a-skin-for-structwsf/">
	<title>conStruct: a skin for structWSF</title>
	<link>http://feedproxy.google.com/~r/FredOnSomething/~3/wjwG1O-TPPU/</link>
	 <dc:date>2009-08-12T21:10:51Z</dc:date>
	<dc:creator>Fred</dc:creator>
			<dc:subject><![CDATA[Semantic Web]]></dc:subject>
		<dc:subject><![CDATA[Structured Dynamics]]></dc:subject>
		<dc:subject><![CDATA[conStruct]]></dc:subject>
		<dc:subject><![CDATA[structWSF]]></dc:subject>
	<description>As I said in my previous blog post, a conStruct instance is nothing more than a skin for one or multiple structWSF instances. conStruct is a user of a structWSF network.

But... what that means?

That means that each conStruct tools communicate with one or multiple structWSF instances. Each each feature of ...</description>
	<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=conStruct: a skin for structWSF&amp;rft.aulast=Giasson&amp;rft.aufirst=Frédérick&amp;rft.subject=Semantic Web&amp;rft.subject=Structured Dynamics&amp;rft.subject=conStruct&amp;rft.subject=structWSF&amp;rft.source=Frederick Giasson&#8217;s Weblog&amp;rft.date=2009-08-12&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://fgiasson.com/blog/index.php/2009/08/12/construct-a-skin-for-structwsf/&amp;rft.language=English"></span>
<p>As I said <a href="http://fgiasson.com/blog/index.php/2009/08/10/re-introduction/">in my previous blog post</a>, a <a href="http://constructscs.com">conStruct</a> instance is nothing more than a skin for one or multiple <a href="http://openstructs.org/structwsf/">structWSF</a> instances. conStruct is a <em>user</em> of a structWSF network.</p>
<p>But&#8230; what that means?</p>
<p>That means that each conStruct tools communicate with one or multiple structWSF instances. Each each feature of conStruct comes from structWSF. The only thing it does is presenting information to users, and give them some tool to manipulate the data.</p>
<h3>A structWSF instances network</h3>
<p><a href="http://openstructs.org/structwsf/individual-ws-documentation">A structWSF instance is a set of web service endpoints</a>. Each endpoint gets registered in a network. Each query sent to any of the web service endpoint of the network gets authenticated (and possibly rejected) by the network.</p>
<p>All structWSF instances share the same basic web services endpoints, however some specialized structWSF instance can add new functionality to the framework by developing new endpoints that does special things. Others can un-register services that has nothing to do with the mission of the instance, etc.</p>
<p>Not all structWSF instances are the same, but all of them share the same interface.</p>
<p>Individual people or organizations can choose to create structWSF nodes. The purposes can be quite different. Some organizations could choose to create structWSF nodes for internal purposes only: to help their departments to share different kind of data for example. Some people could want to setup a structWSF node where they can archive and share all data specific to their hobbies. Whatever the use-case is: they want a platform to ingest, manage, interact with and publish data; publicly or privately.</p>
<p style="text-align: center;"><a href="http://fgiasson.com/blog/wp-content/uploads/2009/08/structwsf_networks.png"><img class="alignnone size-medium wp-image-947 aligncenter" title="structwsf_networks" src="http://fgiasson.com/blog/wp-content/uploads/2009/08/structwsf_networks-300x158.png" alt="" width="300" height="158" /></a></p>
<p><!--[if gte vml 1]> <![endif]--></p>
<p>In the schema above, we can notice that different structWSF instances have been created and are maintained by different organizations, for different purposes. Some of the clients will communicate with these structWSF instances as a public user of the datasets published on the node(s), and other users will access to datasets that only them have access to.</p>
<p>As you can see, some users communicate with multiple structWSF instances. This means that these user cares about data of different datasets, maintained by different organizations. Why and what for? We don&#8217;t know. It can be for any reasons. It can be as a web portal that aggregates all the information about a specific domain that is shared amongst multiple nodes or it can be because the user get information from his client&#8217;s networks to get things done.</p>
<p>What is important to keep in mind with the schema above is that any kind of people, any kind of organizations and any kind of systems can leverage the <em>structured</em> data they have access to that is hosted by different organizations that make available different datasets and different web services endpoints (maybe some organizations can even create a web service endpoint that works with their dataset and to expose some special algorithms they use to disambiguate/tag entities, etc.)</p>
<h3>A network in action</h3>
<p>You are probably telling yourself: well, the grand vision is good&#8230; but where is the meat around the bone?</p>
<p>Lets take a look at the <a href="http://constructscs.com/demos">conStructSCS sandbox demo</a>. You have <a href="http://constructscs.com/conStruct/dataset/">two datasets in there: (1) the Sweet Tools and (2) RePEc</a>. There is one thing that you probably don&#8217;t notice: both datasets live on two different structWSF instances (each structWSF instance is hosted on a different web server). This means that if you perform a <a href="http://constructscs.com/conStruct/search/?query=rdf&amp;type=all&amp;dataset=all">search</a>, or a <a href="http://constructscs.com/conStruct/browse/">browse</a> query, all results you get in the conStruct user interface come from two totally different servers, with different data maintainers, hosted by different organizations, etc. Still, all results are displayed in the same user interface, which is the conStructSCS demo sandbox.</p>
<h3>Under the curtain</h3>
<p>Lets take a look at what is happening. First, run this <a href="http://constructscs.com/conStruct/search/?query=rdf&amp;type=all&amp;dataset=all&amp;wsf_debug=2">search query for &#8220;rdf&#8221;</a>. You see what appears in the yellow box? This is a list of the queries exchanged between conStruct and two structWSF instances. You want more? Try this other <a href="http://constructscs.com/conStruct/search/?query=rdf&amp;type=all&amp;dataset=all&amp;wsf_debug=1">search query for &#8220;rdf&#8221;</a>. Now you also have access to the body of the messages.</p>
<p>For this demo sandbox, we enabled the &#8220;wsf_debug&#8221; parameter so that users of the sandbox can see how a conStruct node can interact with structWSF instances. If the value of this URL parameter is &#8220;1&#8243;, then the header + body of the query is displayed to the users. If the value is &#8220;2&#8243;, only the header is displayed.</p>
<p>This means that you can happen the &#8220;&amp;wsf_debug=1&#8243; parameter to any URL of the demo sandbox and you will be able to see the messages exchanged between the systems. Why? Because <strong>all</strong> conStruct tools communicate with one or multiple web service endpoint(s) and one or multiple structWSF instances.</p>
<p>Now, lets take a look at the output of the search query above.</p>
<ul type="disc">
<li>Web service query: [[url: <strong>http://localhost/ws/search/</strong>] [method: post] [mime:      text/xml] [parameters: <a name="OLE_LINK15"></a>]      [execution time: <strong>0.279745101929</strong>]] (status: 200) OK &#8211; .</li>
<li>Web service query: [[url: <strong>http://bknetwork.org/ws/search/</strong>] [method: post] [mime:      text/xml] [parameters:      query=rdf&amp;types=all&amp;datasets=http%3A%2F%2Fbknetwork.org%2Fwsf%2Fdatasets%2F283%2F%3Bhttp%3A%2F%2Fconstructscs.com%2Fwsf%2Fdatasets%2F160%2F&amp;items=10&amp;page=0&amp;inference=on&amp;include_aggregates=true&amp;registered_ip=self%3A%3A0]      [execution time: <strong>0.289397001266</strong>]] (status: 200) OK &#8211; .</li>
<li>Web service query: [[url: <strong>http://localhost/ws/dataset/read/</strong>] [method: get] [mime:      text/xml] [parameters: uri=all&amp;registered_ip=self%3A%3A0] [execution      time: <strong>0.123399972916</strong>]] (status: 200) OK &#8211; .</li>
<li>Web service query: [[url: <a name="OLE_LINK14"></a><strong>/ws/dataset/read/</strong>] [method: get] [mime:      text/xml] [parameters: uri=all&amp;registered_ip=self%3A%3A0] [execution      time: <strong>0.18315911293</strong>]] (status: 200) OK &#8211; .</li>
</ul>
<p>Each dot is a query sent to a specific structWSF instance. For each query, you have this information:</p>
<ul type="disc">
<li>URL      of the web service endpoint where the query has been sent.</li>
<li>HTTP      method used to send the query</li>
<li>MIME      type (Accept HTTP header parameters) requested</li>
<li>Parameters      of the query</li>
<li>Time      it took to execute the query (including network latency &amp; query      processing)</li>
<li>Status      of the query from the web service endpoint</li>
</ul>
<p>Since this conStruct instance is linked to two different structWSF instances, the search tool will send a search query to two different search web service endpoints. Additionally, it will query these structWSF instances to get the description of the searched dataset (to display the proper name of the datasets in the user interface).</p>
<p>Each query is validated by the structWSF instances to make sure that they are legitimate queries. If they are, then results are returned. Once these queries are sent and answers received, the structSearch tool can then generate the page and display it to the user.</p>
<p>Do you want more? Here is a list of queries sent by different conStruct tools to different web services endpoints:</p>
<ul type="disc">
<li><a href="http://constructscs.com/conStruct/browse/?wsf_debug=2">Browse      tool: listing datasets to browse</a></li>
<li><a href="http://constructscs.com/conStruct/browse/?browse=true&amp;attribute=all&amp;type=all&amp;dataset=http%3A%2F%2Fconstructscs.com%2Fwsf%2Fdatasets%2F122%2F&amp;page=0&amp;wsf_debug=1">Browse      tool: browsing a specific dataset</a></li>
<li><a href="http://constructscs.com/conStruct/dataset/?wsf_debug=2">Dataset      tool</a></li>
<li><a href="http://constructscs.com/conStruct/view/?uri=http%3A%2F%2Fconstructscs.com%2FconStruct%2Fdatasets%2F122%2Fresource%2FCerebra_Server&amp;dataset=http%3A%2F%2Fconstructscs.com%2Fwsf%2Fdatasets%2F122%2F&amp;wsf_debug=2">View      page</a></li>
</ul>
<p><strong>(Note: this debug info tabs has been added so that people can see what is happening under the hood. However this information is only accessible to the registered conStruct instance and the administrator of that instance).</strong></p>
<h3>Do it by yourself, from your desktop computer</h3>
<p>I said that people or organizations that managed to create content data on these structWSF instances were able to manage/manipulate their data from anywhere: not only from within conStruct. Lets test this.</p>
<p>I changed the permissions on the Sweet Tools List dataset so that it is publicly available for reading. That way, any anyone will be able to send <a href="http://curl.haxx.se/">Curl</a> queries against the dataset, to that structWSF instance.</p>
<p>Now, lets try a couple of queries to different web services endpoints. Let start with a query for the keyword &#8220;rdf&#8221; on the Sweet Tools dataset:</p>
<p style="padding-left: 30px;"><em>curl -H &#8220;Accept: text/xml&#8221; &#8220;http://constructscs.com/ws/search/&#8221; -d &#8220;query=rdf&amp;types=all&amp;datasets=http%3A%2F%2Fconstructscs.com%2Fwsf%2Fdatasets%2F122%2F&amp;items=10&amp;inference=on&#8221;</em></p>
<p>What you will get for this query is a list of 10 instance records that match this query. You don&#8217;t like the internal XML representation of the system? Then try the internal JSON representation by running this query:</p>
<p><a name="OLE_LINK17"></a></p>
<p>Maybe this is not good enough for you? Then lets try in RDF+XML:</p>
<p style="padding-left: 30px;"><em>curl -H &#8220;Accept: application/rdf+xml&#8221; &#8220;http://constructscs.com/ws/search/&#8221; -d &#8220;query=rdf&amp;types=all&amp;datasets=http%3A%2F%2Fconstructscs.com%2Fwsf%2Fdatasets%2F122%2F&amp;items=10&amp;inference=on&#8221;</em></p>
<p>I think you understood the point here, so I won&#8217;t continue.</p>
<p>Now, lets send a query to get all the datasets accessible by you:</p>
<p style="padding-left: 30px;"><em>curl -H &#8220;Accept: application/rdf+xml&#8221; &#8220;http://constructscs.com/ws/auth/lister/&#8221; -d &#8220;mode=adataset&#8221;</em></p>
<p>If you can query all these things with Curl, this mean that anything can query these services. Standalone softwares can be developed to leverage these content nodes as well as other online applications.</p>
<h3>Conclusion</h3>
<p>As you probably learned with this blog post, one of the powers of structWSF is that it creates networks of structured content nodes that can be accessed by any thing, from anywhere, publicly or privately.</p>
<p>As you noticed, all this stuff is not only about integrating any kind of data, but also to publish it in a flexible way.</p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/FredOnSomething?a=wjwG1O-TPPU:0RkMeiwAMe4:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/FredOnSomething?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/FredOnSomething?a=wjwG1O-TPPU:0RkMeiwAMe4:7Q72WNTAKBA"><img src="http://feeds.feedburner.com/~ff/FredOnSomething?d=7Q72WNTAKBA" border="0"></img></a>
</div>]]></content:encoded>
	<feedburner:origLink>http://fgiasson.com/blog/index.php/2009/08/12/construct-a-skin-for-structwsf/</feedburner:origLink></item>
<item rdf:about="http://fgiasson.com/blog/index.php/2009/08/10/re-introduction/">
	<title>Re-Introduction</title>
	<link>http://feedproxy.google.com/~r/FredOnSomething/~3/tC-EDhqAswg/</link>
	 <dc:date>2009-08-10T21:46:42Z</dc:date>
	<dc:creator>Fred</dc:creator>
			<dc:subject><![CDATA[Semantic Web]]></dc:subject>
		<dc:subject><![CDATA[Structured Dynamics]]></dc:subject>
		<dc:subject><![CDATA[conStruct]]></dc:subject>
		<dc:subject><![CDATA[structWSF]]></dc:subject>
	<description>I haven't been active on this blog for more than half a year now. I was telling myself that I was too busy coding to write anything meaningful to my readers. I did write a couple of things, but nothing of importance related to all the things I was working ...</description>
	<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Re-Introduction&amp;rft.aulast=Giasson&amp;rft.aufirst=Frédérick&amp;rft.subject=Semantic Web&amp;rft.subject=Structured Dynamics&amp;rft.subject=conStruct&amp;rft.subject=structWSF&amp;rft.source=Frederick Giasson&#8217;s Weblog&amp;rft.date=2009-08-10&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://fgiasson.com/blog/index.php/2009/08/10/re-introduction/&amp;rft.language=English"></span>
<p>I haven&#8217;t been active on this blog for more than half a year now. I was telling myself that I was too busy coding to write anything meaningful to my readers. I did write a couple of things, but nothing of importance related to all the things I was working on. I did publish announcements and such, but didn&#8217;t really take the time to write about these things. A lot of things have been done and published recently, but little has been said. So, lets try to rectify the shot so that I share more about what I am currently working on, the concepts I am playing with, the systems I am releasing, etc. So, lets restart to write about these things that I really do believe in, and that I put all my time, efforts and energy in. Lets restart writing about things that I do believe in and that are valuable to me.</p>
<p>As you probably know, my company <a href="http://structureddynamics.com">Structured Dynamics</a> released a series of products: <a href="http://openstructs.org/structwsf/">structWSF</a> and <a href="http://constructscs.com">conStruct</a>. I spent the last six months developing these two products. However, what are they? Why did I spend all my time working on these products? Why does they matter? Why do I think that they are valuable?</p>
<p>Let me outline what they are, what they do and what they are useful at. Then think if they could be of any value to you, your organizations, your enterprises, etc.</p>
<h3>StructWSF</h3>
<p><a href="http://fgiasson.com/blog/wp-content/uploads/2009/06/triple_120.png"><img class="alignleft size-full wp-image-941" title="triple_120" src="http://fgiasson.com/blog/wp-content/uploads/2009/06/triple_120.png" alt="" width="120" height="120" /></a><a href="http://openstructs.org/structwsf/">StructWSF</a> is a web services framework (WSF) that basically does four things: it ingest, manage, interact with and publish data. What kind data? Any kind of data</p>
<p><strong>Ingesting</strong>: the aim is to be able to ingest data from any data source (so data formatted using any language, or described using any vocabularies/schemas techniques). The framework has to be able to ingest any data that come from any data sources with a single conversion step.</p>
<p><strong>Managing</strong>: the aim is to be able to manage the data. Managing the data means being able to collectively (with permissions and authentication) manage datasets available in a framework instance. Being about the create, modify, delete or update data. It also means being able to browse and search the data. It means making it publicly available, or to restrict its access to a user or group of users. This means merging datasets together too.</p>
<p><strong>Interacting</strong>: but there is another facet to data management. We don&#8217;t only want to be able to manage data in a locked system. What we want is to be able to manage its data from anywhere. It can be from my browse, from my website, from some other applications on my desktop, from my home, from my office: from anywhere. All functions of a structWSF instance are accessible as web services endpoints. This means that you can perform any action, on your data, from anywhere you want: from a conStruct node or from a local Curl query. This is I think how people / organizations want to be able to manage the data they create and curate data.</p>
<p><strong>Publishing</strong>: like ingesting, we want to be able to publish, to communicate the data we create to other people, other organizations or other entities. We want to do this in such a way that these external entities doesn&#8217;t have to recreate/reinvent themselves. We want to be able to communicate data the way they understand it: using any format and any vocabulary/schema.</p>
<p>The mindset behind structWSF is the following: we can ingest any kind of data, we can manage that data in multiple ways, we can interact with that data from anywhere and we can publish-back this data in any ways. structWSF is friction less in the sense of data communication between systems, users and entities.</p>
<h3>conStruct</h3>
<p><a href="http://fgiasson.com/blog/wp-content/uploads/2009/06/construct_logo_120.png"><img class="alignright size-full wp-image-942" title="construct_logo_120" src="http://fgiasson.com/blog/wp-content/uploads/2009/06/construct_logo_120.png" alt="" width="120" height="120" /></a><a href="http://constructscs.com">conStruct</a> is just a skin over one, or multiple, structWSF instances. The conStruct software is an example of how a system can interact with a structWSF data provider. conStruct is a suite of generic tools that can be used to search, browse, visualize (template), import, export, create, delete and update data. All these tools interact with one or multiple structWSF functions by using their web service endpoints.</p>
<p>Since conStruct can interact with a single structWSF instance, it can also interact with multiple structWSF instances. That means that conStruct can be a user interface that communicates with multiple data providers (structWSF instances) and display all the results, from all these providers, in a canonical user interface.</p>
<p>But as I said, conStruct is <em>one</em> skin over structWSF instances. We could think about the integration of structWSF into other CMS systems. We could even think about having different CMS systems integrating with the same structWSF instance(s) so that if one user update/create/delete some data, it appears in other CMS systems as well.</p>
<h3>The Magic Twist</h3>
<p>However, all this is done with a twist: everything is structured. This means that everything that is in the system has a structure: is described using some vocabularies (full blow ontologies; or naive vocabularies). This enable all kind of valuable functionalities: inferencing capabilities in search and browse activities, filtering on types and attributes, helps integrating different datasets from different systems and organizations.</p>
<p>This is the magic twist that make this system different: everything in there is structured in such a way that everything can be ingested and published in any format; in such a way that basic inferencing or more complex reasoning is possible. It integrates data and let users use it the way they want from where they are. The capabilities are there; use it if you need them.</p>
<h3>Next steps</h3>
<p>The next steps for me will be to describe the features of the system: how the data is managed, how permissions work, what is the granularity of permissions available, etc. These will be more technical blog posts, but they will give you the full potential of the systems and concepts I have been talking in this blog post.</p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/FredOnSomething?a=tC-EDhqAswg:zYzNqYErf1w:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/FredOnSomething?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/FredOnSomething?a=tC-EDhqAswg:zYzNqYErf1w:7Q72WNTAKBA"><img src="http://feeds.feedburner.com/~ff/FredOnSomething?d=7Q72WNTAKBA" border="0"></img></a>
</div>]]></content:encoded>
	<feedburner:origLink>http://fgiasson.com/blog/index.php/2009/08/10/re-introduction/</feedburner:origLink></item>
<item rdf:about="http://fgiasson.com/blog/index.php/2009/07/02/release-of-structwsf-construct-and-the-community-web-site/">
	<title>Release of structWSF, conStruct and the Community Web Site</title>
	<link>http://feedproxy.google.com/~r/FredOnSomething/~3/wYRHtzmPAo8/</link>
	 <dc:date>2009-07-02T19:59:47Z</dc:date>
	<dc:creator>Fred</dc:creator>
			<dc:subject><![CDATA[Semantic Web]]></dc:subject>
		<dc:subject><![CDATA[Structured Dynamics]]></dc:subject>
		<dc:subject><![CDATA[conStruct]]></dc:subject>
		<dc:subject><![CDATA[structWSF]]></dc:subject>
	<description>
The last few months have been challenging in term of amount of work to get done, in focusing on deliverables and in getting ready for the release of conStruct and structWSF sources codes, documentations, tutorials, web sites and demos.
I am now really happy to be able to finally announce the ...</description>
	<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Release of structWSF, conStruct and the Community Web Site&amp;rft.aulast=Giasson&amp;rft.aufirst=Frédérick&amp;rft.subject=Semantic Web&amp;rft.subject=Structured Dynamics&amp;rft.subject=conStruct&amp;rft.subject=structWSF&amp;rft.source=Frederick Giasson&#8217;s Weblog&amp;rft.date=2009-07-02&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://fgiasson.com/blog/index.php/2009/07/02/release-of-structwsf-construct-and-the-community-web-site/&amp;rft.language=English"></span>
<p><!--StartFragment--></p>
<p class="MsoNormal">The last few months have been challenging in term of amount of work to get done, in focusing on deliverables and in getting ready for the release of <a href="http://constructscs.com">conStruct</a> and <a href="http://openstructs.org/structwsf">structWSF</a> sources codes, documentations, tutorials, web sites and demos.</p>
<p class="MsoNormal">I am now really happy to be able to finally announce the release of both software code sources along with a new <a name="OLE_LINK2"></a><a href="http://community.openstructs.org/"><span>development community website</span></a><span> where users and developers can exchange ideas about these two news projects.</span></p>
<p class="MsoNormal">The biggest milestone of the last months is now behind us. However, this is just the beginning of everything!</p>
<p class="MsoNormal">I think that many things have been written about these two projects already. I don’t want to write any tutorial at this point. So the only thing I will do right now is to point you the more relevant documentation, web sites, blog posts and demos about each project. The next step will be to write about specific use cases, features, etc.</p>
<p class="MsoNormal">
<h3>Community Web Site</h3>
<p class="MsoNormal">The <a href="http://community.openstructs.org">community Web site</a> is a place where developers and users of structWSF and conStruct can meet to talk about both projects, to report bugs and issues, to submit new enhancements, to find tips and tricks, etc.</p>
<p class="MsoNormal">I would suggest you to <a href="http://community.openstructs.org/user/register">create a new user profile on the community Web site</a> if you are interested in communicating with other members.</p>
<ul type="disc">
<li class="MsoNormal"><a href="http://community.openstructs.org/">Community Web site</a>
<ul type="circle">
<li class="MsoNormal"><a href="http://community.openstructs.org/forum">Discussion Forum</a></li>
<li class="MsoNormal"><a href="http://wiki.openstructs.org/wiki/Welcome">Wiki</a></li>
<li class="MsoNormal"><a href="http://community.openstructs.org/issues">Issues tracker</a></li>
<li class="MsoNormal"><a href="http://community.openstructs.org/source-code/code-repository">Core       source repositories</a></li>
<li class="MsoNormal"><a href="http://community.openstructs.org/source-code/documentation">Code       documentation</a></li>
</ul>
</li>
</ul>
<h3>structWSF</h3>
<p class="MsoNormal"><a href="http://openstructs.org/structwsf">structWSF</a> is a platform-independent Web services framework for accessing and exposing structured<span> </span>RDF data. Its central organizing perspective is that of the dataset. These datasets contain instance records, with the structural relationships amongst the data and their attributes and concepts defined via ontologies (schema with accompanying vocabularies).</p>
<p class="MsoNormal">The structWSF middleware framework is generally RESTful in design and is based on HTTP and Web protocols and open standards. The initial structWSF framework comes packaged with a baseline set of about a dozen Web services in CRUD, browse, search and export and import. All Web services are exposed via APIs and SPARQL endpoints. Each request to an individual Web service returns an HTTP status and optionally a document of resultsets. Each results document can be serialized in many ways, and may be expressed as either RDF or pure XML.</p>
<ul type="disc">
<li class="MsoNormal"><a name="OLE_LINK7"></a><a href="http://openstructs.org/structwsf"><span>Main Web site</span></a>
<ul type="circle">
<li class="MsoNormal"><a href="http://openstructs.org/downloads"><span>Download</span></a></li>
<li class="MsoNormal"><a href="http://openstructs.org/structwsf/architecture"><span>Architecture</span></a></li>
<li class="MsoNormal"><a href="http://openstructs.org/structwsf/individual-ws-documentation"><span>RESTful endpoints documentation</span></a></li>
<li class="MsoNormal"><a href="http://openstructs.org/doc/code/structwsf/index.html"><span>Source code documentation</span></a></li>
<li class="MsoNormal"><span><a name="OLE_LINK1"></a></span><a href="http://wiki.openstructs.org/wiki/Blog_Posts"><span><span>Interesting       blog posts</span></span></a></li>
<li class="MsoNormal"><a href="http://wiki.openstructs.org/wiki/StructWSF_Installation"><span>Installation manual (early draft)</span></a></li>
</ul>
</li>
</ul>
<p class="MsoNormal">
<h3>conStruct</h3>
<p class="MsoNormal"><a href="http://constructscs.com">conStruct</a> is a distro of the Drupal framework that aims to set a new standard in data integration and as a structured content system (SCS). With conStruct, you can let your data and its structure drive your applications. You can easily interoperate your diverse internal information with public content on the Web. And you can leverage a platform designed from the ground up for knowledge management and collaboration.</p>
<ul type="disc">
<li class="MsoNormal"><a name="OLE_LINK3"></a><a name="OLE_LINK4"></a><a href="http://constructscs.com/"><span><span>Main Web site</span></span></a>
<ul type="circle">
<li class="MsoNormal"><a href="http://constructscs.com/downloads"><span><span>Download</span></span></a></li>
<li class="MsoNormal"><a href="http://constructscs.com/features/design-overview"><span><span>Design       overview</span></span></a></li>
<li class="MsoNormal"><a href="http://constructscs.com/doc/code/construct/index.html"><span><span>Source       code documentation</span></span></a></li>
<li class="MsoNormal"><a href="http://constructscs.com/features"><span><span>Current features</span></span></a></li>
<li class="MsoNormal"><a href="http://constructscs.com/demos"><span><span>Online demos</span></span></a></li>
<li class="MsoNormal"><a href="http://constructscs.com/documentation/instructions"><span><span>Tools       instructions manuals</span></span></a></li>
<li class="MsoNormal"><a href="http://wiki.openstructs.org/wiki/Blog_Posts"><span><span>Interesting       blog posts</span></span></a></li>
<li class="MsoNormal"><a href="http://constructscs.com/doc/code/construct/index.html"><span><span>Installation       manual (early draft)</span></span></a></li>
</ul>
</li>
</ul>
<p><!--EndFragment--></p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/FredOnSomething?a=wYRHtzmPAo8:ERKnWL4-5hQ:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/FredOnSomething?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/FredOnSomething?a=wYRHtzmPAo8:ERKnWL4-5hQ:7Q72WNTAKBA"><img src="http://feeds.feedburner.com/~ff/FredOnSomething?d=7Q72WNTAKBA" border="0"></img></a>
</div>]]></content:encoded>
	<feedburner:origLink>http://fgiasson.com/blog/index.php/2009/07/02/release-of-structwsf-construct-and-the-community-web-site/</feedburner:origLink></item>
<item rdf:about="http://fgiasson.com/blog/index.php/2009/06/16/structwsf-and-construct-websites-unveiled/">
	<title>structWSF and conStruct websites unveiled</title>
	<link>http://feedproxy.google.com/~r/FredOnSomething/~3/pqsL-8M49GQ/</link>
	 <dc:date>2009-06-16T20:30:39Z</dc:date>
	<dc:creator>Fred</dc:creator>
			<dc:subject><![CDATA[Semantic Web]]></dc:subject>
		<dc:subject><![CDATA[Structured Dynamics]]></dc:subject>
		<dc:subject><![CDATA[conStruct]]></dc:subject>
		<dc:subject><![CDATA[structWSF]]></dc:subject>
	<description>I am proud to announce the release the websites of two of our products to come: structWSF and conStruct. Both products will be available in open source under the Apache 2 license. Mike just unveiled and demoed the two projects in his talk at SemTech 2009.

As we describe them on ...</description>
	<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=structWSF and conStruct websites unveiled&amp;rft.aulast=Giasson&amp;rft.aufirst=Frédérick&amp;rft.subject=Semantic Web&amp;rft.subject=Structured Dynamics&amp;rft.subject=conStruct&amp;rft.subject=structWSF&amp;rft.source=Frederick Giasson&#8217;s Weblog&amp;rft.date=2009-06-16&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://fgiasson.com/blog/index.php/2009/06/16/structwsf-and-construct-websites-unveiled/&amp;rft.language=English"></span>
<p>I am proud to announce the release the websites of two of our products to come: <a href="http://openstructs.org">structWSF</a> and <a href="http://constructscs.com">conStruct</a>. Both products will be available in open source under the Apache 2 license. <a href="http://mkbergman.com">Mike</a> just unveiled and demoed the two projects in <a href="http://www.semantic-conference.com/session/1806/">his talk at SemTech 2009</a>.</p>
<p>As we describe them on <a href="http://structureddynamics.com/">Structured Dynamics</a>&#8216; website:</p>
<h2>structWSF</h2>
<p><a href="http://fgiasson.com/blog/wp-content/uploads/2009/06/triple_120.png"><img class="alignleft size-full wp-image-941" title="triple_120" src="http://fgiasson.com/blog/wp-content/uploads/2009/06/triple_120.png" alt="" width="120" height="120" /></a><a href="http://openstructs.org">structWSF </a> is a platform-independent Web services framework for accessing and exposing structured  RDF data. Its central organizing perspective is that of the dataset. These datasets contain instance records, with the structural relationships amongst the data and their attributes and concepts defined via ontologies (schema with accompanying vocabularies).</p>
<p>The structWSF middleware framework is generally RESTful in design and is based on HTTP and Web protocols and open standards. The initial structWSF framework comes packaged with a baseline set of about a dozen Web services in CRUD, browse, search and export and import.</p>
<p>All Web services are exposed via APIs and SPARQL endpoints. Each request to an individual Web service returns an HTTP status and optionally a document of resultsets. Each results document can be serialized in many ways, and may be expressed as either RDF or pure XML.</p>
<p>In initial release, structWSF has direct interfaces to the <a href="http://virtuoso.openlinksw.com/wiki/main/Main/">Virtuoso</a> RDF triple store (via ODBC, and later HTTP) and the <a href="http://lucene.apache.org/solr/">Solr</a> faceted, full-text search engine (via HTTP). However, structWSF has been designed to be fully platform-independent. Support for additional datastores and engines is planned. The design also allows other specialized systems to be included, such as analysis or advanced inference engines.</p>
<p>The framework is open source (Apache 2 license) and designed for extensibility. structWSF and its extensions and enhancements are distributed and documented on the OpenStructs Web site.</p>
<h2><a href="http://fgiasson.com/blog/wp-content/uploads/2009/06/construct_logo_120.png"><img class="alignleft size-full wp-image-942" title="construct_logo_120" src="http://fgiasson.com/blog/wp-content/uploads/2009/06/construct_logo_120.png" alt="" width="120" height="120" /></a>conStruct</h2>
<p><a href="http://constructscs.com">conStruct SCS</a> is a structured content system that extends the basic <a href="http://drupal.org/">Drupal</a> content management framework. conStruct  enables structured data and its controlling vocabularies (ontologies) to drive applications and user interfaces.</p>
<p>Users and groups can flexibly access and manage any or all datasets exposed by the system depending on roles and permissions. Report and presentation templates are easily defined, styled or modified based on the underlying datasets and structure. Collaboration networks can readily be established across multiple installations and non-Drupal endpoints. Powerful linked data integration can be included to embrace data anywhere on the Web.</p>
<p>Depending on roles and permissions, a given user may or may not see specific datasets or tools within the Drupal interface. Search and browse results are similarly sequestered depending on access rights.</p>
<p>conStruct provides Drupal-level CRUD (create &#8211; read &#8211; update &#8211; delete), data display templating, faceted browsing, full-text search, and import and export over structured data stores based on RDF. It also provides a system for additional tools additions and expansions for this structured data. conStruct SCS is built on the platform-independent structWSF Web services framework.</p>
<p>Like Drupal and structWSF, conStruct is free and open source (GPL license). Versions of conStruct SCS are planned to adopt it to other content management systems (CMS).</p>
<h2>Next</h2>
<p>The alpha version of the code with all the proper documentation will be released later this summer. Everybody will be able to contribute to the project by enhancing/developing the core code or by extending it with new modules and web services.  Stay tuned!</p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/FredOnSomething?a=pqsL-8M49GQ:vn_D8eTqrJ4:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/FredOnSomething?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/FredOnSomething?a=pqsL-8M49GQ:vn_D8eTqrJ4:7Q72WNTAKBA"><img src="http://feeds.feedburner.com/~ff/FredOnSomething?d=7Q72WNTAKBA" border="0"></img></a>
</div>]]></content:encoded>
	<feedburner:origLink>http://fgiasson.com/blog/index.php/2009/06/16/structwsf-and-construct-websites-unveiled/</feedburner:origLink></item>
<item rdf:about="http://fgiasson.com/blog/index.php/2009/04/29/rdf-aggregates-and-full-text-search-on-steroids-with-solr/">
	<title>RDF Aggregates and Full Text Search on Steroids with Solr</title>
	<link>http://feedproxy.google.com/~r/FredOnSomething/~3/muBZgpwfkp8/</link>
	 <dc:date>2009-04-29T20:46:07Z</dc:date>
	<dc:creator>Fred</dc:creator>
			<dc:subject><![CDATA[Semantic Web]]></dc:subject>
		<dc:subject><![CDATA[Structured Dynamics]]></dc:subject>
		<dc:subject><![CDATA[conStruct]]></dc:subject>
		<dc:subject><![CDATA[structWSF]]></dc:subject>
	<description>Preamble
As I explained in my latest blog post, I am now starting to talk about a couple of things I have been working on in the last few months that will lead to a release, by Structured Dynamics, in the coming months. This blog post is the first step into ...</description>
	<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=RDF Aggregates and Full Text Search on Steroids with Solr&amp;rft.aulast=Giasson&amp;rft.aufirst=Frédérick&amp;rft.subject=Semantic Web&amp;rft.subject=Structured Dynamics&amp;rft.subject=conStruct&amp;rft.subject=structWSF&amp;rft.source=Frederick Giasson&#8217;s Weblog&amp;rft.date=2009-04-29&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://fgiasson.com/blog/index.php/2009/04/29/rdf-aggregates-and-full-text-search-on-steroids-with-solr/&amp;rft.language=English"></span>
<h3><strong>Preamble</strong></h3>
<p>As I explained in my latest blog post, I am now starting to talk about a couple of things I have been working on in the last few months that will lead to a release, by <a href="http://structureddynamics.com">Structured Dynamics</a>, in the coming months. This blog post is the first step into that path. Enjoy!</p>
<h3><strong>Introduction</strong></h3>
<p>I have been working with RDF, SPARQL and triple stores for years now. I have created many prototypes and online services using these technologies. Having the possibility to describe everything with RDF, and having the possibility to index everything in a triple store that you can easily query the way you want using SPARQL, is priceless. Using RDF saves development and maintenance cost because of the flexibility of store (triple store), the query language (SPARQL), and associated schemas (ontologies).</p>
<p>However, even if this set of technologies can do everything, quickly and efficiently, it is not necessarily optimal for all tasks you have to do. As we will see in this blog post, we use RDF for describing, integrating and managing any kind of data (structured or unstructured) that exists out there. RDF + Ontologies are what we use as the canonical expression of any kind of data. It is the triple store that we use to aggregate, index and manage that data, from one or multiple data sources. It is the same triple store that we use to feed any other system that can be used in our architecture. The triple store is the data orchestrator in any such architecture.</p>
<p>In this blog post I will show you how this orchestrator can be used to create <a href="http://lucene.apache.org/solr/">Solr</a> indexes that are used in the architecture to perform three functions that Solr has been built to perform optimally: full-text search, aggregates and filtering. So, while a triple store can perform these functions, it is not optimal for what we have to do.</p>
<h3><strong>Overview</strong></h3>
<p>The idea is to use the RDF data model and a triples store to populate the Solr schema index. We leverage the powerful and flexible data representation framework (RDF), in conjunction with the piece of software that lets you do whatever you want with that data (Virtuoso), to feed a carefully tailored Solr schema index to optimally perform three things: full-text search, aggregates and filtering. Also, we want to leverage the ontologies used to describe this data to be able to infer things vis-à-vis these indexed resources in Solr. This leverage enables us to use inference on full-text search, aggregates and filtering, in Solr! This is quite important since you will be able to perform full text searches, filtered by types that are inferred!</p>
<p>Some people will tell me that they can do this with a traditional relational database management system: yes. However, RDF + SPARQL + Triple Store is so powerful to integrate any kind of data, from any data sources; it is so flexible that it saves precious development and maintenance resources: so money.</p>
<h3>Solr</h3>
<p>What we want to do is to create some kind of &#8220;RDF&#8221; Solr index. We want to be able to perform full-text searches on RDF literals; we want to be able to aggregate RDF resources by the properties that describe them, and their types; and finally we want to be able to do all the searches, aggregation and filtering using inference.</p>
<p>So the first step is to create the proper Solr schema that will let you do all these wonderful things.</p>
<p><a href="http://code.google.com/p/structwsf/source/browse/trunk/framework/solr_schema.xml">The current Solr index schema can be downloaded here.</a> <em>(View source if simply clicking with your browser.)</em></p>
<p>Now, let&#8217;s discuss this schema.</p>
<h3>Solr Index Schema</h3>
<p>A Solr schema is composed of basically two things: fields and type of fields. For this schema, we only need two types of fields: string and text. If you want more information about these two types, I would refer you to the <a href="http://lucene.apache.org/solr/">Solr documentation</a> for a complete explanation of how they work. For now, just consider them as strings and texts.</p>
<p><!--StartFragment-->What interests us is the list of defined fields of this schema (again, see <a href="http://code.google.com/p/structwsf/source/browse/trunk/framework/solr_schema.xml">download</a>):</p>
<ul type="disc">
<li><em>uri</em> [1] &#8211; Unique resource identifier      of the record</li>
<li><em>type </em><span>[1-N]</span><!--EndFragment--> &#8211; Type of the record</li>
<li><em>inferred_type </em> <!--StartFragment--><span>[0-N]</span><!--EndFragment--> &#8211; Inferred type of the record</li>
<li><em>property</em> [0-N] &#8211;      Property identifier used to describe the resource and that has a literal      as object</li>
<li><em>text </em><span>[0-N] (same number as <em>property</em></span><span>)</span><!--EndFragment--> &#8211; Text of the literal of the      property</li>
<li><em>object_property</em> [0-N] &#8211;      Property identifier used to describe the resource where the object is a      reference to another resource and that this other resource can be      described by a literal</li>
<li><em>object_label</em> [0-N]      (same number as <em>object_property</em>) &#8211; Text      used to refer to the resource referenced by the <em>object_property</em></li>
</ul>
<h3>Full Text Search</h3>
<p>A RDF document is a set of multiple triples describing one or multiple resources. Saying that you are doing full-text searches on RDF documents is certainly not the same thing as saying that you are doing full-text searches on traditional text documents. When you describe a resource, you rarely have more than a couple of strings, with a couple of words each. It is generally the name of the entity, or a label that refers to it. You will have different numbers, and sometimes some description (a short biography, or definition, or summary, as examples). However, except if you index an entire text document, the &#8220;textual abundance&#8221; is quite poor compared to an indexed corpus of documents.</p>
<p>In any case, this doesn&#8217;t mean that there are no advantages in doing full-text searches on RDF documents (so, on RDF resource descriptions). But, if we are going to do so, let&#8217;s do so completely, and in a way that meets users&#8217; expectations for full-text document search.  By applying this mindset, we can apply some cool new tricks!</p>
<p>Intuitively the first implementation of a full-text search index on RDF documents would simply make a key-value pair assignment between a resource URI and its related literals. So, when you perform a full-text search for &#8220;Bob&#8221;, you get a reference on all the resources that have &#8220;Bob&#8221; in one of the literals that describe these resources.</p>
<p>This is good, but this is not enough. This is not enough because this breaks the more basic behavior for any users that uses full-text search engines.</p>
<p>Let&#8217;s say that I know the author of many articles is named &#8220;Bob Carron&#8221;. I have no idea what are the titles of the articles he wrote, so I want to search for them. With the system exposed above, if I do a search for &#8220;Bob Carron&#8221;, I will most likely get back as a result the reference to &#8220;Bob Carron&#8221;, the author person. This is good, but this is not enough.</p>
<p>On the results page, I want the list of all articles that Bob wrote! Because of the nature of RDF, I don&#8217;t have this &#8220;full-text&#8221; information of &#8220;Bob&#8221; in the description of the articles he wrote. Most likely, in RDF, Bob will be related to the articles he wrote by reference (object reference with the URIs of these articles), <em>i.e.</em>, &lt;this-article&gt; &lt;author&gt; &lt;bob-uri&gt;. As you can notice, we won&#8217;t get back any articles in the resultset for the full-text query &#8220;Bob Carron&#8221; because this textual information doesn&#8217;t exist in the index at the level of the articles he wrote!</p>
<p>So, what can we do?</p>
<p>A simple trick will beautifully do the work. When we create the Solr index, what we want is to add the textual information of the resources being referenced by the indexed resources. For example, when we create the Solr document that describes one of the articles written by Bob, we want to add the literal that refers to the resource(s) referenced by this article. In this case, we want to add the name of the author(s) in the full-text record of that article. So, with this simple enhancement, if we do a search for &#8220;Bob Carron&#8221;, we will now get the list of all resources that refers to Bob too! (articles he wrote, other people that know him, etc).</p>
<p style="text-align: center;"><a href="http://fgiasson.com/blog/wp-content/uploads/2009/04/text69217.png"><img class="size-medium wp-image-932 aligncenter" title="object property" src="http://fgiasson.com/blog/wp-content/uploads/2009/04/text69217-300x129.png" alt="" width="300" height="129" /></a></p>
<p>So, this is the goal of the &#8220;object_property&#8221; and &#8220;object_label&#8221; fields of the Solr index. In the schema above, the &#8220;object_property&#8221; would be &#8220;author&#8221; and the &#8220;object_label&#8221; would be &#8220;Bob Carron&#8221;. This information would belong to the Solr document of the <em>Article 1</em>.</p>
<h3>Full Text Search Prototype</h3>
<p>Let&#8217;s take a look at the prototype running system (see screen capture below).</p>
<p>&#65279;&#65279;<a href="http://fgiasson.com/blog/wp-content/uploads/2009/04/search.gif"></a></p>
<p style="text-align: center;"><img class="size-medium wp-image-933" title="search" src="http://fgiasson.com/blog/wp-content/uploads/2009/04/search-300x210.gif" alt="" width="300" height="210" /></p>
<p><!--[if gte vml 1]> <![endif]--></p>
<p>The dataset loaded in this prototype is <a href="http://www.mkbergman.com/?page_id=325">Mike&#8217;s Sweet Tools</a>. As you notice in the prototype screen, many things can be done with the simple Solr schema we published above. Let&#8217;s start with a search for the word &#8220;test&#8221;. First, we are getting a resultset of 17 things that have the &#8220;test&#8221; word in any of their text-indexed fields.</p>
<p>What is interesting with that list is the additional information we now have for each of these resultsets that come from the RDF description of these things, and the ontologies that have been used to describe them.</p>
<p>For example, if we take a look at Result #4, we see that the word &#8220;test&#8221; has been found in the <strong><em>description</em></strong> of the <strong><em>Ontology project </em></strong>for the &#8220;TONES  Ontology Repository&#8221; record. Isn&#8217;t that precision far more useful than saying: the word &#8220;test&#8221; has been found in &#8220;this webpage&#8221;? I&#8217;ll let you think about it.</p>
<p>Also, if we take a look at Result #1, we know that the word &#8220;test&#8221; has been found in the <strong><em>homepage</em></strong> of the <strong><em>Data Converter Project</em></strong> for the&#8221;Talis Semantic Converter&#8221; record.</p>
<p>Additionally, by leveraging this Solr index, we can do efficient aggregates on the types of the things returned in the resultset for further filtering. So, in the section &#8220;Filter by kinds&#8221; we know what kinds of things are returned for the query &#8220;test&#8221; against this dataset.</p>
<p>Finally, we can use the drop-down box at the right to do a new search (see screenshot), based on the specific kind of things indexed in the system. So, I could want to make a new search, only for &#8220;Data specification projects&#8221; with the keyword &#8220;rdf&#8221;. I already know from the user interface that there are 59 such projects.</p>
<p>All this information comes form the Solr index at query time, and basically for free by virtue of how we set up the system. Everything is dynamically aggregated and displayed to the user.</p>
<p>However, there are a few things that you won&#8217;t notice here that are used:  1) SPARQL queries to the triple store to get some more information to display on that page; 2) the use of inference (more about it below), and; 3) the leveraging of the ontologies descriptions.</p>
<p>In any case, on one of SD&#8217;s test datasets of about 3 million resources, such a page is generated within a few hundred milliseconds: resultset, aggregates, inference and description of things displayed on that page.  This same 3 million resources that returns results in a few hundred milliseconds did so on a small Amazon EC2 server instance for 10 cents per hour. How&#8217;s that for performance?!</p>
<h3>Aggregates and Filtering on Properties and Types</h3>
<p>But, we don&#8217;t want to merely do full-text search on RDF data. We also want to do aggregates (how many records has this type, or this property, etc.) and filtering, at query time, in a couple of milliseconds. We already had a look at these two functions in the context of a full-text search. Now let&#8217;s see it in action in some dataset prototype browsing tools that uses the same Sweet Tools dataset.</p>
<p>In a few milliseconds, we get the list of different kind of things that are indexed in a given dataset. We can know what are the types, and what is the count for each of these types. So, the ontologies drive the taxonomic display of the list of things indexed in the dataset, and Solr drives the aggregation counts for each of these types of things.</p>
<p>Additionally, the ontologies and the <a href="http://virtuoso.openlinksw.com/wiki/main/Main/">Virtuoso</a> inference rules engine are used to make the count, by inference. If we take the example of the type &#8220;RDF project&#8221;, we know there are 49 such projects. However, not all these projects are explicitly typed with the &#8220;RDF project&#8221; type. In fact, 7 of these &#8220;RDF project&#8221; are &#8220;RDF editor project&#8221; and 6 are &#8220;RDF generator project&#8221;.</p>
<p>This is where inference can play an important role: an article is a document. If I browse documents, I want to include articles as well. This &#8220;broad context retrieval&#8221; is driven by the description of the ontologies, and by inference; this is the same thing for these projects; and this is the same thing for everything else that is stored as structured RDF and characterized by an ontology.</p>
<p align="center"><!--[if gte vml 1]> <![endif]--></p>
<p style="text-align: center;"><a href="http://fgiasson.com/blog/wp-content/uploads/2009/04/browse_tree.gif"><img class="size-medium wp-image-934" title="browse_tree" src="http://fgiasson.com/blog/wp-content/uploads/2009/04/browse_tree-131x300.gif" alt="" width="131" height="300" /></a></p>
<p>The screenshot above shows how these inferences and their nestings could present themselves in a user interface.</p>
<p>Once the user clicks on one of these types, he starts to browse all things of that type. On the next screenshot below, Solr is used to add filters based on the attributes used to describe these things.</p>
<p><!--[if gte vml 1]> <![endif]--></p>
<p style="text-align: center;"><a href="http://fgiasson.com/blog/wp-content/uploads/2009/04/browse_properties_filter.gif"><img class="size-medium wp-image-935" title="browse_properties_filter" src="http://fgiasson.com/blog/wp-content/uploads/2009/04/browse_properties_filter-300x185.gif" alt="" width="300" height="185" /></a></p>
<p>In some cases, I may want to see all the Projects that have a review. To do so, I would simply add this filter criteria on the browsing page and display the &#8220;Projects&#8221; that have a &#8220;review&#8221; of them. And thanks to Solr, I already know how many such Projects have reviews, right before even taking a look at them.</p>
<p>Note, then, on this screenshot that the filters and counts come from Solr.  The list of the actual items returned in the resultset comes from a SPARQL query, and the name of the types and properties (and their descriptions) come from the description of the ontologies used.</p>
<p>This is what all this stuff is about: creating a symbiotic environment where all these wonderful systems live together to do the effective management of the structured data.</p>
<h3>Populating the Solr Index</h3>
<p>Now that we know how to use Solr to perform full-text searches, and the aggregating and filtering of structured data, one question still remains: how do we populate this index? As stated at above, the goal is to manage all the structured data of the system using a triple store and ontologies. Then it is to use this triple store to populate the Solr index.</p>
<p>Structured Dynamics uses the Virtuoso Open Source as the triple store to populate this index for multiple reasons. One of the main ones is for its performance and its capability to do efficient basic inference. The goal is to send the proper SPARQL queries to get the structured data that we will index in the Solr schema index that we talked about above. Once this is done, all the things that I talked about in this blog post become possible, and efficient.</p>
<h3>Syncing the Index</h3>
<p>However, in such a setup, we have to keep one thing in mind: each time the triple store is updated (a resource is created, deleted or updated), we have to sync the Solr index according to these modifications.</p>
<p>What we have to do is to detect any change in the triple store, and to reflect this change into the Solr index. What we have to do is to re-create the entire Solr document (the resource that changed in the triple store) using the &lt;add /&gt; operation.</p>
<p>This design raises an issue with using Solr: we cannot simply modify one field of a record. We have to re-index the entire description of the document even if we want to modify a single field of any document. This is a limitation of Solr that is currently <a href="file://localhost/jira/browse/SOLR-139">addressed in this new feature proposition</a>; but it is not currently available for prime time.</p>
<p>Another thing to consider here is to properly sync the Solr index with any ontology changes (at the level of the class description) if you are using the inference feature. For example, assume you have an ontology that says that class A is a sub-class-of class B. Then, assume the ontology is refined to say that class A is now a sub-class-of class C, which itself is a sub-class-of class B. To keep the Solr index synced with the triple store, you will have to perform all modifications that affect all the records of these types. This means that the synchronization doesn&#8217;t only occur at the level of the description of a record; but also at the level of the changes in the ontologies used to describe those records.</p>
<h3>Conclusion</h3>
<p>One of the main things to keep in mind here is that now, when we develop Web applications, we are not necessarily talking about a single software application, but a group of software applications that compose an architecture to deliver a service(s). In any such architecture, what is at the center of it is <em>Data</em>.</p>
<p>Describing, managing, leveraging and publishing this data is at the center of any Web service. It is why it is so important to have the right flexible data model (RDF), with the right flexible query language (SPARQL), and the right data management system (triple store) in place. From there, you can use the right tools to make it available on the Web to your users.</p>
<p>The right data management system is what should be used to feed any other specific systems that compose the architecture of a Web service. This is what we demonstrated with Solr; but it is certainly not limited to it.</p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/FredOnSomething?a=muBZgpwfkp8:fb_kxlfQMFU:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/FredOnSomething?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/FredOnSomething?a=muBZgpwfkp8:fb_kxlfQMFU:7Q72WNTAKBA"><img src="http://feeds.feedburner.com/~ff/FredOnSomething?d=7Q72WNTAKBA" border="0"></img></a>
</div>]]></content:encoded>
	<feedburner:origLink>http://fgiasson.com/blog/index.php/2009/04/29/rdf-aggregates-and-full-text-search-on-steroids-with-solr/</feedburner:origLink></item>
</rdf:RDF>
