<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss2full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><rss xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" version="2.0"><channel><title>Smart Data Collective</title><link>http://smartdatacollective.com/Home/</link><description>Smart Data Collective</description><language>en-us</language><image><url>http://smartdatacollective.com/logo/70.jpg</url><link>http://smartdatacollective.com/Home/</link><title>Home</title></image><copyright>SocialMediaToday</copyright><managingEditor>managing_editor</managingEditor><webMaster>webmaster</webMaster><pubDate>Tue, 09 Feb 2010 10:45:16 GMT</pubDate><lastBuildDate>Tue, 09 Feb 2010 10:45:16 GMT</lastBuildDate><generator>WordFrame RSS Generator v.1.0</generator><ttl>20</ttl><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" type="application/rss+xml" href="http://feeds.feedburner.com/smartdatacollective_allposts" /><feedburner:info uri="smartdatacollective_allposts" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com" /><item><title>Social Media Marketers Should Get Ahead of the Curve</title><link>http://feedproxy.google.com/~r/smartdatacollective_allposts/~3/5u8ZnlbpJgc/24838</link><description>Recently I was reviewing an interesting article by Ross Mayfield who is an advisor towww.Slideshar.comand co-founder of Socialtext. He is also at @ross on Twitter. My compliments to him and his team...&lt;img src="http://feeds.feedburner.com/~r/smartdatacollective_allposts/~4/5u8ZnlbpJgc" height="1" width="1"/&gt;</description><content><![CDATA[<p>Recently I was reviewing an interesting article by <em><a title="Ross Mayfield" href="http://ross.typepad.com/" target="_blank"><strong>Ross Mayfield</strong></a> who is an advisor to</em> <a title="www.Slideshar.com" href="http://www.Slideshar.com" target="_blank"><strong>www.Slideshar.com</strong></a> <em>and co-founder of <a title="Socialtext" href="http://www.socialtext.com/" target="_blank"><strong>Socialtext</strong></a>. He is also at <a title="@ross" href="http://twitter.com/ross" target="_blank"><strong>@ross</strong></a> on Twitter. My compliments to him and his team.</em> He has this to say: </p>
<p>&nbsp;</p>
<p>As Chief Marketing Officers develop their social media marketing strategy for 2010, they are demanding business results. In 2009, 89% of CMOs tracked social media’s impact by using standard metrics such as site traffic, pageviews, and number of fans (as discussed in a recent survey).&nbsp; However, CMOs expect that in 2010 top metrics will track more closely to P&amp;L business goals––not just Web-related goals. The study forecasts the growth of adoption of the top three metrics in 2010, as follows:</p>
<ul>
    <li>A 333% increase in tracking revenue </li>
    <li>A 174% increase in tracking conversion </li>
    <li>A 150% increase in tracking average order value </li>
</ul>
<p>Such a shift in measurement expectation is significant. CMOs indicate a 300% year-over-year increase in 2010 in the number of companies that plan to measure social media’s impact on conversion and a 400% increase in the number of companies that will track social media’s direct impact on revenues.</p>
<p>&nbsp;</p>
<p>The 2009 financial crisis probably did social media some good. Not that you need a ton of budget, but competing for scarce budget alongside more traditional projects is a creative constraint. As social media marketing matures, it will become a facet of marketing overall and it will be harder to spot a campaign that isn’t social. Part of that maturing is - you can’t manage what you can’t measure. Part of it driving activity that drives traditional metrics, like how you can drive <a title="conversions with LeadShare" href="http://www.slideshare.net/business/leadshare" target="_blank"><strong>conversions with LeadShare</strong></a>.</p>
<p>&nbsp;</p>
<p>Social media marketers should get ahead of this curve. If you know you will eventually be accountable for traditional metrics, start iterating as soon as you can to find models that work. And volunteer to report these metrics before they are volunteered to you. This will require that you actively engage other parts of the marketing organization and give them stakeholdership in your outcomes. Take a look at your 2010 campaigns, reconsider your metrics, and incrementally realign your activities with the core of the marketing function.</p>
<p>&nbsp;</p>
<p><strong>My thoughts:</strong></p>
<ol>
    <li>Social Media is bringing marketing and service people new opportunities of understanding their customers and also tracking/trapping events and customer needs. This is in addition to formerly understanding customers and their behaviors through the use of advanced data warehousing and Business Intelligence investments aimed at processes and techniques that (demand and) drive smart marketers actions. Teradata has been providing such solutions for over 20 years. But now the new era of faster communications, WEB 2.0 applications, and customer-driven interactions has catapulted many countries and companies into a totally new era. </li>
    <li>Original Web Analytics are normally collected from logs or applications on&nbsp; websites which focus on their sales activities and orders (and even revenues). But most companies have invested millions or ten of millions of dollars (or your local currency equivalent) and rarely are tracking and trapping the customers entering point, movements on your website, pages read (and how long), products reviewed and subsequent searches/views which define what customers seeks and would desire. If you had a sales-person on their shoulder watching them go through your website, wouldn’t he first be asking questions and second finding ways to fulfill the customer (purchasing or service) needs?</li>
    <li>Investing in web analytics has now moved beyond the most obvious metrics (usually self-aimed at the business, not about the customers). The new age of tracking and understanding customers has enabled the opportunity of compiling data into a Teradata Data Warehouse and then using sophisticated analytical techniques and models to do what is necessary to take immediate actions and complete the sales or service cycle. In addition, Teradata has partnered with numerous companies, serving many industries, which extrapolate and move the web activity data to the Teradata Data Warehouse, then other companies provide applications and analytics that give INSIGHT to managers and executives who need information to manage their resources. This new area is known as “Interactive Web Intelligence” or “Integrated Web Intelligence” (IWI). This means integrating web data with your detailed customer data from all of your other channels. This is now (sort of) mandatory; if you plan to be successful in the electronic (PDA) world. </li>
    <li>There are some excellent Teradata software alliance partners that provide modern-enabling tools for such gathering and analysis. They are WebTrends and SpeedTrap, along with others who provide additional infrastructure support and loading of data into the Teradata DW’s.&nbsp; </li>
</ol>
<p><strong>SUMMARY AND RECOMMENDATION</strong><br>
Reporting at the end of the month, or even the end of the week, is no longer sensible or even useful. Latent “Post-Action” tracking and reporting, with delays in analyzing and then acting, provide little economic value.&nbsp; In today’s world, using the enabling technologies and smart people to go with them, you should be seeking an ACTIVE Data Warehouse with ACTIVE Enterprise Intelligence. Your competitors are in the integration mode and now gathering web data, and then moving quickly to learn and use such data to manage customer retention, customer sales, customer services and customer satisfaction. Are you?</p>
<p>&nbsp;</p>
<p>My best recommendation is to consider how much you plan, or have, invested in your customer marketing and/or web site. Then evaluate what it would mean if you took just ten percent (10%) and reallocated it to Integrated Web Intelligence (IWI). No one, including your customers and competitors, will find your reallocation to be less than magnificent in terms of ROI. BI and DW along with IWI are part of the new world of Web 2.0 and subsequently understanding your customers and prospects. Address them with your best RELEVANT messaging and you will win in the world of intelligence and revenues. What do you need to know? Let me know…</p>
<p>&nbsp;</p>
<p><strong><a title="Ron Swift" href="http://www.teradata.com/t/WorkArea/linkit.aspx?LinkIdentifier=id&amp;ItemID=11123" target="_blank">Ron Swift</a></strong></p>
<p><a title="www.teradata.com/ronswift" href="http://www.teradata.com/t/blogs/ronswift/" target="_blank"><strong>www.teradata.com/ronswift</strong></a></p>
<br>]]></content><author>Ron Swift</author><category>Industry Perspective</category><category>ROI/Business Value</category><wfCategory>analytics,social media + metrics,conversion + revenues,leadshare</wfCategory><comments>http://smartdatacollective.com/Home/24838#0</comments><pubDate>Tue, 09 Feb 2010 00:44:44 GMT</pubDate><guid isPermaLink="false">http://smartdatacollective.com/Home/24838</guid><feedburner:origLink>http://smartdatacollective.com/Home/24838</feedburner:origLink></item><item><title>When does a hard science become a team sport?</title><link>http://feedproxy.google.com/~r/smartdatacollective_allposts/~3/2MLqRIugtMg/24837</link><description>The newest report from analyst firm Gartner just came out – and we’re all excited! It’s called the “Magic Quadrant for Data Warehouse Database Management Systems.” Gartner issues it about once a yea...&lt;img src="http://feeds.feedburner.com/~r/smartdatacollective_allposts/~4/2MLqRIugtMg" height="1" width="1"/&gt;</description><content><![CDATA[<p>The newest report from analyst firm Gartner just came out – and we’re all excited! It’s called the “Magic Quadrant for Data Warehouse Database Management Systems.” Gartner issues it about <img title="team_sport" style="width: 375px; height: 224px;" alt="team_sport" src="http://www.teradata.com/t/uploadedImages/Blogs/Darryl/ANA0275AL.1.jpg" align="right">once a year, and it positions database vendors based on vision and execution. I am pleased to share that Teradata is in the top spot of the Leaders’ Quadrant. To see our press release and get details, <a title="click here" href="http://www.teradata.com/t/WorkArea/linkit.aspx?LinkIdentifier=id&amp;ItemID=13255" target="_blank"><strong>click here</strong></a>. </p>
<p>&nbsp;</p>
<p>The report is a serious one because CIO’s everywhere will read it, see the quadrants and how they are populated – then make more informed decisions. Those decisions shape our future business. I read the report’s fine print, too. </p>
<p>&nbsp;</p>
<p>Like always, the report focuses on the relevant capabilities of vendors. However, the analysts remarked this year that Teradata’s customers play a role in our product development. I’m glad they did, because it’s an important factor. I commented on it in the press release because I believe that much of the credit for our success as a business comes from our enthusiastic and engaged customers. </p>
<p>&nbsp;</p>
<p>We all know that data warehousing is a sophisticated computer science. However, the excitement, enthusiasm and ongoing involvement of our customers (a big differentiator for Teradata) resemble a team sport. Our customers, our employees and our partners talk about data warehousing – the way a lot of us talk about sports. </p>
<p>&nbsp;</p>
<p>These high-energy conversations take place at our conferences, in our meeting rooms and in our online forums. And when we hear things like “Teradata is a central player in a high-stakes IT arena,” I think of the Super Bowl. I can’t help it. </p>
<p>&nbsp;</p>
<p>Of course, data warehousing is hard science, but for the Teradata community, using enterprise-class intelligence to do dramatic things for a business is serious fun. We talk about big plays, close calls, the best players and every dimension of the game. </p>
<p>&nbsp;</p>
<p>And when the Gartner data warehouse reports come out every year, I am glad to see they are keeping score ... and reporting the team standings. I’m proud that Teradata has been in the Leaders Quadrant since 1999, a position we all work together to keep.</p>
<p>&nbsp;</p>
<p><a title="Darryl" href="http://www.teradata.com/t/WorkArea/linkit.aspx?LinkIdentifier=id&amp;ItemID=6258" target="_blank"><strong>Darryl</strong></a></p>
<br>]]></content><author>Darryl McDonald</author><category>Data Warehousing</category><category>Industry Perspective</category><wfCategory>data warehousing,teradata,magic quadrant,cio's</wfCategory><comments>http://smartdatacollective.com/Home/24837#0</comments><pubDate>Mon, 08 Feb 2010 22:52:29 GMT</pubDate><guid isPermaLink="false">http://smartdatacollective.com/Home/24837</guid><feedburner:origLink>http://smartdatacollective.com/Home/24837</feedburner:origLink></item><item><title>Why Google needed a Superbowl ad</title><link>http://feedproxy.google.com/~r/smartdatacollective_allposts/~3/yGR0D-R15tk/24832</link><description>We were watching the Superbowl on TiVo, about a half hour behind, when I got a text message from my son at college: ...quot;Google's super bowl ad was horrible. I'm going to use Bing....quot; I didn't...&lt;img src="http://feeds.feedburner.com/~r/smartdatacollective_allposts/~4/yGR0D-R15tk" height="1" width="1"/&gt;</description><content><![CDATA[<strong>We were watching the Superbowl on TiVo</strong>, about a half hour behind, when I got a text message from my son at college: ..."Google's <a href="http://www.youtube.com/searchstories" target="_blank">super bowl ad </a>was horrible. I'm going to use Bing..."<br>
<br>
I
didn't think it was bad, but I'm more prone to sentimentality than my
son (to put it mildly). As I started this post, my point was going to
be that the ad wasn't targeted at people like him, <a href="http://thesconz.wordpress.com/" target="_blank">a blogger </a>who
actually considers which search engine to use. And it certainly wasn't for the masses who had clicked the ad on YouTube. I figured it was more
for millions of people who use search occasionally, but don't yet
understand the breadth of its applications. After all, how does Google fare with
people <a href="http://thenumerati.net/index.cfm?postID=518" target="_blank">who have migrated</a> in the last two or three years from dial-up,
or are still there? For a company with 70% of the search market, that segment might represent growth. And you're more likely to reach them on the Superbowl than with a viral campaign on the Net.<br>
<br>
But as I
rewatch the ad, I suspect it might move too fast for that audience. I imagine relatives of mine watching it. Instead of seeing themselves typing rapid-fire queries and madly clicking, they're picturing a grandchild doing it. But of course, he might be using Bing.<br>
<br>
<br>
<br>
<a href="http://www.thenumerati.net/index.cfm?postID=521" title="http://www.thenumerati.net/index.cfm?postID=521">Link to original post</a><br>]]></content><author>Stephen Baker</author><category>Industry Perspective</category><category>Data Quality</category><wfCategory>search,google,bing,super bowl ad</wfCategory><comments>http://smartdatacollective.com/Home/24832#0</comments><pubDate>Mon, 08 Feb 2010 14:56:21 GMT</pubDate><guid isPermaLink="false">http://smartdatacollective.com/Home/24832</guid><feedburner:origLink>http://smartdatacollective.com/Home/24832</feedburner:origLink></item><item><title>MapReduce goes evolutionary</title><link>http://feedproxy.google.com/~r/smartdatacollective_allposts/~3/r-5bs5hnjoM/24831</link><description>Scientists from Texas A&amp;M University have developed a new algorithm MrsRF (MapReduce Speeds up Robinson-Foulds) for analyzing large collection of evolutionary trees using MapReduce framework. Matthews...&lt;img src="http://feeds.feedburner.com/~r/smartdatacollective_allposts/~4/r-5bs5hnjoM" height="1" width="1"/&gt;</description><content><![CDATA[<span class="fullpost">
<div style="text-align: justify;"></div>
</span>
<div style="text-align: justify;">Scientists from Texas A&amp;M University have developed a new algorithm MrsRF (<a class="zem_slink" href="http://en.wikipedia.org/wiki/MapReduce" title="MapReduce" rel="wikipedia" target="_blank">MapReduce</a> Speeds up Robinson-Foulds) for analyzing large collection of evolutionary trees using MapReduce framework. Matthews et. al, have used their MapReduce algorithm to compute all-to-all Robinson-Foulds (RF) <a class="zem_slink" href="http://en.wikipedia.org/wiki/Distance_matrix" title="Distance matrix" rel="wikipedia" target="_blank">distance matrix</a> on <a class="zem_slink" href="http://en.wikipedia.org/wiki/Multi-core" title="Multi-core" rel="wikipedia" target="_blank">multi-core</a> computing platforms. Calculation of all possible Robinson-Foulds  distance pairs is a computationally intensive task. The results show that a significant speedup can be achieved using MrsRF compared to the fastest sequential algorithms.<blockquote>We studied the performance of our MrsRF algorithm on two large biological trees sets consisting of 20,000 trees of 150 taxa each and 33,306 trees of 567 taxa each. Our experiments show that MrsRF is a scalable approach reaching a speedup of over 18 on 32 total cores.</blockquote><br>
<div align="center"><a href="http://2.bp.blogspot.com/_K2PSkfokqx8/S2_bcVanIOI/AAAAAAAABUg/UBK3nt1BBXE/s1600-h/MapReduce+Evolutionary+Biology+2.jpg" target="_blank"><img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 124px;" src="http://2.bp.blogspot.com/_K2PSkfokqx8/S2_bcVanIOI/AAAAAAAABUg/UBK3nt1BBXE/s400/MapReduce+Evolutionary+Biology+2.jpg" alt="" id="BLOGGER_PHOTO_ID_5435804555091058914" border="0"></a></div>
<em>
<div style="text-align: center;"><em>Phase 1 of the MrsRF algorithm. Two mappers and two reducers are used to process the input files.</em></div>
</em>
<div style="text-align: justify;"><br>
</div>
Apart from speeding up the phylogenetic analysis, this study presents a new type of MapReducible problem where "the size of the input (t evolutionary trees) is much smaller than the size of the output (t &#215; t RF matrix)". Generally in MapReduce implementations the final output is smaller in size than the initial input. Another important thing which authors point out is getting best performance out of MapReduce implementation on a multi-core cluster depends on the cluster configuration. For instance, they tried their problem set with 32 total cores, a 16 nodes by 2 cores (16 &#215; 2) cluster configuration which outperformed 8 &#215; 4, 4 &#215; 8, and 32 &#215; 1 cluster configuration.<br>
Overall their research makes a strong case for using  MapReduce framework to design high-performance phylogenetic applications and it can be best for tackling the large evolutionary computational problems such as summarizing the big collections of evolutionary trees. An open-source implementation of MrsRF algorithm is freely available from the <a href="http://code.google.com/p/mrsrf/wiki/HowToUse" target="_blank">Google code</a>.<br>
</div>
<br>
Reference:<br>
Matthews, S., &amp; Williams, T. (2010). MrsRF: an efficient MapReduce algorithm for analyzing large collections of evolutionary trees BMC Bioinformatics, 11 (Suppl 1) DOI: <a rev="review" href="http://dx.doi.org/10.1186/1471-2105-11-S1-S15" target="_blank">10.1186/1471-2105-11-S1-S15</a><br>
<div class="blogger-post-footer"><br>
<div>Original article is
available at <a href="http://www.abhishek-tiwari.com/" target="_blank">Fisheye Perspective</a> blog. Stay
tuned for more posts and subscribe the <a href="http://feeds2.feedburner.com/AbhishekTiwarisBlog" target="_blank">RSS feed</a>.&nbsp;</div>
</div>]]></content><author>Abhishek Tiwari</author><category>Industry Perspective</category><category>Agile Application Development</category><wfCategory>mrsrf,evolutionary trees,mapreducible</wfCategory><comments>http://smartdatacollective.com/Home/24831#0</comments><pubDate>Mon, 08 Feb 2010 10:59:24 GMT</pubDate><guid isPermaLink="false">http://smartdatacollective.com/Home/24831</guid><feedburner:origLink>http://smartdatacollective.com/Home/24831</feedburner:origLink></item><item><title>Defining Analytics: Data, Information and Knowledge</title><link>http://feedproxy.google.com/~r/smartdatacollective_allposts/~3/YEmQMXgH1vU/24822</link><description>Following-up to my blog 'Just Tell Me What I'm Doing', I'm starting a series of posts that define the key concepts and terms that make up my analytic world. Everything I do is coloured by my experie...&lt;img src="http://feeds.feedburner.com/~r/smartdatacollective_allposts/~4/YEmQMXgH1vU" height="1" width="1"/&gt;</description><content><![CDATA[<div xmlns="http://www.w3.org/1999/xhtml">
<p>Following-up to my blog '<a href="http://analytics.typepad.com/oz-analytics/2010/01/in-my-last-post-i-took-us-all-in-the-local-analytic-community-to-task-for-a-shared-lack-of-ambition-having-burnt-some-bridge.html" target="_blank">Just Tell Me What I'm Doing</a>', I'm starting a series of posts that define the key concepts and terms that make up my analytic world. Everything I do is coloured by my experience actually doing analytics in commercial organisations. So while I believe these posts will present practical definitions that will be actionable in the business world, I know that there are other worlds in academia and science where they are less relevant. At the very least, people in these areas will gain a better understanding of how business regards analytics.</p>
<hr>
<p><strong><span style="font-size: 14px;"><span style="font-family: Tahoma; font-size: 16px;">Bennett's Analytica</span></span></strong><a style="float: right;" http:="" www.tbig.com.au="" forums="" cag-cortex-analytic-glossary=""><img class="asset asset-image at-xid-6a01157080889a970b0128776fc2da970c " alt="Cortec_black_logo" title="Cortec_black_logo" src="http://analytics.typepad.com/.a/6a01157080889a970b0128776fc2da970c-800wi" style="margin: 0px 0px 5px 5px;" border="0"></a> &nbsp;<br>
<span><span style="font-family: Tahoma;">&nbsp;&nbsp;&nbsp;<em>A Practitioner's Guide To Analytics</em></span></span></p>
<p><span style="font-family: Tahoma,Verdana,sans-serif;"><em><br>
</em></span></p>
<hr>
<p><strong><span style="color: rgb(0, 0, 191);"><span style="font-size: 14px; font-family: Tahoma; color: rgb(0, 0, 191);">Data, Information and Knowledge</span></span></strong></p>
<hr>
<p>
</p>
<p>Information is a collection of related data – often transformed and aggregated – about a topic. In business, that topic is often insight about an operational area or a performance question. In analytics, information is often used interchangeably to mean ‘data’ but data is actually best thought of as something that on its own carries no meaning. The main differences are in the degree of meaning and the level of abstraction being considered. To explain:</p>
<p><strong><span style="font-family: Tahoma;">Degree Of Meaning</span></strong></p>
<p>Data, information and knowledge all have some degree of meaning. Even data has meaning at some level. For example:</p>
<p>
</p>
<ul>
    <li>data: 99.9 is a number (you know it is probably not text). There is still a possibility that 99.9 is code for a text string or value.&nbsp;</li>
    <li>information: 99.9 is the percent of transactions successfully processed by an application.</li>
    <li>knowledge: 99.9 is 0.05 below the acceptable level for failed transactions with our customers.</li>
</ul>
<p>
</p>
<p><strong><span style="font-family: Tahoma;">Level Of Abstraction</span></strong></p>
<p>Data is the lowest level of abstraction, information is the next level, and finally, knowledge is the highest level among all three.</p>
<p>Be careful: <em>abstraction</em> is not the same as <em>summarisation</em>. Summaries may only be the sum of individual pieces of data. This doesn't change the data into information in and of itself. An example:</p>
<blockquote dir="ltr">
<p>A list of amounts 5, 8, 5, 2 can be summed to 20. Is 20 information?</p>
</blockquote>
<p><strong><span style="font-family: Tahoma;">Sources</span></strong></p>
<p>In the business intelligence world data is extracted from fixed sources (batch or in real time, it doesn't matter). Sources are usually either transactional applications or reference data. All sources have meaning. Transactional data has meaning because:</p>
<p>
</p>
<ul>
    <li>each transaction is stored in one or more records and this gives context to the individual data items of the record.</li>
    <li>the source application is known and that is information that gives additional meaning to the data.</li>
</ul>
<p>
</p>
<p>Reference data also has meaning as the table(s) within which it is stored has an internal meaning due to the relationship between the table rows. Typically this meaning is either hierarchical (for example an organisational structure or products grouped into categories) or group (for example a list of product codes or currencies).</p>
<p>In order for data to become information, it must be interpreted and take on a meaning.</p>
<p><strong><span style="font-family: Tahoma;">Analytica Illustration</span></strong></p>
<p>An example (care of Wikipedia):</p>
<blockquote dir="ltr">
<em>"The height of Mt. Everest is generally considered as "data", a book on Mt. Everest geological characteristics may be considered as "information", and a report containing practical information on the best way to reach Mt. Everest's peak may be considered as "knowledge"."</em>
</blockquote>
<p>
</p>
<p><strong><span style="font-family: Tahoma;">Related Terms and Concepts</span></strong></p>
<p>Refer also to Data</p>
<p>Refer also to Metadata</p>
<hr>
<em>Comments? Via form below or send feedback to </em><em><a href="http://analytics.typepad.com/mailto:analytica@tbig.com.au?Subject=Bennetts%20Analytica%20Blog%20Comment" target="_blank">
analytica@tbig.com.au</a>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;version 0.1 201002</em>
<p>
<hr>
<p>
</div>
<img src="http://feeds.feedburner.com/%7Er/typepad/daAr/%7E4/vwHt65vc5_g" height="1" width="1"><br>
<span><a href="http://feedproxy.google.com/%7Er/typepad/daAr/%7E3/vwHt65vc5_g/defining-analytics-data-information-and-knowledge.html" title="http://feedproxy.google.com/~r/typepad/daAr/~3/vwHt65vc5_g/defining-analytics-data-information-and-knowledge.html">Link to original post</a></span>]]></content><author>Steve Bennett</author><category>Industry Perspective</category><category>Predictive Analytics</category><wfCategory>analytics,metadata,abstraction,fixed sources</wfCategory><comments>http://smartdatacollective.com/Home/24822#0</comments><pubDate>Sun, 07 Feb 2010 19:43:53 GMT</pubDate><guid isPermaLink="false">http://smartdatacollective.com/Home/24822</guid><feedburner:origLink>http://smartdatacollective.com/Home/24822</feedburner:origLink></item><item><title>Socialytics</title><link>http://feedproxy.google.com/~r/smartdatacollective_allposts/~3/pcHsbL2ZdIc/24818</link><description>With the continued growth of both the pubic social web and private collaboration, communication and social business tools we are creating an explosion of social data. As businesses get more deeply ...&lt;img src="http://feeds.feedburner.com/~r/smartdatacollective_allposts/~4/pcHsbL2ZdIc" height="1" width="1"/&gt;</description><content><![CDATA[<div xmlns="http://www.w3.org/1999/xhtml">
<p><img src="http://www.mfauscette.com/.a/6a00e554e7194588330128776e0aeb970c-pi" alt="201002051511.jpg" height="53" width="38"> With the continued growth of both the pubic social web and private collaboration, communication and social business tools we are creating an explosion of social data. As businesses get more deeply involved in the social business movement and as software vendors create more and different social tools there is a compelling case for tools to help businesses make sense out of all this social data. I <a href="http://www.mfauscette.com/software_technology_partn/2009/04/social-analytics.html" target="_blank">wrote</a> about the idea of social analytics last year. This year in our IDC Top 10 Predictions we included a prediction about social analytics: "Business applications will undergo a fundamental transformation — fusing business applications with social/collaboration software and analytics into a new generation of "socialytic" apps, challenging current market leaders." Here's a simple model that shows the concept of socilaytic platforms / apps and how they might be applied:</p>
<p align="center"><img src="http://www.mfauscette.com/.a/6a00e554e7194588330128776e0afa970c-pi" alt="201002061159.jpg" height="358" width="477"></p>
<p>Some characteristics of socialytic apps might include:</p>
<ul>
    <li>Aggregate social data from public and private social data sources (user / company selectable)</li>
    <li>Real-time search and monitoring</li>
    <li>Social metrics dashboard</li>
    <li>Natural language processing (NLP) with linguistic analysis capabilities</li>
    <li>Visualization and simulation</li>
    <li>Social trending</li>
    <li>Human search</li>
    <li>User configurability</li>
</ul>
<p>I'm sure there are lot's of other characteristics and use cases for socialytic platforms and apps that will emerge as more businesses start deploying social tools over the next few years. Once the analysis capabilities are in place, businesses will undoubtedly start to look at automating decisions and building / deploying automated decision architecture-based solutions to help social analysts focus on higher value interactions. A few vendors are offering socialytic apps but the market is far from crowded at this point. An upswing in demand as companies get deeper into their social business projects should drive market growth and the development of new and varied offerings to address emerging needs.</p>
<br>
</div>
<br>]]></content><author>mfauscette</author><category>Industry Perspective</category><category>Microtargeting</category><wfCategory>social analytics,social analytics apps</wfCategory><comments>http://smartdatacollective.com/Home/24818#0</comments><pubDate>Sun, 07 Feb 2010 09:40:45 GMT</pubDate><guid isPermaLink="false">http://smartdatacollective.com/Home/24818</guid><feedburner:origLink>http://smartdatacollective.com/Home/24818</feedburner:origLink></item><item><title>Big Data</title><link>http://feedproxy.google.com/~r/smartdatacollective_allposts/~3/i57Pkc1sV8U/24811</link><description>Several month's ago a short video appeared on YouTube with an interview of LinkedIn's Chief Scientist DJ Patil. In it he discusses how 'Big Data' impacts the practise of analytics. I've only just go...&lt;img src="http://feeds.feedburner.com/~r/smartdatacollective_allposts/~4/i57Pkc1sV8U" height="1" width="1"/&gt;</description><content><![CDATA[<div xmlns="http://www.w3.org/1999/xhtml">
<p>Several month's ago a short video&nbsp;appeared on YouTube with an&nbsp;interview of LinkedIn's Chief Scientist DJ Patil. In it he discusses how '<a href="http://www.youtube.com/watch?v=acimvXoKwhc" target="_blank">Big Data</a>' impacts the practise of analytics. I've only just got around to posting about it but I am doing so now because he has some insights that I agree with and would like to share as they are still relevant.&nbsp;</p>
<p>Big data is today most often associated with the internet superstars like Google, eBay and Amazon. There are 3 other areas with&nbsp;lower profiles&nbsp;where big data is important: intelligence (spooks, the military, etc.), scientific and academic research, and the financial markets.</p>
<p>
<p style="margin: 11px 0px;">Big data's future is much bigger than this because more and more areas of human activity are going to be faced with vast data sets. When you hear people talking about the growth of knowledge and statements like '<a href="http://analytics.typepad.com/im_maze/2009/06/to-pluto-and-back.html" style="color: blue ! important; text-decoration: underline ! important; cursor: text ! important;" target="_blank">if this data were printed then the stack would grow faster than NASA’s fastest rocket</a>', you have to remember that there is a good chance that each page of new data is adding to someone's analytic data set.</p>
<ul></ul>
    <p>
    <p>
    <p>I'm not quoting the guy verbatim but here's what I heard and my takeouts to his comments:</p>
    <p>
    <ul>
        <li>Open source 'big data ready' technologies like Hadoop (see my earlier <a href="http://analytics.typepad.com/oz-analytics/2009/09/top-10-cloud-computing-vendors.html" target="_blank">blog</a> or <a href="http://hadoop.apache.org/" target="_blank">here</a>) have come into their own now. Look to people with these skills over those only with SQL if you are facing big data challenges.</li>
        <li>We have reached a tipping point in the use of open source for commercial solutions to big data problems.</li>
        <li>If you want good analysts then the best place is to look is in occupations where people will already have the practical skills in manipulating big data sets: scientific fields like meteorology, oceanography and the like. I agree but this is not the only place as in my experience I also need analysts that relate well to business decision makers - i.e. those people that make commercial decisions based on the analytics. This is perhaps less important in pure tech plays like LinkedIn.</li>
        <li>Open source will transform the practise of analytics in the next 3 - 5 years. I think it will take longer than this to really impact the more traditional industries. I'm not happy about this but I am realistic about the difficulty in convincing business leaders that open source is a superior solution to proprietary ones. The money behind the big vendors will keep them going for a number of years yet.</li>
    </ul>
    <p>
    <p>One potential qualifier to&nbsp;DJ Patil's perspective is that although he has a very impressive big data background as a mathematician, US Department of Defence analyst ('Threat Anticipation'), and former eBay Director of Strategy and Analytics, his current employer is LinkedIn.</p>
    <p>The core of LinkedIn's big data is structured and fairly static: profiles of people. So I'm not sure how similar their big data challenges would be to, say, those faced with processing, understanding and predicting large streams of real time data from financial markets or very large sensor arrays. On the other hand, the growth of LinkedIn communities and their related activities must generate large amounts of semi-structured data.</p>
    <p>I also have no idea what LinkedIn's own analytic goals are beyond what DJ mentions on his own profile where he says his analytics drives product features like:</p>
    <p>
    <ul>
        <li>"People You May Know"</li>
        <li>"Who Viewed My Profile"</li>
        <li>"Groups You Might Like"</li>
    </ul>
    <p>
    <p>Maybe somebody reading this blog knows more?</p>
    <p>The video is on <a href="http://www.youtube.com/watch?v=dRrkgvr9V_s" target="_blank">YouTube</a> and I embed it here for convenience:</p>
    <object height="360" width="580">
    <param name="movie" value="http://www.youtube.com/v/dRrkgvr9V_s&amp;hl=en_US&amp;fs=1&amp;rel=0&amp;border=1">
    <param name="allowFullScreen" value="true">
    <param name="allowscriptaccess" value="always"><embed allowfullscreen="true" allowscriptaccess="always" src="http://www.youtube.com/v/dRrkgvr9V_s&amp;hl=en_US&amp;fs=1&amp;rel=0&amp;border=1" type="application/x-shockwave-flash" height="360" width="580"></object>
    <p>Or you can download <span class="asset asset-video at-xid-6a01157080889a970b0128776c4b3e970c"><a href="http://analytics.typepad.com/files/dj-patil-on-how-big-data-impacts-analytics-480.mp4" target="_blank">'DJ Patil on How Big Data Impacts Analytics'</a>&nbsp;directly from this blog</span>.</p>
    <p>
    <p><xhtml:img xmlns:xhtml="http://www.w3.org/1999/xhtml" src="http://feeds.feedburner.com/%7Er/typepad/daAr/%7E4/0hhL_bMnSq8" height="1" width="1"></xhtml:img></p>
    </div>
    <br>
    <span><a href="http://feedproxy.google.com/%7Er/typepad/daAr/%7E3/0hhL_bMnSq8/linkedin-chief-scientist-on-big-data.html" title="http://feedproxy.google.com/~r/typepad/daAr/~3/0hhL_bMnSq8/linkedin-chief-scientist-on-big-data.html">Link to original post</a></span>]]></content><author>Steve Bennett</author><category>Business Intelligence</category><category>Industry Perspective</category><category>Data Quality</category><wfCategory>open source,hadoop,big data + analytics,dj patil</wfCategory><comments>http://smartdatacollective.com/Home/24811#0</comments><pubDate>Sun, 07 Feb 2010 01:55:14 GMT</pubDate><guid isPermaLink="false">http://smartdatacollective.com/Home/24811</guid><feedburner:origLink>http://smartdatacollective.com/Home/24811</feedburner:origLink></item><item><title>Selling to enterprises</title><link>http://feedproxy.google.com/~r/smartdatacollective_allposts/~3/eUMEIFZCj2w/24804</link><description>For some reason when you are selling information technology, big companies are referred to as “enterprises.” I’m guessing the word was invented by a software vendor who was trying to justify a milli...&lt;img src="http://feeds.feedburner.com/~r/smartdatacollective_allposts/~4/eUMEIFZCj2w" height="1" width="1"/&gt;</description><content><![CDATA[<p>For some reason when you are selling information technology, big companies are referred to as “enterprises.” I’m guessing the word was invented by a software vendor who was trying to justify a million-dollar price tag. As a rule of thumb, think of enterprise sales as products/services that cost $100K/year or more.</p>
<p>I am by no means an expert in enterprise sales. Personally, I vastly prefer marketing (one-to-many) versus sales (one-to-one), hence only start companies making consumer or small business products (advertising based or sub-$5000 price tags).&nbsp;But I have been involved in a few enterprise companies over the years. Here’s the main thing I’ve observed. Almost every enterprise startup I’ve seen has a product that would solve a problem their prospective customers have. But that isn’t the key question. The key question is whether it solves a problem that is one of the prospective customer’s top immediate priorities. Getting an enterprise to cough up $100K+ requires the “buy in” of many people, most of whom would prefer to maintain the status quo. Only if your product is a top priority can you get powerful “champions” to cut through the red tape.</p>
<p>My rule of thumb is that every enterprise (or large business unit within an enterprise) will, at best, buy 1-3 new enterprise products per year. &nbsp;You can have the greatest hardware/software in the world, but if you aren’t one of their top three priorities, you won’t be able to profitably sell to them.</p>
<p>One final note: enterprise-focused VC’s sometimes refer to products priced between (roughly) $5k and $100K as falling in the “valley of death.” Above $100K, you might be able to make a profit given the cost of sales. Below $5k you might be able to market your product, hence have a very low cost of sales. In between, you need to do sales but it’s hard to do it profitably. Your best bet is a “channel” strategy; however, for innovative new products that is often a lot like trying to push a string.</p>
<br>
<a href="http://cdixon.org/2010/02/06/selling-to-enterprises/" title="http://cdixon.org/2010/02/06/selling-to-enterprises/">Link to original post</a><br>]]></content><author>Chris Dixon</author><category>Industry Perspective</category><wfCategory>information technology + enterprises,channel strategy for sales</wfCategory><comments>http://smartdatacollective.com/Home/24804#0</comments><pubDate>Sat, 06 Feb 2010 15:59:09 GMT</pubDate><guid isPermaLink="false">http://smartdatacollective.com/Home/24804</guid><feedburner:origLink>http://smartdatacollective.com/Home/24804</feedburner:origLink></item><item><title>After phylogenetics Microsoft patents personal data mining</title><link>http://feedproxy.google.com/~r/smartdatacollective_allposts/~3/bdQs2tBHzys/24801</link><description>I hope you remember that some time back Microsoft tried to patent clustering phylogenetics methods which was a socking newsfor the bioinformatics community as community used these methods for a long t...&lt;img src="http://feeds.feedburner.com/~r/smartdatacollective_allposts/~4/bdQs2tBHzys" height="1" width="1"/&gt;</description><content><![CDATA[<span class="fullpost">
<div style="text-align: justify;">I hope you remember that some time back <a href="http://www.freepatentsonline.com/y2009/0030925.html" target="_blank">Microsoft tried to patent clustering phylogenetics methods</a> which <a href="http://scienceblogs.com/pharyngula/2009/08/microsoft_owns_bioinformatics.php" target="_blank">was a socking news</a> <a href="http://johnhawks.net/weblog/topics/biotech/patents/systematics-microsoft-patent-comparative-method-pennisi-2009.html" target="_blank">for the bioinformatics community</a> as community used these methods for a long time without any restriction. <a href="http://techflash.com/seattle/2010/02/gates_ozzie_other_microsoft_execs_patent_personal_data_mining.html" target="_blank">Now Microsoft had</a> <a href="http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&amp;Sect2=HITOFF&amp;d=PALL&amp;p=1&amp;u=/netahtml/PTO/srchnum.htm&amp;r=1&amp;f=G&amp;l=50&amp;s1=7,657,493.PN.&amp;OS=PN/7,657,493&amp;RS=PN/7,657,493" target="_blank">patented the personal data mining system</a>. According to patent abstract which was accepted just last week<blockquote style="text-align: justify;">Personal data mining mechanisms and methods are employed to identify relevant information that otherwise would likely remain undiscovered. Users supply personal data that can be analyzed in conjunction with data associated with a plurality of other users to provide useful information that can improve business operations and/or quality of life. Personal data can be mined alone or in conjunction with third party data to identify correlations amongst the data and associated users. Applications or services can interact with such data and present it to users in a myriad of manners, for instance as notifications of opportunities.</blockquote><br>
The patents includes some heavy weight names from Microsoft including Bill Gates. System tries to answer personal queries such as "What is the best digital camera and where I can find the cheapest one?". What really troubling is the claims like the one below,<br>
<blockquote>Furthermore, as will be appreciated, various portions of the disclosed systems and methods may include or consist of artificial intelligence, machine learning, or knowledge or rule based components, sub-components, processes, means, methodologies, or mechanisms (e.g., support vector machines, neural networks, expert systems, Bayesian belief networks, fuzzy logic, data fusion engines, classifiers . . . ).</blockquote>Also I am also not sure how a patent was granted purely based on data streams without the looking on the background of algorithm used.</div>
</span><br>
<div class="blogger-post-footer">
<div><span style="font-style: italic;">Original article is
available at <a href="http://www.abhishek-tiwari.com/" target="_blank">Fisheye Perspective</a> blog. </span></div>
</div>]]></content><author>Abhishek Tiwari</author><category>Data Mining</category><category>Industry Perspective</category><category>Microtargeting</category><wfCategory>bioinformatics,microsoft + patents,phylogenetics,personal data mining</wfCategory><comments>http://smartdatacollective.com/Home/24801#0</comments><pubDate>Sat, 06 Feb 2010 09:07:43 GMT</pubDate><guid isPermaLink="false">http://smartdatacollective.com/Home/24801</guid><feedburner:origLink>http://smartdatacollective.com/Home/24801</feedburner:origLink></item><item><title>WSDM 2010: Day 2</title><link>http://feedproxy.google.com/~r/smartdatacollective_allposts/~3/PwlPM5hloYw/24799</link><description>Unfortunately, I woke up this morning rather under the weather, so I’m having to resort to remotely reporting on the second day of WSDM 2010 conference, based on the published proceedings and the tw...&lt;img src="http://feeds.feedburner.com/~r/smartdatacollective_allposts/~4/PwlPM5hloYw" height="1" width="1"/&gt;</description><content><![CDATA[<p><em></em>Unfortunately, I woke up this morning rather under the weather, so I’m having to resort to remotely reporting on the second day of&nbsp;<a href="http://www.wsdm-conference.org/2010/" target="_blank">WSDM 2010</a> conference, based on the published proceedings and the&nbsp;<a href="http://twitter.com/#search?q=%23wsdm2010" target="_blank">tweet stream</a>.</p>
<p>The day started with a keynote from Harvard economist&nbsp;<a href="http://kuznets.fas.harvard.edu/%7Eathey/" target="_blank">Susan Athey</a>. Her research focuses on the design of auction-based markets, a topic core to the business of search which largely relies on auction-based advertising models (cf.&nbsp;<a href="http://en.wikipedia.org/wiki/AdWords" target="_blank">Google AdWords</a>). Then came a session focused on learning and optimization. One paper proposed a method to learn ranking functions and query categorization simultaneously, reflecting that different categories of queries leads users to have different expectations about ranking. Another combined traditional list-based ranking with pair-wise comparisons between results to separate the results into tiers reflecting grades of relevance. An intriguing approach to query recommendation treated it as an optimization problem, perturbing users’ query-reformulation path to maximize the expected value of a utility function over the search session. Another paper looked not at ranking per se, but rather at improving the quality of training data for using machine learning for ranking. The final paper of the session, which earned a best-paper nomination, modeled document relevance based not on click-through behavior, but rather on post-click user behavior.</p>
<p>The next session was about users and measurement. It opened with another best-paper nominee: a analysis of over a hundred million users to understand how they re-find web content. Another offered a rigorous analysis of the often sloppily presented “<a href="http://en.wikipedia.org/wiki/Long_Tail" target="_blank">long-tail</a>” hypothesis: it&nbsp;found that light users disproportionately prefer content at the head of distribution while heavy users disproportionately prefer the tail. Another log-analysis paper analyzed search logs using a partially observable Markov model, a variant of the<a href="http://en.wikipedia.org/wiki/Hidden_Markov_model" target="_blank">hidden Markov model</a> in which not all of the hidden state transitions emit observable events–and compared the latent variables with eye-tracking studies. An intriguing study demonstrated that user behavior models are more predictive of goal success than models based on document relevance. The final paper of the session proposed methods for quantifying the reusability of the test collections that lie at the heart of information retrieval evaluation.</p>
<p>The last session of the day focused on social aspects of search. Two of the papers were concerned with modeling authority and influence in social networks, a problem in which I take a deep&nbsp;<a href="http://thenoisychannel.com/2009/01/13/a-twitter-analog-to-pagerank/" target="_blank">personal interest</a>. Another inferred attributes of social network users based on those of other users in their communities (cg. MIT’s&nbsp;<a href="http://www.boston.com/bostonglobe/ideas/articles/2009/09/20/project_gaydar_an_mit_experiment_raises_new_questions_about_online_privacy/" target="_blank">Project Gaydar</a>).&nbsp;Another analyzed&nbsp;<a href="http://www.flickr.com/" target="_blank">Flickr</a> and&nbsp;<a href="http://www.last.fm/" target="_blank">Last.fm</a> user logs to show that users’ semantic similarity based on their tagging behavior is predictive of social links. The final paper tackled the sparsity of social media tags by inferring latent topics from shared tags and spatial information.</p>
<p>Not surprisingly, a disproportionate number of contributors to the conference work at major web search companies, who have both the motivation to improve results and the access to data that is needed for such research. One of the ongoing research challenges for the field is to find ways to make this data available to others while respecting the business concerns of search engine companies and the privacy concerns of their users.</p>
<br>
<a href="http://thenoisychannel.com/2010/02/06/wsdm-2010-day-2/" title="http://thenoisychannel.com/2010/02/06/wsdm-2010-day-2/">Link to original post</a><br>]]></content><author>Daniel Tunkelang</author><category>Industry Perspective</category><wfCategory>wsdm 2010 conference,susan athey,learning + optimization,markov model,mit + project gaydar</wfCategory><comments>http://smartdatacollective.com/Home/24799#0</comments><pubDate>Sat, 06 Feb 2010 04:38:30 GMT</pubDate><guid isPermaLink="false">http://smartdatacollective.com/Home/24799</guid><feedburner:origLink>http://smartdatacollective.com/Home/24799</feedburner:origLink></item><item><title>Huffington Post: Crawling with data addicts</title><link>http://feedproxy.google.com/~r/smartdatacollective_allposts/~3/JcnsVx688wE/24789</link><description>The Huffington Post has recently passed the Washington Post in traffic. It got 410 million page views last month (and 35 million on the iPhone alone). Data is a big part of the site's success. At the ...&lt;img src="http://feeds.feedburner.com/~r/smartdatacollective_allposts/~4/JcnsVx688wE" height="1" width="1"/&gt;</description><content><![CDATA[<a href="http://www.businessweek.com/the_thread/blogspotting/archives/2005/05/arianna_carves.html" target="_blank">The
Huffington Post</a> has recently passed the Washington Post in traffic.
It got 410 million page views last month (and 35 million on the iPhone
alone). Data is a big part of the site's success.<br>
<br>
At the media panel at the <a href="http://blogs.webtrends.com/neworleans10/" target="_blank">Webtrends</a> conference this week, Huffington
Post's chief technology officer, <a href="http://www.huffingtonpost.com/paul-berry" target="_blank">Paul Berry</a>, described some the methods.
Every hour, editors see how the traffic that hour compared to the same
hour a week ago. The site is a laboratory for so-called A/B testing,
where stories are played against each other to see which draws more
traffic--and how long each story should stay on the front page.<br>
<br>
"...We've built a lot of internal tools..." Berry said. ..."A lot of us are
addicted, like crack addicts, to these stats....";<br>
<br>
HuffPost also tracks readers' shifting moods by carrying out automated
"...sentiment analysis..." on the two million comments the site generates
every month. (In other words, machines look for key words and report on
whether the comment was favorable or not. If you look at individual
results, they're fairly primitive. A sentence like, "...I'm not saying I'm
not crazy about it......", can throw a machine for a loop. But they get the
big picture.)<br>
<br>
Berry said that many large advertisers are still eager for traditional over-the-fold real estate. But they get more clicks when their ads accompany stories about them. Clicks on Bing ads, "...go through the roof...", when the story's about Microsoft, he said. (Traditionally, at least some magazines have worked to separate advertisers from stories about them, but those days are disappearing fast...)<br>
<br>
Berry said that in its editorial layout, Huffington follows the, "...Mullet
Strategy:...", Business in front, party out back.<br>
<br>
<br>
<a href="http://www.thenumerati.net/index.cfm?postID=517" title="http://www.thenumerati.net/index.cfm?postID=517">Link to original post</a><br>]]></content><author>Stephen Baker</author><category>Data Mining</category><category>Industry Perspective</category><category>Operational BI</category><wfCategory>optimization,webtrends conference 2010,huffington post,paul berry</wfCategory><comments>http://smartdatacollective.com/Home/24789#0</comments><pubDate>Fri, 05 Feb 2010 14:24:41 GMT</pubDate><guid isPermaLink="false">http://smartdatacollective.com/Home/24789</guid><feedburner:origLink>http://smartdatacollective.com/Home/24789</feedburner:origLink></item><item><title>Microsoft takes on Google and IBM in science cloud</title><link>http://feedproxy.google.com/~r/smartdatacollective_allposts/~3/INMBJd3BW7I/24788</link><description>Microsoft, the Times reports, is offering scientists free access to its cloud computing. This is important because scientists are grappling with mountainous troves of data, and they need Google-like (...&lt;img src="http://feeds.feedburner.com/~r/smartdatacollective_allposts/~4/INMBJd3BW7I" height="1" width="1"/&gt;</description><content><![CDATA[Microsoft, the <a href="http://www.nytimes.com/2010/02/05/science/05cloud.html?hpw" target="_blank">Times
reports</a>, is offering scientists free access to its cloud computing.
This is important because scientists are grappling with mountainous
troves of data, and they need Google-like (or Bing-like) computing
clusters to crunch them. I read recently that the biological data
amassed last year surpassed all of the biological data in history,
presumably from the dissection of the first frog until weeks before the
Obama inauguration.<br>
<br>
The need for scientific clouds is clear, and as <a href="http://www.businessweek.com/magazine/content/07_52/b4064048925836.htm" target="_blank">I
wrote</a> two years ago in BusinessWeek, IBM and Google are on a
similar track. My question is this: Is scientific data
going to get tangled in a software battle? IBM, Google and others are
offering an open-source cloud software known as <a href="http://hadoop.apache.org/mapreduce/" target="_blank">Hadoop</a>, which is based on
Google's MapReduce. Microsoft is providing its own platform, Azure. The
grand promise of cloud computing will be for scientists to share data
sets, and even to delve into ones from seemingly unrelated fields. That
way they might find correlations between, say, meterology and disease.<br>
<br>
But if scientists in the Microsoft cloud are doing their work in Azure,
will they be able to collaborate with others working in the cloud? The last thing science needs is a platform battle in the next generation of computing. (This <a href="http://social.msdn.microsoft.com/Forums/en/windowsazure/thread/230a3555-02cc-421f-8d9b-45e13f9d7319" target="_blank">thread of questions</a> on a Microsoft site shows developers grappling with the challenges of offering Mapreduce within Azure.)<br>
<br>
I also notice that the Microsoft-NSA grant is for U.S.
scientists. I'm assuming researchers from elsewhere will have access
too. Researching teams in science stopped paying attention to borders long ago.<br>
<br>
<br>
Wondering what scientific cloud computing looks like? <a href="http://rob.gillenfamily.net/post/Windows-Azure-Climate-Data-and-Microsoft-Surface.aspx" target="_blank">Rob Gillen</a>, a researcher at Oak Ridge National Laboratories, explores some meteorology data using Azure and other Microsoft technologies, including the <a href="http://rob.gillenfamily.net/post/Windows-Azure-Climate-Data-and-Microsoft-Surface.aspx" target="_blank">Surface</a> touch screen. (This version is more for the presentation of cloud data, I would assume, than the nuts-and-bolts of actual research.)
<br>
<br>
<div>
<em>(embeded object)</em>
</div>
<br>
<br>
<br>
<a href="http://www.thenumerati.net/index.cfm?postID=516" title="http://www.thenumerati.net/index.cfm?postID=516">Link to original post</a><br>]]></content><author>Stephen Baker</author><category>Data Integration</category><category>Industry Perspective</category><wfCategory>cloud computing,scientific cloud computing</wfCategory><comments>http://smartdatacollective.com/Home/24788#0</comments><pubDate>Fri, 05 Feb 2010 13:13:53 GMT</pubDate><guid isPermaLink="false">http://smartdatacollective.com/Home/24788</guid><feedburner:origLink>http://smartdatacollective.com/Home/24788</feedburner:origLink></item><item><title>Computational Biology: Is reproducibility overrated?</title><link>http://feedproxy.google.com/~r/smartdatacollective_allposts/~3/nMexHglhuwk/24787</link><description>Recently I was reading a bunch of articles on computational reproducibility in scientific research including one from my fellow blogger Grant Jacobs of Code for life blog. The term  reproducibility in...&lt;img src="http://feeds.feedburner.com/~r/smartdatacollective_allposts/~4/nMexHglhuwk" height="1" width="1"/&gt;</description><content><![CDATA[<span class="fullpost">
<div style="text-align: justify;">Recently I was reading a bunch of <a href="http://arstechnica.com/science/news/2010/01/keeping-computers-from-ending-sciences-reproducibility.ars" target="_blank">articles</a> on computational reproducibility in <a class="zem_slink" href="http://en.wikipedia.org/wiki/Scientific_method" title="Scientific method" rel="wikipedia" target="_blank">scientific research</a> including <a href="http://sciblogs.co.nz/code-for-life/2010/01/24/reproducible-research-and-computational-biology/" target="_blank">one</a> from my fellow blogger Grant Jacobs of <a href="http://sciblogs.co.nz/code-for-life/" target="_blank">Code for life</a> blog. The term  reproducibility in computational science is not monolithic - what seems essential for one scientific domain may not matter to another.  Finding out what it takes is a difficult, but necessary effort. <a href="http://www.amath.washington.edu/%7Erjl/pubs/icm06/icm06leveque.pdf" target="_blank">According to Randy LeVeque</a>, a prominent mathematician<br>
<blockquote>Scientific and mathematical journals are filled with pretty pictures these days of computational experiments that the reader has no hope of repeating. Even brilliant and well intentioned computational scientists often do a poor job of presenting their work in a reproducible manner. The methods are often very vaguely defined, and even if they are carefully defined, they would normally have to be implemented from scratch by the reader in order to test them.</blockquote><br>
Latest series of debate originated from <a href="http://dx.doi.org/10.1126/science.1179653" target="_blank"> an article in journal Science</a> written by Jill P. Mesirov on accessible reproducible research in <a class="zem_slink" href="http://en.wikipedia.org/wiki/Computer_science" title="Computer science" rel="wikipedia" target="_blank">computer science</a>. The center of argument was Jill's  own experience with <a class="zem_slink" href="http://www.genepattern.org" title="GenePattern" rel="homepage" target="_blank">GenePattern</a>-Word RRS system with intention to universalize the described approach. After reading the whole paper  I found nothing novel about the kind of reproducibility Jill is proposing in the article, in fact there are plenty of ongoing efforts to support the computational reproducibility by developing niche markup languages and associated tools. Many of these efforts such as SBML and <a class="zem_slink" href="http://en.wikipedia.org/wiki/CellML" title="CellML" rel="wikipedia" target="_blank">CellML</a> are well in advanced stage, and again most of journals now demand their perspective authors to submit their data and model in domain specific formats. I can't see why these efforts are not a step further towards accessible computational reproducibility. They are totally qualified as mean to achieve the computational reproducibility. Suggestions such as computational research need a paradigm shift or calls out for a new model for the way we publish our results is already in place, may be in primitive form but we are coming there. We talk and talk but never appreciate what communities are doing, this new drive for  reproducibility evangelism is faulty and incomplete unless this gives a due consideration for ongoing community wide efforts to tackle this issue.<br>
<br>
From this whole debate I was forced to ask few questions, <br>
<br>
1. <span style="font-weight: bold;">Computational reproducibility is important but to what extent?</span> Reproduction and repetation of scientific protocol is never a straightforward affair. By no means scientific reproducibility should be considered as replacement for scientific integrity and learning. It also reminds me an argument presented by James Bassingthwaighte who thinks that computational reproducibility might be overrated. According to James<blockquote><br>
&#8226;Better ideas will emerge when other investigators attack a model (hypothesis) and attempt to improve on it.<br>
&#8226;Redevelopment is the best way to penetrate a model or hypothesis –i.e. no waste of effort!<br>
This<br>
<blockquote>-is the real test of reproducibility of the original authors’ views<br>
-really validates it by doing it totally independently<br>
-makes it less likely to incorporate the original’s errors</blockquote><br>
Copy-cat errors are like mutations in a phylogenetic sequence, so it’s better to start at the beginning each time.</blockquote><br>
Although this is not very universal argument but I am sure readers are well aware about the fact that debugging can be a rewarding experience when you are learning the programming language. Debugging in computational protocols depends on fine tuning of parameters which requires a close understanding of algorithm in use. Same genome analysis pipeline may require different sets of BLAST parameters for the unlike sequence families. This kind of learning is possible only when you challenge the user to come up with a better solution. <br>
<br>
(Check out the full presentation below, it is worthwhile) <br>
<iframe src="http://docs.google.com/gview?url=http://www.vph-noe.eu/vph-repository/doc_download/113-jamesbassingthwaighte&amp;embedded=true" style="width: 600px; height: 480px;" frameborder="0"></iframe><br>
<br>
2. <span style="font-weight: bold;">Are development process (for computer softwares, pipeline, models) and end product "reproducible research" are two different things?</span> They look slightly unrelated but they depend on each other. Employing best practice such as use of domain specific standards (languages and guidelines) along with <a class="zem_slink" href="http://en.wikipedia.org/wiki/Computer_software" title="Computer software" rel="wikipedia" target="_blank">software</a> design principles in the development process results in provenance-aware computer softwares, pipeline and models. Provenance refer to the process of tracing and recording of the origins and evolution of the data. The analysis workflow or pipeline itself can be captured by process definition or  workflow description languages while the data remains seated in the wrapper of domain specific markup languages completing the full circle of provenance-aware tools. You want total reproducibility, do this I promise nothing will come in between. Unfortunately its hard to implement most or part of above suggestion, because when it comes to science getting things done (GTD) is major concern than quality of codes we write. Workflow solutions like Taverna, Galaxy, KNIME, Keplar, Pipeline Pilot, Inforsense KDE are designed to track the trace and history of scientific data and analysis workflow in the same way I suggested above. These tools are available for a long time and scientist mostly failed to embrace these ideas and tools. Why? For the same reasons they don't embrace the very idea of web 2.0. Because in science advertisement of scholarship (read it peer reviewed publications) is more important than scholarship itself (I call it the learning experience).<br>
<br>
3. <span style="font-weight: bold;">What is the role of standards and guidelines? </span>In last few years scientific community has seen a flurry of activity over standards and guidelines development.  Are they any good for reproducibility? Yes, indeed. Let me explain this with one example, can you understand the essence of the image published in a peer reviewed scientific article without reading the following text? Well answer is pretty no. Albeit it might be easy to understand the circuit diagram published in a  electrical book just because most of notation of circuit schematics are well standardized. The people in scientific community have realized this problem and result is a community driven effort <span style="font-style: italic;"><a class="zem_slink" href="http://en.wikipedia.org/wiki/Systems_Biology_Graphical_Notation" title="Systems Biology Graphical Notation" rel="wikipedia" target="_blank">Systems Biology Graphical Notation</a> (SBGN)</span>  to standardize the graphical notations in biology. Although SBGN is in its early stage it can play an important role the way we communicate knowledge in biology. Similarly in <a class="zem_slink" href="http://en.wikipedia.org/wiki/Machine_learning" title="Machine learning" rel="wikipedia" target="_blank">machine learning</a> and data mining community  PMML (<a class="zem_slink" href="http://en.wikipedia.org/wiki/Predictive_Model_Markup_Language" title="Predictive Model Markup Language" rel="wikipedia" target="_blank">Predictive Model Markup Language</a>) is doing a fantastic job by enabling users to reuse, post-process, and utilize data mining models irrespective of the tool that generated them. In my humble opinion, notion of putting the reproducibility into computational code is just magical unless we have better underlaying standards and guidelines to support this.<br>
<br>
4. <span style="font-weight: bold;">What we can learn from Unix and Open source model?</span> Using a "make" inspired computational reproducibility systems is not new. "make" is a popular utility program for automatically building executable programs and other non-source files from <a class="zem_slink" href="http://en.wikipedia.org/wiki/Source_code" title="Source code" rel="wikipedia" target="_blank">source code</a>. In past "make" based tools <a href="http://www.reproducibility.org/wiki/Reproducible_computational_experiments_using_SCons#Tools_for_reproducible_research" target="_blank">such as Cons</a> has been used as platform for reproducible research in scientific computing. Secondly scientific community should avoid any reliance on <a class="zem_slink" href="http://en.wikipedia.org/wiki/Proprietary_software" title="Proprietary software" rel="wikipedia" target="_blank">proprietary softwares</a>, they make a big gap in reproducibility workflow due to closed source nature.<br>
<br>
<span style="font-weight: bold;">Closing thoughts</span><br>
The troubles of computational reproducibility are primarily those of ideology and practice. This may be a slightly controversial assertion, but I would argue that technically reproducibility in computational biology is not a big issue. The adoption of reproducible research practice has been slow. The focus of reproducibility related discussions in computational biology should be more on how to inspire the scientific communities to embrace a better reproducible research. <br>
</div>
<br>
</span>
<div class="blogger-post-footer"><br>
<div><span style="font-style: italic;">Original article is
available at <a href="http://www.abhishek-tiwari.com/" target="_blank">Fisheye Perspective</a> blog. Stay
tuned for more posts and subscribe the <a href="http://feeds2.feedburner.com/AbhishekTiwarisBlog" target="_blank">RSS feed</a>.&nbsp;</span></div>
</div>]]></content><author>Abhishek Tiwari</author><category>Industry Perspective</category><wfCategory>reproducibility + scientific research,computer science,gene pattern,software design principles</wfCategory><comments>http://smartdatacollective.com/Home/24787#0</comments><pubDate>Fri, 05 Feb 2010 13:04:11 GMT</pubDate><guid isPermaLink="false">http://smartdatacollective.com/Home/24787</guid><feedburner:origLink>http://smartdatacollective.com/Home/24787</feedburner:origLink></item><item><title>Python Programs for Non-Python People</title><link>http://feedproxy.google.com/~r/smartdatacollective_allposts/~3/Nc6Fs6226rI/24776</link><description>Python programs written to run in BEGIN PROGRAM blocks are easy to write and add lots of functionality to IBM SPSS Statistics.   Many users have learned to create these.  More users, though, do not ...&lt;img src="http://feeds.feedburner.com/~r/smartdatacollective_allposts/~4/Nc6Fs6226rI" height="1" width="1"/&gt;</description><content><![CDATA[<p>Python programs written to run in BEGIN PROGRAM blocks are easy to write and add lots of functionality to IBM SPSS Statistics.&nbsp;  Many users have learned to create these.  More users, though, do not know the Python language.</p>
<p>The extension command mechanism provides a way for users of traditional SPSS syntax to run Python programs written by someone else without needing any knowledge of Python.&nbsp;  But the program must have been written as an extension command.&nbsp;  While creating extension commands isn’t hard, it does require some extra knowledge and work.&nbsp; (An article on how to do this is available on Developer Central.)</p>
<p>I have posted a new extension command, SPSSINC PROGRAM, that allows ordinary Python programs to be run with traditional syntax without the author having created an extension command: easy on the author and easy on the user.</p>
<p><span id="more-170"></span>Often someone writes and shares a Python program for use via BEGIN PROGRAM that requires some input parameters.&nbsp; The BEGIN PROGRAM syntax does not allow for parameters, so the user must modify the program itself to specify these.&nbsp; If you know Python, this is not a problem, but many users are uncomfortable doing that, since the Python language is quite different from the traditional SPSS language.</p>
<p>A Python programmer would typically define a function or class with the requisite parameters and just modify the function call.&nbsp; An SPSS user might not know how to do that.</p>
<p>SPSSINC PROGRAM solves this problem.&nbsp; (Of course, extension commands solve this problem more generally, but they take extra work to create.) &nbsp; Suppose I have a program saved as <em>mymodule.mypgm.py </em>that does something to a pair of&nbsp; SPSS variables , and I want the variable names to be&nbsp; parameters.&nbsp; Using SPSSINC PROGRAM, I would write</p>
<p>SPSSINC PROGRAM mymodule.mypgm firstvar secondvar.</p>
<p><em>firstvar </em>and <em>secondvar </em>would be the parameter values passed to the program.&nbsp; SPSSINC PROGRAM ensures that mymodule.mypgm is called,&nbsp; makes the parameters available to the program, and handles various error conditions.</p>
<p>Instead of passing the parameters as function arguments, the parameters are set up as if they were a command line.&nbsp; The author of the program would access these in the traditional Python way via <em>sys.argv</em>, with the first parameter being the module and program name.&nbsp; It’s just as if the program were being run from a command shell, except that the parameter values have been passed through the SPSS Universal Parser.&nbsp; Comments in the implementation module details the (small) differences this can produce in what the program sees.</p>
<p>So using SPSSINC PROGRAM is very easy on the Python programmer while still letting the user of the program work in the style he or she is comfortable with.&nbsp; The package also includes a dialog box built with the Custom Dialog Builder where the user can enter the program name and any parameters.</p>
<br>]]></content><author>Jon Peck</author><category>Industry Perspective</category><wfCategory>open source,python programs,ibm spss statistics,firstvar + secondvar</wfCategory><comments>http://smartdatacollective.com/Home/24776#0</comments><pubDate>Thu, 04 Feb 2010 19:47:19 GMT</pubDate><guid isPermaLink="false">http://smartdatacollective.com/Home/24776</guid><feedburner:origLink>http://smartdatacollective.com/Home/24776</feedburner:origLink></item><item><title>Accenture study: Companies structured for gut-thinking, not analytics</title><link>http://feedproxy.google.com/~r/smartdatacollective_allposts/~3/EMZ7JZ0CQs0/24775</link><description>An Accenture study released today has some grim news for the analytically-minded. The survey of 600 senior managers shows that more than half of the blue-chip companies have structures that block anal...&lt;img src="http://feeds.feedburner.com/~r/smartdatacollective_allposts/~4/EMZ7JZ0CQs0" height="1" width="1"/&gt;</description><content><![CDATA[An Accenture <a href="http://newsroom.accenture.com/article_display.cfm?article_id=4935" target="_blank">study released today</a> has some grim news for the analytically-minded. The survey of 600 senior managers shows that more than half of the blue-chip companies have structures that block analytics from decision-making. The problem: There's not enough talent, and the talent they have is often cordoned off in a geek wing.<br>
<br>
Interestingly, while 71% of the top managers expressed strong commitment to statistical analysis in their companies, they often don't practice what they preach: <br>
<br>
<em><span class="Apple-style-span" style="border-collapse: separate; color: rgb(0, 0, 0); font-family: 'Times New Roman'; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: 2; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; font-size: medium;"><span class="Apple-style-span" style="font-family: Arial,sans-serif; font-size: 12px; line-height: 16px;">"...The research revealed that senior managers currently fail to see fact- and data-driven analysis as critical when making key business decisions and instead rely heavily on ...gut feel..."; and ..."factors such as consultation with others, intuition and experience</span></span>..."</em>
<div>
<em><br>
</em>
</div>
<div>
Makes sense. If you've crawled your way to the top of a big company, you're probably pretty confident in the gut-based decisions that guided you. So analytics might be fine for everyone else... As today's<a href="http://www.nytimes.com/2010/02/04/opinion/04brass.html?scp=2...amp;sq=microsoft...amp;st=cse" target="_blank"> Times op-ed</a> on dysfunction at Microsoft makes clear, the status quo has a rock solid constituency in just about any company.
</div>
<br>
<br>
<br>
<a href="http://www.thenumerati.net/index.cfm?postID=515" title="http://www.thenumerati.net/index.cfm?postID=515">Link to original post</a><br>]]></content><author>Stephen Baker</author><category>Business Intelligence</category><category>Industry Perspective</category><wfCategory>statistical analysis,accenture</wfCategory><comments>http://smartdatacollective.com/Home/24775#0</comments><pubDate>Thu, 04 Feb 2010 19:28:32 GMT</pubDate><guid isPermaLink="false">http://smartdatacollective.com/Home/24775</guid><feedburner:origLink>http://smartdatacollective.com/Home/24775</feedburner:origLink></item></channel></rss>
