<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss2full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" version="2.0">

<channel>
	<title>The Keplar LLP blog</title>
	
	<link>/blog</link>
	<description>Blogging from the team at Keplar LLP</description>
	<lastBuildDate>Fri, 14 Sep 2012 07:09:15 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
		<atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" type="application/rss+xml" href="http://feeds.feedburner.com/KeplarLLPBlog" /><feedburner:info uri="keplarllpblog" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com/" /><item>
		<title>Faking cohort analysis with Google Analytics</title>
		<link>/blog/2012/06/faking-cohort-analysis-with-google-analytics</link>
		<comments>/blog/2012/06/faking-cohort-analysis-with-google-analytics#comments</comments>
		<pubDate>Thu, 21 Jun 2012 16:30:00 +0000</pubDate>
		<dc:creator>Yali</dc:creator>
				<category><![CDATA[Analytics]]></category>
		<category><![CDATA[cohort analysis]]></category>
		<category><![CDATA[Google Analytics]]></category>
		<category><![CDATA[web analytics]]></category>

		<guid isPermaLink="false">/blog/?p=2694</guid>
		<description><![CDATA[The cohort analysis blog post series Cohort analyses for digital businesses: an overview Performing cohort analysis on web analytics data using SnowPlow Performing the cohort analysis described by Eric Ries in the Lean Startup On the wide variety of cohort analyses Approaches to measuring user engagement as part of cohort analysis Approaches to measuring customer [...]]]></description>
			<content:encoded><![CDATA[<div id="post-series-box">
<p>The cohort analysis blog post series</p>
<ul>
<li><a href="/blog/2012/04/cohort-analyses-for-digital-businesses-an-overview">Cohort analyses for digital businesses: an overview</a></li>
<li><a href="/blog/2012/05/performing-cohort-analysis-on-web-analytics-data-using-snowplow">Performing cohort analysis on web analytics data using SnowPlow</a></li>
<li><a href="/blog/2012/05/performing-the-cohort-analysis-described-in-eric-riess-lean-startup-using-snowplow-and-hive">Performing the cohort analysis described by Eric Ries in the <i>Lean Startup</i></a></li>
<li><a href="/blog/2012/05/on-the-wide-variety-of-different-cohort-analyses-possible-with-snowplow">On the wide variety of cohort analyses</a></li>
<li><a href="/blog/2012/05/different-approaches-to-measuring-user-engagement-with-snowplow">Approaches to measuring user engagement as part of cohort analysis</a></li>
<li><a href="/blog/2012/06/different-approaches-to-measuring-customer-lifetime-value-with-snowplow">Approaches to measuring customer value as part of cohort analysis</a></li>
<li>Faking cohort analysis with Google Analytics</li>
</ul>
</div>
<p>In the previous posts in this series on cohort analyses, we looked at what cohort analysis is, explored the wide variety of cohort analyses that are possible and walked through the steps necessary to perform them using <a href="http://snowplowanalytics.com" target="_blank">SnowPlow</a>. In this post, we look at how to perform cohort analyses in Google Analytics. As will quickly become apparent, Google Analytics is not well suited to performing cohort analyses. Hence, although this blog post will be useful to people who <i>have</i> to use Google Analytics to perform their cohort analyses, it should be more helpful to analysts identifying the advantages are of using SnowPlow alongside (or instead of) Google Analytics, and how those advantages derive from the fundamentally different approach SnowPlow takes to web analytics.</p>
<p style="text-align:center;"><div class="wp-caption aligncenter" style="width: 460px"><img title="Cohort analysis with Google Analytics" src="http://blog.kepstatic.com/2012/06/snow-car-stuck.jpg" alt=""/ width="450"><p class="wp-caption-text">The wrong tool for the job</p></div></p>
<p><span id="more-2694"></span></p>
<p>When performing a cohort analysis, we need our web analytics to enable us to:</p>
<ol>
<li>Report the metric we want to compare between the cohorts</li>
<li>Report the metric value for each of our cohorts</li>
<li>Compare the metric value for each of our cohorts alongside each other</li>
</ol>
<p><strong>1. Reporting the metric we want to compare between cohorts</strong></p>
<p>Google Analytics provides analysts with a decent set of metrics to report on, including all the usual suspects:</p>
<ul>
<li>Visits</li>
<li>Unique visitors</li>
<li>Page views</li>
<li>Pages per visit</li>
<li>Bounce rate</li>
<li>Conversion rates</li>
</ul>
<p>That&#8217;s not a bad set of metrics, but it&#8217;s not nearly as wide a set as that provided by SnowPlow. As we&#8217;ve shown in previous posts, SnowPlow offers analysts a lot of flexibility to define metrics to measure important things like <a href="/blog/2012/05/different-approaches-to-measuring-user-engagement-with-snowplow">user engagement</a> and <a href="/blog/2012/06/different-approaches-to-measuring-customer-lifetime-value-with-snowplow">customer lifetime value</a>. Both of these are very hard to approximate using the more limited set of metrics Google Analytics provides. (For example, the best we can do on &#8220;engagement&#8221; is to look at the number of pages per visit, or the fraction of users that reach a particular point in a pre-defined funnel. We certainly cannot develop our own models of engagement that assign different values for different user actions, and sum them to get a score for each of our users, as we can with SnowPlow.</p>
<p><strong>2. Enable us to define our cohorts, and report the metric above for each cohort</strong></p>
<p>There is a wide range of different cohorts we might want to define in different circumstances:</p>
<ul>
<li>Which marketing channel a customer was acquired on (useful, for example, if we want to compare the return on marketing investment between different channels, with a view to optimising ad spend going forwards)</li>
<li>Which month a person signed up to a service (useful if we want to measure improvements in how effectively we are converting customers over time, as in the <a href="/blog/2012/05/performing-the-cohort-analysis-described-in-eric-riess-lean-startup-using-snowplow-and-hive">lean startup example</a>.)</li>
<li>Customer profile data e.g. gender, age (useful if we believe that customer behavior varies as a function of demography, and we want to examine how successfully we are serving different customer segments)</li>
</ul>
<p>Google Analytics provides analysts with the ability to report on a range of subsets of the total userbase including:</p>
<ul>
<li>Where the user is based (geography)</li>
<li>Whether this is a new visitor, or someone who has visited the site before</li>
<li>What type of browser the user is running</li>
<li>What traffic source drove the user to the site on this occasion</li>
</ul>
<p>Google recently upgraded Analytics to provide users with the ability to define &#8220;advanced segments&#8221; of visits that meet multiple criteria, giving analysts improved flexibility to define their own cohorts:</p>
<p style="text-align:center;"><img title="Defining segments with Google Analytics" src="http://blog.kepstatic.com/2012/06/advanced-segments-in-google-analytics.png" alt=""/></p>
<p>That functionality is great, but it has a one critical limitation from an analysis perspective: <strong>we can only segment users based on variables related to this particular visit (i.e. this session). We cannot segment users based on their past behaviour &#8211; i.e. what happened to them on previous visits.</strong> To take two examples:</p>
<ol>
<li>We cannot segment users based on when they started using our service, because that requires looking up the date of their <i>first visit</i>, which relates to a previous user session.</li>
<li>We cannot segment users based on whether they have seen a particular ad on their customer journey, because Google will only enable us to query which ads they have seen on this particular visit. So if a customer was originally acquired through a particular PPC campaign, but visited the site most recently by coming to it directly, we cannot use that when defining our cohort.</li>
</ol>
<p>The above problems both stem from the same design feature in Google Analytics: namely that &#8220;visits&#8221; are the primary units of analysis, with Google only providing us with limited tools to analyse an individual&#8217;s behaviour across multiple customer visits. That is a big drawback when you&#8217;re doing cohort analysis, because however a cohort is defined, the customer is always the primary unit of analysis. You may define your customer segment based on the customer&#8217;s behaviour on a particular visit, but limiting yourself to the most recent visit make the vast majority of cohort analyses very tricky with Google Analytics.</p>
<p>There is a way to work around the above limitation, however. If we have a particular variable that we want to use in a cohort analysis, we need to assign it to the user in our own web platform (e.g. CMS or ecommerce package), and then pass that information to Google Analytics every time the user visits our site, by setting a custom variable in the JavaScript. This, like it sounds, is not trivial. It is the basis for the solutions to performing cohort analysis using Google Analytics proposed by <a href="http://danhilltech.tumblr.com/post/12509218078/startups-hacking-a-cohort-analysis-with-google" target="_blank">Dan Hill</a> and <a href="http://techpad.co.uk/content.php?sid=192" target="_blank">Matt Clarke</a>.</p>
<p><strong>3. Comparing the metric value for each of our cohorts alongside each other</strong></p>
<p>In an ideal world, we would want Google Analytics to present us with a plot of our metric against different cohorts:</p>
<p style="text-align:center;"><img title="Cohort analysis output schematic" src="http://blog.kepstatic.com/2012/06/ideal-cohort-analysis.png" alt=""/></p>
<p>Unfortunately, Google wont give us our data in quite that format. Performing the cohort analysis then, is a multi-step process:</p>
<ol>
<li>Go to the advanced segment interface, and either select a segment that corresponds to your first cohort (if the segment already exists), or create a new custom segment if not.
<img title="The advanced segments button" src="http://blog.kepstatic.com/2012/06/advanced-segments-button.PNG" alt="" width="350"/>
<img title="Cohort analysis output schematic" src="http://blog.kepstatic.com/2012/06/create-your-own-custom-segment.PNG" alt="" width=500/>
</li>
<li>Once you have selected the appropriate segment, Google will only report results for this particular cohort. Now navigate to the metric you want to measure for this cohort, using the options in the left hand menu. Chances are that you will want to look at <i>Overview</i> for the standard metrics (visits, page views etc.), <i>Behaviour -> engagement</i> to measure visit duration or <i>Conversions</i> to measure propensity of each cohort to convert.</li>
<li>Now that you have your value for the particular cohort, record that. (Probably most easily done by downloading the relevant report in CSV format.)</li>
<li>Repeat the above steps for each different cohort.</li>
<li>Collate the results in Microsoft Excel (or equivalent).</li>
</ol>
<p><strong>Google Analytics: not the best tool for cohort analysis</strong></p>
<p>As we&#8217;ve seen above, it is possible to perform cohort analyses with Google Analytics. However, it is not easy. In particular, two limitations stand out:</p>
<ol>
<li>It is not possible to define cohorts in Google Analytics based on any user data, only data that is associated with the user&#8217;s most recent visit</li>
<li>Results from different cohorts have to be manually collated: Google will not present them alongside each other (although this pain point can be made automated using Google&#8217;s API, once the analyst has defined the segments in the web UI)</li>
</ol>
<p>In contrast, SnowPlow gives analysts the ability to create segments based on any user data, and returns the complete result set (for all relevant cohorts) in a single table.</p>
<p>Interested in finding out more about how SnowPlow can empower analysis at your company? Then find out more from the <a href="http://snowplowanalytics.com" title="SnowPlow Analytics official website">SnowPlow website</a>, <a href="http://snowplowanalytics.com/" target="_blank">SnowPlow wiki</a>, or <a href="/contact" target="_blank">get in touch</a>.</p>
<img src="http://feeds.feedburner.com/~r/KeplarLLPBlog/~4/YN7eNwQROwc" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>/blog/2012/06/faking-cohort-analysis-with-google-analytics/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>What is wrong with web analytics in 2012? (And how SnowPlow starts to fix it)</title>
		<link>/blog/2012/06/what-is-wrong-with-web-analytics-in-2012-and-how-snowplow-starts-to-fix-it</link>
		<comments>/blog/2012/06/what-is-wrong-with-web-analytics-in-2012-and-how-snowplow-starts-to-fix-it#comments</comments>
		<pubDate>Mon, 11 Jun 2012 20:01:55 +0000</pubDate>
		<dc:creator>Yali</dc:creator>
				<category><![CDATA[Analytics]]></category>
		<category><![CDATA[SnowPlow]]></category>
		<category><![CDATA[analytics limitations]]></category>
		<category><![CDATA[analytics problems]]></category>
		<category><![CDATA[atomic data]]></category>
		<category><![CDATA[customer-level analytics]]></category>
		<category><![CDATA[vanity metrics]]></category>
		<category><![CDATA[web analytics]]></category>

		<guid isPermaLink="false">/blog/?p=2656</guid>
		<description><![CDATA[The introducing SnowPlow blog post series Introducing SnowPlow: the world&#8217;s most powerful web analytics platform SnowPlow update: first part of source code published Warehousing your online ad data with SnowPlow Reintroducing SnowPlow: what is wrong with web analytics in 2012 and how we fix it. (Presentation version) What is wrong with web analytics in 2012, [...]]]></description>
			<content:encoded><![CDATA[<div id="post-series-box">
<p>The introducing SnowPlow blog post series</p>
<ul>
<li><a href="/blog/2012/02/introducing-snowplow-the-worlds-most-powerful-web-analytics-platform">Introducing SnowPlow: the world&#8217;s most powerful web analytics platform</a></li>
<li><a href="/blog/2012/03/snowplow-update-first-part-of-source-code-published">SnowPlow update: first part of source code published</a></li>
<li><a href="/blog/2012/05/warehousing-your-online-ad-data-with-snowplow">Warehousing your online ad data with SnowPlow</a></li>
<li><a href="/blog/2012/05/re-introducing-snowplow-a-new-approach-to-web-analytics">Reintroducing SnowPlow: what is wrong with web analytics in 2012 and how we fix it. (Presentation version)</a></li>
<li>What is wrong with web analytics in 2012, and how SnowPlow starts to fix it. (Blog post version)</li>
</ul>
</div>
<p>We developed <a href="http://snowplowanalytics.com">SnowPlow</a> out of <a href="http://snowplowanalytics.com/product/why-snowplow.html" title="What is wrong with web analytics in 2012" >frustration with the limitations of web analytics solutions today</a>. We believe that there are many things that are wrong with today&#8217;s web analytics packages. In the last few weeks, demoing SnowPlow to different companies and discussing where web analytics falls short for them, we&#8217;ve seen that we&#8217;re not alone in believing that the web analytics world is in need of a shake-up. However, that view is by no means universal, so we thought we&#8217;d take the debate online, and explain our position here.</p>
<p style="text-align:center;"><div class="wp-caption aligncenter" style="width: 410px"><img title="What is wrong with web analytics in 2012" src="http://blog.kepstatic.com/2012/06/overloaded-truck.jpg" alt="" width="400"/><p class="wp-caption-text">Web analytics solutions no longer support all the applications for web analytics data</p></div></p>
<p><span id="more-2656"></span></p>
<h3>What&#8217;s wrong with web analytics in 2012?</h3>
<p><strong>Too narrowly focused</strong></p>
<p>Web analytics tools are narrowly focused on:</p>
<ol>
<li>Marketing-related metrics (sources of traffic, visit numbers over time)</li>
<li>Ecommerce and online retail sites (clickthroughs, conversions, tracking a flow through a pre-defined funnel)</li>
<li>Conventional online publishing sites (page views by web page)</li>
</ol>
<p>Web analytics tools have not been developed with the wide range of digital businesses that are active today online, including an enormous number of B2B SaaS services, online games companies (including social and massively multiplayer games), financial services companies, media broadcasters and social networks, to name just a few categories. These companies are incredibly poorly served by today&#8217;s web analytics solutions.</p>
<p>It is not just whole industries which are poorly served by today&#8217;s web analytics options: the current focus of web analytics packages on marketing activities means that even in the target industries (i.e. online retailers and conventional online publishers), it is hard for anyone outside of Marketing to use the tools &#8211; for example:</p>
<ol>
<li><b>Product managers</b> find it hard to effectively use web analytics tools to understand how users engage with their website, how new versions of the website improve engagement and how that improved engagement drives improved commercial outcomes. Worse, it is all but impossible to use current tools to spot where the product is currently not working (e.g. where users are getting &#8220;stuck&#8221;, and so where product development efforts should focus)</li>
<li><b>Developers</b> find current web analytics packages far too crude to perform the type of event-level inspection and monitoring to understand how their applications are standing up to real-world use</li>
<li><b>Customer relationship managers</b> typically do not use customer web analytics data to improve their knowledge of individual customers and develop customer segmentations</li>
</ol>
<p><strong>Inflexible</strong></p>
<p>One of the most things that surprises us most is that web analytics packages are as a whole, highly inflexible. It is not easy to use them as a basis to build custom reports &#8211; reports that are not simply &#8220;recuts&#8221; on a subset of page views, visits and unique visitors. For example, try building a report to compare the lifetime value of users acquired from one marketing channel with another, or looking at how user engagement varies by when the user signed up to a service.</p>
<p>This is surprising because on the whole, companies are getting much better at building tools that developers and data scientists can take and extend to suit their own needs. Compare the growth in the number of libraries available for analysts using <a href="http://www.r-project.org/">R</a> or <a href="http://mahout.apache.org/">Mahout</a>, as examples of analytics products that have been developed by a wide and disprate community of analysts. What kind of comparable developments have we seen in web analytics? Google&#8217;s Analytics API has been available for ages, but all it has been used for is to <a href="https://github.com/keplar/google-analytics-export-to-csv" target="_blank">export cuts of data</a> and come up with sexy visualisations of dubious analytic value. That is not the fault of the community around the product: Google&#8217;s Analytics API has simply not been built to enable the kind of analytic innovation we&#8217;re looking for, because data is only exposed in an aggregated form, limiting the scope of data scientists and other programmers to reimagine it used in very different types of analysis. </p>
<p><strong>Too high-level</strong></p>
<p>As mentioned above, web analytics packages typically only provide aggregated data. They do not make data available about individual users, making it hard to use web analytics data to diagnose what is going wrong or right with a particular customer journey, let alone to use that data to improve and personalise that journey in the future.</p>
<p><strong>Too low-level</strong></p>
<p>It should not be possible for web analytics tools to be both too high-level and too low-level, but today&#8217;s vendors have managed just that. Opening up a web analytics package, analysts are presented with a sea of metrics: hundreds of carefully collated numbers, very few of which actually matter for business success or decision making. By drowning us in &#8220;vanity metrics&#8221;, the current crop of web analytics products make it hard for us to see the wood for the trees.</p>
<p><strong>Siloed</strong></p>
<p>It is several decades since Ralph Kimball argued that companies should be drawing data from multiple sources together in a single repository so that analysts could use that data to derive insights. Those compelling arguments for warehousing data have been accepted by the whole analytics community. Sadly web analytics tools have not been built to facilitate this. Web analytics data still typically sits outside of company data warehouses and BI systems in its own impregnable silo. Because of this, web analytics data is rarely incorporated as part of a single customer view &#8211; which is shocking given how much modern companies engage with their customers via the web.</p>
<h3>What are the consequences for businesses?</h3>
<p>The consequences for business are simple to articulate: business questions that should be possible to answer using web analytics data are not. We divide those questions into two categories, customer analytics and product analytics:</p>
<p><strong>Customer analytics</strong></p>
<p>How effectively a company understands their customers, and uses that understanding to drive customer loyalty and value, is a key determinant of success in a wide range of industries. For just one example, look at how Tesco&#8217;s use of Clubcard data via Dunnhumby helped propel it from the second largest supermarket in the UK to a global behemoth not far behind Walmart. Web analytics should provide a wealth of detailed customer data based on the way those customers engage with our online products and services. Sadly, web analytics&#8217; relentless focus on visits and page views has totally obscured the customer. The following customer-level questions are all very hard to answer using web analytics data, yet these are basic questions that every consumer-facing business should be asking and answering:</p>
<ol>
<li>Who are our most valuable customers?</li>
<li>How can I spot those valuable customers in advance? (What are the key predictors of value?)</li>
<li>What are the &#8220;sliding doors&#8221; moments that move a customer from a less valuable to a more valuable segment, or vice-versa?</li>
<li>How should we break down our customer-base by behaviour? How do segments vary, by value?</li>
<li>How well do we serve each customer segment? Are some (potentially valuable) segments less well served than others? What parts of the product should we focus on developing to improve the service level for those segments?</li>
<li>What are the best opportunities for growing the value of my customer base?</li>
</ol>
<p>Any company with an online presence should also be interested in using web analytics data to measure help make robust product development decisions. With the current crop of tools, however, this is very difficult: current tools enable A/B and multivariate testing but little else. In particular, it is hard to use today&#8217;s web analytics tools to answer:</p>
<ol>
<li>How successful has each product iteration been at driving user engagement?</li>
<li>Are there parts of our customers&#8217; journeys that are better served by the product than others? What parts of the journeys need improving?</li>
<li>Where should we focus our product development efforts?</li>
</ol>
<h3>How does SnowPlow enable businesses to answer the above questions?</h3>
<p>For us, the key to empowering data scientists and analysts to answer the above questions is to start by giving them access to the underlying raw (aka &#8220;atomic&#8221;) data, in a straightforward format that makes it easy to query.</p>
<p>The strength of this approach is that it gives analysts maximum flexibility to take the data, query it directly, and use the most appropriate analytics tools to perform the analyses that suit the companies, data sets and business questions they work with. For example measuring customer lifetime value for a bank is going to look very different to a telco, which will also look different to a retailer.</p>
<p>Rather than spend time developing our own reporting functionality to meet <i>all</i> the potential analyses that you might want to perform with web analytics data, we have developed a platform that gives analysts the atomic, customer-level data and lets them use whichever tools they believe will most effectively work on the data to meet their needs.</p>
<p>The focus of our approach, then, has been on making it easy for companies to collect <i>all</i> f the possible data they can from their web analytics system, and store it in the cloud in an infrastructure that scales effortlessly with enormous data volumes.</p>
<p>Analysts can then use the wide variety of general analytics tools that are available. Unlike web analytics packages, these have been developing very rapidly. They include:</p>
<ol>
<li>Statistical and modelling tools e.g. <a href="http://www.r-project.org/" target="_blank">R</a></li>
<li>Slice-and-dice OLAP technology e.g. <a href="http://www.microstrategy.co.uk/" target="_blank">Microstrategy</a>, <a href="http://www.tableausoftware.com/" target="_blank">Tableau</a></li>
<li>Behavioural database technologies e.g. <a href="http://blog.skylandlabs.com/introduction-to-behavioral-databases/" target="_blank">Skylab</a></li>
<li>Machine-learning and data mining tools e.g. <a href="http://weka.pentaho.com/">Weka</a>, <a href="http://mahout.apache.org/" target="_blank">Mahout</a></li>
</ol>
<h3>Making the underlying data available is not enough</h3>
<p>Making the underlying data available to analysts is a big step in the right direction, but it is not enough to fix web analytics. Fortunately, SnowPlow has a number of other key features:</p>
<p><strong>Powerful, scalable, flexible analytics with Apache Hive</strong></p>
<p><a href="http://hive.apache.org/" target="_blank">Hive</a> was developed at Facebook to enable analysts there to perform very involved analyses of how Facebook&#8217;s users engage with the product. Built on top of Hadoop, it is incredibly scalable. It is also very flexible, with a framework that lets analysts develop their own functions to use as part of their queries. Best of all, it is easy to use for any analyst with passing knowledge of SQL.</p>
<p><strong>The ability to upload data from other sources into SnowPlow</strong></p>
<p>It is straightforward to add additional data sources into SnowPlow and perform analytics against the combined data set. CRM data, data from social networks or even analyst-generated data (e.g. the values of different SnowPlow &#8220;events&#8221; based on models developed from SnowPlow data using other analytics systems like R) can be uploaded into SnowPlow and joined with SnowPlow&#8217;s own data. This makes SnowPlow an ideal place to assemble a &#8220;single customer view&#8221; which incorporates the customer&#8217;s all-important web behaviours, helping you to answer business questions that require analysis across a range of customer-data sources.</p>
<p><strong>More to come</strong></p>
<p>There&#8217;s a lot more work to do to take web analytics to where it should be in 2012. The areas we are actively working on to develop SnowPlow functionality include:</p>
<ol>
<li>Developing connectors to pipe SnowPlow data into analytics databases, to enable faster train-of-thought analysis and lower-cost repeated querying</li>
<li>Designing connectors to pipe SnowPlow data into behavioural databases, and develop a compelling analytic toolset around behaviour and eventstream analysis</li>
<li>Building additional clients to generate SnowPlow event data from mobile apps, Flash games, desktop apps and so on
<li>Developing tools to enable less technically-savvy analysts to get more value out of the SnowPlow data</li>
<li>Designing a range of tools to enable companies to use web analytics data from SnowPlow in real-time operational systems e.g. product recommendation, content personalisation</li>
</ol>
<p><strong>Interested in learning more about SnowPlow?</strong></p>
<p>Visit the SnowPlow <a href="https://github.com/snowplow/snowplow">Github</a> repository, check out the code and <a href="https://github.com/snowplow/snowplow/wiki">wiki</a>. Or if you prefer you can <a href="mailto:snowplow@keplarllp.com">send us an email</a>.</p>
<p><strong>Interested in joining us on this journey?</strong></p>
<p>We see a revolution coming in web analytics. We are only just beginning to scratch the surface of what is possible with customer event-data. SnowPlow is an open source project, and we encourage anyone with an interest in using web analytics data in novel and interesting ways to work with us to develop the SnowPlow platform and the analysis methodology and toolset around it. Start off by checking out the <a href="http://github.com/snowplow/snowplow">SnowPlow repo on Github</a> and the new <a href="http://snowplowanalytics.com">SnowPlow Analytics website</a></p>
<p>Note &#8211; the arguments presented here are much the same as those we gave in our presentation <a href="/blog/2012/05/re-introducing-snowplow-a-new-approach-to-web-analytics">Re-introducing SnowPlow: a New approach to Web Analytics</a>. Here, we describe them in long-form.</p>
<img src="http://feeds.feedburner.com/~r/KeplarLLPBlog/~4/7rx2v7lG4r4" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>/blog/2012/06/what-is-wrong-with-web-analytics-in-2012-and-how-snowplow-starts-to-fix-it/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Different approaches to measuring customer lifetime value with SnowPlow</title>
		<link>/blog/2012/06/different-approaches-to-measuring-customer-lifetime-value-with-snowplow</link>
		<comments>/blog/2012/06/different-approaches-to-measuring-customer-lifetime-value-with-snowplow#comments</comments>
		<pubDate>Sat, 09 Jun 2012 14:49:16 +0000</pubDate>
		<dc:creator>Yali</dc:creator>
				<category><![CDATA[Analytics]]></category>
		<category><![CDATA[E-commerce]]></category>
		<category><![CDATA[SnowPlow]]></category>
		<category><![CDATA[CLV]]></category>
		<category><![CDATA[cohort analysis]]></category>
		<category><![CDATA[customer lifetime value]]></category>
		<category><![CDATA[customer value modelling]]></category>
		<category><![CDATA[web analytics]]></category>

		<guid isPermaLink="false">/blog/?p=2617</guid>
		<description><![CDATA[The cohort analysis blog post series Cohort analyses for digital businesses: an overview Performing cohort analysis on web analytics data using SnowPlow Performing the cohort analysis described by Eric Ries in the Lean Startup On the wide variety of cohort analyses Approaches to measuring user engagement as part of cohort analysis Approaches to measuring customer [...]]]></description>
			<content:encoded><![CDATA[<div id="post-series-box">
<p>The cohort analysis blog post series</p>
<ul>
<li><a href="/blog/2012/04/cohort-analyses-for-digital-businesses-an-overview">Cohort analyses for digital businesses: an overview</a></li>
<li><a href="/blog/2012/05/performing-cohort-analysis-on-web-analytics-data-using-snowplow">Performing cohort analysis on web analytics data using SnowPlow</a></li>
<li><a href="/blog/2012/05/performing-the-cohort-analysis-described-in-eric-riess-lean-startup-using-snowplow-and-hive">Performing the cohort analysis described by Eric Ries in the <i>Lean Startup</i></a></li>
<li><a href="/blog/2012/05/on-the-wide-variety-of-different-cohort-analyses-possible-with-snowplow">On the wide variety of cohort analyses</a></li>
<li><a href="/blog/2012/05/different-approaches-to-measuring-user-engagement-with-snowplow">Approaches to measuring user engagement as part of cohort analysis</a></li>
<li>Approaches to measuring customer value as part of cohort analysis</li>
<li><a href="/blog/2012/06/faking-cohort-analysis-with-google-analytics">Faking cohort analysis with Google Analytics</a></li>
</ul>
</div>
<p>As part of our <i>cohort analysis series</i>, we have emphasized that there are a <a href="blog/2012/05/on-the-wide-variety-of-different-cohort-analyses-possible-with-snowplow">wide variety of different cohort analyses</a> that are possaible, depending on the business question to be answered. To recap, just quickly, we can vary the cohort analysis by what metric we use to compare between cohorts, and by how we define our cohorts. We have written a post about comparing <a href="/blog/2012/05/on-the-wide-variety-of-different-cohort-analyses-possible-with-snowplow">user engagement</a> between different cohorts, and how this is valuable to especially to social networks, community sites and publishers. In this post, we look at comparing <strong>customer value</strong>, including <strong>customer lifetime value (CLV)</strong> between cohorts. We explain why this is important to <i>all</i> companies whose business models depend, at the end of the day, on monetizing users &#8211; including retailers, media companies and financial services companies. Lastly, we look at how to measure these values in <a href="http://snowplowanalytics.com">SnowPlow</a>, so that an appropriate cohort analysis can then be performed, as described in our <a href="/blog/2012/05/performing-cohort-analysis-on-web-analytics-data-using-snowplow">previous blog post</a>.</p>
<p style="text-align:center;"><div class="wp-caption aligncenter" style="width: 382px"><img title="Weighing your customers' value" src="http://blog.kepstatic.com/2012/06/feet-on-weighing-scale.jpg" alt="" width="372" /><p class="wp-caption-text">Weighing your customers' value</p></div></p>
<p><span id="more-2617"></span><br />
<strong>Defining customer value and customer lifetime value</strong></p>
<p>The idea behind customer value and customer lifetime value in particular is very simple, in spite of all of the nuances associated with the mathematics of calculating them. The idea is that each customer provides our business with a certain amount of value (which we can measure as profit or revenue) over a period of time. That value might come in a number of ways: to take the example of a retailer, a customer might make a number of purchases from us over an extended time period, and we need to include the value of <i>all</i> those purchases when assessing that customer&#8217;s value. But the value that customer provides might not lie <i>only</i> in the purchases she makes. She may recommend products to her friends, thus providing additional value in terms of marketing / customer acquisition. She may write product reviews on the site: thereby making the site more engaging for other customers and prospective customers. These additional activities are valuable, but hard to quantify precisely.</p>
<p>Customer lifetime value or CLV is an extension of the concept of customer value. When we talk about a customer&#8217;s value, we talk in terms of their value over a period of time. If we want to be more rigorous, we might want to quantify the total value of the customer over their complete &#8220;lifetime&#8221; &#8211; i.e. the time period over which that person is a customer of ours. For some types of business (e.g. a cable subscription businesses), defining the lifetime is very easy: it&#8217;s just the length of time that person was a subscriber. For others (such as retailers), it is harder to quantify, because it is not clear when a person <i>stops</i> being a customer: there is always the possibility they will revisit the site and make another purchase.</p>
<p><strong>Why is customer value and customer lifetime value so important?</strong></p>
<p>Historically, different types of businesses have optimised themselves against different metics, depending on their particular commercial dynamics. For example: hotels and airlines have optimised on occupancy; newspapers, car manufacturers and record labels on units sold; financial services firms on the number and profitability of products sold.</p>
<p>There are two major weaknesses with metrics such as these:</p>
<ol>
<li>They obscure the customer-company relationship at the heart of every company&#8217;s business, and instead focus attention on supply-side issues like product production, capacity and distribution, which are generally much easier to control than demand-side issues like the way a customer feels about your product, service and brand.</li>
<li>By not looking at the customer journey as a whole, these metrics lead companies into making decisions that often have unintended negative consequences at a different part of the customer journey.</li>
</ol>
<p>To start with the first weakness: record labels have historically made all of their money selling CDs, and consequently have spent a lot of analytical effort understanding which are the best selling artists by geography, and then making sure that they distribute the right volume of CDs to those geographies to maximize profits. (Not too many to depress CD prices, not too few to artificially depress sales.) Today, the music industry has changed so that music is distributed digitally (so no fretting about matching CD supply with demand), and labels monetize their artists through record sales, streaming deals, live concerts and merchandising. Record companies therefore need to focus on understanding the complete value of each artist <strong>but also</strong> the complete value per fan. It is only when a record label understands which are its most valuable fans (those that go to the concerts, buy the merchandise, influence the prospects of upcoming artists etc.) that they will know where to target their marketing to maximize their profits going forwards. That means understanding the customer value of each of their users, and how that varies by age, affluence, taste in music and taste in music services (Spotify, Songkick and the like).</p>
<p>To illustrate the second weakness, consider the widespread example of the retailer advertising his products using AdWords, and measuring the return on that marketing spend based on the fraction of those users who clicked on the ads who went on to make a purchase, and the value of the purchases they made. This is often given as a textbook example for data-driven marketing. Unfortunately, it is massively flawed:</p>
<ul>
<li>Many customers visit a site more than once before making a purchase. By looking at users&#8217; behaviour only on their <i>initial</i> visit after clicking on an ad, the retailer will likely overlook a large number of return-visit sales which stemmed from that initial AdWords click. This means that the retailer is liable to underestimate the return on their search engine marketing spend.</li>
<li>Many customers who purchase after some ad exposure will go on to make additional purchases in the future. Some of these sales would never have been made were it not for them having seen the ad originally, and then making a successful initial purchase. However, none of the value of the repeat purchases is taken into account by the person calculating the return on this marketing spend, and as a result, the retailer is again liable to underestimate his return on marketing spend.</li>
</ul>
<p>It is only when a company looks at the complete value of each customer acquired from a given marketing activity that they can calculate the true return on that marketing spend. If they do not do this, they risk cutting marketing spend from potentially their best long-term marketing channels, and spending too much on channels that only drive short term sales increases.</p>
<p>For another example of how a company can make a terrible strategic decision because it does not monitor the lifetime value of its customers, consider the (probably apocryphal) story of the bank that axed its children&#8217;s accounts after realizing that these were unprofitable. That bank&#8217;s profits will have risen in the short-term, and then nose dived 15 years after the decision was made, as the cohort of users who would have formed the bedrock of their future profitable customer base took up children&#8217;s accounts with competing banks.</p>
<p><strong>Measuring customer lifetime value (general approach)</strong></p>
<p>Measuring customer lifetime value is difficult for a number of reasons:</p>
<ol>
<li>As described above, there are particular customer actions (such as recommending a brand to a friend) that certainly create value, but are hard to quantify</li>
<li>It is not always easy to define the &#8220;lifetime&#8221; of a customer, in other words to determine over what period we should measure that customer&#8217;s value</li>
<li>Many of our most interesting customers will be in an ongoing relationship with our business: without a crystal ball we cannot be very certain what value we will be able to realize from those customers in the future</li>
</ol>
<p>A full discussion of how to address the above challenges is beyond the scope of this blog post, although we do plan to explore some approaches in future posts. At this stage we make the following general observations, before discussing how to perform some initial customer lifetime value (CLV) calculations using <a href="http://snowplowanalytics.com">SnowPlow</a>:</p>
<ul>
<li>While it is difficult to quantify those more &#8220;intangible&#8221; user actions (such as reviewing a product or &#8220;liking&#8221; a fan page), it is possible to come up with models to do so. Developing these valuation models means analysing what effect these actions have over a long time period, and comparing those effects with &#8220;control groups&#8221; where no equivalent action was performed. This is quite an involved analysis, and it requires good granularity of different customer journeys to perform &#8211; fortunately, <a href="http://snowplowanalytics.com">SnowPlow</a> provides the data granularity required. Once the analysis has been performed, a provisional value for that user action can be assigned.</li>
<li>Whilst defining the lifetime of a customer can be very difficult, we can instead make do by looking at the customer value over a defined period of time (e.g. a quarter), and see how that varies over time periods: are we increasing the value of our individual customers over time, or decreasing it? And how does that vary by customer segment? If we adopt this approach, we need to perform a separate churn analysis, to capture the undesirable but important commercial impact of users who leave us.</li>
<li>An alternative if we want to stick with customer lifetime value is to build a model that predicts the lifetime of each current customer, based on their behaviour to date. Again, because SnowPlow provides a very granular data set for each customer, it can support this type of analysis well. The output is a function that predicts the lifetime value of a customer based on their to-date behaviour, where the salient aspects of that behaviour have been carefully selected, through a combination of customer knowledge and data analysis.</li>
<li>Whichever of the above routes is taken, remember that a given user&#8217;s CLV may evolve over time, based both on their own changing behaviour but also on our evolving model and understanding of valuable customer behaviour. This is fine and to be expected &#8211; the important thing is that at any given moment in time we can attribute a meaningful value for each customer.</li>
</ul>
<p><strong>Measuring customer lifetime value (using SnowPlow)</strong></p>
<p><a href="http://snowplowanalytics.com">SnowPlow</a> makes it easy to take calculate customer value based on web analytics data, unlike traditional web analytics programs. There are a number of key areas where it excels:</p>
<p><strong>1. It makes it possible to measure the total revenue of each customer directly, over their entire history</strong></p>
<p>Remember the example given earlier of the retailer that measured the return on AdWords spend based on the fraction of users who clicked on the ad and went on to purchase on the same visit? Well, <a href="http://snowplowanalytics.com">SnowPlow</a> makes it trivial to look at the amount of revenue booked each customer over all their visits, using the query below:</p>
<pre class="brush: sql; title: ; notranslate">
/* HiveQL */
SELECT
	`user_id` AS customer,
	YEAR(`dt`) AS time_period_year,
	MONTH(`dt`) AS time_period_month,
	SUM(`ev_value`) AS `customer_value_per_time_period`
FROM `events`
WHERE `ev_action` LIKE 'order-confirmation'
GROUP BY `user_id`, YEAR(`time_period`), MONTH(`dt`);
</pre>
<p>The query above sums all the order values for each `user_id` by time period (in this case, month). It is trivial modify the query to add in different types of events, by simply adding them to the `WHERE` clause. It is also trivial to take a different time period, e.g. looking by week, by aggregating the event dates using the Hive WEEKOFYEAR() function rather than the MONTH() function.</p>
<p>If we wanted to perform the above analysis <i>just</i> for a cohort or users who had come to the site from AdWords, we would perform a separate query to identify just those users who&#8217;d clicked on an AdWords ad (shown below), and then JOIN the results (a list of user_ids) with the output of the query above. This query would get you that list of user_ids:</p>
<pre class="brush: sql; title: ; notranslate">
/* HiveQL */
SELECT `user_id`
FROM `events`
WHERE `mkt_source` = 'google' AND `mkt_medium` = 'cpc' ;
</pre>
<p><strong>2. It is possible to include the value of &#8220;intangible&#8221; customer actions (e.g. &#8220;liking a product&#8221;) in the SnowPlow customer value calculation</strong></p>
<p>If your company has developed a model / view of the value of &#8220;intangible&#8221; or otherwise hard-to-value actions (e.g. &#8220;liking&#8221; a product), those values can be incorporated by SnowPlow when you perform the analysis. That&#8217;s because, unlike other web analytics packages including Google Analytics, SnowPlow lets you upload your own first- and third-party data into SnowPlow to use in the calculation.</p>
<p>In this case, you would create a table in SnowPlow with a list of the different actions, categorised by events, where each event has an associated value. The table would look like this:</p>
<pre class="brush: sql; title: ; notranslate">
/* HiveQL */
CREATE TABLE `intangible_event_values` (
	ev_category STRING,
	ev_action STRING,
	value INT
) ;
</pre>
<p>You would then upload your event values to the `intangible_event_values` table and JOIN it with the `events` table when calculating the value per customer per time period:</p>
<pre class="brush: sql; title: ; notranslate">
/* HiveQL */
SELECT
	`user_id` AS customer,
	YEAR(`dt`) AS time_period_year,
	MONTH(`dt`) AS time_period_month,
	SUM(`e`.`ev_value`) + SUM(`v`.`value`) AS `customer_value_per_time_period`
FROM `events` e JOIN `intangible_event_values` v ON (e.ev_category = v.ev_category AND e.ev_value = v.ev_value)
WHERE (`ev_action` LIKE 'order-confirmation' OR v.ev_value IS NOT NULL)
GROUP BY `user_id`, YEAR(`dt`), MONTH(`dt`) ;
</pre>
<p>This allows us to track customer value over time, and monitor how effectively we are growing customer value over time &#8211; for the user base as a whole, by customer segment, or for each customer (by looking at particular `user_id`s).</p>
<p><strong>3. It is possible to include model-based estimates of future customer value in customer lifetime value calculations</strong></p>
<p>Assuming that we have developed a model that predicts future customer value based on their behaviour to-date, we can incorporate that into SnowPlow to calculate the full-lifetime value for the current customer base.</p>
<p>Our function to predict customer lifetime value would take certain events in a customer&#8217;s past and uses those to predict future value. Because <i>all</i> of the events in a customer&#8217;s online history are stored in SnowPlow, it is possible to run the function in SnowPlow and calculate a full-lifetime value for each of our `user_id`s. From this point we could sum all of those individual CLVs to calculate the value of our current user base, and compare that with the value for our userbase last month or year, for example.</p>
<p>How we do this depends on the nature of our predictive function. If it can be written directly in Hive using HiveQL, we simply need to include it in our query. If it is not possible to compose in HiveQL, we can create a user-defined function in Java to do the calculation and return the result to Hive. Either way, our analysis proceeds as follows:</p>
<p>First we define our &#8220;live&#8221; userbase:</p>
<pre class="brush: sql; title: ; notranslate">
/* HiveQL with pseudo-code */
CREATE TABLE `this_months_customers` (
	user_id STRING
) ;

INSERT OVERWRITE TABLE `this_months_customers`
SELECT `user_id`
FROM `events`
WHERE dt&gt;'2012-05-31' /* In this case we define our live customers as those who have engaged with us in June */
GROUP BY `user_id` ;
</pre>
<p>Next we perform a comparable query to identify customers who were &#8220;live&#8221; last month:</p>
<pre class="brush: sql; title: ; notranslate">
/* HiveQL with pseudo-code */
CREATE TABLE `last_months_customers` (
	user_id STRING
) ;

INSERT OVERWRITE TABLE `last_months_customers`
SELECT `user_id`
FROM `events`
WHERE `dt`&gt;'2012-04-30' AND `dt`&lt;'2012-06-01'
GROUP BY `user_id` ;
</pre>
<p>We then calculate the value of this month&#8217;s customers, using our customer lifetime value predictor function:</p>
<pre class="brush: sql; title: ; notranslate">
/* HiveQL with pseudo-code */
CREATE TABLE `lifetime_value_by_customer_this_month` (
	user_id STRING,
	predicted_lifetime_value FLOAT
) ;

INSERT OVERWRITE `lifetime_value_by_customer_this_month`
SELECT
	`user_id` AS customer,
	customer_lifetime_value_predictor_function(input parameters) AS `customer_lifetime_value`
FROM `events`
GROUP BY `user_id`  ;
</pre>
<p>Note that we look at the <i>complete history</i> of each of this month&#8217;s customers to predict their future value.</p>
<p>Next we perform the comparable calculation for last month&#8217;s customers:</p>
<pre class="brush: sql; title: ; notranslate">
/* HiveQL with pseudo-code */
CREATE TABLE `lifetime_value_by_customer_last_month` (
	user_id STRING,
	predicted_lifetime_value FLOAT
) ;

INSERT OVERWRITE `lifetime_value_by_customer_last_month`
SELECT
	`user_id` AS customer,
	customer_lifetime_value_predictor_function(input parameters) AS `customer_lifetime_value`
FROM `events`
WHERE dt&lt;'2012-06-01'
GROUP BY `user_id`  ;
</pre>
<p>We can now compare the two tables `lifetime_value_by_customer_last_month` with `lifetime_value_by_customer_this_month` to see how the average lifetime has varied, how the distribution of lifetime value has varied and how the number of customers by lifetime value has varied.</p>
<p>We will explore different approaches to developing the customer_lifetime_value_preditor_function in future blog posts.</p>
<p><strong>What to learn more about SnowPlow?</strong> Then visit the <a href="http://snowplowanalytics.com"</a>SnowPlow website</a>, and in particular, the guide to <a href="http://snowplowanalytics.com/analytics/customer-analytics/customer-lifetime-value.html" title="Calculating customer lifetime value with SnowPlow">calculating customer lifetime value using SnowPlow</a>.</p>
<img src="http://feeds.feedburner.com/~r/KeplarLLPBlog/~4/s_iTPXXMydg" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>/blog/2012/06/different-approaches-to-measuring-customer-lifetime-value-with-snowplow/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>(Re-)introducing SnowPlow: a new approach to web analytics</title>
		<link>/blog/2012/05/re-introducing-snowplow-a-new-approach-to-web-analytics</link>
		<comments>/blog/2012/05/re-introducing-snowplow-a-new-approach-to-web-analytics#comments</comments>
		<pubDate>Wed, 30 May 2012 17:57:19 +0000</pubDate>
		<dc:creator>Yali</dc:creator>
				<category><![CDATA[Analytics]]></category>
		<category><![CDATA[SnowPlow]]></category>
		<category><![CDATA[web analytics]]></category>

		<guid isPermaLink="false">/blog/?p=2600</guid>
		<description><![CDATA[The introducing SnowPlow blog post series Introducing SnowPlow: the world&#8217;s most powerful web analytics platform SnowPlow update: first part of source code published Warehousing your online ad data with SnowPlow Reintroducing SnowPlow: what is wrong with web analytics in 2012 and how we fix it. (Presentation version) What is wrong with web analytics in 2012, [...]]]></description>
			<content:encoded><![CDATA[<div id="post-series-box">
<p>The introducing SnowPlow blog post series</p>
<ul>
<li><a href="/blog/2012/02/introducing-snowplow-the-worlds-most-powerful-web-analytics-platform">Introducing SnowPlow: the world&#8217;s most powerful web analytics platform</a></li>
<li><a href="/blog/2012/03/snowplow-update-first-part-of-source-code-published">SnowPlow update: first part of source code published</a></li>
<li><a href="/blog/2012/05/warehousing-your-online-ad-data-with-snowplow">Warehousing your online ad data with SnowPlow</a></li>
<li>Reintroducing SnowPlow: what is wrong with web analytics in 2012 and how we fix it. (Presentation version)</li>
<li><a href="/blog/2012/06/what-is-wrong-with-web-analytics-in-2012-and-how-snowplow-starts-to-fix-it">What is wrong with web analytics in 2012, and how SnowPlow starts to fix it. (Blog post version)</a></li>
</ul>
</div>
<p>In the last few weeks we have talked to many different people and organisations about SnowPlow. One of the things that has only become obvious since having those conversations is that the approach we&#8217;ve taken in developing SnowPlow is surprising to many of the people we&#8217;ve spoken to. That&#8217;s at least partly because the journey we&#8217;ve been on, in thinking about how web analytics can be done better, is not one that everyone else has necessarily been on.</p>
<p>The slide deck below is our attempt to describe that journey. It sums up, as briefly as we can manage, what is wrong with web analytics today, and how we believe SnowPlow addresses those fundamental problems in a fresh and new way. It goes on to outline some of the areas we hope to develop SnowPlow in.</p>
<p>I plan to write a blog post discussing some of these issues in more detail in the next few days. But in the meantime, check out the deck: we&#8217;d love your feedback.</p>
<p style="test-align:center;">
<div style="width:637px" id="__ss_13136914"><strong style="display:block;margin:12px 0 4px"><a href="http://www.slideshare.net/yalisassoon/introducing-snowplow" title="Introducing SnowPlow">Introducing SnowPlow</a></strong><object id="__sse13136914" width="637" height="532"><param name="movie" value="http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=introducingsnowplow-120530124302-phpapp02&#038;stripped_title=introducing-snowplow&#038;userName=yalisassoon" /><param name="allowFullScreen" value="true"/><param name="allowScriptAccess" value="always"/><param name="wmode" value="transparent"/><embed name="__sse13136914" src="http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=introducingsnowplow-120530124302-phpapp02&#038;stripped_title=introducing-snowplow&#038;userName=yalisassoon" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" wmode="transparent" width="637" height="532"></embed></object>
<div style="padding:5px 0 12px">View more <a href="http://www.slideshare.net/">presentations</a> from <a href="http://www.slideshare.net/yalisassoon">yalisassoon</a>.</div>
</div>
<p><strong>Update:</strong> since writing this post we&#8217;ve launched a dedicated <a href="http://snowplowanalytics.com">SnowPlow website</a>. Check it out for <a href="http://snowplowanalytics.com">more information on Snowplow</a>.</p>
<img src="http://feeds.feedburner.com/~r/KeplarLLPBlog/~4/ppqaa8R8v9k" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>/blog/2012/05/re-introducing-snowplow-a-new-approach-to-web-analytics/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Different approaches to measuring user engagement with SnowPlow</title>
		<link>/blog/2012/05/different-approaches-to-measuring-user-engagement-with-snowplow</link>
		<comments>/blog/2012/05/different-approaches-to-measuring-user-engagement-with-snowplow#comments</comments>
		<pubDate>Wed, 16 May 2012 14:15:37 +0000</pubDate>
		<dc:creator>Yali</dc:creator>
				<category><![CDATA[Analytics]]></category>
		<category><![CDATA[SnowPlow]]></category>
		<category><![CDATA[audience analytics]]></category>
		<category><![CDATA[audience engagement]]></category>
		<category><![CDATA[audience segmentation]]></category>
		<category><![CDATA[cohort analysis]]></category>
		<category><![CDATA[user engagement]]></category>

		<guid isPermaLink="false">/blog/?p=2521</guid>
		<description><![CDATA[The cohort analysis blog post series Cohort analyses for digital businesses: an overview Performing cohort analysis on web analytics data using SnowPlow Performing the cohort analysis described by Eric Ries in the Lean Startup On the wide variety of cohort analyses Approaches to measuring user engagement as part of cohort analysis Approaches to measuring customer [...]]]></description>
			<content:encoded><![CDATA[<div id="post-series-box">
<p>The cohort analysis blog post series</p>
<ul>
<li><a href="/blog/2012/04/cohort-analyses-for-digital-businesses-an-overview">Cohort analyses for digital businesses: an overview</a></li>
<li><a href="/blog/2012/05/performing-cohort-analysis-on-web-analytics-data-using-snowplow">Performing cohort analysis on web analytics data using SnowPlow</a></li>
<li><a href="/blog/2012/05/performing-the-cohort-analysis-described-in-eric-riess-lean-startup-using-snowplow-and-hive">Performing the cohort analysis described by Eric Ries in the <i>Lean Startup</i></a></li>
<li><a href="/blog/2012/05/on-the-wide-variety-of-different-cohort-analyses-possible-with-snowplow">On the wide variety of cohort analyses</a></li>
<li>Approaches to measuring user engagement as part of cohort analysis</li>
<li><a href="/blog/2012/06/different-approaches-to-measuring-customer-lifetime-value-with-snowplow">Approaches to measuring customer value as part of cohort analysis</a></li>
<li><a href="/blog/2012/06/faking-cohort-analysis-with-google-analytics">Faking cohort analysis with Google Analytics</a></li>
</ul>
</div>
<p>User engagement is one of the most interesting, most important, and yet challenging areas of data analysis. In this post, we will look at different metrics which can be employed to understand user behaviour as part of our overall series on cohort analyses. In addition, we will show how each suggested user engagement metric can be calculated in a straightforward way using <a href="http://snowplowanalytics.com">SnowPlow</a>.</p>
<div class="wp-caption aligncenter" style="width: 350px"><img title="Not an engaged user" src="http://blog.kepstatic.com/2012/05/children-paying-attention-in-class.jpg" alt="" width="340" /><p class="wp-caption-text">Not an engaged user</p></div>
<p>Historically, it was online publishers who were most concerned with the level of their users&#8217; engagement. But the advent of social networks like Facebook and Twitter, community sites (like Mixcloud and Mumsnet) and socially-aware applications (like Spotify and Steam) means that there is now a much larger number of businesses whose success depends very directly on how good they are at getting users to engage frequently (many times per day or month) and deeply (long sessions, many page views). That may be because engagement directly drives revenue (e.g. for an ad-funded business), or because building up a critical mass of users is key to building a viable online marketplace or social network.</p>
<p><span id="more-2521"></span></p>
<p>There are countless potential measures of metrics to capture user engagement &#8211; to give just some examples:</p>
<ul>
<li>Number of visits / logins per month</li>
<li>Number of page views per month</li>
<li>Amount of time spent on the site / in the product</li>
<li>Number of articles opened / read / downloaded / shared / saved / favourited</li>
<li>Number of sessions per month where the user spends a minimum amount of time on the site / connects with another users / completes a level on a game</li>
</ul>
<p>Which ones are most relevant to you will depend on your business, and the ways in which your users engage with your product. Let&#8217;s take some (famous) examples:</p>
<p><!--more--></p>
<h3>Example 1: User engagement at Facebook</h3>
<p>In the case of Facebook, one well established user behaviour is to log into Facebook (e.g. at work, fairly early on in the day) and leave the page open during the day, rechecking it at intervals when there is a lull to see if anything interesting has appeared in the user&#8217;s stream. (Similar behaviour is evident when people access Facebok via mobile.) If this represents a significant fraction of user behaviour, then it becomes meaningless to measure the depth of engagement by the length of time a session lasts (because a session may last 8 hours). Similarly, measuring the breadth of engagement by the number of sessions would be misleading, because a user who logs in and out 5 times in a day (because they are on a public computer) is no more engaged than one who logs in once on their own machine, and keeps the page open all day.</p>
<p>It is likely for this reason (although we haven&#8217;t confirmed with analysts at Facebook), that Facebook tracks the number of days per week (or days per month) that a user logs in as a measure of the breadth of their engagement. Most likely Facebook would use a measure like this to get a handle on a user&#8217;s overall engagement with Facebook, and at the same time also look at how <i>deeply</i> or <i>actively</i> a user engages with Facebook by looking at how many items in their stream they actively engage with (e.g. by <i>commenting</i> on them, <i>liking</i> or actively contributing content to the stream by posting status updates and photos). Our guess would be that, for Facebook, user engagement is tracked across these two dimensions: one for breadth (number of days per month) and one for depth (level of activity when engaged, from active to passive).</p>
<h3>Example 2: User engagement at eBay</h3>
<p>For eBay on the other hand, measuring engagement is more difficult. For starters, there are different types of users on eBay, and we might want to apply different methods for measuring engagement based on the user type. <i>Buyers</i> on eBay may make up a distinct group as opposed to <i>sellers</i>. In the case of buyers, we might superficially be interested in the number of products they search for on eBay &#8211; so perhaps number of searches per week is a relevant metric? Or we might be interested in the breadth of products they look for on eBay &#8211; so a user who searches for electronic goods, clothing and stationery is more &#8220;engaged&#8221; than one who only looks at second-hand books. We might also want to analyse the &#8220;depth&#8221; of engagement by looking at how actively a user bids on products on eBay: what is the number of products bid on per week? How actively does the user bid on each item?</p>
<p>Conversely, if we look at sellers on eBay, things become more straightforward. For sellers we are interested in the number of items they sell (per week or per month). We might also want to quantify what more advanced sellers are making available on eBay relative to what they are retailing on their own website or Amazon Marketplace, or look at how frequently they are upgrading their listings with additional photos and similar. In all of this, we may want to differentiate &#8220;professional&#8221; sellers (including those who sell products on multiple channels) from part-time or hobbyists, who use eBay to sell their old stuff.</p>
<h3>Measuring user engagement using SnowPlow</h3>
<p><a href="http://snowplowanalytics.com">SnowPlow</a> offers a lot of flexibility to use multiple different measures of engagement. To start with, let&#8217;s take the simple case of measuring each user&#8217;s engagement by the number of &#8216;actions&#8217; they have performed (where an <i>action</i> might be anything from loading a web page, to playing a piece of media, to adding an item to a shopping basket). Because a line of data is recorded in SnowPlow&#8217;s event table <em>every</em> time a user peforms an action (and each line has a unique `txn_id`, all we need to do is sum the number of actions take per user per month:</p>
<pre class="brush: sql; title: ; notranslate">
/* HiveQL */
INSERT OVERWRITE TABLE `engagement_by_user`
SELECT
	`user_id`,
	YEAR(`dt`),
	MONTH(`dt`),
	COUNT(`txn_id`) AS `engagement`
FROM `events`
GROUP BY `user_id`, YEAR(`time_period`), MONTH(`dt`);
</pre>
<p>We can make our measure more sophisticated by weighting different actions based on the level of engagement they demonstrate. For example, the folks at Facebook might decide that posting a status update or uploading a photo constitute more active levels of engagement than simply liking a post in your stream, or clicking on a friend&#8217;s photo album. In this case, in SnowPlow, we would create a new table populated with the different user action types (e.g. from the `event_action` field) and assigning each a score:</p>
<p>
<pre class="brush: sql; title: ; notranslate">
/* HiveQL */
CREATE TABLE `engagement_by_event_action` (
	`event_action` string,
	`engagement_level` int
);
</pre>
</p>
<p>We would then populate the table with a list of different types of action and a representative value for the engagement level, like so:</p>
<table>
<tr>
<td><strong>`event_action`</strong</td>
<td><strong>`engagement_level`</strong></td>
</tr>
<tr>
<td>&#8216;like_post&#8217;</td>
<td>2</td>
</tr>
<tr>
<td>&#8216;comment_on_post&#8217;</td>
<td>3</td>
</tr>
<tr>
<td>&#8216;post_status_update&#8217;</td>
<td>5</td>
</tr>
</table>
<p>Note: the above table would have to include a value for <i>every</i> type of action recorded in the SnowPlow events table. Our query for calculating the weighted engagement level by user per period would then join the `events` table with the `engagement_by_event_action` table:</p>
<pre class="brush: sql; title: ; notranslate">
/* HiveQL */
INSERT OVERWRITE TABLE `engagement_by_user`
SELECT
	`user_id`,
	YEAR(`dt`),
	MONTH(`dt`)
	sum(`engagement_level`) AS `engagement`
FROM
	`events`
JOIN
	`engagement_by_event_action`
ON
	`events`.`event_action` LIKE `engagement_by_event_action`.`event_action`
GROUP BY
	`user_id`, YEAR(`dt`), MONTH(`dt`);
</pre>
<p>For the eBay example, our first suggested measure of breadth of engagement for buyers was to examine the number of product searches performed. Assuming that every product search by a user is logged in Snowplow as an event with `event_action` = &#8216;search&#8217; and `event_property` = <i>&#8220;search string&#8221;</i>, we could as a first measure simply count the number of searches per time period per buyer:</p>
<pre class="brush: sql; title: ; notranslate">
INSERT OVERWRITE TABLE `engagement_by_buyer`
SELECT
	`user_id`,
	YEAR(`dt`),
	MONTH(`dt`),
	COUNT(`txn_id`) AS `engagement`
FROM
	`events`
WHERE
	`event_action` like &quot;search&quot;
GROUP BY
	`user_id`, YEAR(`dt`), MONTH(`dt`) ;
</pre>
<p>If we want to be more sophisticated and look at the breadth of items searched for (e.g. the number of different product categories sought by the user), we would need:</p>
<ol>
<li>A way to categorise the search query into a likely product category</li>
<li>To count the number of categories searched</li>
</ol>
<p>Assuming our search facility had some kind of categorisation engine (i.e. could approximate the specific category of product the user is currently searching for), we could create a table in Hive that maps search terms to categories:</p>
<p>
<pre class="brush: sql; title: ; notranslate">
/* HiveQL */
CREATE TABLE `category_by_search_term` (
	`search_term` string,
	`product_category` string
);
</pre>
</p>
<p>We could then examine the breadth of products searched for by buyers using the following query:</p>
<p>
<pre class="brush: sql; title: ; notranslate">
/* HiveQL */
INSERT OVERWRITE TABLE buyer
SELECT
	`user_id`,
	YEAR(`dt`),
	MONTH(`dt`),
	COUNT(DISTINCT `product_category`) AS `engagement`
FROM
	`events`
JOIN
	`category_by_search_term`
ON
	`events`.`event_property` LIKE `category_by_search_term`.`search_terms`
GROUP BY
	`user_id`, YEAR(`dt`), MONTH(`dt`);
</pre>
</p>
<p>Meanwhile, to examine the engagement levels of sellers on eBay, we could look at the number of products listed by each seller per month. Assuming that each time a seller lists an item, SnowPlow records an event with `event_type` = &#8220;list&#8221; and `event_property` = <i>item sku</i>, it would be relatively simple to count the number of items listed per month:</p>
<p>
<pre class="brush: sql; title: ; notranslate">
/* HiveQL */
INSERT OVERWRITE TABLE `engagement_by_seller`
SELECT
	`user_id`,
	YEAR(`dt`),
	MONTH(`dt`),
	COUNT(`txn_id`) AS `engagement`
FROM
	`events`
WHERE
	`event_action` LIKE &quot;list&quot;
GROUP BY
	`user_id`, YEAR(`dt`), MONTH(`dt`) ;
</pre>
</p>
<p>If we wanted to compare the number of different SKUs listed as well, that is also straightforward:</p>
<p>
<pre class="brush: sql; title: ; notranslate">
/* HiveQL */
INSERT OVERWRITE TABLE `engagement_by_seller`
SELECT
	`user_id`,
	YEAR(`dt`),
	MONTH(`dt`),
	COUNT(DISTINCT `event_property`) AS `engagement`
FROM
	`events`
WHERE
	`event_action` LIKE &quot;list&quot;
GROUP BY
	`user_id`, YEAR(`dt`), MONTH(`dt`) ;
</pre>
</p>
<p>Comparing this number against a hypothetical number for the total amount each seller <i>could</i> list would require some data gathering / intelligence outside of SnowPlow to look at e.g. the total number of items the seller makes available on competitor platforms such as Amazon. This data would be put into a Hive table as below:</p>
<p>
<pre class="brush: sql; title: ; notranslate">
/* HiveQL */
CREATE TABLE `max_skus_by_user_id` (
	`user_id` string,
	`yr` int,
	`month` int,
	`max_skus` int
);
</pre>
</p>
<p>And finally, performing the comparison:</p>
<p>
<pre class="brush: sql; title: ; notranslate">
/* HiveQL */
INSERT OVERWRITE TABLE `engagement_by_seller`
SELECT
	`events`.`user_id`,
	YEAR(`dt`),
	MONTH(`dt`),
	COUNT(DISTINCT `event_property`) / `max_skus` AS `engagement` /* Note `engagement` needs to be type float or double */
FROM
	`events`
JOIN
	`max_skus_by_user_id`
ON
	`events`.`user_id` LIKE `max_skus_by_visitor`.`user_id`
WHERE
	`events`.`event_type` LIKE &quot;list&quot;
GROUP BY
	`user_id`, YEAR(`dt`), MONTH(`dt`) ;
</pre>
</p>
<h3>Deciding which metric to use for user engagement</h3>
<p>As the above examples hopefully illustrate, there are many possible metrics which you can employ to describe and measure user engagement. Which metrics are most applicable will vary by business: for each business, there will be multiple possible measures that might potentially work. Whilst a full discussion of all the different possibilities is beyond the scope of this post, a quick tip is that if you do have a number of different possible metrics, and if they rate different users similarly, relative to one another, for equivalent levels of engagement, then chances are that all of them are reasonably robust, and you are reasonably safe picking any one of them (because all of them give similar results).</p>
<p><strong>Interested in using SnowPlow to measure engagement on your website or application?</strong></p>
<p><a href="http://snowplowanalytics.com" title="SnowPlow web analyticcs">Visit the SnowPlow website</a> to learn more, especially the page on <a href="http://snowplowanalytics.com/analytics/customer-analytics/user-engagement.html" title="Measuring user engagement with SnowPlow">using SnowPlow to measure user engagement</a>.</p>
<img src="http://feeds.feedburner.com/~r/KeplarLLPBlog/~4/kK9tSV364_I" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>/blog/2012/05/different-approaches-to-measuring-user-engagement-with-snowplow/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>On the wide variety of different cohort analyses</title>
		<link>/blog/2012/05/on-the-wide-variety-of-different-cohort-analyses-possible-with-snowplow</link>
		<comments>/blog/2012/05/on-the-wide-variety-of-different-cohort-analyses-possible-with-snowplow#comments</comments>
		<pubDate>Wed, 16 May 2012 13:54:29 +0000</pubDate>
		<dc:creator>Yali</dc:creator>
				<category><![CDATA[Analytics]]></category>
		<category><![CDATA[SnowPlow]]></category>
		<category><![CDATA[cluster analysis]]></category>
		<category><![CDATA[cohort analysis]]></category>
		<category><![CDATA[customer segmentation]]></category>
		<category><![CDATA[web analytics]]></category>

		<guid isPermaLink="false">/blog/?p=2476</guid>
		<description><![CDATA[The cohort analysis blog post series Cohort analyses for digital businesses: an overview Performing cohort analysis on web analytics data using SnowPlow Performing the cohort analysis described by Eric Ries in the Lean Startup On the wide variety of cohort analyses Approaches to measuring user engagement as part of cohort analysis Approaches to measuring customer [...]]]></description>
			<content:encoded><![CDATA[<div id="post-series-box">
<p>The cohort analysis blog post series</p>
<ul>
<li><a href="/blog/2012/04/cohort-analyses-for-digital-businesses-an-overview">Cohort analyses for digital businesses: an overview</a></li>
<li><a href="/blog/2012/05/performing-cohort-analysis-on-web-analytics-data-using-snowplow">Performing cohort analysis on web analytics data using SnowPlow</a></li>
<li><a href="/blog/2012/05/performing-the-cohort-analysis-described-in-eric-riess-lean-startup-using-snowplow-and-hive">Performing the cohort analysis described by Eric Ries in the <i>Lean Startup</i></a></li>
<li>On the wide variety of cohort analyses</li>
<li><a href="/blog/2012/05/different-approaches-to-measuring-user-engagement-with-snowplow">Approaches to measuring user engagement as part of cohort analysis</a></li>
<li><a href="/blog/2012/06/different-approaches-to-measuring-customer-lifetime-value-with-snowplow">Approaches to measuring customer value as part of cohort analysis</a></li>
<li><a href="/blog/2012/06/faking-cohort-analysis-with-google-analytics">Faking cohort analysis with Google Analytics</a></li>
</ul>
</div>
<p>In the last two blog posts in this series we looked at two different examples of cohort analysis and how to perform them using SnowPlow. The <a href="/blog/2012/05/performing-cohort-analysis-on-web-analytics-data-using-snowplow" target="_blank">first example</a> was taken from a Twitter case study, while the <a href="/blog/2012/05/performing-the-cohort-analysis-described-in-eric-riess-lean-startup-using-snowplow-and-hive">second example</a> was taken from Eric Ries&#8217;s book <a href="http://theleanstartup.com/" target="_blank">The Lean Startup</a>.</p>
<div class="wp-caption aligncenter" style="width: 510px"><img title="A wide variety of cheeses" src="http://blog.kepstatic.com/2012/05/cheese-counter.jpg" alt="" width="500" /><p class="wp-caption-text">A wide variety of cheeses</p></div>
<p>In this post, we cast the net wider, to bring out the breadth of cohort analyses that you can potentially perform. As should become clear as you read this post, we don&#8217;t see &#8220;cohort analysis&#8221; as a single report that looks the same for every company: rather it is a whole category of analyses which can be brought to bare to answer a number of different business questions. As a result, the specific cohort analyses you conduct will depend on the exact nature of your business and the specific questions you need to answer. But for any one business, there will be a number of cohort analyses that will be relevant to answer a range of different business questions. So, if for now you&#8217;re only using one type of cohort analysis, we suggest you look creatively at how to employ this powerful technique to answer other business questions you face.</p>
<p>Below we outline the range of cohort analyses which are possible, outlining different metrics which can be employed to compare the behaviour of customers in different cohorts, before going on to outline some different ways of defining cohorts. In subsequent posts, we will dive into some of these variations in more detail, and explain how to perform those analyses using <a href="/blog/2012/02/introducing-snowplow-the-worlds-most-powerful-web-analytics-platform" target="_blank">SnowPlow</a>.</p>
<p><span id="more-2476"></span></p>
<p><strong>Different types of metric to compare between cohorts</strong></p>
<p>There are a number of different types of metric which we might want to compare between cohorts:</p>
<ol>
<li><strong>Measures of user engagement</strong>. In the <a href="/blog/2012/05/performing-cohort-analysis-on-web-analytics-data-using-snowplow" target="_blank">Twitter example</a>, we looked at one possible measure, but as we said then, there are many other possibilities; we will outline some of them in the next <a href="/blog/2012/05/different-approaches-to-measuring-user-engagement-with-snowplow" target="_blank">blog post</a>. Which measures are most appropriate will depend on the type of business and the way customers of that business engage with them.</li>
<li><strong>Stages in a customer journey or funnel</strong>. In the <a href="/blog/2012/05/performing-the-cohort-analysis-described-in-eric-riess-lean-startup-using-snowplow-and-hive" target="_blank">Lean Startup</a> example we looked at one example; we will look at variations on this in a future post. SnowPlow offers analysts the exciting possibility of defining new funnels <em>retrospectively</em> &#8211; in other words, an analyst can create bespoke funnels after-the-fact and then track retrospectively how many users in each cohort have progressed through the various stages of this funnel.</li>
<li><strong>Customer Lifetime Value (CLV)</strong>. This is perhaps the most important and valuable metric to compare between cohorts. Unfortunately (although not coincidentally), it is one of the more nuanced and least understood. Again, this is something we will explore more fully in a forthcoming blog post.</li>
</ol>
<p><strong>Different ways to define cohorts</strong></p>
<p>There are a number of different approaches to defining our cohorts to compare. Which approach we use will depend critically on the business question we are looking to answer. To give some examples:</p>
<ol>
<li><strong>Cohorts defined by when the user joins/registers</strong>. Both of our in-depth examples in previous blog posts defined each cohort by when a user first registered for a service (either Twitter or IMVU, depending on the case study). In both cases, each cohort included a group of users who had all registered with the service in the same month. However, we might want to vary the time period over which we group users: for example, if we release new product versions on a weekly basis, it makes sense to compare new users this week with last week and see if there are significant differences. Likewise, we might want to key our cohorts off a different &#8220;trigger point&#8221; in the customer journey &#8211; for example, an online retailer might bucket users by when they made their first purchase, or when they signed up to an email newsletter. Again, SnowPlow makes dividing users into different cohorts using these different definitions very easy to do retrospectively &#8211; we&#8217;ll explore this in a separate blog post.</li>
<li><strong>Cohorts defined by a type of user behaviour</strong>. One of the things we work with a number of businesses to do is to segment their users (e.g. using a cluster analysis) into groups based on different types of user behaviour. For an online retailer, for example, we might differentiate regular shoppers from infrequent shoppers, people who shop from work with people who shop from home; and people who shop for themselves with people who buy items as presents for others. In any of these examples, SnowPlow is flexible enough to enable us to bucket users into different cohorts by whatever behaviour we believe is predictive of underlying differences in user attitude and behaviour.</li>
<li><strong>Cohorts defined by a user characteristic</strong>. As well as segmenting our customer base by behavioural characteristics, we can also segment them into different cohorts based on other user data points that we capture. For example, a social network might want to compare engagement rates between men and women, between different age groups, or between different countries. As long as that data is being captured, it can be used with SnowPlow to easily assign users to cohorts to compare behaviours between cohorts</li>
<li><strong>Cohorts defined by the stage in a user&#8217;s journey</strong>. This is particularly useful for services which users engage with over a long period, and behave in different ways at different stages. For example, a blogging platform (like WordPress.com or Tumblr) might observe that users go through multiple stages when they setup a blog: coming up with a topic, customising the appearance, starting to post, developing a readership, posting more regularly, promoting on Twitter/Facebook, running out of steam, posting less frequently, quitting the blog. Different users might spend different periods at each stage (and some might not progress beyond even the first stage). Comparing engagement levels or customer lifetime value between users at different stages is a potentially useful exercise for companies like WordPress and Tumblr.</li>
<li><strong>Cohorts defined by what channel a customer was acquired on</strong>. To calculate an accurate return on marketing spend, we need to analyse the average customer lifetime value of customers acquired on different marketing channels, by campaign. SnowPlow makes it easy to do this, including distinguishing users who arrive at a site for the first time from a particular channel with those that responded to a campaign on the same channel having already visited the site earlier. We will explain how to do this in a future blog post.</li>
</ol>
<p>Have we missed anything? We are always on the lookout for new and imaginative applications of cohort analyses. If there&#8217;s a variation we haven&#8217;t covered, and you&#8217;d like us to, then <a href="/contact">get in touch!</a> Similarly, if you&#8217;d like to discuss how to use SnowPlow and/or cohort analyses to answer your business questions, then <a href="/contact">get in touch</a> as well.</p>
<p><b>Update:</b> ready for more? The next blog post in this series is now available: <a href="/blog/2012/05/different-approaches-to-measuring-user-engagement-with-snowplow" target="_blank">Different metrics to understanding user engagement when performing cohort analyses</a>. </p>
<img src="http://feeds.feedburner.com/~r/KeplarLLPBlog/~4/msWy5vUan6Y" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>/blog/2012/05/on-the-wide-variety-of-different-cohort-analyses-possible-with-snowplow/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Performing the cohort analysis described in Eric Ries’s Lean Startup using SnowPlow and Hive</title>
		<link>/blog/2012/05/performing-the-cohort-analysis-described-in-eric-riess-lean-startup-using-snowplow-and-hive</link>
		<comments>/blog/2012/05/performing-the-cohort-analysis-described-in-eric-riess-lean-startup-using-snowplow-and-hive#comments</comments>
		<pubDate>Tue, 15 May 2012 11:16:55 +0000</pubDate>
		<dc:creator>Yali</dc:creator>
				<category><![CDATA[Analytics]]></category>
		<category><![CDATA[SnowPlow]]></category>
		<category><![CDATA[cohort analysis]]></category>
		<category><![CDATA[lean startup]]></category>
		<category><![CDATA[web analytics]]></category>

		<guid isPermaLink="false">/blog/?p=2433</guid>
		<description><![CDATA[The cohort analysis blog post series Cohort analyses for digital businesses: an overview Performing cohort analysis on web analytics data using SnowPlow Performing the cohort analysis described by Eric Ries in the Lean Startup On the wide variety of cohort analyses Approaches to measuring user engagement as part of cohort analysis Approaches to measuring customer [...]]]></description>
			<content:encoded><![CDATA[<div id="post-series-box">
<p>The cohort analysis blog post series</p>
<ul>
<li><a href="/blog/2012/04/cohort-analyses-for-digital-businesses-an-overview">Cohort analyses for digital businesses: an overview</a></li>
<li><a href="/blog/2012/05/performing-cohort-analysis-on-web-analytics-data-using-snowplow">Performing cohort analysis on web analytics data using SnowPlow</a></li>
<li>Performing the cohort analysis described by Eric Ries in the <i>Lean Startup</i></li>
<li><a href="/blog/2012/05/on-the-wide-variety-of-different-cohort-analyses-possible-with-snowplow">On the wide variety of cohort analyses</a></li>
<li><a href="/blog/2012/05/different-approaches-to-measuring-user-engagement-with-snowplow">Approaches to measuring user engagement as part of cohort analysis</a></li>
<li><a href="/blog/2012/06/different-approaches-to-measuring-customer-lifetime-value-with-snowplow">Approaches to measuring customer value as part of cohort analysis</a></li>
<li><a href="/blog/2012/06/faking-cohort-analysis-with-google-analytics">Faking cohort analysis with Google Analytics</a></li>
</ul>
</div>
<p>This blog post is the third in our series on cohort analyses using SnowPlow. In the first post, we provided an <a href="/blog/2012/04/cohort-analyses-for-digital-businesses-an-overview" title="Overview of cohort analyses for digital businesses">overview of cohort analyses</a>: why they are so powerful and what are the different analytic steps necessary to perform a cohort analysis. In the second post, we looked at <a href="/blog/2012/05/performing-cohort-analysis-on-web-analytics-data-using-snowplow">why SnowPlow is such a good platform for performing cohort analyses</a> using web analytics data, and worked through the specific example of the Twitter cohort analysis that gets so much attention in startup circles.</p>
<p>In this post, we will follow up with a look at another famous example of a cohort analysis: this time from Eric Ries&#8217;s excellent book <a href="http://theleanstartup.com/">The Lean Startup</a>. We will show how a company running SnowPlow can easily perform the type of analysis Eric performed when he was CTO at IMVU, to assess the progress they were making towards achieving a product-market fit.</p>
<p>A version of the data Eric Ries presents in his book is shown below:</p>
<p style="text-align: center;"><a href="http://blog.kepstatic.com/2012/05/cohort-analysis-visualisation.jpg"><img src="http://blog.kepstatic.com/2012/05/cohort-analysis-visualisation.jpg" width="500" /></a></p>
<p><span id="more-2433"></span></p>
<p><strong>Interpreting Eric Ries&#8217;s cohort analysis</strong></p>
<p>Before I describe how to perform Eric Ries&#8217;s cohort analysis, let&#8217;s remind ourselves what this analysis demonstrates.</p>
<p>The analysis shows the percentage of new users added each month, split by what stage they reached in their customer journey. Remember that Eric&#8217;s startup had an instant messaging product where customers talked to one another using 3D avatars; ultimately, the success of the company depending on persuading users to pay for the product, but that meant each user working through a number of &#8220;stage gates&#8221;, including:</p>
<ol>
<li>Downloading the product</li>
<li>Registering</li>
<li>Creating a 3D avatar (potentially more than one)</li>
<li>Connecting with another user (potentially multiple users)</li>
<li>Signing up to the service and paying the subscription fee</li>
</ol>
<p>Understanding how many potential customers progressed through each stage-gate was critical to IMVU. The graph above looks at the number of people who registered in each of the first 7 months after the first version of their product was launched, and looks at what percentage of users made it through to each stage gate. The aim was to get as many customers as possible to the final stage gate: &#8220;Paid&#8221;. As you&#8217;ll see from the graph however, the percentage of users who make it to this stage each month was very small (less than 1.4% in each month.) Worse, this percentage did not increase month-by-month, as the team worked to improve the product. It was this lack of improvement, in spite of enormous product development effort, that persuaded the IMVU team to pivot their product, with dramatic effect.</p>
<p>There are some interesting things to note when comparing this analysis to the one we performed in the <a href="/blog/2012/05/performing-cohort-analysis-on-web-analytics-data-using-snowplow">last post</a>. As in the previous analysis, each cohort is defined by the month that the user started using the service (although in this case, that &#8220;starting point&#8221; is determined by the point at which the user &#8220;registers&#8221;, which is after they have already downloaded the product). For each cohort, we only look at one month of data. This is very different from the previous example, where we looked at the average engagement level of each cohort over successive months. This may be because the IMVU guys found that if a user hadn&#8217;t signed up to their service after 1 month, there was a negligable chance that they would go on to do so. (Meaning that the timeframe for converting a prospect to a paid user was reasonably short, and hence there was no point including subsequent months in the analysis for each cohort. This makes the IMVU product very different to Twitter, where a user may only graduate to a highly engaged user after two or even three months.) When you consider performing a comparable analysis for your company, decide carefully what time frame to look at, so you don&#8217;t miss relevant data points for each cohort by picking an inappropriate time period.</p>
<p><strong>Performing the cohort analysis using SnowPlow</strong></p>
<p>We will perform the analysis by following the steps outlined in the <a href="/blog/2012/04/cohort-analyses-for-digital-businesses-an-overview">first post</a> i.e.:</p>
<ol>
<li>Start by defining the business question</li>
<li>Work out what is the most appropriate metric to measure, given the business question</li>
<li>Define the cohorts</li>
<li>Perform the analysis</li>
</ol>
<p><strong>1. Define the business question</strong></p>
<p>This is simply: are we getting better over time at getting new people introduced to the product to become paid users?</p>
<p><strong>2. What is the most appropriate metric to measure, given the business question?</strong></p>
<p>In this case it is the percentage of users who register who go on to reach the different stage gates identified in the <a href="#list-of-stage-gates">list above</a>.</p>
<p>It&#8217;s important to note that SnowPlow&#8217;s data table in Hive contains five fields that describe a particular &#8220;event&#8221; in a user journey:</p>
<table style="border: 1px solid black;">
<tr>
<td><strong>Name</strong></td>
<td><strong>Req?</strong></td>
<td><strong>Description</strong></td>
</tr>
<tr>
<td>Category</td>
<td>Yes</td>
<td>The type of event being tracked e.g. &#8220;ecomm&#8221; for buying related events or &#8220;media&#8221; for media consumption events</td>
</tr>
<tr>
<td>Action</td>
<td>Yes</td>
<td>The actual user action performed</td>
</tr>
<tr>
<td>Object</td>
<td>No</td>
<td>The specific object being acted on e.g. the product brought or video played</td>
</tr>
<tr>
<td>Property</td>
<td>No</td>
<td>An option sprint describing the object or the action performed on it e.g. &#8220;HD&#8221; if its a video being played</td>
</tr>
<tr>
<td>Value</td>
<td>No</td>
<td>A value associated with the action e.g. revenue associated with the action, or number of minutes into the video that it is started / stopped</td>
</tr>
</table>
<p>In our case, we&#8217;ll assume that the different actions that indicate what funnel stage a user is in can be read from the Action field (called `event_action` to be precise). In the IMVU case, that might mean that we need to check for the following values in the `event_action` column for a particular user:</p>
<ol>
<li><strong>`Subscribe`</strong> to indicate he/she has reached the final stage in the funnel</li>
<li><strong>`Made-friend`</strong> to indicate he/she has connected with another user</li>
<li><strong>`Created-avatar`</strong> to indicate he/she has created an avatar</li>
<li><strong>`Login`</strong> to indicate he/she has logged in</li>
<li><strong>`Register`</strong> to indicate he/she has logged in</li>
</ol>
<p>To work out what stage each visitor has reached in his/her customer journey, we process the `snowplow_events_table` for each `user_id` and combine all the different `event_actions` which a user has taken into a single set (stored as a HiveQL array):</p>
<pre class="brush: sql; title: ; notranslate">
/* HiveQL */
CREATE TABLE `actions_by_user` (
	`user_id` string,
	`acts` array&lt;string&gt;
);

INSERT OVERWRITE TABLE `actions_by_user`
SELECT
	`user_id`,
	collect_set(`event_action`) AS `acts`
FROM `snowplow_events_table`
GROUP BY
	`user_id`;
</pre>
<p>For each user, we then look through the array of different actions that he / she has performed to work out at what stage each visitor is in the funnel:</p>
<pre class="brush: sql; title: ; notranslate">
/* HiveQL */
CREATE TABLE `stage_in_funnel_by_user` (
	`user_id` string,
	`stage` string
);

INSERT OVERWRITE TABLE `stage_in_funnel_by_user`
SELECT
	`user_id`,
	CASE
		WHEN array_contains(`acts`,&quot;Subscribe&quot;) THEN &quot;Subscribe&quot;
		WHEN array_contains(`acts`,&quot;Made-friend&quot;) THEN &quot;Made-friend&quot;
		WHEN array_contains(`acts`,&quot;Created-avatar&quot;) THEN &quot;Created-avatar&quot;
		WHEN array_contains(`acts`,&quot;Login&quot;) THEN &quot;Login&quot;
		WHEN array_contains(`acts`,&quot;Register&quot;) THEN &quot;Registered&quot;
		ELSE &quot;Not-registered&quot; END AS stage
FROM `actions_by_user`;
</pre>
<p>Note that we start from the final stage in the funnel. For each visitor, we check if they have reached the final stage, and if not, work backwards stage-by-stage to work out if they&#8217;ve reached the previous stage.</p>
<p><strong>3. Define your cohorts</strong></p>
<p>This is straightforward: we want to divide up our users into cohorts based on the date in which they registered for the product. This is done using the following query:</p>
<pre class="brush: sql; title: ; notranslate">
/* HiveQL */
CREATE TABLE `users_by_cohort` (
	user_id string,
	cohort string
);

INSERT OVERWRITE TABLE `users_by_cohort`
SELECT
	`user_id`,
	concat(year(min(`dt`)),&quot;-&quot;,month(min(`dt`))) as cohort
FROM `snowplow_events_table`
WHERE `event_action` LIKE &quot;Register&quot;
GROUP BY `user_id`;
</pre>
<p><strong>4. Perform the analysis</strong></p>
<p>We now need to join our `users_by_cohort` table which defines which cohort each `user_id` belongs with our `stage_in_funnel_by_user` which tells us which stage in the customer journey each user belongs to:</p>
<pre class="brush: sql; title: ; notranslate">
/* HiveQL */
CREATE TABLE `cohort_analysis` (
	`cohort` string,
	`stage` string,
	`number_of_users` int
);

INSERT OVERWRITE TABLE `cohort_analysis`
SELECT
	`cohort`,
	`stage`,
	count(`stage_by_user`.`user_id`) as `number_of_users`
FROM
	`users_by_cohort`
JOIN
	`stage_by_user`
ON
	`users_by_cohort`.`user_id` = `stage_by_user`.`user_id`
GROUP BY
	`cohort`, `stage`
</pre>
<p>And that&#8217;s all there is to this retrospective cohort analysis! Our resulting table looks as follows:</p>
<p style="text-align: center;"><a href="http://blog.kepstatic.com/2012/05/cohort-analysis-visualisation.jpg"><img src="http://blog.kepstatic.com/2012/05/results-table.jpg" /></a></p>
<p>It is now straightforward to convert the absolute numbers of users in each cohort it a percentage value and plot a graph similar to the one displayed at the top of this blog post.</p>
<p><strong>Interested in learning more?</strong> Then <a href="http://snowplowanalytics.com">visit the SnowPlow website</a>, including the section on <a href="http://snowplowanalytics.com/analytics/customer-analytics/cohort-analysis.html">performing cohort analyses with SnowPlow</a>.</p>
<p><b>Update:</b> you can now also discuss <a href="http://news.ycombinator.com/item?id=3976551" target="_blank">this blog post</a> on Hacker News.</p>
<img src="http://feeds.feedburner.com/~r/KeplarLLPBlog/~4/n6GLiLdYSvo" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>/blog/2012/05/performing-the-cohort-analysis-described-in-eric-riess-lean-startup-using-snowplow-and-hive/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Online merchant using PrestaShop? Announcing the SnowPlow Early Access Programme (EAP)</title>
		<link>/blog/2012/05/online-merchant-using-prestashop-announcing-the-snowplow-early-access-programme-eap</link>
		<comments>/blog/2012/05/online-merchant-using-prestashop-announcing-the-snowplow-early-access-programme-eap#comments</comments>
		<pubDate>Mon, 14 May 2012 14:56:38 +0000</pubDate>
		<dc:creator>Alex</dc:creator>
				<category><![CDATA[Analytics]]></category>
		<category><![CDATA[E-commerce]]></category>
		<category><![CDATA[CLV]]></category>
		<category><![CDATA[customer lifetime value]]></category>
		<category><![CDATA[ecommerce analysis]]></category>
		<category><![CDATA[ecommerce analytics]]></category>
		<category><![CDATA[PrestaShop]]></category>
		<category><![CDATA[retail analytics]]></category>
		<category><![CDATA[SnowPlow]]></category>
		<category><![CDATA[web analytics]]></category>

		<guid isPermaLink="false">/blog/?p=2386</guid>
		<description><![CDATA[Are you an online retailer using PrestaShop? Are you interested in getting early access to killer new analytics tools to help super-charge your business? Keplar&#8217;s SnowPlow team would like to hear from you. At Keplar we are now hard at work building sophisticated web analytics, such as cohort analyses, using our open-source SnowPlow technology stack [...]]]></description>
			<content:encoded><![CDATA[<p>Are you an online retailer using PrestaShop? Are you interested in getting early access to killer new analytics tools to help super-charge your business? Keplar&#8217;s SnowPlow team would like to hear from you.</p>
<div class="wp-caption aligncenter" style="width: 471px"><img title="SnowPlow Security holds the line" src="http://blog.kepstatic.com/2012/05/snowplow-bouncer.jpg" alt="" width="461" height="364" /><p class="wp-caption-text">SnowPlow Security holds the line</p></div>
<p>At Keplar we are now hard at work building sophisticated web analytics, such as <a href="/blog/2012/05/performing-cohort-analysis-on-web-analytics-data-using-snowplow" target="_blank">cohort analyses</a>, using our <a href="/blog/2012/02/introducing-snowplow-the-worlds-most-powerful-web-analytics-platform" target="_blank">open-source SnowPlow technology stack</a> (now <a href="https://github.com/snowplow/snowplow" target="_blank">available</a> on GitHub). So far, all of these analyses are being built on top of the &#8220;eventstream&#8221; data collected from a client&#8217;s website using the SnowPlow JavaScript tag installed across all pages.</p>
<p>Alongside these &#8220;eventstream&#8221;-based analyses, we are designing another set of analyses &#8211; equally powerful &#8211; based on the transactional data which lives inside your ecommerce platform; we have chosen to focus on PrestaShop first, because we have already developed and open-sourced <a href="https://github.com/orderly/prestashop-scala-client" target="_blank">a tool</a> for fetching transactional data out PrestaShop&#8230;</p>
<p><span id="more-2386"></span></p>
<p>The first of our new analyses will be around customer lifetime value (CLV) analysis. CLV analysis is critically important for retailers (as it is for just about every other type of business). Customer lifetime value analysis makes it possible for retailers to quantify the value of an average customer over his/her expected lifetime, and hence work out how much it is worth spending to acquire new customers. But that is just the beginning: once a company has a handle on their customer&#8217;s lifetime value, they can perform a cluster analysis to spot different segments of users with different average lifetime values &#8211; thus identifying different groups of customers, with different characteristics, some of whom are worth spending much more on acquiring and cultivating than others.</p>
<p>SnowPlow lets retailers go even further with customer lifetime value analysis, however. By combining a customer lifetime value analysis with the SnowPlow eventstream analytics, it is possible to identify those events and customer behaviours that:</p>
<ol>
<li>Are predictive of a customer belonging to a higher value segment, or:</li>
<li>Are predictive of them migrating from a lower value segment to a higher value segment</li>
</ol>
<p>This is invaluable to retailers who want to maximize satisfaction and profit per customer, guiding product development, service development and marketing promotions so that the retailer delivers services that their customers <i>love</i>.</p>
<p>Although we will start with CLV, we have a roadmap of other analyses to build leveraging operational e-commerce data; stay tuned for more about these other analyses in the near future.</p>
<p>So, we are looking for online retailers (up to five to start with) who are currently using PrestaShop and would like early access in return for giving us feedback and helping us to improve the analyses. Joining our Early Access Programme for SnowPlow will mean competition-beating access to some killer analytics tools, plus a chance to shape their evolution to help super-charge <i>your</i> business. And, assuming that this evolves into a hosted customer lifetime analysis product, we will give our Early Access Programme participants free access to this product for six months, followed by a first-year discount should you wish to continue using the product.</p>
<p>This is what we are looking for in a SnowPlow EAP participant:</p>
<ul>
<li>You use of a recent version of PrestaShop, with Web Services enabled (<strong>or</strong> ready to be enabled)</li>
<li>You have had substantial order volumes over a two-to-three year period (if not more!)</li>
<li>All of your orders are stored in PrestaShop (<strong>or</strong> easily accessible if partially archived)</li>
<li>A good proportion of your customers are repeat customers</li>
</ul>
<p>If you think that you fit the bill, we would love to hear from you and help us build the future of e-commerce analytics together! Please email us at <a href="mailto:eap@keplarllp.com">eap@keplarllp.com</a> with a brief introduction to you and your online shop.</p>
<img src="http://feeds.feedburner.com/~r/KeplarLLPBlog/~4/7I5lCGHgWyM" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>/blog/2012/05/online-merchant-using-prestashop-announcing-the-snowplow-early-access-programme-eap/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Open-sourcing symfony2-paypal-ipn, a Symfony bundle for PayPal IPN</title>
		<link>/blog/2012/05/open-sourcing-symfony2-paypal-ipn-a-symfony-bundle-for-paypal-ipn</link>
		<comments>/blog/2012/05/open-sourcing-symfony2-paypal-ipn-a-symfony-bundle-for-paypal-ipn#comments</comments>
		<pubDate>Sun, 13 May 2012 10:23:06 +0000</pubDate>
		<dc:creator>Alex</dc:creator>
				<category><![CDATA[Coding]]></category>
		<category><![CDATA[E-commerce]]></category>
		<category><![CDATA[instant payment notificaiton]]></category>
		<category><![CDATA[ipn]]></category>
		<category><![CDATA[library]]></category>
		<category><![CDATA[paypal]]></category>
		<category><![CDATA[paypal ipn]]></category>
		<category><![CDATA[symfony]]></category>
		<category><![CDATA[symfony 2]]></category>
		<category><![CDATA[symfony bundle]]></category>
		<category><![CDATA[symfony2]]></category>

		<guid isPermaLink="false">/blog/?p=2369</guid>
		<description><![CDATA[Today we are pleased to announce the open-sourcing on GitHub of our new PayPal e-commerce library for Symfony 2. This is a direct port of our CodeIgniter PayPal IPN library which we open-sourced on this blog some 14 months ago. At Keplar we remain committed to using open-source projects where possible to keep costs down [...]]]></description>
			<content:encoded><![CDATA[<p>Today we are pleased to announce the open-sourcing on GitHub of our new <a href="https://github.com/orderly/symfony2-paypal-ipn" target="_blank">PayPal e-commerce library for Symfony 2</a>. This is a direct port of our <a href="https://github.com/orderly/codeigniter-paypal-ipn" target="_blank">CodeIgniter PayPal IPN library</a> which we open-sourced <a href="/blog/2011/03/our-first-open-source-release-an-e-commerce-library-for-using-paypal-with-codeigniter" target="_blank">on this blog</a> some 14 months ago.</p>
<p>At Keplar we remain committed to using open-source projects where possible to keep costs down for our clients and to avoid &#8220;reinventing the wheel&#8221;. Where high-quality open-source projects exist which meet our client&#8217;s needs, we use them by default; there are too many of these to name them all, but recent projects would have been impossible without <a href="http://hive.apache.org/" target="_blank">Hive</a> (Hadoop ecosystem), <a href="https://github.com/spray/spray/wiki" target="_blank">Spray</a> (Scala/Akka), <a href="https://github.com/j2labs/dictshield" target="_blank">DictShield</a> (Python), <a href="http://hackage.haskell.org/package/wai" target="_blank">WAI</a> (Haskell) and of course <a href="http://redis.io/" target="_blank">Redis</a>.</p>
<p>Where open-source projects do not exist that meet our requirements, we are increasingly looking to develop those tools in-house and open source them where possible (i.e. where they are not part of a client deliverable). Our biggest initiative so far in this area is the <a href="https://github.com/snowplow" target="_blank">SnowPlow web analytics platform</a>, which since its soft-launch in February is already being used externally by one ad network to <a href="/blog/2012/05/warehousing-your-online-ad-data-with-snowplow">track ad impression data</a> and internally by our team to perform some sophisticated analytics, such as <a href="/blog/2012/05/performing-cohort-analysis-on-web-analytics-data-using-snowplow">website cohort analyses</a>.</p>
<p>Other projects we have open-sourced since our CodeIgniter PayPal module include a <a href="https://github.com/keplar/scalapac" target="_blank">Scala client for the Amazon Product Advertising API</a>, a command-line tool for <a href="/blog/2012/01/introducing-google-analytics-export-to-csv-a-fast-simple-way-to-get-your-google-analytics-data-into-your-favourite-analytics-programme">exporting Google Analytics data to CSV flatfiles</a>, and a <a href="https://github.com/orderly/prestashop-scala-client" target="_blank">Scala client for the PrestaShop e-commerce API</a> &#8211; the latter another release under our &#8220;Orderly&#8221; initiative for better e-commerce workflow automation and data analysis.</p>
<p>Onto our new Symfony2 library for PayPal IPN&#8230;</p>
<p><span id="more-2369"></span></p>
<p>As with the original CodeIgniter version, this library is designed to make it easier for developers using Symfony 2 to receive, validate and store instant payment notifications (IPNs) sent by PayPal when an order has been paid for by a customer. To be clear: there is already a &#8220;maximalist&#8221; bundle for handling PayPal payments in Symfony2 &#8211; the excellent <a href="https://github.com/schmittjoh/JMSPaymentPaypalBundle" target="_blank">JMSPaymentPaypalBundle</a>. By contrast, symfony2-paypal-ipn is a &#8220;minimalist&#8221; bundle which focuses on the post-payment workflow, validating an IPN notification from PayPal and then logging the order and order line items into your database (using the Doctrine 2 ORM). Sending an order confirmation email is also super-simple with our library.</p>
<p>For instructions on installing the bundle, please see the <a href="https://github.com/orderly/symfony2-paypal-ipn/blob/master/README.md" target="_blank">README file</a> in the project&#8217;s repository. Once the bundle is installed, using it in a Symfony controller to validate incoming orders and send order confirmation emails is quite straightforward &#8211; just write a controller like this:</p>
<pre class="brush: php; title: ; notranslate">
use Symfony\Component\HttpFoundation\Response;
use Symfony\Bundle\FrameworkBundle\Controller\Controller;
use Sensio\Bundle\FrameworkExtraBundle\Configuration\Route;
use Sensio\Bundle\FrameworkExtraBundle\Configuration\Template;
use Orderly\PayPalIpnBundle\Ipn;

class TwigNotificationEmailController extends Controller
{

    public $paypal_ipn;
    /**
     * @Route(&quot;/ipn-twig-email-notification&quot;)
     * @Template()
     */
    public function indexAction()
    {
        //getting ipn service registered in container
        $this-&gt;paypal_ipn = $this-&gt;get('orderly_pay_pal_ipn');

        //validate ipn (generating response on PayPal IPN request)
        if ($this-&gt;paypal_ipn-&gt;validateIPN())
        {
            // Succeeded, now let's extract the order
            $this-&gt;paypal_ipn-&gt;extractOrder();

            // And we save the order now (persist and extract are separate because you might only want to persist the order in certain circumstances).
            $this-&gt;paypal_ipn-&gt;saveOrder();

            // Now let's check what the payment status is and act accordingly
            if ($this-&gt;paypal_ipn-&gt;getOrderStatus() == Ipn::PAID)
            {
                //preparing message
                $message = \Swift_Message::newInstance()
                    -&gt;setSubject('Order confirmation')
                    -&gt;setFrom('support@CHANGEME.com', 'TEST')
                    -&gt;setTo($this-&gt;paypal_ipn-&gt;getOrder()-&gt;getPayerEmail(), $this-&gt;paypal_ipn-&gt;getOrder()-&gt;getFirstName() .' '. $this-&gt;paypal_ipn-&gt;getOrder()-&gt;getLastName())
                    -&gt;setBody($this-&gt;renderView('OrderlyPayPalIpnBundle:Default:confirmation_email.html.twig',
                            // Prepare the variables to populate the email template:
                            array('order' =&gt; $this-&gt;paypal_ipn-&gt;getOrder(),
                                  'items' =&gt; $this-&gt;paypal_ipn-&gt;getOrderItems())
                            ), 'text/html')
                ;
                //send message
                $this-&gt;get('mailer')-&gt;send($message);
            }
        }
        else // Just redirect to the root URL
        {
            return $this-&gt;redirect('/');
        }

        $response = new Response();
        $response-&gt;setStatusCode(200);

        return $response;
    }
}
</pre>
<p>If you prefer not to send an order confirmation email, then checkout the <a href="https://github.com/orderly/symfony2-paypal-ipn/blob/master/src/Orderly/PayPalIpnBundle/Controller/NoNotificationController.php" target="_blank">NoNotificationController</a>.</p>
<p>Let us know how you get on with this library in the blog comments &#8211; and if there&#8217;s a feature missing that you would like, feel free to raise a <a href="https://github.com/orderly/symfony2-paypal-ipn/issues" target="_blank">new issue</a> over on GitHub. We hope you find it useful, and stay tuned for more &#8220;Orderly&#8221; releases very soon!</p>
<img src="http://feeds.feedburner.com/~r/KeplarLLPBlog/~4/rYOxJND406U" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>/blog/2012/05/open-sourcing-symfony2-paypal-ipn-a-symfony-bundle-for-paypal-ipn/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Performing cohort analysis on web analytics data using SnowPlow</title>
		<link>/blog/2012/05/performing-cohort-analysis-on-web-analytics-data-using-snowplow</link>
		<comments>/blog/2012/05/performing-cohort-analysis-on-web-analytics-data-using-snowplow#comments</comments>
		<pubDate>Tue, 08 May 2012 09:22:21 +0000</pubDate>
		<dc:creator>Yali</dc:creator>
				<category><![CDATA[Analytics]]></category>
		<category><![CDATA[SnowPlow]]></category>
		<category><![CDATA[cohort analysis]]></category>
		<category><![CDATA[hive]]></category>
		<category><![CDATA[hiveql]]></category>
		<category><![CDATA[snowplow analytics]]></category>
		<category><![CDATA[web analytics]]></category>

		<guid isPermaLink="false">/blog/?p=2238</guid>
		<description><![CDATA[The cohort analysis blog post series Cohort analyses for digital businesses: an overview Performing cohort analysis on web analytics data using SnowPlow Performing the cohort analysis described by Eric Ries in the Lean Startup On the wide variety of cohort analyses Approaches to measuring user engagement as part of cohort analysis Approaches to measuring customer [...]]]></description>
			<content:encoded><![CDATA[<div id="post-series-box">
<p>The cohort analysis blog post series</p>
<ul>
<li><a href="/blog/2012/04/cohort-analyses-for-digital-businesses-an-overview">Cohort analyses for digital businesses: an overview</a></li>
<li>Performing cohort analysis on web analytics data using SnowPlow</li>
<li><a href="/blog/2012/05/performing-the-cohort-analysis-described-in-eric-riess-lean-startup-using-snowplow-and-hive">Performing the cohort analysis described by Eric Ries in the <i>Lean Startup</i></a></li>
<li><a href="/blog/2012/05/on-the-wide-variety-of-different-cohort-analyses-possible-with-snowplow">On the wide variety of cohort analyses</a></li>
<li><a href="/blog/2012/05/different-approaches-to-measuring-user-engagement-with-snowplow">Approaches to measuring user engagement as part of cohort analysis</a></li>
<li><a href="/blog/2012/06/different-approaches-to-measuring-customer-lifetime-value-with-snowplow">Approaches to measuring customer value as part of cohort analysis</a></li>
<li><a href="/blog/2012/06/faking-cohort-analysis-with-google-analytics">Faking cohort analysis with Google Analytics</a></li>
</ul>
</div>
<p>In the previous blog post in this series, <a href="/blog/2012/04/cohort-analyses-for-digital-businesses-an-overview" title="cohort analysis for digital businesses">cohort analysis for digital businesses: an overview</a> we described what cohort analysis is and why it is so powerful. In this post, we will look at how to perform cohort analysis on web analytics data in <a href="http://snowplowanalytics.com">SnowPlow</a>. We will start with an overview of the <a href="#general-methodology">general methodology and approach</a> for cohort analysis using SnowPlow, and then launch into a specific example analysis: the <a href="#specific-example">Twitter engagement example</a> that made cohort analysis so famous in startup circles.</p>
<p><a name="general-methodology"></a><strong>Methodology for performing cohort analyses in SnowPlow</strong></p>
<p>SnowPlow has been designed to:</p>
<ol>
<li>Make it easy to perform specific cohort analyses</li>
<li>Give users maximum flexibility to perform a wide range of cohort analyses, by making it easy to define cohorts in multiple ways and leverage multiple different metrics to measure and compare between the different cohorts</li>
</ol>
<p>To understand what makes SnowPlow so suitable for cohort analyses, we need to consider the way data is structured in SnowPlow. This is represented in the diagram below:</p>
<p style="text-align:left;margin-top:10px;margin-bottom:10px;"><img src="http://blog.kepstatic.com/2012/05/events-table.png" title="The SnowPlow events table schema overview"/></p>
<p>SnowPlow records all data in a <a href="http://snowplowanalytics.com/analytics/snowplow-table-structure.html">single events table</a> in <a href="http://hive.apache.org/" target="_blank">Hive</a>. Whenever one of your customers does <i>anything</i> on your website, be it click on a link, fill in a web form, play a video, add a product to basket, perform a search or roll-over an ad (to give just some examples), a line of data is generated in the events table.</p>
<p><span id="more-2238"></span></p>
<p>The events table contains a number of different fields. Some of these describe the particular event: distinguishing the type of event (e.g. a pageview or an add-to-basket), noting the time and date of the event, and recording the value of the event if it is an e.g. transaction; other fields relate to the customer performing the action: distinguishing customers by the type of device they are using, where they are located, what language their browser is set to for example. (For a complete description of the table and fields, see the <a href="http://snowplowanalytics.com/analytics/snowplow-table-structure.html">analyst documentation</a> on the <a href="http://snowplowanalytics.com">SnowPlow website</a>.</p>
<p>Recall from our <a href="/blog/2012/04/cohort-analyses-for-digital-businesses-an-overview" title="cohort analysis for digital businesses" target="_blank">last blog post</a> that performing a cohort analysis is a four step process:</p>
<ol>
<li>Define your business question</li>
<li>Work out what is the most appropriate metric (or set of metrics) to measure, given your business question</li>
<li>Define your cohorts, given the business question</li>
<li>Perform the analysis</li>
</ol>
<p>From a data crunching perspective, the key challenge when performing a cohort analysis is to access the data necessary to calculate the metric identified in step 2, sliced by the cohorts defined in step 3. This is not always straightforward with conventional web analytics tools &#8211; we will explore the limitations to doing this in Google Analytics in a later post.</p>
<p>We can now see what makes SnowPlow such a good platform to perform cohort analyses. Defining our cohorts and performing the data separation is simple: we  define our cohorts by any of the data fields we associate with each visitor or, equally, we can define our cohorts based on user behaviour (or even by some combination of the two e.g. define a cohort of &#8220;male customers who shop regularly&#8221;). In either case, a straightforward HiveQL SELECT statement of the following form will return a table mapping customers (i.e. user_ids) to cohorts (however they are defined, in terms of either customer data, or event data, or a combination of the two):</p>
<pre class="brush: sql; title: ; notranslate">
/* HiveQL with pseudo-code */
CREATE TABLE `user_cohort_map` AS
SELECT
	`user_id`,
	[[ Generate cohort_ids using visitor_attributes and visitor_events ]] AS `cohort_id`
FROM
	`events`
GROUP BY
	`user_id`
</pre>
<p>Because we have <strong>all</strong> our raw data about customers and events in the events table, we have maximum flexibility to define our cohorts however we want, based on any combination of them. (In simple cases the cohort function can be written directly in HiveQL, in more difficult cases by writing a user-defined function.)</p>
<p>As a next step, we need to calculate the value of the desired metric we are comparing the cohorts against for each visitor. Again, this is straightforward to generate from the same events table:</p>
<pre class="brush: sql; title: ; notranslate">
/* HiveQL with pseudo-code */
CREATE TABLE `metric_by_user` AS
SELECT
	`user_id`,
	[[ Generate metric_per_customer using visitor_attributes and visitor_events ]] AS `metric_per_customer`
FROM
	`events`
GROUP BY
	`user_id`
</pre>
<p>Once again, we have maximum flexibility to calculate any metric we want, based on having all the available data in the events table.</p>
<p>In the final step, we aggregate the metrics generated per user to a cohort level, so that we can compare the results for each cohort against one another:</p>
<pre class="brush: sql; title: ; notranslate">
/* HiveQL with pseudo-code */
SELECT
	`cohort`,
	[[ Aggregate the metric over all the customers in each cohort ]] AS `metric_per_cohort`
FROM
	`user_cohort_map`
INNER JOIN
	`metric_by_user`
ON `user_cohort_map`.`user_id` = `metric_by_user`.`user_id`
GROUP BY
	`cohort`
</pre>
<p>Let&#8217;s apply this approach to a specific example:</p>
<p><a name="specific-example"></a><strong>Example 1: Looking for improvements in user engagement over time (the Twitter case study)</strong></p>
<p>As a first example we&#8217;ll take the famous Twitter cohort analyiss exploring whether user engagement levels were rising over time. A version of the data is shown below. Remember: the folks at Twitter wanted to examine whether they were getting better at getting users to engage over time. So they looked at the users they&#8217;d acquired in January and compared how many of them remained engaged after using the service for 1 month, 2 months, 3 months, 4 months etc., with the users they added in February. To see the results more clearly, click on the graph below for an enlarged version.</p>
<p style="text-align:center;"><a href="http://blog.kepstatic.com/2012/05/twitter-cohort-example.jpg"><img src="http://blog.kepstatic.com/2012/05/twitter-cohort-example.jpg" title="Cohort analysis from famous Twitter case study" / ></a></p>
<p>As you should be able to see from the graph, the fraction of users who remained engaged after 1, 2, 3, 4 months etc. rises over time, so that the results for the February cohort look better than those for January. Similarly, the March cohort results are better then Feb, the April cohort better than March and so on.</p>
<p>We&#8217;ll follow the four step approach to cohort analysis outlined in the <a href="/cohort-analyses-for-digital-businesses-an-overview">previous blog post</a>, namely:</p>
<ol>
<li>Define your business question</li>
<li>Work out the most appropriate metric to measure</li>
<li>Define your cohort</li>
<li>Perform the analysis</li>
</ol>
<p><strong><i>1. The business question:</i> Are we getting better at getting user&#8217;s to engage over time?</strong> We are clear on the question to be answered.</p>
<p><strong><i>2. The metric we want to examine:</i> For each cohort, we want to examine &#8220;engagement&#8221;</strong>. For this specific example, we&#8217;ll take the <i>number of actions per user per month</i>. In the Twitter example, they looked at the percentage of users who remained active after <i>X</i> months. This is a good measure of how good Twitter are at facilitating a &#8220;minimum&#8221; level of engagement. Our measure, by contrast, will better reflect the difference between users who only engage occasionally with those that use the service heavily, but do a less good job of indicating what percentage of users do not engage at all. (Which is more appropriate depends on the type of service we are looking at, but in reality, most companies should be looking at both types of measure.)</p>
<p>Remember: for each cohort, we will want to plot the engagement level for the first month of use versus engagement level for the second month of use, versus engagement level for the third month of use etc.</p>
<p>Our HiveQL query to measure engagement then looks like this:</p>
<pre class="brush: sql; title: ; notranslate">
/* HiveQL */
CREATE TABLE `engagement_by_user` (
user_id string,
mn string,
engagement_per_month int
);

INSERT OVERWRITE TABLE `engagement_by_user`
SELECT
`user_id`,
substr(`dt`,1,7) AS mn,
count(tm)
FROM spconsolidated
GROUP BY `user_id`, substr(`dt`,1,7);
</pre>
<p>Unlike SQL, HiveQL does not have date processing functionality, which means that we have to use HiveQL&#8217;s string processing functions to group the results by year and month. (Note: I have since realised this is wrong &#8211; Hive DOES have date functionality, which can be used to rewrite the above query more elegantly&#8230; Nonetheless, the above query will still work <img src='/blog/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> )</p>
<p>The resulting table looks like this:</p>
<p style="text-align:center;"><img src="http://blog.kepstatic.com/2012/05/engagement-by-visitor-id.jpg" title="engagement levels by customer" / width="350"></p>
<p><strong><i>3. Cohort definition:</i></strong> We want to know if we are getting better over time at getting users to engage. So we will define each cohort by the month in which the user first starts using the service. The query for categorising each customer by cohort in HiveQL is then very simple:</p>
<pre class="brush: sql; title: ; notranslate">
/* HiveQL */
CREATE TABLE `visitors_by_cohort` (
cohort string,
user_id string
);

INSERT OVERWRITE TABLE `visitors_by_cohort`
SELECT
substr(MIN(`dt`),1,7) as `cohort`,
`user_id`
FROM spconsolidated
GROUP BY `user_id`;
</pre>
<p>The resulting table looks like this:</p>
<p style="text-align:center;"><img src="http://blog.kepstatic.com/2012/05/visitor-to-cohort-map.jpg" title="customer to cohort map" width="300"/ ></p>
<p><strong><i>Perform the analysis.</i></strong> We simply combine the two queries above to aggregate our results by cohort, comparing the average engagement level between cohorts. This is illustrated schematically below:</p>
<p style="text-align:center;"><img src="http://blog.kepstatic.com/2012/05/performing-the-cohort-analysis-schematic.jpg" title="Performing the cohort analysis schema" / ></p>
<p>The required query is straightforward to write in HiveQL:</p>
<pre class="brush: sql; title: ; notranslate">
/* HiveQL */
CREATE EXTERNAL TABLE `cohort_analysis` (
`cohort` string,
`month` int,
`average_engagement` double
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
LOCATION 's3n://{{your-snowplow-bucket-name}}/analysistables/cohortanalysis' ;

INSERT OVERWRITE TABLE `cohort_analysis`
SELECT
`cohort`,
(substr(`mn`,1,4))*12 + (substr(`mn`,6,2)) - (substr(`cohort`,1,4))*12 + (substr(`cohort`,6,2))+1 AS `month`,
AVG(`engagement_per_month`)
FROM `visitors_by_cohort`
JOIN `engagement_by_user`
ON `engagement_by_user`.`user_id` = `visitors_by_cohort`.user_id
GROUP BY `cohort`, (substr(`mn`,1,4))*12 + (substr(`mn`,6,2)) - (substr(`cohort`,1,4))*12 + (substr(`cohort`,6,2))+1;
</pre>
<p><P>The small complexity introduced in Hive is calculating the difference in month between the cohort date (i.e. the date when the user started using the service) and the month the calculation is being performed for. In Hive, we subtract one date from another, multiplying the number of years by 12 before adding on the number of months. We then add on a single month, so that the initial value (i.e. for the first month of use) is 1 instead of 0.</P></p>
<p>Note also that we make this final table an external table, and save the results back to S3. The results data set is small and easy to plot even in Excel.</p>
<p>And here is what our final cohort analysis looks like:</p>
<p style="text-align:center;"><img src="http://blog.kepstatic.com/2012/05/final-results-table.jpg" title="final cohort results table" width="360"/ ></p>
<p>Et voila! Cohort analysis using web analytics data in 3 easy steps <img src='/blog/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> </p>
<p><strong>Update!</strong></p>
<p>Since writing this post, the <a href="http://snowplowanalytics.com" title="SnowPlow web analytics">SnowPlow website</a> has gone live, including a detailed page describing <a href="http://snowplowanalytics.com/analytics/customer-analytics/cohort-analysis.html" title="Performing cohort analyses with SnowPlow">how to perform cohort analyses with SnowPlow</a>.</p>
<p><i>If you would like help performing cohort analyses like these, or would like to hear more about the benefits of SnowPlow for your web business, please <a href="http://snowplowanalytics.com">visit the SnowPlow website</a>, or <a href="mailto:snowplow@keplarllp.com">contact us</a>.</i></p>
<img src="http://feeds.feedburner.com/~r/KeplarLLPBlog/~4/PnD3JddM7Ps" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>/blog/2012/05/performing-cohort-analysis-on-web-analytics-data-using-snowplow/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
