<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/atom10full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><feed xmlns="http://www.w3.org/2005/Atom" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" xml:lang="en-US">
  <id>tag:awesomeful.net,2009:/posts</id>
  <link rel="alternate" type="text/html" href="http://awesomeful.net" />
  
  <title>Awesomeful.net</title>
  <subtitle>Full of awesomeness</subtitle>
  <updated>2010-02-15T10:48:42-05:00</updated>
  <atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" type="application/atom+xml" href="http://feeds.feedburner.com/awesomeful" /><feedburner:info uri="awesomeful" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com/" /><link rel="license" type="text/html" href="http://creativecommons.org/licenses/by-sa/3.0/" /><logo>http://creativecommons.org/images/public/somerights20.gif</logo><entry>
    <id>tag:awesomeful.net,2009:Post/78</id>
    <published>2010-02-15T10:48:42-05:00</published>
    <updated>2010-02-18T09:41:26-05:00</updated>
    <link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/awesomeful/~3/KuBILOWFRqQ/78-machine-learning-who-s-the-boss" />
    <title>Machine Learning - Who's the Boss?</title>
    <content type="html">&lt;p&gt;In the Machine Learning field, there are two types of algorithms that can be applied to a set of data to solve different kinds of problems: &lt;em&gt;Supervised&lt;/em&gt; and &lt;em&gt;Unsupervised&lt;/em&gt; learning algorithms. Both of these have in common that they aim to extract information or gain knowledge from the raw data that would otherwise be very hard and unpractical to do. This is because we live in very dynamic environments with changing parameters and vast amounts of data being gathered. This data hides important patterns and correlations that are sometimes impossible to deduce manually, and where computing power and smart algorithms excel. They are also heavily dependent on the quantity and quality of the input data, and as such, evolve in their output and accuracy as more and better input data becomes available.&lt;/p&gt;

&lt;p&gt;In this article we will walk through what constitues Supervised and Unsupervised Learning. An overview of the language and terms is presented, as well as the general workflow used for machine learning tasks.&lt;/p&gt;

&lt;h3&gt;Supervised Learning&lt;/h3&gt;

&lt;p&gt;In supervised machine learning we have a set of data points or &lt;em&gt;observations&lt;/em&gt; for which we know the desired output, class, &lt;em&gt;target variable&lt;/em&gt;  or &lt;em&gt;outcome&lt;/em&gt;. The outcome may take one of many values called &lt;em&gt;classes&lt;/em&gt; or &lt;em&gt;labels&lt;/em&gt;. A classic example is that given a few thousand emails for which we know whether they are spam or ham (their labels), the idea is to create a model that is able to deduce whether new, unsean emails are spam or not. In other words, we are creating a mapping function where the inputs are the email's sender, subject, date, time, body, attachments and other attributes, and the output is a prediction as to whether the email is spam or ham. The &lt;em&gt;target variable&lt;/em&gt; is in fact providing some level of &lt;em&gt;supervision&lt;/em&gt; in that it is used by the learning algorithm to adjust parameters or make decisions that will allow it to predict labels for new data. Finally of note, when the algorithm is predicting labels of observations we call it a &lt;em&gt;classifier&lt;/em&gt;. Some classifiers are also capable of providing a probability of a data point belonging to class in which case it is often referred to a probabilistic model or a regression - not to be confused with a &lt;a href="http://en.wikipedia.org/wiki/Regression_analysis#Regression_models"&gt;statistical regression model&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Lets take this as an example in supervised learning algorithms. Given the following dataset, we want to predict on new emails whether they are spam or not. In the dataset below, note that the last column, &lt;code&gt;Spam?&lt;/code&gt;, contains the labels for the examples.&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;td&gt;&lt;b&gt;Subject&lt;/b&gt;&lt;/td&gt; &lt;td&gt;&lt;b&gt;Date&lt;/b&gt;&lt;/td&gt; &lt;td&gt;&lt;b&gt;Time&lt;/b&gt;&lt;/td&gt; &lt;td&gt;&lt;b&gt;Body&lt;/b&gt;&lt;/td&gt; &lt;td&gt;&lt;b&gt;Spam?&lt;/b&gt;&lt;/td&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt; &lt;td&gt;I has the viagra for you&lt;/td&gt; &lt;td&gt;03/12/1992&lt;/td&gt; &lt;td&gt;12:23 pm&lt;/td&gt; &lt;td&gt;Hi! I noticed that you are a software engineer &lt;br/&gt;so here's the pleasure you were looking for...&lt;/td&gt; &lt;td&gt;Yes&lt;/td&gt; &lt;/tr&gt;
    &lt;tr&gt; &lt;td&gt;Important business&lt;/td&gt; &lt;td&gt;05/29/1995&lt;/td&gt; &lt;td&gt;01:24 pm&lt;/td&gt; &lt;td&gt;Give me your account number and you'll be rich. &lt;/ br&gt; I'm totally serial&lt;/td&gt; &lt;td&gt;Yes&lt;/td&gt; &lt;/tr&gt;
    &lt;tr&gt; &lt;td&gt;Business Plan&lt;/td&gt; &lt;td&gt;05/23/1996&lt;/td&gt; &lt;td&gt;07:19 pm&lt;/td&gt; &lt;td&gt;As per our conversation, here's the business plan for our new venture &lt;/ br&gt; Warm regards...&lt;/td&gt; &lt;td&gt;No&lt;/td&gt; &lt;/tr&gt;
    &lt;tr&gt; &lt;td&gt;Job Opportunity&lt;/td&gt; &lt;td&gt;02/29/1998&lt;/td&gt; &lt;td&gt;08:19 am&lt;/td&gt; &lt;td&gt;Hi &lt;name&gt;!&lt;/ br&gt;I am trying to fill a position for a PHP ... &lt;/td&gt; &lt;td&gt;Yes&lt;/td&gt; &lt;/tr&gt;
    &lt;tr&gt; &lt;td colspan="5"&gt; [A few thousand rows ommitted] &lt;/td&gt; &lt;/tr&gt;
    &lt;tr&gt; &lt;td&gt;Call mom&lt;/td&gt; &lt;td&gt;05/23/2000&lt;/td&gt; &lt;td&gt;02:14 pm&lt;/td&gt; &lt;td&gt;Call mom. She's been trying to reach you for a few days now&lt;/td&gt; &lt;td&gt;No&lt;/td&gt; &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;


&lt;p&gt;A common workflow approach, and one that I've taken for supervised learning analysis is shown in the diagram below:&lt;/p&gt;

&lt;p&gt;&lt;img src="http://img.skitch.com/20100213-djhg1re7gaj83ngygcqgj1jm2d.png"&gt;&lt;/p&gt;

&lt;p&gt;The process is:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;em&gt;Scale and prepare training data&lt;/em&gt;: First we build input vectors that are appropriate for feeding into our supervised learning algorithm.&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Create a training set and a validation set&lt;/em&gt; by randomly splitting the universe of data. The training set is the data that the classifier uses to learn how to classify the data, whereas the validation set is used to feed the already trained model in order to get an error rate (or other measures and techniques) that can help us identify the classifier's performance and accuracy. Typically you will use more training data (maybe 80% of the entire universe) than validation data. Note that there is also &lt;a href="http://en.wikipedia.org/wiki/Cross-validation_(statistics"&gt;cross-validation&lt;/a&gt;), but that is beyond the scope of this article.&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Train the model&lt;/em&gt;. We take the training data and we feed it into the algorithm. The end result is a model that has learned (hopefully) how to predict our outcome given new unknown data.&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Validation and tuning&lt;/em&gt;: After we've created a model, we want to test its accuracy. It is critical to do this on data that the model has not seen yet - otherwise you are cheating. This is why on step 2 we separated out a subset of the data that was not used for training. We are indeed testing our model's generalization capabilities. It is very easy to learn every single combination of input vectors and their mappings to the output as observed on the training data, and we can achieve a very low error in doing that, but how does the very same rules or mappings perform on new data that may have different input to output mappings? If the classification error of the validation set is very big compared to the training set's, then we have to go back and adjust model parameters. The model will have essentially memorized the answers seen in the training data, loosing its generalization capabilities. This is called &lt;a href="http://en.wikipedia.org/wiki/Overfitting"&gt;&lt;em&gt;overfitting&lt;/em&gt;&lt;/a&gt;, and there are various techniques for overcoming it.&lt;/li&gt;
&lt;li&gt;Validate the model's performance. There are numerous techniques for achieving this, such as &lt;a href="http://en.wikipedia.org/wiki/Receiver_operating_characteristic"&gt;ROC analysis&lt;/a&gt; and many others. The model's accuracy can be improved by changing its structure or the underlying training data. If the model's performance is not satisfactory, change model parameters, inputs and or scaling, go to step 3 and try again.&lt;/li&gt;
&lt;li&gt;Use the model to classify new data. In production. Profit!&lt;/li&gt;
&lt;/ol&gt;


&lt;h3&gt;Unsupervised Learning&lt;/h3&gt;

&lt;p&gt;The kinds of problems that are suited for unsupervised algorithms may seem similar, but are very different to supervised learners. Instead of trying to predict a set of known classes, we are trying to identify the patterns inherent in the data that separate like observations in one way or another. Viewed from 20 thousand feet, the main difference is that we are not providing a target variable like we did in supervised learning.&lt;/p&gt;

&lt;p&gt;This marks a fundamental difference in how both types of algorithms operate. On one hand, we have supervised algorithms which try to minimize the error in classifying observations, while unsupervised learning algorithms don't have such luxuries because there are no outcomes or labels. Unsupervised algorithms try to create clusters of data that are inherently similar. In some cases we don't necessarily know what makes them similar, but the algorithms are capable of finding these relationships between data points and group them in significant ways. While supervised algorithms aim to minimize the classification error, unsupervised algorithms aim to create groups or subsets of the data where data points belonging to a cluster are as similar to each other as possible, while making the difference between the clusters as high as possible.&lt;/p&gt;

&lt;p&gt;Another main difference is that in a clustering problem, the concept of "Training Set" does not apply in the same way as with supervised learners. Typically we have a dataset that is used to find the relationships in the data that buckets them in different clusters. We could of course apply the same clustering model to new data, but unless it is too unpractical to do so (perhaps for performance reasons), we will most certainly want to rerun the algorithm on new data as it will typically find new relationships within the data that may surface up given the new observations.&lt;/p&gt;

&lt;p&gt;As a simple example, you could imagine clustering customers by their demographics. The learning algorithm may help you discover distinct groups of customers by region, age ranges, gender and other attributes in such way that we can develop targeted marketing programs. Another example may be to cluster patients by their chronic diseases and comorbidities in such a way that targeted interventions can be developed to help manage their diseases and improve their lifestyles.&lt;/p&gt;

&lt;p&gt;&lt;img src="http://img.skitch.com/20100215-qm59id21fs2kr2m1r2sc5umwgw.png"&gt;&lt;/p&gt;

&lt;p&gt;For unsupervised learning, the process is:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;em&gt;Scale and prepare raw data&lt;/em&gt;: As with supervised learners, this step entails selecting features to feed into our algorithm, and scaling them to build a suitable data set.&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Build model&lt;/em&gt;: We run the unsupervised algorithm on the scaled dataset to get groups of like observations.&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Validate&lt;/em&gt;: After clustering the data, we need to verify whether it cleanly separated the data in significant ways. This includes calculating a set of statistics on the resulting clusters (such as the within group sum of squares), as well as analysis based on domain knowledge, where you may measure how certain attributes behave when aggregated by the clusters.&lt;/li&gt;
&lt;li&gt;Once we are satisfied with the clusters created there is no need to run the model with new data (although you can). Profit!&lt;/li&gt;
&lt;/ol&gt;


&lt;h4&gt;Step zero&lt;/h4&gt;

&lt;p&gt;A common step that I have not outlined above and should be performed when working on any such problem is to get a strong understanding for the characteristics of the data. This should be a combination of visual analysis (for which I prefer the excellent &lt;a href="http://had.co.nz/ggplot2/"&gt;ggplot2&lt;/a&gt; library) as well as some basic descriptive statistics and data profiling such as quartiles, means, standard deviation, frequencies and others. &lt;a href="http://www.r-project.org"&gt;R&lt;/a&gt;'s &lt;a href="http://cran.r-project.org/web/packages/Hmisc/index.html"&gt;Hmisc&lt;/a&gt; package has a great function for this purpose called &lt;a href="http://lib.stat.cmu.edu/S/Harrell/help/Hmisc/html/describe.html"&gt;&lt;code&gt;describe&lt;/code&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;I am convinced that not performing this step is a non starter for any datamining project. It will allow you to identify missing values, general distributions of data, early outlier detection, among many other characteristics that drive the selection of attributes for your models.&lt;/p&gt;

&lt;h3&gt;Wrapping up&lt;/h3&gt;

&lt;p&gt;&lt;a href="http://televixen.files.wordpress.com/2009/02/wtb.jpg"&gt;&lt;img src="http://televixen.files.wordpress.com/2009/02/wtb.jpg"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is certainly quite a bit of info, especially if these terms are new to you. To summarize:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;/td&gt;
      &lt;td&gt;&lt;b&gt;Supervised Learning&lt;/b&gt;&lt;/td&gt;
      &lt;td&gt;&lt;b&gt;Unsupervised Learning&lt;/b&gt;&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;b&gt;Objective&lt;/b&gt;&lt;/td&gt;
      &lt;td&gt;Classify or predict a class.&lt;/td&gt;
      &lt;td&gt;Find patterns inherent to the data, creating cluster of like data points. &lt;a href="http://en.wikipedia.org/wiki/Dimension_reduction"&gt;Dimensionality Reduction&lt;/a&gt;.&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;b&gt;Example Implementations&lt;/b&gt;&lt;/td&gt;
      &lt;td&gt;Neural Networks (&lt;a href="http://en.wikipedia.org/wiki/Multilayer_perceptron"&gt;Multilayer Perceptrons&lt;/a&gt;, &lt;a href="http://en.wikipedia.org/wiki/Radial_basis_function_network"&gt;RBF Networks&lt;/a&gt; and others, &lt;a href="http://en.wikipedia.org/wiki/Support_vector_machine"&gt;Support Vector Machines&lt;/a&gt;, Decision Trees (&lt;a href="http://en.wikipedia.org/wiki/ID3_algorithm"&gt;ID3&lt;/a&gt;, &lt;a href="http://en.wikipedia.org/wiki/C4.5_algorithm"&gt;C4.5&lt;/a&gt; and others), &lt;a href="http://en.wikipedia.org/wiki/Naive_Bayes_classifier"&gt;Naive Bayes Classifiers&lt;/a&gt;...&lt;/td&gt;
      &lt;td&gt;&lt;a href="http://en.wikipedia.org/wiki/K-means_clustering"&gt;K-Means&lt;/a&gt; (and variants), &lt;a href="http://en.wikipedia.org/wiki/Cluster_analysis#Hierarchical_clustering"&gt;Hierarchical Clustering&lt;/a&gt;, &lt;a href="http://en.wikipedia.org/wiki/Self-organizing_map"&gt;Kohonen Self Organizing Maps&lt;/a&gt;...&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;b&gt;Who's the Boss?&lt;/b&gt;&lt;/td&gt;
      &lt;td&gt;The target variable or outcome.&lt;/td&gt;
      &lt;td&gt;The relationships inherent to the data.&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;


&lt;p&gt;Hopefuly this article shows the main differences between Unsupervised and Supervised Learning. On followup posts we will dig into some of the specific implementations of these algorithms with examples in &lt;a href="http://www.r-project.org"&gt;R&lt;/a&gt; and &lt;a href="http://ruby-lang.org"&gt;Ruby&lt;/a&gt;&lt;/p&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.feedburner.com/~ff/awesomeful?a=KuBILOWFRqQ:lzCwqR96qnI:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/awesomeful?i=KuBILOWFRqQ:lzCwqR96qnI:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~ff/awesomeful?a=KuBILOWFRqQ:lzCwqR96qnI:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/awesomeful?i=KuBILOWFRqQ:lzCwqR96qnI:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/awesomeful/~4/KuBILOWFRqQ" height="1" width="1"/&gt;</content>
    <author>
      <name>Awesomeful.net</name>
    </author>
  <feedburner:origLink>http://awesomeful.net/posts/78-machine-learning-who-s-the-boss</feedburner:origLink></entry>
  <entry>
    <id>tag:awesomeful.net,2009:Post/77</id>
    <published>2010-01-17T14:55:45-05:00</published>
    <updated>2010-02-06T12:29:01-05:00</updated>
    <link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/awesomeful/~3/NE_0IGkTNSo/77-experiences-porting-a-helper-plugin-to-rails-3" />
    <title>Experiences porting a helper plugin to Rails 3</title>
    <content type="html">&lt;p&gt;Today I spent a few minutes porting &lt;a href="http://github.com/hgimenez/truncate_html"&gt;truncate_html&lt;/a&gt; to Rails 3. This gem/plugin provides you with the &lt;code&gt;truncate_html()&lt;/code&gt; helper method, which is very similar to rails' &lt;code&gt;truncate()&lt;/code&gt;, but it takes care of closing open html tags and other peculiarities of truncating HTML. It works by using regular expressions and does not have any dependencies. I use this gem on this blog, as well as on the &lt;a href="http://bostonrb.org"&gt;bostonrb.org&lt;/a&gt; site. Some other people have found it to be &lt;a href="http://twitter.com/dolzenko/status/6428360551"&gt;useful&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;One of the promises of Rails 3 is that there is an &lt;a href="http://www.engineyard.com/blog/2010/rails-and-merb-merge-plugin-api-part-3-of-6/"&gt;API for plugin developers&lt;/a&gt; that will allow you to hook into the right parts of Rails to add the functionality that your plugin provides. This means that you should not be mixing in or monkeypatching Rails core willy-nilly. In fact, it is now expected for you as a plugin developer to figure out how to hook into the right parts of Rails using the new API, as opposed to doing something like the following:&lt;/p&gt;

&lt;div&gt;&lt;pre&gt;&lt;code class="multiline"&gt;&lt;span class="co"&gt;ActionView&lt;/span&gt;::&lt;span class="co"&gt;Base&lt;/span&gt;.class_eval &lt;span class="r"&gt;do&lt;/span&gt;
  include &lt;span class="co"&gt;TruncateHtmlHelper&lt;/span&gt;
&lt;span class="r"&gt;end&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;At this stage, there isn't much documentation around what the API actually is. But this shouldn't stop you from investigating and finding out. In this case, cloning the rails repo and using ack pointed me towards &lt;a href="http://github.com/rails/rails/blob/master/actionpack/lib/action_controller/metal/helpers.rb#L6-39"&gt;actionpack/lib/action_controller/metal/helpers.rb&lt;/a&gt;, where I found all the info I needed to remove the now outdated meta-programmed mixin technique of the dark Rails 2 days. From the docs:&lt;/p&gt;

&lt;blockquote&gt;&lt;pre&gt;
In addition to using the standard template helpers provided in the Rails framework,
creating custom helpers to extract complicated logic or reusable functionality is strongly
encouraged. By default, the controller will include a helper whose name matches that of
the controller, e.g., MyController will automatically include MyHelper.

Additional helpers can be specified using the helper class method in
ActionController::Base or any controller which inherits from it.
&lt;/pre&gt;&lt;/blockquote&gt;


&lt;p&gt;Perfect. All I need to do in this case is &lt;a href="http://github.com/hgimenez/truncate_html/commit/5a33e52db3297a1b35af224d468636e2e68ecdc4"&gt;call the &lt;code&gt;helper&lt;/code&gt; class method with my helper's module&lt;/a&gt;: &lt;code class="inline"&gt;&lt;span class="co"&gt;ActionController&lt;/span&gt;::&lt;span class="co"&gt;Base&lt;/span&gt;.helper(&lt;span class="co"&gt;TruncateHtmlHelper&lt;/span&gt;)&lt;/code&gt;. A quick run through the app demonstrates however that we now need to mark strings as html_safe. Fine, let's &lt;a href="http://github.com/hgimenez/truncate_html/commit/7539b71f3c572f81ed890d2a9e9156ff51408e2b"&gt;do that&lt;/a&gt;: &lt;code class="inline"&gt;(&lt;span class="co"&gt;TruncateHtml&lt;/span&gt;::&lt;span class="co"&gt;HtmlTruncator&lt;/span&gt;.new(html).truncate(options)).html_safe!&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Finally, let's run the test suite - and &lt;em&gt;facepalm&lt;/em&gt;. The way this plugin is set up is that RSpec must be installed in the containing app for it to run the spec suite. Here's where I ran into the first real issue with the upgrade: I have not been able to install RSpec on a Rails 3 app. I also can't find any obvious way to do it by browsing its source code. For now I seem to be stuck in limbo land until the &lt;a href="http://blog.davidchelimsky.net/2010/01/12/rspec-2-and-rails-3/"&gt;the RSpec/Rails 3 affair&lt;/a&gt; is all sorted out.&lt;/p&gt;

&lt;h3&gt;Backward Compatibility&lt;/h3&gt;

&lt;p&gt;The bigger question is how to maintain backward compatibility. One way to accomplish this is to continue to maintain two git branches for Rails2 and Rails3 (master), and cherry-picking any bug fixes or enhancements from the master branch into the Rails2 branch. However, how could we manage gem bundling and distribution of two gems built for two version of Rails? I'd like to know how you are planning on maintaining backward compatibility. In this particular case, I almost don't care for backward compatibility, and users will simply have to know that version 0.2.2 of the gem is the latest working Rails 2 version, and must install that specific version when running under Rails 2.&lt;/p&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.feedburner.com/~ff/awesomeful?a=NE_0IGkTNSo:a2g6ICh626o:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/awesomeful?i=NE_0IGkTNSo:a2g6ICh626o:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~ff/awesomeful?a=NE_0IGkTNSo:a2g6ICh626o:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/awesomeful?i=NE_0IGkTNSo:a2g6ICh626o:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/awesomeful/~4/NE_0IGkTNSo" height="1" width="1"/&gt;</content>
    <author>
      <name>Awesomeful.net</name>
    </author>
  <feedburner:origLink>http://awesomeful.net/posts/77-experiences-porting-a-helper-plugin-to-rails-3</feedburner:origLink></entry>
  <entry>
    <id>tag:awesomeful.net,2009:Post/75</id>
    <published>2009-09-25T16:31:37-04:00</published>
    <updated>2009-09-25T20:20:54-04:00</updated>
    <link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/awesomeful/~3/I6PE-M-1Aj4/75-spec-your-yields-in-rspec" />
    <title>Spec your yields in RSpec</title>
    <content type="html">&lt;p&gt;Message expectations in RSpec's Mocking/Stubing framework provide means for spec'ing the yielded objects of a method. For example, consider the following spec where we expect the &lt;code class="inline"&gt;here_i_am&lt;/code&gt; method to &lt;code class="inline"&gt;&lt;span class="r"&gt;yield&lt;/span&gt; &lt;span class="pc"&gt;self&lt;/span&gt;&lt;/code&gt;:&lt;/p&gt;

&lt;div&gt;&lt;pre&gt;&lt;code class="multiline"&gt;describe &lt;span class="co"&gt;Triviality&lt;/span&gt; &lt;span class="r"&gt;do&lt;/span&gt;
  describe &lt;span class="s"&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="k"&gt;#here_i_am&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;/span&gt; &lt;span class="r"&gt;do&lt;/span&gt;

    let(&lt;span class="sy"&gt;:triviality&lt;/span&gt;) { &lt;span class="co"&gt;Triviality&lt;/span&gt;.new }

    it &lt;span class="s"&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="k"&gt;yields self&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;/span&gt; &lt;span class="r"&gt;do&lt;/span&gt;
      triviality.should_receive(&lt;span class="sy"&gt;:here_i_am&lt;/span&gt;).and_yield(triviality)
      triviality.here_i_am { }
    &lt;span class="r"&gt;end&lt;/span&gt;

  &lt;span class="r"&gt;end&lt;/span&gt;
&lt;span class="r"&gt;end&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;Nice and easy. First we set the expectation and then we exercise the method so that the expectation is met, passing it a "no op" block - &lt;code class="inline"&gt;{}&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Here's the method to make it pass.&lt;/p&gt;

&lt;div&gt;&lt;pre&gt;&lt;code class="multiline"&gt;&lt;span class="r"&gt;class&lt;/span&gt; &lt;span class="cl"&gt;Triviality&lt;/span&gt;

  &lt;span class="r"&gt;def&lt;/span&gt; &lt;span class="fu"&gt;here_i_am&lt;/span&gt;
    &lt;span class="r"&gt;yield&lt;/span&gt; &lt;span class="pc"&gt;self&lt;/span&gt;
  &lt;span class="r"&gt;end&lt;/span&gt;

&lt;span class="r"&gt;end&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;Furthermore, we can test many yielded values by chaining the &lt;code class="inline"&gt;and_yield&lt;/code&gt; method on the expectation. Let's add a spec for a method  that yields many times and see how that would play out:&lt;/p&gt;

&lt;div&gt;&lt;pre&gt;&lt;code class="multiline"&gt;describe &lt;span class="co"&gt;Triviality&lt;/span&gt; &lt;span class="r"&gt;do&lt;/span&gt;
  describe &lt;span class="s"&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="k"&gt;#one_two_three&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;/span&gt; &lt;span class="r"&gt;do&lt;/span&gt;

    let(&lt;span class="sy"&gt;:triviality&lt;/span&gt;) { &lt;span class="co"&gt;Triviality&lt;/span&gt;.new }

    it &lt;span class="s"&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="k"&gt;yields the numbers 1, 2 and 3&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;/span&gt; &lt;span class="r"&gt;do&lt;/span&gt;
      triviality.should_receive(&lt;span class="sy"&gt;:one_two_three&lt;/span&gt;).and_yield(&lt;span class="i"&gt;1&lt;/span&gt;).and_yield(&lt;span class="i"&gt;2&lt;/span&gt;).and_yield(&lt;span class="i"&gt;3&lt;/span&gt;)
      triviality.one_two_three { }
    &lt;span class="r"&gt;end&lt;/span&gt;

  &lt;span class="r"&gt;end&lt;/span&gt;
&lt;span class="r"&gt;end&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;And the method to make that pass:&lt;/p&gt;

&lt;div&gt;&lt;pre&gt;&lt;code class="multiline"&gt;&lt;span class="r"&gt;class&lt;/span&gt; &lt;span class="cl"&gt;Triviality&lt;/span&gt;

  &lt;span class="r"&gt;def&lt;/span&gt; &lt;span class="fu"&gt;one_two_three&lt;/span&gt;
    &lt;span class="r"&gt;yield&lt;/span&gt; &lt;span class="i"&gt;1&lt;/span&gt;
    &lt;span class="r"&gt;yield&lt;/span&gt; &lt;span class="i"&gt;2&lt;/span&gt;
    &lt;span class="r"&gt;yield&lt;/span&gt; &lt;span class="i"&gt;3&lt;/span&gt;
  &lt;span class="r"&gt;end&lt;/span&gt;

&lt;span class="r"&gt;end&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;This is kind of ugly though. What if it yields many more times, or if you just want to test that it yields all items of an array? A good example of this is the Enumerable's &lt;code class="inline"&gt;each&lt;/code&gt; method. In such cases we can store the &lt;code class="inline"&gt;&lt;span class="co"&gt;MessageExpectation&lt;/span&gt;&lt;/code&gt; object and call &lt;code class="inline"&gt;and_yield&lt;/code&gt; on it many times, in a loop. Take a look at the following example where we yield each letter of the alphabet:&lt;/p&gt;

&lt;div&gt;&lt;pre&gt;&lt;code class="multiline"&gt;describe &lt;span class="co"&gt;Triviality&lt;/span&gt; &lt;span class="r"&gt;do&lt;/span&gt;
  describe &lt;span class="s"&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="k"&gt;#alphabet&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;/span&gt; &lt;span class="r"&gt;do&lt;/span&gt;

    let(&lt;span class="sy"&gt;:triviality&lt;/span&gt;) { &lt;span class="co"&gt;Triviality&lt;/span&gt;.new }

    it &lt;span class="s"&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="k"&gt;yields all letters of the alphabet&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;/span&gt; &lt;span class="r"&gt;do&lt;/span&gt;
      expectation = triviality.should_receive(&lt;span class="sy"&gt;:alphabet&lt;/span&gt;)
      (&lt;span class="s"&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="k"&gt;A&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;/span&gt;...&lt;span class="s"&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="k"&gt;Z&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;/span&gt;).each { |letter| expectation.and_yield(letter) }
      triviality.alphabet { }
    &lt;span class="r"&gt;end&lt;/span&gt;

  &lt;span class="r"&gt;end&lt;/span&gt;
&lt;span class="r"&gt;end&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;And finally, the method to make it pass:&lt;/p&gt;

&lt;div&gt;&lt;pre&gt;&lt;code class="multiline"&gt;&lt;span class="r"&gt;class&lt;/span&gt; &lt;span class="cl"&gt;Triviality&lt;/span&gt;

  &lt;span class="r"&gt;def&lt;/span&gt; &lt;span class="fu"&gt;alphabet&lt;/span&gt;
    (&lt;span class="s"&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="k"&gt;A&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;/span&gt;...&lt;span class="s"&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="k"&gt;Z&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;/span&gt;).each &lt;span class="r"&gt;do&lt;/span&gt; { |letter| &lt;span class="r"&gt;yield&lt;/span&gt; letter }
  &lt;span class="r"&gt;end&lt;/span&gt;

&lt;span class="r"&gt;end&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;&lt;code class="inline"&gt;and_yield&lt;/code&gt; is not only useful for message expectations. You can also use it on your &lt;code class="inline"&gt;stubs&lt;/code&gt;, just like you'd use &lt;code class="inline"&gt;and_returns&lt;/code&gt;.&lt;/p&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.feedburner.com/~ff/awesomeful?a=I6PE-M-1Aj4:A0pfBaUbHHg:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/awesomeful?i=I6PE-M-1Aj4:A0pfBaUbHHg:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~ff/awesomeful?a=I6PE-M-1Aj4:A0pfBaUbHHg:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/awesomeful?i=I6PE-M-1Aj4:A0pfBaUbHHg:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/awesomeful/~4/I6PE-M-1Aj4" height="1" width="1"/&gt;</content>
    <author>
      <name>Awesomeful.net</name>
    </author>
  <feedburner:origLink>http://awesomeful.net/posts/75-spec-your-yields-in-rspec</feedburner:origLink></entry>
  <entry>
    <id>tag:awesomeful.net,2009:Post/73</id>
    <published>2009-09-24T17:58:19-04:00</published>
    <updated>2010-02-10T12:54:57-05:00</updated>
    <link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/awesomeful/~3/FxE-TlHcLzg/73-an-object-quacks-like-a-duck" />
    <title>An object quacks like a duck</title>
    <content type="html">&lt;p&gt;I've been toying around with the idea of spec'ing mixins: that a class includes a module. Suppose the following class:&lt;/p&gt;

&lt;div&gt;&lt;pre&gt;&lt;code class="multiline"&gt;&lt;span class="r"&gt;class&lt;/span&gt; &lt;span class="cl"&gt;FooList&lt;/span&gt;
  include &lt;span class="co"&gt;Enumerable&lt;/span&gt;
 
  attr_accessor &lt;span class="sy"&gt;:some_array&lt;/span&gt;
  &lt;span class="r"&gt;def&lt;/span&gt; &lt;span class="fu"&gt;initialize&lt;/span&gt;(opts)
    &lt;span class="iv"&gt;@some_array&lt;/span&gt; = opts[&lt;span class="sy"&gt;:the_array&lt;/span&gt;] || []
  &lt;span class="r"&gt;end&lt;/span&gt;
 
  &lt;span class="r"&gt;def&lt;/span&gt; &lt;span class="fu"&gt;each&lt;/span&gt;
    &lt;span class="iv"&gt;@some_array&lt;/span&gt;.each { |item| &lt;span class="r"&gt;yield&lt;/span&gt; item }
  &lt;span class="r"&gt;end&lt;/span&gt;
&lt;span class="r"&gt;end&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;We can test the behavior of the &lt;code class="inline"&gt;each&lt;/code&gt; method using RSpec, but we can also make sure that &lt;code class="inline"&gt;&lt;span class="co"&gt;FooList&lt;/span&gt;&lt;/code&gt; actually acts like an &lt;code class="inline"&gt;&lt;span class="co"&gt;Enumerable&lt;/span&gt;&lt;/code&gt;. Here's a quick RSpec Matcher just for that (&lt;code class="inline"&gt;require&lt;/code&gt; it in your spec_helper.rb)&lt;/p&gt;

&lt;div&gt;&lt;pre&gt;&lt;code class="multiline"&gt;&lt;span class="co"&gt;Spec&lt;/span&gt;::&lt;span class="co"&gt;Matchers&lt;/span&gt;.define &lt;span class="sy"&gt;:quack_like&lt;/span&gt; &lt;span class="r"&gt;do&lt;/span&gt; |mod|
  match &lt;span class="r"&gt;do&lt;/span&gt; |instance|
    mod.instance_methods.inject(&lt;span class="pc"&gt;true&lt;/span&gt;) { |accum, method| accum &amp;amp;&amp;amp; instance.respond_to?(method) }
  &lt;span class="r"&gt;end&lt;/span&gt;

  failure_message_for_should &lt;span class="r"&gt;do&lt;/span&gt; |instance|
    &lt;span class="s"&gt;&lt;span class="dl"&gt;&amp;quot;&lt;/span&gt;&lt;span class="k"&gt;expected the class &lt;/span&gt;&lt;span class="il"&gt;&lt;span class="idl"&gt;#{&lt;/span&gt;instance.class.name&lt;span class="idl"&gt;}&lt;/span&gt;&lt;/span&gt;&lt;span class="k"&gt; to include the module &lt;/span&gt;&lt;span class="il"&gt;&lt;span class="idl"&gt;#{&lt;/span&gt;mod&lt;span class="idl"&gt;}&lt;/span&gt;&lt;/span&gt;&lt;span class="dl"&gt;&amp;quot;&lt;/span&gt;&lt;/span&gt;
  &lt;span class="r"&gt;end&lt;/span&gt;

  failure_message_for_should_not &lt;span class="r"&gt;do&lt;/span&gt; |instance|
    &lt;span class="s"&gt;&lt;span class="dl"&gt;&amp;quot;&lt;/span&gt;&lt;span class="k"&gt;expected the class &lt;/span&gt;&lt;span class="il"&gt;&lt;span class="idl"&gt;#{&lt;/span&gt;instance.class.name&lt;span class="idl"&gt;}&lt;/span&gt;&lt;/span&gt;&lt;span class="k"&gt; not to include the module &lt;/span&gt;&lt;span class="il"&gt;&lt;span class="idl"&gt;#{&lt;/span&gt;mod&lt;span class="idl"&gt;}&lt;/span&gt;&lt;/span&gt;&lt;span class="dl"&gt;&amp;quot;&lt;/span&gt;&lt;/span&gt;
  &lt;span class="r"&gt;end&lt;/span&gt;

  description &lt;span class="r"&gt;do&lt;/span&gt;
    &lt;span class="s"&gt;&lt;span class="dl"&gt;&amp;quot;&lt;/span&gt;&lt;span class="k"&gt;expected the class to behave like a module by responding to all of its instance methods&lt;/span&gt;&lt;span class="dl"&gt;&amp;quot;&lt;/span&gt;&lt;/span&gt;
  &lt;span class="r"&gt;end&lt;/span&gt;
&lt;span class="r"&gt;end&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;This allows us to spec some quacking:&lt;/p&gt;

&lt;div&gt;&lt;pre&gt;&lt;code class="multiline"&gt;describe &lt;span class="co"&gt;FooList&lt;/span&gt; &lt;span class="r"&gt;do&lt;/span&gt;
  &lt;span class="r"&gt;def&lt;/span&gt; &lt;span class="fu"&gt;foo_list&lt;/span&gt;
    &lt;span class="iv"&gt;@foo_list&lt;/span&gt; ||= &lt;span class="co"&gt;FooList&lt;/span&gt;.new
  &lt;span class="r"&gt;end&lt;/span&gt;

  it &lt;span class="s"&gt;&lt;span class="dl"&gt;&amp;quot;&lt;/span&gt;&lt;span class="k"&gt;quacks like an Enumerable&lt;/span&gt;&lt;span class="dl"&gt;&amp;quot;&lt;/span&gt;&lt;/span&gt; &lt;span class="r"&gt;do&lt;/span&gt;
    foo_list.should quack_like &lt;span class="co"&gt;Enumerable&lt;/span&gt;
  &lt;span class="r"&gt;end&lt;/span&gt;
&lt;span class="r"&gt;end&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;I am still experimenting with this. In a way it is not really testing behavior, but it's not really testing the implementation either. In other words, if every method in &lt;code class="inline"&gt;&lt;span class="co"&gt;Enumerable&lt;/span&gt;&lt;/code&gt; is implemented in &lt;code class="inline"&gt;&lt;span class="co"&gt;FooList&lt;/span&gt;&lt;/code&gt; and we remove the &lt;code class="inline"&gt;include &lt;span class="co"&gt;Enumerable&lt;/span&gt;&lt;/code&gt; line, the spec still passes.&lt;/p&gt;

&lt;p&gt;I've discussed this over IRC with some other &lt;a href="http://technicalpickles.com/"&gt;smart&lt;/a&gt; &lt;a href="http://www.enlightsolutions.com/"&gt;folks&lt;/a&gt;, but I want more input . Do you think this is appropriate? Useless?&lt;/p&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.feedburner.com/~ff/awesomeful?a=FxE-TlHcLzg:gaLFoIzu4Cg:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/awesomeful?i=FxE-TlHcLzg:gaLFoIzu4Cg:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~ff/awesomeful?a=FxE-TlHcLzg:gaLFoIzu4Cg:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/awesomeful?i=FxE-TlHcLzg:gaLFoIzu4Cg:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/awesomeful/~4/FxE-TlHcLzg" height="1" width="1"/&gt;</content>
    <author>
      <name>Awesomeful.net</name>
    </author>
  <feedburner:origLink>http://awesomeful.net/posts/73-an-object-quacks-like-a-duck</feedburner:origLink></entry>
  <entry>
    <id>tag:awesomeful.net,2009:Post/72</id>
    <published>2009-09-23T13:30:53-04:00</published>
    <updated>2009-09-24T11:32:39-04:00</updated>
    <link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/awesomeful/~3/iOidAHKNZPs/72-postgresql-s-group-by" />
    <title>PostgreSQL's group by</title>
    <content type="html">&lt;p&gt;Last night I noticed a user on IRC complaint on two different channels (#heroku and #rubyonrails) claiming something along the lines of "PostgreSQL sucks: i have this code &lt;div&gt;&lt;pre&gt;&lt;code class="multiline"&gt;named_scope &lt;span class="sy"&gt;:with_questions&lt;/span&gt;,
  &lt;span class="sy"&gt;:joins&lt;/span&gt; =&amp;gt; &lt;span class="sy"&gt;:questions&lt;/span&gt;,
  &lt;span class="sy"&gt;:group&lt;/span&gt; =&amp;gt; &lt;span class="s"&gt;&lt;span class="dl"&gt;&amp;quot;&lt;/span&gt;&lt;span class="k"&gt;categories.id, categories.name, categories.created_at, categories.updated_at&lt;/span&gt;&lt;span class="dl"&gt;&amp;quot;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt; because of the way postgresql handles group by. It should only be &lt;code&gt;"categories.id"&lt;/code&gt;."&lt;/p&gt;

&lt;p&gt;The user was surprised that this query works on MySQL. Surely, the user was getting the PostgreSQL message: &lt;code&gt;ERROR: column "categories.name" must appear in the group by clause or be used in an aggregate function&lt;/code&gt;. It turns out that this is not a bug, and PostgreSQL does not suck as this user initially thought. Furthermore, I tried a similar query on MS SQL Server, and it rightfully complaints: &lt;code&gt;Column 'categories.name' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Let's look at solutions.&lt;/p&gt;

&lt;h4&gt;Alternative Queries&lt;/h4&gt;

&lt;p&gt;The first thing that's wrong about this query is that what the user really wanted was a distinct list of categories that had questions. This is the requirement. To that end, the query should look something like the following two options.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Option 1: Drop the &lt;code class="inline"&gt;&lt;span class="r"&gt;join&lt;/span&gt;&lt;/code&gt; and &lt;code class="inline"&gt;&lt;span class="r"&gt;group&lt;/span&gt; &lt;span class="r"&gt;by&lt;/span&gt;&lt;/code&gt;, and just use a condition checking whether a question exists for the category:&lt;/li&gt;
&lt;/ul&gt;


&lt;div&gt;&lt;pre&gt;&lt;code class="multiline"&gt;&lt;span class="co"&gt;Category&lt;/span&gt;.all(&lt;span class="sy"&gt;:conditions&lt;/span&gt; =&amp;gt;
        &lt;span class="s"&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="k"&gt;exists (select 1 from questions where categories.id = questions.category_id)&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;/span&gt;)&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;A variation of this can be achieved with the &lt;code class="inline"&gt;&lt;span class="r"&gt;in&lt;/span&gt;&lt;/code&gt; operator:&lt;/p&gt;

&lt;div&gt;&lt;pre&gt;&lt;code class="multiline"&gt;&lt;span class="co"&gt;Category&lt;/span&gt;.all(&lt;span class="sy"&gt;:conditions&lt;/span&gt; =&amp;gt; 
        &lt;span class="s"&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="k"&gt;clients.id in (select client_id from questions)&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;/span&gt;)&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;


&lt;ul&gt;
&lt;li&gt;Option 2: Again, drop the &lt;code class="inline"&gt;group by&lt;/code&gt;, and use a &lt;code class="inline"&gt;distinct&lt;/code&gt; instead:&lt;/li&gt;
&lt;/ul&gt;


&lt;div&gt;&lt;pre&gt;&lt;code class="multiline"&gt;&lt;span class="co"&gt;Category&lt;/span&gt;.all(&lt;span class="sy"&gt;:select&lt;/span&gt; =&amp;gt; &lt;span class="s"&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="k"&gt;distinct items.*&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;/span&gt;,
             &lt;span class="sy"&gt;:joins&lt;/span&gt;  =&amp;gt; &lt;span class="sy"&gt;:questions&lt;/span&gt;)&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;


&lt;h4&gt;&lt;em&gt;Why&lt;/em&gt; PostgreSQL doesn't like the original query&lt;/h4&gt;

&lt;p&gt;The &lt;code class="inline"&gt;group by&lt;/code&gt; clause is used to collect data from multiple records having common values in a select statement, and project the result based on some aggregate function. It really does not make any sense to add a &lt;code class="inline"&gt;group by&lt;/code&gt; to a query that does not have an aggregate such as &lt;code class="inline"&gt;sum()&lt;/code&gt;, &lt;code class="inline"&gt;avg()&lt;/code&gt;, &lt;code class="inline"&gt;min()&lt;/code&gt;, &lt;code class="inline"&gt;max()&lt;/code&gt;, &lt;code class="inline"&gt;count()&lt;/code&gt;. There is an exception, but we'll talk about that later.&lt;/p&gt;

&lt;p&gt;As an example, we could retrieve every item along with a count of categories per item:&lt;/p&gt;

&lt;div&gt;&lt;pre&gt;&lt;code class="multiline"&gt;&lt;span class="r"&gt;select&lt;/span&gt; &lt;span class=""&gt;id&lt;/span&gt;, &lt;span class=""&gt;name&lt;/span&gt;, &lt;span class="pd"&gt;count&lt;/span&gt;(&lt;span class=""&gt;id&lt;/span&gt;)
  &lt;span class="r"&gt;from&lt;/span&gt; &lt;span class=""&gt;items&lt;/span&gt;
  &lt;span class="r"&gt;inner&lt;/span&gt; &lt;span class="r"&gt;join&lt;/span&gt; &lt;span class=""&gt;categories&lt;/span&gt;
    &lt;span class="r"&gt;on&lt;/span&gt; &lt;span class=""&gt;items&lt;/span&gt;.&lt;span class=""&gt;id&lt;/span&gt; = &lt;span class=""&gt;categories&lt;/span&gt;.&lt;span class=""&gt;item_id&lt;/span&gt;
  &lt;span class="r"&gt;group&lt;/span&gt; &lt;span class="r"&gt;by&lt;/span&gt; &lt;span class=""&gt;id&lt;/span&gt;, &lt;span class=""&gt;name&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;Note that every non-aggregated column on the &lt;code class="inline"&gt;select&lt;/code&gt; list must appear on the &lt;code class="inline"&gt;group by&lt;/code&gt; list. This is necessary for PostgreSQL to know which item's to &lt;code class="inline"&gt;count&lt;/code&gt; on (or &lt;code class="inline"&gt;sum&lt;/code&gt;, or calculate the &lt;code class="inline"&gt;max&lt;/code&gt; on). Let's walk through a simplified example of what happens if we don't include one of these columns on the &lt;code class="inline"&gt;group by&lt;/code&gt; list.&lt;/p&gt;

&lt;p&gt;Suppose the following table&lt;/p&gt;

&lt;div&gt;&lt;pre&gt;&lt;code class="multiline"&gt;code | city
------------------
&lt;span class="i"&gt;0&lt;/span&gt;    | &lt;span class="co"&gt;Cambridge&lt;/span&gt;
&lt;span class="i"&gt;0&lt;/span&gt;    | &lt;span class="co"&gt;Boston&lt;/span&gt;
&lt;span class="i"&gt;1&lt;/span&gt;    | &lt;span class="co"&gt;Foxboro&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;What happens if we run the following query:&lt;/p&gt;

&lt;div&gt;&lt;pre&gt;&lt;code class="multiline"&gt;&lt;span class="r"&gt;select&lt;/span&gt; &lt;span class=""&gt;code&lt;/span&gt;, &lt;span class=""&gt;city&lt;/span&gt;
  &lt;span class="r"&gt;from&lt;/span&gt; &lt;span class="r"&gt;table&lt;/span&gt;
  &lt;span class="r"&gt;group&lt;/span&gt; &lt;span class="r"&gt;by&lt;/span&gt; &lt;span class=""&gt;code&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;What would you expect PostgreSQL to return for the row with a code equal to 0? Cambridge or Boston? When PostgreSQL is presented with an ambiguous query such as the above, it will stop and report an error. Some other databases may go on and make their own decision as to what to return. To me, this is a broken spec. Futhermore, the result set may be inconsistent and unpredictable across DBMSes, or even queries on the same DB.&lt;/p&gt;

&lt;h4&gt;Exception to the rule&lt;/h4&gt;

&lt;p&gt;On previous versions of PostgreSQL (pre 8.2), the query plan for a &lt;code&gt;group by&lt;/code&gt; was much more efficient than a &lt;code&gt;select distinct&lt;/code&gt;. In some older Rails apps, we wrote things like the following to optimize performance:&lt;/p&gt;

&lt;div&gt;&lt;pre&gt;&lt;code class="multiline"&gt;&lt;span class="co"&gt;Question&lt;/span&gt;.find(&lt;span class="sy"&gt;:all&lt;/span&gt;
              &lt;span class="sy"&gt;:group&lt;/span&gt;      =&amp;gt; &lt;span class="co"&gt;Question&lt;/span&gt;.column_names.join(&lt;span class="s"&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="k"&gt;, &lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;/span&gt;),
              &lt;span class="sy"&gt;:conditions&lt;/span&gt; =&amp;gt; &lt;span class="s"&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="k"&gt;...&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;/span&gt;)&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;Instead of the more natural:&lt;/p&gt;

&lt;div&gt;&lt;pre&gt;&lt;code class="multiline"&gt;&lt;span class="co"&gt;Question&lt;/span&gt;.find(&lt;span class="sy"&gt;:all&lt;/span&gt;,
              &lt;span class="sy"&gt;:select&lt;/span&gt;     =&amp;gt; &lt;span class="s"&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="k"&gt;distinct items.*&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;/span&gt;,
              &lt;span class="sy"&gt;:conditions&lt;/span&gt; =&amp;gt; &lt;span class="s"&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="k"&gt;...&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;/span&gt;)&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;This was an optimization that was specific to our environment and helped us avoid the relatively poor query plan and expensive &lt;code class="inline"&gt;&lt;span class="co"&gt;Seq&lt;/span&gt; &lt;span class="co"&gt;Scan&lt;/span&gt;&lt;/code&gt; that was slowing our app down.&lt;/p&gt;

&lt;p&gt;&lt;object width="560" height="340"&gt;&lt;param name="movie" value="http://www.youtube.com/v/XODMtOqmb9U&amp;amp;hl=en&amp;amp;fs=1&amp;amp;"&gt;&lt;/param&gt;&lt;param name="allowFullScreen" value="true"&gt;&lt;/param&gt;&lt;param name="allowscriptaccess" value="always"&gt;&lt;/param&gt;&lt;embed src="http://www.youtube.com/v/XODMtOqmb9U&amp;amp;hl=en&amp;amp;fs=1&amp;amp;" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="560" height="340"&gt;&lt;/embed&gt;&lt;/object&gt;&lt;/p&gt;

&lt;p&gt;I hope that after reading this you realize that this error is helping you as a user write better SQL. Complaining that the example query doesn't run on PostgreSQL is like complaining that your new &lt;a href="http://www.fender.com/products//search.php?partno=0110100747"&gt;Fender Strat&lt;/a&gt; sucks because when you play &lt;em&gt;Here comes the Sun&lt;/em&gt; the very same way you played it on your &lt;a href="http://www.thebeatlesrockband.com/"&gt;Beatles Rock Band&lt;/a&gt; guitar, it doesn't sound the same. &lt;code&gt;/endrant&lt;/code&gt;&lt;/p&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.feedburner.com/~ff/awesomeful?a=iOidAHKNZPs:b3HRW4f5Bxc:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/awesomeful?i=iOidAHKNZPs:b3HRW4f5Bxc:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~ff/awesomeful?a=iOidAHKNZPs:b3HRW4f5Bxc:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/awesomeful?i=iOidAHKNZPs:b3HRW4f5Bxc:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/awesomeful/~4/iOidAHKNZPs" height="1" width="1"/&gt;</content>
    <author>
      <name>Awesomeful.net</name>
    </author>
  <feedburner:origLink>http://awesomeful.net/posts/72-postgresql-s-group-by</feedburner:origLink></entry>
</feed>
