<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss2full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" version="2.0">

<channel>
	<title>Erics Tech Blog</title>
	
	<link>http://eric.lubow.org</link>
	<description>Thoughts, musings, and other idealistic (sometimes useful) systems and development hoopla.</description>
	<lastBuildDate>Mon, 12 Mar 2012 01:17:15 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.1.4</generator>
		<atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" type="application/rss+xml" href="http://feeds.feedburner.com/lubow/PAyY" /><feedburner:info uri="lubow/payy" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com/" /><feedburner:emailServiceId>lubow/PAyY</feedburner:emailServiceId><feedburner:feedburnerHostname>http://feedburner.google.com</feedburner:feedburnerHostname><item>
		<title>Choosing a Product By Roadmap</title>
		<link>http://feedproxy.google.com/~r/lubow/PAyY/~3/ns5lvABBuiM/</link>
		<comments>http://eric.lubow.org/2011/musings/choosing-a-product-by-roadmap/#comments</comments>
		<pubDate>Fri, 18 Nov 2011 14:56:45 +0000</pubDate>
		<dc:creator>eric</dc:creator>
				<category><![CDATA[Musings]]></category>
		<category><![CDATA[product]]></category>
		<category><![CDATA[roadmap]]></category>

		<guid isPermaLink="false">http://eric.lubow.org/?p=1060</guid>
		<description><![CDATA[There are a lot of reasons to choose a specific technology. You can decide based on what skills you or the engineers around you have. You can decide on a new technology because it&#8217;s the right tool. But there are times when all other things are equal and the flip of a coin would suffice. [...]]]></description>
			<content:encoded><![CDATA[<p>There are a lot of reasons to choose a specific technology.  You can decide based on what skills you or the engineers around you have.  You can decide on a new technology because it&#8217;s the right tool.  But there are times when all other things are equal and the flip of a coin would suffice.  And in my mind, that&#8217;s when it comes to choosing the right technology based on a roadmap.<br />
<span id="more-1060"></span><br />
Recently, at <a href="http://www.simplereach.com">SimpleReach</a>, we were looking into a rather large decision of a backend data store that can be used for data mining.  We took a look at the usual suspects in this arena to include <a href="http://cassandra.apache.org/">Cassandra</a>, <a href="http://www.mongodb.org/">Mongo</a>, and <a href="http://hbase.apache.org/">HBase</a> (just to name a few).  Without getting into the technical details of any of this (since that isn&#8217;t what this post is about), it came down to Cassandra and HBase (and we ended up going with Cassandra).</p>
<p>The more interesting thing to note is not that we ended up with Cassandra, but what we used to make that decision.  When it came down to Cassandra and HBase, they both had their pros and cons for our use/case.  In fact, it&#8217;s likely that either one of them would have worked out just fine in the long run.  We actually made our decision based on the community and the roadmap (but mostly the roadmap).</p>
<p>The product roadmap is simply something that says where the product is intending to be in the next few weeks, months or years.  And it can be important because if the product roadmap in 12 months is in alignment with where you see your technology in 12 months, then it would seem like a pretty good fit.  And it potentially allows your organization to help influence the development of new technologies.  I&#8217;m sure some of you are thinking that it&#8217;s no fun being the guinea pig or being on the bleeding edge of everything.  But when you are pushing the limits of today&#8217;s technology stacks and applications, you have to be on the cutting edge every now and then.</p>
<p>Product roadmap doesn&#8217;t just tell you about the application itself, it tells you about the community.  If you see yourself aligned with the roadmap in the next 12-24 months, then you are also aligning yourself with the community.  And the community is hopefully full of people that have a like-minded set of goals for the use of the product in question.  So don&#8217;t just think about what&#8217;s good for you now, think about what going to be good for you (or your organization) in the future too.</p>


<p>Related posts:<ol><li><a href='http://eric.lubow.org/2010/startup/culture-of-product-vs-culture-of-code/' rel='bookmark' title='Culture of Product vs. Culture of Code'>Culture of Product vs. Culture of Code</a></li>
<li><a href='http://eric.lubow.org/2009/musings/what-does-web-2-0-mean-to-you/' rel='bookmark' title='What Does Web 2.0 Mean To You?'>What Does Web 2.0 Mean To You?</a></li>
<li><a href='http://eric.lubow.org/2009/misc/bing-hunch-decision-engine/' rel='bookmark' title='Bing! Hunch! Decision Engine!'>Bing! Hunch! Decision Engine!</a></li>
</ol></p>
<p><a href="http://feedads.g.doubleclick.net/~a/ZZLLuJ2mjJQ3InqjwC-jEqSotbQ/0/da"><img src="http://feedads.g.doubleclick.net/~a/ZZLLuJ2mjJQ3InqjwC-jEqSotbQ/0/di" border="0" ismap="true"></img></a><br/>
<a href="http://feedads.g.doubleclick.net/~a/ZZLLuJ2mjJQ3InqjwC-jEqSotbQ/1/da"><img src="http://feedads.g.doubleclick.net/~a/ZZLLuJ2mjJQ3InqjwC-jEqSotbQ/1/di" border="0" ismap="true"></img></a></p><img src="http://feeds.feedburner.com/~r/lubow/PAyY/~4/ns5lvABBuiM" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://eric.lubow.org/2011/musings/choosing-a-product-by-roadmap/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://eric.lubow.org/2011/musings/choosing-a-product-by-roadmap/</feedburner:origLink></item>
		<item>
		<title>Google Securing The Web One Discrete Monopolizing Push At A Time</title>
		<link>http://feedproxy.google.com/~r/lubow/PAyY/~3/XhtDvhn1V7o/</link>
		<comments>http://eric.lubow.org/2011/security/google-securing-the-web-one-discrete-monopolizing-push-at-a-time/#comments</comments>
		<pubDate>Fri, 04 Nov 2011 12:52:31 +0000</pubDate>
		<dc:creator>eric</dc:creator>
				<category><![CDATA[Security]]></category>
		<category><![CDATA[google]]></category>
		<category><![CDATA[SEO]]></category>
		<category><![CDATA[ssl]]></category>

		<guid isPermaLink="false">http://eric.lubow.org/?p=1050</guid>
		<description><![CDATA[Contrary to speculation by some, Google&#8217;s decision for encrypting search data is motivated by the goal to make the web as a whole more secure and it&#8217;s not driven by economic interests. I think Google is silently forcing the internet to do what they should be doing on their own. Google can&#8217;t just tell everyone [...]]]></description>
			<content:encoded><![CDATA[<p>Contrary to speculation by some, Google&#8217;s decision for encrypting search data is motivated by the goal to make the web as a whole more secure and it&#8217;s not driven by economic interests.  I think Google is silently forcing the internet to do what they should be doing on their own.<br />
<span id="more-1050"></span><br />
<img src="http://eric.lubow.org/wp-content/uploads/2011/10/Google-Advanced-Security.png" alt="" title="Google Advanced Security" width="140" height="140" class="alignleft size-full wp-image-1055" />Google can&#8217;t just tell everyone to make their sites operate over SSL.  That would show their monopoly and their power (even though everyone knows it&#8217;s there).  So after <a href="http://blogs.ajc.com/jamie-dupree-washington-insider/2011/09/21/google-testimony-to-congress/">Eric Schmidt spoke to congress</a> about many things (including privacy), Google is finally releasing encrypted search for logged in users.  For more information on everything this means with regard to marketing and SEO, I recommend reading <a href="http://searchengineland.com/google-puts-a-price-on-privacy-98029">this comprehensive article</a> by <a href="http://searchengineland.com/">Search Engine Land</a>.  But for security, this has a whole different meaning.</p>
<p>Looking at this from a slightly different perspective, Google is saying that if you just make your site SSL available, then you can continue to have your referrers.  And that is ultimately what people (read marketers and SEO folks) want anyway.  To oversimplify a bit, making one&#8217;s site available over SSL is as easy as going to <a href="http://www.godaddy.com/">GoDaddy</a> or the like and buying and installing an SSL certificate on your web server.</p>
<p>But what does having this certificate really do?  It allows a website to be loaded in a secure, encrypted environment.  It also allows the browser and the user to validate that the site is who they say they are according to a set of authorities like Verisign or Thawte.  These are the folks whose job it is to verify that the certificate is being issued to a valid company (note that I said valid, not necessarily reputable as it&#8217;s not the job of certificate authorities to determine reputation).</p>
<p>And on a more technical level, as a user, SSL certificates keep traffic between you and the website you are interacting with more secure.  Looking at this via the <a href="http://en.wikipedia.org/wiki/OSI_model">OSI model for networking</a>; since all HTTP traffic happens at the application layer (layer 7), when SSL is not present, everything happens over plain text communications and can be <a href="http://en.wikipedia.org/wiki/Packet_analyzer">sniffed</a>.  SSL, which is a network protocol, occurs at layer 6 (the presentation layer) and therefore can encrypt and decrypt all the communications that happen at layer 7 (if used).</p>
<p>So if we all bit the bullet and added SSL capabilities to our sites, the net result would be a more secure internet from a user perspective.  There are plenty worse things that Google could be doing than forcibly making the internet more secure.</p>


<p>Related posts:<ol><li><a href='http://eric.lubow.org/2007/linux-security/10-more-tips-towards-securing-your-linux-system/' rel='bookmark' title='10 More Tips Towards Securing Your Linux System'>10 More Tips Towards Securing Your Linux System</a></li>
<li><a href='http://eric.lubow.org/2009/musings/what-does-web-2-0-mean-to-you/' rel='bookmark' title='What Does Web 2.0 Mean To You?'>What Does Web 2.0 Mean To You?</a></li>
<li><a href='http://eric.lubow.org/2009/ruby/rails/custom-google-maps-marker-with-ym4r_gm/' rel='bookmark' title='Custom Google Maps Marker With YM4R_GM'>Custom Google Maps Marker With YM4R_GM</a></li>
</ol></p>
<p><a href="http://feedads.g.doubleclick.net/~a/C-z86HryHR9vlDjaPEW_314wquc/0/da"><img src="http://feedads.g.doubleclick.net/~a/C-z86HryHR9vlDjaPEW_314wquc/0/di" border="0" ismap="true"></img></a><br/>
<a href="http://feedads.g.doubleclick.net/~a/C-z86HryHR9vlDjaPEW_314wquc/1/da"><img src="http://feedads.g.doubleclick.net/~a/C-z86HryHR9vlDjaPEW_314wquc/1/di" border="0" ismap="true"></img></a></p><img src="http://feeds.feedburner.com/~r/lubow/PAyY/~4/XhtDvhn1V7o" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://eric.lubow.org/2011/security/google-securing-the-web-one-discrete-monopolizing-push-at-a-time/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://eric.lubow.org/2011/security/google-securing-the-web-one-discrete-monopolizing-push-at-a-time/</feedburner:origLink></item>
		<item>
		<title>Exploring AppleScript with Alfred Shortcuts</title>
		<link>http://feedproxy.google.com/~r/lubow/PAyY/~3/jJ-F4QazOpQ/</link>
		<comments>http://eric.lubow.org/2011/mac/exploring-applescript-with-alfred-shortcuts/#comments</comments>
		<pubDate>Thu, 01 Sep 2011 07:34:35 +0000</pubDate>
		<dc:creator>eric</dc:creator>
				<category><![CDATA[Mac]]></category>
		<category><![CDATA[alfred]]></category>
		<category><![CDATA[applescript]]></category>

		<guid isPermaLink="false">http://eric.lubow.org/?p=1041</guid>
		<description><![CDATA[If you have read my blog before, you&#8217;ll know that I am a big fan of Alfred (here). I love the shortcuts and the ability to make things quicker. One of the things I find myself doing quite frequently is looking for domains and their traffic counts on Alexa, Compete, and Quantcast. So I took [...]]]></description>
			<content:encoded><![CDATA[<p>If you have read my blog before, you&#8217;ll know that I am a big fan of <a href="http://www.alfredapp.com/">Alfred</a> (<a href="http://eric.lubow.org/2011/mac/5-apps-to-increase-mac-productivity/">here</a>).  I love the shortcuts and the ability to make things quicker.  One of the things I find myself doing quite frequently is looking for domains and their traffic counts on <a href="http://www.alexa.com">Alexa</a>, <a href="http://www.compete.com/">Compete</a>, and <a href="http://www.quantcast.com">Quantcast</a>.<span id="more-1041"></span></p>
<p>So I took my SysAdmin based love of making things quicker and learned enough Applescript to let Alfred make my life easier.  I wrote a script that when a domain argument is passed to it in Alfred, it will open up a Quantcast tab, a Compete tab, and an Alexa tab for that domain in Google Chrome (Note: With the script below, it MUST be Google Chrome).  To install, go to the Alfred preferences and create a new Applescript extension.</p>
<p>Fill out the text boxes as follows:<br />
<strong>Title:</strong> domstat<br />
<strong>Description:</strong> Get the domain statistics for<br />
<strong>Keyword:</strong> domstat</p>
<p>Now put this in the script box:</p>
<div class="codecolorer-container applescript default" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;width:435px;"><div class="applescript codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #ff0033; font-weight: bold;">on</span> alfred_script<span style="color: #000000;">&#40;</span>q<span style="color: #000000;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #ff0033; font-weight: bold;">set</span> competeURL <span style="color: #ff0033; font-weight: bold;">to</span> <span style="color: #009900;">&quot;http://siteanalytics.compete.com/&quot;</span> <span style="color: #000000;">&amp;</span> <span style="color: #0066ff;">item</span> <span style="color: #000000;">1</span> <span style="color: #ff0033; font-weight: bold;">of</span> q<br />
&nbsp; &nbsp; <span style="color: #ff0033; font-weight: bold;">set</span> quantURL <span style="color: #ff0033; font-weight: bold;">to</span> <span style="color: #009900;">&quot;http://www.quantcast.com/&quot;</span> <span style="color: #000000;">&amp;</span> <span style="color: #0066ff;">item</span> <span style="color: #000000;">1</span> <span style="color: #ff0033; font-weight: bold;">of</span> q<br />
&nbsp; &nbsp; <span style="color: #ff0033; font-weight: bold;">set</span> alexaURL <span style="color: #ff0033; font-weight: bold;">to</span> <span style="color: #009900;">&quot;http://www.alexa.com/siteinfo/&quot;</span> <span style="color: #000000;">&amp;</span> <span style="color: #0066ff;">item</span> <span style="color: #000000;">1</span> <span style="color: #ff0033; font-weight: bold;">of</span> q<br />
<br />
&nbsp; &nbsp; <span style="color: #ff0033; font-weight: bold;">tell</span> <span style="color: #0066ff;">application</span> <span style="color: #009900;">&quot;Google Chrome&quot;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff0033; font-weight: bold;">set</span> activeIndex <span style="color: #ff0033; font-weight: bold;">to</span> <span style="color: #ff0033; font-weight: bold;">get</span> active <span style="color: #0066ff;">tab</span> <span style="color: #ff0033;">index</span> <span style="color: #ff0033; font-weight: bold;">of</span> <span style="color: #0066ff;">window</span> <span style="color: #000000;">1</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff0033; font-weight: bold;">tell</span> <span style="color: #0066ff;">window</span> <span style="color: #000000;">1</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff0033; font-weight: bold;">set</span> competeTab <span style="color: #ff0033; font-weight: bold;">to</span> <span style="color: #0066ff;">make</span> <span style="color: #0066ff;">new</span> <span style="color: #0066ff;">tab</span> <span style="color: #ff0033; font-weight: bold;">with</span> <span style="color: #0066ff;">properties</span> <span style="color: #000000;">&#123;</span>URL:competeURL<span style="color: #000000;">&#125;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff0033; font-weight: bold;">set</span> quantTab <span style="color: #ff0033; font-weight: bold;">to</span> <span style="color: #0066ff;">make</span> <span style="color: #0066ff;">new</span> <span style="color: #0066ff;">tab</span> <span style="color: #ff0033; font-weight: bold;">with</span> <span style="color: #0066ff;">properties</span> <span style="color: #000000;">&#123;</span>URL:quantURL<span style="color: #000000;">&#125;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff0033; font-weight: bold;">set</span> alexaTax <span style="color: #ff0033; font-weight: bold;">to</span> <span style="color: #0066ff;">make</span> <span style="color: #0066ff;">new</span> <span style="color: #0066ff;">tab</span> <span style="color: #ff0033; font-weight: bold;">with</span> <span style="color: #0066ff;">properties</span> <span style="color: #000000;">&#123;</span>URL:alexaURL<span style="color: #000000;">&#125;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff0033; font-weight: bold;">end</span> <span style="color: #ff0033; font-weight: bold;">tell</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff0033; font-weight: bold;">set</span> active <span style="color: #0066ff;">tab</span> <span style="color: #ff0033;">index</span> <span style="color: #ff0033; font-weight: bold;">of</span> <span style="color: #0066ff;">window</span> <span style="color: #000000;">1</span> <span style="color: #ff0033; font-weight: bold;">to</span> activeIndex<br />
&nbsp; &nbsp; <span style="color: #ff0033; font-weight: bold;">end</span> <span style="color: #ff0033; font-weight: bold;">tell</span><br />
<span style="color: #ff0033; font-weight: bold;">end</span> alfred_script</div></div>
<p>To execute, just fire up Alfred and type: &#8220;domstat eric.lubow.org&#8221; and it will fire up Google Chrome and open up the tabs.</p>
<p>Update:<br />
You can even download the Alfred extension directly from <a href="http://eric.lubow.org/wp-content/uploads/2011/09/domstat.alfredextension">here</a>.</p>


<p>Related posts:<ol><li><a href='http://eric.lubow.org/2008/misc/the-next-step-in-browser-evolution/' rel='bookmark' title='The Next Step In Browser Evolution'>The Next Step In Browser Evolution</a></li>
</ol></p>
<p><a href="http://feedads.g.doubleclick.net/~a/w18qT9zwkKURk36au_QLSwrTm40/0/da"><img src="http://feedads.g.doubleclick.net/~a/w18qT9zwkKURk36au_QLSwrTm40/0/di" border="0" ismap="true"></img></a><br/>
<a href="http://feedads.g.doubleclick.net/~a/w18qT9zwkKURk36au_QLSwrTm40/1/da"><img src="http://feedads.g.doubleclick.net/~a/w18qT9zwkKURk36au_QLSwrTm40/1/di" border="0" ismap="true"></img></a></p><img src="http://feeds.feedburner.com/~r/lubow/PAyY/~4/jJ-F4QazOpQ" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://eric.lubow.org/2011/mac/exploring-applescript-with-alfred-shortcuts/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://eric.lubow.org/2011/mac/exploring-applescript-with-alfred-shortcuts/</feedburner:origLink></item>
		<item>
		<title>Fixing CentOS Root Certificate Authority Issues</title>
		<link>http://feedproxy.google.com/~r/lubow/PAyY/~3/RDWDH7j2OIU/</link>
		<comments>http://eric.lubow.org/2011/security/fixing-centos-root-certificate-authority-issues/#comments</comments>
		<pubDate>Wed, 01 Jun 2011 13:59:37 +0000</pubDate>
		<dc:creator>eric</dc:creator>
				<category><![CDATA[Security]]></category>
		<category><![CDATA[git]]></category>
		<category><![CDATA[system]]></category>

		<guid isPermaLink="false">http://eric.lubow.org/?p=955</guid>
		<description><![CDATA[While trying to clone a repository from Github the other day on one of my EC2 servers and I ran into an SSL verification issue. As it turns out, Github renewed their SSL certificate (as people who are responsible about their web presence do when their certificate is about to expire). As a result, I [...]]]></description>
			<content:encoded><![CDATA[<p>While trying to clone a repository from <a href="http://github.com/">Github</a> the other day on one of my EC2 servers and I ran into an SSL verification issue. As it turns out, Github renewed their SSL certificate (as people who are responsible about their web presence do when their certificate is about to expire).  As a result, I couldn&#8217;t <em>git clone</em> over https.  This presents a problem since all my deploys work using <em>git clone</em> over https.<br />
<span id="more-955"></span></p>
<p>The error looks something like this:</p>
<div class="codecolorer-container text default" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;width:435px;"><div class="text codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">*** error: SSL certificate problem, verify that the CA cert is OK. Details:<br />
*** error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed while accessing https://github.com/indexzero/daemon.node.git/info/refs<br />
*** fatal: HTTP request failed<br />
*** Clone of 'https://github.com/indexzero/daemon.node.git' into submodule path 'support/daemon' failed</div></div>
<p>The reason for the error is because <a href="http://www.centos.org">CentOS</a> (at least the <a href="http://www.rightscale.com/">RightScale</a> version 5.6.8.1 has an old certificate authority bundle: <strong>/etc/pki/tls/certs/ca-bundle.crt</strong>.</p>
<p>I backed up the existing certificate file just to be on the safe side.</p>
<div class="codecolorer-container bash default" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;width:435px;"><div class="bash codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #666666;"># </span><span style="color: #c20cb9; font-weight: bold;">cp</span> <span style="color: #000000; font-weight: bold;">/</span>etc<span style="color: #000000; font-weight: bold;">/</span>pki<span style="color: #000000; font-weight: bold;">/</span>tls<span style="color: #000000; font-weight: bold;">/</span>certs<span style="color: #000000; font-weight: bold;">/</span>ca-bundle.crt <span style="color: #000000; font-weight: bold;">/</span>root<span style="color: #000000; font-weight: bold;">/</span>backup<span style="color: #000000; font-weight: bold;">/</span></div></div>
<p>To fix the issue, just download a new certificate bundle.  I used the one from haxx.se.</p>
<div class="codecolorer-container bash default" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;width:435px;"><div class="bash codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #666666;"># </span>curl http:<span style="color: #000000; font-weight: bold;">//</span>curl.haxx.se<span style="color: #000000; font-weight: bold;">/</span>ca<span style="color: #000000; font-weight: bold;">/</span>cacert.pem <span style="color: #660033;">-o</span> <span style="color: #000000; font-weight: bold;">/</span>etc<span style="color: #000000; font-weight: bold;">/</span>pki<span style="color: #000000; font-weight: bold;">/</span>tls<span style="color: #000000; font-weight: bold;">/</span>certs<span style="color: #000000; font-weight: bold;">/</span>ca-bundle.crt</div></div>


<p>Related posts:<ol><li><a href='http://eric.lubow.org/2010/ruby/stopping-curb-from-segfaulting/' rel='bookmark' title='Stopping Curb From Segfaulting'>Stopping Curb From Segfaulting</a></li>
<li><a href='http://eric.lubow.org/2009/ruby/rails/fixing-zlib-errors-on-capistrano-deploy/' rel='bookmark' title='Fixing zlib Errors On Capistrano Deploy'>Fixing zlib Errors On Capistrano Deploy</a></li>
</ol></p>
<p><a href="http://feedads.g.doubleclick.net/~a/aFZsZAQ0xa3bBqDc6Az4IEmvJSA/0/da"><img src="http://feedads.g.doubleclick.net/~a/aFZsZAQ0xa3bBqDc6Az4IEmvJSA/0/di" border="0" ismap="true"></img></a><br/>
<a href="http://feedads.g.doubleclick.net/~a/aFZsZAQ0xa3bBqDc6Az4IEmvJSA/1/da"><img src="http://feedads.g.doubleclick.net/~a/aFZsZAQ0xa3bBqDc6Az4IEmvJSA/1/di" border="0" ismap="true"></img></a></p><img src="http://feeds.feedburner.com/~r/lubow/PAyY/~4/RDWDH7j2OIU" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://eric.lubow.org/2011/security/fixing-centos-root-certificate-authority-issues/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		<feedburner:origLink>http://eric.lubow.org/2011/security/fixing-centos-root-certificate-authority-issues/</feedburner:origLink></item>
		<item>
		<title>ec2-consistent-snapshot With Mongo</title>
		<link>http://feedproxy.google.com/~r/lubow/PAyY/~3/yjrZ4aO-5NU/</link>
		<comments>http://eric.lubow.org/2011/databases/mongodb/ec2-consistent-snapshot-with-mongo/#comments</comments>
		<pubDate>Thu, 21 Apr 2011 07:00:47 +0000</pubDate>
		<dc:creator>eric</dc:creator>
				<category><![CDATA[MongoDB]]></category>
		<category><![CDATA[database]]></category>
		<category><![CDATA[mongodb]]></category>
		<category><![CDATA[Perl]]></category>

		<guid isPermaLink="false">http://eric.lubow.org/?p=863</guid>
		<description><![CDATA[I setup MongoDB on my Amazon EC2 instance knowing full well that it would have to be backed up at some point. I also knew that by using XFS, I could take advantage of filesystem freezing in a similar fashion to LVM snapshots. I had remembered reading about backups on XFS with MySQL being done [...]]]></description>
			<content:encoded><![CDATA[<p>I setup <a href="http://www.mongodb.org/">MongoDB</a> on my Amazon EC2 instance knowing full well that it would have to be backed up at some point.  I also knew that by using XFS, I could take advantage of filesystem freezing in a similar fashion to LVM snapshots.  I had remembered reading about backups on XFS with MySQL being done with <a href="http://alestic.com/2009/09/ec2-consistent-snapshot">ec2-consistent-snapshot</a>.  As with any piece of open source software, it just took a little tweaking to make it do what I wanted it to do.<br />
<span id="more-863"></span><br />
Out of the box, ec2-consistent-snapshot works great for freezing an XFS filesystem with MySQL because it not only stops the server, but handles potential replication issues.  By following the steps outlined <a href="http://www.mongodb.org/pages/viewpage.action?pageId=19562846">here</a> by 10gen, I just made  a few slight adjustments to the core ec2-consistent snapshot script to allow for MongoDB support.  In fact, it supports locking and fsyncing immediately prior to freezing and backup.  I have been using this script in production for a while now and it seems to work without issue for me.</p>
<p>In the usual spirit of social coding, I have added the script to Github: <a href="https://github.com/elubow/ec2-consistent-snapshot">https://github.com/elubow/ec2-consistent-snapshot</a>.</p>
<p>Running it is just this:</p>
<div class="codecolorer-container bash default" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;width:435px;"><div class="bash codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">ec2-consistent-snapshot &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;\<br />
<span style="color: #660033;">--mongo</span> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;\<br />
<span style="color: #660033;">--xfs-filesystem</span> <span style="color: #000000; font-weight: bold;">/</span>data &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; \<br />
<span style="color: #660033;">--region</span> us-east-<span style="color: #000000;">1</span> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; \<br />
<span style="color: #660033;">--description</span> <span style="color: #ff0000;">&quot;RAID snapshot <span style="color: #007800;">$(date +'%Y-%m-%d %H:%M:%S')</span>&quot;</span> \<br />
vol-VOL1 vol-VOL2 vol-VOL3 vol-VOL4 vol-VOL5 vol-VOL6 vol-VOL7 vol-VOL8</div></div>
<p>The options used here (for reference) are telling ec2-consistent-snapshot to use <em>&#8211;mongo</em>, on the <em>&#8211;xfs-filesystem</em> /data, in the us-east-1 <em>&#8211;region</em> (note that it&#8217;s just the region and not the availability zone within that region), to be backed up with the listed <em>&#8211;description</em> of the specified volumes.  You can even throw a <em>&#8211;mongo-stop</em> in there to have Mongo stopped before the file system freeze and then restarted after the volumes have been backed up.  Don&#8217;t forget that you need to set your Amazon keys in you environment variables (AMAZON_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY for your key and secret respectively).</p>
<p>I attempted to keep the usage style consistent with Eric Hammond&#8217;s original version, just add Mongo support for it.</p>
<p><strong>Note:</strong> I also mentioned this on the <a href="http://groups.google.com/group/mongodb-user/browse_thread/thread/633c3fbc648861a1?pli=1">mailing list</a>.  But given the amount of messages that fly around on the list daily, some folks may have missed it.</p>
<p><strong>References:</strong></p>
<ul>
<li><a href="http://alestic.com/2009/09/ec2-consistent-snapshot">ec2-consistent-snapshot</a> blog entry by Eric Hammond</li>
<li><a href="https://github.com/elubow/ec2-consistent-snapshot">ec2-consistent-snapshot</a> on Github with Mongo DB support</li>
<li><a href="http://www.mongodb.org/pages/viewpage.action?pageId=19562846">Backing up MongoDB on EC2 (10gen)</a></li>
</ul>


<p>Related posts:<ol><li><a href='http://eric.lubow.org/2010/databases/mongodb/getting-a-random-record-from-a-mongodb-collection/' rel='bookmark' title='Getting a Random Record From a MongoDB Collection'>Getting a Random Record From a MongoDB Collection</a></li>
</ol></p>
<p><a href="http://feedads.g.doubleclick.net/~a/H5QmdwiTSnJhUe6c711aPbLwRgo/0/da"><img src="http://feedads.g.doubleclick.net/~a/H5QmdwiTSnJhUe6c711aPbLwRgo/0/di" border="0" ismap="true"></img></a><br/>
<a href="http://feedads.g.doubleclick.net/~a/H5QmdwiTSnJhUe6c711aPbLwRgo/1/da"><img src="http://feedads.g.doubleclick.net/~a/H5QmdwiTSnJhUe6c711aPbLwRgo/1/di" border="0" ismap="true"></img></a></p><img src="http://feeds.feedburner.com/~r/lubow/PAyY/~4/yjrZ4aO-5NU" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://eric.lubow.org/2011/databases/mongodb/ec2-consistent-snapshot-with-mongo/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://eric.lubow.org/2011/databases/mongodb/ec2-consistent-snapshot-with-mongo/</feedburner:origLink></item>
		<item>
		<title>5 Apps to Increase Mac Productivity</title>
		<link>http://feedproxy.google.com/~r/lubow/PAyY/~3/XOiLCkzKt0o/</link>
		<comments>http://eric.lubow.org/2011/mac/5-apps-to-increase-mac-productivity/#comments</comments>
		<pubDate>Tue, 05 Apr 2011 08:00:35 +0000</pubDate>
		<dc:creator>eric</dc:creator>
				<category><![CDATA[Mac]]></category>
		<category><![CDATA[productivity]]></category>

		<guid isPermaLink="false">http://eric.lubow.org/?p=861</guid>
		<description><![CDATA[I like to think I have been making the most of what&#8217;s available on my Mac. This means taking advantage of some obscure and some not so obscure apps. I want to go through some of those apps and a little about their usage to help others get some of the benefit I get. There [...]]]></description>
			<content:encoded><![CDATA[<p>I like to think I have been making the most of what&#8217;s available on my Mac.  This means taking advantage of some obscure and some not so obscure apps.  I want to go through some of those apps and a little about their usage to help others get some of the benefit I get.  There are certainly other products available and even ones I use.  The 5 apps I describe are the ones I use the most frequently (and recommend to just about everyone I come in contact with who uses a Mac).<br />
<span id="more-861"></span></p>
<ol>
<li><strong style="font-size:18px;">Boxcar</strong>
<p><img src="http://tctechcrunch.files.wordpress.com/2010/11/b5.png?w=120&#038;h=120" /><br />
Just is case you haven&#8217;t heard of <a href="http://boxcar.io">Boxcar</a>, it&#8217;s what notifications for the iPhone should have been.  You can get push notifications for a ton of different services ranging from Facebook, Twitter, and email to Github or even something more custom (for those of you techies who read this blog).  This awesome iPhone application has recently been released for Mac desktop.  This means that those same notifications that you used to have to have tabs open for Facebook, Twitter, RSS feeds, Email, Github, or whatever other services you use are now all centrally located.  Boxcar for Mac is still beta-ish so expect it to get a lot better.  But centralized notifications helps to prevent you from checking all 80,000 (or so) locations for new items to distract you.</p>
</li>
<li><strong style="font-size:18px;">Alfred</strong>
<p><img src="http://www.alfredapp.com/images/alfred-logo.png" height="120" width="120" /><br />
<a href="http://alfredapp.com">Alfred App</a> is what Spotlight should have been plus some.  It is by far the application that I use the most on my Mac.  It means that I <em>grep</em> through files, search my entire filesystem and either <em>open</em> a file or <em>find</em> the containing folder and open it up.  And with the <a href="http://www.alfredapp.com/powerpack/">Powerpack</a> you have the clipboard manager (which happens to be my favorite feature).  It does favorite snippets and can save old clipboard contents for long period of times that can searchable.  If you try Alfred and it doesn&#8217;t make your life easier, then you are using it wrong.  I could go on for hours with how Alfred can make your Mac life better, but it&#8217;d faster and easier to just read the <a href="http://alfredtips.tumblr.com/">Tips Blog.</a></p>
</li>
<li><strong style="font-size:18px;">Caffeine</strong>
<p><img src="http://a2.mzstatic.com/us/r1000/020/Purple/5f/0a/df/mzi.nvflrkie.175x175-75.png" height="120" width="120" /><br />
<a href="http://itunes.apple.com/us/app/caffeine/id411246225?mt=12">Caffeine</a> is not really a productivity app, but something more to prevent annoyance and generally an all around handy app to have.  It does one thing and does it well.  Caffeine prevents your computer from going to sleep.  This is great if you have a short screen saver that you don&#8217;t feel like changing or if you are watching a movie on Netflix and don&#8217;t want your computer to go to sleep.  There is something to be said for simplicity and doing something well.</p>
</li>
<li><strong style="font-size:18px;">Notational Velocity</strong>
<p><img src="http://i.imgur.com/pv5S8.png" height="120" width="120" /><br />
Mac sticky post-it style notes are good, but <a href="http://notational.net/">Notational Velocity</a> has taken it to the next level.  It&#8217;s freeform, searchable, remote-syncable, taggable notes (too many buzzwords, right?).  But the fact is, you just start typing and it saves as you go.  When you are done, you can add tags.  And if you have an iPhone, then you can install <a href="http://simplenoteapp.com/">SimpleNote</a> and have your notes from the computer sync&#8217;d to your phone (and vise versa).  But my favorite thing is just the fact that you can start typing and it is immediately searchable.  I have it open on all my spaces and I am constantly making notes.  I take items from my Alfred clipboard and paste them into NV as notes for how I get stuff working.  This way I keep track of everything I tried and then just remove the things I don&#8217;t use (and then use those notes to write a blog post).</p>
</li>
<li><strong style="font-size:18px;">Homebrew</strong>
<p><strong style="background-color:#745626;outline-color:#D7AF72;background-clip:border-box;font-size:42px;font-family:ChunkFiveRegular,serif;color:#D7AF72;line-height:30px;">HOMEBREW</strong></p>
<p>I have tried the gamut of package management for the Mac.  I compiled things from source (and that just gets messy).  I have also tried Fink and Macports and they just both felt a little hackish given the naturally usable feel of OS X in general.  So I installed <a href="http://mxcl.github.com/homebrew/">Homebrew</a> and everything just sort of fell into place.  It&#8217;s just as simple as &#8220;<em>brew install $package</em>&#8221; (after Homebrew is installed of course).  And since every package installed is installed in isolation (<em>/usr/local/Cellar</em>), removing and upgrading can also be done with ease.  If there was a solid GUI in front of it, I would recommend Apple adopt it as a 3rd party package management system.</p>
</li>
</ol>
<p>If there are other packages or apps for the Mac that has had a great impact on your productivity, let me know.</p>


<p>Related posts:<ol><li><a href='http://eric.lubow.org/2009/mac/things-todo-app/' rel='bookmark' title='Things (Todo App)'>Things (Todo App)</a></li>
<li><a href='http://eric.lubow.org/2009/mail/transferring-email-from-gmailgoogle-apps-to-dovecot-with-larch/' rel='bookmark' title='Transferring Email From Gmail/Google Apps to Dovecot With Larch'>Transferring Email From Gmail/Google Apps to Dovecot With Larch</a></li>
<li><a href='http://eric.lubow.org/2010/system-administration/creating-dummy-packages-on-debian/' rel='bookmark' title='Creating Dummy Packages On Debian'>Creating Dummy Packages On Debian</a></li>
</ol></p>
<p><a href="http://feedads.g.doubleclick.net/~a/_IzzbQK2bVD3UD2t0NGUEHi8NE0/0/da"><img src="http://feedads.g.doubleclick.net/~a/_IzzbQK2bVD3UD2t0NGUEHi8NE0/0/di" border="0" ismap="true"></img></a><br/>
<a href="http://feedads.g.doubleclick.net/~a/_IzzbQK2bVD3UD2t0NGUEHi8NE0/1/da"><img src="http://feedads.g.doubleclick.net/~a/_IzzbQK2bVD3UD2t0NGUEHi8NE0/1/di" border="0" ismap="true"></img></a></p><img src="http://feeds.feedburner.com/~r/lubow/PAyY/~4/XOiLCkzKt0o" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://eric.lubow.org/2011/mac/5-apps-to-increase-mac-productivity/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		<feedburner:origLink>http://eric.lubow.org/2011/mac/5-apps-to-increase-mac-productivity/</feedburner:origLink></item>
		<item>
		<title>Using Vi Mode Everywhere</title>
		<link>http://feedproxy.google.com/~r/lubow/PAyY/~3/ReGhNiX1WUo/</link>
		<comments>http://eric.lubow.org/2011/tips/using-vi-mode-everywhere/#comments</comments>
		<pubDate>Tue, 15 Mar 2011 07:00:07 +0000</pubDate>
		<dc:creator>eric</dc:creator>
				<category><![CDATA[Tips]]></category>
		<category><![CDATA[bash]]></category>
		<category><![CDATA[vim]]></category>

		<guid isPermaLink="false">http://eric.lubow.org/?p=855</guid>
		<description><![CDATA[Not literally everywhere, but more places than usual. I have been looking for this solution for a long time and finally found it. Anyone who has ever worked around me knows that I do basically everything in Vi. Not only do I use it to edit files, but I use it as an IDE for [...]]]></description>
			<content:encoded><![CDATA[<p>Not literally everywhere, but more places than usual.  I have been looking for this solution for a long time and finally found it.  Anyone who has ever worked around me knows that I do basically everything in <a href="http://www.vim.org/">Vi</a>.<br />
<span id="more-855"></span><br />
Not only do I use it to edit files, but I use it as an IDE for development (even on my Mac instead of Textmate).  So the natural extension is for me to use it in the command prompt as well.  So in my <strong>.bashrc</strong> file, I have the line:</p>
<div class="codecolorer-container bash default" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;width:435px;"><div class="bash codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #000000; font-weight: bold;">set</span> <span style="color: #660033;">-o</span> <span style="color: #c20cb9; font-weight: bold;">vi</span></div></div>
<p>This allows me to navigate the bash console with the usual vim suspects: <em>h, j, k, l</em>.  In addition to that, I also get some fun ones like word movement <em>w</em> and the <em>dw</em> that goes along with it.</p>
<p>But the big winner for me is now I am able to use the vim environment and movement keys inside the irb (Ruby), Mongo, MySQL consoles (still not Redis though).  To do that, just add the following lines to the following files:</p>
<div class="codecolorer-container bash default" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;width:435px;"><div class="bash codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">$ <span style="color: #7a0874; font-weight: bold;">echo</span> <span style="color: #ff0000;">&quot;bind -v&quot;</span> <span style="color: #000000; font-weight: bold;">&gt;&gt;</span> ~<span style="color: #000000; font-weight: bold;">/</span>.editrc<br />
$ <span style="color: #7a0874; font-weight: bold;">echo</span> <span style="color: #ff0000;">&quot;set editing-mode vi&quot;</span> <span style="color: #000000; font-weight: bold;">&gt;&gt;</span> ~<span style="color: #000000; font-weight: bold;">/</span>.inputrc</div></div>


<p>No related posts.</p>
<p><a href="http://feedads.g.doubleclick.net/~a/v_NKsWGd6orVm8owWfHLLR1mme8/0/da"><img src="http://feedads.g.doubleclick.net/~a/v_NKsWGd6orVm8owWfHLLR1mme8/0/di" border="0" ismap="true"></img></a><br/>
<a href="http://feedads.g.doubleclick.net/~a/v_NKsWGd6orVm8owWfHLLR1mme8/1/da"><img src="http://feedads.g.doubleclick.net/~a/v_NKsWGd6orVm8owWfHLLR1mme8/1/di" border="0" ismap="true"></img></a></p><img src="http://feeds.feedburner.com/~r/lubow/PAyY/~4/ReGhNiX1WUo" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://eric.lubow.org/2011/tips/using-vi-mode-everywhere/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://eric.lubow.org/2011/tips/using-vi-mode-everywhere/</feedburner:origLink></item>
		<item>
		<title>Common Pig One Liners</title>
		<link>http://feedproxy.google.com/~r/lubow/PAyY/~3/b9aW4cqHIYw/</link>
		<comments>http://eric.lubow.org/2011/hadoop/common-pig-one-liners/#comments</comments>
		<pubDate>Tue, 01 Mar 2011 07:30:38 +0000</pubDate>
		<dc:creator>eric</dc:creator>
				<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[pig]]></category>

		<guid isPermaLink="false">http://eric.lubow.org/?p=847</guid>
		<description><![CDATA[As with any programming language, there is a bit of a learning curve with Pig. So here are a few common items that I found useful. If you know Pig, please feel free to add your own in the comments section. When it comes to Pig, there is a &#8220;filter early, filter often&#8221; approach that [...]]]></description>
			<content:encoded><![CDATA[<p>As with any programming language, there is a bit of a learning curve with Pig.  So here are a few common items that I found useful.  If you know Pig, please feel free to add your own in the comments section.<br />
<span id="more-847"></span><br />
When it comes to Pig, there is a &#8220;filter early, filter often&#8221; approach that is preached and practiced.  So some of these may be more than one line, but either way, they are short.  These have all been tested only on Pig 0.6 on Amazon&#8217;s Elastic Map Reduce version of Pig.  Since they are simple, they should be fairly portable.  As one would expect, these are contrived examples.</p>
<ul>
<li>Count all the items in a bucket.  The SQL equivalent being: <em>SELECT COUNT(*) FROM foo</em>.
<div class="codecolorer-container pig default" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;width:435px;"><div class="pig codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #808080; font-style: italic;">-- Assuming that 'visits' contains all visits to your website (for example)</span><br />
<span style="color: #808080; font-style: italic;">-- Returns: (100L)</span><br />
total_visits <span style="color: #66cc66;">=</span> <a href="http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html"><span style="color: #993333; font-weight: bold;">FOREACH</span></a> <span style="color: #66cc66;">&#40;</span><a href="http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html"><span style="color: #993333; font-weight: bold;">GROUP</span></a> visits <a href="http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html"><span style="color: #993333; font-weight: bold;">ALL</span></a><span style="color: #66cc66;">&#41;</span> <a href="http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html"><span style="color: #993333; font-weight: bold;">GENERATE</span></a> <a href="http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html"><span style="color: #993333; font-weight: bold;">COUNT</span></a><span style="color: #66cc66;">&#40;</span><span style="color: #66cc66;">$</span>1<span style="color: #66cc66;">&#41;</span>;</div></div>
</li>
<li>Grouping on multiple elements in a bag.  Assuming you have a bag with 4 tuples that looks like this: <em>(1,Football),(2,Soccer),(1,Soccer),(2,Soccer)</em>.  You may want to know how many of user type 1 are &#8220;Football&#8221; or &#8220;Soccer&#8221; and how many of user type 2 are &#8220;Football&#8221; or &#8220;Soccer&#8221;.  Note: If you want user_type and sport in a separate bag, just remove the <em>FLATTEN($0)</em>.
<div class="codecolorer-container pig default" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;width:435px;"><div class="pig codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #808080; font-style: italic;">-- Group by user_type and then by sports interest</span><br />
<span style="color: #808080; font-style: italic;">-- Returns: (1,Football,1L),(1,Soccer,1L),(2,Soccer,2L)</span><br />
<span style="color: #808080; font-style: italic;">-- &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;{group::user_type: chararray,group::sport: chararray,total: long}</span><br />
sports_interests_by_user_type <span style="color: #66cc66;">=</span> <a href="http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html"><span style="color: #993333; font-weight: bold;">FOREACH</span></a> <span style="color: #66cc66;">&#40;</span><a href="http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html"><span style="color: #993333; font-weight: bold;">GROUP</span></a> user_type <a href="http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html"><span style="color: #993333; font-weight: bold;">BY</span></a> <span style="color: #66cc66;">&#40;</span><span style="color: #66cc66;">&#40;</span><a href="http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html"><span style="">CHARARRAY</span></a><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">$</span>0<span style="color: #66cc66;">,</span> <span style="color: #66cc66;">&#40;</span><a href="http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html"><span style="">CHARARRAY</span></a><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">$</span>1<span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span> <a href="http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html"><span style="color: #993333; font-weight: bold;">GENERATE</span></a> <a href="http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html"><span style="color: #993333; font-weight: bold;">FLATTEN</span></a><span style="color: #66cc66;">&#40;</span><span style="color: #66cc66;">$</span>0<span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">,</span> <a href="http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html"><span style="color: #993333; font-weight: bold;">COUNT</span></a><span style="color: #66cc66;">&#40;</span><span style="color: #66cc66;">$</span>1<span style="color: #66cc66;">&#41;</span> <a href="http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html"><span style="color: #993333; font-weight: bold;">AS</span></a> total;</div></div>
</li>
<li>Add a field to a every element in a bag.  From my understanding, this next bit is a Pig 0.6ism.  This will join each by 1 thus creating a tuple with an implicit join of 1. The outcome will be a similar effect to an array push of a field onto the end of every tuple in a bag.
<div class="codecolorer-container pig default" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;width:435px;"><div class="pig codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #808080; font-style: italic;">-- Add total visits to every sports_interest_by_user_type </span><br />
<span style="color: #808080; font-style: italic;">-- Returns: (2,Soccer,2L,100L)</span><br />
<span style="color: #808080; font-style: italic;">-- &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;{sports_interests_by_user_type::group: chararray,sports_interests_by_user_type::total: long,long}</span><br />
sports_interests_by_user_type_fraction <span style="color: #66cc66;">=</span> <a href="http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html"><span style="color: #993333; font-weight: bold;">JOIN</span></a> sports_interests_by_user_type <a href="http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html"><span style="color: #993333; font-weight: bold;">BY</span></a> <span style="color: #cc66cc;">1</span><span style="color: #66cc66;">,</span> total_visits <a href="http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html"><span style="color: #993333; font-weight: bold;">BY</span></a> <span style="color: #cc66cc;">1</span>;</div></div>
</li>
<li>Let&#8217;s take field that we added to the end of the tuple and get a percentage out of it.  This will return the total out of 100%.
<div class="codecolorer-container pig default" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;width:435px;"><div class="pig codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #808080; font-style: italic;">-- Divide the number of user_types per interest by the total</span><br />
<span style="color: #808080; font-style: italic;">-- Returns: (2,Soccer,1.0F)</span><br />
<span style="color: #808080; font-style: italic;">-- &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;{sports_interests_by_user_type::group::wv: chararray,sports_interests_by_user_type::group::area: chararray,float}</span><br />
sports_interests_by_user_type_percent <span style="color: #66cc66;">=</span> <a href="http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html"><span style="color: #993333; font-weight: bold;">FOREACH</span></a> sports_interests_by_user_type_fraction <a href="http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html"><span style="color: #993333; font-weight: bold;">GENERATE</span></a> <span style="color: #66cc66;">&#40;</span><a href="http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html"><span style="">CHARARRAY</span></a><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">$</span>0<span style="color: #66cc66;">,</span> <span style="color: #66cc66;">&#40;</span><a href="http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html"><span style="">CHARARRAY</span></a><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">$</span>1<span style="color: #66cc66;">,</span> <span style="color: #66cc66;">&#40;</span><a href="http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html"><span style="">FLOAT</span></a><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#40;</span><span style="color: #66cc66;">&#40;</span><span style="color: #66cc66;">&#40;</span><a href="http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html"><span style="">FLOAT</span></a><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">$</span>2 <span style="color: #66cc66;">/</span> <span style="color: #66cc66;">&#40;</span><a href="http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html"><span style="">FLOAT</span></a><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">$</span>3<span style="color: #66cc66;">&#41;</span> <span style="color: #66cc66;">*</span> <span style="color: #cc66cc;">100</span><span style="color: #66cc66;">&#41;</span>;</div></div>
</li>
</ul>
<p>This post is another example of work that I could not have accomplished without the help of people on #hadoop-pig on irc.freenode.net.  Also worthy of note are the Pig Latin manuals <a href="http://pig.apache.org/docs/r0.8.0/piglatin_ref1.html">here</a> and <a href="http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html">here</a>.</p>


<p>Related posts:<ol><li><a href='http://eric.lubow.org/2011/hadoop/pig-queries-parsing-json-on-amazons-elastic-map-reduce-using-s3-data/' rel='bookmark' title='Pig Queries Parsing JSON on Amazons Elastic Map Reduce Using S3 Data'>Pig Queries Parsing JSON on Amazons Elastic Map Reduce Using S3 Data</a></li>
<li><a href='http://eric.lubow.org/2007/perl/creating-a-process-table-hash-in-perl/' rel='bookmark' title='Creating a Process Table hash in Perl'>Creating a Process Table hash in Perl</a></li>
<li><a href='http://eric.lubow.org/2010/perl/perl-modules/using-unique-keys-and-key-groups-with-background-jobs-in-gearmanclient/' rel='bookmark' title='Using Unique Keys and Key Groups with Background Jobs in Gearman::Client'>Using Unique Keys and Key Groups with Background Jobs in Gearman::Client</a></li>
</ol></p>
<p><a href="http://feedads.g.doubleclick.net/~a/8KUw3BmtANF17nvHKlWsit7tz1A/0/da"><img src="http://feedads.g.doubleclick.net/~a/8KUw3BmtANF17nvHKlWsit7tz1A/0/di" border="0" ismap="true"></img></a><br/>
<a href="http://feedads.g.doubleclick.net/~a/8KUw3BmtANF17nvHKlWsit7tz1A/1/da"><img src="http://feedads.g.doubleclick.net/~a/8KUw3BmtANF17nvHKlWsit7tz1A/1/di" border="0" ismap="true"></img></a></p><img src="http://feeds.feedburner.com/~r/lubow/PAyY/~4/b9aW4cqHIYw" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://eric.lubow.org/2011/hadoop/common-pig-one-liners/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://eric.lubow.org/2011/hadoop/common-pig-one-liners/</feedburner:origLink></item>
		<item>
		<title>Pig Queries Parsing JSON on Amazons Elastic Map Reduce Using S3 Data</title>
		<link>http://feedproxy.google.com/~r/lubow/PAyY/~3/sPV-rjDJTZo/</link>
		<comments>http://eric.lubow.org/2011/hadoop/pig-queries-parsing-json-on-amazons-elastic-map-reduce-using-s3-data/#comments</comments>
		<pubDate>Wed, 23 Feb 2011 07:15:42 +0000</pubDate>
		<dc:creator>eric</dc:creator>
				<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[amazon]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[pig]]></category>

		<guid isPermaLink="false">http://eric.lubow.org/?p=839</guid>
		<description><![CDATA[I know the title of this post is a mouthful, but it&#8217;s the fun of pushing envelope of existing technologies. What I am looking to do is take my log data stored on S3 (which is in compressed JSON format) and run queries against it. In order to not have to learn everything about setting [...]]]></description>
			<content:encoded><![CDATA[<p>I know the title of this post is a mouthful, but it&#8217;s the fun of pushing envelope of existing technologies.  What I am looking to do is take my log data stored on S3 (which is in compressed JSON format) and run queries against it.  In order to not have to learn everything about setting up Hadoop and still have the ability to leverage the power of Hadoop&#8217;s distributed data processing framework and not have to learn how to write map reduce jobs and &#8230; (this could go on for a while so I&#8217;ll just stop here).  For all these reasons, I choose to use Amazon&#8217;s Elastic Map infrastructure and Pig.<br />
<span id="more-839"></span><br />
Describing all these technologies is beyond the scope of this article.  I will talk you through how I was able to do all this with a little help from the Pig community and a lot of late nights.  I will also provide an example Pig script detailing a little about how I deal with my logs (which are admittedly slightly abnormal).  I will also be making some assumptions here.  Each time I make a large assumption, I will let you know.</p>
<p>First off, I am going to assume that you have an Amazon Web Services account (AWS) and you have also signed up for Elastic Map Reduce (EMR).  For all this, I followed the instructions in <a href="http://s3.amazonaws.com/awsVideos/AmazonElasticMapReduce/ElasticMapReduce-PigTutorial.html">this video</a> by Ian @ AWS to get me going.  Now SSH into the machine so we can get started.</p>
<p>As of the time of this writing, EMR is using Hadoop 0.20 and Pig 0.6.  Everything I am going to talk about is for Pig 0.6.  With any luck, upgrades to Pig will have taken a lot of this into account.  Once you are on the EMR master host, type the following commands to get <a href="https://github.com/kevinweil/elephant-bird">elephant-bird</a> downloaded.  We are going to use it to build a jar that will parse our JSON (big thanks to Dmitriy for all the help here).  Note: The reason we are pulling with wget as opposed to git directly is that we want the <em>jsonloader</em> branch and this is just easier.</p>
<div class="codecolorer-container bash default" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;width:435px;"><div class="bash codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">$ <span style="color: #c20cb9; font-weight: bold;">mkdir</span> <span style="color: #c20cb9; font-weight: bold;">git</span> <span style="color: #000000; font-weight: bold;">&amp;&amp;</span> <span style="color: #c20cb9; font-weight: bold;">mkdir</span> pig-jars<br />
$ <span style="color: #7a0874; font-weight: bold;">cd</span> <span style="color: #c20cb9; font-weight: bold;">git</span> <span style="color: #000000; font-weight: bold;">&amp;&amp;</span> <span style="color: #c20cb9; font-weight: bold;">wget</span> <span style="color: #660033;">--no-check-certificate</span> https:<span style="color: #000000; font-weight: bold;">//</span>github.com<span style="color: #000000; font-weight: bold;">/</span>kevinweil<span style="color: #000000; font-weight: bold;">/</span>elephant-bird<span style="color: #000000; font-weight: bold;">/</span>tarball<span style="color: #000000; font-weight: bold;">/</span>eb1.2.1_with_jsonloader<br />
$ <span style="color: #c20cb9; font-weight: bold;">tar</span> xzf eb.1.2.1_with_jsonloader <span style="color: #000000; font-weight: bold;">&amp;&amp;</span> <span style="color: #7a0874; font-weight: bold;">cd</span> elephant-bird<br />
$ <span style="color: #c20cb9; font-weight: bold;">cp</span> lib<span style="color: #000000; font-weight: bold;">/</span>google-collect-<span style="color: #000000;">1.0</span>.jar ~<span style="color: #000000; font-weight: bold;">/</span>pig-jars <span style="color: #000000; font-weight: bold;">&amp;&amp;</span> <span style="color: #c20cb9; font-weight: bold;">cp</span> lib<span style="color: #000000; font-weight: bold;">/</span>json-simple-<span style="color: #000000;">1.1</span>.jar ~<span style="color: #000000; font-weight: bold;">/</span>pig-jars<br />
$ ant nonothing<br />
$ <span style="color: #7a0874; font-weight: bold;">cd</span> build<span style="color: #000000; font-weight: bold;">/</span>classes<br />
$ jar <span style="color: #660033;">-cf</span> ..<span style="color: #000000; font-weight: bold;">/</span>elephant-bird-1.2.1-SNAPSHOT.jar com<br />
$ <span style="color: #c20cb9; font-weight: bold;">cp</span> ..<span style="color: #000000; font-weight: bold;">/*</span>.jar ~<span style="color: #000000; font-weight: bold;">/</span>pig-jars</div></div>
<p>At this point we should have 3 jars in the <strong>pig-jars</strong> directory.  I created an S3 bucket for myself and put the jars in there so I only have to do the compilation once.  From here on in, I will be referencing those jars using my s3 bucket.  If you like how my logs are organized, then I strongly recommend checking out <a href="https://github.com/cloudera/flume">Cloudera Flume</a> for log aggregation.  &lt;shameless plug&gt;I also wrote a blog post <a href="http://eric.lubow.org/2011/system-administration/distributed-flume-setup-with-an-s3-sink/">here</a> on getting it going.&lt;/shameless plug&gt;</p>
<p>Another item worthy of note is that I store all my logs in gzip format.  Although this isn&#8217;t the best format for Hadoop in the long run because it can&#8217;t be split into chunks, it&#8217;s what I used.  I had trouble getting everything going because Pig doesn&#8217;t decompress files when running in local mode.  Please learn from that mistake of mine.</p>
<p>Now let&#8217;s get into the code a little.  First thing we do is register all the jars necessary to parse JSON.  This is done using our jars that we put directly on S3. Then we load up the JSON into maps.  This is done using the <strong>JsonLoader()</strong>.  You have to use the full path to it (which is listed in the code sample).  Now the &#8220;interesting&#8221; thing about my log files is that they have 3 distinct types of log lines in them.  Types <strong>i</strong>,<strong>b</strong>, and <strong>c</strong>.  Each log line type has a different meaning so I sort them into 3 groups with the <strong>SPLIT</strong> command and some conditionals.</p>
<p>Now that I have my data broken out into the 3 buckets, I can start doing what I want with them.  Let&#8217;s say that in log type <strong>i</strong>, there is a <em>widget_value</em>.  And that <em>widget_value</em> is a string of any number.  To show the top 5 values of that <em>widget_value</em> has, I just pull out all instances of <em>widget_value</em> in <strong>i</strong>.  Then I iterate over those values and group them together by type (thus getting aggregate values).  And finally I sort them in descending order and show only the top 5.</p>
<div class="codecolorer-container pig default" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;width:435px;"><div class="pig codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #808080; font-style: italic;">-- REGISTER the parsing jars</span><br />
<a href="http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html"><span style="color: #993333; font-weight: bold;">REGISTER</span></a> s3:<span style="color: #66cc66;">//$</span>bucket<span style="color: #66cc66;">/</span>jars<span style="color: #66cc66;">/</span><a href="http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html"><span style="color: #993333; font-weight: bold;">PIG</span></a><span style="color: #66cc66;">/</span>google<span style="color: #66cc66;">-</span>collect<span style="color: #66cc66;">-</span><span style="color: #cc66cc;">1.0</span><span style="color: #66cc66;">.</span>jar;<br />
<a href="http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html"><span style="color: #993333; font-weight: bold;">REGISTER</span></a> s3:<span style="color: #66cc66;">//$</span>bucket<span style="color: #66cc66;">/</span>jars<span style="color: #66cc66;">/</span><a href="http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html"><span style="color: #993333; font-weight: bold;">PIG</span></a><span style="color: #66cc66;">/</span>json<span style="color: #66cc66;">-</span>simple<span style="color: #66cc66;">-</span><span style="color: #cc66cc;">1.1</span><span style="color: #66cc66;">.</span>jar;<br />
<a href="http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html"><span style="color: #993333; font-weight: bold;">REGISTER</span></a> s3:<span style="color: #66cc66;">//$</span>bucket<span style="color: #66cc66;">/</span>jars<span style="color: #66cc66;">/</span><a href="http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html"><span style="color: #993333; font-weight: bold;">PIG</span></a><span style="color: #66cc66;">/</span>elephant<span style="color: #66cc66;">-</span>bird<span style="color: #66cc66;">-</span>1<span style="color: #66cc66;">.</span>2<span style="color: #66cc66;">.</span>1<span style="color: #66cc66;">-</span>SNAPSHOT<span style="color: #66cc66;">.</span>jar; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <br />
<br />
<span style="color: #808080; font-style: italic;">-- Load up the JSON and split it into the three log types: b, c and i</span><br />
json <span style="color: #66cc66;">=</span> <a href="http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html"><span style="color: #993333; font-weight: bold;">LOAD</span></a> <span style="color: #ff0000;">'s3://$bucket/logs/2011/02/22/1800/serverlog.*'</span> <a href="http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html"><span style="color: #993333; font-weight: bold;">USING</span></a> com<span style="color: #66cc66;">.</span>twitter<span style="color: #66cc66;">.</span>elephantbird<span style="color: #66cc66;">.</span><a href="http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html"><span style="color: #993333; font-weight: bold;">PIG</span></a><span style="color: #66cc66;">.</span><a href="http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html"><span style="color: #993333; font-weight: bold;">LOAD</span></a><span style="color: #66cc66;">.</span>JsonLoader<span style="color: #66cc66;">&#40;</span><span style="color: #66cc66;">&#41;</span>;<br />
<a href="http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html"><span style="color: #993333; font-weight: bold;">SPLIT</span></a> json <a href="http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html"><span style="color: #993333; font-weight: bold;">INTO</span></a> i <a href="http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html"><span style="">IF</span></a> <span style="color: #66cc66;">&#40;</span><a href="http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html"><span style="">FLOAT</span></a><span style="color: #66cc66;">&#41;</span> <span style="color: #66cc66;">$</span>0<span style="color: #66cc66;">#</span><span style="color: #ff0000;">'amount'</span> <span style="color: #66cc66;">&gt;</span> <span style="color: #cc66cc;">0</span><span style="color: #66cc66;">,</span> c <a href="http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html"><span style="">IF</span></a> <span style="color: #66cc66;">$</span>0<span style="color: #66cc66;">#</span><span style="color: #ff0000;">'id'</span> <a href="http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html"><span style="">IS NOT</span></a> <a href="http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html"><span style="">NULL</span></a><span style="color: #66cc66;">,</span> b <a href="http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html"><span style="">IF</span></a> <span style="color: #66cc66;">$</span>0<span style="color: #66cc66;">#</span><span style="color: #ff0000;">'response'</span> <a href="http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html"><span style="">IS NOT</span></a> <a href="http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html"><span style="">NULL</span></a>;<br />
<br />
wv_i_only <span style="color: #66cc66;">=</span> <a href="http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html"><span style="color: #993333; font-weight: bold;">FOREACH</span></a> i <a href="http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html"><span style="color: #993333; font-weight: bold;">GENERATE</span></a> <span style="color: #66cc66;">&#40;</span><a href="http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html"><span style="">CHARARRAY</span></a><span style="color: #66cc66;">&#41;</span> <span style="color: #66cc66;">$</span>0<span style="color: #66cc66;">#</span><span style="color: #ff0000;">'widget_value'</span> <a href="http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html"><span style="color: #993333; font-weight: bold;">AS</span></a> wv;<br />
wv_i_count <span style="color: #66cc66;">=</span> <a href="http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html"><span style="color: #993333; font-weight: bold;">FOREACH</span></a> <span style="color: #66cc66;">&#40;</span><a href="http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html"><span style="color: #993333; font-weight: bold;">GROUP</span></a> wv_i_only <a href="http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html"><span style="color: #993333; font-weight: bold;">BY</span></a> <span style="color: #66cc66;">$</span>0<span style="color: #66cc66;">&#41;</span> <a href="http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html"><span style="color: #993333; font-weight: bold;">GENERATE</span></a> <span style="color: #66cc66;">$</span>0<span style="color: #66cc66;">,</span> <a href="http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html"><span style="color: #993333; font-weight: bold;">COUNT</span></a><span style="color: #66cc66;">&#40;</span><span style="color: #66cc66;">$</span>1<span style="color: #66cc66;">&#41;</span> <a href="http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html"><span style="color: #993333; font-weight: bold;">AS</span></a> i_cnt;<br />
wv_i_sorted_count <span style="color: #66cc66;">=</span> <a href="http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html"><span style="color: #993333; font-weight: bold;">LIMIT</span></a><span style="color: #66cc66;">&#40;</span><a href="http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html"><span style="color: #993333; font-weight: bold;">ORDER</span></a> wv_i_count <a href="http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html"><span style="color: #993333; font-weight: bold;">BY</span></a> i_cnt <a href="http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html"><span style="color: #993333; font-weight: bold;">DESC</span></a><span style="color: #66cc66;">&#41;</span> <span style="color: #cc66cc;">5</span>;<br />
<a href="http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html"><span style="color: #993333; font-weight: bold;">DUMP</span></a> wv_i_sorted_count</div></div>
<p>Last thing I want to share is some tips on getting everything going with Pig:</p>
<ul>
<li>Start small and continue small until everything is working</li>
<li>Use subsets of your data that you have a good idea of what the results are going to be before you run your queries</li>
<li>Step through you queries to ensure each step is doing what you think it&#8217;s doing</li>
<li>Cast your data types to avoid weird behaviors.  Map doesn&#8217;t always leave your variables in the type you want/expect</li>
</ul>
<p>And I can&#8217;t forget to say thanks for all the help to the people who hang out in <strong>#hadoop-pig</strong> on irc.freenode.net.</p>


<p>Related posts:<ol><li><a href='http://eric.lubow.org/2010/ruby/jruby/json-benchmarks-in-jruby/' rel='bookmark' title='JSON Benchmarks in jRuby'>JSON Benchmarks in jRuby</a></li>
<li><a href='http://eric.lubow.org/2011/system-administration/distributed-flume-setup-with-an-s3-sink/' rel='bookmark' title='Distributed Flume Setup With an S3 Sink'>Distributed Flume Setup With an S3 Sink</a></li>
</ol></p>
<p><a href="http://feedads.g.doubleclick.net/~a/3EFtxh4SfHZlY9pL6267jQPD9fk/0/da"><img src="http://feedads.g.doubleclick.net/~a/3EFtxh4SfHZlY9pL6267jQPD9fk/0/di" border="0" ismap="true"></img></a><br/>
<a href="http://feedads.g.doubleclick.net/~a/3EFtxh4SfHZlY9pL6267jQPD9fk/1/da"><img src="http://feedads.g.doubleclick.net/~a/3EFtxh4SfHZlY9pL6267jQPD9fk/1/di" border="0" ismap="true"></img></a></p><img src="http://feeds.feedburner.com/~r/lubow/PAyY/~4/sPV-rjDJTZo" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://eric.lubow.org/2011/hadoop/pig-queries-parsing-json-on-amazons-elastic-map-reduce-using-s3-data/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		<feedburner:origLink>http://eric.lubow.org/2011/hadoop/pig-queries-parsing-json-on-amazons-elastic-map-reduce-using-s3-data/</feedburner:origLink></item>
		<item>
		<title>Distributed Flume Setup With an S3 Sink</title>
		<link>http://feedproxy.google.com/~r/lubow/PAyY/~3/Gw4-Xr3ce18/</link>
		<comments>http://eric.lubow.org/2011/system-administration/distributed-flume-setup-with-an-s3-sink/#comments</comments>
		<pubDate>Fri, 04 Feb 2011 07:15:03 +0000</pubDate>
		<dc:creator>eric</dc:creator>
				<category><![CDATA[System Administration]]></category>
		<category><![CDATA[amazon]]></category>
		<category><![CDATA[cloudera]]></category>
		<category><![CDATA[flume]]></category>
		<category><![CDATA[logging]]></category>
		<category><![CDATA[s3]]></category>

		<guid isPermaLink="false">http://eric.lubow.org/?p=820</guid>
		<description><![CDATA[I have recently spent a few days getting up to speed with Flume, Cloudera&#8216;s distributed log offering. If you haven&#8217;t seen this and deal with lots of logs, you are definitely missing out on a fantastic project. I&#8217;m not going to spend time talking about it because you can read more about it in the [...]]]></description>
			<content:encoded><![CDATA[<p>I have recently spent a few days getting up to speed with <a href="https://github.com/cloudera/flume">Flume</a>, <a href="http://www.cloudera.com/">Cloudera</a>&#8216;s distributed log offering.  If you haven&#8217;t seen this and deal with lots of logs, you are definitely missing out on a fantastic project.  I&#8217;m not going to spend time talking about it because you can read more about it in the <a href="http://archive.cloudera.com/cdh/3/flume/UserGuide.html">users guide</a> or in the <a href="http://www.quora.com/Flume">Quora Flume Topic</a> in ways that are better than I can describe it.  But I will tell you about is my experience setting up Flume in a distributed environment to sync logs to an Amazon S3 sink.</p>
<p>As CTO of <a href="http://www.simplereach.com">SimpleReach</a>, a company that does most of it&#8217;s work in the cloud, I&#8217;m constantly strategizing on how we can take advantage of the cloud for auto-scaling.  Depending on the time of day or how much content distribution we are dealing with, we will spawn new instances to accommodate the load.  We will still need the logs from those machines for later analysis (batch jobs like making use of Elastic Map Reduce).<br />
<span id="more-820"></span><br />
I am going to attempt to do this as step by step as possible but much of the terminology I use is described in the users guide and there is an expectation that you have at least skimmed it prior to starting this HOWTO.  I am using EMR (Elastic Map Reduce) on EC2 and not the provided Hadoop by Cloudera.  Additionally, the Cloudera version that I am working with is <strong>cdh3b3</strong>.</p>
<p><strong>Context</strong><br />
I have 3 kinds of servers all running CentOS in the <a href="http://www.amazon.com/">Amazon</a> cloud:</p>
<ol>
<li><strong>a1</strong>: This is the agent which is producing all the logs</li>
<li><strong>c1</strong>: This is the collector which is aggregating all the logs (from a1, a2, a3, etc)</li>
<li><strong>u1</strong>: This is the flume master node which is sending out all the commands</li>
</ol>
<p>There are actually <em>n</em> agents, but for this example, we&#8217;ll keep it simple.  Also, for a complete copy of the config files, please check out the full gist available <a href="https://gist.github.com/810104">here</a>.</p>
<p><strong>Initial Setup</strong><br />
On both a1 and c1, you&#8217;ll have to install flume-node (flume-node contains the files necessary to run the agent or the collector).</p>
<div class="codecolorer-container text default" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;width:435px;"><div class="text codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"># curl http://archive.cloudera.com/redhat/cdh/cloudera-cdh3.repo &gt; /etc/yum.repos.d/cloudera-cdh3.repo<br />
# yum update yum<br />
# yum install flume flume-node</div></div>
<p>On u1, you&#8217;ll need to install the flume-master RPM:</p>
<div class="codecolorer-container text default" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;width:435px;"><div class="text codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"># curl http://archive.cloudera.com/redhat/cdh/cloudera-cdh3.repo &gt; /etc/yum.repos.d/cloudera-cdh3.repo<br />
# yum update yum<br />
# yum install flume flume-master</div></div>
<p>On each host, you need to copy the conf template file to the site specific config file.  That is to say:</p>
<div class="codecolorer-container text default" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;width:435px;"><div class="text codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">cp flume-site.xml.template flume-site.xml</div></div>
<p>First let&#8217;s jump onto the agent and set that up.  Tune the $master_IP and $collector_IP variables appropriately, but change your <em>/etc/flume/conf/flume-site.xml</em> to look like:</p>
<div class="codecolorer-container xml default" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;width:435px;height:450px;"><div class="xml codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;configuration<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;property<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;name<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>flume.master.servers<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/name<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;value<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>$master_IP<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/value<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;description<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>This is the address for the config servers status server (http)<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/description<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/property<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
<br />
&nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;property<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;name<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>flume.collector.event.host<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/name<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;value<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>$collector_IP<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/value<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;description<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>This is the host name of the default &quot;remote&quot; collector.<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/description<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/property<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
<br />
&nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;property<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;name<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>flume.collector.port<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/name<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;value<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>35853<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/value<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;description<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>This default tcp port that the collector listens to in order to receive events it is collecting.<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/description<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/property<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
<br />
&nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;property<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;name<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>flume.agent.logdir<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/name<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;value<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>/mnt/flume-${user.name}/agent<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/value<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;description<span style="color: #000000; font-weight: bold;">&gt;</span></span></span> This is the directory that write-ahead logging data<br />
&nbsp; &nbsp; &nbsp; or disk-failover data is collected from applications gets<br />
&nbsp; &nbsp; &nbsp; written to. The agent watches this directory.<br />
&nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/description<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/property<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/configuration<span style="color: #000000; font-weight: bold;">&gt;</span></span></span></div></div>
<p>Now on to the collector.  Same file, different config.  Replace all the variables with you $master IP address (you should be using Amazon&#8217;s internal IPs otherwise you will be paying the regional charge).  The $account and $secret variables are both your Amazon EC2/S3 account key and secret Access key respectively.  The $bucket is the S3 bucket that will contain the log files.  Also worthy of pointing out is the <em>flume.collector.roll.millis</em> and <em>flume.collector.dfs.compress.gzip</em>.  The millis is how frequently the log file gets truncated and the next file begins to be written to.  It would be nice if this could be done by file size and not only by time, but it works for now. The other config option is <em>flume.collector.dfs.compress.gzip</em>.  This ensures that the logfiles are compressed prior to being dumped onto S3 (saves LOTS of space).</p>
<div class="codecolorer-container xml default" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;width:435px;height:450px;"><div class="xml codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;configuration<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;property<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;name<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>flume.master.servers<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/name<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;value<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>$master<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/value<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;description<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>This is the address for the config servers status server (http)<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/description<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/property<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
<br />
<br />
&nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;property<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;name<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>flume.collector.event.host<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/name<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;value<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>localhost<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/value<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;description<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>This is the host name of the default &quot;remote&quot; collector.<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/description<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/property<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
<br />
&nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;property<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;name<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>flume.collector.port<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/name<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;value<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>35853<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/value<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;description<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>This default tcp port that the collector listens to in order to receive events it is collecting.<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/description<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/property<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
<br />
&nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;property<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;name<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>fs.default.name<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/name<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;value<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>s3n://$account:$secret@$bucket<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/value<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/property<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
<br />
&nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;property<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;name<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>fs.s3n.impl<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/name<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;value<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>org.apache.hadoop.fs.s3native.NativeS3FileSystem<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/value<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/property<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; <br />
&nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;property<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;name<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>fs.s3.awsAccessKeyId<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/name<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;value<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>$account<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/value<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/property<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; <br />
&nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;property<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;name<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>fs.s3.awsSecretAccessKey<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/name<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;value<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>$secret<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/value<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/property<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
<br />
&nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;property<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;name<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>fs.s3n.awsAccessKeyId<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/name<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;value<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>$account<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/value<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/property<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; <br />
&nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;property<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;name<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>fs.s3n.awsSecretAccessKey<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/name<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;value<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>$secret<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/value<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/property<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
<br />
&nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;property<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;name<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>flume.agent.logdir<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/name<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;value<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>/mnt/flume-${user.name}/agent<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/value<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;description<span style="color: #000000; font-weight: bold;">&gt;</span></span></span> This is the directory that write-ahead logging data<br />
&nbsp; &nbsp; &nbsp; or disk-failover data is collected from applications gets<br />
&nbsp; &nbsp; &nbsp; written to. The agent watches this directory.<br />
&nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/description<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/property<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp;<br />
&nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;property<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;name<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>flume.collector.dfs.dir<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/name<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;value<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>file:///mnt/flume-${user.name}/collected<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/value<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;description<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>This is a dfs directory that is the the final resting<br />
&nbsp; &nbsp; place for logs to be stored in. &nbsp;This defaults to a local dir in<br />
&nbsp; &nbsp; /tmp but can be hadoop URI path that such as hdfs://namenode/path/<br />
&nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/description<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/property<span style="color: #000000; font-weight: bold;">&gt;</span></span></span> &nbsp;<br />
<br />
&nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;property<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;name<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>flume.collector.dfs.compress.gzip<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/name<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;value<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>true<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/value<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;description<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>Writes compressed output in gzip format to dfs. value is<br />
&nbsp; &nbsp; &nbsp;boolean type, i.e. true/false<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/description<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/property<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
<br />
&nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;property<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;name<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>flume.collector.roll.millis<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/name<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;value<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>60000<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/value<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;description<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>The time (in milliseconds)<br />
&nbsp; &nbsp; between when hdfs files are closed and a new file is opened<br />
&nbsp; &nbsp; (rolled).<br />
&nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/description<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/property<span style="color: #000000; font-weight: bold;">&gt;</span></span></span> <br />
<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/configuration<span style="color: #000000; font-weight: bold;">&gt;</span></span></span></div></div>
<p>While we are still on the collector, in order to properly write to S3, you&#8217;ll need to make 4 file adjustments and all of them will go into the <strong>/usr/lib/flume/lib/</strong> directory.</p>
<ol>
<li>commons-codec-1.4.jar</li>
<li>jets3t-0.6.1.jar</li>
<li>commons-httpclient-3.0.1.jar</li>
<li>emr-hadoop-core-0.20.jar</li>
</ol>
<p>The one thing that should be noted here is that the <strong>emr-hadoop-core-0.20.jar</strong> file replaces the <strong>hadoop-core.jar</strong> symlink.  The emr-hadoop-core-0.20.jar file is the hadoop-core.jar file from an EC2 Hadoop cluster instance.  <strong>Note:</strong> This will break the ability to seamlessly upgrade via the RPM (which is how you installed it if you&#8217;ve been following my HOWTO).  Keep these files around just in case.  I have added a tarball of the files <a href="http://eric.lubow.org/wp-content/uploads/2011/02/flume-jar.tar.gz">here</a>, but they are all still available with a quick Google search.</p>
<p>And now on to the master.  There was actually no configuration that I did on the master file system to get things up and running. But if flume is writing to a /tmp directory on an ephemeral file system, then it should be fixed.</p>
<p><strong>Web Based Setup</strong></p>
<p>I chose to do the individual machine setup via the master web interface.  You can get to this pointing your web browser at http://u1:35871/ (replace u1 with public DNS IP of your flume master).  Ensure that the port is accessible from the outside through your security settings.  At this point, it was easiest for me to ensure all hosts running flume could talk to all ports on all other hosts running flume.  You can certainly lock this down to the individual ports for security once everything is up and running. </p>
<p>At this point, you should go to a1 and c1 run <strong>/etc/init.d/flume-node start</strong>.  If everything goes well, then the master (whose IP is specified in their configs) should be notified of their existence.  Now you can configure them from the web.  Click on the config link and then fill in the text lines as follows (use what is in bold):</p>
<ul>
<li>Agent Node: <strong>$agent_ec2_internal_ip</strong></li>
<li>Source: <strong>tailDir(&#8220;/mnt/logs/&#8221;,&#8221;.*.log&#8221;)</strong></li>
<li>Sink: <strong>agentBESink(&#8220;$collector_ec2_internal_ip&#8221;,35853)</strong></li>
</ul>
<p>Note: I chose to use <em>tailDir</em> since I will control rotating the logs on my own.  I am also using <em>agentBESink</em> because I am ok with losing log lines if the case arises.</p>
<p>Now click <strong>Submit Query</strong> and go back to the config page to setup the collector:</p>
<ul>
<li>Agent Node: <strong>$collector_ec2_internal_ip</strong></li>
<li>Source: <strong>collectorSource(35853)</strong></li>
<li>Sink: <strong>collectorSink(&#8220;s3n://$account:$secret@$bucket/logs/%Y/%m/%d/%H00&#8243;,&#8221;server&#8221;)</strong></li>
</ul>
<p>This is going to tell the collector that we are sinking to s3native with the $account key and the $secret key into the $bucket with an initial folder of &#8216;logs&#8217;.  It will then log to sub-folders with YYYY/MM/DD/HH00 (or 2011/02/03/1300/server-<timestamp>.log).  There will be 60 gziped files in each folder since the timing is setup to be 1 file per minute.  Now click <strong>Submit Query</strong> and go to the &#8216;master&#8217; page and you should see 2 commands listed as &#8220;SUCCEEDED&#8221; in the command history.  If they have not succeeded, ensure a few things have been done (there are probably more, but this is a handy start:</p>
<ol>
<li>Always use double quotes (&#8220;) since single quotes (&#8216;) aren&#8217;t interpreted correctly. UPDATE: Single quotes are interpreted correctly, they are just not accepted intentionally (Thanks jmhsieh)
	</li>
<li>In your regex, use something like &#8220;.*\\.log&#8221; since the &#8216;.&#8217; is part of the regex.</li>
<li>In your regex, ensure that your blackslashes are properly escaped: &#8220;foo\\bar&#8221; is the correct version of trying to match &#8220;foo\bar&#8221;.</li>
<li>Ensure any &#8216;/&#8217; are inserted as &#8216;%2F&#8217; in the Amazon account and secret codes.</li>
</ol>
<p>Additionally, there are also tables of <strong>Node Status</strong> and <strong>Node Configuration</strong>.  These should match up with what you think you configured.</p>
<p>At this point everything should work.  Admittedly I had a lot of trouble getting to this point.  But with the help of the Cloudera folks and the users on irc.freenode.net in #flume, I was able to get things going.  The logs sadly aren&#8217;t too helpful here in most cases (but look anyway cause they might provide you with more info than they provided for me).  If I missed anything in this post or there is something else I am unaware of, then let me know.</p>
<p><strong>References</strong></p>
<ul>
<li><a href="http://wiki.apache.org/hadoop/AmazonS3">http://wiki.apache.org/hadoop/AmazonS3</a></li>
<li><a href="http://archive.cloudera.com/cdh/3/flume/UserGuide.html">Flume Users Guide</a></li>
<li>irc.freenode.net #flume</li>
<li><a href="https://issues.cloudera.org/browse/FLUME-66">https://issues.cloudera.org/browse/FLUME-66</a></li>
<li>Config files <a href="https://gist.github.com/810104">gist</a></li>
</ul>
<p></timestamp></p>


<p>Related posts:<ol><li><a href='http://eric.lubow.org/2011/hadoop/pig-queries-parsing-json-on-amazons-elastic-map-reduce-using-s3-data/' rel='bookmark' title='Pig Queries Parsing JSON on Amazons Elastic Map Reduce Using S3 Data'>Pig Queries Parsing JSON on Amazons Elastic Map Reduce Using S3 Data</a></li>
<li><a href='http://eric.lubow.org/2007/perl/filefind/' rel='bookmark' title='File::Find'>File::Find</a></li>
</ol></p>
<p><a href="http://feedads.g.doubleclick.net/~a/WOKqpvYOMLAqc9lp6ERgreXYR4s/0/da"><img src="http://feedads.g.doubleclick.net/~a/WOKqpvYOMLAqc9lp6ERgreXYR4s/0/di" border="0" ismap="true"></img></a><br/>
<a href="http://feedads.g.doubleclick.net/~a/WOKqpvYOMLAqc9lp6ERgreXYR4s/1/da"><img src="http://feedads.g.doubleclick.net/~a/WOKqpvYOMLAqc9lp6ERgreXYR4s/1/di" border="0" ismap="true"></img></a></p><img src="http://feeds.feedburner.com/~r/lubow/PAyY/~4/Gw4-Xr3ce18" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://eric.lubow.org/2011/system-administration/distributed-flume-setup-with-an-s3-sink/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		<feedburner:origLink>http://eric.lubow.org/2011/system-administration/distributed-flume-setup-with-an-s3-sink/</feedburner:origLink></item>
	</channel>
</rss>

