<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/atom10full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><feed xmlns="http://www.w3.org/2005/Atom" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0">
  <id>http://techgeneral.org/atom1.0</id>
  <title>TechGeneral</title>
  <updated>2008-09-22T11:58:24Z</updated>
  <author>
    <name>Neil Blakey-Milner</name>
  </author>
  
  <atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" type="application/atom+xml" href="http://feeds.feedburner.com/TechGeneral" /><feedburner:info uri="techgeneral" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com/" /><feedburner:emailServiceId>TechGeneral</feedburner:emailServiceId><feedburner:feedburnerHostname>http://feedburner.google.com</feedburner:feedburnerHostname><entry>
    <title>In San Francisco in October</title>
    <id>http://techgeneral.org/in-san-francisco-in-october</id>
    <updated>2008-09-22T11:58:24Z</updated>
    <link rel="alternate" href="http://feedproxy.google.com/~r/TechGeneral/~3/aWfFFNk3Biw/in-san-francisco-in-october" />
    <published>2008-09-22T11:58:24Z</published>
    <author>
        <name>Neil Blakey-Milner</name>
    </author>
    <content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;&#xD;
Visa-willing, I'll be in San Francisco for about three weeks from early October.  The &lt;a href="http://www.synthasite.com/"&gt;SynthaSite&lt;/a&gt;&#xD;
Cape Town office is heading over to the San Francisco office for a mix&#xD;
of team training, team building, end-of-year partying, and planning&#xD;
sessions.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
&lt;a href="http://nxsy.org/in-sunny-san-francisco-at-google-io"&gt;My last trip to San Francisco&lt;/a&gt; in May/June included &lt;a href="http://nxsy.org/tags/io2008"&gt;Google I/O&lt;/a&gt; and a &lt;a href="http://nxsy.org/pylons-tg2-wsgi-sprint-and-sight-seeing-weekend"&gt;Pylons/TG2/WSGI sprint&lt;/a&gt;,&#xD;
and I really enjoyed being in the company of geeks.  This time around,&#xD;
it doesn't seem like there are any good conferences to squeeze in or stay around for&#xD;
and so far my only plans are to attend the &lt;a href="http://baypiggies.net/"&gt;Bay Area Python Interest Group&lt;/a&gt; with Jonathan.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
Are there any interesting tech events happening in October in or around San Francisco I should try to attend?&#xD;
&lt;/p&gt;&#xD;
&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.feedburner.com/~ff/TechGeneral?a=aWfFFNk3Biw:uafoVtqlDfY:K8qFz0M-AJI"&gt;&lt;img src="http://feeds.feedburner.com/~ff/TechGeneral?i=aWfFFNk3Biw:uafoVtqlDfY:K8qFz0M-AJI" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/TechGeneral/~4/aWfFFNk3Biw" height="1" width="1"/&gt;</content>
    <summary type="xhtml">
      <div xmlns="http://www.w3.org/1999/xhtml"><p>
Visa-willing, I'll be in San Francisco for about three weeks from early October.  The <a href="http://www.synthasite.com/">SynthaSite</a>
Cape Town office is heading over to the San Francisco office for a mix
of team training, team building, end-of-year partying, and planning
sessions.
</p>
<p>
<a href="http://nxsy.org/in-sunny-san-francisco-at-google-io">My last trip to San Francisco</a> in May/June included <a href="http://nxsy.org/tags/io2008">Google I/O</a> and a <a href="http://nxsy.org/pylons-tg2-wsgi-sprint-and-sight-seeing-weekend">Pylons/TG2/WSGI sprint</a>,
and I really enjoyed being in the company of geeks.  This time around,
it doesn't seem like there are any good conferences to squeeze in or stay around for
and so far my only plans are to attend the <a href="http://baypiggies.net/">Bay Area Python Interest Group</a> with Jonathan.
</p>
<p>
Are there any interesting tech events happening in October in or around San Francisco I should try to attend?
</p>
</div>
    </summary>
  <feedburner:origLink>http://techgeneral.org/in-san-francisco-in-october</feedburner:origLink></entry><entry>
    <title>Further adventures in Sitemaps</title>
    <id>http://techgeneral.org/further-adventures-in-sitemaps</id>
    <updated>2008-09-15T08:47:02Z</updated>
    <link rel="alternate" href="http://feedproxy.google.com/~r/TechGeneral/~3/0D4YqBI_U4c/further-adventures-in-sitemaps" />
    <published>2008-09-15T08:47:02Z</published>
    <author>
        <name>Neil Blakey-Milner</name>
    </author>
    <content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p style="text-align: center"&gt;&#xD;
&lt;a href="http://flickr.com/photos/b-tal/56642186/"&gt;&lt;img src="http://techgeneral.org/files/2008/09/12/sitemap-510-tran.jpg" border="0" alt="Sitemap by Brian Talbot, CC BY NC" title="Sitemap by Brian Talbot, CC BY NC" width="510" height="195"&gt;&lt;/img&gt;&lt;/a&gt;&lt;br&gt;&#xD;
&lt;small&gt;Sitemap by Brian Talbot &lt;a title="http://creativecommons.org/licenses/by-nc/2.0/deed.en" href="http://creativecommons.org/licenses/by-nc/2.0/deed.en" target="_blank"&gt;CC BY NC&lt;/a&gt;&lt;/small&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
While the two Sitemap formats are straightforward, deciding on the data to put into the templates not always altogether obvious.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
There are three main types of metadata about sitemaps and URLs:&#xD;
&lt;/p&gt;&#xD;
&lt;ul&gt;&#xD;
	&lt;li&gt;Last modification time&lt;/li&gt;&#xD;
	&lt;li&gt;Change Frequency&lt;/li&gt;&#xD;
	&lt;li&gt;Priority&lt;/li&gt;&#xD;
&lt;/ul&gt;&#xD;
&lt;h2&gt;Last modified time&lt;/h2&gt;&#xD;
&lt;p style="text-align: center"&gt;&#xD;
&lt;a href="http://flickr.com/photos/lwr/60496147/"&gt;&lt;img src="http://techgeneral.org/files/2008/09/12/clocks-510-tran.jpg" border="0" alt="squared circles - Clocks by Leo Reynolds, CC BY NC SA" title="squared circles - Clocks by Leo Reynolds, CC BY NC SA" width="510" height="172"&gt;&lt;/img&gt;&lt;/a&gt;&lt;br&gt;&#xD;
&lt;small&gt;squared circles - Clocks by Leo Reynolds &lt;a title="http://creativecommons.org/licenses/by-nc-sa/2.0/deed.en" href="http://creativecommons.org/licenses/by-nc-sa/2.0/deed.en" target="_blank"&gt;CC BY NC SA&lt;/a&gt;&lt;/small&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;h3&gt;Last modified time of sitemaps&lt;/h3&gt;&#xD;
&lt;p&gt;&#xD;
Setting the &lt;strong&gt;last modified time&lt;/strong&gt; on a sitemap allows consumers of the sitemap index to not download the referenced sitemap again if they've already got an up-to-date sitemap.  Getting this wrong (say, by always giving the same last modified time) may mean consumers of your sitemap index will try the referenced sitemaps less often than they should.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
The &lt;strong&gt;last modified time&lt;/strong&gt; for a sitemap for a web log will probably be the most recent last modified time of the posts.  Depending on whether the comments constitute valuable content, the last modified time of comments on the posts may be useful too.&#xD;
&lt;/p&gt;&#xD;
&lt;h3&gt;Last modified time of URLs&lt;/h3&gt;&#xD;
As with sitemaps in sitemap indices, &lt;strong&gt;last modification time&lt;/strong&gt; for URLs listen in a sitemap is pretty easy — the last time that particular URL's content changed.  For a CMS page or web log post, it would usually be the time of the last edit.  For a post, the time of the last comment is relevant.&#xD;
&lt;h3&gt;Complications with last modified&lt;/h3&gt;&#xD;
&lt;p&gt;&#xD;
Things get a bit murky if you change your web site's style though — the HTML output has changed, but the most relevant content hasn't.  If your style change majorly affects the navigation potential or relevance of content, it may be worthwhile updating the last modification time.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
Things are also complicated on pages that aggregate content from elsewhere.  For example, page two of the archives for March 2008 on a web log.  The "&lt;em&gt;correct&lt;/em&gt;" answer to that is probably the last updated time of any posts originally posted in March 2008.  But if you change from having full-content to summary content per post, or remove any content per post, or add tags to your content, or otherwise change navigation or content relevance, then you might want to update the last modified time for all archives pages to when you made the style change.&#xD;
&lt;/p&gt;&#xD;
&lt;h2&gt;Change frequency&#xD;
&lt;/h2&gt;&#xD;
&lt;p style="text-align: center"&gt;&#xD;
&lt;a href="http://flickr.com/photos/evdg/229437566/"&gt;&lt;img src="http://techgeneral.org/files/2008/09/12/subway_frequency-510-tran.jpg" border="0" alt="Toronto subway frequency by Elijah van der Giessen, CC BY NC" title="Toronto subway frequency by Elijah van der Giessen, CC BY NC" width="510" height="142"&gt;&lt;/img&gt;&lt;/a&gt;&lt;br&gt;&#xD;
&lt;small&gt;Toronto subway frequency by Elijah van der Giessen &lt;a title="http://creativecommons.org/licenses/by-nc/2.0/deed.en" href="http://creativecommons.org/licenses/by-nc/2.0/deed.en" target="_blank"&gt;CC BY NC&lt;/a&gt;&lt;/small&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
&lt;strong&gt;Change frequency&lt;/strong&gt; is (currently) unique to URLs in a sitemap.  It's an opportunity to tell consumers of your sitemap how often you think the content at that URL changes.  Valid values are:&#xD;
&lt;/p&gt;&#xD;
&lt;ul&gt;&#xD;
	&lt;li&gt;always&lt;/li&gt;&#xD;
	&lt;li&gt;hourly&lt;/li&gt;&#xD;
	&lt;li&gt;daily&lt;/li&gt;&#xD;
	&lt;li&gt;weekly&lt;/li&gt;&#xD;
	&lt;li&gt;monthly&lt;/li&gt;&#xD;
	&lt;li&gt;yearly&lt;/li&gt;&#xD;
	&lt;li&gt;never&lt;/li&gt;&#xD;
&lt;/ul&gt;&#xD;
It isn't yet obvious how seriously search engines (for example) take these values.  I imagine that if you say that all your URLs change hourly, then you probably won't get any change in their behaviour.  However, it can help reduce the amount of spider traffic that older pages get, and if consumers trust you, may get some of your pages checked for changes more often. &#xD;
&lt;h3&gt;Determining change frequency of URLs&lt;/h3&gt;&#xD;
&lt;p&gt;&#xD;
The change frequency of a &lt;strong&gt;front page&lt;/strong&gt; will probably be &lt;em&gt;hourly&lt;/em&gt;.  Similarly, an &lt;strong&gt;archives page&lt;/strong&gt; for the current day, month, year, or all time would be &lt;em&gt;hourly&lt;/em&gt;.  The change frequency for an archives page for previous days, months, or years could potentially be considered "&lt;em&gt;never&lt;/em&gt;" or "&lt;em&gt;yearly&lt;/em&gt;", but you can always set it to "&lt;em&gt;monthly&lt;/em&gt;" if you're worried about such long periods of time.  (The sitemap consumer will watch the last modified time of the entry in your sitemap anyway, and probably try visit that content more often than that just in case anyway.)&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
The change frequency for a &lt;strong&gt;post&lt;/strong&gt; on a web log or a &lt;strong&gt;news article&lt;/strong&gt; depends on a few things.  For example, if you use "related posts" or "related stories", you may not want to use values such as "&lt;em&gt;never&lt;/em&gt;" or "&lt;em&gt;yearly&lt;/em&gt;" even for posts from years back.  If you allow comments, you may similarly want not to use those values.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
The most important indicator of likely change frequency in standard cases is probably how long it has been since a particular page has changed.  In GibeSitemap, I use a relatively naive algorithm:&#xD;
&lt;/p&gt;&#xD;
&lt;ul&gt;&#xD;
	&lt;li&gt;If the content has changed in the last three days, the change frequency is hourly.&lt;/li&gt;&#xD;
	&lt;li&gt;If changed in the last 15 days, daily.&lt;/li&gt;&#xD;
	&lt;li&gt;If changed in the last 45 days, weekly.&lt;/li&gt;&#xD;
	&lt;li&gt;older, monthly. &lt;/li&gt;&#xD;
&lt;/ul&gt;&#xD;
&lt;h2&gt;Priority&lt;/h2&gt;&#xD;
&lt;p style="text-align: center"&gt;&#xD;
&lt;a href="http://flickr.com/photos/petereed/138369750/"&gt;&lt;img src="http://techgeneral.org/files/2008/09/12/changed-priorities-ahead-510-tran.jpg" border="0" alt="Changed priorities ahead by Peter Reed, CC BY NC SA" title="Changed priorities ahead by Peter Reed, CC BY NC" width="510" height="195"&gt;&lt;/img&gt;&lt;/a&gt;&lt;br&gt;&#xD;
&lt;small&gt;Changed priorities ahead by Peter Reed &lt;a title="http://creativecommons.org/licenses/by-nc/2.0/deed.en" href="http://creativecommons.org/licenses/by-nc/2.0/deed.en" target="_blank"&gt;CC BY NC&lt;/a&gt;&lt;/small&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
The &lt;strong&gt;priority&lt;/strong&gt; of a page signals how valuable and relevant the content on that URL is likely to be to the consumer, relative to other pages on your web site.  Priority can run from 0.0 (low) to 1.0 (high).  Your front page is likely to have a very high priority (say, 1.0).  A web log "About" page is probably one of the highest priority pages (say, 0.9). &#xD;
&lt;/p&gt;&#xD;
&lt;h3&gt;Determining priority of URLs&lt;/h3&gt;&#xD;
&lt;p&gt;&#xD;
For a CMS with a &lt;strong&gt;hierarchical path structure&lt;/strong&gt;, you can use a simple algorithm to determine priority — the fewer folders between the site root and the page, the more important it likely is.  For the Gibe Pages plugin, pages at the top level are given 0.9, losing 0.1 for each folder until a lowest value of 0.6.  So:&#xD;
&lt;/p&gt;&#xD;
&lt;ul&gt;&#xD;
	&lt;li&gt;/about : 0.9&lt;/li&gt;&#xD;
	&lt;li&gt;/about/team : 0.8&lt;/li&gt;&#xD;
	&lt;li&gt;/about/team/neil : 0.7&lt;/li&gt;&#xD;
	&lt;li&gt;/about/team/neil/interests : 0.6&lt;/li&gt;&#xD;
&lt;/ul&gt;&#xD;
&lt;p&gt;&#xD;
Web log or news &lt;strong&gt;archives&lt;/strong&gt; pages should not have remotely high priority, since the content on them is more relevant in the individual posts.  A value of 0.1 is appropriate.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
For web log &lt;strong&gt;posts&lt;/strong&gt; or news &lt;strong&gt;articles&lt;/strong&gt;, priority depends on a number of factors.  For example, you may want to set existing popular posts or articles with a high priority, so that people are more likely to find that post or article when searching for them.  You may want to set posts with a particular tag or articles in a particular section to have higher or lower priority.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
For the &lt;strong&gt;basic case&lt;/strong&gt;, though, you can probably just use the publishing date or last modification time to help determine the priority.  More recent posts and news are probably more relevant (on your site) than older ones.  You might want to use a simple algorithm like the one I used on Gibe:&#xD;
&lt;/p&gt;&#xD;
&lt;ul&gt;&#xD;
	&lt;li&gt;If the publish date is within the last 15 days, priority of 0.9&lt;/li&gt;&#xD;
	&lt;li&gt;last month, 0.8&lt;/li&gt;&#xD;
	&lt;li&gt;last three months, 0.7&lt;/li&gt;&#xD;
	&lt;li&gt;last half-year, 0.6&lt;/li&gt;&#xD;
	&lt;li&gt;last year, 0.5&lt;/li&gt;&#xD;
	&lt;li&gt;last two years, 0.4&lt;/li&gt;&#xD;
	&lt;li&gt;older, 0.3&lt;/li&gt;&#xD;
&lt;/ul&gt;&#xD;
&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.feedburner.com/~ff/TechGeneral?a=0D4YqBI_U4c:9WQjD8QtJ-c:K8qFz0M-AJI"&gt;&lt;img src="http://feeds.feedburner.com/~ff/TechGeneral?i=0D4YqBI_U4c:9WQjD8QtJ-c:K8qFz0M-AJI" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/TechGeneral/~4/0D4YqBI_U4c" height="1" width="1"/&gt;</content>
    <summary type="xhtml">
      <div xmlns="http://www.w3.org/1999/xhtml"><p style="text-align: center">
<a href="http://flickr.com/photos/b-tal/56642186/"><img src="http://techgeneral.org/files/2008/09/12/sitemap-510-tran.jpg" border="0" alt="Sitemap by Brian Talbot, CC BY NC" title="Sitemap by Brian Talbot, CC BY NC" width="510" height="195" /></a><br />
<small>Sitemap by Brian Talbot <a title="http://creativecommons.org/licenses/by-nc/2.0/deed.en" href="http://creativecommons.org/licenses/by-nc/2.0/deed.en" target="_blank">CC BY NC</a></small>
</p>
<p>
While the two Sitemap formats are straightforward, deciding on the data to put into the templates not always altogether obvious.
</p>
<p>
There are three main types of metadata about sitemaps and URLs:
</p>
<ul>
	<li>Last modification time</li>
	<li>Change Frequency</li>
	<li>Priority</li>
</ul>
<h2>Last modified time</h2>
<p style="text-align: center">
<a href="http://flickr.com/photos/lwr/60496147/"><img src="http://techgeneral.org/files/2008/09/12/clocks-510-tran.jpg" border="0" alt="squared circles - Clocks by Leo Reynolds, CC BY NC SA" title="squared circles - Clocks by Leo Reynolds, CC BY NC SA" width="510" height="172" /></a><br />
<small>squared circles - Clocks by Leo Reynolds <a title="http://creativecommons.org/licenses/by-nc-sa/2.0/deed.en" href="http://creativecommons.org/licenses/by-nc-sa/2.0/deed.en" target="_blank">CC BY NC SA</a></small>
</p>
<h3>Last modified time of sitemaps</h3>
<p>
Setting the <strong>last modified time</strong> on a sitemap allows consumers of the sitemap index to not download the referenced sitemap again if they've already got an up-to-date sitemap.  Getting this wrong (say, by always giving the same last modified time) may mean consumers of your sitemap index will try the referenced sitemaps less often than they should.
</p>
<p>
The <strong>last modified time</strong> for a sitemap for a web log will probably be the most recent last modified time of the posts.  Depending on whether the comments constitute valuable content, the last modified time of comments on the posts may be useful too.
</p>
<h3>Last modified time of URLs</h3>
As with sitemaps in sitemap indices, <strong>last modification time</strong> for URLs listen in a sitemap is pretty easy — the last time that particular URL's content changed.  For a CMS page or web log post, it would usually be the time of the last edit.  For a post, the time of the last comment is relevant.
<h3>Complications with last modified</h3>
<p>
Things get a bit murky if you change your web site's style though — the HTML output has changed, but the most relevant content hasn't.  If your style change majorly affects the navigation potential or relevance of content, it may be worthwhile updating the last modification time.
</p>
<p>
Things are also complicated on pages that aggregate content from elsewhere.  For example, page two of the archives for March 2008 on a web log.  The "<em>correct</em>" answer to that is probably the last updated time of any posts originally posted in March 2008.  But if you change from having full-content to summary content per post, or remove any content per post, or add tags to your content, or otherwise change navigation or content relevance, then you might want to update the last modified time for all archives pages to when you made the style change.
</p>
<h2>Change frequency
</h2>
<p style="text-align: center">
<a href="http://flickr.com/photos/evdg/229437566/"><img src="http://techgeneral.org/files/2008/09/12/subway_frequency-510-tran.jpg" border="0" alt="Toronto subway frequency by Elijah van der Giessen, CC BY NC" title="Toronto subway frequency by Elijah van der Giessen, CC BY NC" width="510" height="142" /></a><br />
<small>Toronto subway frequency by Elijah van der Giessen <a title="http://creativecommons.org/licenses/by-nc/2.0/deed.en" href="http://creativecommons.org/licenses/by-nc/2.0/deed.en" target="_blank">CC BY NC</a></small>
</p>
<p>
<strong>Change frequency</strong> is (currently) unique to URLs in a sitemap.  It's an opportunity to tell consumers of your sitemap how often you think the content at that URL changes.  Valid values are:
</p>
<ul>
	<li>always</li>
	<li>hourly</li>
	<li>daily</li>
	<li>weekly</li>
	<li>monthly</li>
	<li>yearly</li>
	<li>never</li>
</ul>
It isn't yet obvious how seriously search engines (for example) take these values.  I imagine that if you say that all your URLs change hourly, then you probably won't get any change in their behaviour.  However, it can help reduce the amount of spider traffic that older pages get, and if consumers trust you, may get some of your pages checked for changes more often. 
<h3>Determining change frequency of URLs</h3>
<p>
The change frequency of a <strong>front page</strong> will probably be <em>hourly</em>.  Similarly, an <strong>archives page</strong> for the current day, month, year, or all time would be <em>hourly</em>.  The change frequency for an archives page for previous days, months, or years could potentially be considered "<em>never</em>" or "<em>yearly</em>", but you can always set it to "<em>monthly</em>" if you're worried about such long periods of time.  (The sitemap consumer will watch the last modified time of the entry in your sitemap anyway, and probably try visit that content more often than that just in case anyway.)
</p>
<p>
The change frequency for a <strong>post</strong> on a web log or a <strong>news article</strong> depends on a few things.  For example, if you use "related posts" or "related stories", you may not want to use values such as "<em>never</em>" or "<em>yearly</em>" even for posts from years back.  If you allow comments, you may similarly want not to use those values.
</p>
<p>
The most important indicator of likely change frequency in standard cases is probably how long it has been since a particular page has changed.  In GibeSitemap, I use a relatively naive algorithm:
</p>
<ul>
	<li>If the content has changed in the last three days, the change frequency is hourly.</li>
	<li>If changed in the last 15 days, daily.</li>
	<li>If changed in the last 45 days, weekly.</li>
	<li>older, monthly. </li>
</ul>
<h2>Priority</h2>
<p style="text-align: center">
<a href="http://flickr.com/photos/petereed/138369750/"><img src="http://techgeneral.org/files/2008/09/12/changed-priorities-ahead-510-tran.jpg" border="0" alt="Changed priorities ahead by Peter Reed, CC BY NC SA" title="Changed priorities ahead by Peter Reed, CC BY NC" width="510" height="195" /></a><br />
<small>Changed priorities ahead by Peter Reed <a title="http://creativecommons.org/licenses/by-nc/2.0/deed.en" href="http://creativecommons.org/licenses/by-nc/2.0/deed.en" target="_blank">CC BY NC</a></small>
</p>
<p>
The <strong>priority</strong> of a page signals how valuable and relevant the content on that URL is likely to be to the consumer, relative to other pages on your web site.  Priority can run from 0.0 (low) to 1.0 (high).  Your front page is likely to have a very high priority (say, 1.0).  A web log "About" page is probably one of the highest priority pages (say, 0.9). 
</p>
<h3>Determining priority of URLs</h3>
<p>
For a CMS with a <strong>hierarchical path structure</strong>, you can use a simple algorithm to determine priority — the fewer folders between the site root and the page, the more important it likely is.  For the Gibe Pages plugin, pages at the top level are given 0.9, losing 0.1 for each folder until a lowest value of 0.6.  So:
</p>
<ul>
	<li>/about : 0.9</li>
	<li>/about/team : 0.8</li>
	<li>/about/team/neil : 0.7</li>
	<li>/about/team/neil/interests : 0.6</li>
</ul>
<p>
Web log or news <strong>archives</strong> pages should not have remotely high priority, since the content on them is more relevant in the individual posts.  A value of 0.1 is appropriate.
</p>
<p>
For web log <strong>posts</strong> or news <strong>articles</strong>, priority depends on a number of factors.  For example, you may want to set existing popular posts or articles with a high priority, so that people are more likely to find that post or article when searching for them.  You may want to set posts with a particular tag or articles in a particular section to have higher or lower priority.
</p>
<p>
For the <strong>basic case</strong>, though, you can probably just use the publishing date or last modification time to help determine the priority.  More recent posts and news are probably more relevant (on your site) than older ones.  You might want to use a simple algorithm like the one I used on Gibe:
</p>
<ul>
	<li>If the publish date is within the last 15 days, priority of 0.9</li>
	<li>last month, 0.8</li>
	<li>last three months, 0.7</li>
	<li>last half-year, 0.6</li>
	<li>last year, 0.5</li>
	<li>last two years, 0.4</li>
	<li>older, 0.3</li>
</ul>
</div>
    </summary>
  <feedburner:origLink>http://techgeneral.org/further-adventures-in-sitemaps</feedburner:origLink></entry><entry>
    <title>Early adventures with Sitemaps</title>
    <id>http://techgeneral.org/early-adventures-with-sitemaps</id>
    <updated>2008-08-26T16:50:44Z</updated>
    <link rel="alternate" href="http://feedproxy.google.com/~r/TechGeneral/~3/XYTKOtKOjXE/early-adventures-with-sitemaps" />
    <published>2008-08-26T16:50:44Z</published>
    <author>
        <name>Neil Blakey-Milner</name>
    </author>
    <content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;&#xD;
Perhaps entirely randomly, I decided that TechGeneral would need &lt;a href="http://www.sitemaps.org/"&gt;Sitemaps&lt;/a&gt; before I put it live.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
A Sitemap &lt;em class="aside"&gt;(sometimes called a Google Sitemap, although you won't see Google calling it that, and it is a standard that Yahoo!, Ask, and Live all support)&lt;/em&gt; is an XML file (or bunch of XML files) that describe the various resources on your web site which allows search engines and other programs to discover them more easily.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
There are a few advantages to putting together a Sitemap.  Generally, search engines give up after they travel a few links into a web site to avoid infinite automatically generated links (not because of malicious intent necessarily, but because of weird programming).  With a Sitemap, each listed resource can potentially be treated as a first visit.  Also, if a site has navigation that search engines can't traverse to get to certain pages, Sitemaps can assist search engines to find those resources.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
They also optionally assign a &lt;strong&gt;priority&lt;/strong&gt; to each resource as a way to influence the importance assigned to the resource relative to other resources on your web site.  Similarly, an optional &lt;strong&gt;update frequency&lt;/strong&gt; per resource can influence how often a search engine or other program should check back for new versions of that resource.  &lt;strong&gt;Last modified dates&lt;/strong&gt; also optionally help to determine whether to try revisit a resource earlier or later than would normally happen. &#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
&lt;strong&gt;Example Sitemap File&lt;/strong&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="prettyprint"&gt;&amp;lt;urlset&#xD;
    xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"&#xD;
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"&#xD;
    xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9&#xD;
        http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd"&amp;gt;&#xD;
 &#xD;
    &amp;lt;url&amp;gt;&#xD;
        &amp;lt;loc&amp;gt;http://techgeneral.org/diary&amp;lt;/loc&amp;gt;&#xD;
        &amp;lt;lastmod&amp;gt;2008-08-16T22:52:41+00:00&amp;lt;/lastmod&amp;gt;&#xD;
        &amp;lt;changefreq&amp;gt;daily&amp;lt;/changefreq&amp;gt;&#xD;
        &amp;lt;priority&amp;gt;0.9&amp;lt;/priority&amp;gt;&#xD;
    &amp;lt;/url&amp;gt;&#xD;
 &#xD;
    &amp;lt;url&amp;gt;&#xD;
        &amp;lt;loc&amp;gt;http://techgeneral.org/speaking&amp;lt;/loc&amp;gt;&#xD;
        &amp;lt;lastmod&amp;gt;2008-08-16T22:52:13+00:00&amp;lt;/lastmod&amp;gt;&#xD;
        &amp;lt;changefreq&amp;gt;daily&amp;lt;/changefreq&amp;gt;&#xD;
        &amp;lt;priority&amp;gt;0.9&amp;lt;/priority&amp;gt;&#xD;
    &amp;lt;/url&amp;gt;&#xD;
 &#xD;
    &amp;lt;url&amp;gt;&#xD;
        &amp;lt;loc&amp;gt;http://techgeneral.org/contact&amp;lt;/loc&amp;gt;&#xD;
        &amp;lt;lastmod&amp;gt;2008-08-10T16:59:32+00:00&amp;lt;/lastmod&amp;gt;&#xD;
        &amp;lt;changefreq&amp;gt;weekly&amp;lt;/changefreq&amp;gt;&#xD;
        &amp;lt;priority&amp;gt;0.9&amp;lt;/priority&amp;gt;&#xD;
    &amp;lt;/url&amp;gt;&#xD;
 &#xD;
    &amp;lt;url&amp;gt;&#xD;
        &amp;lt;loc&amp;gt;http://techgeneral.org/about&amp;lt;/loc&amp;gt;&#xD;
        &amp;lt;lastmod&amp;gt;2008-08-10T12:42:06+00:00&amp;lt;/lastmod&amp;gt;&#xD;
        &amp;lt;changefreq&amp;gt;weekly&amp;lt;/changefreq&amp;gt;&#xD;
        &amp;lt;priority&amp;gt;0.9&amp;lt;/priority&amp;gt;&#xD;
    &amp;lt;/url&amp;gt;&#xD;
&amp;lt;/urlset&amp;gt;&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
There are two types of Sitemaps - individual &lt;strong&gt;Sitemap&lt;/strong&gt; files and &lt;strong&gt;Sitemap Index&lt;/strong&gt; files.  Why would you want a Sitemap Index?  One, less relevant to many, reason is that individual Sitemap files can only contain &lt;strong&gt;50 000&lt;/strong&gt; URLs (which, admittedly, the average blog isn't going to hit) and be less than &lt;strong&gt;10MB&lt;/strong&gt; uncompressed.  Another reason is that you might be using multiple systems that each generate Sitemap files (or you've hacked them to do so) but you don't want to merge them yourself.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
&lt;strong&gt;Example Sitemap Index&lt;/strong&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="prettyprint"&gt;&amp;lt;sitemapindex&#xD;
    xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"&#xD;
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"&#xD;
    xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9&#xD;
        http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd"&amp;gt;&#xD;
 &#xD;
    &amp;lt;sitemap&amp;gt;&#xD;
        &amp;lt;loc&amp;gt;http://techgeneral.org/sitemap_posts.xml&amp;lt;/loc&amp;gt;&#xD;
    &amp;lt;/sitemap&amp;gt;&#xD;
 &#xD;
    &amp;lt;sitemap&amp;gt;&#xD;
        &amp;lt;loc&amp;gt;http://techgeneral.org/sitemap_archives.xml&amp;lt;/loc&amp;gt;&#xD;
    &amp;lt;/sitemap&amp;gt;&#xD;
 &#xD;
    &amp;lt;sitemap&amp;gt;&#xD;
        &amp;lt;loc&amp;gt;http://techgeneral.org/sitemap_pages.xml&amp;lt;/loc&amp;gt;&#xD;
    &amp;lt;/sitemap&amp;gt;&#xD;
&amp;lt;/sitemapindex&amp;gt;&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
One useful side-effect of using a Sitemap with &lt;a href="http://www.google.com/webmasters/"&gt;Google's webmaster tools&lt;/a&gt; is that you can see errors that occur on resources listed in the Sitemap.  So, if a request for a resource starts returning 404 or 500 errors, you can separate that more specific set of errors from those caused by broken links on your site or on other sites.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
However, Google's webmaster tools doesn't seem to like having a whole bunch of separate Sitemap files with a central Sitemap Index.  I mean, it seems to work, but it complains (warnings, not errors) that many of the Sitemaps (all on this site, most on &lt;a href="http://nxsy.org/"&gt;my personal web site&lt;/a&gt;) have only entries with the same priority.  I'm setting the priority of all the archives low (they have noindex, follow set anyway, so won't show up in search results), the frontpage high, and the posts are priorities based on age.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
I get the feeling that the priorities only apply within the same file, and not within the same site.  This somewhat makes sense, since one can delegate a sitemap for a particular folder on your web site, and you wouldn't want an overeager person assigning "1.0" to all content within the folder, overriding your beautifully crafted values for the base site.  However, in this case, they're all at the same level, and I really want the archives lower than the posts, and the frontpage higher than most of the posts.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
Oh well, I'll push on and see whether it's just a matter of warnings that aren't affecting things (my favourite kind) or an indication of things being as I suspect. &#xD;
&lt;/p&gt;&#xD;
&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.feedburner.com/~ff/TechGeneral?a=XYTKOtKOjXE:LTP2zCAH9PU:K8qFz0M-AJI"&gt;&lt;img src="http://feeds.feedburner.com/~ff/TechGeneral?i=XYTKOtKOjXE:LTP2zCAH9PU:K8qFz0M-AJI" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/TechGeneral/~4/XYTKOtKOjXE" height="1" width="1"/&gt;</content>
    <summary type="xhtml">
      <div xmlns="http://www.w3.org/1999/xhtml"><p>
Perhaps entirely randomly, I decided that TechGeneral would need <a href="http://www.sitemaps.org/">Sitemaps</a> before I put it live.
</p>
<p>
A Sitemap <em class="aside">(sometimes called a Google Sitemap, although you won't see Google calling it that, and it is a standard that Yahoo!, Ask, and Live all support)</em> is an XML file (or bunch of XML files) that describe the various resources on your web site which allows search engines and other programs to discover them more easily.
</p>
<p>
There are a few advantages to putting together a Sitemap.  Generally, search engines give up after they travel a few links into a web site to avoid infinite automatically generated links (not because of malicious intent necessarily, but because of weird programming).  With a Sitemap, each listed resource can potentially be treated as a first visit.  Also, if a site has navigation that search engines can't traverse to get to certain pages, Sitemaps can assist search engines to find those resources.
</p>
<p>
They also optionally assign a <strong>priority</strong> to each resource as a way to influence the importance assigned to the resource relative to other resources on your web site.  Similarly, an optional <strong>update frequency</strong> per resource can influence how often a search engine or other program should check back for new versions of that resource.  <strong>Last modified dates</strong> also optionally help to determine whether to try revisit a resource earlier or later than would normally happen. 
</p>
<p>
<strong>Example Sitemap File</strong>
</p>
<pre class="prettyprint">&lt;urlset
    xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9
        http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd"&gt;
 
    &lt;url&gt;
        &lt;loc&gt;http://techgeneral.org/diary&lt;/loc&gt;
        &lt;lastmod&gt;2008-08-16T22:52:41+00:00&lt;/lastmod&gt;
        &lt;changefreq&gt;daily&lt;/changefreq&gt;
        &lt;priority&gt;0.9&lt;/priority&gt;
    &lt;/url&gt;
 
    &lt;url&gt;
        &lt;loc&gt;http://techgeneral.org/speaking&lt;/loc&gt;
        &lt;lastmod&gt;2008-08-16T22:52:13+00:00&lt;/lastmod&gt;
        &lt;changefreq&gt;daily&lt;/changefreq&gt;
        &lt;priority&gt;0.9&lt;/priority&gt;
    &lt;/url&gt;
 
    &lt;url&gt;
        &lt;loc&gt;http://techgeneral.org/contact&lt;/loc&gt;
        &lt;lastmod&gt;2008-08-10T16:59:32+00:00&lt;/lastmod&gt;
        &lt;changefreq&gt;weekly&lt;/changefreq&gt;
        &lt;priority&gt;0.9&lt;/priority&gt;
    &lt;/url&gt;
 
    &lt;url&gt;
        &lt;loc&gt;http://techgeneral.org/about&lt;/loc&gt;
        &lt;lastmod&gt;2008-08-10T12:42:06+00:00&lt;/lastmod&gt;
        &lt;changefreq&gt;weekly&lt;/changefreq&gt;
        &lt;priority&gt;0.9&lt;/priority&gt;
    &lt;/url&gt;
&lt;/urlset&gt;
</pre>
<p>
There are two types of Sitemaps - individual <strong>Sitemap</strong> files and <strong>Sitemap Index</strong> files.  Why would you want a Sitemap Index?  One, less relevant to many, reason is that individual Sitemap files can only contain <strong>50 000</strong> URLs (which, admittedly, the average blog isn't going to hit) and be less than <strong>10MB</strong> uncompressed.  Another reason is that you might be using multiple systems that each generate Sitemap files (or you've hacked them to do so) but you don't want to merge them yourself.
</p>
<p>
<strong>Example Sitemap Index</strong>
</p>
<pre class="prettyprint">&lt;sitemapindex
    xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9
        http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd"&gt;
 
    &lt;sitemap&gt;
        &lt;loc&gt;http://techgeneral.org/sitemap_posts.xml&lt;/loc&gt;
    &lt;/sitemap&gt;
 
    &lt;sitemap&gt;
        &lt;loc&gt;http://techgeneral.org/sitemap_archives.xml&lt;/loc&gt;
    &lt;/sitemap&gt;
 
    &lt;sitemap&gt;
        &lt;loc&gt;http://techgeneral.org/sitemap_pages.xml&lt;/loc&gt;
    &lt;/sitemap&gt;
&lt;/sitemapindex&gt;
</pre>
<p>
One useful side-effect of using a Sitemap with <a href="http://www.google.com/webmasters/">Google's webmaster tools</a> is that you can see errors that occur on resources listed in the Sitemap.  So, if a request for a resource starts returning 404 or 500 errors, you can separate that more specific set of errors from those caused by broken links on your site or on other sites.
</p>
<p>
However, Google's webmaster tools doesn't seem to like having a whole bunch of separate Sitemap files with a central Sitemap Index.  I mean, it seems to work, but it complains (warnings, not errors) that many of the Sitemaps (all on this site, most on <a href="http://nxsy.org/">my personal web site</a>) have only entries with the same priority.  I'm setting the priority of all the archives low (they have noindex, follow set anyway, so won't show up in search results), the frontpage high, and the posts are priorities based on age.
</p>
<p>
I get the feeling that the priorities only apply within the same file, and not within the same site.  This somewhat makes sense, since one can delegate a sitemap for a particular folder on your web site, and you wouldn't want an overeager person assigning "1.0" to all content within the folder, overriding your beautifully crafted values for the base site.  However, in this case, they're all at the same level, and I really want the archives lower than the posts, and the frontpage higher than most of the posts.
</p>
<p>
Oh well, I'll push on and see whether it's just a matter of warnings that aren't affecting things (my favourite kind) or an indication of things being as I suspect. 
</p>
</div>
    </summary>
  <feedburner:origLink>http://techgeneral.org/early-adventures-with-sitemaps</feedburner:origLink></entry><entry>
    <title>Wordpress.com scalability at WordCamp SA 2008</title>
    <id>http://techgeneral.org/wordpresscom-scalability-at-wordcamp-sa-2008</id>
    <updated>2008-08-24T17:13:34Z</updated>
    <link rel="alternate" href="http://feedproxy.google.com/~r/TechGeneral/~3/9EUT138kHTU/wordpresscom-scalability-at-wordcamp-sa-2008" />
    <published>2008-08-24T17:13:34Z</published>
    <author>
        <name>Neil Blakey-Milner</name>
    </author>
    <content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;a href="http://www.wordcamp.co.za/2008/"&gt;&lt;img src="http://techgeneral.org/files/2008/08/24/wordcampsa08.png" border="0" alt="" hspace="5" vspace="5" width="250" height="68" align="right"&gt;&lt;/img&gt;&lt;/a&gt;&#xD;
&lt;p&gt;&#xD;
At &lt;a href="http://www.wordcamp.co.za/2008/"&gt;WordCamp South Africa 2008&lt;/a&gt;, held in Cape Town yesterday, we were given a brief overview of how &lt;a href="http://wordpress.com/"&gt;Wordpress.com&lt;/a&gt; is set up to scale.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
&lt;a href="http://ma.tt/"&gt;Matt Mullenweg&lt;/a&gt; set the scene with some idea of just how huge Wordpress.com is.  I may mess up a few numbers mentioned, but there've been something like 6.5 billion page views on Wordpress.com since the beginning of the year, there are 3.8 million Wordpress.com hosted blogs (only Blogger is bigger), and there are 1.4 billion words in posts created on Wordpress.com.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
&lt;a href="http://blog.linuxinternet.org/"&gt;Warwick Poole&lt;/a&gt; then gave us some more in-depth numbers, although pointing out that Wordpress.com was bigger than AdultFriendFinder was a pretty good and well-understood indication from the audience's reaction.  In May 2008, Wordpress.com was served 693 million page views, but this rose to 812 million page views in July.  Over 1TB of media was uploaded in May, 1.3TB in July.  In May, 417TB of traffic left the Wordpress.com data centres.  These numbers are available in the &lt;a href="http://en.blog.wordpress.com/2008/08/05/july-wrap-up-2/"&gt;"July wrap-up" post&lt;/a&gt; on the Wordpress.com web log. &#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
Apparently, across the approximately 710 servers, 10 000 web requests and 10 000 databases requests are handled per second (I wasn't intelligent to write down whether this was the average).  110 requests per second are done to Amazon's S3 storage service, while 3TB of media is cached on their own media caches.  They output 1.5TB/s (I wrote TB, so it probably is TB and not Tb.  I'm guessing this is peak). They experience approximately 5 server failures a week. &lt;br&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
How is it put together?  They use Round Robin DNS which determines the data centre (from testing, it seems there round robin six IPs - two IPs for each of three data centres).  There it hits a load balancer using some combination of &lt;a href="http://nginx.net/"&gt;nginx&lt;/a&gt;, &lt;a href="http://www.backhand.org/wackamole/"&gt;wackamole&lt;/a&gt;, and &lt;a href="http://www.spread.org/"&gt;spread&lt;/a&gt;.  They use &lt;a href="http://varnish.projects.linpro.no/"&gt;Varnish&lt;/a&gt; for serving at least media, and currently use &lt;a href="http://www.litespeedtech.com/"&gt;Litespeed&lt;/a&gt; web servers.  They also use MySQL and &lt;a href="http://www.danga.com/memcached/"&gt;memcached&lt;/a&gt;.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
They use (and developed) the &lt;a href="http://wordpress.org/extend/plugins/batcache/"&gt;batcache Wordpress plugin&lt;/a&gt; to serve content from memcached - according to the documentation, batcache only potentially servers stale content to first-time visitors - visitors who have interacted with the web log receive up to date content.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
When new media is uploaded, its existence and initial location is stored in a table.  As necessary, the other data centres will create their own local copies of that media, and update that table.  The backup media stores in the data centres are write-only - apparently nothing is ever deleted from them.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
That's about all I wrote down, but there's quite a bit of information about how Wordpress.com is set up and the sort of load/traffic it has on the Wordpress.com blog and on the blogs of various employees (such as &lt;a href="http://barry.wordpress.com/2008/04/28/load-balancer-update/"&gt;this post on nginx replacing Pound&lt;/a&gt;, &lt;a href="http://barry.wordpress.com/2007/11/01/static-hostname-hashing-in-pound/"&gt;this one on Pound&lt;/a&gt;, and &lt;a href="http://blog.apokalyptik.com/2007/10/10/so-you-wanna-see-an-image/"&gt;another on varnish&lt;/a&gt;) giving some useful information which will probably inform some technology choices we might make at &lt;a href="http://www.synthasite.com/"&gt;SynthaSite&lt;/a&gt;. &#xD;
&lt;/p&gt;&#xD;
&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.feedburner.com/~ff/TechGeneral?a=9EUT138kHTU:XiEs9iNiTpU:K8qFz0M-AJI"&gt;&lt;img src="http://feeds.feedburner.com/~ff/TechGeneral?i=9EUT138kHTU:XiEs9iNiTpU:K8qFz0M-AJI" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/TechGeneral/~4/9EUT138kHTU" height="1" width="1"/&gt;</content>
    <summary type="xhtml">
      <div xmlns="http://www.w3.org/1999/xhtml"><a href="http://www.wordcamp.co.za/2008/"><img src="http://techgeneral.org/files/2008/08/24/wordcampsa08.png" border="0" alt="" hspace="5" vspace="5" width="250" height="68" align="right" /></a>
<p>
At <a href="http://www.wordcamp.co.za/2008/">WordCamp South Africa 2008</a>, held in Cape Town yesterday, we were given a brief overview of how <a href="http://wordpress.com/">Wordpress.com</a> is set up to scale.
</p>
<p>
<a href="http://ma.tt/">Matt Mullenweg</a> set the scene with some idea of just how huge Wordpress.com is.  I may mess up a few numbers mentioned, but there've been something like 6.5 billion page views on Wordpress.com since the beginning of the year, there are 3.8 million Wordpress.com hosted blogs (only Blogger is bigger), and there are 1.4 billion words in posts created on Wordpress.com.
</p>
<p>
<a href="http://blog.linuxinternet.org/">Warwick Poole</a> then gave us some more in-depth numbers, although pointing out that Wordpress.com was bigger than AdultFriendFinder was a pretty good and well-understood indication from the audience's reaction.  In May 2008, Wordpress.com was served 693 million page views, but this rose to 812 million page views in July.  Over 1TB of media was uploaded in May, 1.3TB in July.  In May, 417TB of traffic left the Wordpress.com data centres.  These numbers are available in the <a href="http://en.blog.wordpress.com/2008/08/05/july-wrap-up-2/">"July wrap-up" post</a> on the Wordpress.com web log. 
</p>
<p>
Apparently, across the approximately 710 servers, 10 000 web requests and 10 000 databases requests are handled per second (I wasn't intelligent to write down whether this was the average).  110 requests per second are done to Amazon's S3 storage service, while 3TB of media is cached on their own media caches.  They output 1.5TB/s (I wrote TB, so it probably is TB and not Tb.  I'm guessing this is peak). They experience approximately 5 server failures a week. <br />
</p>
<p>
How is it put together?  They use Round Robin DNS which determines the data centre (from testing, it seems there round robin six IPs - two IPs for each of three data centres).  There it hits a load balancer using some combination of <a href="http://nginx.net/">nginx</a>, <a href="http://www.backhand.org/wackamole/">wackamole</a>, and <a href="http://www.spread.org/">spread</a>.  They use <a href="http://varnish.projects.linpro.no/">Varnish</a> for serving at least media, and currently use <a href="http://www.litespeedtech.com/">Litespeed</a> web servers.  They also use MySQL and <a href="http://www.danga.com/memcached/">memcached</a>.
</p>
<p>
They use (and developed) the <a href="http://wordpress.org/extend/plugins/batcache/">batcache Wordpress plugin</a> to serve content from memcached - according to the documentation, batcache only potentially servers stale content to first-time visitors - visitors who have interacted with the web log receive up to date content.
</p>
<p>
When new media is uploaded, its existence and initial location is stored in a table.  As necessary, the other data centres will create their own local copies of that media, and update that table.  The backup media stores in the data centres are write-only - apparently nothing is ever deleted from them.
</p>
<p>
That's about all I wrote down, but there's quite a bit of information about how Wordpress.com is set up and the sort of load/traffic it has on the Wordpress.com blog and on the blogs of various employees (such as <a href="http://barry.wordpress.com/2008/04/28/load-balancer-update/">this post on nginx replacing Pound</a>, <a href="http://barry.wordpress.com/2007/11/01/static-hostname-hashing-in-pound/">this one on Pound</a>, and <a href="http://blog.apokalyptik.com/2007/10/10/so-you-wanna-see-an-image/">another on varnish</a>) giving some useful information which will probably inform some technology choices we might make at <a href="http://www.synthasite.com/">SynthaSite</a>. 
</p>
</div>
    </summary>
  <feedburner:origLink>http://techgeneral.org/wordpresscom-scalability-at-wordcamp-sa-2008</feedburner:origLink></entry><entry>
    <title>Subversion (SVN) shortcuts to revert previous commits</title>
    <id>http://techgeneral.org/subversion-svn-shortcuts-to-revert-previous-commits</id>
    <updated>2008-08-22T15:01:03Z</updated>
    <link rel="alternate" href="http://feedproxy.google.com/~r/TechGeneral/~3/KAzIoHXwkM4/subversion-svn-shortcuts-to-revert-previous-commits" />
    <published>2008-08-22T15:01:03Z</published>
    <author>
        <name>Neil Blakey-Milner</name>
    </author>
    <content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;Good version control system usage prevents many disasters, but that doesn't necessarily mean you won't make your own mistakes.  Today, I mistakenly included a file in a commit that I didn't want to commit yet.  I learned two new tricks while spending a few minutes puzzling the best way to get back to where I was before with that file. &lt;/p&gt;&#xD;
&#xD;
&lt;p&gt;First, make a mistake: &lt;/p&gt;&#xD;
&#xD;
&lt;pre class="console"&gt;$ &lt;code class="typed"&gt;svn commit -m "&lt;em&gt;...&lt;/em&gt;"&lt;/code&gt;&lt;br&gt;Sending        dev.cfg&lt;br&gt;Sending        gibe/plugin.py&lt;br&gt;Transmitting file data ..&lt;br&gt;Committed revision 114.&lt;br&gt;&lt;/pre&gt;&#xD;
&#xD;
&lt;p&gt;&lt;code class="command"&gt;svn merge&lt;/code&gt; is the tool to use for this:&lt;/p&gt;&#xD;
&#xD;
&lt;pre class="console"&gt;merge: Apply the differences between two sources to a working copy path.&lt;br&gt;usage: 1. merge sourceURL1[@N] sourceURL2[@M] [WCPATH]&lt;br&gt;       2. merge sourceWCPATH1@N sourceWCPATH2@M [WCPATH]&lt;br&gt;       3. merge [-c M | -r N:M] SOURCE[@REV] [WCPATH]&lt;br&gt;&lt;br&gt;&lt;/pre&gt;&#xD;
&#xD;
&lt;p&gt;&lt;strong&gt;Trick #1&lt;/strong&gt;: use &lt;code class="command"&gt;svn merge&lt;/code&gt;'s 3rd usage pattern with the &lt;code class="commandoption"&gt;-c&lt;/code&gt; option with the negative of the revision you've committed, and (here comes the trick) use &lt;code class="filename"&gt;.&lt;/code&gt; (the current directory) as the source of the merge: &lt;/p&gt;&#xD;
&#xD;
&lt;pre class="console"&gt;$ &lt;code class="typed"&gt;svn merge -c -&lt;em&gt;114&lt;/em&gt; .&lt;/code&gt;&lt;br&gt;U    gibe/plugin.py&lt;br&gt;U    dev.cfg&lt;br&gt;&lt;/pre&gt;&#xD;
&#xD;
&lt;p&gt;With that your &lt;strong&gt;working copy&lt;/strong&gt; is now where the &lt;strong&gt;repository&lt;/strong&gt; was before your commit.  Commit that to the repository, and the &lt;strong&gt;repository&lt;/strong&gt; is back where &lt;strong&gt;it&lt;/strong&gt; was before your commit.&lt;/p&gt;&lt;p&gt;Now your working copy is where it was before you made any changes - but you probably want those changes back.  Easy enough:&lt;/p&gt;&#xD;
&#xD;
&lt;pre class="console"&gt;$ &lt;code class="typed"&gt;svn merge -c &lt;em&gt;114&lt;/em&gt; .&lt;/code&gt;&lt;br&gt;U    gibe/plugin.py&lt;br&gt;U    dev.cfg&lt;br&gt;&lt;/pre&gt;&#xD;
&#xD;
&lt;p&gt;Now your working copy is back where it was before you did the mistaken commit.&lt;/p&gt;&#xD;
&#xD;
&lt;p&gt;&lt;strong&gt;Trick #2&lt;/strong&gt;: Of course, if your mistake is like mine and you only messed up one file and everything else is as it should be, you can just do this on one file, by using &lt;code class="command"&gt;svn merge&lt;/code&gt;'s 2nd usage pattern:&lt;/p&gt;&#xD;
&#xD;
&lt;pre class="console"&gt;$ &lt;code class="typed"&gt;svn merge &lt;em&gt;dev.cfg&lt;/em&gt;@&lt;em&gt;114&lt;/em&gt; &lt;em&gt;dev.cfg&lt;/em&gt;@&lt;em&gt;113&lt;/em&gt;&lt;/code&gt;&lt;br&gt;U    dev.cfg&lt;br&gt;&lt;/pre&gt;&#xD;
&#xD;
&lt;p&gt;Commit that, and your repository is back to normal.  Then run:&lt;/p&gt;&#xD;
&#xD;
&lt;pre class="console"&gt;$ &lt;code class="typed"&gt;svn merge &lt;em&gt;dev.cfg&lt;/em&gt;@&lt;em&gt;113&lt;/em&gt; &lt;em&gt;dev.cfg&lt;/em&gt;@&lt;em&gt;114&lt;/em&gt;&lt;/code&gt;&lt;br&gt;U  dev.cfg&lt;br&gt;&lt;/pre&gt;&#xD;
&#xD;
&lt;p&gt;Now the file is back where it was before your botch.&lt;/p&gt;&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.feedburner.com/~ff/TechGeneral?a=KAzIoHXwkM4:GxlcnxC25nY:K8qFz0M-AJI"&gt;&lt;img src="http://feeds.feedburner.com/~ff/TechGeneral?i=KAzIoHXwkM4:GxlcnxC25nY:K8qFz0M-AJI" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/TechGeneral/~4/KAzIoHXwkM4" height="1" width="1"/&gt;</content>
    <summary type="xhtml">
      <div xmlns="http://www.w3.org/1999/xhtml"><p>Good version control system usage prevents many disasters, but that doesn't necessarily mean you won't make your own mistakes.  Today, I mistakenly included a file in a commit that I didn't want to commit yet.  I learned two new tricks while spending a few minutes puzzling the best way to get back to where I was before with that file. </p>

<p>First, make a mistake: </p>

<pre class="console">$ <code class="typed">svn commit -m "<em>...</em>"</code><br />Sending        dev.cfg<br />Sending        gibe/plugin.py<br />Transmitting file data ..<br />Committed revision 114.<br /></pre>

<p><code class="command">svn merge</code> is the tool to use for this:</p>

<pre class="console">merge: Apply the differences between two sources to a working copy path.<br />usage: 1. merge sourceURL1[@N] sourceURL2[@M] [WCPATH]<br />       2. merge sourceWCPATH1@N sourceWCPATH2@M [WCPATH]<br />       3. merge [-c M | -r N:M] SOURCE[@REV] [WCPATH]<br /><br /></pre>

<p><strong>Trick #1</strong>: use <code class="command">svn merge</code>'s 3rd usage pattern with the <code class="commandoption">-c</code> option with the negative of the revision you've committed, and (here comes the trick) use <code class="filename">.</code> (the current directory) as the source of the merge: </p>

<pre class="console">$ <code class="typed">svn merge -c -<em>114</em> .</code><br />U    gibe/plugin.py<br />U    dev.cfg<br /></pre>

<p>With that your <strong>working copy</strong> is now where the <strong>repository</strong> was before your commit.  Commit that to the repository, and the <strong>repository</strong> is back where <strong>it</strong> was before your commit.</p><p>Now your working copy is where it was before you made any changes - but you probably want those changes back.  Easy enough:</p>

<pre class="console">$ <code class="typed">svn merge -c <em>114</em> .</code><br />U    gibe/plugin.py<br />U    dev.cfg<br /></pre>

<p>Now your working copy is back where it was before you did the mistaken commit.</p>

<p><strong>Trick #2</strong>: Of course, if your mistake is like mine and you only messed up one file and everything else is as it should be, you can just do this on one file, by using <code class="command">svn merge</code>'s 2nd usage pattern:</p>

<pre class="console">$ <code class="typed">svn merge <em>dev.cfg</em>@<em>114</em> <em>dev.cfg</em>@<em>113</em></code><br />U    dev.cfg<br /></pre>

<p>Commit that, and your repository is back to normal.  Then run:</p>

<pre class="console">$ <code class="typed">svn merge <em>dev.cfg</em>@<em>113</em> <em>dev.cfg</em>@<em>114</em></code><br />U  dev.cfg<br /></pre>

<p>Now the file is back where it was before your botch.</p></div>
    </summary>
  <feedburner:origLink>http://techgeneral.org/subversion-svn-shortcuts-to-revert-previous-commits</feedburner:origLink></entry>
</feed>
