<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss2full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><rss xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" version="2.0"><channel><title>Bioinformatics Zen</title><link>http://www.bioinformaticszen.com/</link><language>en</language><copyright>Creative Commons Attribution 3.0 Unported</copyright><managingEditor>mail@michaelbarton.me.uk (Michael Barton)</managingEditor><lastBuildDate>Wed, 21 Oct 2009 10:23:15 PDT</lastBuildDate><description></description><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" href="http://feeds.feedburner.com/BioinformaticsZen" type="application/rss+xml" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com" /><item><title>Dealing With Big Data In Bioinformatics</title><link>http://feedproxy.google.com/~r/BioinformaticsZen/~3/g1_uFYU0ZZc/dealing-with-big-data-in-bioinformatics</link><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Michael Barton</dc:creator><pubDate>Wed, 21 Oct 2009 16:00:00 PDT</pubDate><guid isPermaLink="false">http://www.bioinformaticszen.com/software/dealing-with-big-data-in-bioinformatics</guid><description>&lt;p&gt;Bioinformatics usually involves shuffling data into the right format for plotting or statistical tests. I prefer to &lt;a href='http://www.bioinformaticszen.com/software/using_a_database/'&gt;use a database to store and format data&lt;/a&gt; as I think this make projects easier to maintain compared with using just scripts. I find a dynamic language like Ruby and libraries for database manipulation like &lt;a href='http://ar.rubyonrails.org/'&gt;ActiveRecord&lt;/a&gt; makes using a database relatively simple.&lt;/p&gt;

&lt;p&gt;Using a database however stops being simple when you have to deal with very large amounts of data. Here I&amp;#8217;m outlining my experience of analysing gigabytes of data with millions of data points and how tried to improve my software&amp;#8217;s performance when manipulating this data. I&amp;#8217;ve ordered each approaches with I think is the most pragmatic first.&lt;/p&gt;

&lt;h2 id='the_simple_things'&gt;The simple things&lt;/h2&gt;

&lt;p&gt;Obvious but sometimes overlooked.&lt;/p&gt;

&lt;h3 id='1_use_a_bigger_computer'&gt;1. Use a bigger computer&lt;/h3&gt;

&lt;p&gt;Using a faster computer might seem like a lazy option compared with optimising your code, but if the analysis works on your computer it should work the same, but faster, on a more powerful computer. Using a faster computer is probably one of the few things I tried which which didn&amp;#8217;t involved modifying my code and therefore shouldn&amp;#8217;t introduce any bugs. I used unit tests to make sure the code still worked as expected though.&lt;/p&gt;

&lt;h3 id='2_add_database_indices'&gt;2. Add database indices&lt;/h3&gt;

&lt;p&gt;Since I&amp;#8217;m using a database, making sure it runs as fast as possible is another cheap way to improve performance. Properly indexed database columns reduce running times when searching or joining tables as an index means rows are looked up much faster. Database indices are also relatively easy to implement, just specify which columns need to be indexed either using SQL or in my case using &lt;a href='http://api.rubyonrails.org/classes/ActiveRecord/ConnectionAdapters/SchemaStatements.html#M001911'&gt;ActiveRecord&lt;/a&gt;.&lt;/p&gt;

&lt;h3 id='3_use_a_faster_language_interpreter'&gt;3. Use a faster language interpreter&lt;/h3&gt;

&lt;p&gt;Most of the time the standard version of ruby is sufficient for running my code. In last two years though different but faster versions have been created such as &lt;a href='http://www.rubyenterpriseedition.com/'&gt;REE&lt;/a&gt;, &lt;a href='http://jruby.org/'&gt;JRuby&lt;/a&gt; and &lt;a href='http://www.ruby-lang.org/en/news/2009/07/20/ruby-1-9-1-p243-released/'&gt;Ruby 1.9&lt;/a&gt;. Therefore when I was encountering long running times from processing millions of database rows I thought it was worth trying a faster Ruby version. I use Ruby 1.9 and it did improve performance. One caveat though was that I had to make my code compatible with the newer version &lt;a href='http://blog.grayproductions.net/articles/getting_code_ready_for_ruby_19'&gt;specifically for the CSV library&lt;/a&gt;. These code changes were still relatively cheap to implement given the noticeable performance benefits.&lt;/p&gt;

&lt;h2 id='delete_stuff'&gt;Delete stuff&lt;/h2&gt;

&lt;p&gt;After the above three points I generally had to start digging around in my code - which is bad because changing working code usually creates broken code. A good way to optimise code, without introducing too many new problems, is just to delete it entirely.&lt;/p&gt;

&lt;h3 id='4_delete_unnecessary_data_and_analysis'&gt;4. Delete unnecessary data and analysis&lt;/h3&gt;

&lt;p&gt;I find that I often generate variables which I think might be useful at some future time. As you might expect just deleting the code that produces these variables removes the time required to compute them. More often than not I never ended up needing the variable anyway.&lt;/p&gt;

&lt;h3 id='5_remove_database_table_joins'&gt;5. Remove database table joins&lt;/h3&gt;

&lt;p&gt;I&amp;#8217;m using a database because usually I want to compare two or more sets of data and therefore I need to format them in a way that makes them comparable. Once formatted I join each variable in the database and then print the results as a CSV file.&lt;/p&gt;

&lt;p&gt;The problem with joining a large number of database records, even with database indices, is that it can take a very long time. The amount of time required also increases the more the &lt;a href='http://en.wikipedia.org/wiki/Database_normalization'&gt;data is normalised&lt;/a&gt;. To try and fix this I found that I could drop the smaller of two variables I was joining and instead do the join further into my workflow.&lt;/p&gt;

&lt;p&gt;For example I had two variables, the first contained millions of entries each one corresponding to a protein residue. The second data contained around only 100 entries each one corresponding to one of twenty amino acids. Merging these two variables in my database required millions of joins and took a long time. Instead I joined my amino acid data to my protein data after I had calculated the mean of each protein residue. This reduced the number of joins from a million down to around 100. I did the join as I plotted it using the &lt;a href='http://wiki.r-project.org/rwiki/doku.php?id=tips:data-frames:merge'&gt;merge function in R&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id='code_optimisation'&gt;Code optimisation&lt;/h2&gt;

&lt;p&gt;When I was encountering performance problems I left optimising code as a last resort. There were three reasons for this, the first is that &lt;a href='http://fetter.org/optimization.html'&gt;premature optimisation may be the root of all evil&lt;/a&gt;. The second reason reason is that the enemy of good-enough code is perfect code - when I start optimising code I tend keep going more than is necessary. Code doesn&amp;#8217;t need to be as fast as possible though, just fast enough to get the results I need. The third point is that optimising code, means changing code, which introduces bugs and so the more the code is optimised the more chance of bugs. Code optimisation was a necessity though because my analysis was still taking days to run. I should also point out that my code optimisation was combined with thorough unit testing and benchmarking - which I think is usually how it should be done.&lt;/p&gt;

&lt;h3 id='6a_batch_load_database_query_results'&gt;6a. Batch load database query results&lt;/h3&gt;

&lt;p&gt;One easy way I reducing running times was by batch loading the database table rows rather than trying to load a big table all at once. Pulling all the database records into memory means that most of the running time is spent loading the data into memory rather than actually dealing with it. Batch loading instead pulls subsets of records into memory at a time and each subset is then processed before the next set or rows is retrieved. This means less less memory is used each time. A example of this in Ruby is &lt;a href='http://ryandaigle.com/articles/2009/2/23/what-s-new-in-edge-rails-batched-find'&gt;the ActiveRecord method find_in_batches&lt;/a&gt;.&lt;/p&gt;

&lt;h3 id='6b_association_loading'&gt;6b. Association loading&lt;/h3&gt;

&lt;p&gt;&lt;a href='http://guides.rubyonrails.org/active_record_querying.html#eager-loading-associations'&gt;Association loading&lt;/a&gt; means that when a row of Table A is retrieved from the database, that all the rows associated with it in Table B are also retrieved. This will usually mean that the database is only queried twice, once to find the records from Table A and once to find the records from Table B. The alternative option is to use a loop to retrieve each required row from Table B but this will mean as many database queries as there are rows - and more queries means more running time.&lt;/p&gt;

&lt;h3 id='6c_database_querying_in_loops'&gt;6c. Database querying in loops&lt;/h3&gt;

&lt;p&gt;I found that large loops which query the database were often the majority of my software&amp;#8217;s running time. I improved this by instead moving the database calls, up or out of the loops as much as possible and caching rows in memory before hand. This meant the looping code was looking things up in memory rather then querying the database each time. A similar approach can also be used to avoid object creation inside loops which also seems to improve performance. Combining this approach with the one below was what most improved the running time in my analysis.&lt;/p&gt;

&lt;h3 id='6d_use_raw_sql'&gt;6d. Use raw SQL&lt;/h3&gt;

&lt;p&gt;Object relational management (ORM) libraries like ActiveRecord allow the database to be manipulated using object orientated programming which generally makes using a database a lot easer easier. Using an ORM does however add a performance penalty because it&amp;#8217;s an extra layer on top of the database. When I was doing millions of database updates I found that skipping the ORM and directly using raw SQL contributed to a large saving of processing time. The advantages of this technique are neatly outlined &lt;a href='http://www.igvita.com/2007/10/29/boosting-activerecord-performance/'&gt;by Ilya Grigorik&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id='thats_it'&gt;That&amp;#8217;s it.&lt;/h2&gt;

&lt;p&gt;Quite a long post I know. I know code performance is a weighty topic and probably what I&amp;#8217;m outlining here isn&amp;#8217;t the best way to go about dealing with large data. I&amp;#8217;m there are better ways better technologies to manage large amounts of data too, e.g. map/reduce or schemaless databases. I&amp;#8217;m not a trained computer scientist or a software engineer, but a biologist and what I&amp;#8217;ve outlined is what allowed to me to produce the results I need. I&amp;#8217;d be happy to read any further suggestions in the comments though.&lt;/p&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.feedburner.com/~ff/BioinformaticsZen?a=g1_uFYU0ZZc:og8cOi-HagY:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/BioinformaticsZen?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~ff/BioinformaticsZen?a=g1_uFYU0ZZc:og8cOi-HagY:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/BioinformaticsZen?i=g1_uFYU0ZZc:og8cOi-HagY:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;</description><feedburner:origLink>http://www.bioinformaticszen.com/software/dealing-with-big-data-in-bioinformatics</feedburner:origLink></item><item><title>Data Analysis Using R Functions As Objects</title><link>http://feedproxy.google.com/~r/BioinformaticsZen/~3/ZF4BOldmT68/data_analysis_using_r_functions_as_objects</link><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Michael Barton</dc:creator><pubDate>Thu, 01 Oct 2009 16:00:00 PDT</pubDate><guid isPermaLink="false">http://www.bioinformaticszen.com/r_programming/data_analysis_using_r_functions_as_objects</guid><description>&lt;p&gt;The R language is useful because of the available statistical and plotting functions in the base and addon packages. Before using any function though it&amp;#8217;s usually necessary to get your input data into the format that the function expects. Performing complicated data manipulations with R&amp;#8217;s standard methods for accessing and subsetting data can however quickly lead to complex and confusing R scripts.&lt;/p&gt;

&lt;p&gt;Processing a complex dataset in someway, for example to find the mean for a range of subsets, generally means creating variables to keep track of a series of levels to subset the data by, then creating a new variable for each subset. Creating many new variables can lead to complex R scripts that are difficult to understand. I&amp;#8217;m going to try an outline here an interesting feature of R that allows functions to be treated as objects and can make for much cleaner and shorter R code. For example when I&amp;#8217;m discussing &amp;#8220;functions as objects&amp;#8221; here I mean that the function for calculating standard deviation can be treated in the same was the array of data it works on.&lt;/p&gt;

&lt;p&gt;In the examples here I&amp;#8217;m using the crab which is part of the MASS package. This data contains morphological measurements on both sexes of two crab species.&lt;/p&gt;
&lt;div class='highlight'&gt;&lt;pre&gt;library&lt;span class='p'&gt;(&lt;/span&gt;MASS&lt;span class='p'&gt;)&lt;/span&gt;
data&lt;span class='p'&gt;(&lt;/span&gt;crabs&lt;span class='p'&gt;)&lt;/span&gt;
head&lt;span class='p'&gt;(&lt;/span&gt;crabs&lt;span class='p'&gt;)&lt;/span&gt;
   sp sex index   FL  RW   CL   CW  BD
&lt;span class='m'&gt;1&lt;/span&gt;   B   M     &lt;span class='m'&gt;1&lt;/span&gt;  &lt;span class='m'&gt;8.1&lt;/span&gt; &lt;span class='m'&gt;6.7&lt;/span&gt; &lt;span class='m'&gt;16.1&lt;/span&gt; &lt;span class='m'&gt;19.0&lt;/span&gt; &lt;span class='m'&gt;7.0&lt;/span&gt;
&lt;span class='m'&gt;2&lt;/span&gt;   B   M     &lt;span class='m'&gt;2&lt;/span&gt;  &lt;span class='m'&gt;8.8&lt;/span&gt; &lt;span class='m'&gt;7.7&lt;/span&gt; &lt;span class='m'&gt;18.1&lt;/span&gt; &lt;span class='m'&gt;20.8&lt;/span&gt; &lt;span class='m'&gt;7.4&lt;/span&gt;
&lt;span class='m'&gt;3&lt;/span&gt;   B   M     &lt;span class='m'&gt;3&lt;/span&gt;  &lt;span class='m'&gt;9.2&lt;/span&gt; &lt;span class='m'&gt;7.8&lt;/span&gt; &lt;span class='m'&gt;19.0&lt;/span&gt; &lt;span class='m'&gt;22.4&lt;/span&gt; &lt;span class='m'&gt;7.7&lt;/span&gt;
&lt;span class='m'&gt;4&lt;/span&gt;   B   M     &lt;span class='m'&gt;4&lt;/span&gt;  &lt;span class='m'&gt;9.6&lt;/span&gt; &lt;span class='m'&gt;7.9&lt;/span&gt; &lt;span class='m'&gt;20.1&lt;/span&gt; &lt;span class='m'&gt;23.1&lt;/span&gt; &lt;span class='m'&gt;8.2&lt;/span&gt;
&lt;span class='m'&gt;5&lt;/span&gt;   B   M     &lt;span class='m'&gt;5&lt;/span&gt;  &lt;span class='m'&gt;9.8&lt;/span&gt; &lt;span class='m'&gt;8.0&lt;/span&gt; &lt;span class='m'&gt;20.3&lt;/span&gt; &lt;span class='m'&gt;23.0&lt;/span&gt; &lt;span class='m'&gt;8.2&lt;/span&gt;
&lt;/pre&gt;
&lt;/div&gt;
&lt;h3 id='performing_functions_on_subsets_of_data'&gt;Performing functions on subsets of data&lt;/h3&gt;

&lt;p&gt;I have a function called &lt;em&gt;do_something&lt;/em&gt; which I want to use to analyse two columns in the crab data. First I want to subset for each sex in each species and call my function on each data subset. In the example below I&amp;#8217;m getting the data for the orange female crabs then calling my function on two of the columns in the data.&lt;/p&gt;
&lt;div class='highlight'&gt;&lt;pre&gt;  do_something &lt;span class='o'&gt;&amp;lt;-&lt;/span&gt; &lt;span class='kr'&gt;function&lt;/span&gt;&lt;span class='p'&gt;(&lt;/span&gt;x&lt;span class='p'&gt;,&lt;/span&gt;y&lt;span class='p'&gt;){&lt;/span&gt;
    &lt;span class='c1'&gt;# Function code goes here ...&lt;/span&gt;
  &lt;span class='p'&gt;}&lt;/span&gt;

  &lt;span class='c1'&gt;# Subset my data&lt;/span&gt;
  orange_girls &lt;span class='o'&gt;&amp;lt;-&lt;/span&gt; subset&lt;span class='p'&gt;(&lt;/span&gt;crabs&lt;span class='p'&gt;,&lt;/span&gt; sex &lt;span class='o'&gt;==&lt;/span&gt; &lt;span class='s'&gt;&amp;#39;F&amp;#39;&lt;/span&gt; &lt;span class='o'&gt;&amp;amp;&lt;/span&gt; sp &lt;span class='o'&gt;==&lt;/span&gt; &lt;span class='s'&gt;&amp;#39;O&amp;#39;&lt;/span&gt;&lt;span class='p'&gt;)&lt;/span&gt;

  &lt;span class='c1'&gt;# Call my function&lt;/span&gt;
  do_something&lt;span class='p'&gt;(&lt;/span&gt;orange_girls&lt;span class='p'&gt;$&lt;/span&gt;CW&lt;span class='p'&gt;,&lt;/span&gt;orange_girls&lt;span class='p'&gt;$&lt;/span&gt;CL&lt;span class='p'&gt;)&lt;/span&gt;
&lt;/pre&gt;
&lt;/div&gt;
&lt;p&gt;I dislike this code because I have to first create a new variable as a subset of the data before I call my function on it. Secondly when I call the &lt;em&gt;do_something&lt;/em&gt; function I have to specify the the columns in data I want using the &lt;em&gt;$&lt;/em&gt; symbol. Instead compare the same functionality but using the &lt;em&gt;with&lt;/em&gt; command.&lt;/p&gt;
&lt;div class='highlight'&gt;&lt;pre&gt;  with&lt;span class='p'&gt;(&lt;/span&gt;subset&lt;span class='p'&gt;(&lt;/span&gt;crabs&lt;span class='p'&gt;,&lt;/span&gt; sex &lt;span class='o'&gt;==&lt;/span&gt; &lt;span class='s'&gt;&amp;#39;F&amp;#39;&lt;/span&gt; &lt;span class='o'&gt;&amp;amp;&lt;/span&gt; sp &lt;span class='o'&gt;==&lt;/span&gt; &lt;span class='s'&gt;&amp;#39;O&amp;#39;&lt;/span&gt;&lt;span class='p'&gt;),&lt;/span&gt; do_something&lt;span class='p'&gt;(&lt;/span&gt;CW&lt;span class='p'&gt;,&lt;/span&gt;CL&lt;span class='p'&gt;))&lt;/span&gt;
&lt;/pre&gt;
&lt;/div&gt;
&lt;p&gt;Here I&amp;#8217;m doing exactly the same thing, but the &lt;em&gt;do_something&lt;/em&gt; command is called in the context of the data subset which means I don&amp;#8217;t have to specify the columns using &lt;em&gt;$&lt;/em&gt;. I also don&amp;#8217;t have to create a new subset variable beforehand, I can just pass the subset of data as the first argument to the &lt;em&gt;with&lt;/em&gt; command. I&amp;#8217;m not restricted to just running one function either.&lt;/p&gt;
&lt;div class='highlight'&gt;&lt;pre&gt;  &lt;span class='c1'&gt;# Same thing but using a multi-line code block via curly braces&lt;/span&gt;
  with&lt;span class='p'&gt;(&lt;/span&gt;subset&lt;span class='p'&gt;(&lt;/span&gt;crabs&lt;span class='p'&gt;,&lt;/span&gt; sex &lt;span class='o'&gt;==&lt;/span&gt; &lt;span class='s'&gt;&amp;#39;F&amp;#39;&lt;/span&gt; &lt;span class='o'&gt;&amp;amp;&lt;/span&gt; sp &lt;span class='o'&gt;==&lt;/span&gt; &lt;span class='s'&gt;&amp;#39;O&amp;#39;&lt;/span&gt;&lt;span class='p'&gt;),{&lt;/span&gt;
    do_something&lt;span class='p'&gt;(&lt;/span&gt;CW&lt;span class='p'&gt;,&lt;/span&gt;CL&lt;span class='p'&gt;)&lt;/span&gt;

    &lt;span class='c1'&gt;# Use more functions on the same subset&lt;/span&gt;
    lm&lt;span class='p'&gt;(&lt;/span&gt;CW ~ CL&lt;span class='p'&gt;)&lt;/span&gt;
  &lt;span class='p'&gt;})&lt;/span&gt;
&lt;/pre&gt;
&lt;/div&gt;
&lt;p&gt;By using curly braces &lt;em&gt;{ }&lt;/em&gt; I can call multiple functions as part of the same command. I can do this because the curly braced expression is an argument to the &lt;em&gt;with&lt;/em&gt; function.&lt;/p&gt;

&lt;h3 id='looping_through_subsets_of_data'&gt;Looping through subsets of data&lt;/h3&gt;

&lt;p&gt;So far I&amp;#8217;m using just a single subset of the crab data, but I will want to call a function on all combinations of sex and species. Usually this means using &lt;em&gt;for&lt;/em&gt; loops to iterate through each combination of sex and species to break the data into smaller chunks.&lt;/p&gt;
&lt;div class='highlight'&gt;&lt;pre&gt;  &lt;span class='c1'&gt;# Arrays of values for each type of species and sex&lt;/span&gt;
  species &lt;span class='o'&gt;&amp;lt;-&lt;/span&gt; unique&lt;span class='p'&gt;(&lt;/span&gt;crabs&lt;span class='p'&gt;$&lt;/span&gt;sp&lt;span class='p'&gt;)&lt;/span&gt;
  sexes   &lt;span class='o'&gt;&amp;lt;-&lt;/span&gt; unique&lt;span class='p'&gt;(&lt;/span&gt;crabs&lt;span class='p'&gt;$&lt;/span&gt;sex&lt;span class='p'&gt;)&lt;/span&gt;

  &lt;span class='c1'&gt;# Loop through species ...&lt;/span&gt;
  &lt;span class='kr'&gt;for&lt;/span&gt;&lt;span class='p'&gt;(&lt;/span&gt;i in &lt;span class='m'&gt;1&lt;/span&gt;:length&lt;span class='p'&gt;(&lt;/span&gt;species&lt;span class='p'&gt;)){&lt;/span&gt;

    &lt;span class='c1'&gt;# ... loop through sex ..&lt;/span&gt;
    &lt;span class='kr'&gt;for&lt;/span&gt;&lt;span class='p'&gt;(&lt;/span&gt;j in &lt;span class='m'&gt;1&lt;/span&gt;:length&lt;span class='p'&gt;(&lt;/span&gt;sexes&lt;span class='p'&gt;)){&lt;/span&gt;

      &lt;span class='c1'&gt;# ... and finally call a function on each subset&lt;/span&gt;
      something_else&lt;span class='p'&gt;(&lt;/span&gt;subset&lt;span class='p'&gt;(&lt;/span&gt;crabs&lt;span class='p'&gt;,&lt;/span&gt; sp &lt;span class='o'&gt;==&lt;/span&gt; species&lt;span class='p'&gt;[&lt;/span&gt;i&lt;span class='p'&gt;]&lt;/span&gt; &lt;span class='o'&gt;&amp;amp;&lt;/span&gt; sex &lt;span class='o'&gt;==&lt;/span&gt; sexes&lt;span class='p'&gt;[&lt;/span&gt;j&lt;span class='p'&gt;]))&lt;/span&gt;
    &lt;span class='p'&gt;}&lt;/span&gt;
  &lt;span class='p'&gt;}&lt;/span&gt;
&lt;/pre&gt;
&lt;/div&gt;
&lt;p&gt;This is messy because I&amp;#8217;ve had to create two arrays just to keep track of the unique values for each species and sex. Furthermore I&amp;#8217;ve got two variables &lt;em&gt;i&lt;/em&gt; and &lt;em&gt;j&lt;/em&gt; which, if I change my loops at some point or add more loops, make the code more fragile and difficult to understand. Furthermore when you look at this code you can see there are nine lines just to get which subset I&amp;#8217;m interested in but only one line actually calls my function. Instead of this I can create my own function to automatically iterate through the dataset and call any function I want.&lt;/p&gt;
&lt;div class='highlight'&gt;&lt;pre&gt;  each &lt;span class='o'&gt;&amp;lt;-&lt;/span&gt; &lt;span class='kr'&gt;function&lt;/span&gt;&lt;span class='p'&gt;(&lt;/span&gt;&lt;span class='m'&gt;.&lt;/span&gt;column&lt;span class='p'&gt;,&lt;/span&gt;&lt;span class='m'&gt;.&lt;/span&gt;data&lt;span class='p'&gt;,&lt;/span&gt;&lt;span class='m'&gt;.&lt;/span&gt;lambda&lt;span class='p'&gt;){&lt;/span&gt;
    
    &lt;span class='c1'&gt;# Find the column index from it&amp;#39;s name&lt;/span&gt;
    column_index  &lt;span class='o'&gt;&amp;lt;-&lt;/span&gt; which&lt;span class='p'&gt;(&lt;/span&gt;names&lt;span class='p'&gt;(&lt;/span&gt;&lt;span class='m'&gt;.&lt;/span&gt;data&lt;span class='p'&gt;)&lt;/span&gt; &lt;span class='o'&gt;==&lt;/span&gt; &lt;span class='m'&gt;.&lt;/span&gt;column&lt;span class='p'&gt;)&lt;/span&gt;

    &lt;span class='c1'&gt;# Find the unique values in the column&lt;/span&gt;
    column_levels &lt;span class='o'&gt;&amp;lt;-&lt;/span&gt; unique&lt;span class='p'&gt;(&lt;/span&gt;&lt;span class='m'&gt;.&lt;/span&gt;data&lt;span class='p'&gt;[,&lt;/span&gt;column_index&lt;span class='p'&gt;])&lt;/span&gt;

    &lt;span class='c1'&gt;# Loop over these values&lt;/span&gt;
    &lt;span class='kr'&gt;for&lt;/span&gt;&lt;span class='p'&gt;(&lt;/span&gt;i in &lt;span class='m'&gt;1&lt;/span&gt;:length&lt;span class='p'&gt;(&lt;/span&gt;column_levels&lt;span class='p'&gt;)){&lt;/span&gt;

      &lt;span class='c1'&gt;# Subset the data and call the passed function on it&lt;/span&gt;
      &lt;span class='m'&gt;.&lt;/span&gt;lambda&lt;span class='p'&gt;(&lt;/span&gt;&lt;span class='m'&gt;.&lt;/span&gt;data&lt;span class='p'&gt;[&lt;/span&gt;&lt;span class='m'&gt;.&lt;/span&gt;data&lt;span class='p'&gt;[,&lt;/span&gt;column_index&lt;span class='p'&gt;]&lt;/span&gt; &lt;span class='o'&gt;==&lt;/span&gt; column_levels&lt;span class='p'&gt;[&lt;/span&gt;i&lt;span class='p'&gt;],])&lt;/span&gt;  
    &lt;span class='p'&gt;}&lt;/span&gt;
  &lt;span class='p'&gt;}&lt;/span&gt;
&lt;/pre&gt;
&lt;/div&gt;
&lt;p&gt;The first two arguments of this function are the name of the column I want to subset the data by, and the data frame I want to iterate over. What&amp;#8217;s interesting though is that the last argument &lt;em&gt;.lambda&lt;/em&gt; is an R function, because R treats functions as objects this allows them to be passed as arguments to other functions.&lt;/p&gt;
&lt;div class='highlight'&gt;&lt;pre&gt;  &lt;span class='c1'&gt;# Another function as the last argument to this function&lt;/span&gt;
  each&lt;span class='p'&gt;(&lt;/span&gt;&lt;span class='s'&gt;&amp;quot;sp&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;,&lt;/span&gt; crabs&lt;span class='p'&gt;,&lt;/span&gt; something_else&lt;span class='p'&gt;)&lt;/span&gt;

  &lt;span class='c1'&gt;# Or create a new anonymous function ...&lt;/span&gt;
  each&lt;span class='p'&gt;(&lt;/span&gt;&lt;span class='s'&gt;&amp;quot;sp&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;,&lt;/span&gt; crabs&lt;span class='p'&gt;,&lt;/span&gt; &lt;span class='kr'&gt;function&lt;/span&gt;&lt;span class='p'&gt;(&lt;/span&gt;x&lt;span class='p'&gt;){&lt;/span&gt;

    &lt;span class='c1'&gt;# ... and run multiple lines of code here&lt;/span&gt;
    something_else&lt;span class='p'&gt;(&lt;/span&gt;x&lt;span class='p'&gt;)&lt;/span&gt;
    with&lt;span class='p'&gt;(&lt;/span&gt;x&lt;span class='p'&gt;,&lt;/span&gt;lm&lt;span class='p'&gt;(&lt;/span&gt;CW ~ CL&lt;span class='p'&gt;))&lt;/span&gt;

  &lt;span class='p'&gt;})&lt;/span&gt;
&lt;/pre&gt;
&lt;/div&gt;
&lt;p&gt;I&amp;#8217;m still looping through each subset of the data but I&amp;#8217;ve now moved all the messy code to a separate function which makes the code more readable. Moving repetitive code like this to a separate function is good programming practice as it keeps your code &lt;a href='http://en.wikipedia.org/wiki/Don&amp;apos;t_repeat_yourself'&gt;DRY&lt;/a&gt; and also avoids the temptation for &lt;a href='http://en.wikipedia.org/wiki/Copy_and_paste_programming'&gt;copy and paste anti-patterns&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;My example function isn&amp;#8217;t very sophisticated though and can&amp;#8217;t create loops for subsets of more than one column. Nevertheless as you might expect there are methods exactly for doing this type of loop-each-subset approach to data analysis. An example is the &lt;em&gt;lapply&lt;/em&gt; function in R base, however I find the syntax of this function hard to remember. Instead I prefer to use &lt;a href='http://had.co.nz/'&gt;Hadly Wickham&amp;#8217;s&lt;/a&gt; excellent &lt;a href='http://had.co.nz/plyr/'&gt;plyr package&lt;/a&gt;. Plyr is very simple to use, for example:&lt;/p&gt;
&lt;div class='highlight'&gt;&lt;pre&gt;  library&lt;span class='p'&gt;(&lt;/span&gt;plyr&lt;span class='p'&gt;)&lt;/span&gt;

  &lt;span class='c1'&gt;# Three arguments&lt;/span&gt;
  &lt;span class='c1'&gt;# 1. The dataframe&lt;/span&gt;
  &lt;span class='c1'&gt;# 2. The name of columns to subset by&lt;/span&gt;
  &lt;span class='c1'&gt;# 3. The function to call on each subset&lt;/span&gt;
  d_ply&lt;span class='p'&gt;(&lt;/span&gt;crabs&lt;span class='p'&gt;,&lt;/span&gt; &lt;span class='m'&gt;.&lt;/span&gt;&lt;span class='p'&gt;(&lt;/span&gt;sp&lt;span class='p'&gt;,&lt;/span&gt; sex&lt;span class='p'&gt;),&lt;/span&gt; something_else&lt;span class='p'&gt;)&lt;/span&gt;
&lt;/pre&gt;
&lt;/div&gt;
&lt;p&gt;Again this is an example of how one function can accept another function as an argument. Compare how concise this code is with the code a few paragraphs up which uses for loops to do the same thing. The plyr package has more functionality than this though and is worth spending some time looking at. Also worth checking out are the &lt;a href='cran.r-project.org/web/packages/foreach/index.html'&gt;foreach&lt;/a&gt; and &lt;a href='http://cran.r-project.org/web/packages/iterators/index.html'&gt;iterators&lt;/a&gt; packages which provide similar functionality.&lt;/p&gt;

&lt;h3 id='too_long_didnt_read_heres_a_short_video'&gt;Too long, didn&amp;#8217;t read? Here&amp;#8217;s a short video&lt;/h3&gt;
&lt;object height='385' width='640'&gt;&lt;param name='movie' value='http://www.youtube.com/v/0FwXSgBzo_Q&amp;amp;hl=en&amp;amp;fs=1&amp;amp;hd=1' /&gt;&lt;param name='allowFullScreen' value='true' /&gt;&lt;param name='allowscriptaccess' value='always' /&gt;&lt;embed src='http://www.youtube.com/v/0FwXSgBzo_Q&amp;amp;hl=en&amp;amp;fs=1&amp;amp;hd=1' type='application/x-shockwave-flash' allowfullscreen='true' allowscriptaccess='always' height='385' width='640' /&gt;&lt;/object&gt;&lt;p /&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.feedburner.com/~ff/BioinformaticsZen?a=ZF4BOldmT68:pzT-yyMm8KM:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/BioinformaticsZen?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~ff/BioinformaticsZen?a=ZF4BOldmT68:pzT-yyMm8KM:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/BioinformaticsZen?i=ZF4BOldmT68:pzT-yyMm8KM:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;</description><feedburner:origLink>http://www.bioinformaticszen.com/r_programming/data_analysis_using_r_functions_as_objects</feedburner:origLink></item><item><title>Keyboard, Command Line, And Text Files</title><link>http://feedproxy.google.com/~r/BioinformaticsZen/~3/9z-UlX0Dzx0/keyboard%2C-command-line%2C-and-text-files</link><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Michael Barton</dc:creator><pubDate>Tue, 08 Sep 2009 16:00:00 PDT</pubDate><guid isPermaLink="false">http://www.bioinformaticszen.com/tools/keyboard%2C-command-line%2C-and-text-files</guid><description>&lt;p&gt;Here I&amp;#8217;m outlining an approach to bioinformatics focusing on using three tools: the keyboard to enter commands, the wide range of functions available at the command line, and storing data in easily searched and manipulated text files. The combination of these three can make bioinformatics analysis faster than using graphical user interfaces (GUIs) because typing into the command line is faster than clicking buttons and selecting menus with a mouse. A series text commands is also easier to reproduce and automate than remembering a sequence of GUI commands.&lt;/p&gt;

&lt;h3 id='the_keyboard'&gt;The Keyboard&lt;/h3&gt;

&lt;p&gt;The best reason for using the keyboard is that it&amp;#8217;s faster to type a command compared with doing the same thing with a mouse and GUI. If you&amp;#8217;re using the keyboard to write code the more you can also use the keyboard for other actions such as saving the current document the less your &amp;#8220;working flow&amp;#8221; is broken by reaching for the mouse. Finally once you know the keyboard shortcuts to do your work learning to touch type will also make you work faster.&lt;/p&gt;

&lt;h3 id='the_command_line'&gt;The Command Line&lt;/h3&gt;

&lt;p&gt;The command line gives predictable and reproducible results unlike a graphical interface. A workflow in the command line is easier for someone else to test and execute than a list screenshots of which buttons and menu items to click.&lt;/p&gt;

&lt;p&gt;The Linux command line has a wide variety of tools ranging from interacting with the operating system to manipulating entries in a text file. Many bioinformatics applications can also be run at the command line and the benefit of using the command line version is that sequences of commands can be chained together. The output of one command becomes the input of the next command to filter and parse the results. In this way sequences of simple commands can be combined together to perform more complex bioinformatics analysis. As these tasks do not require a user with a mouse either they can be performed simultaneously across multiple machines or scheduled to run overnight.&lt;/p&gt;

&lt;h3 id='text_files'&gt;Text Files&lt;/h3&gt;

&lt;p&gt;Binary files in specific formats can usually only be searched and edited by the program that created them. Plain text files are however open to being searched or manipulated by any tool or program. Therefore different data text formats such as CSV, JSON, YAML, and XML can still be searched and edited using similar command line approaches. Combining the command line with text files makes it relatively simple to perform simultaneous large searches or manipulations across many text based files.&lt;/p&gt;

&lt;h3 id='managing_spelling_in_text_and_images'&gt;Managing spelling in text and images&lt;/h3&gt;

&lt;p&gt;In this video I&amp;#8217;m illustrating how the spelling of the sulfur across text and image files can be maintained consistently using the command line program &lt;a href='http://en.wikipedia.org/wiki/Sed'&gt;sed&lt;/a&gt;.&lt;/p&gt;
&lt;object height='385' width='640'&gt;&lt;param name='movie' value='http://www.youtube.com/v/d0TkCdqekS0&amp;amp;hl=en&amp;amp;fs=1&amp;amp;hd=1' /&gt;&lt;param name='allowFullScreen' value='true' /&gt;&lt;param name='allowscriptaccess' value='always' /&gt;&lt;embed src='http://www.youtube.com/v/d0TkCdqekS0&amp;amp;hl=en&amp;amp;fs=1&amp;amp;hd=1' type='application/x-shockwave-flash' allowfullscreen='true' allowscriptaccess='always' height='385' width='640' /&gt;&lt;/object&gt;&lt;p /&gt;
&lt;h3 id='manipulating_data_files_on_the_command_line'&gt;Manipulating data files on the command line&lt;/h3&gt;

&lt;p&gt;In this video I&amp;#8217;m showing how a spread sheet of amino acid data can be converted to &lt;a href='http://en.wikipedia.org/wiki/Comma-separated_values'&gt;comma separated value (CSV) format&lt;/a&gt;. The amino acid data is then searched using &lt;a href='http://en.wikipedia.org/wiki/Grep'&gt;grep&lt;/a&gt; to find for the entry for cysteine. &lt;a href='http://en.wikipedia.org/wiki/AWK'&gt;AWK&lt;/a&gt; is used to sort by carbon content. Finally I use the &lt;a href='http://en.wikipedia.org/wiki/R_%28programming_language%29'&gt;R language&lt;/a&gt; to create histograms and x/y plots of the data.&lt;/p&gt;
&lt;object height='385' width='640'&gt;&lt;param name='movie' value='http://www.youtube.com/v/tUuRBIZVOpY&amp;amp;hl=en&amp;amp;fs=1&amp;amp;hd=1' /&gt;&lt;param name='allowFullScreen' value='true' /&gt;&lt;param name='allowscriptaccess' value='always' /&gt;&lt;embed src='http://www.youtube.com/v/tUuRBIZVOpY&amp;amp;hl=en&amp;amp;fs=1&amp;amp;hd=1' type='application/x-shockwave-flash' allowfullscreen='true' allowscriptaccess='always' height='385' width='640' /&gt;&lt;/object&gt;&lt;p /&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.feedburner.com/~ff/BioinformaticsZen?a=9z-UlX0Dzx0:LMqoke1sss0:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/BioinformaticsZen?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~ff/BioinformaticsZen?a=9z-UlX0Dzx0:LMqoke1sss0:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/BioinformaticsZen?i=9z-UlX0Dzx0:LMqoke1sss0:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;</description><feedburner:origLink>http://www.bioinformaticszen.com/tools/keyboard%2C-command-line%2C-and-text-files</feedburner:origLink></item><item><title>Using A Database</title><link>http://feedproxy.google.com/~r/BioinformaticsZen/~3/1_tmljbcPz0/using_a_database</link><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Michael Barton</dc:creator><pubDate>Thu, 19 Feb 2009 16:00:00 PST</pubDate><guid isPermaLink="false">http://www.bioinformaticszen.com/software/using_a_database</guid><description>&lt;p&gt;My best recommendation for any computational biologist is to learn to use a relational database with a corresponding object relational mapping system (ORM). This sounds complicated, but doesn&amp;#8217;t have to be. In bioinformatics, data are distributed as files. Supplementary data are available from journal websites, and a file is easy to attached to an email. The use of data files in programming, however, should be limited wherever possible. A bioinformatics project should instead be built using a database.&lt;/p&gt;

&lt;p&gt;Using a database allows all data to be accessed in the same way, whether in a script, at the command line, or through third-party database software. Databases are fast and optimised for searching and joining datasets. Joins between two sets of data that would be difficult when merging two files are made much easier using database relationships.&lt;/p&gt;

&lt;p&gt;A simple database workflow first loads all data into the database. Each file usually becomes a table in the database, where each file row is a table row. Analytical scripts make database calls to pull and join different data sets together. Adding indices to a database further increases the speed at which joins are made and data searched.&lt;/p&gt;

&lt;p&gt;In contrast using files as the base of the project results in errors when file paths change. Scripts need rewriting if a file format is altered. If the data file has a missing bracket or comma, the resulting script will throw an exception and break. The worst thing about using flat files though, is that they must be parsed and joined at the start of each script. This is repetitive and leads to code duplication across scripts.&lt;/p&gt;

&lt;h3 id='sql_is_hard'&gt;SQL is hard&lt;/h3&gt;

&lt;p&gt;What I haven&amp;#8217;t mentioned is that learning to use a database takes time. Understanding how to structure tables and the language to join them together requires effort. Furthermore, writing SQL join statements in scripts requires attaching strings together to create the SQL query, which is complex, hard to maintain, and produces ugly code.&lt;/p&gt;

&lt;p&gt;Using object relational mapping (ORM) makes using a database easy and code simpler to write. The phrase &amp;#8220;object relational mapping&amp;#8221; is jargon for what allows database tables and rows to be treated as in-code variables. Instead of creating verbose SQL statements or reading to the required line in a file, the required data are called in the familiar programming syntax of the language you are used to. This combines the best of efficient data storage, with the language you are skilled in.&lt;/p&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.feedburner.com/~f/BioinformaticsZen?a=imHbX9wn"&gt;&lt;img src="http://feeds.feedburner.com/~f/BioinformaticsZen?d=41" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~f/BioinformaticsZen?a=sMJm4tzY"&gt;&lt;img src="http://feeds.feedburner.com/~f/BioinformaticsZen?i=sMJm4tzY" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;</description><feedburner:origLink>http://www.bioinformaticszen.com/software/using_a_database</feedburner:origLink></item><item><title>Scripting</title><link>http://feedproxy.google.com/~r/BioinformaticsZen/~3/viTVCIChycQ/scripting</link><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Michael Barton</dc:creator><pubDate>Thu, 19 Feb 2009 16:00:00 PST</pubDate><guid isPermaLink="false">http://www.bioinformaticszen.com/software/scripting</guid><description>&lt;p&gt;Scripts differentiate computational research from software production. A script is a file of code with a specific purpose such as running a BLAST search on the &lt;em&gt;E. coli&lt;/em&gt; genome. Contrast this with much larger programs designed to manage a variety of inputs and commands. A bioinformatician uses scripts as research tools in the same way a laboratory biologist uses a pipette. In software development, scripting supplements the designing of a software product. The focus is the finished product and scripts there to make source code management or unit testing easier. Since scripts receive comparably less attention as a part of software design, is there best practice for using scripts?&lt;/p&gt;

&lt;h3 id='managing_dependency'&gt;Managing dependency&lt;/h3&gt;

&lt;p&gt;Scripts are often required to run in a specific order. One script produces a result which is the input to the next script. This means the second script is dependent on the first. Dependency in software equates to increased complexity and requires more work to maintain a project. For example, if there is an undetected bug in one script mistakes are propagated as the next scripts are run. Or if one script in a series is missed, and the output files of a previous iteration still remain, then datasets are mixed between workflow repetitions resulting in unexpected side effects.&lt;/p&gt;

&lt;p&gt;Removing the dependencies between workflow steps is difficult. Build files such as &lt;a href='http://rake.rubyforge.org/'&gt;Rake&lt;/a&gt;, &lt;a href='http://ant.apache.org/'&gt;Ant&lt;/a&gt;, and &lt;a href='http://www.gnu.org/software/make/'&gt;make&lt;/a&gt; allow dependencies between scripted steps to be formalised: the required previous steps are automatically run first. This is useful to force the requirement that all previous results are deleted before hand, &lt;a href='http://github.com/jandot/biorake/tree/master'&gt;or that all rows in the database are refreshed&lt;/a&gt;, or even that the entire analysis is repeated from scratch. &lt;a href='http://www.capify.org/'&gt;Capistrano&lt;/a&gt; is a variant where build files can be used to automate repetitive tasks across multiple remote computers. All of this can be managed using single command line calls.&lt;/p&gt;

&lt;h3 id='light_and_fluffy'&gt;Light and fluffy&lt;/h3&gt;

&lt;p&gt;Light and simple scripts are easier to maintain. To simplify, a script reads in a set of input data (such as a protein sequence), analyses the data (formatdb on a sequence database followed by BLAST), and then returns to the data (prints the results to the command line). Using this simplification, a script only needs to know what data is coming in, how to analyse the data, and how to return it.&lt;/p&gt;

&lt;p&gt;Scripts can be made lighter by reducing the amount of analytical code. Instead of writing the code to call and parse BLAST, use existing code such as in &lt;a href='http://www.bioperl.org/wiki/Main_Page'&gt;BioPerl&lt;/a&gt;. If the code you need doesn&amp;#8217;t exist anywhere else, consider writing it as a separate library which can be shared across all your scripts. A script that reads in a the data, calls an external library, then prints the results will be smaller and simpler to understand. Contrast this with a script that reads in data, formats the data, has several lines of a code to interpret and massage the results, then writes output.&lt;/p&gt;

&lt;p&gt;Keeping light and simple, and formalising dependencies makes script-based projects easier to manage, maintain, and repeat.&lt;/p&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.feedburner.com/~f/BioinformaticsZen?a=kIUlDVlv"&gt;&lt;img src="http://feeds.feedburner.com/~f/BioinformaticsZen?d=41" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~f/BioinformaticsZen?a=EIwPXL38"&gt;&lt;img src="http://feeds.feedburner.com/~f/BioinformaticsZen?i=EIwPXL38" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;</description><feedburner:origLink>http://www.bioinformaticszen.com/software/scripting</feedburner:origLink></item><item><title>Why Write Good Software</title><link>http://feedproxy.google.com/~r/BioinformaticsZen/~3/451onGYHJvQ/why_write_good_software</link><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Michael Barton</dc:creator><pubDate>Mon, 02 Feb 2009 16:00:00 PST</pubDate><guid isPermaLink="false">http://www.bioinformaticszen.com/software/why_write_good_software</guid><description>&lt;p&gt;Bioinformatics is far from commercial software development. A bioinformatician&amp;#8217;s goal is developing novel scientific research or tools. A software developer is judged on delivering software that people will pay to use. A biologist, whether they use Perl, a pipette or both, is evaluated on their publication record.&lt;/p&gt;

&lt;p&gt;In bioinformatics, or any science using a computer, software development is a lesser priority than generating new data. Statistical tests for significance outweigh software testing for reliability. A series of Python scripts for interpreting Chip-chip data are a bioinformatician&amp;#8217;s tools; what is important is the publishable prediction of binding sites.&lt;/p&gt;

&lt;p&gt;Compare this with commercial software development, for example development of a hotel online line booking system. The developer talks to the hotel to understand the job. A good developer keeps regular meetings with the hotel, to update the project based on the customer&amp;#8217;s requirements. The developer maintains the code using common development practices: &lt;a href='http://en.wikipedia.org/wiki/Unit_testing'&gt;unit testing&lt;/a&gt;, &lt;a href='http://en.wikipedia.org/wiki/Build_Automation'&gt;automated building&lt;/a&gt;, and &lt;a href='http://en.wikipedia.org/wiki/Revision_control'&gt;source control&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The situation in bioinformatics is different; a hypothesis is made, implemented, and tested. There are no best practices. Methods of research range from a directory of BLAST results with an Excel spreadsheet, to a full application stack with a database backend, revision control, and unit tests. The choice depends on the bioinformatician&amp;#8217;s knowledge and experience.&lt;/p&gt;

&lt;p&gt;Is good software important for bioinformatics? Either end of the above scenario is rare, and a middle approach is a set of flat files, Perl scripts to parse out required rows, with R scripts to plot the results. If the tools work does the method matter?&lt;/p&gt;

&lt;p&gt;Receiving peer review on a manuscript is comparable to getting feedback on delivering a product to a customer. Instead of new feature requests, changes are required as new analyses or the addition of a new data set. Feedback from reviewers is only received after months of work, when the software is developed and mature. The same principles that apply for commercial software can apply for scientific software. Investing 10% extra time in developing versatile and maintainable code saves time later when large changes are required. Using version control is a safety net for making changes to existing code. Unit testing ensures fewer bugs. Automated building makes execution of linear tasks easier. A database enables easier manipulation of large complex data.&lt;/p&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.feedburner.com/~f/BioinformaticsZen?a=80fvqLMQ"&gt;&lt;img src="http://feeds.feedburner.com/~f/BioinformaticsZen?d=41" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~f/BioinformaticsZen?a=6iJ8yf2a"&gt;&lt;img src="http://feeds.feedburner.com/~f/BioinformaticsZen?i=6iJ8yf2a" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;</description><feedburner:origLink>http://www.bioinformaticszen.com/software/why_write_good_software</feedburner:origLink></item><item><title>Reuse, Contribute, Create</title><link>http://feedproxy.google.com/~r/BioinformaticsZen/~3/FADBzlQC4RU/reuse,_contribute,_create</link><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Michael Barton</dc:creator><pubDate>Sun, 25 Jan 2009 16:00:00 PST</pubDate><guid isPermaLink="false">http://www.bioinformaticszen.com/software/reuse,_contribute,_create</guid><description>&lt;h3 id='use_existing_code'&gt;Use existing code&lt;/h3&gt;

&lt;p&gt;A quick way to get something done is using code that someone else has already written. Many languages have a bioinformatics specific library, such as &lt;a href='http://www.bioperl.org/wiki/Main_Page'&gt;BioPerl&lt;/a&gt;, &lt;a href='http://biojava.org/wiki/Main_Page'&gt;BioJava&lt;/a&gt;, &lt;a href='http://www.biopython.org/wiki/Documentation'&gt;BioPython&lt;/a&gt;, or &lt;a href='http://bioruby.open-bio.org/'&gt;BioRuby&lt;/a&gt;. These libraries have functions for many common tasks, such as reading Fasta files, or parsing BLAST results.&lt;/p&gt;

&lt;p&gt;One reason to use an existing library over writing your own code is saving time. These libraries are also mature and tested, which means the chance of a bug is much less. If you&amp;#8217;re unable to do something in particular and can&amp;#8217;t find an answer in the documentation, asking a question on the mailing list will usually result in a suggestion of where to look.&lt;/p&gt;

&lt;h3 id='contribute_to_existing_code'&gt;Contribute to existing code&lt;/h3&gt;

&lt;p&gt;The more specific your requirement the less likely an existing solution. In this case you&amp;#8217;ll need to create the necessary fix yourself. After coding something up, being a generous person you will want to contribute the code to a bioinformatics library. This might mean a little work, but by contributing you can save other people time with the same problem.&lt;/p&gt;

&lt;p&gt;Contributing code first requires getting the library source code using whatever version control system (VCS) the code is managed with. This can be difficult if you&amp;#8217;re never used a VCS before, but is a good change to learn. Once you&amp;#8217;ve got the library you&amp;#8217;ll need to add your own code, as well as some documentation, and usually a few tests. After this you&amp;#8217;ll need to send your update back via the VCS, or submit a patch to the mailing list.&lt;/p&gt;

&lt;h3 id='create_new_code'&gt;Create new code&lt;/h3&gt;

&lt;p&gt;Creating a new library should be a last resort, but sometimes the function you want doesn&amp;#8217;t fit with any existing libraries. Why is creating a new library a last resort? Because it takes more work than adding to an existing library. Having said that, creating a new library does have benefits, as packaging your code makes it is easier to maintain and use across your own projects. Taking the time to share your code with other people also makes you a good person.&lt;/p&gt;

&lt;p&gt;Smaller and simpler is better when creating a new library. The simpler the library, the easier it is for others to use, and for you to maintain. Use a version control system to keep track of changes to the code, &lt;a href='http://git-scm.com/'&gt;git&lt;/a&gt; is a good choice. Document the code, and create some web pages highlighting how the library is used. Develop unit tests so that you can make sure the functionality remains the same whenever changes. Make the code open source so that anyone else can contribute. Finally host the library somewhere so that people can get access. There are usually specific resources for each language, for instance in Ruby there is &lt;a href='http://rubyforge.org/'&gt;Rubyforge&lt;/a&gt; or &lt;a href='https://github.com/'&gt;Github&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;This sounds like a lot of work, which is why simpler and lighter libraries are easier is to maintain. The process of creating a new library is a rewarding itself, but also has benefits. Other people may like your library and decide to contribute or fix any bugs. Therefore if you use the library regularly yourself, the investment in creating and maintaining a library will feed back into your own work.&lt;/p&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.feedburner.com/~f/BioinformaticsZen?a=jzYXsg46"&gt;&lt;img src="http://feeds.feedburner.com/~f/BioinformaticsZen?d=41" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~f/BioinformaticsZen?a=m0k479UN"&gt;&lt;img src="http://feeds.feedburner.com/~f/BioinformaticsZen?i=m0k479UN" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;</description><feedburner:origLink>http://www.bioinformaticszen.com/software/reuse,_contribute,_create</feedburner:origLink></item><item><title>Writing Good Code</title><link>http://feedproxy.google.com/~r/BioinformaticsZen/~3/klmha67-R0I/writing_good_code</link><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Michael Barton</dc:creator><pubDate>Sat, 24 Jan 2009 16:00:00 PST</pubDate><guid isPermaLink="false">http://www.bioinformaticszen.com/software/writing_good_code</guid><description>&lt;p&gt;Writing good code makes life easier. If there&amp;#8217;s a common theme in bioinformatics, it is this: you will write a script, move on to something else, then return to the script in a few months or years time and try to remember how it works. The clearer the code is originally written, the better to remember how it works. Here is a quote &lt;a href='http://www.artima.com/intv/dry.html'&gt;&amp;#8220;All programming is maintenance programming, because you are rarely writing original code&amp;#8221;&lt;/a&gt;. This means that most of your time will be spent fixing and improving code, rather than writing fresh. Writing code is personal, and discussing what makes good code is controversial. But I&amp;#8217;m going do it anyway and describes what I think are a few basic principles that can help to make code easier to maintain.&lt;/p&gt;

&lt;h3 id='be_too_descriptive'&gt;Be too descriptive&lt;/h3&gt;

&lt;p&gt;I think code should err on the side of being too descriptive, rather than being too concise. I mean that code should be loud and expressive about its purpose. An example is choosing variable names.&lt;/p&gt;
&lt;div class='highlight'&gt;&lt;pre&gt;    &lt;span class='c1'&gt;# Concise&lt;/span&gt;
    &lt;span class='n'&gt;seq&lt;/span&gt; &lt;span class='o'&gt;=&lt;/span&gt; &lt;span class='no'&gt;File&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='n'&gt;read&lt;/span&gt;&lt;span class='p'&gt;(&lt;/span&gt;&lt;span class='s1'&gt;&amp;#39;gene.fasta&amp;#39;&lt;/span&gt;&lt;span class='p'&gt;)&lt;/span&gt;
    
    &lt;span class='c1'&gt;# Descriptive&lt;/span&gt;
    &lt;span class='n'&gt;fasta_gene_sequence&lt;/span&gt; &lt;span class='o'&gt;=&lt;/span&gt; &lt;span class='no'&gt;File&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='n'&gt;read&lt;/span&gt;&lt;span class='p'&gt;(&lt;/span&gt;&lt;span class='s1'&gt;&amp;#39;gene.fasta&amp;#39;&lt;/span&gt;&lt;span class='p'&gt;)&lt;/span&gt;
&lt;/pre&gt;
&lt;/div&gt;
&lt;p&gt;The second example is longer, but leaves no doubt as to what the variable contains. The same can be applied to method names. The more specific a method name the better to remember the function and what is returned.&lt;/p&gt;
&lt;div class='highlight'&gt;&lt;pre&gt;    &lt;span class='c1'&gt;# Concise&lt;/span&gt;
    &lt;span class='k'&gt;def&lt;/span&gt; &lt;span class='nf'&gt;get_seq&lt;/span&gt;&lt;span class='p'&gt;(&lt;/span&gt;&lt;span class='n'&gt;file&lt;/span&gt;&lt;span class='p'&gt;)&lt;/span&gt;
      &lt;span class='c1'&gt;# ...&lt;/span&gt;
    &lt;span class='k'&gt;end&lt;/span&gt;

    &lt;span class='c1'&gt;# Descriptive&lt;/span&gt;
    &lt;span class='k'&gt;def&lt;/span&gt; &lt;span class='nf'&gt;read_fasta_from&lt;/span&gt;&lt;span class='p'&gt;(&lt;/span&gt;&lt;span class='n'&gt;file&lt;/span&gt;&lt;span class='p'&gt;)&lt;/span&gt;
      &lt;span class='c1'&gt;# ...&lt;/span&gt;
    &lt;span class='k'&gt;end&lt;/span&gt;
&lt;/pre&gt;
&lt;/div&gt;
&lt;p&gt;Next are magic numbers, numbers that appear in code, but have no explanation to their meaning. These can particularly annoying if you can&amp;#8217;t remember why you used the number and there is no other reference to it.&lt;/p&gt;
&lt;div class='highlight'&gt;&lt;pre&gt;    &lt;span class='c1'&gt;# Three, its the magic number&lt;/span&gt;
    &lt;span class='n'&gt;dna_sequence&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='n'&gt;scan&lt;/span&gt;&lt;span class='p'&gt;(&lt;/span&gt;&lt;span class='mi'&gt;3&lt;/span&gt;&lt;span class='p'&gt;)&lt;/span&gt;

    &lt;span class='c1'&gt;# Descriptive&lt;/span&gt;
    &lt;span class='n'&gt;nucleotides_per_codon&lt;/span&gt; &lt;span class='o'&gt;=&lt;/span&gt; &lt;span class='mi'&gt;3&lt;/span&gt;
    &lt;span class='n'&gt;dna_sequence&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='n'&gt;scan&lt;/span&gt;&lt;span class='p'&gt;(&lt;/span&gt;&lt;span class='n'&gt;nucleotides_per_codon&lt;/span&gt;&lt;span class='p'&gt;)&lt;/span&gt;
&lt;/pre&gt;
&lt;/div&gt;
&lt;p&gt;Comments never hurt either, as long as they are correct. Incorrect comments are generally not considered useful. Comments are especially useful when the meaning of the code is not obvious, but going too much commenting can sometimes make code less easy to read&lt;/p&gt;
&lt;div class='highlight'&gt;&lt;pre&gt;    &lt;span class='c1'&gt;# Why the chop?&lt;/span&gt;
    &lt;span class='n'&gt;protein&lt;/span&gt; &lt;span class='o'&gt;=&lt;/span&gt; &lt;span class='n'&gt;dna_sequence&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='n'&gt;translate&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='n'&gt;chop&lt;/span&gt;

    &lt;span class='c1'&gt;# Some of wikipedia in here...&lt;/span&gt;
    &lt;span class='c1'&gt;# In the genetic code, a stop codon (or termination &lt;/span&gt;
    &lt;span class='c1'&gt;# codon) is a nucleotide triplet within messenger RNA&lt;/span&gt;
    &lt;span class='c1'&gt;# that signals a termination of translation. Proteins &lt;/span&gt;
    &lt;span class='c1'&gt;# are unique sequences of amino acids, and most &lt;/span&gt;
    &lt;span class='c1'&gt;# codons in messenger RNA correspond to the addition&lt;/span&gt;
    &lt;span class='c1'&gt;# of an amino acid to a growing protein chain, stop&lt;/span&gt;
    &lt;span class='c1'&gt;# codons signal the termination of this process,&lt;/span&gt;
    &lt;span class='c1'&gt;# releasing the amino acid chain.&lt;/span&gt;
    &lt;span class='c1'&gt;# Here I am removing the stop codon after translation&lt;/span&gt;
    &lt;span class='n'&gt;protein&lt;/span&gt; &lt;span class='o'&gt;=&lt;/span&gt; &lt;span class='n'&gt;dna_sequence&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='n'&gt;translate&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='n'&gt;chop&lt;/span&gt;

    &lt;span class='c1'&gt;# Remove the stop codon after translating&lt;/span&gt;
    &lt;span class='n'&gt;protein&lt;/span&gt; &lt;span class='o'&gt;=&lt;/span&gt; &lt;span class='n'&gt;dna_sequence&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='n'&gt;translate&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='n'&gt;chop&lt;/span&gt;
&lt;/pre&gt;
&lt;/div&gt;
&lt;p&gt;Try to follow the indentation guidelines for the language you&amp;#8217;re writing in. Indentation makes code easier to read for you and anyone you share the code with.&lt;/p&gt;

&lt;h3 id='dry'&gt;DRY&lt;/h3&gt;

&lt;p&gt;DRY means don&amp;#8217;t repeat yourself. Code for a single function should exist in a single place. When code needs fixing or maintaining, it only needs to changed once in the one place that it resides. In the short term it&amp;#8217;s tempting to copy and paste to save time, but this will be time consuming in the long term when debugging.&lt;/p&gt;

&lt;p&gt;For example a common function such as system specific BLAST settings, used across a variety of scripts can be kept in a single file. The can then be called by any script when required. By moving all the common code to a single file, if the BLAST settings change, this is done in just one place.&lt;/p&gt;

&lt;h3 id='books_and_frameworks'&gt;Books and frameworks&lt;/h3&gt;

&lt;p&gt;When I used Java, Joshua Bloch&amp;#8217;s &lt;a href='http://java.sun.com/docs/books/effective/'&gt;Effective Java&lt;/a&gt; book helped me learn a great deal about how to programme well. When learning Ruby I found the &lt;a href='http://rubyhacker.com/'&gt;Ruby Way&lt;/a&gt; book had many useful examples of how to write in Ruby. I might guess for any popular programming language there is a respected book that illustrates the best practices in the language. These are not the most useful if you&amp;#8217;re just starting to learn the language, but as you get more confident they are great for helping to write better, more maintainable code.&lt;/p&gt;

&lt;p&gt;In addition to good books, examples of the best practices in a language can be found in popular open source libraries. &lt;a href='http://rubyonrails.org/'&gt;Rails&lt;/a&gt; is a Ruby framework for creating dynamic website. Knowing Rails will come in handy if I ever need to create an interactive website, but practising with Rails also gives an opinionated view of the best way to organise a Ruby project, from people who are experienced in creating them.&lt;/p&gt;

&lt;h3 id='further_reading'&gt;Further reading&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href='http://seanskti.wordpress.com/2006/10/08/six-easy-tips-for-more-maintainable-code/'&gt;How to write maintainable code&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;&lt;a href='http://www.joelonsoftware.com/articles/fog0000000043.html'&gt;Twelve steps to better code&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;&lt;a href='http://particletree.com/features/successful-strategies-for-commenting-code/'&gt;Strategies for commenting code&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;&lt;a href='http://sourcemaking.com/antipatterns/software-development-antipatterns'&gt;Common types of bad design&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;&lt;a href='http://www.freevbcode.com/ShowCode.Asp?ID=2547'&gt;How to write unmaintainable code&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.feedburner.com/~f/BioinformaticsZen?a=AZFahYeq"&gt;&lt;img src="http://feeds.feedburner.com/~f/BioinformaticsZen?d=41" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~f/BioinformaticsZen?a=0EPWN1Me"&gt;&lt;img src="http://feeds.feedburner.com/~f/BioinformaticsZen?i=0EPWN1Me" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;</description><feedburner:origLink>http://www.bioinformaticszen.com/software/writing_good_code</feedburner:origLink></item><item><title>Git</title><link>http://feedproxy.google.com/~r/BioinformaticsZen/~3/w2qBntmcpNQ/git</link><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Michael Barton</dc:creator><pubDate>Sat, 10 Jan 2009 16:00:00 PST</pubDate><guid isPermaLink="false">http://www.bioinformaticszen.com/tools/git</guid><description>&lt;h3 id='about'&gt;About&lt;/h3&gt;

&lt;p&gt;Git is a version control system or VCS for short. A VCS helps you manage your code by saving changes as versions in a repository. Each version of any file can be retrieved by rolling back the changes to the required version. At the most basic a VCS allows you the freedom to experiment and actively break the code you&amp;#8217;re working because the last working version can be reverted to with a single command. Version control is used in software development, and in bioinformatics is useful for keeping track of the scripts and libraries you use in development. Version control using an external server is also a good way to back-up code.&lt;/p&gt;

&lt;h3 id='features'&gt;Features&lt;/h3&gt;

&lt;p&gt;As a VCS git is clean and minimal, working out of a single .git directory in the root of your project. If you want to remove the project from version control, delete the .git directory and all git files are gone. Git managed repositories are small using compression to store the differences between versions. Git is fast at storing the latest version of your code, on even a large repository, it is almost instantaneous. Git repositories are simple to create, and don&amp;#8217;t necessarily require an external server to begin tracking versions. If you do use an external git server, pushing and pulling to the server is also very fast. Another feature of git allows you to create branches within your code repository. Branching means copying the code as a duplicate branch of the main &amp;#8220;master&amp;#8221; branch. The duplicate branch can be modified, committed to and then compared with the original branch. If you are happy with the changes in the new branch you can merge them back into the original master branch. Another option is to leave the alternate branch to work on later since switching back to the master branch will restore the previous state before branching. In this way using a branches is a simple and lightweight way to develop or experiment with new features.&lt;/p&gt;

&lt;h3 id='collaboration'&gt;Collaboration&lt;/h3&gt;

&lt;p&gt;Git is useful for collaborating on shared source code repositories. The collaborative development of the Linux kernel is the reason git was &lt;a href='http://en.wikipedia.org/wiki/Git_(software)#Early_history'&gt;created by Linus Torvalds&lt;/a&gt;. A key feature of git is that it is distributed. You are not bound by working from a single source server. I have my copy of the repository and you have yours. I like the changes you are making so I clone your repository as a branch into my own. I can test out the changes you&amp;#8217;ve made before merging them into my master branch. If I only want a subset of the changes you&amp;#8217;ve made I can use the git cherry-pick command to merge only the changes I want. The website &lt;a href='http://www.github.com'&gt;github.com&lt;/a&gt; enables a collaborative aspect of developing software with git. Github acts as a git server but also highlights the social links of branches between developers. Other developers&amp;#8217; git repositories can be viewed and downloaded, but also forked into your own github space. This fork acts as a copy of the original repository with the relationship between the two repositories maintained. Github monitors the commits, merges and branching between repositories which can be viewed, compared, or visualised as a network.&lt;/p&gt;

&lt;h3 id='getting_started'&gt;Getting started&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href='http://git-scm.com/'&gt;The git website&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;&lt;a href='http://www.gitcasts.com/posts/railsconf-git-talk'&gt;Video introduction to git&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;&lt;a href='http://github.com/blog/120-new-to-git'&gt;Links for git beginners on Github&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;&lt;a href='http://mendicantbug.com/2008/11/30/10-reasons-to-use-git-for-research/'&gt;Using git for research&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;&lt;a href='http://www-cs-students.stanford.edu/~blynn/gitmagic/'&gt;Extensive git guide&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.feedburner.com/~f/BioinformaticsZen?a=33mvxsxy"&gt;&lt;img src="http://feeds.feedburner.com/~f/BioinformaticsZen?d=41" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~f/BioinformaticsZen?a=kFPzavcC"&gt;&lt;img src="http://feeds.feedburner.com/~f/BioinformaticsZen?i=kFPzavcC" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;</description><feedburner:origLink>http://www.bioinformaticszen.com/tools/git</feedburner:origLink></item><item><title>Latex</title><link>http://feedproxy.google.com/~r/BioinformaticsZen/~3/Xza6BlhPOwg/latex</link><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Michael Barton</dc:creator><pubDate>Tue, 23 Dec 2008 16:00:00 PST</pubDate><guid isPermaLink="false">http://www.bioinformaticszen.com/tools/latex</guid><description>&lt;p&gt;LaTeX (pronounced lay-tek) is a document creation system aimed at scientific and technical documents. LaTeX documents are written in plain text using markup to describe which parts should be sections tables or pictures. The LaTeX system parses the markup and formats the text to produce either dvi, postscript or pdf output. As LaTeX is entirely text based, the content can be tracked using a version control system. Plain text files mean that you can work on a document with your favourite editor, and can also be manipulated at the command line using Unix tools. The syntax of LaTeX markup will take an hour or two of practice to learn, but the advantage of creating documents from a marked up source is that the results are consistent and reproducible, which isn&amp;#8217;t always the case for graphical document editors.&lt;/p&gt;

&lt;h3 id='features'&gt;Features&lt;/h3&gt;

&lt;p&gt;The main reason for using LaTeX is that it allows you to work on the content of the document, not the formatting. If you were using a graphical editor you format the text as type, but with LaTeX you only have to add the markup to the document and LaTeX takes care of the rest. This can save a lot of time with large documents. The basic features of LaTeX include automatic generation of tables of contents, tables of figures and automatic numbering of sections tables and figures. BibTeX is the companion to LaTeX which adds simple organisation and addition of citations. Citations are added to documents using a simple &amp;#8220;cite&amp;#8221; command in the text, without the requirement for third party software. One of the benefits of creating documents using LaTeX is that the produced formatting is the result of best practices in typography and document presentation, which means LaTeX documents look better than the average.&lt;/p&gt;

&lt;h3 id='templates_and_plugins'&gt;Templates and Plugins&lt;/h3&gt;

&lt;p&gt;LaTeX is free software and available for most operating systems. There is a large LaTeX community which develops themes and modules that can be added to LaTeX documents. Many journals also provide LaTeX templates in which papers can be submitted. There are templates available for writing a &lt;a href='http://bit.ly/lBZs'&gt;thesis or dissertation&lt;/a&gt;, and there is likely a specific templates which follows your own institution guidelines. There are many useful third party plugins for adding extras to a document. For example &lt;a href='http://www.ctan.org/tex-archive/macros/latex/contrib/booktabs/'&gt;beautiful formating of tables&lt;/a&gt;, &lt;a href='http://www.ctan.org/tex-archive/macros/latex/contrib/subfig/'&gt;grouping figures into subfigures&lt;/a&gt;, &lt;a href='http://www.ctan.org/tex-archive/macros/latex/contrib/subfig/'&gt;replacing text inside figures&lt;/a&gt; and even &lt;a href='http://www.stat.uni-muenchen.de/~leisch/Sweave/'&gt;a framework for including R-code inside a LaTeX document&lt;/a&gt;.&lt;/p&gt;

&lt;h3 id='creating_a_latex_document'&gt;Creating a LaTeX document&lt;/h3&gt;

&lt;p&gt;In this video I&amp;#8217;m illustrating how to create a simple LaTeX document with text and headings. An overview of LaTeX document structure &lt;a href='http://www.andy-roberts.net/misc/latex/latextutorial2.html'&gt;is outlined by Andrew Roberts&lt;/a&gt;&lt;/p&gt;
&lt;object height='385' width='640'&gt;&lt;param name='movie' value='http://www.youtube.com/v/PF1hFaoWEY4&amp;amp;hl=en&amp;amp;fs=1&amp;amp;hd=1' /&gt;&lt;param name='allowFullScreen' value='true' /&gt;&lt;param name='allowscriptaccess' value='always' /&gt;&lt;embed src='http://www.youtube.com/v/PF1hFaoWEY4&amp;amp;hl=en&amp;amp;fs=1&amp;amp;hd=1' type='application/x-shockwave-flash' allowfullscreen='true' allowscriptaccess='always' height='385' width='640' /&gt;&lt;/object&gt;&lt;p /&gt;
&lt;h3 id='creating_figures_in_latex'&gt;Creating figures in LaTeX&lt;/h3&gt;

&lt;p&gt;This screencast illustrates how to add a figure to a LaTeX document. The figure size is changed and a list of figures is added to the document. Andrew Roberts LaTeX site shows &lt;a href='http://www.andy-roberts.net/misc/latex/latextutorial5.html'&gt;different examples of LaTeX figure settings&lt;/a&gt;.&lt;/p&gt;
&lt;object height='385' width='640'&gt;&lt;param name='movie' value='http://www.youtube.com/v/GXmmS8N_s0o&amp;amp;hl=en&amp;amp;fs=1&amp;amp;hd=1' /&gt;&lt;param name='allowFullScreen' value='true' /&gt;&lt;param name='allowscriptaccess' value='always' /&gt;&lt;embed src='http://www.youtube.com/v/GXmmS8N_s0o&amp;amp;hl=en&amp;amp;fs=1&amp;amp;hd=1' type='application/x-shockwave-flash' allowfullscreen='true' allowscriptaccess='always' height='385' width='640' /&gt;&lt;/object&gt;&lt;p /&gt;
&lt;h3 id='creating_tables_in_latex'&gt;Creating tables in LaTeX&lt;/h3&gt;

&lt;p&gt;This screencast illustrates adding a simple table to a LaTeX document. The table is then formatted to look more &amp;#8220;professional&amp;#8221; using the &lt;a href='http://www.ctan.org/tex-archive/macros/latex/contrib/booktabs/'&gt;booktabs package&lt;/a&gt;. More information on tables in LaTeX can be found at &lt;a href='http://www.andy-roberts.net/misc/latex/latextutorial4.html'&gt;Andrew Robert&amp;#8217;s website&lt;/a&gt; and on the &lt;a href='http://en.wikibooks.org/wiki/LaTeX/Tables'&gt;Wikibooks website&lt;/a&gt;&lt;/p&gt;
&lt;object height='385' width='640'&gt;&lt;param name='movie' value='http://www.youtube.com/v/9Rh77LBJIDc&amp;amp;hl=en&amp;amp;fs=1&amp;amp;hd=1' /&gt;&lt;param name='allowFullScreen' value='true' /&gt;&lt;param name='allowscriptaccess' value='always' /&gt;&lt;embed src='http://www.youtube.com/v/9Rh77LBJIDc&amp;amp;hl=en&amp;amp;fs=1&amp;amp;hd=1' type='application/x-shockwave-flash' allowfullscreen='true' allowscriptaccess='always' height='385' width='640' /&gt;&lt;/object&gt;&lt;p /&gt;
&lt;h3 id='adding_references_in_latex'&gt;Adding references in LaTeX&lt;/h3&gt;

&lt;p&gt;This screencast shows how to add dynamic references to the text in a document. This includes automatically adding references to tables and figures. The second half of the video illustrates how to add a bibliography to a LaTeX document, and how to cite articles in the text. More information about citations can be found on &lt;a href='http://www.andy-roberts.net/misc/latex/latextutorial3.html'&gt;Andrew Robert&amp;#8217;s website&lt;/a&gt;.&lt;/p&gt;
&lt;object height='385' width='640'&gt;&lt;param name='movie' value='http://www.youtube.com/v/jvh_2EQ1iwM&amp;amp;hl=en&amp;amp;fs=1&amp;amp;hd=1' /&gt;&lt;param name='allowFullScreen' value='true' /&gt;&lt;param name='allowscriptaccess' value='always' /&gt;&lt;embed src='http://www.youtube.com/v/jvh_2EQ1iwM&amp;amp;hl=en&amp;amp;fs=1&amp;amp;hd=1' type='application/x-shockwave-flash' allowfullscreen='true' allowscriptaccess='always' height='385' width='640' /&gt;&lt;/object&gt;&lt;p /&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.feedburner.com/~f/BioinformaticsZen?a=4kZlvXBW"&gt;&lt;img src="http://feeds.feedburner.com/~f/BioinformaticsZen?d=41" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~f/BioinformaticsZen?a=EbIqz9kw"&gt;&lt;img src="http://feeds.feedburner.com/~f/BioinformaticsZen?i=EbIqz9kw" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;</description><feedburner:origLink>http://www.bioinformaticszen.com/tools/latex</feedburner:origLink></item><item><title>Vim</title><link>http://feedproxy.google.com/~r/BioinformaticsZen/~3/JjvtyD7syuk/vim</link><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Michael Barton</dc:creator><pubDate>Mon, 22 Dec 2008 16:00:00 PST</pubDate><guid isPermaLink="false">http://www.bioinformaticszen.com/tools/vim</guid><description>&lt;h3 id='about'&gt;About&lt;/h3&gt;

&lt;p&gt;Doing bioinformatics I only use two tools: the keyboard and the mouse. Since I use these tools all-day, everyday I want to use them efficiently. My opinion is that using the keyboard as much as possible while the mouse as little as possible, is the best way to work at a computer. In a simple example, knowing the keyboard shortcut to run a given command means that I can execute the command quickly without breaking the flow of typing. The text editor Vim takes this to another level and is entirely keyboard driven. There are no drop down menus and everything is performed using the keyboard; there&amp;#8217;s no reason for my hands to leave the keyboard. The large number of commands for Vim means there is a fair amount of practice required before you can use it fluently. I think this practice is a great investment though, as being able to use Vim intuitively makes you work faster and more efficiently. The reason for this is because Vim has a huge functionality to be taken advantage of, using just a few quick remembered keystrokes.&lt;/p&gt;

&lt;h3 id='features'&gt;Features&lt;/h3&gt;

&lt;p&gt;As Vim is entirely text-based, without the pretty interface of modern editors, my original opinion was of an archaic hangover from the early days of Unix. Vim is around twenty years old, but is still a sophisticated text editor with a large range of functions. The large range of vim&amp;#8217;s functionality means a steep learning curve, but the extensive help documentation is an eloquent and gentle introduction. The commands &amp;#8220;:vimtutor&amp;#8221; and &amp;#8220;:help&amp;#8221; are the places to get started for using Vim. Vim&amp;#8217;s greatest feature is how easy it is to move, edit and manipulate text. This can sound trivial, but reordering paragraphs in minutes with a mouse, takes just seconds using the keyboard with Vim. This example is applicable to any type of text file, such as Ruby or LaTeX source code, which is what I spend most of my time editing. Another Vim feature are the registers, which act as super-charged clipboards. Not only can text be stored for pasting, but also sequences of commands as well. Stored commands can then be replayed by calling the register, this eliminates performing repetitive actions. Typing &amp;#8220;:help usr_10&amp;#8221; is a good place to start for learning about using registers for commands, as well as other ways of making large changes quickly.&lt;/p&gt;

&lt;h3 id='customise'&gt;Customise&lt;/h3&gt;

&lt;p&gt;One reason not to use Vim, is that it doesn&amp;#8217;t have the code orientated features of integrated development environments (IDE) such as &lt;a href='http://www.eclipse.org/'&gt;Eclipse&lt;/a&gt; and &lt;a href='http://www.netbeans.org/'&gt;Netbeans&lt;/a&gt;. But it does. Vim is easy to customise which has lead to a community of developers creating a large number of plugins. &lt;a href='http://www.vim.org/scripts/script.php?script_id=1318'&gt;Code snippets&lt;/a&gt;, &lt;a href='http://www.vim.org/scripts/script.php?script_id=1658'&gt;project drawers&lt;/a&gt;, and &lt;a href='http://www.vim.org/scripts/script.php?script_id=1984'&gt;fuzzy file finding&lt;/a&gt; are just some examples of plugins aimed at using Vim as an IDE. Whatever &lt;a href='http://vim-latex.sourceforge.net/'&gt;language&lt;/a&gt;, &lt;a href='http://www.infynity.spodzone.com/vim/HTML/'&gt;file type&lt;/a&gt;, or &lt;a href='http://www.vim.org/scripts/script.php?script_id=1567'&gt;framework&lt;/a&gt; you use, someone will have written a Vim plugin. This gives Vim all the functionality to rival any IDE.&lt;/p&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.feedburner.com/~f/BioinformaticsZen?a=se6Nsp3q"&gt;&lt;img src="http://feeds.feedburner.com/~f/BioinformaticsZen?d=41" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~f/BioinformaticsZen?a=MPdnxSw0"&gt;&lt;img src="http://feeds.feedburner.com/~f/BioinformaticsZen?i=MPdnxSw0" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;</description><feedburner:origLink>http://www.bioinformaticszen.com/tools/vim</feedburner:origLink></item></channel></rss>
