<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:blogger='http://schemas.google.com/blogger/2008' xmlns:georss='http://www.georss.org/georss' xmlns:gd="http://schemas.google.com/g/2005" xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-7099891672772539383</id><updated>2024-11-05T20:47:22.035-06:00</updated><category term="Bioinformatics"/><category term="Linux"/><category term="Tutorials"/><category term="R"/><category term="Stochastic messages"/><category term="Perl"/><category term="Bash"/><category term="Debian"/><category term="Random"/><category term="Machine Learning"/><category term="Markov Chains"/><category term="Sequence analysis"/><category term="Game-Theory"/><category term="HMM"/><category term="LaTeX"/><category term="Octave"/><title type='text'>Computational Biology Blog in fasta format</title><subtitle type='html'>&amp;gt;Computational Biology Blog in fasta format</subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://tata-box-blog.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7099891672772539383/posts/default'/><link rel='alternate' type='text/html' href='http://tata-box-blog.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><link rel='next' type='application/atom+xml' href='http://www.blogger.com/feeds/7099891672772539383/posts/default?start-index=26&amp;max-results=25'/><author><name>Anonymous</name><uri>http://www.blogger.com/profile/12992490904490492131</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>49</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>25</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-7099891672772539383.post-4563229489956399144</id><published>2014-09-15T13:54:00.001-05:00</published><updated>2014-09-15T13:54:21.343-05:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="Markov Chains"/><category scheme="http://www.blogger.com/atom/ns#" term="R"/><title type='text'>If the typing monkeys have met Mr Markov: probabilities of spelling &quot;omglolbbq&quot; after the digitial monkeys have read Dracula</title><content type='html'>On the weekend, randomly after watching Catching Fire, I remember the problem of the typing monkeys (Infinite monkey theorem) in which basically could be defined as (&lt;a href=&quot;https://en.wikipedia.org/wiki/Infinite_monkey_theorem&quot;&gt;Thanks to Wiki&lt;/a&gt;):&lt;br /&gt;
&lt;br /&gt;
&lt;span style=&quot;color: orange;&quot;&gt;# *******************&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;color: orange;&quot;&gt;# &amp;nbsp;INTRODUCTION&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;color: orange;&quot;&gt;# *******************&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The infinite monkey theorem states that a monkey hitting keys at random on a typewriter keyboard for an infinite amount of time will almost surely type a given text, such as the complete works of William Shakespeare.&lt;br /&gt;
&lt;br /&gt;
There is a straightforward proof of this theorem. As an introduction, recall that if two events are statistically independent, then the probability of both happening equals the product of the probabilities of each one happening independently. For example, if the chance of rain in Moscow on a particular day in the future is 0.4 and the chance of an earthquake in San Francisco on that same day is 0.00003, then the chance of both happening on that day is 0.4 * 0.00003 = 0.000012, assuming that they are indeed independent.&lt;br /&gt;
&lt;br /&gt;
Suppose the typewriter has 50 keys, and the word to be typed is banana. If the keys are pressed randomly and independently, it means that each key has an equal chance of being pressed. Then, the chance that the first letter typed is &#39;b&#39; is 1/50, and the chance that the second letter typed is a is also 1/50, and so on. Therefore, the chance of the first six letters spelling banana is&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;/div&gt;
&lt;br /&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhzoask01voUyYr-jGrqiXcJD4C1etrF6-ZdbGvz6VJEoMWAYXQesvykudTpcRbj89lMtk5kS25qCAmbKK23swP284Esqnm2WLHmyDutxac8oSmwZemsIfJBIudn9uLVfYjyA6fhHWr-gw/s1600/prob1.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img border=&quot;0&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhzoask01voUyYr-jGrqiXcJD4C1etrF6-ZdbGvz6VJEoMWAYXQesvykudTpcRbj89lMtk5kS25qCAmbKK23swP284Esqnm2WLHmyDutxac8oSmwZemsIfJBIudn9uLVfYjyA6fhHWr-gw/s1600/prob1.png&quot; height=&quot;67&quot; width=&quot;400&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;
&lt;br /&gt;
less than one in 15 billion, but not zero, hence a possible outcome.&lt;br /&gt;
&lt;br /&gt;
&lt;span style=&quot;color: orange;&quot;&gt;# *******************&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;color: orange;&quot;&gt;# &amp;nbsp;METHODS&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;color: orange;&quot;&gt;# *******************&lt;/span&gt;&lt;br /&gt;
&lt;br /&gt;
In my implementation, I will only consider 26 characters of the alphabet (from a to z, excluding the whitespace). The real question I would like to ask is the following:&lt;br /&gt;
&lt;br /&gt;
Given a target word, say &quot;banana&quot;, how many monkeys would be needed to have at least one successful event (a monkey typed the target) after the monkeys have typed 6 characters.&lt;br /&gt;
&lt;br /&gt;
To solve this, first calculate the probability of typing the word banana:&lt;br /&gt;
&lt;br /&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiZHPQOLF_Cr8A2um9PWf0IGogCYyPb_1h-5_PVoqK8KaXdyPNbnYKOBffUcDMSDN3TLqsUuWvgcInwuQino0VsmfD_wx5_Al5q_J5lRuiAUC_eyyF0xMsOkvcOaNzNaFCP7PCFpnULQuU/s1600/prob2.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img border=&quot;0&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiZHPQOLF_Cr8A2um9PWf0IGogCYyPb_1h-5_PVoqK8KaXdyPNbnYKOBffUcDMSDN3TLqsUuWvgcInwuQino0VsmfD_wx5_Al5q_J5lRuiAUC_eyyF0xMsOkvcOaNzNaFCP7PCFpnULQuU/s1600/prob2.png&quot; height=&quot;93&quot; width=&quot;320&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;
&lt;br /&gt;
Now, just compute the number of monkeys that might be needed:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhnuHXlFbAV0u-J5q1-jqrxvG6OjFIXNyfusINtPO6Xhhyphenhyphen9NiozsVFrN5d_mDFin6Y9aHr3P-PiJULJNhUvA6IP4lHHzzWBA2_Jz4UGigw2vcFHd8BtnGNJm12GZi3id5HFyEH4uFqcPgM/s1600/prob3.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img border=&quot;0&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhnuHXlFbAV0u-J5q1-jqrxvG6OjFIXNyfusINtPO6Xhhyphenhyphen9NiozsVFrN5d_mDFin6Y9aHr3P-PiJULJNhUvA6IP4lHHzzWBA2_Jz4UGigw2vcFHd8BtnGNJm12GZi3id5HFyEH4uFqcPgM/s1600/prob3.png&quot; height=&quot;165&quot; width=&quot;320&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;
&lt;br /&gt;
The model that assigns the same probability for each character is labeled as &quot;uniform model&quot; in my simulation.&lt;br /&gt;
&lt;br /&gt;
My goal is to optimize &lt;span style=&quot;color: orange;&quot;&gt;&lt;b&gt;n&lt;/b&gt;&lt;/span&gt; (minimize the number of monkeys needed because I am on a tight budget). So I decided to use a &lt;a href=&quot;https://www.biostat.wisc.edu/bmi576/lectures/markov-chains-2.pdf&quot;&gt;Markov Chain model of order 1 &lt;/a&gt;to do so. If you are unfamiliar with Markov Chains here is a very nice explanation of the models &lt;a href=&quot;http://setosa.io/blog/2014/07/26/markov-chains/index.html&quot;&gt;here&lt;/a&gt;.&lt;br /&gt;
&lt;br /&gt;
The training set of the emission probability matrix, consist on a parsed version of &lt;a href=&quot;http://www.gutenberg.org/ebooks/345&quot;&gt;Dracula&lt;/a&gt; (chapters 1 to 3, no punctuation signs, lowercase characters only)&lt;br /&gt;&lt;br /&gt;The emission probability matrix of the Markov Chain ensures that&amp;nbsp;&lt;span style=&quot;text-align: center;&quot;&gt;the transition from one character to another character is constrained by previous character and this relation is weighted based on the frequencies obtained in the training text.&amp;nbsp;&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;text-align: center;&quot;&gt;&lt;br /&gt;&lt;/span&gt;
&lt;span style=&quot;text-align: center;&quot;&gt;It is like having a keyboard with lights for each key, after &quot;a&quot; is pressed, the light intensity of each key would be proportional of what characters are more likely to appear after an &quot;a&quot;. For example &quot;b&quot; would have more light than &quot;a&quot;, because it is more common to find words having *a-b* than *a-a*.&lt;/span&gt;&lt;br /&gt;
&lt;br /&gt;
&lt;span style=&quot;color: orange;&quot;&gt;# *******************&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;color: orange;&quot;&gt;# &amp;nbsp;RESULTS&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;color: orange;&quot;&gt;# *******************&lt;/span&gt;&lt;br /&gt;
&lt;br /&gt;
1) Plot the distribution of characters in the uniform model&lt;br /&gt;
&lt;br /&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiQmev9r_t9Z0ruthl3DiFmrSleYxSfHqUojevTIxWmCNgpepXumEWc-tymKF9Oqucqn0gqbpdn0Q6aW_cUIDKC6QtZ0nrhncGHT7Q3LBa5EmNhWOz22AREoYc1SdRzEA2RFQQZ6Q2rqqQ/s1600/barplot_freq.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img border=&quot;0&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiQmev9r_t9Z0ruthl3DiFmrSleYxSfHqUojevTIxWmCNgpepXumEWc-tymKF9Oqucqn0gqbpdn0Q6aW_cUIDKC6QtZ0nrhncGHT7Q3LBa5EmNhWOz22AREoYc1SdRzEA2RFQQZ6Q2rqqQ/s1600/barplot_freq.png&quot; height=&quot;459&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: center;&quot;&gt;
Distribution of characters after 10,000 iterations using the uniform model&lt;/div&gt;
&lt;br /&gt;
&lt;br /&gt;
2) Plot the emission matrices&lt;br /&gt;
&lt;br /&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEic71dmcas04TySaqdtMkwxD-WW2dYy8s8R3pIUQdRlJrGiv0N45Xe_pAR61z9y-afFb3tFkdMRNpeCGLIVPkMOsTfUDgoEyxOu9lMIFgKyR48HAegoiRVFgtFMx-8QKBBECpFmz3nq124/s1600/em_m_network.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img border=&quot;0&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEic71dmcas04TySaqdtMkwxD-WW2dYy8s8R3pIUQdRlJrGiv0N45Xe_pAR61z9y-afFb3tFkdMRNpeCGLIVPkMOsTfUDgoEyxOu9lMIFgKyR48HAegoiRVFgtFMx-8QKBBECpFmz3nq124/s1600/em_m_network.png&quot; height=&quot;337&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: center;&quot;&gt;
A) As expected, the transition from one character to another character is constrained by previous character and this relation is weighted based on the frequencies obtained in the training text. B) in the uniform model each character has the same probability to be typed and does not depend on the previous character.&amp;nbsp;&lt;/div&gt;
&lt;br /&gt;
&lt;br /&gt;
3) Compare the performance of the two models&lt;br /&gt;
&lt;br /&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhbUTtjINzB-MokM9lPgy-5ExqFQevGqDzzylDg_ur538LV1WeBdL2nDMF-cafd_iAsUzmelmbpT5E-9xUsaTYxqzWhM_jM_0OEaiIBVltbmJF8zfEPrbkGrSGblqafBJUBVTgm2UQQBU0/s1600/ratio_n_monkeys.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img border=&quot;0&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhbUTtjINzB-MokM9lPgy-5ExqFQevGqDzzylDg_ur538LV1WeBdL2nDMF-cafd_iAsUzmelmbpT5E-9xUsaTYxqzWhM_jM_0OEaiIBVltbmJF8zfEPrbkGrSGblqafBJUBVTgm2UQQBU0/s1600/ratio_n_monkeys.png&quot; height=&quot;594&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: center;&quot;&gt;
In this plot I am comparing the number of monkeys (log10(x)) required to type the target words (indicated in red text) using the Markov Chain model and the uniform model. In general the Markov Chain model requires less monkeys in words that are likely to appear in the training set, like &quot;by&quot;, &quot;the&quot;, &quot;what&quot; , &quot;where&quot; and &quot;Dracula&quot;. On the other hand, words that only have one character like &quot;a&quot;, given that there&#39;s no prior information the models perform equally. Now another interesting example is the word &quot;linux&quot;, &amp;nbsp;in which is not very likely to appear in the training set and therefore the models perform likely equally. The extreme case example is the word &quot;omglolbbq&quot;, in which the Markov Chain model performs worse than the uniform model due of the very low probability of this word to happen, so it is penalized and I will need more monkeys to get this target word&lt;/div&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;span style=&quot;color: orange;&quot;&gt;# *******************&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;color: orange;&quot;&gt;# &amp;nbsp;SOURCE AND FILES&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;color: orange;&quot;&gt;# *******************&lt;/span&gt;&lt;br /&gt;
&lt;br /&gt;
&lt;a href=&quot;https://github.com/TATABOX42/typing_monkeys&quot;&gt;Source and files&lt;/a&gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Benjamin&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;</content><link rel='replies' type='application/atom+xml' href='http://tata-box-blog.blogspot.com/feeds/4563229489956399144/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://tata-box-blog.blogspot.com/2014/09/if-typing-monkeys-have-met-mr-markov.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7099891672772539383/posts/default/4563229489956399144'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7099891672772539383/posts/default/4563229489956399144'/><link rel='alternate' type='text/html' href='http://tata-box-blog.blogspot.com/2014/09/if-typing-monkeys-have-met-mr-markov.html' title='If the typing monkeys have met Mr Markov: probabilities of spelling &quot;omglolbbq&quot; after the digitial monkeys have read Dracula'/><author><name>Anonymous</name><uri>http://www.blogger.com/profile/12992490904490492131</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhzoask01voUyYr-jGrqiXcJD4C1etrF6-ZdbGvz6VJEoMWAYXQesvykudTpcRbj89lMtk5kS25qCAmbKK23swP284Esqnm2WLHmyDutxac8oSmwZemsIfJBIudn9uLVfYjyA6fhHWr-gw/s72-c/prob1.png" height="72" width="72"/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7099891672772539383.post-4781343306816601833</id><published>2014-08-10T13:15:00.001-05:00</published><updated>2014-08-10T13:15:24.616-05:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="Game-Theory"/><category scheme="http://www.blogger.com/atom/ns#" term="R"/><title type='text'>To cooperate of defect (besides of coding): Prisoners dilemma, a game theory example in R</title><content type='html'>&lt;br /&gt;
Hello Computer Science and/or R enthusiasts.&lt;br /&gt;
&lt;br /&gt;
This week I had the opportunity to try something that was in my To-Do list a while ago. The idea came almost instantly after reading Dr.&amp;nbsp;&lt;a href=&quot;https://en.wikipedia.org/wiki/Richard_Dawkins&quot;&gt;Richard Dawkins&lt;/a&gt; book, &lt;a href=&quot;https://en.wikipedia.org/wiki/The_Selfish_Gene&quot;&gt;The Selfish Gene&lt;/a&gt; (which was a BD gift, thanks Andy).&lt;br /&gt;
&lt;br /&gt;
I feel the obligated necessity to program my own implementation of the prisoners dilemma and make my own version of the contest. To try different parameters of greediness and vindictiveness, just coding in the name of science.&lt;br /&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
You can download the code &lt;a href=&quot;https://github.com/TATABOX42/R_files/blob/master/prisoners_dilemma.R&quot;&gt;here&lt;/a&gt;&lt;br /&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style=&quot;color: orange;&quot;&gt;&lt;b&gt;*****************&lt;/b&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style=&quot;color: orange;&quot;&gt;&lt;b&gt;INTRODUCTION&lt;/b&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;div&gt;
&lt;span style=&quot;color: orange;&quot;&gt;&lt;b&gt;*****************&lt;/b&gt;&lt;/span&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
Here&#39;s a basic definition of prisoners dilemma that can be found in this &lt;a href=&quot;https://en.wikipedia.org/wiki/Prisoner%27s_dilemma&quot;&gt;Wikipedia article:&lt;/a&gt;&lt;br /&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
The structure of the traditional Prisoners’ Dilemma can be generalized from its original prisoner setting. Suppose that the two players are represented by the colors, red and blue, and that each player chooses to either &quot;Cooperate&quot; or &quot;Defect&quot;.&lt;br /&gt;
&lt;br /&gt;
If both players cooperate, they both receive the reward, R, for cooperating. If Blue defects while Red cooperates, then Blue receives the temptation, T payoff while Red receives the &quot;sucker&#39;s&quot;, S, payoff.&amp;nbsp;&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
Similarly, if Blue cooperates while Red defects, then Blue receives the sucker&#39;s payoff S while Red receives the temptation payoff T. If both players defect, they both receive the punishment payoff P.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg4B-m4Q35emYAY7s4qGSPSLlGV4sLzhRlqrtJriKZEqefKB-nY0R50oIRN56QTbglTUHp8Pg03jxhuwk5NHZ8xxlH82-ppOPXaMZMZzx3I_KZMxHtSbBCZ-RmF2OwDIWZQnqQkvvHejXk/s1600/payoff.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img border=&quot;0&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg4B-m4Q35emYAY7s4qGSPSLlGV4sLzhRlqrtJriKZEqefKB-nY0R50oIRN56QTbglTUHp8Pg03jxhuwk5NHZ8xxlH82-ppOPXaMZMZzx3I_KZMxHtSbBCZ-RmF2OwDIWZQnqQkvvHejXk/s1600/payoff.png&quot; height=&quot;144&quot; width=&quot;320&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;
And to be a prisoner&#39;s dilemma game in the strong sense, the following condition must hold for the payoffs:&lt;br /&gt;
&lt;br /&gt;
&lt;div style=&quot;text-align: center;&quot;&gt;
&lt;span style=&quot;color: orange;&quot;&gt;&lt;b&gt;T &amp;gt; R &amp;gt; P &amp;gt; S&lt;/b&gt;&lt;/span&gt;&lt;/div&gt;
&lt;br /&gt;
&lt;div&gt;
&lt;div&gt;
&lt;span style=&quot;color: orange;&quot;&gt;&lt;b&gt;*****************&lt;/b&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style=&quot;color: orange;&quot;&gt;&lt;b&gt;MATERIALS&lt;/b&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style=&quot;color: orange;&quot;&gt;&lt;b&gt;*****************&lt;/b&gt;&lt;/span&gt;&lt;/div&gt;
&lt;br /&gt;
I decided to program three basic strategy functions:&lt;br /&gt;
&lt;br /&gt;
&lt;ol&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;color: orange;&quot;&gt;tit.for.tat.bot:&lt;/span&gt;&lt;/b&gt; this simple strategy repeats opponent&#39;s last choice&lt;/li&gt;
&lt;ol&gt;
&lt;li&gt;This bot/strategy is parameter-free&lt;/li&gt;
&lt;/ol&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;color: orange;&quot;&gt;greedy.bot:&lt;/span&gt;&lt;/b&gt; this strategy is affected by the parameter of greedy.level, which increases the probability of the bot to &quot;defect&quot; when this parameter is close to 1.0. If this parameter is set to 0.5, then the bot will choose randomly.&amp;nbsp;&lt;/li&gt;
&lt;ol&gt;
&lt;li&gt;This bot does not care about the previous action of the other player, it only decides to &quot;defect&quot; or &quot;cooperate&quot; based on its own greediness.&lt;/li&gt;
&lt;/ol&gt;
&lt;li&gt;&lt;span style=&quot;color: orange;&quot;&gt;&lt;b&gt;vindictive.bot:&lt;/b&gt;&amp;nbsp;&lt;/span&gt; this strategy is affected by the parameter of vindictive.level, which is only used when the previous action of the other player is &quot;defect&quot; in the same sense as the greedy.level affects the greedy.bot, otherwise it will &quot;cooperate&quot;.&amp;nbsp;&lt;/li&gt;
&lt;ol&gt;
&lt;li&gt;This &amp;nbsp;means that if the other player plays &quot;cooperate&quot;, this bot will &quot;cooperate&quot; too, but at any moment when the other player plays &quot;defect&quot;, this bot will trigger its own vindictive.level to decide if it will play &quot;defect&quot; or to &quot;cooperate&quot;.&lt;/li&gt;
&lt;/ol&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;color: orange;&quot;&gt;All bots have three parameters as arguments:&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;ol&gt;
&lt;li&gt;action: last action of the other player&lt;/li&gt;
&lt;li&gt;greedy.level*&lt;/li&gt;
&lt;li&gt;vindictive.level&lt;/li&gt;
&lt;/ol&gt;
&lt;li&gt;&lt;i&gt;*The greedy.level parameter in the&lt;b&gt;&lt;span style=&quot;color: orange;&quot;&gt; tit.for.tat.bot&lt;/span&gt;&lt;/b&gt; and the &lt;b&gt;&lt;span style=&quot;color: orange;&quot;&gt;vindictive.bot&lt;/span&gt;&lt;/b&gt; is only used when this bots are playing as player.1 and have to decide the first action/move against player.2. After that event, this parameter is no longer used for these bots.&lt;/i&gt;&lt;/li&gt;
&lt;li&gt;Each point in the plots means 1000 (iter=1000) plays of player.1 vs player.2&lt;/li&gt;
&lt;li&gt;Parameters for the payoff matrix: T=20, R=5, P=1 and S=0&lt;/li&gt;
&lt;/ol&gt;
&lt;br /&gt;
&lt;div&gt;
&lt;span style=&quot;color: orange;&quot;&gt;&lt;b&gt;*****************&lt;/b&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style=&quot;color: orange;&quot;&gt;&lt;b&gt;METHODS&lt;/b&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style=&quot;color: orange;&quot;&gt;&lt;b&gt;*****************&lt;/b&gt;&lt;/span&gt;&lt;/div&gt;
&lt;br /&gt;
&lt;ol&gt;
&lt;li&gt;Try different pairwise comparisons of strategies varying the greediness and the vindictiveness of the bots&lt;/li&gt;
&lt;li&gt;Make our clandestine fight club with the bots (1st RULE: You do not &lt;span style=&quot;color: orange;&quot;&gt;&lt;b&gt;code&lt;/b&gt;&lt;/span&gt; about FIGHT CLUB)&lt;/li&gt;
&lt;/ol&gt;
&lt;br /&gt;
&lt;div&gt;
&lt;span style=&quot;color: orange;&quot;&gt;&lt;b&gt;*****************&lt;/b&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style=&quot;color: orange;&quot;&gt;&lt;b&gt;RESULTS&lt;/b&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style=&quot;color: orange;&quot;&gt;&lt;b&gt;*****************&lt;/b&gt;&lt;/span&gt;&lt;/div&gt;
&lt;br /&gt;
1.1) GREEDY BOT VS TIT FOR TAT BOT&lt;br /&gt;
&lt;br /&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjk7GxqEHmzvbq-YT5Mub7ewUYdQ9MlrtmH9HZ2BISrIG88nuljGKFw4_ba09Qct2D_UtDXsZ2YcyPhRTG5p-wBhngfxrhwKvbPrQMKVRsz5W2LNpDLkmz2Pl0IMEVdqYn8p0U0iHLXgPQ/s1600/greedy_vs_tit_for_tat.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img border=&quot;0&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjk7GxqEHmzvbq-YT5Mub7ewUYdQ9MlrtmH9HZ2BISrIG88nuljGKFw4_ba09Qct2D_UtDXsZ2YcyPhRTG5p-wBhngfxrhwKvbPrQMKVRsz5W2LNpDLkmz2Pl0IMEVdqYn8p0U0iHLXgPQ/s1600/greedy_vs_tit_for_tat.png&quot; height=&quot;341&quot; width=&quot;400&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;b&gt;&lt;span style=&quot;color: orange;&quot;&gt;Using different thresholds of greediness in a greedy.bot vs tit.for.tat.bot.&lt;/span&gt;&lt;/b&gt;&amp;nbsp;we can see that the tit.for.tat.bot handles very well when this bot increases its greediness. this bot cooperates when the adversary cooperates, but reacts when the other player decides to &quot;defects&quot;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: left;&quot;&gt;
1.2) VINDICTIVE BOT VS TIT FOR TAT BOT&lt;/div&gt;
&lt;br /&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhokQOt6ARyz8v_AqyOhDGE_AEg0zldRnjA64ZTGulQE5aVgz4hbx4Y5YXPH2khYYgTdbDAaOPhZfvThm6d14Vv5L4zUJ92utd1eIfP5upgfc3O_D-H5-8WF4AOr2zLPTOlmH07SGz5Ur4/s1600/vindictive_vs_tit_for_tat.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img border=&quot;0&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhokQOt6ARyz8v_AqyOhDGE_AEg0zldRnjA64ZTGulQE5aVgz4hbx4Y5YXPH2khYYgTdbDAaOPhZfvThm6d14Vv5L4zUJ92utd1eIfP5upgfc3O_D-H5-8WF4AOr2zLPTOlmH07SGz5Ur4/s1600/vindictive_vs_tit_for_tat.png&quot; height=&quot;400&quot; width=&quot;350&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;span style=&quot;color: orange;&quot;&gt;&lt;b&gt;Using different thresholds of vindictive.level in a vindictive.bot vs tit.for.tat.bot.&amp;nbsp;&lt;/b&gt;&lt;/span&gt;&amp;nbsp;A) We can see that the vindictive.bot does not &quot;defect&quot; unless provoked. &amp;nbsp;Therefore as these bots &quot;cooperate&quot; along all the simulations, they get the score for 5000 which is the maximum score for mutual cooperation along 1000 iterations. &amp;nbsp;B) For this experiment, we are increasing the greedy.level and the vindictive.level by steps of 0.05. The greedy.level in the vindictive.bot only determines the probability to &quot;defect&quot; in the first move only and only if the vindictive.bot is playing as player.1. We can see that the tit.for.tat.bot handles very well when this bot increases its greediness and the vindictiveness. This bot cooperates when the adversary cooperates, but reacts when the other player decides to &quot;defects&quot;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: left;&quot;&gt;
1.3) GREEDY BOT VS VINDICTIVE BOT&lt;/div&gt;
&lt;div style=&quot;text-align: center;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg1hUhXEwTjPkmKciY-5i3d7jtEsogsbxxlpGWsF0JBsiyICpAWeTHtbypvtwCDdVLOEEQ94R_jfibfT7Ahb8PNoWIkkf9XuA4NWwx-aAgqgfuFMenlCEEYQtu6xDTub8vCz4YdkG5-els/s1600/greedy_vs_vindictive.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img border=&quot;0&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg1hUhXEwTjPkmKciY-5i3d7jtEsogsbxxlpGWsF0JBsiyICpAWeTHtbypvtwCDdVLOEEQ94R_jfibfT7Ahb8PNoWIkkf9XuA4NWwx-aAgqgfuFMenlCEEYQtu6xDTub8vCz4YdkG5-els/s1600/greedy_vs_vindictive.png&quot; height=&quot;400&quot; width=&quot;350&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: center;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: center;&quot;&gt;
&lt;span style=&quot;color: orange;&quot;&gt;&lt;b&gt;Using different thresholds of greedy.level in greedy.bot vs the effect of vindictive.level in a vindictive.bot.&amp;nbsp;&lt;/b&gt;&lt;/span&gt;&amp;nbsp;A) greedy.level and vindictive.level increased proportionally equal among the two bots. we can see that the greedy.bot when becomes more greedy it outperform the vindictive.bot because the vindictive.bot only triggers its self-defense strategy when provoked, and the probability to take revenge is based on the vindictive.level. B) greedy.level and vindictive.level increased proportionally inverse among the two bots. This means that the most greedy bot will face with the most forgiving bot.&lt;/div&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;
&lt;br /&gt;
2.1) FIGHT CLUB: ROUND 1 (0.7 to 1.0 in greedy and vindictive.level)&lt;br /&gt;
&lt;br /&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjBqWddPbSvrROnmTTpejW_gCfVx7J-bpxMMyfKEhONYOVpnb2vlynRLoovaZ06_RvxfH1jcMmpovayJ1xvbJ2RbcH-hTIjBbGFZd0JaLwhHFVhUssaubVf78sY4lSD7Pc3rxu-oJEO8hI/s1600/fight_club_round1.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img border=&quot;0&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjBqWddPbSvrROnmTTpejW_gCfVx7J-bpxMMyfKEhONYOVpnb2vlynRLoovaZ06_RvxfH1jcMmpovayJ1xvbJ2RbcH-hTIjBbGFZd0JaLwhHFVhUssaubVf78sY4lSD7Pc3rxu-oJEO8hI/s1600/fight_club_round1.png&quot; height=&quot;376&quot; width=&quot;400&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;span style=&quot;color: orange;&quot;&gt;&lt;b&gt;Tournament of all vs all using a range of parameters from 0.7 to 1.0 in greedy and vindictive.level&amp;nbsp;&lt;/b&gt;&lt;/span&gt;&amp;nbsp;A) Scatter-plot of each strategy. in x-axis is the mean score of the strategy versus all other strategies when it is playing as player.1 and y.axis when is playing as player.2. We can see that the greedy.bot.1.0 (greedy.level=1.0) has the lowest score followed by the vindictive.bots and the tit.for.tat.bot. The bots that gets higher scores are the ones that have a greedy personality, but not &quot;so-greedy&quot; after all, for example the most successful bot only has 0.7 of probability of &quot;defect&quot;. This experiment is in someway biased because the greedy.bots are taking advantage by exploiting the &quot;nice bots&quot; and getting higher scores for that reason. &amp;nbsp;B) &amp;nbsp;This barplot shows how is the effect of the greediness in the greedy.bot, as it becomes more greedy, the scores tend to be lower. The opposite behavior is shown for the vindictive.bot&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
2.2) FIGHT CLUB: ROUND 2 (0.8 to 1.0 in greedy and vindictive.level)&lt;br /&gt;
&lt;br /&gt;
To avoid that bias, now we show the experiment with more fierce adversaries&lt;br /&gt;
&lt;br /&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEidiTuuT7MVC61hkEPqiOJ-E11PY4VUx8oiWHxpz3HNzKPnh7yi8JK435L_JTdJXHktDpdRom5nvqyEfYTsEwYcRqOvwqGWVX95xZlsHh2ZPc3Ls4nmLwJyEcEZaMWKp0tX07zNOhcsTgk/s1600/fight_club_round2.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img border=&quot;0&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEidiTuuT7MVC61hkEPqiOJ-E11PY4VUx8oiWHxpz3HNzKPnh7yi8JK435L_JTdJXHktDpdRom5nvqyEfYTsEwYcRqOvwqGWVX95xZlsHh2ZPc3Ls4nmLwJyEcEZaMWKp0tX07zNOhcsTgk/s1600/fight_club_round2.png&quot; height=&quot;400&quot; width=&quot;350&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;span style=&quot;color: orange;&quot;&gt;&lt;b&gt;Tournament of all vs all using a range of parameters from 0.8 to 1.0 in greedy and vindictive.level&amp;nbsp;&lt;/b&gt;&lt;/span&gt;&amp;nbsp;A) Scatter-plot of each strategy. in x-axis is the mean score of the strategy versus all other strategies when it is playing as player.1 and y.axis when is playing as player.2. We can see that the greedy.bot.1.0 (greedy.level=1.0) has the lowest score followed by the vindictive.bots and the not-so-greedy bots. Now the bot that wins the contest is the simplest strategy, tit.for.that. B) &amp;nbsp;This barplot shows how is the effect of the greediness in the greedy.bot, as it becomes more greedy, the scores tend to be lower. The opposite behavior is shown for the vindictive.bot. In general we can conclude that the the vindictive.bots performs better than the greedy.bots because they tend to cooperate unless provoked, so, they are nicer bots.&lt;/div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;div&gt;
&lt;span style=&quot;color: orange;&quot;&gt;&lt;b&gt;*****************&lt;/b&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style=&quot;color: orange;&quot;&gt;&lt;b&gt;DISCUSSION AND CONCLUSIONS&lt;/b&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style=&quot;color: orange;&quot;&gt;&lt;b&gt;*****************&lt;/b&gt;&lt;/span&gt;&lt;/div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
NICE &quot;BOTS&quot; FINISH FIRST, as Dr Dawkins concluded in his book.&lt;br /&gt;
&lt;br /&gt;
Dr. Dawkins &lt;a href=&quot;https://en.wikipedia.org/wiki/Nice_Guys_Finish_First&quot;&gt;illustrates&lt;/a&gt; the four conditions of tit for tat.&lt;br /&gt;
&lt;br /&gt;
&lt;ol&gt;
&lt;li&gt;Unless provoked, the agent will always cooperate.&lt;/li&gt;
&lt;li&gt;If provoked, the agent will retaliate.&lt;/li&gt;
&lt;li&gt;The agent is quick to forgive.&lt;/li&gt;
&lt;li&gt;The agent must have a good chance of competing against the opponent more than once.&lt;/li&gt;
&lt;/ol&gt;
&lt;br /&gt;
&lt;div style=&quot;text-align: center;&quot;&gt;
&lt;b&gt;&lt;span style=&quot;color: orange;&quot;&gt;In conclusion, it is better to be a nice human than a greedy one.&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
Benjamin&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
</content><link rel='replies' type='application/atom+xml' href='http://tata-box-blog.blogspot.com/feeds/4781343306816601833/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://tata-box-blog.blogspot.com/2014/08/to-cooperate-of-defect-besides-of.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7099891672772539383/posts/default/4781343306816601833'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7099891672772539383/posts/default/4781343306816601833'/><link rel='alternate' type='text/html' href='http://tata-box-blog.blogspot.com/2014/08/to-cooperate-of-defect-besides-of.html' title='To cooperate of defect (besides of coding): Prisoners dilemma, a game theory example in R'/><author><name>Anonymous</name><uri>http://www.blogger.com/profile/12992490904490492131</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg4B-m4Q35emYAY7s4qGSPSLlGV4sLzhRlqrtJriKZEqefKB-nY0R50oIRN56QTbglTUHp8Pg03jxhuwk5NHZ8xxlH82-ppOPXaMZMZzx3I_KZMxHtSbBCZ-RmF2OwDIWZQnqQkvvHejXk/s72-c/payoff.png" height="72" width="72"/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7099891672772539383.post-8899814216260277759</id><published>2014-08-01T18:53:00.000-05:00</published><updated>2014-08-01T18:53:09.781-05:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="Game-Theory"/><category scheme="http://www.blogger.com/atom/ns#" term="R"/><category scheme="http://www.blogger.com/atom/ns#" term="Random"/><title type='text'>UPDATE: 100 Prisoners, 100 lines of code, a  game theory example in R</title><content type='html'>I &amp;nbsp;was looking for some Game-Theory post in &lt;a href=&quot;http://www.r-bloggers.com/&quot;&gt;R-bloggers&lt;/a&gt; and found this very cool &lt;a href=&quot;http://www.r-bloggers.com/100-prisoners-100-lines-of-code/&quot;&gt;post&lt;/a&gt;, &amp;nbsp;obviously I decided to read the post, to look at the code and implement a version of it.&lt;br /&gt;
&lt;br /&gt;
The idea of the post was very cool for me that I thought it would be a nice idea to get more &quot;juice&quot; from it.&lt;br /&gt;
&lt;br /&gt;
All the credit for this post, is for its original author.&lt;br /&gt;
&lt;br /&gt;
The code is available &lt;a href=&quot;https://github.com/TATABOX42/R_files/blob/master/1000_prisoners_1000_boxes.R&quot;&gt;here&lt;/a&gt;.&lt;br /&gt;
&lt;br /&gt;
As&amp;nbsp;&lt;a href=&quot;http://www.r-bloggers.com/author/matt-asher/&quot;&gt;Matt Asher&lt;/a&gt;&amp;nbsp;describes in his original post:&lt;br /&gt;
&lt;br /&gt;
&lt;span style=&quot;color: orange;&quot;&gt;# ***************&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;color: orange;&quot;&gt;# INTRODUCTION&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;color: orange;&quot;&gt;# ***************&lt;/span&gt;&lt;br /&gt;
&lt;br /&gt;
In this game, the warden places 100 numbers in 100 boxes, at random with equal probability that any number will be in any box. Each convict is assigned a number. One by one they enter the room with the boxes, and try to find their corresponding number. They can open up to 50 different boxes. Once they either find their number or fail, they move on to a different room and all of the boxes are returned to exactly how they were before the prisoner entered the room.&lt;br /&gt;
&lt;br /&gt;
The prisoners can communicate with each other before the game begins, but as soon as it starts they have no way to signal to each other. The warden is requiring that all 100 prisoners find their numbers, otherwise she will force them to listen to hundreds of hours of non-stop, loud rock musician interviews. Can they avoid this fate?&lt;br /&gt;
&lt;br /&gt;
The first thing you might notice is that if every prisoner opens 50 boxes at random, they will have a 0.5 probability of finding their number.&lt;br /&gt;
&lt;br /&gt;
There&#39;s actually a strategy that can guarantee all the prisoners to get their numbers. &lt;br /&gt;&lt;br /&gt;For this post, we will call that strategy, &lt;span style=&quot;color: orange;&quot;&gt;model&lt;/span&gt;, and the strategy with no strategy at all, we will call it &lt;span style=&quot;color: orange;&quot;&gt;random.model&lt;/span&gt;.&lt;br /&gt;
&lt;br /&gt;
The &lt;span style=&quot;color: orange;&quot;&gt;model&lt;/span&gt; strategy consist in that at the beginning, each prisoner will open the box based on their number, for example, the prisoner 42 will open the box at the position 42, after open it, say it contains the number 12, and given that 42 is not equal to 12, he will have to open the box at position 12. This will happen until he had opened 50 boxes or found his number.&lt;br /&gt;
&lt;br /&gt;
The &lt;span style=&quot;color: orange;&quot;&gt;random.model&lt;/span&gt; strategy consist in no strategy at all, only opening boxes at random until he had opened 50 boxes or found his number.&lt;br /&gt;
&lt;br /&gt;
Every time a prisoner found his number we will call this event a &lt;span style=&quot;color: orange;&quot;&gt;success&lt;/span&gt;.&lt;br /&gt;
&lt;br /&gt;
Our interest is to find the strategy that maximizes the success rate. We will run each simulation 1000 times.&lt;br /&gt;
&lt;br /&gt;
&lt;span style=&quot;color: orange;&quot;&gt;# ***************&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;color: orange;&quot;&gt;# RESULTS&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;color: orange;&quot;&gt;# ***************&lt;/span&gt;&lt;br /&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
Let&#39;s take a look at the performance of each model individually with this histogram:&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj1WaPhrOT7C3esbCinaF8-v3pz_Ipu3T5Fty0myU-33OIT7ry7HR4-PYOGvuW02lRBWPMDlVBMpZAoq1VCVSmK1t1srGN0a73srIv_s8RVL_hGkVwG-U_KyoHJ_DUx44V0GQ2lxgb0qMY/s1600/histograms1.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img border=&quot;0&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj1WaPhrOT7C3esbCinaF8-v3pz_Ipu3T5Fty0myU-33OIT7ry7HR4-PYOGvuW02lRBWPMDlVBMpZAoq1VCVSmK1t1srGN0a73srIv_s8RVL_hGkVwG-U_KyoHJ_DUx44V0GQ2lxgb0qMY/s1600/histograms1.png&quot; height=&quot;400&quot; width=&quot;327&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: left;&quot;&gt;
We can see that the random.model behaves as we were expecting assuming that by chance we have a 0.5 chance of getting our number by selecting 50 random boxes, On the other hand, the other strategy has a very different distribution, there&#39;s a gap between 51 and 99 with 0 of success rate!, any suggestion to explain how is this possible?&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: left;&quot;&gt;
Another way to plot the distribution of the success rates obtained from each model is by looking at a density plot:&lt;/div&gt;
&lt;div&gt;
&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgTfwrtpy9NVuEPmDnSpPONJWPr-2IY0KRSKQEIIABynfGEMYzdQgtvBAhuNkHvnYqB7LtZsgWrQDzJ08FC5xYuTX5atBKJTC0L1g1dFqPypsQY37cug96rIYRfuybwZSKAQwbRi7o8L0o/s1600/density_plot.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img border=&quot;0&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgTfwrtpy9NVuEPmDnSpPONJWPr-2IY0KRSKQEIIABynfGEMYzdQgtvBAhuNkHvnYqB7LtZsgWrQDzJ08FC5xYuTX5atBKJTC0L1g1dFqPypsQY37cug96rIYRfuybwZSKAQwbRi7o8L0o/s1600/density_plot.png&quot; height=&quot;348&quot; width=&quot;400&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: left;&quot;&gt;
This histogram is telling the same story than the previous plots, but in the same plot:&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgy6aG29b_EPhy7a8CO2PpO_EydM1c2xuu_Uuez_wjQ7WQwQGKlVzS6jDjhW3EK0KO1Ii6w15zV_qBzAPzoJNbMj1wDqH_9HFIZPw0sLzw0SVJRctck8KMG3cOPdV373lsx83l4bzn6Ung/s1600/histograms2.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img border=&quot;0&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgy6aG29b_EPhy7a8CO2PpO_EydM1c2xuu_Uuez_wjQ7WQwQGKlVzS6jDjhW3EK0KO1Ii6w15zV_qBzAPzoJNbMj1wDqH_9HFIZPw0sLzw0SVJRctck8KMG3cOPdV373lsx83l4bzn6Ung/s1600/histograms2.png&quot; height=&quot;348&quot; width=&quot;400&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: left;&quot;&gt;
And as usual, I always left the best plot for the grand finale.&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgEop2zCo8pc2NFGom8wZ-jNxeL8Hgo4m30rq9xUrLlVaVsypOGlQWK0SCTzU3j_Os2-Db3kcFWmYrKtLEHAqj6c5wNq0tDWHsnaM04_u0qECK1xavdcqQVquwTEURd5fWWZavWBETFWOM/s1600/boxplot.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img border=&quot;0&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgEop2zCo8pc2NFGom8wZ-jNxeL8Hgo4m30rq9xUrLlVaVsypOGlQWK0SCTzU3j_Os2-Db3kcFWmYrKtLEHAqj6c5wNq0tDWHsnaM04_u0qECK1xavdcqQVquwTEURd5fWWZavWBETFWOM/s1600/boxplot.png&quot; height=&quot;348&quot; width=&quot;400&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;
What we can say about this plot?, before going there, I will continue using Matt&#39;s explanation:&lt;br /&gt;
&lt;br /&gt;
The theoretical success rate of the model strategy is about 31%&lt;br /&gt;
&lt;br /&gt;
One way to look at the distribution of numbers in boxes is to see it as a permutation of the numbers from 1 to 100. Each permutation can be partitioned into what are called cycles. A cycle works like this: pick any number in your permutation. Let&#39;s say it&#39;s 23. Then you look at the number the 23rd place (ie the number in the 23rd box, counting from the left). If that number is 16, you look at the number in the 16th place. If that number is 87, go open box number 87 and follow that number. Eventually, the box you open up will have the number that brings you back to where you started, completing the cycle. Different permutations have different cycles.&lt;br /&gt;
&lt;br /&gt;
The key for the prisoner is that by starting with the box that is the same place from the left as his number, and by following the numbers in the boxes, the prisoner guarantees that if he is in a cycle of length less than 50, he will eventually open the box with his number in it, which would complete the cycle he began. One way to envision cycles of different lengths is to think about the extreme cases. If a particular permutation shifted every single number over one to the left (and wrapped number 1 onto the end), you would have a single cycle of length 100. Box 1 would contain number 2, box 2 number 3 and so on. On the other hand, if a permutation flipped every pair of consecutive numbers, you would have 50 cycles, each of length 2: box 1 would have number 2, box 2 would have number 1. Of course if your permutation doesn&#39;t change anything you have 100 cycles of length 1.&lt;br /&gt;
&lt;br /&gt;
As you can see from the histogram of the model strategy, you can never have between 50 and 100 winning prisoners. Anytime you have a single cycle of length greater than 50, for example 55, then all 55 prisoners who start on that cycle will fail to find their number. If no cycles are longer than 50, everyone wins.&lt;br /&gt;
&lt;br /&gt;
Now, looking at the boxplot, OK we would say, using the model strategy we have like a 30% percent of chance that every one of the prisoners will get their number, but on the other hand, the random strategy guarantees that at least every prisoner will have 50% chance of getting his number, and therefore win.&lt;br /&gt;
&lt;br /&gt;
So, the model strategy is like an all of nothing strategy, you can win the lottery or die trying. This is observable looking at the large standard deviation in the boxplot, on the other hand, the random.model strategy is very consistent with low dispersion values. This means that if the prisoners decides to use the model strategy, they will have to face tree possible outcomes, or everybody wins, or nobody does, or only a few does. On the other hand, with the random.model strategy, they will expect that at least 50 prisoners will win, thinking as a prisoner as a logical individual, I think this is the best strategy he can use.&lt;br /&gt;
&lt;br /&gt;
Write your opinions.&lt;br /&gt;
&lt;br /&gt;
Thanks to&amp;nbsp;&lt;a href=&quot;http://www.r-bloggers.com/author/matt-asher/&quot;&gt;Matt Asher&lt;/a&gt;&amp;nbsp;for the idea&lt;br /&gt;
&lt;br /&gt;
Benjamin&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;br /&gt;</content><link rel='replies' type='application/atom+xml' href='http://tata-box-blog.blogspot.com/feeds/8899814216260277759/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://tata-box-blog.blogspot.com/2014/08/update-100-prisoners-100-lines-of-code.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7099891672772539383/posts/default/8899814216260277759'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7099891672772539383/posts/default/8899814216260277759'/><link rel='alternate' type='text/html' href='http://tata-box-blog.blogspot.com/2014/08/update-100-prisoners-100-lines-of-code.html' title='UPDATE: 100 Prisoners, 100 lines of code, a  game theory example in R'/><author><name>Anonymous</name><uri>http://www.blogger.com/profile/12992490904490492131</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj1WaPhrOT7C3esbCinaF8-v3pz_Ipu3T5Fty0myU-33OIT7ry7HR4-PYOGvuW02lRBWPMDlVBMpZAoq1VCVSmK1t1srGN0a73srIv_s8RVL_hGkVwG-U_KyoHJ_DUx44V0GQ2lxgb0qMY/s72-c/histograms1.png" height="72" width="72"/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7099891672772539383.post-6346253440888523615</id><published>2014-07-07T12:28:00.000-05:00</published><updated>2014-07-07T14:10:31.207-05:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="Machine Learning"/><category scheme="http://www.blogger.com/atom/ns#" term="R"/><category scheme="http://www.blogger.com/atom/ns#" term="Stochastic messages"/><title type='text'>What are the names of the school principals in Mexico?, If your name is Maria, probably this post will interest you. Trends and cool plots from the national education census of Mexico in 2013</title><content type='html'>&lt;b&gt;I will start this post with a disclaimer:&lt;/b&gt;&lt;br /&gt;
&lt;span style=&quot;color: #f1c232;&quot;&gt;&lt;br /&gt;&lt;/span&gt;
The main intention of the post is to show how is the distribution of the school principal names in Mexico, for example, to show basic trends regarding about what is the most common nation-wide first name and so on, also to show trends delimited by state and regions. &lt;br /&gt;
&lt;br /&gt;
These trends in data would answer questions such:&lt;br /&gt;
&lt;br /&gt;
1. Are the most common first names distributed equally among the states?&lt;br /&gt;
2. Does the states sharing the same region also share the same &quot;naming&quot; behavior?&lt;br /&gt;
&lt;br /&gt;
Additionally, this post includes cool wordclouds.&lt;br /&gt;
&lt;br /&gt;
Finally and the last part of my disclaimer is that, I am really concerned about the privacy of the persons involved. I am not by any sense promoting the exploitation of this personal data, if you decide to download the dataset, I would really ask you to study it and to generate information that is beneficial, do not join the Dark side.&lt;br /&gt;
&lt;br /&gt;
Benjamin&lt;br /&gt;
&lt;br /&gt;
&lt;span style=&quot;color: #3d85c6;&quot;&gt;&lt;b&gt;##################&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;color: #3d85c6;&quot;&gt;&lt;b&gt;# GETTING THE DATASET AND CODE&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;color: #3d85c6;&quot;&gt;&lt;b&gt;##################&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;
&lt;br /&gt;
The database is located &lt;a href=&quot;http://datamx.io/dataset/censo-educativo-sep&quot;&gt;here&lt;/a&gt;&lt;br /&gt;
The R code can be downloaded &lt;a href=&quot;https://www.dropbox.com/s/6zqqjf2rvmx3by9/SEP_names_analysis.R&quot;&gt;here&lt;/a&gt;&lt;br /&gt;
Additional data can be downloaded &lt;a href=&quot;https://www.dropbox.com/s/8nepk4r0a18akuf/state_region.csv&quot;&gt;here&lt;/a&gt;&lt;br /&gt;
&lt;br /&gt;
All the results were computed exploring&amp;nbsp;&lt;span style=&quot;color: #93c47d;&quot;&gt;&lt;b&gt;202,118&amp;nbsp;schools&lt;/b&gt;&lt;/span&gt; across the 32 states of Mexico from the 2013 census&lt;br /&gt;
&lt;br /&gt;
&lt;span style=&quot;color: #3d85c6;&quot;&gt;&lt;b&gt;##################&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;color: #3d85c6;&quot;&gt;&lt;b&gt;# EXPLORING THE DATA&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;color: #3d85c6;&quot;&gt;&lt;b&gt;# WITH WORDCLOUDS&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;color: #3d85c6;&quot;&gt;&lt;b&gt;##################&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;
&lt;br /&gt;
Here is the wordcloud of names (by name, I am referring to first name only), it can be concluded that MARIA is by far the most common first name of a school principal in all Mexican schools, followed by JOSE and then by JUAN&lt;br /&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;img alt=&quot;&quot; border=&quot;0&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjoV7nmn24XCkB2dXAeKel0w2A3dxpLISJ0vwi_xAJc0WcCmFKtDyRQyguxUukMOHbq7HurdZCTICX09TctgWYMUQC-hHcEyEplnQbRHUGl2DqRXW-B89MYzyXckBQcZ42mr0q114Lys5M/s1600/wordcloud_responsible_name_only.pdf.png&quot; height=&quot;400&quot; title=&quot;&quot; width=&quot;400&quot; /&gt;&lt;/div&gt;
&lt;br /&gt;
The following wordcloud includes every word in the responsible_name column (this includes, first name, last names). Now the plot shows that besides the common first name of MARIA, also the last names of HERNANDEZ, MARTINEZ and GARCIA are very common.&lt;br /&gt;
&lt;br /&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;https://www.dropbox.com/s/yxc15b313xl25e4/wordcloud_responsible_name.pdf&quot;&gt;&lt;img border=&quot;0&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEin9w90pvNr4K2CmWeACAXtuwcZayaUjWv0iNzAgdfeiqz0Wvo0Mz8Nnl118CJeI6FrV-CJ24Ebv1Ptm9N1en8lv4xLM3oHMXLVZC3dsVcmVDYOGGKJ8ijItFcjjPP7aFNt-d1CpwomBgU/s1600/wordcloud_responsible_name.pdf.png&quot; height=&quot;400&quot; width=&quot;400&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;
&lt;br class=&quot;Apple-interchange-newline&quot; /&gt;
&lt;b&gt;&lt;span style=&quot;color: #6fa8dc;&quot;&gt;##################&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;
&lt;b&gt;&lt;span style=&quot;color: #6fa8dc;&quot;&gt;# EXPLORING THE FREQUENCY&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;
&lt;b&gt;&lt;span style=&quot;color: #6fa8dc;&quot;&gt;# OF FIRST NAMES (TOP 30 | NATION-WIDE)&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;
&lt;b&gt;&lt;span style=&quot;color: #6fa8dc;&quot;&gt;##################&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;
&lt;br /&gt;
Looking at this barplot, the name MARIA is by far the most common name of the Mexican school&#39;s principals, with a frequency ~25,000. The next most popular name is JOSE with a frequency of ~7,500&lt;br /&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;/div&gt;
&lt;br /&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj_9Ov6SR6MHazjDSXi64lE2GnIMVysaUmqq9TJSI5HWzulOas3iXqHSUI9PyTVUaYe8jYMmCY8cxWKS46vYg1EwsBZgi0YGtX2zhhyHWX8rRlKnHDRlaLL33iHa1176g4-z0QfNEccwW0/s1600/barplot_national_name_frequency.pdf.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img border=&quot;0&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj_9Ov6SR6MHazjDSXi64lE2GnIMVysaUmqq9TJSI5HWzulOas3iXqHSUI9PyTVUaYe8jYMmCY8cxWKS46vYg1EwsBZgi0YGtX2zhhyHWX8rRlKnHDRlaLL33iHa1176g4-z0QfNEccwW0/s1600/barplot_national_name_frequency.pdf.png&quot; height=&quot;355&quot; width=&quot;400&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: left;&quot;&gt;
Looking at the same data, just adjusted to represent the % of each name inside the pool of first names we have that MARIA occupy ~11% of the names pool.&lt;/div&gt;
&lt;br /&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi-xQJ5IPCmKtzIjlajUvWUEKzyej35-EatkSkV7GdGq0xSYkqYUEJRmN2My9QmIdnqW-KmyJU8OhyToluiuyu-u6svrpXOdoRyUMePfH6nFfH-OhScnjLbUYoENLMPtdwQ6KDLJ_c5kCc/s1600/barplot_national_name_percentage.pdf.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img border=&quot;0&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi-xQJ5IPCmKtzIjlajUvWUEKzyej35-EatkSkV7GdGq0xSYkqYUEJRmN2My9QmIdnqW-KmyJU8OhyToluiuyu-u6svrpXOdoRyUMePfH6nFfH-OhScnjLbUYoENLMPtdwQ6KDLJ_c5kCc/s1600/barplot_national_name_percentage.pdf.png&quot; height=&quot;355&quot; width=&quot;400&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;
&lt;span style=&quot;color: #6fa8dc;&quot;&gt;&lt;b&gt;##################&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;color: #6fa8dc;&quot;&gt;&lt;b&gt;# HEATMAPS OF THE DATA&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;color: #6fa8dc;&quot;&gt;&lt;b&gt;##################&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&amp;nbsp;With this heatmap, my intention is to show the distribution of the top 20 most common first names across all the Mexican states&lt;br /&gt;
&lt;br /&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg56ca3Xak76N8Bb4pO8KO5XkO6cIsfKdMUugdabt3-7o6oybD4fH95Fc7Y3gTSpp9ntwp1keYFTpni6g186AK6liH7Blpx6dRMDkfiQjRujw0SOCCB-VkeYALWMwbqzoyTgxBZs6hyphenhyphenXa0/s1600/heatmap_state_and_top_names.pdf.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img border=&quot;0&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg56ca3Xak76N8Bb4pO8KO5XkO6cIsfKdMUugdabt3-7o6oybD4fH95Fc7Y3gTSpp9ntwp1keYFTpni6g186AK6liH7Blpx6dRMDkfiQjRujw0SOCCB-VkeYALWMwbqzoyTgxBZs6hyphenhyphenXa0/s1600/heatmap_state_and_top_names.pdf.png&quot; height=&quot;355&quot; width=&quot;400&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
It can be concluded that there is a small cluster of states which keep the most number of principals named MARIA(but no so fast!, some states, for example Mexico and Distrito Federal are very populated, so I will reduce this effect in the following plot). In summary the message of this plot is the distribution of frequency of the top 20 most frequent first-names across the country.&lt;br /&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;span style=&quot;color: #6fa8dc;&quot;&gt;&lt;b&gt;##################&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;color: #6fa8dc;&quot;&gt;&lt;b&gt;# CLUSTERS OF THE DATA&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;color: #6fa8dc;&quot;&gt;&lt;b&gt;##################&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
For me, a young data-science-padawan, this is my favorite analysis: &quot;hunting down the trends&quot;.&lt;br /&gt;
&lt;br /&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhnthFgedu6Qtl0X0Zj0ktsga9OiXpEAAJLChzhIYh69K8vSY_mIgBsaNsozlguRXfO1jg7svcRrhc_X_EMXWMhlJsMKWs7bWMsXj-R1vRX2eIjkt-e1YNoCzNGKvzSLywJ0BJ4CyQMXlQ/s1600/heatmap_state_distance_1000_top_names.pdf.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img border=&quot;0&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhnthFgedu6Qtl0X0Zj0ktsga9OiXpEAAJLChzhIYh69K8vSY_mIgBsaNsozlguRXfO1jg7svcRrhc_X_EMXWMhlJsMKWs7bWMsXj-R1vRX2eIjkt-e1YNoCzNGKvzSLywJ0BJ4CyQMXlQ/s1600/heatmap_state_distance_1000_top_names.pdf.png&quot; height=&quot;355&quot; width=&quot;400&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;/div&gt;
The setup of the experiment is very simple, map the top 1,000 most frequent nation-wide names across each state to create a 32 x 1000 matrix (32 states and 1,000 most nation-wide frequent names).&lt;br /&gt;
&lt;br /&gt;
With this matrix, normalize the values by diving each row by the sum of it &amp;nbsp;(this will minimize the effect of the populated states vs the non populated while maintaining the proportion of the name frequencies per state). Then I just computed a distance matrix and plotted it as a heatmap.&lt;br /&gt;
&lt;br /&gt;
What I can conclude with this plot is that, there are clusters of states that seems to maintain a geographical preference to be clustered within its region, this would be concluded that it is likely that states sharing the same regions would be more likely to share the &quot;naming&quot; trends due to some cultural factors (like the cluster that includes Chihuahua, Sonora and Sinaloa). But this effect is not present in all the clusters. &lt;br /&gt;
&lt;br /&gt;
All images can be downloaded in PDF format here, just don&#39;t do evil with them!&lt;br /&gt;
&lt;br /&gt;
Plot 1 &lt;a href=&quot;https://www.dropbox.com/s/14ygb664c7bnirx/wordcloud_responsible_name_only.pdf&quot;&gt;here&lt;/a&gt;&lt;br /&gt;
Plot 2 &lt;a href=&quot;https://www.dropbox.com/s/yxc15b313xl25e4/wordcloud_responsible_name.pdf&quot;&gt;here&lt;/a&gt;&lt;br /&gt;
Plot 3&amp;nbsp;&lt;a href=&quot;https://www.dropbox.com/s/psvxwngwii5lsz3/barplot_national_name_frequency.pdf&quot;&gt;here&lt;/a&gt;&lt;br /&gt;
Plot 4&amp;nbsp;&lt;a href=&quot;https://www.dropbox.com/s/8dzr9ji8n88fftb/barplot_national_name_percentage.pdf&quot;&gt;here&lt;/a&gt;&lt;br /&gt;
Plot 5 &lt;a href=&quot;https://www.dropbox.com/s/jy88erv2utnvim8/heatmap_state_and_top_names.pdf&quot;&gt;here&lt;/a&gt;&lt;br /&gt;
Plot 6 &lt;a href=&quot;https://www.dropbox.com/s/i3me88logtx731x/heatmap_state_distance_1000_top_names.pdf&quot;&gt;here&lt;/a&gt;&lt;br /&gt;
&lt;br /&gt;
Benjamin&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;</content><link rel='replies' type='application/atom+xml' href='http://tata-box-blog.blogspot.com/feeds/6346253440888523615/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://tata-box-blog.blogspot.com/2014/07/what-are-names-of-school-principals-in.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7099891672772539383/posts/default/6346253440888523615'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7099891672772539383/posts/default/6346253440888523615'/><link rel='alternate' type='text/html' href='http://tata-box-blog.blogspot.com/2014/07/what-are-names-of-school-principals-in.html' title='What are the names of the school principals in Mexico?, If your name is Maria, probably this post will interest you. Trends and cool plots from the national education census of Mexico in 2013'/><author><name>Anonymous</name><uri>http://www.blogger.com/profile/12992490904490492131</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjoV7nmn24XCkB2dXAeKel0w2A3dxpLISJ0vwi_xAJc0WcCmFKtDyRQyguxUukMOHbq7HurdZCTICX09TctgWYMUQC-hHcEyEplnQbRHUGl2DqRXW-B89MYzyXckBQcZ42mr0q114Lys5M/s72-c/wordcloud_responsible_name_only.pdf.png" height="72" width="72"/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7099891672772539383.post-6936388348361381857</id><published>2014-06-18T12:26:00.000-05:00</published><updated>2014-06-18T12:26:07.833-05:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="Debian"/><category scheme="http://www.blogger.com/atom/ns#" term="R"/><category scheme="http://www.blogger.com/atom/ns#" term="Stochastic messages"/><title type='text'>[SOLVED] Problems compiling Rcpp dependent R packages in Crunchbang Linux 11 (Debian 7.5 (wheezy) 64-bit)</title><content type='html'>I had a very weird issue when I tried to compile (install) certain R packages like &quot;wordcloud&quot;, &quot;RSNNS&quot; or &quot;GOSemSim&quot; (a Bioconductor package). The installation always ended with a compiling error at the end.&lt;br /&gt;
&lt;br /&gt;
At first, I tried to solve the problem by finding anyone that had faced the same issue, so I Duck-Duck-go-it, Google-it and at least for me, I did not find anyone.&lt;br /&gt;
&lt;br /&gt;
At the end, and which is the core of my post, I solved the problem by removing or commenting all my customizations in the /usr/lib/R/etc/Rprofile.site file (for example, loading libraries automatically).&lt;br /&gt;
&lt;br /&gt;
In summary, using the default Rprofile.site worked for me.&lt;br /&gt;
&lt;br /&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiZC7binS0CVX7UWmR5-vxHY3x_rObN2O5R0f5QhmBxe75kFAPylRBL24Tz2eVgFD7FC70pcwVD9ArvPQXrRjfYrsC7FZiKYRiaj-Hg-FldTG3qDKK4lPfgirGc74TtnqkKzkpBz9JVMBE/s1600/R_console.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img border=&quot;0&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiZC7binS0CVX7UWmR5-vxHY3x_rObN2O5R0f5QhmBxe75kFAPylRBL24Tz2eVgFD7FC70pcwVD9ArvPQXrRjfYrsC7FZiKYRiaj-Hg-FldTG3qDKK4lPfgirGc74TtnqkKzkpBz9JVMBE/s1600/R_console.png&quot; height=&quot;273&quot; width=&quot;400&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;
&lt;br /&gt;
I wrote this post to make a record of the issue, so maybe it would help someone to solve the same in the future.&lt;br /&gt;
&lt;br /&gt;
Benjamin&lt;br /&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
</content><link rel='replies' type='application/atom+xml' href='http://tata-box-blog.blogspot.com/feeds/6936388348361381857/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://tata-box-blog.blogspot.com/2014/06/solved-problems-compiling-rcpp.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7099891672772539383/posts/default/6936388348361381857'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7099891672772539383/posts/default/6936388348361381857'/><link rel='alternate' type='text/html' href='http://tata-box-blog.blogspot.com/2014/06/solved-problems-compiling-rcpp.html' title='[SOLVED] Problems compiling Rcpp dependent R packages in Crunchbang Linux 11 (Debian 7.5 (wheezy) 64-bit)'/><author><name>Anonymous</name><uri>http://www.blogger.com/profile/12992490904490492131</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiZC7binS0CVX7UWmR5-vxHY3x_rObN2O5R0f5QhmBxe75kFAPylRBL24Tz2eVgFD7FC70pcwVD9ArvPQXrRjfYrsC7FZiKYRiaj-Hg-FldTG3qDKK4lPfgirGc74TtnqkKzkpBz9JVMBE/s72-c/R_console.png" height="72" width="72"/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7099891672772539383.post-296690891258870085</id><published>2014-02-25T10:35:00.000-06:00</published><updated>2014-02-25T10:35:10.723-06:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="Bioinformatics"/><category scheme="http://www.blogger.com/atom/ns#" term="R"/><category scheme="http://www.blogger.com/atom/ns#" term="Sequence analysis"/><title type='text'>Convert Ensembl, Unigene, Uniprot and RefSeq IDs to Symbol IDs in R using Bioconductor</title><content type='html'>Hello, I have programmed a function that converts different sources of IDs to Symbol IDs.&lt;br /&gt;
&lt;br /&gt;
The input ID types allowed are (at the moment): &amp;nbsp;Ensembl, Unigene, Uniprot and RefSeq.&lt;br /&gt;
&lt;br /&gt;
The code is available clicking &lt;a href=&quot;https://dl.dropboxusercontent.com/u/17816910/R/get.symbolIDs.R&quot;&gt;here&lt;/a&gt;&lt;br /&gt;
&lt;br /&gt;
NOTE: The function depends on the Bioconductor package &quot;org.Hs.eg.db&quot; available &lt;a href=&quot;http://www.bioconductor.org/packages/release/data/annotation/html/org.Hs.eg.db.html&quot;&gt;here&lt;/a&gt;&lt;br /&gt;
&lt;br /&gt;
For example, lets show 10 Ensembl IDs:&lt;br /&gt;
&lt;br /&gt;
&lt;span style=&quot;color: lime; font-family: Courier New, Courier, monospace;&quot;&gt;&amp;gt; id[1:10]&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;color: lime; font-family: Courier New, Courier, monospace;&quot;&gt;&amp;nbsp;[1] &quot;ENSG00000121410&quot; &quot;ENSG00000175899&quot; &quot;ENSG00000256069&quot; &quot;ENSG00000171428&quot;&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;color: lime; font-family: Courier New, Courier, monospace;&quot;&gt;&amp;nbsp;[5] &quot;ENSG00000156006&quot; &quot;ENSG00000196136&quot; &quot;ENSG00000114771&quot; &quot;ENSG00000127837&quot;&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;color: lime; font-family: Courier New, Courier, monospace;&quot;&gt;&amp;nbsp;[9] &quot;ENSG00000129673&quot; &quot;ENSG00000090861&quot;&lt;/span&gt;&lt;br /&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
And their Symbol IDs:&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;div&gt;
&lt;span style=&quot;color: lime; font-family: Courier New, Courier, monospace;&quot;&gt;&amp;gt; res[1:10]&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style=&quot;color: lime; font-family: Courier New, Courier, monospace;&quot;&gt;&amp;nbsp;[1] &quot;A1BG&quot; &amp;nbsp; &amp;nbsp; &quot;A2M&quot; &amp;nbsp; &amp;nbsp; &amp;nbsp;&quot;A2MP1&quot; &amp;nbsp; &amp;nbsp;&quot;NAT1&quot; &amp;nbsp; &amp;nbsp; &quot;NAT2&quot; &amp;nbsp; &amp;nbsp; &quot;SERPINA3&quot;&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style=&quot;color: lime; font-family: Courier New, Courier, monospace;&quot;&gt;&amp;nbsp;[7] &quot;AADAC&quot; &amp;nbsp; &amp;nbsp;&quot;AAMP&quot; &amp;nbsp; &amp;nbsp; &quot;AANAT&quot; &amp;nbsp; &amp;nbsp;&quot;AARS&quot; &amp;nbsp; &amp;nbsp;&lt;/span&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
This is a running example of the function to convert Unigene IDs to Symbol IDs (For all the other IDs types, just replace &quot;unigene&quot; to &quot;ensembl&quot; or &quot;refseq&quot; or &quot;uniprot&quot;):&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;div&gt;
&lt;span style=&quot;color: lime; font-family: Courier New, Courier, monospace;&quot;&gt;# USAGE EXAMPlE: UNIGENE&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style=&quot;color: lime; font-family: Courier New, Courier, monospace;&quot;&gt;require(org.Hs.eg.db)&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style=&quot;color: lime; font-family: Courier New, Courier, monospace;&quot;&gt;unigene &amp;lt;- toTable(org.Hs.egUNIGENE)&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style=&quot;color: lime; font-family: Courier New, Courier, monospace;&quot;&gt;# extract 100 random unigene entries&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style=&quot;color: lime; font-family: Courier New, Courier, monospace;&quot;&gt;id &amp;nbsp;&amp;lt;- unigene[sample(1:length(unigene[,2]),100),2]&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style=&quot;color: lime; font-family: Courier New, Courier, monospace;&quot;&gt;id.type &amp;nbsp;&amp;lt;- &quot;unigene&quot;&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style=&quot;color: lime; font-family: Courier New, Courier, monospace;&quot;&gt;res &amp;lt;- get.symbolIDs(id,id.type)&lt;/span&gt;&lt;span class=&quot;Apple-tab-span&quot; style=&quot;white-space: pre;&quot;&gt; &lt;/span&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;br /&gt;
Benjamin</content><link rel='replies' type='application/atom+xml' href='http://tata-box-blog.blogspot.com/feeds/296690891258870085/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://tata-box-blog.blogspot.com/2014/02/convert-ensembl-unigene-uniprot-and.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7099891672772539383/posts/default/296690891258870085'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7099891672772539383/posts/default/296690891258870085'/><link rel='alternate' type='text/html' href='http://tata-box-blog.blogspot.com/2014/02/convert-ensembl-unigene-uniprot-and.html' title='Convert Ensembl, Unigene, Uniprot and RefSeq IDs to Symbol IDs in R using Bioconductor'/><author><name>Anonymous</name><uri>http://www.blogger.com/profile/12992490904490492131</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7099891672772539383.post-8384876607844245588</id><published>2014-02-10T16:51:00.000-06:00</published><updated>2014-02-10T16:51:16.640-06:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="Debian"/><category scheme="http://www.blogger.com/atom/ns#" term="Linux"/><category scheme="http://www.blogger.com/atom/ns#" term="R"/><category scheme="http://www.blogger.com/atom/ns#" term="Tutorials"/><title type='text'>Upgrade and update R 2.15 to R 3.0 in Debian Wheezy </title><content type='html'>Following the instructions from &lt;a href=&quot;http://cran.r-project.org/&quot;&gt;CRAN&lt;/a&gt;,&amp;nbsp;you need to add the R backports in your source list.&lt;br /&gt;
&lt;br /&gt;
&lt;div&gt;
&lt;span style=&quot;background-color: #0b5394;&quot;&gt;FIRST PART: ADD R BACKPORTS:&amp;nbsp;&lt;/span&gt;&lt;br /&gt;
&lt;br /&gt;&lt;/div&gt;
First, open a Terminal and open the&lt;span style=&quot;color: orange;&quot;&gt;&amp;nbsp;sources.list&lt;/span&gt;&amp;nbsp;file:&lt;br /&gt;
&lt;code&gt;&lt;br /&gt;&lt;span style=&quot;color: lime;&quot;&gt;$ gksudo gedit /etc/apt/sources.list&lt;/span&gt;&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;
Then, add these lines at the bottom of the file (Note, I use the Revolution Analytics Dallas, TX server, but this can be easily changed taking a look&amp;nbsp;&lt;a href=&quot;http://cran.r-project.org/mirrors.html&quot;&gt;here&lt;/a&gt;&amp;nbsp;for the mirrors):&lt;br /&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style=&quot;color: lime; font-family: monospace;&quot;&gt;## R BACKPORTS FOR WHEEZY&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;color: lime; font-family: monospace;&quot;&gt;deb http://cran.revolutionanalytics.com/bin/linux/debian wheezy-cran3/&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;font-family: monospace;&quot;&gt;&lt;span style=&quot;color: lime;&quot;&gt;#deb-src http://cran.revolutionanalytics.com/bin/linux/debian wheezy-cran3/&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;
&lt;div&gt;
&lt;span style=&quot;background-color: #0b5394;&quot;&gt;SECOND PART: RENAME THE R PACKAGES FOLDER:&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
There&#39;s a folder where R uses to store the packages we download, just rename it to the current version of R. For example, mine was &quot;&lt;span style=&quot;color: orange;&quot;&gt;2.15&lt;/span&gt;&quot; and then I just renamed it to &quot;&lt;span style=&quot;color: orange;&quot;&gt;3.0&lt;/span&gt;&quot;&amp;nbsp;and was inside this path:&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;br /&gt;
&lt;div&gt;
Before:&lt;/div&gt;
&lt;div&gt;
&lt;span style=&quot;color: orange;&quot;&gt;/home/benjamin/R/x86_64-pc-linux-gnu-library/2.15&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
After:&lt;/div&gt;
&lt;div&gt;
&lt;span style=&quot;color: orange;&quot;&gt;/home/benjamin/R/x86_64-pc-linux-gnu-library/3.0&lt;/span&gt;&lt;/div&gt;
&lt;br /&gt;
&lt;div&gt;
&lt;div&gt;
Remember that&amp;nbsp;some packages also needs to install some files in folders that belongs to the root, so, I would recommend to open R in sudo mode (only if you&#39;re sure about what you&#39;re doing :P) just by executing R this way: &quot;&lt;span style=&quot;color: orange;&quot;&gt;sudo R&lt;/span&gt;&quot; and then, in the R console&amp;nbsp;type :&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;code&gt;&lt;span style=&quot;color: lime;&quot;&gt;update.packages(checkBuilt=TRUE, ask=FALSE)&lt;/span&gt;&lt;/code&gt;&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;
&lt;span style=&quot;background-color: #0b5394;&quot;&gt;THIRD PART: SECURE APT:&lt;/span&gt;&lt;br /&gt;
&lt;br /&gt;
The Debian backports archives on CRAN are signed with the key ID 381BA480, to add them, in a Terminal prompt type:&lt;br /&gt;
&lt;br /&gt;
&lt;code&gt;&lt;span style=&quot;color: lime;&quot;&gt;gpg --keyserver pgp.mit.edu --recv-key 381BA480&lt;br /&gt;gpg -a --export 381BA480 &amp;gt; jranke_cran.asc&lt;br /&gt;sudo &amp;nbsp;apt-key add jranke_cran.asc&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;/code&gt;&lt;span style=&quot;background-color: #0b5394;&quot;&gt;FOURTH PART: UPDATE AND UPGRADE R:&lt;/span&gt;&lt;br /&gt;
&lt;br /&gt;
&lt;div&gt;
Save the file and you can either enter to Synaptic, update the packages list and then just upgrade the packages or in a terminal type:&lt;/div&gt;
&lt;div&gt;
&lt;code&gt;&lt;br /&gt;&lt;span style=&quot;color: lime;&quot;&gt;sudo apt-get update&lt;/span&gt;&lt;/code&gt;&lt;/div&gt;
&lt;div&gt;
&lt;code&gt;&lt;span style=&quot;color: lime;&quot;&gt;sudo apt-get upgrade&lt;/span&gt;&lt;/code&gt;&lt;/div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div&gt;
And that&#39;s all.&lt;/div&gt;
&lt;div&gt;
Benjamin&lt;/div&gt;
</content><link rel='replies' type='application/atom+xml' href='http://tata-box-blog.blogspot.com/feeds/8384876607844245588/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://tata-box-blog.blogspot.com/2014/02/upgrade-and-update-r-215-to-r-30-in.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7099891672772539383/posts/default/8384876607844245588'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7099891672772539383/posts/default/8384876607844245588'/><link rel='alternate' type='text/html' href='http://tata-box-blog.blogspot.com/2014/02/upgrade-and-update-r-215-to-r-30-in.html' title='Upgrade and update R 2.15 to R 3.0 in Debian Wheezy '/><author><name>Anonymous</name><uri>http://www.blogger.com/profile/12992490904490492131</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7099891672772539383.post-1130091852629051783</id><published>2013-10-19T11:17:00.000-05:00</published><updated>2013-10-20T11:05:13.615-05:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="Machine Learning"/><category scheme="http://www.blogger.com/atom/ns#" term="R"/><title type='text'>Optimizing a multivariable function parameters using a random method, genetic algorithm and simulated annealing in R</title><content type='html'>&lt;br /&gt;
Say that you are implementing a non-linear regression analysis, which is shortly described by &lt;a href=&quot;http://en.wikipedia.org/wiki/Nonlinear_regression&quot;&gt;wikipedia&lt;/a&gt; as:&lt;br /&gt;
&lt;br /&gt;
&quot;In statistics, nonlinear regression is a form of regression analysis in which observational data are modeled by a function which is a nonlinear combination of the model parameters and depends on one or more independent variables.&quot;&lt;br /&gt;
&lt;br /&gt;
For the training set, we have the following:&lt;br /&gt;
&lt;br /&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEijSAFmttOJI7nAf5SQ48F9HK6p8b2pP2I3SVUlIU4CTgwayb1mq4M2fLwO0VtJ-XWAaIOdISEV_T1XsX0DrH5oc1gH1YhGgklYt9cPGUuZ8q1iAWGMPZTT-DU5kVvqyt8xaXf2xh-M3Qg/s1600/trainSet.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img border=&quot;0&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEijSAFmttOJI7nAf5SQ48F9HK6p8b2pP2I3SVUlIU4CTgwayb1mq4M2fLwO0VtJ-XWAaIOdISEV_T1XsX0DrH5oc1gH1YhGgklYt9cPGUuZ8q1iAWGMPZTT-DU5kVvqyt8xaXf2xh-M3Qg/s1600/trainSet.png&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;
And the function to optimize the parameters is:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgyTu_0c-1J22FrBsAtruygRJ4GLZ1uKRrtxmIl1GUl2tIIjpwIcmBj0hyphenhyphen2po2bEZWb7zfoK60xgIwYqKJUYchJJaO6Wxh243MUw1u9m0_6THzDrmuUqVC6PU_rnHsWciLsQH3zVJHbSAU/s1600/f1.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img border=&quot;0&quot; height=&quot;52&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgyTu_0c-1J22FrBsAtruygRJ4GLZ1uKRrtxmIl1GUl2tIIjpwIcmBj0hyphenhyphen2po2bEZWb7zfoK60xgIwYqKJUYchJJaO6Wxh243MUw1u9m0_6THzDrmuUqVC6PU_rnHsWciLsQH3zVJHbSAU/s320/f1.png&quot; width=&quot;320&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;
Which leads us to the following equality:&lt;br /&gt;
&lt;br /&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhfjKXuhCgDzWt7jxIg9WnHBRKgkxrEAOEm9Xwwnay-sV-7XAxPfXBiqkA3gFKEUhxTVDGx9log8kkIUUibborUzRHc_LYG3ENE7pm48vs0zGHNR3z1mb2it_eF80WQfyMe7ohY8Ng0X0c/s1600/f2.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img border=&quot;0&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhfjKXuhCgDzWt7jxIg9WnHBRKgkxrEAOEm9Xwwnay-sV-7XAxPfXBiqkA3gFKEUhxTVDGx9log8kkIUUibborUzRHc_LYG3ENE7pm48vs0zGHNR3z1mb2it_eF80WQfyMe7ohY8Ng0X0c/s1600/f2.png&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;
In other words, we want to optimize the value of theta in order to minimize the sum of the error among y and predicted.y:&lt;br /&gt;
&lt;br /&gt;
Given theta (each parameter a0,..a3 has a range from 0 to 15):&lt;br /&gt;
&lt;br /&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgXSrVAsbSNtEfcHdnaC3G3ZexmUKdtt95EBPqPCiiGEdPJddEU_Jlzv4R5K0rrZVcCVEPPFQOwKE_eaG7KfyC7F3321zOz7kwCIxzjQmHpid4p5M6SSKx5fJJY_qcJEkLep9vZtXQjdF8/s1600/theta.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img border=&quot;0&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgXSrVAsbSNtEfcHdnaC3G3ZexmUKdtt95EBPqPCiiGEdPJddEU_Jlzv4R5K0rrZVcCVEPPFQOwKE_eaG7KfyC7F3321zOz7kwCIxzjQmHpid4p5M6SSKx5fJJY_qcJEkLep9vZtXQjdF8/s1600/theta.png&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;
And the error function:&lt;br /&gt;
&lt;br /&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjVOQq76enI6MzVAfmeRI12DWEyHon6IRcwooF6Q-kvr_QxTzVUII9AVl2fq0vagj9Wbdp4F-TO1FL8LZtw6NdSblNhHjjlep4rYwDrwfaaFK1Jw_7XJpz5dYweVVLNh-_BTbBIpD7Na9M/s1600/error.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img border=&quot;0&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjVOQq76enI6MzVAfmeRI12DWEyHon6IRcwooF6Q-kvr_QxTzVUII9AVl2fq0vagj9Wbdp4F-TO1FL8LZtw6NdSblNhHjjlep4rYwDrwfaaFK1Jw_7XJpz5dYweVVLNh-_BTbBIpD7Na9M/s1600/error.png&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;
finally, the goal function:&lt;br /&gt;
&lt;br /&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhVdrNS-XxmGkAELqvp75dIebkNU45N14q_9vD5T0yAA10w_Ia8mYhJC0ws_Yw9Z8fULOHUedHzoelzWa7Vz_RwZ57xxs2P9RqW6uPzNVW36IaweQUNWHlML221jCIQzY4z-56DpmyDBcA/s1600/goal.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img border=&quot;0&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhVdrNS-XxmGkAELqvp75dIebkNU45N14q_9vD5T0yAA10w_Ia8mYhJC0ws_Yw9Z8fULOHUedHzoelzWa7Vz_RwZ57xxs2P9RqW6uPzNVW36IaweQUNWHlML221jCIQzY4z-56DpmyDBcA/s1600/goal.png&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;
In other words, the goal function searches for the value of theta that minimizes the error.&lt;br /&gt;
&lt;br /&gt;
COMPUTATIONS BEGIN&lt;br /&gt;
&lt;br /&gt;
This is the scatter plot of the training set:&lt;br /&gt;
&lt;br /&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgrXdD0Y7wSTHMxtJDFhAIgz8w8MiwxH1RuhHgQfuRkiyjtgc-HPPDU8RqwmcvgVn8xb-Yl8J09kvRM2dcKyk0hJm8vPfKbpi1lYkX-ZJFFQiJMo4W7qieiY5iT7gYAxqm9gjV0WE_NEyw/s1600/trainingSetPlot.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img border=&quot;0&quot; height=&quot;378&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgrXdD0Y7wSTHMxtJDFhAIgz8w8MiwxH1RuhHgQfuRkiyjtgc-HPPDU8RqwmcvgVn8xb-Yl8J09kvRM2dcKyk0hJm8vPfKbpi1lYkX-ZJFFQiJMo4W7qieiY5iT7gYAxqm9gjV0WE_NEyw/s400/trainingSetPlot.png&quot; width=&quot;400&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;
Here is the implementation in R, you can download the file clicking &lt;a href=&quot;https://dl.dropboxusercontent.com/u/17816910/R/parameter_optimization.R&quot;&gt;here&lt;/a&gt;&lt;br /&gt;
&lt;br /&gt;
Here is a result plot using the genetic algorithm:&lt;br /&gt;
&lt;br /&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjqX8h6U7EBUlJrweIxnrau3dA29UpQr4hpMtRkICsfK19uN7uq6KOsIDbgmtv6iFh_uLeRg5PCnHx5tzli8MRtRhbhymuLwuDN73Z4L1FGxaMcthSarIK-nubxORS7j00lbRBzonxypy0/s1600/solution.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img border=&quot;0&quot; height=&quot;377&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjqX8h6U7EBUlJrweIxnrau3dA29UpQr4hpMtRkICsfK19uN7uq6KOsIDbgmtv6iFh_uLeRg5PCnHx5tzli8MRtRhbhymuLwuDN73Z4L1FGxaMcthSarIK-nubxORS7j00lbRBzonxypy0/s400/solution.png&quot; width=&quot;400&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;
Benjamin&lt;br /&gt;
&lt;br /&gt;</content><link rel='replies' type='application/atom+xml' href='http://tata-box-blog.blogspot.com/feeds/1130091852629051783/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://tata-box-blog.blogspot.com/2013/10/optimizing-multivariable-function.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7099891672772539383/posts/default/1130091852629051783'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7099891672772539383/posts/default/1130091852629051783'/><link rel='alternate' type='text/html' href='http://tata-box-blog.blogspot.com/2013/10/optimizing-multivariable-function.html' title='Optimizing a multivariable function parameters using a random method, genetic algorithm and simulated annealing in R'/><author><name>Anonymous</name><uri>http://www.blogger.com/profile/12992490904490492131</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEijSAFmttOJI7nAf5SQ48F9HK6p8b2pP2I3SVUlIU4CTgwayb1mq4M2fLwO0VtJ-XWAaIOdISEV_T1XsX0DrH5oc1gH1YhGgklYt9cPGUuZ8q1iAWGMPZTT-DU5kVvqyt8xaXf2xh-M3Qg/s72-c/trainSet.png" height="72" width="72"/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7099891672772539383.post-8899661702104241916</id><published>2013-08-17T01:38:00.004-05:00</published><updated>2013-08-17T01:38:51.131-05:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="Stochastic messages"/><title type='text'>Lieutenant Dan You Got a New Interface..  </title><content type='html'>After some days of thinking, I realize that this blog deserved a little bit more attention, so I decided to change the interface and I&#39;am happy about how it looks now.&lt;br /&gt;
&lt;br /&gt;
Hope to see you again in my personal blog.&lt;br /&gt;
&lt;br /&gt;
keep on programming!&lt;br /&gt;
&lt;br /&gt;
Benjamin</content><link rel='replies' type='application/atom+xml' href='http://tata-box-blog.blogspot.com/feeds/8899661702104241916/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://tata-box-blog.blogspot.com/2013/08/lieutenant-dan-you-got-new-interface.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7099891672772539383/posts/default/8899661702104241916'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7099891672772539383/posts/default/8899661702104241916'/><link rel='alternate' type='text/html' href='http://tata-box-blog.blogspot.com/2013/08/lieutenant-dan-you-got-new-interface.html' title='Lieutenant Dan You Got a New Interface..  '/><author><name>Anonymous</name><uri>http://www.blogger.com/profile/12992490904490492131</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7099891672772539383.post-1456846190134299461</id><published>2013-08-16T22:19:00.002-05:00</published><updated>2013-08-16T22:26:02.770-05:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="Bioinformatics"/><category scheme="http://www.blogger.com/atom/ns#" term="Machine Learning"/><category scheme="http://www.blogger.com/atom/ns#" term="R"/><title type='text'>Accuracy versus F score: Machine Learning for the RNA Polymerases</title><content type='html'>Hello, today I&#39;m going to show you the difference of using two different common performance measures (useful not only for Machine Learning purposes, is useful in every scientific field). Until now, I have found more the accuracy values than F scores in the performance measuring of some methods which ranges from metaheuristics (Genetic Algorithms fitness functions) to promoter recognition programs, diagnose methods and so on.&lt;br /&gt;
&lt;br /&gt;
But, I would really recommend to avoid using the accuracy measure. The reason is shown below with a nice example in R programming language (all the functions used in the simulation are included, &amp;nbsp;you can download them clicking &lt;a href=&quot;https://dl.dropboxusercontent.com/u/17816910/R/accuracyVsFscore.R&quot;&gt;here&lt;/a&gt;).&lt;br /&gt;
&lt;br /&gt;
&lt;span style=&quot;color: orange;&quot;&gt;Case study 1: &lt;/span&gt;&lt;br /&gt;
&lt;br /&gt;
Imagine that you are in a Computer Vision project and your task is to &quot;teach&quot; a program to recognize among &amp;nbsp;electric guitars and acoustic guitars showing the program pictures of different guitars.&lt;br /&gt;
&lt;br /&gt;
Suppose that you&#39;ve already developed that program and now you want to measure the performance of this Boolean classifier (this is for example, you show the program a picture of a an electric guitar, and the program has to decide whether it will recognize and &quot;classify&quot; it as an electric or as an acoustic guitar).&lt;br /&gt;
&lt;br /&gt;
For the function of this post, lets write down some useful concepts&lt;br /&gt;
&lt;br /&gt;
Consider the following:&lt;br /&gt;
&lt;br /&gt;
&lt;span style=&quot;color: orange;&quot;&gt;TP&lt;/span&gt;: a true positive is when the program classifies an electric guitar as an electric guitar, we will use the letter &quot;E&quot; to denote the electric guitar &quot;class&quot;&lt;br /&gt;
&lt;br /&gt;
&lt;span style=&quot;color: orange;&quot;&gt;FP&lt;/span&gt;: a false positive is when the program classifies an acoustic guitar as an electric guitar, we will use the letter &quot;A&quot; to denote the acoustic guitar &quot;class&quot;&lt;br /&gt;
&lt;br /&gt;
&lt;span style=&quot;color: orange;&quot;&gt;FN&lt;/span&gt;: a false negative is when the program classifies an electric guitar as an acoustic guitar&lt;br /&gt;
&lt;br /&gt;
&lt;span style=&quot;color: orange;&quot;&gt;TN&lt;/span&gt;: a true negative is when the program classifies an acoustic guitar as an acoustic guitar&lt;br /&gt;
&lt;br /&gt;
Now that we are ready, we shall begin with the calculations&lt;br /&gt;
&lt;br /&gt;
In R, I have simulated the results of the program. Say, for 1,000 electric guitar pictures and 1,000 acoustic guitar pictures&lt;br /&gt;
&lt;br /&gt;
The program prompt the following results:&lt;br /&gt;
&lt;br /&gt;
&lt;span style=&quot;color: lime;&quot;&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; PREDICTED.E PREDICTED.A&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;color: lime;&quot;&gt;TRUE.E &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 485 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 515&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;color: lime;&quot;&gt;TRUE.A &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 9 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 991&lt;/span&gt;&lt;br /&gt;
&lt;br /&gt;
If you notice, from the 1000 electric guitar pictures, only 485 were labeled as electric (TP=485), the rest were labeled as acoustic (FN=515). I feel bad for the hypothetical programmer of this hypothetical example.&lt;br /&gt;
&lt;br /&gt;
On the other hand, from the 1000 acoustic guitars, 991 were labeled as acoustic (TN=991) and only 9 of them were labeled as electric (FP=9). Well not bad!..... or it is?&lt;br /&gt;
&lt;br /&gt;
&lt;span style=&quot;color: lime;&quot;&gt;The accuracy value of this program is =&amp;nbsp;0.738&lt;/span&gt;&lt;br /&gt;
&lt;br /&gt;
And, for computing the F score is necessary to compute the precision and the recall first, where:&lt;br /&gt;
&lt;br /&gt;
&lt;span style=&quot;color: lime;&quot;&gt;precision =&amp;nbsp;0.9817814 and recall =&amp;nbsp;0.485&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;color: lime;&quot;&gt;&lt;br /&gt;Then, the F score is equal to&amp;nbsp;0.6492637&lt;/span&gt;&lt;br /&gt;
&lt;br /&gt;
Well, the F scores seems to be more &quot;strict&quot;, and in fact it is in comparison of the accuracy performance measure. But this example is not very &quot;cool&quot;. Lets pass to the case study 2&lt;br /&gt;
&lt;br /&gt;
&lt;span style=&quot;color: orange;&quot;&gt;Case study 2:&lt;/span&gt;&lt;br /&gt;
&lt;br /&gt;
Now we have 1,000 electric guitar pictures and 100,000 acoustic guitar pictures, the confusion matrix of the results are:&lt;br /&gt;
&lt;br /&gt;
&lt;span style=&quot;color: lime;&quot;&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; PREDICTED.E PREDICTED.A&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;color: lime;&quot;&gt;TRUE.E &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 493 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 507&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;color: lime;&quot;&gt;TRUE.A &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;1017 &amp;nbsp; &amp;nbsp; &amp;nbsp; 98983&lt;/span&gt;&lt;br /&gt;
&lt;br /&gt;
If you notice, from the 1,000 electric guitar pictures, only 493 were labeled as electric (TP=493), the rest were labeled as acoustic (FN=507)&lt;br /&gt;
&lt;br /&gt;
On the other hand, from the 100,000 acoustic guitars,&amp;nbsp;98983&amp;nbsp;were labeled as acoustic (TN=98983) and only&amp;nbsp;1017&amp;nbsp;of them were labeled as electric (FP=1017)&lt;br /&gt;
&lt;br /&gt;
Now (cha cha chan!), the performance values are:&lt;br /&gt;
&lt;br /&gt;
&lt;span style=&quot;color: lime;&quot;&gt;Accuracy:&amp;nbsp;0.9849109&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;color: lime;&quot;&gt;Precision:&amp;nbsp;0.3264901&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;color: lime;&quot;&gt;Recall:&amp;nbsp;0.493&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;color: lime;&quot;&gt;F score:&amp;nbsp;0.3928287&lt;/span&gt;&lt;br /&gt;
&lt;br /&gt;
Now you see it?, how come or how is possible that missing almost the 50% of the labels of the electric guitars, the performance of the program in accuracy is almost 0.99?, despite of having a precision and recall not greater than 0.50. Then &lt;span style=&quot;color: orange;&quot;&gt;we have a winner and is the F score measure&lt;/span&gt;.&lt;br /&gt;
&lt;br /&gt;
for references visit the following pages:&lt;br /&gt;
&lt;br /&gt;
&lt;a href=&quot;http://en.wikipedia.org/wiki/Accuracy&quot;&gt;http://en.wikipedia.org/wiki/Accuracy&lt;/a&gt;&lt;br /&gt;
&lt;a href=&quot;http://en.wikipedia.org/wiki/F1_score&quot;&gt;http://en.wikipedia.org/wiki/F1_score&lt;/a&gt;&lt;br /&gt;
&lt;a href=&quot;http://en.wikipedia.org/wiki/Precision_and_recall&quot;&gt;http://en.wikipedia.org/wiki/Precision_and_recall&lt;/a&gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;</content><link rel='replies' type='application/atom+xml' href='http://tata-box-blog.blogspot.com/feeds/1456846190134299461/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://tata-box-blog.blogspot.com/2013/08/accuracy-versus-f-score-machine.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7099891672772539383/posts/default/1456846190134299461'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7099891672772539383/posts/default/1456846190134299461'/><link rel='alternate' type='text/html' href='http://tata-box-blog.blogspot.com/2013/08/accuracy-versus-f-score-machine.html' title='Accuracy versus F score: Machine Learning for the RNA Polymerases'/><author><name>Anonymous</name><uri>http://www.blogger.com/profile/12992490904490492131</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7099891672772539383.post-1335561387199324373</id><published>2013-05-25T20:14:00.000-05:00</published><updated>2013-05-25T20:14:13.005-05:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="LaTeX"/><category scheme="http://www.blogger.com/atom/ns#" term="Tutorials"/><title type='text'>Make your LaTex presentations using the Beamer class</title><content type='html'>Hello, I have been working with this tutorial. All you need to do is download the source files and take a look at the source code and the PDF file.&lt;br /&gt;&lt;br /&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhHT7Qp_jG9jS-GTHwYMMhIVard5zoMLBwaUWkL7wxfpsbzwaA0_FspUhFg_tGbJsTJwl7AaDWY_IM1Ns51ZBqMIBfkaUrfscnCJ7moje9pdTqcp8y6AIe5Vl6YaV6NitB1L-aaw8xOui0/s1600/beamerExample.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img border=&quot;0&quot; height=&quot;297&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhHT7Qp_jG9jS-GTHwYMMhIVard5zoMLBwaUWkL7wxfpsbzwaA0_FspUhFg_tGbJsTJwl7AaDWY_IM1Ns51ZBqMIBfkaUrfscnCJ7moje9pdTqcp8y6AIe5Vl6YaV6NitB1L-aaw8xOui0/s400/beamerExample.png&quot; width=&quot;400&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;Download link &lt;a href=&quot;https://www.dropbox.com/s/jlir7wsiczd88ep/latexBeamerExample.tar.gz&quot;&gt;here&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;
Hope you find LaTex as useful as I do&lt;br /&gt;
&lt;br /&gt;
Benjamin&lt;br /&gt;&lt;br /&gt;</content><link rel='replies' type='application/atom+xml' href='http://tata-box-blog.blogspot.com/feeds/1335561387199324373/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://tata-box-blog.blogspot.com/2013/05/make-your-latex-presentations-using.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7099891672772539383/posts/default/1335561387199324373'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7099891672772539383/posts/default/1335561387199324373'/><link rel='alternate' type='text/html' href='http://tata-box-blog.blogspot.com/2013/05/make-your-latex-presentations-using.html' title='Make your LaTex presentations using the Beamer class'/><author><name>Anonymous</name><uri>http://www.blogger.com/profile/12992490904490492131</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhHT7Qp_jG9jS-GTHwYMMhIVard5zoMLBwaUWkL7wxfpsbzwaA0_FspUhFg_tGbJsTJwl7AaDWY_IM1Ns51ZBqMIBfkaUrfscnCJ7moje9pdTqcp8y6AIe5Vl6YaV6NitB1L-aaw8xOui0/s72-c/beamerExample.png" height="72" width="72"/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7099891672772539383.post-30404396157788036</id><published>2012-12-02T12:33:00.001-06:00</published><updated>2013-10-18T18:23:40.949-05:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="Bioinformatics"/><category scheme="http://www.blogger.com/atom/ns#" term="Sequence analysis"/><category scheme="http://www.blogger.com/atom/ns#" term="Stochastic messages"/><title type='text'>Póster presentado en la XIV Escuela de Otoño en Biología Matemática, México</title><content type='html'>&lt;div dir=&quot;ltr&quot; style=&quot;text-align: left;&quot; trbidi=&quot;on&quot;&gt;
Póster presentado en la XIV Escuela de Otoño en Biología Matemática - 8vo Encuentro de Biología Matemática celebrado en San Luis Potosí, S.L.P, México:&lt;br /&gt;
&lt;br /&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;img border=&quot;0&quot; height=&quot;241&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh-FpASPXZNnGPY_Uays21rP0fb2OwDXr22ln_G0UY-JN3aoY6NxGDa3OFjBb9FKBo5-3S-96L3jEje5wIbpp5dXtQKW7L7yAvvKEJ7fglpiRpYRkpv5uewKdfMzRD3TyP-uM7h_gbIGQ65/s320/poster.png&quot; style=&quot;margin-left: auto; margin-right: auto;&quot; width=&quot;320&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;&lt;span style=&quot;font-size: small; text-align: left;&quot;&gt;&quot;Predicción de promotores RNA POL-II en Drosophila melanogaster utilizando propiedades de señal, contexto y estructura a partir de secuencias nucleotídicas&quot;&lt;/span&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;br /&gt;
Póster disponible &lt;a href=&quot;https://www.dropbox.com/s/ds7qemsh1mp9kzw/benjamin.tovar.poster.png&quot;&gt;aquí&lt;/a&gt;&lt;br /&gt;
&lt;br /&gt;&lt;/div&gt;
</content><link rel='replies' type='application/atom+xml' href='http://tata-box-blog.blogspot.com/feeds/30404396157788036/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://tata-box-blog.blogspot.com/2012/12/poster-presentado-en-la-xiv-escuela-de.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7099891672772539383/posts/default/30404396157788036'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7099891672772539383/posts/default/30404396157788036'/><link rel='alternate' type='text/html' href='http://tata-box-blog.blogspot.com/2012/12/poster-presentado-en-la-xiv-escuela-de.html' title='Póster presentado en la XIV Escuela de Otoño en Biología Matemática, México'/><author><name>Unknown</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh-FpASPXZNnGPY_Uays21rP0fb2OwDXr22ln_G0UY-JN3aoY6NxGDa3OFjBb9FKBo5-3S-96L3jEje5wIbpp5dXtQKW7L7yAvvKEJ7fglpiRpYRkpv5uewKdfMzRD3TyP-uM7h_gbIGQ65/s72-c/poster.png" height="72" width="72"/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7099891672772539383.post-7733419355868892543</id><published>2012-12-02T12:21:00.001-06:00</published><updated>2012-12-02T12:24:12.414-06:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="R"/><category scheme="http://www.blogger.com/atom/ns#" term="Random"/><title type='text'>Brownian motion simulation in R</title><content type='html'>&lt;div dir=&quot;ltr&quot; style=&quot;text-align: left;&quot; trbidi=&quot;on&quot;&gt;
&lt;div dir=&quot;ltr&quot; style=&quot;text-align: left;&quot; trbidi=&quot;on&quot;&gt;
&lt;div dir=&quot;ltr&quot; style=&quot;text-align: left;&quot; trbidi=&quot;on&quot;&gt;
&lt;div dir=&quot;ltr&quot; style=&quot;text-align: left;&quot; trbidi=&quot;on&quot;&gt;
I have based this post on a very useful piece of code which basically is the core of my own implementation of a Brownian Motion simulation in R. The original reference code &amp;lt;-&amp;nbsp;&lt;a href=&quot;http://landshape.org/enm/r-code-for-brownian-motion/&quot;&gt;http://landshape.org/enm/r-code-for-brownian-motion/&lt;/a&gt;&lt;br /&gt;
&lt;br /&gt;
According to Wikipedia the mathematical model for Brownian motion (also known as random walks) can also be used to describe many phenomena as well as the random movements of minute particles,&lt;br /&gt;
such as stock market fluctuations and the evolution of physical characteristics in the fossil record. The simple form of the mathematical model for Brownian motion has the form:&lt;br /&gt;
&lt;br /&gt;
S_t = eS_t-1&lt;br /&gt;
&lt;br /&gt;
where e is drawn from a probability distribution.&lt;br /&gt;
&lt;br /&gt;
The source code is &lt;a href=&quot;https://dl.dropbox.com/u/17816910/R/brownian.motion/brownian.motion.R&quot;&gt;here&lt;/a&gt;&lt;br /&gt;
&lt;br /&gt;
After loading the source code, there are two functions: &lt;br /&gt;
&lt;br /&gt;
The first one, &lt;span style=&quot;color: orange;&quot;&gt;brownian&lt;/span&gt; will plot in an R graphics window the resulting simulation in an animated way.&lt;br /&gt;
&lt;br /&gt;
The second function, &lt;span style=&quot;color: orange;&quot;&gt;export.brownian&lt;/span&gt; will export each step of the simulation in independent PNG files.&lt;br /&gt;
&lt;br /&gt;
Example of running:&lt;br /&gt;
&lt;br /&gt;
&lt;span style=&quot;color: lime;&quot;&gt;&amp;gt; source(&quot;brownian.motion.R&quot;)&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;color: lime;&quot;&gt;&amp;gt; brownian(500)&lt;/span&gt;&lt;br /&gt;
&lt;br /&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh0RmnJfqVCbA0O0dd8x1fN6wqBDdVLVCaL-mC2QuT1CvqrGp9X6iLVSh9_e7Yzc_tZoDZOgmt_p_cK2CbH3qi0qd-U6wLNbP5t09uSuPL1KUOK7WH4bTf_CFVjHwbUycR-wZQ9TgvcICX2/s1600/bm1.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img border=&quot;0&quot; height=&quot;320&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh0RmnJfqVCbA0O0dd8x1fN6wqBDdVLVCaL-mC2QuT1CvqrGp9X6iLVSh9_e7Yzc_tZoDZOgmt_p_cK2CbH3qi0qd-U6wLNbP5t09uSuPL1KUOK7WH4bTf_CFVjHwbUycR-wZQ9TgvcICX2/s320/bm1.png&quot; width=&quot;308&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;
&lt;br /&gt;
The second function will produce this &lt;a href=&quot;https://dl.dropbox.com/u/17816910/R/brownian.motion/brownian.motion.animation.tar.gz&quot;&gt;output&lt;/a&gt;&lt;br /&gt;
&lt;span style=&quot;color: lime;&quot;&gt;&amp;gt; export.brownian(500)&lt;/span&gt;&lt;br /&gt;
CODE:&lt;br /&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;pre style=&quot;background-color: #48214a; background-position: initial initial; background-repeat: initial initial; border: 2px dashed rgb(204, 204, 204); height: auto; overflow: auto; padding: 0px; text-align: left; width: 99%;&quot;&gt;&lt;code style=&quot;word-wrap: normal;&quot;&gt;&lt;span style=&quot;color: white;&quot;&gt;&lt;span style=&quot;line-height: 20px;&quot;&gt;# *******************************
# BROWNIAN MOTION SIMULATION
# December 2012 | Benjamin Tovar
# *******************************
#
#   REFERENCES
#   http://landshape.org/enm/r-code-for-brownian-motion/
#
#   According to Wikipedia the mathematical model for Brownian motion 
#   (also known as random walks) can also be used to describe many 
#   phenomena as well as the random movements of minute particles, 
#   such as stock market fluctuations and the evolution of physical 
#   characteristics in the fossil record. The simple form of the 
#   mathematical model for Brownian motion has the form:
#
#    S_t = eS_t-1
#
#    where e is drawn from a probability distribution.
#
#######################################################################

brownian &amp;lt;- function(n.times){
    x &amp;lt;- y &amp;lt;- x.new &amp;lt;- y.new &amp;lt;- x.new.p &amp;lt;- y.new.p &amp;lt;- vector()
    for(i in 1:n.times){
        # Initialize variables
        x &amp;lt;- rnorm(1)
        y &amp;lt;- rnorm(1)
        # concatenate variables 
        # to increase the vector size
        x.new &amp;lt;- c(x.new,x)
        y.new &amp;lt;- c(y.new,y)
        # sum the vector numbers
        x.new.p &amp;lt;- cumsum(x.new)
        y.new.p &amp;lt;- cumsum(y.new)  
        # plot the model
        plot(x.new.p,y.new.p,type=&quot;b&quot;,
             main=paste(&quot;Brownian motion simulation in R\nTime =&quot;,i,sep=&quot; &quot;),
             xlab=&quot;x coordinates&quot;,ylab=&quot;y coordinates&quot;,
             col=c(rep(&quot;gray&quot;,i-1),&quot;red&quot;),
             pch=c(rep(20,i-1),1))    
    }
}

# Test the function
# brownian(500)

# ****************************************
# EXPORT BROWNIAN MOTION SIMULATION IMAGES
# ****************************************

export.brownian &amp;lt;- function(n.times){
    x &amp;lt;- y &amp;lt;- x.new &amp;lt;- y.new &amp;lt;- x.new.p &amp;lt;- y.new.p &amp;lt;- vector()
    for(i in 1:n.times){
        # Initialize variables
        x &amp;lt;- rnorm(1)
        y &amp;lt;- rnorm(1)
        # concatenate variables to increase the
        # vector size
        x.new &amp;lt;- c(x.new,x)
        y.new &amp;lt;- c(y.new,y)
        # sum the vector numbers
        x.new.p &amp;lt;- cumsum(x.new)
        y.new.p &amp;lt;- cumsum(y.new)  
        # plot the model
        png(paste(&quot;image&quot;,i,&quot;png&quot;,sep=&quot;.&quot;),width=600,height=600)
            plot(x.new.p,y.new.p,type=&quot;b&quot;,
                 main=paste(&quot;Brownian motion simulation in R\nTime =&quot;,
                 i,sep=&quot; &quot;),
                 xlab=&quot;x coordinates&quot;,ylab=&quot;y coordinates&quot;,
                 col=c(rep(&quot;gray&quot;,i-1),&quot;red&quot;),
                 pch=c(rep(20,i-1),1))
                 cat(&quot;image&quot;,i,&quot;DONE&quot;,date(),&quot;\n&quot;)
        dev.off()
    }
}

# Test the function
# export.brownian(500)&lt;span style=&quot;font-size: medium;&quot;&gt;

&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;!-----&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
</content><link rel='replies' type='application/atom+xml' href='http://tata-box-blog.blogspot.com/feeds/7733419355868892543/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://tata-box-blog.blogspot.com/2012/12/brownian-motion-simulation-in-r.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7099891672772539383/posts/default/7733419355868892543'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7099891672772539383/posts/default/7733419355868892543'/><link rel='alternate' type='text/html' href='http://tata-box-blog.blogspot.com/2012/12/brownian-motion-simulation-in-r.html' title='Brownian motion simulation in R'/><author><name>Unknown</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh0RmnJfqVCbA0O0dd8x1fN6wqBDdVLVCaL-mC2QuT1CvqrGp9X6iLVSh9_e7Yzc_tZoDZOgmt_p_cK2CbH3qi0qd-U6wLNbP5t09uSuPL1KUOK7WH4bTf_CFVjHwbUycR-wZQ9TgvcICX2/s72-c/bm1.png" height="72" width="72"/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7099891672772539383.post-1188592394332664477</id><published>2012-07-29T13:22:00.002-05:00</published><updated>2013-07-23T16:25:09.396-05:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="Bioinformatics"/><category scheme="http://www.blogger.com/atom/ns#" term="R"/><category scheme="http://www.blogger.com/atom/ns#" term="Sequence analysis"/><title type='text'>Extracting upstream regions of a RefSeq human gene list in R using Bioconductor</title><content type='html'>&lt;div dir=&quot;ltr&quot; style=&quot;text-align: left;&quot; trbidi=&quot;on&quot;&gt;
Suppose that you want to do local mapping of upstream regions of a given RefSeq IDs in a particular genome in R using Bioconductor. Download the script &lt;a href=&quot;https://dl.dropboxusercontent.com/u/17816910/R/extract.five.utr.sequence.R&quot;&gt;here&lt;/a&gt;.&lt;br /&gt;
&lt;br /&gt;
In this case, you may take a look at the Bioconductor AnnotationData Packages here:&amp;nbsp;&lt;a href=&quot;http://www.bioconductor.org/packages/release/data/annotation/&quot;&gt;http://www.bioconductor.org/packages/release/data/annotation/&lt;/a&gt;&lt;br /&gt;
&lt;br /&gt;
The goal of this post is that for example I have the following RefSeq IDs and want to extract 250 bases upstream of each gene in a single list with another useful information such the entrez.id, the symbol and the gene description.&lt;br /&gt;
&lt;br /&gt;
&lt;code&gt;
&lt;i&gt;# RefSeqs IDs:&lt;/i&gt;&lt;br /&gt;
gene.list.refseq &amp;lt;- c(&quot;NM_003588&quot;,&quot;NM_001145436&quot;, &quot;NM_001135188&quot;,&quot;NM_020760&quot;,&quot;NM_173362&quot;, &quot;NM_198393&quot;,&quot;NM_022736&quot;,&quot;NM_025074&quot;,&quot;NM_033449&quot;,&quot;NM_015726&quot;, &quot;NM_022110&quot;,&quot;NM_016478&quot;,&quot;NM_020634&quot;,&quot;NM_002291&quot;,&quot;NM_000418&quot;, &quot;NM_001862&quot;,&quot;NM_017752&quot;,&quot;NM_006591&quot;,&quot;NM_000124&quot;,&quot;NM_144610&quot;)&lt;br /&gt;
&lt;br /&gt;
&lt;i&gt;# How many bases upstream of each gene: &amp;nbsp; &amp;nbsp;&lt;/i&gt;&lt;br /&gt;
bases.upstream &amp;lt;- 250&lt;/code&gt;&lt;br /&gt;
&lt;br /&gt;
Before starting, please download the following packages in R:&lt;br /&gt;
&lt;br /&gt;
&lt;code&gt;
&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;source(&quot;http://bioconductor.org/biocLite.R&quot;)&lt;br /&gt;
&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;biocLite(&quot;BSgenome.Hsapiens.UCSC.hg19&quot;) &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &lt;br /&gt;
&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;biocLite(&quot;Biostrings&quot;)&lt;br /&gt;
&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;biocLite(&quot;org.Hs.eg.db&quot;)&lt;br /&gt;
&lt;br /&gt;
&lt;/code&gt;
Load the function using:&lt;br /&gt;
&lt;br /&gt;
&lt;code&gt;
source(&quot;extract.five.utr.sequence.R&quot;)&lt;br /&gt;
&lt;/code&gt;
&lt;br /&gt;
&lt;code&gt;&lt;br /&gt;&lt;/code&gt;
And finally make the computations:&lt;br /&gt;
&lt;br /&gt;
&lt;code&gt;
output.sequences &amp;lt;- extract.five.utr.sequence(gene.list.refseq,bases.upstream)&lt;/code&gt;&lt;br /&gt;
&lt;br /&gt;
CODE:&lt;br /&gt;
&lt;pre style=&quot;background-image: URL(https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjSeO1ffr-IA2p7rIyuMmAiMDJW0-yI0DECylRbDyJmgky06C_0iJGqD1V-R4ZLSZzRfG8RL7g5vMst6-KwLHcHLiVG6C0d2QqopIiwUnoGVcQSDGPhyphenhyphen-yK0gJfqv07GfxDnQc1PtTlAdPs/s320/codebg.gif); background: #48214A; border: 2px dashed #CCCCCC; color: white; font-family: arial; font-size: 15.6px; height: 600px; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;&quot;&gt;&lt;code&gt;&lt;code style=&quot;color: white; word-wrap: normal;&quot;&gt;# ******************************************************************************
#       FUNCTION: extract.five.utr.sequence | 29/JULY/2012 | BENJAMIN TOVAR
# ******************************************************************************
extract.five.utr.sequence &amp;lt;- function(gene.list.refseq,
                                      number.bases.upstream=1000){
    
    
    # *****************************************************************
    # RUNNING NOTES: Please download this packages from Bioconductor
    # http://www.bioconductor.org/packages/release/data/annotation/
    # ***************************************************************** 
    #   source(&quot;http://bioconductor.org/biocLite.R&quot;)
    #   biocLite(&quot;BSgenome.Hsapiens.UCSC.hg19&quot;)                                
    #   biocLite(&quot;Biostrings&quot;)
    #   biocLite(&quot;org.Hs.eg.db&quot;)
                                      
    # *****************************************************************
    # RUNNING EXAMPLE
    # *****************************************************************                              
    #
    ## Extract 250 bases upstream of each gene in gene.list.refseq 
    #
    # gene.list.refseq &amp;lt;- c(&quot;NM_003588&quot;,&quot;NM_001145436&quot;, &quot;NM_001135188&quot;,&quot;NM_020760&quot;,&quot;NM_173362&quot;,   
    #                    &quot;NM_198393&quot;,&quot;NM_022736&quot;,&quot;NM_025074&quot;,&quot;NM_033449&quot;,&quot;NM_015726&quot;,   
    #                    &quot;NM_022110&quot;,&quot;NM_016478&quot;,&quot;NM_020634&quot;,&quot;NM_002291&quot;,&quot;NM_000418&quot;,   
    #                    &quot;NM_001862&quot;,&quot;NM_017752&quot;,&quot;NM_006591&quot;,&quot;NM_000124&quot;,&quot;NM_144610&quot;) 
    #
    # bases.upstream &amp;lt;- 250
    #  
    # result &amp;lt;- extract.five.utr.sequence(gene.list.refseq,bases.upstream)                             
                             
    # *****************************************************************
    # LOAD THE LIBRARIES
    # ***************************************************************** 
    cat(&quot;Loading libraries&quot;,date(),&quot;\n&quot;)   
    # human genome DNA sequences
        library(BSgenome.Hsapiens.UCSC.hg19) 
    # human genome wide annotations 
        library(org.Hs.eg.db) 
    # load IDs, symbol and gene descriptions
        refseq.id &amp;lt;- toTable(org.Hs.egREFSEQ)
        symbol &amp;lt;- toTable(org.Hs.egSYMBOL)
        gene.description &amp;lt;- toTable(org.Hs.egGENENAME)
    
    # *****************************************************************
    # LOAD THE UPSTREAM SEQUENCES
    # *****************************************************************

    if(number.bases.upstream &amp;lt;= 1000){
        # load the 1000 bases upstream of all genes in the human genome 
        # in the BSgenome.Hsapiens.UCSC.hg19 package
        upstream &amp;lt;- Hsapiens$upstream1000
    }
    
    if(number.bases.upstream &amp;gt; 1000 &amp;amp;&amp;amp; number.bases.upstream &amp;lt;= 2000){
        # load the 1000 bases upstream of all genes in the human genome 
        # in the BSgenome.Hsapiens.UCSC.hg19 package
        upstream &amp;lt;- Hsapiens$upstream2000
    }
    
    if(number.bases.upstream &amp;gt; 2000 &amp;amp;&amp;amp; number.bases.upstream &amp;lt;= 5000){
        # load the 1000 bases upstream of all genes in the human genome 
        # in the BSgenome.Hsapiens.UCSC.hg19 package
        upstream &amp;lt;- Hsapiens$upstream5000
    }        
    
    if(number.bases.upstream &amp;gt; 5000){
        cat(&quot;ERROR: The number of bases upstream of 5&#39;UTR is too large (max value = 5000)\n&quot;)
        return(0);
    } 
    
    if(number.bases.upstream &amp;lt; 1){
        cat(&quot;ERROR: The number of bases upstream of 5&#39;UTR cannot be less than 1 nucleotide (min value = 1)\n&quot;)
        return(0);
    }                   
        
    # extract the names of each gene
    names.genes.upstream &amp;lt;- names(upstream)

    # extract the refseq of each gene  
    refseq.upstream.id &amp;lt;- vector(&quot;character&quot;,length(names.genes.upstream))
    for(i in 1:length(names.genes.upstream)){
        refseq.upstream.id[i] &amp;lt;- gsub(&quot;, &quot;,&quot;_&quot;,toString(unlist(strsplit(names.genes.upstream[i],
                                      &quot;\\_&quot;))[c(1,2)],sep=&quot;&quot;))
    }    
    names(refseq.upstream.id) &amp;lt;- 1:length(refseq.upstream.id)

    # *****************************************************************
    # CREATE OUTPUT OBJECT
    # *****************************************************************    
    output.params &amp;lt;- c(&quot;entrez.id&quot;,&quot;symbol&quot;,&quot;gene.description&quot;,&quot;five.UTR.sequence&quot;)
                        
    output.list &amp;lt;- list()
    for(i in 1:length(gene.list.refseq)){
        output.list[[i]] &amp;lt;- list()
        for(j in 1:length(output.params)){
            output.list[[i]][[j]] &amp;lt;- list()
        }
        names(output.list[[i]]) &amp;lt;- output.params
    }
    names(output.list) &amp;lt;- gene.list.refseq

    # *****************************************************************
    # LET THE COMPUTER DECIDE
    # *****************************************************************     
    cat(&quot;Extracting data&quot;,date(),&quot;\n&quot;)
    
    for(i in 1:length(gene.list.refseq)){
        cat(rep(&quot;.&quot;,1))

        # extract the entrez.id of each gene in geneList 
        output.list[[i]]$entrez.id &amp;lt;- refseq.id[which(refseq.id$accession==gene.list.refseq[i]),]$gene_id

        # extract the symbol of each gene in geneList 
        output.list[[i]]$symbol &amp;lt;- symbol[which(symbol==output.list[[i]]$entrez.id),]$symbol
        
        # extract the gene description of each gene in geneList 
        output.list[[i]]$gene.description &amp;lt;- gene.description[which(gene.description==output.list[[i]]$entrez.id),]$gene_name  
        
        # Return the index of the target genes
        index.target.gene &amp;lt;- sequence &amp;lt;- NA;
        index.target.gene &amp;lt;- as.numeric(names(refseq.upstream.id[which(refseq.upstream.id==gene.list.refseq[i])]))
        
        # NOTE: some genes have repeated RefSeq.id which corresponds to the same gene in other Chromosomes.
        # by using index.target.gene[1] we are selecting only the first RefSeq entry.
        sequence &amp;lt;- upstream[index.target.gene[1]]
        sequence.length &amp;lt;- nchar(sequence)
        start.position &amp;lt;- ((sequence.length-number.bases.upstream)+1)
        sequence &amp;lt;- substring(toString(sequence),1:sequence.length,1:sequence.length)
        sequence &amp;lt;- sequence[start.position:sequence.length]
        output.list[[i]]$five.UTR.sequence &amp;lt;- sequence
    }
    cat(&quot;\n&quot;)
    cat(&quot;Computations DONE&quot;,date(),&quot;\n&quot;)
    return(output.list)
}
&lt;/code&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;br /&gt;
Benjamin.&lt;/div&gt;
</content><link rel='replies' type='application/atom+xml' href='http://tata-box-blog.blogspot.com/feeds/1188592394332664477/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://tata-box-blog.blogspot.com/2012/07/extracting-upstream-regions-of-refseq.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7099891672772539383/posts/default/1188592394332664477'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7099891672772539383/posts/default/1188592394332664477'/><link rel='alternate' type='text/html' href='http://tata-box-blog.blogspot.com/2012/07/extracting-upstream-regions-of-refseq.html' title='Extracting upstream regions of a RefSeq human gene list in R using Bioconductor'/><author><name>Unknown</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7099891672772539383.post-1526217454312582529</id><published>2012-06-23T11:17:00.001-05:00</published><updated>2012-06-23T11:17:21.936-05:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="Bioinformatics"/><category scheme="http://www.blogger.com/atom/ns#" term="R"/><title type='text'>Handling large FASTA sequence datasets in R:  Shuffle and retrieve &quot;n&quot; number of sequences of fixed length from the whole FASTA file and export them in a new FASTA file.</title><content type='html'>&lt;div dir=&quot;ltr&quot; style=&quot;text-align: left;&quot; trbidi=&quot;on&quot;&gt;
&lt;div dir=&quot;ltr&quot; style=&quot;text-align: left;&quot; trbidi=&quot;on&quot;&gt;
When you are working with large FASTA datasets is probable to find out that the sequences are in sort of a mixed quality (obviously, depending on your scientific question),&lt;br /&gt;
&lt;br /&gt;
I mean for example, imagine that you retrieve the whole collection of exons of a given organism and suppose that the FASTA file is 50mb and there are included ~50,000 DNA sequences but, if you look at them, you may find that there are sequences much larger than others and others will be probably 50 bases long.&lt;br /&gt;
&lt;br /&gt;
Well that&#39;s what I mean with mixed quality and therefore, a sort of filter might be very helpful because you may find more informative to put a threshold and say &quot;hey, &amp;nbsp;from the pool of ~50K sequences I only want 5K sequences randomly chosen (from the entire FASTA dataset) and every sequence must have a fixed length, say 1000 bases long&quot;.&lt;br /&gt;
&lt;br /&gt;
That&#39;s why I wrote some code in R language, it&#39;s a function titled &quot;&lt;span style=&quot;color: orange;&quot;&gt;shuffleAndExtract&lt;/span&gt;&quot; and you can download the example set and the source &lt;a href=&quot;https://dl.dropbox.com/u/17816910/R/handleFastaDatasets.tar.gz&quot;&gt;here&lt;/a&gt;.&lt;br /&gt;
&lt;br /&gt;
Here is the description of the function:&lt;br /&gt;
&lt;br /&gt;
&lt;span style=&quot;background-color: orange;&quot;&gt;        handleFastaDatasets.R&amp;nbsp;&lt;/span&gt;&lt;br /&gt;
&lt;div&gt;
&lt;br /&gt;
&lt;span style=&quot;color: orange;&quot;&gt;shuffleAndExtract:&amp;nbsp;This function in R is designed to open a fasta file dataset, shuffle the sequences and extract the desired sequences wanted by the user to generate a new dataset of fixed size (number of required sequences) and with the same length for each sequence.&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;color: orange;&quot;&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;
And, after you download the example set and the R source file, you can try to run a very simple example:&lt;br /&gt;
&lt;br /&gt;
&lt;span style=&quot;background-color: orange;&quot;&gt;NOTE&lt;/span&gt;: my implementation depends on the function &quot;&lt;span style=&quot;color: orange;&quot;&gt;seqinr&lt;/span&gt;&quot;, if is not installed, you may do this before all the magic begin with a simple install.packages(&quot;seqinr&quot;).&lt;br /&gt;
&lt;br /&gt;
&lt;span style=&quot;color: lime;&quot;&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; # run example:&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;color: lime;&quot;&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; source(&quot;handleFastaDatasets.R&quot;)&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;color: lime;&quot;&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; shuffleAndExtract(&quot;example.fasta&quot;,1000,200)&lt;/span&gt;&lt;br /&gt;
&lt;br /&gt;
The arguments of the function are:&lt;br /&gt;
&lt;br /&gt;
&amp;nbsp; &amp;nbsp; &amp;nbsp;&lt;span style=&quot;color: orange;&quot;&gt;inputFastaFile&lt;/span&gt;: name of the input fasta file&lt;br /&gt;
&amp;nbsp; &amp;nbsp; &amp;nbsp;&lt;span style=&quot;color: orange;&quot;&gt;numberOfoutputSeqs&lt;/span&gt;: number of desired sequences in the output file&lt;br /&gt;
&amp;nbsp; &amp;nbsp; &amp;nbsp;&lt;span style=&quot;color: orange;&quot;&gt;lengthOutputSeqs&lt;/span&gt;: fixed length of every sequence in the output file&lt;br /&gt;
&amp;nbsp; &amp;nbsp; &amp;nbsp;&lt;span style=&quot;color: orange;&quot;&gt;initialPos&lt;/span&gt;: Position where the new window sizing will begin, default = 1&lt;br /&gt;
&amp;nbsp; &amp;nbsp; &amp;nbsp;&lt;span style=&quot;color: orange;&quot;&gt;outputFileName&lt;/span&gt;: name of the output file, by default will be (e.g):&amp;nbsp; &quot;inputFastaFile.fasta.output.fasta&quot;&lt;br /&gt;
&lt;br /&gt;
CODE:&lt;br /&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;pre style=&quot;background-image: URL(https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjSeO1ffr-IA2p7rIyuMmAiMDJW0-yI0DECylRbDyJmgky06C_0iJGqD1V-R4ZLSZzRfG8RL7g5vMst6-KwLHcHLiVG6C0d2QqopIiwUnoGVcQSDGPhyphenhyphen-yK0gJfqv07GfxDnQc1PtTlAdPs/s320/codebg.gif); background: #48214A; border: 2px dashed #CCCCCC; color: white; font-family: arial; font-size: 15.6px; height: 600px; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;&quot;&gt;&lt;code style=&quot;color: white; word-wrap: normal;&quot;&gt;################################################################################

shuffleAndExtract &amp;lt;- function(inputFastaFile,
                              numberOfoutputSeqs,
                              lengthOutputSeqs,
                              initialPos=1,
                              outputFileName = paste(inputFastaFile,&quot;output&quot;,
                                                     &quot;fasta&quot;,sep=&quot;.&quot;)
                              ){

    #
    #    handleFastaDatasets.R   
    #
    #    This function in R is designed to open a fasta file dataset, shuffle the 
    #    sequences and extract the desired sequences wanted by the user to generate 
    #    a new dataset of fixed size (number of required sequences) and with the 
    #    same length for each sequence. 
    #
    #    Author: Benjamin Tovar
    #    Date: 22/JUNE/2012
    #
    ################################################################################

    #    run example:
    #    source(&quot;handleFastaDatasets.R&quot;)
    #    shuffleAndExtract(&quot;example.fasta&quot;,1000,200)

    # inputFastaFile: name of the input fasta file
    # numberOfoutputSeqs: number of desired sequences in the output file 
    # lengthOutputSeqs: fixed length of every sequence in the output file
    # initialPos: Position where the new window sizing will begin, default = 1
    # outputFileName: name of the output file, by default will be (e.g):
    #                 &quot;inputFastaFile.fasta.output.fasta&quot;

    cat(&quot;*** Starting computations |&quot;,date(),&quot;****\n&quot;)    
        
    # Load the seqinr package
    require(seqinr)

    # Load the large seq and not shuffled dataset
    inputSeqs &amp;lt;- read.fasta(inputFastaFile)

    cat(&quot;\tProcessing for&quot;,length(inputSeqs),&quot;sequences |&quot;,date(),&quot;\n&quot;)    

    # Extract the length of every sequence and store the results in a vector.
    inputSeqsSize &amp;lt;- rep(NA,length(inputSeqs))
    for(i in 1:length(inputSeqs)){ 
        inputSeqsSize[i] &amp;lt;- summary(inputSeqs[[i]])$length
    }

    # Extract the index of the sequences which are longer than threshold
    inputSeqsLongSeqIndex &amp;lt;- which(inputSeqsSize &amp;gt; (lengthOutputSeqs+initialPos))

    # randomly pick numberOfoutputSeqs indexes that will be used to create 
    # the output dataset 
    inputSeqsIndex &amp;lt;- sample(inputSeqsLongSeqIndex,numberOfoutputSeqs,rep=F)
    
    # Store the Fasta header of each selected sequence in a vector
    inputSeqsIndexNames &amp;lt;- rep(NA,numberOfoutputSeqs)
    
    # create output object 
    outputSeqs &amp;lt;- list()
    for(i in 1:numberOfoutputSeqs){
        # Extract the fasta headers 
        inputSeqsIndexNames[i] &amp;lt;- attr(inputSeqs[[inputSeqsIndex[i]]],&quot;name&quot;)
        # Extract the sequence
        outputSeqs[[i]] &amp;lt;- inputSeqs[[inputSeqsIndex[i]]][initialPos:((initialPos+lengthOutputSeqs)-1)]
    }
 
    # Export the sequences in a new fasta file 
    write.fasta(outputSeqs,inputSeqsIndexNames,outputFileName)
    
    cat(&quot;\n*** DONE |&quot;,date(),&quot;****\n&quot;)
}


# 22 Jun 2012 | Benjamin Tovar
&lt;/code&gt;&lt;/pre&gt;
&lt;br /&gt;
Benjamin&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://tata-box-blog.blogspot.com/feeds/1526217454312582529/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://tata-box-blog.blogspot.com/2012/06/handling-large-fasta-sequence-datasets.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7099891672772539383/posts/default/1526217454312582529'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7099891672772539383/posts/default/1526217454312582529'/><link rel='alternate' type='text/html' href='http://tata-box-blog.blogspot.com/2012/06/handling-large-fasta-sequence-datasets.html' title='Handling large FASTA sequence datasets in R:  Shuffle and retrieve &quot;n&quot; number of sequences of fixed length from the whole FASTA file and export them in a new FASTA file.'/><author><name>Unknown</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7099891672772539383.post-2723234217478647689</id><published>2012-05-06T00:04:00.000-05:00</published><updated>2012-05-06T00:06:22.484-05:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="Random"/><category scheme="http://www.blogger.com/atom/ns#" term="Stochastic messages"/><title type='text'>First anniversary of my blog</title><content type='html'>&lt;div dir=&quot;ltr&quot; style=&quot;text-align: left;&quot; trbidi=&quot;on&quot;&gt;
Exactly today &lt;span style=&quot;color: orange;&quot;&gt;(5th/May)&lt;/span&gt;, a year ago I decided to start with a project, mostly motivated in a way that I feel that probably some of my post will help someone in the future&lt;span style=&quot;color: orange;&quot;&gt; (bots included)&lt;/span&gt;.&lt;br /&gt;
&lt;br /&gt;
My motivation comes from, because my knowledge about Biology is ~80% and the other ~20% is about Computer Science, I thought that it will be a very cool idea to share some code, ideas, tutorials and even random messages for people that have not been surrounded with bunches of code in front of a &lt;span style=&quot;color: lime;&quot;&gt;UNIX&lt;/span&gt; Terminal.&lt;br /&gt;
&lt;br /&gt;
I promise to still publishing posts, and in the same way I still continue my own training, the&amp;nbsp;awesomeness&amp;nbsp;:P of the posts will increase too.&lt;br /&gt;
&lt;br /&gt;
Wishing you the best for every person which has visited my blog. &lt;span style=&quot;color: orange;&quot;&gt;Thank you so much and keep coding&lt;/span&gt;!&lt;br /&gt;
&lt;br /&gt;
Benjamin.&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://tata-box-blog.blogspot.com/feeds/2723234217478647689/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://tata-box-blog.blogspot.com/2012/05/first-anniversary-of-my-blog.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7099891672772539383/posts/default/2723234217478647689'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7099891672772539383/posts/default/2723234217478647689'/><link rel='alternate' type='text/html' href='http://tata-box-blog.blogspot.com/2012/05/first-anniversary-of-my-blog.html' title='First anniversary of my blog'/><author><name>Unknown</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7099891672772539383.post-7561402439345036279</id><published>2012-05-02T20:05:00.000-05:00</published><updated>2012-05-02T20:16:40.003-05:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="Bioinformatics"/><category scheme="http://www.blogger.com/atom/ns#" term="HMM"/><category scheme="http://www.blogger.com/atom/ns#" term="Markov Chains"/><category scheme="http://www.blogger.com/atom/ns#" term="Stochastic messages"/><title type='text'>My poster &quot;Mining Biosequences&quot; (Spanish) at the Scientific Poster Contest due to the BSc in Genomic Biotechnology anniversary at Faculty of Biological Sciences in UANL</title><content type='html'>&lt;div dir=&quot;ltr&quot; style=&quot;text-align: left;&quot; trbidi=&quot;on&quot;&gt;
&lt;div dir=&quot;ltr&quot; style=&quot;text-align: left;&quot; trbidi=&quot;on&quot;&gt;
My poster is an introduction to HMM using some awesome material from Eddy S (among other awesome authors)., hope you like it.&lt;br /&gt;
&lt;br /&gt;
Click the image to download the poster.&lt;br /&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;http://dl.dropbox.com/u/17816910/files/posterBen.pdf&quot;&gt;&lt;img border=&quot;0&quot; height=&quot;400&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiDJxTwtPz2ifpUNA43n_5rzTzepDQ6po2GAHCRZ0BgyGORvKwG-I3r9Qhzzv3v2MRU7bpDotdk9daeXdmhv4GWGQd5bo38Nyco-8u4zsbPlKfjFOID7jPkMZEFkndMfP_4xbap-F9PDDIZ/s400/posterSmall.png&quot; width=&quot;300&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: left;&quot;&gt;
Benjamin&lt;/div&gt;
&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://tata-box-blog.blogspot.com/feeds/7561402439345036279/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://tata-box-blog.blogspot.com/2012/05/my-poster-mining-biosequences-spanish.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7099891672772539383/posts/default/7561402439345036279'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7099891672772539383/posts/default/7561402439345036279'/><link rel='alternate' type='text/html' href='http://tata-box-blog.blogspot.com/2012/05/my-poster-mining-biosequences-spanish.html' title='My poster &quot;Mining Biosequences&quot; (Spanish) at the Scientific Poster Contest due to the BSc in Genomic Biotechnology anniversary at Faculty of Biological Sciences in UANL'/><author><name>Unknown</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiDJxTwtPz2ifpUNA43n_5rzTzepDQ6po2GAHCRZ0BgyGORvKwG-I3r9Qhzzv3v2MRU7bpDotdk9daeXdmhv4GWGQd5bo38Nyco-8u4zsbPlKfjFOID7jPkMZEFkndMfP_4xbap-F9PDDIZ/s72-c/posterSmall.png" height="72" width="72"/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7099891672772539383.post-5374545724376066055</id><published>2012-04-21T15:40:00.000-05:00</published><updated>2012-04-21T15:40:06.742-05:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="Bioinformatics"/><category scheme="http://www.blogger.com/atom/ns#" term="R"/><category scheme="http://www.blogger.com/atom/ns#" term="Tutorials"/><title type='text'>Calculate the average distance between a given DNA motif within DNA sequences in R</title><content type='html'>&lt;div dir=&quot;ltr&quot; style=&quot;text-align: left;&quot; trbidi=&quot;on&quot;&gt;
Suppose&amp;nbsp;that we want to calculate the expected distance of a DNA motif within a DNA target sequence, if we know the composition bias or the probability distribution (multinomial model) we can compute it just fine.&lt;br /&gt;
&lt;br /&gt;
Download the R code &amp;lt;- &lt;a href=&quot;http://dl.dropbox.com/u/17816910/R/motifOccurrence.R&quot;&gt;here&lt;/a&gt;&lt;br /&gt;
&lt;br /&gt;
&lt;span style=&quot;background-color: orange;&quot;&gt;FIRST PART&lt;/span&gt;&lt;br /&gt;
&lt;br /&gt;
For example, supose that we want to compute the expected distance of the motif &lt;span style=&quot;color: orange;&quot;&gt;&quot;GATC&quot;&lt;/span&gt; in a sequence composed of 10,000 bases given that the whole sequence follows a probability distribution of p(A) = 0.3, p(C) = 0.2, p(G) = 0.2, p(T) = 0.3.&lt;br /&gt;
&lt;br /&gt;
So open an R prompt and load the functions:&lt;br /&gt;
&lt;br /&gt;
&lt;code&gt;
source(&quot;motifOccurrence.R&quot;)&lt;br /&gt;
&lt;/code&gt;
&lt;br /&gt;
Then, enter the initial values:&lt;br /&gt;
&lt;br /&gt;
&lt;code&gt;
lengthSeq &amp;lt;- 10000&lt;br /&gt;
motif &amp;lt;- c(&quot;G&quot;,&quot;A&quot;,&quot;T&quot;,&quot;C&quot;)&lt;br /&gt;
probDistr &amp;lt;- c(0.3,0.2,0.2,0.3)&lt;br /&gt;
&lt;/code&gt;
&lt;br /&gt;
And finally, compute the average expected distance of every occurrence of the motif inside the target sequence. Using the following formula (this computation corresponds to the&lt;span style=&quot;color: orange;&quot;&gt; &quot;computeExpectedDistance&quot;&lt;/span&gt; function of the R script&lt;span style=&quot;color: orange;&quot;&gt; &quot;motifOccurrence.R&quot;&lt;/span&gt;):&lt;br /&gt;
&lt;br /&gt;
&lt;span style=&quot;color: orange;&quot;&gt;&amp;nbsp; &amp;nbsp; expectedDistance = lengthDNAseq/(lengthDNAseq*p))&lt;/span&gt;&lt;br /&gt;
&amp;nbsp; &lt;br /&gt;
where&lt;span style=&quot;color: orange;&quot;&gt; &quot;p&quot;&lt;/span&gt; stands for the joint probability of the motif, in other words: p = p(GATC) = p(G)*p(A)*p(T)*p(C) = 0.2*0.3*0.3*0.2 = 0.0036&lt;br /&gt;
&lt;br /&gt;
To easily compute the expected distance in R, type:&lt;br /&gt;
&amp;nbsp; &lt;br /&gt;
&lt;code&gt;
expectedDist &amp;lt;- computeExpectedDistance(probDistr,lengthSeq,motif)&lt;br /&gt;
&lt;/code&gt;
&lt;br /&gt;
As you see, it returns the value of &lt;span style=&quot;color: orange;&quot;&gt;&quot;277&quot;&lt;/span&gt;, this number means that, for the &lt;span style=&quot;color: orange;&quot;&gt;&quot;GATC&quot;&lt;/span&gt; motif inside a sequence of 10,000 bases with a composition bias of p(A) = 0.3, p(C) = 0.2, p(G) = 0.2, p(T) = 0.3 we may expect a distance of 277 bases between each &lt;span style=&quot;color: orange;&quot;&gt;&quot;GATC&quot;&lt;/span&gt;.&lt;br /&gt;
&lt;br /&gt;
Or, a little more graphic:&lt;br /&gt;
&lt;br /&gt;
277 bases | GATC | 277 bases | GATC | .....&lt;br /&gt;
&lt;br /&gt;
&lt;span style=&quot;background-color: orange;&quot;&gt;SECOND PART&lt;/span&gt;&lt;br /&gt;
&lt;br /&gt;
Another way to compute this (even though it involves more computations) we can simulate X number of DNA sequences with a fixed length with an equal probability distribution per sequence and extract the coordinates of the motif within each sequence to finally compute the average distance of the motif.&lt;br /&gt;
&lt;br /&gt;
There is a function titled &lt;span style=&quot;color: orange;&quot;&gt;&quot;iterateComputeDistance&quot;&lt;/span&gt; to do the calculations for you. Add the next parameter to the R environment:&lt;br /&gt;
&lt;br /&gt;
&lt;code&gt;
iterations &amp;lt;- 100&lt;br /&gt;
&lt;/code&gt;
&lt;br /&gt;
Compute the average distance of the &lt;span style=&quot;color: orange;&quot;&gt;&quot;GATC&quot;&lt;/span&gt; motif within 100 DNA sequences (the other parameters remain equal)&lt;br /&gt;
&lt;br /&gt;
&lt;code&gt;
expectedDistWithSimSeqs &amp;lt;- iterateComputeDistance(probDistr,lengthSeq,motif,iterations)&lt;br /&gt;
&lt;/code&gt;
&lt;br /&gt;
As we expected, the results of the two approaches are highly similar (ouuu yeah!)&lt;br /&gt;
&lt;br /&gt;
&lt;span style=&quot;background-color: orange;&quot;&gt;THIRD PART&lt;/span&gt;&lt;br /&gt;
&lt;br /&gt;
But, what happens when we already have a sequence and want to know the expected distance of that motif inside of it?.&lt;br /&gt;
&lt;br /&gt;
Just like &quot;Hey dude, I have an E.coli plasmid DNA sequence and want to know the average distance of the GATC motif&quot;.&lt;br /&gt;
&lt;br /&gt;
Lets test using the sequence &lt;span style=&quot;color: orange;&quot;&gt;&quot;Escherichia coli 2078 plasmid pQNR2078 complete sequence&quot;&lt;/span&gt; &amp;lt;- http://www.ncbi.nlm.nih.gov/nuccore/HE613857.1&lt;br /&gt;
&lt;br /&gt;
Ok, use the &lt;span style=&quot;color: orange;&quot;&gt;&quot;ape&quot;&lt;/span&gt; library to import the sequence to the R environment (if this library is not installed, type: &lt;span style=&quot;background-color: #e06666;&quot;&gt;install.packages(&quot;ape&quot;)&lt;/span&gt;)&lt;br /&gt;
&lt;br /&gt;
&lt;span style=&quot;background-color: orange;&quot;&gt;NOTE&lt;/span&gt;: the GenBank sequences are in lowecase, so it will be needed to use a motif in lowercase to do the right computations.&lt;br /&gt;
&lt;br /&gt;
Import the library:&lt;br /&gt;
&lt;code&gt;
require(&quot;ape&quot;)&lt;br /&gt;
&lt;/code&gt;
&lt;br /&gt;
Import the sequence:&lt;br /&gt;
&lt;code&gt;
plasmid &amp;lt;- read.GenBank(&quot;HE613857.1&quot;)&lt;br /&gt;
plasmidDNA &amp;lt;- as.character.DNAbin(plasmid)&lt;br /&gt;
plasmidDNA &amp;lt;- plasmidDNA[[1]]&lt;br /&gt;
motifEcoli &amp;lt;- c(&quot;g&quot;,&quot;a&quot;,&quot;t&quot;,&quot;c&quot;)&lt;br /&gt;
&lt;/code&gt;
&lt;br /&gt;
Get the coordinates:&lt;br /&gt;
&lt;code&gt;plasmidDNAcoord &amp;lt;- coordMotif(plasmidDNA,motifEcoli)&lt;/code&gt;&lt;br /&gt;
Get the average distance between the motif occurrences.&lt;br /&gt;
&lt;code&gt;
plasmidDNAmotifDistance &amp;lt;- computeDistance(plasmidDNAcoord)&lt;/code&gt;&lt;br /&gt;
&lt;code&gt;&lt;br /&gt;
&lt;/code&gt;&lt;span style=&quot;color: orange;&quot;&gt;
&amp;nbsp; &amp;nbsp;&amp;gt; plasmidDNAmotifDistance&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;color: orange;&quot;&gt;&amp;nbsp; &amp;nbsp;[1] 270&lt;/span&gt;&lt;br /&gt;
&lt;br /&gt;
Compute the number of occurrences of the motif among the plasmid (the result is 151 occurrences):&lt;br /&gt;
&lt;br /&gt;
The number of occurrences of the motif in R is:&lt;br /&gt;
&lt;code&gt;
(length(plasmidDNAcoord)-1)&lt;/code&gt;&lt;br /&gt;
&lt;code&gt;&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;
So, the main distance between the motif &lt;span style=&quot;color: orange;&quot;&gt;&quot;gatc&quot;&lt;/span&gt;&amp;nbsp;finally is 270 bases and we are done :D&lt;br /&gt;
&lt;br /&gt;
CODE:
&lt;br /&gt;
&lt;pre style=&quot;background-image: URL(https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjSeO1ffr-IA2p7rIyuMmAiMDJW0-yI0DECylRbDyJmgky06C_0iJGqD1V-R4ZLSZzRfG8RL7g5vMst6-KwLHcHLiVG6C0d2QqopIiwUnoGVcQSDGPhyphenhyphen-yK0gJfqv07GfxDnQc1PtTlAdPs/s320/codebg.gif); background: #48214A; border: 2px dashed #CCCCCC; color: white; font-family: arial; font-size: 15.6px; height: 600px; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;&quot;&gt;&lt;code style=&quot;color: white; word-wrap: normal;&quot;&gt;#
#    Script: motifOccurrence.R
#    Author: Benjamin Tovar
#    Date: 21/April/2012
#
################################################################################

#                       ############
#                        FUNCTIONS:
#                       ############

##############################################################################
iterateComputeDistance &amp;lt;- function(multinomialDNAmodel,
                                   lengthDNAseq,
                                   motif,
                                   numberOfIterations){

    # This function returns the mean distance 
    # of a given motif given X number of DNA sequences given a multinomial model
    # (probability distribution of each base).
    
    # So, it will generate X number of DNA sequences using a given 
    # probability distribution and then it will compute the distance among 
    # that mofit within the total set of sequences to finally returns 
    # the average distance of the motif.
    
    result &amp;lt;- rep(NA,numberOfIterations)
    for(i in 1:numberOfIterations){
        currentGenome &amp;lt;- NA
        currentCoordinatesOfMotif &amp;lt;- NA        
        currentSequence &amp;lt;- sample(c(&quot;A&quot;,&quot;C&quot;,&quot;G&quot;,&quot;T&quot;),
                                    lengthDNAseq,rep=T,
                                    prob=multinomialDNAmodel)
        currentCoordinatesOfMotif &amp;lt;- coordMotif(currentSequence,motif)
        result[i] &amp;lt;- computeDistance(currentCoordinatesOfMotif)
        cat(&quot; *** Iteration number: &quot;,i,&quot; completed *** | average distance = &quot;
            ,result[i],&quot;\n&quot;)  
    }
    result &amp;lt;- trunc(mean(result))
    cat(&quot; \n*** Computation status: DONE ***\n\n&quot;)
    return(result)  
}
##############################################################################
coordMotif &amp;lt;- function(targetSequence,motif){

    # This function returns the coordinates of the motif of study in a target 
    # DNA sequence. In other words, if I found the motif, tell me exactly in
    # which position of the DNA sequence is.
    
    lengthMotif &amp;lt;- length(motif)
    lengthTargetSeq &amp;lt;- (length(targetSequence)-lengthMotif)
    motif &amp;lt;- toString(motif)
    motif &amp;lt;- gsub(&quot;, &quot;,&quot;&quot;,motif)
    res &amp;lt;- 1    
    for(i in 1:lengthTargetSeq){
        currentTargetSeq &amp;lt;- targetSequence[i:(i+(lengthMotif)-1)]
        currentTargetSeq &amp;lt;- toString(currentTargetSeq)
        currentTargetSeq &amp;lt;- gsub(&quot;, &quot;,&quot;&quot;,currentTargetSeq)
        if(currentTargetSeq == motif){
            res[(length(res)+1)] &amp;lt;- i
        }
    }
    return(res)
}
##############################################################################
computeDistance &amp;lt;- function(coordinatesOfMotif){
    
    # This function returns the mean distance 
    # of a given motif given its coordinates within a target DNA sequence.
    # In other words, If I already got a list with the coordinates where the 
    # motif is inside a DNA sequence, tell me the average distance between
    # this coordinates to get the expected distance of that motif.

    currentDistance &amp;lt;- rep(NA,(length(coordinatesOfMotif)-1))
    lengthCoord &amp;lt;- length(currentDistance)
    for(i in 1:lengthCoord){
        currentDistance[i] &amp;lt;- coordinatesOfMotif[i+1]-coordinatesOfMotif[i]
    }
    res &amp;lt;- trunc(mean(currentDistance))
    return(res)   
}
##############################################################################
computeExpectedDistance &amp;lt;- function(multinomialModel,
                                    lengthDNAseq,
                                    motif){
                                    
    # This function computes the expected distance of a given motif in a DNA
    # sequence given its multinomial model (probability distribution of 
    # each base)
    
    # Convert the motif into an index                                       
    motifIndex &amp;lt;- gsub(&quot;A&quot;,1,motif); motifIndex &amp;lt;- gsub(&quot;C&quot;,2,motifIndex)
    motifIndex &amp;lt;- gsub(&quot;G&quot;,3,motifIndex); motifIndex &amp;lt;- gsub(&quot;T&quot;,4,motifIndex)
    motifIndex &amp;lt;- as.numeric(motifIndex)
    # Compute p value of the motif given the multinomial model
    p &amp;lt;- rep(NA,length(motif))
    for(i in 1:length(motifIndex)){
        p[i] &amp;lt;- multinomialModel[motifIndex[i]]
    }    
    p &amp;lt;- prod(p)
    result &amp;lt;- trunc(lengthDNAseq/(lengthDNAseq*p))
    return(result)
}                                   
##############################################################################

# Benjamin
&lt;/code&gt;&lt;/pre&gt;
&lt;br /&gt;
Benjamin&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://tata-box-blog.blogspot.com/feeds/5374545724376066055/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://tata-box-blog.blogspot.com/2012/04/calculate-average-distance-between.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7099891672772539383/posts/default/5374545724376066055'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7099891672772539383/posts/default/5374545724376066055'/><link rel='alternate' type='text/html' href='http://tata-box-blog.blogspot.com/2012/04/calculate-average-distance-between.html' title='Calculate the average distance between a given DNA motif within DNA sequences in R'/><author><name>Unknown</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7099891672772539383.post-2726154655059215419</id><published>2012-04-13T22:57:00.001-05:00</published><updated>2012-04-14T10:32:25.232-05:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="Bioinformatics"/><category scheme="http://www.blogger.com/atom/ns#" term="Markov Chains"/><category scheme="http://www.blogger.com/atom/ns#" term="R"/><category scheme="http://www.blogger.com/atom/ns#" term="Tutorials"/><title type='text'>Introduction to Markov Chains and modeling DNA sequences in R</title><content type='html'>&lt;div dir=&quot;ltr&quot; style=&quot;text-align: left;&quot; trbidi=&quot;on&quot;&gt;
Markov chains are probabilistic models which can be used for the modeling of sequences given a probability distribution and then, they are also very useful for the characterization of certain parts of a DNA or protein string given for example, a bias towards the AT or GC content.&lt;br /&gt;
&lt;br /&gt;
For now, I am going to introduce how to build our own Markov Chain of zero order and first order in R programming language. The definition of a zero order Markov Chain relies in that, the current state (or nucleotide) does not depends on the previous state, so there&#39;s no &quot;memory&quot; and every state is untied.&lt;br /&gt;
&lt;br /&gt;
For the first order Markov Chain the case is different because the current state actually depends only on the previous state. Given that points clear, a second order Markov Model will be a model that reflects that the current state only depends on the previous two states before it (This model will be useful for the study of codons, given that they are substrings of 3 nucleotides long).&lt;br /&gt;
&lt;br /&gt;
Download the scripts (R script and graphviz scripts) &lt;a href=&quot;http://dl.dropbox.com/u/17816910/R/scriptsIntrMarkovChains.tar.gz&quot;&gt;here&lt;/a&gt;&lt;br /&gt;
&lt;br /&gt;
Example of a Markov Chain of zero order (the current nucleotide is&amp;nbsp;totally independent of the previous nucleotide).&lt;br /&gt;
&lt;br /&gt;
The multinomial model is:&lt;br /&gt;
&lt;span style=&quot;color: orange;&quot;&gt;p(A)+p(C)+p(G)+p(T) = 1.0&lt;br /&gt;0.4 +0.1 +0.1 +0.4  = 1.0&lt;/span&gt;&lt;br /&gt;
&lt;br /&gt;
Example of the structure of the zero order model:&lt;br /&gt;
&lt;br /&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhYxtTJ7cJu-Zppss4C_EbOlHfM0cu73tZW4nBnejzdxTUbXdzQ-mDBT8_x7OXyUNKCJC9mZR2HHuF8MnVeNQz9G1_DFGkx4EEOeKxLDeueOIbTSKhxFYpFa1UUpbo_BXYCFL4PZBFmRMqR/s1600/zero.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img border=&quot;0&quot; height=&quot;351&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhYxtTJ7cJu-Zppss4C_EbOlHfM0cu73tZW4nBnejzdxTUbXdzQ-mDBT8_x7OXyUNKCJC9mZR2HHuF8MnVeNQz9G1_DFGkx4EEOeKxLDeueOIbTSKhxFYpFa1UUpbo_BXYCFL4PZBFmRMqR/s400/zero.png&quot; width=&quot;400&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Example of a Markov Chain of first order (the current nucleotide only&amp;nbsp;depends on the previous nucleotide).&lt;br /&gt;
&lt;br /&gt;
&amp;nbsp; &amp;nbsp; The multinomial model per base is:&lt;br /&gt;
&lt;span style=&quot;color: orange;&quot;&gt;&amp;nbsp; &amp;nbsp; p(A|A)+p(C|A)+p(G|A)+p(T|A) = 1.0&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;color: orange;&quot;&gt;&amp;nbsp; &amp;nbsp; p(A|C)+p(C|C)+p(G|C)+p(T|C) = 1.0&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;color: orange;&quot;&gt;&amp;nbsp; &amp;nbsp; p(A|G)+p(C|G)+p(G|G)+p(T|G) = 1.0&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;color: orange;&quot;&gt;&amp;nbsp; &amp;nbsp; p(A|T)+p(C|T)+p(G|T)+p(T|T) = 1.0 &amp;nbsp; &amp;nbsp; &amp;nbsp; &lt;/span&gt;&lt;br /&gt;
&lt;br /&gt;
&amp;nbsp; &amp;nbsp; So:&lt;br /&gt;
&lt;span style=&quot;color: orange;&quot;&gt;&amp;nbsp; &amp;nbsp; 0.6 + 0.1 + 0.1 + 0.2 &amp;nbsp;= 1.0&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;color: orange;&quot;&gt;&amp;nbsp; &amp;nbsp; 0.1 + 0.5 + 0.3 + 0.1 &amp;nbsp;= 1.0&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;color: orange;&quot;&gt;&amp;nbsp; &amp;nbsp; 0.05+ 0.2 + 0.7 + 0.05 = 1.0&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;color: orange;&quot;&gt;&amp;nbsp; &amp;nbsp; 0.4 + 0.05+0.05 + 0.5 &amp;nbsp;= 1.0&lt;/span&gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Example of the structure of the first order model:&lt;br /&gt;
&lt;div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjtk81CFCpTZCgHSSVjtVFJr3rjeQV811oQk26WIsl1bn062NpdeB8fkLV_wH1aeCIOO4gv5sfwAnGkIk-3ueqiycCyUKEGuEYXaBT0yQ1yfl66Rh7v6NC21KCITxs-iqLPOB_BW6XE19qm/s1600/first.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img border=&quot;0&quot; height=&quot;347&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjtk81CFCpTZCgHSSVjtVFJr3rjeQV811oQk26WIsl1bn062NpdeB8fkLV_wH1aeCIOO4gv5sfwAnGkIk-3ueqiycCyUKEGuEYXaBT0yQ1yfl66Rh7v6NC21KCITxs-iqLPOB_BW6XE19qm/s400/first.png&quot; width=&quot;400&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: left;&quot;&gt;
Code:&amp;nbsp;&lt;/div&gt;
&lt;pre style=&quot;background-image: URL(https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjSeO1ffr-IA2p7rIyuMmAiMDJW0-yI0DECylRbDyJmgky06C_0iJGqD1V-R4ZLSZzRfG8RL7g5vMst6-KwLHcHLiVG6C0d2QqopIiwUnoGVcQSDGPhyphenhyphen-yK0gJfqv07GfxDnQc1PtTlAdPs/s320/codebg.gif); background: #48214A; border: 2px dashed #CCCCCC; color: white; font-family: arial; font-size: 15.6px; height: 600px; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;&quot;&gt;&lt;code style=&quot;color: white; word-wrap: normal;&quot;&gt;#    Author: Benjamin Tovar
#    Date: 13/April/2012
#
#    Example of a Markov Chain of zero order (the current nucleotide is
#    totally independent of the previous nucleotide).

#    The multinomial model is:
#    p(A)+p(C)+p(G)+p(T) = 1.0
#    0.4 +0.1 +0.1 +0.4  = 1.0

# Define the DNA alphabet that will be used to put names to the objects
alp &amp;lt;- c(&quot;A&quot;,&quot;C&quot;,&quot;G&quot;,&quot;T&quot;)
# Create the vector that represents the probability distribution of the model
zeroOrder &amp;lt;- c(0.4,0.1,0.1,0.4)
# Put the name of reference of each base 
names(zeroOrder) &amp;lt;- alp 
# Create a sequence of 1000 bases using this model.
zeroOrderSeq &amp;lt;- sample(alp,1000,rep=T,prob=zeroOrder)

# ***** Study the composition bias of the sequence *****
# We wil use the &quot;seqinr&quot; package.
# For the installation of the package, type:
# install.packages(&quot;seqinr&quot;)
# Then, load the package:
require(&quot;seqinr&quot;)

# Count the frequency of each base 
# in the sequence using the &quot;count&quot; function
zeroOrderFreq &amp;lt;- count(zeroOrderSeq,1,alphabet=alp,freq=TRUE)

# Count the frequency of dinucleotides 
# in the sequence using the &quot;count&quot; function
zeroOrderFreqDin &amp;lt;- count(zeroOrderSeq,2,alphabet=alp,freq=TRUE) 
 
# Now, plot the results in the same plot:
layout(1:2)
barplot(zeroOrderFreq,col=1:4,main=&quot;Compositional bias of each nucleotide&quot;,
        xlab=&quot;Base&quot;,
        ylab=&quot;Base proportion&quot;)
barplot(zeroOrderFreqDin,col=rainbow(16),
        main=&quot;Compositional bias of each dinucleotide&quot;,
        xlab=&quot;Base&quot;,
        ylab=&quot;Base proportion&quot;)

# &amp;amp;&amp;amp;&amp;amp;&amp;amp;&amp;amp;&amp;amp;&amp;amp;&amp;amp;&amp;amp;&amp;amp;&amp;amp;&amp;amp;&amp;amp;&amp;amp;&amp;amp;&amp;amp;&amp;amp;&amp;amp;&amp;amp;&amp;amp;&amp;amp;&amp;amp;&amp;amp;&amp;amp;&amp;amp;&amp;amp;&amp;amp;&amp;amp;&amp;amp;&amp;amp;&amp;amp;&amp;amp;&amp;amp;&amp;amp;&amp;amp;&amp;amp;&amp;amp;&amp;amp;&amp;amp;&amp;amp;&amp;amp;&amp;amp;&amp;amp;&amp;amp;&amp;amp;&amp;amp;&amp;amp;&amp;amp;&amp;amp;&amp;amp;&amp;amp;&amp;amp;&amp;amp;&amp;amp;&amp;amp;&amp;amp;&amp;amp;&amp;amp;&amp;amp;&amp;amp;&amp;amp;&amp;amp;&amp;amp;&amp;amp;&amp;amp;&amp;amp;&amp;amp;&amp;amp;&amp;amp;&amp;amp;&amp;amp;&amp;amp;&amp;amp;&amp;amp;&amp;amp;&amp;amp;&amp;amp;&amp;amp;

#    Example of a Markov Chain of first order (the current nucleotide only
#    depends on the previous nucleotide).

#    The multinomial model per base is:
#    p(A|A)+p(C|A)+p(G|A)+p(T|A) = 1.0
#    p(A|C)+p(C|C)+p(G|C)+p(T|C) = 1.0
#    p(A|G)+p(C|G)+p(G|G)+p(T|G) = 1.0
#    p(A|T)+p(C|T)+p(G|T)+p(T|T) = 1.0        

#    So:
#    0.6 + 0.1 + 0.1 + 0.2  = 1.0
#    0.1 + 0.5 + 0.3 + 0.1  = 1.0
#    0.05+ 0.2 + 0.7 + 0.05 = 1.0
#    0.4 + 0.05+0.05 + 0.5  = 1.0
    
    
# Create the matrix that will store the probability distribution given 
# a certain nucleotide:
firstOrderMat &amp;lt;- matrix(NA,nr=4,nc=4)
# Put names to the 2 dimensions of the matrix 
colnames(firstOrderMat) &amp;lt;- rownames(firstOrderMat) &amp;lt;- alp 
# Add the probability distribution per base:
firstOrderMat[1,] &amp;lt;- c(0.6,0.1,0.1,0.2)  # Given an A in the 1st position
firstOrderMat[2,] &amp;lt;- c(0.1,0.5,0.3,0.1)  # Given a  C in the 1st position
firstOrderMat[3,] &amp;lt;- c(0.05,0.2,0.7,0.05)# Given a  G in the 1st position
firstOrderMat[4,] &amp;lt;- c(0.4,0.05,0.05,0.5)# Given a  T in the 1st position

# Now we got a matrix
#    &amp;gt; firstOrderMat
#         A    C    G    T
#    A 0.60 0.10 0.10 0.20
#    C 0.10 0.50 0.30 0.10
#    G 0.05 0.20 0.70 0.05
#    T 0.40 0.05 0.05 0.50

# In order to continue, we need an initial probability distribution to know
# which base is the most probable to start up the sequence.
inProb &amp;lt;- c(0.4,0.1,0.1,0.4); names(inProb) &amp;lt;- alp
# So, the sequence will have a 40% to start with an A or a T and 10% with C or G

# Create a function to generate the sequence.
# NOTE: To load the function to the current environment, just copy
# the entire function and paste it inside the R prompt.

generateFirstOrderSeq &amp;lt;- function(lengthSeq,
                                  alphabet,  
                                  initialProb,
                                  firstOrderMatrix){
#    lengthSeq = length of the sequence
#    alphabet = alphabet that compounds the sequence 
#    initialProb   = initial probability distribution
#    firstOrderMatrix = matrix that stores the probability distribution of a 
#                       first order Markov Chain

    # Construct the object that stores the sequence
    outputSeq &amp;lt;- rep(NA,lengthSeq)
    # Which is the first base:
    outputSeq[1]  &amp;lt;- sample(alphabet,1,prob=initialProb) 
    # Let the computer decide:
    for(i in 2:length(outputSeq)){
        prevNuc &amp;lt;- outputSeq[i-1]    
        currentProb &amp;lt;- firstOrderMatrix[prevNuc,]
        outputSeq[i] &amp;lt;- sample(alp,1,prob=currentProb)
    } 
    cat(&quot;** DONE: Sequence computation is complete **\n&quot;)
    return(outputSeq)
}

# Use the generateFirstOrderSeq function to generate a sequence of 1000 bases 
# long
firstOrderSeq &amp;lt;- generateFirstOrderSeq(1000,alp,inProb,firstOrderMat)

# ***** Study the composition bias of the sequence *****
# We wil use the &quot;seqinr&quot; package.
# For the installation of the package, type:
# install.packages(&quot;seqinr&quot;)
# Then, load the package:
require(&quot;seqinr&quot;)

# Count the frequency of each base 
# in the sequence using the &quot;count&quot; function
firstOrderFreq &amp;lt;- count(firstOrderSeq,1,alphabet=alp,freq=TRUE)

# Count the frequency of dinucleotides 
# in the sequence using the &quot;count&quot; function
firstOrderFreqDin &amp;lt;- count(firstOrderSeq,2,alphabet=alp,freq=TRUE) 
 
# Now, plot the results in the same plot:
layout(1:2)
barplot(firstOrderFreq,col=1:4,main=&quot;Compositional bias of each nucleotide&quot;,
        xlab=&quot;Base&quot;,
        ylab=&quot;Base proportion&quot;)
barplot(firstOrderFreqDin,col=rainbow(16),
        main=&quot;Compositional bias of each dinucleotide&quot;,
        xlab=&quot;Base&quot;,
        ylab=&quot;Base proportion&quot;)
        
## Lets plot the 4 plots in one window
    layout(matrix(1:4,nr=2,nc=2))
    # Results from the zero order 
    barplot(zeroOrderFreq,col=1:4,
            main=&quot;Compositional bias of each nucleotide\nZero Order Markov Chain&quot;,
            xlab=&quot;Base&quot;,
            ylab=&quot;Base proportion&quot;)
    barplot(zeroOrderFreqDin,col=rainbow(16),
            main=&quot;Compositional bias of each dinucleotide\nZero Order Markov Chain&quot;,
            xlab=&quot;Base&quot;,
            ylab=&quot;Base proportion&quot;)
    # Results from the first order 
    barplot(firstOrderFreq,col=1:4,
            main=&quot;Compositional bias of each nucleotide\nFirst Order Markov Chain&quot;,
            xlab=&quot;Base&quot;,
            ylab=&quot;Base proportion&quot;)
    barplot(firstOrderFreqDin,col=rainbow(16),
            main=&quot;Compositional bias of each dinucleotide\nFirst Order Markov Chain&quot;,
            xlab=&quot;Base&quot;,
            ylab=&quot;Base proportion&quot;)
        
        
# end.   

     
&lt;/code&gt;&lt;/pre&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
Plot of the zero order model:&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiWptzPgE8eQoTFtUF7aN0f8MS6wjeDkhgHwXlGTuMB6rd4IbwFbrKZPSfQUP9rx-AfJJ7xHQl0gJNvU4dCSSHV1Hu1aD5sz2pDs2hbpByIYrnUYrzTJuSadYBfM58MSWbix1wwT_K7WJSj/s1600/zeroOrderPlots.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img border=&quot;0&quot; height=&quot;322&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiWptzPgE8eQoTFtUF7aN0f8MS6wjeDkhgHwXlGTuMB6rd4IbwFbrKZPSfQUP9rx-AfJJ7xHQl0gJNvU4dCSSHV1Hu1aD5sz2pDs2hbpByIYrnUYrzTJuSadYBfM58MSWbix1wwT_K7WJSj/s400/zeroOrderPlots.png&quot; width=&quot;400&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
Plot of the first order model&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgVldDwkdkuhyS3nY4qdJdUrKcORsa3HZqBK1VVsRJmcvrlj6jPoqsQc3fvcCElZTPCf0TU4HLZ7uoMaihGKr7EXJKmR5NyYbdXln0F8LucX2oDbKe7jHalU4CaKVgoVaoyOg4wuYxKtWaz/s1600/firstOrderPlots.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img border=&quot;0&quot; height=&quot;347&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgVldDwkdkuhyS3nY4qdJdUrKcORsa3HZqBK1VVsRJmcvrlj6jPoqsQc3fvcCElZTPCf0TU4HLZ7uoMaihGKr7EXJKmR5NyYbdXln0F8LucX2oDbKe7jHalU4CaKVgoVaoyOg4wuYxKtWaz/s400/firstOrderPlots.png&quot; width=&quot;400&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
All plots&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEizXtGJ1jSrfcMYTDsjxyqOAQ10FKT2__CZpTR70RsI4RY1DbYgy_f8JzHJoLSpIV59CN5U7MbfYvwhs_yMAbfcYDDguCJqDmFYBjgKS0lizmBh9z9CFqICPKGBDU-MGl0Fpn_mefqDMHB7/s1600/allPlots.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img border=&quot;0&quot; height=&quot;189&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEizXtGJ1jSrfcMYTDsjxyqOAQ10FKT2__CZpTR70RsI4RY1DbYgy_f8JzHJoLSpIV59CN5U7MbfYvwhs_yMAbfcYDDguCJqDmFYBjgKS0lizmBh9z9CFqICPKGBDU-MGl0Fpn_mefqDMHB7/s400/allPlots.png&quot; width=&quot;400&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
Benjamin&lt;/div&gt;
&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://tata-box-blog.blogspot.com/feeds/2726154655059215419/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://tata-box-blog.blogspot.com/2012/04/introduction-to-markov-chains-and.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7099891672772539383/posts/default/2726154655059215419'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7099891672772539383/posts/default/2726154655059215419'/><link rel='alternate' type='text/html' href='http://tata-box-blog.blogspot.com/2012/04/introduction-to-markov-chains-and.html' title='Introduction to Markov Chains and modeling DNA sequences in R'/><author><name>Unknown</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhYxtTJ7cJu-Zppss4C_EbOlHfM0cu73tZW4nBnejzdxTUbXdzQ-mDBT8_x7OXyUNKCJC9mZR2HHuF8MnVeNQz9G1_DFGkx4EEOeKxLDeueOIbTSKhxFYpFa1UUpbo_BXYCFL4PZBFmRMqR/s72-c/zero.png" height="72" width="72"/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7099891672772539383.post-2090626984764518013</id><published>2012-04-11T21:10:00.003-05:00</published><updated>2012-04-11T21:10:47.382-05:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="Bioinformatics"/><category scheme="http://www.blogger.com/atom/ns#" term="R"/><category scheme="http://www.blogger.com/atom/ns#" term="Tutorials"/><title type='text'>Generate artificial DNA or protein sequences in R in a single line of code.</title><content type='html'>&lt;div dir=&quot;ltr&quot; style=&quot;text-align: left;&quot; trbidi=&quot;on&quot;&gt;
&lt;div dir=&quot;ltr&quot; style=&quot;text-align: left;&quot; trbidi=&quot;on&quot;&gt;
To generate an artificial DNA sequence of &amp;nbsp;&quot;n&quot; bases long with a fixed composition bias in just one line of code, just open your R prompt and type:&lt;br /&gt;
&lt;br /&gt;
&lt;code&gt;
seqX &amp;lt;- sample(c(&quot;A&quot;,&quot;C&quot;,&quot;G&quot;,&quot;T&quot;),10000,rep=TRUE,prob=c(0.4,0.1,0.1,0.4))&lt;br /&gt;
&lt;/code&gt;
&lt;br /&gt;
As you see, the &lt;span style=&quot;color: orange;&quot;&gt;alphabet&lt;/span&gt; is the four letter alphabet of the DNA (so, if desired, you can replace that alphabet with any other alphabet that you might need, like the amino acids 20 letter code), the next parameter is the &lt;span style=&quot;color: orange;&quot;&gt;length&lt;/span&gt; of the desired output sequence. &lt;span style=&quot;color: orange;&quot;&gt;rep=TRUE&lt;/span&gt; means that each letter of the alphabet could be repeated and finally, I think that the most useful parameter of the function sample is the &lt;span style=&quot;color: orange;&quot;&gt;prob&lt;/span&gt; parameter because it allows you to select the multinomial model (the proportion or &quot;bias&quot; of each base). For example our simulated sequence is 80% AT rich and 20% GC rich given that for the &quot;A&quot; base we got a probability of 0.4, for the &quot;C&quot; base 0.2 and so on.&lt;br /&gt;
&lt;br /&gt;
Now, to check it out that our artificial sequence follows that bias, a simple plot will tell us more than thousand words:&lt;br /&gt;
&lt;br /&gt;
Lets extract the proportion of each base along the sequence using the table and the length function:&lt;br /&gt;
&lt;br /&gt;
&lt;code&gt;
freqX &amp;lt;- table(seqX)/length(seqX)&lt;br /&gt;
&lt;/code&gt;
&lt;br /&gt;
Now, lets plot it doing:&lt;br /&gt;
&lt;br /&gt;
&lt;code&gt;
barplot(freqX,col=1:4,main=&quot;Compositional bias of seqX&quot;,xlab=&quot;Base&quot;,ylab=&quot;Base proportion&quot;)&lt;br /&gt;
&lt;/code&gt;
&lt;br /&gt;
And finally we got this awesome plot.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiJTtqKNzU_n3USPoUpa2BBe5hPLlwCTnvedYGGuKr9JH79sAnOJfRy1kDDm2Zy674SvxcG7fLFw0jU0AqlMcknoBgcA05v4sszGJqjiDHdMKwoK13jgIbtZD0ojx5C8Gz3V5dUpFMaLEFA/s1600/Rplot.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img border=&quot;0&quot; height=&quot;374&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiJTtqKNzU_n3USPoUpa2BBe5hPLlwCTnvedYGGuKr9JH79sAnOJfRy1kDDm2Zy674SvxcG7fLFw0jU0AqlMcknoBgcA05v4sszGJqjiDHdMKwoK13jgIbtZD0ojx5C8Gz3V5dUpFMaLEFA/s400/Rplot.png&quot; width=&quot;400&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: left;&quot;&gt;
So, the barplot shows that effectively each base is well represented by the multinomial model and our artificial sequence is loyal to it.&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: left;&quot;&gt;
Benjamin&lt;/div&gt;
&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://tata-box-blog.blogspot.com/feeds/2090626984764518013/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://tata-box-blog.blogspot.com/2012/04/generate-artificial-dna-or-protein.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7099891672772539383/posts/default/2090626984764518013'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7099891672772539383/posts/default/2090626984764518013'/><link rel='alternate' type='text/html' href='http://tata-box-blog.blogspot.com/2012/04/generate-artificial-dna-or-protein.html' title='Generate artificial DNA or protein sequences in R in a single line of code.'/><author><name>Unknown</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiJTtqKNzU_n3USPoUpa2BBe5hPLlwCTnvedYGGuKr9JH79sAnOJfRy1kDDm2Zy674SvxcG7fLFw0jU0AqlMcknoBgcA05v4sszGJqjiDHdMKwoK13jgIbtZD0ojx5C8Gz3V5dUpFMaLEFA/s72-c/Rplot.png" height="72" width="72"/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7099891672772539383.post-3137567180092281419</id><published>2012-04-11T20:33:00.001-05:00</published><updated>2012-04-11T20:37:54.746-05:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="R"/><title type='text'>Epic R is Epic &lt;-  Beginners command reference card</title><content type='html'>&lt;div dir=&quot;ltr&quot; style=&quot;text-align: left;&quot; trbidi=&quot;on&quot;&gt;
&lt;div dir=&quot;ltr&quot; style=&quot;text-align: left;&quot; trbidi=&quot;on&quot;&gt;
This reference card has been written by Tom Short.&lt;br /&gt;
&lt;div&gt;
&lt;br /&gt;
Click the image to see the R magic:&lt;br /&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;http://cran.r-project.org/doc/contrib/Short-refcard.pdf&quot;&gt;&lt;img border=&quot;0&quot; height=&quot;280&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi5ZvYV_GvZXvq3irErT1CEeWTgD-DopDeXEhYFSRvPnreylsa4fr1C-UOMATz8lRz8rtwDxqPjkJu1tfpPg33CEXARk7r-8ZLK7_K5g97BaCzhSwFL-1Mw_2rMX4mTF9DlXbornRoIgZRD/s400/Rprompt.png&quot; width=&quot;400&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
Thank you for your support Tom.&lt;br /&gt;
Benjamin&lt;br /&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://tata-box-blog.blogspot.com/feeds/3137567180092281419/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://tata-box-blog.blogspot.com/2012/04/epic-r-is-epic-beginners-command.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7099891672772539383/posts/default/3137567180092281419'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7099891672772539383/posts/default/3137567180092281419'/><link rel='alternate' type='text/html' href='http://tata-box-blog.blogspot.com/2012/04/epic-r-is-epic-beginners-command.html' title='Epic R is Epic &lt;-  Beginners command reference card'/><author><name>Unknown</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi5ZvYV_GvZXvq3irErT1CEeWTgD-DopDeXEhYFSRvPnreylsa4fr1C-UOMATz8lRz8rtwDxqPjkJu1tfpPg33CEXARk7r-8ZLK7_K5g97BaCzhSwFL-1Mw_2rMX4mTF9DlXbornRoIgZRD/s72-c/Rprompt.png" height="72" width="72"/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7099891672772539383.post-4604551887710867602</id><published>2012-04-06T23:54:00.001-05:00</published><updated>2012-04-06T23:55:57.523-05:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="Random"/><title type='text'>Using the Blogger app for Android..</title><content type='html'>&lt;div&gt;&lt;p&gt;Well, the application has a simple and elegant UI. &lt;/p&gt;
&lt;p&gt;Let&#39;s check it out and see how it works :&lt;u&gt;P&lt;/u&gt;&lt;/p&gt;
&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://tata-box-blog.blogspot.com/feeds/4604551887710867602/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://tata-box-blog.blogspot.com/2012/04/using-blogger-app-for-android.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7099891672772539383/posts/default/4604551887710867602'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7099891672772539383/posts/default/4604551887710867602'/><link rel='alternate' type='text/html' href='http://tata-box-blog.blogspot.com/2012/04/using-blogger-app-for-android.html' title='Using the Blogger app for Android..'/><author><name>Unknown</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7099891672772539383.post-4147501871237866540</id><published>2012-04-06T16:11:00.000-05:00</published><updated>2012-04-06T16:11:07.176-05:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="Debian"/><category scheme="http://www.blogger.com/atom/ns#" term="R"/><category scheme="http://www.blogger.com/atom/ns#" term="Tutorials"/><title type='text'>Install R 2.15 and further versions in Debian Squeeze</title><content type='html'>&lt;div dir=&quot;ltr&quot; style=&quot;text-align: left;&quot; trbidi=&quot;on&quot;&gt;The last Friday, March 30th, the last stable version of &lt;a href=&quot;http://cran.r-project.org/&quot;&gt;R&lt;/a&gt;, the version 2.15.0 was released.&lt;div&gt;&lt;br /&gt;
&lt;/div&gt;&lt;div&gt;So, to install it in Debian Squeeze, or in another Distro powered by Debian (I actually use CrunchBang Linux), just follow the same instructions described here for the installation of R 2.14.2 by clicking &lt;a href=&quot;http://tata-box-blog.blogspot.mx/2012/03/install-r-2142-in-debian-squeeze.html&quot;&gt;here&lt;/a&gt;.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;
&lt;/div&gt;&lt;div&gt;Do not forget to upgrade the packages too.&lt;/div&gt;&lt;div&gt;Benjamin.&lt;/div&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://tata-box-blog.blogspot.com/feeds/4147501871237866540/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://tata-box-blog.blogspot.com/2012/04/install-r-215-and-further-versions-in.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7099891672772539383/posts/default/4147501871237866540'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7099891672772539383/posts/default/4147501871237866540'/><link rel='alternate' type='text/html' href='http://tata-box-blog.blogspot.com/2012/04/install-r-215-and-further-versions-in.html' title='Install R 2.15 and further versions in Debian Squeeze'/><author><name>Unknown</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7099891672772539383.post-7280687502969422325</id><published>2012-04-06T10:55:00.002-05:00</published><updated>2012-04-06T10:55:53.588-05:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="Debian"/><category scheme="http://www.blogger.com/atom/ns#" term="Linux"/><category scheme="http://www.blogger.com/atom/ns#" term="Octave"/><category scheme="http://www.blogger.com/atom/ns#" term="Tutorials"/><title type='text'>Install octave 3.6.1 in Debian Squeeze</title><content type='html'>&lt;div dir=&quot;ltr&quot; style=&quot;text-align: left;&quot; trbidi=&quot;on&quot;&gt;Following the nice instructions found it here &amp;lt;-&amp;nbsp;&lt;a href=&quot;http://verahill.blogspot.mx/2012/02/debian-testing-wheezy-64-compiling.html&quot;&gt;http://verahill.blogspot.mx/2012/02/debian-testing-wheezy-64-compiling.html&lt;/a&gt;&amp;nbsp;, I have successfully installed Octave 3.6.1 in Squeeze (I had to rename some packages names to fit with the Squeeze packages).&lt;br /&gt;
&lt;br /&gt;
&lt;span style=&quot;color: orange;&quot;&gt;FIRST STEP: Install the required libraries and dependencies.&lt;/span&gt;&lt;br /&gt;
&lt;code&gt;&lt;br /&gt;
sudo apt-get install gfortran build-essential&lt;br /&gt;
sudo apt-get install libqhull-dev libpcre++-dev libblas-dev liblapack-dev libreadline-dev&lt;br /&gt;
sudo apt-get install libcurl4-openssl-dev&lt;br /&gt;
sudo apt-get install libfltk1.1-dev&lt;br /&gt;
sudo apt-get install libgraphicsmagick++-dev&lt;br /&gt;
sudo apt-get install libhdf5-serial-dev&lt;br /&gt;
sudo apt-get install libqrupdate-dev&lt;br /&gt;
sudo apt-get install libsuitesparse-dev&lt;br /&gt;
sudo apt-get install &amp;nbsp;glpk gperf flex bison libfontconfig1-dev&lt;br /&gt;
&lt;/code&gt;&lt;br /&gt;
&lt;br /&gt;
&lt;span style=&quot;color: orange;&quot;&gt;SECOND STEP: Download Octave and extract it.&lt;/span&gt;&lt;br /&gt;
&lt;code&gt;&lt;br /&gt;
wget ftp://ftp.gnu.org/gnu/octave/octave-3.6.1.tar.gz&lt;br /&gt;
tar -xvf octave-3.6.1.tar.gz&lt;br /&gt;
&lt;/code&gt;&lt;br /&gt;
&lt;br /&gt;
&lt;span style=&quot;color: orange;&quot;&gt;THIRD STEP: Compile it.&lt;/span&gt;&lt;br /&gt;
&lt;br /&gt;
&lt;code&gt; cd octave-3.6.1/&lt;br /&gt;
./configure&lt;br /&gt;
make -j3&lt;br /&gt;
&lt;/code&gt;&lt;br /&gt;
where 3 is the number of cores +1 (for me 2 cores)&lt;br /&gt;
&lt;br /&gt;
&lt;span style=&quot;color: orange;&quot;&gt;FOURTH STEP: Validate the compiled version:&lt;/span&gt;&lt;br /&gt;
&lt;br /&gt;
&lt;code&gt; make check&lt;/code&gt;&lt;br /&gt;
&lt;span style=&quot;color: orange;&quot;&gt;&lt;br /&gt;
&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;color: orange;&quot;&gt;FIFTH STEP: Install Octave&lt;/span&gt;&lt;br /&gt;
&lt;code&gt;&lt;br /&gt;
sudo make install&lt;/code&gt;&lt;br /&gt;
&lt;br /&gt;
That&#39;s all. Thanks to &lt;a href=&quot;http://verahill.blogspot.mx/&quot;&gt;Lindqvist&lt;/a&gt; - a blog about Linux and Science. Mostly.&lt;br /&gt;
Benjamin&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://tata-box-blog.blogspot.com/feeds/7280687502969422325/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://tata-box-blog.blogspot.com/2012/04/install-octave-361-in-debian-squeeze.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7099891672772539383/posts/default/7280687502969422325'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7099891672772539383/posts/default/7280687502969422325'/><link rel='alternate' type='text/html' href='http://tata-box-blog.blogspot.com/2012/04/install-octave-361-in-debian-squeeze.html' title='Install octave 3.6.1 in Debian Squeeze'/><author><name>Unknown</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7099891672772539383.post-5462616670608327883</id><published>2012-03-29T23:37:00.000-06:00</published><updated>2012-04-19T15:04:32.106-05:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="Debian"/><category scheme="http://www.blogger.com/atom/ns#" term="Linux"/><category scheme="http://www.blogger.com/atom/ns#" term="R"/><title type='text'>UPDATE: Install R 2.14.2 or R 2.15 in Debian Squeeze</title><content type='html'>&lt;div dir=&quot;ltr&quot; style=&quot;text-align: left;&quot; trbidi=&quot;on&quot;&gt;
Since not so long, I upgraded &amp;nbsp;the version of R that is included in the Debian repositories (2.11.1 in Squeeze) and I was very happy with it since 2010. But there was some packages that I could not install and then, I decided to upgrade R despite of having to upgrade every package too (I have more than 1 GB in my home R packages folder).&lt;br /&gt;
&lt;br /&gt;
There&#39;s nothing new or tricky about how to upgrade, the awesome guys of &lt;a href=&quot;http://cran.r-project.org/index.html&quot;&gt;CRAN&lt;/a&gt; always brings an answer for our needs.&lt;br /&gt;
&lt;br /&gt;
&lt;div&gt;
&lt;span style=&quot;background-color: orange;&quot;&gt;FIRST PART: ADD R BACKPORTS:&amp;nbsp;&lt;/span&gt;&lt;br /&gt;
&lt;br /&gt;&lt;/div&gt;
First, open a Terminal and open the&lt;span style=&quot;color: orange;&quot;&gt; sources.list&lt;/span&gt;&amp;nbsp;file:&lt;br /&gt;
&lt;code&gt;&lt;br /&gt;
$ gksudo gedit /etc/apt/sources.list&lt;br /&gt;
&lt;/code&gt;&lt;br /&gt;
Then, add these lines at the bottom of the file (Note, I use the UCLA server, but this can be easily changed taking a look &lt;a href=&quot;http://cran.r-project.org/mirrors.html&quot;&gt;here&lt;/a&gt; for the mirrors):&lt;br /&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;div&gt;
&lt;code&gt; ## R BACKPORTS&lt;/code&gt;&lt;/div&gt;
&lt;div&gt;
&lt;code&gt;deb http://cran.stat.ucla.edu/bin/linux/debian squeeze-cran/&lt;/code&gt;&lt;/div&gt;
&lt;div&gt;
&lt;code&gt;#deb-src http://cran.stat.ucla.edu/bin/linux/debian squeeze-cran/&lt;/code&gt;&lt;br /&gt;
&lt;code&gt;&lt;br /&gt;&lt;/code&gt;&lt;/div&gt;
&lt;div&gt;
The original set of instructions can be seen &lt;a href=&quot;http://cran.r-project.org/bin/linux/debian/&quot;&gt;here&lt;/a&gt;.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style=&quot;background-color: orange;&quot;&gt;SECOND PART: RENAME THE R PACKAGES FOLDER:&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
There&#39;s a folder where R uses to store the packages we download, just rename it to the current version of R.&lt;/div&gt;
&lt;div&gt;
For example, mine was &quot;&lt;span style=&quot;color: orange;&quot;&gt;2.11&lt;/span&gt;&quot; and then I just renamed it to &quot;&lt;span style=&quot;color: orange;&quot;&gt;2.14&lt;/span&gt;&quot; (or &quot;&lt;span style=&quot;color: orange;&quot;&gt;2.15&lt;/span&gt;&quot;, without quotes) and was inside this path:&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
Before:&lt;/div&gt;
&lt;div&gt;
&lt;span style=&quot;color: orange;&quot;&gt;/home/benjamin/R/x86_64-pc-linux-gnu-library/2.11&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
After:&lt;/div&gt;
&lt;div&gt;
&lt;div&gt;
&lt;span style=&quot;color: orange;&quot;&gt;/home/benjamin/R/x86_64-pc-linux-gnu-library/2.15&lt;/span&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
Remember that&amp;nbsp;some packages also needs to install some files in folders that belongs to the root, so, I would recommend to open R in sudo mode (only if you&#39;re sure about what you&#39;re doing :P) just by executing R this way: &quot;&lt;span style=&quot;color: orange;&quot;&gt;sudo R&lt;/span&gt;&quot; and then, in the R console&amp;nbsp;type :&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;code&gt;update.packages(checkBuilt=TRUE, ask=FALSE)&lt;/code&gt;&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;
&lt;span style=&quot;background-color: orange;&quot;&gt;THIRD PART: SECURE APT:&lt;/span&gt;&lt;br /&gt;
&lt;br /&gt;
The Debian backports archives on CRAN are signed with the key ID 381BA480, to add them, in a Terminal prompt type:&lt;br /&gt;
&lt;br /&gt;
&lt;code&gt;
gpg --keyserver pgp.mit.edu --recv-key 381BA480&lt;br /&gt;
gpg -a --export 381BA480 &amp;gt; jranke_cran.asc&lt;br /&gt;sudo &amp;nbsp;apt-key add jranke_cran.asc&lt;br /&gt;&lt;br /&gt;
&lt;/code&gt;&lt;span style=&quot;background-color: orange;&quot;&gt;
FOURTH PART: UPDATE AND UPGRADE R:&lt;/span&gt;&lt;br /&gt;
&lt;br /&gt;
&lt;div&gt;
Save the file and you can either enter to Synaptic, update the packages list and then just upgrade the packages or in a terminal type:&lt;/div&gt;
&lt;div&gt;
&lt;code&gt;&lt;br /&gt;sudo apt-get update&lt;/code&gt;&lt;/div&gt;
&lt;div&gt;
&lt;code&gt;sudo apt-get upgrade&lt;/code&gt;&lt;/div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div&gt;
And that&#39;s all.&lt;/div&gt;
&lt;div&gt;
Benjamin&lt;/div&gt;
&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://tata-box-blog.blogspot.com/feeds/5462616670608327883/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://tata-box-blog.blogspot.com/2012/03/install-r-2142-in-debian-squeeze.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7099891672772539383/posts/default/5462616670608327883'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7099891672772539383/posts/default/5462616670608327883'/><link rel='alternate' type='text/html' href='http://tata-box-blog.blogspot.com/2012/03/install-r-2142-in-debian-squeeze.html' title='UPDATE: Install R 2.14.2 or R 2.15 in Debian Squeeze'/><author><name>Unknown</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry></feed>