<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/atom10full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><feed xmlns="http://www.w3.org/2005/Atom" xmlns:openSearch="http://a9.com/-/spec/opensearch/1.1/" xmlns:georss="http://www.georss.org/georss" xmlns:gd="http://schemas.google.com/g/2005" xmlns:thr="http://purl.org/syndication/thread/1.0" gd:etag="W/&quot;DUcEQHk_cSp7ImA9WhRVGEk.&quot;"><id>tag:blogger.com,1999:blog-710383708872670121</id><updated>2012-01-17T16:36:41.749-08:00</updated><category term="dimensionality reduction" /><category term="self-oranizing map" /><category term="curled dimension" /><category term="data clustering" /><category term="Relational Perspective Map" /><category term="self-organizing graph" /><title>Visualizing High Dimensional Data</title><subtitle type="html" /><link rel="http://schemas.google.com/g/2005#feed" type="application/atom+xml" href="http://jamesxli.blogspot.com/feeds/posts/default" /><link rel="alternate" type="text/html" href="http://jamesxli.blogspot.com/" /><link rel="next" type="application/atom+xml" href="http://www.blogger.com/feeds/710383708872670121/posts/default?start-index=26&amp;max-results=25&amp;redirect=false&amp;v=2" /><author><name>James X. Li</name><uri>http://www.blogger.com/profile/17491271097854752129</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="31" src="http://2.bp.blogspot.com/_Wca44kNaYGE/Sal7IrL8FJI/AAAAAAAAEyU/yhHymRkjWY4/S220/JamesPicLarge2.JPG" /></author><generator version="7.00" uri="http://www.blogger.com">Blogger</generator><openSearch:totalResults>37</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>25</openSearch:itemsPerPage><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" type="application/atom+xml" href="http://feeds.feedburner.com/JamesXLiBlog" /><feedburner:info xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" uri="jamesxliblog" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com/" /><entry gd:etag="W/&quot;CEICQHo-fip7ImA9WhRVEEs.&quot;"><id>tag:blogger.com,1999:blog-710383708872670121.post-74389411928920300</id><published>2011-11-30T13:29:00.000-08:00</published><updated>2012-01-08T14:42:41.456-08:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2012-01-08T14:42:41.456-08:00</app:edited><title>Quantifying dependency between attributes.</title><content type="html">&lt;div dir="ltr" style="text-align: left;" trbidi="on"&gt;
&lt;div&gt;
Given two sets of attributes, X = (&lt;em&gt;x&lt;/em&gt;&lt;sub&gt;1&lt;/sub&gt;, &lt;em&gt;x&lt;/em&gt;&lt;sub&gt;2&lt;/sub&gt;, ..., &lt;em&gt;x&lt;/em&gt;&lt;sub&gt;p&lt;/sub&gt;) and Y = (&lt;em&gt;y&lt;/em&gt;&lt;sub&gt;1&lt;/sub&gt;, &lt;em&gt;y&lt;/em&gt;&lt;sub&gt;2&lt;/sub&gt;, ..., &lt;em&gt;y&lt;/em&gt;&lt;sub&gt;q&lt;/sub&gt;), how do we quantify the dependency between X and Y? &lt;br /&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
If &amp;nbsp;both X and Y are one dimensional numerical attributes, we can use &lt;i&gt;&lt;a href="http://en.wikipedia.org/wiki/Correlation_and_dependence"&gt;linear correlation coefficient&lt;/a&gt;&lt;/i&gt; to quantify the dependency between X and Y. For multidimensional X and Y (that is our data are&amp;nbsp;numerical&amp;nbsp;matrices&amp;nbsp;of dimension &lt;em&gt;n&lt;/em&gt;×&lt;em&gt;p&lt;/em&gt; and &lt;em&gt;n&lt;/em&gt;×&lt;em&gt;q&lt;/em&gt;&amp;nbsp;respectively, assuming that we have sampled &lt;i&gt;n &lt;/i&gt;data points) &amp;nbsp;we can use canonical correlation analysis (CCA) to capture the dependency&amp;nbsp;between X and Y. Linear methods are in general effective and simple to use. &lt;a href="http://visumap.com/"&gt;VisuMap&lt;/a&gt; offers support for CCA and many other linear analysis services through a plugin module &lt;a href="http://www.visumap.net/rcHome.aspx?c=Plugin"&gt;MultivariateAnalysis&lt;/a&gt;. &amp;nbsp;However, linear methods work only for numerical attributes and can only&amp;nbsp;capture linear relationships. &lt;/div&gt;
&lt;br /&gt;
In this note I'll describe a more general method to quantify dependency between attributes. The method can be applied to non-numerical data and can capture non-linear relationships. The method is based on the concept &lt;a href="http://en.wikipedia.org/wiki/Conditional_entropy"&gt;conditional entropy&lt;/a&gt;&amp;nbsp;from information theory. Assuming that X and Y are two random variables, the &lt;i&gt;information gain ratio&lt;/i&gt;&amp;nbsp;from X to Y is defined as following:&lt;br /&gt;
&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://2.bp.blogspot.com/-YMSC8Aoju8I/Tu_G1nVTreI/AAAAAAAAHx0/EC1StXOL2K0/s1600/Form1.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="52" src="http://2.bp.blogspot.com/-YMSC8Aoju8I/Tu_G1nVTreI/AAAAAAAAHx0/EC1StXOL2K0/s200/Form1.png" width="200" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;
where H(Y) is the &lt;a href="http://en.wikipedia.org/wiki/Entropy_(information_theory)"&gt;entropy&lt;/a&gt; of Y, and H(Y|X) is the &lt;i&gt;conditional entropy&lt;/i&gt;&amp;nbsp;of Y under the conditional that X is known. The&amp;nbsp;numerator&amp;nbsp;H(Y) - H(Y|X) is also called the &lt;i&gt;mutual information&lt;/i&gt; between X and Y, and it is normally denoted as I(X; Y). So, the above formula can also be defined as R(X;Y):=I(X;Y)/H(Y). The following diagram shows the relationship between various types of entropy relating to two random variables:&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://upload.wikimedia.org/wikipedia/commons/2/2e/Conditional_entropy.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" oda="true" src="http://upload.wikimedia.org/wikipedia/commons/2/2e/Conditional_entropy.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;
In general, R(X; Y) tells us&amp;nbsp;how much information we gain about Y when&amp;nbsp;the value of X becomes known. R(X; Y) is a number between 0 and 1.0. If R(X;Y)=0, then X and Y are independent from each other, knowing X does not provide any information about Y. If R(X;Y)=1.0, then Y is completely dependent on X, i.e. Y is completely determined by the value of X.&lt;br /&gt;
&lt;br /&gt;
Now, let's go back to&amp;nbsp;our original problem. We&amp;nbsp;assume that X and Y&amp;nbsp;are&amp;nbsp;now two tables of dimension &lt;em&gt;n&lt;/em&gt;×&lt;em&gt;p&lt;/em&gt; and &lt;em&gt;n&lt;/em&gt;×&lt;em&gt;q&lt;/em&gt;&amp;nbsp;respectively, and X and&amp;nbsp;Y may contain numerical and non-numerical values.&amp;nbsp;In order to calculate the information gain ratio between X and Y we first convert them to random variables. To do so, we pick two clustering algorithms C and D that can be applied on X and Y. The clustering algorithm will assign each of the &lt;em&gt;n&lt;/em&gt; data point to a label of label sets &lt;span style="text-decoration: overline;"&gt;X&lt;/span&gt; and &lt;span style="text-decoration: overline;"&gt;Y&lt;/span&gt; respectively:&lt;br /&gt;
&lt;br /&gt;
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;C: N →&amp;nbsp;&lt;span style="text-decoration: overline;"&gt;X&lt;/span&gt;&lt;br /&gt;
&lt;br /&gt;
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;D: N →&amp;nbsp;&lt;span style="text-decoration: overline;"&gt;Y&lt;/span&gt;&lt;br /&gt;
&lt;br /&gt;
where N:= {1,2,3, ..., n} is the set of data sample indexes. We then&amp;nbsp;define two random variables X' and Y' over&amp;nbsp;&lt;span style="text-decoration: overline;"&gt;X&lt;/span&gt;&amp;nbsp;and&amp;nbsp;&lt;span style="text-decoration: overline;"&gt;Y&lt;/span&gt;&amp;nbsp;as following:&lt;br /&gt;
&lt;span class="Apple-style-span" style="text-decoration: overline;"&gt;&lt;/span&gt;&lt;br /&gt;
&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://3.bp.blogspot.com/-qtuwZDwvcA4/Tu_Qb78SxmI/AAAAAAAAHx8/8yetp33ZJos/s1600/FormProbabilities.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://3.bp.blogspot.com/-qtuwZDwvcA4/Tu_Qb78SxmI/AAAAAAAAHx8/8yetp33ZJos/s1600/FormProbabilities.png" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;
&lt;br /&gt;
With above definitions we can then apply the formula for R(X;Y) to the random variables X' and Y'. In particular, we have the following detailed formula to&amp;nbsp;calculate the &lt;em&gt;generalized information gain ratio&lt;/em&gt; (GIGR)&amp;nbsp;of X for Y:&lt;br /&gt;
&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://1.bp.blogspot.com/-wrYYQwj7iQk/Tu_TmCBHKWI/AAAAAAAAHyE/fFJwNrnMKkA/s1600/FormularRatio.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="108" src="http://1.bp.blogspot.com/-wrYYQwj7iQk/Tu_TmCBHKWI/AAAAAAAAHyE/fFJwNrnMKkA/s320/FormularRatio.png" width="320" /&gt;&lt;/a&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&amp;nbsp;&lt;/div&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
We notice that&amp;nbsp;GIGR in above formula is indexed by the two clustering algorithms C and D. This means that the calculation is dependent on what clustering algorithms we choose for X and Y. In other words, the method described here is actually a meta-algorithm that requires, apart from two input data tables, also two clustering algorithms as input.&lt;br /&gt;
&lt;br /&gt;
The clustering algorithms C and D plays an essential role in GIGR. They determine the granularity of the entropy calculation. More&amp;nbsp;particularly,&lt;i&gt; R&lt;sub&gt;C,D&lt;/sub&gt;&lt;/i&gt; will be maximum if C and D are the finest clustering algorithms by creating unique label for each&amp;nbsp;distinguished data point. On the other side,&amp;nbsp;&lt;i&gt;R&lt;sub&gt;C,D&lt;/sub&gt;&lt;/i&gt;&amp;nbsp;will be zero, if C and D, as the coarsest clustering algorithm, simply assign all data points to single label.&lt;br /&gt;
&lt;br /&gt;
We can also consider the clustering algorithm in GIGR&amp;nbsp;as a quantization method that converts data from multidimensional discrete or continues domain to one-dimensional discrete domain, so that we can effectively apply the information&amp;nbsp;theoretical&amp;nbsp;calculation on them.&lt;br /&gt;
&lt;br /&gt;
In&amp;nbsp;practise, C and D can be different clustering algorithms. The main requirement on C and D is that they can be applied on&amp;nbsp;data tables&amp;nbsp;X and Y separately and can&amp;nbsp;assign a label to&amp;nbsp;each data row. In fact, the data X and Y don't even have to&amp;nbsp;be&amp;nbsp;tables. They can be, for instance, a tree structures&amp;nbsp;or&amp;nbsp;DNA sequences. To apply our method, we just need to pick appropriate clustering algorithms for the data types.&lt;br /&gt;
&lt;br /&gt;
It should be pointed out here that GIGR described here does not bias towards any type of clustering algorithms. In fact, I believe that any clustering algorithm would be appropriate for certain type of problems. The attached script code implemented 6 different groups of clustering algorithms to calculate GIGR with VisuMap.&lt;br /&gt;
&lt;br /&gt;
Since the two clustering algorithms in GIGR are kind of input, we can use GIGR to investigate&amp;nbsp;similarity between different clustering algorithms. For instance,&amp;nbsp;if &lt;em&gt;R&lt;sub&gt;C,D&lt;/sub&gt;&lt;/em&gt;(X; X) is close 1.0 for&amp;nbsp;a data set X, then&amp;nbsp;the two clustering algorithms C and D produce similar results with respect to our data. Thus, in the practise, we might use the algorithm that is easier to implement, if the GIGR indicates that two algorithm produce similar results.&lt;br /&gt;
&lt;br /&gt;
Also along the same line,&amp;nbsp;GIGR can be&amp;nbsp;used as an indicator&amp;nbsp;for the&amp;nbsp;stability of&amp;nbsp;clustering algorithms. Many clustering algorithms, like the k-mean algorithm, are non-deterministic in the sense that&amp;nbsp;repeating the algorithm&amp;nbsp;on the same data&amp;nbsp;normally yields different result. We would normally favor an algorithm that maintain certain level of&amp;nbsp;stability so that the clustering&amp;nbsp;results remain more or less the same.&amp;nbsp;When we calculate&amp;nbsp;&lt;em&gt;R&lt;sub&gt;C,C&lt;/sub&gt;&lt;/em&gt;(X; X) by repeating algorithm C twice on X, we&amp;nbsp;would normally get a GIGR value that is smaller than 1.0 for non-deterministic clustering algorithms. In these cases, there closer &lt;em&gt;R&lt;sub&gt;C,C&lt;/sub&gt;&lt;/em&gt;(X; X) is to the value 1.0, the more stable is C with respect to the data set X.&lt;br /&gt;
&lt;br /&gt;
&lt;strong&gt;Appendix:&lt;/strong&gt;&lt;br /&gt;
The following JavaScript code helps users to calculate the GIGR&amp;nbsp;with VisuMap software application:&lt;script src="https://gist.github.com/1503211.js?file=InfoGainRatio.js"&gt;
&lt;/script&gt; &lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/710383708872670121-74389411928920300?l=jamesxli.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel="replies" type="application/atom+xml" href="http://jamesxli.blogspot.com/feeds/74389411928920300/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://www.blogger.com/comment.g?blogID=710383708872670121&amp;postID=74389411928920300" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/710383708872670121/posts/default/74389411928920300?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/710383708872670121/posts/default/74389411928920300?v=2" /><link rel="alternate" type="text/html" href="http://jamesxli.blogspot.com/2011/11/quantifying-dependency-between.html" title="Quantifying dependency between attributes." /><author><name>James X. Li</name><uri>http://www.blogger.com/profile/17491271097854752129</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="31" src="http://2.bp.blogspot.com/_Wca44kNaYGE/Sal7IrL8FJI/AAAAAAAAEyU/yhHymRkjWY4/S220/JamesPicLarge2.JPG" /></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://2.bp.blogspot.com/-YMSC8Aoju8I/Tu_G1nVTreI/AAAAAAAAHx0/EC1StXOL2K0/s72-c/Form1.png" height="72" width="72" /><thr:total>0</thr:total></entry><entry gd:etag="W/&quot;D0YESXk7fSp7ImA9WhRRFkw.&quot;"><id>tag:blogger.com,1999:blog-710383708872670121.post-693889206722438702</id><published>2011-11-29T12:40:00.001-08:00</published><updated>2011-11-29T17:05:08.705-08:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2011-11-29T17:05:08.705-08:00</app:edited><title>K-Mean clustering under no-optimal conditions</title><content type="html">K-Mean algorithm (KMA) is a widely used clustering algorithm for multidimensional data. In this note, I'll show some intuitive behaviors of KMA with a very simple quantization task: Grouping a series of numeric values into multiple groups. An intuitive requirement for this task is: if some values are concentrated at a location, those values should be grouped together.&lt;br /&gt;&lt;br /&gt;I have generated 5 datasets, each with 5000 values between -5.0 and +20.0. Data points in the 5 datasets are successively concentrated at 1 to 5 locations according to Gaussian distribution. I used KMA to cluster these data into 3 clusters. I ran KMA on each dataset twice; the result is shown in the following picture:&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/-3dAOCf19UVg/TtVdgsFVN-I/AAAAAAAAHwc/KCDAHJfVhAE/s1600/Kmean%2B5Clusters.png"&gt;&lt;img style="TEXT-ALIGN: center; MARGIN: 0px auto 10px; WIDTH: 280px; DISPLAY: block; HEIGHT: 400px; CURSOR: hand" id="BLOGGER_PHOTO_ID_5680549321168467938" border="0" alt="" src="http://3.bp.blogspot.com/-3dAOCf19UVg/TtVdgsFVN-I/AAAAAAAAHwc/KCDAHJfVhAE/s400/Kmean%2B5Clusters.png" /&gt;&lt;/a&gt; In above picture, each spectrum shows the result of one run of KMA on a dataset. Each bar in the spectrum indicates a data point in a dataset. The curve on the spectrum shows the density of the data points. The bar colors indicate the clusters created by the KMA. We notice that all bars are colored by 3 different colors as the number of clusters, K, used KMA is 3.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;Let N be the natural clusters present in the dataset (i.e. N=1,2,3,4,5 for these 5 datasets,) it would be helpful to distinguish the behaviors of KMA in to the following three cases:&lt;/p&gt;&lt;br /&gt;&lt;p&gt;(1) N &amp;lt; K : In this case the KMA has to group fewer natural clusters into larger number of clusters. As be shown in spectrum 1.a, 1.b, 2.a and 2.b, KMA assigned more than one clusters (colors) to a natural cluster, but didn't assign one to multiple natural clusters. For instance, in spectrum 2.a, the first natural cluster has been split evenly into two clusters.&lt;/p&gt;&lt;br /&gt;&lt;p&gt;(2) N = K : In this case the KMA has to group exact N natural clusters in to N clusters. We see in spectrum 3.a and 3.b that KMA did a nice job in this case. KMA acutally found the natural clusters.&lt;/p&gt;&lt;br /&gt;&lt;p&gt;(3) N &amp;gt; K : KMA has to group more natural clusters into fewer clusters. As be shown in spectrum 4.a, 4.b, 5.a and 5.b, KMA, in this case, grouped multiple natural clusters together. KMA has not, at least not significantly, split natural clusters. How KMA groups the natural clusters can change from run to run, depending on the initialization of the algorithm. For instance, spectrum 4.a and 4.b show two different grouping created by two different KMA runs.&lt;/p&gt;&lt;br /&gt;&lt;p&gt;In above test, our datasets all have clearly separated natural clusters. I would say that KMA did a pretty good job: its results reflect pretty much our intuitive expectation. However, if the natural cluster are not so clearly separated, the result of KMA would not be such nice. The following picture shows how KMA groups less separated natural clusters:&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/-w-_NUvUD6aY/TtVcZxMOAuI/AAAAAAAAHwQ/NWENIhipJCo/s1600/Kmean%2B5ClustersB.png"&gt;&lt;img style="TEXT-ALIGN: center; MARGIN: 0px auto 10px; WIDTH: 282px; DISPLAY: block; HEIGHT: 400px; CURSOR: hand" id="BLOGGER_PHOTO_ID_5680548102768820962" border="0" alt="" src="http://1.bp.blogspot.com/-w-_NUvUD6aY/TtVcZxMOAuI/AAAAAAAAHwQ/NWENIhipJCo/s400/Kmean%2B5ClustersB.png" /&gt;&lt;/a&gt; In above picture, the 5 datasets are created similarly as before except that these natural clusters overlap each other somewhat. We see that, only for the case K=N, KMA found the exact natural clusters. For other causes, KMA does not respect the natural clusters. Actually, we can visually do a better job based on the density curves.&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/710383708872670121-693889206722438702?l=jamesxli.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel="replies" type="application/atom+xml" href="http://jamesxli.blogspot.com/feeds/693889206722438702/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://www.blogger.com/comment.g?blogID=710383708872670121&amp;postID=693889206722438702" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/710383708872670121/posts/default/693889206722438702?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/710383708872670121/posts/default/693889206722438702?v=2" /><link rel="alternate" type="text/html" href="http://jamesxli.blogspot.com/2011/11/k-mean-clustering-under-no-optimal.html" title="K-Mean clustering under no-optimal conditions" /><author><name>James X. Li</name><uri>http://www.blogger.com/profile/17491271097854752129</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="31" src="http://2.bp.blogspot.com/_Wca44kNaYGE/Sal7IrL8FJI/AAAAAAAAEyU/yhHymRkjWY4/S220/JamesPicLarge2.JPG" /></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://3.bp.blogspot.com/-3dAOCf19UVg/TtVdgsFVN-I/AAAAAAAAHwc/KCDAHJfVhAE/s72-c/Kmean%2B5Clusters.png" height="72" width="72" /><thr:total>0</thr:total></entry><entry gd:etag="W/&quot;CEYCRH07cCp7ImA9WhZWE08.&quot;"><id>tag:blogger.com,1999:blog-710383708872670121.post-5279957583646803245</id><published>2011-05-13T13:25:00.000-07:00</published><updated>2011-05-13T14:29:25.308-07:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2011-05-13T14:29:25.308-07:00</app:edited><title>VisuMap v3.5 Relased.</title><content type="html">Today we have released VisuMap version 3.5.864. This version is a major upgrade from its previous version 3.2; it is marked by the following major changes:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;b&gt;Multi-core optimization&lt;/b&gt;: we have implemented extensive optimization for most algorithms to exploit multi-core computer system. By default, VisuMap 3.5 will automatically make use of all available CPU cores with multi-threading technology. The user can also explicitly limit the number of processing unit through the application configuration property WorkThreads.  On our 6-core Windows 7 test machine we have often achieved 3-5 times performance speedup.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;b&gt;Updated DirectX Library&lt;/b&gt;: VisuMap used DirectX+MDX (managed directX) to implement hardware accelerated visualization service. Since MDX has been discontinued as a component in new generations of .NET platform, we have to switch to another product. In VisuMap v3.5 we have replaced MDX by SlimDX that is an open source library with similar goal as MDX. With this change, we not only significantly improved the rendering performance, but also reduced the dependency on old obsoleted software components. VisuMap can now be installed on most Windows 7 machines without pre-installing any other software.  Some older systems may need to upgrade to the latest DirectX runtime library through the normal &lt;a href="http://www.microsoft.com/downloads/en/details.aspx?familyid=2da43d38-db71-4c1b-bc6a-9b6652cd92a3"&gt;DirectX web installer&lt;/a&gt;.&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;&lt;b&gt;Many bug fixes and enhancements&lt;/b&gt;: Based on customer feedback VisuMap has undergone many small changes. Most of them have already made available through previous minor releases. Worth mentioning is here the extension of the XY plot that made this simple service a generic tool for large datasets; as well as addition of new glyphsets and utility scripts. We have recently also released some VisuMap plugins which extend VisuMap for special user groups. For instance, the plugin for Multivariate Analysis and Wave Transformations. Also, the Customer Importer plugin has been extended to import FCS (Flow Cytometry Standard) files.&lt;/li&gt;&lt;/ul&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;VisuMap v3.5 is backwards compatible with its previous versions. All existing datasets will silently upgraded to the new version when loaded into VisuMap. Existing users can upgrade to the new version, as long as the license type allows it, by simply clicking on the menu Help&amp;gt;Check for Updates in VisuMap's main window.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/710383708872670121-5279957583646803245?l=jamesxli.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel="replies" type="application/atom+xml" href="http://jamesxli.blogspot.com/feeds/5279957583646803245/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://www.blogger.com/comment.g?blogID=710383708872670121&amp;postID=5279957583646803245" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/710383708872670121/posts/default/5279957583646803245?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/710383708872670121/posts/default/5279957583646803245?v=2" /><link rel="alternate" type="text/html" href="http://jamesxli.blogspot.com/2011/05/visumap-v35-relased.html" title="VisuMap v3.5 Relased." /><author><name>James X. Li</name><uri>http://www.blogger.com/profile/17491271097854752129</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="31" src="http://2.bp.blogspot.com/_Wca44kNaYGE/Sal7IrL8FJI/AAAAAAAAEyU/yhHymRkjWY4/S220/JamesPicLarge2.JPG" /></author><thr:total>0</thr:total></entry><entry gd:etag="W/&quot;D0cARno6eip7ImA9WhZREk0.&quot;"><id>tag:blogger.com,1999:blog-710383708872670121.post-5956503491159574392</id><published>2011-04-06T14:55:00.000-07:00</published><updated>2011-04-07T12:37:27.412-07:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2011-04-07T12:37:27.412-07:00</app:edited><title>Self-similarity of high dimensional random walk process</title><content type="html">&lt;div&gt;A high dimensional random walk process (RWP) is the trajectory of a vector variable whose components change independently and randomly step by step, in discrete time space, for a small constant percentage. High dimensional RWP can be used as the starting point to investigate changing complex systems. For instance, as discussed in &lt;a href="http://jamesxli.blogspot.com/2010/10/principal-components-analysis-pca-of.html"&gt;this blog&lt;/a&gt;, the stocks price history of 500 stocks can be considered as a 500 dimensional RWP.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;a href="http://en.wikipedia.org/wiki/Self-similarity"&gt;Self-similarity&lt;/a&gt; means that an object is, from certain point of view, similar to parts of itself. A high dimensional RWP is self-similar in the sense that each sub-section of it follow the same statistical constraints. Thus, the randomness is a property shared by RWP and its sub-sections. We can generate series of 1000 data points of a 500-dimensional RWP with the following VisuMap's JavaScript code:&lt;/div&gt;&lt;pre&gt;&lt;br /&gt;var n = 1000;&lt;br /&gt;var dim = 500;&lt;br /&gt;var rwp = New.NumberTable(n, dim);&lt;br /&gt;var m = rwp.Matrix;&lt;br /&gt;for( var col=0; col &amp;lt; dim; col++) {&lt;br /&gt;    m[0][col] = 1.0;&lt;br /&gt;    for(var t=1; t &amp;lt; n; t++) {&lt;br /&gt;        m[t][col] = m[t-1][col] * (1 + 0.01*(Math.random() - 0.5) );&lt;br /&gt;    }&lt;br /&gt;}&lt;br /&gt;rwp.ShowValueDiagram();&lt;/pre&gt;&lt;br /&gt;&lt;div&gt;How can we visualize the self-similar randomness in high dimensional RWP?  A simple way is to use a dimensionality reduction method to map the high dimensional trajectory to low dimensional space, then plot it out on paper or screen. The following picture for instance shows two CCA (curvilinear component analysis) maps of the 1000 data points mapped to the 3 dimensional space:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;a href="http://3.bp.blogspot.com/-L9V_HvB3Bpk/TZzwrjlWFTI/AAAAAAAAHOA/CRqm06e4uHE/s1600/RwpCcaMaps.jpg" onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 257px;" src="http://3.bp.blogspot.com/-L9V_HvB3Bpk/TZzwrjlWFTI/AAAAAAAAHOA/CRqm06e4uHE/s400/RwpCcaMaps.jpg" border="0" alt="" id="BLOGGER_PHOTO_ID_5592609468364231986" /&gt;&lt;/a&gt;We can see the randomness of the trajectory in above picture, but the self-similarity is hardly apparent. This is because, as intrinsically to the human perception, we normally only good at recognizing similarity between geometric patterns, but not similarity between random patterns.&lt;br /&gt;&lt;br /&gt;&lt;div&gt;Fortunately, when we use principal component analysis (PCA) to project the trajectory to low dimensional space, we get much easier recognizable patterns. For instance, the following picture shows the projections of our sample RWP to some major principal components (ie. eigenvectors):&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://4.bp.blogspot.com/-Tml6A5bu37Y/TZz4WvHfgYI/AAAAAAAAHOI/8mCBaZXlvsU/s1600/PcaProjections.png" onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 341px; height: 400px;" src="http://4.bp.blogspot.com/-Tml6A5bu37Y/TZz4WvHfgYI/AAAAAAAAHOI/8mCBaZXlvsU/s400/PcaProjections.png" border="0" alt="" id="BLOGGER_PHOTO_ID_5592617906776015234" /&gt;&lt;/a&gt;&lt;br /&gt;The PCA projection to the major principal components apparently filtered out the randomness of the data, like a low pass filter suppresses high frequency noises. Now, we can select three principal components, say the second, third and forth components as our projection axes;  then plot the PCA projection of the data (or sub-sections of it) to these three axes. As be illustrated in the following picture, we can see that they are all geometrically similar to each other albeit with different densities.&lt;div&gt;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://1.bp.blogspot.com/-lQn-TCXETxw/TZ31GVkI4CI/AAAAAAAAHOQ/8x51rtNP8bs/s1600/SelfSimilarityPca.jpg" onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 298px;" src="http://1.bp.blogspot.com/-lQn-TCXETxw/TZ31GVkI4CI/AAAAAAAAHOQ/8x51rtNP8bs/s400/SelfSimilarityPca.jpg" border="0" alt="" id="BLOGGER_PHOTO_ID_5592895801480765474" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div&gt;Notice that in above picture,  the projection axes are the second, third and forth principal components of the selected data, not those of the complete dataset. The following video clip shows what we have discussed above in a more intuitively way:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;object width="425" height="344"&gt;&lt;param name="movie" value="http://www.youtube.com/v/gR7QG-Y_yyM?hl=en&amp;amp;fs=1"&gt;&lt;param name="allowFullScreen" value="true"&gt;&lt;param name="allowscriptaccess" value="always"&gt;&lt;embed src="http://www.youtube.com/v/gR7QG-Y_yyM?hl=en&amp;amp;fs=1" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="425" height="344"&gt;&lt;/embed&gt;&lt;/object&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;One practical use of studying RWP is to find non-randomness ( that is information ) in seemly random data. Using the PCA technique we can show random data as easy recognizable patterns which enable us to detect deviations from those patterns, so that we can quickly find potential information in apparently random data. &lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/710383708872670121-5956503491159574392?l=jamesxli.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel="replies" type="application/atom+xml" href="http://jamesxli.blogspot.com/feeds/5956503491159574392/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://www.blogger.com/comment.g?blogID=710383708872670121&amp;postID=5956503491159574392" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/710383708872670121/posts/default/5956503491159574392?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/710383708872670121/posts/default/5956503491159574392?v=2" /><link rel="alternate" type="text/html" href="http://jamesxli.blogspot.com/2011/04/self-similarity-of-high-dimensional.html" title="Self-similarity of high dimensional random walk process" /><author><name>James X. Li</name><uri>http://www.blogger.com/profile/17491271097854752129</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="31" src="http://2.bp.blogspot.com/_Wca44kNaYGE/Sal7IrL8FJI/AAAAAAAAEyU/yhHymRkjWY4/S220/JamesPicLarge2.JPG" /></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://3.bp.blogspot.com/-L9V_HvB3Bpk/TZzwrjlWFTI/AAAAAAAAHOA/CRqm06e4uHE/s72-c/RwpCcaMaps.jpg" height="72" width="72" /><thr:total>0</thr:total></entry><entry gd:etag="W/&quot;D0ENQng7fSp7ImA9WhZREUU.&quot;"><id>tag:blogger.com,1999:blog-710383708872670121.post-1543846779999383638</id><published>2011-04-01T08:21:00.000-07:00</published><updated>2011-04-07T07:14:53.605-07:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2011-04-07T07:14:53.605-07:00</app:edited><title>Pirated software web sites</title><content type="html">&lt;p&gt;I have notice that more than a dozens so-called freeware web sites now provide pirated version of &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_0"&gt;VisuMap&lt;/span&gt; software. All those sites seem to be originated from a single source, as they are all organized in a similar way and all pirated the version 3.2.854 of &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_1"&gt;VisuMap&lt;/span&gt;. I have spent 2 or 3 hours to investigate those sites, the followings are what I have found:&lt;/p&gt;&lt;ul&gt;&lt;br /&gt;&lt;li&gt;These sites have done a pretty good &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_2"&gt;SEO&lt;/span&gt; (search engine optimization) job, they pretty much over flooded search results when I searched for "&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_3"&gt;VisuMap&lt;/span&gt; download" on Google. Google is clearly loosing the battle against those &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_4"&gt;mis&lt;/span&gt;-information. &lt;/li&gt;&lt;br /&gt;&lt;li&gt;Most of these sites are not really free. They usually re-direct users through a chain of &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_5"&gt;redirections&lt;/span&gt; with dubious ads (they hold the user with popups for popups, you need to shutdown the browser to get away from them), then at the end they usually require users to sign up for paid service for "fast" download.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;All product information about &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_6"&gt;VisuMap&lt;/span&gt; on those sites are copy and pasted from &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_7"&gt;VisuMap's&lt;/span&gt; web site. Some sites used translation engine to translate English to other languages with horrible results.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;All information about downloading sites, speed, etc. on those site are faked and likely automatically generated to fool users and search engines.&lt;/li&gt;&lt;br /&gt;&lt;li&gt;I have tested the pirated software on an isolated machine. I scanned the software with anti-virus software, no virus or &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_8"&gt;spyware&lt;/span&gt; has been detected.&lt;/li&gt;&lt;br /&gt;&lt;li&gt;The pirated software seemed to run smoothly on the test machine,  although it felt a little sluggish.  Since it has no proper license, it won't be able to get online update from &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_9"&gt;VisuMap's&lt;/span&gt; web site.&lt;/li&gt;&lt;br /&gt;&lt;li&gt;The pirated software has an obsoleted version, so that it won't be able to use some new &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_10"&gt;plugin&lt;/span&gt; extensions for &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_11"&gt;VisuMap&lt;/span&gt;.&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;p&gt;With so many limitations and apparent dubious practice, I don't think those pirate sites will have serious impact on the legitimated software vendors. But, they clearly post a challenge for Internet searching engines.&lt;/p&gt;&lt;p&gt;&lt;br /&gt;&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/710383708872670121-1543846779999383638?l=jamesxli.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel="replies" type="application/atom+xml" href="http://jamesxli.blogspot.com/feeds/1543846779999383638/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://www.blogger.com/comment.g?blogID=710383708872670121&amp;postID=1543846779999383638" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/710383708872670121/posts/default/1543846779999383638?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/710383708872670121/posts/default/1543846779999383638?v=2" /><link rel="alternate" type="text/html" href="http://jamesxli.blogspot.com/2011/04/pirated-software-web-sites.html" title="Pirated software web sites" /><author><name>James X. Li</name><uri>http://www.blogger.com/profile/17491271097854752129</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="31" src="http://2.bp.blogspot.com/_Wca44kNaYGE/Sal7IrL8FJI/AAAAAAAAEyU/yhHymRkjWY4/S220/JamesPicLarge2.JPG" /></author><thr:total>0</thr:total></entry><entry gd:etag="W/&quot;D08ASHk7eip7ImA9Wx5bGU8.&quot;"><id>tag:blogger.com,1999:blog-710383708872670121.post-6981055894112627687</id><published>2010-11-04T15:52:00.000-07:00</published><updated>2010-11-04T19:57:29.702-07:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2010-11-04T19:57:29.702-07:00</app:edited><title>Visualizing dynamics in time series data.</title><content type="html">We have just released a short video clip that shows how VisuMap visualizes dynamics in time series data. This demo uses a feature of the spectrum view that&lt;em&gt; smoothly&lt;/em&gt; changes the display from one configuration to another one. This video is inspired by the some samples from &lt;a href="http://www.google.com/publicdata/home"&gt;google data explorer&lt;/a&gt;. It is amazing how human eyes can quickly capture such characteristics like speed and variations if the data is properly visualized.&lt;br /&gt;&lt;br /&gt;This video uses some features in VisuMap available in VisuMap version 3.2.855. The sample datasets used in the video are included in the standard installation of VisuMap.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.youtube.com/watch?v=aW8Br55bemE" href2="http://www.visumap.com/Resources/Demo/TimeSeries.htm"&gt;&lt;img style="TEXT-ALIGN: center; MARGIN: 0px auto 10px; WIDTH: 400px; DISPLAY: block; HEIGHT: 282px; CURSOR: hand" id="BLOGGER_PHOTO_ID_5535832276137995266" border="0" alt="" src="http://2.bp.blogspot.com/_Wca44kNaYGE/TNM6IANY7AI/AAAAAAAAG9E/tKUOHz4Qy7s/s400/TimeSeriesDynamics.jpg" /&gt;&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/710383708872670121-6981055894112627687?l=jamesxli.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel="replies" type="application/atom+xml" href="http://jamesxli.blogspot.com/feeds/6981055894112627687/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://www.blogger.com/comment.g?blogID=710383708872670121&amp;postID=6981055894112627687" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/710383708872670121/posts/default/6981055894112627687?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/710383708872670121/posts/default/6981055894112627687?v=2" /><link rel="alternate" type="text/html" href="http://jamesxli.blogspot.com/2010/11/visualizing-dynamics-in-time-series.html" title="Visualizing dynamics in time series data." /><author><name>James X. Li</name><uri>http://www.blogger.com/profile/17491271097854752129</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="31" src="http://2.bp.blogspot.com/_Wca44kNaYGE/Sal7IrL8FJI/AAAAAAAAEyU/yhHymRkjWY4/S220/JamesPicLarge2.JPG" /></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://2.bp.blogspot.com/_Wca44kNaYGE/TNM6IANY7AI/AAAAAAAAG9E/tKUOHz4Qy7s/s72-c/TimeSeriesDynamics.jpg" height="72" width="72" /><thr:total>0</thr:total></entry><entry gd:etag="W/&quot;DEIAQHcycCp7ImA9Wx5VEkw.&quot;"><id>tag:blogger.com,1999:blog-710383708872670121.post-1020402225031489682</id><published>2010-10-04T09:26:00.000-07:00</published><updated>2010-10-04T11:22:21.998-07:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2010-10-04T11:22:21.998-07:00</app:edited><title>Principal Components Analysis (PCA) of Random Walk Process</title><content type="html">&lt;div&gt;One sample dataset distributed with the VisuMap software package is the normalized weekly stock price history of 500 S&amp;amp;P  stocks for the year 2002. This dataset as shown in the following picture can be considered as 500 time series for about 50 time points (i.e. weeks).&lt;/div&gt;&lt;br /&gt;&lt;div&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_Wca44kNaYGE/TKoC4SBXnwI/AAAAAAAAG6E/uUdsW4r0dg8/s1600/Sp500LineChart.PNG"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 340px;" src="http://1.bp.blogspot.com/_Wca44kNaYGE/TKoC4SBXnwI/AAAAAAAAG6E/uUdsW4r0dg8/s400/Sp500LineChart.PNG" alt="" id="BLOGGER_PHOTO_ID_5524231058856845058" border="0" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div style="text-align: center;"&gt;&lt;b&gt;Picture 1&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;/div&gt;&lt;div&gt;While above picture shows some common trend (like the large downturn in middle of the year), the price development is predominately random. How can we characterize this randomness? Well, one way to model the randomness is considering the stock price developments as independent &lt;span style="font-style: italic;"&gt;random walk &lt;/span&gt;processes: Assuming that you have invested one dollar in each of the 500 stocks, their values will change more or less randomly from week to week for a small percentages. As a reference model we can generate 500 such random walk process with the following JavaScript code (as supported by VisuMap):&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;&lt;cod&gt;&lt;pre&gt;var randomWalk = New.NumberTable(500,50);&lt;br /&gt;for(var row=0; row &amp;#60; randomWalk.Rows; row++) {&lt;br /&gt;  var v = 1.0;&lt;br /&gt;  for( var col=0; col &amp;#60; randomWalk.Columns; col++) {&lt;br /&gt;    randomWalk.Matrix[row][col] = v;&lt;br /&gt;    v *= 1 + 0.01*(Math.random() - 0.5);&lt;br /&gt;  }&lt;br /&gt;}&lt;br /&gt;randomWalk.ShowValueDiagram();&lt;br /&gt;&lt;/pre&gt;The last line in above code block creates the following diagram (the diagram has been colored with k-mean algorithm):&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_Wca44kNaYGE/TKoDaQEov_I/AAAAAAAAG6M/_eqN98ZMicc/s1600/RandomWalkChart.PNG"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 342px;" src="http://2.bp.blogspot.com/_Wca44kNaYGE/TKoDaQEov_I/AAAAAAAAG6M/_eqN98ZMicc/s400/RandomWalkChart.PNG" alt="" id="BLOGGER_PHOTO_ID_5524231642449231858" border="0" /&gt;&lt;/a&gt;&lt;div style="text-align: center;"&gt;&lt;b&gt;Picture 2&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-weight: normal;"&gt;Comparing Figure 1 and 2, we can say that Figure 2, more or less, resembles Figure 1 if we remove some common trends from Figure 1 (i.e. de-&lt;/span&gt;&lt;span class="Apple-style-span" style="font-weight: normal;"&gt;&lt;b&gt;&lt;div style="display: inline ! important;"&gt;&lt;span class="Apple-style-span" style="font-weight: normal;"&gt;trending the down-turn in the middle and the overall slightly down-wards trend). The resemblance is however a collective similarity. It &lt;/span&gt;&lt;/div&gt;&lt;/b&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-weight: normal;"&gt;&lt;b&gt;&lt;div style="display: inline ! important;"&gt;&lt;span class="Apple-style-span" style="font-weight: normal;"&gt;does not make sense to find similarity between individual curves in the two figures. &lt;/span&gt;&lt;/div&gt;&lt;/b&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-weight: normal;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-weight: normal;"&gt;Another way to compare two sets of time series is using principal components analysis (PCA).&lt;/span&gt;&lt;span class="Apple-style-span" style="font-weight: normal;"&gt; Among many other services, PCA provides a systematic way to decompose high dimensional variances in a dataset into many 1-dimensional directions. These directions (called principal components) are also ordered in the way, so that the beginning components have larger variance and therefore more information. Thus, PCA might be interesting way to characterize the variance (i.e. some kind of randomness) in a dataset.&lt;br /&gt;&lt;br /&gt;It is very simple to visualize the PCA components with VisuMap. To do so, we just need to open the PCA view and open the Projection Analyzer window. Then select some PCs and choose the context menu "Show PCAs". The following two picture shows the first 24 PCs of the S&amp;amp;P500 and the RandomWalk datasets:&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;div style="text-align: center;"&gt;&lt;span class="Apple-style-span" style="font-weight: normal;"&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_Wca44kNaYGE/TKoQqoHh-iI/AAAAAAAAG6U/cGBStNaKCIo/s1600/Sp500Pca.PNG"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 393px; height: 400px;" src="http://4.bp.blogspot.com/_Wca44kNaYGE/TKoQqoHh-iI/AAAAAAAAG6U/cGBStNaKCIo/s400/Sp500Pca.PNG" alt="" id="BLOGGER_PHOTO_ID_5524246217432889890" border="0" /&gt;&lt;/a&gt;&lt;span style="font-weight: bold;"&gt;Picture 3&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;/div&gt;&lt;span class="Apple-style-span" style="font-weight: normal;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;div style="text-align: center;"&gt;&lt;span class="Apple-style-span" style="font-weight: normal;"&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_Wca44kNaYGE/TKoQ1mBbWMI/AAAAAAAAG6c/jpAcsryDkjM/s1600/RandomWalkPca.PNG"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 389px; height: 400px;" src="http://2.bp.blogspot.com/_Wca44kNaYGE/TKoQ1mBbWMI/AAAAAAAAG6c/jpAcsryDkjM/s400/RandomWalkPca.PNG" alt="" id="BLOGGER_PHOTO_ID_5524246405848979650" border="0" /&gt;&lt;/a&gt;&lt;span style="font-weight: bold;"&gt;Picture 4&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;span class="Apple-style-span" style="font-weight: normal;"&gt;We can clearly see some similarity between the first few PCs in above two pictures. This indicates that random walk process models pretty well the major variance directions of the S&amp;amp;P  dataset. The discrepancy between the PCs in above two picture may be used to characterize how the S&amp;amp;P price history differs from random walk process. In this way the random walk process is used as a null-space reference model.&lt;br /&gt;&lt;br /&gt;Looking at Picture 4 we can also notice an interesting thing: the PCs clearly resembles the curves of sine functions. This is actually not too surprising since the PCA algorithm is basically a sequence of high dimensional rotation operations, which, as we know, lead to a lot of sine/cosine functions. Nevertheless, it would be interesting to determine the exact mathematical formula for the random walk process. By doing so, we can have a quick statical approximation for PCAs of many random walk alike datasets.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;/b&gt;&lt;/div&gt;&lt;/cod&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/710383708872670121-1020402225031489682?l=jamesxli.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel="replies" type="application/atom+xml" href="http://jamesxli.blogspot.com/feeds/1020402225031489682/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://www.blogger.com/comment.g?blogID=710383708872670121&amp;postID=1020402225031489682" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/710383708872670121/posts/default/1020402225031489682?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/710383708872670121/posts/default/1020402225031489682?v=2" /><link rel="alternate" type="text/html" href="http://jamesxli.blogspot.com/2010/10/principal-components-analysis-pca-of.html" title="Principal Components Analysis (PCA) of Random Walk Process" /><author><name>James X. Li</name><uri>http://www.blogger.com/profile/17491271097854752129</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="31" src="http://2.bp.blogspot.com/_Wca44kNaYGE/Sal7IrL8FJI/AAAAAAAAEyU/yhHymRkjWY4/S220/JamesPicLarge2.JPG" /></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://1.bp.blogspot.com/_Wca44kNaYGE/TKoC4SBXnwI/AAAAAAAAG6E/uUdsW4r0dg8/s72-c/Sp500LineChart.PNG" height="72" width="72" /><thr:total>0</thr:total></entry><entry gd:etag="W/&quot;C0UNQHs9eCp7ImA9Wx5TEk0.&quot;"><id>tag:blogger.com,1999:blog-710383708872670121.post-4285421426002024500</id><published>2010-07-18T11:08:00.000-07:00</published><updated>2010-07-26T20:28:11.560-07:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2010-07-26T20:28:11.560-07:00</app:edited><title>Visualizing linkage disequilibrium clusters of genotype SNPs</title><content type="html">&lt;span style="font-weight: bold;"&gt;SNP and Linkage Disequilibrium&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;I have just released a sample dataset &lt;a href="http://www.visumap.com/index.aspx?p=Resources/VisuMapDatasets"&gt;Haplotype Analysis&lt;/a&gt; that shows how to use VisuMap to visualize clusters among SNPs (single nucleotide polymorphism). SNPs are about 0.1% of base-pairs locations in DNA sequences that vary from population to population; and from individual to individual. It has been said that the difference of many phonetic traits, like height, eye color, etc, of human beings can be attributed to variations of SNPs.&lt;br /&gt;&lt;br /&gt;Haplotype analysis aims to find &lt;i&gt;correlation &lt;/i&gt;between SNPs and phonetic traits. One of the often used method in haplotype analysis is the concept &lt;em&gt;linkage disequilibrium&lt;/em&gt; (LD) that measures the &lt;em&gt;correlation&lt;/em&gt; between SNP pairs with respect to a population. In abstract terms, LD induces an information structure over SNPs. Such a relational structure may offer a helpful framework to study the correlation between SNPs and phonetic variations.&lt;br /&gt;&lt;br /&gt;The numerical calculation of LD is actually pretty straightforward. But all descriptions about LD I can find are too vague for step-by-step calculation. For the sack of reference, I will briefly describe the way how LD is calculated here. Assuming that we have obtained the genotype data about two SNPs, SNP&lt;sub&gt;a&lt;/sub&gt; and SNP&lt;sub&gt;b&lt;/sub&gt;, for a group of individuals I&lt;sub&gt;1&lt;/sub&gt;, I&lt;sub&gt;2&lt;/sub&gt;, I&lt;sub&gt;3&lt;/sub&gt;, ..., I&lt;sub&gt;k&lt;/sub&gt; as in the following table:&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;table style="width: 400px; text-align: left;" border="0" cellpadding="0" cellspacing="0"&gt;&lt;tbody&gt;&lt;/tbody&gt;&lt;tbody&gt;&lt;tr style="height: 30px;" bgcolor="#cccccc"&gt;&lt;td&gt;&lt;br /&gt;&lt;/td&gt;&lt;td&gt;I&lt;sub&gt;1 &lt;/sub&gt;&lt;/td&gt;&lt;td&gt;I&lt;sub&gt;2&lt;/sub&gt;&lt;/td&gt;&lt;td&gt;I&lt;sub&gt;3&lt;/sub&gt;&lt;/td&gt;&lt;td&gt;...&lt;/td&gt;&lt;td&gt;I&lt;sub&gt;k&lt;/sub&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr style="height: 30px;"&gt;&lt;td&gt;SNP&lt;sub&gt;a&lt;/sub&gt;&lt;/td&gt;&lt;td&gt;T/T&lt;/td&gt;&lt;td&gt;G/T&lt;/td&gt;&lt;td&gt;T/T&lt;/td&gt;&lt;td&gt;...&lt;/td&gt;&lt;td&gt;G/G&lt;/td&gt;&lt;/tr&gt;&lt;tr style="height: 30px;"&gt;&lt;td&gt;SNP&lt;sub&gt;b&lt;/sub&gt;&lt;/td&gt;&lt;td&gt;C/T&lt;/td&gt;&lt;td&gt;C/C&lt;/td&gt;&lt;td&gt;C/T&lt;/td&gt;&lt;td&gt;...&lt;/td&gt;&lt;td&gt;T/T&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;tbody&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;&lt;br /&gt;Notice that each element in above table is a pair of nucleotides A, C, G or T. For certain reason (which I don't know), all SNPs are &lt;span style="font-style: italic;"&gt;bi-allelic&lt;/span&gt;, that means each row in above table will only have maximal two different nucleotides. So, in above sample, SNP&lt;sub&gt;a&lt;/sub&gt; has G and T, whereas SNP&lt;sub&gt;b&lt;/sub&gt; has C and T. In order to calculate LD between the two SNPs, we first select a nucleotide &lt;span style="font-style: italic;"&gt;arbitrarily&lt;/span&gt; from each row; and then count how many times that nucleotide appears in that row in each column. For instance, if we have selected T and C for SNP&lt;sub&gt;a&lt;/sub&gt; and SNP&lt;sub&gt;b&lt;/sub&gt; respectively, we will get a frequency table as following:&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;table style="width: 400px; text-align: left;" border="0" cellpadding="0" cellspacing="0"&gt;&lt;tbody&gt;&lt;/tbody&gt;&lt;tbody&gt;&lt;tr style="height: 30px;" bgcolor="#cccccc"&gt;&lt;td&gt;&lt;br /&gt;&lt;/td&gt;&lt;td&gt;I&lt;sub&gt;1 &lt;/sub&gt;&lt;/td&gt;&lt;td&gt;I&lt;sub&gt;2 &lt;/sub&gt;&lt;/td&gt;&lt;td&gt;I&lt;sub&gt;3 &lt;/sub&gt;&lt;/td&gt;&lt;td&gt;...&lt;/td&gt;&lt;td&gt;I&lt;sub&gt;k &lt;/sub&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr style="height: 30px;"&gt;&lt;td&gt;SNP&lt;sub&gt;a&lt;/sub&gt;&lt;/td&gt;&lt;td&gt;2 &lt;/td&gt;&lt;td&gt;1 &lt;/td&gt;&lt;td&gt;2 &lt;/td&gt;&lt;td&gt;...&lt;/td&gt;&lt;td&gt;0 &lt;/td&gt;&lt;/tr&gt;&lt;tr style="height: 30px;"&gt;&lt;td&gt;SNP&lt;sub&gt;b&lt;/sub&gt;&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;2&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;...&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;tbody&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;&lt;br /&gt;The LD between the SNP&lt;sub&gt;a&lt;/sub&gt; and SNP&lt;sub&gt;b&lt;/sub&gt; (more accurately the R-Squared LD) is then defined as the squared value of the&lt;a href="http://en.wikipedia.org/wiki/Correlation_and_dependence"&gt; linear correlation coefficient&lt;/a&gt; between the two frequency rows. Notice that since the SNPs are bi-allelic, the final LD value won't depend on which nucleotide we have selected from each row for the calculation. We can verify this easily based on the formula below. More formerly, let {a&lt;sub&gt;1&lt;/sub&gt;, a&lt;sub&gt;2&lt;/sub&gt;, a&lt;sub&gt;3&lt;/sub&gt;, ..., a&lt;sub&gt;k&lt;/sub&gt;} and {b&lt;sub&gt;1&lt;/sub&gt;, b&lt;sub&gt;2&lt;/sub&gt;, b&lt;sub&gt;3&lt;/sub&gt;, ..., b&lt;sub&gt;k&lt;/sub&gt;} be the two frequency row vectors in above table; let &lt;span style="text-decoration: overline;"&gt;a&lt;/span&gt; and &lt;span style="text-decoration: overline;"&gt;b&lt;/span&gt; be their mean values; then the R-Squared LD between SNP&lt;sub&gt;a&lt;/sub&gt; and SNP&lt;sub&gt;b&lt;/sub&gt; is defined as following:&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_Wca44kNaYGE/TEXene_RoII/AAAAAAAAGsw/aOzjLxWB97U/s1600/RsquaredLD.GIF"&gt;&lt;img id="BLOGGER_PHOTO_ID_5496043690190282882" style="display: block; margin: 0px auto 10px; width: 291px; cursor: pointer; height: 75px; text-align: center;" alt="" src="http://2.bp.blogspot.com/_Wca44kNaYGE/TEXene_RoII/AAAAAAAAGsw/aOzjLxWB97U/s400/RsquaredLD.GIF" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;The Sample Dataset&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The uploaded sample dataset is obtained from the &lt;a href="http://www.hapmap.org/"&gt;HapMap Project&lt;/a&gt; website that provides accesses to a large collection of genotype datasets. More interestingly, from that website we can select genotype data for special SNPs and populations. To generate the sample dataset I have select all SNPs from the region of the first 2 Million base pairs in the 9-th chromosome for the CEU population that consists of 169 individuals. There are altogether 1900 SNPs in the selected region. Our sample dataset is thus basically a table with 1900x169 nucleotide pairs.&lt;br /&gt;&lt;br /&gt;In order to import the downloaded data into VisuMap we have implemented a plugin module &lt;a href="http://www.visumap.com/rcHome.aspx?c=Video"&gt;HaploExplorer&lt;/a&gt; that enables users to import data downloaded from the HapMap website (files must have the extension .hmap.) The HaploExplorer plugin also implements a special metric named "LD R-Sqaured", so that users can simply select this metric to generate maps to visualize LD values. The following picture shows a sphere map of the 1900 SNPs:&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_Wca44kNaYGE/TEXfKJbnHPI/AAAAAAAAGs4/HwhrygFRu3o/s1600/HaploSphereMapSmall.PNG"&gt;&lt;img id="BLOGGER_PHOTO_ID_5496044285698972914" style="display: block; margin: 0px auto 10px; width: 396px; cursor: pointer; height: 340px; text-align: center;" alt="" src="http://3.bp.blogspot.com/_Wca44kNaYGE/TEXfKJbnHPI/AAAAAAAAGs4/HwhrygFRu3o/s400/HaploSphereMapSmall.PNG" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;In above map, each glyph represents a SNP from the dataset. The size of the glyph indicates the chromosome location of the SNP, ie. smaller glyphs represent SNPs located at the beginning of the chromosome. Most importantly, the distances between the glyphs indicate the LD between the SNPs; that means closely located glyphs correspond to SNPs with high LD. We can see clearly various type of clusters in above picture.&lt;br /&gt;&lt;br /&gt;We can also visualize the chromosome locations more directly with the help a spectrum view. The following picture shows, for instance, a 2D-RPM map of the the dataset together with a spectrum view of their chromosome locations:&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_Wca44kNaYGE/TEXfY0aTWKI/AAAAAAAAGtA/yy2hQGbb3g4/s1600/LDClusterSmall.PNG"&gt;&lt;img id="BLOGGER_PHOTO_ID_5496044537754376354" style="display: block; margin: 0px auto 10px; width: 400px; cursor: pointer; height: 334px; text-align: center;" alt="" src="http://2.bp.blogspot.com/_Wca44kNaYGE/TEXfY0aTWKI/AAAAAAAAGtA/yy2hQGbb3g4/s400/LDClusterSmall.PNG" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;When interactively exploring about views, the user can select a group of SNPs in the lower window, the upper spectrum window will automatically show the chromosome locations of the selected SNPs.&lt;br /&gt;&lt;br /&gt;For comparison purpose the HapMap project recommends a program called &lt;a href="http://www.broadinstitute.org/haploview/haploview"&gt;Haploview&lt;/a&gt; that allows user to directly visualize LD values with a triangle matrix, where the LD values are represented by different levels of gray colors. The following picture shows such a map generated by Haploview:&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_Wca44kNaYGE/TEXfnFf6QCI/AAAAAAAAGtI/3SKRTEnxAoo/s1600/HaploviewMapSmall.PNG"&gt;&lt;img id="BLOGGER_PHOTO_ID_5496044782859468834" style="display: block; margin: 0px auto 10px; width: 400px; cursor: pointer; height: 258px; text-align: center;" alt="" src="http://2.bp.blogspot.com/_Wca44kNaYGE/TEXfnFf6QCI/AAAAAAAAGtI/3SKRTEnxAoo/s400/HaploviewMapSmall.PNG" border="0" /&gt;&lt;/a&gt;In above picture, all 1900 SNPs are sequentially lined up to the upper edge, so that the matrix is basically a 1900x1900 triangle matrix. In order support exploration of such large matrix Haploview implements an overview window (displayed on the lower left corner) that allows users to select a small section of the data for close investigation.&lt;br /&gt;&lt;br /&gt;Comparing above maps, we can see that VisuMap provides a more direct and intuitive way to visualize patterns among SNPs. More importantly, as an integrated software package, VisuMap offers simple framework to investigate different type of LD relationships between SNPs. We can, for instance, easily experiment with any other comparable distance metric available in VisuMap; and we can use any of clustering algorithms in VisuMap to cluster the SNPs. &lt;/p&gt;&lt;p&gt;At last, but not at least, after we have imported the data in to VisuMap, we can also visualize patterns among populations by transposing the dataset table. In order to study relationships between individuals or populations a special distance metric, called the &lt;a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0006711"&gt;IBS (identity-by-state)&lt;/a&gt; distance, has been suggested by some researchers. The IBS distance metric is also implemented in the &lt;a href="http://www.visumap.com/rcHome.aspx?c=Video"&gt;HaploExplorer&lt;/a&gt; plugin. After we have transposed a SNP dataset, we can select the IBS distance metric to produce population maps.&lt;/p&gt;&lt;p&gt;&lt;/p&gt;&lt;p&gt;&lt;br /&gt;&lt;br /&gt;&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/710383708872670121-4285421426002024500?l=jamesxli.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel="replies" type="application/atom+xml" href="http://jamesxli.blogspot.com/feeds/4285421426002024500/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://www.blogger.com/comment.g?blogID=710383708872670121&amp;postID=4285421426002024500" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/710383708872670121/posts/default/4285421426002024500?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/710383708872670121/posts/default/4285421426002024500?v=2" /><link rel="alternate" type="text/html" href="http://jamesxli.blogspot.com/2010/07/visualizing-linkage-disequilibrium.html" title="Visualizing linkage disequilibrium clusters of genotype SNPs" /><author><name>James X. Li</name><uri>http://www.blogger.com/profile/17491271097854752129</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="31" src="http://2.bp.blogspot.com/_Wca44kNaYGE/Sal7IrL8FJI/AAAAAAAAEyU/yhHymRkjWY4/S220/JamesPicLarge2.JPG" /></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://2.bp.blogspot.com/_Wca44kNaYGE/TEXene_RoII/AAAAAAAAGsw/aOzjLxWB97U/s72-c/RsquaredLD.GIF" height="72" width="72" /><thr:total>0</thr:total></entry><entry gd:etag="W/&quot;D0cGQX49fCp7ImA9WxBbGU4.&quot;"><id>tag:blogger.com,1999:blog-710383708872670121.post-7523098176483076602</id><published>2010-03-18T09:55:00.000-07:00</published><updated>2010-03-18T10:57:00.064-07:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2010-03-18T10:57:00.064-07:00</app:edited><title>Spherical Multidimensional Scaling in RPM way</title><content type="html">&lt;div&gt;We have just released VisuMap version 3.0.844.  In addition to many enhancements (like the map gallery view), this release includes the implementation of a new kind of MDS algorithm, called the manifold RPM algorithm. Manifold RPM works similarly as the original RPM (&lt;a href="http://www.visumap.com/index.aspx?p=Resources/RpmOverview"&gt;relational perspective map&lt;/a&gt;), except that it maps data to different 2-dimensional surfaces.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The manifold RPM service in this release supports following surfaces as image space:&lt;/div&gt;&lt;div&gt;&lt;ul&gt;&lt;li&gt;Flat real projective plane.&lt;/li&gt;&lt;li&gt;Flat klein-bottle&lt;/li&gt;&lt;li&gt;Flat sphere;&lt;/li&gt;&lt;li&gt;3D-sphere.&lt;/li&gt;&lt;/ul&gt;&lt;/div&gt;&lt;div&gt;Whereas the first 3 image spaces so far didn't seem to produce significantly new results, the&lt;/div&gt;&lt;div&gt;3D-sphere surface (also termed as S2) has produced surprisingly good results. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_Wca44kNaYGE/S6Jd96I33EI/AAAAAAAAFmk/-_i4gHGDXQs/s1600-h/SP500Sphere.jpg"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 301px;" src="http://3.bp.blogspot.com/_Wca44kNaYGE/S6Jd96I33EI/AAAAAAAAFmk/-_i4gHGDXQs/s400/SP500Sphere.jpg" border="0" alt="" id="BLOGGER_PHOTO_ID_5450021817231596610" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_Wca44kNaYGE/S6JeORfQXPI/AAAAAAAAFms/nWjo_IMCNio/s1600-h/PhamSphere.jpg"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 310px;" src="http://2.bp.blogspot.com/_Wca44kNaYGE/S6JeORfQXPI/AAAAAAAAFms/nWjo_IMCNio/s400/PhamSphere.jpg" border="0" alt="" id="BLOGGER_PHOTO_ID_5450022098377399538" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;The spherical RPM often produced better results than the original toroidal RPM in the sense&lt;/div&gt;&lt;div&gt;that spherical RPM maps often have less ad-hoc fragmentations. The main reason for this improvement is probably because that 3D-sphere is more symmetrical than torus. Flat torus is symmetrical in shifting (isometric), but not symmetrical in rotation (isotropic), so that some directions are treated differently as I have &lt;a href="http://jamesxli.blogspot.com/2008_04_01_archive.html"&gt;once blogged&lt;/a&gt;.  Obviously, 3D-sphere is both isometric and isotropic.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;From the point of view of MDS (multidimensional scaling), the spherical RPM algorithm &lt;/div&gt;&lt;div&gt;basically replaces the distance metric by angle metric in image space. Although many trigonometrical calculation are invoked in the algorithm, the implementation of the algorithm has turned out to be significantly faster than the original RPM algorithm.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The major disadvantage of the sphere map is that 3D sphere maps is a little difficult&lt;/div&gt;&lt;div&gt;to explore on 2D computer screens or printed papers. In this release we have implemented a 3D viewer, the &lt;i&gt;sphere view&lt;/i&gt;, to help people to explore 3D sphere maps. With the growing popularity of tools like Google Earth, I hope people will find a easy and useful tool in sphere viewer. The following link is a short video demonstrating the 3D sphere viewer implemented in VisuMap:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;a href="http://www.visumap.com/Resources/Demo/SphereMap.htm"&gt;"Data Earth" - Exploring Data with Sphere Map&lt;/a&gt;.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/710383708872670121-7523098176483076602?l=jamesxli.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel="replies" type="application/atom+xml" href="http://jamesxli.blogspot.com/feeds/7523098176483076602/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://www.blogger.com/comment.g?blogID=710383708872670121&amp;postID=7523098176483076602" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/710383708872670121/posts/default/7523098176483076602?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/710383708872670121/posts/default/7523098176483076602?v=2" /><link rel="alternate" type="text/html" href="http://jamesxli.blogspot.com/2010/03/spherical-multidimensional-scaling-in.html" title="Spherical Multidimensional Scaling in RPM way" /><author><name>James X. Li</name><uri>http://www.blogger.com/profile/17491271097854752129</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="31" src="http://2.bp.blogspot.com/_Wca44kNaYGE/Sal7IrL8FJI/AAAAAAAAEyU/yhHymRkjWY4/S220/JamesPicLarge2.JPG" /></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://3.bp.blogspot.com/_Wca44kNaYGE/S6Jd96I33EI/AAAAAAAAFmk/-_i4gHGDXQs/s72-c/SP500Sphere.jpg" height="72" width="72" /><thr:total>0</thr:total></entry><entry gd:etag="W/&quot;A0INSHw9fyp7ImA9WxNVFE0.&quot;"><id>tag:blogger.com,1999:blog-710383708872670121.post-2321934728525477511</id><published>2009-10-24T11:15:00.000-07:00</published><updated>2009-10-24T11:39:59.267-07:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2009-10-24T11:39:59.267-07:00</app:edited><title>New productivity features in VisuMap 3.0</title><content type="html">I have just posted a short video that shows some new productivity features implemented in VisuMap version 3.0.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://www.visumap.net/Resources/Demo/ProductivityFeatures.htm"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 220px;" src="http://www.visumap.net/Resources/Demo/ProductivityFeatures.PNG" alt="" id="BLOGGER_PHOTO_ID_5396234976914657842" border="0" /&gt;&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/710383708872670121-2321934728525477511?l=jamesxli.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel="replies" type="application/atom+xml" href="http://jamesxli.blogspot.com/feeds/2321934728525477511/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://www.blogger.com/comment.g?blogID=710383708872670121&amp;postID=2321934728525477511" title="1 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/710383708872670121/posts/default/2321934728525477511?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/710383708872670121/posts/default/2321934728525477511?v=2" /><link rel="alternate" type="text/html" href="http://jamesxli.blogspot.com/2009/10/new-productivity-features-in-visumap-30.html" title="New productivity features in VisuMap 3.0" /><author><name>James X. Li</name><uri>http://www.blogger.com/profile/17491271097854752129</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="31" src="http://2.bp.blogspot.com/_Wca44kNaYGE/Sal7IrL8FJI/AAAAAAAAEyU/yhHymRkjWY4/S220/JamesPicLarge2.JPG" /></author><thr:total>1</thr:total></entry><entry gd:etag="W/&quot;AkUFQ3c_fip7ImA9WxNQF0k.&quot;"><id>tag:blogger.com,1999:blog-710383708872670121.post-2860547419568897904</id><published>2009-09-23T16:11:00.001-07:00</published><updated>2009-09-23T16:23:32.946-07:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2009-09-23T16:23:32.946-07:00</app:edited><title>A layman's introduction to principal component analysis.</title><content type="html">I have just uploaded a very short introduction for PCA to youtube. The video is probably the shortest introduction to PCA (1.5min). This introduction emphasizes the geometrical aspects, instead of the usual statistical nature. I especially like the statement that the covariance matrix is just a method to measure the average extend of an object (a point cloud) along any axis.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;object height="344" width="425"&gt;&lt;param name="movie" value="http://www.youtube.com/v/BfTMmoDFXyE&amp;amp;hl=en&amp;amp;fs=1&amp;amp;"&gt;&lt;param name="allowFullScreen" value="true"&gt;&lt;param name="allowscriptaccess" value="always"&gt;&lt;embed src="http://www.youtube.com/v/BfTMmoDFXyE&amp;amp;hl=en&amp;amp;fs=1&amp;amp;" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" height="344" width="425"&gt;&lt;/embed&gt;&lt;/object&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/710383708872670121-2860547419568897904?l=jamesxli.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel="replies" type="application/atom+xml" href="http://jamesxli.blogspot.com/feeds/2860547419568897904/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://www.blogger.com/comment.g?blogID=710383708872670121&amp;postID=2860547419568897904" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/710383708872670121/posts/default/2860547419568897904?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/710383708872670121/posts/default/2860547419568897904?v=2" /><link rel="alternate" type="text/html" href="http://jamesxli.blogspot.com/2009/09/laymans-introduction-to-principal.html" title="A layman's introduction to principal component analysis." /><author><name>James X. Li</name><uri>http://www.blogger.com/profile/17491271097854752129</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="31" src="http://2.bp.blogspot.com/_Wca44kNaYGE/Sal7IrL8FJI/AAAAAAAAEyU/yhHymRkjWY4/S220/JamesPicLarge2.JPG" /></author><thr:total>0</thr:total></entry><entry gd:etag="W/&quot;AkIDSXsyeip7ImA9WxNRFUw.&quot;"><id>tag:blogger.com,1999:blog-710383708872670121.post-3988197063732338689</id><published>2009-08-31T10:23:00.000-07:00</published><updated>2009-09-09T10:49:38.592-07:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2009-09-09T10:49:38.592-07:00</app:edited><title>VisuMap 3.0 Released</title><content type="html">Today we have released VisuMap 3.0. This is a major upgrade from the previous version 2.7.&lt;br /&gt;&lt;br /&gt;The main change in this release is the implementation of the &lt;span style="font-style: italic;"&gt;Atlas&lt;/span&gt; service. An atlas in VisuMap 3.0 is frame-less window  that enables users to organize and compose data views of different types. The following picture shows the snapshot of a sample atlas:&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_Wca44kNaYGE/SpwMEys0O0I/AAAAAAAAFbU/wy6eZpTvvFQ/s1600-h/VisuMapAtlas.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 227px;" src="http://2.bp.blogspot.com/_Wca44kNaYGE/SpwMEys0O0I/AAAAAAAAFbU/wy6eZpTvvFQ/s400/VisuMapAtlas.png" alt="" id="BLOGGER_PHOTO_ID_5376185331642284866" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;Atlas service in VisuMap is thus similar as dashboard  service in some other visualization centric software applications. It allows users to group visual information according to the subject instead of information type itself. And more importantly for VisuMap users, atlas service enables  users to combine multiple algorithms to generate and compose heterogeneously structured information visualization. For instance, this release includes a script (PartitionAtlas.js) that automatically partitions a dataset in several clusters, and  generates various maps for each cluster.&lt;br /&gt;&lt;br /&gt;Together with this release we have extensive updated the scripting interface. Unfortunately, some old scripts may need minor adjustments to convert to this new release. We  have also extended the data format extensively to accommodate the new service. However, the data format has kept compatible in the sense that old dataset files can be loaded into new version, but datasets of new format can not be read by previous versions.&lt;br /&gt;&lt;br /&gt;Click the following image to watch a short video that shows how to create an atlas interactively:&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://www.visumap.net/Resources/Demo/AtlasDemo.htm"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 664px; height: 249px;" src="http://www.visumap.net/Resources/Demo/AtlasDemo.png" alt="" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.visumap.net/Resources/Demo/PartitionAtlas.htm"&gt;Here&lt;/a&gt; is another video that shows how a script uses kmean algorithm to partition a dataset and generate different data views for the partitions.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.visumap.net/Resources/Demo/AtlasClusters.htm"&gt;This video&lt;/a&gt; shows using pre-configured SOG cluster to create mutiple data views for a dataset.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/710383708872670121-3988197063732338689?l=jamesxli.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel="replies" type="application/atom+xml" href="http://jamesxli.blogspot.com/feeds/3988197063732338689/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://www.blogger.com/comment.g?blogID=710383708872670121&amp;postID=3988197063732338689" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/710383708872670121/posts/default/3988197063732338689?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/710383708872670121/posts/default/3988197063732338689?v=2" /><link rel="alternate" type="text/html" href="http://jamesxli.blogspot.com/2009/08/visumap-30-released.html" title="VisuMap 3.0 Released" /><author><name>James X. Li</name><uri>http://www.blogger.com/profile/17491271097854752129</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="31" src="http://2.bp.blogspot.com/_Wca44kNaYGE/Sal7IrL8FJI/AAAAAAAAEyU/yhHymRkjWY4/S220/JamesPicLarge2.JPG" /></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://2.bp.blogspot.com/_Wca44kNaYGE/SpwMEys0O0I/AAAAAAAAFbU/wy6eZpTvvFQ/s72-c/VisuMapAtlas.png" height="72" width="72" /><thr:total>0</thr:total></entry><entry gd:etag="W/&quot;AkUMQX0zcCp7ImA9WxJVFEQ.&quot;"><id>tag:blogger.com,1999:blog-710383708872670121.post-3354639627353773759</id><published>2009-07-01T09:00:00.000-07:00</published><updated>2009-07-01T18:31:20.388-07:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2009-07-01T18:31:20.388-07:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="data clustering" /><category scheme="http://www.blogger.com/atom/ns#" term="self-oranizing map" /><category scheme="http://www.blogger.com/atom/ns#" term="self-organizing graph" /><title>Data Clustering with Self-Organizing Graph</title><content type="html">We have just released VisuMap version 2.7.832. In this release we have added a very interesting clustering service called the&lt;span style="font-style: italic;"&gt; self-organizing graph&lt;/span&gt; (SOG).  SOG is basically an extension to the self-ogranizing map (SOM) that simulates a homogeneous artificial neural network. Unlike SOM, SOG simulates a network of arbitrary structure. The network can be in fact any weighted undirected graph. The network is a kind of parameter for the algorithm: the user defines the network depending on his/her knowledge or interest about the dataset.&lt;br /&gt;&lt;br /&gt;The following picture illustrates an application of SOG. On the left side is a map of a dataset that shows roughly two data point clouds. The right side shows the network we used to classify the dataset. During the learning process each node of the network will become associated with some  data points. After the learning process has finished each data point will be classified as their corresponding nodes in the SOG network.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_Wca44kNaYGE/Sku2NA14vQI/AAAAAAAAFKQ/aL0l7FLlsAA/s1600-h/SogOverview.PNG"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 146px;" src="http://2.bp.blogspot.com/_Wca44kNaYGE/Sku2NA14vQI/AAAAAAAAFKQ/aL0l7FLlsAA/s400/SogOverview.PNG" alt="" id="BLOGGER_PHOTO_ID_5353572916740537602" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;We see that the SOG algorithm not only correctly classified the two data point clouds, but also captured those data points located a the peripheries of the clouds (data points shown as blue triangles).&lt;br /&gt;&lt;br /&gt;The following is a short video tutorial for the SOG clustering service in VisuMap:&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://www.visumap.net/rc2.aspx?f=VideoClips/SogClustering.htm&amp;amp;c=Video"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 252px;" src="http://3.bp.blogspot.com/_Wca44kNaYGE/SkwM4iv-hRI/AAAAAAAAFKY/b63LP_HDZCQ/s400/SogVideo.PNG" alt="" id="BLOGGER_PHOTO_ID_5353668222576919826" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Comparing with traditional clustering algorithms, like k-Mean and hierarchical clustering algorithms, SOG provides more flexible clustering service. With SOG the user can not only specify the number of clusters, but also the size and other geometric or topological relationships between the clusters. In fact, SOG enables the user to specify a model to reflect existing knowledge and interests about the data.&lt;br /&gt;&lt;br /&gt;There are many variants of the self-organizing map that use irregular networks. Those algorithms often employ adaptive strategies to generate the irregular networks depending on the data. Those irregular network are results of those algorithms. Contrary to those algorithms SOG  uses the network as input for the algorithm, the user can express his/her interest and knowledge about the data through the network.&lt;br /&gt;&lt;br /&gt;SOG is our first attempt to provide a new kind of clustering service. While its framework resembles to provide general pattern matching capability. The learning algorithm it employed has only limited capability to capture structural information. In view of this there are still many challenges and potentials for SOG algorithm in the future.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/710383708872670121-3354639627353773759?l=jamesxli.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel="replies" type="application/atom+xml" href="http://jamesxli.blogspot.com/feeds/3354639627353773759/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://www.blogger.com/comment.g?blogID=710383708872670121&amp;postID=3354639627353773759" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/710383708872670121/posts/default/3354639627353773759?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/710383708872670121/posts/default/3354639627353773759?v=2" /><link rel="alternate" type="text/html" href="http://jamesxli.blogspot.com/2009/07/data-clustering-with-self-organizing.html" title="Data Clustering with Self-Organizing Graph" /><author><name>James X. Li</name><uri>http://www.blogger.com/profile/17491271097854752129</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="31" src="http://2.bp.blogspot.com/_Wca44kNaYGE/Sal7IrL8FJI/AAAAAAAAEyU/yhHymRkjWY4/S220/JamesPicLarge2.JPG" /></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://2.bp.blogspot.com/_Wca44kNaYGE/Sku2NA14vQI/AAAAAAAAFKQ/aL0l7FLlsAA/s72-c/SogOverview.PNG" height="72" width="72" /><thr:total>0</thr:total></entry><entry gd:etag="W/&quot;CUEMSX8zcCp7ImA9WxVVGUU.&quot;"><id>tag:blogger.com,1999:blog-710383708872670121.post-3089278103497419156</id><published>2009-03-13T15:14:00.001-07:00</published><updated>2009-03-13T15:34:48.188-07:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2009-03-13T15:34:48.188-07:00</app:edited><title>Visualizing the real projecitve plane.</title><content type="html">I have just uploaded a video clip to YouTube that shows some ways to explorer the real projective plane with VisuMap. The link to this video is included here:&lt;br /&gt;&lt;br /&gt;&lt;object width="425" height="344"&gt;&lt;param name="movie" value="http://www.youtube.com/v/QQML-6hcqGc&amp;amp;hl=en&amp;amp;fs=1"&gt;&lt;param name="allowFullScreen" value="true"&gt;&lt;param name="allowscriptaccess" value="always"&gt;&lt;embed src="http://www.youtube.com/v/QQML-6hcqGc&amp;amp;hl=en&amp;amp;fs=1" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="425" height="344"&gt;&lt;/embed&gt;&lt;/object&gt;&lt;br /&gt;&lt;br /&gt;This dataset contains 14,358 data points (4-dimensional vectors) sampled from the surface of RP2 embedded into  R&lt;sup&gt;4&lt;/sup&gt;. The analytical formula is given in the video clip. A smaller sample dataset in VisuMap format can be downloaded at &lt;a href="http://www.visumap.net/rc1.aspx?f=KleinBottle4D.xvmz&amp;amp;g=res"&gt;KleinBottle4D&lt;/a&gt; that also contains the dataset for the Klein-Bottle surface.&lt;br /&gt;&lt;br /&gt;It is interesting to notice that RPM algorithm flattens the 2D sphere by cutting it into two equal sized half; but it flattens the real projective plane in the way that the whole plane is kept in a single connected region.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/710383708872670121-3089278103497419156?l=jamesxli.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel="replies" type="application/atom+xml" href="http://jamesxli.blogspot.com/feeds/3089278103497419156/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://www.blogger.com/comment.g?blogID=710383708872670121&amp;postID=3089278103497419156" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/710383708872670121/posts/default/3089278103497419156?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/710383708872670121/posts/default/3089278103497419156?v=2" /><link rel="alternate" type="text/html" href="http://jamesxli.blogspot.com/2009/03/visualizing-real-projecitve-plane.html" title="Visualizing the real projecitve plane." /><author><name>James X. Li</name><uri>http://www.blogger.com/profile/17491271097854752129</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="31" src="http://2.bp.blogspot.com/_Wca44kNaYGE/Sal7IrL8FJI/AAAAAAAAEyU/yhHymRkjWY4/S220/JamesPicLarge2.JPG" /></author><thr:total>0</thr:total></entry><entry gd:etag="W/&quot;Ck4FRnwyfCp7ImA9WxNWGUs.&quot;"><id>tag:blogger.com,1999:blog-710383708872670121.post-3138520426595499866</id><published>2009-03-09T10:26:00.000-07:00</published><updated>2009-10-19T07:01:57.294-07:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2009-10-19T07:01:57.294-07:00</app:edited><title>Projective and Injective Exploration Methods</title><content type="html">In a recent &lt;a href="http://terrytao.wordpress.com/2009/01/26/245b-notes-6-duality-and-the-hahn-banach-theorem/"&gt;blog post&lt;/a&gt; Terence Tao has put mathematical approaches to study a space X (which can be a space, a manifold etc.) into two fundamental camps:&lt;br /&gt;&lt;br /&gt;1. By looking into a map &lt;span style="font-style: italic;"&gt;f&lt;/span&gt;: Y-&gt;X  that maps certain (better known) space Y into X.&lt;br /&gt;&lt;br /&gt;2. By looking into a map &lt;span style="font-style: italic;"&gt;f&lt;/span&gt;: X-&gt;Y that maps X into certain (better known) space Y.&lt;br /&gt;&lt;br /&gt;I think that these statements not only apply to mathematical studies, they also provide useful framework for the exploration of high dimensional data. In our case, the unknown space X becomes a high dimensional dataset that we wish to understand. The space Y is usually a lower dimensional space that we know better and is normally easy to visualize.&lt;br /&gt;&lt;br /&gt;For the sake of simplicity I would like to refer to methods from the above two camps as &lt;span style="font-style: italic;"&gt;injective &lt;/span&gt;and &lt;span style="font-style: italic;"&gt;projective &lt;/span&gt;methods respectively. That means, injective methods inject known space into the target space, whereas the projective methods project the target space into a better known space.&lt;br /&gt;&lt;br /&gt;Following this point of view, algorithms to visualize high dimensional data can be grouped as injective or projective. Most MDS (&lt;a href="http://www.mathpsyc.uni-bonn.de/doc/delbeke/delbeke.htm"&gt;multidiemsnional scaling&lt;/a&gt;) based visualization algorithms are projective. Algorithm like Sammon map and PCA (principal component analysis) map the high dimensional dataset into a low dimensional Euclidean space. The RPM algorithm (&lt;a href="http://www.visumap.net/index.aspx?p=Resources/RpmOverview"&gt;relational perspective map&lt;/a&gt;) maps high dimensional to flat torus.&lt;br /&gt;&lt;br /&gt;On the other side, most clustering algorithms can be considered as injective algorithm. For instance, the method of vector quantization maps a finite discrete set into the target space, whereas the Kohonen net (or Self-organizing map) maps a discrete set with a mesh topology into the target space. The hierarchical clustering algorithms (like the agglomerative clustering algorithm) basically map binary tree structures into the target space.&lt;br /&gt;&lt;br /&gt;One major benefit of this classification of algorithms is that it can guide us to apply existing algorithms to new data, in the sense that algorithms from the same camp might be able to share some methods/concepts/tricks. For instance, we might be apply an optimization method used by one clustering algorithm to another injective algorithm.&lt;br /&gt;&lt;br /&gt;Furthermore, as be pointed out in Tao's blog post, there is certain "duality" between injective and projective methods. That means, for a given method in one camp you can sometimes find its corresponding counter part in the other camp. This duality can then help us to transfer a method from camp to anther one. As an example for this "duality", the RPM algorithm is basically a dual algorithm of the self-organizing map: they are in different camp, but are dual to each there as they share the flat torus as the image space (i.e. the space Y mentioned at the beginning of this post).&lt;br /&gt;&lt;br /&gt;Similar to the duality concept I have&lt;a href="http://jamesxli.blogspot.com/2007/10/symmetry-between-dimensionality.html"&gt; blogged before&lt;/a&gt; about the "symmetry" between these two camps in more restricted situations, namely when the datasets are multidimensional vector spaces (e.g. data tables). In these cases, we can often transfer a method from one camp to the another one in a straightforward way (e.g. by transposing the data matrix).&lt;br /&gt;&lt;br /&gt;The injective and projective algorithm are realized in VisuMap by clustering and mapping services, which form the two main groups of services in the VisuMap application.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/710383708872670121-3138520426595499866?l=jamesxli.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel="replies" type="application/atom+xml" href="http://jamesxli.blogspot.com/feeds/3138520426595499866/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://www.blogger.com/comment.g?blogID=710383708872670121&amp;postID=3138520426595499866" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/710383708872670121/posts/default/3138520426595499866?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/710383708872670121/posts/default/3138520426595499866?v=2" /><link rel="alternate" type="text/html" href="http://jamesxli.blogspot.com/2009/03/projective-and-injective-exploration.html" title="Projective and Injective Exploration Methods" /><author><name>James X. Li</name><uri>http://www.blogger.com/profile/17491271097854752129</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="31" src="http://2.bp.blogspot.com/_Wca44kNaYGE/Sal7IrL8FJI/AAAAAAAAEyU/yhHymRkjWY4/S220/JamesPicLarge2.JPG" /></author><thr:total>0</thr:total></entry><entry gd:etag="W/&quot;CEEFQ344fCp7ImA9WxVVEkU.&quot;"><id>tag:blogger.com,1999:blog-710383708872670121.post-147110495038344333</id><published>2009-03-05T09:09:00.000-08:00</published><updated>2009-03-05T11:50:12.034-08:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2009-03-05T11:50:12.034-08:00</app:edited><title>Dimension analysis on mulitdimensional scaling</title><content type="html">This post brings forward an inconsistency problem in the study of multidimensional scaling from the perspective of dimensional analysis. This post is rather of philosophical nature. Nevertheless, I find it is worthwhile being aware of this issue&lt;br /&gt;for further theoretical investigations.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://en.wikipedia.org/wiki/Dimensional_analysis"&gt;Dimensional analysis&lt;/a&gt; is a simple and useful rule to check the consistence of a mathematical formula or equation. Roughly speaking, this rule demands that both sides of an equation (that has any physical or chemical bearings) must have the same dimension, or unit of measurement. So for instance the equation E = mC&lt;sup&gt;2&lt;/sup&gt; is consistent with the dimensional analysis, since both sides of the equation have &lt;span style="font-style: italic;"&gt;Energy&lt;/span&gt; as dimension. The equation E=mC is not consistent, since the right side would have &lt;span style="font-style: italic;"&gt;momentum&lt;/span&gt; as dimension.&lt;br /&gt;&lt;br /&gt;Similarly, if a formula is composed of multiple terms by addition, then all terms must have the same dimension. So for instance, the formula: S:=π·r&lt;sup&gt;2&lt;/sup&gt; + r&lt;sup&gt;2&lt;/sup&gt;/2 makes sense, but not S:=π·r&lt;sup&gt;2&lt;/sup&gt; + r/2.&lt;br /&gt;&lt;br /&gt;Now, let's turn to a problem of &lt;a href="http://www.mathpsyc.uni-bonn.de/doc/delbeke/delbeke.htm"&gt;multidimensional scaling&lt;/a&gt; (MDS) from the perspective of the dimensional analysis. Roughly speaking, MDS maps data points from an abstract space with a distance function to a low dimensional, normally, Euclidean space like R&lt;sup&gt;2&lt;/sup&gt;. The goal of a MDS algorithm is to construct the map in a way, so that the distance in the low dimensional space approximates the distance in the original abstract space.&lt;br /&gt;&lt;br /&gt;More formally, let D&lt;sub&gt;&lt;small&gt;ij&lt;/small&gt;&lt;/sub&gt; be the distance between two data points in the original space, d&lt;sub&gt;&lt;small&gt;ij&lt;/small&gt;&lt;/sub&gt; be the distance low dimensional space after MDS mapping; Then a MDS algorithm tries to minimize the following quantity (often called &lt;span style="font-style: italic;"&gt;stress&lt;/span&gt;):&lt;br /&gt;&lt;br /&gt;S := Σ ( D&lt;sub&gt;&lt;small&gt;ij&lt;/small&gt;&lt;/sub&gt; - d&lt;sub&gt;&lt;small&gt;ij&lt;/small&gt;&lt;/sub&gt;)&lt;sup&gt;2&lt;/sup&gt;&lt;br /&gt;&lt;br /&gt;Now, from the view point of dimensional analysis the stress function in its above form is &lt;span style="font-style: italic;"&gt;not &lt;/span&gt;consistent, since D&lt;sub&gt;&lt;small&gt;ij&lt;/small&gt;&lt;/sub&gt; and d&lt;sub&gt;&lt;small&gt;ij&lt;/small&gt;&lt;/sub&gt; have arguably different dimensions as they measure distance in different spaces.&lt;br /&gt;&lt;br /&gt;One work-around to solve this inconsistency issue is to add a special constant or apply a monotonous function on d&lt;sub&gt;&lt;small&gt;ij&lt;/small&gt;&lt;/sub&gt; like:&lt;br /&gt;&lt;br /&gt;S := Σ ( D&lt;sub&gt;&lt;small&gt;ij&lt;/small&gt;&lt;/sub&gt; - &lt;span style="font-style: italic;"&gt;C&lt;/span&gt;·d&lt;sub&gt;&lt;small&gt;ij&lt;/small&gt;&lt;/sub&gt;)&lt;sup&gt;2&lt;/sup&gt;   &amp;nbsp; &amp;nbsp;         or   &amp;nbsp; &amp;nbsp;        S := Σ ( D&lt;sub&gt;&lt;small&gt;ij&lt;/small&gt;&lt;/sub&gt; - &lt;span style="font-style: italic;"&gt;f&lt;/span&gt;(d&lt;sub&gt;&lt;small&gt;ij&lt;/small&gt;&lt;/sub&gt;))&lt;sup&gt;2&lt;/sup&gt;&lt;br /&gt;&lt;br /&gt;The constant &lt;span style="font-style: italic;"&gt;C&lt;/span&gt; or the function &lt;span style="font-style: italic;"&gt;f&lt;/span&gt;  should magically convert the dimension of d&lt;sub&gt;&lt;small&gt;ij&lt;/small&gt;&lt;/sub&gt; to that of D&lt;sub&gt;&lt;small&gt;ij&lt;/small&gt;&lt;/sub&gt;. This kind of ad-hoc techniques look pretty artificial. Most scientists probably won't like this method.&lt;br /&gt;&lt;br /&gt;It should be noticed that the RPM (relational perspecitve map) algorithm avoided this inconsistency by using the following stress function:&lt;br /&gt;&lt;br /&gt;S := Σ D&lt;sub&gt;&lt;small&gt;ij&lt;/small&gt;&lt;/sub&gt;/d&lt;sub&gt;&lt;small&gt;ij&lt;/small&gt;&lt;/sub&gt;&lt;br /&gt;&lt;br /&gt;The dimensional analysis rule does not prohibit composing terms of different dimensions by multiplication. In above formula, both sides of the equation have a completely new dimension, that is the dimension of  D&lt;sub&gt;&lt;small&gt;ij&lt;/small&gt;&lt;/sub&gt; divided by the dimension of d&lt;sub&gt;&lt;small&gt;ij&lt;/small&gt;&lt;/sub&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/710383708872670121-147110495038344333?l=jamesxli.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel="replies" type="application/atom+xml" href="http://jamesxli.blogspot.com/feeds/147110495038344333/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://www.blogger.com/comment.g?blogID=710383708872670121&amp;postID=147110495038344333" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/710383708872670121/posts/default/147110495038344333?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/710383708872670121/posts/default/147110495038344333?v=2" /><link rel="alternate" type="text/html" href="http://jamesxli.blogspot.com/2009/03/dimension-analysis-on-mulitdimensional.html" title="Dimension analysis on mulitdimensional scaling" /><author><name>James X. Li</name><uri>http://www.blogger.com/profile/17491271097854752129</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="31" src="http://2.bp.blogspot.com/_Wca44kNaYGE/Sal7IrL8FJI/AAAAAAAAEyU/yhHymRkjWY4/S220/JamesPicLarge2.JPG" /></author><thr:total>0</thr:total></entry><entry gd:etag="W/&quot;CUABR3c4cCp7ImA9WxVQEEw.&quot;"><id>tag:blogger.com,1999:blog-710383708872670121.post-4187922184435155092</id><published>2009-01-15T15:04:00.000-08:00</published><updated>2009-01-26T15:49:16.938-08:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2009-01-26T15:49:16.938-08:00</app:edited><title>Experiements with kernel PCA</title><content type="html">The newly released VisuMap software version 2.7 has added the kernel PCA (kPCA) service. From the user's perspective kPCA basically provides the same service as MDS methods, namely maping an abstract set of data points with a similarity distance to a multidimensional vector space, so that  the Euclidean distance in the dimensional space somehow reflect the original similarity distance. Thus, the input for kPCA is a dataset with a similarity distance (or matrix), the output is a table of numerical values with one row vector for each data point.&lt;br /&gt;&lt;br /&gt;As there are many effective analysis methods operating on dimensional vector space (like k-mean clustering, support vector machine), kPCA enables us, like MDS, to apply effective methods to a broader range of data.&lt;br /&gt;&lt;br /&gt;In order see how kPCA works, I created a dataset with about 3000 data points which form together a sphere in the 3D space.  With the Gaussian kernel (a way to measure similarity between data points) kPCA of VisuMap created a 50 dimensional datasets within about 100 seconds. That means, kPCA mapped a 3 D dataset into the 50 D space. The first 3 dimensions in the 50 D space basically mirror the original 3 dimensions.  In order to see how other dimensions look like, I fixed the x- and y-axises of a 3D view window to the first two dimensions, then  assign the z-axis to other dimensions one after the other while rotating the 3D view window.  Then following small video clip shows how the spherical dataset looks like with those extra dimensions:&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;object width="320" height="266" class="BLOG_video_class" id="BLOG_video-c65880e670073a4c" classid="clsid:D27CDB6E-AE6D-11cf-96B8-444553540000" codebase="http://download.macromedia.com/pub/shockwave/cabs/flash/swflash.cab#version=6,0,40,0"&gt;&lt;param name="movie" value="http://www.youtube.com/get_player"&gt;
&lt;param name="bgcolor" value="#FFFFFF"&gt;
&lt;param name="allowfullscreen" value="true"&gt;
&lt;param name="flashvars" value="flvurl=http://v21.nonxt3.googlevideo.com/videoplayback?id%3Dc65880e670073a4c%26itag%3D5%26app%3Dblogger%26ip%3D0.0.0.0%26ipbits%3D0%26expire%3D1329652881%26sparams%3Did,itag,ip,ipbits,expire%26signature%3D7729D5E2DDA19D2B39852E585F1C893FD6AF93DC.7891BD8ED5B00F8256776904794A4B45E6505850%26key%3Dck1&amp;amp;iurl=http://video.google.com/ThumbnailServer2?app%3Dblogger%26contentid%3Dc65880e670073a4c%26offsetms%3D5000%26itag%3Dw160%26sigh%3DLJzJgDIcwxzLqwJdQIPt5WREtZ0&amp;amp;autoplay=0&amp;amp;ps=blogger"&gt;
&lt;embed src="http://www.youtube.com/get_player" type="application/x-shockwave-flash"
width="320" height="266" bgcolor="#FFFFFF"
flashvars="flvurl=http://v21.nonxt3.googlevideo.com/videoplayback?id%3Dc65880e670073a4c%26itag%3D5%26app%3Dblogger%26ip%3D0.0.0.0%26ipbits%3D0%26expire%3D1329652881%26sparams%3Did,itag,ip,ipbits,expire%26signature%3D7729D5E2DDA19D2B39852E585F1C893FD6AF93DC.7891BD8ED5B00F8256776904794A4B45E6505850%26key%3Dck1&amp;iurl=http://video.google.com/ThumbnailServer2?app%3Dblogger%26contentid%3Dc65880e670073a4c%26offsetms%3D5000%26itag%3Dw160%26sigh%3DLJzJgDIcwxzLqwJdQIPt5WREtZ0&amp;autoplay=0&amp;ps=blogger"
allowFullScreen="true" /&gt;&lt;/object&gt;
&lt;br /&gt;&lt;br /&gt;The first few seconds of the video shows how the first 3 dimension re-constructs the original sphere. Then, each time when the z-axis switched to another dimension, the sphere turned in to another geometrical shape. I am not sure how those geometrical shape get formed and how to take advantage of those new dimensions. But, those dimensions surely look interesting and a little mysterious.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/710383708872670121-4187922184435155092?l=jamesxli.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel="enclosure" type="video/mp4" href="http://www.blogger.com/video-play.mp4?contentId=c65880e670073a4c&amp;type=video%2Fmp4" length="0" /><link rel="replies" type="application/atom+xml" href="http://jamesxli.blogspot.com/feeds/4187922184435155092/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://www.blogger.com/comment.g?blogID=710383708872670121&amp;postID=4187922184435155092" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/710383708872670121/posts/default/4187922184435155092?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/710383708872670121/posts/default/4187922184435155092?v=2" /><link rel="alternate" type="text/html" href="http://jamesxli.blogspot.com/2009/01/experiements-with-kernel-pca.html" title="Experiements with kernel PCA" /><author><name>James X. Li</name><uri>http://www.blogger.com/profile/17491271097854752129</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="31" src="http://2.bp.blogspot.com/_Wca44kNaYGE/Sal7IrL8FJI/AAAAAAAAEyU/yhHymRkjWY4/S220/JamesPicLarge2.JPG" /></author><thr:total>0</thr:total></entry><entry gd:etag="W/&quot;A0ADQ307cSp7ImA9WxRVF0s.&quot;"><id>tag:blogger.com,1999:blog-710383708872670121.post-111521978468497127</id><published>2008-10-30T10:19:00.000-07:00</published><updated>2008-11-15T09:22:52.309-08:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2008-11-15T09:22:52.309-08:00</app:edited><title>K-Center  and Affinity Propagation Clustering</title><content type="html">Recent release of VisuMap implements a new clustering algorithm called &lt;a href="http://www.psi.toronto.edu/affinitypropagation/"&gt;Affinity Propagation&lt;/a&gt;  (AP).  AP provides similar service as the metric sampling (MS) algorithm that has been available in VisuMap for a while. In this blog I will compare AP with MS from  practical point of view with concrete samples.&lt;br /&gt;&lt;br /&gt;First, a few words about the implementations of these algorithms. MS is basically a variation of the k-mean algorithm that uses the expectation maximization (EM) method to solve the k-medoid problem. Such algorithms are sometimes called k-center or k-medoid algorithm. Simple direct implementation of k-medoid algorithm often just find bad solutions (i.e. local minimums). People normally has to repeat the algorithm many many times to find an acceptable solution. The MS algorithm implemented in VisuMap uses a stochastic mechanism to avoid or reduce such local minimum problem.&lt;br /&gt;&lt;br /&gt;AP uses message passing mechanism to find representative exemplars within dataset with a similarity structure. AP is basically a deterministic algorithm, it does not require random initialization or randomness during the optimization process. However, AP algorithm often suffers from the oscillation problem in the way that the algorithm wanders around several local minimums for quite long time. To reduce the oscillation, we can use a large dumping factor as suggested in the original paper of AP. But a more efficient way seems to be using small randomness when assigning preferences to the data points. By default, the implementation of AP in VisuMap enables this randomness.&lt;br /&gt;&lt;br /&gt;Thus, both MS and AP in VisuMap are of stochasitc nature. I have applied the two algorithms each 500 times on the dataset &lt;a href="http://www.visumap.net/index.aspx?p=Resources/VisuMapDatasets"&gt;SP500  &lt;/a&gt;(each data point contains the weekly price history of a stock in a year) and got all together 1000 different clustering solutions (each solution is a selection of 25 exemplar points). Using the t-SNE algorithm I created the map of these 1000 solutions as shown in the following picture:&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_Wca44kNaYGE/SQo5_CuhJfI/AAAAAAAAEbM/qj0KGyKEYiw/s1600-h/SP500.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 256px;" src="http://2.bp.blogspot.com/_Wca44kNaYGE/SQo5_CuhJfI/AAAAAAAAEbM/qj0KGyKEYiw/s400/SP500.png" alt="" id="BLOGGER_PHOTO_ID_5263082869762369010" border="0" /&gt;&lt;/a&gt;In above map each red/green dot represents a solution found by AP/MS algorithm respectively. As a special characteristic of t-SNE algorithm, identical solutions will be arranged as one more circles. That means, dots in a rectangle in above map represent acutally identical solutions.&lt;br /&gt;&lt;br /&gt;We notice first that the two algorithms found very different solutions as red and green dots are clearly separated from each other. There are only two small regions (marked by ovals) where the two algorithms found similar solutions. Secondly, we notice that MS found a large number of different solutions.  AP found just few different solutions. This means that the AP optimization process often converged to common solutions.&lt;br /&gt;&lt;br /&gt;The following table shows the statistics about the solutions found by AP and MS:&lt;center&gt;&lt;table style="border-width: 1px; width: 400px;" border="1" cellpadding="10" cellspacing="0"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style="width: 140px;"&gt;&lt;/td&gt;&lt;td&gt;Metric&lt;br /&gt;Sampling&lt;/td&gt;&lt;td&gt;Affinity&lt;br /&gt;Propagation&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Min Error&lt;/td&gt;&lt;td&gt;0.4972&lt;/td&gt;&lt;td&gt;0.4901&lt;/td&gt;&lt;/tr&gt;&lt;br /&gt;&lt;tr&gt;&lt;td&gt;Max Error&lt;/td&gt;&lt;td&gt;0.5058&lt;/td&gt;&lt;td&gt;0.5035&lt;/td&gt;&lt;/tr&gt;&lt;br /&gt;&lt;tr&gt;&lt;td&gt;Average Error&lt;/td&gt;&lt;td&gt;0.5004&lt;/td&gt;&lt;td&gt;0.4960&lt;/td&gt;&lt;/tr&gt;&lt;br /&gt;&lt;tr&gt;&lt;td&gt;Process Time&lt;br /&gt;in Seconds&lt;/td&gt;&lt;td&gt;5&lt;/td&gt;&lt;td&gt;4&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/center&gt;&lt;br /&gt;&lt;br /&gt;In the above table the error of a solution is the average distance of the data points to their cluster center (i.e. exemplars). The average error is calculated over the 500 solutions found by each of the two algorithms. We can see that AP has found, on the average, better solutions than MS, but the improvement is only about 0.4%. So, it is fair to say that these two algorithms performed similarly with respect to this dataset.&lt;br /&gt;&lt;br /&gt;The situation changed a lot when I applied the two algorithms to a different dataset, namely the dataset &lt;a href="http://jamesxli.blogspot.com/2008/09/feature-selection-by-clustering.html"&gt;PharMine171 &lt;/a&gt; discussed in previous blog entry. This dataset contains about 1000 binary 171-dimensional vectors , whereas the SP500 dataset contains 500 40-dimensional vectors with real values. The following is the map of the clustering solutions and the corresponding statistics.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_Wca44kNaYGE/SQpI9JfnlWI/AAAAAAAAEbU/nLOK0ewA5bU/s1600-h/Pharm171.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 301px;" src="http://4.bp.blogspot.com/_Wca44kNaYGE/SQpI9JfnlWI/AAAAAAAAEbU/nLOK0ewA5bU/s400/Pharm171.png" alt="" id="BLOGGER_PHOTO_ID_5263099329893602658" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;center&gt;&lt;table style="border-width: 1px; width: 400px;" border="1" cellpadding="10" cellspacing="0"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style="width: 140px;"&gt;&lt;/td&gt;&lt;td&gt;Metric&lt;br /&gt;Sampling&lt;/td&gt;&lt;td&gt;Affinity&lt;br /&gt;Propagation&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Min Error&lt;/td&gt;&lt;td&gt;2.1775&lt;/td&gt;&lt;td&gt;2.1101&lt;/td&gt;&lt;/tr&gt;&lt;br /&gt;&lt;tr&gt;&lt;td&gt;Max Error&lt;/td&gt;&lt;td&gt;2.2321&lt;/td&gt;&lt;td&gt;2.1686&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Average Error&lt;/td&gt;&lt;td&gt;2.1972&lt;/td&gt;&lt;td&gt;2.1603&lt;/td&gt;&lt;/tr&gt;&lt;br /&gt;&lt;tr&gt;&lt;td&gt;Process Time&lt;br /&gt;in Seconds&lt;br /&gt;&lt;/td&gt;&lt;td&gt;6&lt;/td&gt;&lt;td&gt;30&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/center&gt;&lt;br /&gt;From above map we can see that AP and MS still produced different set of solutions. AP also produced more different solutions than with the previous dataset. This means that, with respect to this dataset, AP is more sensitive to small variations/noise in the input data.  This phenomenon might be explained as following: Because this dataset contains binary vectors, many similarities between data points might have exact equal values. Those equivalence of similarity will be broken different by small variations in the input data. Since AP indeed uses operation like max() to distinguish even infinitesimal differences, small variations may lead to significantly different solutions.&lt;br /&gt;&lt;br /&gt;We also notice that AP is substantially slower than MS. For larger datasets, the performance difference will be even larger. For instance, for a dataset with about 4000 data points, MS is more than 100 times faster than AP. It should be pointed that both implementations of AP and MS uses the full similarity matrix. It remains to see how much speedup can be achieved by using sparse matrix.&lt;br /&gt;&lt;br /&gt;More interestingly we see that the solutions found by AP is about 3.7% better than those found by MS; and even the worst solution found by AP is better than the best solution found by MS.&lt;br /&gt;&lt;br /&gt;Now, one would ask the question: Does AP always find better solutions than MS? Unfortunately, this is not the case. The following picture shows the 3D presentation of the &lt;a href="http://www.visumap.net/index.aspx?p=Resources/VisuMapDatasets"&gt;spheroid &lt;/a&gt;dataset that contains 1825 data points forming together a spheroid:&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_Wca44kNaYGE/SQthz4_YK3I/AAAAAAAAEbc/g7_Z6cfPt_M/s1600-h/Sphorid.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 122px;" src="http://1.bp.blogspot.com/_Wca44kNaYGE/SQthz4_YK3I/AAAAAAAAEbc/g7_Z6cfPt_M/s400/Sphorid.png" alt="" id="BLOGGER_PHOTO_ID_5263408133611989874" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;As be shown in the following test statistics, solutions found by MS is about 3.7% better than those found by AP:&lt;center&gt;&lt;table style="border-width: 1px; width: 400px;" border="1" cellpadding="10" cellspacing="0"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style="width: 140px;"&gt;&lt;/td&gt;&lt;td&gt;Metric&lt;br /&gt;Sampling&lt;/td&gt;&lt;td&gt;Affinity&lt;br /&gt;Propagation&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Min Error&lt;/td&gt;&lt;td&gt;5.9029&lt;/td&gt;&lt;td&gt;5.9029&lt;/td&gt;&lt;/tr&gt;&lt;br /&gt;&lt;tr&gt;&lt;td&gt;Max Error&lt;/td&gt;&lt;td&gt;6.4034&lt;/td&gt;&lt;td&gt;6.6483&lt;/td&gt;&lt;/tr&gt;&lt;br /&gt;&lt;tr&gt;&lt;td&gt;Average Error&lt;/td&gt;&lt;td&gt;6.0948&lt;/td&gt;&lt;td&gt;6.3199&lt;/td&gt;&lt;/tr&gt;&lt;br /&gt;&lt;tr&gt;&lt;td&gt;Process Time&lt;br /&gt;in Seconds&lt;/td&gt;&lt;td&gt;9&lt;/td&gt;&lt;td&gt;284&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;&lt;/center&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Conclusions: &lt;/span&gt;&lt;br /&gt;I believe AP is a useful and promising addition to the arsenal of clustering algorithms. Not only that AP can find better solutions, it also finds different solutions as existing algorithms. As we know that there are many different clustering problems. AP doesn't seem to be designed to directly attack the k-medoid problem, but nevertheless it is competitive to those algorithms directly targeting the k-medoid problem. This indicates that AP may have potential to attack a much larger class of clustering problems.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/710383708872670121-111521978468497127?l=jamesxli.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel="replies" type="application/atom+xml" href="http://jamesxli.blogspot.com/feeds/111521978468497127/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://www.blogger.com/comment.g?blogID=710383708872670121&amp;postID=111521978468497127" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/710383708872670121/posts/default/111521978468497127?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/710383708872670121/posts/default/111521978468497127?v=2" /><link rel="alternate" type="text/html" href="http://jamesxli.blogspot.com/2008/10/comparing-two-clustering-algorithms.html" title="K-Center  and Affinity Propagation Clustering" /><author><name>James X. Li</name><uri>http://www.blogger.com/profile/17491271097854752129</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="31" src="http://2.bp.blogspot.com/_Wca44kNaYGE/Sal7IrL8FJI/AAAAAAAAEyU/yhHymRkjWY4/S220/JamesPicLarge2.JPG" /></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://2.bp.blogspot.com/_Wca44kNaYGE/SQo5_CuhJfI/AAAAAAAAEbM/qj0KGyKEYiw/s72-c/SP500.png" height="72" width="72" /><thr:total>0</thr:total></entry><entry gd:etag="W/&quot;DEAEQn8-fyp7ImA9WxRXFk8.&quot;"><id>tag:blogger.com,1999:blog-710383708872670121.post-985997992205248012</id><published>2008-10-21T12:39:00.000-07:00</published><updated>2008-10-21T15:05:03.157-07:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2008-10-21T15:05:03.157-07:00</app:edited><title>Clustering clusters</title><content type="html">Many clustering algorithms, like k-mean algorithm, are of stochastic nature in the sense that they produce different partitions (of a data set) depending on the random initialization.  In this blog I am going to use the mapping service provided by VisuMap to investigate the partitions generated by a particular clustering algorithm. In other word, I want to visualize the clusters among partitions generated by a clustering algorithm.&lt;br /&gt;&lt;br /&gt;The clustering algorithm I am going to look at is the &lt;span style="font-style: italic;"&gt;metrical sampling&lt;/span&gt; (MS) clustering algorithm which is basically a k-mean algorithm with some randomization extension. For a given set of data points with a distance matrix,  MS algorithm tries to find a set of sample points, so that minimizes the &lt;span style="font-style: italic;"&gt;average error&lt;/span&gt;, that is average distance of the data points to their nearest sample point (also called &lt;span style="font-style: italic;"&gt;exemplar&lt;/span&gt;).  Thus, MS algorithm is one of many available algorithms to solve the so called &lt;span style="font-style: italic;"&gt;k-medoid clustering&lt;/span&gt; problem (which is an NP complete problem according to certain literature).&lt;br /&gt;&lt;br /&gt;For a dataset of N data points, MS algorithm produces a N dimensional binary vectors. Each component is a binary value that decides whether a data point is selected as a sample point. So, when we run the MS algorithm many times with different random initializations, we will get a list of N dimensional binary vectors.  I am interested to see clusters or patterns among those binary vectors. For this study, I picked up the dataset &lt;a href="http://jamesxli.blogspot.com/2008/09/feature-selection-by-clustering.html"&gt;PharMine171&lt;/a&gt; that contains 1025 fingerprints of 171 pharmacologically interesting compounds. I have run MS algorithm on the dataset for 1000 times. The following is a map of the 1000 partitions ( produced with the t-SNE mapping algorithm):&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_Wca44kNaYGE/SP45btNRMUI/AAAAAAAADYI/9kxjSv159nU/s1600-h/PharmMine171.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://4.bp.blogspot.com/_Wca44kNaYGE/SP45btNRMUI/AAAAAAAADYI/9kxjSv159nU/s400/PharmMine171.png" alt="" id="BLOGGER_PHOTO_ID_5259704562969817410" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;In above map, each spot represents a partition generated by one run of MS algorithm. Closely located partitions means they share more sample points. The coloring is done manually to highlight the clusters. We can easily see that there are 8 or 9 clusters among the 1000 partitions.&lt;br /&gt;&lt;br /&gt;Now, one question that raises naturally is: Do the partitions in one cluster have similar qualities (i.e. average error)?  In order to answer this questions I opened a bar view window that displays the average error of each partition in an ordered bar view as follows:&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_Wca44kNaYGE/SP4-CQfpJFI/AAAAAAAADYQ/bYnRdrmAS-M/s1600-h/ErrorView.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://3.bp.blogspot.com/_Wca44kNaYGE/SP4-CQfpJFI/AAAAAAAADYQ/bYnRdrmAS-M/s400/ErrorView.png" alt="" id="BLOGGER_PHOTO_ID_5259709623323665490" border="0" /&gt;&lt;/a&gt;In above picture, each bar (in yellow or red color) represents a partition. The height of the bar indicates their average error. The red bars represent those partitions selected manually on the partition maps. So, when we select a cluster in the partition map, we can easily see the corresponding distribution of average error of that clusters. The following two pictures show two such distributions:&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_Wca44kNaYGE/SP5HCg1C5vI/AAAAAAAADYY/5GWdHmlKyI4/s1600-h/ClusterErrors.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://1.bp.blogspot.com/_Wca44kNaYGE/SP5HCg1C5vI/AAAAAAAADYY/5GWdHmlKyI4/s400/ClusterErrors.png" alt="" id="BLOGGER_PHOTO_ID_5259719523313051378" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_Wca44kNaYGE/SP5HcL3foYI/AAAAAAAADYg/jqDDKYa9Lu4/s1600-h/ClusterErrors2.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://4.bp.blogspot.com/_Wca44kNaYGE/SP5HcL3foYI/AAAAAAAADYg/jqDDKYa9Lu4/s400/ClusterErrors2.png" alt="" id="BLOGGER_PHOTO_ID_5259719964362776962" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;We notice that the average error in a cluster is spread all over the whole error range (from very bad to very good). This means, similar partitions can have very different qualities. Thus, the answer to above questions is negative.&lt;br /&gt;&lt;br /&gt;More in general, I suppose this study indicates that any algorithm trying to find good solutions for the k-medoid problem need a mechanism to allow global variations in the searching space. The MS algorithm implemented in VisuMap does this through controlled randomness. Another interesting algorithm, named &lt;a href="http://www.psi.toronto.edu/affinitypropagation/"&gt;affinity propagation&lt;/a&gt; seems to do that with help of 2N&lt;sup&gt;2&lt;/sup&gt; addition system states (called messages). Affinity propagation is also supported in VisuMap. I will do some comparison between MS and AP in another blog.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/710383708872670121-985997992205248012?l=jamesxli.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel="replies" type="application/atom+xml" href="http://jamesxli.blogspot.com/feeds/985997992205248012/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://www.blogger.com/comment.g?blogID=710383708872670121&amp;postID=985997992205248012" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/710383708872670121/posts/default/985997992205248012?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/710383708872670121/posts/default/985997992205248012?v=2" /><link rel="alternate" type="text/html" href="http://jamesxli.blogspot.com/2008/10/clustering-clusters.html" title="Clustering clusters" /><author><name>James X. Li</name><uri>http://www.blogger.com/profile/17491271097854752129</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="31" src="http://2.bp.blogspot.com/_Wca44kNaYGE/Sal7IrL8FJI/AAAAAAAAEyU/yhHymRkjWY4/S220/JamesPicLarge2.JPG" /></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://4.bp.blogspot.com/_Wca44kNaYGE/SP45btNRMUI/AAAAAAAADYI/9kxjSv159nU/s72-c/PharmMine171.png" height="72" width="72" /><thr:total>0</thr:total></entry><entry gd:etag="W/&quot;DUIDSX49fCp7ImA9WxRSE0k.&quot;"><id>tag:blogger.com,1999:blog-710383708872670121.post-8577764662568854908</id><published>2008-09-13T08:09:00.000-07:00</published><updated>2008-09-13T16:12:58.064-07:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2008-09-13T16:12:58.064-07:00</app:edited><title>Feature selection by clustering</title><content type="html">I have&lt;a href="http://jamesxli.blogspot.com/2007/10/symmetry-between-dimensionality.html"&gt; blogged before&lt;/a&gt; that there is a "symmetry" between &lt;span style="font-style: italic;"&gt;data clustering&lt;/span&gt; and &lt;span style="font-style: italic;"&gt;dimensionality reduction&lt;/span&gt;: whereas the former reduces the number of rows; the latter reduces the number of columns in a data table. Thus, we can theoretically convert any clustering algorithm to a dimensionality reduction method by transposing the data table, and vise verse.&lt;br /&gt;&lt;br /&gt;In this blog I am going to demonstrate how to use the &lt;span style="font-style: italic;"&gt;metric sampling&lt;/span&gt; algorithm to reduce dimensionality with a concrete &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_0"&gt;dataset&lt;/span&gt;.  My sample &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_1"&gt;dataset&lt;/span&gt; &lt;a href="http://www.visumap.net/Common/FetchDoc.aspx?f=PharMine171.xvmz&amp;amp;g=res"&gt;(&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_2"&gt;PharMine&lt;/span&gt;171)&lt;/a&gt; contains data for 171 &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_3"&gt;pharmacologically&lt;/span&gt; interesting compounds. Each data point contains 1025 binary vectors which indicate whether the compound contains particular fragments (ie. molecular sub-structurs). Thus, the &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_4"&gt;dataset&lt;/span&gt; is a 171 x 1025 table with binary values and its dimension is 1025.&lt;br /&gt;&lt;br /&gt;In order the reduce the dimensionality we first transpose the table to get a 171 dimensional &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_5"&gt;dataset&lt;/span&gt; with 1025 data points (each data point now represents a fragment). In order to get feeling about the transposed &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_6"&gt;dataset&lt;/span&gt; I have created 2 dimensional map for it (with &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_7"&gt;VisuMap&lt;/span&gt;)  as be shown in the following picture:&lt;br /&gt;&lt;br /&gt;&lt;div style="text-align: center;"&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_Wca44kNaYGE/SMBRywIMHzI/AAAAAAAADVM/SNaSOnGbzlA/s1600-h/FragmentsMap_tSNE.PNG"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://2.bp.blogspot.com/_Wca44kNaYGE/SMBRywIMHzI/AAAAAAAADVM/SNaSOnGbzlA/s400/FragmentsMap_tSNE.PNG" alt="" id="BLOGGER_PHOTO_ID_5242279898613817138" border="0" /&gt;&lt;/a&gt;Map of 1025 binary features&lt;br /&gt;&lt;br /&gt;&lt;div style="text-align: left;"&gt;Above map shows clearly some clusters among the 1025 data points. In order to find representative columns in our original &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_8"&gt;dataset&lt;/span&gt;, we need to find representative rows in the new data table. In order to do so, we apply the metric sampling algorithm (available in &lt;a href="http://www.visumap.net/index.aspx?p=Products"&gt;&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_9"&gt;VisuMap&lt;/span&gt;&lt;/a&gt;) that selects representative subset of data points. I have configured the algorithm to select 25 data points. The following picture shows how the 25 data  points are distributed among the 1025 data points:&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_Wca44kNaYGE/SMvhMszfs0I/AAAAAAAADVs/Y3xaZbJ7kE8/s1600-h/Features25.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://3.bp.blogspot.com/_Wca44kNaYGE/SMvhMszfs0I/AAAAAAAADVs/Y3xaZbJ7kE8/s400/Features25.png" alt="" id="BLOGGER_PHOTO_ID_5245533799304901442" border="0" /&gt;&lt;/a&gt;We can then export the 25 data points as a 25 x 171 table, and transpose the table to get an 171 x 25 data table. This table is then our new &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_10"&gt;dataset&lt;/span&gt; with reduced dimension.&lt;br /&gt;&lt;br /&gt;Now, how can we tell that the reduced &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_12"&gt;dataset&lt;/span&gt; approaches the original &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_13"&gt;dataset&lt;/span&gt;? We can do this pretty easily with &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_14"&gt;VisuMap&lt;/span&gt; by creating maps for both &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_15"&gt;datasets&lt;/span&gt;. If the two &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_16"&gt;datasets&lt;/span&gt; contain about the same information, their maps should be visually similar to each other.  The following picture shows the map of the original &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_17"&gt;dataset&lt;/span&gt;  (on the left side) and of the reduced &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_18"&gt;dataset&lt;/span&gt; (on the right side). They are both created with the &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_19"&gt;SMACOF&lt;/span&gt; mapping method. We see that the map of the reduced &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_20"&gt;dataset&lt;/span&gt; reveals clearly the three clusters as the original &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_21"&gt;dataset&lt;/span&gt;, but it shows less details within the clusters.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_Wca44kNaYGE/SMwy9oemeSI/AAAAAAAADV0/88i8eXrAkZQ/s1600-h/MapTwoMaps.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://3.bp.blogspot.com/_Wca44kNaYGE/SMwy9oemeSI/AAAAAAAADV0/88i8eXrAkZQ/s400/MapTwoMaps.png" alt="" id="BLOGGER_PHOTO_ID_5245623700398962978" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Notice that the &lt;span style="font-style: italic;"&gt;metric sampling&lt;/span&gt; algorithm implemented in &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_22"&gt;VisuMap&lt;/span&gt; is a variation of the widely known k-&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_23"&gt;medoid&lt;/span&gt; algorithm. The recent release of &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_24"&gt;VisuMap&lt;/span&gt; (version 2.6.807) utilized  a randomization strategy in the optimization process that made the algorithm much more robust and less prone to local minimum problem; which has been a big issue for the k-&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_25"&gt;medoid&lt;/span&gt; algorithm.&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/710383708872670121-8577764662568854908?l=jamesxli.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel="replies" type="application/atom+xml" href="http://jamesxli.blogspot.com/feeds/8577764662568854908/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://www.blogger.com/comment.g?blogID=710383708872670121&amp;postID=8577764662568854908" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/710383708872670121/posts/default/8577764662568854908?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/710383708872670121/posts/default/8577764662568854908?v=2" /><link rel="alternate" type="text/html" href="http://jamesxli.blogspot.com/2008/09/feature-selection-by-clustering.html" title="Feature selection by clustering" /><author><name>James X. Li</name><uri>http://www.blogger.com/profile/17491271097854752129</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="31" src="http://2.bp.blogspot.com/_Wca44kNaYGE/Sal7IrL8FJI/AAAAAAAAEyU/yhHymRkjWY4/S220/JamesPicLarge2.JPG" /></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://2.bp.blogspot.com/_Wca44kNaYGE/SMBRywIMHzI/AAAAAAAADVM/SNaSOnGbzlA/s72-c/FragmentsMap_tSNE.PNG" height="72" width="72" /><thr:total>0</thr:total></entry><entry gd:etag="W/&quot;CkQBSX8yeSp7ImA9WxRSEU0.&quot;"><id>tag:blogger.com,1999:blog-710383708872670121.post-1307854311663525506</id><published>2008-09-04T11:05:00.000-07:00</published><updated>2008-09-10T19:32:38.191-07:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2008-09-10T19:32:38.191-07:00</app:edited><title>On similarity metrics for chemical compounds</title><content type="html">Recently, Yap Chun Wei has posted a &lt;a href="http://voyagememoirs.com/pharmine/2008/08/29/visumap-part-2/"&gt;dataset &lt;/a&gt;on the &lt;a href="http://voyagememoirs.com/pharmine/"&gt;pharmine&lt;/a&gt; blog. The dataset consists of fingerprints of 171 pharmacologically interesting compounds. Just to recapture,  the fingerprint of a compound is here a vector of 1025 binary flags, each flag in the vector indicates whether a particular molecular fragment is present in the compound. There are many ways to calculate fingerprints. Depending on the nature of the problem, you can use different algorithms or different collection of fragments. The mentioned dataset, for instance, used &lt;a href="http://openbabel.org/wiki/Main_Page"&gt;OpenBabel &lt;/a&gt;to calculate those fingerprints.&lt;br /&gt;&lt;br /&gt;The dataset constains 3 different groups of compounds:  penicillins, cephalosporins, fluoroquinolones. Using VisuMap Yap created different maps of these 171 compounds which showed, more or less, the cluster structure of the dataset. I personally find that the PCA map provides the best visualization. The following picture is a PCA map I created with VisuMap for this dataset:&lt;br /&gt;&lt;br /&gt;&lt;div style="text-align: center;"&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_Wca44kNaYGE/SMBEGqaAh5I/AAAAAAAADVE/rY-IvzzBMzo/s1600-h/PcaWithFrequency.PNG"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://1.bp.blogspot.com/_Wca44kNaYGE/SMBEGqaAh5I/AAAAAAAADVE/rY-IvzzBMzo/s400/PcaWithFrequency.PNG" alt="" id="BLOGGER_PHOTO_ID_5242264847512536978" border="0" /&gt;&lt;/a&gt;Compound Map of 171 compounds&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;In above map, the 3 compound groups are displayed as glyphs in 3 different colors. The coloring are done manually with VisuMap based on known information about these groups. Although, for this dataset, you can get almost exactly these 3 clusters using the k-mean algorithm provided by VisuMap. The bar diagram in the picture shows the presence frequency of the 1025 fragments among the 171 selected compounds. That means, a higher bar indicates that a particular fragment is present in more compounds.&lt;br /&gt;&lt;br /&gt;The above map visualizes the similarity information between the 171 compounds. That means, closely located compounds will have similar fragment collections and therefore similar pharmacological  properties. The similarity information are basically encoded in the fingerprints. Thus, the method to calculate of those fingerprints is naturally critical for this kind of data analysis.&lt;br /&gt;&lt;br /&gt;In order to better understand those fingerprints we can created a map of those 1025 features with VisuMap.  In order to do so, we simply transpose the binary data table (via the menu Edit&gt;Filter Data&gt;Transpose Table in VisuMap), so that each binary feature becomes a new data point; and each compound become a feature in the transposed dataset. We can then pick a mapping algorithm and metrics to create a feature map. The following picture shows such a map created with the t-SNE algorithm and the tanimoto dissimilarity metric:&lt;br /&gt;&lt;br /&gt;&lt;div style="text-align: center;"&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_Wca44kNaYGE/SMBRywIMHzI/AAAAAAAADVM/SNaSOnGbzlA/s1600-h/FragmentsMap_tSNE.PNG"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://2.bp.blogspot.com/_Wca44kNaYGE/SMBRywIMHzI/AAAAAAAADVM/SNaSOnGbzlA/s400/FragmentsMap_tSNE.PNG" alt="" id="BLOGGER_PHOTO_ID_5242279898613817138" border="0" /&gt;&lt;/a&gt;Feature map of 1025 binary features&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;Above picture shows 4 or 5 clear clusters on the left side represented by colored glyphs. The rest are more or less randomly distributed. It turned out that those yellow-square features are those fragments which are NOT present in any of those 171 compounds (all bits are zero). Therefore, they carry no direct information about our compounds. Interestingly, these zero vectors form together a homogeneous cluster in the map.&lt;br /&gt;&lt;br /&gt;Other clusters in above map represent groups of fragments which have high frequency and are informative to distinguish the three compound groups in the original dataset. We can verify this with the help of the bar diagram in VisuMap as follows:&lt;br /&gt;&lt;br /&gt;We first open the feature map in a separate window (via the menu Tools&gt;Map Snapsot). Then open the compound map, and then select all compounds and open the bar view through the context menu "Bar View". The bar view by default displays the frequency in the order as given in the transposed data table. We then sort the bars through the context menu "Sort Values" so that bars are displayed in the order from low frequency to high frequency as depicted in the following two picture:&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_Wca44kNaYGE/SMBaE1aKb1I/AAAAAAAADVc/34qF6pRx6YE/s1600-h/SortingBarView.PNG"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://4.bp.blogspot.com/_Wca44kNaYGE/SMBaE1aKb1I/AAAAAAAADVc/34qF6pRx6YE/s400/SortingBarView.PNG" alt="" id="BLOGGER_PHOTO_ID_5242289005362048850" border="0" /&gt;&lt;/a&gt;The sorted bar view shows 3  plateaus which correspond to clusters in the feature map and in the compound maps. In order to see the correspondence we select a plateau in the bar view with the mouse, the snapshot window of the feature map will automatically high light those selected features. As we can see in the following picture, the selected plateau of features clearly correspond to a particular cluster in the feature map (the marked cluster at lower left corner).&lt;br /&gt;&lt;br /&gt;&lt;div style="text-align: center;"&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_Wca44kNaYGE/SMBUw8tALpI/AAAAAAAADVU/d72IQC_EU70/s1600-h/FragmentsMapWithFreq2.PNG"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://2.bp.blogspot.com/_Wca44kNaYGE/SMBUw8tALpI/AAAAAAAADVU/d72IQC_EU70/s400/FragmentsMapWithFreq2.PNG" alt="" id="BLOGGER_PHOTO_ID_5242283166164594322" border="0" /&gt;&lt;/a&gt;Correspondence between high frequency features and feature clusters&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;We notice that some of the clusters in the feature map show some fine sequential structure that may lead to more hints about the internal structure of the fragment collections.&lt;br /&gt;&lt;br /&gt;With the knowledge about informative feature clusters we can, for instance, reduce the number of features significantly without significant loss of information about the clusters in the original dataset. The VisuMap dataset folder &lt;a href="http://www.visumap.net/Common/FetchDoc.aspx?f=PharMine171.xvmz&amp;amp;g=res"&gt;PharMine171.xvmz (zipped XML file)&lt;/a&gt;  includes a reduced dataset with 298 features that characterizes similar similarity information as the original dataset with 1025 features.&lt;br /&gt;&lt;br /&gt;The above VisuMap dataset folder also contains feature maps created with other mapping algorithms. It is interesting to notice that for the feature map dataset, the t-SNE algorithm provides the best visualization, whereas the result of the PCA algorithm is rather disappointing.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/710383708872670121-1307854311663525506?l=jamesxli.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel="replies" type="application/atom+xml" href="http://jamesxli.blogspot.com/feeds/1307854311663525506/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://www.blogger.com/comment.g?blogID=710383708872670121&amp;postID=1307854311663525506" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/710383708872670121/posts/default/1307854311663525506?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/710383708872670121/posts/default/1307854311663525506?v=2" /><link rel="alternate" type="text/html" href="http://jamesxli.blogspot.com/2008/09/on-similarity-metrics-for-chemical.html" title="On similarity metrics for chemical compounds" /><author><name>James X. Li</name><uri>http://www.blogger.com/profile/17491271097854752129</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="31" src="http://2.bp.blogspot.com/_Wca44kNaYGE/Sal7IrL8FJI/AAAAAAAAEyU/yhHymRkjWY4/S220/JamesPicLarge2.JPG" /></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://1.bp.blogspot.com/_Wca44kNaYGE/SMBEGqaAh5I/AAAAAAAADVE/rY-IvzzBMzo/s72-c/PcaWithFrequency.PNG" height="72" width="72" /><thr:total>0</thr:total></entry><entry gd:etag="W/&quot;Dk4AR3g-fSp7ImA9WxdbFkU.&quot;"><id>tag:blogger.com,1999:blog-710383708872670121.post-2778740621244928743</id><published>2008-08-13T08:56:00.000-07:00</published><updated>2008-08-13T20:35:46.655-07:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2008-08-13T20:35:46.655-07:00</app:edited><title>New mapping algorithm t-SNE</title><content type="html">Recently I spent some time to experiment with the  t-SNE mapping algorithm proposed by L. Maaten and G. Hinton in &lt;a href="http://www.cs.unimaas.nl/l.vandermaaten/Laurens_van_der_Maaten/t-SNE_files/utml.pdf"&gt;Visualizing Data Using t-SNE.&lt;/a&gt; The algorithm is a variation of the &lt;a href="http://www.cs.toronto.edu/%7Eroweis/papers/sne_final.pdf"&gt;SNE&lt;/a&gt; algorithm developed few years ago.&lt;br /&gt;&lt;br /&gt;The t-SNE &lt;a href="http://www.cs.unimaas.nl/l.vandermaaten/Laurens_van_der_Maaten/t-SNE.html"&gt;home site&lt;/a&gt; provides source code, technical papers and many samples, so that it is pretty easy to test the algorithm. The matlab source code is provided as a component in a matlab toolbox that also implements many other popular methods for dimensionality reduction. The web site also provides a binary executable that is optimized for the Intel/Windows platform. It took me a while to figure out its binary data format, but the time spent is worthwhile, as it runs much faster than the matlab version.&lt;br /&gt;&lt;br /&gt;As demonstrated in its technical paper, t-SNE has the nice capability to show the cluster structure of high dimensional datasets in low dimensional maps. More interestingly, I noticed that t-SNE is able to segment dataset with complex structure in a natural way to preserve relevant information from certain perspective. For instance, the left map in the following picture shows a dataset that forms two crossing squares in 3D space, the map on the right side is a 2D map created with t-SNE. We notice that t-SNE segmented the connected shape into 5 clusters. The segmentation occurred roughly along the center line where the two squares crossing each other.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_Wca44kNaYGE/SKNPBu_rpiI/AAAAAAAADUc/J1H1ZpCu3rE/s1600-h/TwoSquaresTSNE.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://3.bp.blogspot.com/_Wca44kNaYGE/SKNPBu_rpiI/AAAAAAAADUc/J1H1ZpCu3rE/s400/TwoSquaresTSNE.png" alt="" id="BLOGGER_PHOTO_ID_5234114083148244514" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;In general, I think there are two major challenges for mapping algorithms intended to visualize high dimensional complex data. The first one, I call it the &lt;span style="font-style: italic;"&gt;singularity problem&lt;/span&gt;, is present in the above sample. The dataset is mostly a 2D surface except the center region where the two square cross each other. In order to map this dataset into 2D map in non-overlapping manner, an algorithm has to find a non-trivial way to segment the dataset along the singularity region. t-SNE did a good job here compared to other algorithms like CCA (curvilinear component analysis) which segments the dataset rather randomly.&lt;br /&gt;&lt;br /&gt;The second challenge, I call it the &lt;span style="font-style: italic;"&gt;closedness problem&lt;/span&gt;, occurs when the dataset forms certain kind of closed manifold in high dimensional space, like sphere, torus (e.g. surfaces of 3D objects). In this case, a mapping algorithm needs to have a mechanism to cut-off the closed structure underlying the dataset. The t-SNE algorithm behaviors also pretty well in addressing  this kind of problems. The following video clip (produced with &lt;a href="http://www.visumap.net/index.aspx?p=Products"&gt;VisuMap 2.6.800&lt;/a&gt;) shows how t-SNE gradually deforms two overlapping spheres to a 2D map.&lt;br /&gt;&lt;br /&gt;&lt;div style="text-align: center;"&gt;&lt;object width="320" height="266" class="BLOG_video_class" id="BLOG_video-34311eaa011321ab" classid="clsid:D27CDB6E-AE6D-11cf-96B8-444553540000" codebase="http://download.macromedia.com/pub/shockwave/cabs/flash/swflash.cab#version=6,0,40,0"&gt;&lt;param name="movie" value="http://www.youtube.com/get_player"&gt;
&lt;param name="bgcolor" value="#FFFFFF"&gt;
&lt;param name="allowfullscreen" value="true"&gt;
&lt;param name="flashvars" value="flvurl=http://v9.nonxt2.googlevideo.com/videoplayback?id%3D34311eaa011321ab%26itag%3D5%26app%3Dblogger%26ip%3D0.0.0.0%26ipbits%3D0%26expire%3D1329652881%26sparams%3Did,itag,ip,ipbits,expire%26signature%3D4C2878321C53FCF97966BA0A9A24E13764100B1.29622DA53C24DA9EB8BBAED10BE12C6431493114%26key%3Dck1&amp;amp;iurl=http://video.google.com/ThumbnailServer2?app%3Dblogger%26contentid%3D34311eaa011321ab%26offsetms%3D5000%26itag%3Dw160%26sigh%3Dd2E0fsbKH_OikkTX9eZ6K5sfCus&amp;amp;autoplay=0&amp;amp;ps=blogger"&gt;
&lt;embed src="http://www.youtube.com/get_player" type="application/x-shockwave-flash"
width="320" height="266" bgcolor="#FFFFFF"
flashvars="flvurl=http://v9.nonxt2.googlevideo.com/videoplayback?id%3D34311eaa011321ab%26itag%3D5%26app%3Dblogger%26ip%3D0.0.0.0%26ipbits%3D0%26expire%3D1329652881%26sparams%3Did,itag,ip,ipbits,expire%26signature%3D4C2878321C53FCF97966BA0A9A24E13764100B1.29622DA53C24DA9EB8BBAED10BE12C6431493114%26key%3Dck1&amp;iurl=http://video.google.com/ThumbnailServer2?app%3Dblogger%26contentid%3D34311eaa011321ab%26offsetms%3D5000%26itag%3Dw160%26sigh%3Dd2E0fsbKH_OikkTX9eZ6K5sfCus&amp;autoplay=0&amp;ps=blogger"
allowFullScreen="true" /&gt;&lt;/object&gt;
&lt;br /&gt;&lt;br /&gt;&lt;div style="text-align: left;"&gt;We notice that t-SNE first captures the global structure of dataset (like the Sammon algorithm  or PCA), then deforms the map locally to achieve non-overlapping mappings. This kind of dynamical behavior is quite similar as the CCA algorithm.&lt;br /&gt;&lt;br /&gt;The implementation of t-SNE is pretty straightforward, and it can be easily tuned for a wide range of data. Even with its current simple implementation, some of its features already surpassed many other popular mapping algorithms. So, I believe it will become a valuable tool for analyzing high dimensional data.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/710383708872670121-2778740621244928743?l=jamesxli.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel="enclosure" type="video/mp4" href="http://www.blogger.com/video-play.mp4?contentId=34311eaa011321ab&amp;type=video%2Fmp4" length="0" /><link rel="replies" type="application/atom+xml" href="http://jamesxli.blogspot.com/feeds/2778740621244928743/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://www.blogger.com/comment.g?blogID=710383708872670121&amp;postID=2778740621244928743" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/710383708872670121/posts/default/2778740621244928743?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/710383708872670121/posts/default/2778740621244928743?v=2" /><link rel="alternate" type="text/html" href="http://jamesxli.blogspot.com/2008/08/new-mapping-algorithm-t-sne.html" title="New mapping algorithm t-SNE" /><author><name>James X. Li</name><uri>http://www.blogger.com/profile/17491271097854752129</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="31" src="http://2.bp.blogspot.com/_Wca44kNaYGE/Sal7IrL8FJI/AAAAAAAAEyU/yhHymRkjWY4/S220/JamesPicLarge2.JPG" /></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://3.bp.blogspot.com/_Wca44kNaYGE/SKNPBu_rpiI/AAAAAAAADUc/J1H1ZpCu3rE/s72-c/TwoSquaresTSNE.png" height="72" width="72" /><thr:total>0</thr:total></entry><entry gd:etag="W/&quot;C0YHQ38zfSp7ImA9WxRVFUs.&quot;"><id>tag:blogger.com,1999:blog-710383708872670121.post-1650690012226936125</id><published>2008-04-12T20:24:00.000-07:00</published><updated>2008-11-12T23:25:32.185-08:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2008-11-12T23:25:32.185-08:00</app:edited><title>Villarceau Circles and Diagonal Rotation of RPM Map</title><content type="html">It has been known for long time that there are 4 perfect circles passing every points on the a torus surface. We can get two of these circles buy cutting the torus vertically and horizontally through that points. These two circles are the normal circles most people can easily visualize. The other two, not so apparent, circles are the so-called Villarceau circles. They are made by cutting the torus diagonally as depicted in the following animation (IE7 users need to enable the flag Tools&gt;Internet Options&gt;Advanced&gt;"Play animations in web pages" to see the animation):&lt;br /&gt;&lt;div style="TEXT-ALIGN: center"&gt;&lt;img alt="Villarceau Circles Animation" src="http://upload.wikimedia.org/wikipedia/commons/f/f1/Villarceau_circles.gif" /&gt;&lt;br /&gt;&lt;/div&gt;&lt;center&gt;Figure 1: Creating the Villarceau Circles on a torus.&lt;/center&gt;&lt;br /&gt;&lt;br /&gt;Above animation is very helpful to visualize the Villarceau Circles. Another simple way to show the Villarceau circles is using the &lt;span style="FONT-STYLE: italic"&gt;fundamental polygon &lt;/span&gt;representation where a torus is represented as a rectangle whose opposite edges are glued together. In the polygon represetnation, the horizontal and vertical edges are the two normal circles (horizontal and vertical circles). The two diagonals are the Villarceau circles as depicted in the following diagram:&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_Wca44kNaYGE/SAJdcZhbmsI/AAAAAAAADGI/zutNJHUFtqE/s1600-h/FourCirclesOnTorus.PNG"&gt;&lt;img id="BLOGGER_PHOTO_ID_5188812463153060546" style="DISPLAY: block; MARGIN: 0px auto 10px; CURSOR: pointer; TEXT-ALIGN: center" alt="" src="http://1.bp.blogspot.com/_Wca44kNaYGE/SAJdcZhbmsI/AAAAAAAADGI/zutNJHUFtqE/s400/FourCirclesOnTorus.PNG" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;center&gt;Figure 2: Four circles on a torus passing through a point.&lt;/center&gt;&lt;br /&gt;&lt;br /&gt;We know that RPM algorithm simulates a multi-particle dynamic system confined on the flat torus surface. For some unknown reasons, RPM maps often form symmetry patterns alone the diagonal lines, e.g. along the Villarcea circles. For instance, when we map the sphere to the tours, RPM algorithm cuts the sphere into two equal sized half and map them as two discs to the torus surface. The two discs are positioned alone a a diagonal line as depicted in the following picture:&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_Wca44kNaYGE/SAGNLJhbmgI/AAAAAAAADEs/IiQYwBQgq4Q/s1600-h/SphereOnTorus.png"&gt;&lt;img id="BLOGGER_PHOTO_ID_5188583468381739522" style="DISPLAY: block; MARGIN: 0px auto 10px; CURSOR: pointer; TEXT-ALIGN: center" alt="" src="http://3.bp.blogspot.com/_Wca44kNaYGE/SAGNLJhbmgI/AAAAAAAADEs/IiQYwBQgq4Q/s400/SphereOnTorus.png" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;center&gt;Figure 3: RPM Map of a sphere&lt;/center&gt;&lt;br /&gt;&lt;br /&gt;We see in above picture that only one disc is completely displayed as a whole, the other one has been cut by the edges into 3 or 4 pices (which will form a complete disc when we glue the opposite edge together). The whole picture is more or less symmetric with respect to the left-up-to-right-bottom diagonal line. We can get a better visualization of above image when we tile multiple copies of above map together as be show in the following picture:&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_Wca44kNaYGE/SAJapJhbmrI/AAAAAAAADGA/y69zzO17Mos/s1600-h/TilingSpheremap2.png"&gt;&lt;img id="BLOGGER_PHOTO_ID_5188809383661509298" style="DISPLAY: block; MARGIN: 0px auto 10px; CURSOR: pointer; TEXT-ALIGN: center" alt="" src="http://4.bp.blogspot.com/_Wca44kNaYGE/SAJapJhbmrI/AAAAAAAADGA/y69zzO17Mos/s400/TilingSpheremap2.png" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;center&gt;Figure 4: Tiling duplicates of Figure 3.&lt;/center&gt;&lt;br /&gt;&lt;br /&gt;Now, it would be helpful for the purpose of visualization if we can show the two discs as whole discs in a single map without using the tiling techniques. We can achieve this by transforming the coordinator from the two vertical and horizontal edges to the two diagonals (i.e. the two Villarceau circles). The transformation is schematically depicted in the following diagram:&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_Wca44kNaYGE/SAJe9ZhbmtI/AAAAAAAADGQ/5nknfbzIdCQ/s1600-h/DiagonalRotation.png"&gt;&lt;img id="BLOGGER_PHOTO_ID_5188814129600371410" style="DISPLAY: block; MARGIN: 0px auto 10px; CURSOR: pointer; TEXT-ALIGN: center" alt="" src="http://1.bp.blogspot.com/_Wca44kNaYGE/SAJe9ZhbmtI/AAAAAAAADGQ/5nknfbzIdCQ/s400/DiagonalRotation.png" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;center&gt;Figure 5: Rotation to the diagonal circles.&lt;/center&gt;&lt;br /&gt;&lt;br /&gt;We call the above transformation the &lt;span style="FONT-STYLE: italic"&gt;diagonal rotation&lt;/span&gt; as it basically rotate the coordinator to the diagonals. Formally, the diagonal rotation is done by the following forms:&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_Wca44kNaYGE/SAJGOJhbmlI/AAAAAAAADFU/8bvzdHiQMvg/s1600-h/DiagonalRotation2.png"&gt;&lt;img id="BLOGGER_PHOTO_ID_5188786929572485714" style="DISPLAY: block; MARGIN: 0px auto 10px; CURSOR: pointer; TEXT-ALIGN: center" alt="" src="http://4.bp.blogspot.com/_Wca44kNaYGE/SAJGOJhbmlI/AAAAAAAADFU/8bvzdHiQMvg/s400/DiagonalRotation2.png" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;In above forms &lt;span style="FONT-STYLE: italic"&gt;w &lt;/span&gt;and &lt;span style="FONT-STYLE: italic"&gt;h &lt;/span&gt;are the width and height of the fundamental rectangle respectively. &lt;span style="FONT-STYLE: italic"&gt;d &lt;/span&gt;is the length of its diagonal. The above forms transforms points from the region D to D'. For points from other regions, we have to apply some simple shifting after the transformation as indicated in Figure 5.&lt;br /&gt;&lt;br /&gt;As indicated in Figure 5, the two diagonals will become the edges of the transformed map. Thus, in order to avoid the resulting map be cut off by the edges we should shift a RPM map away from the diagonals through torus cyclical shifting. For our sphere example we can shift the map, for instance, to the following position:&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_Wca44kNaYGE/SAJTyJhbmmI/AAAAAAAADFc/MxhXy4Ym94A/s1600-h/SphereOnTorusShifted.png"&gt;&lt;img id="BLOGGER_PHOTO_ID_5188801841698937442" style="DISPLAY: block; MARGIN: 0px auto 10px; CURSOR: pointer; TEXT-ALIGN: center" alt="" src="http://4.bp.blogspot.com/_Wca44kNaYGE/SAJTyJhbmmI/AAAAAAAADFc/MxhXy4Ym94A/s400/SphereOnTorusShifted.png" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;center&gt;Figure 6: Sphere map shifted off from the diagonal lines.&lt;/center&gt;&lt;br /&gt;&lt;br /&gt;When we now apply the diagonal rotation on the map in Figure 6 we will get the following rotated map:&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_Wca44kNaYGE/SAJVxZhbmoI/AAAAAAAADFs/lWpJOmVoOig/s1600-h/SphereRotated.png"&gt;&lt;img id="BLOGGER_PHOTO_ID_5188804027837291138" style="DISPLAY: block; MARGIN: 0px auto 10px; CURSOR: pointer; TEXT-ALIGN: center" alt="" src="http://1.bp.blogspot.com/_Wca44kNaYGE/SAJVxZhbmoI/AAAAAAAADFs/lWpJOmVoOig/s400/SphereRotated.png" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;center&gt;Figure 7: The sphere map after diagonal rotation.&lt;/center&gt;&lt;br /&gt;Comparing with the original RPM map in Figure 3, the rotated map in figure 7 is obviously easier to investigate for human eyes.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;It should be pointed out that a RPM map after diagonal rotation is no long a torus map in the sense that you cannot tile the map to fill 2D space the same way as be shown in Figure 4. Since this will break the edge-connectivity as indicated in Figure 5. But you can fill the 2D space with the rotated RPM map like building a bridge wall: you tile the map site by site to build a continuous lay, then build next lay the same while shift the lay to the left (or right) for half block. The following picture depicts this characteristics:&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_Wca44kNaYGE/SAJZN5hbmqI/AAAAAAAADF4/IsQeKtD-sIM/s1600-h/TilingRotatedMap.png"&gt;&lt;img id="BLOGGER_PHOTO_ID_5188807815998446242" style="DISPLAY: block; MARGIN: 0px auto 10px; CURSOR: pointer; TEXT-ALIGN: center" alt="" src="http://3.bp.blogspot.com/_Wca44kNaYGE/SAJZN5hbmqI/AAAAAAAADF4/IsQeKtD-sIM/s400/TilingRotatedMap.png" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;center&gt;Figure 8: Tiling with rotated map.&lt;/center&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/710383708872670121-1650690012226936125?l=jamesxli.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel="replies" type="application/atom+xml" href="http://jamesxli.blogspot.com/feeds/1650690012226936125/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://www.blogger.com/comment.g?blogID=710383708872670121&amp;postID=1650690012226936125" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/710383708872670121/posts/default/1650690012226936125?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/710383708872670121/posts/default/1650690012226936125?v=2" /><link rel="alternate" type="text/html" href="http://jamesxli.blogspot.com/2008/04/villarceau-circles-and-diagonal.html" title="Villarceau Circles and Diagonal Rotation of RPM Map" /><author><name>James X. Li</name><uri>http://www.blogger.com/profile/17491271097854752129</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="31" src="http://2.bp.blogspot.com/_Wca44kNaYGE/Sal7IrL8FJI/AAAAAAAAEyU/yhHymRkjWY4/S220/JamesPicLarge2.JPG" /></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://1.bp.blogspot.com/_Wca44kNaYGE/SAJdcZhbmsI/AAAAAAAADGI/zutNJHUFtqE/s72-c/FourCirclesOnTorus.PNG" height="72" width="72" /><thr:total>0</thr:total></entry><entry gd:etag="W/&quot;D08NQX49cSp7ImA9WxZUFE0.&quot;"><id>tag:blogger.com,1999:blog-710383708872670121.post-3864716895589242483</id><published>2008-04-04T17:00:00.001-07:00</published><updated>2008-04-05T07:31:30.069-07:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2008-04-05T07:31:30.069-07:00</app:edited><title>The Equivalence of Cooling and Expanding Dynamics</title><content type="html">The RPM (relational perspective map) algorithm has a seemly add-&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_0"&gt;hoc&lt;/span&gt; measure that gradually decreases the learning rate to force the convergence of the algorithm. I have tried to avoid such measure with various methods (e.g. conjugated algorithm exploring second order gradient), but was not able to find anything working.&lt;br /&gt;&lt;br /&gt;I have notice a significant problem here: In the original Newton-&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_1"&gt;Raphson&lt;/span&gt; algorithm, the algorithm should converge to a local minimum if the learning rate is sufficiently small. But, this not so for the optimization problem of RPM. The algorithm never converges, no matter how small the learning rate is (in a way that is doable with double precision arithmetic). If I keep the learning rate constant at certain small value, the simulated dynamic system goes in to an equilibrium state with constant variance.&lt;br /&gt;&lt;br /&gt;Thus, it looks like NR method, as it is, does not work for massively complex dynamic system, it is not stable enough. Letting the learning rate gradually vanish seems to be the simplest method to enforce the convergence.&lt;br /&gt;&lt;br /&gt;Mathematically, a very fine granulated gradient descent algorithm should be able to find a local minimum. But, I think this is practically not possible. To understand this, we notice that the learning rate practically controls the average variance of the positions of the simulated particles. That means, the learning rate is a kind of "temperature" of the system. In physics, we know that if we keep the temperature constant, the system will never converge to a frozen state. We can only reach that state when we gradually reduce the temperature to zero.&lt;br /&gt;&lt;br /&gt;It is interesting to notice that the cooling dynamic system simulated by RPM is theoretically equivalent to a n-body dynamic system that is ever expanding. To see this, we notice that the learning rate is directly linked to the average variance of the particle's positions. Thus, instead of gradually lowering learning rate, we can just reduce the scale of unit length of the space compared to total size of the space (e.g. the size of the torus). The latter is again equivalent to a system in which we keep the scale of unit length constant, but gradually expand the size of the whole space.&lt;br /&gt;&lt;br /&gt;We see here certain similarity of RPM model with the inflationary big-bang model from cosmology. Whereas RPM keeps the total size of the system stationary and gradually reduces the energy, the big-bang model keeps the total energy constant but let the universe expands. It is probably not a too crazy idea to consider a dual big-bang model that keeps the apparent universe stationary, but magically reduces the total energy/mass/temperature.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/710383708872670121-3864716895589242483?l=jamesxli.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel="replies" type="application/atom+xml" href="http://jamesxli.blogspot.com/feeds/3864716895589242483/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://www.blogger.com/comment.g?blogID=710383708872670121&amp;postID=3864716895589242483" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/710383708872670121/posts/default/3864716895589242483?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/710383708872670121/posts/default/3864716895589242483?v=2" /><link rel="alternate" type="text/html" href="http://jamesxli.blogspot.com/2008/04/equivalence-of-cooling-and-expanding.html" title="The Equivalence of Cooling and Expanding Dynamics" /><author><name>James X. Li</name><uri>http://www.blogger.com/profile/17491271097854752129</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="31" src="http://2.bp.blogspot.com/_Wca44kNaYGE/Sal7IrL8FJI/AAAAAAAAEyU/yhHymRkjWY4/S220/JamesPicLarge2.JPG" /></author><thr:total>0</thr:total></entry><entry gd:etag="W/&quot;CUAAQH48fip7ImA9WxRaEkQ.&quot;"><id>tag:blogger.com,1999:blog-710383708872670121.post-1447266590970600349</id><published>2008-03-10T08:28:00.000-07:00</published><updated>2008-12-14T14:29:01.076-08:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2008-12-14T14:29:01.076-08:00</app:edited><title>The "Big Three"  Data Categories</title><content type="html">Data analysis is an essential part of science and engineering. As a guide to deal with various data analysis tasks, I find it helpful to group data into three basic categories:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Multidimensional tabular data (e.g . tables of numbers and text).&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Time series data (e.g. image, audio/video  data).&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Graphics oriented data (e.g. tree and network structured data).&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;All data analysis methods can be grouped in to three categories depending on which data category they target. So, for instance, wavelet transformation is a method for time series; whereas MDS are methods for multidimensional data. Context free grammar and text parsers are mainly tools for graphics oriented data.&lt;br /&gt;&lt;br /&gt;Furthermore, all software applications can also be grouped in to three corresponding categories. A software package that supports multiple data categories (like java system, Matlab) is either a generic tool that requires further software for end-users; or it is hopelessly difficult to design, implement, understand and to use.&lt;br /&gt;&lt;br /&gt;When I encounter a new data analysis problem I first identify its category, then I start to search for methods and software packages targeting that category. I would not try, for instance, to use MDS or SQL to analyze images or video/audio data.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/710383708872670121-1447266590970600349?l=jamesxli.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel="replies" type="application/atom+xml" href="http://jamesxli.blogspot.com/feeds/1447266590970600349/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://www.blogger.com/comment.g?blogID=710383708872670121&amp;postID=1447266590970600349" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/710383708872670121/posts/default/1447266590970600349?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/710383708872670121/posts/default/1447266590970600349?v=2" /><link rel="alternate" type="text/html" href="http://jamesxli.blogspot.com/2008/03/big-three-data-categories.html" title="The &quot;Big Three&quot;  Data Categories" /><author><name>James X. Li</name><uri>http://www.blogger.com/profile/17491271097854752129</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="31" src="http://2.bp.blogspot.com/_Wca44kNaYGE/Sal7IrL8FJI/AAAAAAAAEyU/yhHymRkjWY4/S220/JamesPicLarge2.JPG" /></author><thr:total>0</thr:total></entry></feed>

