tag:blogger.com,1999:blog-830550213054630252017-09-12T05:00:16.140-07:00Journey into Machine LearningIncoherent Ramblingshttp://www.blogger.com/profile/16873840468463648925noreply@blogger.comBlogger18125tag:blogger.com,1999:blog-83055021305463025.post-39165084038189025082017-06-22T04:48:00.000-07:002017-06-22T04:50:27.765-07:00Transfer learning and hotdog classification!Inspired by a <a href="https://www.youtube.com/watch?v=ACmydtFDTGs">recent episode</a> of Silicon Valley where Jian Yang builds an app to classify pictures as either <i>Hotdog</i> or <i>Not Hotdog</i>, I decided to have a crack at retraining the final layer of the Inception V3 network to do so.<br /><br />Thankfully, the lovely folks at Google make it <a href="https://www.tensorflow.org/tutorials/image_retraining">ridiculously easy to do so</a>. So instead of maxing out my poor little Macbook pro for weeks (or months :/) of training - we can utilise the already trained network and simply adjust the final layer for our use, more on this below.<br /><br />Note that this post assumes familiarity of CNN architecture and its use in computer vision - see this great course at Stanford for a great introduction to CNNs - <a href="http://cs231n.stanford.edu/">CS231n: Convolutional Neural Networks for Visual Recognition.</a><br /><br /><h2>Background</h2><div>The Inception v3 image recognition neural network came out of Google at the end of 2015. The focus of the architecture builds upon the original inception architecture of GoogLeNet [1] which <i>was also designed to perform well even under strict constraints on memory and computational budget </i>[2].<br /><br /><h3>Inception v1</h3>The basis of this (at the time) new architecture was the so called <i>inception</i> module. It builds on the traditional convolutional layer which has a fixed height and width. Instead it applies and assortment of different size filters to the input, allowing the model to perform multi-level feature extraction on each input:</div><div><div class="separator" style="clear: both; text-align: center;"><a href="https://4.bp.blogspot.com/-M8pHCGWcfXE/WUeSVRrijjI/AAAAAAAAAK4/COB4Y6C27n0Kb2qB_TuAmhduVr5Fn8VwgCLcBGAs/s1600/inception.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="254" data-original-width="549" height="185" src="https://4.bp.blogspot.com/-M8pHCGWcfXE/WUeSVRrijjI/AAAAAAAAAK4/COB4Y6C27n0Kb2qB_TuAmhduVr5Fn8VwgCLcBGAs/s400/inception.png" width="400" /></a></div><div class="separator" style="clear: both; text-align: center;"></div><i><span style="font-family: inherit;"><br /></span></i><i><span style="font-family: inherit;">Figure 1: <span class="qlink_container">Taken from Rethinking the inception architecture for computer vision[1] - the naive inception module.</span></span></i><br /><i><span style="font-family: inherit;"><span class="qlink_container"><br /></span></span></i>As you can see above, there are $1 \times 1$, $3 \times 3$ and $5 \times 5$ convolutions as well as a $3 \times 3$ max pooling applied to the input. In theory, this is great - as we are able to exploit multiple level spatial correlations in the input, however using a large number of $5 \times 5$ filters can be computationally expensive.<br /><br />This train of thought led to the next iteration of the inception module:<br /><div class="separator" style="clear: both; text-align: center;"><a href="https://2.bp.blogspot.com/-QVkPhIXPWFQ/WUeUrqbw-hI/AAAAAAAAALE/Oh0jLzFtguIlTmibyPDWfBIljWyXHcHwgCLcBGAs/s1600/inception1a.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="279" data-original-width="550" height="202" src="https://2.bp.blogspot.com/-QVkPhIXPWFQ/WUeUrqbw-hI/AAAAAAAAALE/Oh0jLzFtguIlTmibyPDWfBIljWyXHcHwgCLcBGAs/s400/inception1a.png" width="400" /></a></div><i><span style="font-family: inherit;">Figure 2: <span class="qlink_container">Taken from Rethinking the inception architecture for computer vision[1] - inception module with dimension reductions.</span></span></i><br /><i><span style="font-family: inherit;"><span class="qlink_container"><br /></span></span></i>This architecture is essentially the same as the initial proposal except that each of the larger filters is preceded by a $1 \times 1$ convolution. These $1 \times 1$ convolutions act as a dimension reduction tool in the <i>channel</i> dimension, leaving the spatial dimension untouched. The idea is that the important cross channel information will be captured without explicitly keeping every channel. Once the channel dimension reduction has been performed, these results feed into the larger filters which will capture both spatial and channel correlations. The $1 \times 1$ filters also have ReLU activation functions which introduce additional non-linearities into the system.<br /><br />Put simply in [1]:<br /><i><br /></i><i>One of the main beneficial aspects of this architecture is that it allows for increasing the number of units at each stage significantly without an uncontrolled blow-up in computational complexity. The ubiquitous use of dimension reduction allows for shielding the large number of input filters of the last stage to the next layer, first reducing their dimension before convolving over them with a large patch size. Another practically useful aspect of this design is that it aligns with the intuition that visual information should be processed at various scales and then aggregated so that the next stage can abstract features from different scales simultaneously. </i><br /><i><br /></i><br /><h3>Inception v2/3</h3></div><div>A key realisation of the earlier iterations of Inception was that the improved performance compared to its peers was driven by dimensionality reduction using the $1 \times 1$ filters. This has a natural interpretation in computer vision; as the authors put it [2]:<br /><i><br /></i><i>In a vision network, it is expected that the outputs of near-by activations are highly correlated. Therefore, we can expect that their activations can be reduced before aggregation and that this should result in similarly expressive local representations.</i><br /><br />With reducing computational complexity in mind (hence the number of parameters in the network) the authors sought a way to keep the power of the larger filters, but reduce how expensive the operations involving them were. Thus they investigated replacing the larger filters with a multi-layer network that has the same input/output size but less parameters.<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://2.bp.blogspot.com/-RwBeqgdGho0/WUecImTbUTI/AAAAAAAAALU/lOZYmBuS6gIb8VBm7RkJvUp6EClnfG5PgCLcBGAs/s1600/5x5replacement.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="202" data-original-width="263" height="245" src="https://2.bp.blogspot.com/-RwBeqgdGho0/WUecImTbUTI/AAAAAAAAALU/lOZYmBuS6gIb8VBm7RkJvUp6EClnfG5PgCLcBGAs/s320/5x5replacement.PNG" width="320" /></a></div><i>Figure 3: Replacing a $5 \times 5$ filter with a network of $3 \times 3$ and a $1 \times 1$ filter.</i><br /><i><br /></i>Thus the inception module now looks like<br /><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-72xGFZWrzcw/WUjhLTOtzRI/AAAAAAAAALk/1jo3q2Yn8vou4GHwe3WbK7k5X5ZlWG_8wCLcBGAs/s1600/inceptionaa.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="319" data-original-width="325" height="314" src="https://1.bp.blogspot.com/-72xGFZWrzcw/WUjhLTOtzRI/AAAAAAAAALk/1jo3q2Yn8vou4GHwe3WbK7k5X5ZlWG_8wCLcBGAs/s320/inceptionaa.PNG" width="320" /></a></div><i>Figure 4: Inception module where each $5 \times 5$ filter has been replaced by two $3 \times 3$ filters.</i><br /><i><br /></i>This replacement results in a 28% saving in computational efficiency [2]. Inception v2 and v3 share the same inception module architecture, the different arises in differences in training - namely batch normalization of the fully connected layer of the auxiliary classifier.<br /><br /></div><h3>Transfer Learning</h3><div>Now that we have a (very) high level understanding of the architecture of Inception v3, let's talk about transfer learning.</div><div><br /></div><div>Given how computationally expensive it is to train a CNN from scratch - a task which can take months - we can use the results of a pre-trained CNN to do the feature extraction for us. The idea is that the generic features of an image (edges, shapes, etc) which are captured in the early layers of the CNN are common to most images. That is, we can utilise a CNN that has been trained for a specific task on millions of images to do the feature extraction for our image recognition task.</div><div><br /></div><div>We then strip the CNN of the final fully connected layer (which feeds to the softmax classifier) and replace it with a fully connected layer built for our own task and train that. There are several approaches to transfer learning which are dependent on the task, such as we can use the pre-trained CNN as a feature extractor as detailed above, or we can use it as a starting point and retrain the weights in the convolutional layers. The amount of training data you have for your task and the details of the task will generally dictate the approach. See [3] for a detailed study on transfer learning.</div><div><br /></div><div>The pre-trained CNN we'll be utilising an Inception v3 network as described above, trained on the ImageNet 2012 dataset which contained 1.2million images and 1000 categories.<br /><br /><h3>Implementation</h3></div><div>Thankfully Google had the foresight to make it dead easy to perform the aforementioned transfer learning using Tensorflow.<br />For my training images, I utilised the work done in [4] which provided around 350 hotdog images and around 50 not hotdog (notdog) images. We then utilise the script <i><a href="https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/image_retraining/retrain.py">retrain.py</a></i> which we'll pass parameters to from the terminal.<br /><br />The script will determine the number of classes based on the number of folders in the parent folder. You don't need to worry about specific names for the images, you just need to make sure the folders contained the correctly labeled images.<br /><br />The rest of the parameters are fairly self explanatory and are explained on the <a href="https://www.tensorflow.org/tutorials/image_retraining">tutorial</a>. One interesting set of parameters to note are the <i>random_crop</i> and <i>random_scale, </i>these distort the training images as to preserve their meaning but add new training examples. For example a picture of a hotdog rotated 90 degrees is still a hotdog; instead of sourcing additional hotdog pictures we can just rotate an existing training example to teach the classifier something new.<br /><br /></div><h3>Results</h3><div>After running retrain.py with 500 training steps, I received the below training summary</div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-Gi0J0PbgzBU/WUkDFL7rZoI/AAAAAAAAAL0/wU7Byptlr1gNuaoyRGC5mvr4Ik46o0FZQCLcBGAs/s1600/Screen%2BShot%2B2017-06-20%2Bat%2B9.10.34%2BPM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="750" data-original-width="1168" height="256" src="https://1.bp.blogspot.com/-Gi0J0PbgzBU/WUkDFL7rZoI/AAAAAAAAAL0/wU7Byptlr1gNuaoyRGC5mvr4Ik46o0FZQCLcBGAs/s400/Screen%2BShot%2B2017-06-20%2Bat%2B9.10.34%2BPM.png" width="400" /></a></div><br /><div style="text-align: left;"><span style="font-family: inherit;">99% Validation accuracy - not bad at all!</span></div><div style="text-align: left;"><span style="font-family: inherit;">Let's have a look at a specific example and try the below hotdog</span></div><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-Hk11UJH0Ryw/WUkDxpH971I/AAAAAAAAAL8/kj5T8_yZHEsD6ZMJgOoVsVHIiWq0QyScgCLcBGAs/s1600/hotdog.jpg" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://1.bp.blogspot.com/-Hk11UJH0Ryw/WUkDxpH971I/AAAAAAAAAL8/kj5T8_yZHEsD6ZMJgOoVsVHIiWq0QyScgCLcBGAs/s320/hotdog.jpg" /></a></div><div class="separator" style="clear: both; text-align: center;"></div><div style="text-align: justify;"><b><span style="text-align: start;">hotdogs (score = 0.99956)</span><span style="text-align: start;"> </span></b></div><div style="text-align: justify;"><span style="text-align: start;"><b>random (score = 0.00044)</b></span><br />Great news, this image is in fact a hotdog and our CNN recognises it as such with a high degree of confidence. What about a notdog?</div><span style="text-align: start;"></span><br /><div class="separator" style="clear: both; text-align: center;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://2.bp.blogspot.com/-75QYtUA5230/WUkErYMmOoI/AAAAAAAAAME/3-XvPUefBs4UlSvcmc63qBImvmGnjAHjwCLcBGAs/s1600/35477171_13cb52115c_n.jpg" style="margin-left: 1em; margin-right: 1em; text-align: center;"><img border="0" src="https://2.bp.blogspot.com/-75QYtUA5230/WUkErYMmOoI/AAAAAAAAAME/3-XvPUefBs4UlSvcmc63qBImvmGnjAHjwCLcBGAs/s1600/35477171_13cb52115c_n.jpg" /></a></div><div class="separator" style="clear: both; text-align: center;"><br style="text-align: start;" /></div><div style="text-align: justify;"><span style="text-align: start;"><b>random (score = 0.92311)</b></span></div><div style="text-align: justify;"><span style="text-align: start;"><b>hotdogs (score = 0.07689)</b></span></div><div style="text-align: justify;"><span style="text-align: start;">This is a sunflower not a hotdog! Yet the classifier is only rather confident rather than completely confident that it is a notdog. This can be attributed to the limited number of training examples that were used in training the final layer before classification.</span><br />What about if we try to trick it - with a HOTDOG CAR!?</div><span style="text-align: start;"></span><br /><div style="text-align: justify;"><span style="text-align: start;"><br /></span></div><span style="text-align: start;"></span><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://4.bp.blogspot.com/-aamBvAtyOek/WUkGMwZZjpI/AAAAAAAAAMQ/AfQPChm_Q7MY-a1r2ByvOAtaD_0ek0ivACLcBGAs/s1600/hotdogcar.jpg" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://4.bp.blogspot.com/-aamBvAtyOek/WUkGMwZZjpI/AAAAAAAAAMQ/AfQPChm_Q7MY-a1r2ByvOAtaD_0ek0ivACLcBGAs/s320/hotdogcar.jpg" /></a></div><div><b><br />random (score = 0.70404)<br />hotdogs (score = 0.29596)</b><br />Not too bad at all, obviously the shape is similar to that of a hot dog but the chassis, wheels and other car-like features offer enough difference for our CNN to make the distinction.<br /><br /><div class="p2"><h3>What's next?</h3><div>I haven't decided if I'll build some CNNs or have a crack at an LSTM implementation next - both will be using Tensorflow but I've got some ideas I want to play around with first which will shape which direction I head in.</div><div><br /></div><h3>References</h3><div><span style="font-family: "times" , "times new roman" , serif;"><span style="background-color: white; font-family: "times" , "times new roman" , serif;">[1] Szegedy, C. et al. 2014. <em style="box-sizing: inherit; line-height: inherit;">Going deeper with convolutions,</em> <u style="box-sizing: inherit;"><a href="https://arxiv.org/pdf/1409.4842.pdf" style="box-sizing: inherit; cursor: pointer; line-height: inherit; text-decoration-line: none;">https://arxiv.org/pdf/1409.4842.pdf</a></u></span></span><br /><span style="font-family: "times" , "times new roman" , serif;"><span style="background-color: white; font-family: "times" , "times new roman" , serif;">[2] Szegedy, C. et al. 2015. <em style="box-sizing: inherit; line-height: inherit;">Rethinking the Inception Architecture for Computer Vision,</em> <u style="box-sizing: inherit;"><a href="https://arxiv.org/pdf/1512.00567.pdf" style="box-sizing: inherit; cursor: pointer; line-height: inherit; text-decoration-line: none;">https://arxiv.org/pdf/1512.00567.pdf</a></u>.</span></span><br /><span style="font-family: "times" , "times new roman" , serif;"><span style="background-color: white; font-family: "times" , "times new roman" , serif;">[3] Donahue, J. et al. 2013. <em style="box-sizing: inherit; line-height: inherit;">DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition, </em> <u style="box-sizing: inherit;"><a href="https://arxiv.org/abs/1310.1531" style="box-sizing: inherit; cursor: pointer; line-height: inherit; text-decoration-line: none;">https://arxiv.org/abs/1310.1531</a></u>.</span></span><br /><span style="font-family: "times" , "times new roman" , serif;">[4] <a href="https://github.com/Fazelesswhite/Hotdog-Classification">https://github.com/Fazelesswhite/Hotdog-Classification</a></span></div></div></div><style type="text/css">p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Menlo} span.s1 {font-variant-ligatures: no-common-ligatures} </style><style type="text/css">p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Menlo} span.s1 {font-variant-ligatures: no-common-ligatures} </style><style type="text/css">p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Menlo} span.s1 {font-variant-ligatures: no-common-ligatures} </style><br /><style type="text/css">p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Menlo} span.s1 {font-variant-ligatures: no-common-ligatures} </style><img src="http://feeds.feedburner.com/~r/blogspot/GjNcQ/~4/aRfXKJK4EL4" height="1" width="1" alt=""/>Incoherent Ramblingshttp://www.blogger.com/profile/16873840468463648925noreply@blogger.com0http://edgeofabandon.blogspot.com/2017/06/transfer-learning-and-hotdog.htmltag:blogger.com,1999:blog-83055021305463025.post-17171867989181105802017-03-31T16:16:00.000-07:002017-03-31T16:16:43.972-07:00word2vec and word embeddings<h3>Where to now?</h3><div>Now that we've got the basics of neural networks down pat, this opens up a world of related techniques that can build on this knowledge. I thought to myself that I'd get stuck into some Natural Language Processing (NLP) with a view to eventually implement some sort of Recurrent Neural Network using TensorFlow.<br /><br />The wonderful folks at Stanford have all the lecture notes and assignments up online at <a href="http://web.stanford.edu/class/cs224n/syllabus.html">http://web.stanford.edu/class/cs224n/syllabus.html</a> which is a fantastic resource. </div><div><br /></div><div>Anyway before we can even begin to think of the applications, we must understand how to represent our text for input into any machine learning algorithm.</div><div><br /></div><h3>One-hot encoding</h3><div>Naively, we may think to take our entire vocabulary (i.e all words in the training set - size $V$) and make a huge vector, with each entry in the vector corresponding to a word in our vocabulary. Thus each word is represented by a $V \times 1$ vector with exactly one entry equal to $1$ and all other entries $0$. For example, a word $x_i$ would be represented as</div><div>$$ x_i = \left[ \begin{array}{c} 0 \\ \vdots \\ 1 \\ \vdots \\ 0 \end{array} \right]$$This is called one-hot encoding. The representation seems simple enough, but it has one problem - there is no natural way to embed the contextual similarity of words. Imagine we had three words; $x_1$ = dog, $x_2$ = cat and $x_3$ = pencil. Intuitively, if you were asked which two words were "similar", you would say in a given context, $x_1$ and $x_2$ are similar, whilst $x_3$ shares less similarity with $x_1$ or $x_2$. Now supposed our three words have the following one-hot encoded representation $$ x_1 = \left[\begin{array}{c} \vdots \\ 0 \\ \vdots \\ 1 \\ \vdots \\ 0 \\ \vdots\end{array} \right], x_2 = \left[ \begin{array}{c} \vdots \\ 1 \\ \vdots \\ 0 \\ \vdots \\ 0 \\\vdots \end{array} \right], x_3 = \left[ \begin{array}{c} \vdots \\ 0 \\ \vdots \\ 0 \\ \vdots \\ 1 \\\vdots \end{array} \right]$$ We could hope that the dot product between vectors would give us a notion of similarity, but each of the $w_i$ are orthogonal, that is $x_i \cdot x_j = 0 \hspace{.1in} \forall i \neq j$ This representation also isn't all that great given that our vocabulary could be of the order of hundreds of millions of words, which would result in humongous one-hot encoded vector representations in which almost all (except for one) entry is a 0. This called a <i>sparse </i>representation for obvious reasons. We'll now have a look at some more computationally efficient (and more accurate!) <i>dense</i> word representations.</div><div><br /></div><h3>Efficient Estimation of Word Representations in Vector Space</h3><div>In 2013, Mikolov et al. presented <a href="https://arxiv.org/pdf/1301.3781.pdf">Efficient Estimation of Word representations in Vector Space</a> which proposed two architectures: </div><div><ul><li>Continuous Bag of Words (CBOW) and</li><li>Skip-gram model</li></ul>with the aim of minimizing the computational complexity traditionally associated with NLP models. These representations somewhat surprisingly capture a lot of syntactic and semantic information in the structure of the vector space. That is, that similar words tend to be near one another (in terms of Euclidean distance) and that vector operations can be used to analogously relate pairs of words. There is the famous $\text{king} - \text{man} + \text{woman} \approx \text{queen}$ equation which is quite remarkable. In the one-hot view of the world, all of our vectors were orthogonal so dot products or any sort of vector addition/subtraction had no intuitive meaning or captured any relationship between words. See <a href="https://lamyiowce.github.io/word2viz/">here</a> for a cool visualisation.</div><div><br /></div><h3>Continuous Bag of words</h3><div>Simply put, the CBOW aims to predict a word given an input context. For example, we would like a CBOW model to predict the word <i>fox</i> from an input context "<i>The quick brown...</i>". To perform such a task, it was proposed to use a simple feedforward neural network (wait - we know all about those now!) <i>without</i> an activation function in each neuron in the hidden layer. The idea is to train the NN on this classification task and then simply pick out the weight matrix in the hidden layer - this will be our <i>dense representation.</i></div><div><br /></div><h4>Architecture of CBOW</h4><div>The CBOW model looks at <i>m</i> words either side (or $C$ total as in the diagram below) of the target/center word $w_c$ in order to predict it . Below is an image the architecture of a CBOW model - notice that each input word is a one-hot encoded vector of dimension $V \times 1$ where $V$ is the size of our vocabulary. $W$ is a $N \times V$ matrix, where $N$ is the dimension of the representation of our input vectors $x_1, \dots, x_c$, where $x_{i}$ is the one hot encoded vector of the $i^{th}$ context word. We define $W x_{i} \equiv v_{i}$ which is a single column of $W$ due to the one hot encoding. Below is a visual representation of a CBOW model.<br /><div class="separator" style="clear: both; text-align: center;"><a href="https://2.bp.blogspot.com/-7MsYeu1IjN4/WNogOiAchzI/AAAAAAAAAJk/R0U30myuArMYFzRGs6DNZE6aNnKAokkbgCLcB/s1600/cbow2.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="400" src="https://2.bp.blogspot.com/-7MsYeu1IjN4/WNogOiAchzI/AAAAAAAAAJk/R0U30myuArMYFzRGs6DNZE6aNnKAokkbgCLcB/s400/cbow2.png" width="329" /></a></div></div><div class="separator" style="clear: both; text-align: center;"></div><div class="separator" style="clear: both; text-align: center;"></div><div class="separator" style="clear: both; text-align: center;"></div><br /><div class="separator" style="clear: both; text-align: center;"></div><div class="separator" style="clear: both; text-align: center;"></div><div class="separator" style="clear: both; text-align: center;"></div><div><div class="separator" style="clear: both; text-align: center;"></div><b>Forward propagation:</b></div><div class="separator" style="clear: both; text-align: center;"></div><div><div class="separator" style="clear: both; text-align: center;"></div><ul><li> We have $C$ one hot encoded input vectors which feed into the hidden layer via the weight matrix $W$.</li><li>Since we have $C$ inputs acting on $W$, the resulting matrix $\bar{v}$ is computed from the average as follows: $$ \hat{v} = \frac{1}{C} W \left( x_{1} + x_{2} + \dots + x_{C} \right) = \frac{1}{C} \left ( v_{1} + v_{2} + \dots + v_{C} \right)$$</li><li>We actually don't have an activation function here (it is the identity mapping). So from the hidden layer to the output layer, we propagate forward via weight matrix $W'$ which is $V \times N$</li><ul><li>Define the $j^{th}$ row of $W'$ as $u^T_{j}$.</li></ul><li>Thus the $j^{th}$ entry of the unnormalised output (a $V \times 1$ vector) is simply $z_j = u^{T}_{j} \hat{v}$</li><li>Finally, we apply our old friend the softmax function to obtain our output: $$\hat{y}_j = \frac{e^{z_j}}{\sum_{k=1}^{V} e^{z_k}}$$</li><li>To measure the loss of our network, we'll use the cross entropy loss function, which we saw in the derivation of a <a href="http://edgeofabandon.blogspot.com.au/2017/01/feedforward-artificial-neural-network.html">feedforward neural network.</a> $$l = - \sum_{j=1}^{V} y_j \log(\hat{y}_j)$$</li><ul><li>This is just the negative of the log likelihood $p \left( w_k | w_{I_1}, \dots , w_{I_C} \right)$ where $w_{I_j}$ represents the $j^{th}$ word of input context $I$.</li><li>Thus we have $$l = - u^{T}_{c} \hat{v} + \log \left(\sum_{k=1}^V e^{ u^{T}_k \hat{v}} \right)$$</li><li>where <i>c</i> is the index of the correct word - the one hot vector for the correct word as a $1$ at the $c^{th}$ position.</li></ul><li>We now just need to optimise the loss function to recover our <i>input</i> representation $\hat{v}$ and the <i>output</i> representation $u^{T}_j$ for $j = 1, \dots, V$.</li><ul></ul></ul>Let's now take a look at how we optimise our loss function $L$ via gradient descent using backpropagation. We won't go through algebra, but it is exactly analogous to what we've seen <a href="http://edgeofabandon.blogspot.com.au/2017/01/feedforward-artificial-neural-network.html">before</a>. Now $$\frac{\partial J}{\partial u_j} = \left\{ \begin{array}{ll} (\hat{y}_j-1)\hat{v} & j = c \\ \hat{y}_j \hat{v} & \text{otherwise} \end{array} \right. $$ and $$\frac{\partial J}{\partial \hat{v}} = -u_c + \sum_{k=1}^{V} \hat{y}_k u_k$$<br />These define the update rules for stochastic gradient descent. Note the sum over $V$, the size of our vocabulary, which can contain potentially millions of tokens. This isn't computationally efficient and will result in terrible performance if we have do do this when training the model. Suspend your disbelief for now and I'll cover some methods that people have come up with to tackle this problem in my next blogpost. We'll carry on for now and introduce the skip-gram model.<br /><br /><h3>Skip-Gram</h3><div>The skip-gram does somewhat the opposite of the CBOW. A skip-gram model takes a single word (the <i>center</i> word) as input and predicts surrounding words (context). It uses the same approach in that we'll train an ANN to perform the prediction task and we'll just pick up the weight matrices it learns as our dense representation of our words.<br /><br /></div><div><h4>Architecture of the Skip-Gram model</h4></div><div>The Skip-Gram model takes an input (context) word $x_k$ and will predict the context from it. The input word is a single one-hot encoded vector. We denote the output by $C$ - a collection of vectors ${\hat{y_1}, \dots, \hat{y_C}}$. Each $\hat{y_j}$ is a vector of probabilities for each word in the vocabulary occurring. Below is an image the architecture of a Skip-Gram model - notice that we have the same matrices $W$ and $W'$ as in CBOW - these operate the same way on our single input as they did on $\hat{v}$ in CBOW.</div><div class="separator" style="clear: both; text-align: center;"><a href="https://3.bp.blogspot.com/-SOuDkTAyd4A/WNojUY9RTSI/AAAAAAAAAJ0/o9XbnfVIdb4zt6F6WwAxT_UuY33b6r00wCLcB/s1600/skipgram.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="400" src="https://3.bp.blogspot.com/-SOuDkTAyd4A/WNojUY9RTSI/AAAAAAAAAJ0/o9XbnfVIdb4zt6F6WwAxT_UuY33b6r00wCLcB/s400/skipgram.PNG" width="356" /></a></div><div><br /></div><div><br /></div><div class="separator" style="clear: both; text-align: center;"></div><div><div><b>Forward propagation:</b></div><div class="separator" style="clear: both; text-align: center;"></div><div><div class="separator" style="clear: both; text-align: center;"></div><ul><li> We have a single one hot encoded input vector $x_k$ which feeds into the hidden layer via the weight matrix $W$. Define $W x_k \equiv v_c$ - this is also called the center word <i>input representation</i>.</li><li>Again, We actually don't have an activation function here (it is the identity mapping). So from the hidden layer to the output layer, we propagate forward via weight matrix $W'$ which is $V \times N$</li><li>Thus the $j^th$ element of the unnormalised output (a $V \times 1$ vector) is simply $z_j = u^{T}_j v_c$. $C$ copies of the output vector are output - one vector for each of the words in the context we are trying to predict.</li><li>Again, we apply the softmax function to each of the $C$ output vectors (to obtain our output: $$\hat{y}_{i,j} = \frac{e^{z_j}}{\sum_{k=1}^{V} e^{z_k}} = \frac{e^{u^{T}_j v_c}}{\sum_{k=1}^{V} e^{u^{T}_k v_c}}$$ where $i \in \{1, \dots , C\}$ indexes each of the output vectors.</li><li>Note that since we used $C$ copies of the same output matrix for the prediction of each surrounding word, this amounts to a conditional independence assumption. That is the probability of each surrounding word occurring given the center word is independent of it's relative positioning to it.</li><li>To measure the loss of our network, we'll again use the cross entropy loss function. However, now that there are $C$ copies of the output, we need to sum over them $$l = - \sum_{j=1}^{V} \sum_{i=1}^{C} y_{i,j} \log(\hat{y}_{i,j}) $$ represents the index of the actual output word. For example if the actual output words for a center word <i>like</i> were <i>I </i>and <i>ice-cream </i>then there are two one hot encoded vectors $y_{1}, y_{2}$. Suppose <i>I </i>has position $10,000$ in our vocabulary then $y_{1,10000}$ would be the only non zero entry of $y_{1}$. Similarly $y_{2,5000}$ would be the only non-zero entry of $y_{2}$ if <i>ice-cream</i> had position $5000$ in the vocabulary.</li><li>This is just the negative of the log likelihood $p \left( w_{c-m},\dots, w_{c-1}, w_{c+1}, \dots , w_{c+m} | w_c \right)$ where $w_{j}$ where $(j \neq c)$ represents the $j^{th}$ word of surrounding window and $w_c$ is the center word. We have introduced the index $m$ to index the words surrounding the center word. That is, the window of sized $C$ is symmetric about the center word $v_c$ such that there are $m$ words to the left and right of $v_c$.</li><ul><li>Thus we have $$l = - \sum_{j=0, j\neq m}^{2m} u^{T}_{c-m+j} v_c + 2m \log{\sum_{j=1}^{V} e^{u^{T}_j v_c }} $$</li></ul><li>It's important to see that this loss function also suffers from the problem that we need to sum over our entire vocabulary $V$ at each parameter update step - which is computationally far too expensive and sometimes not even possible.</li></ul><h4>Results</h4><div>There are some interesting remarks on the <a href="https://code.google.com/archive/p/word2vec/">word2vec Google Code page</a> regarding model selection (skip-gram vs CBOW) and hyper parameter selection. We won't get into the details until we've covered the aforementioned sampling approaches for evaluating the models, but one point to note is </div><div><ul style="background-color: white; color: #444444; font-family: arial, sans-serif; font-size: 13px; margin: 10px 10px 10px 30px; padding: 0px;"><li><i>architecture: skip-gram (slower, better for infrequent words) vs CBOW (fast)</i></li></ul><div>We can understand the skip-gram performing better for infrequent words as the embedded words are not averaged as they are in CBOW (i.e when we define $\hat{v}$ in CBOW). The averaging process will dampen information from infrequently occurring words, contrasted with the skip-gram model where no such process takes place.</div></div><div><br /></div><div>In a second paper published by Mikolov et al <a href="https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf">Distributed Representations of Words and Phrasesand their Compositionality</a> they published the following results for a $1000$ dimensional skip-gram model. Taking the first two principal components, the following image was created</div><div class="separator" style="clear: both; text-align: center;"></div><br /><div class="separator" style="clear: both; text-align: center;"></div><div class="separator" style="clear: both; text-align: center;"><a href="https://3.bp.blogspot.com/-UnYQWbGjY9M/WN7ha5MqDbI/AAAAAAAAAKU/vTvJlXNAF7omdV_H7G7khTJswH8AQ0PzwCLcB/s1600/Capture.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="301" src="https://3.bp.blogspot.com/-UnYQWbGjY9M/WN7ha5MqDbI/AAAAAAAAAKU/vTvJlXNAF7omdV_H7G7khTJswH8AQ0PzwCLcB/s400/Capture.PNG" width="400" /></a></div><div><br /></div><div><br /></div><div>The caption says it all really - note the geometric similarity of differences between capital cities and their countries! This was captured by the skip-gram model <i>without</i> providing any supervised information regarding these relationships. You can imagine simple vector addition holding approximately true in this 2D vector space: $$\text{Portugal} - \text{Lisbon} + \text{Beijing} \approx \text{China}$$</div><ul></ul><div><h4>Conclusion</h4></div></div></div><div>In the next blog I'll look at some clever sampling based approaches that people have come up with to tackle the problem of having to normalise over the entire vocabulary $V$ in order to efficiently train the model. Then we can mess around with some text classification tasks, by using our word embeddings in our favourite ML techniques! The hope is that once we've played around with some toy problems to get a strong handle of the implementations of word2vec, then I'll introduce Recurrent Neural Networks and then we can get into the nitty gritty of some super interesting deep learning architectures :)</div><div><br /></div><br /></div><div><br /></div><div><br /></div><img src="http://feeds.feedburner.com/~r/blogspot/GjNcQ/~4/BCd186G4j6Q" height="1" width="1" alt=""/>Incoherent Ramblingshttp://www.blogger.com/profile/16873840468463648925noreply@blogger.com0http://edgeofabandon.blogspot.com/2017/03/word2vec-and-word-embeddings.htmltag:blogger.com,1999:blog-83055021305463025.post-22997514417450924892017-02-24T02:38:00.000-08:002017-02-24T02:38:01.670-08:00Optimisation routinesNow that we are able to train the Neural Network, let's take a look at how we can optimise the training as in reality we will be dealing with deep (many hidden neurons) networks which will potentially take days to train. I'm going to take a step back and have a look at how we might generally attack a <i>convex</i> optimisation problem. Convex optimisation problems are very well researched and understood area, so it makes sense to have a look at these first.<br /><br /><h3>The problem</h3><div>We are going to fit a simple logistic regression to the following data </div><div><ul><li>http://cs229.stanford.edu/ps/ps1/logistic_x.txt</li><li>http://cs229.stanford.edu/ps/ps1/logistic_y.txt</li></ul><div>There are two predictors (columns) in the first file and 1 outcome ($ y = \pm 1$) in the second. Note that I've borrowed these from the cs229 class that I'm currently working through. Our task is to <i>minimise </i> the average empirical loss: $$ L(\theta) = \frac{1}{m} \sum_{i=1}^{m} \log (1 + \exp^{-y^{(i)} \theta^{T} x^{(i)}}) $$ where $y^{(i)}$ is the actual outcome for observation $x^{(i)}$. Note that $x^{(i)}$ is a vector containing the predictors for that observation.</div></div><div><br /></div><h3>Is this a convex optimisation problem?</h3><div>It turns out that if we can show that the Hessian $H$ for this loss function satisfies $$z^{T} H z \ge 0$$ $ \forall z \in \mathbb{R}^{3}$ then $L(\theta) $ is a convex function (more generally if $H \in \mathbb{R}^{n \times n}$ then this result must hold true $\forall z \in \mathbb{R}^{n}$. Using the definition of $H$, we have $$ H_{pq} = \frac{\partial^2 L(\theta)}{\partial \theta_p \partial \theta_q}$$ The details of the calculation are quite straightforward, so I'll omit them but the result yields $$H_{pq} = \frac{1}{m} \sum_{i=1}^{m} \frac{1}{1 + \exp^{-\theta^T x^{(i)}}} \times \left( 1 - \frac{1}{1 + \exp^{-\theta^T x^{(i)}}} \right) \times x^{(i)}_p x^{(i)}_q $$ where the sum is over all of our training examples. We can write the last two terms as a matrix product (and subsequently drop the indices) as $x^{(i)} x^{(i)^{T}}$. Since the first two terms are $\in (0, 1]$ and $[0,1)$ respectively, then the product is $\in [0, 1]$, thus we can ignore this term when assessing if $z^T H z \ge 0 \hspace{.1in} \forall z$. Thus $$z^T H z \propto \sum_{i=1}^{m} z^T x^{(i)} x^{(i)^{T}} z = \sum_{i=1}^{m} (z^T x^{(i)})^2 \ge 0 \hspace{0.1in} \forall z$$ Hence $H$ is positive semidefinite which implies that $L(\theta) is a convex problem. This means it has a<i> </i>global optima, which makes our lives a lot easier. Since calculating the Hessian is rather easy in this setting, we can use <a href="https://en.wikipedia.org/wiki/Newton's_method_in_optimization">Newton's method</a>.</div><h3>Newton's method</h3><div>Newton's method utilises the Hessian and hence the curvature of the loss surface to find an optimal path to the minima. With this additional information, we can expect the algorithm to converge faster than gradient descent, which only uses first derivative information. The update rule is as follows $$ \theta \rightarrow \theta - H^{-1} \nabla_{\theta} L(\theta) $$</div><div>Let's see the performance of Newton's method vs Gradient descent:<br /><div class="separator" style="clear: both; text-align: center;"><a href="https://2.bp.blogspot.com/-rf2R2QAfH58/WJxFIhbd-QI/AAAAAAAAAGs/52mzCu44KNEAAaCxyFNU74rlTpDk0l3cQCLcB/s1600/newtonvgraddescent.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="295" src="https://2.bp.blogspot.com/-rf2R2QAfH58/WJxFIhbd-QI/AAAAAAAAAGs/52mzCu44KNEAAaCxyFNU74rlTpDk0l3cQCLcB/s400/newtonvgraddescent.png" width="400" /></a></div><div class="separator" style="clear: both; text-align: left;">I've performed 20 iterations for both Newton's method and gradient descent - clearly Newton's method converges <i>a lot</i> faster than gradient descent. Looking at the update step, it is obvious that this method won't scale well with more parameters since each step requires the calculation of the matrix of first derivatives <i>and </i>the Hessian. Furthermore, it's obvious that this method will fall apart if $H$ is singular. So in our toy example above which had a convex loss function, minimal parameters to estimate - Newton's method was king, but obviously in our neural network we couldn't apply such an algorithm. The beauty of backpropagation was that after a forward pass through the network, we had all of the first order derivatives we required in the update step. This fortune does not however extend to the second derivatives we would require to calculate $H$. It seems we're stuck with regular old batch gradient descent to train or neural network...or are we?</div><div class="separator" style="clear: both; text-align: left;"><br /></div><h3 style="clear: both; text-align: left;">Stochastic Gradient Descent (SGD)</h3></div><div>Recall our update step for batch gradient descent: $$\theta_j \rightarrow \theta_j - \eta \frac{\partial}{\partial \theta_j}L(\theta)$$ where $$ L(\theta) = \frac{1}{m} \sum_{i=1}^{m} \log (1 + \exp^{-y^{(i)} \theta^{T} x^{(i)}}) $$$L(\theta)$ is a function of all<i> m</i> training examples. That is, for a single update to parameter $\theta_j$ we need to use <i>all</i> of our training data. What if we could calculate our loss on a subset of the training data? The hope is that this subset is "representative enough" of the entire dataset such that resulting update to $\theta_j$ is generally in the same direction to that of the update calculated on the entire dataset. This is the essence of Stochastic Gradient Descent.<br /><br />Consider the fact when the training dataset has several hundred million rows, we may not even be able to fit all the data in memory to perform a batch gradient descent! </div><div><br /></div><div>I'll define the following terms commonly found when talking about SGD<br /><br /><ul><li>Epoch - once every datapoint in the training set has been used, <i>one epoch</i> has occurred.</li><li>Mini batch - the subset of training data that is used in the parameter update</li><li>Batch size - the size of the mini batch</li></ul><div>So per epoch, we will be able to update the parameters $\frac{N}{\text{batch size}}$ times. Compare this to batch gradient descent which by definition, the parameters are updated once per epoch.</div><div>The pseudocode is as follows</div><div><code class="prettify">for epoch in num_epochs:</code><br /><code class="prettify"> shuffle(data)</code><br /><code class="prettify"> for mini_batch in data:</code><br /><code class="prettify"> evaluate derivative of loss on mini_batch</code><br /><code class="prettify"> </code><span style="font-family: monospace;">update parameters</span><br /><code class="prettify"><br /></code>Since we are taking a subset of our training data to update our parameters, the results may be volatile as each subset may contain slightly different information about our loss surface. This is slightly troublesome as we approach the optima, as in SGD the path of optimisation will tend to oscillate around the minimum we seek. In order to remedy this, we use a <i>learning rate schedule</i> which is simply a scaling (generally based on heuristics) of our learning rate at each iteration. The hope is that by the time the algorithm is near the minimum, the learning rate has been scaled down such that the successive parameter updates are relatively stable from this point forward. This process is also called <i>annealing</i> the learning rate. The trick is getting the balance between timing the schedule such that the learning rate is small enough when the algorithm is near the minimum - if you are too aggressive with the schedule, the learning rate will become too small too soon and you won't get near the minimum as the update to the parameters will tend to zero.<br /><br />Let's have a look at the performance of SGD vs batch gradient descent on our <a href="http://edgeofabandon.blogspot.com.au/2017/01/feedforward-artificial-neural-network_12.html">neural network</a>. If implemented SGD on the aforementioned neural network and have run the make moons dataset with 100000 data points and 10 neurons in the hidden layer. See below for a plot of the loss vs. epoch for both SGD and batch gradient descent.<br /><div class="separator" style="clear: both; text-align: center;"><a href="https://4.bp.blogspot.com/-muBLAm1eBo0/WK697kyOwRI/AAAAAAAAAIA/Yf5RKXFup7w204l0_Shhv2mWAFfjRAlcQCLcB/s1600/batch_v_sgd_LOSS.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="285" src="https://4.bp.blogspot.com/-muBLAm1eBo0/WK697kyOwRI/AAAAAAAAAIA/Yf5RKXFup7w204l0_Shhv2mWAFfjRAlcQCLcB/s400/batch_v_sgd_LOSS.png" width="400" /></a></div><div class="separator" style="clear: both; text-align: center;"><br /></div><div class="separator" style="clear: both; text-align: left;">Batch gradient descent:</div><div class="separator" style="clear: both; text-align: left;"></div><ul><li>Epochs: 200 (i.e 200 updates to the parameters, each based on the full training set)</li><li>Execution time: 17s</li></ul><div>SGD:</div><div><ul><li>Epochs: 1</li><ul><li>Mini batch size: 500 (i.e 200 updates to the parameters, each based on mini batch of size 500)</li></ul><li>Execution time: 6s</li></ul><div>We can see that SGD actually outperforms batch gradient descent here and takes about a third of the time to run! I haven't actually applied any learning rate schedule here, you can see an egregious spike in the loss from SGD at around the $160^{th}$ epoch. See what happens below when I simply set $\eta \rightarrow \frac{\eta}{10}$ at the $150^{th}$ epoch:</div></div><div class="separator" style="clear: both; text-align: center;"><a href="https://4.bp.blogspot.com/-3Olaf8xTHiQ/WK6-HkKGR0I/AAAAAAAAAIE/5IS1CoKH1EUtTHutMdq7P3H0OWrrguAmgCLcB/s1600/batch_V_sgd_NN_annealed.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="285" src="https://4.bp.blogspot.com/-3Olaf8xTHiQ/WK6-HkKGR0I/AAAAAAAAAIE/5IS1CoKH1EUtTHutMdq7P3H0OWrrguAmgCLcB/s400/batch_V_sgd_NN_annealed.png" width="400" /></a></div><div class="separator" style="clear: both; text-align: center;"><br /></div><div><br /></div><div>Notice that the aforementioned spike is now gone and the SGD results look very promising given the accuracy and performance (compared to batch gradient descent). In practice, batch gradient descent will never be used due to the memory constraints and the fact that the randomness of SGD can help the algorithm escape local minima that batch gradient descent would naturally get stuck in.<br /><h3>Momentum</h3></div><div>We can introduce the concept of <i>momentum</i> to our parameter updates. It is applied as follows:<br />Recall the usual update to our parameters is as follows $$\theta_j \rightarrow \theta_j - \eta \frac{\partial}{\partial \theta_j}L(\theta)$$ which we can write in two steps, define<br />$$\Delta \theta_j = \frac{\partial}{\partial \theta_j}L(\theta)$$ such that the update becomes $$\theta_j \rightarrow \theta_j - \eta \Delta \theta_j$$ We'll now modify our definition of $\Delta \theta_j$ as $$\Delta \theta_j = \gamma \Delta \theta_j - \eta \frac{\partial}{\partial \theta_j}L(\theta)$$ and our update is $$\theta_j \rightarrow \theta_j + \Delta \theta_j$$ What we've done is influenced our current update of $\theta_j$ by the amount it was updated by in the previous step. Yes we've introduced yet <i>another</i> hyperparameter, but the idea here is to give it some "momentum" such that the updates are influenced by previous updates and the optimsation continues towards the minimum. Consider the case when the loss surface is a long narrow ravine with steep walls - SGD will typically oscillate across the ravine as the update will point down the steep walls, the momentum term will help move the algorithm down the ravine towards to minimum we seek. See the effect of momentum on training our neural network below<br /><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-djt7hbOlswo/WK6-WcRw_XI/AAAAAAAAAII/yXaV95LZB6gBi2r2qhUmKaKTwwXw-AT6gCLcB/s1600/batch_V_sgd_NN_momentum.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="285" src="https://1.bp.blogspot.com/-djt7hbOlswo/WK6-WcRw_XI/AAAAAAAAAII/yXaV95LZB6gBi2r2qhUmKaKTwwXw-AT6gCLcB/s400/batch_V_sgd_NN_momentum.png" width="400" /></a></div><br /></div><div>We see that momentum allows the algorithm to descend very quickly and hits the minimum loss around 40 epochs in - compared to ~140 in our previous iteration.<br /><br />Hopefully you've now got a good grasp of batch gradient descent vs SGD and some of the subtleties and hyperparameters that need to be considered when training a model with SGD. For a more in depth exploration of various optimisation techniques, including some more advanced methods see <a href="http://sebastianruder.com/optimizing-gradient-descent/index.html#momentum">here</a>. In production, we'll rely on the built in SGD routines which have been highly tuned for performance.</div><div><br /></div><div><br /></div><br /></div></div><img src="http://feeds.feedburner.com/~r/blogspot/GjNcQ/~4/nnNBxJ-jcVU" height="1" width="1" alt=""/>Incoherent Ramblingshttp://www.blogger.com/profile/16873840468463648925noreply@blogger.com0http://edgeofabandon.blogspot.com/2017/02/optimisation-routines.htmltag:blogger.com,1999:blog-83055021305463025.post-14820871105359487472017-01-27T13:39:00.002-08:002017-01-27T13:39:59.534-08:00Feedforward Artificial Neural Network pt5: Additional analysisNow that we've finally implemented our ANN, let's have a play around some of the parameters to get an understanding of how they affect our network and its results.<br />The tricky part about training ANNs is that the loss function isn't necessarily convex, which means that we can't use our usual optimisation routines. The fact that the loss function isn't necessarily convex means that just because we find a local minimum, it doesn't mean it's the global minimum. Thus non convex problems may converge on different local minima depending on the parameters of the optimisation routine. We'll explore some of these parameters below.<br /><h3>Learning Rate</h3>Recall how the learning rate $\eta$ enters our optimisation procedure via the weight updates in gradient descent; $$ w \rightarrow w - \eta \frac{\partial L}{\partial w}$$ It essentially controls the step size at each update. Recall we had some funny bumps in our loss function at certain iterations, let's take a closer look: I've plotted the results of two different iterations of training the ANN below<br /><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-NeAbfINaAos/WHySSnMnahI/AAAAAAAAAFc/XugC2zxabo0M4GKWAfLR8EMb7QPrb68BgCLcB/s1600/loss.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="281" src="https://1.bp.blogspot.com/-NeAbfINaAos/WHySSnMnahI/AAAAAAAAAFc/XugC2zxabo0M4GKWAfLR8EMb7QPrb68BgCLcB/s400/loss.png" width="400" /></a></div><div class="separator" style="clear: both; text-align: left;">The two lines correspond to the total loss as a function of the number of iterations in our training of the ANN. The blue line has $\eta = 0.001$ and the green line has $\eta = 0.01$. You can see that the green line has those funny bumps we witnessed before - this is the training example with a larger learning rate. The spikes occur when the step size is too large and we overshoot the minimum. Notice that the blue line doesn't have these overshoots, however it takes more iterations to approach the minimum. If we take a step size which is <i>too large, </i>then we consistently overshoot the minima - never converging on the minimum:</div><div class="separator" style="clear: both; text-align: center;"><a href="https://3.bp.blogspot.com/-epNjGVu3jsg/WHyX3fQbpEI/AAAAAAAAAF0/fuApvBzlhvk-6b8Ij_W58ShxQl253lkOgCLcB/s1600/loss3.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="282" src="https://3.bp.blogspot.com/-epNjGVu3jsg/WHyX3fQbpEI/AAAAAAAAAF0/fuApvBzlhvk-6b8Ij_W58ShxQl253lkOgCLcB/s400/loss3.png" width="400" /></a></div><div class="separator" style="clear: both; text-align: center;"><br /></div><div class="separator" style="clear: both; text-align: left;"><br /></div>The key is finding a learning rate which will find the minimum within a reasonable timeframe. Although our selection ($\eta = 0.001$ vs $\eta = 0.01$) didn't make a huge difference in this case, consider an ANN with multiple hidden layers and thousands of neurons in each layer. This network may take hours (or days) to train depending on how we choose our learning rate.<br />Depending on the problem at hand, you may value accuracy more than efficiency or vice versa, this will dictate how you choose your learning rate, of which you will usually calculate using cross validation.<br /><br /><h3>Regularisation / Weight Decay</h3>Say we have our initial loss function (the cross entropy Loss) $L_0$ and we add a <i>regularisation</i> term such that we now have $$L = L_0 + \frac{\lambda}{2n} \sum_{w} w^2$$ where the sum is over all weights. Now if $\lambda$ is large then the second term will dominate $L$ and the task of optimising the entire expression will be reduced to minimising $\sum_w w^2$. If $\lambda$ is small then the first term dominates and there are less restrictions place on $w$. This regularisation term controls $w$ by preventing it from becoming overly large and helps us from overfitting the model. If we want to use gradient descent to minimise this regularised loss function we have $$ \frac{\partial L}{\partial w} = \frac{\partial L_0}{\partial w} + \frac{\lambda}{n} \sum_w w$$ so our update at each iteration is $$ w \rightarrow w - \eta \frac{\partial L}{\partial w}$$ becomes $$ w \rightarrow w - \eta \frac{\partial L_0}{\partial w} - \frac{\eta \lambda}{n} w$$ $$\implies w \rightarrow \left(1 - \frac{\eta \lambda}{n} \right) w - \eta \frac{\partial L_0}{\partial w}$$ That is at each update, the weight $w$ is rescaled by a factor of $\left( 1 - \frac{\eta \lambda}{n} \right)$ at each iteration; this is referred to as <i>weight decay</i> and as mentioned before, limits the magnitude of $w$.<br /><br /><h3>Weight initialisation</h3>In this section we'll take a look at why we chose to intialise our values as we did (from a normal distribution with specific parameters). Recall the definition of the weight update from our gradient descent algorithm $$ w \rightarrow w - \eta \frac{\partial L}{\partial w}$$, if the second term in this expression is small or zero, then there is effectively no (or very little) weight update to $w$. This causes our training to slow down incredibly, such that after each iteration our weight $w$ is only changing ever so slightly; obviously we would like to avoid this situation at the start of the procedure. Recall the backpropagation rules for $W^{(1)}$:<br /><br /><li>$\delta^{(1)} = (1-\tanh^{2}(Z^{(1)}) \odot \delta^{(2)}{W^{(2)}}^T$</li><li>$\frac{\partial L}{\partial W^{(1)}} = {x}^T \delta^{(1)}$</li><br /><div>we see that $(1-\tanh^{2}(Z^{(1)})$ term enters the equation (more generally, this will be the derivative of the activation function). So we have our update to the weight $w$ as $$w \rightarrow w - \eta (1-\tanh^{2}(Z^{(1)})) \odot {x}^T \delta^{(2)}{W^{(2)}}^T$$ That is the amount we update $w$ by is proportional to the derivative of our activation function. Thus we want to avoid initialising our weights in a region where this derivative is close to zero. Below is a plot of the function</div><div class="separator" style="clear: both; text-align: center;"><a href="https://3.bp.blogspot.com/-rbZ0hyefcYQ/WIcjZtNPHCI/AAAAAAAAAGE/LIrMaSPcjccvd0qZZ3dABwB0lK_w5b0pwCLcB/s1600/tanh.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="280" src="https://3.bp.blogspot.com/-rbZ0hyefcYQ/WIcjZtNPHCI/AAAAAAAAAGE/LIrMaSPcjccvd0qZZ3dABwB0lK_w5b0pwCLcB/s400/tanh.png" width="400" /></a></div><div>We can see that this activation function has its derivative approach zero at both extremes: as $x \rightarrow \infty$ and as $x \rightarrow -\infty$. Let's think about a more general ANN for a moment - suppose we have an ANN with 1000 inputs and a single training example where each input is equal to $1$. We have as usual $$Z^{(1)} = x W^{(1)} +b^{(1)}$$ If we have intialised each entry $W^{(1)}_{ij}$ and $b^{(1)}_j$ as selected from a standard normal distribution (iid), then each entry $Z^{(1)}_{i}$ will be the sum of 1001 iid standard normal variables. Then since the sum of $N$ standard normal variables will have mean $0$ and standard deviation $\sqrt{1001}$ i.e a very wide distribution with a relatively high probability of giving a large negative or positive result (this almost looks like a uniform distribution), the derivative of the activation function will be very close to zero! This isn't what we want.</div><div>What about if we initialise with a random normal with mean $0$ and standard deviation $\frac{1}{\sqrt{1000}}$? Now we know that the variance of a sum of iid random normal variances is the sum of the variances so we now have each entry in $Z^{(1)}_{ij}$ has mean $0$ and standard deviation $$\sigma = \sqrt{\frac{1000}{1000}+1} = \sqrt{2}$$ which is a lot narrower than our distribution before - there is a lot smaller chance intialising at values where the derivative of the activation function is close to $0$. Below is a comparison of the resulting initialisation distributions from the toy example - the green line is the resulting distribution for the refined initialisation, where the red line results from initialisation by standard normal variables.<br /><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-RsSs2dcWBxw/WIcuBL3FROI/AAAAAAAAAGU/8opxL3MzL9w_wV322hqBinDffzaAXsqrwCLcB/s1600/initdists.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="283" src="https://1.bp.blogspot.com/-RsSs2dcWBxw/WIcuBL3FROI/AAAAAAAAAGU/8opxL3MzL9w_wV322hqBinDffzaAXsqrwCLcB/s400/initdists.png" width="400" /></a></div></div><div><br /></div><div>More generally for a given network we will initialise from a Gaussian distribution with mean $0$ and standard deviation $\frac{1}{\sqrt{N_{in}}}$ where $N_{in}$ is the number of inputs into the neural network.<br />Next time we'll have a look at optimising our network using stochastic gradient descent and maybe play around with some different datasets.</div><img src="http://feeds.feedburner.com/~r/blogspot/GjNcQ/~4/rxJaCS7DRms" height="1" width="1" alt=""/>Incoherent Ramblingshttp://www.blogger.com/profile/16873840468463648925noreply@blogger.com0http://edgeofabandon.blogspot.com/2017/01/feedforward-artificial-neural-network_27.htmltag:blogger.com,1999:blog-83055021305463025.post-91196045231602435912017-01-12T01:04:00.000-08:002017-02-13T02:45:56.813-08:00Feedforward Artificial Neural Network pt4: Implementation!So if you've survived this far, it's all worth it! We've done <i>all</i> the groundwork required to code up our ANN to classify our toy example (<i>make moons</i> dataset kindly provided by sklearn).<br /><br /><h3>Recap</h3><a href="http://edgeofabandon.blogspot.com.au/2016/11/back-again-artificial-neural-networks.html">Recall</a> that a motivation for exploring ANN's was that our standard Logistic Regression classification technique could only ever give us a <i>linear</i> separating boundary:<br /><div class="separator" style="clear: both; text-align: center;"><a href="https://2.bp.blogspot.com/-ZZcOgt_Cif0/WCl_YvvJ1xI/AAAAAAAAACU/3Rm0LGs8_7Y0ZxiVjgdPbmIhOzysOETVQCPcB/s1600/LogReg_boundary.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="309" src="https://2.bp.blogspot.com/-ZZcOgt_Cif0/WCl_YvvJ1xI/AAAAAAAAACU/3Rm0LGs8_7Y0ZxiVjgdPbmIhOzysOETVQCPcB/s320/LogReg_boundary.png" width="320" /></a></div><div class="separator" style="clear: both; text-align: center;"><br /></div><h3>ANN structure and Matrices</h3><div>In the <a href="http://edgeofabandon.blogspot.com.au/2017/01/feedforward-artificial-neural-network.html">last blog post</a> we derived general expressions for a single hidden layer feedforward ANN. The results were;</div><div><div>For a network with the following features:</div><div><ul><li>2 inputs, $x_1$, $x_2$</li><li>1 Hidden Layer with <i>n</i> neurons - we will treat <i>n</i> as a parameter and see how it affects the output classification</li><li>2 outputs, $\hat{y}_1$, $\hat{y}_2$</li><li>Activation function: $\sigma(x) = \tanh(x)$</li><li>Softmax output</li><li>Loss function: Cross entropy loss $$L = - \frac{1}{N} \sum_{n \in N} \sum_{i \in C} y_{n,i} \ln{\hat{y}_{n,i}}$$</li></ul></div></div><div><div><h4>Forward Propagation Summary</h4></div><ul><li>$Z^{(1)} = xW^{(1)} + b^{(1)}$</li><li>$a^{(1)} = \tanh(Z^{(1)})$</li><li>$Z^{(2)} = a^{(1)}W^{(2)} + b^{(2)}$</li><li>$\hat{y} \equiv a^{(2)} = softmax(Z^{(2)})$</li></ul>Backpropagation Summary<br /><br /><ul><li>$\delta^{(2)} = \hat{y} - y$</li><li>$\delta^{(1)} = (1-\tanh^{2}(Z^{(1)}) \odot \delta^{(2)}{W^{(2)}}^T$</li><li>$\frac{\partial L}{\partial W^{(2)}} = {a^{(1)}}^T \delta^{(2)}$</li><li>$\frac{\partial L}{\partial b^{(2)}} = \sum_{rows} \delta^{(2)}$</li><li>$\frac{\partial L}{\partial W^{(1)}} = {x}^T \delta^{(1)}$</li><li>$\frac{\partial L}{\partial b^{(1)}} = \sum_{rows} \delta^{(1)}$</li></ul>With the above, we have entirely defined how our ANN classifies training examples, how the weights and biases of the network affect the loss function and hence have our gradient descent update rule completely specified (up to the learning rate parameter).</div><div><br /></div><div><h3>Implementation in Python</h3></div><div>Now instead of defining our network and neurons as a class, I've decided to closely follow <a href="http://www.wildml.com/2015/09/implementing-a-neural-network-from-scratch/">this page</a> (which actually inspired all of these ANN related posts) as it is more transparent and easier to grasp what is going on in the code. So without further delay, I present the code for an ANN with a single hidden layer of size 3:<br /><br /></div><div><h3>Code</h3></div><code class="prettify">import numpy as np import sklearn.datasets</code><br /><code class="prettify">import pandas as pd import matplotlib.pyplot as plt </code><br /><code class="prettify">np.random.seed(0) </code><br /><code class="prettify">X, y = sklearn.datasets.make_moons(200, noise=0.20)</code><br /><code class="prettify">df = pd.DataFrame()</code><br /><code class="prettify">df['x1'] = X[:,0]</code><br /><code class="prettify">df['x2'] = X[:,1]</code><br /><code class="prettify">df['z'] = y</code><br /><br /><br /><span style="font-family: monospace;">##define ANN which takes inputs X and y with parameters n_iter as number of iterations and learning_rating the rate used in gradient descent</span><br /><span style="font-family: monospace;">def calculate(X ,y, num_hidden_neurons, n_iter=2000, learning_rate = 0.01, regularisation_rate = 0):</span><br /><span style="font-family: monospace;"> loss = [];</span><br /><span style="font-family: monospace;"> #initialise weights and biases</span><br /><span style="font-family: monospace;"> X_size = len(X)</span><br /><span style="font-family: monospace;"> np.random.seed(0)</span><br /><span style="font-family: monospace;"> W1 = np.random.randn(2, num_hidden_neurons) / np.sqrt(2)</span><br /><span style="font-family: monospace;"> b1 = np.zeros((1, num_hidden_neurons))</span><br /><span style="font-family: monospace;"> W2 = np.random.randn(num_hidden_neurons, 2) / np.sqrt(num_hidden_neurons)</span><br /><span style="font-family: monospace;"> b2 = np.zeros((1, 2))</span><br /><span style="font-family: monospace;"> </span><br /><span style="font-family: monospace;"> model = {}</span><br /><span style="font-family: monospace;"> </span><br /><span style="font-family: monospace;"> for i in xrange(n_iter):</span><br /><span style="font-family: monospace;"> #feedforward</span><br /><span style="font-family: monospace;"> Z1 = X.dot(W1)+b1</span><br /><span style="font-family: monospace;"> a1 = np.tanh(Z1)</span><br /><span style="font-family: monospace;"> Z2 = a1.dot(W2)+b2</span><br /><span style="font-family: monospace;"> yhat = softmax(Z2)</span><br /><span style="font-family: monospace;"> #backpropagation</span><br /><span style="font-family: monospace;"> d2 = yhat</span><br /><span style="font-family: monospace;"> d2[range(X_size),y] -= 1</span><br /><span style="font-family: monospace;"> d1 = (1-a1**2)*d2.dot(W2.T)</span><br /><span style="font-family: monospace;"> dW2 = a1.T.dot(d2)</span><br /><span style="font-family: monospace;"> db2 = np.sum(d2, axis=0, keepdims = True)</span><br /><span style="font-family: monospace;"> dW1 = X.T.dot(d1)</span><br /><span style="font-family: monospace;"> db1 = np.sum(d1, axis=0)</span><br /><span style="font-family: monospace;"> #regularisation</span><br /><span style="font-family: monospace;"> dW1 += regularisation_rate * W1 </span><br /><span style="font-family: monospace;"> dW2 += regularisation_rate * W2 </span><br /><span style="font-family: monospace;"> #gradient descent </span><br /><span style="font-family: monospace;"> W1 = W1 - learning_rate * dW1</span><br /><span style="font-family: monospace;"> b1 = b1 - learning_rate * db1</span><br /><span style="font-family: monospace;"> W2 = W2 - learning_rate * dW2</span><br /><span style="font-family: monospace;"> b2 = b2 - learning_rate * db2</span><br /><span style="font-family: monospace;"> </span><br /><span style="font-family: monospace;"> model = {'W1' : W1, 'b1' : b1, 'W2' : W2, 'b2' : b2}</span><br /><span style="font-family: monospace;"> loss.append(predict(model,X,y,True))</span><br /><span style="font-family: monospace;"> </span><br /><span style="font-family: monospace;"> return loss, model</span><br /><h3>Comments</h3><div>The code is fairly self explanatory, but I'll walk through the steps</div><div><ol><li>Importing all the libraries we need</li><li>Initialisation of the weights and biases (more on this later)</li><li>Feedforward step - implemented via matrices exactly how we calculated them in the previous blog post</li><li>Backpropagation step, again - implemented just how we derived them before. Note as the $b^{(i)}$ matrices consist of a single unique row each, we let python <a href="https://docs.scipy.org/doc/numpy/user/basics.broadcasting.html">broadcast</a> them into the appropriate dimensions when it does the matrix addition in the calculation of $Z^{(1)}$ and $Z^{(2)}$.</li><li>Gradient descent to alter the weights after each iteration in order to minimize the loss function.</li><li>Regularisation rate - I haven't mentioned this before, I've put it in there but set it to $0$ as to have no effect on the results. I will discuss this term later.</li><li>Output a dictionary <i>Model</i> which contains our weight and bias matrices $W^{(1)}$, $b^{(1)}$, $W^{(2)}$, $b^{(2)}$.</li><li>Note that I call a function (that's not defined here) called <i>predict</i> which will make predictions based on the input $X$ or return the cross entropy loss of a set of weights and biases when applied to $X$. The calculation of the cross entropy loss is the reason that $y$ is an input into the overall <i>calculate</i> function; if we were just calculating the weights and biases we would only require $X$ as an input. See code for <i>predict</i> at the bottom of this post.</li></ol><h3>Results</h3><div><span style="font-family: monospace;">#function call - 3 neurons in the hidden layer</span><br /><span style="font-family: monospace;">loss, model = calculate(X, y, 3)</span><br />The decision boundary is plotted below<br /><div class="separator" style="clear: both; text-align: center;"><a href="https://4.bp.blogspot.com/-hsA1sa6UF54/WHdB3HFRHsI/AAAAAAAAAE0/G9wviRdP_18tVAyHIFMkaJEipAyNt4XIgCLcB/s1600/ANN_3Neurons_boundary.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="382" src="https://4.bp.blogspot.com/-hsA1sa6UF54/WHdB3HFRHsI/AAAAAAAAAE0/G9wviRdP_18tVAyHIFMkaJEipAyNt4XIgCLcB/s400/ANN_3Neurons_boundary.png" width="400" /></a></div>Notice the non-linear decision boundary, and how well it captures the separation of data points! This gives us great hope that if the data generating process is distinctly non-linear then even though the standard GLM (i.e logistic regression) decision boundaries are crude, with an ANN we can potentially get a good fit!<br />The corresponding cross entropy loss as a function of iterations is also plotted below:<br /><div class="separator" style="clear: both; text-align: center;"><a href="https://3.bp.blogspot.com/-3IyJhNfsyr4/WHdCicvJelI/AAAAAAAAAE4/KOmY8rN3Y68zLimS6mh5uVW39PuiuOTVgCLcB/s1600/ANN_3Neurons_Loss.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="218" src="https://3.bp.blogspot.com/-3IyJhNfsyr4/WHdCicvJelI/AAAAAAAAAE4/KOmY8rN3Y68zLimS6mh5uVW39PuiuOTVgCLcB/s320/ANN_3Neurons_Loss.png" width="320" /></a></div>Ok, the overall trend of the loss looks alright, but what the hell are the spikes??? Stay tuned...<br /><br />What about if we have 10 neurons in the hidden layer?<br /><span style="font-family: monospace;">#function call - 10 neurons in the hidden layer</span><br /><span style="font-family: monospace;">loss, model = calculate(X, y, 10)</span><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://2.bp.blogspot.com/-45TYKfcl-Fc/WHdC_ZM3iVI/AAAAAAAAAE8/cE8_dj8kz1w9HbfKk1cpsdZqPILprwwjwCLcB/s1600/ANN_10Neurons_Loss.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="382" src="https://2.bp.blogspot.com/-45TYKfcl-Fc/WHdC_ZM3iVI/AAAAAAAAAE8/cE8_dj8kz1w9HbfKk1cpsdZqPILprwwjwCLcB/s400/ANN_10Neurons_Loss.png" width="400" /></a></div><div class="separator" style="clear: both; text-align: left;">and the corresponding loss: </div><div class="separator" style="clear: both; text-align: center;"><a href="https://4.bp.blogspot.com/-gAviUSnL6kc/WHdDOLnEcVI/AAAAAAAAAFE/ue3uUppXgyoR76DwxSSUrjvLxk4VHhSxACLcB/s1600/ANN_10Neurons_Loss.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="221" src="https://4.bp.blogspot.com/-gAviUSnL6kc/WHdDOLnEcVI/AAAAAAAAAFE/ue3uUppXgyoR76DwxSSUrjvLxk4VHhSxACLcB/s320/ANN_10Neurons_Loss.png" width="320" /></a></div><div class="separator" style="clear: both; text-align: left;">Now the spikes are gone and the loss is lower than in the 3 neuron case. Whoa this is a <i>really</i> good fit, or is it? How good is too good? Obviously here, with the 10 neurons the ANN is not only fitting the overall general shape of the data, but its capturing the noise from the input too. Just looking at the data (ignoring the decision boundary) we can imagine a kind of smooth sinusoidal shape separating the classes (as in the 3 neuron case above). Notice the jagged edge near the long red dot amongst the blue dots at coordinates (-0.5, 0.7), you can see that the single red training example has 'pulled' the decision boundary towards it. This is an ominous sign as the decision boundary is far too sensitive to a single data point, remember the idea is to capture the essence of the data generating process <i>excluding</i> noise.</div><div class="separator" style="clear: both; text-align: left;">If we were to assess the performance of our ANN purely on the cross entropy loss - it would seem that the network with 10 neurons in the hidden layer outperforms the one with 3. However we <i>always</i> need to test the <span style="background-color: white;">ANN</span> on data it hasn't seen before - cross validation is usually employed to do so.</div><div class="separator" style="clear: both; text-align: left;"><br /></div><div class="separator" style="clear: both; text-align: left;">Well there it is, if you've followed the posts from pt.1 of this series knowing the bare minimum about ANNs only armed with a pen, paper and a desire to learn, then you should be able to implement your own feedforward ANN from scratch! It was a laborious task (all the matrix algebra, etc) but I hope it provided insight into the construction and architecture of ANNs.</div><div class="separator" style="clear: both; text-align: left;"><br /></div><h3 style="clear: both; text-align: left;">Useful functions</h3><div><div><span style="font-family: monospace;"># Predict y_hat from input X or calculate loss</span></div><div><span style="font-family: monospace;">def predict(model, X, y, calc_loss = False):</span></div><div><span style="font-family: monospace;"> W1, b1, W2, b2 = model['W1'], model['b1'], model['W2'], model['b2']</span></div><div><span style="font-family: monospace;"> # Forward propagation</span></div><div><span style="font-family: monospace;"> Z1 = X.dot(W1) + b1</span></div><div><span style="font-family: monospace;"> a1 = np.tanh(Z1)</span></div><div><span style="font-family: monospace;"> Z2 = a1.dot(W2) + b2</span></div><div><span style="font-family: monospace;"> probs = softmax(Z2)</span></div><div><span style="font-family: monospace;"> if calc_loss:</span></div><div><span style="font-family: monospace;"> loss = -np.log(probs[range(len(X)), y])</span></div><div><span style="font-family: monospace;"> tot_loss = np.sum(loss)</span></div><div><span style="font-family: monospace;"> return 1./len(X) * tot_loss</span></div><div><span style="font-family: monospace;"> return np.argmax(probs, axis=1) </span></div><div></div><div><span style="font-family: monospace;">#Compute softmax values for each sets of scores in x.</span></div><div><span style="font-family: monospace;">def softmax(x):</span></div><div><span style="font-family: monospace;"> return np.exp(x) / np.sum(np.exp(x), axis=1, keepdims = True)</span></div></div><div>Given a model, the <i>predict</i> function returns a classification (0 or 1) for our input data X or the cross entropy loss. This is used to keep track of the loss at each iteration of the weight updates. Note that this is just for illustrative purposes, since it is an expensive calculation in reality, it isn't wise to call it <i>every</i> iteration - maybe every 100, or 1000 or so.</div><div><br /></div><div><i>Softmax </i>is used to calculate the softmax of the entries of a vector - used to calculate our $\hat{y}$.</div><div><br /></div><h3 style="clear: both; text-align: left;">Next steps</h3><div>Currently I'm working my way through the <a href="http://cs229.stanford.edu/">CS229</a> Machine Learning course by Andrew Ng which takes up the majority of my spare time. It's a fantastic course which has accompanying lectures available on youtube and is a great introduction to ML techniques.</div><div><br /></div><div>However, in the future I am to continue with this implementation but tweak it to explore the following</div><div><ul><li>Why the weights were initialised in such a peculiar way (why not just set them all to 0?)</li><li>Extending to more than 1 hidden layer</li><li>How to prevent overfitting</li><ul><li>Regularisation (I mentioned it above, but did not explain what/how/why)</li><li>Drop out techniques</li></ul><li>Stochastic gradient descent</li><ul><li>A variant of our <i>batch gradient descent</i> which is industry practise when training an ANN (due to massive datasets)</li></ul><li>Different applications - when I get around to it, I'll implement an ANN for the classic MNIST handwritten digit recognition problem.</li></ul></div><h3><span style="font-family: monospace;"><br /></span></h3></div><ol></ol></div><img src="http://feeds.feedburner.com/~r/blogspot/GjNcQ/~4/xiL8DrEZQJQ" height="1" width="1" alt=""/>Incoherent Ramblingshttp://www.blogger.com/profile/16873840468463648925noreply@blogger.com0http://edgeofabandon.blogspot.com/2017/01/feedforward-artificial-neural-network_12.htmltag:blogger.com,1999:blog-83055021305463025.post-69277069144691769652017-01-04T02:02:00.000-08:002017-01-11T23:59:15.467-08:00Feedforward Artificial Neural Network pt3: Matrices!So far we've looked at the building blocks of an ANN - the different layers, weights, activation functions and error functions. We know have a decent understanding of what the objects are and how they relate, but so far we've only looked at relatively small parts of the network in isolation for individual training samples. We'll look at implementing the aforementioned ideas using matrices, which will be very helpful when we build our full network and have hundreds (or thousands) of training samples. This is a very natural thing to do - if we can set up the propagation of our network and training samples through matrix multiplication then the computer can do the work in optimising the calculations. The other option is the old fashioned way of using for loops to iterate through each neuron and each training example in our network, this way is markedly slower and inefficient.<br /><h3>Sample ANN - using matrices</h3><div>We'll use the below network to demonstrate the power of the matrix implementation. Note that we will do this step by step, so it can be rather laborious/repetitive but it's important to understand how each element of the matrix relates to the inputs and is propagated through the network. </div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-bZW27KOt7hg/WE-0gkcFp3I/AAAAAAAAAEg/mX3PAfLDNi4zip6ycr-TGAPhlb6BU2y5wCLcB/s1600/ANN.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="190" src="https://1.bp.blogspot.com/-bZW27KOt7hg/WE-0gkcFp3I/AAAAAAAAAEg/mX3PAfLDNi4zip6ycr-TGAPhlb6BU2y5wCLcB/s400/ANN.png" width="400" /></a></div><div>The network has the following features:</div><div><ul><li>2 inputs, $x_1$, $x_2$</li><li>1 Hidden Layer with 3 neurons</li><li>2 outputs, $\hat{y}_1$, $\hat{y}_2$</li><li>Activation function: $\sigma(x) = \tanh(x)$</li><li>Softmax output</li><li>Loss function: Cross entropy loss $$L = - \frac{1}{N} \sum_{n \in N} \sum_{i \in C} y_{n,i} \ln{\hat{y}_{n,i}}$$</li></ul><h3>Inputs</h3><div>We define our input matrix based on two training samples: $$x = \left[ \begin{array}{cc} x_{11} & x_{12} \\ x_{21} & x_{22} \\ \end{array} \right]$$ </div></div><div>where the matrix is defined such that $x_{ij}$ is the $j^{th}$ input of the $i^{th}$ training sample.</div><div><br /></div><h3>Weights</h3><div>Now we define a matrix $W^{(1)}$ of weights for the neurons in the Hidden Layer;</div><div>$$W^{(1)} = \left[ \begin{array}{ccc} w_{11}^{(1)} w_{12}^{(1)} w_{13}^{(1)} \\ w_{21}^{(1)} w_{22}^{(1)} w_{23}^{(1)} \end{array} \right]$$ where $w_{ij}^{(1)}$ is the weight for the $i^{th}$ input in the $j^{th}$ neuron in the hidden layer. Thus each column of $W^{(1)}$ represents the weight each input receives from each neuron in the hidden layer.</div><h3>Bias</h3><div>We introduce the concept of <i>bias</i> in an activation function. The bias is a translation of the entire activation function and is implemented as the following $$\sigma(w \cdot x +b)$$ where $w$ is the weight, $x$ is the input into the particular neuron and $b$ is the bias. The bias adds flexibility to the activation function; we can achieve any output from any input. </div><div>For example consider an activation function <i>without </i>bias $$\sigma(x) = \tanh(w \cdot x)$$ and say we want to achieve an output of $0.5$ when $x=0$. There is no such $w$ that will allow us to achieve this result, as $\tanh(0) = 0$. However if we introduce bias we have $$\sigma(x) = \tanh(w \cdot x + b)$$ we can set $b \approx 0.549$ to achieve the desired result. Now we introduce the bias matrix - the matrix of biases for the hidden layer $$b^{(1)} = \left[ \begin{array}{ccc} b_{11}^{(1)} b_{12}^{(1)} b_{13}^{(1)} \\ b_{21}^{(1)} b_{22}^{(1)} b_{23}^{(1)} \end{array} \right]$$ with the $b^{(1)}_{ij} = b^{(1)}_{kj}$ for all $k$ i.e $b^{(1)}$ has a single unique row; all other rows in the matrix are a copy of this one. This is just to ensure that the matrix multiplication works as intended.<br /><h3>Forward Propagation</h3></div>We'll now work our way through the network, starting at the input, traversing through the hidden layer and arriving at the output.<br /><br />Using the notation described in <a href="http://edgeofabandon.blogspot.com.au/2016/12/feedforward-artificial-neural-network.html">this</a> previous blog post, we get our matrix $Z^{(1)}$ which contains the results of applying the weights and biases to our inputs. Thus $$Z^{(1)} = x W^{(1)}+b^{(1)}$$ is a $2 \times 3$ matrix with the following structure<br />$$Z^{(1)} = \left[ \begin{array}{ccc} x_{11}w_{11}^{(1)}+x_{12}w_{21}^{(1)}+b_{11}^{(1)} \hspace{.3in} \cdots \hspace{.3in} \cdots \\ x_{21}w_{11}^{(1)}+x_{22}w_{21}^{(1)}+b_{21}^{(1)} \hspace{.3in} \cdots \hspace{.3in} \cdots \end{array} \right]$$ where I've only included the first column of the matrix. In matrix index notation we have $$Z_{ij}^{(1)} = \sum_{k} x_{ik} W^{(1)}_{kj} + b_{ij}^{(1)}$$We can interpret $Z_{ij}^{(1)}$ as the input the the $j^{th}$ hidden neuron receives from the $i^{th}$ test sample.<br /><br />Once at the hidden layer, the activation function is applied (elementwise) to $Z^{(1)}$ as $$a^{(1)} = \left[ \begin{array}{ccc} a_{11}^{(1)} a_{12}^{(1)} a_{13}^{(1)} \\ a_{21}^{(1)} a_{22}^{(1)} a_{23}^{(1)} \end{array} \right] \equiv \tanh{Z^{(1)}} $$<br /><br />We now propagate these values through the hidden layer, applying the weights and biases resulting in a $2 \times 2$ matrix $Z^{(2)}$ defined as $$Z^{(2)} = a^{(1)}W^{(2)}+b^{(2)}$$ where $$W^{(2)} = \left[ \begin{array}{cc} w_{11}^{(2)} w_{12}^{(2)} \\ w_{21}^{(2)} w_{22}^{(2)} \\ w_{31}^{(2)} w_{32}^{(2)} \end{array} \right]$$ and $$b^{(2)} = \left[ \begin{array}{cc} b_{11}^{(1)} b_{12}^{(2)} \\ b_{21}^{(2)} b_{22}^{(2)} \end{array} \right]$$ Similarly to $b^{(1)}$, this matrix has a unique single row, with the rest of the rows of the matrix being exact copies. Explicitly, $Z^{(2)}$ has the following entries<br />$$\left[ \begin{array}{cc} a_{11}^{(1)}w_{11}^{(2)}+a_{12}^{(1)}w_{21}^{(2)}+a_{13}^{(1)}w_{31}^{(2)}+b_{11}^{(2)} \hspace{.3in} a_{11}^{(1)}w_{12}^{(2)}+a_{12}^{(1)}w_{22}^{(2)}+a_{13}^{(1)}w_{23}^{(2)}+b_{12}^{(2)} \\ a_{21}^{(1)}w_{11}^{(2)}+a_{22}^{(1)}w_{21}^{(2)}+a_{23}^{(1)}w_{31}^{(2)}+b_{21}^{(2)} \hspace{.3in} a_{21}^{(1)}w_{12}^{(2)}+a_{22}^{(1)}w_{22}^{(2)}+a_{23}^{(1)}w_{23}^{(2)}+b_{22}^{(2)} \end{array} \right]$$ More compactly, we have $$Z_{ij}^{(2)} = \sum_{k} a^{(1)}_{ik} W^{(2)}_{kj} + b^{(2)}_{ij} $$ where $Z_{ij}^{(2)}$ can be considered the input to the $j^{th}$ output neuron from the $i^{th}$ training sample.<br /><br />Finally, applying the softmax function in the output neurons we end up with $$a^{(2)} = \text{softmax}(Z^{(2)}) \equiv \left[ \begin{array}{cc} \hat{y}_{11} \hspace{.15in} \hat{y}_{12} \\ \hat{y}_{21} \hspace{.15in} \hat{y}_{22} \end{array} \right]$$ where $a_{ij}^{(2)}$ is the output of the $j^{th}$ output neuron of the $i^{th}$ training sample. Hopefully the explicit matrix multiplication helps illuminate how the input values are propagated through the network and how their values are affected by the various weights in the input/hidden/output layers. Next we'll take an explicit look at backpropagation and establish similar matrix results which will allow us to train the network in a relatively efficient manner.<br /><h3>Backpropagation</h3><div>In this section we'll derive the backpropagation rules for the network. Make sure you understand matrix index notation as we'll use it here to succinctly write the results.</div><div>Recall our definition $$\delta_j \equiv \frac{\partial L}{\partial z_j} = \sigma ' (z_j) \sum_{k \in path(j)} \delta_k w_{j \rightarrow k}$$ We'll slightly alter our notation to make it easier to track all of the indices floating around $ \delta^{(j)} \equiv \delta_j$ Starting at our output neurons and using the result from the <a href="http://edgeofabandon.blogspot.com.au/2016/12/derivative-of-cross-entropy-loss-with.html">previous blog post</a> we have $$\delta^{(2)}_{ij} \equiv \frac{\partial L}{\partial Z^{(2)}_{ij}} = \hat{y}_{ij} - y_{ij}$$ or in full matrix form<br />$$\left[ \begin{array}{cc} \hat{y}_{11}-y_{11} \hspace{.3in} \hat{y}_{12}-y_{12} \\ \hat{y}_{21}-y_{21} \hspace{.3in} \hat{y}_{22}-y_{22} \end{array} \right]$$ Now $$\delta^{(1)}_{ij} \equiv \frac{\partial L}{\partial Z^{(1)}_{ij}}$$ using the chain rule we can decompose this into $$ \frac{\partial L}{\partial Z^{(1)}_{ij}} = \sum_{m,n,p,q} \frac{\partial L}{\partial{Z^{(2)}_{mn}}} \times \frac{\partial{Z^{(2)}_{mn}}}{\partial{a^{(1)}_{pq}}}\times \frac{\partial{a^{(1)}_{pq}}}{\partial {Z^{(1)}_{ij}}}$$ The first term is simply $\delta^{(2)}_{mn}$. For the second term, recall above $$Z^{(2)}_{mn} = \sum_{k} a^{(1)}_{mk} W^{(2)}_{kn} + b^{(2)}_{mn} $$ $$\implies \frac{\partial Z^{(1)}_{mn}}{\partial a^{(1)}_{pq}} = W^{(2)}_{qn}$$ where the sum over k has been collapsed since the derivative is only non-zero when $m=p$ and $k=q$. Now, the third term $$\frac{\partial a^{(1)}_{pq}}{\partial Z^{(1)}_{ij}} = \frac{\partial}{\partial Z^{(1)}_{ij}} \left( \tanh(Z^{(1)}_{pq}) \right) = 1 - \tanh^{2}(Z^{(1)}_{ij})$$ where the only non-zero elements of the above expression are when $p=i$ and $q=j$. Putting this all together we have $$\frac{\partial L}{\partial Z^{(1)}_{ij}} = \left( 1 - \tanh^{2}(Z^{(1)}_{ij}) \right) \sum_{n} \delta^{(2)}_{in} W^{(2)}_{jn} $$ note the sum is only over $n$ now, instead of $m, n, p, q$ make sure you understand why this is. We can now write this in full matrix form (as we will use in the implementation) as<br />$$\delta^{(1)} = \left( 1 - \tanh^{2}(Z^{(1)}) \right) \odot \delta^{(2)} {W^{(2)}}^T$$ where $\odot$ is the elemetwise multiplication of the matrices (Hadamard product) and $T$ indicates the transpose of the matrix.<br />Now we can use the above results for $\delta^{(1)}$ and $\delta^{(2)}$ to calculate how the loss function changes with respect to the weights - recall this is the value we use to alter our weights at each iteration in our gradient descent routine. Now $$\frac{\partial L}{\partial W^{(2)}_{ij}} = \sum_{p,q} \frac{\partial L}{\partial Z^{(2)}_{pq}} \times \frac{\partial Z^{(2)}_{pq}}{\partial W^{(2)}_{ij}}$$ The first term is simply $\delta^{(2)}_{pq}$ and the second term is $$\frac{\partial}{\partial W^{(2)}_{ij}} \left[ \sum_{k} a^{(1)}_{pk} W^{(2)}_{kq} + b^{(2)}_{pq} \right]$$ $\implies k=i$ and $q=j$ hence the sum collapses and the term evaluates to $a^{(1)}_{pi}$. Hence $$\frac{\partial L}{\partial W^{(2)}_{ij}} = \sum_{p} \delta^{(2)}_{pj} a^{(1)}_{pi}$$ or in matrix form, this is represented as $$ \frac{\partial L}{\partial W^{(2)}} = {a^{(1)}}^T \delta^{(2)}$$ Similarly $$\frac{\partial L}{ \partial b^{(2)}_{ij}} = \sum_{p,q} \frac{\partial L}{\partial Z^{(2)}_{pq}} \times \frac{\partial Z^{(2)}_{pq}}{\partial b^{(2)}_{ij}} $$ the second term forces $q=j$ resulting in $$\frac{\partial L}{ \partial b^{(2)}_{ij}} = \sum_p \delta^{(2)}_{pj}$$ i.e a sum over the rows of $\delta^{(2)}$ note the right hand side of the expression is indendent of $i$ this implies that every row in $\frac{\partial L}{ \partial b^{(2)}}$ is the same. Similar analysis yields $$\frac{\partial L}{\partial W^{(1)}_{ij}} = \sum_{p} \delta^{(1)}_{pj} x_{pi} $$ and $$\frac{\partial L}{\partial b^{(1)}_{ij}} = \sum_p \delta^{(1)}_{pj}$$<br /><h3><b>Recap</b></h3>Relax. Take a breath. The derivations are over (for now). I think going through all the derivations in detail at least once is paramount to understanding the inner workings of the neural network. Once you get your head around the notation, there actually isn't anything that fancy going on - just repeated applications of the chain rule. They key for me was attacking this problem via matrix index notation which illustrated exactly how values are propagated (backwards and forwards) in the network via matrix multiplication.<br /><h4>Forward Propagation Summary</h4></div><ul><li>$Z^{(1)} = xW^{(1)} + b^{(1)}$</li><li>$a^{(1)} = \tanh(Z^{(1)})$</li><li>$Z^{(2)} = a^{(1)}W^{(2)} + b^{(2)}$</li><li>$\hat{y} \equiv a^{(2)} = softmax(Z^{(2)})$</li></ul><div><h4>Backpropagation Summary</h4></div><br /><li>$\delta^{(2)} = \hat{y} - y$</li><li>$\delta^{(1)} = (1-\tanh^{2}(Z^{(1)}) \odot \delta^{(2)}{W^{(2)}}^T$</li><li>$\frac{\partial L}{\partial W^{(2)}} = {a^{(1)}}^T \delta^{(2)}$</li><li>$\frac{\partial L}{\partial b^{(2)}} = \sum_{rows} \delta^{(2)}$</li><li>$\frac{\partial L}{\partial W^{(1)}} = {x}^T \delta^{(1)}$</li><li>$\frac{\partial L}{\partial b^{(1)}} = \sum_{rows} \delta^{(1)}$</li>Note that the $\frac{\partial L}{\partial W^{(i)}}$ and $\frac{\partial L}{\partial b^{(i)}}$ are <b>not</b> actually derivatives with respect to matrices, these are just labels for the matrices containing the derivatives with respect to the various weights and biases. The $\sum_rows \delta^{(i)}$ indicates that we sum over the rows of the matrix $\delta^{(i)}$ such that the matrices $$\frac{\partial L}{\partial b^{(i)}}$$ consist of a single unique row repeated $m$ times, where $n$ is the number of training examples.<br /><div><br /><div><br /></div><div>With these matrices we are now in a position to <i>finally</i> implement our ANN to classify our dataset! The next blog will detail the implementation in python and the associated results.</div></div><img src="http://feeds.feedburner.com/~r/blogspot/GjNcQ/~4/DA-ZwyjMA-4" height="1" width="1" alt=""/>Incoherent Ramblingshttp://www.blogger.com/profile/16873840468463648925noreply@blogger.com0http://edgeofabandon.blogspot.com/2017/01/feedforward-artificial-neural-network.htmltag:blogger.com,1999:blog-83055021305463025.post-38840720154035013832016-12-07T02:32:00.001-08:002016-12-07T02:37:47.719-08:00Derivative of cross entropy loss with softmax<h2>Derivative of the softmax function</h2><div>I've dedicated a separate post for the derivation of the derivative of the cross entropy loss function with softmax as the activation function in the output layer, which I'll reference in the future. It's nothing groundbreaking but sometimes it's nice to work through some of the results which are often quoted without derivation. </div><h3>The softmax function</h3><div>The softmax function is often used as the activation function for neurons in the output layer of an ANN. We borrow the notation from a previous <a href="http://edgeofabandon.blogspot.com.au/2016/12/feedforward-artificial-neural-network.html">blog post</a>, where $\hat{y}$ is the output of a particular neuron in the output layer and $$z_o = \sum_{j \in input(o)} z_j w_{j \rightarrow o}$$ i.e. we sum over all paths $j$ which are an input into our output neuron $o$.</div><div><br /></div><div>It is defined as the following:</div><div>$$ \hat{y}_o = \frac{e^{z_o}}{\sum_k e^{z_k}}$$ where the sum is over the $k$ output neurons in the network. The softmax essentially takes an input $z_o \in \mathbb{R}$ and maps it to the interval $(0,1)$ such that the sum of all outputs is 1 - it gives us a probability distribution over our $k$ outputs (i.e. $k$ different classifications). </div><div></div><h3>Cross entropy loss function</h3><div>Recalling the definition of the cross entropy loss function from a previous <a href="http://edgeofabandon.blogspot.com.au/2016/12/feedforward-artificial-neural-network.html">blog post</a> we have</div><div>$$L = - \frac{1}{N} \sum_{n \in N} \sum_{i \in C} y_{n,i} \ln{\hat{y}_{n,i}}$$</div><div>where $N$ is the number of training examples and $C$ is the number of classifications (i.e 2 for binary classification, etc...). Let us assume $N = 1$ without loss of generality, thus </div><div>$$L = - \sum_{o} y_{o} \ln{\hat{y}_{o}}$$ where the sum is over the output neurons in our network (given that we have 1 output neuron for each class, the sum over $i \in C$ and $o$ are equivalent). Note that $y_o$ is the actual classification of our data, i.e. $y_o \in \{0,1\}$ and that if $y_o=1$ then all other $y_j =0$ where $ i \neq o$ see a description of <a href="https://www.quora.com/What-is-one-hot-encoding-and-when-is-it-used-in-data-science">one hot encoding</a>. Now our object of interest is $$ \frac{ \partial L}{\partial z_i} = \frac{\partial L}{\partial \hat{y}_o} \times \frac{\partial \hat{y}_o}{\partial z_i}$$ Let's focus on the second term $\frac{\partial \hat{y}_o}{\partial z_i}$ for now and split it into two cases.<br /><br /><ul><li>Case 1: $ i = o$</li></ul>$$\frac{\partial \hat{y}_o}{\partial z_i} = \frac{\partial \hat{y}_o}{\partial z_o} = \frac{\partial}{\partial z_o} \left( \frac{e^{z_o}}{\sum_k e^{z_k}} \right) = \frac{e^{z_o}}{\sum_k e^{z_k}} - e^{z_o} (\sum_k e^{z_k})^{-2}e^{z_o}$$</div><div>$$= \frac{e^{z_o}}{\sum_k e^{z_k}} \left ( 1 - \frac{e^{z_o}}{\sum_k e^{z_k}} \right) = \hat{y}_o (1 - \hat{y}_o)$$</div><div><ul><li>Case 2: $ i \neq o $</li></ul>$$ \frac{\partial \hat{y}_i}{\partial z_i} = \frac{\partial}{\partial z_i} \left( \frac{e^{z_o}}{\sum_k e^{z_k}} \right) = -e^{z_o}(\sum_k e^{z_k})^{-2} e^{z_i}$$</div><div>$$= -\hat{y}_o \hat{y}_i$$ Now $$\frac{\partial L}{\partial \hat{y}_o} = - \sum_o \frac{1}{\hat{y}_o} y_o$$ Therefore $$ \frac{\partial L}{\partial z_i} = -\sum_o \frac{1}{\hat{y}_o} y_o \times \frac{\partial \hat{y}_o}{\partial z_i}$$ Substituting in our above computations and splitting out the $i=o$ case from the sum we get</div><div><br /></div><div>$$\frac{\partial L}{\partial z_i} = -\frac{1}{\hat{y}_i} y_i \times \hat{y}_i(1-\hat{y}_i) - \sum_{o \neq i} \frac{1}{\hat{y}_o} y_o \times (- \hat{y}_o \hat{y}_i)$$</div><div>$$ = -y_i(1-\hat{y}_i) + \sum_{o \neq i} y_o \hat{y}_i$$ $$= y_i \hat{y}_i -y_i + \sum_{o \neq i} y_o \hat{y}_i$$ Moving the first term inside the sum and summing over all $o$ we get </div><div>$$\frac{\partial L}{\partial z_i} = -y_i + \sum_{o} y_o \hat{y}_i$$ $$ = \hat{y}_i \sum_o y_o - y_i$$ But $\sum_o y_o$ is the sum over all the actual labels of our data which is one hot encoded (as mentioned above). Thus $y_o = 1$ for exactly one class and $0$ for all else - hence $\sum_o y_o = 1$. We <i>finally</i> arrive at a result we'll use in the future in constructing and implementing our ANN to classify the <i>make moons</i> dataset</div><div>$$ \frac{\partial L}{\partial z_i} = \hat{y}_i - y_i$$</div><div></div><img src="http://feeds.feedburner.com/~r/blogspot/GjNcQ/~4/MDMzaMkY5oQ" height="1" width="1" alt=""/>Incoherent Ramblingshttp://www.blogger.com/profile/16873840468463648925noreply@blogger.com0http://edgeofabandon.blogspot.com/2016/12/derivative-of-cross-entropy-loss-with.htmltag:blogger.com,1999:blog-83055021305463025.post-53133941703032139972016-12-05T01:50:00.001-08:002016-12-14T02:07:53.275-08:00Feedforward Artificial Neural Network pt2<h2>Feedforward Artificial Neural Network pt2</h2>In the previous post, we briefly looked at the general structure of a Feedforward ANN and why such a construction should be useful. Now that we have a general handle of the idea behind a Feedforward ANN, we need to <i>train</i> the network - find a method to calculate the appropriate weights for our hidden layer neurons. In the context of our classification case, we want to train the network by providing it a dataset with labels so that it can predict the labels of new data once trained - a classic supervised learning scenario.<br /><br /><div><h4>Backpropagation</h4></div><div>Backpropagation is the algorithm we'll use to train (i.e minimize the error function of) our ANN. It is essentially repeated applications of the chain rule - chaining together all possible paths in the network to establish how the error changes as a function of the inputs and outputs of each neuron. This may not seem like a very profound approach, nor may it be clear that this use of the chain rule should deserve it's own name, however the power from this method comes from the implementation. The set up of backpropagation allows the derivatives we use in the chain rule to be calculated using intermediate results in our network. This becomes especially important when the size of our network grows - it is efficient in that it saves the recalculation of derivatives.<br /><h4></h4><h4>Some notation</h4><h4><div></div><div><span style="font-weight: normal;">Suppose we have two neurons labelled by $i$ and $j$ respectively, as depicted below:</span></div><div class="separator" style="clear: both; text-align: center;"></div></h4><h4><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-iDK6Juycs0M/WD_phs3vzjI/AAAAAAAAADw/A164M-v0KU0SnDv_5glLntjDRjfoDIxvgCLcB/s1600/simpleNeuron.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="41" src="https://1.bp.blogspot.com/-iDK6Juycs0M/WD_phs3vzjI/AAAAAAAAADw/A164M-v0KU0SnDv_5glLntjDRjfoDIxvgCLcB/s200/simpleNeuron.png" width="200" /></a></div><div class="separator" style="clear: both;"><span style="font-weight: normal;">where $\sigma_j$ is a label for the neuron with the following meaning:</span></div><div class="separator" style="clear: both;"></div><ul><li><span style="font-weight: normal;">We define the activation function for neuron $i$ as $a_j(w_{i \rightarrow j} \cdot a_i)$, where </span><span style="font-weight: normal;">$w_{i \rightarrow j}$ is the weight applied to the input $a_i$ of neuron $j$. </span></li></ul></h4><h4><div><div></div><b>Gradient Descent</b></div><div class="separator" style="clear: both;"><span style="font-weight: normal;"></span></div><div><div style="font-weight: normal;">In order to minimize the Loss function, one might think to simply take the derivatives and throw them into a prebuilt optimisation routine from your favourite library. This approach is fine when the network is small, but when it scales to millions of neurons it becomes computationally intractable. We will in fact use a standard optimisation routine, but it is the clever implementation which makes it so attractive. The optimisation method we'll be using is called the <a href="https://en.wikipedia.org/wiki/Gradient_descent">batch gradient descent</a> method, this is in contrast to the<a href="https://en.wikipedia.org/wiki/Stochastic_gradient_descent"> stochastic gradient descent</a> method. It is a fairly standard iterative method to minmize a loss function $L\left(w_{i \rightarrow j} \right)$ by updating the target variables by</div><div style="font-weight: normal;">$$ w_{i\rightarrow j} \rightarrow w_{i\rightarrow j} - \eta \frac{\partial L}{\partial w_{i\rightarrow j}} $$</div><div style="font-weight: normal;">at each iteration. $\eta$ is the so called <i>learning rate</i> which determines how big/small the steps are at each iteration. Intuitively small steps will take a longer time for the algorithm to converge, while larger steps cover more ground, they may overshoot the true minimum - the ideal value is dependent on the loss function and is somewhere in between.</div><div style="font-weight: normal;"><br /></div></div></h4><h4>The Loss Function</h4><h4><div><span style="font-weight: normal;">Once we have the architecture of our ANN set up, we need to measure its performance. In order to do so, we borrow from information theory, a loss function which is commonly found in ANN classification tasks: the <a href="https://en.wikipedia.org/wiki/Cross_entropy">cross entropy loss</a></span><br /><span style="font-weight: normal;">$$ L = -\frac{1}{N} \sum_{n \in N} \sum_{ i \in C} y_{n,i} \ln \hat{y}_{n,i} $$</span><br /><span style="font-weight: normal;">where we sum over N, the number of training examples and C, the number of classes in our classification problem. The hat $\hat{y}$ indicates an output classification from our network, where $y$ is the actual classification of the training sample. This is the Loss function we will use once we begin to classify the examples from the previous blog post. Note that if we were building the ANN with a continuous output (i.e. regression) we could use the usual Residual Sum of Squares</span></div><div><span style="font-weight: normal;">$$L_{reg} = \sum_{i=1}^{N} \frac{1}{2} \left( \hat{y_{i}} - y_{i} \right)^{2} $$</span></div><div><span style="font-weight: normal;">where the factor of a half is chosen for convenience (it prevents stray factors of $\frac{1}{2}$ floating around when we differentiate).</span><br /><span style="font-weight: normal;">Thus the task of training our neural network then becomes an optimisation problem; minimize the classification error function $L$ with respect to the weights $w_{i \rightarrow j}$ of each neuron in the hidden layer - see below for an explanation of the notation. Note this is ignoring any potential problems of overfitting.</span></div><div><br /></div></h4></div><h4>Illuminating example - a sample ANN</h4><h4><span style="font-weight: normal;">I'm still working my way around TikZ which will allow me to make nice diagrams in LaTeX, but in the interest of time I've borrowed the below image from <a href="http://briandolhansky.com/blog/2013/9/27/artificial-neural-networks-backpropagation-part-4">here</a>. The network pictured has a single input $x_i$, 2 hidden layers and a single output. We will not assume the functional form of the loss function $L$ or the activation functions $\sigma$. </span><br /><div class="separator" style="clear: both; text-align: center;"></div><div class="separator" style="clear: both; text-align: center;"><a href="https://2.bp.blogspot.com/-pDM_9ZMy9nA/WEUncE33nMI/AAAAAAAAAEQ/q1Y4XmIZAFoDY0p8kTI7GXJE9Ta4C_VZACLcB/s1600/download.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="171" src="https://2.bp.blogspot.com/-pDM_9ZMy9nA/WEUncE33nMI/AAAAAAAAAEQ/q1Y4XmIZAFoDY0p8kTI7GXJE9Ta4C_VZACLcB/s320/download.png" width="320" /></a></div><div class="separator" style="clear: both; text-align: center;"><br /></div><div class="separator" style="clear: both; text-align: left;"><span style="font-weight: normal;"><br /></span></div><div class="separator" style="clear: both; text-align: left;"><span style="font-weight: normal;">We want to understand how the output $\hat{y}_i$ changes with respect to the input weight $w_{in \rightarrow i}$, we can use the chain rule to get this relationship. Noting that there are two paths from the input layer to the output, we need to consider each path separately. </span></div><div class="separator" style="clear: both; text-align: left;"><span style="font-weight: normal;">We define the following in the network, where $i$, $j$, $k$, $o$ index the neurons </span></div><div class="separator" style="clear: both; text-align: left;"><span style="font-weight: normal;">$$z_i = w_{in \rightarrow i} x_i, \quad a_i = \sigma(z_i)$$</span></div><div class="separator" style="clear: both; text-align: left;"><span style="font-weight: normal;">$$z_j = w_{i \rightarrow j} a_i, \quad a_j = \sigma(z_j)$$</span></div><div class="separator" style="clear: both; text-align: left;"><span style="font-weight: normal;">$$z_k = w_{i \rightarrow k} a_i, \quad a_j = \sigma(z_k)$$</span></div><div class="separator" style="clear: both; text-align: left;"><span style="font-weight: normal;">$$z_o = w_{k \rightarrow o} a_k + w_{j \rightarrow o} a_j, \quad \hat{y} \equiv a_j = output(z_o)$$</span></div><div class="separator" style="clear: both; text-align: left;"><span style="font-weight: normal;"><br /></span></div><div class="separator" style="clear: both; text-align: left;"><span style="font-weight: normal;">where $output(x)$ is the output function, which for now we will keep generic and $\sigma(x)$ is the activation function, which is the same for each neuron in the network.</span></div><div class="separator" style="clear: both; text-align: left;"><span style="font-weight: normal;"><br /></span></div><div class="separator" style="clear: both; text-align: left;"><span style="font-weight: normal;">We decompose $\frac{\partial L}{\partial w_{in \rightarrow j}}$: starting at the right side of the graph and moving through $o \rightarrow k \rightarrow i$ as below:</span></div><div class="separator" style="clear: both; text-align: left;"><span style="font-weight: normal;"><br /></span></div><div class="separator" style="clear: both; text-align: left;"><span style="font-weight: normal;">$$\frac{\partial L}{\partial z_o} \times \frac{\partial z_o}{\partial z_k} \times \frac{\partial z_k}{\partial z_i} \times \frac{\partial z_i}{\partial w_{in \rightarrow j}}$$</span></div></h4><h4><div class="separator" style="clear: both;"><span style="font-weight: normal;">Let's look at an individual component $$\frac{\partial z_o}{ \partial z_k} = \frac{\partial}{\partial z_k} \left( w_{k \rightarrow o} a_k + w_{j \rightarrow o} a_j \right)$$ noting the second term does not depend on $z_k$</span></div><div class="separator" style="clear: both;"><span style="font-weight: normal;">$$\frac{\partial}{\partial z_k} w_{k \rightarrow o} \sigma(z_k) = w_{k \rightarrow o} \sigma ' (z_k)$$With a similar analysis we see that the product of derivatives can be expressed as</span></div><div class="separator" style="clear: both;"><span style="font-weight: normal;">$$\frac{\partial L}{\partial z_o} \times w_{k \rightarrow o} \sigma ' (z_k) \times w_{i \rightarrow k} \sigma ' (z_i) \times x_i$$</span></div><div><span style="font-weight: normal;">and now moving through $o \rightarrow j \rightarrow i$</span></div></h4><h4><div class="separator" style="clear: both;"><span style="font-weight: normal;">$$\frac{\partial L}{\partial z_o} \times \frac{\partial z_o}{\partial z_j} \times \frac{\partial z_j}{\partial z_i} \times \frac{\partial z_i}{\partial w_{in \rightarrow j}}$$</span></div><div class="separator" style="clear: both;"><span style="font-weight: normal;">$$=\frac{\partial L}{\partial z_o} \times w_{j \rightarrow o} \sigma ' (z_j) \times w_{i \rightarrow k} \sigma ' (z_i) \times x_i$$</span></div><div class="separator" style="clear: both;"><span style="font-weight: normal;">Putting this all together we have $$\frac{\partial L}{\partial w_{in \rightarrow i}} = \frac{\partial L}{\partial z_o} \left[ w_{k \rightarrow o} \sigma ' (z_k) \times w_{i \rightarrow k} \sigma '(z_i) + w_{j \rightarrow o} \sigma ' (z_j) \times w_{i \rightarrow j} \sigma '(z_i) \right] x_i$$</span></div><div class="separator" style="clear: both;"><span style="font-weight: normal;">where we have yet to specify the exact form of the loss function $L$ and activation function $\sigma(x)$ in this example. Once these are specified, we can just plug them in to the above formula. Hopefully this simplified example illuminates how the derivatives of the loss function wrt to each neuron is calculated. In terms of backpropagation, we can think of starting out the output and working backwards through each neuron in the path leading to it and understanding how the error changes at each point. In the next section I will generalise these results so they are easily extendable to a larger network.</span></div><div class="separator" style="clear: both;"><span style="font-weight: normal;"><br /></span></div></h4><h3 style="clear: both;"><span style="font-weight: normal;">Error propagation</span></h3><span style="font-weight: normal;">We will now generalise the above computation, in order to get the derivative of the loss function with respect to any $z_j$ in any of the layers of the network. We define the error signal $$\delta_j = \frac{\partial L}{\partial z_j}$$ Now, using the chain rule as in the example above we have $$\delta_j = \frac{\partial L}{\partial \hat{y}} \times \frac{\partial \hat{y}}{\partial z_j}$$ Once we specify our loss function $L$ we'll have a nice expression for the first term in the expression above, for now let's keep it generic. Now using the definitions of the $z_i$ and $a_i$ above: $$\frac{\partial \hat{y}}{\partial z_j} = \frac{\partial \hat{y}}{\partial a_j} \times \frac{\partial a_j}{\partial z_j} = \sigma ' (z_j) \sum_{k \in path(j)} \frac{\partial \hat{y}}{\partial z_k} \frac{\partial z_k}{\partial a_j}$$ where the sum is over the $k$ paths which are connected to the output of neuron $j$. We have $z_k = a_j w_{j \rightarrow k}$ so our expression becomes $$\frac{ \partial \hat{y}}{\partial z_j} = \sigma '(z_j) \sum_{k \in path(j)} \frac{\partial \hat{y}}{\partial z_k} w_{j \rightarrow k}$$ $$\implies \delta_j = \sigma '(z_j) \sum_{k \in path(j)} \frac{\partial L}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial z_k} w_{j \rightarrow k}$$ Note the first two terms in the summation are actually $\delta_k$ in disguise! Thus we have a recursive definition for our error propagation: $$\delta_j = \sigma ' (z_j) \sum_{k \in path(j)} \delta_k w_{j \rightarrow k}$$</span><br /><span style="font-weight: normal;">Thus if we have $\delta_\text{output}$, we can work our way backwards through the neural network, calculating the error for each neuron as a function of the error of the neurons that its output is connected to. This is the key concept of back propagation - although it is just an application of the chain rule, because we can <i>propagate</i> the error at each point in the network through subsequent neurons, the calculations can be done on the fly by utilising calculations that have already been performed.</span><br /><span style="font-weight: normal;"><br /></span>Next time we'll describe the architecture of the actual ANN we'll use to classify the <i>make moons</i> example detailted in pt1. To be honest, these concepts didn't fully sink in until the first time I implemented an ANN, going through the details of the calculations, I suggest you work through this and the next blog post by hand in order to ensure you have a handle on how the feedforward and error propagations are affected by different topologies of ANNs.<br /><h4><div class="separator" style="clear: both; text-align: left;"><br /></div></h4><div class="separator" style="clear: both; text-align: left;"><br /></div><div class="separator" style="clear: both; text-align: left;"><br /></div><div class="separator" style="clear: both; text-align: left;"><span style="font-weight: normal;"><br /></span></div><h4><div><span style="font-weight: normal;"><br /></span></div></h4><img src="http://feeds.feedburner.com/~r/blogspot/GjNcQ/~4/hRCHQZAS0yM" height="1" width="1" alt=""/>Incoherent Ramblingshttp://www.blogger.com/profile/16873840468463648925noreply@blogger.com0http://edgeofabandon.blogspot.com/2016/12/feedforward-artificial-neural-network.htmltag:blogger.com,1999:blog-83055021305463025.post-84993821410149079372016-11-15T02:49:00.001-08:002016-11-16T00:54:39.607-08:00Feedforward Artificial Neural Network pt1<h2>Architecture of an ANN</h2><div>The history of ANNs can be traced back to <span style="background-color: white;">Warren McCulloch, who proposed a mechanism of how neurons might work by modelling a simple neural network with an electrical circuit. This line of thinking was reinforced by Donald Hebb, who discovered that neural pathways between neurons were strengthened through repetitive use. Understanding the motivation behind ANNs will help us understand why they are structured the way they are, particularly the <a href="https://en.wikipedia.org/wiki/Action_potential">action potential</a> of a neuron.</span></div><div><br /></div><div>An ANN consists of three major components;</div><div><ul><li>Input Layer</li><ul><li>This layer takes the inputs as is - they do <i>not</i> modify the data at all; they push it into the hidden layers.</li></ul><li>Hidden Layer(s)</li><ul><li>In line with the aforementioned action potential of a neuron in the brain, the hidden layer neurons take an input, applies an <i>activation function</i> and pushes the result to the next layer in the network.</li><li>The activation function $\phi : \mathbb{R} \rightarrow \mathbb{R}$ takes $\sum_{i=1}^{n} x_i w_i + b$ as an argument, where $n$ is the number of inputs flowing into the particular hidden layer neuron, $w_{i}$ is the associated weight with input $x_{i}$ and $b$ is called the <i>bias. </i>Our task is to find the $w_{i}$ and $b$ for each neuron in the hidden layer which minimize our error function (more on this later).</li><li>The choice of activation function is entirely up to you. Obviously there are some standard choices for certain tasks. See below for a (by no means exhaustive) list of activation functions<br /><div class="separator" style="clear: both; text-align: center;"><a href="https://3.bp.blogspot.com/-W_VxmsjN924/WCraxHwquzI/AAAAAAAAACw/vjGSeydKAmUev97AzYXSMhtHv77nlQ_lwCLcB/s1600/main-qimg-01c26eabd976b027e49015428b7fcf01.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="300" src="https://3.bp.blogspot.com/-W_VxmsjN924/WCraxHwquzI/AAAAAAAAACw/vjGSeydKAmUev97AzYXSMhtHv77nlQ_lwCLcB/s400/main-qimg-01c26eabd976b027e49015428b7fcf01.png" width="400" /></a></div> </li><li>The activation functions are they key to capturing non-linearities in our decision boundary.</li></ul><li>Output Layer</li><ul><li>The output of the neural network. In our case (classification) it will be a label classifying our input data to one of two classes.</li></ul></ul><div>These three components together with the topology of the network will define the entire structure of our ANN. Below is a stock image of a typical ANN</div></div><div class="separator" style="clear: both; text-align: center;"><a href="https://3.bp.blogspot.com/-x8gHSVbUJ1k/WCrN7r6h6tI/AAAAAAAAACg/SSiuimMht3Q8tCOdGGI86jTrWM3HYZWoQCLcB/s1600/structure.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="192" src="https://3.bp.blogspot.com/-x8gHSVbUJ1k/WCrN7r6h6tI/AAAAAAAAACg/SSiuimMht3Q8tCOdGGI86jTrWM3HYZWoQCLcB/s320/structure.png" width="320" /></a></div><div>On the left there are 3 inputs which flow into the hidden layers (more on these below) which then in turn flow into the outputs. There are two hidden layers in this setup, each with 4 neurons - these are hyper parameters we can tune to increase performance and reduce overfitting.</div><div><br /></div><div>Note that the flow of the network is strictly left to right, none of the neurons in the hidden layer are connected to one another and none of the neurons in the network loop back into themselves. This single direction of information flow and the fact that there are no loops or cycles in the network define a <i style="font-weight: bold;">Feedforward Artificial Neural Network.</i> Also note that the number of inputs and outputs are fixed once we define the network under this architecture. Other types of neural networks exist that are able to handle inputs with multiple lengths, but our focus for now will be Feedforward ANNs.</div><div><br /></div><h3>By why <i>this</i> structure?</h3><div>The above structure of a neural network may seem like an arbitrary construction - how can we guarantee that it can be trained in such a way to produce our desired output?</div><div><br /></div><div>Well..luckily for us there exists a theorem called the <i>Universal Approximation Theorem</i> which states that under reasonable circumstances, a feedforward network containing a finite number of neurons can approximate continuous functions on compact subsets of $\mathbb{R}^{n}$.</div><div><br /></div><h4>Statement</h4><div>Let $\phi \left( \cdot \right)$ be a nonconstant, bounded and monotonically-increasing continuous function. Let $I_{m}$ denote the $m$-dimensional unit hypercube $\left[ 0, 1 \right]$. The space of continuous functions on $I_{m}$ is $C \left( I_{m} \right)$. Then $\forall f \in C \left( I_{m} \right)$ and $ \epsilon > 0$ $\exists N \in \mathbb{Z}$, $v_{i}, b_{i} \in \mathbb{R}$ and $w_i \in \mathbb{R}^{m}$ where $ i = 1, \dots, N$ such that we may define</div><div><br /></div><div>$$F(x) \equiv \sum_{i=1}^{N} v_{i} \phi \left( w_{i}^{T}x + b_{i} \right)$$</div><div>such that $$\left| F(x) - f(x) \right| < \epsilon \quad \forall x \in I_{m}$$</div><div><br /></div><div>Which is quite fascinating really, that we can essentially approximate <i>any</i> continuous function to an arbitrary degree using a linear combination of our activation functions. Please note that this tells us nothing of actually <i>how</i> to approximate the function; what the parameters $N, v_{i}, b_{i}$ or $w_{i}$ are - only that they exist.</div><div><br /></div><h4>Summary</h4><div>So there it is, a whirlwind introduction to Feedforward Artificial Networks. By no means is this supposed to be a comprehensive introduction, but to set the scene for the next few blog posts. In my next blog I'll define the error functions and consequently the <i>Backpropogation algorithm</i> which we will use to train our Feedforward ANN to find the weights and biases of our hidden layer neurons.</div><br /><img src="http://feeds.feedburner.com/~r/blogspot/GjNcQ/~4/_LZldjq8dsU" height="1" width="1" alt=""/>Incoherent Ramblingshttp://www.blogger.com/profile/16873840468463648925noreply@blogger.com0http://edgeofabandon.blogspot.com/2016/11/feedforward-artificial-neural-network.htmltag:blogger.com,1999:blog-83055021305463025.post-29358168228744713132016-11-14T01:30:00.001-08:002016-11-14T01:30:58.472-08:00Back.. again! Artificial neural networks have tickled my fancy<script src="https://google-code-prettify.googlecode.com/svn/loader/run_prettify.js"></script>I'm really impressed with my foresight when naming this blog all those years ago, it truly does live on the edge of abandon as I haven't touched it in 3 years! I've decided to fire it back up again and will (at least try) to actively post as I delve into the world of statistical / machine learning! I just want a place to document my thoughts/learnings/projects as I delve into this fascinating area. I've decided to use Python as my language of choice as I've always wanted to learn it but never got around to it or had the chance. Now I have no excuses!<br /><br />So, without further ado...<br /><br /><h3><b>Artificial Neural Networks</b></h3>I'm going to assume <i>some </i>familiarity with ANNs - what they are, their general structure/philosophy. See <a href="https://www.doc.ic.ac.uk/~nd/surprise_96/journal/vol4/cs11/report.html">here</a> for a comprehensive introduction.<br />As a first pass, I'll introduce a motivating example, go through the math and then implement (in Python) an ANN from scratch.<br /><br /><h3><b>Definition of the problem</b></h3>We'll look at a classification problem in 2 dimensions (to make visualisation easier, read: possible) where our old friend Logistic Regression struggles. It will highlight a key fact that needs to be understood in any sort of machine learning / statistical problem; understanding when and what tools to use to attack a problem.<br /><br /><h4>The data:</h4><div>We'll use a toy dataset from sklearn called <i>make_moons</i>, with the following call:<br /><br /></div><div style="text-align: center;"><div style="text-align: left;"><code class="prettify">import pandas as pd</code><br /><code class="prettify">import sklearn.datasets</code><br /><code class="prettify">import seaborn as sb</code><br /><code class="prettify">import sklearn.linear_model</code><br /><code class="prettify"><br /></code><code class="prettify">X,y = sklearn.datasets.make_moons(200, noise=0.2) </code><br /><code class="prettify">df = pd.DataFrame() </code><br /><code class="prettify">df['x1'] = X[:,0] </code><br /><code class="prettify">df['x2'] = X[:,1] </code><br /><code class="prettify">df['z'] = y</code><br /><code class="prettify">sb.lmplot('x1','x2',hue='z', data=df, fit_reg = False) </code></div></div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://3.bp.blogspot.com/-O6upYjgURe0/WCl_GrW8CoI/AAAAAAAAACM/6LLKw-0VaA49IQVYfEy2gx8XMdfvZi_2gCLcB/s1600/make_moons.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="309" src="https://3.bp.blogspot.com/-O6upYjgURe0/WCl_GrW8CoI/AAAAAAAAACM/6LLKw-0VaA49IQVYfEy2gx8XMdfvZi_2gCLcB/s320/make_moons.png" width="320" /></a></div><div class="separator" style="clear: both; text-align: center;"><br /></div><br /><h4><b>The task:</b></h4><div>We want to calculate the decision boundary between the two (blue and green) classes. At a glance it's clearly non-linear, so let's see how a Logistic Regression classification goes!<br /><h4><b>Results:</b></h4><div>We can see that the Logistic Regression can only fit a linear decision boundary (see below) - even though the actual decision boundary is decidedly non-linear.</div><div class="separator" style="clear: both; text-align: center;"><a href="https://4.bp.blogspot.com/-ZZcOgt_Cif0/WCl_YvvJ1xI/AAAAAAAAACQ/IlzX-4pZnoIgB3-ByVdS3ROliscJtAYTwCLcB/s1600/LogReg_boundary.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="309" src="https://4.bp.blogspot.com/-ZZcOgt_Cif0/WCl_YvvJ1xI/AAAAAAAAACQ/IlzX-4pZnoIgB3-ByVdS3ROliscJtAYTwCLcB/s320/LogReg_boundary.png" width="320" /></a></div><div><b><br /></b></div><br /></div>We can see why the Logistic Regression decision boundary is linear from the following:<br />$$ P_{boundary} \equiv \frac{1}{1+e^{- \theta \cdot x}} = 0.5 $$<br />$$ \implies 1 = e^{- \theta \cdot x}$$<br />$$ \implies \theta \cdot x = 0 $$ which defines a plane (line in 2 dimensions).<br /><br />In part 2 I will define an ANN that we will work through to establish exactly how it works and how it is trained. We will then construct and implement it in python to get an idea of the decision boundaries it can produce.<img src="http://feeds.feedburner.com/~r/blogspot/GjNcQ/~4/aLHzFV1d-8A" height="1" width="1" alt=""/>Incoherent Ramblingshttp://www.blogger.com/profile/16873840468463648925noreply@blogger.com0http://edgeofabandon.blogspot.com/2016/11/back-again-artificial-neural-networks.htmltag:blogger.com,1999:blog-83055021305463025.post-45399973386989199422013-08-25T01:05:00.000-07:002013-09-04T01:44:56.757-07:00(Weak) Law of Large Numbers and the Central Limit TheoremFollowing the definitions of <i>convergence in probability </i>and <i>convergence in distribution</i> in the previous post - we now state two fundamental theorems which incorporate these concepts.<br /><br /><b>(Weak) Law of Large Numbers:</b><br />Let $\left\{X_i \right\}$ be a sequence of iid (independent and identically distributed) random variables with $\mathbb{E}\left[X_i\right] = \mu \lt \infty$ and $\text{Var}\left(X_i \right) = \sigma^2 $. Now define $$S_n:=\sum_{i=1}^{n} X_i$$ then $$\frac{S_n}{n}\rightarrow \mu$$ as $n\rightarrow \infty$.<br /><br /><i><b>Proof:</b></i><br />Since the random variables <i><b> </b></i>are all iid then $$\mathbb{E}\left[\frac{S_n}{n} \right] = \frac{1}{n} \mathbb{E}\left[S_n\right] = \frac{n \mu}{n} = \mu$$ Similarly $$\text{Var}\left( \frac{S_n}{n} \right) = \frac{1}{n^2} \left(\mathbb{E}[S_n^2]- \mathbb{E}[S_n]^2 \right)=\frac{n \sigma^2}{n^2} = \frac{\sigma^2}{n}$$ We now call upon the <a href="http://en.wikipedia.org/wiki/Chebyshev%27s_inequality#Probabilistic_statement">Chebyshev's Inequality</a> which states if $X$ is a random variable with finite expectation $\mu$ and non-zero variance $\sigma^2$ then $$P\left( \left|X-\mu \right| \ge n \sigma \right) \le \frac{1}{n^2}$$It gives a bound on the distributions values in terms of the variance - this is completely distribution agnostic.<br /><br />We can now use Chebyshev's Inequality and re-write it using $X=S_n$ and $n \sigma = \epsilon$:<br />$$P\left( \left|S_n-\mu \right| \ge \epsilon \right) \le \frac{\sigma^2}{n \epsilon^2}$$Hence<br />$$ P\left( \left|S_n-\mu \right| \lt \epsilon \right) = 0$$ as $n \rightarrow \infty$. Which is precisely the definition of convergence in probability.<br /><br /><b>Central Limit Theorem (CLT):</b><br />We shall state a slightly restricted version of the CLT which is fine for illustrative purposes. Taking our $S_n$ from above, then the CLT states:<br />Let $\left\{X_1, X_2, . . . \right\}$ be a sequence of i.i.d random variables with finite expectation $\mathbb{E}[X_i] = \mu < \infty$ and finite non-zero variance $\text{Var}\left(X_i\right) = \sigma^2 < \infty$. Then<br /><br />$$ P \left( \frac{S_n-n \mu}{\sigma \sqrt{n}} \leq x \right) \rightarrow \Phi(x)$$<br />as $n \rightarrow \infty$ and the convergence is <i>in distribution</i> and $\Phi(x)$ is the cumulative density function of of a Standard Normal variable. We shall not prove the CLT as only an understanding of what it says and how it can be applied is required.<br /><br />This explains why Normal distributions are so common in modelling (and even nature) since under these mild conditions, regardless of distribution - if this combination of random variables is formed then it will tend to the normal distribution for large enough n.<img src="http://feeds.feedburner.com/~r/blogspot/GjNcQ/~4/DhUVWi-5F7I" height="1" width="1" alt=""/>Incoherent Ramblingshttp://www.blogger.com/profile/16873840468463648925noreply@blogger.com0http://edgeofabandon.blogspot.com/2013/08/weak-law-of-large-numbers-and-central.htmltag:blogger.com,1999:blog-83055021305463025.post-37855532647735446002013-08-22T02:23:00.000-07:002013-08-22T02:23:31.454-07:00Convergence of Random Variables<pre><br /></pre><br /><br />Before moving on to scaling a random walk into a Wiener process, we must understand the notion of <i>convergence </i>of random variables.<br /><br />We won't cover the notions of <i>uniform </i>and <i>pointwise</i> convergence of a sequence of functions here - these are covered in any undergraduate analysis class. We will delve right into <i>Convergence of random variables.</i><br /><br /><b>Convergence in probability:</b><br />Let $\left\{X_n\right\}$ be a sequence of of random variables and $X$ be a random variable. $\left\{X_n\right\}$ is said to<i> converge in probability </i>to $X$ if $\forall \epsilon \lt 0$<br />$$P\left(\left| X_n-X \right| \lt \epsilon \right)=0$$ as $n \rightarrow \infty$.<br /><br />Another notion of convergence of random variables is <i>Convergence in distribution.</i><br /><br /><b>Convergence in distribution:</b><br />Given random variables $\{X_1, X_2, ...\}$ and the cumulative distribution functions $F_i(x)$ of $X_i$ then we say that $X_n$ <i>converges to the distribution </i>of $X$ if<br />$$\lim_{n \rightarrow \infty} F_n(x) = F(x)$$ for every $x \in \mathbb{R}$ at which $F(x)$ is continuous.<br /><br />Note that only the<i> distributions</i> of the variables matter - so they random variables themselves could in fact be defined on different probability spaces.<br /><br />These may seem like fairly elementary definitions, but they hold great importance in applications of probability, namely through the <i>(Weak) Law of Large numbers </i>and the <i>Central Limit Theorem - </i>which will be covered in the next post.<img src="http://feeds.feedburner.com/~r/blogspot/GjNcQ/~4/pfLc07P47i0" height="1" width="1" alt=""/>Incoherent Ramblingshttp://www.blogger.com/profile/16873840468463648925noreply@blogger.com0http://edgeofabandon.blogspot.com/2013/08/convergence-of-random-variables.htmltag:blogger.com,1999:blog-83055021305463025.post-14492530062636434992013-08-19T04:11:00.002-07:002013-08-22T02:25:33.807-07:00I've returned with a random walkSo after two years absent from blogging, I am now back with a rejuvenated interest in mathematics, more specifically financial mathematics. So from now on, this blog will serve as my workbook as I endeavour to learn more about portfolio valuation, solving stochastic differential equations and other such fun. Such posts will not be rigorous in the mathematical sense, but will serve as a heuristic survey of the different tools available and how they are used in a financial context. The posts may also seem scattered content wise as I really haven't planned a syllabus - I'm just learning things as I come across them.<br /><br />With that in mind - I will start with the very basics; introducing my first blog in over two years, I give a quick overview of Random Walks in $\mathbb{Z}$<br /><br />Let $X_0 := 0$ be our starting position and for each $i \in \mathbb{N}$ let $$X_i:=X_{i-1}+dX_i$$ where $dX_i = \pm 1$ with equal probability - i.e is a Bernoulli distributed random variable with $p=q=\frac{1}{2}$. Therefore our position at step $N$ is just the sum of the <i>independent </i>Bernoulli variables;<br />$$X_N= \sum_{i=0}^{N} dX_i$$.<br />We can calculated the expectation: $$\mathbb{E}[X_N]= \mathbb{E}\left[ \sum_{i=0}^{N} dX_i\right]= \sum_{i=0}^{N} \mathbb{E} [ dX_i]=0$$ by using the definition of the expectation of a discrete random variable and the fact that $dX_i$ and $dX_j$ are independent for $i \neq j$. Now<br />$$Var(X)=\mathbb{E}[X^2]- \mathbb{E}[X]^2$$ we can calculate $$\mathbb{E}[X_N^2] = \sum_{i=0}^N 1 = N$$That is, the variance of $X_N$ is the jump size times the total steps $N$.<br />Below I have simulated the aforementioned random walk for $N=100$ and $N=100000$ respectively. <br /><div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/-dd7crFsU0hE/UhH5xMP2lwI/AAAAAAAAABE/1XPVvT3OhpM/s1600/n100.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="240" src="http://3.bp.blogspot.com/-dd7crFsU0hE/UhH5xMP2lwI/AAAAAAAAABE/1XPVvT3OhpM/s320/n100.jpg" width="320" /></a></div><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/-337Paeu1bSo/UhH5xP-naII/AAAAAAAAABI/vzhdkUi4KIE/s1600/n100000.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="240" src="http://3.bp.blogspot.com/-337Paeu1bSo/UhH5xP-naII/AAAAAAAAABI/vzhdkUi4KIE/s320/n100000.jpg" width="320" /> </a></div><div class="separator" style="clear: both; text-align: left;">Next we'll look at a specific scaling of these random walks to obtain a <i>Wiener Process </i>and eventually talk about Stochastic Integrals and Ito's lemma.</div><br /><img src="http://feeds.feedburner.com/~r/blogspot/GjNcQ/~4/QluMgdjl9ow" height="1" width="1" alt=""/>Incoherent Ramblingshttp://www.blogger.com/profile/16873840468463648925noreply@blogger.com0http://edgeofabandon.blogspot.com/2013/08/ive-returned-with-random-walk.htmltag:blogger.com,1999:blog-83055021305463025.post-12705165890040604512011-03-24T03:21:00.000-07:002011-03-24T03:21:54.803-07:00I've been missing.Ok so since my university work has ramped up - I've had little time for blogging/let alone myself; thus the absence of <i>interesting </i>updates.<br /><br />Hopefully, when my assignments are done I shall devote my time to making some more thought provoking posts.<br /><br />But for now I'll leave you to ponder <i>Russel's Paradox</i><br /><i><br /></i><br /><br /><span class="Apple-style-span" style="font-style: italic;"><b>Pre-reading:</b> </span>The following assumes that you are all familiar with a set, if not then put simply;<br /><div style="text-align: center;">A set is a collection of objects.</div><div style="text-align: left;">Generally there will be a rule discerning whether an object is included in a set or not. for example;</div><div style="text-align: left;">A set of all shoe brands does not have colgate as a member (since colgate is a toothpaste brand).</div><div style="text-align: left;"><br /></div><div style="text-align: left;">The members of a set can also be sets themselves, for example;</div><div style="text-align: left;">The set of all forks is a member of the set of all cutlery.</div><div style="text-align: left;"><br /></div><div style="text-align: left;">Now make a proposal;</div><i>Let F be the set of all forks, F itself is not a fork therefore F does not contain itself as a member.</i><br /><i>We can call F as a <b>normal </b>set.</i><br /><i><br /></i><br /><i>Now suppose we have a set NF - this is the set of all things that are <b>not</b> forks. NF itself is not a fork, therefore it contains itself as a member. We can call NF an <b>abnormal</b> set.</i><br /><i><br /></i><br /><i>Suppose we have a new set <b>G </b>- with all of its elements <b>normal sets.</b> Now, if <b>G</b> is <b>normal</b> then it should contain itself as a member - but the instant it contains itself, <b>G</b> becomes abnormal.</i><br /><i><br /></i><br /><i>This is known as <b>Russel's Paradox.</b></i><img src="http://feeds.feedburner.com/~r/blogspot/GjNcQ/~4/DD9NZ9TB4Yg" height="1" width="1" alt=""/>Incoherent Ramblingshttp://www.blogger.com/profile/16873840468463648925noreply@blogger.com6http://edgeofabandon.blogspot.com/2011/03/ive-been-missing.htmltag:blogger.com,1999:blog-83055021305463025.post-9798155292867245042011-03-09T18:24:00.000-08:002013-08-15T06:02:07.028-07:00Integral of a Gaussian (continued)Ok, so last post we derived (not so rigorously) that the integral of a standard Gaussian as;<br /><div style="text-align: center;">$$\int_{-\infty}^{\infty} e^{-x^2} dx = \sqrt{\pi}$$</div><div style="text-align: center;"><br /></div><div style="text-align: center;">and using a substitution of </div><div style="text-align: center;"> <br />$$x=\sqrt{a}y$$ and $$dx=\sqrt{a}dy$$</div><div style="text-align: center;"> we obtain;$$\int_{-\infty}^{\infty} e^{-ax^2} dx = \sqrt{\frac{\pi}{a}}$$.</div><div style="text-align: center;"><br /></div><div style="text-align: center;">Now let's extend that to the more general form of;</div><div style="text-align: center;">$$I=\int_{-\infty}^{\infty} x^2 e^{-ax^2} dx $$ where a is some constant.</div><div style="text-align: center;"><br /></div><div style="text-align: center;">We can recognise (or at least be told to recognise) that the argument of the integral; $$x^2 e^{-ax^2}$$ is equal to</div><div style="text-align: center;"> $$-\frac{\partial}{\partial a} e^{-ax^2}$$</div><div style="text-align: center;"><br /></div><div style="text-align: center;">Putting this together, we get;</div><div style="text-align: center;">$$I=\int_{-\infty}^{\infty}-\frac{\partial}{\partial a} e^{-ax^2} dx$$</div><div style="text-align: center;">We can swap the partial derivative and Integration sign (provided the variable we are integrating over is independent of a and the integrand is continuous)</div><div style="text-align: center;"> </div><div style="text-align: center;">We get; $$I= -\frac{\partial}{\partial a} \int_{-\infty}^{\infty} e^{-ax^2} dx = -\frac{\partial}{\partial a}\sqrt{\frac{\pi}{a}}$$</div><div style="text-align: center;">$$=\frac{1}{2a} \sqrt{\frac{\pi}{a}}$$ </div><div style="text-align: center;">Therefore; $$\int_{-\infty}^{\infty} x^2 e^{-ax^2} dx =\frac{1}{2a} \sqrt{\frac{\pi}{a}}$$</div><div style="text-align: center;"><br /></div><div style="text-align: center;">This technique can be extended for different powers of x - provided they are even. </div><img src="http://feeds.feedburner.com/~r/blogspot/GjNcQ/~4/pafVyq8ViB0" height="1" width="1" alt=""/>Incoherent Ramblingshttp://www.blogger.com/profile/16873840468463648925noreply@blogger.com13http://edgeofabandon.blogspot.com/2011/03/integral-of-gaussian-continued.htmltag:blogger.com,1999:blog-83055021305463025.post-85034663765463472482011-03-08T22:12:00.000-08:002011-03-08T22:40:56.831-08:00Evaluating the integral of a GaussianOkay, so I thought I'd take a step back from the quite technical blog about finding the <a href="http://edgeofabandon.blogspot.com/2011/03/volume-of-n-sphere.html">Volume of an n-sphere</a> and look at some of the (often overlooked) techniques that were used.<br /><br />First, is evaluating the following integral of a gaussian - which occurs very commonly in branches of physics and engineering;<br /><br /><div style="text-align: center;">$$I=\int_{-\infty}^{\infty} e^{-x^2} dx$$</div><div style="text-align: center;"><br /></div><div style="text-align: center;">Let's now have a look at what $$I^2$$ looks like;</div><div style="text-align: center;">$$I^2=\left(\int_{-\infty}^{\infty} e^{-x^2} dx\right)^2=\int_{-\infty}^{\infty} e^{-x^2}dx \cdot \int_{-\infty}^{\infty} e^{-x^2}dx$$</div><div style="text-align: center;">Now, using a dummy variable <i>y </i>in the second integral , we get;</div><div style="text-align: center;">$$I^2=\int_{-\infty}^{\infty} e^{-x^2}dx \cdot \int_{-\infty}^{\infty} e^{-y^2}dy$$</div><div style="text-align: center;">Note: We can use dummy variables since they are exactly that; a variable which is only used to integrate over and (presumably) does not appear in the evaluated integral. </div><div style="text-align: center;">We can bring the exponents together - the integrals are independent.</div><div style="text-align: center;">$$I^2=\int_{-\infty}^{\infty} \int_{-\infty}^{\infty} e^{-(x^2+y^2)}dxdy$$</div><div style="text-align: center;"><br /></div><div style="text-align: center;">Now we use a change of variable from cartesian coordinates to polar with;</div><div style="text-align: center;">$$x=rcos(\theta)$$ and<br />$$y=rsin(\theta)$$</div><div style="text-align: center;">With the determinant of the Jacobian matrix being $$r d\theta dr$$ and noting the change of terminals of integration.</div><div style="text-align: center;"><br /></div><div style="text-align: center;">Hence $$I^2=\int_{0}^{2\pi}\int_{0}^{\infty} e^{-r^2} r dr d\theta$$</div><div style="text-align: center;">Noting that nothing in the integrand depends on theta.</div><div style="text-align: center;">Now, letting $$u=r^2$$ we get $$\frac{du}{dr}=2r$$ so $$dr=\frac{du}{2r}$$</div><div style="text-align: center;">We get $$I^2= 2\pi \int_{u=0}^{u=\infty} e^{-u} r \frac{du}{2r}$$</div><div style="text-align: center;"><br /></div><div style="text-align: center;"><br /></div><div style="text-align: center;">Finally; $$I^2= \frac{2\pi}{2} \left[-e^{-u}\right]_{u=0}^{u=\infty}=\pi \left[-\left(e^{-\infty}-e^0\right)\right]=\pi$$</div><div style="text-align: center;"><br />Hence $$I^2=\pi \rightarrow I=\sqrt{\pi}$$</div><div style="text-align: center;"><br /></div><div style="text-align: left;">Note: This isn't really a <i>strict </i>or <i>rigorous</i> derivation - as I just dealt with the Improper Integrals without taking limits and checking their convergence. This was intended just to get the general idea across.</div><div style="text-align: left;"><br /></div><div style="text-align: left;">To be continued...</div><img src="http://feeds.feedburner.com/~r/blogspot/GjNcQ/~4/pbxRkmxKhgg" height="1" width="1" alt=""/>Incoherent Ramblingshttp://www.blogger.com/profile/16873840468463648925noreply@blogger.com5http://edgeofabandon.blogspot.com/2011/03/evaluating-integral-of-gaussian.htmltag:blogger.com,1999:blog-83055021305463025.post-54050523206334563622011-03-07T22:52:00.000-08:002011-03-08T22:41:10.430-08:00The Volume of an n-sphereHere I present a (fairly common) derivation for the volume of an n sphere.<br />Where an n-sphere is defined by<br /><div style="text-align: center;">$S^n=\left\{x \epsilon \mathbb{R}^{n+1}: ||x||)=r\right\}$</div><br />We first start by calling on what we already know;<br /><br />$V_{1}=2 \pi r$ for $S^0$ $V=\pi r^2$ for $S^{1}$ and $V=\frac{4}{3} \pi r^3$ for $S^2$ These can both be obtained most easily via usual methods of integrating in polar and spherical coordinates, respectively.<br /><br />By inspection - and intuition - we should think that the volume is proportional to $r^{n+1}$ i.e some constant times r to the power of the dimension we are in.<br />Therefore we should expect that the formula for the volume of a sphere in <i>m-dimensions</i> looks a little like<br /><div style="text-align: center;">$V_{m}=\nu_{m} r^m$</div><div style="text-align: left;">where $\nu_{m}$ is the constant of proportionality that we are after.</div><div style="text-align: left;">Hence the volume element (in <i>m-dimensions </i>and spherical coordinates<i>)</i> is given by; $dV_{m}= m \nu_{m} r^{m-1} dr$</div><div style="text-align: left;"><br /></div><div style="text-align: left;">Now this seems like a leap of faith, or unjustified - but let's take a look at some Gaussian Integrals;</div><div style="text-align: left;"><br /></div><div style="text-align: center;">$$I_m=\int \! e^{-\left(x_{1}^2+...+x_m^{2}\right)} dx_1....dx_m=$$ where the integral is over all of $\mathbb{R}^m$</div><div style="text-align: left;">From our knowledge of a standard Gaussian Integral;</div><div style="text-align: center;">$I=\int_{- \infty}^{\infty} \! e^{-x^2} dx=\sqrt{\pi}$</div><div style="text-align: center;">and recognising the fact that the $x_i$ are independent, we can factorise the integral;</div><div style="text-align: center;">$$I_m=\int \! e^{-x_1^2} dx_1...\int \! e^{-x_m^2} dx_m = I^m $$</div><div style="text-align: center;">Therefore $$I_m=\sqrt{\pi}^m = \pi^\frac{m}{2}$$</div><div style="text-align: center;"><br /></div><div style="text-align: center;">Now, transforming into <i>m-dimensional </i>spherical coordinates so that;</div><div style="text-align: center;">$$x_1^2+....+x_m^2=r^2$$</div><div style="text-align: center;">Which gives $$I_m=\int \! e^{-r^2} dV_{m} $$</div><div style="text-align: center;">and using our definition for the spherical volume element we got before;</div><div style="text-align: center;">$$\int_{0}^{\infty} \! e^{-r^2} m\nu_m r^{m-1} dr=\pi^\frac{m}{2}$$</div><div style="text-align: center;">Taking out the variables which aren't dependent on <i>r</i> from the integrand we get;</div><div style="text-align: center;">$$I_m=m\nu_m \int_{0}^{\infty} \! e^{-r^2} r^{m-1} dr =\pi^\frac{m}{2}$$</div><div style="text-align: center;"><br /></div><div style="text-align: center;">You math geeks out there should recognise the form of the integral as somewhat close to the gamma function;</div><div style="text-align: center;">$$\Gamma(z+1)= \int_{0}^{\infty} u^{z}e^{-u} du $$</div><div style="text-align: center;">Now, substituting $$u=r^2$$ then $$\frac{du}{dr}=2r$$ or in terms of u; $$\frac{du}{dr}=2\sqrt{u}$$<br />$$V_m=\nu_m m \int_{0}^{\infty} e^{-u}\left(\sqrt{u}\right)^{m-1}\frac{1}{2\sqrt{u}} du$$</div><div style="text-align: center;"><br />$$\pi^{\frac{m}{2}}=\frac{\nu_m m}{2} \int_{0}^{\infty}e^{-u} \left(\sqrt{u}\right)^{m-2} du$$<br />$$=\frac{\nu_m m}{2} \int_{0}^{\infty}e^{-u} u^{\frac{m}{2}-1} du$$<br />Which we can almost assimilate with the above form of the Gamma function;<br /><br />A simple substitution provides us with;<br />$$\frac{m}{2}-1=z \rightarrow \frac{m}{2}=z+1$$<br />Hence $$\Gamma(z+1)= \int_{0}^{\infty} u^{z}e^{-u} du $$ becomes;<br />$$\Gamma(\frac{m}{2})=\int_{0}^{\infty} u^{\frac{m}{2}-1} e^{-u} du$$<br /><br /><br />Therefore; $$\pi^{\frac{m}{2}}=\frac{ \nu_m m \Gamma(\frac{m}{2}) }{2}$$<br /><br /></div><div style="text-align: center;">So putting it all together, we recover our unknown coefficient $\nu_m$;</div><div style="text-align: center;">$$\nu_m= \frac{\pi^\frac{m}{2}}{\frac{m}{2}\Gamma(\frac{m}{2})}$$</div><br />Or you can obtain an equivalent formula by looking at $$\Gamma(\frac{m}{2}+1)= \int_{0}^{\infty} u^{\frac{n}{2}} e^{-u} du $$ and integrating by parts to recognise that;<br /><div style="text-align: center;">$$\Gamma(\frac{m}{2}+1)=\frac{m}{2}\Gamma(\frac{m}{2})$$</div><div style="text-align: center;">Hence putting it all together, we recover the <i>Volume of an n-dimensional sphere of radius r as;</i></div><div style="text-align: center;"><br /></div><div style="text-align: center;">$$V_m=\frac{\pi^{\frac{m}{2}} r^m}{\frac{m}{2} \Gamma(\frac{m}{2})}$$</div><div style="text-align: center;"></div><div style="text-align: center;"><br /></div><div style="text-align: center;">= $$\frac{\pi^\frac{m}{2} r^m}{\Gamma(\frac{m}{2}+1)}$$</div><div style="text-align: center;"><br /></div><div style="text-align: left;">Having done all that, it could be an interesting idea to see how the volume of the unit n-sphere changes as we increase the number of dimensions;</div><div style="text-align: center;">In dimension 2; $$V=\pi$$</div><div style="text-align: center;">In dimension 3; $$V=\frac{4}{3}\pi$$</div><div style="text-align: center;">.</div><div style="text-align: center;">.</div><div style="text-align: center;">.</div><div style="text-align: center;">and so on.</div><div style="text-align: left;">Below is a plot of volume vs dimension for the unit sphere - constructed using the volume formula we obtained above;</div><div class="separator" style="clear: both; text-align: center;"></div><div style="text-align: left;"> </div><div class="separator" style="clear: both; text-align: center;"><a href="https://lh4.googleusercontent.com/-C1Pqtyh25M4/TXXP0YRBSpI/AAAAAAAAAAQ/7-KfGQVlRMY/s1600/nsphere.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="398" src="https://lh4.googleusercontent.com/-C1Pqtyh25M4/TXXP0YRBSpI/AAAAAAAAAAQ/7-KfGQVlRMY/s400/nsphere.jpeg" width="400" /></a></div><div style="text-align: left;"><br />I find it quite amazing that there is a maximum! That is that there is a point (around n=5) that when you start adding dimensions to the sphere; its volume <i>decreases</i>!</div><div style="text-align: left;"><br /></div><div style="text-align: left;"><b>Final Word:</b></div><div style="text-align: left;">It may not seem like there are any real world applications for the use of n-dimensional spheres, but there actually are! In statistical mechanics, a system of N independent particles may be confined to the surface or within a volume of an N dimensional sphere in phase space (due to energy considerations)- where here we don't think of N as spatial dimensions, but <i>degrees of freedom. </i>So when N becomes large (as it does in the thermodynamic limit) we are required to use the above formula and it is just as well that the volume decreases to zero as N becomes large, otherwise it would be indicating that the thermodynamic properties (which in general depend on the statistical behaviour of the particles) of our ensemble have would become infinite - which is an indication that something has gone wrong in the math!</div><img src="http://feeds.feedburner.com/~r/blogspot/GjNcQ/~4/-bnQNiJSVRc" height="1" width="1" alt=""/>Incoherent Ramblingshttp://www.blogger.com/profile/16873840468463648925noreply@blogger.com15http://edgeofabandon.blogspot.com/2011/03/volume-of-n-sphere.htmltag:blogger.com,1999:blog-83055021305463025.post-61079577623392887382011-03-04T04:30:00.000-08:002011-03-08T22:27:50.532-08:00Introduction to the blog worldThis is my first ever blog! Hello!<br /><br />I am currently enrolled in a Masters of Mathematics and Statistics - Majoring in Applied Maths and I intend to use this as a place where I can use $\LaTeX$<br />and write up some notes/workings outs from any problems that I'm working on.<br /><br />This blog will also serve as a place to document some incoherent ramblings that fill my head from time to time.<br /><br />But currently it will mostly be a playground for me to get familiar with $$\LaTeX$$.<br /><br /><div style="text-align: center;"><b>For Example;</b></div><div style="text-align: center;">$$\Gamma(z)=\int_{0}^{\infty}t^{z-1}e^{-t}dt$$</div><img src="http://feeds.feedburner.com/~r/blogspot/GjNcQ/~4/5m-afl7yT5M" height="1" width="1" alt=""/>Incoherent Ramblingshttp://www.blogger.com/profile/16873840468463648925noreply@blogger.com3http://edgeofabandon.blogspot.com/2011/03/introduction-to-blog-world.html