Of Penguins and Coffee2024-01-17T13:56:36+00:00https://andrewbolster.infoAndrew Bolsterme@andrewbolster.infoJupyter Environment Management for Dummies2024-01-17T13:50:00+00:00https://andrewbolster.info/2024/01/jupyter-environment-management-for-dummies<p>This is another one of those “I kept googling the same thing over and over again” things that needed a post, except this time I made an <a href="https://github.com/andrewbolster/andrewbolster.github.io/issues/8">issue to make a post</a> and then started to repeatedly refer to that.</p>
<h2 id="tldr">TL;DR</h2>
<p>When you want to spin up an experimental environment and get it tied in to your Jupyter environment of choice (I actually quite like <a href="https://github.com/jupyterlab/jupyterlab-desktop">JupyterLab Desktop these days…</a>), you need two steps.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Create the conda environment with $NAME and ipykernel as the main dependency</span>
conda create <span class="nt">--name</span> <span class="nv">$NAME</span> ipykernel <span class="o">&&</span> conda activate
<span class="c"># Once you're in the new environment, add it to the _global_ (There may be a better `kernelspec` way to do this but I haven't done it yet.)</span>
ipython kernel <span class="nb">install</span> <span class="nt">--name</span><span class="o">=</span><span class="nv">$CONDA_DEFAULT_ENV</span> <span class="nt">--user</span>
... Do real work here ...
<span class="c"># If you messed up and need to nuke it from orbit without wiping out the rest of your env;</span>
jupyter kernelspec uninstall <span class="nv">$NAME</span>
</code></pre></div></div>
<p>BTW it can sometimes take a few minutes/interactions to coax Jupyterlab into identifying the new kernel.</p>
<h2 id="aside-on-kernel-names">Aside on Kernel Names</h2>
<p>There’s also a trick to getting the current kernel name in a Notebook that doesn’t abuse any magic functions</p>
<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">sys</span>
<span class="kn">import</span> <span class="nn">os</span>
<span class="n">kernel_name</span> <span class="o">=</span> <span class="n">os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">basename</span><span class="p">(</span><span class="n">sys</span><span class="p">.</span><span class="n">executable</span><span class="p">.</span><span class="n">replace</span><span class="p">(</span><span class="s">"/bin/python"</span><span class="p">,</span><span class="s">""</span><span class="p">))</span>
</code></pre></div></div>
<p>But what <em>does</em> abuse the magic string is then using that in-kernel python variable to do the <em>out</em> of kernel invocation of the ‘correct’ install incantation;</p>
<p><strong>Reminder</strong>:</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">!command</code> runs the command in shell of the jupyterlab runner (so you <em>can</em> install jupyter extensions etc from a running notebook.)
<ul>
<li>e.g. <code class="language-plaintext highlighter-rouge">!which python</code> which give you the system python of the runtime the <em>jupyterlab runner</em> is executing in, not the notebook kernel</li>
</ul>
</li>
<li><code class="language-plaintext highlighter-rouge">%command</code> are a small subset of <a href="https://ipython.readthedocs.io/en/stable/interactive/magics.html">magic commands</a> for interacting with the <em>notebook</em> kernel
<ul>
<li>e,g. <code class="language-plaintext highlighter-rouge">%conda install -y -q <requirement></code> or <code class="language-plaintext highlighter-rouge">%pip install -yq <requirement></code> <strong>DON’T COPY PASTE THESE YET</strong></li>
</ul>
</li>
</ul>
<p><em>However</em> some environment setups (particularly <code class="language-plaintext highlighter-rouge">conda</code> related ones) cause a bit of mayhem, so for instance when you run the magic of <code class="language-plaintext highlighter-rouge">%conda env list</code>, you may get a response back that has a * beside the ‘base’ environment, which is a bit wrong coming from a notebook we just created above.</p>
<p>To be <em>sure</em> that you’re actually installing things in the right environment, this works;</p>
<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">sys</span>
<span class="kn">import</span> <span class="nn">os</span>
<span class="n">kernel_name</span> <span class="o">=</span> <span class="n">os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">basename</span><span class="p">(</span><span class="n">sys</span><span class="p">.</span><span class="n">executable</span><span class="p">.</span><span class="n">replace</span><span class="p">(</span><span class="s">"/bin/python"</span><span class="p">,</span><span class="s">""</span><span class="p">))</span>
</code></pre></div></div>
<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">%</span><span class="n">conda</span> <span class="n">install</span> <span class="o">-</span><span class="n">y</span> <span class="o">-</span><span class="n">n</span> <span class="err">$</span><span class="n">kernel_name</span> <span class="n">magicalpackagenamethatsdefinitelynotahallucination</span>
</code></pre></div></div>
Pulling Election Count data out of Google Sheets for fun and democracy2023-05-18T13:43:00+00:00https://andrewbolster.info/2023/05/pulling-election-count-data-out-of-google-sheets-for-fun-and-democracy<h1 id="messing-around-with-elections-ni-data">Messing around with Elections NI data</h1>
<p>Sources:</p>
<ul>
<li><a href="https://docs.google.com/spreadsheets/d/11o0rbI-NVcPJhkBZxInN4qn55rm0TCJT9ERitaiN1fk/edit?usp=sharing">Live Data (for 2023)</a></li>
<li><a href="https://docs.google.com/spreadsheets/d/1AazeIZwfflJJoTiYNil3RprIOXcNy8yyzfZ4ImVlETA/edit#gid=264660014">2022 Assembly Elections</a></li>
</ul>
<h2 id="creating-your-own-google-sheet-and-referencing-the-crowdsourced-data">Creating your own Google Sheet and referencing the crowdsourced data</h2>
<p>The above linked spreadsheets are naturally not editable by everyone; this is great for reliable data but isn’t so great when you want to make pretty graphs.</p>
<p>Google Sheets supports the live referencing of external sheets in your own sheets, so you can ‘import’ the data from the read-only sheets as they evolve over the count, and then reference those data in your own visualisations.</p>
<p>This is done using the <a href="https://support.google.com/docs/answer/3093340?hl=en&ref_topic=9199554&sjid=4812370886125323989-EU">IMPORTRANGE</a> function in Google Sheets, so like this;</p>
<p><code class="language-plaintext highlighter-rouge">=IMPORTRANGE("https://docs.google.com/spreadsheets/d/1AazeIZwfflJJoTiYNil3RprIOXcNy8yyzfZ4ImVlETA/edit#gid=372848906","Belfast South/West!A2:Q24")</code></p>
<p><img src="/img/eni_sheet_1.png" alt="Image of Imported Spreadsheet showing separated results for the Belfast South Assembly Election in 2022" /></p>
<p>Using this and the <a href="https://support.google.com/docs/answer/3093343?hl=en">Google QUERY language</a>, you can easily create some pretty dynamic graphs in a couple of lines/cells across your own sheets, all while being ‘fed’ by the main collaborative work.</p>
<p><img src="/img/eni_sheet_2.png" alt="Image of PieChart of First Preference Votes in Belfast South in 2022" /></p>
<p>For instance, this is generated from the following formula;</p>
<p><code class="language-plaintext highlighter-rouge">=query('Basic IMPORTRANGE'!A2:Q16, "select B, sum(C) group by B order by sum(C) desc")</code>;</p>
<p>In this case the first argument to the <code class="language-plaintext highlighter-rouge">query</code> is a reference to my own sheet that just has ‘IMPORTRANGE’ in it; the interesting bit is the second argument which gives a list of the values in column B (The Party Names in the count sheet) sorted by the sum of the matching rows in column C (the first preference votes), with that ‘sum’ being defined across the groups with the same values in column B, and finally, these all sorted in a descending fashion by the total of those first preference votes.</p>
<p>So now we’ve easily set up a aggregation with two ‘cells’ of formula and a fairly basic chart.</p>
<h2 id="doing-the-real-work-with-python">Doing the real work with Python</h2>
<p>Python is really powerful for this kind of stuff, particularly the <code class="language-plaintext highlighter-rouge">pandas</code> data management library; we can get the above referenced sheet into a ‘raw’ python format with just a ‘few lines of code’.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="kn">from</span> <span class="nn">urllib.parse</span> <span class="kn">import</span> <span class="n">quote</span>
<span class="n">sheet_id</span> <span class="o">=</span> <span class="s">"1AazeIZwfflJJoTiYNil3RprIOXcNy8yyzfZ4ImVlETA"</span> <span class="c1"># This is the bit taken from the URL above, like IMPORTRANGE above
</span><span class="n">tab_id</span> <span class="o">=</span> <span class="mi">372848906</span>
<span class="c1"># https://docs.google.com/spreadsheets/d/1AazeIZwfflJJoTiYNil3RprIOXcNy8yyzfZ4ImVlETA/edit#gid=372848906
</span>
<span class="n">url</span> <span class="o">=</span> <span class="sa">f</span><span class="s">"https://docs.google.com/spreadsheets/d/</span><span class="si">{</span><span class="n">sheet_id</span><span class="si">}</span><span class="s">/export?format=csv&gid=</span><span class="si">{</span><span class="n">tab_id</span><span class="si">}</span><span class="s">"</span>
<span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="n">url</span><span class="p">)</span>
</code></pre></div></div>
<div>
<style scoped="">
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Unnamed: 0</th>
<th>Unnamed: 1</th>
<th>Unnamed: 2</th>
<th>Unnamed: 3</th>
<th>Unnamed: 4</th>
<th>Unnamed: 5</th>
<th>Unnamed: 6</th>
<th>Unnamed: 7</th>
<th>Unnamed: 8</th>
<th>Unnamed: 9</th>
<th>...</th>
<th>Unnamed: 17</th>
<th>Unnamed: 18</th>
<th>Unnamed: 19</th>
<th>Unnamed: 20</th>
<th>Unnamed: 21</th>
<th>Unnamed: 22</th>
<th>Unnamed: 23</th>
<th>Unnamed: 24</th>
<th>Unnamed: 25</th>
<th>Unnamed: 26</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>Belfast South</td>
<td>NaN</td>
<td>Stage 1</td>
<td>Surplus Hargey</td>
<td>2.00</td>
<td>Exclude <500</td>
<td>3.00</td>
<td>Exclude McCann Sibanda</td>
<td>4.00</td>
<td>Exclude Girvin</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>11.00</td>
<td>12.0</td>
<td>13.0</td>
<td>14.0</td>
<td>15.0</td>
</tr>
<tr>
<th>1</th>
<td>Deirdre Hargey</td>
<td>Sinn Féin</td>
<td>9511</td>
<td>-1687</td>
<td>7824.00</td>
<td>NaN</td>
<td>7824.00</td>
<td>NaN</td>
<td>7824.00</td>
<td>NaN</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>2</th>
<td>Edwin Poots</td>
<td>Democratic Unionist Party</td>
<td>7211</td>
<td>6.12</td>
<td>7217.12</td>
<td>6.08</td>
<td>7223.20</td>
<td>117.72</td>
<td>7340.92</td>
<td>1134</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>3</th>
<td>Matthew O'Toole</td>
<td>Social Democratic and Labour Party</td>
<td>5394</td>
<td>664.92</td>
<td>6058.92</td>
<td>68.18</td>
<td>6127.10</td>
<td>322.04</td>
<td>6449.14</td>
<td>10.18</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>4</th>
<td>Paula Bradshaw</td>
<td>Alliance Party</td>
<td>6503</td>
<td>265.14</td>
<td>6768.14</td>
<td>59.5</td>
<td>6827.64</td>
<td>135.82</td>
<td>6963.46</td>
<td>31</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>5</th>
<td>Kate Nicholl</td>
<td>Alliance Party</td>
<td>5201</td>
<td>145.98</td>
<td>5346.98</td>
<td>57.52</td>
<td>5404.50</td>
<td>131.28</td>
<td>5535.78</td>
<td>17</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>6</th>
<td>Clare Bailey</td>
<td>Green Party</td>
<td>4058</td>
<td>167.4</td>
<td>4225.40</td>
<td>169.82</td>
<td>4395.22</td>
<td>490.54</td>
<td>4885.76</td>
<td>48.36</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>7</th>
<td>Stephen McCarthy</td>
<td>Ulster Unionist Party</td>
<td>3061</td>
<td>6.12</td>
<td>3067.12</td>
<td>17.72</td>
<td>3084.84</td>
<td>25.18</td>
<td>3110.02</td>
<td>643.36</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>8</th>
<td>Elsie Trainor</td>
<td>Social Democratic and Labour Party</td>
<td>2030</td>
<td>181.8</td>
<td>2211.80</td>
<td>18.52</td>
<td>2230.32</td>
<td>132.94</td>
<td>2363.26</td>
<td>4</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>9</th>
<td>Andrew Girvin</td>
<td>Traditional Unionist Voice</td>
<td>1935</td>
<td>0.36</td>
<td>1935.36</td>
<td>9</td>
<td>1944.36</td>
<td>35.54</td>
<td>1979.90</td>
<td>-1979.9</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>10</th>
<td>Luke McCann</td>
<td>Aontú</td>
<td>806</td>
<td>70.92</td>
<td>876.92</td>
<td>15.16</td>
<td>892.08</td>
<td>-892.08</td>
<td>0.00</td>
<td>NaN</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>11</th>
<td>Sipho Sibanda</td>
<td>People Before Profit</td>
<td>629</td>
<td>40.5</td>
<td>669.50</td>
<td>172.66</td>
<td>842.16</td>
<td>-842.16</td>
<td>0.00</td>
<td>NaN</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>12</th>
<td>Neil Moore</td>
<td>Socialist Party</td>
<td>353</td>
<td>18</td>
<td>371.00</td>
<td>-371</td>
<td>0.00</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>13</th>
<td>Paddy Lynn</td>
<td>The Workers Party</td>
<td>139</td>
<td>24.48</td>
<td>163.48</td>
<td>-163.48</td>
<td>0.00</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>14</th>
<td>Elly Odhiambo</td>
<td>Independent</td>
<td>107</td>
<td>6.84</td>
<td>113.84</td>
<td>-113.84</td>
<td>0.00</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>15</th>
<td>Exhausted</td>
<td>NaN</td>
<td>NaN</td>
<td>88.42</td>
<td>88.42</td>
<td>54.16</td>
<td>142.58</td>
<td>343.18</td>
<td>485.76</td>
<td>92</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>16</th>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>17</th>
<td>Eligible Voters</td>
<td>73497</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>18</th>
<td>Turnout</td>
<td>47306</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>19</th>
<td>% Turnout</td>
<td>64.36%</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>20</th>
<td>Valid Ballots</td>
<td>46938</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>21</th>
<td>invalid ballots</td>
<td>368</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>22</th>
<td>quota</td>
<td>7824</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>23</th>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>24</th>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>25</th>
<td>Belfast West</td>
<td>NaN</td>
<td>Stage 1</td>
<td>Surplus Baker</td>
<td>2.00</td>
<td>Exclude Hill Mallon</td>
<td>3.00</td>
<td>Exclude Crossan</td>
<td>4.00</td>
<td>Exclude Burns</td>
<td>...</td>
<td>Exclude Doran</td>
<td>9.00</td>
<td>Exclude Murphy Higgins</td>
<td>10.00</td>
<td>Exclude Doherty</td>
<td>11.00</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>26</th>
<td>Danny Baker</td>
<td>Sinn Féin</td>
<td>9011</td>
<td>-1733</td>
<td>7278.00</td>
<td>NaN</td>
<td>7278.00</td>
<td>NaN</td>
<td>7278.00</td>
<td>NaN</td>
<td>...</td>
<td>NaN</td>
<td>7278.00</td>
<td>NaN</td>
<td>7278.00</td>
<td>NaN</td>
<td>7278.00</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>27</th>
<td>Órlaithí Flynn</td>
<td>Sinn Féin</td>
<td>6743</td>
<td>344.47</td>
<td>7087.47</td>
<td>5.38</td>
<td>7092.85</td>
<td>16.09</td>
<td>7108.94</td>
<td>27.37</td>
<td>...</td>
<td>68.37</td>
<td>7228.63</td>
<td>179</td>
<td>7407.63</td>
<td>NaN</td>
<td>7407.63</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>28</th>
<td>Aisling Reilly</td>
<td>Sinn Féin</td>
<td>5681</td>
<td>1028.47</td>
<td>6709.47</td>
<td>10.19</td>
<td>6719.66</td>
<td>7.76</td>
<td>6727.42</td>
<td>17.09</td>
<td>...</td>
<td>50.52</td>
<td>6811.98</td>
<td>318.98</td>
<td>7130.96</td>
<td>533</td>
<td>7663.96</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>29</th>
<td>Pat Sheehan</td>
<td>Sinn Féin</td>
<td>6370</td>
<td>52.44</td>
<td>6422.44</td>
<td>4</td>
<td>6426.44</td>
<td>5</td>
<td>6431.44</td>
<td>8.38</td>
<td>...</td>
<td>26.71</td>
<td>6477.10</td>
<td>258.18</td>
<td>6735.28</td>
<td>451.13</td>
<td>7186.41</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>30</th>
<td>Gerry Carroll</td>
<td>People Before Profit</td>
<td>3279</td>
<td>115.33</td>
<td>3394.33</td>
<td>19.76</td>
<td>3414.09</td>
<td>78.38</td>
<td>3492.47</td>
<td>70.08</td>
<td>...</td>
<td>238.23</td>
<td>3936.16</td>
<td>542.99</td>
<td>4479.15</td>
<td>1543.46</td>
<td>6022.61</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>31</th>
<td>Frank McCoubrey</td>
<td>Democratic Unionist Party</td>
<td>4166</td>
<td>0.57</td>
<td>4166.57</td>
<td>7.19</td>
<td>4173.76</td>
<td>0</td>
<td>4173.76</td>
<td>2</td>
<td>...</td>
<td>76.19</td>
<td>5275.14</td>
<td>154</td>
<td>5429.14</td>
<td>60.57</td>
<td>5489.71</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>32</th>
<td>Paul Doherty</td>
<td>Social Democratic and Labour Party</td>
<td>2528</td>
<td>88.35</td>
<td>2616.35</td>
<td>2.19</td>
<td>2618.54</td>
<td>29.14</td>
<td>2647.68</td>
<td>36.28</td>
<td>...</td>
<td>478.51</td>
<td>3232.66</td>
<td>404.28</td>
<td>3636.94</td>
<td>-3636.94</td>
<td>0.00</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>33</th>
<td>Gerard Herdman</td>
<td>Aontú</td>
<td>1753</td>
<td>8.17</td>
<td>1761.17</td>
<td>32</td>
<td>1793.17</td>
<td>9.19</td>
<td>1802.36</td>
<td>17.19</td>
<td>...</td>
<td>35.19</td>
<td>1871.74</td>
<td>-1871.74</td>
<td>0.00</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>34</th>
<td>Dan Murphy</td>
<td>Irish Republican Socialist Party</td>
<td>1103</td>
<td>12.16</td>
<td>1115.16</td>
<td>8</td>
<td>1123.16</td>
<td>7</td>
<td>1130.16</td>
<td>14</td>
<td>...</td>
<td>7</td>
<td>1159.16</td>
<td>-1159.16</td>
<td>0.00</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>35</th>
<td>Donnamarie Higgins</td>
<td>Alliance Party</td>
<td>907</td>
<td>17.48</td>
<td>924.48</td>
<td>2</td>
<td>926.48</td>
<td>6</td>
<td>932.48</td>
<td>7.19</td>
<td>...</td>
<td>-1134.81</td>
<td>0.00</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>36</th>
<td>Jordan Doran</td>
<td>Traditional Unionist Voice</td>
<td>802</td>
<td>0.38</td>
<td>802.38</td>
<td>3</td>
<td>805.38</td>
<td>1</td>
<td>806.38</td>
<td>3</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>37</th>
<td>Linsey Gibson</td>
<td>Ulster Unionist Party</td>
<td>474</td>
<td>0.76</td>
<td>474.76</td>
<td>2</td>
<td>476.76</td>
<td>1.19</td>
<td>477.95</td>
<td>2</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>38</th>
<td>Stevie Maginn</td>
<td>Green Party</td>
<td>307</td>
<td>3.04</td>
<td>310.04</td>
<td>3</td>
<td>313.04</td>
<td>10</td>
<td>323.04</td>
<td>10.19</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>39</th>
<td>Gerard Burns</td>
<td>Independent</td>
<td>192</td>
<td>16.91</td>
<td>208.91</td>
<td>28.19</td>
<td>237.10</td>
<td>7</td>
<td>244.10</td>
<td>-244.1</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>40</th>
<td>Patrick Crossan</td>
<td>The Workers Party</td>
<td>193</td>
<td>4.75</td>
<td>197.75</td>
<td>6.76</td>
<td>204.51</td>
<td>-204.51</td>
<td>0.00</td>
<td>NaN</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>41</th>
<td>Tony Mallon</td>
<td>Independent</td>
<td>129</td>
<td>2.28</td>
<td>131.28</td>
<td>-131.28</td>
<td>0.00</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>42</th>
<td>Declan Hill</td>
<td>Independent</td>
<td>26</td>
<td>0.38</td>
<td>26.38</td>
<td>-26.38</td>
<td>0.00</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>43</th>
<td>Exhausted</td>
<td>NaN</td>
<td>NaN</td>
<td>37.06</td>
<td>37.06</td>
<td>24</td>
<td>61.06</td>
<td>26.76</td>
<td>87.82</td>
<td>29.33</td>
<td>...</td>
<td>154.09</td>
<td>393.43</td>
<td>1173.47</td>
<td>1566.90</td>
<td>1048.78</td>
<td>2615.68</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>44</th>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>45</th>
<td>Eligible Voters</td>
<td>68727</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>46</th>
<td>Turnout</td>
<td>44440</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>47</th>
<td>% Turnout</td>
<td>64.66%</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>48</th>
<td>Valid Ballots</td>
<td>43664</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>49</th>
<td>invalid ballots</td>
<td>776</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>50</th>
<td>quota</td>
<td>7278</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
</tbody>
</table>
<p>51 rows × 27 columns</p>
</div>
<p>This is a little bit more complicated than other google foo would have you believe but it looks like Google updated their API’s over the years to remove the ‘happy path’ for this call.</p>
<p>Additionally, note that as in the <code class="language-plaintext highlighter-rouge">IMPORTRANGE</code> example, there are no usable ‘headers’ in the underlying data so we may have to create these ourselves for more complex analysis.</p>
<p>Also, we have to manually ‘tidy up’ the ‘range’ ourselves, as the Belfast South range only goes to row 24, and then Belfast West appears.</p>
<p>While we could get fancy, for simplicity, this is a manual example. And we’ll also exclude the ‘metadata’ such as the Turnout statistics and Quota from the bottom of the section, so in this case trimming the data from to run between rows 3 and 17 for just candidates and transfer statistics.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="n">url</span><span class="p">)</span>
<span class="n">df</span>
</code></pre></div></div>
<div>
<style scoped="">
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Unnamed: 0</th>
<th>Unnamed: 1</th>
<th>Unnamed: 2</th>
<th>Unnamed: 3</th>
<th>Unnamed: 4</th>
<th>Unnamed: 5</th>
<th>Unnamed: 6</th>
<th>Unnamed: 7</th>
<th>Unnamed: 8</th>
<th>Unnamed: 9</th>
<th>...</th>
<th>Unnamed: 17</th>
<th>Unnamed: 18</th>
<th>Unnamed: 19</th>
<th>Unnamed: 20</th>
<th>Unnamed: 21</th>
<th>Unnamed: 22</th>
<th>Unnamed: 23</th>
<th>Unnamed: 24</th>
<th>Unnamed: 25</th>
<th>Unnamed: 26</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>Belfast South</td>
<td>NaN</td>
<td>Stage 1</td>
<td>Surplus Hargey</td>
<td>2.00</td>
<td>Exclude <500</td>
<td>3.00</td>
<td>Exclude McCann Sibanda</td>
<td>4.00</td>
<td>Exclude Girvin</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>11.00</td>
<td>12.0</td>
<td>13.0</td>
<td>14.0</td>
<td>15.0</td>
</tr>
<tr>
<th>1</th>
<td>Deirdre Hargey</td>
<td>Sinn Féin</td>
<td>9511</td>
<td>-1687</td>
<td>7824.00</td>
<td>NaN</td>
<td>7824.00</td>
<td>NaN</td>
<td>7824.00</td>
<td>NaN</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>2</th>
<td>Edwin Poots</td>
<td>Democratic Unionist Party</td>
<td>7211</td>
<td>6.12</td>
<td>7217.12</td>
<td>6.08</td>
<td>7223.20</td>
<td>117.72</td>
<td>7340.92</td>
<td>1134</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>3</th>
<td>Matthew O'Toole</td>
<td>Social Democratic and Labour Party</td>
<td>5394</td>
<td>664.92</td>
<td>6058.92</td>
<td>68.18</td>
<td>6127.10</td>
<td>322.04</td>
<td>6449.14</td>
<td>10.18</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>4</th>
<td>Paula Bradshaw</td>
<td>Alliance Party</td>
<td>6503</td>
<td>265.14</td>
<td>6768.14</td>
<td>59.5</td>
<td>6827.64</td>
<td>135.82</td>
<td>6963.46</td>
<td>31</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>5</th>
<td>Kate Nicholl</td>
<td>Alliance Party</td>
<td>5201</td>
<td>145.98</td>
<td>5346.98</td>
<td>57.52</td>
<td>5404.50</td>
<td>131.28</td>
<td>5535.78</td>
<td>17</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>6</th>
<td>Clare Bailey</td>
<td>Green Party</td>
<td>4058</td>
<td>167.4</td>
<td>4225.40</td>
<td>169.82</td>
<td>4395.22</td>
<td>490.54</td>
<td>4885.76</td>
<td>48.36</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>7</th>
<td>Stephen McCarthy</td>
<td>Ulster Unionist Party</td>
<td>3061</td>
<td>6.12</td>
<td>3067.12</td>
<td>17.72</td>
<td>3084.84</td>
<td>25.18</td>
<td>3110.02</td>
<td>643.36</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>8</th>
<td>Elsie Trainor</td>
<td>Social Democratic and Labour Party</td>
<td>2030</td>
<td>181.8</td>
<td>2211.80</td>
<td>18.52</td>
<td>2230.32</td>
<td>132.94</td>
<td>2363.26</td>
<td>4</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>9</th>
<td>Andrew Girvin</td>
<td>Traditional Unionist Voice</td>
<td>1935</td>
<td>0.36</td>
<td>1935.36</td>
<td>9</td>
<td>1944.36</td>
<td>35.54</td>
<td>1979.90</td>
<td>-1979.9</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>10</th>
<td>Luke McCann</td>
<td>Aontú</td>
<td>806</td>
<td>70.92</td>
<td>876.92</td>
<td>15.16</td>
<td>892.08</td>
<td>-892.08</td>
<td>0.00</td>
<td>NaN</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>11</th>
<td>Sipho Sibanda</td>
<td>People Before Profit</td>
<td>629</td>
<td>40.5</td>
<td>669.50</td>
<td>172.66</td>
<td>842.16</td>
<td>-842.16</td>
<td>0.00</td>
<td>NaN</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>12</th>
<td>Neil Moore</td>
<td>Socialist Party</td>
<td>353</td>
<td>18</td>
<td>371.00</td>
<td>-371</td>
<td>0.00</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>13</th>
<td>Paddy Lynn</td>
<td>The Workers Party</td>
<td>139</td>
<td>24.48</td>
<td>163.48</td>
<td>-163.48</td>
<td>0.00</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>14</th>
<td>Elly Odhiambo</td>
<td>Independent</td>
<td>107</td>
<td>6.84</td>
<td>113.84</td>
<td>-113.84</td>
<td>0.00</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>15</th>
<td>Exhausted</td>
<td>NaN</td>
<td>NaN</td>
<td>88.42</td>
<td>88.42</td>
<td>54.16</td>
<td>142.58</td>
<td>343.18</td>
<td>485.76</td>
<td>92</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>16</th>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>17</th>
<td>Eligible Voters</td>
<td>73497</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>18</th>
<td>Turnout</td>
<td>47306</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>19</th>
<td>% Turnout</td>
<td>64.36%</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>20</th>
<td>Valid Ballots</td>
<td>46938</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>21</th>
<td>invalid ballots</td>
<td>368</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>22</th>
<td>quota</td>
<td>7824</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>23</th>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>24</th>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>25</th>
<td>Belfast West</td>
<td>NaN</td>
<td>Stage 1</td>
<td>Surplus Baker</td>
<td>2.00</td>
<td>Exclude Hill Mallon</td>
<td>3.00</td>
<td>Exclude Crossan</td>
<td>4.00</td>
<td>Exclude Burns</td>
<td>...</td>
<td>Exclude Doran</td>
<td>9.00</td>
<td>Exclude Murphy Higgins</td>
<td>10.00</td>
<td>Exclude Doherty</td>
<td>11.00</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>26</th>
<td>Danny Baker</td>
<td>Sinn Féin</td>
<td>9011</td>
<td>-1733</td>
<td>7278.00</td>
<td>NaN</td>
<td>7278.00</td>
<td>NaN</td>
<td>7278.00</td>
<td>NaN</td>
<td>...</td>
<td>NaN</td>
<td>7278.00</td>
<td>NaN</td>
<td>7278.00</td>
<td>NaN</td>
<td>7278.00</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>27</th>
<td>Órlaithí Flynn</td>
<td>Sinn Féin</td>
<td>6743</td>
<td>344.47</td>
<td>7087.47</td>
<td>5.38</td>
<td>7092.85</td>
<td>16.09</td>
<td>7108.94</td>
<td>27.37</td>
<td>...</td>
<td>68.37</td>
<td>7228.63</td>
<td>179</td>
<td>7407.63</td>
<td>NaN</td>
<td>7407.63</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>28</th>
<td>Aisling Reilly</td>
<td>Sinn Féin</td>
<td>5681</td>
<td>1028.47</td>
<td>6709.47</td>
<td>10.19</td>
<td>6719.66</td>
<td>7.76</td>
<td>6727.42</td>
<td>17.09</td>
<td>...</td>
<td>50.52</td>
<td>6811.98</td>
<td>318.98</td>
<td>7130.96</td>
<td>533</td>
<td>7663.96</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>29</th>
<td>Pat Sheehan</td>
<td>Sinn Féin</td>
<td>6370</td>
<td>52.44</td>
<td>6422.44</td>
<td>4</td>
<td>6426.44</td>
<td>5</td>
<td>6431.44</td>
<td>8.38</td>
<td>...</td>
<td>26.71</td>
<td>6477.10</td>
<td>258.18</td>
<td>6735.28</td>
<td>451.13</td>
<td>7186.41</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>30</th>
<td>Gerry Carroll</td>
<td>People Before Profit</td>
<td>3279</td>
<td>115.33</td>
<td>3394.33</td>
<td>19.76</td>
<td>3414.09</td>
<td>78.38</td>
<td>3492.47</td>
<td>70.08</td>
<td>...</td>
<td>238.23</td>
<td>3936.16</td>
<td>542.99</td>
<td>4479.15</td>
<td>1543.46</td>
<td>6022.61</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>31</th>
<td>Frank McCoubrey</td>
<td>Democratic Unionist Party</td>
<td>4166</td>
<td>0.57</td>
<td>4166.57</td>
<td>7.19</td>
<td>4173.76</td>
<td>0</td>
<td>4173.76</td>
<td>2</td>
<td>...</td>
<td>76.19</td>
<td>5275.14</td>
<td>154</td>
<td>5429.14</td>
<td>60.57</td>
<td>5489.71</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>32</th>
<td>Paul Doherty</td>
<td>Social Democratic and Labour Party</td>
<td>2528</td>
<td>88.35</td>
<td>2616.35</td>
<td>2.19</td>
<td>2618.54</td>
<td>29.14</td>
<td>2647.68</td>
<td>36.28</td>
<td>...</td>
<td>478.51</td>
<td>3232.66</td>
<td>404.28</td>
<td>3636.94</td>
<td>-3636.94</td>
<td>0.00</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>33</th>
<td>Gerard Herdman</td>
<td>Aontú</td>
<td>1753</td>
<td>8.17</td>
<td>1761.17</td>
<td>32</td>
<td>1793.17</td>
<td>9.19</td>
<td>1802.36</td>
<td>17.19</td>
<td>...</td>
<td>35.19</td>
<td>1871.74</td>
<td>-1871.74</td>
<td>0.00</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>34</th>
<td>Dan Murphy</td>
<td>Irish Republican Socialist Party</td>
<td>1103</td>
<td>12.16</td>
<td>1115.16</td>
<td>8</td>
<td>1123.16</td>
<td>7</td>
<td>1130.16</td>
<td>14</td>
<td>...</td>
<td>7</td>
<td>1159.16</td>
<td>-1159.16</td>
<td>0.00</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>35</th>
<td>Donnamarie Higgins</td>
<td>Alliance Party</td>
<td>907</td>
<td>17.48</td>
<td>924.48</td>
<td>2</td>
<td>926.48</td>
<td>6</td>
<td>932.48</td>
<td>7.19</td>
<td>...</td>
<td>-1134.81</td>
<td>0.00</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>36</th>
<td>Jordan Doran</td>
<td>Traditional Unionist Voice</td>
<td>802</td>
<td>0.38</td>
<td>802.38</td>
<td>3</td>
<td>805.38</td>
<td>1</td>
<td>806.38</td>
<td>3</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>37</th>
<td>Linsey Gibson</td>
<td>Ulster Unionist Party</td>
<td>474</td>
<td>0.76</td>
<td>474.76</td>
<td>2</td>
<td>476.76</td>
<td>1.19</td>
<td>477.95</td>
<td>2</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>38</th>
<td>Stevie Maginn</td>
<td>Green Party</td>
<td>307</td>
<td>3.04</td>
<td>310.04</td>
<td>3</td>
<td>313.04</td>
<td>10</td>
<td>323.04</td>
<td>10.19</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>39</th>
<td>Gerard Burns</td>
<td>Independent</td>
<td>192</td>
<td>16.91</td>
<td>208.91</td>
<td>28.19</td>
<td>237.10</td>
<td>7</td>
<td>244.10</td>
<td>-244.1</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>40</th>
<td>Patrick Crossan</td>
<td>The Workers Party</td>
<td>193</td>
<td>4.75</td>
<td>197.75</td>
<td>6.76</td>
<td>204.51</td>
<td>-204.51</td>
<td>0.00</td>
<td>NaN</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>41</th>
<td>Tony Mallon</td>
<td>Independent</td>
<td>129</td>
<td>2.28</td>
<td>131.28</td>
<td>-131.28</td>
<td>0.00</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>42</th>
<td>Declan Hill</td>
<td>Independent</td>
<td>26</td>
<td>0.38</td>
<td>26.38</td>
<td>-26.38</td>
<td>0.00</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>43</th>
<td>Exhausted</td>
<td>NaN</td>
<td>NaN</td>
<td>37.06</td>
<td>37.06</td>
<td>24</td>
<td>61.06</td>
<td>26.76</td>
<td>87.82</td>
<td>29.33</td>
<td>...</td>
<td>154.09</td>
<td>393.43</td>
<td>1173.47</td>
<td>1566.90</td>
<td>1048.78</td>
<td>2615.68</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>44</th>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>45</th>
<td>Eligible Voters</td>
<td>68727</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>46</th>
<td>Turnout</td>
<td>44440</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>47</th>
<td>% Turnout</td>
<td>64.66%</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>48</th>
<td>Valid Ballots</td>
<td>43664</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>49</th>
<td>invalid ballots</td>
<td>776</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>50</th>
<td>quota</td>
<td>7278</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
</tbody>
</table>
<p>51 rows × 27 columns</p>
</div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df</span><span class="p">.</span><span class="n">iloc</span><span class="p">[</span><span class="mi">1</span><span class="p">:</span><span class="mi">16</span><span class="p">]</span>
</code></pre></div></div>
<div>
<style scoped="">
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Unnamed: 0</th>
<th>Unnamed: 1</th>
<th>Unnamed: 2</th>
<th>Unnamed: 3</th>
<th>Unnamed: 4</th>
<th>Unnamed: 5</th>
<th>Unnamed: 6</th>
<th>Unnamed: 7</th>
<th>Unnamed: 8</th>
<th>Unnamed: 9</th>
<th>...</th>
<th>Unnamed: 17</th>
<th>Unnamed: 18</th>
<th>Unnamed: 19</th>
<th>Unnamed: 20</th>
<th>Unnamed: 21</th>
<th>Unnamed: 22</th>
<th>Unnamed: 23</th>
<th>Unnamed: 24</th>
<th>Unnamed: 25</th>
<th>Unnamed: 26</th>
</tr>
</thead>
<tbody>
<tr>
<th>1</th>
<td>Deirdre Hargey</td>
<td>Sinn Féin</td>
<td>9511</td>
<td>-1687</td>
<td>7824.00</td>
<td>NaN</td>
<td>7824.00</td>
<td>NaN</td>
<td>7824.00</td>
<td>NaN</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>2</th>
<td>Edwin Poots</td>
<td>Democratic Unionist Party</td>
<td>7211</td>
<td>6.12</td>
<td>7217.12</td>
<td>6.08</td>
<td>7223.20</td>
<td>117.72</td>
<td>7340.92</td>
<td>1134</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>3</th>
<td>Matthew O'Toole</td>
<td>Social Democratic and Labour Party</td>
<td>5394</td>
<td>664.92</td>
<td>6058.92</td>
<td>68.18</td>
<td>6127.10</td>
<td>322.04</td>
<td>6449.14</td>
<td>10.18</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>4</th>
<td>Paula Bradshaw</td>
<td>Alliance Party</td>
<td>6503</td>
<td>265.14</td>
<td>6768.14</td>
<td>59.5</td>
<td>6827.64</td>
<td>135.82</td>
<td>6963.46</td>
<td>31</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>5</th>
<td>Kate Nicholl</td>
<td>Alliance Party</td>
<td>5201</td>
<td>145.98</td>
<td>5346.98</td>
<td>57.52</td>
<td>5404.50</td>
<td>131.28</td>
<td>5535.78</td>
<td>17</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>6</th>
<td>Clare Bailey</td>
<td>Green Party</td>
<td>4058</td>
<td>167.4</td>
<td>4225.40</td>
<td>169.82</td>
<td>4395.22</td>
<td>490.54</td>
<td>4885.76</td>
<td>48.36</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>7</th>
<td>Stephen McCarthy</td>
<td>Ulster Unionist Party</td>
<td>3061</td>
<td>6.12</td>
<td>3067.12</td>
<td>17.72</td>
<td>3084.84</td>
<td>25.18</td>
<td>3110.02</td>
<td>643.36</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>8</th>
<td>Elsie Trainor</td>
<td>Social Democratic and Labour Party</td>
<td>2030</td>
<td>181.8</td>
<td>2211.80</td>
<td>18.52</td>
<td>2230.32</td>
<td>132.94</td>
<td>2363.26</td>
<td>4</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>9</th>
<td>Andrew Girvin</td>
<td>Traditional Unionist Voice</td>
<td>1935</td>
<td>0.36</td>
<td>1935.36</td>
<td>9</td>
<td>1944.36</td>
<td>35.54</td>
<td>1979.90</td>
<td>-1979.9</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>10</th>
<td>Luke McCann</td>
<td>Aontú</td>
<td>806</td>
<td>70.92</td>
<td>876.92</td>
<td>15.16</td>
<td>892.08</td>
<td>-892.08</td>
<td>0.00</td>
<td>NaN</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>11</th>
<td>Sipho Sibanda</td>
<td>People Before Profit</td>
<td>629</td>
<td>40.5</td>
<td>669.50</td>
<td>172.66</td>
<td>842.16</td>
<td>-842.16</td>
<td>0.00</td>
<td>NaN</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>12</th>
<td>Neil Moore</td>
<td>Socialist Party</td>
<td>353</td>
<td>18</td>
<td>371.00</td>
<td>-371</td>
<td>0.00</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>13</th>
<td>Paddy Lynn</td>
<td>The Workers Party</td>
<td>139</td>
<td>24.48</td>
<td>163.48</td>
<td>-163.48</td>
<td>0.00</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>14</th>
<td>Elly Odhiambo</td>
<td>Independent</td>
<td>107</td>
<td>6.84</td>
<td>113.84</td>
<td>-113.84</td>
<td>0.00</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>15</th>
<td>Exhausted</td>
<td>NaN</td>
<td>NaN</td>
<td>88.42</td>
<td>88.42</td>
<td>54.16</td>
<td>142.58</td>
<td>343.18</td>
<td>485.76</td>
<td>92</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
</tbody>
</table>
<p>15 rows × 27 columns</p>
</div>
<p>There are a few ways to tidy up this stage/transfer setup, so for simplicity we’ll take the ‘index’ off the left of the table (consisting of the candidate and party names) and try and construct a new column index based on those.</p>
<p>Sounds fancy.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df</span><span class="p">.</span><span class="n">iloc</span><span class="p">[</span><span class="mi">1</span><span class="p">:</span><span class="mi">16</span><span class="p">]</span>
</code></pre></div></div>
<div>
<style scoped="">
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Unnamed: 0</th>
<th>Unnamed: 1</th>
<th>Unnamed: 2</th>
<th>Unnamed: 3</th>
<th>Unnamed: 4</th>
<th>Unnamed: 5</th>
<th>Unnamed: 6</th>
<th>Unnamed: 7</th>
<th>Unnamed: 8</th>
<th>Unnamed: 9</th>
<th>...</th>
<th>Unnamed: 17</th>
<th>Unnamed: 18</th>
<th>Unnamed: 19</th>
<th>Unnamed: 20</th>
<th>Unnamed: 21</th>
<th>Unnamed: 22</th>
<th>Unnamed: 23</th>
<th>Unnamed: 24</th>
<th>Unnamed: 25</th>
<th>Unnamed: 26</th>
</tr>
</thead>
<tbody>
<tr>
<th>1</th>
<td>Deirdre Hargey</td>
<td>Sinn Féin</td>
<td>9511</td>
<td>-1687</td>
<td>7824.00</td>
<td>NaN</td>
<td>7824.00</td>
<td>NaN</td>
<td>7824.00</td>
<td>NaN</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>2</th>
<td>Edwin Poots</td>
<td>Democratic Unionist Party</td>
<td>7211</td>
<td>6.12</td>
<td>7217.12</td>
<td>6.08</td>
<td>7223.20</td>
<td>117.72</td>
<td>7340.92</td>
<td>1134</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>3</th>
<td>Matthew O'Toole</td>
<td>Social Democratic and Labour Party</td>
<td>5394</td>
<td>664.92</td>
<td>6058.92</td>
<td>68.18</td>
<td>6127.10</td>
<td>322.04</td>
<td>6449.14</td>
<td>10.18</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>4</th>
<td>Paula Bradshaw</td>
<td>Alliance Party</td>
<td>6503</td>
<td>265.14</td>
<td>6768.14</td>
<td>59.5</td>
<td>6827.64</td>
<td>135.82</td>
<td>6963.46</td>
<td>31</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>5</th>
<td>Kate Nicholl</td>
<td>Alliance Party</td>
<td>5201</td>
<td>145.98</td>
<td>5346.98</td>
<td>57.52</td>
<td>5404.50</td>
<td>131.28</td>
<td>5535.78</td>
<td>17</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>6</th>
<td>Clare Bailey</td>
<td>Green Party</td>
<td>4058</td>
<td>167.4</td>
<td>4225.40</td>
<td>169.82</td>
<td>4395.22</td>
<td>490.54</td>
<td>4885.76</td>
<td>48.36</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>7</th>
<td>Stephen McCarthy</td>
<td>Ulster Unionist Party</td>
<td>3061</td>
<td>6.12</td>
<td>3067.12</td>
<td>17.72</td>
<td>3084.84</td>
<td>25.18</td>
<td>3110.02</td>
<td>643.36</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>8</th>
<td>Elsie Trainor</td>
<td>Social Democratic and Labour Party</td>
<td>2030</td>
<td>181.8</td>
<td>2211.80</td>
<td>18.52</td>
<td>2230.32</td>
<td>132.94</td>
<td>2363.26</td>
<td>4</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>9</th>
<td>Andrew Girvin</td>
<td>Traditional Unionist Voice</td>
<td>1935</td>
<td>0.36</td>
<td>1935.36</td>
<td>9</td>
<td>1944.36</td>
<td>35.54</td>
<td>1979.90</td>
<td>-1979.9</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>10</th>
<td>Luke McCann</td>
<td>Aontú</td>
<td>806</td>
<td>70.92</td>
<td>876.92</td>
<td>15.16</td>
<td>892.08</td>
<td>-892.08</td>
<td>0.00</td>
<td>NaN</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>11</th>
<td>Sipho Sibanda</td>
<td>People Before Profit</td>
<td>629</td>
<td>40.5</td>
<td>669.50</td>
<td>172.66</td>
<td>842.16</td>
<td>-842.16</td>
<td>0.00</td>
<td>NaN</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>12</th>
<td>Neil Moore</td>
<td>Socialist Party</td>
<td>353</td>
<td>18</td>
<td>371.00</td>
<td>-371</td>
<td>0.00</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>13</th>
<td>Paddy Lynn</td>
<td>The Workers Party</td>
<td>139</td>
<td>24.48</td>
<td>163.48</td>
<td>-163.48</td>
<td>0.00</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>14</th>
<td>Elly Odhiambo</td>
<td>Independent</td>
<td>107</td>
<td>6.84</td>
<td>113.84</td>
<td>-113.84</td>
<td>0.00</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>15</th>
<td>Exhausted</td>
<td>NaN</td>
<td>NaN</td>
<td>88.42</td>
<td>88.42</td>
<td>54.16</td>
<td>142.58</td>
<td>343.18</td>
<td>485.76</td>
<td>92</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
</tbody>
</table>
<p>15 rows × 27 columns</p>
</div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">_table</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">iloc</span><span class="p">[</span><span class="mi">1</span><span class="p">:</span><span class="mi">16</span><span class="p">]</span>
<span class="n">_table</span> <span class="o">=</span> <span class="n">_table</span><span class="p">.</span><span class="n">dropna</span><span class="p">(</span><span class="n">how</span><span class="o">=</span><span class="s">'all'</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="n">_table</span> <span class="o">=</span> <span class="n">_table</span><span class="p">.</span><span class="n">set_index</span><span class="p">(</span><span class="n">_table</span><span class="p">.</span><span class="n">columns</span><span class="p">.</span><span class="n">tolist</span><span class="p">()[</span><span class="mi">0</span><span class="p">:</span><span class="mi">2</span><span class="p">])</span>
<span class="n">_table</span> <span class="o">=</span> <span class="n">_table</span><span class="p">.</span><span class="n">fillna</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
<span class="n">_table</span> <span class="o">=</span> <span class="n">_table</span><span class="p">.</span><span class="n">astype</span><span class="p">(</span><span class="nb">float</span><span class="p">)</span>
<span class="n">_table</span><span class="p">.</span><span class="n">index</span><span class="o">=</span><span class="n">_table</span><span class="p">.</span><span class="n">index</span><span class="p">.</span><span class="n">set_names</span><span class="p">([</span><span class="s">'Candidate'</span><span class="p">,</span><span class="s">'Party'</span><span class="p">])</span>
<span class="n">_table</span>
</code></pre></div></div>
<div>
<style scoped="">
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th></th>
<th>Unnamed: 2</th>
<th>Unnamed: 3</th>
<th>Unnamed: 4</th>
<th>Unnamed: 5</th>
<th>Unnamed: 6</th>
<th>Unnamed: 7</th>
<th>Unnamed: 8</th>
<th>Unnamed: 9</th>
<th>Unnamed: 10</th>
<th>Unnamed: 11</th>
<th>Unnamed: 12</th>
<th>Unnamed: 13</th>
<th>Unnamed: 14</th>
<th>Unnamed: 15</th>
<th>Unnamed: 16</th>
</tr>
<tr>
<th>Candidate</th>
<th>Party</th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<th>Deirdre Hargey</th>
<th>Sinn Féin</th>
<td>9511.0</td>
<td>-1687.00</td>
<td>7824.00</td>
<td>0.00</td>
<td>7824.00</td>
<td>0.00</td>
<td>7824.00</td>
<td>0.00</td>
<td>7824.00</td>
<td>0.00</td>
<td>7824.00</td>
<td>0.00</td>
<td>7824.00</td>
<td>0.00</td>
<td>7824.00</td>
</tr>
<tr>
<th>Edwin Poots</th>
<th>Democratic Unionist Party</th>
<td>7211.0</td>
<td>6.12</td>
<td>7217.12</td>
<td>6.08</td>
<td>7223.20</td>
<td>117.72</td>
<td>7340.92</td>
<td>1134.00</td>
<td>8474.92</td>
<td>0.00</td>
<td>8474.92</td>
<td>0.00</td>
<td>8474.92</td>
<td>-650.92</td>
<td>7824.00</td>
</tr>
<tr>
<th>Matthew O'Toole</th>
<th>Social Democratic and Labour Party</th>
<td>5394.0</td>
<td>664.92</td>
<td>6058.92</td>
<td>68.18</td>
<td>6127.10</td>
<td>322.04</td>
<td>6449.14</td>
<td>10.18</td>
<td>6459.32</td>
<td>1630.00</td>
<td>8089.32</td>
<td>0.00</td>
<td>8089.32</td>
<td>0.00</td>
<td>8089.32</td>
</tr>
<tr>
<th>Paula Bradshaw</th>
<th>Alliance Party</th>
<td>6503.0</td>
<td>265.14</td>
<td>6768.14</td>
<td>59.50</td>
<td>6827.64</td>
<td>135.82</td>
<td>6963.46</td>
<td>31.00</td>
<td>6994.46</td>
<td>214.68</td>
<td>7209.14</td>
<td>1114.00</td>
<td>8323.14</td>
<td>0.00</td>
<td>8323.14</td>
</tr>
<tr>
<th>Kate Nicholl</th>
<th>Alliance Party</th>
<td>5201.0</td>
<td>145.98</td>
<td>5346.98</td>
<td>57.52</td>
<td>5404.50</td>
<td>131.28</td>
<td>5535.78</td>
<td>17.00</td>
<td>5552.78</td>
<td>244.90</td>
<td>5797.68</td>
<td>858.42</td>
<td>6656.10</td>
<td>81.00</td>
<td>6737.10</td>
</tr>
<tr>
<th>Clare Bailey</th>
<th>Green Party</th>
<td>4058.0</td>
<td>167.40</td>
<td>4225.40</td>
<td>169.82</td>
<td>4395.22</td>
<td>490.54</td>
<td>4885.76</td>
<td>48.36</td>
<td>4934.12</td>
<td>164.02</td>
<td>5098.14</td>
<td>600.96</td>
<td>5699.10</td>
<td>127.00</td>
<td>5826.10</td>
</tr>
<tr>
<th>Stephen McCarthy</th>
<th>Ulster Unionist Party</th>
<td>3061.0</td>
<td>6.12</td>
<td>3067.12</td>
<td>17.72</td>
<td>3084.84</td>
<td>25.18</td>
<td>3110.02</td>
<td>643.36</td>
<td>3753.38</td>
<td>16.90</td>
<td>3770.28</td>
<td>-3770.28</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
</tr>
<tr>
<th>Elsie Trainor</th>
<th>Social Democratic and Labour Party</th>
<td>2030.0</td>
<td>181.80</td>
<td>2211.80</td>
<td>18.52</td>
<td>2230.32</td>
<td>132.94</td>
<td>2363.26</td>
<td>4.00</td>
<td>2367.26</td>
<td>-2367.26</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
</tr>
<tr>
<th>Andrew Girvin</th>
<th>Traditional Unionist Voice</th>
<td>1935.0</td>
<td>0.36</td>
<td>1935.36</td>
<td>9.00</td>
<td>1944.36</td>
<td>35.54</td>
<td>1979.90</td>
<td>-1979.90</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
</tr>
<tr>
<th>Luke McCann</th>
<th>Aontú</th>
<td>806.0</td>
<td>70.92</td>
<td>876.92</td>
<td>15.16</td>
<td>892.08</td>
<td>-892.08</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
</tr>
<tr>
<th>Sipho Sibanda</th>
<th>People Before Profit</th>
<td>629.0</td>
<td>40.50</td>
<td>669.50</td>
<td>172.66</td>
<td>842.16</td>
<td>-842.16</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
</tr>
<tr>
<th>Neil Moore</th>
<th>Socialist Party</th>
<td>353.0</td>
<td>18.00</td>
<td>371.00</td>
<td>-371.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
</tr>
<tr>
<th>Paddy Lynn</th>
<th>The Workers Party</th>
<td>139.0</td>
<td>24.48</td>
<td>163.48</td>
<td>-163.48</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
</tr>
<tr>
<th>Elly Odhiambo</th>
<th>Independent</th>
<td>107.0</td>
<td>6.84</td>
<td>113.84</td>
<td>-113.84</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
</tr>
<tr>
<th>Exhausted</th>
<th>NaN</th>
<td>0.0</td>
<td>88.42</td>
<td>88.42</td>
<td>54.16</td>
<td>142.58</td>
<td>343.18</td>
<td>485.76</td>
<td>92.00</td>
<td>577.76</td>
<td>96.76</td>
<td>674.52</td>
<td>1196.90</td>
<td>1871.42</td>
<td>442.92</td>
<td>2314.34</td>
</tr>
</tbody>
</table>
</div>
<p>Now to do the same thing with the stage counts and transfers;</p>
<p>There are many ways to do this, either leaving it as is and just naming the columns by Stage and Transfer, but a ‘better’ way to do it is to create a multiindex on the column. Which probably means nothing.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">_table</span><span class="p">.</span><span class="n">columns</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Index(['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4', 'Unnamed: 5', 'Unnamed: 6',
'Unnamed: 7', 'Unnamed: 8', 'Unnamed: 9', 'Unnamed: 10', 'Unnamed: 11',
'Unnamed: 12', 'Unnamed: 13', 'Unnamed: 14', 'Unnamed: 15',
'Unnamed: 16'],
dtype='object')
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">stage</span> <span class="o">=</span> <span class="p">[</span><span class="s">'Count'</span><span class="p">,</span><span class="s">'Transfers'</span><span class="p">]</span>
<span class="n">stages</span> <span class="o">=</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="n">_table</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span><span class="o">//</span><span class="mi">2</span><span class="o">+</span><span class="mi">2</span><span class="p">)</span>
<span class="n">pd</span><span class="p">.</span><span class="n">MultiIndex</span><span class="p">.</span><span class="n">from_product</span><span class="p">([</span><span class="n">stages</span><span class="p">,</span><span class="n">stage</span><span class="p">])[:</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>MultiIndex([(1, 'Count'),
(1, 'Transfers'),
(2, 'Count'),
(2, 'Transfers'),
(3, 'Count'),
(3, 'Transfers'),
(4, 'Count'),
(4, 'Transfers'),
(5, 'Count'),
(5, 'Transfers'),
(6, 'Count'),
(6, 'Transfers'),
(7, 'Count'),
(7, 'Transfers'),
(8, 'Count')],
)
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">_table</span><span class="p">.</span><span class="n">columns</span><span class="o">=</span><span class="n">pd</span><span class="p">.</span><span class="n">MultiIndex</span><span class="p">.</span><span class="n">from_product</span><span class="p">([</span><span class="n">stages</span><span class="p">,</span><span class="n">stage</span><span class="p">],</span> <span class="n">names</span><span class="o">=</span><span class="p">[</span><span class="s">'Stage'</span><span class="p">,</span><span class="s">'Step'</span><span class="p">])[:</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">_table</span>
</code></pre></div></div>
<div>
<style scoped="">
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead tr th {
text-align: left;
}
.dataframe thead tr:last-of-type th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr>
<th></th>
<th>Stage</th>
<th colspan="2" halign="left">1</th>
<th colspan="2" halign="left">2</th>
<th colspan="2" halign="left">3</th>
<th colspan="2" halign="left">4</th>
<th colspan="2" halign="left">5</th>
<th colspan="2" halign="left">6</th>
<th colspan="2" halign="left">7</th>
<th>8</th>
</tr>
<tr>
<th></th>
<th>Step</th>
<th>Count</th>
<th>Transfers</th>
<th>Count</th>
<th>Transfers</th>
<th>Count</th>
<th>Transfers</th>
<th>Count</th>
<th>Transfers</th>
<th>Count</th>
<th>Transfers</th>
<th>Count</th>
<th>Transfers</th>
<th>Count</th>
<th>Transfers</th>
<th>Count</th>
</tr>
<tr>
<th>Candidate</th>
<th>Party</th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<th>Deirdre Hargey</th>
<th>Sinn Féin</th>
<td>9511.0</td>
<td>-1687.00</td>
<td>7824.00</td>
<td>0.00</td>
<td>7824.00</td>
<td>0.00</td>
<td>7824.00</td>
<td>0.00</td>
<td>7824.00</td>
<td>0.00</td>
<td>7824.00</td>
<td>0.00</td>
<td>7824.00</td>
<td>0.00</td>
<td>7824.00</td>
</tr>
<tr>
<th>Edwin Poots</th>
<th>Democratic Unionist Party</th>
<td>7211.0</td>
<td>6.12</td>
<td>7217.12</td>
<td>6.08</td>
<td>7223.20</td>
<td>117.72</td>
<td>7340.92</td>
<td>1134.00</td>
<td>8474.92</td>
<td>0.00</td>
<td>8474.92</td>
<td>0.00</td>
<td>8474.92</td>
<td>-650.92</td>
<td>7824.00</td>
</tr>
<tr>
<th>Matthew O'Toole</th>
<th>Social Democratic and Labour Party</th>
<td>5394.0</td>
<td>664.92</td>
<td>6058.92</td>
<td>68.18</td>
<td>6127.10</td>
<td>322.04</td>
<td>6449.14</td>
<td>10.18</td>
<td>6459.32</td>
<td>1630.00</td>
<td>8089.32</td>
<td>0.00</td>
<td>8089.32</td>
<td>0.00</td>
<td>8089.32</td>
</tr>
<tr>
<th>Paula Bradshaw</th>
<th>Alliance Party</th>
<td>6503.0</td>
<td>265.14</td>
<td>6768.14</td>
<td>59.50</td>
<td>6827.64</td>
<td>135.82</td>
<td>6963.46</td>
<td>31.00</td>
<td>6994.46</td>
<td>214.68</td>
<td>7209.14</td>
<td>1114.00</td>
<td>8323.14</td>
<td>0.00</td>
<td>8323.14</td>
</tr>
<tr>
<th>Kate Nicholl</th>
<th>Alliance Party</th>
<td>5201.0</td>
<td>145.98</td>
<td>5346.98</td>
<td>57.52</td>
<td>5404.50</td>
<td>131.28</td>
<td>5535.78</td>
<td>17.00</td>
<td>5552.78</td>
<td>244.90</td>
<td>5797.68</td>
<td>858.42</td>
<td>6656.10</td>
<td>81.00</td>
<td>6737.10</td>
</tr>
<tr>
<th>Clare Bailey</th>
<th>Green Party</th>
<td>4058.0</td>
<td>167.40</td>
<td>4225.40</td>
<td>169.82</td>
<td>4395.22</td>
<td>490.54</td>
<td>4885.76</td>
<td>48.36</td>
<td>4934.12</td>
<td>164.02</td>
<td>5098.14</td>
<td>600.96</td>
<td>5699.10</td>
<td>127.00</td>
<td>5826.10</td>
</tr>
<tr>
<th>Stephen McCarthy</th>
<th>Ulster Unionist Party</th>
<td>3061.0</td>
<td>6.12</td>
<td>3067.12</td>
<td>17.72</td>
<td>3084.84</td>
<td>25.18</td>
<td>3110.02</td>
<td>643.36</td>
<td>3753.38</td>
<td>16.90</td>
<td>3770.28</td>
<td>-3770.28</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
</tr>
<tr>
<th>Elsie Trainor</th>
<th>Social Democratic and Labour Party</th>
<td>2030.0</td>
<td>181.80</td>
<td>2211.80</td>
<td>18.52</td>
<td>2230.32</td>
<td>132.94</td>
<td>2363.26</td>
<td>4.00</td>
<td>2367.26</td>
<td>-2367.26</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
</tr>
<tr>
<th>Andrew Girvin</th>
<th>Traditional Unionist Voice</th>
<td>1935.0</td>
<td>0.36</td>
<td>1935.36</td>
<td>9.00</td>
<td>1944.36</td>
<td>35.54</td>
<td>1979.90</td>
<td>-1979.90</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
</tr>
<tr>
<th>Luke McCann</th>
<th>Aontú</th>
<td>806.0</td>
<td>70.92</td>
<td>876.92</td>
<td>15.16</td>
<td>892.08</td>
<td>-892.08</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
</tr>
<tr>
<th>Sipho Sibanda</th>
<th>People Before Profit</th>
<td>629.0</td>
<td>40.50</td>
<td>669.50</td>
<td>172.66</td>
<td>842.16</td>
<td>-842.16</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
</tr>
<tr>
<th>Neil Moore</th>
<th>Socialist Party</th>
<td>353.0</td>
<td>18.00</td>
<td>371.00</td>
<td>-371.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
</tr>
<tr>
<th>Paddy Lynn</th>
<th>The Workers Party</th>
<td>139.0</td>
<td>24.48</td>
<td>163.48</td>
<td>-163.48</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
</tr>
<tr>
<th>Elly Odhiambo</th>
<th>Independent</th>
<td>107.0</td>
<td>6.84</td>
<td>113.84</td>
<td>-113.84</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
</tr>
<tr>
<th>Exhausted</th>
<th>NaN</th>
<td>0.0</td>
<td>88.42</td>
<td>88.42</td>
<td>54.16</td>
<td>142.58</td>
<td>343.18</td>
<td>485.76</td>
<td>92.00</td>
<td>577.76</td>
<td>96.76</td>
<td>674.52</td>
<td>1196.90</td>
<td>1871.42</td>
<td>442.92</td>
<td>2314.34</td>
</tr>
</tbody>
</table>
</div>
<p>Now we can do some interesting queries really easily;</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">_table</span><span class="p">.</span><span class="n">xs</span><span class="p">(</span><span class="s">"Count"</span><span class="p">,</span> <span class="n">level</span><span class="o">=</span><span class="s">'Step'</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
</code></pre></div></div>
<div>
<style scoped="">
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Stage</th>
<th>1</th>
<th>2</th>
<th>3</th>
<th>4</th>
<th>5</th>
<th>6</th>
<th>7</th>
<th>8</th>
</tr>
<tr>
<th>Candidate</th>
<th>Party</th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<th>Deirdre Hargey</th>
<th>Sinn Féin</th>
<td>9511.0</td>
<td>7824.00</td>
<td>7824.00</td>
<td>7824.00</td>
<td>7824.00</td>
<td>7824.00</td>
<td>7824.00</td>
<td>7824.00</td>
</tr>
<tr>
<th>Edwin Poots</th>
<th>Democratic Unionist Party</th>
<td>7211.0</td>
<td>7217.12</td>
<td>7223.20</td>
<td>7340.92</td>
<td>8474.92</td>
<td>8474.92</td>
<td>8474.92</td>
<td>7824.00</td>
</tr>
<tr>
<th>Matthew O'Toole</th>
<th>Social Democratic and Labour Party</th>
<td>5394.0</td>
<td>6058.92</td>
<td>6127.10</td>
<td>6449.14</td>
<td>6459.32</td>
<td>8089.32</td>
<td>8089.32</td>
<td>8089.32</td>
</tr>
<tr>
<th>Paula Bradshaw</th>
<th>Alliance Party</th>
<td>6503.0</td>
<td>6768.14</td>
<td>6827.64</td>
<td>6963.46</td>
<td>6994.46</td>
<td>7209.14</td>
<td>8323.14</td>
<td>8323.14</td>
</tr>
<tr>
<th>Kate Nicholl</th>
<th>Alliance Party</th>
<td>5201.0</td>
<td>5346.98</td>
<td>5404.50</td>
<td>5535.78</td>
<td>5552.78</td>
<td>5797.68</td>
<td>6656.10</td>
<td>6737.10</td>
</tr>
<tr>
<th>Clare Bailey</th>
<th>Green Party</th>
<td>4058.0</td>
<td>4225.40</td>
<td>4395.22</td>
<td>4885.76</td>
<td>4934.12</td>
<td>5098.14</td>
<td>5699.10</td>
<td>5826.10</td>
</tr>
<tr>
<th>Stephen McCarthy</th>
<th>Ulster Unionist Party</th>
<td>3061.0</td>
<td>3067.12</td>
<td>3084.84</td>
<td>3110.02</td>
<td>3753.38</td>
<td>3770.28</td>
<td>0.00</td>
<td>0.00</td>
</tr>
<tr>
<th>Elsie Trainor</th>
<th>Social Democratic and Labour Party</th>
<td>2030.0</td>
<td>2211.80</td>
<td>2230.32</td>
<td>2363.26</td>
<td>2367.26</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
</tr>
<tr>
<th>Andrew Girvin</th>
<th>Traditional Unionist Voice</th>
<td>1935.0</td>
<td>1935.36</td>
<td>1944.36</td>
<td>1979.90</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
</tr>
<tr>
<th>Luke McCann</th>
<th>Aontú</th>
<td>806.0</td>
<td>876.92</td>
<td>892.08</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
</tr>
<tr>
<th>Sipho Sibanda</th>
<th>People Before Profit</th>
<td>629.0</td>
<td>669.50</td>
<td>842.16</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
</tr>
<tr>
<th>Neil Moore</th>
<th>Socialist Party</th>
<td>353.0</td>
<td>371.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
</tr>
<tr>
<th>Paddy Lynn</th>
<th>The Workers Party</th>
<td>139.0</td>
<td>163.48</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
</tr>
<tr>
<th>Elly Odhiambo</th>
<th>Independent</th>
<td>107.0</td>
<td>113.84</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
</tr>
<tr>
<th>Exhausted</th>
<th>NaN</th>
<td>0.0</td>
<td>88.42</td>
<td>142.58</td>
<td>485.76</td>
<td>577.76</td>
<td>674.52</td>
<td>1871.42</td>
<td>2314.34</td>
</tr>
</tbody>
</table>
</div>
<p>This makes plotting quite simple; which then makes the queries you can express much more complex…</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">_table</span><span class="p">.</span><span class="n">xs</span><span class="p">(</span><span class="s">"Count"</span><span class="p">,</span> <span class="n">level</span><span class="o">=</span><span class="s">'Step'</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">).</span><span class="n">groupby</span><span class="p">(</span><span class="s">'Party'</span><span class="p">).</span><span class="nb">sum</span><span class="p">().</span><span class="n">T</span><span class="p">.</span><span class="n">plot</span><span class="p">()</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><Axes: xlabel='Stage'>
</code></pre></div></div>
<p><img src="/img/eni_output_16_1.png" alt="png" /></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">_table</span><span class="p">.</span><span class="n">xs</span><span class="p">(</span><span class="s">"Transfers"</span><span class="p">,</span> <span class="n">level</span><span class="o">=</span><span class="s">'Step'</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>\
<span class="p">.</span><span class="n">groupby</span><span class="p">(</span><span class="s">'Party'</span><span class="p">).</span><span class="nb">sum</span><span class="p">().</span><span class="n">cumsum</span><span class="p">().</span><span class="n">T</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span>
<span class="n">title</span><span class="o">=</span><span class="s">'Net Cumulative Transfers'</span>
<span class="p">)</span>\
<span class="p">.</span><span class="n">legend</span><span class="p">(</span><span class="n">loc</span><span class="o">=</span><span class="s">'right'</span><span class="p">,</span> <span class="n">bbox_to_anchor</span><span class="o">=</span><span class="p">(</span><span class="mf">1.6</span><span class="p">,</span><span class="mf">0.5</span><span class="p">))</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><matplotlib.legend.Legend at 0x17b555a50>
</code></pre></div></div>
<p><img src="/img/eni_output_17_1.png" alt="png" /></p>
<h2 id="conclusion">Conclusion</h2>
<p>Considering it’s election day today in Northern Ireland, I wanted to bash this out to help other election observers have a play with the fantastic work the likes of <a href="https://twitter.com/colm_burns">@colm_burns</a> and the rest of the <a href="https://twitter.com/electionsni">@electionsni</a> team are doing.</p>
<p>More than happy to help anyone else answer interesting electoral questions, and remember; vote early, vote often, and #votetillyouboke</p>
StayCation20232023-04-01T19:00:00+00:00https://andrewbolster.info/2023/04/staycation2023<p><strong>TL;DR I’m taking 2 weeks off my Synopsys work to work on all those side projects I promised I would, either to myself or to others.</strong></p>
<p>We’ve all got a box somewhere of either AliExpress / PiMoroni / PiHut boards and a number of repos and half started projects that you promised that you’d be able to work in playing with in evenings or weekends, but you keep finding yourself doing stupid things like “spending time with family/friends”, “working out”, “spending time with / trying to find your significant other”, or “eating” or “sleeping” or other ridiculous indulgences.</p>
<p>For the past three years, I’ve basically taking December off from work because for a while there, there was no major motivation to take significant blocks of time off work… (Dear American colleagues; we <em>have</em> to take holidays, it’s very good when there’s no pandemic, check it out…)</p>
<p>In an effort to try something different this year, I’m taking a two week “StayCation”, mostly going to be spent in <a href="https://www.farsetlabs.org.uk">Farset Labs</a>.</p>
<p>Projects I’ll be working on (and trying to write up…)</p>
<ul>
<li>Make a proper AirFlow / MongoDB / Kafka setup for personal projects</li>
<li>Make a Home Assistant integrated EPaper Weather / News Display for the bedroom</li>
<li>Fix Farset’s HotDesking page…</li>
<li>Check that <a href="https://github.com/andrewbolster/arduino_workshop">this Arduino Workshop</a> from years ago still works…
<ul>
<li>If it does, add in my recently acquired SDS011 Particulate Sensors into the mix…</li>
</ul>
</li>
<li>Test out HDMI Capture for the Farset Labs Event Space for easy streaming / remote engagement.</li>
<li>Try out fixing Farset’s front porch display with HDMI CEC so it can automatically wake up the big vertical display</li>
<li>Build out a persistent EPaper display for the desk with Todoist</li>
<li>Calibrate the recently acquired CR30 (the treadmill one) at Farset</li>
<li>Plan out how to get ClubMate into the country this year for both Farset, BSides Belfast, and my selfish, thirsty, self.</li>
<li>Flesh out and productionise the <a href="https://github.com/FarsetLabs/vote-herder">ElectionNI API and Persistence layers</a></li>
<li>Fix a particularly nasty and annoying CI/CD bug in <a href="https://github.com/andrewbolster/bolster">my wee secondary brain package</a>…</li>
<li>Setup, Document and Install Nexudus driven Amazon Fire tablets (or other…) for room occupancy / booking displays in Farset</li>
<li>BSides Belfast Sponsor Logo Collation and Scheduled Shoutouts</li>
<li>Write an Instagram Scraper for a “protest wall” for a certain beardy fella (not myself for a change)</li>
<li>Treasurer update for Farset Board</li>
<li>GIF - To - LED Array webservice for the <a href="https://shop.pimoroni.com/products/cosmic-unicorn">Cosmic Unicorn</a></li>
<li>Write a rotating/interactive badge for the <a href="https://shop.pimoroni.com/products/badger-2040-w?variant=40514062188627">Badger 2040W</a></li>
</ul>
<p>That’s all… 🤣</p>
<p>I have no expectation of completing all these but it’ll be fun trying!</p>
<p>Wish me luck…</p>
Wrangling NI House Price Index Data2022-03-27T19:49:00+00:00https://andrewbolster.info/2022/03/NI-House-Price-Index<h1 id="data-wrangling-ni-house-price-index-data">Data Wrangling NI House Price Index Data</h1>
<p>This is a ‘messy’ ‘blog post’ that’s just a braindump of a notebook to step through <a href="https://www.nisra.gov.uk/statistics/housing-community-and-regeneration/northern-ireland-house-price-index">NI House Price Index</a> datasets I was playing around with.</p>
<p>It’s mostly code, so if you were here from some ‘insight’, feck aff.</p>
<p>There is <strong>no</strong> analysis here, this is <strong>just</strong> data wrangling.</p>
<p>TLDR As always, Government Open Data has over the years gone from ‘non-existent’ to ‘garbeled’ to ‘inconsistent’ and I feel is now in the stage of ‘consistently inconsistent’, which is progress in my eyes.</p>
<h1 id="preamble-code-move-on">Preamble Code, move on.</h1>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">bs4</span> <span class="kn">import</span> <span class="n">BeautifulSoup</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="kn">import</span> <span class="nn">requests</span>
<span class="c1"># Pull the latest pages of https://www.finance-ni.gov.uk/publications/ni-house-price-index-statistical-reports and extract links
</span>
<span class="n">base_url</span><span class="o">=</span> <span class="s">'https://www.finance-ni.gov.uk/publications/ni-house-price-index-statistical-reports'</span>
<span class="n">base_content</span> <span class="o">=</span> <span class="n">requests</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="n">base_url</span><span class="p">).</span><span class="n">content</span>
<span class="n">base_soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">base_content</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="n">a</span> <span class="ow">in</span> <span class="n">base_soup</span><span class="p">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">'a'</span><span class="p">):</span>
<span class="k">if</span> <span class="n">a</span><span class="p">.</span><span class="n">attrs</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">'href'</span><span class="p">,</span><span class="s">''</span><span class="p">).</span><span class="n">endswith</span><span class="p">(</span><span class="s">'xlsx'</span><span class="p">):</span>
<span class="n">source_name</span><span class="p">,</span> <span class="n">source_url</span> <span class="o">=</span> <span class="n">a</span><span class="p">.</span><span class="n">contents</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span><span class="n">a</span><span class="p">.</span><span class="n">attrs</span><span class="p">[</span><span class="s">'href'</span><span class="p">]</span>
<span class="n">source_df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_excel</span><span class="p">(</span><span class="n">source_url</span><span class="p">,</span> <span class="n">sheet_name</span> <span class="o">=</span> <span class="bp">None</span><span class="p">)</span> <span class="c1"># Load all worksheets in
</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">source_df</span><span class="p">.</span><span class="n">keys</span><span class="p">()</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>dict_keys(['Cover Sheet', 'Contents', 'Table 1', 'Table 2', 'Table 2a', 'Table 2b', 'Table 2c', 'Table 2d', 'Table 3', 'Table 3a', 'Table 3b', 'Table 3c', 'Table 4', 'Fig 5', 'Table 5', 'Table 5a', 'Fig 6', 'Table 6', 'Table 7', 'Table 8', 'Table 9', 'Table 9a', 'Table 9b', 'Table 9c', 'Table 9d', 'Table 10a', 'Table 10b', 'Table 10c', 'Table 10d', 'Table 10e', 'Table 10f', 'Table 10g', 'Table 10h', 'Table 10i', 'Table 10j', 'Table 10k'])
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">source_df</span><span class="p">[</span><span class="s">'Contents'</span><span class="p">]</span>
</code></pre></div></div>
<div>
<style scoped="">
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Table of Contents</th>
<th>Unnamed: 1</th>
<th>Unnamed: 2</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>Worksheet Name</td>
<td>Frequency</td>
<td>House Price Index - Quarter 4 2021</td>
</tr>
<tr>
<th>1</th>
<td>Table 1</td>
<td>Quarterly</td>
<td>Table 1: NI HPI Trends Q1 2005 - Q4 2021</td>
</tr>
<tr>
<th>2</th>
<td>Figure 1</td>
<td>Quarterly</td>
<td>Figure 1: Graph of NI HPI Q1 2005 - Q4 2021</td>
</tr>
<tr>
<th>3</th>
<td>Figure 1a</td>
<td>Quarterly</td>
<td>Figure 1a: Graph of Percentage Quarterly Chang...</td>
</tr>
<tr>
<th>4</th>
<td>Figure 1b</td>
<td>Quarterly</td>
<td>Figure 1b: Graph of Percentage Annual Change Q...</td>
</tr>
<tr>
<th>...</th>
<td>...</td>
<td>...</td>
<td>...</td>
</tr>
<tr>
<th>58</th>
<td>Table 10h</td>
<td>Quarterly</td>
<td>Table 10h: Number of Verified Residential Prop...</td>
</tr>
<tr>
<th>59</th>
<td>Table 10i</td>
<td>Quarterly</td>
<td>Table 10i: Number of Verified Residential Prop...</td>
</tr>
<tr>
<th>60</th>
<td>Table 10j</td>
<td>Quarterly</td>
<td>Table 10j: Number of Verified Residential Prop...</td>
</tr>
<tr>
<th>61</th>
<td>Table 10k</td>
<td>Quarterly</td>
<td>Table 10k: Number of Verified Residential Prop...</td>
</tr>
<tr>
<th>62</th>
<td>Figure 11</td>
<td>Quarterly</td>
<td>Figure 11: Number of Verified Residential Prop...</td>
</tr>
</tbody>
</table>
<p>63 rows × 3 columns</p>
</div>
<h3 id="fix-the-contents-sheet-to-correctly-reflect-the-worksheet-names">Fix the Contents sheet to correctly reflect the Worksheet names</h3>
<p>And fix the table headers and sheet-titles while we’re at it.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">new_header</span> <span class="o">=</span> <span class="n">source_df</span><span class="p">[</span><span class="s">'Contents'</span><span class="p">].</span><span class="n">iloc</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
<span class="n">source_df</span><span class="p">[</span><span class="s">'Contents'</span><span class="p">]</span> <span class="o">=</span> <span class="n">source_df</span><span class="p">[</span><span class="s">'Contents'</span><span class="p">][</span><span class="mi">1</span><span class="p">:]</span>
<span class="n">source_df</span><span class="p">[</span><span class="s">'Contents'</span><span class="p">].</span><span class="n">columns</span> <span class="o">=</span> <span class="n">new_header</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">source_df</span><span class="p">[</span><span class="s">'Contents'</span><span class="p">].</span><span class="n">columns</span> <span class="o">=</span> <span class="p">[</span><span class="o">*</span><span class="n">new_header</span><span class="p">[:</span><span class="o">-</span><span class="mi">1</span><span class="p">],</span><span class="s">'Title'</span><span class="p">]</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">[</span><span class="n">t</span> <span class="k">for</span> <span class="n">t</span> <span class="ow">in</span> <span class="n">source_df</span><span class="p">[</span><span class="s">'Contents'</span><span class="p">][</span><span class="s">'Title'</span><span class="p">].</span><span class="n">values</span> <span class="k">if</span> <span class="n">t</span><span class="p">.</span><span class="n">startswith</span><span class="p">(</span><span class="s">'Table'</span><span class="p">)]</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>['Table 1: NI HPI Trends Q1 2005 - Q4 2021',
'Table 2: NI HPI & Standardised Price Statistics by Property Type Q4 2021',
'Table 2a: NI Detached Property Price Index Q1 2005 - Q4 2021',
'Table 2b: NI Semi-Detached Property Price Index Q1 2005 - Q4 2021',
'Table 2c: NI Terrace Property Price Index Q1 2005 - Q4 2021',
'Table 2d: NI Apartment Price Index Q1 2005 - Q4 2021',
'Table 3: NI HPI & Standardised Price Statistics by New/Existing Resold Dwelling Type Q4 2021',
'Table 3a: NI New Dwelling Price Index Q1 2005 - Q4 2021',
'Table 3b: NI Existing Resold Dwellings Price Index Q1 2005 - Q4 2021',
'Table 3c: Number of Verified Residential Property Sales by New/Existing Resold Dwellings Q1 2005 - Q2 2021',
'Table 4: Number of Verified Residential Property Sales Q1 2005 - Q4 2021',
'Table 5: HPI & Standardised Price for each Local Government District in NI',
'Table 5a: Number of Verified Residential Property Sales by Local Government District Q1 2005 - Q4 2021',
'Table 6: NI HPI & Standardised Price by Urban and Rural areas of Northern Ireland',
'Table 7: Standardised House Price & Index for Rural Areas of Northern Ireland by drive times',
'Table 8: Number of Verified Residential Property Sales for Urban and Rural Areas of NI (Q1 2005 - Q4 2021) and Rural Areas of NI by drive times (Q1 2015 - Q4 2021)',
'Table 9: NI Average Sale Prices All Properties Q1 2005 - Q4 2021',
'Table 9a: NI Average Sale Prices Detached Properties Q1 2005 - Q4 2021',
'Table 9b: NI Average Sale Prices Semi-Detached Properties Q1 2005 - Q4 2021',
'Table 9c: NI Average Sale Prices Terrace Properties Q1 2005 - Q4 2021',
'Table 9d: NI Average Sale Prices Apartments Q1 2005 - Q4 2021',
'Table 10a: Number of Verified Residential Property Sales by Type in Antrim and Newtownabbey Council Q1 2005 - Q4 2021',
'Table 10b: Number of Verified Residential Property Sales by Type in Ards and North Down Council Q1 2005 - Q4 2021',
'Table 10c: Number of Verified Residential Property Sales by Type in Armagh City, Banbridge and Craigavon Council Q1 2005 - Q4 2021',
'Table 10d: Number of Verified Residential Property Sales by Type in Belfast Council Q1 2005 - Q4 2021',
'Table 10e: Number of Verified Residential Property Sales by Type in Causeway Coast and Glens Council Q1 2005 - Q4 2021',
'Table 10f: Number of Verified Residential Property Sales by Type in Derry City and Strabane Council Q1 2005 - Q4 2021',
'Table 10g: Number of Verified Residential Property Sales by Type in Fermanagh and Omagh Council Q1 2005 - Q4 2021',
'Table 10h: Number of Verified Residential Property Sales by Type in Lisburn and Castlereagh Council Q1 2005 - Q4 2021',
'Table 10i: Number of Verified Residential Property Sales by Type in Mid and East Antrim Council Q1 2005 - Q4 2021',
'Table 10j: Number of Verified Residential Property Sales by Type in Mid Ulster Council Q1 2005 - Q4 2021',
'Table 10k: Number of Verified Residential Property Sales by Type in Newry, Mourne and Down Council Q1 2005 - Q4 2021']
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Replace 'Figure' with 'Fig' in 'Worksheet Name'
</span><span class="k">with</span> <span class="n">pd</span><span class="p">.</span><span class="n">option_context</span><span class="p">(</span><span class="s">'mode.chained_assignment'</span><span class="p">,</span><span class="bp">None</span><span class="p">):</span>
<span class="n">source_df</span><span class="p">[</span><span class="s">'Contents'</span><span class="p">][</span><span class="s">'Worksheet Name'</span><span class="p">]</span> <span class="o">=</span> <span class="n">source_df</span><span class="p">[</span><span class="s">'Contents'</span><span class="p">][</span><span class="s">'Worksheet Name'</span><span class="p">].</span><span class="nb">str</span><span class="p">.</span><span class="n">replace</span><span class="p">(</span><span class="s">'Figure'</span><span class="p">,</span><span class="s">'Fig'</span><span class="p">)</span>
</code></pre></div></div>
<h2 id="tidy-up-data">Tidy up Data</h2>
<h3 id="general-methodology">General Methodology</h3>
<p>Ignore figure data (pretty much completly….)</p>
<p>Tables have more or less the same structure; a header on row 3(1), a year and quarter ‘index’ (on time series; otherwise categorical index, see Table 2, Table 3).</p>
<p>Some TS tables <em>also</em> have totals subsections so these should be a) validated and b) ignored.</p>
<p>Any columns with no header in row 3(1) should be ignored (usually text notes)</p>
<p><em>Operate Sequentially</em> (i.e. Table 1, Table 2, Table 2a; don’t skip, even if it’s tempting)</p>
<p>Use keys from ‘Contents’ to describe data, but <strong>may be suffixed by the date which could change between data sets!</strong></p>
<p>There’s also some really columns that look like checksums, so if there is an ‘NI’ column, or a data column that all valid values are ‘100’, delete it.</p>
<h3 id="table-1-ni-hpi-trends-q1-2005---q4-2021">Table 1: NI HPI Trends Q1 2005 - Q4 2021</h3>
<p><strong>TODO: Regexy way to get rid of the ‘\QX-YYYY -\QX YYYY’ tail</strong></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">source_df</span><span class="p">[</span><span class="s">'Table 1'</span><span class="p">]</span>
</code></pre></div></div>
<div>
<style scoped="">
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Table 1: NI House Price Index, Standardised Price and Quarterly and Annual Change</th>
<th>Unnamed: 1</th>
<th>Unnamed: 2</th>
<th>Unnamed: 3</th>
<th>Unnamed: 4</th>
<th>Unnamed: 5</th>
<th>Unnamed: 6</th>
<th>Unnamed: 7</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>Back to contents</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>1</th>
<td>Year</td>
<td>Quarter</td>
<td>NI House Price Index</td>
<td>NI House Standardised Price</td>
<td>Quarterly Change</td>
<td>Annual Change</td>
<td>NI</td>
<td>NaN</td>
</tr>
<tr>
<th>2</th>
<td>2005</td>
<td>Q1</td>
<td>100.883607</td>
<td>111920.268199</td>
<td>NaN</td>
<td>NaN</td>
<td>100</td>
<td>0.0</td>
</tr>
<tr>
<th>3</th>
<td>NaN</td>
<td>Q2</td>
<td>104.564663</td>
<td>116004.031639</td>
<td>0.036488</td>
<td>NaN</td>
<td>100</td>
<td>0.0</td>
</tr>
<tr>
<th>4</th>
<td>NaN</td>
<td>Q3</td>
<td>111.219</td>
<td>123386.352673</td>
<td>0.063638</td>
<td>NaN</td>
<td>100</td>
<td>0.0</td>
</tr>
<tr>
<th>...</th>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
</tr>
<tr>
<th>89</th>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>100</td>
<td>0.0</td>
</tr>
<tr>
<th>90</th>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>100</td>
<td>0.0</td>
</tr>
<tr>
<th>91</th>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>100</td>
<td>0.0</td>
</tr>
<tr>
<th>92</th>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>100</td>
<td>0.0</td>
</tr>
<tr>
<th>93</th>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>100</td>
<td>0.0</td>
</tr>
</tbody>
</table>
<p>94 rows × 8 columns</p>
</div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">basic_cleanup</span><span class="p">(</span><span class="n">df</span><span class="p">:</span><span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">,</span> <span class="n">offset</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span><span class="o">-></span><span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">:</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">copy</span><span class="p">()</span>
<span class="c1"># Re-header from row 1 (which was row 3 in excel)
</span> <span class="n">new_header</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">iloc</span><span class="p">[</span><span class="n">offset</span><span class="p">]</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">iloc</span><span class="p">[</span><span class="n">offset</span><span class="o">+</span><span class="mi">1</span><span class="p">:]</span>
<span class="n">df</span><span class="p">.</span><span class="n">columns</span> <span class="o">=</span> <span class="n">new_header</span>
<span class="c1"># remove 'NaN' trailing columns
</span> <span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="n">df</span><span class="p">.</span><span class="n">columns</span><span class="p">[</span><span class="n">pd</span><span class="p">.</span><span class="n">notna</span><span class="p">(</span><span class="n">df</span><span class="p">.</span><span class="n">columns</span><span class="p">)]]</span>
<span class="c1"># 'NI' is a usually hidden column that appears to be a checksum;
</span> <span class="c1">#if it's all there and all 100, remove it, otherwise, complain.
</span> <span class="c1"># (Note, need to change this 'if' logic to just 'if there's a
</span> <span class="c1"># column with all 100's, but cross that bridge later)
</span> <span class="k">if</span> <span class="s">'NI'</span> <span class="ow">in</span> <span class="n">df</span><span class="p">:</span>
<span class="k">assert</span> <span class="n">df</span><span class="p">[</span><span class="s">'NI'</span><span class="p">].</span><span class="nb">all</span><span class="p">()</span> <span class="ow">and</span> <span class="n">df</span><span class="p">[</span><span class="s">'NI'</span><span class="p">].</span><span class="n">mean</span><span class="p">()</span> <span class="o">==</span> <span class="mi">100</span><span class="p">,</span> <span class="s">"Not all values in df['NI'] == 100"</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">drop</span><span class="p">(</span><span class="s">'NI'</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="c1"># Strip rows below the first all-nan row, if there is one
</span> <span class="c1"># (Otherwise this truncates the tables as there is no
</span> <span class="c1"># idxmax in the table of all 'false's)
</span> <span class="k">if</span> <span class="nb">any</span><span class="p">(</span><span class="n">df</span><span class="p">.</span><span class="n">isna</span><span class="p">().</span><span class="nb">all</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)):</span>
<span class="n">idx_first_bad_row</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">isna</span><span class="p">().</span><span class="nb">all</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">).</span><span class="n">idxmax</span><span class="p">()</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">loc</span><span class="p">[:</span><span class="n">idx_first_bad_row</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span>
<span class="c1"># By Inspection, other tables use 'Sale Year' and 'Sale Quarter'
</span> <span class="k">if</span> <span class="nb">set</span><span class="p">(</span><span class="n">df</span><span class="p">.</span><span class="n">keys</span><span class="p">()).</span><span class="n">issuperset</span><span class="p">({</span><span class="s">'Sale Year'</span><span class="p">,</span><span class="s">'Sale Quarter'</span><span class="p">}):</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">rename</span><span class="p">(</span><span class="n">columns</span> <span class="o">=</span> <span class="p">{</span>
<span class="s">'Sale Year'</span><span class="p">:</span><span class="s">'Year'</span><span class="p">,</span>
<span class="s">'Sale Quarter'</span><span class="p">:</span> <span class="s">'Quarter'</span>
<span class="p">})</span>
<span class="c1"># For 'Year','Quarter' indexed pages, there is an implied Year
</span> <span class="c1"># in Q2/4, so fill it downwards
</span> <span class="k">if</span> <span class="nb">set</span><span class="p">(</span><span class="n">df</span><span class="p">.</span><span class="n">keys</span><span class="p">()).</span><span class="n">issuperset</span><span class="p">({</span><span class="s">'Year'</span><span class="p">,</span><span class="s">'Quarter'</span><span class="p">}):</span>
<span class="n">df</span><span class="p">[</span><span class="s">'Year'</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s">'Year'</span><span class="p">].</span><span class="n">astype</span><span class="p">(</span><span class="nb">float</span><span class="p">).</span><span class="n">fillna</span><span class="p">(</span><span class="n">method</span><span class="o">=</span><span class="s">'ffill'</span><span class="p">).</span><span class="n">astype</span><span class="p">(</span><span class="nb">int</span><span class="p">)</span>
<span class="c1"># In Pandas we can represent Y/Q combinations as proper datetimes
</span> <span class="c1">#https://stackoverflow.com/questions/53898482/clean-way-to-convert-quarterly-periods-to-datetime-in-pandas
</span> <span class="n">df</span><span class="p">.</span><span class="n">insert</span><span class="p">(</span><span class="n">loc</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span>
<span class="n">column</span><span class="o">=</span><span class="s">'Period'</span><span class="p">,</span>
<span class="n">value</span><span class="o">=</span><span class="n">pd</span><span class="p">.</span><span class="n">PeriodIndex</span><span class="p">(</span><span class="n">df</span><span class="p">.</span><span class="nb">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">r</span><span class="p">:</span><span class="sa">f</span><span class="s">'</span><span class="si">{</span><span class="n">r</span><span class="p">.</span><span class="n">Year</span><span class="si">}</span><span class="s">-</span><span class="si">{</span><span class="n">r</span><span class="p">.</span><span class="n">Quarter</span><span class="si">}</span><span class="s">'</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">),</span> <span class="n">freq</span><span class="o">=</span><span class="s">'Q'</span><span class="p">)</span>
<span class="p">)</span>
<span class="c1"># reset index, try to fix dtypes, etc, (this should be the last
</span> <span class="c1"># operation before returning!
</span> <span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">reset_index</span><span class="p">(</span><span class="n">drop</span><span class="o">=</span><span class="bp">True</span><span class="p">).</span><span class="n">infer_objects</span><span class="p">()</span>
<span class="k">return</span> <span class="n">df</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">basic_cleanup</span><span class="p">(</span><span class="n">source_df</span><span class="p">[</span><span class="s">'Table 1'</span><span class="p">])</span>
<span class="n">df</span>
</code></pre></div></div>
<div>
<style scoped="">
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th>1</th>
<th>Period</th>
<th>Year</th>
<th>Quarter</th>
<th>NI House Price Index</th>
<th>NI House Standardised Price</th>
<th>Quarterly Change</th>
<th>Annual Change</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>2005Q1</td>
<td>2005</td>
<td>Q1</td>
<td>100.883607</td>
<td>111920.268199</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>1</th>
<td>2005Q2</td>
<td>2005</td>
<td>Q2</td>
<td>104.564663</td>
<td>116004.031639</td>
<td>0.036488</td>
<td>NaN</td>
</tr>
<tr>
<th>2</th>
<td>2005Q3</td>
<td>2005</td>
<td>Q3</td>
<td>111.219000</td>
<td>123386.352673</td>
<td>0.063638</td>
<td>NaN</td>
</tr>
<tr>
<th>3</th>
<td>2005Q4</td>
<td>2005</td>
<td>Q4</td>
<td>115.083964</td>
<td>127674.143865</td>
<td>0.034751</td>
<td>NaN</td>
</tr>
<tr>
<th>4</th>
<td>2006Q1</td>
<td>2006</td>
<td>Q1</td>
<td>118.354129</td>
<td>131302.064422</td>
<td>0.028415</td>
<td>0.173175</td>
</tr>
<tr>
<th>...</th>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
</tr>
<tr>
<th>63</th>
<td>2020Q4</td>
<td>2020</td>
<td>Q4</td>
<td>132.931827</td>
<td>147474.561707</td>
<td>0.026103</td>
<td>0.052326</td>
</tr>
<tr>
<th>64</th>
<td>2021Q1</td>
<td>2021</td>
<td>Q1</td>
<td>134.382831</td>
<td>149084.306040</td>
<td>0.010915</td>
<td>0.059421</td>
</tr>
<tr>
<th>65</th>
<td>2021Q2</td>
<td>2021</td>
<td>Q2</td>
<td>139.105050</td>
<td>154323.134643</td>
<td>0.035140</td>
<td>0.095724</td>
</tr>
<tr>
<th>66</th>
<td>2021Q3</td>
<td>2021</td>
<td>Q3</td>
<td>143.346066</td>
<td>159028.118093</td>
<td>0.030488</td>
<td>0.106491</td>
</tr>
<tr>
<th>67</th>
<td>2021Q4</td>
<td>2021</td>
<td>Q4</td>
<td>143.456594</td>
<td>159150.737832</td>
<td>0.000771</td>
<td>0.079174</td>
</tr>
</tbody>
</table>
<p>68 rows × 7 columns</p>
</div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">dest_df</span> <span class="o">=</span> <span class="p">{</span>
<span class="s">'Table 1'</span><span class="p">:</span> <span class="n">basic_cleanup</span><span class="p">(</span><span class="n">source_df</span><span class="p">[</span><span class="s">'Table 1'</span><span class="p">])</span>
<span class="p">}</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">len</span><span class="p">([</span><span class="n">k</span> <span class="k">for</span> <span class="n">k</span> <span class="ow">in</span> <span class="n">source_df</span><span class="p">.</span><span class="n">keys</span><span class="p">()</span> <span class="k">if</span> <span class="n">k</span><span class="p">.</span><span class="n">startswith</span><span class="p">(</span><span class="s">'Table'</span><span class="p">)])</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>32
</code></pre></div></div>
<p>One down, 31 to go…</p>
<h3 id="table-2-ni-hpi--standardised-price-statistics-by-property-type-q4-2021">Table 2: NI HPI & Standardised Price Statistics by Property Type Q4 2021’</h3>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df</span> <span class="o">=</span> <span class="n">basic_cleanup</span><span class="p">(</span><span class="n">source_df</span><span class="p">[</span><span class="s">'Table 2'</span><span class="p">])</span>
<span class="n">df</span>
</code></pre></div></div>
<div>
<style scoped="">
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th>1</th>
<th>Property Type</th>
<th>Index\n(Quarter 4 2021)</th>
<th>Percentage Change on Previous Quarter</th>
<th>Percentage Change over 12 months</th>
<th>Standardised Price\n(Quarter 4 2021)</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>Detached</td>
<td>143.488806</td>
<td>0.008491</td>
<td>0.093110</td>
<td>241131.373512</td>
</tr>
<tr>
<th>1</th>
<td>Semi-Detached</td>
<td>140.680694</td>
<td>0.004211</td>
<td>0.076953</td>
<td>153718.543755</td>
</tr>
<tr>
<th>2</th>
<td>Terrace</td>
<td>149.564169</td>
<td>-0.009577</td>
<td>0.078758</td>
<td>112831.710806</td>
</tr>
<tr>
<th>3</th>
<td>Apartment</td>
<td>133.376791</td>
<td>-0.014732</td>
<td>0.032761</td>
<td>116554.228620</td>
</tr>
<tr>
<th>4</th>
<td>All</td>
<td>143.456594</td>
<td>0.000771</td>
<td>0.079174</td>
<td>159150.737832</td>
</tr>
</tbody>
</table>
</div>
<p>Those ‘\n (Quarter 4 2021)’ entries are unnecessary, so <em>for this table</em>, lets clear them</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df</span><span class="p">.</span><span class="n">columns</span> <span class="o">=</span> <span class="p">[</span><span class="n">c</span><span class="p">.</span><span class="n">split</span><span class="p">(</span><span class="s">'</span><span class="se">\n</span><span class="s">'</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span> <span class="k">for</span> <span class="n">c</span> <span class="ow">in</span> <span class="n">df</span><span class="p">.</span><span class="n">columns</span><span class="p">]</span>
<span class="n">df</span>
</code></pre></div></div>
<div>
<style scoped="">
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Property Type</th>
<th>Index</th>
<th>Percentage Change on Previous Quarter</th>
<th>Percentage Change over 12 months</th>
<th>Standardised Price</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>Detached</td>
<td>143.488806</td>
<td>0.008491</td>
<td>0.093110</td>
<td>241131.373512</td>
</tr>
<tr>
<th>1</th>
<td>Semi-Detached</td>
<td>140.680694</td>
<td>0.004211</td>
<td>0.076953</td>
<td>153718.543755</td>
</tr>
<tr>
<th>2</th>
<td>Terrace</td>
<td>149.564169</td>
<td>-0.009577</td>
<td>0.078758</td>
<td>112831.710806</td>
</tr>
<tr>
<th>3</th>
<td>Apartment</td>
<td>133.376791</td>
<td>-0.014732</td>
<td>0.032761</td>
<td>116554.228620</td>
</tr>
<tr>
<th>4</th>
<td>All</td>
<td>143.456594</td>
<td>0.000771</td>
<td>0.079174</td>
<td>159150.737832</td>
</tr>
</tbody>
</table>
</div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">dest_df</span><span class="p">[</span><span class="s">'Table 2'</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span>
</code></pre></div></div>
<h3 id="table-2a-ni-detached-property-price-index-q1-2005---q4-2021">Table 2a: NI Detached Property Price Index Q1 2005 - Q4 2021</h3>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df</span> <span class="o">=</span> <span class="n">basic_cleanup</span><span class="p">(</span><span class="n">source_df</span><span class="p">[</span><span class="s">'Table 2a'</span><span class="p">])</span>
<span class="n">df</span>
</code></pre></div></div>
<div>
<style scoped="">
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th>1</th>
<th>Period</th>
<th>Year</th>
<th>Quarter</th>
<th>NI Detached Property Price Index</th>
<th>NI Detached Property Standardised Price</th>
<th>Quarterly Change</th>
<th>Annual Change</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>2005Q1</td>
<td>2005</td>
<td>Q1</td>
<td>95.465560</td>
<td>160428.832662</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>1</th>
<td>2005Q2</td>
<td>2005</td>
<td>Q2</td>
<td>100.974498</td>
<td>169686.542965</td>
<td>0.057706</td>
<td>NaN</td>
</tr>
<tr>
<th>2</th>
<td>2005Q3</td>
<td>2005</td>
<td>Q3</td>
<td>107.526236</td>
<td>180696.666810</td>
<td>0.064885</td>
<td>NaN</td>
</tr>
<tr>
<th>3</th>
<td>2005Q4</td>
<td>2005</td>
<td>Q4</td>
<td>110.279730</td>
<td>185323.883533</td>
<td>0.025608</td>
<td>NaN</td>
</tr>
<tr>
<th>4</th>
<td>2006Q1</td>
<td>2006</td>
<td>Q1</td>
<td>112.270506</td>
<td>188669.361197</td>
<td>0.018052</td>
<td>0.176032</td>
</tr>
<tr>
<th>...</th>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
</tr>
<tr>
<th>63</th>
<td>2020Q4</td>
<td>2020</td>
<td>Q4</td>
<td>131.266614</td>
<td>220592.113069</td>
<td>0.026393</td>
<td>0.055357</td>
</tr>
<tr>
<th>64</th>
<td>2021Q1</td>
<td>2021</td>
<td>Q1</td>
<td>133.814014</td>
<td>224872.989982</td>
<td>0.019406</td>
<td>0.071429</td>
</tr>
<tr>
<th>65</th>
<td>2021Q2</td>
<td>2021</td>
<td>Q2</td>
<td>139.682380</td>
<td>234734.715703</td>
<td>0.043855</td>
<td>0.129844</td>
</tr>
<tr>
<th>66</th>
<td>2021Q3</td>
<td>2021</td>
<td>Q3</td>
<td>142.280745</td>
<td>239101.239764</td>
<td>0.018602</td>
<td>0.112515</td>
</tr>
<tr>
<th>67</th>
<td>2021Q4</td>
<td>2021</td>
<td>Q4</td>
<td>143.488806</td>
<td>241131.373512</td>
<td>0.008491</td>
<td>0.093110</td>
</tr>
</tbody>
</table>
<p>68 rows × 7 columns</p>
</div>
<h3 id="table-2x-ni-xxx-property-price-index-q1-2005---q4-2021">Table 2x: NI XXX Property Price Index Q1 2005 - Q4 2021</h3>
<p>This table structure is consistent against the rest of the Table 2x cohort; mapping to the Property Types listed in Table 2.</p>
<p>For the time being, we can ignore these, but this will probably become a pain later on…</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">dest_df</span><span class="p">[</span><span class="s">'Table 2'</span><span class="p">][</span><span class="s">'Property Type'</span><span class="p">]</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0 Detached
1 Semi-Detached
2 Terrace
3 Apartment
4 All
Name: Property Type, dtype: object
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">re</span>
<span class="n">table2s</span> <span class="o">=</span> <span class="n">re</span><span class="p">.</span><span class="nb">compile</span><span class="p">(</span><span class="s">'Table 2[a-z]'</span><span class="p">)</span>
<span class="k">assert</span> <span class="n">table2s</span><span class="p">.</span><span class="n">match</span><span class="p">(</span><span class="s">'Table 2'</span><span class="p">)</span> <span class="ow">is</span> <span class="bp">None</span><span class="p">,</span> <span class="s">'Table 2 is matching itself!'</span>
<span class="k">assert</span> <span class="n">table2s</span><span class="p">.</span><span class="n">match</span><span class="p">(</span><span class="s">'Table 20'</span><span class="p">)</span> <span class="ow">is</span> <span class="bp">None</span><span class="p">,</span> <span class="s">'Table 2 is greedy!'</span>
<span class="k">assert</span> <span class="n">table2s</span><span class="p">.</span><span class="n">match</span><span class="p">(</span><span class="s">'Table 2z'</span><span class="p">)</span> <span class="ow">is</span> <span class="ow">not</span> <span class="bp">None</span><span class="p">,</span> <span class="s">'Table 2 is matching incorrectly!'</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">table2s</span> <span class="o">=</span> <span class="n">re</span><span class="p">.</span><span class="nb">compile</span><span class="p">(</span><span class="s">'Table 2[a-z]'</span><span class="p">)</span>
<span class="k">for</span> <span class="n">table</span> <span class="ow">in</span> <span class="n">source_df</span><span class="p">:</span>
<span class="k">if</span> <span class="n">table2s</span><span class="p">.</span><span class="n">match</span><span class="p">(</span><span class="n">table</span><span class="p">):</span>
<span class="n">dest_df</span><span class="p">[</span><span class="n">table</span><span class="p">]</span> <span class="o">=</span> <span class="n">basic_cleanup</span><span class="p">(</span><span class="n">source_df</span><span class="p">[</span><span class="n">table</span><span class="p">])</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">len</span><span class="p">(</span><span class="n">dest_df</span><span class="p">),</span> <span class="nb">len</span><span class="p">([</span><span class="n">k</span> <span class="k">for</span> <span class="n">k</span> <span class="ow">in</span> <span class="n">source_df</span><span class="p">.</span><span class="n">keys</span><span class="p">()</span> <span class="k">if</span> <span class="n">k</span><span class="p">.</span><span class="n">startswith</span><span class="p">(</span><span class="s">'Table'</span><span class="p">)</span> <span class="ow">and</span> <span class="n">k</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">dest_df</span><span class="p">])</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(6, 26)
</code></pre></div></div>
<p>6 down, 26 to go.</p>
<h3 id="table-3-ni-hpi--standardised-price-statistics-by-newexisting-resold-dwelling-type-q4-2021">Table 3: NI HPI & Standardised Price Statistics by New/Existing Resold Dwelling Type Q4 2021</h3>
<p>These appear to be a similar structure of the Table 2’s… hopefully</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df</span> <span class="o">=</span> <span class="n">basic_cleanup</span><span class="p">(</span><span class="n">source_df</span><span class="p">[</span><span class="s">'Table 3'</span><span class="p">])</span>
<span class="n">df</span><span class="p">.</span><span class="n">columns</span> <span class="o">=</span> <span class="p">[</span><span class="n">c</span><span class="p">.</span><span class="n">split</span><span class="p">(</span><span class="s">'</span><span class="se">\n</span><span class="s">'</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span> <span class="k">for</span> <span class="n">c</span> <span class="ow">in</span> <span class="n">df</span><span class="p">.</span><span class="n">columns</span><span class="p">]</span> <span class="c1"># Stolen from Table 2 Treatment
</span><span class="n">df</span>
</code></pre></div></div>
<div>
<style scoped="">
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Property Type</th>
<th>Index</th>
<th>Percentage Change on Previous Quarter</th>
<th>Percentage Change over 12 months</th>
<th>Standardised Price</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>New</td>
<td>141.769973</td>
<td>0.024877</td>
<td>0.072609</td>
<td>185966.524090</td>
</tr>
<tr>
<th>1</th>
<td>Existing Resold</td>
<td>143.518977</td>
<td>-0.004918</td>
<td>0.080771</td>
<td>152275.828046</td>
</tr>
<tr>
<th>2</th>
<td>All</td>
<td>143.456594</td>
<td>0.000771</td>
<td>0.079174</td>
<td>159150.737832</td>
</tr>
</tbody>
</table>
</div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">dest_df</span><span class="p">[</span><span class="s">'Table 3'</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df</span> <span class="o">=</span> <span class="n">basic_cleanup</span><span class="p">(</span><span class="n">source_df</span><span class="p">[</span><span class="s">'Table 3a'</span><span class="p">])</span>
<span class="n">df</span>
</code></pre></div></div>
<div>
<style scoped="">
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th>1</th>
<th>Period</th>
<th>Year</th>
<th>Quarter</th>
<th>NI New Dwellings Price Index</th>
<th>NI New Dwellings Standardised Price</th>
<th>Quarterly Change</th>
<th>Annual Change</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>2005Q1</td>
<td>2005</td>
<td>Q1</td>
<td>95.804706</td>
<td>125671.662611</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>1</th>
<td>2005Q2</td>
<td>2005</td>
<td>Q2</td>
<td>101.229223</td>
<td>132787.263460</td>
<td>0.056621</td>
<td>NaN</td>
</tr>
<tr>
<th>2</th>
<td>2005Q3</td>
<td>2005</td>
<td>Q3</td>
<td>106.243580</td>
<td>139364.837967</td>
<td>0.049535</td>
<td>NaN</td>
</tr>
<tr>
<th>3</th>
<td>2005Q4</td>
<td>2005</td>
<td>Q4</td>
<td>110.118105</td>
<td>144447.239874</td>
<td>0.036468</td>
<td>NaN</td>
</tr>
<tr>
<th>4</th>
<td>2006Q1</td>
<td>2006</td>
<td>Q1</td>
<td>113.624410</td>
<td>149046.629634</td>
<td>0.031841</td>
<td>0.186000</td>
</tr>
<tr>
<th>...</th>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
</tr>
<tr>
<th>63</th>
<td>2020Q4</td>
<td>2020</td>
<td>Q4</td>
<td>132.173052</td>
<td>173377.779440</td>
<td>0.004103</td>
<td>0.036125</td>
</tr>
<tr>
<th>64</th>
<td>2021Q1</td>
<td>2021</td>
<td>Q1</td>
<td>133.772562</td>
<td>175475.933612</td>
<td>0.012102</td>
<td>0.027916</td>
</tr>
<tr>
<th>65</th>
<td>2021Q2</td>
<td>2021</td>
<td>Q2</td>
<td>136.969311</td>
<td>179669.264190</td>
<td>0.023897</td>
<td>0.046474</td>
</tr>
<tr>
<th>66</th>
<td>2021Q3</td>
<td>2021</td>
<td>Q3</td>
<td>138.328776</td>
<td>181452.540106</td>
<td>0.009925</td>
<td>0.050867</td>
</tr>
<tr>
<th>67</th>
<td>2021Q4</td>
<td>2021</td>
<td>Q4</td>
<td>141.769973</td>
<td>185966.524090</td>
<td>0.024877</td>
<td>0.072609</td>
</tr>
</tbody>
</table>
<p>68 rows × 7 columns</p>
</div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">table3s</span> <span class="o">=</span> <span class="n">re</span><span class="p">.</span><span class="nb">compile</span><span class="p">(</span><span class="s">'Table 3[a-z]'</span><span class="p">)</span>
<span class="k">for</span> <span class="n">table</span> <span class="ow">in</span> <span class="n">source_df</span><span class="p">:</span>
<span class="k">if</span> <span class="n">table3s</span><span class="p">.</span><span class="n">match</span><span class="p">(</span><span class="n">table</span><span class="p">):</span>
<span class="n">dest_df</span><span class="p">[</span><span class="n">table</span><span class="p">]</span> <span class="o">=</span> <span class="n">basic_cleanup</span><span class="p">(</span><span class="n">source_df</span><span class="p">[</span><span class="n">table</span><span class="p">])</span>
<span class="nb">len</span><span class="p">(</span><span class="n">dest_df</span><span class="p">),</span> <span class="nb">len</span><span class="p">([</span><span class="n">k</span> <span class="k">for</span> <span class="n">k</span> <span class="ow">in</span> <span class="n">source_df</span><span class="p">.</span><span class="n">keys</span><span class="p">()</span> <span class="k">if</span> <span class="n">k</span><span class="p">.</span><span class="n">startswith</span><span class="p">(</span><span class="s">'Table'</span><span class="p">)</span> <span class="ow">and</span> <span class="n">k</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">dest_df</span><span class="p">])</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(10, 22)
</code></pre></div></div>
<h3 id="table-4-number-of-verified-residential-property-sales-q1-2005---q4-2021">Table 4: Number of Verified Residential Property Sales Q1 2005 - Q4 2021</h3>
<p>Table 4 is not looking great</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df</span> <span class="o">=</span> <span class="n">source_df</span><span class="p">[</span><span class="s">'Table 4'</span><span class="p">]</span>
<span class="n">df</span>
</code></pre></div></div>
<div>
<style scoped="">
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Table 4: Number of Verified Residential Property Sales</th>
<th>Unnamed: 1</th>
<th>Unnamed: 2</th>
<th>Unnamed: 3</th>
<th>Unnamed: 4</th>
<th>Unnamed: 5</th>
<th>Unnamed: 6</th>
<th>Unnamed: 7</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>Verified Sales = Sales matched to a property i...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>1</th>
<td>Please note figures for the 2 most recent quar...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>2</th>
<td>Back to contents</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>3</th>
<td>Sale Year</td>
<td>Sale Quarter</td>
<td>Detached</td>
<td>Semi-Detached</td>
<td>Terrace</td>
<td>Apartment</td>
<td>Total</td>
<td>NaN</td>
</tr>
<tr>
<th>4</th>
<td>2005\n</td>
<td>Quarter 1</td>
<td>809</td>
<td>894</td>
<td>1035</td>
<td>198</td>
<td>2936</td>
<td>NaN</td>
</tr>
<tr>
<th>...</th>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
</tr>
<tr>
<th>84</th>
<td>2021</td>
<td>Quarter 1</td>
<td>2509</td>
<td>2477</td>
<td>1962</td>
<td>561</td>
<td>7509</td>
<td>NaN</td>
</tr>
<tr>
<th>85</th>
<td>NaN</td>
<td>Quarter 2</td>
<td>2668</td>
<td>2613</td>
<td>2056</td>
<td>604</td>
<td>7941</td>
<td>NaN</td>
</tr>
<tr>
<th>86</th>
<td>NaN</td>
<td>Quarter 3</td>
<td>2519</td>
<td>2797</td>
<td>2220</td>
<td>633</td>
<td>8169</td>
<td>Please note this figure is provisional and wil...</td>
</tr>
<tr>
<th>87</th>
<td>NaN</td>
<td>Quarter 4</td>
<td>1478</td>
<td>2100</td>
<td>2057</td>
<td>515</td>
<td>6150</td>
<td>and new dwellings sold in this quarter being a...</td>
</tr>
<tr>
<th>88</th>
<td>NaN</td>
<td>2021 Total</td>
<td>9174</td>
<td>9987</td>
<td>8295</td>
<td>2313</td>
<td>29769</td>
<td>NaN</td>
</tr>
</tbody>
</table>
<p>89 rows × 8 columns</p>
</div>
<p>Of note; new offset for the header row at index 3 instead of index 1, due to lots of fluff at the start that is probably not going to be consistent between reports so that will almost certainly mess up my day in a few months.</p>
<p>Also, <strong>Quarter dates</strong> have now been shifted into ‘Quarter 1’ instead of ‘Q1’, which … meh 🤷♂️. More Egrigiously, it looks like <strong>‘\n’ has leaked into some Sales Year values</strong>. Funtimes.</p>
<p>Finally, and possibly most annoying, the introduction of <strong>partial total lines</strong> is going to throw things off, and this isn’t a validation study, to stuff-em</p>
<p>In an effort not to over-complicate <code class="language-plaintext highlighter-rouge">basic_cleanup</code>, we can try and clean these table specific issues first;</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df</span><span class="p">.</span><span class="n">iloc</span><span class="p">[:,</span><span class="mi">1</span><span class="p">]</span><span class="o">=</span><span class="n">df</span><span class="p">.</span><span class="n">iloc</span><span class="p">[:,</span><span class="mi">1</span><span class="p">].</span><span class="nb">str</span><span class="p">.</span><span class="n">replace</span><span class="p">(</span><span class="s">'Quarter ([1-4])'</span><span class="p">,</span><span class="sa">r</span><span class="s">'Q\1'</span><span class="p">,</span> <span class="n">regex</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">df</span>
</code></pre></div></div>
<div>
<style scoped="">
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Table 4: Number of Verified Residential Property Sales</th>
<th>Unnamed: 1</th>
<th>Unnamed: 2</th>
<th>Unnamed: 3</th>
<th>Unnamed: 4</th>
<th>Unnamed: 5</th>
<th>Unnamed: 6</th>
<th>Unnamed: 7</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>Verified Sales = Sales matched to a property i...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>1</th>
<td>Please note figures for the 2 most recent quar...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>2</th>
<td>Back to contents</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>3</th>
<td>Sale Year</td>
<td>Sale Quarter</td>
<td>Detached</td>
<td>Semi-Detached</td>
<td>Terrace</td>
<td>Apartment</td>
<td>Total</td>
<td>NaN</td>
</tr>
<tr>
<th>4</th>
<td>2005\n</td>
<td>Q1</td>
<td>809</td>
<td>894</td>
<td>1035</td>
<td>198</td>
<td>2936</td>
<td>NaN</td>
</tr>
<tr>
<th>...</th>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
</tr>
<tr>
<th>84</th>
<td>2021</td>
<td>Q1</td>
<td>2509</td>
<td>2477</td>
<td>1962</td>
<td>561</td>
<td>7509</td>
<td>NaN</td>
</tr>
<tr>
<th>85</th>
<td>NaN</td>
<td>Q2</td>
<td>2668</td>
<td>2613</td>
<td>2056</td>
<td>604</td>
<td>7941</td>
<td>NaN</td>
</tr>
<tr>
<th>86</th>
<td>NaN</td>
<td>Q3</td>
<td>2519</td>
<td>2797</td>
<td>2220</td>
<td>633</td>
<td>8169</td>
<td>Please note this figure is provisional and wil...</td>
</tr>
<tr>
<th>87</th>
<td>NaN</td>
<td>Q4</td>
<td>1478</td>
<td>2100</td>
<td>2057</td>
<td>515</td>
<td>6150</td>
<td>and new dwellings sold in this quarter being a...</td>
</tr>
<tr>
<th>88</th>
<td>NaN</td>
<td>2021 Total</td>
<td>9174</td>
<td>9987</td>
<td>8295</td>
<td>2313</td>
<td>29769</td>
<td>NaN</td>
</tr>
</tbody>
</table>
<p>89 rows × 8 columns</p>
</div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df</span><span class="o">=</span><span class="n">df</span><span class="p">[</span><span class="o">~</span><span class="n">df</span><span class="p">.</span><span class="n">iloc</span><span class="p">[:,</span><span class="mi">1</span><span class="p">].</span><span class="nb">str</span><span class="p">.</span><span class="n">contains</span><span class="p">(</span><span class="s">'Total'</span><span class="p">).</span><span class="n">fillna</span><span class="p">(</span><span class="bp">False</span><span class="p">)]</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Lose the year new-lines (needs astype because non str lines are
# correctly inferred to be ints, so .str methods nan-out
</span><span class="k">with</span> <span class="n">pd</span><span class="p">.</span><span class="n">option_context</span><span class="p">(</span><span class="s">'mode.chained_assignment'</span><span class="p">,</span><span class="bp">None</span><span class="p">):</span>
<span class="n">df</span><span class="p">.</span><span class="n">iloc</span><span class="p">[:,</span><span class="mi">0</span><span class="p">]</span><span class="o">=</span><span class="n">df</span><span class="p">.</span><span class="n">iloc</span><span class="p">[:,</span><span class="mi">0</span><span class="p">].</span><span class="n">astype</span><span class="p">(</span><span class="nb">str</span><span class="p">).</span><span class="nb">str</span><span class="p">.</span><span class="n">replace</span><span class="p">(</span><span class="s">'</span><span class="se">\n</span><span class="s">'</span><span class="p">,</span><span class="s">''</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df</span>
</code></pre></div></div>
<div>
<style scoped="">
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Table 4: Number of Verified Residential Property Sales</th>
<th>Unnamed: 1</th>
<th>Unnamed: 2</th>
<th>Unnamed: 3</th>
<th>Unnamed: 4</th>
<th>Unnamed: 5</th>
<th>Unnamed: 6</th>
<th>Unnamed: 7</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>Verified Sales = Sales matched to a property i...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>1</th>
<td>Please note figures for the 2 most recent quar...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>2</th>
<td>Back to contents</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>3</th>
<td>Sale Year</td>
<td>Sale Quarter</td>
<td>Detached</td>
<td>Semi-Detached</td>
<td>Terrace</td>
<td>Apartment</td>
<td>Total</td>
<td>NaN</td>
</tr>
<tr>
<th>4</th>
<td>2005</td>
<td>Q1</td>
<td>809</td>
<td>894</td>
<td>1035</td>
<td>198</td>
<td>2936</td>
<td>NaN</td>
</tr>
<tr>
<th>...</th>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
</tr>
<tr>
<th>82</th>
<td>nan</td>
<td>Q4</td>
<td>2808</td>
<td>2944</td>
<td>2170</td>
<td>555</td>
<td>8477</td>
<td>NaN</td>
</tr>
<tr>
<th>84</th>
<td>2021</td>
<td>Q1</td>
<td>2509</td>
<td>2477</td>
<td>1962</td>
<td>561</td>
<td>7509</td>
<td>NaN</td>
</tr>
<tr>
<th>85</th>
<td>nan</td>
<td>Q2</td>
<td>2668</td>
<td>2613</td>
<td>2056</td>
<td>604</td>
<td>7941</td>
<td>NaN</td>
</tr>
<tr>
<th>86</th>
<td>nan</td>
<td>Q3</td>
<td>2519</td>
<td>2797</td>
<td>2220</td>
<td>633</td>
<td>8169</td>
<td>Please note this figure is provisional and wil...</td>
</tr>
<tr>
<th>87</th>
<td>nan</td>
<td>Q4</td>
<td>1478</td>
<td>2100</td>
<td>2057</td>
<td>515</td>
<td>6150</td>
<td>and new dwellings sold in this quarter being a...</td>
</tr>
</tbody>
</table>
<p>72 rows × 8 columns</p>
</div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">basic_cleanup</span><span class="p">(</span><span class="n">df</span><span class="p">,</span> <span class="n">offset</span><span class="o">=</span><span class="mi">3</span><span class="p">)</span>
</code></pre></div></div>
<div>
<style scoped="">
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th>3</th>
<th>Period</th>
<th>Year</th>
<th>Quarter</th>
<th>Detached</th>
<th>Semi-Detached</th>
<th>Terrace</th>
<th>Apartment</th>
<th>Total</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>2005Q1</td>
<td>2005</td>
<td>Q1</td>
<td>809</td>
<td>894</td>
<td>1035</td>
<td>198</td>
<td>2936</td>
</tr>
<tr>
<th>1</th>
<td>2005Q2</td>
<td>2005</td>
<td>Q2</td>
<td>2208</td>
<td>2474</td>
<td>2808</td>
<td>483</td>
<td>7973</td>
</tr>
<tr>
<th>2</th>
<td>2005Q3</td>
<td>2005</td>
<td>Q3</td>
<td>2297</td>
<td>2655</td>
<td>2952</td>
<td>539</td>
<td>8443</td>
</tr>
<tr>
<th>3</th>
<td>2005Q4</td>
<td>2005</td>
<td>Q4</td>
<td>2498</td>
<td>3003</td>
<td>3492</td>
<td>631</td>
<td>9624</td>
</tr>
<tr>
<th>4</th>
<td>2006Q1</td>
<td>2006</td>
<td>Q1</td>
<td>2185</td>
<td>2650</td>
<td>3158</td>
<td>594</td>
<td>8587</td>
</tr>
<tr>
<th>...</th>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
</tr>
<tr>
<th>63</th>
<td>2020Q4</td>
<td>2020</td>
<td>Q4</td>
<td>2808</td>
<td>2944</td>
<td>2170</td>
<td>555</td>
<td>8477</td>
</tr>
<tr>
<th>64</th>
<td>2021Q1</td>
<td>2021</td>
<td>Q1</td>
<td>2509</td>
<td>2477</td>
<td>1962</td>
<td>561</td>
<td>7509</td>
</tr>
<tr>
<th>65</th>
<td>2021Q2</td>
<td>2021</td>
<td>Q2</td>
<td>2668</td>
<td>2613</td>
<td>2056</td>
<td>604</td>
<td>7941</td>
</tr>
<tr>
<th>66</th>
<td>2021Q3</td>
<td>2021</td>
<td>Q3</td>
<td>2519</td>
<td>2797</td>
<td>2220</td>
<td>633</td>
<td>8169</td>
</tr>
<tr>
<th>67</th>
<td>2021Q4</td>
<td>2021</td>
<td>Q4</td>
<td>1478</td>
<td>2100</td>
<td>2057</td>
<td>515</td>
<td>6150</td>
</tr>
</tbody>
</table>
<p>68 rows × 8 columns</p>
</div>
<p>Thats awkward enough to get it’s own function…</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">cleanup_table_4</span><span class="p">(</span><span class="n">df</span><span class="p">):</span>
<span class="s">"""
Table 4: Number of Verified Residential Property Sales
* Regex 'Quarter X' to 'QX' in future 'Sales Quarter' column
* Drop Year Total rows
* Clear any Newlines from the future 'Sales Year' column
* call `basic_cleanup` with offset=3
"""</span>
<span class="n">df</span><span class="p">.</span><span class="n">iloc</span><span class="p">[:,</span><span class="mi">1</span><span class="p">]</span><span class="o">=</span><span class="n">df</span><span class="p">.</span><span class="n">iloc</span><span class="p">[:,</span><span class="mi">1</span><span class="p">].</span><span class="nb">str</span><span class="p">.</span><span class="n">replace</span><span class="p">(</span><span class="s">'Quarter ([1-4])'</span><span class="p">,</span><span class="sa">r</span><span class="s">'Q\1'</span><span class="p">,</span> <span class="n">regex</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">df</span><span class="o">=</span><span class="n">df</span><span class="p">[</span><span class="o">~</span><span class="n">df</span><span class="p">.</span><span class="n">iloc</span><span class="p">[:,</span><span class="mi">1</span><span class="p">].</span><span class="nb">str</span><span class="p">.</span><span class="n">contains</span><span class="p">(</span><span class="s">'Total'</span><span class="p">).</span><span class="n">fillna</span><span class="p">(</span><span class="bp">False</span><span class="p">)]</span>
<span class="c1"># Lose the year new-lines (needs astype because non str lines are
</span> <span class="c1"># correctly inferred to be ints, so .str methods nan-out
</span> <span class="k">with</span> <span class="n">pd</span><span class="p">.</span><span class="n">option_context</span><span class="p">(</span><span class="s">'mode.chained_assignment'</span><span class="p">,</span><span class="bp">None</span><span class="p">):</span>
<span class="n">df</span><span class="p">.</span><span class="n">iloc</span><span class="p">[:,</span><span class="mi">0</span><span class="p">]</span><span class="o">=</span><span class="n">df</span><span class="p">.</span><span class="n">iloc</span><span class="p">[:,</span><span class="mi">0</span><span class="p">].</span><span class="n">astype</span><span class="p">(</span><span class="nb">str</span><span class="p">).</span><span class="nb">str</span><span class="p">.</span><span class="n">replace</span><span class="p">(</span><span class="s">'</span><span class="se">\n</span><span class="s">'</span><span class="p">,</span><span class="s">''</span><span class="p">)</span>
<span class="k">return</span> <span class="n">basic_cleanup</span><span class="p">(</span><span class="n">df</span><span class="p">,</span> <span class="n">offset</span><span class="o">=</span><span class="mi">3</span><span class="p">)</span>
<span class="n">cleanup_table_4</span><span class="p">(</span><span class="n">source_df</span><span class="p">[</span><span class="s">'Table 4'</span><span class="p">].</span><span class="n">copy</span><span class="p">())</span>
</code></pre></div></div>
<div>
<style scoped="">
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th>3</th>
<th>Period</th>
<th>Year</th>
<th>Quarter</th>
<th>Detached</th>
<th>Semi-Detached</th>
<th>Terrace</th>
<th>Apartment</th>
<th>Total</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>2005Q1</td>
<td>2005</td>
<td>Q1</td>
<td>809</td>
<td>894</td>
<td>1035</td>
<td>198</td>
<td>2936</td>
</tr>
<tr>
<th>1</th>
<td>2005Q2</td>
<td>2005</td>
<td>Q2</td>
<td>2208</td>
<td>2474</td>
<td>2808</td>
<td>483</td>
<td>7973</td>
</tr>
<tr>
<th>2</th>
<td>2005Q3</td>
<td>2005</td>
<td>Q3</td>
<td>2297</td>
<td>2655</td>
<td>2952</td>
<td>539</td>
<td>8443</td>
</tr>
<tr>
<th>3</th>
<td>2005Q4</td>
<td>2005</td>
<td>Q4</td>
<td>2498</td>
<td>3003</td>
<td>3492</td>
<td>631</td>
<td>9624</td>
</tr>
<tr>
<th>4</th>
<td>2006Q1</td>
<td>2006</td>
<td>Q1</td>
<td>2185</td>
<td>2650</td>
<td>3158</td>
<td>594</td>
<td>8587</td>
</tr>
<tr>
<th>...</th>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
</tr>
<tr>
<th>63</th>
<td>2020Q4</td>
<td>2020</td>
<td>Q4</td>
<td>2808</td>
<td>2944</td>
<td>2170</td>
<td>555</td>
<td>8477</td>
</tr>
<tr>
<th>64</th>
<td>2021Q1</td>
<td>2021</td>
<td>Q1</td>
<td>2509</td>
<td>2477</td>
<td>1962</td>
<td>561</td>
<td>7509</td>
</tr>
<tr>
<th>65</th>
<td>2021Q2</td>
<td>2021</td>
<td>Q2</td>
<td>2668</td>
<td>2613</td>
<td>2056</td>
<td>604</td>
<td>7941</td>
</tr>
<tr>
<th>66</th>
<td>2021Q3</td>
<td>2021</td>
<td>Q3</td>
<td>2519</td>
<td>2797</td>
<td>2220</td>
<td>633</td>
<td>8169</td>
</tr>
<tr>
<th>67</th>
<td>2021Q4</td>
<td>2021</td>
<td>Q4</td>
<td>1478</td>
<td>2100</td>
<td>2057</td>
<td>515</td>
<td>6150</td>
</tr>
</tbody>
</table>
<p>68 rows × 8 columns</p>
</div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">dest_df</span><span class="p">[</span><span class="s">'Table 4'</span><span class="p">]</span> <span class="o">=</span> <span class="n">cleanup_table_4</span><span class="p">(</span><span class="n">source_df</span><span class="p">[</span><span class="s">'Table 4'</span><span class="p">])</span>
<span class="nb">len</span><span class="p">(</span><span class="n">dest_df</span><span class="p">),</span> <span class="nb">len</span><span class="p">([</span><span class="n">k</span> <span class="k">for</span> <span class="n">k</span> <span class="ow">in</span> <span class="n">source_df</span><span class="p">.</span><span class="n">keys</span><span class="p">()</span> <span class="k">if</span> <span class="n">k</span><span class="p">.</span><span class="n">startswith</span><span class="p">(</span><span class="s">'Table'</span><span class="p">)</span> <span class="ow">and</span> <span class="n">k</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">dest_df</span><span class="p">])</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(11, 21)
</code></pre></div></div>
<h3 id="table-5-hpi--standardised-price-for-each-local-government-district-in-ni">Table 5: HPI & Standardised Price for each Local Government District in NI</h3>
<p>This <em>nearly works</em> but structurally requires a multi-index column to make sense….</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df</span> <span class="o">=</span> <span class="n">basic_cleanup</span><span class="p">(</span><span class="n">source_df</span><span class="p">[</span><span class="s">'Table 5'</span><span class="p">])</span>
<span class="n">df</span>
</code></pre></div></div>
<div>
<style scoped="">
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th>1</th>
<th>Period</th>
<th>Year</th>
<th>Quarter</th>
<th>Antrim and Newtownabbey HPI</th>
<th>Antrim and Newtownabbey Standardised Price</th>
<th>Ards and North Down HPI</th>
<th>Ards and North Down Standardised Price</th>
<th>Armagh City, Banbridge and Craigavon HPI</th>
<th>Armagh City, Banbridge and Craigavon Standardised Price</th>
<th>Belfast HPI</th>
<th>...</th>
<th>Fermanagh and Omagh HPI</th>
<th>Fermanagh and Omagh Standardised Price</th>
<th>Lisburn and Castlereagh HPI</th>
<th>Lisburn and Castlereagh Standardised Price</th>
<th>Mid and East Antrim HPI</th>
<th>Mid and East Antrim Standardised Price</th>
<th>Mid Ulster Standardised HPI</th>
<th>Mid Ulster Standardised Price</th>
<th>Newry, Mourne and Down HPI</th>
<th>Newry, Mourne and Down Standardised Price</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>2005Q1</td>
<td>2005</td>
<td>Q1</td>
<td>99.903277</td>
<td>114851.528270</td>
<td>97.150602</td>
<td>130398.569667</td>
<td>102.245597</td>
<td>100785.145986</td>
<td>99.839849</td>
<td>...</td>
<td>109.429237</td>
<td>104874.980231</td>
<td>95.958322</td>
<td>128828.327513</td>
<td>102.246427</td>
<td>105865.408901</td>
<td>102.714778</td>
<td>114882.211239</td>
<td>100.810773</td>
<td>113420.880186</td>
</tr>
<tr>
<th>1</th>
<td>2005Q2</td>
<td>2005</td>
<td>Q2</td>
<td>99.723509</td>
<td>114644.862732</td>
<td>100.794472</td>
<td>137133.037807</td>
<td>106.325843</td>
<td>104807.109982</td>
<td>100.589870</td>
<td>...</td>
<td>117.239850</td>
<td>112360.529330</td>
<td>100.164437</td>
<td>134475.225477</td>
<td>104.443325</td>
<td>108140.065924</td>
<td>110.386311</td>
<td>123462.501283</td>
<td>111.965743</td>
<td>125971.191415</td>
</tr>
<tr>
<th>2</th>
<td>2005Q3</td>
<td>2005</td>
<td>Q3</td>
<td>107.940849</td>
<td>124091.740608</td>
<td>102.167971</td>
<td>137133.037807</td>
<td>110.006212</td>
<td>108434.910333</td>
<td>109.614861</td>
<td>...</td>
<td>125.900145</td>
<td>120660.397585</td>
<td>106.757895</td>
<td>143327.237126</td>
<td>112.748278</td>
<td>116738.970434</td>
<td>117.595723</td>
<td>131525.929577</td>
<td>117.235685</td>
<td>131900.333698</td>
</tr>
<tr>
<th>3</th>
<td>2005Q4</td>
<td>2005</td>
<td>Q4</td>
<td>111.934696</td>
<td>128683.175719</td>
<td>106.396379</td>
<td>142808.538807</td>
<td>116.073031</td>
<td>114415.072260</td>
<td>110.728237</td>
<td>...</td>
<td>130.781315</td>
<td>125338.422216</td>
<td>111.307116</td>
<td>149434.769200</td>
<td>114.584090</td>
<td>118639.759900</td>
<td>121.851999</td>
<td>136286.397473</td>
<td>123.628047</td>
<td>139092.296651</td>
</tr>
<tr>
<th>4</th>
<td>2006Q1</td>
<td>2006</td>
<td>Q1</td>
<td>113.494351</td>
<td>130476.197845</td>
<td>109.206160</td>
<td>146579.915492</td>
<td>121.831058</td>
<td>120090.852733</td>
<td>112.326582</td>
<td>...</td>
<td>135.555749</td>
<td>129914.152078</td>
<td>110.539212</td>
<td>148403.823796</td>
<td>115.878975</td>
<td>119980.477260</td>
<td>129.544702</td>
<td>144890.365875</td>
<td>127.449613</td>
<td>143391.890242</td>
</tr>
<tr>
<th>...</th>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
</tr>
<tr>
<th>63</th>
<td>2020Q4</td>
<td>2020</td>
<td>Q4</td>
<td>131.045293</td>
<td>150653.238745</td>
<td>123.824862</td>
<td>166201.593253</td>
<td>130.857866</td>
<td>128988.723586</td>
<td>133.525177</td>
<td>...</td>
<td>142.381814</td>
<td>136456.127817</td>
<td>129.797418</td>
<td>174258.823716</td>
<td>130.813697</td>
<td>135443.809729</td>
<td>124.617456</td>
<td>139379.446212</td>
<td>138.815696</td>
<td>156179.721555</td>
</tr>
<tr>
<th>64</th>
<td>2021Q1</td>
<td>2021</td>
<td>Q1</td>
<td>133.481101</td>
<td>153453.510344</td>
<td>128.398202</td>
<td>172340.072904</td>
<td>130.382658</td>
<td>128520.303209</td>
<td>135.257679</td>
<td>...</td>
<td>141.464114</td>
<td>135576.621629</td>
<td>131.166305</td>
<td>176096.615474</td>
<td>131.268719</td>
<td>135914.936888</td>
<td>127.657615</td>
<td>142779.737045</td>
<td>138.481347</td>
<td>155803.549899</td>
</tr>
<tr>
<th>65</th>
<td>2021Q2</td>
<td>2021</td>
<td>Q2</td>
<td>137.827568</td>
<td>158450.326506</td>
<td>130.543464</td>
<td>175219.510303</td>
<td>137.439265</td>
<td>135476.115278</td>
<td>138.558752</td>
<td>...</td>
<td>146.059468</td>
<td>139980.724158</td>
<td>134.815374</td>
<td>180995.654429</td>
<td>138.663140</td>
<td>143571.081234</td>
<td>129.231669</td>
<td>144540.250870</td>
<td>145.000829</td>
<td>163138.533592</td>
</tr>
<tr>
<th>66</th>
<td>2021Q3</td>
<td>2021</td>
<td>Q3</td>
<td>141.167257</td>
<td>162289.724156</td>
<td>134.116570</td>
<td>180015.444071</td>
<td>143.522521</td>
<td>141472.480114</td>
<td>142.250634</td>
<td>...</td>
<td>149.743153</td>
<td>143511.101233</td>
<td>136.178634</td>
<td>182825.891020</td>
<td>140.051373</td>
<td>145008.450168</td>
<td>135.759680</td>
<td>151841.560426</td>
<td>152.177814</td>
<td>171213.265699</td>
</tr>
<tr>
<th>67</th>
<td>2021Q4</td>
<td>2021</td>
<td>Q4</td>
<td>141.900751</td>
<td>163132.969278</td>
<td>134.883745</td>
<td>181045.170599</td>
<td>140.957739</td>
<td>138944.332704</td>
<td>140.072076</td>
<td>...</td>
<td>156.033991</td>
<td>149540.124905</td>
<td>137.683726</td>
<td>184846.544332</td>
<td>141.311289</td>
<td>146312.960523</td>
<td>136.847809</td>
<td>153058.587031</td>
<td>152.044892</td>
<td>171063.717288</td>
</tr>
</tbody>
</table>
<p>68 rows × 25 columns</p>
</div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Two inner-columns per LGD
</span><span class="n">lgds</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">columns</span><span class="p">[</span><span class="mi">3</span><span class="p">:].</span><span class="nb">str</span><span class="p">.</span><span class="n">replace</span><span class="p">(</span><span class="s">' HPI'</span><span class="p">,</span><span class="s">''</span><span class="p">).</span><span class="nb">str</span><span class="p">.</span><span class="n">replace</span><span class="p">(</span><span class="s">' Standardised Price'</span><span class="p">,</span><span class="s">''</span><span class="p">).</span><span class="n">unique</span><span class="p">()</span>
<span class="n">lgds</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Index(['Antrim and Newtownabbey', 'Ards and North Down',
'Armagh City, Banbridge and Craigavon', 'Belfast',
'Causeway Coast and Glens', 'Derry City and Strabane',
'Fermanagh and Omagh', 'Lisburn and Castlereagh', 'Mid and East Antrim',
'Mid Ulster Standardised', 'Mid Ulster', 'Newry, Mourne and Down'],
dtype='object', name=1)
</code></pre></div></div>
<p>For some reason; Mid-ulster has a ‘Standardised HPI’ which throws off the above trick, so we gotta make it ugly…</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">lgds</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">columns</span><span class="p">[</span><span class="mi">3</span><span class="p">:].</span><span class="nb">str</span><span class="p">.</span><span class="n">replace</span><span class="p">(</span><span class="s">' Standardised HPI'</span><span class="p">,</span><span class="s">' HPI'</span><span class="p">)</span>\
<span class="p">.</span><span class="nb">str</span><span class="p">.</span><span class="n">replace</span><span class="p">(</span><span class="s">' HPI'</span><span class="p">,</span><span class="s">''</span><span class="p">)</span>\
<span class="p">.</span><span class="nb">str</span><span class="p">.</span><span class="n">replace</span><span class="p">(</span><span class="s">' Standardised Price'</span><span class="p">,</span><span class="s">''</span><span class="p">).</span><span class="n">unique</span><span class="p">()</span>
<span class="n">lgds</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Index(['Antrim and Newtownabbey', 'Ards and North Down',
'Armagh City, Banbridge and Craigavon', 'Belfast',
'Causeway Coast and Glens', 'Derry City and Strabane',
'Fermanagh and Omagh', 'Lisburn and Castlereagh', 'Mid and East Antrim',
'Mid Ulster', 'Newry, Mourne and Down'],
dtype='object', name=1)
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df</span><span class="p">.</span><span class="n">columns</span> <span class="o">=</span> <span class="p">[</span><span class="o">*</span><span class="n">df</span><span class="p">.</span><span class="n">columns</span><span class="p">[:</span><span class="mi">3</span><span class="p">],</span> <span class="o">*</span><span class="n">pd</span><span class="p">.</span><span class="n">MultiIndex</span><span class="p">.</span><span class="n">from_product</span><span class="p">([</span><span class="n">lgds</span><span class="p">,[</span><span class="s">'Index'</span><span class="p">,</span><span class="s">'Price'</span><span class="p">]],</span> <span class="n">names</span><span class="o">=</span><span class="p">[</span><span class="s">'LGD'</span><span class="p">,</span><span class="s">'Metric'</span><span class="p">])]</span>
<span class="n">df</span>
</code></pre></div></div>
<div>
<style scoped="">
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Period</th>
<th>Year</th>
<th>Quarter</th>
<th>(Antrim and Newtownabbey, Index)</th>
<th>(Antrim and Newtownabbey, Price)</th>
<th>(Ards and North Down, Index)</th>
<th>(Ards and North Down, Price)</th>
<th>(Armagh City, Banbridge and Craigavon, Index)</th>
<th>(Armagh City, Banbridge and Craigavon, Price)</th>
<th>(Belfast, Index)</th>
<th>...</th>
<th>(Fermanagh and Omagh, Index)</th>
<th>(Fermanagh and Omagh, Price)</th>
<th>(Lisburn and Castlereagh, Index)</th>
<th>(Lisburn and Castlereagh, Price)</th>
<th>(Mid and East Antrim, Index)</th>
<th>(Mid and East Antrim, Price)</th>
<th>(Mid Ulster, Index)</th>
<th>(Mid Ulster, Price)</th>
<th>(Newry, Mourne and Down, Index)</th>
<th>(Newry, Mourne and Down, Price)</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>2005Q1</td>
<td>2005</td>
<td>Q1</td>
<td>99.903277</td>
<td>114851.528270</td>
<td>97.150602</td>
<td>130398.569667</td>
<td>102.245597</td>
<td>100785.145986</td>
<td>99.839849</td>
<td>...</td>
<td>109.429237</td>
<td>104874.980231</td>
<td>95.958322</td>
<td>128828.327513</td>
<td>102.246427</td>
<td>105865.408901</td>
<td>102.714778</td>
<td>114882.211239</td>
<td>100.810773</td>
<td>113420.880186</td>
</tr>
<tr>
<th>1</th>
<td>2005Q2</td>
<td>2005</td>
<td>Q2</td>
<td>99.723509</td>
<td>114644.862732</td>
<td>100.794472</td>
<td>137133.037807</td>
<td>106.325843</td>
<td>104807.109982</td>
<td>100.589870</td>
<td>...</td>
<td>117.239850</td>
<td>112360.529330</td>
<td>100.164437</td>
<td>134475.225477</td>
<td>104.443325</td>
<td>108140.065924</td>
<td>110.386311</td>
<td>123462.501283</td>
<td>111.965743</td>
<td>125971.191415</td>
</tr>
<tr>
<th>2</th>
<td>2005Q3</td>
<td>2005</td>
<td>Q3</td>
<td>107.940849</td>
<td>124091.740608</td>
<td>102.167971</td>
<td>137133.037807</td>
<td>110.006212</td>
<td>108434.910333</td>
<td>109.614861</td>
<td>...</td>
<td>125.900145</td>
<td>120660.397585</td>
<td>106.757895</td>
<td>143327.237126</td>
<td>112.748278</td>
<td>116738.970434</td>
<td>117.595723</td>
<td>131525.929577</td>
<td>117.235685</td>
<td>131900.333698</td>
</tr>
<tr>
<th>3</th>
<td>2005Q4</td>
<td>2005</td>
<td>Q4</td>
<td>111.934696</td>
<td>128683.175719</td>
<td>106.396379</td>
<td>142808.538807</td>
<td>116.073031</td>
<td>114415.072260</td>
<td>110.728237</td>
<td>...</td>
<td>130.781315</td>
<td>125338.422216</td>
<td>111.307116</td>
<td>149434.769200</td>
<td>114.584090</td>
<td>118639.759900</td>
<td>121.851999</td>
<td>136286.397473</td>
<td>123.628047</td>
<td>139092.296651</td>
</tr>
<tr>
<th>4</th>
<td>2006Q1</td>
<td>2006</td>
<td>Q1</td>
<td>113.494351</td>
<td>130476.197845</td>
<td>109.206160</td>
<td>146579.915492</td>
<td>121.831058</td>
<td>120090.852733</td>
<td>112.326582</td>
<td>...</td>
<td>135.555749</td>
<td>129914.152078</td>
<td>110.539212</td>
<td>148403.823796</td>
<td>115.878975</td>
<td>119980.477260</td>
<td>129.544702</td>
<td>144890.365875</td>
<td>127.449613</td>
<td>143391.890242</td>
</tr>
<tr>
<th>...</th>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
</tr>
<tr>
<th>63</th>
<td>2020Q4</td>
<td>2020</td>
<td>Q4</td>
<td>131.045293</td>
<td>150653.238745</td>
<td>123.824862</td>
<td>166201.593253</td>
<td>130.857866</td>
<td>128988.723586</td>
<td>133.525177</td>
<td>...</td>
<td>142.381814</td>
<td>136456.127817</td>
<td>129.797418</td>
<td>174258.823716</td>
<td>130.813697</td>
<td>135443.809729</td>
<td>124.617456</td>
<td>139379.446212</td>
<td>138.815696</td>
<td>156179.721555</td>
</tr>
<tr>
<th>64</th>
<td>2021Q1</td>
<td>2021</td>
<td>Q1</td>
<td>133.481101</td>
<td>153453.510344</td>
<td>128.398202</td>
<td>172340.072904</td>
<td>130.382658</td>
<td>128520.303209</td>
<td>135.257679</td>
<td>...</td>
<td>141.464114</td>
<td>135576.621629</td>
<td>131.166305</td>
<td>176096.615474</td>
<td>131.268719</td>
<td>135914.936888</td>
<td>127.657615</td>
<td>142779.737045</td>
<td>138.481347</td>
<td>155803.549899</td>
</tr>
<tr>
<th>65</th>
<td>2021Q2</td>
<td>2021</td>
<td>Q2</td>
<td>137.827568</td>
<td>158450.326506</td>
<td>130.543464</td>
<td>175219.510303</td>
<td>137.439265</td>
<td>135476.115278</td>
<td>138.558752</td>
<td>...</td>
<td>146.059468</td>
<td>139980.724158</td>
<td>134.815374</td>
<td>180995.654429</td>
<td>138.663140</td>
<td>143571.081234</td>
<td>129.231669</td>
<td>144540.250870</td>
<td>145.000829</td>
<td>163138.533592</td>
</tr>
<tr>
<th>66</th>
<td>2021Q3</td>
<td>2021</td>
<td>Q3</td>
<td>141.167257</td>
<td>162289.724156</td>
<td>134.116570</td>
<td>180015.444071</td>
<td>143.522521</td>
<td>141472.480114</td>
<td>142.250634</td>
<td>...</td>
<td>149.743153</td>
<td>143511.101233</td>
<td>136.178634</td>
<td>182825.891020</td>
<td>140.051373</td>
<td>145008.450168</td>
<td>135.759680</td>
<td>151841.560426</td>
<td>152.177814</td>
<td>171213.265699</td>
</tr>
<tr>
<th>67</th>
<td>2021Q4</td>
<td>2021</td>
<td>Q4</td>
<td>141.900751</td>
<td>163132.969278</td>
<td>134.883745</td>
<td>181045.170599</td>
<td>140.957739</td>
<td>138944.332704</td>
<td>140.072076</td>
<td>...</td>
<td>156.033991</td>
<td>149540.124905</td>
<td>137.683726</td>
<td>184846.544332</td>
<td>141.311289</td>
<td>146312.960523</td>
<td>136.847809</td>
<td>153058.587031</td>
<td>152.044892</td>
<td>171063.717288</td>
</tr>
</tbody>
</table>
<p>68 rows × 25 columns</p>
</div>
<p>We <em>could</em> turn this into a proper multiindex but it would mean pushing the Period/Year/Quarter columns into keys which would be inconsistent behaviour with the rest of the ‘cleaned’ dataset, so that can be a downstream problem; at least we’ve got the relevant metrics consistent!</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">cleanup_table_5</span><span class="p">(</span><span class="n">df</span><span class="p">):</span>
<span class="s">"""
Table 5: Standardised House Price & Index for each Local Government District Northern Ireland
*
"""</span>
<span class="c1"># Basic Cleanup first
</span> <span class="n">df</span> <span class="o">=</span> <span class="n">basic_cleanup</span><span class="p">(</span><span class="n">df</span><span class="p">)</span>
<span class="c1"># Build multi-index of LGD / Metric [Index,Price]
</span> <span class="c1"># Two inner-columns per LGD
</span> <span class="n">lgds</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">columns</span><span class="p">[</span><span class="mi">3</span><span class="p">:].</span><span class="nb">str</span><span class="p">.</span><span class="n">replace</span><span class="p">(</span><span class="s">' Standardised HPI'</span><span class="p">,</span><span class="s">' HPI'</span><span class="p">)</span>\
<span class="p">.</span><span class="nb">str</span><span class="p">.</span><span class="n">replace</span><span class="p">(</span><span class="s">' HPI'</span><span class="p">,</span><span class="s">''</span><span class="p">)</span>\
<span class="p">.</span><span class="nb">str</span><span class="p">.</span><span class="n">replace</span><span class="p">(</span><span class="s">' Standardised Price'</span><span class="p">,</span><span class="s">''</span><span class="p">)</span>\
<span class="p">.</span><span class="n">unique</span><span class="p">()</span>
<span class="n">df</span><span class="p">.</span><span class="n">columns</span> <span class="o">=</span> <span class="p">[</span><span class="o">*</span><span class="n">df</span><span class="p">.</span><span class="n">columns</span><span class="p">[:</span><span class="mi">3</span><span class="p">],</span> <span class="o">*</span><span class="n">pd</span><span class="p">.</span><span class="n">MultiIndex</span><span class="p">.</span><span class="n">from_product</span><span class="p">([</span><span class="n">lgds</span><span class="p">,[</span><span class="s">'Index'</span><span class="p">,</span><span class="s">'Price'</span><span class="p">]],</span> <span class="n">names</span><span class="o">=</span><span class="p">[</span><span class="s">'LGD'</span><span class="p">,</span><span class="s">'Metric'</span><span class="p">])]</span>
<span class="k">return</span> <span class="n">df</span>
<span class="n">cleanup_table_5</span><span class="p">(</span><span class="n">source_df</span><span class="p">[</span><span class="s">'Table 5'</span><span class="p">])</span>
</code></pre></div></div>
<div>
<style scoped="">
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Period</th>
<th>Year</th>
<th>Quarter</th>
<th>(Antrim and Newtownabbey, Index)</th>
<th>(Antrim and Newtownabbey, Price)</th>
<th>(Ards and North Down, Index)</th>
<th>(Ards and North Down, Price)</th>
<th>(Armagh City, Banbridge and Craigavon, Index)</th>
<th>(Armagh City, Banbridge and Craigavon, Price)</th>
<th>(Belfast, Index)</th>
<th>...</th>
<th>(Fermanagh and Omagh, Index)</th>
<th>(Fermanagh and Omagh, Price)</th>
<th>(Lisburn and Castlereagh, Index)</th>
<th>(Lisburn and Castlereagh, Price)</th>
<th>(Mid and East Antrim, Index)</th>
<th>(Mid and East Antrim, Price)</th>
<th>(Mid Ulster, Index)</th>
<th>(Mid Ulster, Price)</th>
<th>(Newry, Mourne and Down, Index)</th>
<th>(Newry, Mourne and Down, Price)</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>2005Q1</td>
<td>2005</td>
<td>Q1</td>
<td>99.903277</td>
<td>114851.528270</td>
<td>97.150602</td>
<td>130398.569667</td>
<td>102.245597</td>
<td>100785.145986</td>
<td>99.839849</td>
<td>...</td>
<td>109.429237</td>
<td>104874.980231</td>
<td>95.958322</td>
<td>128828.327513</td>
<td>102.246427</td>
<td>105865.408901</td>
<td>102.714778</td>
<td>114882.211239</td>
<td>100.810773</td>
<td>113420.880186</td>
</tr>
<tr>
<th>1</th>
<td>2005Q2</td>
<td>2005</td>
<td>Q2</td>
<td>99.723509</td>
<td>114644.862732</td>
<td>100.794472</td>
<td>137133.037807</td>
<td>106.325843</td>
<td>104807.109982</td>
<td>100.589870</td>
<td>...</td>
<td>117.239850</td>
<td>112360.529330</td>
<td>100.164437</td>
<td>134475.225477</td>
<td>104.443325</td>
<td>108140.065924</td>
<td>110.386311</td>
<td>123462.501283</td>
<td>111.965743</td>
<td>125971.191415</td>
</tr>
<tr>
<th>2</th>
<td>2005Q3</td>
<td>2005</td>
<td>Q3</td>
<td>107.940849</td>
<td>124091.740608</td>
<td>102.167971</td>
<td>137133.037807</td>
<td>110.006212</td>
<td>108434.910333</td>
<td>109.614861</td>
<td>...</td>
<td>125.900145</td>
<td>120660.397585</td>
<td>106.757895</td>
<td>143327.237126</td>
<td>112.748278</td>
<td>116738.970434</td>
<td>117.595723</td>
<td>131525.929577</td>
<td>117.235685</td>
<td>131900.333698</td>
</tr>
<tr>
<th>3</th>
<td>2005Q4</td>
<td>2005</td>
<td>Q4</td>
<td>111.934696</td>
<td>128683.175719</td>
<td>106.396379</td>
<td>142808.538807</td>
<td>116.073031</td>
<td>114415.072260</td>
<td>110.728237</td>
<td>...</td>
<td>130.781315</td>
<td>125338.422216</td>
<td>111.307116</td>
<td>149434.769200</td>
<td>114.584090</td>
<td>118639.759900</td>
<td>121.851999</td>
<td>136286.397473</td>
<td>123.628047</td>
<td>139092.296651</td>
</tr>
<tr>
<th>4</th>
<td>2006Q1</td>
<td>2006</td>
<td>Q1</td>
<td>113.494351</td>
<td>130476.197845</td>
<td>109.206160</td>
<td>146579.915492</td>
<td>121.831058</td>
<td>120090.852733</td>
<td>112.326582</td>
<td>...</td>
<td>135.555749</td>
<td>129914.152078</td>
<td>110.539212</td>
<td>148403.823796</td>
<td>115.878975</td>
<td>119980.477260</td>
<td>129.544702</td>
<td>144890.365875</td>
<td>127.449613</td>
<td>143391.890242</td>
</tr>
<tr>
<th>...</th>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
</tr>
<tr>
<th>63</th>
<td>2020Q4</td>
<td>2020</td>
<td>Q4</td>
<td>131.045293</td>
<td>150653.238745</td>
<td>123.824862</td>
<td>166201.593253</td>
<td>130.857866</td>
<td>128988.723586</td>
<td>133.525177</td>
<td>...</td>
<td>142.381814</td>
<td>136456.127817</td>
<td>129.797418</td>
<td>174258.823716</td>
<td>130.813697</td>
<td>135443.809729</td>
<td>124.617456</td>
<td>139379.446212</td>
<td>138.815696</td>
<td>156179.721555</td>
</tr>
<tr>
<th>64</th>
<td>2021Q1</td>
<td>2021</td>
<td>Q1</td>
<td>133.481101</td>
<td>153453.510344</td>
<td>128.398202</td>
<td>172340.072904</td>
<td>130.382658</td>
<td>128520.303209</td>
<td>135.257679</td>
<td>...</td>
<td>141.464114</td>
<td>135576.621629</td>
<td>131.166305</td>
<td>176096.615474</td>
<td>131.268719</td>
<td>135914.936888</td>
<td>127.657615</td>
<td>142779.737045</td>
<td>138.481347</td>
<td>155803.549899</td>
</tr>
<tr>
<th>65</th>
<td>2021Q2</td>
<td>2021</td>
<td>Q2</td>
<td>137.827568</td>
<td>158450.326506</td>
<td>130.543464</td>
<td>175219.510303</td>
<td>137.439265</td>
<td>135476.115278</td>
<td>138.558752</td>
<td>...</td>
<td>146.059468</td>
<td>139980.724158</td>
<td>134.815374</td>
<td>180995.654429</td>
<td>138.663140</td>
<td>143571.081234</td>
<td>129.231669</td>
<td>144540.250870</td>
<td>145.000829</td>
<td>163138.533592</td>
</tr>
<tr>
<th>66</th>
<td>2021Q3</td>
<td>2021</td>
<td>Q3</td>
<td>141.167257</td>
<td>162289.724156</td>
<td>134.116570</td>
<td>180015.444071</td>
<td>143.522521</td>
<td>141472.480114</td>
<td>142.250634</td>
<td>...</td>
<td>149.743153</td>
<td>143511.101233</td>
<td>136.178634</td>
<td>182825.891020</td>
<td>140.051373</td>
<td>145008.450168</td>
<td>135.759680</td>
<td>151841.560426</td>
<td>152.177814</td>
<td>171213.265699</td>
</tr>
<tr>
<th>67</th>
<td>2021Q4</td>
<td>2021</td>
<td>Q4</td>
<td>141.900751</td>
<td>163132.969278</td>
<td>134.883745</td>
<td>181045.170599</td>
<td>140.957739</td>
<td>138944.332704</td>
<td>140.072076</td>
<td>...</td>
<td>156.033991</td>
<td>149540.124905</td>
<td>137.683726</td>
<td>184846.544332</td>
<td>141.311289</td>
<td>146312.960523</td>
<td>136.847809</td>
<td>153058.587031</td>
<td>152.044892</td>
<td>171063.717288</td>
</tr>
</tbody>
</table>
<p>68 rows × 25 columns</p>
</div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">dest_df</span><span class="p">[</span><span class="s">'Table 5'</span><span class="p">]</span><span class="o">=</span><span class="n">cleanup_table_5</span><span class="p">(</span><span class="n">source_df</span><span class="p">[</span><span class="s">'Table 5'</span><span class="p">])</span>
<span class="nb">len</span><span class="p">(</span><span class="n">dest_df</span><span class="p">),</span> <span class="nb">len</span><span class="p">([</span><span class="n">k</span> <span class="k">for</span> <span class="n">k</span> <span class="ow">in</span> <span class="n">source_df</span><span class="p">.</span><span class="n">keys</span><span class="p">()</span> <span class="k">if</span> <span class="n">k</span><span class="p">.</span><span class="n">startswith</span><span class="p">(</span><span class="s">'Table'</span><span class="p">)</span> <span class="ow">and</span> <span class="n">k</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">dest_df</span><span class="p">])</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(12, 20)
</code></pre></div></div>
<h3 id="table-5a-number-of-verified-residential-property-sales-by-local-government-district">Table 5a: Number of Verified Residential Property Sales by Local Government District</h3>
<p>This one has a new problem; the Sale Year/Quarter is now squished together. This will do a few terrible things to our <code class="language-plaintext highlighter-rouge">basic_cleanup</code> so this needs to be done ahead of cleanup.
Also has annual total lines.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df</span> <span class="o">=</span> <span class="n">source_df</span><span class="p">[</span><span class="s">'Table 5a'</span><span class="p">].</span><span class="n">copy</span><span class="p">()</span>
<span class="n">df</span>
</code></pre></div></div>
<div>
<style scoped="">
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Table 5a: Number of Verified Residential Property Sales by Local Government District</th>
<th>Unnamed: 1</th>
<th>Unnamed: 2</th>
<th>Unnamed: 3</th>
<th>Unnamed: 4</th>
<th>Unnamed: 5</th>
<th>Unnamed: 6</th>
<th>Unnamed: 7</th>
<th>Unnamed: 8</th>
<th>Unnamed: 9</th>
<th>Unnamed: 10</th>
<th>Unnamed: 11</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>Please note figures for the 2 most recent quar...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>1</th>
<td>Back to contents</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>2</th>
<td>Sale Year/Quarter</td>
<td>Antrim and Newtownabbey</td>
<td>Ards and North Down</td>
<td>Armagh City, Banbridge and Craigavon</td>
<td>Belfast</td>
<td>Causeway Coast and Glens</td>
<td>Derry City and Strabane</td>
<td>Fermanagh and Omagh</td>
<td>Lisburn and Castlereagh</td>
<td>Mid and East Antrim</td>
<td>Mid Ulster</td>
<td>Newry, Mourne and Down</td>
</tr>
<tr>
<th>3</th>
<td>Q1 2005</td>
<td>236</td>
<td>320</td>
<td>333</td>
<td>623</td>
<td>236</td>
<td>226</td>
<td>138</td>
<td>219</td>
<td>188</td>
<td>176</td>
<td>241</td>
</tr>
<tr>
<th>4</th>
<td>Q2 2005</td>
<td>735</td>
<td>857</td>
<td>961</td>
<td>1549</td>
<td>712</td>
<td>637</td>
<td>316</td>
<td>655</td>
<td>618</td>
<td>428</td>
<td>505</td>
</tr>
<tr>
<th>...</th>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
</tr>
<tr>
<th>85</th>
<td>Q3 2021</td>
<td>739</td>
<td>989</td>
<td>931</td>
<td>1584</td>
<td>625</td>
<td>485</td>
<td>325</td>
<td>869</td>
<td>671</td>
<td>377</td>
<td>574</td>
</tr>
<tr>
<th>86</th>
<td>Q4 2021</td>
<td>532</td>
<td>702</td>
<td>730</td>
<td>1272</td>
<td>417</td>
<td>405</td>
<td>250</td>
<td>572</td>
<td>474</td>
<td>359</td>
<td>437</td>
</tr>
<tr>
<th>87</th>
<td>2021 Total</td>
<td>2647</td>
<td>3685</td>
<td>3333</td>
<td>5934</td>
<td>2327</td>
<td>1803</td>
<td>1181</td>
<td>3053</td>
<td>2346</td>
<td>1460</td>
<td>2000</td>
</tr>
<tr>
<th>88</th>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>89</th>
<td>Please note figures for the 2 most recent quar...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
</tbody>
</table>
<p>90 rows × 12 columns</p>
</div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">dates</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">iloc</span><span class="p">[:,</span><span class="mi">0</span><span class="p">].</span><span class="nb">str</span><span class="p">.</span><span class="n">extract</span><span class="p">(</span><span class="s">'(Q[1-4]) ([0-9]{4})'</span><span class="p">).</span><span class="n">rename</span><span class="p">(</span><span class="n">columns</span><span class="o">=</span><span class="p">{</span><span class="mi">0</span><span class="p">:</span><span class="s">'Quarter'</span><span class="p">,</span><span class="mi">1</span><span class="p">:</span><span class="s">'Year'</span><span class="p">})</span>
<span class="k">for</span> <span class="n">c</span> <span class="ow">in</span> <span class="p">[</span><span class="s">'Quarter'</span><span class="p">,</span><span class="s">'Year'</span><span class="p">]:</span><span class="c1"># insert the dates in order, so they come out in reverse in the insert
</span> <span class="n">df</span><span class="p">.</span><span class="n">insert</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="n">c</span><span class="p">,</span><span class="n">dates</span><span class="p">[</span><span class="n">c</span><span class="p">])</span>
<span class="n">df</span><span class="p">.</span><span class="n">iloc</span><span class="p">[</span><span class="mi">2</span><span class="p">,</span><span class="mi">1</span><span class="p">]</span><span class="o">=</span><span class="n">c</span> <span class="c1"># Need to have the right colname for when `basic_cleanup` is called.
</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df</span><span class="p">.</span><span class="n">iloc</span><span class="p">[</span><span class="mi">2</span><span class="p">,</span><span class="mi">1</span><span class="p">]</span><span class="o">=</span><span class="n">c</span>
<span class="n">df</span>
</code></pre></div></div>
<div>
<style scoped="">
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Table 5a: Number of Verified Residential Property Sales by Local Government District</th>
<th>Year</th>
<th>Quarter</th>
<th>Unnamed: 1</th>
<th>Unnamed: 2</th>
<th>Unnamed: 3</th>
<th>Unnamed: 4</th>
<th>Unnamed: 5</th>
<th>Unnamed: 6</th>
<th>Unnamed: 7</th>
<th>Unnamed: 8</th>
<th>Unnamed: 9</th>
<th>Unnamed: 10</th>
<th>Unnamed: 11</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>Please note figures for the 2 most recent quar...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>1</th>
<td>Back to contents</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>2</th>
<td>Sale Year/Quarter</td>
<td>Year</td>
<td>Quarter</td>
<td>Antrim and Newtownabbey</td>
<td>Ards and North Down</td>
<td>Armagh City, Banbridge and Craigavon</td>
<td>Belfast</td>
<td>Causeway Coast and Glens</td>
<td>Derry City and Strabane</td>
<td>Fermanagh and Omagh</td>
<td>Lisburn and Castlereagh</td>
<td>Mid and East Antrim</td>
<td>Mid Ulster</td>
<td>Newry, Mourne and Down</td>
</tr>
<tr>
<th>3</th>
<td>Q1 2005</td>
<td>2005</td>
<td>Q1</td>
<td>236</td>
<td>320</td>
<td>333</td>
<td>623</td>
<td>236</td>
<td>226</td>
<td>138</td>
<td>219</td>
<td>188</td>
<td>176</td>
<td>241</td>
</tr>
<tr>
<th>4</th>
<td>Q2 2005</td>
<td>2005</td>
<td>Q2</td>
<td>735</td>
<td>857</td>
<td>961</td>
<td>1549</td>
<td>712</td>
<td>637</td>
<td>316</td>
<td>655</td>
<td>618</td>
<td>428</td>
<td>505</td>
</tr>
<tr>
<th>...</th>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
</tr>
<tr>
<th>85</th>
<td>Q3 2021</td>
<td>2021</td>
<td>Q3</td>
<td>739</td>
<td>989</td>
<td>931</td>
<td>1584</td>
<td>625</td>
<td>485</td>
<td>325</td>
<td>869</td>
<td>671</td>
<td>377</td>
<td>574</td>
</tr>
<tr>
<th>86</th>
<td>Q4 2021</td>
<td>2021</td>
<td>Q4</td>
<td>532</td>
<td>702</td>
<td>730</td>
<td>1272</td>
<td>417</td>
<td>405</td>
<td>250</td>
<td>572</td>
<td>474</td>
<td>359</td>
<td>437</td>
</tr>
<tr>
<th>87</th>
<td>2021 Total</td>
<td>NaN</td>
<td>NaN</td>
<td>2647</td>
<td>3685</td>
<td>3333</td>
<td>5934</td>
<td>2327</td>
<td>1803</td>
<td>1181</td>
<td>3053</td>
<td>2346</td>
<td>1460</td>
<td>2000</td>
</tr>
<tr>
<th>88</th>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>89</th>
<td>Please note figures for the 2 most recent quar...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
</tbody>
</table>
<p>90 rows × 14 columns</p>
</div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df</span><span class="o">=</span><span class="n">df</span><span class="p">[</span><span class="o">~</span><span class="n">df</span><span class="p">.</span><span class="n">iloc</span><span class="p">[:,</span><span class="mi">0</span><span class="p">].</span><span class="nb">str</span><span class="p">.</span><span class="n">contains</span><span class="p">(</span><span class="s">'Total'</span><span class="p">).</span><span class="n">fillna</span><span class="p">(</span><span class="bp">False</span><span class="p">)]</span>
</code></pre></div></div>
<p>df.iloc[1,2]=c</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">basic_cleanup</span><span class="p">(</span><span class="n">df</span><span class="p">,</span><span class="n">offset</span><span class="o">=</span><span class="mi">2</span><span class="p">)</span>
</code></pre></div></div>
<div>
<style scoped="">
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th>2</th>
<th>Period</th>
<th>Sale Year/Quarter</th>
<th>Year</th>
<th>Quarter</th>
<th>Antrim and Newtownabbey</th>
<th>Ards and North Down</th>
<th>Armagh City, Banbridge and Craigavon</th>
<th>Belfast</th>
<th>Causeway Coast and Glens</th>
<th>Derry City and Strabane</th>
<th>Fermanagh and Omagh</th>
<th>Lisburn and Castlereagh</th>
<th>Mid and East Antrim</th>
<th>Mid Ulster</th>
<th>Newry, Mourne and Down</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>2005Q1</td>
<td>Q1 2005</td>
<td>2005</td>
<td>Q1</td>
<td>236</td>
<td>320</td>
<td>333</td>
<td>623</td>
<td>236</td>
<td>226</td>
<td>138</td>
<td>219</td>
<td>188</td>
<td>176</td>
<td>241</td>
</tr>
<tr>
<th>1</th>
<td>2005Q2</td>
<td>Q2 2005</td>
<td>2005</td>
<td>Q2</td>
<td>735</td>
<td>857</td>
<td>961</td>
<td>1549</td>
<td>712</td>
<td>637</td>
<td>316</td>
<td>655</td>
<td>618</td>
<td>428</td>
<td>505</td>
</tr>
<tr>
<th>2</th>
<td>2005Q3</td>
<td>Q3 2005</td>
<td>2005</td>
<td>Q3</td>
<td>757</td>
<td>960</td>
<td>968</td>
<td>1722</td>
<td>714</td>
<td>632</td>
<td>365</td>
<td>654</td>
<td>686</td>
<td>403</td>
<td>582</td>
</tr>
<tr>
<th>3</th>
<td>2005Q4</td>
<td>Q4 2005</td>
<td>2005</td>
<td>Q4</td>
<td>893</td>
<td>995</td>
<td>1199</td>
<td>1943</td>
<td>834</td>
<td>746</td>
<td>385</td>
<td>670</td>
<td>759</td>
<td>489</td>
<td>711</td>
</tr>
<tr>
<th>4</th>
<td>2006Q1</td>
<td>Q1 2006</td>
<td>2006</td>
<td>Q1</td>
<td>761</td>
<td>933</td>
<td>1038</td>
<td>1686</td>
<td>763</td>
<td>708</td>
<td>348</td>
<td>600</td>
<td>668</td>
<td>515</td>
<td>567</td>
</tr>
<tr>
<th>...</th>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
</tr>
<tr>
<th>63</th>
<td>2020Q4</td>
<td>Q4 2020</td>
<td>2020</td>
<td>Q4</td>
<td>756</td>
<td>1052</td>
<td>974</td>
<td>1565</td>
<td>728</td>
<td>496</td>
<td>336</td>
<td>830</td>
<td>685</td>
<td>419</td>
<td>636</td>
</tr>
<tr>
<th>64</th>
<td>2021Q1</td>
<td>Q1 2021</td>
<td>2021</td>
<td>Q1</td>
<td>652</td>
<td>976</td>
<td>849</td>
<td>1497</td>
<td>610</td>
<td>466</td>
<td>290</td>
<td>762</td>
<td>572</td>
<td>349</td>
<td>486</td>
</tr>
<tr>
<th>65</th>
<td>2021Q2</td>
<td>Q2 2021</td>
<td>2021</td>
<td>Q2</td>
<td>724</td>
<td>1018</td>
<td>823</td>
<td>1581</td>
<td>675</td>
<td>447</td>
<td>316</td>
<td>850</td>
<td>629</td>
<td>375</td>
<td>503</td>
</tr>
<tr>
<th>66</th>
<td>2021Q3</td>
<td>Q3 2021</td>
<td>2021</td>
<td>Q3</td>
<td>739</td>
<td>989</td>
<td>931</td>
<td>1584</td>
<td>625</td>
<td>485</td>
<td>325</td>
<td>869</td>
<td>671</td>
<td>377</td>
<td>574</td>
</tr>
<tr>
<th>67</th>
<td>2021Q4</td>
<td>Q4 2021</td>
<td>2021</td>
<td>Q4</td>
<td>532</td>
<td>702</td>
<td>730</td>
<td>1272</td>
<td>417</td>
<td>405</td>
<td>250</td>
<td>572</td>
<td>474</td>
<td>359</td>
<td>437</td>
</tr>
</tbody>
</table>
<p>68 rows × 15 columns</p>
</div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">cleanup_table_5a</span><span class="p">(</span><span class="n">df</span><span class="p">):</span>
<span class="s">"""
Table 5a: Number of Verified Residential Property Sales by Local Government District
* Parse the 'Sale Year/Quarter' to two separate cols
* Insert future-headers for Quarter and Year cols
* Remove rows with 'total' in the first column
* Disregard the 'Sale Year/Quarter' column
* perform `basic_cleanup` with offset=2
"""</span>
<span class="c1"># Safety first
</span> <span class="n">df</span><span class="o">=</span><span class="n">df</span><span class="p">.</span><span class="n">copy</span><span class="p">()</span>
<span class="c1"># Extract 'Quarter' and 'Year' columns from the future 'Sale Year/Quarter' column
</span> <span class="n">dates</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">iloc</span><span class="p">[:,</span><span class="mi">0</span><span class="p">].</span><span class="nb">str</span><span class="p">.</span><span class="n">extract</span><span class="p">(</span><span class="s">'(Q[1-4]) ([0-9]{4})'</span><span class="p">).</span><span class="n">rename</span><span class="p">(</span><span class="n">columns</span><span class="o">=</span><span class="p">{</span><span class="mi">0</span><span class="p">:</span><span class="s">'Quarter'</span><span class="p">,</span><span class="mi">1</span><span class="p">:</span><span class="s">'Year'</span><span class="p">})</span>
<span class="k">for</span> <span class="n">c</span> <span class="ow">in</span> <span class="p">[</span><span class="s">'Quarter'</span><span class="p">,</span><span class="s">'Year'</span><span class="p">]:</span><span class="c1"># insert the dates in order, so they come out in reverse in the insert
</span> <span class="n">df</span><span class="p">.</span><span class="n">insert</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="n">c</span><span class="p">,</span><span class="n">dates</span><span class="p">[</span><span class="n">c</span><span class="p">])</span>
<span class="n">df</span><span class="p">.</span><span class="n">iloc</span><span class="p">[</span><span class="mi">2</span><span class="p">,</span><span class="mi">1</span><span class="p">]</span><span class="o">=</span><span class="n">c</span> <span class="c1"># Need to have the right colname for when `basic_cleanup` is called.
</span>
<span class="c1"># Remove 'total' rows from the future 'Sale Year/Quarter' column
</span> <span class="n">df</span><span class="o">=</span><span class="n">df</span><span class="p">[</span><span class="o">~</span><span class="n">df</span><span class="p">.</span><span class="n">iloc</span><span class="p">[:,</span><span class="mi">0</span><span class="p">].</span><span class="nb">str</span><span class="p">.</span><span class="n">contains</span><span class="p">(</span><span class="s">'Total'</span><span class="p">).</span><span class="n">fillna</span><span class="p">(</span><span class="bp">False</span><span class="p">)]</span>
<span class="c1"># Remove the 'Sale Year/Quarter' column all together
</span> <span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">iloc</span><span class="p">[:,</span><span class="mi">1</span><span class="p">:]</span>
<span class="c1"># Standard cleanup
</span> <span class="n">df</span> <span class="o">=</span> <span class="n">basic_cleanup</span><span class="p">(</span><span class="n">df</span><span class="p">,</span> <span class="n">offset</span><span class="o">=</span><span class="mi">2</span><span class="p">)</span>
<span class="k">return</span> <span class="n">df</span>
<span class="n">cleanup_table_5a</span><span class="p">(</span><span class="n">source_df</span><span class="p">[</span><span class="s">'Table 5a'</span><span class="p">])</span>
</code></pre></div></div>
<div>
<style scoped="">
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th>2</th>
<th>Period</th>
<th>Year</th>
<th>Quarter</th>
<th>Antrim and Newtownabbey</th>
<th>Ards and North Down</th>
<th>Armagh City, Banbridge and Craigavon</th>
<th>Belfast</th>
<th>Causeway Coast and Glens</th>
<th>Derry City and Strabane</th>
<th>Fermanagh and Omagh</th>
<th>Lisburn and Castlereagh</th>
<th>Mid and East Antrim</th>
<th>Mid Ulster</th>
<th>Newry, Mourne and Down</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>2005Q1</td>
<td>2005</td>
<td>Q1</td>
<td>236</td>
<td>320</td>
<td>333</td>
<td>623</td>
<td>236</td>
<td>226</td>
<td>138</td>
<td>219</td>
<td>188</td>
<td>176</td>
<td>241</td>
</tr>
<tr>
<th>1</th>
<td>2005Q2</td>
<td>2005</td>
<td>Q2</td>
<td>735</td>
<td>857</td>
<td>961</td>
<td>1549</td>
<td>712</td>
<td>637</td>
<td>316</td>
<td>655</td>
<td>618</td>
<td>428</td>
<td>505</td>
</tr>
<tr>
<th>2</th>
<td>2005Q3</td>
<td>2005</td>
<td>Q3</td>
<td>757</td>
<td>960</td>
<td>968</td>
<td>1722</td>
<td>714</td>
<td>632</td>
<td>365</td>
<td>654</td>
<td>686</td>
<td>403</td>
<td>582</td>
</tr>
<tr>
<th>3</th>
<td>2005Q4</td>
<td>2005</td>
<td>Q4</td>
<td>893</td>
<td>995</td>
<td>1199</td>
<td>1943</td>
<td>834</td>
<td>746</td>
<td>385</td>
<td>670</td>
<td>759</td>
<td>489</td>
<td>711</td>
</tr>
<tr>
<th>4</th>
<td>2006Q1</td>
<td>2006</td>
<td>Q1</td>
<td>761</td>
<td>933</td>
<td>1038</td>
<td>1686</td>
<td>763</td>
<td>708</td>
<td>348</td>
<td>600</td>
<td>668</td>
<td>515</td>
<td>567</td>
</tr>
<tr>
<th>...</th>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
</tr>
<tr>
<th>63</th>
<td>2020Q4</td>
<td>2020</td>
<td>Q4</td>
<td>756</td>
<td>1052</td>
<td>974</td>
<td>1565</td>
<td>728</td>
<td>496</td>
<td>336</td>
<td>830</td>
<td>685</td>
<td>419</td>
<td>636</td>
</tr>
<tr>
<th>64</th>
<td>2021Q1</td>
<td>2021</td>
<td>Q1</td>
<td>652</td>
<td>976</td>
<td>849</td>
<td>1497</td>
<td>610</td>
<td>466</td>
<td>290</td>
<td>762</td>
<td>572</td>
<td>349</td>
<td>486</td>
</tr>
<tr>
<th>65</th>
<td>2021Q2</td>
<td>2021</td>
<td>Q2</td>
<td>724</td>
<td>1018</td>
<td>823</td>
<td>1581</td>
<td>675</td>
<td>447</td>
<td>316</td>
<td>850</td>
<td>629</td>
<td>375</td>
<td>503</td>
</tr>
<tr>
<th>66</th>
<td>2021Q3</td>
<td>2021</td>
<td>Q3</td>
<td>739</td>
<td>989</td>
<td>931</td>
<td>1584</td>
<td>625</td>
<td>485</td>
<td>325</td>
<td>869</td>
<td>671</td>
<td>377</td>
<td>574</td>
</tr>
<tr>
<th>67</th>
<td>2021Q4</td>
<td>2021</td>
<td>Q4</td>
<td>532</td>
<td>702</td>
<td>730</td>
<td>1272</td>
<td>417</td>
<td>405</td>
<td>250</td>
<td>572</td>
<td>474</td>
<td>359</td>
<td>437</td>
</tr>
</tbody>
</table>
<p>68 rows × 14 columns</p>
</div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">dest_df</span><span class="p">[</span><span class="s">'Table 5a'</span><span class="p">]</span><span class="o">=</span><span class="n">cleanup_table_5a</span><span class="p">(</span><span class="n">source_df</span><span class="p">[</span><span class="s">'Table 5a'</span><span class="p">])</span>
<span class="nb">len</span><span class="p">(</span><span class="n">dest_df</span><span class="p">),</span> <span class="nb">len</span><span class="p">([</span><span class="n">k</span> <span class="k">for</span> <span class="n">k</span> <span class="ow">in</span> <span class="n">source_df</span><span class="p">.</span><span class="n">keys</span><span class="p">()</span> <span class="k">if</span> <span class="n">k</span><span class="p">.</span><span class="n">startswith</span><span class="p">(</span><span class="s">'Table'</span><span class="p">)</span> <span class="ow">and</span> <span class="n">k</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">dest_df</span><span class="p">])</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(13, 19)
</code></pre></div></div>
<h3 id="table-6-standardised-house-price--index-for-all-urban-and-rural-areas-in-ni">Table 6: Standardised House Price & Index for all Urban and Rural areas in NI</h3>
<p>Wee buns, thankfully. Still mixing the ‘HPI’ vs ‘Index’, but that’s a downstream problem</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df</span> <span class="o">=</span> <span class="n">basic_cleanup</span><span class="p">(</span><span class="n">source_df</span><span class="p">[</span><span class="s">'Table 6'</span><span class="p">])</span>
<span class="n">df</span>
</code></pre></div></div>
<div>
<style scoped="">
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th>1</th>
<th>Period</th>
<th>Year</th>
<th>Quarter</th>
<th>Urban Areas HPI</th>
<th>Urban Areas Standardised Price</th>
<th>Rural Areas HPI</th>
<th>Rural Areas Standardised Price</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>2005Q1</td>
<td>2005</td>
<td>Q1</td>
<td>101.309947</td>
<td>107723.320891</td>
<td>100.109860</td>
<td>124292.601178</td>
</tr>
<tr>
<th>1</th>
<td>2005Q2</td>
<td>2005</td>
<td>Q2</td>
<td>104.402908</td>
<td>111012.079786</td>
<td>105.467951</td>
<td>138865.721275</td>
</tr>
<tr>
<th>2</th>
<td>2005Q3</td>
<td>2005</td>
<td>Q3</td>
<td>111.163485</td>
<td>118200.631818</td>
<td>111.847591</td>
<td>138865.721275</td>
</tr>
<tr>
<th>3</th>
<td>2005Q4</td>
<td>2005</td>
<td>Q4</td>
<td>114.871996</td>
<td>122143.908606</td>
<td>116.175119</td>
<td>144238.615701</td>
</tr>
<tr>
<th>4</th>
<td>2006Q1</td>
<td>2006</td>
<td>Q1</td>
<td>118.187559</td>
<td>125669.361667</td>
<td>119.329374</td>
<td>148154.818847</td>
</tr>
<tr>
<th>...</th>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
</tr>
<tr>
<th>63</th>
<td>2020Q4</td>
<td>2020</td>
<td>Q4</td>
<td>132.610763</td>
<td>141005.619094</td>
<td>133.854953</td>
<td>166189.226014</td>
</tr>
<tr>
<th>64</th>
<td>2021Q1</td>
<td>2021</td>
<td>Q1</td>
<td>134.077654</td>
<td>142565.370205</td>
<td>135.267264</td>
<td>167942.698911</td>
</tr>
<tr>
<th>65</th>
<td>2021Q2</td>
<td>2021</td>
<td>Q2</td>
<td>138.575881</td>
<td>147348.355880</td>
<td>140.501443</td>
<td>174441.256673</td>
</tr>
<tr>
<th>66</th>
<td>2021Q3</td>
<td>2021</td>
<td>Q3</td>
<td>142.840470</td>
<td>151882.912133</td>
<td>144.695321</td>
<td>179648.216283</td>
</tr>
<tr>
<th>67</th>
<td>2021Q4</td>
<td>2021</td>
<td>Q4</td>
<td>142.375033</td>
<td>151388.010443</td>
<td>146.115278</td>
<td>181411.180623</td>
</tr>
</tbody>
</table>
<p>68 rows × 7 columns</p>
</div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">dest_df</span><span class="p">[</span><span class="s">'Table 6'</span><span class="p">]</span><span class="o">=</span><span class="n">basic_cleanup</span><span class="p">(</span><span class="n">source_df</span><span class="p">[</span><span class="s">'Table 6'</span><span class="p">])</span>
<span class="nb">len</span><span class="p">(</span><span class="n">dest_df</span><span class="p">),</span> <span class="nb">len</span><span class="p">([</span><span class="n">k</span> <span class="k">for</span> <span class="n">k</span> <span class="ow">in</span> <span class="n">source_df</span><span class="p">.</span><span class="n">keys</span><span class="p">()</span> <span class="k">if</span> <span class="n">k</span><span class="p">.</span><span class="n">startswith</span><span class="p">(</span><span class="s">'Table'</span><span class="p">)</span> <span class="ow">and</span> <span class="n">k</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">dest_df</span><span class="p">])</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(14, 18)
</code></pre></div></div>
<h3 id="table-7-standardised-house-price--index-for-rural-areas-of-northern-ireland-by-drive-times">Table 7: Standardised House Price & Index for Rural Areas of Northern Ireland by drive times</h3>
<p>Nearly-wee-buns; but this one doesn’t have Year or Quarter headers, and the extra <code class="language-plaintext highlighter-rouge">\n (Ref: Q1 2015)</code> added, which will complicate downstream analysis if that changes over time…</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df</span> <span class="o">=</span> <span class="n">source_df</span><span class="p">[</span><span class="s">'Table 7'</span><span class="p">].</span><span class="n">copy</span><span class="p">()</span>
<span class="n">df</span><span class="p">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div></div>
<div>
<style scoped="">
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Table 7: Standardised House Price & Index for Rural Areas of Northern Ireland by drive times</th>
<th>Unnamed: 1</th>
<th>Unnamed: 2</th>
<th>Unnamed: 3</th>
<th>Unnamed: 4</th>
<th>Unnamed: 5</th>
<th>Unnamed: 6</th>
<th>Unnamed: 7</th>
<th>Unnamed: 8</th>
<th>Unnamed: 9</th>
<th>Unnamed: 10</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>Back to contents</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>1</th>
<td>NaN</td>
<td>NaN</td>
<td>Drive time within 20mins of town of 10,000 or ...</td>
<td>Drive time within 20mins of town of 10,000 or ...</td>
<td>Drive time outside 20mins of town of 10,000 or...</td>
<td>Drive time outside 20mins of town of 10,000 or...</td>
<td>Drive time within 1hr of Belfast Index</td>
<td>Drive time within 1hr of Belfast Price\n(Ref:...</td>
<td>Drive time outside 1hr of Belfast Index</td>
<td>Drive time outside 1hr of Belfast Price\n(Ref...</td>
<td>NaN</td>
</tr>
<tr>
<th>2</th>
<td>2015</td>
<td>Q1</td>
<td>100</td>
<td>124898.676844</td>
<td>100</td>
<td>122528.427865</td>
<td>100</td>
<td>128955.274996</td>
<td>100</td>
<td>111866.40498</td>
<td>100.0</td>
</tr>
<tr>
<th>3</th>
<td>NaN</td>
<td>Q2</td>
<td>103.166882</td>
<td>128854.070701</td>
<td>103.003978</td>
<td>126209.155363</td>
<td>103.025069</td>
<td>132856.260679</td>
<td>103.349406</td>
<td>115613.265107</td>
<td>100.0</td>
</tr>
<tr>
<th>4</th>
<td>NaN</td>
<td>Q3</td>
<td>105.851629</td>
<td>132207.28391</td>
<td>105.619893</td>
<td>129414.394046</td>
<td>105.031061</td>
<td>135443.093443</td>
<td>107.811831</td>
<td>120605.219276</td>
<td>100.0</td>
</tr>
</tbody>
</table>
</div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df</span><span class="p">.</span><span class="n">iloc</span><span class="p">[</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="s">'Year'</span>
<span class="n">df</span><span class="p">.</span><span class="n">iloc</span><span class="p">[</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="s">'Quarter'</span>
<span class="n">df</span><span class="p">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div></div>
<div>
<style scoped="">
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Table 7: Standardised House Price & Index for Rural Areas of Northern Ireland by drive times</th>
<th>Unnamed: 1</th>
<th>Unnamed: 2</th>
<th>Unnamed: 3</th>
<th>Unnamed: 4</th>
<th>Unnamed: 5</th>
<th>Unnamed: 6</th>
<th>Unnamed: 7</th>
<th>Unnamed: 8</th>
<th>Unnamed: 9</th>
<th>Unnamed: 10</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>Back to contents</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>1</th>
<td>Year</td>
<td>Quarter</td>
<td>Drive time within 20mins of town of 10,000 or ...</td>
<td>Drive time within 20mins of town of 10,000 or ...</td>
<td>Drive time outside 20mins of town of 10,000 or...</td>
<td>Drive time outside 20mins of town of 10,000 or...</td>
<td>Drive time within 1hr of Belfast Index</td>
<td>Drive time within 1hr of Belfast Price\n(Ref:...</td>
<td>Drive time outside 1hr of Belfast Index</td>
<td>Drive time outside 1hr of Belfast Price\n(Ref...</td>
<td>NaN</td>
</tr>
<tr>
<th>2</th>
<td>2015</td>
<td>Q1</td>
<td>100</td>
<td>124898.676844</td>
<td>100</td>
<td>122528.427865</td>
<td>100</td>
<td>128955.274996</td>
<td>100</td>
<td>111866.40498</td>
<td>100.0</td>
</tr>
<tr>
<th>3</th>
<td>NaN</td>
<td>Q2</td>
<td>103.166882</td>
<td>128854.070701</td>
<td>103.003978</td>
<td>126209.155363</td>
<td>103.025069</td>
<td>132856.260679</td>
<td>103.349406</td>
<td>115613.265107</td>
<td>100.0</td>
</tr>
<tr>
<th>4</th>
<td>NaN</td>
<td>Q3</td>
<td>105.851629</td>
<td>132207.28391</td>
<td>105.619893</td>
<td>129414.394046</td>
<td>105.031061</td>
<td>135443.093443</td>
<td>107.811831</td>
<td>120605.219276</td>
<td>100.0</td>
</tr>
</tbody>
</table>
</div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">basic_cleanup</span><span class="p">(</span><span class="n">df</span><span class="p">).</span><span class="n">head</span><span class="p">()</span>
</code></pre></div></div>
<div>
<style scoped="">
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th>1</th>
<th>Period</th>
<th>Year</th>
<th>Quarter</th>
<th>Drive time within 20mins of town of 10,000 or more Index</th>
<th>Drive time within 20mins of town of 10,000 or more Price\n(Ref: Q1 2015)</th>
<th>Drive time outside 20mins of town of 10,000 or more Index</th>
<th>Drive time outside 20mins of town of 10,000 or more Price\n(Ref: Q1 2015)</th>
<th>Drive time within 1hr of Belfast Index</th>
<th>Drive time within 1hr of Belfast Price\n(Ref: Q1 2015)</th>
<th>Drive time outside 1hr of Belfast Index</th>
<th>Drive time outside 1hr of Belfast Price\n(Ref: Q1 2015)</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>2015Q1</td>
<td>2015</td>
<td>Q1</td>
<td>100.000000</td>
<td>124898.676844</td>
<td>100.000000</td>
<td>122528.427865</td>
<td>100.000000</td>
<td>128955.274996</td>
<td>100.000000</td>
<td>111866.404980</td>
</tr>
<tr>
<th>1</th>
<td>2015Q2</td>
<td>2015</td>
<td>Q2</td>
<td>103.166882</td>
<td>128854.070701</td>
<td>103.003978</td>
<td>126209.155363</td>
<td>103.025069</td>
<td>132856.260679</td>
<td>103.349406</td>
<td>115613.265107</td>
</tr>
<tr>
<th>2</th>
<td>2015Q3</td>
<td>2015</td>
<td>Q3</td>
<td>105.851629</td>
<td>132207.283910</td>
<td>105.619893</td>
<td>129414.394046</td>
<td>105.031061</td>
<td>135443.093443</td>
<td>107.811831</td>
<td>120605.219276</td>
</tr>
<tr>
<th>3</th>
<td>2015Q4</td>
<td>2015</td>
<td>Q4</td>
<td>107.430656</td>
<td>134179.467306</td>
<td>106.924715</td>
<td>131013.172436</td>
<td>106.240145</td>
<td>137002.270924</td>
<td>110.075053</td>
<td>123137.004353</td>
</tr>
<tr>
<th>4</th>
<td>2016Q1</td>
<td>2016</td>
<td>Q1</td>
<td>108.909364</td>
<td>136026.354775</td>
<td>108.368772</td>
<td>132782.552750</td>
<td>107.604887</td>
<td>138762.178070</td>
<td>111.828874</td>
<td>125098.941485</td>
</tr>
</tbody>
</table>
</div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">cleanup_table_7</span><span class="p">(</span><span class="n">df</span><span class="p">):</span>
<span class="s">"""
Table 7: Standardised House Price & Index for Rural Areas of Northern Ireland by drive times
* Insert Year/Quarter future-headers
* Clean normally
# TODO THIS MIGHT BE VALID FOR MULTIINDEXING ON DRIVETIME/[Index/Price]
"""</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">copy</span><span class="p">()</span>
<span class="n">df</span><span class="p">.</span><span class="n">iloc</span><span class="p">[</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="s">'Year'</span>
<span class="n">df</span><span class="p">.</span><span class="n">iloc</span><span class="p">[</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="s">'Quarter'</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">basic_cleanup</span><span class="p">(</span><span class="n">df</span><span class="p">)</span>
<span class="k">return</span> <span class="n">df</span>
<span class="n">cleanup_table_7</span><span class="p">(</span><span class="n">source_df</span><span class="p">[</span><span class="s">'Table 7'</span><span class="p">])</span>
</code></pre></div></div>
<div>
<style scoped="">
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th>1</th>
<th>Period</th>
<th>Year</th>
<th>Quarter</th>
<th>Drive time within 20mins of town of 10,000 or more Index</th>
<th>Drive time within 20mins of town of 10,000 or more Price\n(Ref: Q1 2015)</th>
<th>Drive time outside 20mins of town of 10,000 or more Index</th>
<th>Drive time outside 20mins of town of 10,000 or more Price\n(Ref: Q1 2015)</th>
<th>Drive time within 1hr of Belfast Index</th>
<th>Drive time within 1hr of Belfast Price\n(Ref: Q1 2015)</th>
<th>Drive time outside 1hr of Belfast Index</th>
<th>Drive time outside 1hr of Belfast Price\n(Ref: Q1 2015)</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>2015Q1</td>
<td>2015</td>
<td>Q1</td>
<td>100.000000</td>
<td>124898.676844</td>
<td>100.000000</td>
<td>122528.427865</td>
<td>100.000000</td>
<td>128955.274996</td>
<td>100.000000</td>
<td>111866.404980</td>
</tr>
<tr>
<th>1</th>
<td>2015Q2</td>
<td>2015</td>
<td>Q2</td>
<td>103.166882</td>
<td>128854.070701</td>
<td>103.003978</td>
<td>126209.155363</td>
<td>103.025069</td>
<td>132856.260679</td>
<td>103.349406</td>
<td>115613.265107</td>
</tr>
<tr>
<th>2</th>
<td>2015Q3</td>
<td>2015</td>
<td>Q3</td>
<td>105.851629</td>
<td>132207.283910</td>
<td>105.619893</td>
<td>129414.394046</td>
<td>105.031061</td>
<td>135443.093443</td>
<td>107.811831</td>
<td>120605.219276</td>
</tr>
<tr>
<th>3</th>
<td>2015Q4</td>
<td>2015</td>
<td>Q4</td>
<td>107.430656</td>
<td>134179.467306</td>
<td>106.924715</td>
<td>131013.172436</td>
<td>106.240145</td>
<td>137002.270924</td>
<td>110.075053</td>
<td>123137.004353</td>
</tr>
<tr>
<th>4</th>
<td>2016Q1</td>
<td>2016</td>
<td>Q1</td>
<td>108.909364</td>
<td>136026.354775</td>
<td>108.368772</td>
<td>132782.552750</td>
<td>107.604887</td>
<td>138762.178070</td>
<td>111.828874</td>
<td>125098.941485</td>
</tr>
<tr>
<th>5</th>
<td>2016Q2</td>
<td>2016</td>
<td>Q2</td>
<td>111.263396</td>
<td>138966.509219</td>
<td>109.739250</td>
<td>134461.778232</td>
<td>110.208116</td>
<td>142119.179594</td>
<td>111.991819</td>
<td>125281.221326</td>
</tr>
<tr>
<th>6</th>
<td>2016Q3</td>
<td>2016</td>
<td>Q3</td>
<td>113.419541</td>
<td>141659.506269</td>
<td>112.426034</td>
<td>137753.851946</td>
<td>112.202571</td>
<td>144691.133778</td>
<td>115.398242</td>
<td>129091.864904</td>
</tr>
<tr>
<th>7</th>
<td>2016Q4</td>
<td>2016</td>
<td>Q4</td>
<td>113.928074</td>
<td>142294.657346</td>
<td>113.219995</td>
<td>138726.680412</td>
<td>112.508660</td>
<td>145085.851672</td>
<td>116.952798</td>
<td>130830.890712</td>
</tr>
<tr>
<th>8</th>
<td>2017Q1</td>
<td>2017</td>
<td>Q1</td>
<td>114.262386</td>
<td>142712.207695</td>
<td>113.549623</td>
<td>139130.567598</td>
<td>112.823330</td>
<td>145491.635911</td>
<td>117.341538</td>
<td>131265.759778</td>
</tr>
<tr>
<th>9</th>
<td>2017Q2</td>
<td>2017</td>
<td>Q2</td>
<td>115.566592</td>
<td>144341.144812</td>
<td>115.829688</td>
<td>141924.295411</td>
<td>114.288862</td>
<td>147381.515712</td>
<td>119.397027</td>
<td>133565.161466</td>
</tr>
<tr>
<th>10</th>
<td>2017Q3</td>
<td>2017</td>
<td>Q3</td>
<td>116.716428</td>
<td>145777.273752</td>
<td>117.061832</td>
<td>143434.022704</td>
<td>115.192251</td>
<td>148546.484147</td>
<td>121.273582</td>
<td>135664.395891</td>
</tr>
<tr>
<th>11</th>
<td>2017Q4</td>
<td>2017</td>
<td>Q4</td>
<td>117.925340</td>
<td>147287.189812</td>
<td>118.541541</td>
<td>145247.086931</td>
<td>116.101165</td>
<td>149718.576638</td>
<td>123.577530</td>
<td>138241.740123</td>
</tr>
<tr>
<th>12</th>
<td>2018Q1</td>
<td>2018</td>
<td>Q1</td>
<td>118.482802</td>
<td>147983.452250</td>
<td>120.184585</td>
<td>147260.282195</td>
<td>117.972698</td>
<td>152132.017714</td>
<td>122.075418</td>
<td>136561.381608</td>
</tr>
<tr>
<th>13</th>
<td>2018Q2</td>
<td>2018</td>
<td>Q2</td>
<td>119.443631</td>
<td>149183.514842</td>
<td>120.710551</td>
<td>147904.740501</td>
<td>117.686726</td>
<td>151763.241043</td>
<td>125.603705</td>
<td>140508.349303</td>
</tr>
<tr>
<th>14</th>
<td>2018Q3</td>
<td>2018</td>
<td>Q3</td>
<td>121.408923</td>
<td>151638.138779</td>
<td>122.222994</td>
<td>149757.912837</td>
<td>119.614730</td>
<td>154249.503782</td>
<td>127.210974</td>
<td>142306.342946</td>
</tr>
<tr>
<th>15</th>
<td>2018Q4</td>
<td>2018</td>
<td>Q4</td>
<td>123.531419</td>
<td>154289.108214</td>
<td>125.254013</td>
<td>153471.772944</td>
<td>121.462721</td>
<td>156632.586327</td>
<td>130.953652</td>
<td>146493.143086</td>
</tr>
<tr>
<th>16</th>
<td>2019Q1</td>
<td>2019</td>
<td>Q1</td>
<td>122.499375</td>
<td>153000.098716</td>
<td>123.207618</td>
<td>150964.357154</td>
<td>120.087144</td>
<td>154858.706540</td>
<td>129.712835</td>
<td>145105.085124</td>
</tr>
<tr>
<th>17</th>
<td>2019Q2</td>
<td>2019</td>
<td>Q2</td>
<td>124.397722</td>
<td>155371.109292</td>
<td>125.151589</td>
<td>153346.274193</td>
<td>122.486275</td>
<td>157952.513361</td>
<td>130.480634</td>
<td>145963.994647</td>
</tr>
<tr>
<th>18</th>
<td>2019Q3</td>
<td>2019</td>
<td>Q3</td>
<td>126.533407</td>
<td>158038.551430</td>
<td>128.647747</td>
<td>157630.061642</td>
<td>124.978137</td>
<td>161165.900455</td>
<td>133.258945</td>
<td>149071.990904</td>
</tr>
<tr>
<th>19</th>
<td>2019Q4</td>
<td>2019</td>
<td>Q4</td>
<td>127.126748</td>
<td>158779.626458</td>
<td>127.784267</td>
<td>156572.053236</td>
<td>124.547746</td>
<td>160610.887802</td>
<td>134.709059</td>
<td>150694.181735</td>
</tr>
<tr>
<th>20</th>
<td>2020Q1</td>
<td>2020</td>
<td>Q1</td>
<td>127.090324</td>
<td>158734.133127</td>
<td>128.619521</td>
<td>157595.476721</td>
<td>124.744271</td>
<td>160864.317472</td>
<td>135.068006</td>
<td>151095.722593</td>
</tr>
<tr>
<th>21</th>
<td>2020Q2</td>
<td>2020</td>
<td>Q2</td>
<td>127.200617</td>
<td>158871.887068</td>
<td>127.231209</td>
<td>155894.400318</td>
<td>125.021931</td>
<td>161222.374474</td>
<td>132.935126</td>
<td>148709.746207</td>
</tr>
<tr>
<th>22</th>
<td>2020Q3</td>
<td>2020</td>
<td>Q3</td>
<td>129.627870</td>
<td>161903.493901</td>
<td>131.083667</td>
<td>160614.756005</td>
<td>127.873046</td>
<td>164899.037745</td>
<td>135.952621</td>
<td>152085.309777</td>
</tr>
<tr>
<th>23</th>
<td>2020Q4</td>
<td>2020</td>
<td>Q4</td>
<td>132.853240</td>
<td>165931.938294</td>
<td>135.873737</td>
<td>166483.954356</td>
<td>130.725554</td>
<td>168577.497418</td>
<td>142.032974</td>
<td>158887.182045</td>
</tr>
<tr>
<th>24</th>
<td>2021Q1</td>
<td>2021</td>
<td>Q1</td>
<td>134.414458</td>
<td>167881.879606</td>
<td>136.978231</td>
<td>167837.272586</td>
<td>132.479831</td>
<td>170839.730568</td>
<td>142.532053</td>
<td>159445.483290</td>
</tr>
<tr>
<th>25</th>
<td>2021Q2</td>
<td>2021</td>
<td>Q2</td>
<td>139.417605</td>
<td>174130.744141</td>
<td>142.727536</td>
<td>174881.805508</td>
<td>137.737680</td>
<td>177620.004609</td>
<td>147.652942</td>
<td>165174.037638</td>
</tr>
<tr>
<th>26</th>
<td>2021Q3</td>
<td>2021</td>
<td>Q3</td>
<td>143.303934</td>
<td>178984.717757</td>
<td>147.615067</td>
<td>180870.420630</td>
<td>141.482937</td>
<td>182449.710341</td>
<td>153.161532</td>
<td>171336.300173</td>
</tr>
<tr>
<th>27</th>
<td>2021Q4</td>
<td>2021</td>
<td>Q4</td>
<td>144.731984</td>
<td>180768.332630</td>
<td>149.013971</td>
<td>182584.475980</td>
<td>142.365385</td>
<td>183587.673223</td>
<td>156.204293</td>
<td>174740.127539</td>
</tr>
</tbody>
</table>
</div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">dest_df</span><span class="p">[</span><span class="s">'Table 7'</span><span class="p">]</span> <span class="o">=</span> <span class="n">cleanup_table_7</span><span class="p">(</span><span class="n">source_df</span><span class="p">[</span><span class="s">'Table 7'</span><span class="p">])</span>
<span class="nb">len</span><span class="p">(</span><span class="n">dest_df</span><span class="p">),</span> <span class="nb">len</span><span class="p">([</span><span class="n">k</span> <span class="k">for</span> <span class="n">k</span> <span class="ow">in</span> <span class="n">source_df</span><span class="p">.</span><span class="n">keys</span><span class="p">()</span> <span class="k">if</span> <span class="n">k</span><span class="p">.</span><span class="n">startswith</span><span class="p">(</span><span class="s">'Table'</span><span class="p">)</span> <span class="ow">and</span> <span class="n">k</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">dest_df</span><span class="p">])</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(15, 17)
</code></pre></div></div>
<h3 id="table-8-number-of-verified-residential-property-sales-of-properties-in-urban-and-rural-areas-and-properties-in-rural-areas-by-drive-times-witihn-towns-of-10000-or-more-and-within-1-hour-of-belfast">Table 8: Number of Verified Residential Property Sales of properties in urban and rural areas and properties in rural areas by drive times witihn towns of 10,000 or more and within 1 hour of Belfast</h3>
<p>We’re now getting into the swing of this!</p>
<p>This one has two similar problems we’ve already seen; Munged Quarters/Years (this time with no header on that column…), and annual Total rows.</p>
<blockquote>
<p>Vee must deeel with it</p>
</blockquote>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">cleanup_table_5a</span><span class="p">(</span><span class="n">source_df</span><span class="p">[</span><span class="s">'Table 8'</span><span class="p">]).</span><span class="n">head</span><span class="p">()</span>
</code></pre></div></div>
<div>
<style scoped="">
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th>2</th>
<th>Period</th>
<th>Year</th>
<th>Quarter</th>
<th>Urban</th>
<th>Rural</th>
<th>Drive time within 20mins of town of 10,000 or more</th>
<th>Drive time outside 20mins of town of 10,000 or more</th>
<th>Drive time within 1hr of Belfast</th>
<th>Drive time outside 1hr of Belfast</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>2015Q1</td>
<td>2015</td>
<td>Q1</td>
<td>3294</td>
<td>1322</td>
<td>898</td>
<td>424</td>
<td>976</td>
<td>346</td>
</tr>
<tr>
<th>1</th>
<td>2015Q2</td>
<td>2015</td>
<td>Q2</td>
<td>3789</td>
<td>1500</td>
<td>1034</td>
<td>466</td>
<td>1142</td>
<td>358</td>
</tr>
<tr>
<th>2</th>
<td>2015Q3</td>
<td>2015</td>
<td>Q3</td>
<td>4199</td>
<td>1640</td>
<td>1145</td>
<td>495</td>
<td>1250</td>
<td>390</td>
</tr>
<tr>
<th>3</th>
<td>2015Q4</td>
<td>2015</td>
<td>Q4</td>
<td>4396</td>
<td>1780</td>
<td>1223</td>
<td>557</td>
<td>1342</td>
<td>438</td>
</tr>
<tr>
<th>4</th>
<td>2016Q1</td>
<td>2016</td>
<td>Q1</td>
<td>4424</td>
<td>1731</td>
<td>1171</td>
<td>560</td>
<td>1263</td>
<td>468</td>
</tr>
</tbody>
</table>
</div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">cleanup_table_8</span> <span class="o">=</span> <span class="n">cleanup_table_5a</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">dest_df</span><span class="p">[</span><span class="s">'Table 8'</span><span class="p">]</span> <span class="o">=</span> <span class="n">cleanup_table_8</span><span class="p">(</span><span class="n">source_df</span><span class="p">[</span><span class="s">'Table 8'</span><span class="p">])</span>
<span class="nb">len</span><span class="p">(</span><span class="n">dest_df</span><span class="p">),</span> <span class="nb">len</span><span class="p">([</span><span class="n">k</span> <span class="k">for</span> <span class="n">k</span> <span class="ow">in</span> <span class="n">source_df</span><span class="p">.</span><span class="n">keys</span><span class="p">()</span> <span class="k">if</span> <span class="n">k</span><span class="p">.</span><span class="n">startswith</span><span class="p">(</span><span class="s">'Table'</span><span class="p">)</span> <span class="ow">and</span> <span class="n">k</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">dest_df</span><span class="p">])</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(16, 16)
</code></pre></div></div>
<h3 id="table-9-ni-average-sales-prices-q1-2005---q4-2021">Table 9: NI Average Sales Prices Q1 2005 - Q4 2021</h3>
<p>Wee buns</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">basic_cleanup</span><span class="p">(</span><span class="n">source_df</span><span class="p">[</span><span class="s">'Table 9'</span><span class="p">])</span>
</code></pre></div></div>
<div>
<style scoped="">
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th>1</th>
<th>Period</th>
<th>Year</th>
<th>Quarter</th>
<th>Simple Mean</th>
<th>Simple Median</th>
<th>Standardised Price (HPI)</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>2005Q1</td>
<td>2005</td>
<td>Q1</td>
<td>115912.942222</td>
<td>100000</td>
<td>111920.268199</td>
</tr>
<tr>
<th>1</th>
<td>2005Q2</td>
<td>2005</td>
<td>Q2</td>
<td>120481.290591</td>
<td>105000</td>
<td>116004.031639</td>
</tr>
<tr>
<th>2</th>
<td>2005Q3</td>
<td>2005</td>
<td>Q3</td>
<td>128866.225917</td>
<td>115000</td>
<td>123386.352673</td>
</tr>
<tr>
<th>3</th>
<td>2005Q4</td>
<td>2005</td>
<td>Q4</td>
<td>129649.092074</td>
<td>117000</td>
<td>127674.143865</td>
</tr>
<tr>
<th>4</th>
<td>2006Q1</td>
<td>2006</td>
<td>Q1</td>
<td>132972.115070</td>
<td>120000</td>
<td>131302.064422</td>
</tr>
<tr>
<th>...</th>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
</tr>
<tr>
<th>63</th>
<td>2020Q4</td>
<td>2020</td>
<td>Q4</td>
<td>171803.199843</td>
<td>150000</td>
<td>147474.561707</td>
</tr>
<tr>
<th>64</th>
<td>2021Q1</td>
<td>2021</td>
<td>Q1</td>
<td>176218.214924</td>
<td>150000</td>
<td>149084.306040</td>
</tr>
<tr>
<th>65</th>
<td>2021Q2</td>
<td>2021</td>
<td>Q2</td>
<td>184144.458946</td>
<td>154950</td>
<td>154323.134643</td>
</tr>
<tr>
<th>66</th>
<td>2021Q3</td>
<td>2021</td>
<td>Q3</td>
<td>173490.230508</td>
<td>155000</td>
<td>159028.118093</td>
</tr>
<tr>
<th>67</th>
<td>2021Q4</td>
<td>2021</td>
<td>Q4</td>
<td>159965.154863</td>
<td>141000</td>
<td>159150.737832</td>
</tr>
</tbody>
</table>
<p>68 rows × 6 columns</p>
</div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">dest_df</span><span class="p">[</span><span class="s">'Table 9'</span><span class="p">]</span> <span class="o">=</span> <span class="n">basic_cleanup</span><span class="p">(</span><span class="n">source_df</span><span class="p">[</span><span class="s">'Table 9'</span><span class="p">])</span>
<span class="nb">len</span><span class="p">(</span><span class="n">dest_df</span><span class="p">),</span> <span class="nb">len</span><span class="p">([</span><span class="n">k</span> <span class="k">for</span> <span class="n">k</span> <span class="ow">in</span> <span class="n">source_df</span><span class="p">.</span><span class="n">keys</span><span class="p">()</span> <span class="k">if</span> <span class="n">k</span><span class="p">.</span><span class="n">startswith</span><span class="p">(</span><span class="s">'Table'</span><span class="p">)</span> <span class="ow">and</span> <span class="n">k</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">dest_df</span><span class="p">])</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(17, 15)
</code></pre></div></div>
<h3 id="table-9x-ni-average-sale-prices-xxxxx-property-q1-2005---q4-2021">Table 9x: NI Average Sale Prices XXXXX Property Q1 2005 - Q4 2021</h3>
<p>These are very similar to Tables 2x; i.e. they’re broken down by property type.</p>
<p>Annoyingly, they don’t follow the same structure as Tables 2x or Table 9 because they don’t include the Year/Quarter headers.</p>
<p>If that reminds you of anything, it’s because Table 7 was the same…</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">cleanup_table_7</span><span class="p">(</span><span class="n">source_df</span><span class="p">[</span><span class="s">'Table 9a'</span><span class="p">])</span>
</code></pre></div></div>
<div>
<style scoped="">
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th>1</th>
<th>Period</th>
<th>Year</th>
<th>Quarter</th>
<th>Simple Mean</th>
<th>Simple Median</th>
<th>Standardised Price (HPI)</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>2005Q1</td>
<td>2005</td>
<td>Q1</td>
<td>166314.816092</td>
<td>149972.5</td>
<td>160428.832662</td>
</tr>
<tr>
<th>1</th>
<td>2005Q2</td>
<td>2005</td>
<td>Q2</td>
<td>173370.669076</td>
<td>155000.0</td>
<td>169686.542965</td>
</tr>
<tr>
<th>2</th>
<td>2005Q3</td>
<td>2005</td>
<td>Q3</td>
<td>185397.896739</td>
<td>165000.0</td>
<td>180696.666810</td>
</tr>
<tr>
<th>3</th>
<td>2005Q4</td>
<td>2005</td>
<td>Q4</td>
<td>186545.119355</td>
<td>165000.0</td>
<td>185323.883533</td>
</tr>
<tr>
<th>4</th>
<td>2006Q1</td>
<td>2006</td>
<td>Q1</td>
<td>191328.398119</td>
<td>173000.0</td>
<td>188669.361197</td>
</tr>
<tr>
<th>...</th>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
</tr>
<tr>
<th>63</th>
<td>2020Q4</td>
<td>2020</td>
<td>Q4</td>
<td>243712.512641</td>
<td>220000.0</td>
<td>220592.113069</td>
</tr>
<tr>
<th>64</th>
<td>2021Q1</td>
<td>2021</td>
<td>Q1</td>
<td>254182.439174</td>
<td>225000.0</td>
<td>224872.989982</td>
</tr>
<tr>
<th>65</th>
<td>2021Q2</td>
<td>2021</td>
<td>Q2</td>
<td>268755.621299</td>
<td>235000.0</td>
<td>234734.715703</td>
</tr>
<tr>
<th>66</th>
<td>2021Q3</td>
<td>2021</td>
<td>Q3</td>
<td>245860.399289</td>
<td>225000.0</td>
<td>239101.239764</td>
</tr>
<tr>
<th>67</th>
<td>2021Q4</td>
<td>2021</td>
<td>Q4</td>
<td>244468.040738</td>
<td>219000.0</td>
<td>241131.373512</td>
</tr>
</tbody>
</table>
<p>68 rows × 6 columns</p>
</div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">cleanup_table_9x</span> <span class="o">=</span> <span class="n">cleanup_table_7</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">table9s</span> <span class="o">=</span> <span class="n">re</span><span class="p">.</span><span class="nb">compile</span><span class="p">(</span><span class="s">'Table 9[a-z]'</span><span class="p">)</span>
<span class="k">for</span> <span class="n">table</span> <span class="ow">in</span> <span class="n">source_df</span><span class="p">:</span>
<span class="k">if</span> <span class="n">table9s</span><span class="p">.</span><span class="n">match</span><span class="p">(</span><span class="n">table</span><span class="p">):</span>
<span class="n">dest_df</span><span class="p">[</span><span class="n">table</span><span class="p">]</span> <span class="o">=</span> <span class="n">cleanup_table_9x</span><span class="p">(</span><span class="n">source_df</span><span class="p">[</span><span class="n">table</span><span class="p">])</span>
<span class="nb">len</span><span class="p">(</span><span class="n">dest_df</span><span class="p">),</span> <span class="nb">len</span><span class="p">([</span><span class="n">k</span> <span class="k">for</span> <span class="n">k</span> <span class="ow">in</span> <span class="n">source_df</span><span class="p">.</span><span class="n">keys</span><span class="p">()</span> <span class="k">if</span> <span class="n">k</span><span class="p">.</span><span class="n">startswith</span><span class="p">(</span><span class="s">'Table'</span><span class="p">)</span> <span class="ow">and</span> <span class="n">k</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">dest_df</span><span class="p">])</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(21, 11)
</code></pre></div></div>
<h3 id="table-10x-number-of-verified-residential-property-sales-by-type-in-xxxxx">Table 10x: Number of Verified Residential Property Sales by Type in XXXXX</h3>
<p>Surprisingly, we’re in the home straight; the remaining tables are all of the same structure, with familiar awkwardness.,,</p>
<ul>
<li>Annual-Total Rows</li>
<li>Munged Year/Quarter Column</li>
<li>That column having a silly (but contextual) name</li>
<li>a different offset</li>
</ul>
<p>Fortunately, we already have something like that from dealing with Table 5a!</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">source_df</span><span class="p">[</span><span class="s">'Table 10a'</span><span class="p">]</span>
</code></pre></div></div>
<div>
<style scoped="">
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Table 10a: Number of Verified Residential Property Sales by Type in Antrim and Newtownabbey Council</th>
<th>Unnamed: 1</th>
<th>Unnamed: 2</th>
<th>Unnamed: 3</th>
<th>Unnamed: 4</th>
<th>Unnamed: 5</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>Please note figures for the 2 most recent quar...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>1</th>
<td>Back to contents</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>2</th>
<td>ANTRIM AND NEWTOWNABBEY</td>
<td>Apartments</td>
<td>Detached</td>
<td>Semi-Detached</td>
<td>Terrace</td>
<td>Total</td>
</tr>
<tr>
<th>3</th>
<td>Q1 2005</td>
<td>10</td>
<td>61</td>
<td>78</td>
<td>87</td>
<td>236</td>
</tr>
<tr>
<th>4</th>
<td>Q2 2005</td>
<td>46</td>
<td>213</td>
<td>216</td>
<td>260</td>
<td>735</td>
</tr>
<tr>
<th>...</th>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
</tr>
<tr>
<th>83</th>
<td>Q1 2021</td>
<td>52</td>
<td>222</td>
<td>212</td>
<td>166</td>
<td>652</td>
</tr>
<tr>
<th>84</th>
<td>Q2 2021</td>
<td>56</td>
<td>217</td>
<td>275</td>
<td>176</td>
<td>724</td>
</tr>
<tr>
<th>85</th>
<td>Q3 2021</td>
<td>47</td>
<td>222</td>
<td>268</td>
<td>202</td>
<td>739</td>
</tr>
<tr>
<th>86</th>
<td>Q4 2021</td>
<td>50</td>
<td>117</td>
<td>176</td>
<td>189</td>
<td>532</td>
</tr>
<tr>
<th>87</th>
<td>2021 Total</td>
<td>205</td>
<td>778</td>
<td>931</td>
<td>733</td>
<td>2647</td>
</tr>
</tbody>
</table>
<p>88 rows × 6 columns</p>
</div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">cleanup_table_5a</span><span class="p">(</span><span class="n">source_df</span><span class="p">[</span><span class="s">'Table 10a'</span><span class="p">])</span>
</code></pre></div></div>
<div>
<style scoped="">
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th>2</th>
<th>Period</th>
<th>Year</th>
<th>Quarter</th>
<th>Apartments</th>
<th>Detached</th>
<th>Semi-Detached</th>
<th>Terrace</th>
<th>Total</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>2005Q1</td>
<td>2005</td>
<td>Q1</td>
<td>10</td>
<td>61</td>
<td>78</td>
<td>87</td>
<td>236</td>
</tr>
<tr>
<th>1</th>
<td>2005Q2</td>
<td>2005</td>
<td>Q2</td>
<td>46</td>
<td>213</td>
<td>216</td>
<td>260</td>
<td>735</td>
</tr>
<tr>
<th>2</th>
<td>2005Q3</td>
<td>2005</td>
<td>Q3</td>
<td>46</td>
<td>214</td>
<td>238</td>
<td>259</td>
<td>757</td>
</tr>
<tr>
<th>3</th>
<td>2005Q4</td>
<td>2005</td>
<td>Q4</td>
<td>65</td>
<td>227</td>
<td>270</td>
<td>331</td>
<td>893</td>
</tr>
<tr>
<th>4</th>
<td>2006Q1</td>
<td>2006</td>
<td>Q1</td>
<td>48</td>
<td>186</td>
<td>231</td>
<td>296</td>
<td>761</td>
</tr>
<tr>
<th>...</th>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
</tr>
<tr>
<th>63</th>
<td>2020Q4</td>
<td>2020</td>
<td>Q4</td>
<td>53</td>
<td>248</td>
<td>268</td>
<td>187</td>
<td>756</td>
</tr>
<tr>
<th>64</th>
<td>2021Q1</td>
<td>2021</td>
<td>Q1</td>
<td>52</td>
<td>222</td>
<td>212</td>
<td>166</td>
<td>652</td>
</tr>
<tr>
<th>65</th>
<td>2021Q2</td>
<td>2021</td>
<td>Q2</td>
<td>56</td>
<td>217</td>
<td>275</td>
<td>176</td>
<td>724</td>
</tr>
<tr>
<th>66</th>
<td>2021Q3</td>
<td>2021</td>
<td>Q3</td>
<td>47</td>
<td>222</td>
<td>268</td>
<td>202</td>
<td>739</td>
</tr>
<tr>
<th>67</th>
<td>2021Q4</td>
<td>2021</td>
<td>Q4</td>
<td>50</td>
<td>117</td>
<td>176</td>
<td>189</td>
<td>532</td>
</tr>
</tbody>
</table>
<p>68 rows × 8 columns</p>
</div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">cleanup_table_10x</span> <span class="o">=</span> <span class="n">cleanup_table_5a</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">table10s</span> <span class="o">=</span> <span class="n">re</span><span class="p">.</span><span class="nb">compile</span><span class="p">(</span><span class="s">'Table 10[a-z]'</span><span class="p">)</span>
<span class="k">for</span> <span class="n">table</span> <span class="ow">in</span> <span class="n">source_df</span><span class="p">:</span>
<span class="k">if</span> <span class="n">table10s</span><span class="p">.</span><span class="n">match</span><span class="p">(</span><span class="n">table</span><span class="p">):</span>
<span class="n">dest_df</span><span class="p">[</span><span class="n">table</span><span class="p">]</span> <span class="o">=</span> <span class="n">cleanup_table_10x</span><span class="p">(</span><span class="n">source_df</span><span class="p">[</span><span class="n">table</span><span class="p">])</span>
<span class="nb">len</span><span class="p">(</span><span class="n">dest_df</span><span class="p">),</span> <span class="nb">len</span><span class="p">([</span><span class="n">k</span> <span class="k">for</span> <span class="n">k</span> <span class="ow">in</span> <span class="n">source_df</span><span class="p">.</span><span class="n">keys</span><span class="p">()</span> <span class="k">if</span> <span class="n">k</span><span class="p">.</span><span class="n">startswith</span><span class="p">(</span><span class="s">'Table'</span><span class="p">)</span> <span class="ow">and</span> <span class="n">k</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">dest_df</span><span class="p">])</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(32, 0)
</code></pre></div></div>
<h2 id="and-were-done">And We’re Done!</h2>
<p>So, we can see that while government open data is a pain, at least it’s a … consistently inconsistent pain?</p>
<p>I hope this was helpful to someone else.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">dest_df</span><span class="p">[</span><span class="s">'Contents'</span><span class="p">]</span> <span class="o">=</span> <span class="n">source_df</span><span class="p">[</span><span class="s">'Contents'</span><span class="p">][</span><span class="n">source_df</span><span class="p">[</span><span class="s">'Contents'</span><span class="p">][</span><span class="s">'Worksheet Name'</span><span class="p">].</span><span class="nb">str</span><span class="p">.</span><span class="n">startswith</span><span class="p">(</span><span class="s">'Table'</span><span class="p">)]</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">with</span> <span class="n">pd</span><span class="p">.</span><span class="n">ExcelWriter</span><span class="p">(</span><span class="s">'NI Housing Price Index.xlsx'</span><span class="p">)</span> <span class="k">as</span> <span class="n">writer</span><span class="p">:</span>
<span class="c1"># Thankfully these are semantically sortable otherwise this would be a _massive_ pain
</span> <span class="k">for</span> <span class="n">k</span><span class="p">,</span><span class="n">df</span> <span class="ow">in</span> <span class="nb">sorted</span><span class="p">(</span><span class="n">dest_df</span><span class="p">.</span><span class="n">items</span><span class="p">()):</span>
<span class="n">df</span><span class="p">.</span><span class="n">to_excel</span><span class="p">(</span><span class="n">writer</span><span class="p">,</span> <span class="n">sheet_name</span><span class="o">=</span><span class="n">k</span><span class="p">)</span>
</code></pre></div></div>
<ul>
<li><a href="/assets/2022-03-27-NI-House-Price-Index.ipynb">Notebook Here</a></li>
<li><a href="/assets/2022-03-27-NI-House-Price-Index.xlsx">Resulting Excel File Here</a></li>
</ul>
Lies, Damned Lies, and Data Science2022-01-04T18:49:00+00:00https://andrewbolster.info/2022/01/lies-damned-lies-and-data-science<blockquote>
<p>This talk was originally prepared for my 2021 Guest Lecture at UU Magee for the MSc Data Science course. And if it looks familiar, yes, the first bit is almost entirely lifted from <a href="2020/10/a-stranger-in-a-strange-land-data-science-onboarding-in-practice.html">A Stranger in a Strange Land</a> from last year.</p>
</blockquote>
<h2 id="intro">Intro</h2>
<p>Data Science is the current hotness.</p>
<p>While those of us in these virtual rooms may make fun of the likes of Dominic Cummings for extolling a ‘Data Driven Approach’ to policy, the reality is that Data Science as a buzzword bingo term has survived and indeed thrived in a climate where ‘Artificial Intelligence’ is increasingly derided as being something that’s written more in PowerPoint than Python, ‘Machine Learning’ still gives people images of liquid metal exoskeletons crushing powdery puny human skulls, and those in management with long memories remember what kind of <a href="https://economistsview.typepad.com/economistsview/2008/12/quantitative-me.html">mess</a> “Quantitative Analysis” got us into in 2008…</p>
<p><a href="https://twitter.com/matvelloso/status/1065778379612282885?lang=en"><img src="/img/2020/hqzDHa2mKqFaFwXYVScaExr0nA9wTxLEVaSr7fLl98SnsxoYa3OgtQPhx90rzyGIo90D2zboVcnwHCXk2hFsEQnJlv20EdhFgqd9uYW57DcngrRiO2NB7bFjj0lmZ3wiyo8VyP2Ocg" alt="img" /></a></p>
<p>Way back in 2012, the Harvard Business Review described Data Science as <a href="https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century">“The Sexiest Job of the 21st Century”</a>, and since then has been appearing in job specs and linkedin posts and research funding applications and business startup prospecta more than ever.</p>
<p>You’re not really doing tech unless you’ve got a few pet Data Scientists under your wing.</p>
<p><a href="https://nerdarchy.com/dd-ideas-artificer/"><img src="/img/2020/HdDYoDMx9A0qIoRJJF57EW5WHFNUfkZ8C5bcwb3e-ck4nntw_Ch_KAH8CzqLPU6FsUZLepfB-VzDcRsBM2DTeStdf5XS1rJVV1Nw1q9TDMiMIK30jRmUt-FuLvhbKj_q-dFu2CmJkhg" alt="img" /></a></p>
<p>Like some kind of mythical creature, these Data Scientists sit somewhere between Wizards, Artificers, and Necromancers, breathing business intelligence into glass and copper to give the appearance of wisdom from a veritable onslaught of data, wielding swords of statistical <a href="https://en.wikipedia.org/wiki/Student%27s_t-test">T-tests</a>, shields made of the <a href="https://en.wikipedia.org/wiki/Receiver_operating_characteristic">Areas Under Curves</a>, and casting magicks of <a href="https://en.wikipedia.org/wiki/Recurrent_neural_network">Recurrent Neural Networks</a>.</p>
<p><img src="/img/2020/iHRlCltQBUSLHxknWUF4-ctyA7E65b-tHfWU3Rc4Bt9KFtQCP3dJYGke5p9kGQh0sw3wzHvPwGXKwgwRdEFiZhcWnbGlzCntf54JWITuNCHfrG0wnHfiK6ktz1y_wxdTeQsdKQh-" alt="img" /></p>
<p>Like if Tony Stark and Steven Strange fell into a blender and the Iron Mage appeared, extracting wisdom from the seen and unseen worlds around them and project wisdom into the future….</p>
<p>But more often than not, it’s much more mundane…</p>
<p><a href="https://xkcd.com/1838/"><img src="/img/2020/machine_learning.png" alt="XKCD: Machine Learning" /></a></p>
<p>And it’s often in this mundanity of applying “standard” tools, techniques and analysis of data and stirring the pot until something interesting pops out that we are most likely to make mistakes, and that’s going to be the subject of this talk;</p>
<p><img src="/img/2022/shells.png" alt="Lies We Tell Our Shells" /></p>
<p><img src="/img/2022/10things.png" alt="Ten Things I Hate About Data Science" /></p>
<p><img src="/img/2022/fault.png" alt="The Fault In Our Stats" /></p>
<p><img src="/img/2022/correlations.png" alt="The Usual Correlations" /></p>
<p>This isn’t going to be a technical Data Science talk, we’re not opening up <a href="/2018/10/daily-dated-untitled-jupyter-notebooks.html">Jupyter</a> or firing up <a href="https://en.wikipedia.org/wiki/Apache_Spark">Spark</a> or <a href="https://www.tensorflow.org/tutorials">Tensorflow</a> or whatever. We’re not even going to talk about <a href="https://en.wikipedia.org/wiki/Perceptron">Perceptrons</a> or <a href="https://stats.stackexchange.com/questions/63152/what-does-the-hidden-layer-in-a-neural-network-compute">Hidden neurons</a> or <a href="https://hackernoon.com/exploring-decentralization-with-homomorphic-encryption-a8295755ba5f">homomorphic cryptography</a>. This is about people, processes, how to establish a healthy data science culture.</p>
<p>This is about numbers, aggregations, visualisation, and how you, as a Data Scientist, have a responsibility to look for logical pitfalls, and over time, curate that experience to constructively critique both your own analytical work, but also of the people, teams, and some times, executives around you.</p>
<p>Anyway, who am I to talk about this stuff?</p>
<h2 id="who-am-i--aka-you-can-skip-this-bit">Who am I ? (AKA you can skip this bit)</h2>
<p>My professional background started off by getting robotic dogs to piss on headmasters in front of 200 primary school kids and taking things apart and always having a few screws left over (or loose) at the end.</p>
<p>I eventually turned that “skillset” into something of a trade, by studying <a href="https://www.qub.ac.uk/courses/undergraduate/software-electronic-systems-engineering-year-in-industry-meng-gh68/">electronics and software engineering at Queens</a>.</p>
<p>As part of this I got to test the launch of 4G networks in China from the grey comfort of an office in <a href="https://www.ericsson.com/en/about-us/company-facts/ericsson-worldwide/ireland">Athlone</a>, I moonlit as a technology consultant for a <a href="http://navigatorblue.com/">marketing and advertising firm</a> in Belfast, used massive clusters of GPUs to optimise cable internet delivery, and spent a summer developing BIOSs for <a href="https://www.pcengines.ch/">embedded computers</a> in Switzerland.</p>
<p>After that, and just in time for the financial crisis to make everyone question their career choices, I continued down the academic culvert to do a PhD, stealing shamelessly from the <a href="/2017/09/human-factors-in-autonomous-systems.html">sociologists</a> to make their “science” vaguely useful by teaching autonomous military submarines how to trust each-other.</p>
<p>More recently, I worked with a bunch of psychologists and marketers to teach machines how to understand human emotions using biometrics and wearable tech as the only Data Scientist.</p>
<p>This being a small start-up, that meant I did anything that involved Data, so from storage and network administration to statistical analysis, real-time cloud architecture to academic writing, and everything in between. This also somehow involved throwing people down mountains and developing lie detecting underwear. Ahh the joys of Start Ups.</p>
<p>After that I got to be a grown up Data Scientist working in at a cybersecurity firm specialising in real time network intrusion systems, playing with terabytes of historical and real time data trying to read the minds of hackers and script kiddies across the world who are throwing everything they can at some of the internet’s biggest institutions. This was my first taste of being a Data Scientist who wasn’t working completely alone…</p>
<h1 id="what-about-now-aka-start-reading-here">What about now? (AKA ‘Start reading here’)</h1>
<p>After two years in that I got pinched to build a new team within an established Cyber Security group called WhiteHat Security, that had recently been acquired by NTT Security;</p>
<p>We have over 15 years of human expert trained data on if and how customer websites can be vulnerable to attack. We have teams of hackers working 24/7 to try and break peoples websites before ‘the bad guys’ do to prove that they’re vulnerable, and one way or another, we have those footprints of investigation, and the company wanted to start doing something with that data, so they needed a Data Science group, and they needed at team lead.</p>
<p>In the time that I’ve been here, we’ve gone from almost zero ‘practical’ Data Science, to ML representing 87% of all of the assessment actions that are going across our platform; Our group has also been contributing to next generation security data architectures with the Data Science Group as a critical future customer, rather than an opportunistic after thought, and along the way we’ve invented a couple of patents or patent worthy things that I can’t really talk about yet!</p>
<p>I’ve been there two and a bit years and while this isn’t officially a careers talk, all I’ll say is I’m still really enjoying the work, and we have roles open across our Belfast operations, and a placement scheme in the works that if anyones interested, some creative googling will get you there. Or just email me later!</p>
<blockquote>
<p>In the time between when this talk was originally delivered and publication, a <a href="http://app.jobvite.com/m?3Uk0emwj">Data Science role</a> was opened up for UK/Remote work</p>
</blockquote>
<h1 id="lies-damned-lies-and-data-science">Lies, Damned Lies and Data Science</h1>
<p>But anyway; what do I mean when I’m talking about all these mistakes and failures that churn around our feet every time we’re wandering through data?</p>
<p>Fundamentally, there are a few significant themes of ‘mistake’, where well intentioned, qualified, experienced and component subject matter experts can wield all the right tools in all the right ways and still come to an incorrect, or at least, incomplete conclusion given a certain set of data.</p>
<p>These fall into a couple of general areas:</p>
<ul>
<li>Causation Inversion</li>
<li>Ignoring Contextual Features</li>
<li>Over reliance on abstract measures of quality</li>
<li>Premature Aggregation</li>
</ul>
<h2 id="causation-inversion">Causation Inversion</h2>
<p>This one has a few different names that I’m sure many of you have heard of, and I hope at least one of these has appeared in your course so far!</p>
<ul>
<li>Correlation does not imply causation</li>
<li>Spurious Relationships</li>
<li>Cum/Post hoc ergo propter hoc (with/after this, therefore because of this)
<ul>
<li>Logicians and philosophers argue there’s a difference, I see them as similar fallacies under slightly contexts (consequential vs abstract correlation)</li>
</ul>
</li>
<li>“That thing where Nicholas Cage is Drowning People”</li>
</ul>
<p>Put simply, Causation inversion is where you look at two or more variables observed over some dimension (usually time), and through the observation of some behaviour linkage, you can reasonably theorise that one variable is influencing the other.</p>
<p>While the ‘Nicholas Cage’ example gets a lot of attention, and without tripping over my own later topic of ‘Reliance on Abstract Measures of Quality’; the Nicholas Cage example ‘only’ has an r value of 0.66.</p>
<p><a href="https://www.tylervigen.com/spurious-correlations"><img src="/img/2022/cage.png" alt="Number of people who drowned by falling into a pool correlated with films Nicolas Cage appeared in" /></a></p>
<p>Also, visually, we can pretty clearly see that there are some counter-correlations, like around 2002 where Cage upped his output to 3 (namely, <a href="https://en.wikipedia.org/wiki/Windtalkers">Windtalkers</a>, <a href="https://en.wikipedia.org/wiki/Sonny_(2002_film)">Sonny</a>, and <a href="https://en.wikipedia.org/wiki/Adaptation_(film)">Adaptation</a>). That year, drownings in fact reduced, contra-indicating the hypothesis that they are directly correlated.</p>
<p>A much more quote-unquote “convincing” correlation is this one. It’s got an r of 0.95 which is pretty good I guess, and it certainly looks like they’re going the same direction.</p>
<blockquote>
<p>In the real presentation, which was a bit of Audience Participation where different parts of the below graph were exposed with guesses from the audience of what could be under the cards; that really doesn’t work in text…</p>
</blockquote>
<p><a href="https://www.tylervigen.com/spurious-correlations"><img src="/img/2022/cheese.png" alt="Per capita cheese consumption correlated with number of people who died by becoming tangled in their bedsheets" /></a></p>
<p>Remember, the point here isn’t “lol, graphs go burr” or even “r-values suck”, but we’ll come back to that.</p>
<p>The critical issue here isn’t anything to do with the numbers; it’s about you, as a quote-unquote “subject matter expert”, looking at the numbers, performing some reasonable analysis, and declaring “Cheese is killing people”.</p>
<p>We can make fun of this to a certain degree with the cherry picked examples I’ve put us through here, but causation inversion is lurking at the bottom of every time a manager, executive, client, or colleague asks you a question; always be aware that just because two factors are correlated, there’s no requirement in the universe that says that that means one leads to the other.</p>
<p>Infact, more often than not, these kind of spurious or coincidental correlations indicate some other factor lurking under the surface that you’re not taking account of in your modelling.</p>
<p>Which leads us nicely into….</p>
<h2 id="ignoring-contextual-features">Ignoring Contextual Features</h2>
<p>This is another one with many related names; Simpson’s Paradox, Lords Paradox, the Suppressor Effect.</p>
<h3 id="simpsons-paradox">Simpsons Paradox</h3>
<p><a href="https://en.wikipedia.org/wiki/Simpson%27s_paradox">Simpsons Paradox</a> is a fun, weird, and occasionally disturbing consequence of the old “Lies, damned, lies, and statistics” adage.</p>
<p>Fundamentally, Simpsons Paradox is the effect where <em>identical</em> data can be used to support <em>directly contradictory</em> conclusions, if the context or causality of the underlying data is not taken into account.</p>
<p>This is commonly summarized in the two charts below; from the first graph, it’s obvious that there is a tightly correlated positive linear relationship between the X and Y values, you’d have to be blind to say anything else. However, due to some underlying structure or sub-divison of values, the relationship can be totally flipped.</p>
<p><img src="/img/2022/simpson.png" alt="Simpsons Paradox" /></p>
<p>Using exactly the same values, depending on how or if we slice, group and aggregate our data, we can come to two totally opposite conclusions, with total statistical justification.</p>
<h2 id="real-world-research-case---uc-berkeley">Real World Research Case - UC Berkeley</h2>
<p>But what does this really mean in real life? One of the most famous examples of this paradox is the 1973 investigation into <a href="https://www.refsmmat.com/posts/2016-05-08-simpsons-paradox-berkeley.html">allegations of bias in the admissions criteria of UC Berkeley</a>.</p>
<p>From the data presented below, it is once again Obvious, and Self Evident that a higher percentage of men were admitted than women. It can be graphed easily, the aggregations seem ‘fair’,’obvious’ and ‘natural’; all we do is we take a count of all the men who applied, all the women who applied, and in each group, calculate the percentage that were successful.</p>
<table>
<thead>
<tr>
<th style="text-align: center">Applicants</th>
<th style="text-align: center">Applied</th>
<th style="text-align: center">Admitted</th>
<th style="text-align: center">% Success</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: center">Men</td>
<td style="text-align: center">19186</td>
<td style="text-align: center">8442</td>
<td style="text-align: center">44%</td>
</tr>
<tr>
<td style="text-align: center">Women</td>
<td style="text-align: center">12346</td>
<td style="text-align: center">4321</td>
<td style="text-align: center">35%</td>
</tr>
<tr>
<td style="text-align: center">Total</td>
<td style="text-align: center">31129</td>
<td style="text-align: center">12763</td>
<td style="text-align: center">41%</td>
</tr>
</tbody>
</table>
<p><img src="/img/2022/ucb1-bar.png" alt="img" /></p>
<p>Job done, fire the admissions board, issue a public apology and go home.</p>
<p>But not so fast! When researchers dug into the totals and looked at each department (presumably to find someone to blame at a lower faculty tier…), they found something surprising; in the most popular and highest intake departments, if anything there was a bias towards female applicants.</p>
<table>
<thead>
<tr>
<th style="text-align: center"> </th>
<th style="text-align: center">Total</th>
<th style="text-align: center"> </th>
<th style="text-align: center"> </th>
<th style="text-align: center">Men</th>
<th style="text-align: center"> </th>
<th style="text-align: center"> </th>
<th style="text-align: center">Women</th>
<th style="text-align: center"> </th>
<th style="text-align: center"> </th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: center">Department</td>
<td style="text-align: center">Applied</td>
<td style="text-align: center">Admitted</td>
<td style="text-align: center">% Success</td>
<td style="text-align: center">Applied</td>
<td style="text-align: center">Admitted</td>
<td style="text-align: center">% Success</td>
<td style="text-align: center">Applicants</td>
<td style="text-align: center">Admitted</td>
<td style="text-align: center">% Success</td>
</tr>
<tr>
<td style="text-align: center">A</td>
<td style="text-align: center">933</td>
<td style="text-align: center">597</td>
<td style="text-align: center">64%</td>
<td style="text-align: center">825</td>
<td style="text-align: center">512</td>
<td style="text-align: center">62%</td>
<td style="text-align: center">108</td>
<td style="text-align: center">89</td>
<td style="text-align: center">82%</td>
</tr>
<tr>
<td style="text-align: center">B</td>
<td style="text-align: center">585</td>
<td style="text-align: center">369</td>
<td style="text-align: center">63%</td>
<td style="text-align: center">560</td>
<td style="text-align: center">353</td>
<td style="text-align: center">63%</td>
<td style="text-align: center">25</td>
<td style="text-align: center">17</td>
<td style="text-align: center">68%</td>
</tr>
<tr>
<td style="text-align: center">C</td>
<td style="text-align: center">918</td>
<td style="text-align: center">321</td>
<td style="text-align: center">35%</td>
<td style="text-align: center">325</td>
<td style="text-align: center">120</td>
<td style="text-align: center">37%</td>
<td style="text-align: center">593</td>
<td style="text-align: center">202</td>
<td style="text-align: center">34%</td>
</tr>
<tr>
<td style="text-align: center">D</td>
<td style="text-align: center">792</td>
<td style="text-align: center">269</td>
<td style="text-align: center">34%</td>
<td style="text-align: center">417</td>
<td style="text-align: center">138</td>
<td style="text-align: center">33%</td>
<td style="text-align: center">375</td>
<td style="text-align: center">131</td>
<td style="text-align: center">35%</td>
</tr>
<tr>
<td style="text-align: center">E</td>
<td style="text-align: center">584</td>
<td style="text-align: center">146</td>
<td style="text-align: center">25%</td>
<td style="text-align: center">191</td>
<td style="text-align: center">53</td>
<td style="text-align: center">28%</td>
<td style="text-align: center">393</td>
<td style="text-align: center">94</td>
<td style="text-align: center">24%</td>
</tr>
<tr>
<td style="text-align: center">F</td>
<td style="text-align: center">714</td>
<td style="text-align: center">43</td>
<td style="text-align: center">6%</td>
<td style="text-align: center">373</td>
<td style="text-align: center">22</td>
<td style="text-align: center">6%</td>
<td style="text-align: center">341</td>
<td style="text-align: center">24</td>
<td style="text-align: center">7%</td>
</tr>
</tbody>
</table>
<p><a href="https://www.kdnuggets.com/2020/09/simpsons-paradox.html"><img src="/img/2022/ucb2-bar.jpg" alt="Source: KDNuggets" /></a></p>
<p>The researchers <a href="https://homepage.stat.uiowa.edu/~mbognar/1030/Bickel-Berkeley.pdf">instead concluded</a> that what was happening was that women were applying to more competitive departments, and that men were going for ‘less risky’ fields. Note in particular that the most popular departments for women (Dept C/E) had both a significant difference between numbers of male and female applicants, and indeed, had some of the lowest overall admission rates.</p>
<p>A word of caution here before we move on however; I’m going to quote Curtis Wilson, one of the statisticians in the Data Science Group at NTT in response to me talking about Simpsons Paradox in work;</p>
<blockquote>
<p>Its always worth mentioning with the UCB data that this doesn’t show there isn’t a bias at play. The follow up questions should be “why do the departments which appeal to women more than men have lower admission rates? Is this related to historical under-funding?”. General lesson: we’ve identified we have subgroups that behave differently, so now we need to ask why they behave differently.</p>
</blockquote>
<p>So, as we said, the moral of the story is, be careful what causal or narrative explanations you use your data science skills for, and always try to dig down to make sure you can understand and contextualise the origins and intentionality of the data you’re using.</p>
<h3 id="practical-business-case---its-in-your-jeans">Practical Business Case - It’s In Your Jeans</h3>
<p>What about instead of retroactive research we’re operating as a start-up and using data to plan our go-to-market approach? Let’s say we’re wanting to sell jeans.</p>
<p>Since we’re fairly conventional folks and we don’t want to get into the bespoke sizing game, and that instead we’re going to try and produce the simplest jeans that fit the most people.</p>
<p>And we’re all very clever and data driven so we’re going to experiment by taking a sample of a population, and measure their heights. <a href="https://en.wikipedia.org/wiki/Vitruvian_Man">Thanks to good old Leo Davinci</a>, we can assume that the optimal jean length is roughly half the height of the individual. And we can arbitrarily define a ‘comfort tolerance’ of plus/minus 3cm.</p>
<p><img src="/img/2022/leo-jean.png" alt="Yes, this is totally made up, but it doesn't matter" /></p>
<p>To make sure we’re comfortable with the numbers, we’ll start with a small scale and ramp up as we need to.</p>
<table>
<thead>
<tr>
<th> </th>
<th>height (cm)</th>
</tr>
</thead>
<tbody>
<tr>
<td>A</td>
<td>155.06</td>
</tr>
<tr>
<td>B</td>
<td>157.83</td>
</tr>
<tr>
<td>C</td>
<td>173.85</td>
</tr>
<tr>
<td>D</td>
<td>158.78</td>
</tr>
<tr>
<td>E</td>
<td>170.03</td>
</tr>
<tr>
<td>F</td>
<td>160.80</td>
</tr>
<tr>
<td>G</td>
<td>175.18</td>
</tr>
<tr>
<td>H</td>
<td>158.58</td>
</tr>
<tr>
<td>I</td>
<td>168.70</td>
</tr>
<tr>
<td>J</td>
<td>154.79</td>
</tr>
</tbody>
</table>
<p>So, with our sample of 10 people, we can do a few easy things first;</p>
<p>We can identify that our tallest individual is 175cm, roughly 5’9”, and shortest is 154cm, so around 5’ nothing.</p>
<p>We can take a straight average by adding up all the values and dividing that by the number of individual values, so in this case we end up with 163cm, or around 5’4”</p>
<p>Now, there’s another measure we could use that we’ll include for simplicity; the Median; this is a dark and magical term which basically means “if you sorted all the values, which one would be in the middle”. Another way to think about it is that if you pink the correct ‘median’ value, 50% of the values will be higher, and 50% will be lower. And that way we end up with 160cm.</p>
<p><img src="/img/2022/jeans-10.png" alt="Beware of Geeks carrying Boxplots" /></p>
<p>This is important to flag; not all ‘averages’ are created equally, but that’s a story for another day.</p>
<p>So, big question at the end is “How big is our potential market share if we go with the average?”.</p>
<p>10%. Only one member of our sample could actually wear our ‘average’ jeans.</p>
<blockquote>
<p>Spoilers, but exactly the same problem was found when the <a href="https://www.thestar.com/news/insight/2016/01/16/when-us-air-force-discovered-the-flaw-of-averages.html">USAF attempted to find the ‘average’</a> pilot to design the Goldilocks of cockpits. Instead, they decided to make everything customisable.</p>
</blockquote>
<p>But maybe we just don’t have enough data; so let’s try again. This time we spend a load of money to measure 3,500 people and we go through the same exercise coming up with hopefully more generalised numbers: this time we get an Average of 166cm and a Median of 165cm, which is pretty convenient and could be used to imply that our data set was nicely balanced and we didn’t have any significant ‘ lumps’ in our data. And this time, we get a new ‘coverage’ of 20%, great! But can we do more?</p>
<p><img src="/img/2022/jeans-3500.png" alt="We are legion" /></p>
<p>Instead of looking at the data as 1-dimensional measurements, we can use a histogram to count how many people were measured at particular heights. We can also superimpose our ‘comfort range’ to get a visual clue of what we’re actually covering here.</p>
<p><img src="/img/2022/jeans-hist.png" alt="Histograms are Bestograms" /></p>
<p>But now I think we can see the problem; we’ve got the average and median heights, but they’re not the most popular heights in the sample. And looking at the ‘camel humps’ distributions (otherwise known as a <a href="https://en.wikipedia.org/wiki/Multimodal_distribution">bimodal distribution</a>), we might be able to infer that there’s an underlying structure that we’re missing.</p>
<p>And we’d be right! The hump on the left constitutes predominantly women, and the hump on the right constitutes predominantly men.</p>
<p><img src="/img/2022/jeans-hist-gender.png" alt="If the jeans fit" /></p>
<p>If we instead shift our window to target the average woman, we up our coverage to 26%, and if we slightly compromise on our initial ‘one size fits all ‘vision and make a mens version as well, we can up our coverage to 49% of the market, much more healthy for the investor meeting.</p>
<p><img src="/img/2022/jeans-hist-gender-dual.png" alt="Two Jeans Are Better Than One" /></p>
<h3 id="the-alabama-paradox---even-when-youre-right-youre-wrong">The Alabama Paradox - Even When You’re Right; You’re Wrong</h3>
<p>There’s a related phenomenon that if anyone watches the <a href="https://www.youtube.com/user/standupmaths">Stand-Up-Maths channel with Matt Parker on YouTube</a>, which isn’t so much a data science thing as a strange quirk of mathematics that appeared in Political Science, and while this is a bit of a segue, I won’t go too far into the weeds;</p>
<p>Basically in the United States, the House of Representatives is supposed to be… representative; that is, the number of seats allocated (or, apportioned) to each state should be proportional to the population of that state. Seems pretty simple; take the population of state, divide by the population of country, multiply by the number of seats in the chamber, and get the job done; right?</p>
<p>Oh, ok, except for the decimals… Ok, so just round things then and we’re done, right? And all these steps that we’ve taken are objectively, demonstrably, fair? Right?</p>
<p>Well, not quite; <a href="https://www.youtube.com/watch?v=GVhFBujPlVo">Matt tells it better than I do</a>, but the top line is that there are circumstances where changing the number of seats had unexpected results, specifically in 1800 when it was discovered that increasing the number of seats in the house from 299 to 300 would in fact reduce Alabama’s apportionment from 8 to 7, significantly reducing the ‘representation’ of that state.</p>
<p>From Wikipedia;</p>
<blockquote>
<p>An actual impact was observed in 1900, when Virginia lost a seat to Maine, even though Virginia’s population was growing more rapidly […]</p>
</blockquote>
<p>Also from Wikipedia, here’s a little simpler worked example to think it through; 3 ‘states’, 14 ‘people’, 10 seats, and we can all do the rounding so this all looks legit,</p>
<p>Until; Some pesky legislator says we need more seats, citing something like ‘fairness’ or ‘I like prime numbers’</p>
<p>And State C suddenly goes from 20% of the representative body to a 9% representation.</p>
<table>
<thead>
<tr>
<th> </th>
<th> </th>
<th>With 10 seats</th>
<th> </th>
<th>With 11 seats</th>
<th> </th>
</tr>
</thead>
<tbody>
<tr>
<td>State</td>
<td>Population</td>
<td>Fair share</td>
<td>Seats</td>
<td>Fair share</td>
<td>Seats</td>
</tr>
<tr>
<td>A</td>
<td>6</td>
<td>4.286</td>
<td>4</td>
<td>4.714</td>
<td>5</td>
</tr>
<tr>
<td>B</td>
<td>6</td>
<td>4.286</td>
<td>4</td>
<td>4.714</td>
<td>5</td>
</tr>
<tr>
<td>C</td>
<td>2</td>
<td>1.429</td>
<td>2</td>
<td>1.571</td>
<td>1</td>
</tr>
</tbody>
</table>
<p>What does this have to do with Data Science? Bear with me, because I’ve seen this happen in the wild, and it’s a strange one; multi-label classification tasks.</p>
<p>I was working on a system to detect and classify emotional states in humans from biological markers like heart rate, breathing rate, galvanic skin response, vocal timbre, and acceleration over time. Sounds like fun, and it was, and when we were doing continuous mapping, i.e. we had a vectorised emotional space such that we could project any ‘emotion’ into a series of coordinated in a projected space, and then ‘map’ those values back out to something also continuous, like colour space, or even an ‘emotional noise generator’ that a colleague had trained.</p>
<p>All was well until someone said “Yeah, this is cool and all but I want it in words”; so we started off with the classical <a href="https://www.paulekman.com/universal-emotions/">“Ekman Seven”</a> of Anger, Contempt, Disgust, Enjoyment, Fear, Sadness and Surprise”, and got to training.</p>
<p>There was a wealth of training and academic data around these so this was quite positive and smooth. Until someone wanted us to add an eighth; ‘Contentment’.</p>
<p>Long story short, by adding an additional label option to our classifier, we in fact reduced the overall trained accuracy of our classifier, and when we eventually dug around, we found that it was exactly this kind of ‘rounding’ issue that was confounding our training. Since then, I keep an eye out for any time that labels are being changed….</p>
<h3 id="over-reliance-on-abstract-measures-of-quality">Over Reliance on Abstract Measures of Quality</h3>
<p>Speaking of measuring performance, one of the hairiest parts of Data Science is quantifying the ‘quality’ or ‘accuracy’ of data. One such metric of quality we mentioned earlier was the r-value. It’s technically the ‘r-squared’ value but thats a pick we don’t need to nit today.</p>
<p><img src="/img/2022/r-score.png" alt="How far away is one series from another" /></p>
<p>R-values range from 0 to 1 and are usually interpreted as summarizing the percent of variation in a given metric that is ‘explained’ or ‘correlated’ to another metric or value. So, before, when Nicholas Cage was Drowning people, one interpretation is to say that Nicolas Cage Movies explain 66% of the variation in pool drownings.</p>
<p>That 66% sounds like a strong-ish correlation, but as we saw, it’s not great; but equally, we saw that the really super high 95% correlation in Cheese consumption didn’t actually mean anything.</p>
<p>Many will be familiar with the often quoted <a href="https://xkcd.com/927/">XKCD about standards</a>, but the same thing can be said about quality metrics; there are a wide range of them that mean different things, and hopefully, some of these will already be familiar to you.</p>
<p>An easy one is ‘accuracy’, or ‘how many times did you get it right?’.</p>
<p>This is a clean, simple, management friendly metric, and nothing could possibly go wrong with something so simple?</p>
<p>At NTT, one of the projects we’ve delivered this year was a machine learning derived model for verifying if a website might be vulnerable to particular kinds of attack. As part of the training for that, we took over a decade of human labeled and scored instances of true and false vulnerability observations across a huge swath of the internet. I believe it was on the order of a hundred million individual ‘samples’. So, we fired up the GPU’s and let it run wild, optimising for the ‘accuracy’ metric.</p>
<p><img src="/img/2022/nasty-graph.png" alt="Graphs can Lie" /></p>
<p>On our first few passes something strange happened; we kept getting really high, more than 95% accuracy scores. In any normal environment, that would be a great success and then we would go home and take a month off and wait for our bonuses to roll in.</p>
<p>But, thankfully, we’re a sceptical bunch and we dug a bit deeper; basically, we were getting every single ‘false’ assessment correct, i.e. ‘this website does not have this vulnerability’, but we were incorrectly marking the ‘true’ cases, the ones we actually really cared about as ‘false’.</p>
<p>However, because in the real world, the real occurrence of vulnerabilities is thankfully rare (and thanks to products like ours, generally short lived), we had what is called a biased sample set. The ‘False’ set dwarfed the ‘True’ set. And because we were at that time looking to optimise accuracy, we succeeded in failing miserably.</p>
<p><img src="/img/2022/nasty-vulnerabilities.png" alt="Metrics can Lie" /></p>
<p>Thankfully there are other metrics; lots of them. And I’m not going to suggest you need to use them all, but in our case we evaluated the measures, conferred with our domain experts, and our product team to work out what behaviour and tolerances were actually desirable, and settled on the [Matthews Correlation Coefficient(https://en.wikipedia.org/wiki/Phi_coefficient)] as the optimal target for that particular training task.</p>
<p><a href="https://en.wikipedia.org/wiki/Confusion_matrix#Table_of_confusion"><img src="/img/2022/table-of-confusion.png" alt="This is seriously called the Table of Confusion on Wikipedia and I can't argue with that" /></a></p>
<p>If we’d blindly deployed our Accuracy models, I’d almost certainly not have a job anymore!</p>
<h2 id="premature-aggregation">Premature Aggregation</h2>
<p>Finally, we come to my personal bug-bear of this line of work; premature-aggregation.</p>
<p>8 out of 10 executives suffer from premature aggregation at some point in their careers. It’s nothing to be ashamed of, and you can seek guidance for how to resolve it.</p>
<p>With complex systems, the urge to simply roll everything up and take an average is strong, but as we saw in the jeans-example, sometimes the average isn’t the best approach, and that you simply should not hide a certain level of detail.</p>
<p>However, where you set that detail is a slippery slope, and I don’t have any hard and fast rules for you, so I’ll lay out a few examples;</p>
<p>Lets talk about that jeans example again; last time we saw it, we were down to breaking up the decision space into three factors; height, count, and gender;</p>
<p><img src="/img/2022/jeans-hist-gender-dual.png" alt="Two Jeans Are Still Better Than One" /></p>
<p>We already recognised that we’d ‘prematurely aggregated’ by not taking gender into account and just looking at the average and hoping for the best.
But fundamentally, each one of these blocks in the histogram are made up of individuals, and individuals have all kinds of characteristics that we could dive into.
So how do we know when to stop? Fundamentally, you’re just going to have to learn to make that judgement from experience and context. For instance, we could dive a little deeper and look into the influence of race into height distribution.</p>
<p><img src="/img/2022/jeans-hist-race.png" alt="You've got to stop somewhere" /></p>
<p>However, as you can see, it ends up being a bit of a mess. Part of this comes down to the choices of visualisation, but fundamentally; people are messy, and generally, the world is messy.</p>
<p>In these kind of situations, I try to go back to the motivation for any study or analysis I’m conducting; am I trying to convince someone of something? Am I trying to improve the performance of some process? Or am I trying to sell jeans to the most people.</p>
<p>From the data above, while there is significant variation in overall racial morphology, the ‘signal’ is nowhere near as strong as the Gender factor, and since race or ethnicity means nothing to whether and individual can wear jeans or not, it’s an unnecessary detail to the business case and we can say we’ve reached our ‘optimum level of aggregation’.</p>
<p>In my own work we had something similar. We have a scoring system, much like a credit score, that is intended to be an honest, representative, comparable measure of the ‘security hygiene’ of a website. It consists of 8 factors which aren’t massively important to the point, however they are (currently) presented monolithically, and of course, both the customers and our support teams were chasing this metric and were generally quite unhappy any time it went down, or even, didn’t go up.</p>
<p><img src="/img/2022/wsi-basic.png" alt="Metrics can definitely lie..." /></p>
<p>For anyone keeping track of “Internet Law Bingo”, what comes next is <a href="https://en.wikipedia.org/wiki/Goodhart%27s_law">Goodhart’s Law</a>;</p>
<blockquote>
<p>When a measure becomes a target, it ceases to be a good measure.</p>
</blockquote>
<p>Without any significant thought, these ‘credit scores’ were simply averaged across a wide range of client assets, big, small, financial, gaming, healthcare, customer facing, internal, what have you, and for a significant period of time, executives chased this number and were…. disappointed… when they couldn’t “game the system”.</p>
<p><img src="/img/2022/wsi-dist.png" alt="Metrics can definitely lie..." /></p>
<p>In an attempt to combat this, our group have been working with the front end teams to unpack that premature aggregation, and to better share and explain that no, a single number doesn’t express the ‘hygiene’ of your entire company.</p>
<p><img src="/img/2022/wsi-detail.png" alt="Power to the Factors" /></p>
<p>For example, this is the distribution of different sites under a particular clients control.</p>
<p><img src="/img/2022/wsi-detail.png" alt="Power to the Factors" /></p>
<p>Now, for context, the theoretical maximum for this particular index is 800, and this particular client had the highest proportion of ‘near perfect scores’ of any of our clients.</p>
<p><img src="/img/2022/wsi-detail.png" alt="Power to the Factors" /></p>
<p>But, their security and executive teams were primarily focused on one thing. Their average score of <strong>592</strong>, which put them largely in the middle of the road for high end clients based on our earlier distribution; far from stellar but pretty good, which just <em>didn’t reflect their actual security posture</em>.</p>
<p>What’s more galling from a Data Science perspective is that their 25-odd sites that were dragging down their average by over 100 on their own were all copies of the same site for different regions so had the same vulnerabilities, so fixing 3 vulnerabilities on that ‘one’ application on those 25 sites could have upped their scores, instead of fiddling around with their high value, and already highly scored, sites.</p>
<h1 id="conclusion">Conclusion</h1>
<p>And with that, that’s it, we’re finally at the end. We’ve reviewed how just because metrics are correlated doesn’t mean they cause each other, that the devil is very often in the details in terms of measuring patterns and in dividing classes; that you need to be as careful in your choice of measures as in your data, and finally that it’s important to pick the appropriate level of abstraction, lest you lose the impact of what your analysis is trying to say.</p>
<p>There’s a quote floating around (generally <a href="https://quoteinvestigator.com/2021/01/18/confess/">attributed</a> to British Economist <a href="https://en.wikipedia.org/wiki/Ronald_Coase">Ronald Coase</a>);</p>
<blockquote>
<p>If you torture data long enough, it will confess to anything you’d like</p>
</blockquote>
<p>Being data driven is one thing, but when working with data, we need to also understand the underlying structure of the systems and phenomena that we’re measuring, planning, and deciding on. All the storage space and GPU time in the world won’t save you from screwing up bigly if you don’t know your problem domain.</p>
<p>Thanks for your time, and if you have any questions, I’m on twitter as @bolster and I’ve email addresses littered over the internet so google me!</p>
<h2 id="postscript">Postscript</h2>
<p>As part of the generation of film posters, I had a suggestion in from <a href="https://twitter.com/Jon_C_Rees/status/1465777730625146887">Jon Reese and Amy Pearson</a> and had to make it, so here’s a freebie.</p>
<p><a href="https://twitter.com/Bolster/status/1465776842426748939"><img src="/img/2022/fear.png" alt="Fear and Loathing in Stats Vegas" /></a></p>
Generative Adversarial Procrastination2021-11-26T15:07:00+00:00https://andrewbolster.info/2021/11/generative-adversarial-procrastination<p><em>TL:DR “Don’t worry about being a procrastinator, just make sure that your procrastinations are worthwhile.”</em></p>
<p>There’s an implicit irony in this post that I’ve been thinking / talking about writing it for at least 6 months, and it finally came down to a <a href="https://twitter.com/Bolster/status/1464233419714568198?s=20">tweet</a> to force me to do it.</p>
<blockquote>
<p>Fun fact, in the time it took for me to write this procrastination post, the twitter poll changed, so I guess I gotta delete it all and play Satisfactory now?</p>
</blockquote>
<p>I’m a procrastinator. Ridiculously so. To a degree that my procrastination at the moment is procrastination from procrastinating. To misquote a misquote from the New York Times; “It’s Procrastination all the way down”.</p>
<p>However, I’m still surprisingly productive in my own way; I’ve contributed to the operation of numerous charities, side projects, talks, meetups, guest lecture series; spent a load of time on deep background to local journalists who end up out of their depth in the stranger parts of the tech world; did a masters project on a 36 hour sleep cycle, and a 3 year PhD in 5 years, still play the Start-Up game every so often (or, at least, the fun bit of setting the world to rights in the bar with a few friends / colleagues convinced that we’ve got an idea that will change the world), and still faff around with a load of hardware and home automation stuff (that never seems to work for long)…</p>
<p>So I’m a workaholic right? Christ no. This morning I got out of bed at around noon after getting caught up on <a href="https://play.acast.com/s/blindboy">BlindBoy</a> and <a href="https://open.spotify.com/show/0rOatMqaG3wB5BF4AdsrSX">Robert Evans podcasts</a></p>
<p>Must be one of those ‘mono-taskers’? Nope; I currently have 2 different IDE’s open, 12 text documents in various stages of progress and <a href="/2021/07/counting-tabs-and-background-tasks-taunting-goodharts-demon.html">currently 55 tabs open</a>, along with 3 different ‘in flight’ books, one has been ‘in-flight’ for over a year…</p>
<p>You must take great notes and plan everything and have a schedule for all the things? Nope, used to do that, and it’s more trouble than it’s worth. These days probably around 75+% of my ‘working time’ is ‘unstructured’ (including the ‘semi-structured’ bits I talk about below)</p>
<p>I think I finally worked out what my working style is, and considering my professional practice as a Data Scientist working in Machine Learning, it’s blindingly obvious in retrospect; I’m a Adversarial Procrastinator.</p>
<p>For that to make sense; I need to first talk about Ian Goodfellow, who IMO basically changed the game in terms of practical ML in 2014 with his publication of <a href="https://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf"><em>Generative Adversarial Nets</em></a>. Put simply; why train one big neural network with a whole pile of data, when you can train two neural networks both trying to ‘beat’ eachother? This is the basis of almost <em>all</em> of the “Oh wow, that’s fake?” posts that you’ve seen shared around for the guts of a decade;</p>
<p><a href="https://thispersondoesnotexist.com/">ThisPersonDoesNotExist.com</a> (or <a href="https://thiscatdoesnotexist.com/">cats</a>)? GAN</p>
<p>Deepfakes? GANs on Video</p>
<p>That classic video game that had it’s textures from 20 years ago upscaled to 4k? GANs</p>
<p>Google Pixel 6’s “Magic Eraser”? GANs on image segmentation</p>
<p>Those Style Transfers of <a href="https://www.tensorflow.org/tutorials/generative/style_transfer">“What if Kandinsky painted a Lab?”</a>; Origionally a CNN, but GANs have been used to identify ‘genuine’ pieces.</p>
<p>Put simply, for a given target task, GAN’s consist of two parts;</p>
<ul>
<li>Discriminator - This tries to accomplish the task by expressing a lower dimensionality output, usually a label or number, based on a higher dimensional input, usually an image</li>
<li>Generator - This tries to ‘trick’ the discriminator, by expressing a candidate input for the Discriminator, based on a candidate label.</li>
</ul>
<p>Basically, if the Generator tricks the Discriminator, the Discriminator then gets retrained with the new ‘fake’ information as a negative label, and the Generator gets retrained with the digital equivalent of a pat on the head for a good job done. Together, and in this loop, you end up with one network that’s really good at mapping from ‘images’ to ‘labels’, and one that’s really good at mapping from ‘labels’ to ‘images’.</p>
<p>This is the ‘Adversarial’ part, and it’s become pretty clear to me, that I do the same thing with my procrastination; I continuously push new ‘tasks’ onto the stack, so that when I’m getting pissed off / frustrated / bored / stuck with one thing that I have on my plate, I can ‘constructively’ procrastinate away from Task A by working on Task B.</p>
<p>This makes my use of task managers / issue trackers / calendars really … Strange.</p>
<p>I will block off sections of time in my calendar with built in procrastination targets, because I’ve accepted that if I’m not going to feel like doing Task A, I may as well do Task B instead.</p>
<p>For instance; 1400-1500 on Thursdays, I work on Project Management (yey jira….😭), or Documentation, that way, if I can’t be arsed looking at Jira again, I dive in to one of the many outstanding documentation tasks (mostly reviews, don’t worry) that are floating around.</p>
<p>(I balance this by having an ‘Personal Admin/faffing around’ block on Monday mornings that leans more heavily on the Faffing side, and a ‘Wrap Up’ on Friday nights that is really just there to make sure my American colleagues don’t accidentally invite me to a meeting that I have to take from a pub smoking area, don’t tell my CPO 😛 (<em>I’m kidding, they know already</em>))</p>
<p>Or, like I’m doing right now, I’m 4,000 words into a probably 6,000 word lecture on ‘Lies, Damned Lies and Data Science’ that I’m delivering to UU’s MSc Data Science classes next week, but instead I’m finally writing out something that I’ve been meaning to do for ages…</p>
<p>This all may indeed be stating the obvious, this is almost certainly not a ‘new task management paradigm’, but it took me a couple of years to work out and accept what I was doing, and to be able to ‘tame the discriminator’ to accept that I can ‘generate’ procrastination tasks that are actually long term at least as worth while as ‘the target task’; and that when I <em>do</em> get around to the ‘target task’, I’m coming to it fresh, with energy, and not just doing it “because you told yourself that you would”.</p>
<p>Anyway, guess that’s about it;</p>
<p>Don’t worry about being a procrastinator, just make sure that your procrastinations are worthwhile.</p>
UUIDs and You2021-11-10T18:00:00+00:00https://andrewbolster.info/2021/11/uuids-and-you<blockquote>
<p>The guts of this document was originally created as part of my work at NTT Application Security stripped of its specificity and retained for my own reference.</p>
</blockquote>
<h2 id="background">Background</h2>
<p>Entities need to be identifiable, but the existence of entities should not be predictible, and it should not be easy for an external user/attacker to infer anything about the number of or presence of entities.</p>
<p>Conventional auto-increment integer ID’s were historically de-rigeur for (now largely spurious) database performance optimisation reasons, however, they are succeptible to both presence estimation, and scale estimation.</p>
<p>If a potential customer creates an entity with the integer id <code class="language-plaintext highlighter-rouge">180</code>, we can infer that globally, there are <code class="language-plaintext highlighter-rouge">179</code> existant entities of that type, and that then next one will probably be <code class="language-plaintext highlighter-rouge">181</code>.</p>
<p>Also, if implemented as a distributed/ scalable system, then contention for global ID generation becomes a limiting factor (and a challenge to any stateless expectations on such a system).</p>
<h3 id="what-about-using-nested-structures-like-customer10entity1">What about using nested structures like <code class="language-plaintext highlighter-rouge">/customer/10/entity/1</code></h3>
<p>This is an option, however it couples the customer state to the individual state of a given entity, which introduces unnecessary dependency. Also, establishing the ‘schema’ for addressing arbitrary objects in this way becomes challenging, for instance; if a <code class="language-plaintext highlighter-rouge">customer</code> object has several <code class="language-plaintext highlighter-rouge">reading_list</code> mappings with multiple <code class="language-plaintext highlighter-rouge">book</code> objects within with potential (realised or not) many to many mappings, there are many ways to reference an particular book.</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">/book/123</code></li>
<li><code class="language-plaintext highlighter-rouge">/customer/10/book/123</code></li>
<li><code class="language-plaintext highlighter-rouge">/customer/10/reading_list/5/book/123</code></li>
</ul>
<h3 id="uuids">UUIDS!</h3>
<p>Universally Unique Identifiers (<a href="https://en.wikipedia.org/wiki/Universally_unique_identifier">UUIDs</a>) are 128bit cryptographically generated values, intended for the generation of globally unique identifiers without dependency on a central authority or coordination between services generating them.</p>
<p>There are a range of UUID generation types, summarised below (v1-5 based on <a href="https://www.rfc-editor.org/rfc/inline-errata/rfc4122.html">RFC4122</a></p>
<table>
<thead>
<tr>
<th>Version</th>
<th>Description</th>
<th>Usecase</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>Date Time (100ns) + MAC address</td>
<td>Basically guaranteed uniqueness between nodes</td>
</tr>
<tr>
<td>2</td>
<td>Date Time (7min) + MAC + Namespace/Domain ID</td>
<td>Security (but pretty useless)</td>
</tr>
<tr>
<td>3/5</td>
<td>Namespace + Name + Hash (MD5/SHA1)</td>
<td>Actively designed to enable “same uuid generation”</td>
</tr>
<tr>
<td>4</td>
<td>Random</td>
<td>Identity Generation</td>
</tr>
<tr>
<td>*6,7,8</td>
<td>v1 with Time Ordering</td>
<td>Guarantees Uniqueness, Ordering and Proximity</td>
</tr>
</tbody>
</table>
<p>*<a href="https://www.ietf.org/archive/id/draft-peabody-dispatch-new-uuid-format-01.html">IETF Proposal</a></p>
<h3 id="sidebar-on-collisions">Sidebar on Collisions</h3>
<p>They’re just stupidly unlikely, stop worrying about them:</p>
<blockquote>
<p>Thus, the probability to find a duplicate within 103 trillion version-4 UUIDs is one in a billion.</p>
</blockquote>
<p>Within most reasonable data models, it would be sufficient (or, overkill to the point of paranoia) to check for the existence of an entity as part of it’s initial creation, and to just ‘reroll’ in the rare case this emerges, although this in and of itself would mitigate a significant amount of the advantage of UUID in distributed/delay tolerant systems.</p>
<p>However, for timeseries variants (1/2/6,7,8), once you leave their ‘tick’, no future-collision is possible, and so as long as your generating nodes have unique MAC addresses, all will be well.</p>
<h3 id="sidebar-on-storage">Sidebar on Storage</h3>
<p>Naively, UUID’s look huge with 36bit chars, vs our teeny integer or even uint or int32 fields we could use in other ID schemes, but compared to the size of the objects these keys will be referencing, this is a miniscule consideration. Even then, we can also change the representation of these 128 bits in any number of ways, mapping them directly to bin16 blobs, or encoding them as base64 values.</p>
<p>What’s more relevant is the impact of the use of UUID’s in back end storage considerations. Much hay has been made about these critiques over the years, here’s a couple of highlights.</p>
<ul>
<li><a href="https://www.percona.com/blog/2019/11/22/uuids-are-popular-but-bad-for-performance-lets-discuss/">UUIDs are Popular, but Bad for Performance — Let’s Discuss</a>
<ul>
<li>Highlights different compromises to storage, in terms of impact on InnoDB insertion rates (Char is worst, bin16 is best), and on the ordering of ID’s for query lookup performance (injecting ‘order’ into random id generation makes proximal lookups more efficient (Note, this is basically what UUIDv6 does))</li>
</ul>
</li>
<li><a href="https://www.cybertec-postgresql.com/en/int4-vs-int8-vs-uuid-vs-numeric-performance-on-bigger-joins/">int4 vs int8 vs uuid vs numeric performance on bigger joins</a>
<ul>
<li>UUID introduces a 13% increase in join spec and significant increase to Index scaling. (note; <code class="language-plaintext highlighter-rouge">numeric</code> type was even worse in the join case with a 34% drop in join rates)</li>
</ul>
</li>
<li><a href="https://richardstartin.github.io/posts/uuids-and-compressibility">UUIDs and Compressibility</a>
<ul>
<li>Because UUIDs are generally ‘random’, they’re practically impossible to meaningfully compress</li>
<li>Also, string representation makes a big difference in key size scaling (binary->16b, UTF-8-> 36b)</li>
</ul>
</li>
</ul>
<h3 id="sidebar-on-applicability-of-this-to-the-microservicespersistence-architectures">Sidebar on applicability of this to the microservices/persistence architectures</h3>
<p>Within a service/state domain, noone cares what the underlying storage is doing; <strong>there is no reason that a persistence service couldn’t use auto-increment PK’s for its internal state storage</strong> etc.</p>
<p>However, when entities are referenced, or expressed on the boundaries of that service, they should be referred to and stored using some kind of globally unique identifier.</p>
<h3 id="sidebar-on-intepretability">Sidebar on intepretability</h3>
<p>One valid criticism leveled at UUID is that it’s difficult / taxing for a human to tell if two UUID’s are the same, and similarly, they can be difficult to ‘share around’, i.e. shout across the offices we no longer use (greetings from COVID ‘21).</p>
<p>Even if the entity references are encoded a full-throated UUID, there’s no reason we can’t have other tools to support the recognition and similarity.</p>
<p>One approach is <a href="https://francoisbest.com/posts/2021/hashvatars">‘hashvatars’</a>, i.e. functionally generated visual representations of large, un-friendly, numbers.</p>
<p><img src="/img/2021/hashvatar.png" alt="" /></p>
<p>These can be mapped such that while ‘close’ hashes look similar, (the author uses the word ‘soul’ to describe this), that nearby-differences can be highlighted, so that it is clear that two findings are very different.</p>
<p>But visual verification isn’t a solution for interpretability. Git has exactly the same problem, with even longer, 40 character SHA1 hash’s representing commits. However, for the vast majority of projects, <a href="https://git-scm.com/book/en/v2/Git-Tools-Revision-Selection">7 characters of that hash is sufficient to represent the commit</a> <em>within the project</em> without colliding with project-external commits.</p>
<p>This kind of contextual reference (i.e. permitting the use of shortcodes or similar) to identify, for instance, a ‘book’ associated with a ‘customer’, even if several ‘global’ entities still match that ‘shortcode’, that querier should not have access to those entities, so the storage backend can still resolve that case.</p>
<h2 id="bolsters-general-rule-of-thumb">Bolsters General Rule of Thumb</h2>
<p>For stateful entities such as Customer or Clients records, or the long term objects they create, etc, these should be referred to by truely-random identifiers, as one ‘client’ does not need to be proximal or relatable to another client, just being held distinctly. IMO these should be UUIDv4 values.</p>
<p>For stateless events or ephemeral data or observations, these should be referred to by values that have lexical sorting and proximity, so as to support efficient window-range querying, caching and rollup. IMO these should be UUIDv6, (alternatives do exist, see <a href="https://www.ietf.org/archive/id/draft-peabody-dispatch-new-uuid-format-01.html#name-background">UUIDv6 Background section</a>)</p>
<p>(Other versions of uuid libraries supporting v6 exist in a range of languages, such as <a href="https://uuid.ramsey.dev/en/latest/nonstandard/version6.html">PHP</a>,<a href="https://github.com/uuid6/prototypes#prototypes">Python, Zig, Dart, Javascript and Go</a>, however, since UUIDv6 is simply and endianness-flip of UUIDv1, there should be no trouble in applying this to other languages/frameworks)</p>
<h2 id="uuidv4-for-entity-ids">UUIDv4 for Entity ID’s</h2>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">uuid</span>
<span class="k">def</span> <span class="nf">id_gen</span><span class="p">():</span>
<span class="k">return</span> <span class="n">uuid</span><span class="p">.</span><span class="n">uuid4</span><span class="p">()</span>
<span class="n">id_gen</span><span class="p">()</span>
<span class="o">>>></span> <span class="n">UUID</span><span class="p">(</span><span class="s">'d9c11c39-4bdf-4903-83e5-163b77f6df23'</span><span class="p">)</span>
</code></pre></div></div>
<h2 id="uuidv4-for-event-ids">UUIDv4 for Event ID’s</h2>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">uuid</span>
<span class="k">def</span> <span class="nf">uuidv1tov6</span><span class="p">(</span><span class="n">u</span><span class="p">):</span>
<span class="c1">## UUIDV1 has it's time-based components in 'little endian', i.e.
</span> <span class="c1"># Least significant words first
</span> <span class="c1"># see https://datatracker.ietf.org/doc/html/rfc4122#section-4.1.2
</span> <span class="c1"># Equivalent to writing timestamps as SS.MM:HH DD-MM-YYYY
</span> <span class="c1"># Makes them time based but not lexically sortable or proximal
</span> <span class="c1"># This simple rearrangement fixes that
</span> <span class="n">uh</span> <span class="o">=</span> <span class="n">u</span><span class="p">.</span><span class="nb">hex</span>
<span class="n">tlo1</span> <span class="o">=</span> <span class="n">uh</span><span class="p">[:</span><span class="mi">5</span><span class="p">]</span> <span class="c1">#LSB
</span> <span class="n">tlo2</span> <span class="o">=</span> <span class="n">uh</span><span class="p">[</span><span class="mi">5</span><span class="p">:</span><span class="mi">8</span><span class="p">]</span>
<span class="n">tmid</span> <span class="o">=</span> <span class="n">uh</span><span class="p">[</span><span class="mi">8</span><span class="p">:</span><span class="mi">12</span><span class="p">]</span>
<span class="c1"># uh[12] contains the version number, i.e. 1, this is manually replaced below
</span> <span class="n">thig</span> <span class="o">=</span> <span class="n">uh</span><span class="p">[</span><span class="mi">13</span><span class="p">:</span><span class="mi">16</span><span class="p">]</span> <span class="c1">#MSB
</span> <span class="n">rest</span> <span class="o">=</span> <span class="n">uh</span><span class="p">[</span><span class="mi">16</span><span class="p">:]</span>
<span class="n">uh6</span> <span class="o">=</span> <span class="n">thig</span> <span class="o">+</span> <span class="n">tmid</span> <span class="o">+</span> <span class="n">tlo1</span> <span class="o">+</span> <span class="s">'6'</span> <span class="o">+</span> <span class="n">tlo2</span> <span class="o">+</span> <span class="n">rest</span>
<span class="k">return</span> <span class="n">uuid</span><span class="p">.</span><span class="n">UUID</span><span class="p">(</span><span class="nb">hex</span><span class="o">=</span><span class="n">uh6</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">id_gen</span><span class="p">():</span>
<span class="k">return</span> <span class="n">uuidv1tov6</span><span class="p">(</span><span class="n">uuid</span><span class="p">.</span><span class="n">uuid1</span><span class="p">())</span>
<span class="n">id_gen</span><span class="p">()</span>
<span class="o">>>></span> <span class="n">UUID</span><span class="p">(</span><span class="s">'1ebd8241-29fe-63bc-87f1-000d3a45a647'</span><span class="p">)</span>
</code></pre></div></div>
<blockquote>
<p>Just because you’re paranoid doesn’t mean they’re not after you</p>
</blockquote>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">time</span> <span class="kn">import</span> <span class="n">sleep</span>
<span class="kn">from</span> <span class="nn">tqdm.auto</span> <span class="kn">import</span> <span class="n">tqdm</span>
<span class="n">generated_ids</span> <span class="o">=</span> <span class="nb">list</span><span class="p">()</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">tqdm</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="mi">10_000</span><span class="p">)):</span>
<span class="n">generated_ids</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">id_gen</span><span class="p">())</span>
<span class="n">sleep</span><span class="p">(</span><span class="mf">0.01</span><span class="p">)</span> <span class="c1"># Simulate actual time / work done
</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">generated_ids</span><span class="p">)</span><span class="o">-</span><span class="mi">1</span><span class="p">):</span>
<span class="c1"># Validate monotonicity, order, and uniqueness
</span> <span class="k">assert</span> <span class="nb">str</span><span class="p">(</span><span class="n">generated_ids</span><span class="p">[</span><span class="n">i</span><span class="p">])</span> <span class="o"><</span> <span class="nb">str</span><span class="p">(</span><span class="n">generated_ids</span><span class="p">[</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">])</span>
</code></pre></div></div>
Counting Tabs and Background Tasks: Taunting Goodharts Demon2021-07-22T10:58:00+00:00https://andrewbolster.info/2021/07/counting-tabs-and-background-tasks-taunting-goodharts-demon<p>This was going to be a really quick post yesterday, but I’ve spent the guts of a day (between actual work) just getting the <em>simplest</em> bit of this working.</p>
<h2 id="the-intent">The intent</h2>
<p>I was silly enough to say this to someone recently in work:</p>
<blockquote>
<p>It’s a good day when I end up with fewer firefox tabs open than when I started</p>
</blockquote>
<p>And a plan was hatched;</p>
<ol>
<li>Find API to query Firefox for current number of open Tabs across all window instances</li>
<li>Send that value to my internal MQTT node (that has telegraf, influx, home assistant and grafana integrations too)</li>
<li>Repeat</li>
</ol>
<p>Spoiler alert, number 3 was the bastard</p>
<h2 id="getting-the-tabs">Getting the Tabs</h2>
<p>First step I thought was going to be more complicated. Fortunately <a href="https://github.com/balta2ar/brotab"><code class="language-plaintext highlighter-rouge">brotab</code></a> beat me to it (although I’m not a fan of the name…)</p>
<p>Once <code class="language-plaintext highlighter-rouge">brotab</code> is installed the first step of this is easy, if under-documented</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">brotab.main</span> <span class="kn">import</span> <span class="n">create_clients</span>
<span class="kn">from</span> <span class="nn">brotab.api</span> <span class="kn">import</span> <span class="n">MultipleMediatorsAPI</span>
<span class="n">n_tabs</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span>
<span class="n">MultipleMediatorsAPI</span><span class="p">(</span>
<span class="n">create_clients</span><span class="p">()</span>
<span class="p">).</span><span class="n">list_tabs</span><span class="p">([])</span>
<span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">'You have </span><span class="si">{</span><span class="n">n_tabs</span><span class="si">}</span><span class="s"> open, you shameless procrastinator'</span><span class="p">)</span>
</code></pre></div></div>
<blockquote>
<p>You have 135 open, you shameless procrastinator</p>
</blockquote>
<h2 id="sending-the-message">Sending the Message</h2>
<p>Thankfully, I’ve done this a few times in a few projects, so this is simple enough.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">paho.mqtt.client</span> <span class="k">as</span> <span class="n">mqtt</span>
<span class="kn">import</span> <span class="nn">socket</span>
<span class="n">hostname</span> <span class="o">=</span> <span class="n">socket</span><span class="p">.</span><span class="n">gethostname</span><span class="p">()</span>
<span class="n">c</span> <span class="o">=</span> <span class="n">mqtt</span><span class="p">.</span><span class="n">Client</span><span class="p">(</span><span class="n">client_id</span><span class="o">=</span><span class="n">hostname</span><span class="p">)</span>
<span class="n">c</span><span class="p">.</span><span class="n">connect</span><span class="p">(</span><span class="s">'maguire'</span><span class="p">,</span> <span class="n">port</span><span class="o">=</span><span class="mi">1883</span><span class="p">)</span> <span class="c1"># this is the internal hostname of my MQTT service. And No, not Harry
</span>
<span class="n">c</span><span class="p">.</span><span class="n">publish</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="si">{</span><span class="n">hostname</span><span class="si">}</span><span class="s">/open_tabs"</span><span class="p">,</span>
<span class="n">n_tabs</span><span class="p">,</span>
<span class="n">retain</span><span class="o">=</span><span class="bp">True</span> <span class="c1"># this asks the MQTT service to persist the value, so clients _after_ publish can see the last value.
</span><span class="p">)</span>
</code></pre></div></div>
<p>Then you can use something like <a href="http://mqtt-explorer.com/">MQTT Explorer</a> to validate that it all works as expected.</p>
<p>At this point you can go and play with Grafana or Influx boards or however you want to think about using the value longer term, however, we’ve still got step three to get to….</p>
<h2 id="task-scheduler-more-like-task-mangler-amirite">Task Scheduler? More like Task Mangler, amirite?</h2>
<p>In ‘nix land, this is easy.</p>
<p><code class="language-plaintext highlighter-rouge">crontab -e</code></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>*/15 * * * * * /home/bolster/anaconda3/bin/python /home/bolster/bin/count_tabs.py
</code></pre></div></div>
<p>Job done, go home, happy days.</p>
<p>Unfortuantely that’s not the case with Windows “Task Scheduler” (Accessible via the start menu)</p>
<p>I won’t go into the full rant, but here’s select issues I came across.</p>
<ul>
<li>Regularly forgetting what user it was supposed to run under, reverting to ‘Medium Execution’ role or something similar, but it only notices that issue when the next cycle goes around and then fires a “Task Scheduler did not launch task “\TabPing” because user “(NONE)” was not logged on when the launching conditions were met. User Action: Ensure user is logged on or change the task definition to allow launching when user is logged off.”</li>
<li>Everyone and their dog has a different, contradictory, solution to ‘my task doesn’t start when scheduled’ on <a href="https://superuser.com/search?q=%22task+scheduler%22+not+executing">SuperUser</a></li>
<li>The Task Scheduler execution environment <em>isn’t</em> a shell, so commands like “START” will raise the helpful <code class="language-plaintext highlighter-rouge">Additional Data: Error Value: 2147942402.</code> error, which actually means ‘File not Found’ but everyone loves guru compemplations.</li>
<li>Windows <em>really</em> doesn’t like running things in the background; if you just go the ‘traditional route’ of ‘python.exe’ ‘<path to="" script="" file="">', it'll helpfully pop up and persist that command window during execution, which would be annoying as hell. Helpfully, `conda` ships with `pythonw.exe`, specifically designed to launch headless programs.... however....</path></li>
<li>You can’t call that directly without calling up the appropriate Conda environment :facepalm:, so we need to wrap the whole pointless thing in a <code class="language-plaintext highlighter-rouge">.bat</code> file to run it.</li>
<li>Any time you change <em>anything</em> about the task, it ‘resets’ the schedule, so if you have it triggered daily @ 9am, and you update the task definition after that, nothing with kickoff until tomorrow 9am (fix is to just reset the start time to T+5mins or something)</li>
</ul>
<p>So, yeah, here goes.</p>
<p>count_tabs.py (lives in C:\Users\me, so YMMV)</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">brotab.main</span> <span class="kn">import</span> <span class="n">create_clients</span>
<span class="kn">from</span> <span class="nn">brotab.api</span> <span class="kn">import</span> <span class="n">MultipleMediatorsAPI</span>
<span class="kn">import</span> <span class="nn">paho.mqtt.client</span> <span class="k">as</span> <span class="n">mqtt</span>
<span class="kn">import</span> <span class="nn">socket</span>
<span class="n">hostname</span> <span class="o">=</span> <span class="n">socket</span><span class="p">.</span><span class="n">gethostname</span><span class="p">()</span>
<span class="n">mqtt_host</span> <span class="o">=</span> <span class="s">'maguire'</span>
<span class="n">c</span> <span class="o">=</span> <span class="n">mqtt</span><span class="p">.</span><span class="n">Client</span><span class="p">(</span><span class="n">client_id</span><span class="o">=</span><span class="n">hostname</span><span class="p">)</span>
<span class="n">c</span><span class="p">.</span><span class="n">connect</span><span class="p">(</span><span class="n">mqtt_host</span><span class="p">,</span> <span class="n">port</span><span class="o">=</span><span class="mi">1883</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">'Connected to </span><span class="si">{</span><span class="n">mqtt_host</span><span class="si">}</span><span class="s">'</span><span class="p">)</span>
<span class="n">n_tabs</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">MultipleMediatorsAPI</span><span class="p">(</span><span class="n">create_clients</span><span class="p">()).</span><span class="n">list_tabs</span><span class="p">([]))</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">'Got </span><span class="si">{</span><span class="n">n_tabs</span><span class="si">}</span><span class="s"> tabs'</span><span class="p">)</span>
<span class="n">c</span><span class="p">.</span><span class="n">publish</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="si">{</span><span class="n">hostname</span><span class="si">}</span><span class="s">/open_tabs"</span><span class="p">,</span>
<span class="n">n_tabs</span><span class="p">,</span>
<span class="n">retain</span><span class="o">=</span><span class="bp">True</span>
<span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Published'</span><span class="p">)</span>
</code></pre></div></div>
<p>count_tabs.bat (also C:\Users\me)</p>
<div class="language-bat highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">SET</span> <span class="kd">logfile</span><span class="o">=</span><span class="s2">"C:\Users\me\batch.log"</span>
@echo <span class="na">off</span>
@echo <span class="kd">Starting</span> <span class="kd">Script</span> <span class="nb">at</span> <span class="nv">%date%</span> <span class="nv">%time%</span> <span class="o">>></span> <span class="nv">%logfile%</span>
<span class="k">call</span> <span class="kd">C</span>:\Users\me\anaconda3\condabin\conda.bat <span class="kd">activate</span>
<span class="kd">C</span>:\Users\me\anaconda3\pythonw.exe <span class="s2">"C:\Users\me\count_tabs.py"</span> <span class="s2">"1>C:\stdout.txt"</span> <span class="s2">"2>C:\stderr.txt"</span>
@echo <span class="kd">finished</span> <span class="nb">at</span> <span class="nv">%date%</span> <span class="nv">%time%</span> <span class="o">>></span> <span class="nv">%logfile%</span>
</code></pre></div></div>
<p>And finally, the actual task as an XML that you <em>might</em> be able to import, no idea, good luck.</p>
<div class="language-xml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp"><?xml version="1.0" encoding="UTF-16"?></span>
<span class="nt"><Task</span> <span class="na">version=</span><span class="s">"1.4"</span> <span class="na">xmlns=</span><span class="s">"http://schemas.microsoft.com/windows/2004/02/mit/task"</span><span class="nt">></span>
<span class="nt"><RegistrationInfo></span>
<span class="nt"><Date></span>2021-07-21T14:11:30.6608587<span class="nt"></Date></span>
<span class="nt"><Author></span>GRANGER\me<span class="nt"></Author></span>
<span class="nt"><URI></span>\TabPing<span class="nt"></URI></span>
<span class="nt"></RegistrationInfo></span>
<span class="nt"><Triggers></span>
<span class="nt"><CalendarTrigger></span>
<span class="nt"><Repetition></span>
<span class="nt"><Interval></span>PT5M<span class="nt"></Interval></span>
<span class="nt"><StopAtDurationEnd></span>false<span class="nt"></StopAtDurationEnd></span>
<span class="nt"></Repetition></span>
<span class="nt"><StartBoundary></span>2021-07-22T09:00:00<span class="nt"></StartBoundary></span>
<span class="nt"><Enabled></span>true<span class="nt"></Enabled></span>
<span class="nt"><ScheduleByDay></span>
<span class="nt"><DaysInterval></span>1<span class="nt"></DaysInterval></span>
<span class="nt"></ScheduleByDay></span>
<span class="nt"></CalendarTrigger></span>
<span class="nt"></Triggers></span>
<span class="nt"><Principals></span>
<span class="nt"><Principal</span> <span class="na">id=</span><span class="s">"Author"</span><span class="nt">></span>
<span class="nt"><UserId></span>REDACTED EVEN THOUGH I HAVE NO IDEA IF IT MATTERS<span class="nt"></UserId></span>
<span class="nt"><LogonType></span>S4U<span class="nt"></LogonType></span>
<span class="nt"><RunLevel></span>LeastPrivilege<span class="nt"></RunLevel></span>
<span class="nt"></Principal></span>
<span class="nt"></Principals></span>
<span class="nt"><Settings></span>
<span class="nt"><MultipleInstancesPolicy></span>StopExisting<span class="nt"></MultipleInstancesPolicy></span>
<span class="nt"><DisallowStartIfOnBatteries></span>false<span class="nt"></DisallowStartIfOnBatteries></span>
<span class="nt"><StopIfGoingOnBatteries></span>true<span class="nt"></StopIfGoingOnBatteries></span>
<span class="nt"><AllowHardTerminate></span>false<span class="nt"></AllowHardTerminate></span>
<span class="nt"><StartWhenAvailable></span>true<span class="nt"></StartWhenAvailable></span>
<span class="nt"><RunOnlyIfNetworkAvailable></span>false<span class="nt"></RunOnlyIfNetworkAvailable></span>
<span class="nt"><IdleSettings></span>
<span class="nt"><StopOnIdleEnd></span>true<span class="nt"></StopOnIdleEnd></span>
<span class="nt"><RestartOnIdle></span>false<span class="nt"></RestartOnIdle></span>
<span class="nt"></IdleSettings></span>
<span class="nt"><AllowStartOnDemand></span>true<span class="nt"></AllowStartOnDemand></span>
<span class="nt"><Enabled></span>true<span class="nt"></Enabled></span>
<span class="nt"><Hidden></span>false<span class="nt"></Hidden></span>
<span class="nt"><RunOnlyIfIdle></span>false<span class="nt"></RunOnlyIfIdle></span>
<span class="nt"><DisallowStartOnRemoteAppSession></span>false<span class="nt"></DisallowStartOnRemoteAppSession></span>
<span class="nt"><UseUnifiedSchedulingEngine></span>true<span class="nt"></UseUnifiedSchedulingEngine></span>
<span class="nt"><WakeToRun></span>false<span class="nt"></WakeToRun></span>
<span class="nt"><ExecutionTimeLimit></span>PT0S<span class="nt"></ExecutionTimeLimit></span>
<span class="nt"><Priority></span>7<span class="nt"></Priority></span>
<span class="nt"></Settings></span>
<span class="nt"><Actions</span> <span class="na">Context=</span><span class="s">"Author"</span><span class="nt">></span>
<span class="nt"><Exec></span>
<span class="nt"><Command></span>"C:\Users\me\count_tabs.bat"<span class="nt"></Command></span>
<span class="nt"></Exec></span>
<span class="nt"></Actions></span>
<span class="nt"></Task></span>
</code></pre></div></div>
<h1 id="task-gotchas">Task Gotchas</h1>
<ul>
<li>To run a command properly in the background, ‘Run whether user is logged on or not’ needs to be selected, but also check “Do not store password”, because this has the helpful effect that <strong>if you get queried for a password on saving the task; the task has forgotten who you are</strong>, so pop back into the ‘Change User of Group’, select your user, and then go ahead.</li>
<li>The ‘Conditions’ Tab should not have <em>anything</em> checked; I got trapped in there for a good while.</li>
<li>You will really want to enable <strong>‘Enable All Tasks History’</strong> in the ‘Actions’ sidebar of the Task Scheduler</li>
</ul>
Response to TOG's Third Eviction2021-06-10T12:18:00+00:00https://andrewbolster.info/2021/06/response-to-tog-s-third-eviction<blockquote>
<p><em>This is a condensed version of a twitter thread in response to <a href="https://twitter.com/silverSpoon">@silverSpoon</a> asking about how Farset Labs would comment on recent news of Dublins hackerspace (<a href="https://www.tog.ie/">TOG</a>) being <a href="https://www.independent.ie/regionals/dublin/southside/life/there-should-be-space-for-community-groups-that-are-more-than-just-gaa-clubs-40502887.html">uncerimoniously turfed out of their third home</a> thanks to developers just wanting to flip sites</em></p>
</blockquote>
<blockquote class="twitter-tweet"><p lang="en" dir="ltr">Agreed, and it shouldn't be down to companies to fill up the gap. <a href="https://t.co/Np46Znn3xH">https://t.co/Np46Znn3xH</a></p>— Andrea (@silverSpoon) <a href="https://twitter.com/silverSpoon/status/1402231908457648134?ref_src=twsrc%5Etfw">June 8, 2021</a></blockquote>
<script async="" src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
<blockquote class="twitter-tweet"><p lang="en" dir="ltr">I wonder if <a href="https://twitter.com/FarsetLabs?ref_src=twsrc%5Etfw">@FarsetLabs</a> has comments on how that works on NI?</p>— Andrea (@silverSpoon) <a href="https://twitter.com/silverSpoon/status/1402232432473088000?ref_src=twsrc%5Etfw">June 8, 2021</a></blockquote>
<script async="" src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
<p>It’s similar up here, but that’s more challenging (and lines up with <a href="https://www.twitter.com/jeffrey_roe">@jeffrey_roe</a>’s comments in the piece) that hackerspaces fall between a crack in local development support between ‘entrepeneurship/startups support’ (aka subsidised professional services and office space), the classical arts sector individual and studio support (which has its own significant issues, often a lot more bureaucracy than it’s worth, and stifling community revenue generation by arcane entertainments licensing that I’m sure if you start <a href="https://twitter.com/adamturks">@adamtucks</a> he’ll chip in) and last, the classical community centre/sports groups (such as the GAA clubs discussed in the original article).</p>
<p>I’ve also had push back in the past because “Oh, you should go for this cross-community fund” when even asking the monitoring questions of “so which side are you from” makes my skin crawl and is against the #hacker ethos.</p>
<p>In the case of <a href="https://www.farsetlabs.org.uk">Farset Labs</a> our out goings are >90% related to building and upkeep and we spend the vast majority of our administrative time paying the landlord and heating/electric bills that we should be doing teaching/learning, but instead of that we’ve had decades of young technologists energies going in to pure governance, upkeep, and maintenance. And when we ask for help, there are always two ‘strata’ of help;</p>
<p>First, the sporadic, seemingly random, £1-5k pots for individual projects, which aren’t really something you can put for rent (or even most #hackerspace) equipment, and secondly, 6/7-figure annual funding programmes expecting staff of dozens and 30 year leases.</p>
<p>So we’ve survived based on our community just keeping paying, even while before the pandemic, we’d been shut down for about a year due to structural issues in our building requiring a <a href="https://blog.farsetlabs.org.uk/2019/06/farset-labs-v2-expanding-renovating-improving/">complete refit</a>.</p>
<p>As for ‘get the tech sector to pay for it’, there are very few models where this doesn’t create perverse incentives in terms of recruitment bias or restrictions to ‘only relevant’ practice; <a href="https://www.farsetlabs.org.uk">Farset Labs</a> is a community, not a startup, not an incubator, & not a co-working space.</p>
<p>For years, <a href="https://www.twitter.com/dsrkane">@dsrkane</a> had a great slide in his presentations asking people who they thought paid for @FarsetLabs, and people always just assumed that it was something paid for by <a href="https://twitter.com/belfastcc">@belfastcc</a> or <a href="https://twitter.com/InvestNI">@InvestNI</a>, because that’s what everyone else seems to get, but the fact is that a voluntary run self-funded STEAM charity that doesn’t claim members IP or curate products and doesn’t want to scam it’s community with exploitative fees isn’t something that interests bureaucrats. At least not the ones with the purse strings.</p>
<p>Disclaimer: Speaking as an individual with 10 years of pissing upstream to keep the place going, and not as a representative of the board.</p>
Apache Arrow in 5-10 Mins2021-06-01T10:54:00+00:00https://andrewbolster.info/2021/06/apache-arrow-in-5-10-mins<p>The below was presented as part of the Belfast Linux Users Groups’ May Technical Meetup.</p>
<p>I’m also maintaining this ‘short’ as part of my gradually accumulating <a href="https://www.github.com/andrewbolster/python_data_workshop/">‘python_data_workshop’</a>, so that may be the best place to see updates if you’re reading this any later than 2021.</p>
<h1 id="conditions">Conditions</h1>
<ul>
<li>Theme: ‘Universal Screwdrivers’</li>
<li>Event: <a href="https://www.meetup.com/belfast-lug/events/278413010/">BLUG</a></li>
<li>Time: ‘5-10mins’ (take the 10)</li>
</ul>
<h1 id="slides">Slides</h1>
<iframe src="https://docs.google.com/presentation/d/e/2PACX-1vQmuoRkGBg8lNbiN8R4Is_ypl0IemJND5Ll5h69VgQp3NAqc2hcozGDa_bYs_GJFvP4jPXaBbtNtTx1/embed?start=false&loop=false&delayms=3000" frameborder="0" width="960" height="569" allowfullscreen="true" mozallowfullscreen="true" webkitallowfullscreen="true"></iframe>
<h1 id="transcript">Transcript</h1>
<p>Evening folks, and I’m happy to be appearing in what we hope is the last virtual BLUG! We’ve all spent a long time toiling in our own little worlds in relative isolation over the course of this sequence of lockdowns, and its appropriate that now as we begin coming back together, that we’re talking about projects that operate and complement across ecosystems.</p>
<p><a href="https://fabienmaussion.info/scientific_programming/week_05/02-Scientific-Python.html"><img src="https://fabienmaussion.info/acinn_python_workshop/figures/scipy_ecosystem.png" alt="python_ecosystem" /></a></p>
<p>Now, as many of you might know, I’ve basically been a Python-only developer since my time at University; the strength of that ecosystem, with genre defining packages like <code class="language-plaintext highlighter-rouge">numpy</code>, <code class="language-plaintext highlighter-rouge">requests</code>, <code class="language-plaintext highlighter-rouge">pandas</code>, <code class="language-plaintext highlighter-rouge">beautifulsoup</code>, <code class="language-plaintext highlighter-rouge">scikit-learn</code>, <code class="language-plaintext highlighter-rouge">jupyter</code>, and <code class="language-plaintext highlighter-rouge">flask</code> mean that in my career as a Data Scientist, I rarely feel a strong need to leave my comfort zone for my own work and exploration.</p>
<p>However, in both my professional and hobbiest software engineering and data science projects, I occasionally have to deal with … “Other People”, and these other people have their own ecosystems that they think suit their workflows better; and, while they have every right to be mistaken, we generally have to shuffle small, medium, or depressingly large datasets around and between these vast programming language star systems like some kind of “No Mans Compiler”.</p>
<p>So when Phil mentioned about the theme being about crossplatform libraries and workflows, there was only one candidate that I could suggest, and that is the <a href="https://arrow.apache.org/">Apache Arrow</a> project.</p>
<p>On paper, and on spec, Arrow is fairly conventional as an in-memory data representation; it’s got columnar storage for analytic lookups, leveraging CPU cache locality to maximise local throughput, with a fairly conventional bitmask/offset/value format for fast partial eliminations, and then these row-groups can be efficiently swapped in and out at will, or <code class="language-plaintext highlighter-rouge">mmap</code>ed directly from disk to minimise paging, blah blah blah. We’ve had hyper optimised data structures for as long as we’ve had data.</p>
<p><a href="https://arrow.apache.org/overview/"><img src="https://arrow.apache.org/img/simd.png" alt="smid" /></a></p>
<p><a href="https://www.kdnuggets.com/2017/02/apache-arrow-parquet-columnar-data.html"><img src="https://www.kdnuggets.com/wp-content/uploads/dremio-columnar-2.jpg" alt="columnar" /></a></p>
<p>One of the exciting things about the Arrow project, well, the two exciting things, is that it’s a direct descendent to my go to tool for any data problem; <code class="language-plaintext highlighter-rouge">pandas</code>, and <code class="language-plaintext highlighter-rouge">pandas</code> author and maintainer, Wes McKinney, is the lead of the Arrow project. Infact, the Arrow project was effectively annouced by Wes publishing his own <a href="https://www.slideshare.net/wesm/practical-medium-data-analytics-with-python">“10 Things I Hate About Pandas”</a> talk in 2013, which is well worth a <a href="https://wesmckinney.com/blog/apache-arrow-pandas-internals/">watch</a>. What this means is that the <code class="language-plaintext highlighter-rouge">arrow</code> memory model is a “clean” version of the <code class="language-plaintext highlighter-rouge">pandas</code> internal memory representation, optimised for all the pain points of <code class="language-plaintext highlighter-rouge">pandas</code> like memory management, crummy categoricals, painful appends, etc. So if you’re using <code class="language-plaintext highlighter-rouge">pandas</code> (and expecially if you use <code class="language-plaintext highlighter-rouge">pd.to_parquet</code>), you’re already an “Arrow-ite”.</p>
<p><a href="https://wesmckinney.com/"><img src="https://wesmckinney.com/images/wes-2017-01-12-small.png" alt="wes" /></a></p>
<p><a href="https://arrow.apache.org/overview/"><img src="https://arrow.apache.org/img/copy.png" alt="copy" /></a></p>
<p>But, to step away from either the Python-focused world and the internal memory representation world, the <em>really</em> interesting thing about <code class="language-plaintext highlighter-rouge">Arrow</code> as a project, is that it neatly sidesteps the classic serialisation loop, and all the overhead that comes from it. I’m going to take the following quote from Dremio’s summary of the Arrow project:</p>
<p><a href="https://www.reddit.com/r/Maps/comments/ek06ja/currencies_of_europe_countries_before_euro/"><img src="https://i.redd.it/1kwbacxyzs841.jpg" alt="" /></a></p>
<blockquote>
<p>To use an analogy, consider traveling to Europe for vacation before the European Union (EU) was established. To visit five countries in seven days, you could count on the fact that you were going to spend a few hours at each border for passport control, and you were going to lose some value of your money in the currency exchange. This is how working with data in-memory works without Arrow: enormous inefficiencies exist to serialize and deserialize data structures, and a copy is made in the process, wasting precious memory and CPU resources. In contrast, Arrow is like visiting Europe after the EU and the adoption of the common currency dubbed the euro: you don’t wait at the border, and one currency is used everywhere.</p>
</blockquote>
<p><a href="https://en.wikipedia.org/wiki/List_of_currencies_in_Europe"><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/f/f1/Eurozone_map.svg/600px-Eurozone_map.svg.png" alt="" /></a></p>
<p>The Arrow format (which if you’re already using <code class="language-plaintext highlighter-rouge">pandas</code>, you’re 90% of the way there) provides compatible execution and common-value mapping across Analytics platforms like Spark, Drill, or Impala; Streaming systems like anything Kafka shaped; and Storage backends like HBase, Cassandra, and my friend, Parquet</p>
<p>That also means that a common data format that lives in CPU cache, GPU shared memory, on disk, in the cloud or on a stream can be intuited, manipulated, and assessed in any way that suits you, leading to an implementation agnostic data ecosystem, where your work is not defined by the method of analysis, but of the data that you express.</p>
<p><a href="https://arrow.apache.org/overview/"><img src="https://arrow.apache.org/img/shared.png" alt="shared" /></a></p>
<blockquote>
<p>“But, my data is too complicated for that”</p>
</blockquote>
<p>Possibly! and that’s fine; Arrow is not well equiped for some common use cases, particularly anything transactional or non-idempotent, and there’s still active work being done to establish a common graph implementation for instance, and just last week there was a paper released that demonstrated an almost-hardware POSIX interface to perform AWS Athena-like operations in local filesystems leveraging CephFS, Arrow and Parquet</p>
<p><a href="https://twitter.com/wesmckinn/status/1397912053898256384"><img src="/img/2021/cephfs.png" alt="cephfs" /></a></p>
<p>At the core, and this probably should have come at the top, but still, the fundamental data type architecture on which the whole thing is set is beautifully simple;</p>
<ul>
<li>Scalars:
<ul>
<li>Boolean</li>
<li>[u]int[8,16,32,64], Decimal, Float, Double</li>
<li>Date, Time, Timestamp</li>
<li>UTF8 String, Binary blob</li>
</ul>
</li>
<li>Complex:
<ul>
<li>Struct</li>
<li>List</li>
<li>Map</li>
</ul>
</li>
<li>Compound
<ul>
<li>Union</li>
</ul>
</li>
</ul>
<p><em>N.b., all values are nullable via a value offset bitmask</em></p>
<p>But anyway, to wrap things up, as a ‘reverse summary slide’, here’s the things that I think make Arrow great:</p>
<ul>
<li>Solid stable community and governance model, with leading contributors from more than a dozen major OSS and industrial projects</li>
<li>In-memory model directly maps to transit model (no serialisation overhead)</li>
<li>Common representation model across all major languages (originally Java/C+++, but everyone else caught on eventually)</li>
<li>IPC and RPC as first class considerations (i.e. <code class="language-plaintext highlighter-rouge">python</code> + <code class="language-plaintext highlighter-rouge">R</code> playing nicely together locally)</li>
<li>Union schema type permits complex / underspecified data representation (a la <code class="language-plaintext highlighter-rouge">JSONB</code> in <code class="language-plaintext highlighter-rouge">postgresql</code>) (but I still want my graphs dammit…)</li>
<li>Removes pretty much every excuse for using CSV/JSON for Bulk Data Transit/Retention (Hello Open Data NI)</li>
</ul>
<p>And that’s it from me, if I haven’t totally borked the timing, I should have a minute or two for questions, which I probably won’t have the answers for.</p>
<hr />
<h1 id="faqs-because-bolster-doesnt-know-everything"><a href="https://arrow.apache.org/faq/">FAQ</a>’s because Bolster doesn’t know everything:</h1>
<h2 id="-arrow-vs-parquet">∆ Arrow vs Parquet</h2>
<p>Arrow -> In memory, with IPC/RPC/Streaming options, uncompressed,
Parquet -> On disk, maximising compression, at expense of read speed</p>
<h2 id="-arrow-file-vs-parquet-files">∆ Arrow File vs Parquet Files</h2>
<p>Arrow ‘Files’ are not really files; but more like mmapable IPC pipes</p>
<h2 id="wtfeather">WTFeather?</h2>
<p>‘Feather v1’ came before the spec of Arrow ‘files’, Feather v2 === Arrow ‘files’</p>
<h2 id="wtflight">WTFlight?</h2>
<p>“Flight” is a transit protocol for high performance transport of Arrow record batches (initially but not exclusively over gRPC) that yields <a href="https://www.dremio.com/is-time-to-replace-odbc-jdbc">20-50x better performance over ODBC</a> approaches.</p>
<h2 id="-arrow-vs-protobuf">∆ Arrow vs Protobuf</h2>
<p>Protobuf not great for in stream processing; prioritised structure serialisability for currying over efficient representation; internal representations not language agnostic</p>
<h2 id="-arrow-vs-flatbuffer">∆ Arrow vs Flatbuffer</h2>
<p>Flatbuffer not great for large structured data that’s largely homogenous (i.e. columns with different but consistent datatypes across rows)</p>
Dr StrangeBot: Or How I Learned to Stop Worrying and Trust Machine Learning2021-03-24T06:06:00+00:00https://andrewbolster.info/2021/03/dr-strangebot-or-how-i-learned-to-stop-worrying-and-trust-machine-learning<blockquote>
<p>This post was originally published as part of my role at WhiteHat Security
Links have been added for context/comedy/my own entertainment, but no content has been modified</p>
</blockquote>
<p>Beneath the cynicism, hyperbole, market–making and <a href="https://www.urbandictionary.com/define.php?term=Fud">FUD</a>; the strategic importance of AI in Cybersecurity is only constrained by us ‘meatbags’.</p>
<p>Being a data science practitioner in the cybersecurity space has been a double–edged sword for several years. On the one hand, with the proliferation of automated security testing, network IDS advances, the sheer growth in traffic and the threat surface of our increasingly complex, interconnected application development practices, these roiling oceans of flotsam and datum are everything our data hungry little hearts desire. Related innovations in data engineering in the past decade mean that questions that had previously only lived in the craven dreams of executive officers and deranged analysts are now the kind of tasks that we hand off to interns to make sure they have correctly set up their workstations.</p>
<p>But this glut of “Big Data” and computational wizardry leads inevitably to the other side of that coin; the zombie-esque re-emergence of casualties from the last “<a href="https://link.springer.com/article/10.1007/s13347-020-00396-6">AI Winter</a>,” proselytising that “now is the time.” Revolutions in highly specific fields like natural language processing and computer vision previously only imagined in big budget Sci-Fi tentpole movie franchises were now accessible with URLs like <a href="https://thiscatdoesnotexist.com/">ThisCatDoesNotExist</a> and QuickChat.ai with links to the code on Github for all to emulate.</p>
<p>“This isn’t your parents AI,” was the rallying call of the entire B2B software engineering industry, “this time it’s different,” and AI would make it all better, and “no-code” AI / ML deep learning adversarial recurrent network solutions on the blockchain were the proverbial White Whales that just needed to be chased through these oceans of data. And finally, after years of promising research, Captain Ahab would have his prize of Human-Like Intelligence, able to take ‘meatbag’ expertise, judgement and wisdom, and scale indefinitely, or as much as your cloud compute budget could tolerate.</p>
<p>“Powered by AI,” has become an albatross across many parts of the software engineering industry, no more so than in cybersecurity. Considering the fundamental premise of our industry is ‘computer systems can be bent to induce unintended behavior’, the magic wand of ‘AI’ often ends up being relegated to a square in our socially distanced buzzword bingo cards.</p>
<p>The real opportunity for the techniques pioneered in the ‘Big Data’ and ‘Artificial Intelligence’ research spaces are already well voiced; “joining the best of human and machine intelligence,” but the question of how this is accomplished remains unclear at best and at worst is misleading.</p>
<p>At WhiteHat Security, we have pioneered an <a href="https://algorithmia.com/blog/active-learning-machine-learning">Active Learning</a> approach to our development of machine learning models that opportunistically takes tasks off our security experts’ work queues when that model has confidence in its assessment of a piece of evidence. These items are then either directly and invisibly actioned on behalf of our security team, or, on a probabilistic basis, the item still goes to our security teams to assess, along with the model’s assessment of that piece of evidence so that we can cross verify the ongoing performance of the models under test. This ensures that both that our security teams have the most ‘un-boring’ experience possible and that our models receive continual feedback so that performance or accuracy deviations can be quickly identified, and any models with reduced accuracy can be retrained and the old ones decommissioned rapidly without any loss of security oversight.</p>
<p>Within a standardized deployment and interaction architecture is a behind the scenes core approach. It is a “decision system” based on mutual trust between the Data Science capabilities of analysing and modelling data to use optimal techniques per scenario context. This means that our partners in the rest of the product organization can understand and rely on the “decision support systems” that we as a Data Science group release. Fundamental to this “decision support system” approach is that whatever techniques, tools, strategies, technologies or <a href="https://powerlisting.fandom.com/wiki/Technological_Magic">technomancy</a> that are used to pre-process, clean, analyze and train models, that their integration is as simple as possible; a decision support system is fed some evidence, and it responds with a set of recommendations and related confidences.</p>
<p>These specific confidences being expressed and exposed as part of the system fosters the development of a form of ‘trust’ between the decision support system, and the security practitioners that then makes decisions based on that data. And finally, when the decision support systems themselves have conflicting or low confidence in their assessments, not only are these borderline or edge cases raised with the security teams, but they’re also collated by our Data Science team, where they’re analyzed separately, and if any patterns can be observed in the ‘confusing’ evidence, these are raised with our R&D and security teams and new models are trained against this novel finding.</p>
<p>The intent is not to somehow replace or supplant the contextually informed human expert but rather to provide cognitive shortcuts and contextual evidence to empower them to make heuristic decisions on the edges.</p>
<p>AI, ML, Bots, Black Boxes, Decision Support Systems; whatever the phrasing, the place of these technologies in the modern cybersecurity landscape is simple; answer the easy questions for me and get out of the way, or give me enough contextual information and trusted advice to take on the hard questions myself.</p>
<h2 id="author-bio">Author Bio</h2>
<p>Andrew Bolster Ph.D leads the Data Science group in WhiteHat Security. His professional and academic experience spans from teaching autonomous submarines to collaborate on port protection, <a href="https://andrewbolster.info/2017/09/legal-considerations-for-trusted-autonomy.html">establishing guidelines</a> for military application of AI, using biosensors to monitor and communicate human emotions, establishing IEEE standards for applying <a href="https://andrewbolster.info/2020/04/is-your-ai-ethical.html">ethics in AI</a>, and curating data playgrounds for cybersecurity researchers and professionals to experiment with multi-terabyte streaming datasets for product innovation. In his “spare time”, he is a founding trustee of the <a href="https://www.farsetlabs.org.uk/">Farset Labs hackerspace</a>, and on the board of <a href="https://www.vaultartiststudios.com/">Vault Artist Studios</a>, both in Belfast, Northern Ireland</p>
<blockquote>
<p>Final fun note; this was my first time using Github’s VSCode Codespaces as an online Markdown editing environment and it wasn’t a disaster!</p>
</blockquote>
A Stranger in a Strange Land: Data Science Onboarding In Practice2020-10-20T10:03:00+00:00https://andrewbolster.info/2020/10/a-stranger-in-a-strange-land-data-science-onboarding-in-practice<blockquote>
<p>This talk was originally prepared for the 2020 <a href="https://www.nidevconf.com/sessions/andrew-bolster/">Northern Ireland Developers Conference</a>, held in lockdown and pre-recorded in the McKee Room in <a href="https://www.farsetlabs.org.uk/">Farset Labs</a></p>
</blockquote>
<p><a href="https://www.youtube.com/watch?v=8jbA1MAJe3Q"><img src="https://img.youtube.com/vi/8jbA1MAJe3Q/0.jpg" alt="'A Stranger in a Strange Land: Data Science Onboarding In Practice'" /></a></p>
<h2 id="intro">Intro</h2>
<p>Data Science is the current hotness.</p>
<p>While those of us in these virtual rooms may make fun of the likes of Dominic Cummings for extolling a ‘Data Driven Approach’ to policy, the reality is that Data Science as a buzzword bingo term has survived and indeed thrived in a climate where ‘Artificial Intelligence’ is increasingly derided as being something that’s written more in PowerPoint than Python, ‘Machine Learning’ still gives people images of liquid metal exoskeletons crushing powdery puny human skulls, and those in management with long memories remember what kind of <a href="https://economistsview.typepad.com/economistsview/2008/12/quantitative-me.html">mess</a> “Quantitative Analysis” got us into not too long ago…</p>
<p><a href="https://twitter.com/matvelloso/status/1065778379612282885?lang=en"><img src="/img/2020/hqzDHa2mKqFaFwXYVScaExr0nA9wTxLEVaSr7fLl98SnsxoYa3OgtQPhx90rzyGIo90D2zboVcnwHCXk2hFsEQnJlv20EdhFgqd9uYW57DcngrRiO2NB7bFjj0lmZ3wiyo8VyP2Ocg" alt="img" /></a></p>
<p>Way back in 2012, the Harvard Business Review described Data Science as <a href="https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century">“The Sexiest Job of the 21st Century”</a>, and since then has been appearing in job specs and linkedin posts and research funding applications and business startup prospecta more than ever.</p>
<p>You’re not really doing tech unless you’ve got a few pet Data Scientists under your wing.</p>
<p><a href="https://nerdarchy.com/dd-ideas-artificer/"><img src="/img/2020/HdDYoDMx9A0qIoRJJF57EW5WHFNUfkZ8C5bcwb3e-ck4nntw_Ch_KAH8CzqLPU6FsUZLepfB-VzDcRsBM2DTeStdf5XS1rJVV1Nw1q9TDMiMIK30jRmUt-FuLvhbKj_q-dFu2CmJkhg" alt="img" /></a></p>
<p>Like some kind of mythical creature, these Data Scientists sit somewhere between Wizards, Artificers, and Necromancers, breathing business intelligence into glass and copper to give the appearance of wisdom from a veritable onslaught of data, wielding swords of statistical <a href="https://en.wikipedia.org/wiki/Student%27s_t-test">T-tests</a>, shields made of the <a href="https://en.wikipedia.org/wiki/Receiver_operating_characteristic">Areas Under Curves</a>, and casting magicks of <a href="https://en.wikipedia.org/wiki/Recurrent_neural_network">Recurrent Neural Networks</a>.</p>
<p><img src="/img/2020/iHRlCltQBUSLHxknWUF4-ctyA7E65b-tHfWU3Rc4Bt9KFtQCP3dJYGke5p9kGQh0sw3wzHvPwGXKwgwRdEFiZhcWnbGlzCntf54JWITuNCHfrG0wnHfiK6ktz1y_wxdTeQsdKQh-" alt="img" /></p>
<p>Like if Tony Stark and Steven Strange fell into a blender and the Iron Mage appeared, extracting wisdom from the seen and unseen worlds around them and projecting wisdom into the future….</p>
<p>But more often, it’s much more mundane…</p>
<p><a href="https://xkcd.com/1838/"><img src="/img/2020/machine_learning.png" alt="XKCD: Machine Learning" /></a></p>
<p>However, for an organisation attempting to leverage these mythical Data Scientists, how do you introduce, accommodate, and indeed, welcome these new skills into your production data workflows?</p>
<p><img src="/img/2020/wZI5CWfBuebdzKQJOZjfUWsS8o3N0NMen1MfnVrPU5PKCSKtyVXxLLkK1nDPN3uz5y4GwbJpUpfanhv7KMWD78olIqo_SrVg4l7LMPSnA-mjkogQDDqr7J3YB5tN7vT7yC_MVsLA" alt="img" /></p>
<h2 id="whats-this-about">What’s this about?</h2>
<p>In this talk we’ll walk through some of the philosophies I’ve arrived at as someone who started off as a lone-Data Scientist now transitioning to team leadership, and what tools I recommend to new hires (and intrigued colleagues) to understand complex production architectures. So, generally, what I wish I knew when I started with modern-ish Data Science workflow.</p>
<p><a href="https://www.niceideas.ch/roller2/badtrash/entry/agile-landscape"><img src="/img/2020/SZEIYRDajtVEkbGxR6_naemsIUJVvNwsnZVycJ-L3a0AHPqEBGgXqNbATppwTOadoNqW2ikgzD3eTM5YD_j1AmjbMshYevwS7Z_eqQkmOi-m1cKGoyUVVQKMvvwuphVpW6UyqI4p" alt="img" /></a></p>
<p>Also, a couple of dodgy stories from over the years of ‘Data Science Gone Wrong’, that will probably get some questions asked, and hopefully not of me.</p>
<p>This isn’t going to be a technical Data Science talk, we’re not opening up <a href="/2018/10/daily-dated-untitled-jupyter-notebooks.html">Jupyter</a> or firing up <a href="https://en.wikipedia.org/wiki/Apache_Spark">Spark</a> or <a href="https://www.tensorflow.org/tutorials">Tensorflow</a> or whatever. We’re not even going to talk about <a href="https://en.wikipedia.org/wiki/Perceptron">Perceptrons</a> or <a href="https://stats.stackexchange.com/questions/63152/what-does-the-hidden-layer-in-a-neural-network-compute">Hidden neurons</a> or <a href="https://hackernoon.com/exploring-decentralization-with-homomorphic-encryption-a8295755ba5f">homomorphic cryptography</a>. This is about people, processes, how to establish a healthy data science culture.</p>
<p>Anyway, who am I to talk about this stuff?</p>
<h2 id="who-am-i--aka-you-can-skip-this-bit">Who am I ? (AKA you can skip this bit)</h2>
<p>My professional background started off by getting robotic dogs to piss on headmasters in front of 200 primary school kids and taking things apart and always having a few screws left over (or loose) at the end.</p>
<p>I eventually turned that “skillset” into something of a trade, by studying <a href="https://www.qub.ac.uk/courses/undergraduate/software-electronic-systems-engineering-year-in-industry-meng-gh68/">electronics and software engineering at Queens</a>.</p>
<p>As part of this I got to test the launch of 4G networks in China from the grey comfort of an office in <a href="https://www.ericsson.com/en/about-us/company-facts/ericsson-worldwide/ireland">Athlone</a>, I moonlit as a technology consultant for a <a href="http://navigatorblue.com/">marketing and advertising firm</a> in Belfast, used massive clusters of GPUs to optimise cable internet delivery, and spent a summer developing BIOSs for <a href="https://www.pcengines.ch/">embedded computers</a> in Switzerland.</p>
<p>After that, and just in time for the financial crisis to make everyone question their career choices, I continued down the academic culvert to do a PhD, stealing shamelessly from the <a href="/2017/09/human-factors-in-autonomous-systems.html">sociologists</a> to make their “science” vaguely useful by teaching autonomous military submarines how to trust each-other.</p>
<p>More recently, I worked with a bunch of psychologists and marketers to teach machines how to understand human emotions using biometrics and wearable tech as the only Data Scientist.</p>
<p>This being a small start-up, that meant I did anything that involved Data, so from storage and network administration to statistical analysis, real-time cloud architecture to academic writing, and everything in between. This also somehow involved throwing people down mountains and developing lie detecting underwear. Ahh the joys of Start Ups.</p>
<p>After that I got to be a grown up Data Scientist working in at a cybersecurity firm specialising in real time network intrusion systems, playing with terabytes of historical and real time data trying to read the minds of hackers and script kiddies across the world who are throwing everything they can at some of the internet’s biggest institutions. This was my first taste of being a Data Scientist who wasn’t working completely alone…</p>
<h1 id="what-about-now-aka-start-reading-here">What about now? (AKA ‘Start reading here’)</h1>
<p>After two years in that I got pinched to build a new team within an established Cyber Security group called WhiteHat Security, that had recently been acquired by NTT Security;</p>
<p>We have 15 years of human expert trained data on if and how customer websites can be vulnerable to attack. We have teams of people working 24/7 to try and break peoples websites before ‘the bad guys’ do to prove that they’re vulnerable, and one way or another, we have those footprints of investigation, and the company wanted to start doing something with that data, so they needed a Data Science team.</p>
<p>I’ve been there a year and this isn’t officially a sponsored talk so I won’t rant, but all I’ll say is I’m still really enjoying the work. Anyway, with all that in mind, I want to look at this ‘How do you spin-up Data Science’ from three perspectives.</p>
<ol>
<li>Things that made previous “Data Science” roles suck</li>
<li>Methods and approaches that I as an Individual contributor came to use to make my own life easier</li>
<li>Now that I’m leading a team, how I’m trying to put those approaches into practice, and hopefully soliciting advice from you lot too…?</li>
</ol>
<h2 id="what-is-a-data-scientist-really">What is a Data Scientist <u>really</u>?</h2>
<p>For a change, and with a certain sense of Irony, Google itself has settled on a pretty decent job description for the field;</p>
<blockquote>
<p>“a person employed to analyze and interpret complex digital data, […], especially in order to assist a business in its decision-making.”</p>
</blockquote>
<p>To me, this definition encapsulates three of what I think are the four key elements of what the modern Data Science role is, and it’s all the sexy ones.</p>
<p>It says that Data Science sits in between Data Complexity, Business Context and Assistive Communication;</p>
<p><img src="/img/2020/zwN6hbU2O7Q9JapOib78IsuVUWoKfDAABSa7MEUjvtbhnYdILa7wA1scxcCToe3okzl74FVK6gsWrglr2_7Usp5-641GuKnonpEtGDl7ZZGRE1FyQl_pvp397DxUevaET--XlzUvN8A" alt="img" /></p>
<p>The obvious one is the <strong>Complex Data</strong>; you need to be familiar with how to access structured and unstructured data stores, you need to know how to navigate and validate your assumptions about that data, and be aware of techniques and methodologies to abstract or visualise that data.</p>
<p>A fairly common second highlight is the <strong>Communications</strong> aspect; at the end of the day, it’s your job to inform your internal and external customers with an appropriate amount of actionable information so that they can make an informed decision.</p>
<p>But, more subtly than that, you need to be aware of what the <strong>Business Context</strong> as a whole is trying to accomplish, not just the direct requirements that may be foisted on you. Some people call this ‘<a href="https://en.wikipedia.org/wiki/Systems_theory#Systems_thinking">systems thinking</a>’, I call it ‘caring about other people’s work as well as your own’, but each to their own.</p>
<p>As we’ll see later, this is often more important on the ‘interpreting’ side than on the ‘communication’ side…</p>
<h2 id="four-is-a-magic-number">Four is a Magic Number</h2>
<p>So, we have Google’s Defined Trifecta of Complexity, Communications and Context, but I’d add in a fourth, but I think it’s quite overlooked in many ways. But in the interests of not breaking anyone’s brains, we’re going to forego the Venn diagrams in favour of Bullets…</p>
<ul>
<li>Complexity</li>
<li>Communications</li>
<li>Context</li>
</ul>
<p>Anyway, what’s this fourth theme?</p>
<ul>
<li><u>Continuity.</u></li>
</ul>
<p>Yes, it is a little bit of an alliterative backronym, but when I say Continuity, it has many meanings;</p>
<ul>
<li>Continuity of <strong>operations</strong> through automation and continuous testing.</li>
<li>Continuity of <strong>visibility</strong> enforced by the construction of reproducible reports and continuous dashboarding pipelines.</li>
<li>Continuity of <strong>meaning</strong> by the explicit and near obsessive transparency of recording and sharing assumptions, decisions, experiments, and most importantly, failures.</li>
<li>Continuity of <strong>capability</strong> by having your Data Science operations actually survive your Data Science <a href="https://en.wikipedia.org/wiki/Bus_factor">team being hit by a bus</a></li>
</ul>
<p>So, in my contrived setup, we’ve now got <strong>Complexity, Context, Communication and Continuity</strong>.</p>
<p>Great, after 10 minutes, we’ve got a definition. Ish.</p>
<p>Great, move on Bolster; What does this all mean for someone either getting into Data Science as a career or building out a new capability.</p>
<h2 id="story-time">Story Time</h2>
<p>Before we get into the solutions, I’d like to share a couple of “WTF’s”, and then spend a little bit of time explaining where those WTFs actually came from.</p>
<p>I’ll avoid naming names to protect the guilty, but here’s a few beauts in no particular order. I’ll let you be horrified by them en-masse then we can spend a bit of time going through them to understand how these came about.</p>
<h3 id="exhibit-a-the-thing">Exhibit A. The “Thing”</h3>
<p>Once upon a time, a bright eyed data scientist was exploring a database. This was a mixed Perl PHP environment that had a lot of the business logic embedded in the production databases.</p>
<p>This isn’t a bad thing. What was a bad thing, was the ‘thing’ table that they discovered. A 6 way mapping table between different types of entities from completely different parts of the business logic, including user roles, scheduling specifications, and assessment targets.</p>
<p><img src="/img/2020/F6qXvsJASZfSwM2bu-tKC-CBtaQYnmnh2H8gaLFw1IKaYOGN70LStKdA7eWoduJ87Uxp3-n8gqxI-2Imfy8STRLsb-FEW-sUSfX3H1WIblYP7ZtDVzsB9E9pLsNi0-yaRozEWWWCjAQ" alt="img" /></p>
<p>These were not ‘many to many’ relationships being maintained. No, my friends. This was to map a new global <code class="language-plaintext highlighter-rouge">thing_id</code>. A quick GitHub enterprise search for <code class="language-plaintext highlighter-rouge">thing_id</code>revealed the horror that had been unearthed. Almost every interaction in the company first queried this table to work out what on earth a given query was talking about, leading to a structure that, after some coaxing, leads <a href="https://www.jetbrains.com/datagrip/">DataGrip</a> to spew out this entity relationship diagram.</p>
<p>It’s easy to discount this as lazy engineering or an incorrect abstraction, but there are three things that, while they don’t justify leaving it that way, explain the history of how you could end up that way.</p>
<p><u>Factor one</u>: long ago, there was no thing table; the company data architecture was built cleanly and there was no need for such hellishness.</p>
<p><u>Factor two</u>: long ago, certain database’s Foreign Key performance wasn’t particularly great, so doing multi-entity ‘one to many’ relationships wasn’t all that fast.</p>
<p><u>Factor three</u>: long ago, it was recognised that the company could expand some of it’s capabilities by acquiring a few other companies and integrating their data pipelines into theirs.</p>
<p>Now I think we can see the trouble. Long story short, an engineering department was under pressure to deliver on grand promises, and hacked together a solution that reused the previous clean data architecture in…. several different ways at once.</p>
<h3 id="exhibit-b-a-role-by-any-other-name">Exhibit B. A Role by any other name</h3>
<p>I was looking to deploy a data pipeline to automate something that had just been a cron job for ages. While I was still upskilling in AWS, identity management appeared to be a massive pain in the behind, so I decided to reuse an existing execution role, <code class="language-plaintext highlighter-rouge">api_injest_ro</code>, and considering this was largely an ingest project, that made sense. I reviewed the decision with my direct superior, who saw no problem with it.</p>
<p>On deployment, their pipeline died instantly as the <strong>entire</strong> company’s global client base started routing traffic through their, totally incorrect, pipeline instead of their primary ingest node.</p>
<p>This ‘isn’t really a story of a hack, it’s the story of subjectivity; one persons ingest is another persons… well, you know.</p>
<p>In this case, the role was not originally intended for API clients trying to read data from our own systems, rather it was intended for accepting data from external API clients sending data into our systems. <em>This intention was not documented anywhere</em>.</p>
<p>The hack was a frankly clever piece of early cloud load balancing where traffic was routed around the places that responded <u>fastest</u> with the <u>least</u> amount of non-200 responses.</p>
<p>Guess what was the only thing our intrepid data scientists pipeline template did?</p>
<p><img src="/img/2020/lZPF89kmNyNrLJYL4KT7GHZ-4ORZQHRJ3Pljpp1KTKz3XFK5w0M1sH7_yCkEcDqMDZGEivWVldi8FiXkPEedkia8L5jV3GQz_bWOl_AlHAYfEs-tquiw1o7iFne1Pf_iznGCjH9w5YY" alt="img" /></p>
<p>(For the record, this is always a terrible idea; your code should fail disgracefully first….)</p>
<h3 id="exhibit-c-whats-the-difference-anyway">Exhibit C: What’s the difference anyway?</h3>
<p>In another life, our friendly neighbourhood data scientist was doing a cohort analysis.</p>
<p>Participants in two different groups were put under different forms of stimulation, and the question was, what was the quantitative difference in response between the two groups.</p>
<p><img src="/img/2020/bEgdNFa21IlqDjkQWjYdyOylQAI2X8snLKRwCLmyosmkJ1TtoFDjzEpLov95GO3Y2gObzUmk9XGkUkSLX8JGiZxQFNu-fI7SR4mC4Y1e3Ia-Dq0hEZmNp7HYnxh_1wy-_GhLJ0LRisw" alt="img" /></p>
<p>Our scientist took the measurements, assessed the difference between the groups, and delivered the following two messages to the marketing team.</p>
<blockquote>
<p>Group B responded 25% more than group A</p>
</blockquote>
<blockquote>
<p>Group B’s response increased by 5% on average compared to Group A</p>
</blockquote>
<p>Time passes. The report is published, and then the calls start.</p>
<p>“Your numbers don’t make any sense? How can you have such a substantial effect? It’s physically impossible for a person to respond that much? You must just be making it up!”</p>
<p>Data Scientist goes to a website to read a completed report for the first time.</p>
<blockquote>
<p>“This product increases <characteristic> by 25%”</characteristic></p>
</blockquote>
<p>We can laugh about this now but this is a story of crunch timelines with a priority for speed over clarity, and no review or feedback opportunity for subject matter experts. Our data scientist gave two factual comments on the data from deep in their own trenches, and threw them over the no-mans-land into the editing trenches. This was then rushed out the door with little to no final review, and by the time the honest misinterpretation was revealed, it was clear that both sides had screwed up.</p>
<h2 id="how-do-you-solve-a-problem-like-data-science">How do you solve a problem like Data Science</h2>
<p>So, those are just a small sample of the challenges that face any data science team, but they’re doozies when you take them as abstract examples, and I believe that these examples at least could be ‘dealt with’ with some abstract advice.</p>
<ol>
<li>People don’t agree on what words mean, let alone what numbers mean, so don’t assume anything and add your assumptions to any numbers / statements you’re delivering.</li>
<li>A Data Scientists’ job is not done once the number is out the door; you have a responsibility to make sure that whomever you delivered it to is on the same page as you as to its meaning</li>
<li>Engineering, Strategy, and Innovation operate <em>in tension</em> with each other. Sometimes they speed each other up, occasionally they have to slow each other down. If a decision is made to do the wrong thing quickly instead of the right thing slowly, that needs to be a decision visible across that trifecta. And recorded…</li>
</ol>
<h3 id="thats-nice-and-all-but-how-do-you-actually-do-that">That’s nice and all but how do you actually do that?</h3>
<p>It’s been easy to stand up in conferences like this for years as an individual contributor, start-up data scientist, or solo-researcher and wax lyrical about how all the things that other people do is crap and it’d all be better if they just listened to me. It’s also fun.</p>
<p>However, how do you actually curate the kind of culture that I’m talking about? Both between a team, within an engineering division (your Data Science team <em>is</em> in your engineering division, right?), within a company and within a wider data ecosystem?</p>
<p>Well, I’ve been doing this for a year and I don’t think I’ve succeeded yet, but here’s some of the things that we’re doing in my team to try and foster this, with appropriate redactions made…</p>
<p><img src="/img/2020/soOsf52AM5ntpzCcNWPAHgxlgA2c3YmscQUCZudmPcjoh1GQC7jrjVmXK-SOrG3JDs3qQhmVkDwN9-u4QJXIYUjbZid0DYjtvdIko4YMB5Hrqgf4FtYTYTI5IVSDTZYdmSxsQprPdKs" alt="img" /></p>
<p><strong>A Seat at the Table</strong>: As we’ve seen, the most challenging part of a Data Scientists jobs is often interpreting and ingesting from something that was never designed to be accessed in the weird and wonderful ways they want to. Data Science has to have a seat at the Engineering Architecture table, both to manage expectations and to highlight premature abstractions or constraints that might later cause a massive headache for analysis, but are simple to think about early on.</p>
<div style="position:relative; padding-bottom:calc(76.67% + 44px)"><iframe src="https://gfycat.com/ifr/UltimateJoyfulAcornweevil" frameborder="0" scrolling="no" width="100%" height="100%" style="position:absolute;top:0;left:0;" allowfullscreen=""></iframe></div>
<p><strong>Transparency</strong>; Teams are encouraged to share both their successes and failures in the open with the rest of the company, and encouraged to discuss their work in progress openly in our team slack and to cycle in subject matter experts from across the company to contribute to the discussion, that way we can test our assumptions early and often so wether you’re a green horn statistician or a distinguished engineer, you can ask ‘stupid’ questions without any fear of backlash.</p>
<p><img src="/img/2020/6ZvdKr0jvMUTJzbndQXRqhYDIj4sdCj-JWl22N8lr5Oqp04vw4xIBSCGtdrfgwQ_f8rnoRl8LAGuFTwiDaekYObIuk6v4lHrVO-fY2BBgZRClBZ4d6Qscu3q7pYAVSxxtrdtAiPYU7g" alt="img" /></p>
<p><strong>Bar-stool diversity</strong>: This is one of the only failures I’ve had so far in this role, where I started off being ‘given’ a pair of extremely experienced engineers who knew the platform inside and out, but not so much on the analytical rigour or the statistical operations. My first attempted hire was a talented neurobiologist. Unfortunately this was rejected above my head as they “didn’t have enough programming experience”. My internal response was “Yeah, doi, that’s kinda why I wanted them”. I ended up hiring a statistician who’d done some R and some Python. And proceeded to beat the R out of them. Anyway, back to the point. Data Science is a field that thrives on questioning and different perspectives, and if all you have is one leg, you’re gonna fall over.</p>
<p><img src="/img/2020/t6k_UM87pcFngC-TQkZEVvIM_H2-WPPFF9zN436GvyY5RhNZ1jEl49ZH3-qes4bFBUTbhn7EcfM8MQZPk-Tqt17x041mZbUEoGStnlI7Ad2oXQksK0_hBJ4jah77eXYm2wS7oqqDygc" alt="img" /></p>
<p><strong>Defended empowerment</strong>: part of my responsibility as a team lead is to give my team cover, both from management noise but also from vexatious questions; our team is doing great things because of the deep and wide knowledge embodied in it, and I don’t want to waste that strength answering questions from colleagues who haven’t read our reports or done any of their own research.</p>
<p>So I field those calls, and if I can’t point to a part of a report, document, or code that explains the question, I add it to my own to-do list to explain it and update the documentation, pending review from the original contributor.</p>
<p>So that’s it, that’s my principles for establishing and running a high performing Data Science team; get good diverse people, encourage their curiosity by giving them freedom to talk to anyone, encourage them to share their successes and their failures, and cover their ass from all the stuff that gets in the way, but make sure that their voice is heard at the highest levels as an equal partner.</p>
<h3 id="and-finally">And Finally</h3>
<p><img src="/img/2020/image-20201019130301677.png" alt="image-20201019130301677" /></p>
<p>Just for those who wanted to get the Tools discussion, here’s my recommended stack;</p>
<ul>
<li><a href="https://jupyter-docker-stacks.readthedocs.io/en/latest/">Jupyter Stacks</a> running in <a href="https://docs.docker.com/get-docker/">Docker</a> with <a href="https://stackoverflow.com/questions/23439126/how-to-mount-a-host-directory-in-a-docker-container">folder mapping to userland</a> for exploratory stuff, although my team are currently looking at moving all our exploratory analysis to <a href="https://azure.microsoft.com/en-gb/">Azure</a>, and <a href="https://azure.microsoft.com/en-gb/services/databricks/">Azures Databricks</a> looks like a drop in replacement, with the added benefit of the team being able to work in their own environments if they have a preferred stack themselves.</li>
<li><a href="https://www.metabase.com/">Metabase</a> for exploratory data collation as a team (also does particularly well at introspecting on what should be foreign keys but aren’t)</li>
<li><a href="https://www.jetbrains.com/datagrip/">Jetbrains DataGrip</a> for, well, basically anything that it supports</li>
<li>If you don’t have a <a href="https://www.alation.com/blog/what-is-a-data-catalog/">data catalogue</a> and a data glossary, you don’t have data. <a href="https://docs.microsoft.com/en-us/azure/data-catalog/overview">Azure Data Catalog</a> is very good for both of these, including metadata tagging and the ability to make people outside your team ‘admins’ on particular terms. <a href="https://docs.aws.amazon.com/glue/latest/dg/populate-data-catalog.html">AWS Glue</a> does similar but is more internally focused</li>
<li><a href="https://pandas.pydata.org/">Pandas</a> goes without saying, but I would flag that <a href="https://plotly.com/python/plotly-express/">Plotly Express</a> and it’s Jupyter integrations are looking awesome. If you’ve ever played with in notebook interactive graphing and found it frustrating, try it again.</li>
<li>And finally, a personal favourite; <a href="https://github.com/Valassis-Digital-Media/nbconflux"><code class="language-plaintext highlighter-rouge">nbconflux</code></a>, an extension to push <a href="https://jupyter-notebook.readthedocs.io/en/stable/">Jupyter notebooks</a> up to <a href="https://www.atlassian.com/software/confluence">Atlassian Confluence</a>, so that people outside your analysis environments can work out what the hell you’re talking about and what assumptions you made.</li>
</ul>
<h2 id="final-thoughts">Final Thoughts</h2>
<p>Data Science sits somewhere between Engineering, R&D, and Management.</p>
<p>Most people think it’s either magic or it’s going to steal their job or both.</p>
<p>For all the talk of Data Science being about technology, so far, for me, I’d had to learn more about the human side than the technical side.</p>
<p>But, as ever, your mileage may be non-deterministic.</p>
<p>Thank you for your time.</p>
Tell me about your Programmer - Robopsychologist and other careers that don't exist (yet)2020-10-19T10:34:00+00:00https://andrewbolster.info/2020/10/tell-me-about-your-programmer<blockquote>
<p>This talk was originally prepared for <a href="https://hopin.to/events/ni-raspberry-jam-nidc">NI Raspberry Jam’s Kids Track</a>, associated with the full <a href="https://www.nidevconf.com/">Northern Ireland Developers Conference</a>, held in lockdown and pre-recorded in the McKee Room in <a href="https://www.farsetlabs.org.uk/">Farset Labs</a></p>
</blockquote>
<p><a href="https://www.youtube.com/watch?v=Z1c4KAESO0s"><img src="/../../../../../../../../img/2020/0.jpg" alt="Actual Recording of This Transcript" /></a></p>
<p><img src="/img/2020/image-20201019102913292.png" alt="image-20201019102913292" /></p>
<p>In <a href="https://en.wikipedia.org/wiki/Isaac_Asimov">Issac Asimov’s stories</a>, the technical, social and personal impacts of advanced robotics and artificial intelligence are explored. One creation in his books was the career of “<a href="https://en.wikipedia.org/wiki/Robopsychology">Robopsychologist</a>”, a combination of mathematician, programmer, and psychologist, that diagnosed and treated misbehaving AI. In this talk we’ll discuss how on earth you can prepare for careers in Robopsychology and other careers that don’t exist (yet).</p>
<p>Issac Asimov is primarily known for being one of the most prolific and impactful science fiction writers ever, and as you would expect, while wandering around these fictional worlds, he came up with a few science-y sounding mumbo jumbo terms such as ‘positronic’ and ‘psychohistory’, he is literally the father of the word ‘robotics’.</p>
<p><img src="/img/2020/image-20201019104244483.png" alt="image-20201019104244483" /></p>
<p>He first used the term in his 1941 story ‘<a href="https://en.wikipedia.org/wiki/Liar!_(short_story)">Liar!</a>’, about a robot called ‘Herbie’ that develops telepathic abilities, and can read people’s thoughts. However, because the robot’s core operating principles, or, ‘programming’ still included the first law of robotics, that is, not to hurt people, Herbie starts lying to people to make them feel happy.</p>
<p>This leads to Herbie leading a Robopsychologist called <a href="https://en.wikipedia.org/wiki/Susan_Calvin">Susan Calvin</a> to believe that a co-worker fancies her, and when she finds out that this isn’t true and Herbie just told her this because Herbie knew the idea would make her feel better, she is very hurt.</p>
<p>This ‘First Law’ is part of the <a href="https://en.wikipedia.org/wiki/Three_Laws_of_Robotics">‘Three Laws of Robotics’</a>, which were officially codified the following year as:</p>
<ol>
<li>A robot may not injure a human being or, through inaction, allow a human being to come to harm.</li>
<li>A robot must obey the orders given it by human beings except where such orders would conflict with the First Law.</li>
<li>A robot must protect its own existence as long as such protection does not conflict with the First or Second Law.</li>
</ol>
<p>These simple ‘laws’ that were programmed into all robots, collided with the very human tendencies of wanting to be loved, and created this imaginary field of research called ‘Robopsychology’.</p>
<blockquote>
<p>Robopsychology - “the study of the personalities and behaviour of intelligent machines”</p>
</blockquote>
<p>This is a field of research that today, doesn’t exist, but I first read about it in 2014 when I was on a packed commuter train in California just having just left Google’s Mountain View complex where I’d met a university friend who was a programmer there.</p>
<p><img src="/img/2020/RkUMAXrjqJ6W2XHLgUSnno2OaOz2Hsm6KZJ3rJ7-MMA8cCXF2_6UZ8-OSk9wT1Er-5BjLps3ZSLw-LUQ97PR8F7x0S_9ggIa6UEj--jzWx9oQ5pXx7xgL1GZC5F_uLNAPUIwCfWsuFY" alt="img" /></p>
<p>I had spent the previous 5 years in a Masters programme at <a href="https://www.qub.ac.uk/">Queen’s University Belfast</a> studying Electronics and Software Engineering, a course that’s now called <a href="https://www.qub.ac.uk/courses/undergraduate/software-electronic-systems-engineering-year-in-industry-meng-gh68/">Software and Electronic Systems Engineering</a> (that I would highly recommend, ask me afterwards…)</p>
<p><img src="/img/2020/image-20201019104613554.png" alt="image-20201019104613554" /></p>
<p>It focused on the overlap between two fields that, in my eyes, were obviously one bigger field; How to put lightning in glass to add numbers together, and how to turn adding numbers together into thinking. The physical world of computation and the virtual world of programming. And it fascinated me.</p>
<p>I wasn’t a particularly stellar student, I asked lots of awkward questions and never did particularly well in exams, and I had just resigned myself to going down the ‘standard engineering route’ of ‘Get Degree, Join Big Company, Write Banking Software for 30 years, Retire’, but my discovery of this weird term ‘Robopsychology’ kicked me in the behind.</p>
<p>After that, I gave up the prospective Banking Analyst job and I took up postgraduate research exploring how smart submarines collaborate and interact with each other for environmental and military applications, including how to hide nuclear submarines using sound and how to use atomic clocks to build an underwater GPS system, as well as doing all of this assuming that someone can take control of one or more of your submarines and make it ‘lie’.</p>
<p><img src="/img/2020/image-20201019104714938.png" alt="image-20201019104714938" style="zoom:80%;" /></p>
<p>The research eventually became too classified for me to continue working on it but I know that I contributed to international agreements on how autonomous systems are allowed to integrate into military chains of command.</p>
<p><img src="/img/2020/LndCkwguzD58UAxk89KawzdwQQ_2HcDmR7YfnQGQZeVCjrPjtSjTdRxtI4E_EIt16-0hnx28qDMUU0n4HjaoGMsukafJDrFIiwryq20jOlu3efuxldkmxm-fBszYqV_copfboZkKoa4" alt="img" /></p>
<p>This interplay between how fixed, rule based, programmed systems like computers and robots, and the fuzzy, fluffy, mushy stuff that comes from people and communities, has driven my career since then.</p>
<p><a href="https://www.broadcastnow.co.uk/tech/red-bull-creates-vr-showing-real-time-emotions-feedback/5122720.article"><img src="/img/2020/bH66fOgnl46z2uM78mR9JWR63O3Kp1vCvRqvITcA2zy2me2rSN1rpNJetbhk9fI-M4Ke27TMOwvllYhJCRs8opLHPa4rWDfjVoJ6_XDp3G0Jocr1Lmqw3NBmmQ4G3lY3e8VH2XYKL3k" alt="img" /></a></p>
<p>I spent two years developing smart watch applications that could tell how stressed you were, culminating in developing particularly shameful lie-detecting underwear, as well as a survey system for a deodorant manufacturer that only asked you questions about the product when it knew you were sweating.</p>
<p>During this time we developed systems to translate emotions from heart rates to words to colours to sounds and back again, a universal translator for emotions.</p>
<p>After that, I leapt back into cybersecurity, developing smart algorithms that watched the worlds internet traffic sniffing out hackers, trying to predict their next moves and detect the faintest whiffs of exploitation, fully aware that the hackers were doing exactly the same thing on the other side; automating intelligence, teaching that lightning to think, teaching that lightning to think for them.</p>
<p><img src="/img/2020/sBxuOZAaD9QSRrXPLNHp7OwlB_lGzR0rDfsPfy6dvUenK4vNgk8_MVBaK4FEuFBgpVf3-Dk71Zz8eI9hzoFJ-j4C5dbkPHne6Ny7mDAdEahDordmIQy95IWxYFfRPj-b0-dbRZYqrBU" alt="img" /></p>
<p>In my current role, I lead a team of Data Scientists, a term that didn’t really exist when I was on that train in California only 6 years ago. We develop and monitor intelligent systems that watch company websites for security vulnerabilities. My day job is to work out better ways to try and pretend to be a hacker and work out how to automate the boring bits of the professional <a href="https://www.whitehatsec.com/">WhiteHat</a> hackers I work with.</p>
<p>When I was your age, those jobs didn’t exist. The internet as we know it today didn’t exist. We didn’t even know what we didn’t know. So when my careers teacher told me in 2004 that I should look at being an insurance adjuster “because you’re good with numbers”, she didn’t know that that job would basically be automated out of existence.</p>
<p>So I ended up being a Data Scientist. Not because it’s what my careers teacher or parents or friends told me, it’s not because it was on some ‘skills and employability map’ or because of the output of some assessment tests. The job role simply didn’t exist.</p>
<p>And I guarantee that most of you watching now will end up working in and creating jobs that simply don’t exist today.</p>
<p>That could be <a href="https://www.bitfarming.co.za/">bit-farming</a> or <a href="https://blackwellglobal.com/top-10-crypto-influencers/">crypto-influencer</a> or <a href="https://en.wikipedia.org/wiki/Quantum_annealing">quantum annealer</a> or, indeed, <a href="https://www.therobotreport.com/synthetic-psychology-understanding-human-behavior-through-robotics/">robopsychiatrist</a>. So as you go through your studies, don’t allow yourself to fixate or judge yourself against what jobs are out there now.</p>
<p>Your parents and your teachers and your friends genuinely want the best for you, so they will suggest and encourage you to follow certain paths, generally because it’s advice they wish they could give themselves 20 years ago based on their own experiences. But the thing is, the past 20 years was theirs, the next 20 years is yours.</p>
<p>There are no robopsychiatrist jobs out there. Yet.</p>
<p><img src="/img/2020/image-20201019104933566.png" alt="image-20201019104933566" /></p>
<p>Build your own paths and experiences, read widely, care deeply, and don’t be afraid of being ‘directionless’ or meandering. Because if you make your own luck, you might just end up in the right place at the right time and with the right skills to realise you’re being lied to by a telepathic robot.</p>
<p>Thank you for your time, and I’m happy to take any questions. (Online or offline!)</p>
<p>Also, <a href="https://www.whitehatsec.com/company/careers/">I’ll still be hiring in a few years</a>, so if you wanna join me, gimme a shout.</p>
<blockquote>
<p>The best way to contact Andrew if you want a short answer is <a href="https://twitter.com/bolster">Twitter</a>, and if you want a longer answer that may take several days, tweet him for his email address (or find it yourself :wink:)</p>
</blockquote>
Back to reality2020-09-08T15:22:00+00:00https://andrewbolster.info/2020/09/back-to-reality<p>Realised it’d been a while, thought it was time for an update. I’m gonna try and make an effort to get closer to an ideal of something like #weeknotes as I’ve always been jealous of <a href="http://www.mcqn.net/mcfilter/">Adrian’s</a> ability to put this thoughts down and get them out the door in a sensible way…</p>
<p>Anyway, use the Table of Contents above to skip to bits you may care about.</p>
<h2 id="farset-updates">Farset Updates</h2>
<p>So, this was the actual reason for this update but it started a weird thought process that led to this blog.</p>
<p>Post renovation work at Farset has been the definition of a group effort; everyone’s doing their bits to make small improvements as they go along (and while being COVID save…), and we’ve been terrible at collecting them and sharing those updates, so people could easily be forgiven for thinking that bugger all has happened since COVID.</p>
<p>Art has been essential in encouraging people to throw post-its around the space with ideas for people to pick up. I didn’t take photos of all of them but here’s the ‘master panel’ in the entrance way.</p>
<p><img src="/img/2020/2020-09-08 12.38.27.jpg" alt="2020-09-08 12.38.27" /></p>
<p>I’m only gonna include stuff that I did today after not having been in the space for a month, but I encourage everyone contributing to take the credit and share what you do to fix and grow our wee space in these times.</p>
<ul>
<li>
<p>We’ve got working calipers (and a stash of relevant batteries)</p>
<p><img src="/img/2020/2020-09-08 12.53.51-1599575484372.jpg" alt="2020-09-08 12.53.51" /></p>
</li>
<li>
<p>Coat trees in both desk-spaces (which are ‘gentrified’ according to one member :joy:)</p>
<p><img src="/img/2020/2020-09-08 12.00.49.jpg" alt="2020-09-08 12.00.49" /></p>
</li>
<li>
<p>Managed to unpack, blade, stash and label the Circular Saw that we’d ordered before I left (Do not attempt to use unless you really know what you’re doing, do not work alone, etc, etc.)</p>
<p><img src="/img/2020/2020-09-08 12.55.48.jpg" alt="2020-09-08 12.55.48" /></p>
</li>
<li>
<p>As part of our newly-ratified <a href="https://www.farsetlabs.org.uk/coronavirus">Guest policy</a>, got a nicely hazard taped visitors book with appropriate warnings<img src="/img/2020/2020-09-08 11.13.51.jpg" alt="2020-09-08 11.13.51" /><img src="/img/2020/2020-09-08 11.13.46.jpg" alt="2020-09-08 11.13.46" /></p>
</li>
<li>
<p>A Dremel-esque workstation got set up on the workbench (mostly just to match with it’s friendly drill press)</p>
<p><img src="/img/2020/2020-09-08 12.56.10.jpg" alt="2020-09-08 12.56.10" /></p>
</li>
<li>
<p>Fixed the heights of the rolling drawers we’d ordered ages ago to see if they fit under the workbenches (they do) so will take a proposal to the board about getting a load more to help with organising the workshop</p>
<p><img src="/img/2020/2020-09-08 12.46.17.jpg" alt="2020-09-08 12.46.17" /></p>
</li>
<li>
<p>Put the clock up in the downstairs coworking space (all the other spaces already had one from a previous pass)</p>
<p><img src="/img/2020/2020-09-08 13.07.48.jpg" alt="2020-09-08 13.07.48" /></p>
</li>
</ul>
<p>Not shown; all the other awesomeness others have done. This is just for my own sake and because doing updates in our Slack doesn’t exactly provide much visibility for the outside</p>
<p>Previous interesting highlights not from todays work;</p>
<ul>
<li>
<p><a href="https://discourse.farsetlabs.org.uk/t/event-space-streaming-av-setup/26">Updated AV setup in the Event Space</a></p>
</li>
<li>
<p>We’re getting a laser cutter but we’re not talking too much about details till it’s all confirmed</p>
</li>
<li>
<p>Not our work, but there’s an EV charging point outside our space, so that might have to get listed on the website at some point…</p>
</li>
<li>
<p>Loads of social distancing signage (and Arts fantastic signage templates; but no spoilers)</p>
</li>
<li>
<p>A couple of cast-able screens that we’re not entirely sure where to mount yet (ideas welcome)</p>
<p><img src="/img/2020/2020-07-23 17.58.27.jpg" alt="2020-07-23 17.58.27" /></p>
</li>
</ul>
<h2 id="lockdown-travel-coronavirus-and-moving">Lockdown, Travel, Coronavirus, and Moving</h2>
<p>Everyone and their aunt has done a COVID lockdown stress post so here’s some highlights for myself.</p>
<ul>
<li>Got stuck with my partner during lockdown as it made more sense to stay at my place as I have better access to Deliveroo; haven’t killed each-other yet, although it’s been close a few times…</li>
<li>Work has migrated pretty well, some teething problems but nothing that can’t be managed. Still enjoying the work and the opportunities, and pushing for them to take a more active role in the now estranged Belfast/ NI tech scene, encouraging people to submit personal talks to <a href="https://www.nidevconf.com/">NIDC</a> (Come see mine!) etc and basically ‘get out there’ more, but with limited local management beyond team leads like me acting alone, getting sign off can be slower than I’d like.</li>
<li>Took said partner to France to visit my parents in a village in the deep south (<a href="https://www.google.com/maps?client=firefox-b-d&q=herault&um=1&ie=UTF-8&sa=X&ved=2ahUKEwja1brW09nrAhV5SBUIHeN1BSEQ_AUoAXoECB4QAw">Hérault-ish</a>) after persistent chastising; isolated in the village for 3 days before the locals would even consider talking to us as we were ‘unclean’, but once we got past that barrier, spent the next two weeks enjoying ourselves, me, mostly enjoying wine and cheese, and my partner, mostly enjoying the pool of a friend of the family and posting cheesy artsy videos on instagram….</li>
</ul>
<p><img src="/img/2020/image-20200908141016712.png" alt="image-20200908141016712" style="zoom:50%;" /></p>
<ul>
<li>
<p>As you may know/remember, France was taken off the ‘<a href="https://www.bbc.co.uk/news/uk-53774955">green list</a>’ while we were over, but we were aware of this possibility and took the risk anyway; we got caught out by a feckless reactionary government response that was locking out countries while opening up schools and pubs. Good job :man_shrugging:</p>
</li>
<li>
<p>So, we returned when we’d planned to, and locked ourselves down as per the guidelines. Fortunately, I’d gotten an Index just before we left, so I had fun playing with <a href="https://store.steampowered.com/app/546560/HalfLife_Alyx/">HL:Alyx</a>, and yes, it’s as good as everyone says. Partner prefers <a href="https://store.steampowered.com/app/418650/Space_Pirate_Trainer/">Space Pirate Trainer</a> though</p>
<p><img src="/img/2020/VID_20200727_233315-ANIMATION.gif" alt="VID_20200727_233315-ANIMATION" /></p>
</li>
<li>
<p>Some may know that after spending 8 months in lockdown with my Harry Potter Style ‘cupboard under the stairs’, standing desk or no, has taken a toll a bit, and I’d been keeping an eye out for interesting properties. I’ve been doing this for years with weekly digests from <a href="https://www.propertypal.com/">PropertyPal</a>, but had only been even vaguely excited about one in months, until one strange one popped up, right in the middle of our isolation…. We dispatched a <a href="https://twitter.com/ben_bland">friendly telepresence robot</a> to take a video walk through for us, and in a day of bouncing back and forth, it was ours. We move in next month. Best bit? It has both a study (aka ‘bedroom two’) and most shockingly for that area of Belfast, a garage, which I’m hoping to turn into a bit of a carpentry setup… to then build a bar, since IMO in this whole post-covid world, everyone going to the bar together and mixing with loads of different people may be a thing of the past for a while…. So why not have a few folks round for a few tins in style!</p>
<p><img src="/img/2020/ezgif.com-gif-maker.webp" alt="ezgif.com-gif-maker" /></p>
</li>
</ul>
<h2 id="upcoming-projects">Upcoming Projects</h2>
<p>So far I’ve committed to;</p>
<ul>
<li>Doing a half day workshop on <code class="language-plaintext highlighter-rouge">pandas</code> and <code class="language-plaintext highlighter-rouge">arrow</code> for the NIDC pre-workshops on the 8th of October
<ul>
<li>And then reusing said workshop for UUJ’s Data Analytics MSc Guest Lecture series….</li>
</ul>
</li>
<li>Doing a talk on Data Science Onboarding Best Practices for NIDC proper</li>
<li>Finishing editing a very drunken <a href="https://datadelinquents.dev/">Data Delinquents</a> ‘podcast’</li>
<li>Getting back to the funders of <a href="https://blog.farsetlabs.org.uk/2020/03/data-art-ni-open-datani-outreach-funding/">DataArtNI</a> about our plans to restart it (a programme that was painfully confirmed just in time to be locked down <em>hard</em>)</li>
</ul>
<p>Wee buns…. :grimacing:</p>
<h2 id="conclusion">Conclusion</h2>
<p>I hope to get better at this ‘updating’ stuff as a form of journaling, because, if lockdown has taught me anything, time has no meaning, memory is fallible, and if you don’t record things, they’re lost to time and <em>really</em> difficult to reverse engineer when you realise you should have written it down six months later….</p>
Merging Git Repos for Archival Purposes2020-06-23T13:32:00+00:00https://andrewbolster.info/2020/06/merging-git-repos-for-archival-purposes<h1 id="tldr">TL;DR</h1>
<p>I had reason to want to combine git repos into one big repo consisting of repos in their own folders, while ideally maintaining the histories of all those repos for archaeological purposes.</p>
<p>There are many reasons why someone would <em>want</em> to do this, and my specific use case isn’t relevant. Good luck.</p>
<script src="https://gist.github.com/andrewbolster/2ed60be3592c41c9123b5c0b764dea4c.js"></script>
<h2 id="why-so-complicated">Why so complicated?</h2>
<ul>
<li>‘Hidden’ files (dotfiles) suck</li>
<li>Shell Wildcards suck</li>
<li>Wildcards with selective exclusions (i.e. <code class="language-plaintext highlighter-rouge">.git</code>) suck</li>
<li>File names with spaces suck</li>
<li>Trailing Slashes suck</li>
<li>Rewriting History sucks</li>
</ul>
<h3 id="raw-version-because-i-dont-trust-gist-and-embeds-and-such">Raw version because I don’t trust GIST and embeds and such….</h3>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#!/bin/bash</span>
usage<span class="o">()</span> <span class="o">{</span>
<span class="nb">cat</span> <span class="o"><<</span> <span class="no">EOF</span><span class="sh">
This script imports a git repo (accessible from https://</span><span class="se">\$</span><span class="sh">origin/</span><span class="se">\$</span><span class="sh">user/</span><span class="se">\$</span><span class="sh">repo) and all its history as subdirectory of a destination (available locally at </span><span class="se">\$</span><span class="sh">dest)
It is designed for non-production, archival processes and may destroy everything you've ever loved because you looked at it funny. You have been warned.
The structure of the destination will end up something like this:
~/src
- </span><span class="se">\$</span><span class="sh">dest
- origins
- </span><span class="se">\$</span><span class="sh">origin
- </span><span class="se">\$</span><span class="sh">user
- </span><span class="se">\$</span><span class="sh">repo
Required Arguments:
-u|--user: The user that owns the repo to be imported
-r|--repo: The name of the repository to be imported
-d|--dest: The local name of the destination repository (assumed to be under ~/src)
-o|--origin: The git server that is the origin of the repo to be imported
</span><span class="no">EOF
</span><span class="o">}</span>
<span class="k">if</span> <span class="o">[</span> <span class="nv">$# </span><span class="nt">-le</span> 6 <span class="o">]</span><span class="p">;</span> <span class="k">then
</span>usage
<span class="nb">exit </span>1
<span class="k">fi
while</span> <span class="o">[[</span> <span class="s2">"$#"</span> <span class="nt">-gt</span> 0 <span class="o">]]</span><span class="p">;</span> <span class="k">do
case</span> <span class="nv">$1</span> <span class="k">in</span>
<span class="nt">-u</span><span class="p">|</span><span class="nt">--user</span><span class="p">)</span> <span class="nv">user</span><span class="o">=</span><span class="s2">"</span><span class="nv">$2</span><span class="s2">"</span><span class="p">;</span> <span class="nb">shift</span> <span class="p">;;</span>
<span class="nt">-r</span><span class="p">|</span><span class="nt">--repo</span><span class="p">)</span> <span class="nv">repo</span><span class="o">=</span><span class="s2">"</span><span class="nv">$2</span><span class="s2">"</span><span class="p">;</span> <span class="nb">shift</span> <span class="p">;;</span>
<span class="nt">-d</span><span class="p">|</span><span class="nt">--dest</span><span class="p">)</span> <span class="nv">dest</span><span class="o">=</span><span class="s2">"</span><span class="nv">$2</span><span class="s2">"</span><span class="p">;</span> <span class="nb">shift</span> <span class="p">;;</span>
<span class="nt">-o</span><span class="p">|</span><span class="nt">--origin</span><span class="p">)</span> <span class="nv">origin</span><span class="o">=</span><span class="s2">"</span><span class="nv">$2</span><span class="s2">"</span><span class="p">;</span> <span class="nb">shift</span> <span class="p">;;</span>
<span class="k">*</span><span class="p">)</span> <span class="nb">echo</span> <span class="s2">"Unknown parameter passed: </span><span class="nv">$1</span><span class="s2">"</span><span class="p">;</span> usage<span class="p">;</span> <span class="nb">exit </span>1 <span class="p">;;</span>
<span class="k">esac</span>
<span class="nb">shift
</span><span class="k">done
</span><span class="nv">tmp</span><span class="o">=</span><span class="s2">"/tmp/_</span><span class="k">${</span><span class="nv">dest</span><span class="k">}</span><span class="s2">_tmp"</span>
<span class="nb">echo</span> <span class="s2">"Importing </span><span class="nv">$origin</span><span class="s2">/</span><span class="nv">$user</span><span class="s2">/</span><span class="nv">$repo</span><span class="s2"> into </span><span class="nv">$dest</span><span class="s2">"</span>
<span class="nb">rm</span> <span class="nt">-rf</span> ~/src/<span class="nv">$repo</span>
<span class="nb">cd</span> ~/src
git clone https://<span class="nv">$origin</span>/<span class="nv">$user</span>/<span class="nv">$repo</span>
<span class="nb">cd</span> <span class="nv">$repo</span>
git filter-branch <span class="se">\</span>
<span class="nt">--tree-filter</span> <span class="s2">"mkdir -p </span><span class="nv">$tmp</span><span class="s2">/origin; git ls-files | cpio -pdumB </span><span class="nv">$tmp</span><span class="s2">/origin; git ls-files | xargs -d '</span><span class="se">\n</span><span class="s2">' rm -r; find . -type d -empty -delete; mkdir -p origins/</span><span class="nv">$origin</span><span class="s2">/</span><span class="nv">$user</span><span class="s2">; mv </span><span class="nv">$tmp</span><span class="s2">/origin origins/</span><span class="nv">$origin</span><span class="s2">/</span><span class="nv">$user</span><span class="s2">/</span><span class="nv">$repo</span><span class="s2">/"</span><span class="se">\</span>
<span class="nt">--tag-name-filter</span> <span class="nb">cat</span> <span class="nt">--prune-empty</span> <span class="nt">--</span> <span class="nt">--all</span>
<span class="k">if</span> <span class="o">[</span> <span class="nv">$?</span> <span class="nt">-eq</span> 0 <span class="o">]</span><span class="p">;</span> <span class="k">then</span>
<span class="c">## WAIT PATIENTLY</span>
<span class="nb">cd</span> ../<span class="nv">$dest</span>
git remote add <span class="nv">$repo</span> ../<span class="nv">$repo</span>
git fetch <span class="nv">$repo</span> <span class="nt">--tags</span>
git merge <span class="nt">--allow-unrelated-histories</span> <span class="nv">$repo</span>/master <span class="c">#Youre on your own if you want a different / multiple branch(es)...</span>
git remote remove <span class="nv">$repo</span>
<span class="k">else
</span><span class="nb">echo </span>failed <span class="k">for</span> <span class="nv">$user</span>/<span class="nv">$repo</span>
<span class="k">fi</span>
</code></pre></div></div>
Pragmatic Data Science; When Unstoppable Math meets Immovable Ethics2020-05-22T13:35:00+00:00https://andrewbolster.info/2020/05/pragmatic-data-science<iframe src="https://docs.google.com/presentation/d/e/2PACX-1vTmW2Ls_Qvj0KGH-7wJvWxPyHGJnwf-Den52J5dm-ejF2WQeXnUEqDPv98uT--mt_WwsgQR6vEh30w6/embed?start=false&loop=true&delayms=0" frameborder="0" width="960" height="569" allowfullscreen="true" mozallowfullscreen="true" webkitallowfullscreen="true"></iframe>
<p><em>This is a Rough Transcript from <a href="https://www.meetup.com/devbash/events/270315785/">The Virtual Bash on Ethics</a></em></p>
<h1 id="introduction">Introduction</h1>
<p>Around two thousand, two hundred and 20 years ago, Archimedes said “Give me a place to stand and I will move the world”.</p>
<p>Archimedes has come to be (rightly) associated with many fundamental innovations in mathematics and engineering, and while I’m sure we’re all sick and tired of screws and levers and jumping out of bathtubs, I’d like to start off with a discussion on one of his more mundane creations, the compound pulley.</p>
<p>The Greek biographer Plutarch describes an interaction between him and King Heiro of Syracuse, where Archimedes boasts that he can pretty much move or lift anything. The King was dubious of this, but, being a king, had a few massive warships that required hundreds of men to manoeuvre in and out of dock and put Archies wild claims to the test.</p>
<p>As the story goes, Archie shows up with what we now call a compound pully system, and pulls the kings warship along the dock, with his relatively frail body pulling arm over armfuls of rope. Each chained pully multiplies the expressed force of the previous, pretty much for as many layers as you like, trading distance for force. This, today seemingly simple, technical innovation would find applications across human experience; Archie wasn’t the first to use a pulley, and was almost certainly not the first to come up the compound pully, and he’s almost certainly not the first to use a pully to make a military manoeuvre easier… But Archimedes went beyond the simple application of his stated ‘creations’ like the lever, the water screw and the compound pulley, to mathematically characterise their operation and performance, making them sharable, teachable, and in this day and age we might even say ‘scalable’.</p>
<p>My name’s Andrew Bolster, I’m the team lead of the Data Science team at WhiteHat Security, a director at your friendly neighbourhood hackerspace, Farset Labs, and in past lives I’ve worked on 4G telecoms which definitely don’t cause coronavirus, been an autonomous systems and AI researcher for the defence establishment designing communications networks for smart submarines, and have tried to teach machines how to read and respond to human emotions, and in the next hopefully 20 minutes, I am going to talk about the Missing E from STEM, Ethics, and how within the burgeoning field of Data Science, we have a serious ethics problem, and potentially, how we can work towards redressing it.</p>
<p>We’re going to approach this under three main themes;</p>
<p>· Ethical Optimisation, via our old friend the trolley problem;</p>
<p>· Unscrupulous Actors, Perverse Incentives and Bernhardssons Corollary, via our ‘it’s complicated’ friend Facebook; and finally</p>
<p>· Asilomar, or how an attempt to mess with carcinogenic monkey virus led to what I believe might be a depressingly bureaucratic but optimistically pragmatic approach to, maybe not putting the genie back in the bottle, but putting a leash on it.</p>
<p>So, lets see if I can keep to time. I think there’s a bracket in the slack room, put me down for a tenner on 26 minutes total.</p>
<h2 id="the-trolley-problem">The Trolley Problem</h2>
<p>We all know the story by now; There is a somehow unstoppable indestructible trolley barrelling down the tracks heading for some number of immobilised people, say five in this creative example. You, an innocent bystander, see this calamity and can predict what is going to happen if you do nothing. You also see that on another siding, some other number of people, in this case 1, are similarly restrained, and you can predict what happens if you pull the lever. You, dear bystander, can act to choose to reroute the trolley, killing one to save five.</p>
<p>And as we’re all good Vulcans, we can confidently act to maximise the number of lives; after all, the needs of the many outweigh the needs of the few, or in this case, that poor unfortunate one.</p>
<p>So, what we’ve got here is a good old fashioned optimisation problem. Ok. It’s clear that more lives are saved by acting in this case.</p>
<p>But I’m sure we’ve sat through talks that then go on at length about the combinations of the saps on each track, whether there’s a guy you can push off the tracks to stop the train, or if there’s a baby on one side and an old person on the other, to the point where you eventually work out that the exchange rate between ‘Baby’ and ‘OAP’ is something around 3.235 or whatever.</p>
<p>That’s not what I want to talk about.</p>
<p>Any machine learning system, as well as most data science research is a form of optimisation system; you feed it data, and you give it some way of knowing if it’s “good” or not, and decisions or recommendations pop out of a trained model. These models are generally educated on ‘goodness’ by what is called a fitness function.</p>
<p>Usually, and thankfully, it’s rarely that simple, but lets say you do a massive global study and you somehow create this huge matrix of different peoples decisions so that you can confidently aggregate what the ‘normal’ human would do in a given situation. That sounds like a pretty awesome solution, right? Literally crowdsourcing ethics? Wee buns.</p>
<p>Well, as usual, MIT already did it. In 2017 they had 70,000 participants across 42 countries. And the while the whole paper is a fascinating read, I’ll give you an easy highlight and move on.</p>
<p>To cherry pick; in 82% of responses, Germans, like most europeans, did a Spock and sacrificed the one to save the many. However in China, that number is 58%. And China is not some surprising outlier, there are a smorgasbord of variations in judgement calls that are correlated within regions but vary across regions.</p>
<p>There is no global definition of what “a socially moral act” is, and indeed if you just took the straight average, more people would disagree with it than would agree with it. And this is in quite possibly the ‘simplest’ ethical question we think we can ask, saving lives or not.</p>
<p>So, we can’t optimise for ethics, because no one can or will be able to agree on what ethics is.</p>
<h2 id="unscrupulous-actors-and-perverse-incentives">Unscrupulous Actors and Perverse Incentives</h2>
<p>In mid 2008, Facebook proudly announced their Connect API. This enabled truly portable social login, and also enabled app developers to access the users social web to recommend, engage, and customise. Fun fact, their proudly launched blog post has since disappeared from the internet, go figure. Apps like Eventbee turned your friends into mini Ticketmasters by enabling them to get a cut of virally recommended ticket sales; Netflix let you share movie ratings within your friend group; Hell, the Watchmen Blue-Ray had the ability to invite your friends for simultaneous watch parties, which would have been handy about now…</p>
<p>This was the dawn of a new age for the web, everything was connected to everyone, and the experience was customised on a per user basis, rather than requiring some bulk data access API or something similarly privacy-crazy.</p>
<p>And then, the quizzes started.</p>
<p>Netizens of a certain age remember the day before Facebook quizzes exploded, and the day after. It was hell. And it made no sense; but you just ignored your dumb-ass mates posts and moved on with your life.</p>
<p>By 2014, Facebook had killed this ‘next generation social network’, and years later, we found out why. But the genie was out of the bottle and the world was already changed. But, how?</p>
<p>How did a proliferation of pointless quizzes and single-purpose app integrations lead to the global manipulation of public opinion on an industrially mechanised scale?</p>
<p>Well, friends, it started off with people like us. Technologists, Data Scientists, Researchers, Academics, Numerical Fiddlers.</p>
<p>Between 2007 and 2016, The University of Cambridge Psychometrics Centre, which is part of the Judge Business School, ran a quiet little side project called ‘MyPersonality’ that, like many other psychometrics research groups across the world, leveraged the kind of viral social media reach that the Connect API enabled, to run playful personality tests to tell you what kind of dog you are or whatever. But these results were genuinely used for bona-fide, ethically moderated, research into the social and psychological questions of the day. The API itself provided a perverse incentive for data driven researchers; the more people you can get to run a quiz, the more people who’s demographic and interest metadata information you could lift at the same time, even if the friends of the obsessive compulsive quiz taker never touched, or consented to, your friendly little app. The movies they liked, the pages they liked, the artists they followed and the mutual-friends with the quiz taker were all on the table.</p>
<p>Today we call this kind of thing ‘Triangulation’, but at this time, it was just an interesting research context to characterise different quiz responses into different social buckets.</p>
<p>Like Age, Race, Occupation, Economic status, Education… Beliefs… And what kind of posts you would ‘like’</p>
<p>And the quizzes themselves, through psychometric aggregation, could score people on estimates like “Respect for authority”, “Voting Intention”, Media Savviness, Neuroticism and more.</p>
<p>I have to confess, I myself played with the same datasets that eventually got called ‘The Cambridge Analytica’ files. I was at the University of Liverpool at the time researching trust networks for applications in autonomous systems, so borrowing structures and activities from human social networks made sense. And I requested access, got vetted for my purposes by diligent academics, and I played with the things I asked for. All good, sounds nice.</p>
<p>Then one academic accidentally published their access credentials onto GitHub…</p>
<p>And there they were for over four years.</p>
<p>The story of what the evidently unscrupulous actors, SCL and Cambridge Analytica did with that data is better documented in other places, but I want to talk about Facebook’s handling of the situation.</p>
<p>Frankly, Facebook didn’t give a poop that there was an aggregated data set of millions of its users, which was in massive violation of the Connect API’s Ts&Cs. And it’s not like they didn’t know about it; a Facebook employee applied to access the data, but was rejected by vetters.</p>
<p>It is here I want to break off for a second. I think most of us are aware of ‘Hanlon’s Razor’; Never attribute to malice that which can be adequately explained by stupidity. Its a great way to stress less about our reptilian overlords and pizza shop basements, but I came across what I have termed “Bernhardssons Corollary”, “Never attribute to stupidity that which can be explained by opportunity cost”.</p>
<p>Facebook didn’t <strong>plan</strong> for their data to be misused by unscrupulous actors, they weren’t too stupid to think about it, and while Zuckerberg might look inhuman when he’s supping on Dihydrogen Monoxide, given the choice between “deep state conspiracy to elect a glorified used car salesman” and “prioritised shipping features over data governance”, my bet is always with over-worked data scientists, developers and product managers on a Sisyphean treadmill. It is easy to say today that ‘yeah, that’s a nice API and all, but what if someone doesn’t adhere to our lazy, unenforceable rules for legitimate research purposes, the same research that we do internally for funzies, and what if a private firm work out how to get access to the aggregate data and use it to isolate and manipulate filter bubbles for the highest bidder?”.</p>
<p>Wargaming Data Governance “What If?” scenarios isn’t prioritised at an executive level; it wasn’t something that could be quantified on a burn down; or allocated story points; and it was never going to end up in the investor briefings or press releases. Well, until it made it into the congressional record.</p>
<p>But lets not just poop over Zuck for this one; our memetastic interweb has an applicable ding on those supposedly responsible for oversight too.</p>
<p>Policy makers are supposed to be experts in policy; politicians ar</p>
<p>And modern corporate governance, that has eschewed internal research teams to optimise for the bottom line, can’t be expected to be informed enough to self-govern in a way that covers the explosive possibilities of the connected world.</p>
<p>But then, when public bodies <strong>do</strong> try and create these kind of structures, they inevitably run in to massive obstacles, particularly in implementation, with technologists tearing their remaining hair out wondering ‘what the hell does “appropriate measures to prevent deanonymisation” mean?’, looking at you GDPR…</p>
<p>They can’t prepare for unforeseen consequences, and they can’t see the perverse incentives until long after they’ve pushed their beautiful code.</p>
<h2 id="asilomar-and-public-scientific-policy">Asilomar and Public Scientific Policy</h2>
<p>So, crowdsourced ethics makes no one happy, corporate entities don’t account for moral failure on their balance sheets, and socio-economics prevents governments from effectively legislating this kind of thing, What about practitioners?</p>
<p>In 1974 a biochemist at Stanford called Paul Berg designed an experiment to stuff bits of a carcinogenic monkey virus into some E. Coli bacteria, but when he shared his proposed experimental protocol, he got a very very mixed response from the scientific community at the time. Long story short, they believed that it could cause a quickly spreading cancer causing viral pathogen. Fun times.</p>
<p>Instead of pulling a Frankenstein, the 1975 Asilomar Conference on Recombinant DNA was held with over 100 internationally respected molecular biologists in attendance, with Berg as its chair. There, they established a set of guidelines to be followed by all scientists doing this kind of recombinant DNA research. They considered several classes of experiment and assigned relative levels of risk, from minimal to high. Each level of risk required a corresponding set of containment procedures, designed to minimise the chance of carriers from escaping into the wider world and, in our established parlance, ‘unforeseen consequences’. This included the forbiddance of certain classes of experiments, such as using materials from highly infectious or toxic genomes or running experiments to produce large scales of reagents.</p>
<p>These guidelines were predominantly ‘socially enforced’ rather than being legislated; with a significant strand of the conference discussing the importance of operating ‘in the open’ and bringing their science into the public eye. Some commentators suggest that this is a reflection of the distrust of government in the wake of the Watergate scandal, such that it was easier for the public to trust scientists following scientific consensus than it was to trust scientists following government derived and potentially variably enforced legislation.</p>
<p>This generation of an open consensus within a community of practitioners and experts led to an increasingly positive view of the field within the general public.</p>
<p>Somewhat ironically from our vantage point almost 50 years later, this enabled specialists who had previously confined themselves to academia, to develop ties in the private sector, as shareholders, executives, and consultants; engaging with private industry in a constructive way, because their Biomolecular ‘Hippocratic Oath’ provided a stability and confidence to explore new innovations and opportunities, safely.</p>
<p>Berg went on to earn the Nobel Prize in Chemistry in 1980 and, by the 20th anniversary of the Asilomar conference in 1995, genetics and it’s terminology had become part of the day to day vocabulary and not something restricted to the hallowed halls of university or private research establishments.</p>
<p>I’d argue that if it’s good enough for cancer causing simian stomach bugs, it’s good enough for Data Science. Efforts along these lines are already active, and I would encourage you as practitioners to take part. There are many Data Ethics projects across the world, as well as Gillian’s Human Impact Statements; some of them open source such as the Open Ethics Canvas from the Open Data Institute, or Public Sector, such as GOV.UK’s Data Ethics Framework, which while being good has no interesting graphics associated with it other than Matt Hancocks face, which I won’t subject you to. As well as this there are ranges of ad-hoc and drifting towards professional standards being developed for data governance and ethical data science.</p>
<p>Of particular interest to me at the moment, is the IEEE’s proposed 7000 series of standards on ethically aligned design, which ranges from p7003 concerned with Algorithmic Bias Considerations, P7002 Data Privacy Processes, and one of the weirder ones, P7014, and a bit of a mouthful, Standard for Ethical considerations in Emulated Empathy in Autonomous and Intelligent Systems, which is to do with the strange world of thinking about systems that quantify, respond to, or simulate emotion. And of course I sit on the working group for the weird one.</p>
<p>These groups bring together wide and diverse voices together to interrogate and twist the use and potential misuse of data science and autonomous systems, developing shared standards and guidelines to which we as a community of practice should adhere.</p>
<p>And it’s far from perfect; the bureaucracy involved in engaging a community of practitioners of thousands is not without it’s challenges, with the ever taunting pressures of ‘just get it done’ hanging over us all we have to ask the question; how do we do this?</p>
<p>So we’ve covered the challenge of mathematically operating in an ethical, human, world; the inevitable challenge of unforeseen consequences, and, as usual in software engineering, we think we might be able to steal another fields good ideas to save our own skins. What does this all mean as a Data Scientist today?</p>
<h2 id="what-now">What now?</h2>
<p>Frankly, it’s a fraught time to be a Data Scientist. I don’t have any easy answers.</p>
<p>With an analytical background, and the usual laundry list of ‘skills’ this industry desires, you’re used to developing tools against constraints, but when there’s no accepted, or potentially even possible, standard against which to measure your work ethically, how can you write code or interpretations of data that you can stand behind?</p>
<p>If the Ethics is unaccountable and un-costable, how do we progress with any kind of confidence as an industry?</p>
<p>Or do we just resign ourselves to say that Data Science as a field is the art of fulfilling the ‘lies, damned lies and statistics’ aphorism, or getting trapped in Gillian’s ‘Get It Done’ environment?</p>
<p>For years, I considered myself an Engineer above being a Scientist, approaching problems in the practical mindset of “How do we do the thing that satisfies the requirements in the most practical elegant solution?.</p>
<p>These days, I’ve been forced to accept that we don’t know what the “thing” is, the requirements will be written by the headlines 5 years later; and the best we can hope for is for a generation of data scientists, developers, engineers, and product managers, to take a pragmatic approach, to assess the second order risks like perverse incentives and the acceptance that there are unscrupulous actors that are more ingenious than us, and that take a step back and remember that ethical optimisation is at best a heuristic rather than an algorithm, and above all, try and leave the world a little better, and better documented, than we found it.</p>
<p>It’s on us, as an industry and as a field, to hold each other accountable, and to question and support each others decisions, pragmatically, fairly, and openly.</p>
<p>Thank you.</p>
Is Your AI Ethical?2020-04-26T20:16:00+00:00https://andrewbolster.info/2020/04/is-your-ai-ethical<p><em>Originally posted in <a href="https://www.rtinsights.com/is-your-ai-ethical/">RTInsights</a></em></p>
<p><strong>Businesses should do their part to ensure products are designed judiciously to reflect core company values and provide audit trails of how AI is learned.</strong></p>
<p>As we examine an increasing reliance on artificial intelligence (AI) and machine learning, it’s being revealed that AI can have a built-in bias, whether intentional or not.</p>
<p>In late 2019, Apple and Goldman Sachs faced allegations that the Apple Card used an algorithm that discriminated against women in credit-scoring evaluation – after Apple’s own co-founder Steve Wozniak and entrepreneur David Heinemeier Hansson received credit limits of 10-20 times higher than their wives.</p>
<p>A <a href="https://www.pnas.org/content/early/2020/03/17/1915768117">recent study</a> also found that AI-based automated speech recognition (ASR) systems from Amazon, Apple, Google, IBM, and Microsoft exhibited substantial racial disparities, with an average word error rate (WER) of 0.35 for African American speakers, compared with 0.19 for Caucasian speakers. The study highlighted the need to invest resources into ensuring that ASR systems and speech researchers are broadly inclusive.</p>
<p>Companies are only beginning to understand issues with AI-based products and take action towards more ethical AI. A <a href="https://www2.deloitte.com/us/en/insights/focus/cognitive-technologies/state-of-ai-and-intelligent-automation-in-business-survey.html">State of AI in the Enterprise</a> survey from Deloitte found that 32% of executives ranked ethical issues as a top-three risk of AI, but most don’t yet have specific plans in place to address the risk.</p>
<p><strong>The Road to Ethical AI</strong></p>
<p>In the corporate world, AI is being used for everything from the development of new products and platforms to driving marketing initiatives and major business decisions.</p>
<p>In a world where data is the new oil, it is increasingly tempting to throw all the data you can lay your hands on at a problem to best solve it quickly, cheaply, and accurately for your planned use case; however, it’s vital that the data that you collect, clean, and use to train such models is nurtured both on its inclusiveness, correctness, and its ethics.</p>
<p>Part of any major machine learning project that interacts with humanity should assess its fairness. Indeed, many standards groups are attempting to build consensuses around such ethics, such as the IEEE’s P7000 series on Ethics of Autonomous and Intelligent Systems.</p>
<p>While progress is being made within standards bodies, organizations can also take steps to do their part to ensure products are designed judiciously to reflect core company values, provide audit trails of how AI is learned, and finally be remediated if or when it discriminates or causes harm. If we cannot do so at the outset of AI design and ensure inclusive data for machine learning, we risk losing the benefits of AI altogether. Organizations using and creating AI-based products are recognizing that responsible innovation requires stronger internal governance.</p>
<p><strong>Getting Started</strong></p>
<p>One solution is building internal ethics committees. For organizations, they can help with the design of AI-based products and services, while ensuring privacy, security, fairness, and ultimately building trust among consumers and partners.</p>
<p>But building ethics committees is easier said than done. It was reported that Google formed an AI ethics committee that was to meet quarterly in 2019, but it only lasted a week.</p>
<p>Luckily, a new report from the Ethics Institute at Northeastern University and Accenture entitled <a href="https://www.accenture.com/us-en/insights/software-platforms/building-data-ai-ethics-committees">Building Data & AI Ethics Committees</a> offers expert guidance.</p>
<p>The report states that when getting started with an ethics committee, it’s critical to put together the right team of people to represent organizational stakeholders. From there, they must think through and agree on key functions, values, principles, and processes. Key committee considerations include:</p>
<ul>
<li>What are the basic values the committee is meant to protect?</li>
<li>What are the guiding principles in support of the values?</li>
<li>What are the types of expertise needed?</li>
<li>What are the standards by which committees make judgments?</li>
<li>How can the committee avoid bias and conflicts of interest?</li>
<li>When should the committee be consulted?</li>
<li>What authority does the committee have?</li>
</ul>
<p>While forming and managing a committee may be difficult in the beginning, having one in place could help prevent AI issues down the road caused by biased product development, and better inform engagements and business decisions that will ultimately build trust and confidence with customers.</p>
The Importance of Active Learning in Data Science and Engineering2020-03-04T11:10:00+00:00https://andrewbolster.info/2020/03/the-importance-of-active-learning-in-data-science-and-engineering<p><em>Originally posted in <a href="https://www.cybersecurity-insiders.com/the-importance-of-active-learning-in-the-data-science-and-engineering/">Cybersecurity Insiders</a></em></p>
<p>Back when I was pursuing my undergraduate degree in electronics and software engineering, I couldn’t imagine a path that would lead to me working with NATO on port protection and maritime defense, teaching smart submarines how to trust each other. But while I was working toward a Ph.D., that’s what happened. Instead of following the path into academia, a friend enticed me to work with him on biometrics. From there, I found an opportunity to apply my skills and knowledge to the cybersecurity industry – but that’s not something I could have predicted either.</p>
<p>The new year has me reflecting on the roundabout approach that led to my current role as a machine learning (ML) team lead with WhiteHat Security. I think it’s important to share some of the challenges, experiences, and opportunities I’ve been fortunate enough to have gone through with others who may be just at the beginning of a journey or professional career in data science and engineering. Like many, my experiences have shaped my views on certain issues and advancements in technology. For example, during the past eight years as a chairman of a charitable hackerspace in Northern Ireland, I have become a huge proponent of open data and transparent government. I also believe that with enough work, many of the world’s current challenges, from climate change, road traffic accidents, infectious diseases, obesity, and even corrupt governments can be faced with a combination of accurate, honest, data, and the advances in machine learning and data science techniques.</p>
<p>If you’re planning to pursue a career in STEM, one important thing to consider is that feeding your interest in these fields is one of the smartest investments you can make. Look for every opportunity to engage and immerse yourself in conversations, and self-directed projects or research. If you’re still a full-time student, don’t panic about exams or even grades. With an understanding that your education could be out-of-date by the time you finish your degree pathway, know that your long-term success will more accurately be defined by how much you can learn and remain an active learner in your own experiences.</p>
<p>Another critical factor to building success in science and engineering is to find mentors early on. I can confidently say when seated at either side of the interview table, that the most active learners with a keen interest in their fields have often worked very closely with an experienced mentor along the way, and subsequently gift their time and experience to the next generation of learners that come after them. The role a mentor plays in keeping you engaged and curious cannot be overstated. Not to mention, a mentor can show you all the best ways to ‘win friends and influence people’ throughout an organization.</p>
<p>As an example of active learning in my own career, my team was recently working on verifying vulnerabilities. Before, using scanning technologies to assess websites was difficult because the scanner can be quite ‘dumb.’ Our team is building a secondary system to apply ML and develop an augmented risk assessment system on top of the battle-tested scanner, to add experience over time. Whenever the ML has seen vulnerabilities on a website, they are clustered and grouped together and are verified by human experts. These experts then perform a deep dive into websites to investigate further, but it takes a lot of time and resources to do so. To serve our customers, we must be able to scale this process. All vulnerabilities are human verified, leveraging the collective experience they possess. When building with ML, the models improve as they receive more engagement from the subject matter experts – and this is where active learning and natural curiosity play a very strong role.</p>
<p>One thing that ML is terrible at is pattern recognition on time series data. Despite all of the research into stock market trends, for example, ML is very bad at identifying patterns in noisy, time-varying, multi-value inputs. This is because ML doesn’t understand the reason for changes over time and their pattern correlations. What’s missing is the human ability to infer, correlate, and leverage past experience. Humans can say, “Of course that vulnerability won’t work, because this happened.” ML – whether fortunately or unfortunately – lacks common sense, and the ability to intuitively make these kinds of inferences. But this is an exciting time to pursue a career in this field because this is being researched at this very moment!</p>
<p>As you pursue STEM studies, make time to explore and follow new and emerging technologies. Bear in mind that by graduation, there could be entirely new fields and industries to get excited about. That’s a major upside to science and engineering – it’s constantly evolving!</p>
<p>Become plugged into the industry from a global perspective. Read about what’s happening in the world and draw your inspiration from the latest research or breakthroughs as you explore your own potential. But be sure to strike a balance between work and life, too. There are plenty of social opportunities to have fun and make friends at events like meetups, hackathons, or robotics competitions, and quite often, these relationships will form the basis of strong and long lasting personal and professional relationships. These relationships will support you as you build a reputation within the field, and even bring you professional opportunities that many would not normally have access to following classical pathways.</p>
<p>Finally, remember that studying science and engineering is serious, and it’s rare that things will run smoothly every time. There are no experts, just people who have made (and learned from) more mistakes than you have. Expect the unexpected but keep your wits and a sense of humor about it all. Use the time to learn your own preferences – do you enjoy the precision and methodology of lab work, or are you more of a free spirit, who prefers to be in the field? By exploring these options early, you can figure out what appeals to you most and set your course with purpose.</p>
Mosquitto (MQTT) Emon Pi (Open Energy Monitor) Forwarding Bridge2020-02-18T16:40:00+00:00https://andrewbolster.info/2020/02/mosquitto-mqtt-emon-pi-open-energy-monitor-forwarding-bridge<p>Super quick one this time; I’ve been experimenting with <a href="http://mqtt.org/">MQTT</a> to act as a central messaging broker for “Farset In-Space Related Stuff” as part of the near continuous <a href="https://blog.farsetlabs.org.uk/2019/09/farset-labs-2-0-nearly-ready-to-go/">renovations and expansions</a>.</p>
<p>We previously had a well configured <a href="https://wiki.openenergymonitor.org/index.php/EmonPi">EMonPi</a> set up with nice dashboards and things, but that died a death at some point during the move, who knows.</p>
<p>Anyway, EmonPi has a built in <a href="https://mosquitto.org/">mosquitto</a> broker, which it uses to keep ‘state’ across several parts of the emonpi ecosystem.</p>
<p>On the other side, I had a relatively easy job getting <a href="https://hub.docker.com/_/eclipse-mosquitto">Mosquitto set up via Docker on our Synology NAS</a> (also <a href="https://nodered.org/">NodeRed</a>, but that’s for another day), however I had no idea how to connect the two.</p>
<p>I was expecting the kind of intricate surgury to ‘replace’ a core part of emonpi’s architecture with an ‘off-device’ broker, and all the failure that that would entail, but helpfully, the MQTT protocol has a concept of “bridgeing”, where you can effectively make one Broker ‘publish’ all it’s topics on another broker. <a href="http://www.steves-internet-guide.com/mosquitto-bridge-configuration/">Super Easy, Barely an Inconvenience</a>.</p>
<p>TL;DR
Create a file called <code class="language-plaintext highlighter-rouge">bridge_over_the_river_que.conf</code>* in the mosquitto/config/conf.d/ folder on the <em>source</em> device (in this case the emon pi), with the following contents *</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>#connection farset
connection bridge-fsl
address 192.168.1.222:1883
topic # out 0
</code></pre></div></div>
<p>If you want to make the ‘bridge’ bi directional, you can add <code class="language-plaintext highlighter-rouge">topic # in 0</code> to also grab everthing from the ‘destination’ broker, but in this case, we don’t want that.</p>
<p>If we wanted to be pedantic and to only forward <code class="language-plaintext highlighter-rouge">emon</code> sensor <a href="https://guide.openenergymonitor.org/technical/mqtt/">topic values</a>, or values from a particular device, you can replace the <code class="language-plaintext highlighter-rouge">#</code> with the appropriate <a href="https://subscription.packtpub.com/book/application_development/9781787287815/1/ch01lvl1sec18/understanding-wildcards">topic wildcard</a> (you knew that you could do wildcards in MQTT right? Cus I didn’t!)</p>
<p>* Naturally, adjust the values to something that makes sense in your setup….</p>
Headless Pi Configuration with Multi-Wifi Remote Access2020-02-17T15:28:00+00:00https://andrewbolster.info/2020/02/headless-pi-configuration-with-multi-wifi-remote-access<p>Raspberry Pi’s are great, but often have a lot of baggage associated with them, and I keep forgetting all the clever things you can do to get them up and running without having a sea of cables attached…</p>
<h1 id="results">Results</h1>
<p>Raspberry Pi that can be <code class="language-plaintext highlighter-rouge">ssh</code>d into from anywhere in the world* without poking any firewall rules or anything other than power connected to it, that works in a range of WiFi access points.</p>
<p>*(Requires access to an external domain or static IP server)</p>
<h1 id="requirements">Requirements</h1>
<ul>
<li>Raspberry Pi with Wifi (Either onboard or dongle)</li>
<li>Externally accessible <code class="language-plaintext highlighter-rouge">ssh</code>able server</li>
<li>(Micro) SD Card reader and SD Card (>=8GB)</li>
<li>Another machine to do all the magic with</li>
<li>You to be sitting in a WiFi area that you know the key for…</li>
</ul>
<h1 id="setup">Setup</h1>
<h2 id="os-image">OS Image</h2>
<ol>
<li>Install <a href="https://www.balena.io/etcher/">Etcher</a> (Or similar, see <a href="https://www.raspberrypi.org/documentation/installation/installing-images/">here</a> for other options)</li>
<li>Download Rasbian Lite from <a href="https://www.raspberrypi.org/downloads/raspbian/">here</a></li>
<li>Use Etcher (or whatever) to burn the downloaded image to the SD card</li>
<li>Once verified, eject, unplug, and replug the SD card into the machine.</li>
</ol>
<h2 id="boot-fiddling"><code class="language-plaintext highlighter-rouge">Boot</code> Fiddling</h2>
<h3 id="enable-ssh">Enable SSH</h3>
<p>When you plug the card back in, you should see a <code class="language-plaintext highlighter-rouge">boot</code> volume appear in your Finder/Explorer/File Manager, in it, create an empty file called <code class="language-plaintext highlighter-rouge">ssh</code>. e.g.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">touch</span> /Volumes/boot/ssh
</code></pre></div></div>
<h2 id="pre-configure-wifi">Pre-Configure Wifi</h2>
<p>Then in your favourite text editor, create another file in the same directory called <code class="language-plaintext highlighter-rouge">wpa_supplicant.conf</code> that looks something like this, populated with the <code class="language-plaintext highlighter-rouge">ssid</code> and <code class="language-plaintext highlighter-rouge">psk</code> keys for your relevant networks.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>
country=GB #Or your ISO 3166 country code
# https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes
ctrl_interface=DIR=/var/run/wpa_supplicant GROUP=netdev
update_config=1
network={
ssid="TotallyMyWorkSSID"
psk="supersecretandtotallyaccurate"
}
network={
ssid="MyHomeWifi"
psk="homesweethome"
}
network={
ssid="farset-members"
psk="probablytheworstkeptsecretinbelfast"
}
</code></pre></div></div>
<blockquote>
<p>If you have a ‘weird’ password with quotes or unescaped symbols, you can use <code class="language-plaintext highlighter-rouge">wpa_passphrase MY_SSID</code> to generate the relevant pre-encrypted <code class="language-plaintext highlighter-rouge">psk</code>’s that are acceptable by <code class="language-plaintext highlighter-rouge">wpa_supplicant.conf</code>, and you should be able to run this either on the pi directly later, any other pi, or any other linux-based wifi enabled machine… You can probably to it in a VM too but I haven’t tested that…
<a href="https://www.raspberrypi.org/forums/viewtopic.php?p=1472865&sid=0e9fd7545d8ce3e46ffaa9f212649697#p1472865">Source</a></p>
</blockquote>
<h2 id="squeaky-bum-time">Squeaky Bum Time</h2>
<p>At this point, eject and unplug the SD card from the machine, and plug it into the (unpowered) Pi, and then the moment of truth; power it on…</p>
<blockquote>
<p>At this point I recommend waiting at least 5 minutes, get a coffee or a <a href="https://www.farsetlabs.org.uk/about/club_mate.html">ClubMate</a></p>
</blockquote>
<blockquote>
<p>While you’re waiting, take some time to review the <a href="https://elinux.org/R-Pi_Troubleshooting#Power_.2F_Start-up">Power/Startup</a> section of the Pi Troubleshooting Guide, just in case…</p>
</blockquote>
<h2 id="first-login">First Login</h2>
<h3 id="easy-mode-mdns">Easy Mode; mDNS</h3>
<p>If you’re lucky and have a nice router on your wifi network, you <em>might</em> be able to log in to your device using the default password <code class="language-plaintext highlighter-rouge">raspberry</code> using the below</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ssh pi@raspberrypi
</code></pre></div></div>
<p>or</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ssh pi@raspberrypi.local
</code></pre></div></div>
<h3 id="hard-mode-port-scan">Hard Mode; Port Scan</h3>
<p>Using a network scanner like <code class="language-plaintext highlighter-rouge">nmap</code>, scan for IP’s that have the SSH port (22) open, and try a few. YMMV.</p>
<p>On basic networks this should be as simple as:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>sudo nmap -sS -p 22 192.168.0.0/24
</code></pre></div></div>
<p>But you may also want to check out the <a href="https://www.fing.com/products/fing-app">Fing</a> android network scanner which is great at identifying Raspberry Pi’s based on their <a href="https://www.macvendorlookup.com/">Vendor MAC Address</a></p>
<p>Once you’ve identified the IP address of the Pi, you can connect to it like:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ssh pi@<IP-ADDRESS>
</code></pre></div></div>
<h2 id="good-housekeeping">Good Housekeeping</h2>
<p>Now that we’re in, there are a few steps that we should do just for good housekeeping;</p>
<ol>
<li>Update the Pi with <code class="language-plaintext highlighter-rouge">sudo apt-get update; sudo apt-get upgrade -y</code></li>
<li>Using <a href="https://www.raspberrypi.org/documentation/configuration/raspi-config.md"><code class="language-plaintext highlighter-rouge">sudo raspi-config</code></a>, perform the following actions
<ol>
<li>Change the <code class="language-plaintext highlighter-rouge">pi</code> user password to something custom</li>
<li>Network Options > Set the Hostname to something memorable (We’ll be referring to this in future as PI_HOST)</li>
<li>Advanced Options > Update</li>
</ol>
</li>
<li>Reboot with <code class="language-plaintext highlighter-rouge">sudo reboot</code></li>
<li>Log back in with either <code class="language-plaintext highlighter-rouge">ssh pi@PIHOST</code>, or your previously found IP address (or return to the “Port Scan” section if that doesn’t work)</li>
<li>Run <code class="language-plaintext highlighter-rouge">ssh-keygen</code> (without specifying a passphrase)</li>
<li>Print out and note the contents of the public key generated using <code class="language-plaintext highlighter-rouge">cat ~/.ssh/id_rsa.pub</code></li>
</ol>
<h2 id="prepping-autossh">Prepping <code class="language-plaintext highlighter-rouge">autossh</code></h2>
<p>We’ll be using the <code class="language-plaintext highlighter-rouge">autossh</code> program to, well, automate ssh. This will be able to provide us with a “Reverse SSH Tunnel” to get into the Pi via an internet-accessible server, that we’ll be calling <code class="language-plaintext highlighter-rouge">JUMPBOX</code></p>
<h3 id="jumpbox-prep"><code class="language-plaintext highlighter-rouge">JUMPBOX</code> Prep</h3>
<p>Log into the <code class="language-plaintext highlighter-rouge">JUMPBOX</code>, and create a new <code class="language-plaintext highlighter-rouge">nologin</code> user</p>
<blockquote>
<p><strong>All the commands in this section should be executed on <code class="language-plaintext highlighter-rouge">JUMPBOX</code> as the <code class="language-plaintext highlighter-rouge">root</code> user (or add appropriate <code class="language-plaintext highlighter-rouge">sudo</code>s)</strong></p>
</blockquote>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>useradd -m -s /sbin/nologin --disabled-password autotunnel
su - autotunnel -s /bin/bash
ssh-keygen
</code></pre></div></div>
<p>You’ll be asked for a <code class="language-plaintext highlighter-rouge">passphrase</code> here, don’t enter one, as this ensures we’ll be able to securely setup the tunnel without manually entering passwords</p>
<p>Then we’ll construct an <code class="language-plaintext highlighter-rouge">authorized_keys</code> file (note the Americanisation!) in the <code class="language-plaintext highlighter-rouge">.ssh</code> directory so our pi will be able to log into this account automatically.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>mkdir ~/.ssh
vim ~/.ssh/authorized_keys
</code></pre></div></div>
<p>In this file, add the contents of the <code class="language-plaintext highlighter-rouge">cat ~/.ssh/id_rsa.pub</code> call that we performed on the Pi, and save it (with <code class="language-plaintext highlighter-rouge"><ESC>:wq</code>, incase you forgot…)</p>
<p>We need to make sure these files have the correct permissions, so…</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>chmod 700 ~/.ssh
chmod 600 ~/.ssh/authorized_keys
</code></pre></div></div>
<h3 id="pi_host-prep"><code class="language-plaintext highlighter-rouge">PI_HOST</code> Prep</h3>
<p>Autossh isn’t installed by default</p>
<p><code class="language-plaintext highlighter-rouge">sudo apt-get install autossh -y</code></p>
<p>Similar to the setup of the <code class="language-plaintext highlighter-rouge">autotunnel</code> user on the <code class="language-plaintext highlighter-rouge">JUMPBOX</code>, we do the same on the <code class="language-plaintext highlighter-rouge">PI_HOST</code></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>sudo useradd -m -s /sbin/nologin --disabled-password autotunnel
sudo su - autotunnel -s /bin/bash
ssh-keygen
ssh-copy-id JUMPBOX
ssh JUMPBOX
</code></pre></div></div>
<p>At this point you’ll be asked to confirm things like accepting the SSH key of the JUMPBOX server, but you won’t actually get a login shell and will be kicked off; this is because we setup the <code class="language-plaintext highlighter-rouge">autotunnel@jumpbox</code> user with <code class="language-plaintext highlighter-rouge">/sbin/nologin</code>, so it’s fine.</p>
<p>Now, using <kbd>CTL</kbd> + <kbd>D</kbd> to “escape” from the <code class="language-plaintext highlighter-rouge">autotunnel</code> user back to the <code class="language-plaintext highlighter-rouge">pi</code> user</p>
<p>Finally, <code class="language-plaintext highlighter-rouge">sudo -sh</code> into the root user and execute the following to create a auto-starting services file</p>
<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cat</span> <span class="o">></span> /etc/systemd/system/autossh-JUMPBOX.service <span class="o"><<</span> <span class="no">EOF</span><span class="sh">
[Unit]
Description=Keep a tunnel to 'JUMPBOX' open
After=network-online.target
[Service]
Type=forking
User=autotunnel
ExecStart=/usr/bin/autossh -f -M 0 -N -i /home/autotunnel/.ssh/id_rsa -o ServerAliveInterval=30 -o ServerAliveCountMax=3 -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=false autotunnel@JUMPBOX -R 2222:127.0.0.1:22
ExecStop=/usr/bin/pkill -9 -u autotunnel
Restart=always
[Install]
WantedBy=multi-user.target
</span><span class="no">EOF
</span>systemctl <span class="nb">enable </span>autossh-JUMPBOX.service
systemctl daemon-reload
reboot
</code></pre></div></div>
<p>And then wait a while and hope for the best!</p>
<h1 id="persistent-connections">Persistent Connections</h1>
<p>All-in-one command to test your JUMPBOX/PI_HOST connection;</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ssh -J user@JUMPBOX pi@localhost -p 2222
</code></pre></div></div>
<p>Once you’re happy that this is fairly stable, you can set up a ‘shortcut’ to this host in your local <code class="language-plaintext highlighter-rouge">~/.ssh/config</code> file with something like this</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Host PI_HOST
HostName localhost
Port 2222
ProxyJump user@JUMPBOX
User pi
</code></pre></div></div>
<p>Then you can excute the following to set up passwordless pasthrough connection</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ssh-copy-id PI_HOST
ssh PI_HOST
</code></pre></div></div>
<h1 id="sources">Sources</h1>
<ul>
<li><a href="https://desertbot.io/blog/headless-raspberry-pi-4-ssh-wifi-setup">Headless Raspberry Pi 4 SSH WiFi Setup</a></li>
<li><a href="https://unix.stackexchange.com/questions/56765/creating-a-user-without-a-password">Creating a user without a password</a></li>
<li><a href="https://hobo.house/2016/06/20/fun-and-profit-with-reverse-ssh-tunnels-and-autossh/">Fun and Profit with Reverse SSH Tunnels and AutoSSH</a></li>
<li><a href="https://gist.github.com/thomasfr/9707568">Autossh.service</a></li>
</ul>