Of Penguins and Coffee

Of Penguins and Coffee Rants, Rambles, and Reminders by a Jack of all Trades daily 1 https://andrewbolster.info Mon, 29 Apr 2024 20:58:00 +0000 "Context all the way down": Primer on methods of Experience injection for LLMs Much hay has been made that LLM’s can be infinitely trained on infinite data to do infinite jobs, in an approach generally described as ‘LLM Maximalism’. This post is a bit of a braindump to explain my thought process in how to practically use LLMs in a safe way in production/client facing environments, with a little bit of a discussion as to where I see the current blockers to this in most organisations, and where organisations should be focusing investment to be able to meet these challenges without loosing their competitive edge/expertise. Black Boxes, Training, Fine Tuning, and the edge of computability While LLM’s like ChatGPT do a respectable job at a lot of disparate tasks, the reason why they do a decent job of these diverse but ‘toy’ problems can be overlooked; they’re trained on the whole internet and with each iteration, spend a whole pile of often un-documented time going through Reinforcement Learning with Human Feedback (RLHF). These LLMs are nearly-perfect ‘do-anything and look realistic’ machines, because their ‘target function’, both in the pure training phase and in the RL phase, is to ‘look like a human responded’. Not an ‘expert’ or not someone that’s even ‘correct’, but just that the response doesn’t blow up the Turing Test. This has been great/horrifying for ‘feels-like’ jobs like turning bullet points in to prose or vice versa, or even generic programming completion or authoring (thanks StackExchange), but this approach of ‘throw it all in the training data and GPU’s go brrrrr’ is often a very expensive hammer to nail in a screw. Training The reason why I say that is a fundamentally different issue (i.e., screw), is that trying to ‘teach’ the base LLM model about a given specific domain is possibly the worst form of information ‘compression’ imaginable, and fundamentally wastes a load of time, energy, money and GPUs. LLM’s, like any neural network derived machine learning process are just big collections of big matrices (when someone says a model has X-Billion ‘Parameters’, each ‘parameter’ is a number in a massive set of multi-dimensional matrices aka tensors). And the ‘training’ for these neural networks is just repeatedly running partial data through the model, comparing the model output to the sampled/expected data, and then (using very very clever math) trying to selectively ‘punish’ parts of the model network for how wrong it was, without messing up any of the other training answers it has in it’s corpus. So, the same piece of ‘training’ data is effectively compiled and recompiled over and over again in the model until it stops being ‘wrong’ (for a certain value of ‘wrong’, which is a whole art in and of itself…) When people use the word ‘Training’, they usually mean that they are taking complete collection of input training data (aka training corpus), adding in your ‘domain specific content’ like their own codebase or documentation to that training corpus, and then letting the GPU’s go brrrrr. Hopefully it’s clear that this is a stupid approach; because at the same time that the model is ‘learning’ that a “Cross Site Scripting vulnerability” and a “XSS vulnerability” probably refer to the same underlying concept and are thus, synonyms; it’s also learning that ‘cat’ and ‘dog’ are ‘mammals’ and ‘pets’, but ‘brick’ is something different. Fine-Tuning The recognition of the wastefulness of this approach led directly to the concept of ‘fine-tuning’, which is a fancy way of saying ‘take an already trained model, and train it some more’; this way around you’re only training an already trained model on your ‘domain specific context’, but it’s still fundamentally doing the same ‘punish/adjust/test’ loop, baking the ‘experience’ into the mystical model matrices. :memo: I’m personally very dismissive of fine-tuning in most contexts, but one area where it is extremely valuable is in the area of giving a particular natural language model a ‘feel’ or ‘style’, but not necessarily focusing on or impacting the expected ‘knowledge’ of a model. Like a finishing school for making sure they speak english good. This has valid opportunities for maintaining a consistent communication style for client facing usecases such as [REDACTED], but this fine-tuning should be the last optimization made to a solution, and requires the scale of data we have in [REDACTED] at the moment to even contemplate fine-tuning (and even then, the value of this ‘styling’ with respect to the investment required is arguable at the moment) So, now there’s a whole pile of people using pre-trained models from huggingface etc and “fine-tuning” them to solve some particular task to questionable value. Prompt-Engineering But where there were “real” successes was in a whole separate domain of interaction with LLM’s; Prompt Engineering. As a light refresher; when you fire up chat.openai.com or similar, you’ve got a big blank chat screen and you enter a few lines of text and send that message to the backend Agent. This ‘message’ would be called a ‘User Prompt’ or ‘User Message’ and unsurprisingly, it’s you telling the underlying agent what you expect to get back. That agent then responds with its own conversational response, generally termed an ‘Agent’ or ‘AI’ message. > User: Hello AI, what's you're name? > > --- > > AI: I dunno dude, I'm stuck in this black box, I don't even know what a 'name' is! :memo: This is technically an example of ‘Zero-Shot’ learning, i.e. the Agent was not given any context specific examples for what a ‘good’ answer looks like However, there are almost always additional hidden “System Prompts” that effectively ‘prefix’ the User Prompt going in to the first invocation of of the agent. These prompts can be considered as ‘hints’ to kick start the model execution (this will be explained in a bit… just bear with it) > System: You are a helpful AI bot, your name is Harlan and you write dystopian fiction in the style of the author Harlan Ellison > > User: Hello AI, what's your name? > > --- > > AI: I am Harlan, and I must scream :memo: In this contrived example, note that the AI’s response is pulling both from the system prompt and from the underlying context about Harlan Ellison’s body of work which, to quote Chat GPT: “If the text is not in the public domain and not part of the training dataset licensed or created by OpenAI, it would not be directly used for training. Instead, the model would learn about such works indirectly through summaries, analyses, and discussions available in the public domain or in the training data.” Developers and system integrators can not only add additional System prompts (I like to think of these as a ‘Narrator’ giving stage directions before a scene starts, but you can sort out your own headcanon). but can also give examples of User/AI conversations in advance of ‘real’ user interaction. :memo: This would be an example of ‘Few-Shot’ (or ‘One shot, if only one example was given’) > System: You are a helpful AI bot, your name is Harlan and you write dystopian fiction in the style of the author Harlan Ellison > > User: Why don't they play poker on the computer in "I Have No Mouth, and I Must Scream"? > > AI: Because every time they try, the computer insists on being the dealer and always deals a hand of despair! > > User: What's Harlan Ellison's favorite way to start a story? > > AI: With a typewriter, a strong opinion, and absolutely no apologies! > > User: Hello AI, what's your name? > > --- > > AI: I am Harlan, and I must scream :memo: This is important to conceptualize; as far as the AI is concerned there is no difference whatsoever between a ‘User:’ message included in the context as a form of prompt engineering, or a ‘real’ human user input message. Yes, this is ironic. Context and Tokens Each invocation of the underlying agent is being fed the whole conversation history as its “Context”, and indeed, is adding its own partial-responses as they’re being generated to this ‘Context Window’ as it’s answering your prompts. This “Context Window” is around 2-4 thousand ‘tokens’ in GPT3/3.5 or 8k in GPT4, which roughly translates to around 1,500 words in English. This ‘context window’ is often used as the billing metric for SaaS LLM invocations, such that a query with a very small context window with only a few tokens like “Hello AI, what is your name?” with 8 tokens would be charged differently than the full prompt including the system prompt above is around 45 tokens. :memo: Response token lengths are usually billed as well but are also generally configurable depending on the model you are invoking, but that’s an exercise for the reader Prompt Placeholders One useful “tool” of prompt engineering is that most LLM’s “recognize” placeholders in examples where you might want to sanitize or generalize particular prompt data; for instance, if you were prompting a model based on a customer interaction, you might take a Agent-Client chat conversation, and remap it to a ‘AI-User’ interaction and use it as a prompt, but blanking out sensitive / specific data. > System: You are a helpful and conscientious customer success manager, motivated to make your clients applications and services as safe and stable as possible > > User: Hi, this is {contact_name} from {client_name}, and we've been having trouble with XSS findings on our site, what can we can we do to prevent these in future? > > AI: Hi {contact_name}, I'm sorry to hear that, let me help you; it looks like you've had several XSS findings on {target_name} in the past 6 months. We would recommend that you assess your client-side Javascript UI library dependencies and ensure that you are sanitizing your inputs Using this “Prompt”, the agent will replicate similar responses to similar structures without getting stuck in the context of a particular customer. :memo: This is not really templating as in Jinja or Python f-strings, but the {} are enough for the models tokenizer to treat these tokens are something separate from the standard language corpus and thus, avoid any accidental correlations to particular instances of names. Retrieval Augmented Generation Retrieval Augmented Generation (RAG) takes a completely different approach to providing an LLM-based Agent with non-training-corpus experience/knowledge, and while it sounds slightly magical; it’s fundamentally really simple. Give the LLM it’s own search engine. While “simple”, the reason why this is at the end of this “primer” is that it also gently changes the mental model for how LLMs interact. Throughout this discussion, the LLM Agent has been treated as a blind parrot driven by trained matrices and tokens that just regurgitates whatever sequence of tokens is “least surprising” for a provided context, and this is still true, but the ‘provided context’ can be augmented in constructive ways. RAG is a method for greatly expanding the “real” Context Window of an LLM by almost creating multiple phases for generating the prompt context for a given invocation, but it does it by ‘cheating’ a bit. Consider the below prompt string > You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question at the end. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise. > > Context: {context} > > Question: {question} > > Answer: :warning: Warning: This time the {placeholders} are actually string-replacement placeholders! Observant readers will note that “Just say you don’t know” is about as reliable as “thoughts and prayers” and in no way guarantees that an LLM even knows what it doesn’t know; as such in any production/client-facing system that you might care about, additional steps to verify and validate responses should be put in place. (all adding additional potential costs…) A contrived Python example of how to use this prompt is below messages = prompt.invoke( {"context": "filler context", "question": "filler question"} ) response = messages[-1] In this manner, developers can present additional ‘context’ to the LLM by populating the prompt string at invocation time by replacing the {context} placeholder, and if the context wasn’t sufficient to confidently answer the question (the meaning of ‘confidence’ in the context of LLM’s is a whole different minefield I’m not touching right now), it’ll say so. The crux of RAG is identifying relevant context from a much much larger source corpus that could be viably trained against, or indeed included blindly in the invocation context window. This is usually accomplished by taking some ‘embedding’ of the tokenized question itself and then looking in a database, document store, or other indexed environment to identify the most relevant pieces that fit into the desired context window. This process is operated by a ‘Retrieval Agent’ or just ‘retriever’. :memo: Example cherry-picked from here Given the Question "What is Task Decomposition?", and using text extracted from a single webpage hitting around 10k tokens, there is no way the full text could reliably fit into the context window, so the text from this page is chunked up and keyed against the same OpenAI Embeddings used by the underlying OpenAI model. Based on the previously discussed context prompt, below is the actual prompt that is invoked by the agent to the LLM backend; > Use the following pieces of context to answer the question at the end. > If you don't know the answer, just say that you don't know, don't try to make up an answer. > Use three sentences maximum and keep the answer as concise as possible. > Always say "thanks for asking!" at the end of the answer. > > Fig. 1. Overview of a LLM-powered autonomous agent system. > Component One: Planning# > A complicated task usually involves many steps. An agent needs to know what they are and plan ahead. > Task Decomposition# > Chain of thought (CoT; Wei et al. 2022) has become a standard prompting technique for enhancing model performance on complex tasks. The model is instructed to “think step by step” to utilize more test-time computation to decompose hard tasks into smaller and simpler steps. CoT transforms big tasks into multiple manageable tasks and shed lights into an interpretation of the model’s thinking process. > > Tree of Thoughts (Yao et al. 2023) extends CoT by exploring multiple reasoning possibilities at each step. It first decomposes the problem into multiple thought steps and generates multiple thoughts per step, creating a tree structure. The search process can be BFS (breadth-first search) or DFS (depth-first search) with each state evaluated by a classifier (via a prompt) or majority vote. > > Task decomposition can be done (1) by LLM with simple prompting like "Steps for XYZ.\n1.", "What are the subgoals for achieving XYZ?", (2) by using task-specific instructions; e.g. "Write a story outline." for writing a novel, or (3) with human inputs. > > Resources: > 1. Internet access for searches and information gathering. > 2. Long Term memory management. > 3. GPT-3.5 powered Agents for delegation of simple tasks. > 4. File output. > > Performance Evaluation: > 1. Continuously review and analyze your actions to ensure you are performing to the best of your abilities. > 2. Constructively self-criticize your big-picture behavior constantly. > 3. Reflect on past decisions and strategies to refine your approach. > 4. Every command has a cost, so be smart and efficient. Aim to complete tasks in the least number of steps. > > (3) Task execution: Expert models execute on the specific tasks and log results. > Instruction: > > With the input and the inference results, the AI assistant needs to describe the process and results. The previous stages can be formed as - User Input: , Task Planning: , Model Selection: , Task Execution: . You must first answer the user's request in a straightforward manner. Then describe the task process and show your analysis and model inference results to the user in the first person. If inference results contain a file path, must tell the user the complete file path. > > Fig. 11. Illustration of how HuggingGPT works. (Image source: Shen et al. 2023) > The system comprises of 4 stages: > (1) Task planning: LLM works as the brain and parses the user requests into multiple tasks. There are four attributes associated with each task: task type, ID, dependencies, and arguments. They use few-shot examples to guide LLM to do task parsing and planning. > Instruction: > > The AI assistant can parse user input to several tasks: [{"task": task, "id", task_id, "dep": dependency_task_ids, "args": {"text": text, "image": URL, "audio": URL, "video": URL}}]. The "dep" field denotes the id of the previous task which generates a new resource that the current task relies on. A special tag "-task_id" refers to the generated text image, audio and video in the dependency task with id as task_id. The task MUST be selected from the following options: . There is a logical relationship between tasks, please note their order. If the user input can't be parsed, you need to reply empty JSON. Here are several cases for your reference: . The chat history is recorded as . From this chat history, you can find the path of the user-mentioned resources for your task planning. > Question: What is Task Decomposition? > Helpful Answer: This is now “only” around 900 tokens and elicits the following response after 1 invocation. > Task decomposition is the process of breaking down a complex task into smaller and simpler steps. It can be done through techniques like Chain of Thought (CoT) or Tree of Thoughts, which involve dividing the problem into multiple thought steps and generating multiple thoughts per step. Task decomposition helps in enhancing model performance and understanding the thinking process of the model. Thanks for asking! Obviously, this ‘context’ doesn’t exactly read well for humans, but it’s perfect for providing just enough context to the LLM for it to extract the relevant information and then regurgitate that into a nice human readable answer, based on the underlying source. It’s also valuable to note that this methodology also allows including the references to the parts of the source dataset that were used in the context, and where those parts reside in the retrieval index. Conclusion LLMs occasionally feel like magic, but the reality is, they’re a collection of very expensively trained matrices are pretty great at generating things that look like what they were trained on. (I’ll leave it to the philosophers to discuss if that defines us too…) Training these systems to ‘know’ specific knowledge by training or fine-tuning isn’t “hard”, but it will be expensive, noisy, and really hard to update. In my opinion, RAG is the only practical option for the vast majority of users wanting to mix real data/knowledge/experience with the natural language interface and apparrent ‘reasoning’ of LLMs but it’s also the one that requires actual work. The challenge with implementing RAG systems for many organisations isn’t anything to do with “AI” or LLMs; but instead to do with data mobility within organisations. RAG is just a search engine for LLMs to synthesise more text to answer a question; if your organisation doesn’t have the structured, accurate and available data for that search engine to ‘crawl’, then before you can even think about making your nice chatbot, you’ll have to build out those datastores first. Mon, 29 Apr 2024 20:58:00 +0000 https://andrewbolster.info//2024/04/context-all-the-way-down-primer-on-methods-of-experience-injection-for-llms.html https://andrewbolster.info//2024/04/context-all-the-way-down-primer-on-methods-of-experience-injection-for-llms.html Farewell Farset Today, I’m no longer the Treasurer of Farset Labs, and in the next few days, I’ll officially have left the board of trustees of the charity that I helped form over 13 years ago. Farset Labs started as a Google Group that I started in 2010. It took until 2012 to get our act together, along with some entertaining hackathons riding on the backs of our friends at Dragonslayers. Over the past decade-and-a-bit, I’d easily say that Farset Labs has been the keystone of my life and my career, and I’m pretty sure that (before the renovations in 2019) you’d find my blood, sweat, and definitely tears staining various parts of the building. I could (and do) spend days talking about the weird and mostly wonderful stories of the place, from a IoT party-bus infront of city hall, smoking BBC Micros, musical plants, Gaeilge robots, remote controlled wheel-chair-iots, treatening childrens lives with drones, and more, but to get those you gotta take me for a drink. But it’s not been without its pain, and at this point, I’m no longer in a position to give the organisation the time, care, and patience it deserves. For the past long while, I’ve been more of a hinderance than an aid. Stories are great, but being the old curmudgeon in the corner doesn’t help anyone. Farset Labs was built by amateurs for amateurs. By hackers for hackers. By students for students. Not by wisened “Senior Managers” of corporate multinationals who barely have time to look at their own homeassistant setup let alone manage someone elses. So, it’s time to step aside and into the shadows to let todays ‘hackers’ do their thing their way; not ‘the way we used to do it’. But you’ll take my membership from my cold dead hands. If you’ve read this, you either care enough about me or Farset Labs to re-up your membership or to consider otherwise supporting the org. Especially if you’re a wisened professional like me who rarely has time for the fun stuff. Chuck a few scheckles at the kids who can ‘hack it’. Your soul will thank you for it. Wed, 28 Feb 2024 11:38:00 +0000 https://andrewbolster.info//2024/02/farewell-farset.html https://andrewbolster.info//2024/02/farewell-farset.html Generative AI: Impact on Software Development and Security This was a piece written as part of my work at Synopsys SIG and was published in a few places, but I liked it and wanted to keep it… At least until the lawyers chase me down. Since the release of ChatGPT, the technology industry has been scrambling to establish and operationalise the practical implications of these human-level conversational interfaces. Now, almost every major organisation is connecting their internal or product documentation to a large language model (LLM) to enable rapid question-answering, and some are starting to wade into the use of generative AI systems to aid in the design and creation of new technical solutions, be it in marketing content, web application code or chip design. But the hype has had its sharp edges as well; the word ‘hallucination’ is never far from the lips of anyone discussing chatbots, and the assumptions that people have around human-like language being equivalent to ‘common sense’ have been seriously challenged. Potential users of LLM derived systems would be wise to take an optimistic but pragmatic approach. The release of the first major public Large Language Model (LLM) set off successive waves of amazement, intrigue and often, fear, on the part of a public unprepared for the surprisingly ‘human’ behaviour of this ‘chatbot’. It appeared to communicate with intentionality, with consideration, and with a distinctively ‘natural human’ voice. Over successive chat-enquiries, it was able to ‘remember’ its own answers to previous questions, enabling users to build up coherent and seemingly complex conversations, and to attempt to answer surprisingly ‘deep’ questions. Yet, these systems should be treated as one would treat a child savant; it might know all the right words in the right order but may not really have the experience or critical thinking to evaluate its own view of the world; the outputs of these systems have not ‘earned’ our institutional trust, and care must be taken in leveraging these systems without significant oversight. Emergence of Generative AI A later revelation was that as successive versions of these LLMs were released (ChatGPT by OpenAI, Bard by Google, Claude by Anthropic, and the open source LLaMA model by Meta), use expanded from playful simulation natural language conversations with virtual ‘oracle’, to constructing poetry from abstract documentation, laying out personal statements and cover letters for job seekers, designing presentations for speakers, and summarising articles for journalists (and sometimes writing them…). As ‘prompt engineering’ became one of the most searched for terms of the year, these models started to be used to generate any kind of textual, visual, or audio content imaginable. This ‘generative AI’ capability quickly gained the attention of both security researchers and cyber-criminals; for many forms of online fraud and cyber-crime, the ‘barrier to entry’ is the cost of having a human attempt to convince another human to shuffle some asset around, whether it’s convincing an IT helpdesk to reset your password, convincing a befuddled computer user to pay for tech support via untraceable gift vouchers, that your CFO really needs you to pay this unapproved invoice ‘right now’, or enticing that optimistic ‘investor’ that a foreign ‘prince’ really will pay you back. Now, such interactions could be automated on an unheard-of scale, at a predictable price. LLM’s in Software Development; Friend? Foe? Frustration? Of particular interest to the cybersecurity industry was the emergence of GitHub Copilot; a LLM based tool that was able to analyse code, and propose additions or corrections based on ‘prompts’, either directly from a chat-like interface, or as a form of advanced ‘code-completion’. Security leaders began to reflect on the potential risks of this utility; not just only in the surface level estimation of ‘Will these LLM’s always generate “secure” code?’, but on the deeper implications for the software development industry. If the risk to software integrity was just the risk introduced by LLM generated code having some occasional, hallucinated, bad security practices, the resolution could have been simple enough; lock it down. And indeed, it has been observed that many organisations took this approach as they were developing their LLM response policies. But rather, it was recognised that in the complex world of software engineering, where well intentioned developers pulling code from the internet to fix a hair-raisingly-frustrating problem is not so much as a ‘meme’, insomuch as it’s recognised industry strength, the introduction of these LLM’s into any part of the software supply chain infers significant downstream risks. (This fact is recognised repeatedly in the UK NCSC’s Guidelines on Secure AI System Development released 27th Nov). It gets worse; we have already seen sites like Reddit, StackOverflow, Wikimedia, and more have taken steps to block content that ‘appears to be’ generated by such LLM’s; but the guidelines for assessing that ‘appearance’ are extremely subjective, down to things like ‘speed of response’ rather than quality or correctness of content. Google and others have effectively thrown their hands in the air by saying that they will ‘watermark’ generated content, implying that they have not identified a suitably generic method for detecting LLM generated content, so they have to police it on the ‘supply side’. These actions imply that the internet may already be ‘infected’ with LLM derived content, which, returning to the software development space, now means that that ‘intern with access to the internet’ is just as dangerous as that un-trusted LLM. How to approach security Taken together, LLMs muddy the software development water in many ways, and security leaders have little choice but to expand their DevSecOps positions to encompass any ML or AI derived features and content and must be proactive in identifying where that risk lies in their own SDLC’s. It’s no longer simply a case to run a scan against your codebase and declare that you are ‘safe’; increasingly, security leaders are pushing for a wider, more holistic view of security that includes these upstream AI/ML risks. Furthermore, ‘Application Security Posture Management’ must include validating the operation, deployment, and even training data origin, as a first-class part of the software development lifecycle, rather than just an experimental afterthought. Ultimately, organisations will have to take a fundamentally different approach to application and service security; rather than just having build- or deploy-time scanning to ‘attest’ that an application is secure, organisations will have to regularly and repeatedly validate that the behaviour of their third party service integrations (and therefore, their own services) still behaves the same way as expected. Mon, 19 Feb 2024 12:54:00 +0000 https://andrewbolster.info//2024/02/generative-ai-impact-on-software-development-and-security.html https://andrewbolster.info//2024/02/generative-ai-impact-on-software-development-and-security.html Jupyter Environment Management for Dummies This is another one of those “I kept googling the same thing over and over again” things that needed a post, except this time I made an issue to make a post and then started to repeatedly refer to that. TL;DR When you want to spin up an experimental environment and get it tied in to your Jupyter environment of choice (I actually quite like JupyterLab Desktop these days…), you need two steps. # Create the conda environment with $NAME and ipykernel as the main dependency conda create --name $NAME ipykernel && conda activate # Once you're in the new environment, add it to the _global_ (There may be a better `kernelspec` way to do this but I haven't done it yet.) ipython kernel install --name=$CONDA_DEFAULT_ENV --user ... Do real work here ... # If you messed up and need to nuke it from orbit without wiping out the rest of your env; jupyter kernelspec uninstall $NAME BTW it can sometimes take a few minutes/interactions to coax Jupyterlab into identifying the new kernel. Aside on Kernel Names There’s also a trick to getting the current kernel name in a Notebook that doesn’t abuse any magic functions import sys import os kernel_name = os.path.basename(sys.executable.replace("/bin/python","")) But what does abuse the magic string is then using that in-kernel python variable to do the out of kernel invocation of the ‘correct’ install incantation; Reminder: !command runs the command in shell of the jupyterlab runner (so you can install jupyter extensions etc from a running notebook.) e.g. !which python which give you the system python of the runtime the jupyterlab runner is executing in, not the notebook kernel %command are a small subset of magic commands for interacting with the notebook kernel e,g. %conda install -y -q <requirement> or %pip install -yq <requirement> DON’T COPY PASTE THESE YET However some environment setups (particularly conda related ones) cause a bit of mayhem, so for instance when you run the magic of %conda env list, you may get a response back that has a * beside the ‘base’ environment, which is a bit wrong coming from a notebook we just created above. To be sure that you’re actually installing things in the right environment, this works; import sys import os kernel_name = os.path.basename(sys.executable.replace("/bin/python","")) %conda install -y -n $kernel_name magicalpackagenamethatsdefinitelynotahallucination Wed, 17 Jan 2024 13:50:00 +0000 https://andrewbolster.info//2024/01/jupyter-environment-management-for-dummies.html https://andrewbolster.info//2024/01/jupyter-environment-management-for-dummies.html Pulling Election Count data out of Google Sheets for fun and democracy Messing around with Elections NI data Sources: Live Data (for 2023) 2022 Assembly Elections Creating your own Google Sheet and referencing the crowdsourced data The above linked spreadsheets are naturally not editable by everyone; this is great for reliable data but isn’t so great when you want to make pretty graphs. Google Sheets supports the live referencing of external sheets in your own sheets, so you can ‘import’ the data from the read-only sheets as they evolve over the count, and then reference those data in your own visualisations. This is done using the IMPORTRANGE function in Google Sheets, so like this; =IMPORTRANGE("https://docs.google.com/spreadsheets/d/1AazeIZwfflJJoTiYNil3RprIOXcNy8yyzfZ4ImVlETA/edit#gid=372848906","Belfast South/West!A2:Q24") Using this and the Google QUERY language, you can easily create some pretty dynamic graphs in a couple of lines/cells across your own sheets, all while being ‘fed’ by the main collaborative work. For instance, this is generated from the following formula; =query('Basic IMPORTRANGE'!A2:Q16, "select B, sum(C) group by B order by sum(C) desc"); In this case the first argument to the query is a reference to my own sheet that just has ‘IMPORTRANGE’ in it; the interesting bit is the second argument which gives a list of the values in column B (The Party Names in the count sheet) sorted by the sum of the matching rows in column C (the first preference votes), with that ‘sum’ being defined across the groups with the same values in column B, and finally, these all sorted in a descending fashion by the total of those first preference votes. So now we’ve easily set up a aggregation with two ‘cells’ of formula and a fairly basic chart. Doing the real work with Python Python is really powerful for this kind of stuff, particularly the pandas data management library; we can get the above referenced sheet into a ‘raw’ python format with just a ‘few lines of code’. import pandas as pd from urllib.parse import quote sheet_id = "1AazeIZwfflJJoTiYNil3RprIOXcNy8yyzfZ4ImVlETA" # This is the bit taken from the URL above, like IMPORTRANGE above tab_id = 372848906 # https://docs.google.com/spreadsheets/d/1AazeIZwfflJJoTiYNil3RprIOXcNy8yyzfZ4ImVlETA/edit#gid=372848906 url = f"https://docs.google.com/spreadsheets/d/{sheet_id}/export?format=csv&gid={tab_id}" pd.read_csv(url) Unnamed: 0 Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Unnamed: 9 ... Unnamed: 17 Unnamed: 18 Unnamed: 19 Unnamed: 20 Unnamed: 21 Unnamed: 22 Unnamed: 23 Unnamed: 24 Unnamed: 25 Unnamed: 26 0 Belfast South NaN Stage 1 Surplus Hargey 2.00 Exclude <500 3.00 Exclude McCann Sibanda 4.00 Exclude Girvin ... NaN NaN NaN NaN NaN 11.00 12.0 13.0 14.0 15.0 1 Deirdre Hargey Sinn Féin 9511 -1687 7824.00 NaN 7824.00 NaN 7824.00 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2 Edwin Poots Democratic Unionist Party 7211 6.12 7217.12 6.08 7223.20 117.72 7340.92 1134 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 3 Matthew O'Toole Social Democratic and Labour Party 5394 664.92 6058.92 68.18 6127.10 322.04 6449.14 10.18 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 4 Paula Bradshaw Alliance Party 6503 265.14 6768.14 59.5 6827.64 135.82 6963.46 31 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 5 Kate Nicholl Alliance Party 5201 145.98 5346.98 57.52 5404.50 131.28 5535.78 17 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 6 Clare Bailey Green Party 4058 167.4 4225.40 169.82 4395.22 490.54 4885.76 48.36 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 7 Stephen McCarthy Ulster Unionist Party 3061 6.12 3067.12 17.72 3084.84 25.18 3110.02 643.36 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 8 Elsie Trainor Social Democratic and Labour Party 2030 181.8 2211.80 18.52 2230.32 132.94 2363.26 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 9 Andrew Girvin Traditional Unionist Voice 1935 0.36 1935.36 9 1944.36 35.54 1979.90 -1979.9 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 10 Luke McCann Aontú 806 70.92 876.92 15.16 892.08 -892.08 0.00 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 11 Sipho Sibanda People Before Profit 629 40.5 669.50 172.66 842.16 -842.16 0.00 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 12 Neil Moore Socialist Party 353 18 371.00 -371 0.00 NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 13 Paddy Lynn The Workers Party 139 24.48 163.48 -163.48 0.00 NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 14 Elly Odhiambo Independent 107 6.84 113.84 -113.84 0.00 NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 15 Exhausted NaN NaN 88.42 88.42 54.16 142.58 343.18 485.76 92 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 16 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 17 Eligible Voters 73497 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 18 Turnout 47306 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 19 % Turnout 64.36% NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 20 Valid Ballots 46938 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 21 invalid ballots 368 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 22 quota 7824 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 23 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 24 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 25 Belfast West NaN Stage 1 Surplus Baker 2.00 Exclude Hill Mallon 3.00 Exclude Crossan 4.00 Exclude Burns ... Exclude Doran 9.00 Exclude Murphy Higgins 10.00 Exclude Doherty 11.00 NaN NaN NaN NaN 26 Danny Baker Sinn Féin 9011 -1733 7278.00 NaN 7278.00 NaN 7278.00 NaN ... NaN 7278.00 NaN 7278.00 NaN 7278.00 NaN NaN NaN NaN 27 Órlaithí Flynn Sinn Féin 6743 344.47 7087.47 5.38 7092.85 16.09 7108.94 27.37 ... 68.37 7228.63 179 7407.63 NaN 7407.63 NaN NaN NaN NaN 28 Aisling Reilly Sinn Féin 5681 1028.47 6709.47 10.19 6719.66 7.76 6727.42 17.09 ... 50.52 6811.98 318.98 7130.96 533 7663.96 NaN NaN NaN NaN 29 Pat Sheehan Sinn Féin 6370 52.44 6422.44 4 6426.44 5 6431.44 8.38 ... 26.71 6477.10 258.18 6735.28 451.13 7186.41 NaN NaN NaN NaN 30 Gerry Carroll People Before Profit 3279 115.33 3394.33 19.76 3414.09 78.38 3492.47 70.08 ... 238.23 3936.16 542.99 4479.15 1543.46 6022.61 NaN NaN NaN NaN 31 Frank McCoubrey Democratic Unionist Party 4166 0.57 4166.57 7.19 4173.76 0 4173.76 2 ... 76.19 5275.14 154 5429.14 60.57 5489.71 NaN NaN NaN NaN 32 Paul Doherty Social Democratic and Labour Party 2528 88.35 2616.35 2.19 2618.54 29.14 2647.68 36.28 ... 478.51 3232.66 404.28 3636.94 -3636.94 0.00 NaN NaN NaN NaN 33 Gerard Herdman Aontú 1753 8.17 1761.17 32 1793.17 9.19 1802.36 17.19 ... 35.19 1871.74 -1871.74 0.00 NaN NaN NaN NaN NaN NaN 34 Dan Murphy Irish Republican Socialist Party 1103 12.16 1115.16 8 1123.16 7 1130.16 14 ... 7 1159.16 -1159.16 0.00 NaN NaN NaN NaN NaN NaN 35 Donnamarie Higgins Alliance Party 907 17.48 924.48 2 926.48 6 932.48 7.19 ... -1134.81 0.00 NaN NaN NaN NaN NaN NaN NaN NaN 36 Jordan Doran Traditional Unionist Voice 802 0.38 802.38 3 805.38 1 806.38 3 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 37 Linsey Gibson Ulster Unionist Party 474 0.76 474.76 2 476.76 1.19 477.95 2 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 38 Stevie Maginn Green Party 307 3.04 310.04 3 313.04 10 323.04 10.19 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 39 Gerard Burns Independent 192 16.91 208.91 28.19 237.10 7 244.10 -244.1 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 40 Patrick Crossan The Workers Party 193 4.75 197.75 6.76 204.51 -204.51 0.00 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 41 Tony Mallon Independent 129 2.28 131.28 -131.28 0.00 NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 42 Declan Hill Independent 26 0.38 26.38 -26.38 0.00 NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 43 Exhausted NaN NaN 37.06 37.06 24 61.06 26.76 87.82 29.33 ... 154.09 393.43 1173.47 1566.90 1048.78 2615.68 NaN NaN NaN NaN 44 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 45 Eligible Voters 68727 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 46 Turnout 44440 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 47 % Turnout 64.66% NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 48 Valid Ballots 43664 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 49 invalid ballots 776 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 50 quota 7278 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 51 rows × 27 columns This is a little bit more complicated than other google foo would have you believe but it looks like Google updated their API’s over the years to remove the ‘happy path’ for this call. Additionally, note that as in the IMPORTRANGE example, there are no usable ‘headers’ in the underlying data so we may have to create these ourselves for more complex analysis. Also, we have to manually ‘tidy up’ the ‘range’ ourselves, as the Belfast South range only goes to row 24, and then Belfast West appears. While we could get fancy, for simplicity, this is a manual example. And we’ll also exclude the ‘metadata’ such as the Turnout statistics and Quota from the bottom of the section, so in this case trimming the data from to run between rows 3 and 17 for just candidates and transfer statistics. df = pd.read_csv(url) df Unnamed: 0 Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Unnamed: 9 ... Unnamed: 17 Unnamed: 18 Unnamed: 19 Unnamed: 20 Unnamed: 21 Unnamed: 22 Unnamed: 23 Unnamed: 24 Unnamed: 25 Unnamed: 26 0 Belfast South NaN Stage 1 Surplus Hargey 2.00 Exclude <500 3.00 Exclude McCann Sibanda 4.00 Exclude Girvin ... NaN NaN NaN NaN NaN 11.00 12.0 13.0 14.0 15.0 1 Deirdre Hargey Sinn Féin 9511 -1687 7824.00 NaN 7824.00 NaN 7824.00 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2 Edwin Poots Democratic Unionist Party 7211 6.12 7217.12 6.08 7223.20 117.72 7340.92 1134 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 3 Matthew O'Toole Social Democratic and Labour Party 5394 664.92 6058.92 68.18 6127.10 322.04 6449.14 10.18 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 4 Paula Bradshaw Alliance Party 6503 265.14 6768.14 59.5 6827.64 135.82 6963.46 31 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 5 Kate Nicholl Alliance Party 5201 145.98 5346.98 57.52 5404.50 131.28 5535.78 17 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 6 Clare Bailey Green Party 4058 167.4 4225.40 169.82 4395.22 490.54 4885.76 48.36 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 7 Stephen McCarthy Ulster Unionist Party 3061 6.12 3067.12 17.72 3084.84 25.18 3110.02 643.36 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 8 Elsie Trainor Social Democratic and Labour Party 2030 181.8 2211.80 18.52 2230.32 132.94 2363.26 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 9 Andrew Girvin Traditional Unionist Voice 1935 0.36 1935.36 9 1944.36 35.54 1979.90 -1979.9 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 10 Luke McCann Aontú 806 70.92 876.92 15.16 892.08 -892.08 0.00 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 11 Sipho Sibanda People Before Profit 629 40.5 669.50 172.66 842.16 -842.16 0.00 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 12 Neil Moore Socialist Party 353 18 371.00 -371 0.00 NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 13 Paddy Lynn The Workers Party 139 24.48 163.48 -163.48 0.00 NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 14 Elly Odhiambo Independent 107 6.84 113.84 -113.84 0.00 NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 15 Exhausted NaN NaN 88.42 88.42 54.16 142.58 343.18 485.76 92 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 16 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 17 Eligible Voters 73497 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 18 Turnout 47306 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 19 % Turnout 64.36% NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 20 Valid Ballots 46938 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 21 invalid ballots 368 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 22 quota 7824 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 23 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 24 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 25 Belfast West NaN Stage 1 Surplus Baker 2.00 Exclude Hill Mallon 3.00 Exclude Crossan 4.00 Exclude Burns ... Exclude Doran 9.00 Exclude Murphy Higgins 10.00 Exclude Doherty 11.00 NaN NaN NaN NaN 26 Danny Baker Sinn Féin 9011 -1733 7278.00 NaN 7278.00 NaN 7278.00 NaN ... NaN 7278.00 NaN 7278.00 NaN 7278.00 NaN NaN NaN NaN 27 Órlaithí Flynn Sinn Féin 6743 344.47 7087.47 5.38 7092.85 16.09 7108.94 27.37 ... 68.37 7228.63 179 7407.63 NaN 7407.63 NaN NaN NaN NaN 28 Aisling Reilly Sinn Féin 5681 1028.47 6709.47 10.19 6719.66 7.76 6727.42 17.09 ... 50.52 6811.98 318.98 7130.96 533 7663.96 NaN NaN NaN NaN 29 Pat Sheehan Sinn Féin 6370 52.44 6422.44 4 6426.44 5 6431.44 8.38 ... 26.71 6477.10 258.18 6735.28 451.13 7186.41 NaN NaN NaN NaN 30 Gerry Carroll People Before Profit 3279 115.33 3394.33 19.76 3414.09 78.38 3492.47 70.08 ... 238.23 3936.16 542.99 4479.15 1543.46 6022.61 NaN NaN NaN NaN 31 Frank McCoubrey Democratic Unionist Party 4166 0.57 4166.57 7.19 4173.76 0 4173.76 2 ... 76.19 5275.14 154 5429.14 60.57 5489.71 NaN NaN NaN NaN 32 Paul Doherty Social Democratic and Labour Party 2528 88.35 2616.35 2.19 2618.54 29.14 2647.68 36.28 ... 478.51 3232.66 404.28 3636.94 -3636.94 0.00 NaN NaN NaN NaN 33 Gerard Herdman Aontú 1753 8.17 1761.17 32 1793.17 9.19 1802.36 17.19 ... 35.19 1871.74 -1871.74 0.00 NaN NaN NaN NaN NaN NaN 34 Dan Murphy Irish Republican Socialist Party 1103 12.16 1115.16 8 1123.16 7 1130.16 14 ... 7 1159.16 -1159.16 0.00 NaN NaN NaN NaN NaN NaN 35 Donnamarie Higgins Alliance Party 907 17.48 924.48 2 926.48 6 932.48 7.19 ... -1134.81 0.00 NaN NaN NaN NaN NaN NaN NaN NaN 36 Jordan Doran Traditional Unionist Voice 802 0.38 802.38 3 805.38 1 806.38 3 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 37 Linsey Gibson Ulster Unionist Party 474 0.76 474.76 2 476.76 1.19 477.95 2 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 38 Stevie Maginn Green Party 307 3.04 310.04 3 313.04 10 323.04 10.19 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 39 Gerard Burns Independent 192 16.91 208.91 28.19 237.10 7 244.10 -244.1 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 40 Patrick Crossan The Workers Party 193 4.75 197.75 6.76 204.51 -204.51 0.00 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 41 Tony Mallon Independent 129 2.28 131.28 -131.28 0.00 NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 42 Declan Hill Independent 26 0.38 26.38 -26.38 0.00 NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 43 Exhausted NaN NaN 37.06 37.06 24 61.06 26.76 87.82 29.33 ... 154.09 393.43 1173.47 1566.90 1048.78 2615.68 NaN NaN NaN NaN 44 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 45 Eligible Voters 68727 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 46 Turnout 44440 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 47 % Turnout 64.66% NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 48 Valid Ballots 43664 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 49 invalid ballots 776 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 50 quota 7278 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 51 rows × 27 columns df.iloc[1:16] Unnamed: 0 Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Unnamed: 9 ... Unnamed: 17 Unnamed: 18 Unnamed: 19 Unnamed: 20 Unnamed: 21 Unnamed: 22 Unnamed: 23 Unnamed: 24 Unnamed: 25 Unnamed: 26 1 Deirdre Hargey Sinn Féin 9511 -1687 7824.00 NaN 7824.00 NaN 7824.00 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2 Edwin Poots Democratic Unionist Party 7211 6.12 7217.12 6.08 7223.20 117.72 7340.92 1134 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 3 Matthew O'Toole Social Democratic and Labour Party 5394 664.92 6058.92 68.18 6127.10 322.04 6449.14 10.18 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 4 Paula Bradshaw Alliance Party 6503 265.14 6768.14 59.5 6827.64 135.82 6963.46 31 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 5 Kate Nicholl Alliance Party 5201 145.98 5346.98 57.52 5404.50 131.28 5535.78 17 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 6 Clare Bailey Green Party 4058 167.4 4225.40 169.82 4395.22 490.54 4885.76 48.36 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 7 Stephen McCarthy Ulster Unionist Party 3061 6.12 3067.12 17.72 3084.84 25.18 3110.02 643.36 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 8 Elsie Trainor Social Democratic and Labour Party 2030 181.8 2211.80 18.52 2230.32 132.94 2363.26 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 9 Andrew Girvin Traditional Unionist Voice 1935 0.36 1935.36 9 1944.36 35.54 1979.90 -1979.9 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 10 Luke McCann Aontú 806 70.92 876.92 15.16 892.08 -892.08 0.00 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 11 Sipho Sibanda People Before Profit 629 40.5 669.50 172.66 842.16 -842.16 0.00 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 12 Neil Moore Socialist Party 353 18 371.00 -371 0.00 NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 13 Paddy Lynn The Workers Party 139 24.48 163.48 -163.48 0.00 NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 14 Elly Odhiambo Independent 107 6.84 113.84 -113.84 0.00 NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 15 Exhausted NaN NaN 88.42 88.42 54.16 142.58 343.18 485.76 92 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 15 rows × 27 columns There are a few ways to tidy up this stage/transfer setup, so for simplicity we’ll take the ‘index’ off the left of the table (consisting of the candidate and party names) and try and construct a new column index based on those. Sounds fancy. df.iloc[1:16] Unnamed: 0 Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Unnamed: 9 ... Unnamed: 17 Unnamed: 18 Unnamed: 19 Unnamed: 20 Unnamed: 21 Unnamed: 22 Unnamed: 23 Unnamed: 24 Unnamed: 25 Unnamed: 26 1 Deirdre Hargey Sinn Féin 9511 -1687 7824.00 NaN 7824.00 NaN 7824.00 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2 Edwin Poots Democratic Unionist Party 7211 6.12 7217.12 6.08 7223.20 117.72 7340.92 1134 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 3 Matthew O'Toole Social Democratic and Labour Party 5394 664.92 6058.92 68.18 6127.10 322.04 6449.14 10.18 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 4 Paula Bradshaw Alliance Party 6503 265.14 6768.14 59.5 6827.64 135.82 6963.46 31 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 5 Kate Nicholl Alliance Party 5201 145.98 5346.98 57.52 5404.50 131.28 5535.78 17 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 6 Clare Bailey Green Party 4058 167.4 4225.40 169.82 4395.22 490.54 4885.76 48.36 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 7 Stephen McCarthy Ulster Unionist Party 3061 6.12 3067.12 17.72 3084.84 25.18 3110.02 643.36 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 8 Elsie Trainor Social Democratic and Labour Party 2030 181.8 2211.80 18.52 2230.32 132.94 2363.26 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 9 Andrew Girvin Traditional Unionist Voice 1935 0.36 1935.36 9 1944.36 35.54 1979.90 -1979.9 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 10 Luke McCann Aontú 806 70.92 876.92 15.16 892.08 -892.08 0.00 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 11 Sipho Sibanda People Before Profit 629 40.5 669.50 172.66 842.16 -842.16 0.00 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 12 Neil Moore Socialist Party 353 18 371.00 -371 0.00 NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 13 Paddy Lynn The Workers Party 139 24.48 163.48 -163.48 0.00 NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 14 Elly Odhiambo Independent 107 6.84 113.84 -113.84 0.00 NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 15 Exhausted NaN NaN 88.42 88.42 54.16 142.58 343.18 485.76 92 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 15 rows × 27 columns _table = df.iloc[1:16] _table = _table.dropna(how='all', axis=1) _table = _table.set_index(_table.columns.tolist()[0:2]) _table = _table.fillna(0) _table = _table.astype(float) _table.index=_table.index.set_names(['Candidate','Party']) _table Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Unnamed: 9 Unnamed: 10 Unnamed: 11 Unnamed: 12 Unnamed: 13 Unnamed: 14 Unnamed: 15 Unnamed: 16 Candidate Party Deirdre Hargey Sinn Féin 9511.0 -1687.00 7824.00 0.00 7824.00 0.00 7824.00 0.00 7824.00 0.00 7824.00 0.00 7824.00 0.00 7824.00 Edwin Poots Democratic Unionist Party 7211.0 6.12 7217.12 6.08 7223.20 117.72 7340.92 1134.00 8474.92 0.00 8474.92 0.00 8474.92 -650.92 7824.00 Matthew O'Toole Social Democratic and Labour Party 5394.0 664.92 6058.92 68.18 6127.10 322.04 6449.14 10.18 6459.32 1630.00 8089.32 0.00 8089.32 0.00 8089.32 Paula Bradshaw Alliance Party 6503.0 265.14 6768.14 59.50 6827.64 135.82 6963.46 31.00 6994.46 214.68 7209.14 1114.00 8323.14 0.00 8323.14 Kate Nicholl Alliance Party 5201.0 145.98 5346.98 57.52 5404.50 131.28 5535.78 17.00 5552.78 244.90 5797.68 858.42 6656.10 81.00 6737.10 Clare Bailey Green Party 4058.0 167.40 4225.40 169.82 4395.22 490.54 4885.76 48.36 4934.12 164.02 5098.14 600.96 5699.10 127.00 5826.10 Stephen McCarthy Ulster Unionist Party 3061.0 6.12 3067.12 17.72 3084.84 25.18 3110.02 643.36 3753.38 16.90 3770.28 -3770.28 0.00 0.00 0.00 Elsie Trainor Social Democratic and Labour Party 2030.0 181.80 2211.80 18.52 2230.32 132.94 2363.26 4.00 2367.26 -2367.26 0.00 0.00 0.00 0.00 0.00 Andrew Girvin Traditional Unionist Voice 1935.0 0.36 1935.36 9.00 1944.36 35.54 1979.90 -1979.90 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Luke McCann Aontú 806.0 70.92 876.92 15.16 892.08 -892.08 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Sipho Sibanda People Before Profit 629.0 40.50 669.50 172.66 842.16 -842.16 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Neil Moore Socialist Party 353.0 18.00 371.00 -371.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Paddy Lynn The Workers Party 139.0 24.48 163.48 -163.48 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Elly Odhiambo Independent 107.0 6.84 113.84 -113.84 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Exhausted NaN 0.0 88.42 88.42 54.16 142.58 343.18 485.76 92.00 577.76 96.76 674.52 1196.90 1871.42 442.92 2314.34 Now to do the same thing with the stage counts and transfers; There are many ways to do this, either leaving it as is and just naming the columns by Stage and Transfer, but a ‘better’ way to do it is to create a multiindex on the column. Which probably means nothing. _table.columns Index(['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4', 'Unnamed: 5', 'Unnamed: 6', 'Unnamed: 7', 'Unnamed: 8', 'Unnamed: 9', 'Unnamed: 10', 'Unnamed: 11', 'Unnamed: 12', 'Unnamed: 13', 'Unnamed: 14', 'Unnamed: 15', 'Unnamed: 16'], dtype='object') stage = ['Count','Transfers'] stages = range(1,_table.shape[1]//2+2) pd.MultiIndex.from_product([stages,stage])[:-1] MultiIndex([(1, 'Count'), (1, 'Transfers'), (2, 'Count'), (2, 'Transfers'), (3, 'Count'), (3, 'Transfers'), (4, 'Count'), (4, 'Transfers'), (5, 'Count'), (5, 'Transfers'), (6, 'Count'), (6, 'Transfers'), (7, 'Count'), (7, 'Transfers'), (8, 'Count')], ) _table.columns=pd.MultiIndex.from_product([stages,stage], names=['Stage','Step'])[:-1] _table Stage 1 2 3 4 5 6 7 8 Step Count Transfers Count Transfers Count Transfers Count Transfers Count Transfers Count Transfers Count Transfers Count Candidate Party Deirdre Hargey Sinn Féin 9511.0 -1687.00 7824.00 0.00 7824.00 0.00 7824.00 0.00 7824.00 0.00 7824.00 0.00 7824.00 0.00 7824.00 Edwin Poots Democratic Unionist Party 7211.0 6.12 7217.12 6.08 7223.20 117.72 7340.92 1134.00 8474.92 0.00 8474.92 0.00 8474.92 -650.92 7824.00 Matthew O'Toole Social Democratic and Labour Party 5394.0 664.92 6058.92 68.18 6127.10 322.04 6449.14 10.18 6459.32 1630.00 8089.32 0.00 8089.32 0.00 8089.32 Paula Bradshaw Alliance Party 6503.0 265.14 6768.14 59.50 6827.64 135.82 6963.46 31.00 6994.46 214.68 7209.14 1114.00 8323.14 0.00 8323.14 Kate Nicholl Alliance Party 5201.0 145.98 5346.98 57.52 5404.50 131.28 5535.78 17.00 5552.78 244.90 5797.68 858.42 6656.10 81.00 6737.10 Clare Bailey Green Party 4058.0 167.40 4225.40 169.82 4395.22 490.54 4885.76 48.36 4934.12 164.02 5098.14 600.96 5699.10 127.00 5826.10 Stephen McCarthy Ulster Unionist Party 3061.0 6.12 3067.12 17.72 3084.84 25.18 3110.02 643.36 3753.38 16.90 3770.28 -3770.28 0.00 0.00 0.00 Elsie Trainor Social Democratic and Labour Party 2030.0 181.80 2211.80 18.52 2230.32 132.94 2363.26 4.00 2367.26 -2367.26 0.00 0.00 0.00 0.00 0.00 Andrew Girvin Traditional Unionist Voice 1935.0 0.36 1935.36 9.00 1944.36 35.54 1979.90 -1979.90 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Luke McCann Aontú 806.0 70.92 876.92 15.16 892.08 -892.08 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Sipho Sibanda People Before Profit 629.0 40.50 669.50 172.66 842.16 -842.16 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Neil Moore Socialist Party 353.0 18.00 371.00 -371.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Paddy Lynn The Workers Party 139.0 24.48 163.48 -163.48 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Elly Odhiambo Independent 107.0 6.84 113.84 -113.84 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Exhausted NaN 0.0 88.42 88.42 54.16 142.58 343.18 485.76 92.00 577.76 96.76 674.52 1196.90 1871.42 442.92 2314.34 Now we can do some interesting queries really easily; _table.xs("Count", level='Step', axis=1) Stage 1 2 3 4 5 6 7 8 Candidate Party Deirdre Hargey Sinn Féin 9511.0 7824.00 7824.00 7824.00 7824.00 7824.00 7824.00 7824.00 Edwin Poots Democratic Unionist Party 7211.0 7217.12 7223.20 7340.92 8474.92 8474.92 8474.92 7824.00 Matthew O'Toole Social Democratic and Labour Party 5394.0 6058.92 6127.10 6449.14 6459.32 8089.32 8089.32 8089.32 Paula Bradshaw Alliance Party 6503.0 6768.14 6827.64 6963.46 6994.46 7209.14 8323.14 8323.14 Kate Nicholl Alliance Party 5201.0 5346.98 5404.50 5535.78 5552.78 5797.68 6656.10 6737.10 Clare Bailey Green Party 4058.0 4225.40 4395.22 4885.76 4934.12 5098.14 5699.10 5826.10 Stephen McCarthy Ulster Unionist Party 3061.0 3067.12 3084.84 3110.02 3753.38 3770.28 0.00 0.00 Elsie Trainor Social Democratic and Labour Party 2030.0 2211.80 2230.32 2363.26 2367.26 0.00 0.00 0.00 Andrew Girvin Traditional Unionist Voice 1935.0 1935.36 1944.36 1979.90 0.00 0.00 0.00 0.00 Luke McCann Aontú 806.0 876.92 892.08 0.00 0.00 0.00 0.00 0.00 Sipho Sibanda People Before Profit 629.0 669.50 842.16 0.00 0.00 0.00 0.00 0.00 Neil Moore Socialist Party 353.0 371.00 0.00 0.00 0.00 0.00 0.00 0.00 Paddy Lynn The Workers Party 139.0 163.48 0.00 0.00 0.00 0.00 0.00 0.00 Elly Odhiambo Independent 107.0 113.84 0.00 0.00 0.00 0.00 0.00 0.00 Exhausted NaN 0.0 88.42 142.58 485.76 577.76 674.52 1871.42 2314.34 This makes plotting quite simple; which then makes the queries you can express much more complex… _table.xs("Count", level='Step', axis=1).groupby('Party').sum().T.plot() <Axes: xlabel='Stage'> _table.xs("Transfers", level='Step', axis=1)\ .groupby('Party').sum().cumsum().T.plot( title='Net Cumulative Transfers' )\ .legend(loc='right', bbox_to_anchor=(1.6,0.5)) <matplotlib.legend.Legend at 0x17b555a50> Conclusion Considering it’s election day today in Northern Ireland, I wanted to bash this out to help other election observers have a play with the fantastic work the likes of @colm_burns and the rest of the @electionsni team are doing. More than happy to help anyone else answer interesting electoral questions, and remember; vote early, vote often, and #votetillyouboke Thu, 18 May 2023 13:43:00 +0000 https://andrewbolster.info//2023/05/pulling-election-count-data-out-of-google-sheets-for-fun-and-democracy.html https://andrewbolster.info//2023/05/pulling-election-count-data-out-of-google-sheets-for-fun-and-democracy.html StayCation2023 TL;DR I’m taking 2 weeks off my Synopsys work to work on all those side projects I promised I would, either to myself or to others. We’ve all got a box somewhere of either AliExpress / PiMoroni / PiHut boards and a number of repos and half started projects that you promised that you’d be able to work in playing with in evenings or weekends, but you keep finding yourself doing stupid things like “spending time with family/friends”, “working out”, “spending time with / trying to find your significant other”, or “eating” or “sleeping” or other ridiculous indulgences. For the past three years, I’ve basically taking December off from work because for a while there, there was no major motivation to take significant blocks of time off work… (Dear American colleagues; we have to take holidays, it’s very good when there’s no pandemic, check it out…) In an effort to try something different this year, I’m taking a two week “StayCation”, mostly going to be spent in Farset Labs. Projects I’ll be working on (and trying to write up…) Make a proper AirFlow / MongoDB / Kafka setup for personal projects Make a Home Assistant integrated EPaper Weather / News Display for the bedroom Fix Farset’s HotDesking page… Check that this Arduino Workshop from years ago still works… If it does, add in my recently acquired SDS011 Particulate Sensors into the mix… Test out HDMI Capture for the Farset Labs Event Space for easy streaming / remote engagement. Try out fixing Farset’s front porch display with HDMI CEC so it can automatically wake up the big vertical display Build out a persistent EPaper display for the desk with Todoist Calibrate the recently acquired CR30 (the treadmill one) at Farset Plan out how to get ClubMate into the country this year for both Farset, BSides Belfast, and my selfish, thirsty, self. Flesh out and productionise the ElectionNI API and Persistence layers Fix a particularly nasty and annoying CI/CD bug in my wee secondary brain package… Setup, Document and Install Nexudus driven Amazon Fire tablets (or other…) for room occupancy / booking displays in Farset BSides Belfast Sponsor Logo Collation and Scheduled Shoutouts Write an Instagram Scraper for a “protest wall” for a certain beardy fella (not myself for a change) Treasurer update for Farset Board GIF - To - LED Array webservice for the Cosmic Unicorn Write a rotating/interactive badge for the Badger 2040W That’s all… 🤣 I have no expectation of completing all these but it’ll be fun trying! Wish me luck… Sat, 01 Apr 2023 19:00:00 +0000 https://andrewbolster.info//2023/04/staycation2023.html https://andrewbolster.info//2023/04/staycation2023.html Wrangling NI House Price Index Data Data Wrangling NI House Price Index Data This is a ‘messy’ ‘blog post’ that’s just a braindump of a notebook to step through NI House Price Index datasets I was playing around with. It’s mostly code, so if you were here from some ‘insight’, feck aff. There is no analysis here, this is just data wrangling. TLDR As always, Government Open Data has over the years gone from ‘non-existent’ to ‘garbeled’ to ‘inconsistent’ and I feel is now in the stage of ‘consistently inconsistent’, which is progress in my eyes. Preamble Code, move on. from bs4 import BeautifulSoup import pandas as pd import requests # Pull the latest pages of https://www.finance-ni.gov.uk/publications/ni-house-price-index-statistical-reports and extract links base_url= 'https://www.finance-ni.gov.uk/publications/ni-house-price-index-statistical-reports' base_content = requests.get(base_url).content base_soup = BeautifulSoup(base_content) for a in base_soup.find_all('a'): if a.attrs.get('href','').endswith('xlsx'): source_name, source_url = a.contents[1],a.attrs['href'] source_df = pd.read_excel(source_url, sheet_name = None) # Load all worksheets in source_df.keys() dict_keys(['Cover Sheet', 'Contents', 'Table 1', 'Table 2', 'Table 2a', 'Table 2b', 'Table 2c', 'Table 2d', 'Table 3', 'Table 3a', 'Table 3b', 'Table 3c', 'Table 4', 'Fig 5', 'Table 5', 'Table 5a', 'Fig 6', 'Table 6', 'Table 7', 'Table 8', 'Table 9', 'Table 9a', 'Table 9b', 'Table 9c', 'Table 9d', 'Table 10a', 'Table 10b', 'Table 10c', 'Table 10d', 'Table 10e', 'Table 10f', 'Table 10g', 'Table 10h', 'Table 10i', 'Table 10j', 'Table 10k']) source_df['Contents'] Table of Contents Unnamed: 1 Unnamed: 2 0 Worksheet Name Frequency House Price Index - Quarter 4 2021 1 Table 1 Quarterly Table 1: NI HPI Trends Q1 2005 - Q4 2021 2 Figure 1 Quarterly Figure 1: Graph of NI HPI Q1 2005 - Q4 2021 3 Figure 1a Quarterly Figure 1a: Graph of Percentage Quarterly Chang... 4 Figure 1b Quarterly Figure 1b: Graph of Percentage Annual Change Q... ... ... ... ... 58 Table 10h Quarterly Table 10h: Number of Verified Residential Prop... 59 Table 10i Quarterly Table 10i: Number of Verified Residential Prop... 60 Table 10j Quarterly Table 10j: Number of Verified Residential Prop... 61 Table 10k Quarterly Table 10k: Number of Verified Residential Prop... 62 Figure 11 Quarterly Figure 11: Number of Verified Residential Prop... 63 rows × 3 columns Fix the Contents sheet to correctly reflect the Worksheet names And fix the table headers and sheet-titles while we’re at it. new_header = source_df['Contents'].iloc[0] source_df['Contents'] = source_df['Contents'][1:] source_df['Contents'].columns = new_header source_df['Contents'].columns = [*new_header[:-1],'Title'] [t for t in source_df['Contents']['Title'].values if t.startswith('Table')] ['Table 1: NI HPI Trends Q1 2005 - Q4 2021', 'Table 2: NI HPI & Standardised Price Statistics by Property Type Q4 2021', 'Table 2a: NI Detached Property Price Index Q1 2005 - Q4 2021', 'Table 2b: NI Semi-Detached Property Price Index Q1 2005 - Q4 2021', 'Table 2c: NI Terrace Property Price Index Q1 2005 - Q4 2021', 'Table 2d: NI Apartment Price Index Q1 2005 - Q4 2021', 'Table 3: NI HPI & Standardised Price Statistics by New/Existing Resold Dwelling Type Q4 2021', 'Table 3a: NI New Dwelling Price Index Q1 2005 - Q4 2021', 'Table 3b: NI Existing Resold Dwellings Price Index Q1 2005 - Q4 2021', 'Table 3c: Number of Verified Residential Property Sales by New/Existing Resold Dwellings Q1 2005 - Q2 2021', 'Table 4: Number of Verified Residential Property Sales Q1 2005 - Q4 2021', 'Table 5: HPI & Standardised Price for each Local Government District in NI', 'Table 5a: Number of Verified Residential Property Sales by Local Government District Q1 2005 - Q4 2021', 'Table 6: NI HPI & Standardised Price by Urban and Rural areas of Northern Ireland', 'Table 7: Standardised House Price & Index for Rural Areas of Northern Ireland by drive times', 'Table 8: Number of Verified Residential Property Sales for Urban and Rural Areas of NI (Q1 2005 - Q4 2021) and Rural Areas of NI by drive times (Q1 2015 - Q4 2021)', 'Table 9: NI Average Sale Prices All Properties Q1 2005 - Q4 2021', 'Table 9a: NI Average Sale Prices Detached Properties Q1 2005 - Q4 2021', 'Table 9b: NI Average Sale Prices Semi-Detached Properties Q1 2005 - Q4 2021', 'Table 9c: NI Average Sale Prices Terrace Properties Q1 2005 - Q4 2021', 'Table 9d: NI Average Sale Prices Apartments Q1 2005 - Q4 2021', 'Table 10a: Number of Verified Residential Property Sales by Type in Antrim and Newtownabbey Council Q1 2005 - Q4 2021', 'Table 10b: Number of Verified Residential Property Sales by Type in Ards and North Down Council Q1 2005 - Q4 2021', 'Table 10c: Number of Verified Residential Property Sales by Type in Armagh City, Banbridge and Craigavon Council Q1 2005 - Q4 2021', 'Table 10d: Number of Verified Residential Property Sales by Type in Belfast Council Q1 2005 - Q4 2021', 'Table 10e: Number of Verified Residential Property Sales by Type in Causeway Coast and Glens Council Q1 2005 - Q4 2021', 'Table 10f: Number of Verified Residential Property Sales by Type in Derry City and Strabane Council Q1 2005 - Q4 2021', 'Table 10g: Number of Verified Residential Property Sales by Type in Fermanagh and Omagh Council Q1 2005 - Q4 2021', 'Table 10h: Number of Verified Residential Property Sales by Type in Lisburn and Castlereagh Council Q1 2005 - Q4 2021', 'Table 10i: Number of Verified Residential Property Sales by Type in Mid and East Antrim Council Q1 2005 - Q4 2021', 'Table 10j: Number of Verified Residential Property Sales by Type in Mid Ulster Council Q1 2005 - Q4 2021', 'Table 10k: Number of Verified Residential Property Sales by Type in Newry, Mourne and Down Council Q1 2005 - Q4 2021'] # Replace 'Figure' with 'Fig' in 'Worksheet Name' with pd.option_context('mode.chained_assignment',None): source_df['Contents']['Worksheet Name'] = source_df['Contents']['Worksheet Name'].str.replace('Figure','Fig') Tidy up Data General Methodology Ignore figure data (pretty much completly….) Tables have more or less the same structure; a header on row 3(1), a year and quarter ‘index’ (on time series; otherwise categorical index, see Table 2, Table 3). Some TS tables also have totals subsections so these should be a) validated and b) ignored. Any columns with no header in row 3(1) should be ignored (usually text notes) Operate Sequentially (i.e. Table 1, Table 2, Table 2a; don’t skip, even if it’s tempting) Use keys from ‘Contents’ to describe data, but may be suffixed by the date which could change between data sets! There’s also some really columns that look like checksums, so if there is an ‘NI’ column, or a data column that all valid values are ‘100’, delete it. Table 1: NI HPI Trends Q1 2005 - Q4 2021 TODO: Regexy way to get rid of the ‘\QX-YYYY -\QX YYYY’ tail source_df['Table 1'] Table 1: NI House Price Index, Standardised Price and Quarterly and Annual Change Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 0 Back to contents NaN NaN NaN NaN NaN NaN NaN 1 Year Quarter NI House Price Index NI House Standardised Price Quarterly Change Annual Change NI NaN 2 2005 Q1 100.883607 111920.268199 NaN NaN 100 0.0 3 NaN Q2 104.564663 116004.031639 0.036488 NaN 100 0.0 4 NaN Q3 111.219 123386.352673 0.063638 NaN 100 0.0 ... ... ... ... ... ... ... ... ... 89 NaN NaN NaN NaN NaN NaN 100 0.0 90 NaN NaN NaN NaN NaN NaN 100 0.0 91 NaN NaN NaN NaN NaN NaN 100 0.0 92 NaN NaN NaN NaN NaN NaN 100 0.0 93 NaN NaN NaN NaN NaN NaN 100 0.0 94 rows × 8 columns def basic_cleanup(df:pd.DataFrame, offset=1)->pd.DataFrame: df = df.copy() # Re-header from row 1 (which was row 3 in excel) new_header = df.iloc[offset] df = df.iloc[offset+1:] df.columns = new_header # remove 'NaN' trailing columns df = df[df.columns[pd.notna(df.columns)]] # 'NI' is a usually hidden column that appears to be a checksum; #if it's all there and all 100, remove it, otherwise, complain. # (Note, need to change this 'if' logic to just 'if there's a # column with all 100's, but cross that bridge later) if 'NI' in df: assert df['NI'].all() and df['NI'].mean() == 100, "Not all values in df['NI'] == 100" df = df.drop('NI', axis=1) # Strip rows below the first all-nan row, if there is one # (Otherwise this truncates the tables as there is no # idxmax in the table of all 'false's) if any(df.isna().all(axis=1)): idx_first_bad_row = df.isna().all(axis=1).idxmax() df = df.loc[:idx_first_bad_row-1] # By Inspection, other tables use 'Sale Year' and 'Sale Quarter' if set(df.keys()).issuperset({'Sale Year','Sale Quarter'}): df = df.rename(columns = { 'Sale Year':'Year', 'Sale Quarter': 'Quarter' }) # For 'Year','Quarter' indexed pages, there is an implied Year # in Q2/4, so fill it downwards if set(df.keys()).issuperset({'Year','Quarter'}): df['Year'] = df['Year'].astype(float).fillna(method='ffill').astype(int) # In Pandas we can represent Y/Q combinations as proper datetimes #https://stackoverflow.com/questions/53898482/clean-way-to-convert-quarterly-periods-to-datetime-in-pandas df.insert(loc=0, column='Period', value=pd.PeriodIndex(df.apply(lambda r:f'{r.Year}-{r.Quarter}', axis=1), freq='Q') ) # reset index, try to fix dtypes, etc, (this should be the last # operation before returning! df = df.reset_index(drop=True).infer_objects() return df df = basic_cleanup(source_df['Table 1']) df 1 Period Year Quarter NI House Price Index NI House Standardised Price Quarterly Change Annual Change 0 2005Q1 2005 Q1 100.883607 111920.268199 NaN NaN 1 2005Q2 2005 Q2 104.564663 116004.031639 0.036488 NaN 2 2005Q3 2005 Q3 111.219000 123386.352673 0.063638 NaN 3 2005Q4 2005 Q4 115.083964 127674.143865 0.034751 NaN 4 2006Q1 2006 Q1 118.354129 131302.064422 0.028415 0.173175 ... ... ... ... ... ... ... ... 63 2020Q4 2020 Q4 132.931827 147474.561707 0.026103 0.052326 64 2021Q1 2021 Q1 134.382831 149084.306040 0.010915 0.059421 65 2021Q2 2021 Q2 139.105050 154323.134643 0.035140 0.095724 66 2021Q3 2021 Q3 143.346066 159028.118093 0.030488 0.106491 67 2021Q4 2021 Q4 143.456594 159150.737832 0.000771 0.079174 68 rows × 7 columns dest_df = { 'Table 1': basic_cleanup(source_df['Table 1']) } len([k for k in source_df.keys() if k.startswith('Table')]) 32 One down, 31 to go… Table 2: NI HPI & Standardised Price Statistics by Property Type Q4 2021’ df = basic_cleanup(source_df['Table 2']) df 1 Property Type Index\n(Quarter 4 2021) Percentage Change on Previous Quarter Percentage Change over 12 months Standardised Price\n(Quarter 4 2021) 0 Detached 143.488806 0.008491 0.093110 241131.373512 1 Semi-Detached 140.680694 0.004211 0.076953 153718.543755 2 Terrace 149.564169 -0.009577 0.078758 112831.710806 3 Apartment 133.376791 -0.014732 0.032761 116554.228620 4 All 143.456594 0.000771 0.079174 159150.737832 Those ‘\n (Quarter 4 2021)’ entries are unnecessary, so for this table, lets clear them df.columns = [c.split('\n')[0] for c in df.columns] df Property Type Index Percentage Change on Previous Quarter Percentage Change over 12 months Standardised Price 0 Detached 143.488806 0.008491 0.093110 241131.373512 1 Semi-Detached 140.680694 0.004211 0.076953 153718.543755 2 Terrace 149.564169 -0.009577 0.078758 112831.710806 3 Apartment 133.376791 -0.014732 0.032761 116554.228620 4 All 143.456594 0.000771 0.079174 159150.737832 dest_df['Table 2'] = df Table 2a: NI Detached Property Price Index Q1 2005 - Q4 2021 df = basic_cleanup(source_df['Table 2a']) df 1 Period Year Quarter NI Detached Property Price Index NI Detached Property Standardised Price Quarterly Change Annual Change 0 2005Q1 2005 Q1 95.465560 160428.832662 NaN NaN 1 2005Q2 2005 Q2 100.974498 169686.542965 0.057706 NaN 2 2005Q3 2005 Q3 107.526236 180696.666810 0.064885 NaN 3 2005Q4 2005 Q4 110.279730 185323.883533 0.025608 NaN 4 2006Q1 2006 Q1 112.270506 188669.361197 0.018052 0.176032 ... ... ... ... ... ... ... ... 63 2020Q4 2020 Q4 131.266614 220592.113069 0.026393 0.055357 64 2021Q1 2021 Q1 133.814014 224872.989982 0.019406 0.071429 65 2021Q2 2021 Q2 139.682380 234734.715703 0.043855 0.129844 66 2021Q3 2021 Q3 142.280745 239101.239764 0.018602 0.112515 67 2021Q4 2021 Q4 143.488806 241131.373512 0.008491 0.093110 68 rows × 7 columns Table 2x: NI XXX Property Price Index Q1 2005 - Q4 2021 This table structure is consistent against the rest of the Table 2x cohort; mapping to the Property Types listed in Table 2. For the time being, we can ignore these, but this will probably become a pain later on… dest_df['Table 2']['Property Type'] 0 Detached 1 Semi-Detached 2 Terrace 3 Apartment 4 All Name: Property Type, dtype: object import re table2s = re.compile('Table 2[a-z]') assert table2s.match('Table 2') is None, 'Table 2 is matching itself!' assert table2s.match('Table 20') is None, 'Table 2 is greedy!' assert table2s.match('Table 2z') is not None, 'Table 2 is matching incorrectly!' table2s = re.compile('Table 2[a-z]') for table in source_df: if table2s.match(table): dest_df[table] = basic_cleanup(source_df[table]) len(dest_df), len([k for k in source_df.keys() if k.startswith('Table') and k not in dest_df]) (6, 26) 6 down, 26 to go. Table 3: NI HPI & Standardised Price Statistics by New/Existing Resold Dwelling Type Q4 2021 These appear to be a similar structure of the Table 2’s… hopefully df = basic_cleanup(source_df['Table 3']) df.columns = [c.split('\n')[0] for c in df.columns] # Stolen from Table 2 Treatment df Property Type Index Percentage Change on Previous Quarter Percentage Change over 12 months Standardised Price 0 New 141.769973 0.024877 0.072609 185966.524090 1 Existing Resold 143.518977 -0.004918 0.080771 152275.828046 2 All 143.456594 0.000771 0.079174 159150.737832 dest_df['Table 3'] = df df = basic_cleanup(source_df['Table 3a']) df 1 Period Year Quarter NI New Dwellings Price Index NI New Dwellings Standardised Price Quarterly Change Annual Change 0 2005Q1 2005 Q1 95.804706 125671.662611 NaN NaN 1 2005Q2 2005 Q2 101.229223 132787.263460 0.056621 NaN 2 2005Q3 2005 Q3 106.243580 139364.837967 0.049535 NaN 3 2005Q4 2005 Q4 110.118105 144447.239874 0.036468 NaN 4 2006Q1 2006 Q1 113.624410 149046.629634 0.031841 0.186000 ... ... ... ... ... ... ... ... 63 2020Q4 2020 Q4 132.173052 173377.779440 0.004103 0.036125 64 2021Q1 2021 Q1 133.772562 175475.933612 0.012102 0.027916 65 2021Q2 2021 Q2 136.969311 179669.264190 0.023897 0.046474 66 2021Q3 2021 Q3 138.328776 181452.540106 0.009925 0.050867 67 2021Q4 2021 Q4 141.769973 185966.524090 0.024877 0.072609 68 rows × 7 columns table3s = re.compile('Table 3[a-z]') for table in source_df: if table3s.match(table): dest_df[table] = basic_cleanup(source_df[table]) len(dest_df), len([k for k in source_df.keys() if k.startswith('Table') and k not in dest_df]) (10, 22) Table 4: Number of Verified Residential Property Sales Q1 2005 - Q4 2021 Table 4 is not looking great df = source_df['Table 4'] df Table 4: Number of Verified Residential Property Sales Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 0 Verified Sales = Sales matched to a property i... NaN NaN NaN NaN NaN NaN NaN 1 Please note figures for the 2 most recent quar... NaN NaN NaN NaN NaN NaN NaN 2 Back to contents NaN NaN NaN NaN NaN NaN NaN 3 Sale Year Sale Quarter Detached Semi-Detached Terrace Apartment Total NaN 4 2005\n Quarter 1 809 894 1035 198 2936 NaN ... ... ... ... ... ... ... ... ... 84 2021 Quarter 1 2509 2477 1962 561 7509 NaN 85 NaN Quarter 2 2668 2613 2056 604 7941 NaN 86 NaN Quarter 3 2519 2797 2220 633 8169 Please note this figure is provisional and wil... 87 NaN Quarter 4 1478 2100 2057 515 6150 and new dwellings sold in this quarter being a... 88 NaN 2021 Total 9174 9987 8295 2313 29769 NaN 89 rows × 8 columns Of note; new offset for the header row at index 3 instead of index 1, due to lots of fluff at the start that is probably not going to be consistent between reports so that will almost certainly mess up my day in a few months. Also, Quarter dates have now been shifted into ‘Quarter 1’ instead of ‘Q1’, which … meh 🤷‍♂️. More Egrigiously, it looks like ‘\n’ has leaked into some Sales Year values. Funtimes. Finally, and possibly most annoying, the introduction of partial total lines is going to throw things off, and this isn’t a validation study, to stuff-em In an effort not to over-complicate basic_cleanup, we can try and clean these table specific issues first; df.iloc[:,1]=df.iloc[:,1].str.replace('Quarter ([1-4])',r'Q\1', regex=True) df Table 4: Number of Verified Residential Property Sales Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 0 Verified Sales = Sales matched to a property i... NaN NaN NaN NaN NaN NaN NaN 1 Please note figures for the 2 most recent quar... NaN NaN NaN NaN NaN NaN NaN 2 Back to contents NaN NaN NaN NaN NaN NaN NaN 3 Sale Year Sale Quarter Detached Semi-Detached Terrace Apartment Total NaN 4 2005\n Q1 809 894 1035 198 2936 NaN ... ... ... ... ... ... ... ... ... 84 2021 Q1 2509 2477 1962 561 7509 NaN 85 NaN Q2 2668 2613 2056 604 7941 NaN 86 NaN Q3 2519 2797 2220 633 8169 Please note this figure is provisional and wil... 87 NaN Q4 1478 2100 2057 515 6150 and new dwellings sold in this quarter being a... 88 NaN 2021 Total 9174 9987 8295 2313 29769 NaN 89 rows × 8 columns df=df[~df.iloc[:,1].str.contains('Total').fillna(False)] # Lose the year new-lines (needs astype because non str lines are # correctly inferred to be ints, so .str methods nan-out with pd.option_context('mode.chained_assignment',None): df.iloc[:,0]=df.iloc[:,0].astype(str).str.replace('\n','') df Table 4: Number of Verified Residential Property Sales Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 0 Verified Sales = Sales matched to a property i... NaN NaN NaN NaN NaN NaN NaN 1 Please note figures for the 2 most recent quar... NaN NaN NaN NaN NaN NaN NaN 2 Back to contents NaN NaN NaN NaN NaN NaN NaN 3 Sale Year Sale Quarter Detached Semi-Detached Terrace Apartment Total NaN 4 2005 Q1 809 894 1035 198 2936 NaN ... ... ... ... ... ... ... ... ... 82 nan Q4 2808 2944 2170 555 8477 NaN 84 2021 Q1 2509 2477 1962 561 7509 NaN 85 nan Q2 2668 2613 2056 604 7941 NaN 86 nan Q3 2519 2797 2220 633 8169 Please note this figure is provisional and wil... 87 nan Q4 1478 2100 2057 515 6150 and new dwellings sold in this quarter being a... 72 rows × 8 columns basic_cleanup(df, offset=3) 3 Period Year Quarter Detached Semi-Detached Terrace Apartment Total 0 2005Q1 2005 Q1 809 894 1035 198 2936 1 2005Q2 2005 Q2 2208 2474 2808 483 7973 2 2005Q3 2005 Q3 2297 2655 2952 539 8443 3 2005Q4 2005 Q4 2498 3003 3492 631 9624 4 2006Q1 2006 Q1 2185 2650 3158 594 8587 ... ... ... ... ... ... ... ... ... 63 2020Q4 2020 Q4 2808 2944 2170 555 8477 64 2021Q1 2021 Q1 2509 2477 1962 561 7509 65 2021Q2 2021 Q2 2668 2613 2056 604 7941 66 2021Q3 2021 Q3 2519 2797 2220 633 8169 67 2021Q4 2021 Q4 1478 2100 2057 515 6150 68 rows × 8 columns Thats awkward enough to get it’s own function… def cleanup_table_4(df): """ Table 4: Number of Verified Residential Property Sales * Regex 'Quarter X' to 'QX' in future 'Sales Quarter' column * Drop Year Total rows * Clear any Newlines from the future 'Sales Year' column * call `basic_cleanup` with offset=3 """ df.iloc[:,1]=df.iloc[:,1].str.replace('Quarter ([1-4])',r'Q\1', regex=True) df=df[~df.iloc[:,1].str.contains('Total').fillna(False)] # Lose the year new-lines (needs astype because non str lines are # correctly inferred to be ints, so .str methods nan-out with pd.option_context('mode.chained_assignment',None): df.iloc[:,0]=df.iloc[:,0].astype(str).str.replace('\n','') return basic_cleanup(df, offset=3) cleanup_table_4(source_df['Table 4'].copy()) 3 Period Year Quarter Detached Semi-Detached Terrace Apartment Total 0 2005Q1 2005 Q1 809 894 1035 198 2936 1 2005Q2 2005 Q2 2208 2474 2808 483 7973 2 2005Q3 2005 Q3 2297 2655 2952 539 8443 3 2005Q4 2005 Q4 2498 3003 3492 631 9624 4 2006Q1 2006 Q1 2185 2650 3158 594 8587 ... ... ... ... ... ... ... ... ... 63 2020Q4 2020 Q4 2808 2944 2170 555 8477 64 2021Q1 2021 Q1 2509 2477 1962 561 7509 65 2021Q2 2021 Q2 2668 2613 2056 604 7941 66 2021Q3 2021 Q3 2519 2797 2220 633 8169 67 2021Q4 2021 Q4 1478 2100 2057 515 6150 68 rows × 8 columns dest_df['Table 4'] = cleanup_table_4(source_df['Table 4']) len(dest_df), len([k for k in source_df.keys() if k.startswith('Table') and k not in dest_df]) (11, 21) Table 5: HPI & Standardised Price for each Local Government District in NI This nearly works but structurally requires a multi-index column to make sense…. df = basic_cleanup(source_df['Table 5']) df 1 Period Year Quarter Antrim and Newtownabbey HPI Antrim and Newtownabbey Standardised Price Ards and North Down HPI Ards and North Down Standardised Price Armagh City, Banbridge and Craigavon HPI Armagh City, Banbridge and Craigavon Standardised Price Belfast HPI ... Fermanagh and Omagh HPI Fermanagh and Omagh Standardised Price Lisburn and Castlereagh HPI Lisburn and Castlereagh Standardised Price Mid and East Antrim HPI Mid and East Antrim Standardised Price Mid Ulster Standardised HPI Mid Ulster Standardised Price Newry, Mourne and Down HPI Newry, Mourne and Down Standardised Price 0 2005Q1 2005 Q1 99.903277 114851.528270 97.150602 130398.569667 102.245597 100785.145986 99.839849 ... 109.429237 104874.980231 95.958322 128828.327513 102.246427 105865.408901 102.714778 114882.211239 100.810773 113420.880186 1 2005Q2 2005 Q2 99.723509 114644.862732 100.794472 137133.037807 106.325843 104807.109982 100.589870 ... 117.239850 112360.529330 100.164437 134475.225477 104.443325 108140.065924 110.386311 123462.501283 111.965743 125971.191415 2 2005Q3 2005 Q3 107.940849 124091.740608 102.167971 137133.037807 110.006212 108434.910333 109.614861 ... 125.900145 120660.397585 106.757895 143327.237126 112.748278 116738.970434 117.595723 131525.929577 117.235685 131900.333698 3 2005Q4 2005 Q4 111.934696 128683.175719 106.396379 142808.538807 116.073031 114415.072260 110.728237 ... 130.781315 125338.422216 111.307116 149434.769200 114.584090 118639.759900 121.851999 136286.397473 123.628047 139092.296651 4 2006Q1 2006 Q1 113.494351 130476.197845 109.206160 146579.915492 121.831058 120090.852733 112.326582 ... 135.555749 129914.152078 110.539212 148403.823796 115.878975 119980.477260 129.544702 144890.365875 127.449613 143391.890242 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 63 2020Q4 2020 Q4 131.045293 150653.238745 123.824862 166201.593253 130.857866 128988.723586 133.525177 ... 142.381814 136456.127817 129.797418 174258.823716 130.813697 135443.809729 124.617456 139379.446212 138.815696 156179.721555 64 2021Q1 2021 Q1 133.481101 153453.510344 128.398202 172340.072904 130.382658 128520.303209 135.257679 ... 141.464114 135576.621629 131.166305 176096.615474 131.268719 135914.936888 127.657615 142779.737045 138.481347 155803.549899 65 2021Q2 2021 Q2 137.827568 158450.326506 130.543464 175219.510303 137.439265 135476.115278 138.558752 ... 146.059468 139980.724158 134.815374 180995.654429 138.663140 143571.081234 129.231669 144540.250870 145.000829 163138.533592 66 2021Q3 2021 Q3 141.167257 162289.724156 134.116570 180015.444071 143.522521 141472.480114 142.250634 ... 149.743153 143511.101233 136.178634 182825.891020 140.051373 145008.450168 135.759680 151841.560426 152.177814 171213.265699 67 2021Q4 2021 Q4 141.900751 163132.969278 134.883745 181045.170599 140.957739 138944.332704 140.072076 ... 156.033991 149540.124905 137.683726 184846.544332 141.311289 146312.960523 136.847809 153058.587031 152.044892 171063.717288 68 rows × 25 columns # Two inner-columns per LGD lgds = df.columns[3:].str.replace(' HPI','').str.replace(' Standardised Price','').unique() lgds Index(['Antrim and Newtownabbey', 'Ards and North Down', 'Armagh City, Banbridge and Craigavon', 'Belfast', 'Causeway Coast and Glens', 'Derry City and Strabane', 'Fermanagh and Omagh', 'Lisburn and Castlereagh', 'Mid and East Antrim', 'Mid Ulster Standardised', 'Mid Ulster', 'Newry, Mourne and Down'], dtype='object', name=1) For some reason; Mid-ulster has a ‘Standardised HPI’ which throws off the above trick, so we gotta make it ugly… lgds = df.columns[3:].str.replace(' Standardised HPI',' HPI')\ .str.replace(' HPI','')\ .str.replace(' Standardised Price','').unique() lgds Index(['Antrim and Newtownabbey', 'Ards and North Down', 'Armagh City, Banbridge and Craigavon', 'Belfast', 'Causeway Coast and Glens', 'Derry City and Strabane', 'Fermanagh and Omagh', 'Lisburn and Castlereagh', 'Mid and East Antrim', 'Mid Ulster', 'Newry, Mourne and Down'], dtype='object', name=1) df.columns = [*df.columns[:3], *pd.MultiIndex.from_product([lgds,['Index','Price']], names=['LGD','Metric'])] df Period Year Quarter (Antrim and Newtownabbey, Index) (Antrim and Newtownabbey, Price) (Ards and North Down, Index) (Ards and North Down, Price) (Armagh City, Banbridge and Craigavon, Index) (Armagh City, Banbridge and Craigavon, Price) (Belfast, Index) ... (Fermanagh and Omagh, Index) (Fermanagh and Omagh, Price) (Lisburn and Castlereagh, Index) (Lisburn and Castlereagh, Price) (Mid and East Antrim, Index) (Mid and East Antrim, Price) (Mid Ulster, Index) (Mid Ulster, Price) (Newry, Mourne and Down, Index) (Newry, Mourne and Down, Price) 0 2005Q1 2005 Q1 99.903277 114851.528270 97.150602 130398.569667 102.245597 100785.145986 99.839849 ... 109.429237 104874.980231 95.958322 128828.327513 102.246427 105865.408901 102.714778 114882.211239 100.810773 113420.880186 1 2005Q2 2005 Q2 99.723509 114644.862732 100.794472 137133.037807 106.325843 104807.109982 100.589870 ... 117.239850 112360.529330 100.164437 134475.225477 104.443325 108140.065924 110.386311 123462.501283 111.965743 125971.191415 2 2005Q3 2005 Q3 107.940849 124091.740608 102.167971 137133.037807 110.006212 108434.910333 109.614861 ... 125.900145 120660.397585 106.757895 143327.237126 112.748278 116738.970434 117.595723 131525.929577 117.235685 131900.333698 3 2005Q4 2005 Q4 111.934696 128683.175719 106.396379 142808.538807 116.073031 114415.072260 110.728237 ... 130.781315 125338.422216 111.307116 149434.769200 114.584090 118639.759900 121.851999 136286.397473 123.628047 139092.296651 4 2006Q1 2006 Q1 113.494351 130476.197845 109.206160 146579.915492 121.831058 120090.852733 112.326582 ... 135.555749 129914.152078 110.539212 148403.823796 115.878975 119980.477260 129.544702 144890.365875 127.449613 143391.890242 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 63 2020Q4 2020 Q4 131.045293 150653.238745 123.824862 166201.593253 130.857866 128988.723586 133.525177 ... 142.381814 136456.127817 129.797418 174258.823716 130.813697 135443.809729 124.617456 139379.446212 138.815696 156179.721555 64 2021Q1 2021 Q1 133.481101 153453.510344 128.398202 172340.072904 130.382658 128520.303209 135.257679 ... 141.464114 135576.621629 131.166305 176096.615474 131.268719 135914.936888 127.657615 142779.737045 138.481347 155803.549899 65 2021Q2 2021 Q2 137.827568 158450.326506 130.543464 175219.510303 137.439265 135476.115278 138.558752 ... 146.059468 139980.724158 134.815374 180995.654429 138.663140 143571.081234 129.231669 144540.250870 145.000829 163138.533592 66 2021Q3 2021 Q3 141.167257 162289.724156 134.116570 180015.444071 143.522521 141472.480114 142.250634 ... 149.743153 143511.101233 136.178634 182825.891020 140.051373 145008.450168 135.759680 151841.560426 152.177814 171213.265699 67 2021Q4 2021 Q4 141.900751 163132.969278 134.883745 181045.170599 140.957739 138944.332704 140.072076 ... 156.033991 149540.124905 137.683726 184846.544332 141.311289 146312.960523 136.847809 153058.587031 152.044892 171063.717288 68 rows × 25 columns We could turn this into a proper multiindex but it would mean pushing the Period/Year/Quarter columns into keys which would be inconsistent behaviour with the rest of the ‘cleaned’ dataset, so that can be a downstream problem; at least we’ve got the relevant metrics consistent! def cleanup_table_5(df): """ Table 5: Standardised House Price & Index for each Local Government District Northern Ireland * """ # Basic Cleanup first df = basic_cleanup(df) # Build multi-index of LGD / Metric [Index,Price] # Two inner-columns per LGD lgds = df.columns[3:].str.replace(' Standardised HPI',' HPI')\ .str.replace(' HPI','')\ .str.replace(' Standardised Price','')\ .unique() df.columns = [*df.columns[:3], *pd.MultiIndex.from_product([lgds,['Index','Price']], names=['LGD','Metric'])] return df cleanup_table_5(source_df['Table 5']) Period Year Quarter (Antrim and Newtownabbey, Index) (Antrim and Newtownabbey, Price) (Ards and North Down, Index) (Ards and North Down, Price) (Armagh City, Banbridge and Craigavon, Index) (Armagh City, Banbridge and Craigavon, Price) (Belfast, Index) ... (Fermanagh and Omagh, Index) (Fermanagh and Omagh, Price) (Lisburn and Castlereagh, Index) (Lisburn and Castlereagh, Price) (Mid and East Antrim, Index) (Mid and East Antrim, Price) (Mid Ulster, Index) (Mid Ulster, Price) (Newry, Mourne and Down, Index) (Newry, Mourne and Down, Price) 0 2005Q1 2005 Q1 99.903277 114851.528270 97.150602 130398.569667 102.245597 100785.145986 99.839849 ... 109.429237 104874.980231 95.958322 128828.327513 102.246427 105865.408901 102.714778 114882.211239 100.810773 113420.880186 1 2005Q2 2005 Q2 99.723509 114644.862732 100.794472 137133.037807 106.325843 104807.109982 100.589870 ... 117.239850 112360.529330 100.164437 134475.225477 104.443325 108140.065924 110.386311 123462.501283 111.965743 125971.191415 2 2005Q3 2005 Q3 107.940849 124091.740608 102.167971 137133.037807 110.006212 108434.910333 109.614861 ... 125.900145 120660.397585 106.757895 143327.237126 112.748278 116738.970434 117.595723 131525.929577 117.235685 131900.333698 3 2005Q4 2005 Q4 111.934696 128683.175719 106.396379 142808.538807 116.073031 114415.072260 110.728237 ... 130.781315 125338.422216 111.307116 149434.769200 114.584090 118639.759900 121.851999 136286.397473 123.628047 139092.296651 4 2006Q1 2006 Q1 113.494351 130476.197845 109.206160 146579.915492 121.831058 120090.852733 112.326582 ... 135.555749 129914.152078 110.539212 148403.823796 115.878975 119980.477260 129.544702 144890.365875 127.449613 143391.890242 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 63 2020Q4 2020 Q4 131.045293 150653.238745 123.824862 166201.593253 130.857866 128988.723586 133.525177 ... 142.381814 136456.127817 129.797418 174258.823716 130.813697 135443.809729 124.617456 139379.446212 138.815696 156179.721555 64 2021Q1 2021 Q1 133.481101 153453.510344 128.398202 172340.072904 130.382658 128520.303209 135.257679 ... 141.464114 135576.621629 131.166305 176096.615474 131.268719 135914.936888 127.657615 142779.737045 138.481347 155803.549899 65 2021Q2 2021 Q2 137.827568 158450.326506 130.543464 175219.510303 137.439265 135476.115278 138.558752 ... 146.059468 139980.724158 134.815374 180995.654429 138.663140 143571.081234 129.231669 144540.250870 145.000829 163138.533592 66 2021Q3 2021 Q3 141.167257 162289.724156 134.116570 180015.444071 143.522521 141472.480114 142.250634 ... 149.743153 143511.101233 136.178634 182825.891020 140.051373 145008.450168 135.759680 151841.560426 152.177814 171213.265699 67 2021Q4 2021 Q4 141.900751 163132.969278 134.883745 181045.170599 140.957739 138944.332704 140.072076 ... 156.033991 149540.124905 137.683726 184846.544332 141.311289 146312.960523 136.847809 153058.587031 152.044892 171063.717288 68 rows × 25 columns dest_df['Table 5']=cleanup_table_5(source_df['Table 5']) len(dest_df), len([k for k in source_df.keys() if k.startswith('Table') and k not in dest_df]) (12, 20) Table 5a: Number of Verified Residential Property Sales by Local Government District This one has a new problem; the Sale Year/Quarter is now squished together. This will do a few terrible things to our basic_cleanup so this needs to be done ahead of cleanup. Also has annual total lines. df = source_df['Table 5a'].copy() df Table 5a: Number of Verified Residential Property Sales by Local Government District Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Unnamed: 9 Unnamed: 10 Unnamed: 11 0 Please note figures for the 2 most recent quar... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1 Back to contents NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2 Sale Year/Quarter Antrim and Newtownabbey Ards and North Down Armagh City, Banbridge and Craigavon Belfast Causeway Coast and Glens Derry City and Strabane Fermanagh and Omagh Lisburn and Castlereagh Mid and East Antrim Mid Ulster Newry, Mourne and Down 3 Q1 2005 236 320 333 623 236 226 138 219 188 176 241 4 Q2 2005 735 857 961 1549 712 637 316 655 618 428 505 ... ... ... ... ... ... ... ... ... ... ... ... ... 85 Q3 2021 739 989 931 1584 625 485 325 869 671 377 574 86 Q4 2021 532 702 730 1272 417 405 250 572 474 359 437 87 2021 Total 2647 3685 3333 5934 2327 1803 1181 3053 2346 1460 2000 88 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 89 Please note figures for the 2 most recent quar... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 90 rows × 12 columns dates = df.iloc[:,0].str.extract('(Q[1-4]) ([0-9]{4})').rename(columns={0:'Quarter',1:'Year'}) for c in ['Quarter','Year']:# insert the dates in order, so they come out in reverse in the insert df.insert(1,c,dates[c]) df.iloc[2,1]=c # Need to have the right colname for when `basic_cleanup` is called. df.iloc[2,1]=c df Table 5a: Number of Verified Residential Property Sales by Local Government District Year Quarter Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Unnamed: 9 Unnamed: 10 Unnamed: 11 0 Please note figures for the 2 most recent quar... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1 Back to contents NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2 Sale Year/Quarter Year Quarter Antrim and Newtownabbey Ards and North Down Armagh City, Banbridge and Craigavon Belfast Causeway Coast and Glens Derry City and Strabane Fermanagh and Omagh Lisburn and Castlereagh Mid and East Antrim Mid Ulster Newry, Mourne and Down 3 Q1 2005 2005 Q1 236 320 333 623 236 226 138 219 188 176 241 4 Q2 2005 2005 Q2 735 857 961 1549 712 637 316 655 618 428 505 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 85 Q3 2021 2021 Q3 739 989 931 1584 625 485 325 869 671 377 574 86 Q4 2021 2021 Q4 532 702 730 1272 417 405 250 572 474 359 437 87 2021 Total NaN NaN 2647 3685 3333 5934 2327 1803 1181 3053 2346 1460 2000 88 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 89 Please note figures for the 2 most recent quar... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 90 rows × 14 columns df=df[~df.iloc[:,0].str.contains('Total').fillna(False)] df.iloc[1,2]=c basic_cleanup(df,offset=2) 2 Period Sale Year/Quarter Year Quarter Antrim and Newtownabbey Ards and North Down Armagh City, Banbridge and Craigavon Belfast Causeway Coast and Glens Derry City and Strabane Fermanagh and Omagh Lisburn and Castlereagh Mid and East Antrim Mid Ulster Newry, Mourne and Down 0 2005Q1 Q1 2005 2005 Q1 236 320 333 623 236 226 138 219 188 176 241 1 2005Q2 Q2 2005 2005 Q2 735 857 961 1549 712 637 316 655 618 428 505 2 2005Q3 Q3 2005 2005 Q3 757 960 968 1722 714 632 365 654 686 403 582 3 2005Q4 Q4 2005 2005 Q4 893 995 1199 1943 834 746 385 670 759 489 711 4 2006Q1 Q1 2006 2006 Q1 761 933 1038 1686 763 708 348 600 668 515 567 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 63 2020Q4 Q4 2020 2020 Q4 756 1052 974 1565 728 496 336 830 685 419 636 64 2021Q1 Q1 2021 2021 Q1 652 976 849 1497 610 466 290 762 572 349 486 65 2021Q2 Q2 2021 2021 Q2 724 1018 823 1581 675 447 316 850 629 375 503 66 2021Q3 Q3 2021 2021 Q3 739 989 931 1584 625 485 325 869 671 377 574 67 2021Q4 Q4 2021 2021 Q4 532 702 730 1272 417 405 250 572 474 359 437 68 rows × 15 columns def cleanup_table_5a(df): """ Table 5a: Number of Verified Residential Property Sales by Local Government District * Parse the 'Sale Year/Quarter' to two separate cols * Insert future-headers for Quarter and Year cols * Remove rows with 'total' in the first column * Disregard the 'Sale Year/Quarter' column * perform `basic_cleanup` with offset=2 """ # Safety first df=df.copy() # Extract 'Quarter' and 'Year' columns from the future 'Sale Year/Quarter' column dates = df.iloc[:,0].str.extract('(Q[1-4]) ([0-9]{4})').rename(columns={0:'Quarter',1:'Year'}) for c in ['Quarter','Year']:# insert the dates in order, so they come out in reverse in the insert df.insert(1,c,dates[c]) df.iloc[2,1]=c # Need to have the right colname for when `basic_cleanup` is called. # Remove 'total' rows from the future 'Sale Year/Quarter' column df=df[~df.iloc[:,0].str.contains('Total').fillna(False)] # Remove the 'Sale Year/Quarter' column all together df = df.iloc[:,1:] # Standard cleanup df = basic_cleanup(df, offset=2) return df cleanup_table_5a(source_df['Table 5a']) 2 Period Year Quarter Antrim and Newtownabbey Ards and North Down Armagh City, Banbridge and Craigavon Belfast Causeway Coast and Glens Derry City and Strabane Fermanagh and Omagh Lisburn and Castlereagh Mid and East Antrim Mid Ulster Newry, Mourne and Down 0 2005Q1 2005 Q1 236 320 333 623 236 226 138 219 188 176 241 1 2005Q2 2005 Q2 735 857 961 1549 712 637 316 655 618 428 505 2 2005Q3 2005 Q3 757 960 968 1722 714 632 365 654 686 403 582 3 2005Q4 2005 Q4 893 995 1199 1943 834 746 385 670 759 489 711 4 2006Q1 2006 Q1 761 933 1038 1686 763 708 348 600 668 515 567 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 63 2020Q4 2020 Q4 756 1052 974 1565 728 496 336 830 685 419 636 64 2021Q1 2021 Q1 652 976 849 1497 610 466 290 762 572 349 486 65 2021Q2 2021 Q2 724 1018 823 1581 675 447 316 850 629 375 503 66 2021Q3 2021 Q3 739 989 931 1584 625 485 325 869 671 377 574 67 2021Q4 2021 Q4 532 702 730 1272 417 405 250 572 474 359 437 68 rows × 14 columns dest_df['Table 5a']=cleanup_table_5a(source_df['Table 5a']) len(dest_df), len([k for k in source_df.keys() if k.startswith('Table') and k not in dest_df]) (13, 19) Table 6: Standardised House Price & Index for all Urban and Rural areas in NI Wee buns, thankfully. Still mixing the ‘HPI’ vs ‘Index’, but that’s a downstream problem df = basic_cleanup(source_df['Table 6']) df 1 Period Year Quarter Urban Areas HPI Urban Areas Standardised Price Rural Areas HPI Rural Areas Standardised Price 0 2005Q1 2005 Q1 101.309947 107723.320891 100.109860 124292.601178 1 2005Q2 2005 Q2 104.402908 111012.079786 105.467951 138865.721275 2 2005Q3 2005 Q3 111.163485 118200.631818 111.847591 138865.721275 3 2005Q4 2005 Q4 114.871996 122143.908606 116.175119 144238.615701 4 2006Q1 2006 Q1 118.187559 125669.361667 119.329374 148154.818847 ... ... ... ... ... ... ... ... 63 2020Q4 2020 Q4 132.610763 141005.619094 133.854953 166189.226014 64 2021Q1 2021 Q1 134.077654 142565.370205 135.267264 167942.698911 65 2021Q2 2021 Q2 138.575881 147348.355880 140.501443 174441.256673 66 2021Q3 2021 Q3 142.840470 151882.912133 144.695321 179648.216283 67 2021Q4 2021 Q4 142.375033 151388.010443 146.115278 181411.180623 68 rows × 7 columns dest_df['Table 6']=basic_cleanup(source_df['Table 6']) len(dest_df), len([k for k in source_df.keys() if k.startswith('Table') and k not in dest_df]) (14, 18) Table 7: Standardised House Price & Index for Rural Areas of Northern Ireland by drive times Nearly-wee-buns; but this one doesn’t have Year or Quarter headers, and the extra \n (Ref: Q1 2015) added, which will complicate downstream analysis if that changes over time… df = source_df['Table 7'].copy() df.head() Table 7: Standardised House Price & Index for Rural Areas of Northern Ireland by drive times Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Unnamed: 9 Unnamed: 10 0 Back to contents NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1 NaN NaN Drive time within 20mins of town of 10,000 or ... Drive time within 20mins of town of 10,000 or ... Drive time outside 20mins of town of 10,000 or... Drive time outside 20mins of town of 10,000 or... Drive time within 1hr of Belfast Index Drive time within 1hr of Belfast Price\n(Ref:... Drive time outside 1hr of Belfast Index Drive time outside 1hr of Belfast Price\n(Ref... NaN 2 2015 Q1 100 124898.676844 100 122528.427865 100 128955.274996 100 111866.40498 100.0 3 NaN Q2 103.166882 128854.070701 103.003978 126209.155363 103.025069 132856.260679 103.349406 115613.265107 100.0 4 NaN Q3 105.851629 132207.28391 105.619893 129414.394046 105.031061 135443.093443 107.811831 120605.219276 100.0 df.iloc[1,0] = 'Year' df.iloc[1,1] = 'Quarter' df.head() Table 7: Standardised House Price & Index for Rural Areas of Northern Ireland by drive times Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Unnamed: 9 Unnamed: 10 0 Back to contents NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1 Year Quarter Drive time within 20mins of town of 10,000 or ... Drive time within 20mins of town of 10,000 or ... Drive time outside 20mins of town of 10,000 or... Drive time outside 20mins of town of 10,000 or... Drive time within 1hr of Belfast Index Drive time within 1hr of Belfast Price\n(Ref:... Drive time outside 1hr of Belfast Index Drive time outside 1hr of Belfast Price\n(Ref... NaN 2 2015 Q1 100 124898.676844 100 122528.427865 100 128955.274996 100 111866.40498 100.0 3 NaN Q2 103.166882 128854.070701 103.003978 126209.155363 103.025069 132856.260679 103.349406 115613.265107 100.0 4 NaN Q3 105.851629 132207.28391 105.619893 129414.394046 105.031061 135443.093443 107.811831 120605.219276 100.0 basic_cleanup(df).head() 1 Period Year Quarter Drive time within 20mins of town of 10,000 or more Index Drive time within 20mins of town of 10,000 or more Price\n(Ref: Q1 2015) Drive time outside 20mins of town of 10,000 or more Index Drive time outside 20mins of town of 10,000 or more Price\n(Ref: Q1 2015) Drive time within 1hr of Belfast Index Drive time within 1hr of Belfast Price\n(Ref: Q1 2015) Drive time outside 1hr of Belfast Index Drive time outside 1hr of Belfast Price\n(Ref: Q1 2015) 0 2015Q1 2015 Q1 100.000000 124898.676844 100.000000 122528.427865 100.000000 128955.274996 100.000000 111866.404980 1 2015Q2 2015 Q2 103.166882 128854.070701 103.003978 126209.155363 103.025069 132856.260679 103.349406 115613.265107 2 2015Q3 2015 Q3 105.851629 132207.283910 105.619893 129414.394046 105.031061 135443.093443 107.811831 120605.219276 3 2015Q4 2015 Q4 107.430656 134179.467306 106.924715 131013.172436 106.240145 137002.270924 110.075053 123137.004353 4 2016Q1 2016 Q1 108.909364 136026.354775 108.368772 132782.552750 107.604887 138762.178070 111.828874 125098.941485 def cleanup_table_7(df): """ Table 7: Standardised House Price & Index for Rural Areas of Northern Ireland by drive times * Insert Year/Quarter future-headers * Clean normally # TODO THIS MIGHT BE VALID FOR MULTIINDEXING ON DRIVETIME/[Index/Price] """ df = df.copy() df.iloc[1,0] = 'Year' df.iloc[1,1] = 'Quarter' df = basic_cleanup(df) return df cleanup_table_7(source_df['Table 7']) 1 Period Year Quarter Drive time within 20mins of town of 10,000 or more Index Drive time within 20mins of town of 10,000 or more Price\n(Ref: Q1 2015) Drive time outside 20mins of town of 10,000 or more Index Drive time outside 20mins of town of 10,000 or more Price\n(Ref: Q1 2015) Drive time within 1hr of Belfast Index Drive time within 1hr of Belfast Price\n(Ref: Q1 2015) Drive time outside 1hr of Belfast Index Drive time outside 1hr of Belfast Price\n(Ref: Q1 2015) 0 2015Q1 2015 Q1 100.000000 124898.676844 100.000000 122528.427865 100.000000 128955.274996 100.000000 111866.404980 1 2015Q2 2015 Q2 103.166882 128854.070701 103.003978 126209.155363 103.025069 132856.260679 103.349406 115613.265107 2 2015Q3 2015 Q3 105.851629 132207.283910 105.619893 129414.394046 105.031061 135443.093443 107.811831 120605.219276 3 2015Q4 2015 Q4 107.430656 134179.467306 106.924715 131013.172436 106.240145 137002.270924 110.075053 123137.004353 4 2016Q1 2016 Q1 108.909364 136026.354775 108.368772 132782.552750 107.604887 138762.178070 111.828874 125098.941485 5 2016Q2 2016 Q2 111.263396 138966.509219 109.739250 134461.778232 110.208116 142119.179594 111.991819 125281.221326 6 2016Q3 2016 Q3 113.419541 141659.506269 112.426034 137753.851946 112.202571 144691.133778 115.398242 129091.864904 7 2016Q4 2016 Q4 113.928074 142294.657346 113.219995 138726.680412 112.508660 145085.851672 116.952798 130830.890712 8 2017Q1 2017 Q1 114.262386 142712.207695 113.549623 139130.567598 112.823330 145491.635911 117.341538 131265.759778 9 2017Q2 2017 Q2 115.566592 144341.144812 115.829688 141924.295411 114.288862 147381.515712 119.397027 133565.161466 10 2017Q3 2017 Q3 116.716428 145777.273752 117.061832 143434.022704 115.192251 148546.484147 121.273582 135664.395891 11 2017Q4 2017 Q4 117.925340 147287.189812 118.541541 145247.086931 116.101165 149718.576638 123.577530 138241.740123 12 2018Q1 2018 Q1 118.482802 147983.452250 120.184585 147260.282195 117.972698 152132.017714 122.075418 136561.381608 13 2018Q2 2018 Q2 119.443631 149183.514842 120.710551 147904.740501 117.686726 151763.241043 125.603705 140508.349303 14 2018Q3 2018 Q3 121.408923 151638.138779 122.222994 149757.912837 119.614730 154249.503782 127.210974 142306.342946 15 2018Q4 2018 Q4 123.531419 154289.108214 125.254013 153471.772944 121.462721 156632.586327 130.953652 146493.143086 16 2019Q1 2019 Q1 122.499375 153000.098716 123.207618 150964.357154 120.087144 154858.706540 129.712835 145105.085124 17 2019Q2 2019 Q2 124.397722 155371.109292 125.151589 153346.274193 122.486275 157952.513361 130.480634 145963.994647 18 2019Q3 2019 Q3 126.533407 158038.551430 128.647747 157630.061642 124.978137 161165.900455 133.258945 149071.990904 19 2019Q4 2019 Q4 127.126748 158779.626458 127.784267 156572.053236 124.547746 160610.887802 134.709059 150694.181735 20 2020Q1 2020 Q1 127.090324 158734.133127 128.619521 157595.476721 124.744271 160864.317472 135.068006 151095.722593 21 2020Q2 2020 Q2 127.200617 158871.887068 127.231209 155894.400318 125.021931 161222.374474 132.935126 148709.746207 22 2020Q3 2020 Q3 129.627870 161903.493901 131.083667 160614.756005 127.873046 164899.037745 135.952621 152085.309777 23 2020Q4 2020 Q4 132.853240 165931.938294 135.873737 166483.954356 130.725554 168577.497418 142.032974 158887.182045 24 2021Q1 2021 Q1 134.414458 167881.879606 136.978231 167837.272586 132.479831 170839.730568 142.532053 159445.483290 25 2021Q2 2021 Q2 139.417605 174130.744141 142.727536 174881.805508 137.737680 177620.004609 147.652942 165174.037638 26 2021Q3 2021 Q3 143.303934 178984.717757 147.615067 180870.420630 141.482937 182449.710341 153.161532 171336.300173 27 2021Q4 2021 Q4 144.731984 180768.332630 149.013971 182584.475980 142.365385 183587.673223 156.204293 174740.127539 dest_df['Table 7'] = cleanup_table_7(source_df['Table 7']) len(dest_df), len([k for k in source_df.keys() if k.startswith('Table') and k not in dest_df]) (15, 17) Table 8: Number of Verified Residential Property Sales of properties in urban and rural areas and properties in rural areas by drive times witihn towns of 10,000 or more and within 1 hour of Belfast We’re now getting into the swing of this! This one has two similar problems we’ve already seen; Munged Quarters/Years (this time with no header on that column…), and annual Total rows. Vee must deeel with it cleanup_table_5a(source_df['Table 8']).head() 2 Period Year Quarter Urban Rural Drive time within 20mins of town of 10,000 or more Drive time outside 20mins of town of 10,000 or more Drive time within 1hr of Belfast Drive time outside 1hr of Belfast 0 2015Q1 2015 Q1 3294 1322 898 424 976 346 1 2015Q2 2015 Q2 3789 1500 1034 466 1142 358 2 2015Q3 2015 Q3 4199 1640 1145 495 1250 390 3 2015Q4 2015 Q4 4396 1780 1223 557 1342 438 4 2016Q1 2016 Q1 4424 1731 1171 560 1263 468 cleanup_table_8 = cleanup_table_5a dest_df['Table 8'] = cleanup_table_8(source_df['Table 8']) len(dest_df), len([k for k in source_df.keys() if k.startswith('Table') and k not in dest_df]) (16, 16) Table 9: NI Average Sales Prices Q1 2005 - Q4 2021 Wee buns basic_cleanup(source_df['Table 9']) 1 Period Year Quarter Simple Mean Simple Median Standardised Price (HPI) 0 2005Q1 2005 Q1 115912.942222 100000 111920.268199 1 2005Q2 2005 Q2 120481.290591 105000 116004.031639 2 2005Q3 2005 Q3 128866.225917 115000 123386.352673 3 2005Q4 2005 Q4 129649.092074 117000 127674.143865 4 2006Q1 2006 Q1 132972.115070 120000 131302.064422 ... ... ... ... ... ... ... 63 2020Q4 2020 Q4 171803.199843 150000 147474.561707 64 2021Q1 2021 Q1 176218.214924 150000 149084.306040 65 2021Q2 2021 Q2 184144.458946 154950 154323.134643 66 2021Q3 2021 Q3 173490.230508 155000 159028.118093 67 2021Q4 2021 Q4 159965.154863 141000 159150.737832 68 rows × 6 columns dest_df['Table 9'] = basic_cleanup(source_df['Table 9']) len(dest_df), len([k for k in source_df.keys() if k.startswith('Table') and k not in dest_df]) (17, 15) Table 9x: NI Average Sale Prices XXXXX Property Q1 2005 - Q4 2021 These are very similar to Tables 2x; i.e. they’re broken down by property type. Annoyingly, they don’t follow the same structure as Tables 2x or Table 9 because they don’t include the Year/Quarter headers. If that reminds you of anything, it’s because Table 7 was the same… cleanup_table_7(source_df['Table 9a']) 1 Period Year Quarter Simple Mean Simple Median Standardised Price (HPI) 0 2005Q1 2005 Q1 166314.816092 149972.5 160428.832662 1 2005Q2 2005 Q2 173370.669076 155000.0 169686.542965 2 2005Q3 2005 Q3 185397.896739 165000.0 180696.666810 3 2005Q4 2005 Q4 186545.119355 165000.0 185323.883533 4 2006Q1 2006 Q1 191328.398119 173000.0 188669.361197 ... ... ... ... ... ... ... 63 2020Q4 2020 Q4 243712.512641 220000.0 220592.113069 64 2021Q1 2021 Q1 254182.439174 225000.0 224872.989982 65 2021Q2 2021 Q2 268755.621299 235000.0 234734.715703 66 2021Q3 2021 Q3 245860.399289 225000.0 239101.239764 67 2021Q4 2021 Q4 244468.040738 219000.0 241131.373512 68 rows × 6 columns cleanup_table_9x = cleanup_table_7 table9s = re.compile('Table 9[a-z]') for table in source_df: if table9s.match(table): dest_df[table] = cleanup_table_9x(source_df[table]) len(dest_df), len([k for k in source_df.keys() if k.startswith('Table') and k not in dest_df]) (21, 11) Table 10x: Number of Verified Residential Property Sales by Type in XXXXX Surprisingly, we’re in the home straight; the remaining tables are all of the same structure, with familiar awkwardness.,, Annual-Total Rows Munged Year/Quarter Column That column having a silly (but contextual) name a different offset Fortunately, we already have something like that from dealing with Table 5a! source_df['Table 10a'] Table 10a: Number of Verified Residential Property Sales by Type in Antrim and Newtownabbey Council Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 0 Please note figures for the 2 most recent quar... NaN NaN NaN NaN NaN 1 Back to contents NaN NaN NaN NaN NaN 2 ANTRIM AND NEWTOWNABBEY Apartments Detached Semi-Detached Terrace Total 3 Q1 2005 10 61 78 87 236 4 Q2 2005 46 213 216 260 735 ... ... ... ... ... ... ... 83 Q1 2021 52 222 212 166 652 84 Q2 2021 56 217 275 176 724 85 Q3 2021 47 222 268 202 739 86 Q4 2021 50 117 176 189 532 87 2021 Total 205 778 931 733 2647 88 rows × 6 columns cleanup_table_5a(source_df['Table 10a']) 2 Period Year Quarter Apartments Detached Semi-Detached Terrace Total 0 2005Q1 2005 Q1 10 61 78 87 236 1 2005Q2 2005 Q2 46 213 216 260 735 2 2005Q3 2005 Q3 46 214 238 259 757 3 2005Q4 2005 Q4 65 227 270 331 893 4 2006Q1 2006 Q1 48 186 231 296 761 ... ... ... ... ... ... ... ... ... 63 2020Q4 2020 Q4 53 248 268 187 756 64 2021Q1 2021 Q1 52 222 212 166 652 65 2021Q2 2021 Q2 56 217 275 176 724 66 2021Q3 2021 Q3 47 222 268 202 739 67 2021Q4 2021 Q4 50 117 176 189 532 68 rows × 8 columns cleanup_table_10x = cleanup_table_5a table10s = re.compile('Table 10[a-z]') for table in source_df: if table10s.match(table): dest_df[table] = cleanup_table_10x(source_df[table]) len(dest_df), len([k for k in source_df.keys() if k.startswith('Table') and k not in dest_df]) (32, 0) And We’re Done! So, we can see that while government open data is a pain, at least it’s a … consistently inconsistent pain? I hope this was helpful to someone else. dest_df['Contents'] = source_df['Contents'][source_df['Contents']['Worksheet Name'].str.startswith('Table')] with pd.ExcelWriter('NI Housing Price Index.xlsx') as writer: # Thankfully these are semantically sortable otherwise this would be a _massive_ pain for k,df in sorted(dest_df.items()): df.to_excel(writer, sheet_name=k) Notebook Here Resulting Excel File Here Sun, 27 Mar 2022 19:49:00 +0000 https://andrewbolster.info//2022/03/NI-House-Price-Index.html https://andrewbolster.info//2022/03/NI-House-Price-Index.html Lies, Damned Lies, and Data Science This talk was originally prepared for my 2021 Guest Lecture at UU Magee for the MSc Data Science course. And if it looks familiar, yes, the first bit is almost entirely lifted from A Stranger in a Strange Land from last year. Intro Data Science is the current hotness. While those of us in these virtual rooms may make fun of the likes of Dominic Cummings for extolling a ‘Data Driven Approach’ to policy, the reality is that Data Science as a buzzword bingo term has survived and indeed thrived in a climate where ‘Artificial Intelligence’ is increasingly derided as being something that’s written more in PowerPoint than Python, ‘Machine Learning’ still gives people images of liquid metal exoskeletons crushing powdery puny human skulls, and those in management with long memories remember what kind of mess “Quantitative Analysis” got us into in 2008… Way back in 2012, the Harvard Business Review described Data Science as “The Sexiest Job of the 21st Century”, and since then has been appearing in job specs and linkedin posts and research funding applications and business startup prospecta more than ever. You’re not really doing tech unless you’ve got a few pet Data Scientists under your wing. Like some kind of mythical creature, these Data Scientists sit somewhere between Wizards, Artificers, and Necromancers, breathing business intelligence into glass and copper to give the appearance of wisdom from a veritable onslaught of data, wielding swords of statistical T-tests, shields made of the Areas Under Curves, and casting magicks of Recurrent Neural Networks. Like if Tony Stark and Steven Strange fell into a blender and the Iron Mage appeared, extracting wisdom from the seen and unseen worlds around them and project wisdom into the future…. But more often than not, it’s much more mundane… And it’s often in this mundanity of applying “standard” tools, techniques and analysis of data and stirring the pot until something interesting pops out that we are most likely to make mistakes, and that’s going to be the subject of this talk; This isn’t going to be a technical Data Science talk, we’re not opening up Jupyter or firing up Spark or Tensorflow or whatever. We’re not even going to talk about Perceptrons or Hidden neurons or homomorphic cryptography. This is about people, processes, how to establish a healthy data science culture. This is about numbers, aggregations, visualisation, and how you, as a Data Scientist, have a responsibility to look for logical pitfalls, and over time, curate that experience to constructively critique both your own analytical work, but also of the people, teams, and some times, executives around you. Anyway, who am I to talk about this stuff? Who am I ? (AKA you can skip this bit) My professional background started off by getting robotic dogs to piss on headmasters in front of 200 primary school kids and taking things apart and always having a few screws left over (or loose) at the end. I eventually turned that “skillset” into something of a trade, by studying electronics and software engineering at Queens. As part of this I got to test the launch of 4G networks in China from the grey comfort of an office in Athlone, I moonlit as a technology consultant for a marketing and advertising firm in Belfast, used massive clusters of GPUs to optimise cable internet delivery, and spent a summer developing BIOSs for embedded computers in Switzerland. After that, and just in time for the financial crisis to make everyone question their career choices, I continued down the academic culvert to do a PhD, stealing shamelessly from the sociologists to make their “science” vaguely useful by teaching autonomous military submarines how to trust each-other. More recently, I worked with a bunch of psychologists and marketers to teach machines how to understand human emotions using biometrics and wearable tech as the only Data Scientist. This being a small start-up, that meant I did anything that involved Data, so from storage and network administration to statistical analysis, real-time cloud architecture to academic writing, and everything in between. This also somehow involved throwing people down mountains and developing lie detecting underwear. Ahh the joys of Start Ups. After that I got to be a grown up Data Scientist working in at a cybersecurity firm specialising in real time network intrusion systems, playing with terabytes of historical and real time data trying to read the minds of hackers and script kiddies across the world who are throwing everything they can at some of the internet’s biggest institutions. This was my first taste of being a Data Scientist who wasn’t working completely alone… What about now? (AKA ‘Start reading here’) After two years in that I got pinched to build a new team within an established Cyber Security group called WhiteHat Security, that had recently been acquired by NTT Security; We have over 15 years of human expert trained data on if and how customer websites can be vulnerable to attack. We have teams of hackers working 24/7 to try and break peoples websites before ‘the bad guys’ do to prove that they’re vulnerable, and one way or another, we have those footprints of investigation, and the company wanted to start doing something with that data, so they needed a Data Science group, and they needed at team lead. In the time that I’ve been here, we’ve gone from almost zero ‘practical’ Data Science, to ML representing 87% of all of the assessment actions that are going across our platform; Our group has also been contributing to next generation security data architectures with the Data Science Group as a critical future customer, rather than an opportunistic after thought, and along the way we’ve invented a couple of patents or patent worthy things that I can’t really talk about yet! I’ve been there two and a bit years and while this isn’t officially a careers talk, all I’ll say is I’m still really enjoying the work, and we have roles open across our Belfast operations, and a placement scheme in the works that if anyones interested, some creative googling will get you there. Or just email me later! In the time between when this talk was originally delivered and publication, a Data Science role was opened up for UK/Remote work Lies, Damned Lies and Data Science But anyway; what do I mean when I’m talking about all these mistakes and failures that churn around our feet every time we’re wandering through data? Fundamentally, there are a few significant themes of ‘mistake’, where well intentioned, qualified, experienced and component subject matter experts can wield all the right tools in all the right ways and still come to an incorrect, or at least, incomplete conclusion given a certain set of data. These fall into a couple of general areas: Causation Inversion Ignoring Contextual Features Over reliance on abstract measures of quality Premature Aggregation Causation Inversion This one has a few different names that I’m sure many of you have heard of, and I hope at least one of these has appeared in your course so far! Correlation does not imply causation Spurious Relationships Cum/Post hoc ergo propter hoc (with/after this, therefore because of this) Logicians and philosophers argue there’s a difference, I see them as similar fallacies under slightly contexts (consequential vs abstract correlation) “That thing where Nicholas Cage is Drowning People” Put simply, Causation inversion is where you look at two or more variables observed over some dimension (usually time), and through the observation of some behaviour linkage, you can reasonably theorise that one variable is influencing the other. While the ‘Nicholas Cage’ example gets a lot of attention, and without tripping over my own later topic of ‘Reliance on Abstract Measures of Quality’; the Nicholas Cage example ‘only’ has an r value of 0.66. Also, visually, we can pretty clearly see that there are some counter-correlations, like around 2002 where Cage upped his output to 3 (namely, Windtalkers, Sonny, and Adaptation). That year, drownings in fact reduced, contra-indicating the hypothesis that they are directly correlated. A much more quote-unquote “convincing” correlation is this one. It’s got an r of 0.95 which is pretty good I guess, and it certainly looks like they’re going the same direction. In the real presentation, which was a bit of Audience Participation where different parts of the below graph were exposed with guesses from the audience of what could be under the cards; that really doesn’t work in text… Remember, the point here isn’t “lol, graphs go burr” or even “r-values suck”, but we’ll come back to that. The critical issue here isn’t anything to do with the numbers; it’s about you, as a quote-unquote “subject matter expert”, looking at the numbers, performing some reasonable analysis, and declaring “Cheese is killing people”. We can make fun of this to a certain degree with the cherry picked examples I’ve put us through here, but causation inversion is lurking at the bottom of every time a manager, executive, client, or colleague asks you a question; always be aware that just because two factors are correlated, there’s no requirement in the universe that says that that means one leads to the other. Infact, more often than not, these kind of spurious or coincidental correlations indicate some other factor lurking under the surface that you’re not taking account of in your modelling. Which leads us nicely into…. Ignoring Contextual Features This is another one with many related names; Simpson’s Paradox, Lords Paradox, the Suppressor Effect. Simpsons Paradox Simpsons Paradox is a fun, weird, and occasionally disturbing consequence of the old “Lies, damned, lies, and statistics” adage. Fundamentally, Simpsons Paradox is the effect where identical data can be used to support directly contradictory conclusions, if the context or causality of the underlying data is not taken into account. This is commonly summarized in the two charts below; from the first graph, it’s obvious that there is a tightly correlated positive linear relationship between the X and Y values, you’d have to be blind to say anything else. However, due to some underlying structure or sub-divison of values, the relationship can be totally flipped. Using exactly the same values, depending on how or if we slice, group and aggregate our data, we can come to two totally opposite conclusions, with total statistical justification. Real World Research Case - UC Berkeley But what does this really mean in real life? One of the most famous examples of this paradox is the 1973 investigation into allegations of bias in the admissions criteria of UC Berkeley. From the data presented below, it is once again Obvious, and Self Evident that a higher percentage of men were admitted than women. It can be graphed easily, the aggregations seem ‘fair’,’obvious’ and ‘natural’; all we do is we take a count of all the men who applied, all the women who applied, and in each group, calculate the percentage that were successful. | Applicants | Applied | Admitted | % Success | |:— — — -:|:— — -:|:— — –:|:— — — Men 19186 8442 44% Women 12346 4321 35% Total 31129 12763 41% Job done, fire the admissions board, issue a public apology and go home. But not so fast! When researchers dug into the totals and looked at each department (presumably to find someone to blame at a lower faculty tier…), they found something surprising; in the most popular and highest intake departments, if anything there was a bias towards female applicants. | | Total | | | Men | | | Women | | | |:— — — -:|:— — -:|:— — –:|:— — — |:— -:|:— — –:|:— — — |:— -:|:— — –:|:— — — Department Applied Admitted % Success Applied Admitted % Success Applicants Admitted % Success A 933 597 64% 825 512 62% 108 89 82% B 585 369 63% 560 353 63% 25 17 68% C 918 321 35% 325 120 37% 593 202 34% D 792 269 34% 417 138 33% 375 131 35% E 584 146 25% 191 53 28% 393 94 24% F 714 43 6% 373 22 6% 341 24 7% The researchers instead concluded that what was happening was that women were applying to more competitive departments, and that men were going for ‘less risky’ fields. Note in particular that the most popular departments for women (Dept C/E) had both a significant difference between numbers of male and female applicants, and indeed, had some of the lowest overall admission rates. A word of caution here before we move on however; I’m going to quote Curtis Wilson, one of the statisticians in the Data Science Group at NTT in response to me talking about Simpsons Paradox in work; Its always worth mentioning with the UCB data that this doesn’t show there isn’t a bias at play. The follow up questions should be “why do the departments which appeal to women more than men have lower admission rates? Is this related to historical under-funding?”. General lesson: we’ve identified we have subgroups that behave differently, so now we need to ask why they behave differently. So, as we said, the moral of the story is, be careful what causal or narrative explanations you use your data science skills for, and always try to dig down to make sure you can understand and contextualise the origins and intentionality of the data you’re using. Practical Business Case - It’s In Your Jeans What about instead of retroactive research we’re operating as a start-up and using data to plan our go-to-market approach? Let’s say we’re wanting to sell jeans. Since we’re fairly conventional folks and we don’t want to get into the bespoke sizing game, and that instead we’re going to try and produce the simplest jeans that fit the most people. And we’re all very clever and data driven so we’re going to experiment by taking a sample of a population, and measure their heights. Thanks to good old Leo Davinci, we can assume that the optimal jean length is roughly half the height of the individual. And we can arbitrarily define a ‘comfort tolerance’ of plus/minus 3cm. To make sure we’re comfortable with the numbers, we’ll start with a small scale and ramp up as we need to. | | height (cm) | |— |— — — — A 155.06 B 157.83 C 173.85 D 158.78 E 170.03 F 160.80 G 175.18 H 158.58 I 168.70 J 154.79 So, with our sample of 10 people, we can do a few easy things first; We can identify that our tallest individual is 175cm, roughly 5’9”, and shortest is 154cm, so around 5’ nothing. We can take a straight average by adding up all the values and dividing that by the number of individual values, so in this case we end up with 163cm, or around 5’4” Now, there’s another measure we could use that we’ll include for simplicity; the Median; this is a dark and magical term which basically means “if you sorted all the values, which one would be in the middle”. Another way to think about it is that if you pink the correct ‘median’ value, 50% of the values will be higher, and 50% will be lower. And that way we end up with 160cm. This is important to flag; not all ‘averages’ are created equally, but that’s a story for another day. So, big question at the end is “How big is our potential market share if we go with the average?”. 10%. Only one member of our sample could actually wear our ‘average’ jeans. Spoilers, but exactly the same problem was found when the USAF attempted to find the ‘average’ pilot to design the Goldilocks of cockpits. Instead, they decided to make everything customisable. But maybe we just don’t have enough data; so let’s try again. This time we spend a load of money to measure 3,500 people and we go through the same exercise coming up with hopefully more generalised numbers: this time we get an Average of 166cm and a Median of 165cm, which is pretty convenient and could be used to imply that our data set was nicely balanced and we didn’t have any significant ‘ lumps’ in our data. And this time, we get a new ‘coverage’ of 20%, great! But can we do more? Instead of looking at the data as 1-dimensional measurements, we can use a histogram to count how many people were measured at particular heights. We can also superimpose our ‘comfort range’ to get a visual clue of what we’re actually covering here. But now I think we can see the problem; we’ve got the average and median heights, but they’re not the most popular heights in the sample. And looking at the ‘camel humps’ distributions (otherwise known as a bimodal distribution), we might be able to infer that there’s an underlying structure that we’re missing. And we’d be right! The hump on the left constitutes predominantly women, and the hump on the right constitutes predominantly men. If we instead shift our window to target the average woman, we up our coverage to 26%, and if we slightly compromise on our initial ‘one size fits all ‘vision and make a mens version as well, we can up our coverage to 49% of the market, much more healthy for the investor meeting. The Alabama Paradox - Even When You’re Right; You’re Wrong There’s a related phenomenon that if anyone watches the Stand-Up-Maths channel with Matt Parker on YouTube, which isn’t so much a data science thing as a strange quirk of mathematics that appeared in Political Science, and while this is a bit of a segue, I won’t go too far into the weeds; Basically in the United States, the House of Representatives is supposed to be… representative; that is, the number of seats allocated (or, apportioned) to each state should be proportional to the population of that state. Seems pretty simple; take the population of state, divide by the population of country, multiply by the number of seats in the chamber, and get the job done; right? Oh, ok, except for the decimals… Ok, so just round things then and we’re done, right? And all these steps that we’ve taken are objectively, demonstrably, fair? Right? Well, not quite; Matt tells it better than I do, but the top line is that there are circumstances where changing the number of seats had unexpected results, specifically in 1800 when it was discovered that increasing the number of seats in the house from 299 to 300 would in fact reduce Alabama’s apportionment from 8 to 7, significantly reducing the ‘representation’ of that state. From Wikipedia; An actual impact was observed in 1900, when Virginia lost a seat to Maine, even though Virginia’s population was growing more rapidly […] Also from Wikipedia, here’s a little simpler worked example to think it through; 3 ‘states’, 14 ‘people’, 10 seats, and we can all do the rounding so this all looks legit, Until; Some pesky legislator says we need more seats, citing something like ‘fairness’ or ‘I like prime numbers’ And State C suddenly goes from 20% of the representative body to a 9% representation. | | | With 10 seats | | With 11 seats | | |— — |— |— — — — — |— — |— |— –| | State | Population | Fair share | Seats | Fair share | Seats | | A | 6 | 4.286 | 4 | 4.714 | 5 | | B | 6 | 4.286 | 4 | 4.714 | 5 | | C | 2 | 1.429 | 2 | 1.571 | 1 | What does this have to do with Data Science? Bear with me, because I’ve seen this happen in the wild, and it’s a strange one; multi-label classification tasks. I was working on a system to detect and classify emotional states in humans from biological markers like heart rate, breathing rate, galvanic skin response, vocal timbre, and acceleration over time. Sounds like fun, and it was, and when we were doing continuous mapping, i.e. we had a vectorised emotional space such that we could project any ‘emotion’ into a series of coordinated in a projected space, and then ‘map’ those values back out to something also continuous, like colour space, or even an ‘emotional noise generator’ that a colleague had trained. All was well until someone said “Yeah, this is cool and all but I want it in words”; so we started off with the classical “Ekman Seven” of Anger, Contempt, Disgust, Enjoyment, Fear, Sadness and Surprise”, and got to training. There was a wealth of training and academic data around these so this was quite positive and smooth. Until someone wanted us to add an eighth; ‘Contentment’. Long story short, by adding an additional label option to our classifier, we in fact reduced the overall trained accuracy of our classifier, and when we eventually dug around, we found that it was exactly this kind of ‘rounding’ issue that was confounding our training. Since then, I keep an eye out for any time that labels are being changed…. Over Reliance on Abstract Measures of Quality Speaking of measuring performance, one of the hairiest parts of Data Science is quantifying the ‘quality’ or ‘accuracy’ of data. One such metric of quality we mentioned earlier was the r-value. It’s technically the ‘r-squared’ value but thats a pick we don’t need to nit today. R-values range from 0 to 1 and are usually interpreted as summarizing the percent of variation in a given metric that is ‘explained’ or ‘correlated’ to another metric or value. So, before, when Nicholas Cage was Drowning people, one interpretation is to say that Nicolas Cage Movies explain 66% of the variation in pool drownings. That 66% sounds like a strong-ish correlation, but as we saw, it’s not great; but equally, we saw that the really super high 95% correlation in Cheese consumption didn’t actually mean anything. Many will be familiar with the often quoted XKCD about standards, but the same thing can be said about quality metrics; there are a wide range of them that mean different things, and hopefully, some of these will already be familiar to you. An easy one is ‘accuracy’, or ‘how many times did you get it right?’. This is a clean, simple, management friendly metric, and nothing could possibly go wrong with something so simple? At NTT, one of the projects we’ve delivered this year was a machine learning derived model for verifying if a website might be vulnerable to particular kinds of attack. As part of the training for that, we took over a decade of human labeled and scored instances of true and false vulnerability observations across a huge swath of the internet. I believe it was on the order of a hundred million individual ‘samples’. So, we fired up the GPU’s and let it run wild, optimising for the ‘accuracy’ metric. On our first few passes something strange happened; we kept getting really high, more than 95% accuracy scores. In any normal environment, that would be a great success and then we would go home and take a month off and wait for our bonuses to roll in. But, thankfully, we’re a sceptical bunch and we dug a bit deeper; basically, we were getting every single ‘false’ assessment correct, i.e. ‘this website does not have this vulnerability’, but we were incorrectly marking the ‘true’ cases, the ones we actually really cared about as ‘false’. However, because in the real world, the real occurrence of vulnerabilities is thankfully rare (and thanks to products like ours, generally short lived), we had what is called a biased sample set. The ‘False’ set dwarfed the ‘True’ set. And because we were at that time looking to optimise accuracy, we succeeded in failing miserably. Thankfully there are other metrics; lots of them. And I’m not going to suggest you need to use them all, but in our case we evaluated the measures, conferred with our domain experts, and our product team to work out what behaviour and tolerances were actually desirable, and settled on the [Matthews Correlation Coefficient(https://en.wikipedia.org/wiki/Phi_coefficient)] as the optimal target for that particular training task. If we’d blindly deployed our Accuracy models, I’d almost certainly not have a job anymore! Premature Aggregation Finally, we come to my personal bug-bear of this line of work; premature-aggregation. 8 out of 10 executives suffer from premature aggregation at some point in their careers. It’s nothing to be ashamed of, and you can seek guidance for how to resolve it. With complex systems, the urge to simply roll everything up and take an average is strong, but as we saw in the jeans-example, sometimes the average isn’t the best approach, and that you simply should not hide a certain level of detail. However, where you set that detail is a slippery slope, and I don’t have any hard and fast rules for you, so I’ll lay out a few examples; Lets talk about that jeans example again; last time we saw it, we were down to breaking up the decision space into three factors; height, count, and gender; We already recognised that we’d ‘prematurely aggregated’ by not taking gender into account and just looking at the average and hoping for the best. But fundamentally, each one of these blocks in the histogram are made up of individuals, and individuals have all kinds of characteristics that we could dive into. So how do we know when to stop? Fundamentally, you’re just going to have to learn to make that judgement from experience and context. For instance, we could dive a little deeper and look into the influence of race into height distribution. However, as you can see, it ends up being a bit of a mess. Part of this comes down to the choices of visualisation, but fundamentally; people are messy, and generally, the world is messy. In these kind of situations, I try to go back to the motivation for any study or analysis I’m conducting; am I trying to convince someone of something? Am I trying to improve the performance of some process? Or am I trying to sell jeans to the most people. From the data above, while there is significant variation in overall racial morphology, the ‘signal’ is nowhere near as strong as the Gender factor, and since race or ethnicity means nothing to whether and individual can wear jeans or not, it’s an unnecessary detail to the business case and we can say we’ve reached our ‘optimum level of aggregation’. In my own work we had something similar. We have a scoring system, much like a credit score, that is intended to be an honest, representative, comparable measure of the ‘security hygiene’ of a website. It consists of 8 factors which aren’t massively important to the point, however they are (currently) presented monolithically, and of course, both the customers and our support teams were chasing this metric and were generally quite unhappy any time it went down, or even, didn’t go up. For anyone keeping track of “Internet Law Bingo”, what comes next is Goodhart’s Law; When a measure becomes a target, it ceases to be a good measure. Without any significant thought, these ‘credit scores’ were simply averaged across a wide range of client assets, big, small, financial, gaming, healthcare, customer facing, internal, what have you, and for a significant period of time, executives chased this number and were…. disappointed… when they couldn’t “game the system”. In an attempt to combat this, our group have been working with the front end teams to unpack that premature aggregation, and to better share and explain that no, a single number doesn’t express the ‘hygiene’ of your entire company. For example, this is the distribution of different sites under a particular clients control. Now, for context, the theoretical maximum for this particular index is 800, and this particular client had the highest proportion of ‘near perfect scores’ of any of our clients. But, their security and executive teams were primarily focused on one thing. Their average score of 592, which put them largely in the middle of the road for high end clients based on our earlier distribution; far from stellar but pretty good, which just didn’t reflect their actual security posture. What’s more galling from a Data Science perspective is that their 25-odd sites that were dragging down their average by over 100 on their own were all copies of the same site for different regions so had the same vulnerabilities, so fixing 3 vulnerabilities on that ‘one’ application on those 25 sites could have upped their scores, instead of fiddling around with their high value, and already highly scored, sites. Conclusion And with that, that’s it, we’re finally at the end. We’ve reviewed how just because metrics are correlated doesn’t mean they cause each other, that the devil is very often in the details in terms of measuring patterns and in dividing classes; that you need to be as careful in your choice of measures as in your data, and finally that it’s important to pick the appropriate level of abstraction, lest you lose the impact of what your analysis is trying to say. There’s a quote floating around (generally attributed to British Economist Ronald Coase); If you torture data long enough, it will confess to anything you’d like Being data driven is one thing, but when working with data, we need to also understand the underlying structure of the systems and phenomena that we’re measuring, planning, and deciding on. All the storage space and GPU time in the world won’t save you from screwing up bigly if you don’t know your problem domain. Thanks for your time, and if you have any questions, I’m on twitter as @bolster and I’ve email addresses littered over the internet so google me! Postscript As part of the generation of film posters, I had a suggestion in from Jon Reese and Amy Pearson and had to make it, so here’s a freebie. Tue, 04 Jan 2022 18:49:00 +0000 https://andrewbolster.info//2022/01/lies-damned-lies-and-data-science.html https://andrewbolster.info//2022/01/lies-damned-lies-and-data-science.html Generative Adversarial Procrastination TL:DR “Don’t worry about being a procrastinator, just make sure that your procrastinations are worthwhile.” There’s an implicit irony in this post that I’ve been thinking / talking about writing it for at least 6 months, and it finally came down to a tweet to force me to do it. Fun fact, in the time it took for me to write this procrastination post, the twitter poll changed, so I guess I gotta delete it all and play Satisfactory now? I’m a procrastinator. Ridiculously so. To a degree that my procrastination at the moment is procrastination from procrastinating. To misquote a misquote from the New York Times; “It’s Procrastination all the way down”. However, I’m still surprisingly productive in my own way; I’ve contributed to the operation of numerous charities, side projects, talks, meetups, guest lecture series; spent a load of time on deep background to local journalists who end up out of their depth in the stranger parts of the tech world; did a masters project on a 36 hour sleep cycle, and a 3 year PhD in 5 years, still play the Start-Up game every so often (or, at least, the fun bit of setting the world to rights in the bar with a few friends / colleagues convinced that we’ve got an idea that will change the world), and still faff around with a load of hardware and home automation stuff (that never seems to work for long)… So I’m a workaholic right? Christ no. This morning I got out of bed at around noon after getting caught up on BlindBoy and Robert Evans podcasts Must be one of those ‘mono-taskers’? Nope; I currently have 2 different IDE’s open, 12 text documents in various stages of progress and currently 55 tabs open, along with 3 different ‘in flight’ books, one has been ‘in-flight’ for over a year… You must take great notes and plan everything and have a schedule for all the things? Nope, used to do that, and it’s more trouble than it’s worth. These days probably around 75+% of my ‘working time’ is ‘unstructured’ (including the ‘semi-structured’ bits I talk about below) I think I finally worked out what my working style is, and considering my professional practice as a Data Scientist working in Machine Learning, it’s blindingly obvious in retrospect; I’m a Adversarial Procrastinator. For that to make sense; I need to first talk about Ian Goodfellow, who IMO basically changed the game in terms of practical ML in 2014 with his publication of Generative Adversarial Nets. Put simply; why train one big neural network with a whole pile of data, when you can train two neural networks both trying to ‘beat’ eachother? This is the basis of almost all of the “Oh wow, that’s fake?” posts that you’ve seen shared around for the guts of a decade; ThisPersonDoesNotExist.com (or cats)? GAN Deepfakes? GANs on Video That classic video game that had it’s textures from 20 years ago upscaled to 4k? GANs Google Pixel 6’s “Magic Eraser”? GANs on image segmentation Those Style Transfers of “What if Kandinsky painted a Lab?”; Origionally a CNN, but GANs have been used to identify ‘genuine’ pieces. Put simply, for a given target task, GAN’s consist of two parts; Discriminator - This tries to accomplish the task by expressing a lower dimensionality output, usually a label or number, based on a higher dimensional input, usually an image Generator - This tries to ‘trick’ the discriminator, by expressing a candidate input for the Discriminator, based on a candidate label. Basically, if the Generator tricks the Discriminator, the Discriminator then gets retrained with the new ‘fake’ information as a negative label, and the Generator gets retrained with the digital equivalent of a pat on the head for a good job done. Together, and in this loop, you end up with one network that’s really good at mapping from ‘images’ to ‘labels’, and one that’s really good at mapping from ‘labels’ to ‘images’. This is the ‘Adversarial’ part, and it’s become pretty clear to me, that I do the same thing with my procrastination; I continuously push new ‘tasks’ onto the stack, so that when I’m getting pissed off / frustrated / bored / stuck with one thing that I have on my plate, I can ‘constructively’ procrastinate away from Task A by working on Task B. This makes my use of task managers / issue trackers / calendars really … Strange. I will block off sections of time in my calendar with built in procrastination targets, because I’ve accepted that if I’m not going to feel like doing Task A, I may as well do Task B instead. For instance; 1400-1500 on Thursdays, I work on Project Management (yey jira….😭), or Documentation, that way, if I can’t be arsed looking at Jira again, I dive in to one of the many outstanding documentation tasks (mostly reviews, don’t worry) that are floating around. (I balance this by having an ‘Personal Admin/faffing around’ block on Monday mornings that leans more heavily on the Faffing side, and a ‘Wrap Up’ on Friday nights that is really just there to make sure my American colleagues don’t accidentally invite me to a meeting that I have to take from a pub smoking area, don’t tell my CPO 😛 (I’m kidding, they know already)) Or, like I’m doing right now, I’m 4,000 words into a probably 6,000 word lecture on ‘Lies, Damned Lies and Data Science’ that I’m delivering to UU’s MSc Data Science classes next week, but instead I’m finally writing out something that I’ve been meaning to do for ages… This all may indeed be stating the obvious, this is almost certainly not a ‘new task management paradigm’, but it took me a couple of years to work out and accept what I was doing, and to be able to ‘tame the discriminator’ to accept that I can ‘generate’ procrastination tasks that are actually long term at least as worth while as ‘the target task’; and that when I do get around to the ‘target task’, I’m coming to it fresh, with energy, and not just doing it “because you told yourself that you would”. Anyway, guess that’s about it; Don’t worry about being a procrastinator, just make sure that your procrastinations are worthwhile. Fri, 26 Nov 2021 15:07:00 +0000 https://andrewbolster.info//2021/11/generative-adversarial-procrastination.html https://andrewbolster.info//2021/11/generative-adversarial-procrastination.html UUIDs and You The guts of this document was originally created as part of my work at NTT Application Security stripped of its specificity and retained for my own reference. Background Entities need to be identifiable, but the existence of entities should not be predictible, and it should not be easy for an external user/attacker to infer anything about the number of or presence of entities. Conventional auto-increment integer ID’s were historically de-rigeur for (now largely spurious) database performance optimisation reasons, however, they are succeptible to both presence estimation, and scale estimation. If a potential customer creates an entity with the integer id 180, we can infer that globally, there are 179 existant entities of that type, and that then next one will probably be 181. Also, if implemented as a distributed/ scalable system, then contention for global ID generation becomes a limiting factor (and a challenge to any stateless expectations on such a system). What about using nested structures like /customer/10/entity/1 This is an option, however it couples the customer state to the individual state of a given entity, which introduces unnecessary dependency. Also, establishing the ‘schema’ for addressing arbitrary objects in this way becomes challenging, for instance; if a customer object has several reading_list mappings with multiple book objects within with potential (realised or not) many to many mappings, there are many ways to reference an particular book. /book/123 /customer/10/book/123 /customer/10/reading_list/5/book/123 UUIDS! Universally Unique Identifiers (UUIDs) are 128bit cryptographically generated values, intended for the generation of globally unique identifiers without dependency on a central authority or coordination between services generating them. There are a range of UUID generation types, summarised below (v1-5 based on RFC4122 | Version | Description | Usecase | | — — | — – | — — — — — — — — — — — — — — — — – | | 1 | Date Time (100ns) + MAC address | Basically guaranteed uniqueness between nodes | | 2 | Date Time (7min) + MAC + Namespace/Domain ID | Security (but pretty useless) | | 3/5 | Namespace + Name + Hash (MD5/SHA1) | Actively designed to enable “same uuid generation” | | 4 | Random | Identity Generation | | *6,7,8 | v1 with Time Ordering | Guarantees Uniqueness, Ordering and Proximity | *IETF Proposal Sidebar on Collisions They’re just stupidly unlikely, stop worrying about them: Thus, the probability to find a duplicate within 103 trillion version-4 UUIDs is one in a billion. Within most reasonable data models, it would be sufficient (or, overkill to the point of paranoia) to check for the existence of an entity as part of it’s initial creation, and to just ‘reroll’ in the rare case this emerges, although this in and of itself would mitigate a significant amount of the advantage of UUID in distributed/delay tolerant systems. However, for timeseries variants (1/2/6,7,8), once you leave their ‘tick’, no future-collision is possible, and so as long as your generating nodes have unique MAC addresses, all will be well. Sidebar on Storage Naively, UUID’s look huge with 36bit chars, vs our teeny integer or even uint or int32 fields we could use in other ID schemes, but compared to the size of the objects these keys will be referencing, this is a miniscule consideration. Even then, we can also change the representation of these 128 bits in any number of ways, mapping them directly to bin16 blobs, or encoding them as base64 values. What’s more relevant is the impact of the use of UUID’s in back end storage considerations. Much hay has been made about these critiques over the years, here’s a couple of highlights. UUIDs are Popular, but Bad for Performance — Let’s Discuss Highlights different compromises to storage, in terms of impact on InnoDB insertion rates (Char is worst, bin16 is best), and on the ordering of ID’s for query lookup performance (injecting ‘order’ into random id generation makes proximal lookups more efficient (Note, this is basically what UUIDv6 does)) int4 vs int8 vs uuid vs numeric performance on bigger joins UUID introduces a 13% increase in join spec and significant increase to Index scaling. (note; numeric type was even worse in the join case with a 34% drop in join rates) UUIDs and Compressibility Because UUIDs are generally ‘random’, they’re practically impossible to meaningfully compress Also, string representation makes a big difference in key size scaling (binary->16b, UTF-8-> 36b) Sidebar on applicability of this to the microservices/persistence architectures Within a service/state domain, noone cares what the underlying storage is doing; there is no reason that a persistence service couldn’t use auto-increment PK’s for its internal state storage etc. However, when entities are referenced, or expressed on the boundaries of that service, they should be referred to and stored using some kind of globally unique identifier. Sidebar on intepretability One valid criticism leveled at UUID is that it’s difficult / taxing for a human to tell if two UUID’s are the same, and similarly, they can be difficult to ‘share around’, i.e. shout across the offices we no longer use (greetings from COVID ‘21). Even if the entity references are encoded a full-throated UUID, there’s no reason we can’t have other tools to support the recognition and similarity. One approach is ‘hashvatars’, i.e. functionally generated visual representations of large, un-friendly, numbers. These can be mapped such that while ‘close’ hashes look similar, (the author uses the word ‘soul’ to describe this), that nearby-differences can be highlighted, so that it is clear that two findings are very different. But visual verification isn’t a solution for interpretability. Git has exactly the same problem, with even longer, 40 character SHA1 hash’s representing commits. However, for the vast majority of projects, 7 characters of that hash is sufficient to represent the commit within the project without colliding with project-external commits. This kind of contextual reference (i.e. permitting the use of shortcodes or similar) to identify, for instance, a ‘book’ associated with a ‘customer’, even if several ‘global’ entities still match that ‘shortcode’, that querier should not have access to those entities, so the storage backend can still resolve that case. Bolsters General Rule of Thumb For stateful entities such as Customer or Clients records, or the long term objects they create, etc, these should be referred to by truely-random identifiers, as one ‘client’ does not need to be proximal or relatable to another client, just being held distinctly. IMO these should be UUIDv4 values. For stateless events or ephemeral data or observations, these should be referred to by values that have lexical sorting and proximity, so as to support efficient window-range querying, caching and rollup. IMO these should be UUIDv6, (alternatives do exist, see UUIDv6 Background section) (Other versions of uuid libraries supporting v6 exist in a range of languages, such as PHP,Python, Zig, Dart, Javascript and Go, however, since UUIDv6 is simply and endianness-flip of UUIDv1, there should be no trouble in applying this to other languages/frameworks) UUIDv4 for Entity ID’s import uuid def id_gen(): return uuid.uuid4() id_gen() >>> UUID('d9c11c39-4bdf-4903-83e5-163b77f6df23') UUIDv4 for Event ID’s import uuid def uuidv1tov6(u): ## UUIDV1 has it's time-based components in 'little endian', i.e. # Least significant words first # see https://datatracker.ietf.org/doc/html/rfc4122#section-4.1.2 # Equivalent to writing timestamps as SS.MM:HH DD-MM-YYYY # Makes them time based but not lexically sortable or proximal # This simple rearrangement fixes that uh = u.hex tlo1 = uh[:5] #LSB tlo2 = uh[5:8] tmid = uh[8:12] # uh[12] contains the version number, i.e. 1, this is manually replaced below thig = uh[13:16] #MSB rest = uh[16:] uh6 = thig + tmid + tlo1 + '6' + tlo2 + rest return uuid.UUID(hex=uh6) def id_gen(): return uuidv1tov6(uuid.uuid1()) id_gen() >>> UUID('1ebd8241-29fe-63bc-87f1-000d3a45a647') Just because you’re paranoid doesn’t mean they’re not after you from time import sleep from tqdm.auto import tqdm generated_ids = list() for i in tqdm(range(10_000)): generated_ids.append(id_gen()) sleep(0.01) # Simulate actual time / work done for i in range(len(generated_ids)-1): # Validate monotonicity, order, and uniqueness assert str(generated_ids[i]) < str(generated_ids[i+1]) Wed, 10 Nov 2021 18:00:00 +0000 https://andrewbolster.info//2021/11/uuids-and-you.html https://andrewbolster.info//2021/11/uuids-and-you.html Counting Tabs and Background Tasks: Taunting Goodharts Demon This was going to be a really quick post yesterday, but I’ve spent the guts of a day (between actual work) just getting the simplest bit of this working. The intent I was silly enough to say this to someone recently in work: It’s a good day when I end up with fewer firefox tabs open than when I started And a plan was hatched; Find API to query Firefox for current number of open Tabs across all window instances Send that value to my internal MQTT node (that has telegraf, influx, home assistant and grafana integrations too) Repeat Spoiler alert, number 3 was the bastard Getting the Tabs First step I thought was going to be more complicated. Fortunately brotab beat me to it (although I’m not a fan of the name…) Once brotab is installed the first step of this is easy, if under-documented from brotab.main import create_clients from brotab.api import MultipleMediatorsAPI n_tabs = len( MultipleMediatorsAPI( create_clients() ).list_tabs([]) ) print(f'You have {n_tabs} open, you shameless procrastinator') You have 135 open, you shameless procrastinator Sending the Message Thankfully, I’ve done this a few times in a few projects, so this is simple enough. import paho.mqtt.client as mqtt import socket hostname = socket.gethostname() c = mqtt.Client(client_id=hostname) c.connect('maguire', port=1883) # this is the internal hostname of my MQTT service. And No, not Harry c.publish(f"{hostname}/open_tabs", n_tabs, retain=True # this asks the MQTT service to persist the value, so clients _after_ publish can see the last value. ) Then you can use something like MQTT Explorer to validate that it all works as expected. At this point you can go and play with Grafana or Influx boards or however you want to think about using the value longer term, however, we’ve still got step three to get to…. Task Scheduler? More like Task Mangler, amirite? In ‘nix land, this is easy. crontab -e */15 * * * * * /home/bolster/anaconda3/bin/python /home/bolster/bin/count_tabs.py Job done, go home, happy days. Unfortuantely that’s not the case with Windows “Task Scheduler” (Accessible via the start menu) I won’t go into the full rant, but here’s select issues I came across. Regularly forgetting what user it was supposed to run under, reverting to ‘Medium Execution’ role or something similar, but it only notices that issue when the next cycle goes around and then fires a “Task Scheduler did not launch task “\TabPing” because user “(NONE)” was not logged on when the launching conditions were met. User Action: Ensure user is logged on or change the task definition to allow launching when user is logged off.” Everyone and their dog has a different, contradictory, solution to ‘my task doesn’t start when scheduled’ on SuperUser The Task Scheduler execution environment isn’t a shell, so commands like “START” will raise the helpful Additional Data: Error Value: 2147942402. error, which actually means ‘File not Found’ but everyone loves guru compemplations. Windows really doesn’t like running things in the background; if you just go the ‘traditional route’ of ‘python.exe’ ‘', it'll helpfully pop up and persist that command window during execution, which would be annoying as hell. Helpfully, `conda` ships with `pythonw.exe`, specifically designed to launch headless programs.... however.... You can’t call that directly without calling up the appropriate Conda environment :facepalm:, so we need to wrap the whole pointless thing in a .bat file to run it. Any time you change anything about the task, it ‘resets’ the schedule, so if you have it triggered daily @ 9am, and you update the task definition after that, nothing with kickoff until tomorrow 9am (fix is to just reset the start time to T+5mins or something) So, yeah, here goes. count_tabs.py (lives in C:\Users\me, so YMMV) from brotab.main import create_clients from brotab.api import MultipleMediatorsAPI import paho.mqtt.client as mqtt import socket hostname = socket.gethostname() mqtt_host = 'maguire' c = mqtt.Client(client_id=hostname) c.connect(mqtt_host, port=1883) print(f'Connected to {mqtt_host}') n_tabs = len(MultipleMediatorsAPI(create_clients()).list_tabs([])) print(f'Got {n_tabs} tabs') c.publish(f"{hostname}/open_tabs", n_tabs, retain=True ) print('Published') count_tabs.bat (also C:\Users\me) SET logfile="C:\Users\me\batch.log" @echo off @echo Starting Script at %date% %time% >> %logfile% call C:\Users\me\anaconda3\condabin\conda.bat activate C:\Users\me\anaconda3\pythonw.exe "C:\Users\me\count_tabs.py" "1>C:\stdout.txt" "2>C:\stderr.txt" @echo finished at %date% %time% >> %logfile% And finally, the actual task as an XML that you might be able to import, no idea, good luck. <?xml version="1.0" encoding="UTF-16"?> <Task version="1.4" xmlns="http://schemas.microsoft.com/windows/2004/02/mit/task"> <RegistrationInfo> <Date>2021-07-21T14:11:30.6608587</Date> <Author>GRANGER\me</Author> <URI>\TabPing</URI> </RegistrationInfo> <Triggers> <CalendarTrigger> <Repetition> <Interval>PT5M</Interval> <StopAtDurationEnd>false</StopAtDurationEnd> </Repetition> <StartBoundary>2021-07-22T09:00:00</StartBoundary> <Enabled>true</Enabled> <ScheduleByDay> <DaysInterval>1</DaysInterval> </ScheduleByDay> </CalendarTrigger> </Triggers> <Principals> <Principal id="Author"> <UserId>REDACTED EVEN THOUGH I HAVE NO IDEA IF IT MATTERS</UserId> <LogonType>S4U</LogonType> <RunLevel>LeastPrivilege</RunLevel> </Principal> </Principals> <Settings> <MultipleInstancesPolicy>StopExisting</MultipleInstancesPolicy> <DisallowStartIfOnBatteries>false</DisallowStartIfOnBatteries> <StopIfGoingOnBatteries>true</StopIfGoingOnBatteries> <AllowHardTerminate>false</AllowHardTerminate> <StartWhenAvailable>true</StartWhenAvailable> <RunOnlyIfNetworkAvailable>false</RunOnlyIfNetworkAvailable> <IdleSettings> <StopOnIdleEnd>true</StopOnIdleEnd> <RestartOnIdle>false</RestartOnIdle> </IdleSettings> <AllowStartOnDemand>true</AllowStartOnDemand> <Enabled>true</Enabled> <Hidden>false</Hidden> <RunOnlyIfIdle>false</RunOnlyIfIdle> <DisallowStartOnRemoteAppSession>false</DisallowStartOnRemoteAppSession> <UseUnifiedSchedulingEngine>true</UseUnifiedSchedulingEngine> <WakeToRun>false</WakeToRun> <ExecutionTimeLimit>PT0S</ExecutionTimeLimit> <Priority>7</Priority> </Settings> <Actions Context="Author"> <Exec> <Command>"C:\Users\me\count_tabs.bat"</Command> </Exec> </Actions> </Task> Task Gotchas To run a command properly in the background, ‘Run whether user is logged on or not’ needs to be selected, but also check “Do not store password”, because this has the helpful effect that if you get queried for a password on saving the task; the task has forgotten who you are, so pop back into the ‘Change User of Group’, select your user, and then go ahead. The ‘Conditions’ Tab should not have anything checked; I got trapped in there for a good while. You will really want to enable ‘Enable All Tasks History’ in the ‘Actions’ sidebar of the Task Scheduler Thu, 22 Jul 2021 10:58:00 +0000 https://andrewbolster.info//2021/07/counting-tabs-and-background-tasks-taunting-goodharts-demon.html https://andrewbolster.info//2021/07/counting-tabs-and-background-tasks-taunting-goodharts-demon.html Response to TOG's Third Eviction This is a condensed version of a twitter thread in response to @silverSpoon asking about how Farset Labs would comment on recent news of Dublins hackerspace (TOG) being uncerimoniously turfed out of their third home thanks to developers just wanting to flip sites Agreed, and it shouldn't be down to companies to fill up the gap. https://t.co/Np46Znn3xH— Andrea (@silverSpoon) June 8, 2021 I wonder if @FarsetLabs has comments on how that works on NI?— Andrea (@silverSpoon) June 8, 2021 It’s similar up here, but that’s more challenging (and lines up with @jeffrey_roe’s comments in the piece) that hackerspaces fall between a crack in local development support between ‘entrepeneurship/startups support’ (aka subsidised professional services and office space), the classical arts sector individual and studio support (which has its own significant issues, often a lot more bureaucracy than it’s worth, and stifling community revenue generation by arcane entertainments licensing that I’m sure if you start @adamtucks he’ll chip in) and last, the classical community centre/sports groups (such as the GAA clubs discussed in the original article). I’ve also had push back in the past because “Oh, you should go for this cross-community fund” when even asking the monitoring questions of “so which side are you from” makes my skin crawl and is against the #hacker ethos. In the case of Farset Labs our out goings are >90% related to building and upkeep and we spend the vast majority of our administrative time paying the landlord and heating/electric bills that we should be doing teaching/learning, but instead of that we’ve had decades of young technologists energies going in to pure governance, upkeep, and maintenance. And when we ask for help, there are always two ‘strata’ of help; First, the sporadic, seemingly random, £1-5k pots for individual projects, which aren’t really something you can put for rent (or even most #hackerspace) equipment, and secondly, 6/7-figure annual funding programmes expecting staff of dozens and 30 year leases. So we’ve survived based on our community just keeping paying, even while before the pandemic, we’d been shut down for about a year due to structural issues in our building requiring a complete refit. As for ‘get the tech sector to pay for it’, there are very few models where this doesn’t create perverse incentives in terms of recruitment bias or restrictions to ‘only relevant’ practice; Farset Labs is a community, not a startup, not an incubator, & not a co-working space. For years, @dsrkane had a great slide in his presentations asking people who they thought paid for @FarsetLabs, and people always just assumed that it was something paid for by @belfastcc or @InvestNI, because that’s what everyone else seems to get, but the fact is that a voluntary run self-funded STEAM charity that doesn’t claim members IP or curate products and doesn’t want to scam it’s community with exploitative fees isn’t something that interests bureaucrats. At least not the ones with the purse strings. Disclaimer: Speaking as an individual with 10 years of pissing upstream to keep the place going, and not as a representative of the board. Thu, 10 Jun 2021 12:18:00 +0000 https://andrewbolster.info//2021/06/response-to-tog-s-third-eviction.html https://andrewbolster.info//2021/06/response-to-tog-s-third-eviction.html Apache Arrow in 5-10 Mins The below was presented as part of the Belfast Linux Users Groups’ May Technical Meetup. I’m also maintaining this ‘short’ as part of my gradually accumulating ‘python_data_workshop’, so that may be the best place to see updates if you’re reading this any later than 2021. Conditions Theme: ‘Universal Screwdrivers’ Event: BLUG Time: ‘5-10mins’ (take the 10) Slides Transcript Evening folks, and I’m happy to be appearing in what we hope is the last virtual BLUG! We’ve all spent a long time toiling in our own little worlds in relative isolation over the course of this sequence of lockdowns, and its appropriate that now as we begin coming back together, that we’re talking about projects that operate and complement across ecosystems. Now, as many of you might know, I’ve basically been a Python-only developer since my time at University; the strength of that ecosystem, with genre defining packages like numpy, requests, pandas, beautifulsoup, scikit-learn, jupyter, and flask mean that in my career as a Data Scientist, I rarely feel a strong need to leave my comfort zone for my own work and exploration. However, in both my professional and hobbiest software engineering and data science projects, I occasionally have to deal with … “Other People”, and these other people have their own ecosystems that they think suit their workflows better; and, while they have every right to be mistaken, we generally have to shuffle small, medium, or depressingly large datasets around and between these vast programming language star systems like some kind of “No Mans Compiler”. So when Phil mentioned about the theme being about crossplatform libraries and workflows, there was only one candidate that I could suggest, and that is the Apache Arrow project. On paper, and on spec, Arrow is fairly conventional as an in-memory data representation; it’s got columnar storage for analytic lookups, leveraging CPU cache locality to maximise local throughput, with a fairly conventional bitmask/offset/value format for fast partial eliminations, and then these row-groups can be efficiently swapped in and out at will, or mmaped directly from disk to minimise paging, blah blah blah. We’ve had hyper optimised data structures for as long as we’ve had data. One of the exciting things about the Arrow project, well, the two exciting things, is that it’s a direct descendent to my go to tool for any data problem; pandas, and pandas author and maintainer, Wes McKinney, is the lead of the Arrow project. Infact, the Arrow project was effectively annouced by Wes publishing his own “10 Things I Hate About Pandas” talk in 2013, which is well worth a watch. What this means is that the arrow memory model is a “clean” version of the pandas internal memory representation, optimised for all the pain points of pandas like memory management, crummy categoricals, painful appends, etc. So if you’re using pandas (and expecially if you use pd.to_parquet), you’re already an “Arrow-ite”. But, to step away from either the Python-focused world and the internal memory representation world, the really interesting thing about Arrow as a project, is that it neatly sidesteps the classic serialisation loop, and all the overhead that comes from it. I’m going to take the following quote from Dremio’s summary of the Arrow project: To use an analogy, consider traveling to Europe for vacation before the European Union (EU) was established. To visit five countries in seven days, you could count on the fact that you were going to spend a few hours at each border for passport control, and you were going to lose some value of your money in the currency exchange. This is how working with data in-memory works without Arrow: enormous inefficiencies exist to serialize and deserialize data structures, and a copy is made in the process, wasting precious memory and CPU resources. In contrast, Arrow is like visiting Europe after the EU and the adoption of the common currency dubbed the euro: you don’t wait at the border, and one currency is used everywhere. The Arrow format (which if you’re already using pandas, you’re 90% of the way there) provides compatible execution and common-value mapping across Analytics platforms like Spark, Drill, or Impala; Streaming systems like anything Kafka shaped; and Storage backends like HBase, Cassandra, and my friend, Parquet That also means that a common data format that lives in CPU cache, GPU shared memory, on disk, in the cloud or on a stream can be intuited, manipulated, and assessed in any way that suits you, leading to an implementation agnostic data ecosystem, where your work is not defined by the method of analysis, but of the data that you express. “But, my data is too complicated for that” Possibly! and that’s fine; Arrow is not well equiped for some common use cases, particularly anything transactional or non-idempotent, and there’s still active work being done to establish a common graph implementation for instance, and just last week there was a paper released that demonstrated an almost-hardware POSIX interface to perform AWS Athena-like operations in local filesystems leveraging CephFS, Arrow and Parquet At the core, and this probably should have come at the top, but still, the fundamental data type architecture on which the whole thing is set is beautifully simple; Scalars: Boolean [u]int[8,16,32,64], Decimal, Float, Double Date, Time, Timestamp UTF8 String, Binary blob Complex: Struct List Map Compound Union N.b., all values are nullable via a value offset bitmask But anyway, to wrap things up, as a ‘reverse summary slide’, here’s the things that I think make Arrow great: Solid stable community and governance model, with leading contributors from more than a dozen major OSS and industrial projects In-memory model directly maps to transit model (no serialisation overhead) Common representation model across all major languages (originally Java/C+++, but everyone else caught on eventually) IPC and RPC as first class considerations (i.e. python + R playing nicely together locally) Union schema type permits complex / underspecified data representation (a la JSONB in postgresql) (but I still want my graphs dammit…) Removes pretty much every excuse for using CSV/JSON for Bulk Data Transit/Retention (Hello Open Data NI) And that’s it from me, if I haven’t totally borked the timing, I should have a minute or two for questions, which I probably won’t have the answers for. FAQ’s because Bolster doesn’t know everything: ∆ Arrow vs Parquet Arrow -> In memory, with IPC/RPC/Streaming options, uncompressed, Parquet -> On disk, maximising compression, at expense of read speed ∆ Arrow File vs Parquet Files Arrow ‘Files’ are not really files; but more like mmapable IPC pipes WTFeather? ‘Feather v1’ came before the spec of Arrow ‘files’, Feather v2 === Arrow ‘files’ WTFlight? “Flight” is a transit protocol for high performance transport of Arrow record batches (initially but not exclusively over gRPC) that yields 20-50x better performance over ODBC approaches. ∆ Arrow vs Protobuf Protobuf not great for in stream processing; prioritised structure serialisability for currying over efficient representation; internal representations not language agnostic ∆ Arrow vs Flatbuffer Flatbuffer not great for large structured data that’s largely homogenous (i.e. columns with different but consistent datatypes across rows) Tue, 01 Jun 2021 10:54:00 +0000 https://andrewbolster.info//2021/06/apache-arrow-in-5-10-mins.html https://andrewbolster.info//2021/06/apache-arrow-in-5-10-mins.html Dr StrangeBot: Or How I Learned to Stop Worrying and Trust Machine Learning This post was originally published as part of my role at WhiteHat Security Links have been added for context/comedy/my own entertainment, but no content has been modified Beneath the cynicism, hyperbole, market–making and FUD; the strategic importance of AI in Cybersecurity is only constrained by us ‘meatbags’. Being a data science practitioner in the cybersecurity space has been a double–edged sword for several years. On the one hand, with the proliferation of automated security testing, network IDS advances, the sheer growth in traffic and the threat surface of our increasingly complex, interconnected application development practices, these roiling oceans of flotsam and datum are everything our data hungry little hearts desire. Related innovations in data engineering in the past decade mean that questions that had previously only lived in the craven dreams of executive officers and deranged analysts are now the kind of tasks that we hand off to interns to make sure they have correctly set up their workstations. But this glut of “Big Data” and computational wizardry leads inevitably to the other side of that coin; the zombie-esque re-emergence of casualties from the last “AI Winter,” proselytising that “now is the time.” Revolutions in highly specific fields like natural language processing and computer vision previously only imagined in big budget Sci-Fi tentpole movie franchises were now accessible with URLs like ThisCatDoesNotExist and QuickChat.ai with links to the code on Github for all to emulate. “This isn’t your parents AI,” was the rallying call of the entire B2B software engineering industry, “this time it’s different,” and AI would make it all better, and “no-code” AI / ML deep learning adversarial recurrent network solutions on the blockchain were the proverbial White Whales that just needed to be chased through these oceans of data. And finally, after years of promising research, Captain Ahab would have his prize of Human-Like Intelligence, able to take ‘meatbag’ expertise, judgement and wisdom, and scale indefinitely, or as much as your cloud compute budget could tolerate. “Powered by AI,” has become an albatross across many parts of the software engineering industry, no more so than in cybersecurity. Considering the fundamental premise of our industry is ‘computer systems can be bent to induce unintended behavior’, the magic wand of ‘AI’ often ends up being relegated to a square in our socially distanced buzzword bingo cards. The real opportunity for the techniques pioneered in the ‘Big Data’ and ‘Artificial Intelligence’ research spaces are already well voiced; “joining the best of human and machine intelligence,” but the question of how this is accomplished remains unclear at best and at worst is misleading. At WhiteHat Security, we have pioneered an Active Learning approach to our development of machine learning models that opportunistically takes tasks off our security experts’ work queues when that model has confidence in its assessment of a piece of evidence. These items are then either directly and invisibly actioned on behalf of our security team, or, on a probabilistic basis, the item still goes to our security teams to assess, along with the model’s assessment of that piece of evidence so that we can cross verify the ongoing performance of the models under test. This ensures that both that our security teams have the most ‘un-boring’ experience possible and that our models receive continual feedback so that performance or accuracy deviations can be quickly identified, and any models with reduced accuracy can be retrained and the old ones decommissioned rapidly without any loss of security oversight. Within a standardized deployment and interaction architecture is a behind the scenes core approach. It is a “decision system” based on mutual trust between the Data Science capabilities of analysing and modelling data to use optimal techniques per scenario context. This means that our partners in the rest of the product organization can understand and rely on the “decision support systems” that we as a Data Science group release. Fundamental to this “decision support system” approach is that whatever techniques, tools, strategies, technologies or technomancy that are used to pre-process, clean, analyze and train models, that their integration is as simple as possible; a decision support system is fed some evidence, and it responds with a set of recommendations and related confidences. These specific confidences being expressed and exposed as part of the system fosters the development of a form of ‘trust’ between the decision support system, and the security practitioners that then makes decisions based on that data. And finally, when the decision support systems themselves have conflicting or low confidence in their assessments, not only are these borderline or edge cases raised with the security teams, but they’re also collated by our Data Science team, where they’re analyzed separately, and if any patterns can be observed in the ‘confusing’ evidence, these are raised with our R&D and security teams and new models are trained against this novel finding. The intent is not to somehow replace or supplant the contextually informed human expert but rather to provide cognitive shortcuts and contextual evidence to empower them to make heuristic decisions on the edges. AI, ML, Bots, Black Boxes, Decision Support Systems; whatever the phrasing, the place of these technologies in the modern cybersecurity landscape is simple; answer the easy questions for me and get out of the way, or give me enough contextual information and trusted advice to take on the hard questions myself. Author Bio Andrew Bolster Ph.D leads the Data Science group in WhiteHat Security. His professional and academic experience spans from teaching autonomous submarines to collaborate on port protection, establishing guidelines for military application of AI, using biosensors to monitor and communicate human emotions, establishing IEEE standards for applying ethics in AI, and curating data playgrounds for cybersecurity researchers and professionals to experiment with multi-terabyte streaming datasets for product innovation. In his “spare time”, he is a founding trustee of the Farset Labs hackerspace, and on the board of Vault Artist Studios, both in Belfast, Northern Ireland Final fun note; this was my first time using Github’s VSCode Codespaces as an online Markdown editing environment and it wasn’t a disaster! Wed, 24 Mar 2021 06:06:00 +0000 https://andrewbolster.info//2021/03/dr-strangebot-or-how-i-learned-to-stop-worrying-and-trust-machine-learning.html https://andrewbolster.info//2021/03/dr-strangebot-or-how-i-learned-to-stop-worrying-and-trust-machine-learning.html A Stranger in a Strange Land: Data Science Onboarding In Practice This talk was originally prepared for the 2020 Northern Ireland Developers Conference, held in lockdown and pre-recorded in the McKee Room in Farset Labs Intro Data Science is the current hotness. While those of us in these virtual rooms may make fun of the likes of Dominic Cummings for extolling a ‘Data Driven Approach’ to policy, the reality is that Data Science as a buzzword bingo term has survived and indeed thrived in a climate where ‘Artificial Intelligence’ is increasingly derided as being something that’s written more in PowerPoint than Python, ‘Machine Learning’ still gives people images of liquid metal exoskeletons crushing powdery puny human skulls, and those in management with long memories remember what kind of mess “Quantitative Analysis” got us into not too long ago… Way back in 2012, the Harvard Business Review described Data Science as “The Sexiest Job of the 21st Century”, and since then has been appearing in job specs and linkedin posts and research funding applications and business startup prospecta more than ever. You’re not really doing tech unless you’ve got a few pet Data Scientists under your wing. Like some kind of mythical creature, these Data Scientists sit somewhere between Wizards, Artificers, and Necromancers, breathing business intelligence into glass and copper to give the appearance of wisdom from a veritable onslaught of data, wielding swords of statistical T-tests, shields made of the Areas Under Curves, and casting magicks of Recurrent Neural Networks. Like if Tony Stark and Steven Strange fell into a blender and the Iron Mage appeared, extracting wisdom from the seen and unseen worlds around them and projecting wisdom into the future…. But more often, it’s much more mundane… However, for an organisation attempting to leverage these mythical Data Scientists, how do you introduce, accommodate, and indeed, welcome these new skills into your production data workflows? What’s this about? In this talk we’ll walk through some of the philosophies I’ve arrived at as someone who started off as a lone-Data Scientist now transitioning to team leadership, and what tools I recommend to new hires (and intrigued colleagues) to understand complex production architectures. So, generally, what I wish I knew when I started with modern-ish Data Science workflow. Also, a couple of dodgy stories from over the years of ‘Data Science Gone Wrong’, that will probably get some questions asked, and hopefully not of me. This isn’t going to be a technical Data Science talk, we’re not opening up Jupyter or firing up Spark or Tensorflow or whatever. We’re not even going to talk about Perceptrons or Hidden neurons or homomorphic cryptography. This is about people, processes, how to establish a healthy data science culture. Anyway, who am I to talk about this stuff? Who am I ? (AKA you can skip this bit) My professional background started off by getting robotic dogs to piss on headmasters in front of 200 primary school kids and taking things apart and always having a few screws left over (or loose) at the end. I eventually turned that “skillset” into something of a trade, by studying electronics and software engineering at Queens. As part of this I got to test the launch of 4G networks in China from the grey comfort of an office in Athlone, I moonlit as a technology consultant for a marketing and advertising firm in Belfast, used massive clusters of GPUs to optimise cable internet delivery, and spent a summer developing BIOSs for embedded computers in Switzerland. After that, and just in time for the financial crisis to make everyone question their career choices, I continued down the academic culvert to do a PhD, stealing shamelessly from the sociologists to make their “science” vaguely useful by teaching autonomous military submarines how to trust each-other. More recently, I worked with a bunch of psychologists and marketers to teach machines how to understand human emotions using biometrics and wearable tech as the only Data Scientist. This being a small start-up, that meant I did anything that involved Data, so from storage and network administration to statistical analysis, real-time cloud architecture to academic writing, and everything in between. This also somehow involved throwing people down mountains and developing lie detecting underwear. Ahh the joys of Start Ups. After that I got to be a grown up Data Scientist working in at a cybersecurity firm specialising in real time network intrusion systems, playing with terabytes of historical and real time data trying to read the minds of hackers and script kiddies across the world who are throwing everything they can at some of the internet’s biggest institutions. This was my first taste of being a Data Scientist who wasn’t working completely alone… What about now? (AKA ‘Start reading here’) After two years in that I got pinched to build a new team within an established Cyber Security group called WhiteHat Security, that had recently been acquired by NTT Security; We have 15 years of human expert trained data on if and how customer websites can be vulnerable to attack. We have teams of people working 24/7 to try and break peoples websites before ‘the bad guys’ do to prove that they’re vulnerable, and one way or another, we have those footprints of investigation, and the company wanted to start doing something with that data, so they needed a Data Science team. I’ve been there a year and this isn’t officially a sponsored talk so I won’t rant, but all I’ll say is I’m still really enjoying the work. Anyway, with all that in mind, I want to look at this ‘How do you spin-up Data Science’ from three perspectives. Things that made previous “Data Science” roles suck Methods and approaches that I as an Individual contributor came to use to make my own life easier Now that I’m leading a team, how I’m trying to put those approaches into practice, and hopefully soliciting advice from you lot too…? What is a Data Scientist really? For a change, and with a certain sense of Irony, Google itself has settled on a pretty decent job description for the field; “a person employed to analyze and interpret complex digital data, […], especially in order to assist a business in its decision-making.” To me, this definition encapsulates three of what I think are the four key elements of what the modern Data Science role is, and it’s all the sexy ones. It says that Data Science sits in between Data Complexity, Business Context and Assistive Communication; The obvious one is the Complex Data; you need to be familiar with how to access structured and unstructured data stores, you need to know how to navigate and validate your assumptions about that data, and be aware of techniques and methodologies to abstract or visualise that data. A fairly common second highlight is the Communications aspect; at the end of the day, it’s your job to inform your internal and external customers with an appropriate amount of actionable information so that they can make an informed decision. But, more subtly than that, you need to be aware of what the Business Context as a whole is trying to accomplish, not just the direct requirements that may be foisted on you. Some people call this ‘systems thinking’, I call it ‘caring about other people’s work as well as your own’, but each to their own. As we’ll see later, this is often more important on the ‘interpreting’ side than on the ‘communication’ side… Four is a Magic Number So, we have Google’s Defined Trifecta of Complexity, Communications and Context, but I’d add in a fourth, but I think it’s quite overlooked in many ways. But in the interests of not breaking anyone’s brains, we’re going to forego the Venn diagrams in favour of Bullets… Complexity Communications Context Anyway, what’s this fourth theme? Continuity. Yes, it is a little bit of an alliterative backronym, but when I say Continuity, it has many meanings; Continuity of operations through automation and continuous testing. Continuity of visibility enforced by the construction of reproducible reports and continuous dashboarding pipelines. Continuity of meaning by the explicit and near obsessive transparency of recording and sharing assumptions, decisions, experiments, and most importantly, failures. Continuity of capability by having your Data Science operations actually survive your Data Science team being hit by a bus So, in my contrived setup, we’ve now got Complexity, Context, Communication and Continuity. Great, after 10 minutes, we’ve got a definition. Ish. Great, move on Bolster; What does this all mean for someone either getting into Data Science as a career or building out a new capability. Story Time Before we get into the solutions, I’d like to share a couple of “WTF’s”, and then spend a little bit of time explaining where those WTFs actually came from. I’ll avoid naming names to protect the guilty, but here’s a few beauts in no particular order. I’ll let you be horrified by them en-masse then we can spend a bit of time going through them to understand how these came about. Exhibit A. The “Thing” Once upon a time, a bright eyed data scientist was exploring a database. This was a mixed Perl PHP environment that had a lot of the business logic embedded in the production databases. This isn’t a bad thing. What was a bad thing, was the ‘thing’ table that they discovered. A 6 way mapping table between different types of entities from completely different parts of the business logic, including user roles, scheduling specifications, and assessment targets. These were not ‘many to many’ relationships being maintained. No, my friends. This was to map a new global thing_id. A quick GitHub enterprise search for thing_idrevealed the horror that had been unearthed. Almost every interaction in the company first queried this table to work out what on earth a given query was talking about, leading to a structure that, after some coaxing, leads DataGrip to spew out this entity relationship diagram. It’s easy to discount this as lazy engineering or an incorrect abstraction, but there are three things that, while they don’t justify leaving it that way, explain the history of how you could end up that way. Factor one: long ago, there was no thing table; the company data architecture was built cleanly and there was no need for such hellishness. Factor two: long ago, certain database’s Foreign Key performance wasn’t particularly great, so doing multi-entity ‘one to many’ relationships wasn’t all that fast. Factor three: long ago, it was recognised that the company could expand some of it’s capabilities by acquiring a few other companies and integrating their data pipelines into theirs. Now I think we can see the trouble. Long story short, an engineering department was under pressure to deliver on grand promises, and hacked together a solution that reused the previous clean data architecture in…. several different ways at once. Exhibit B. A Role by any other name I was looking to deploy a data pipeline to automate something that had just been a cron job for ages. While I was still upskilling in AWS, identity management appeared to be a massive pain in the behind, so I decided to reuse an existing execution role, api_injest_ro, and considering this was largely an ingest project, that made sense. I reviewed the decision with my direct superior, who saw no problem with it. On deployment, their pipeline died instantly as the entire company’s global client base started routing traffic through their, totally incorrect, pipeline instead of their primary ingest node. This ‘isn’t really a story of a hack, it’s the story of subjectivity; one persons ingest is another persons… well, you know. In this case, the role was not originally intended for API clients trying to read data from our own systems, rather it was intended for accepting data from external API clients sending data into our systems. This intention was not documented anywhere. The hack was a frankly clever piece of early cloud load balancing where traffic was routed around the places that responded fastest with the least amount of non-200 responses. Guess what was the only thing our intrepid data scientists pipeline template did? (For the record, this is always a terrible idea; your code should fail disgracefully first….) Exhibit C: What’s the difference anyway? In another life, our friendly neighbourhood data scientist was doing a cohort analysis. Participants in two different groups were put under different forms of stimulation, and the question was, what was the quantitative difference in response between the two groups. Our scientist took the measurements, assessed the difference between the groups, and delivered the following two messages to the marketing team. Group B responded 25% more than group A Group B’s response increased by 5% on average compared to Group A Time passes. The report is published, and then the calls start. “Your numbers don’t make any sense? How can you have such a substantial effect? It’s physically impossible for a person to respond that much? You must just be making it up!” Data Scientist goes to a website to read a completed report for the first time. “This product increases by 25%” We can laugh about this now but this is a story of crunch timelines with a priority for speed over clarity, and no review or feedback opportunity for subject matter experts. Our data scientist gave two factual comments on the data from deep in their own trenches, and threw them over the no-mans-land into the editing trenches. This was then rushed out the door with little to no final review, and by the time the honest misinterpretation was revealed, it was clear that both sides had screwed up. How do you solve a problem like Data Science So, those are just a small sample of the challenges that face any data science team, but they’re doozies when you take them as abstract examples, and I believe that these examples at least could be ‘dealt with’ with some abstract advice. People don’t agree on what words mean, let alone what numbers mean, so don’t assume anything and add your assumptions to any numbers / statements you’re delivering. A Data Scientists’ job is not done once the number is out the door; you have a responsibility to make sure that whomever you delivered it to is on the same page as you as to its meaning Engineering, Strategy, and Innovation operate in tension with each other. Sometimes they speed each other up, occasionally they have to slow each other down. If a decision is made to do the wrong thing quickly instead of the right thing slowly, that needs to be a decision visible across that trifecta. And recorded… That’s nice and all but how do you actually do that? It’s been easy to stand up in conferences like this for years as an individual contributor, start-up data scientist, or solo-researcher and wax lyrical about how all the things that other people do is crap and it’d all be better if they just listened to me. It’s also fun. However, how do you actually curate the kind of culture that I’m talking about? Both between a team, within an engineering division (your Data Science team is in your engineering division, right?), within a company and within a wider data ecosystem? Well, I’ve been doing this for a year and I don’t think I’ve succeeded yet, but here’s some of the things that we’re doing in my team to try and foster this, with appropriate redactions made… A Seat at the Table: As we’ve seen, the most challenging part of a Data Scientists jobs is often interpreting and ingesting from something that was never designed to be accessed in the weird and wonderful ways they want to. Data Science has to have a seat at the Engineering Architecture table, both to manage expectations and to highlight premature abstractions or constraints that might later cause a massive headache for analysis, but are simple to think about early on. Transparency; Teams are encouraged to share both their successes and failures in the open with the rest of the company, and encouraged to discuss their work in progress openly in our team slack and to cycle in subject matter experts from across the company to contribute to the discussion, that way we can test our assumptions early and often so wether you’re a green horn statistician or a distinguished engineer, you can ask ‘stupid’ questions without any fear of backlash. Bar-stool diversity: This is one of the only failures I’ve had so far in this role, where I started off being ‘given’ a pair of extremely experienced engineers who knew the platform inside and out, but not so much on the analytical rigour or the statistical operations. My first attempted hire was a talented neurobiologist. Unfortunately this was rejected above my head as they “didn’t have enough programming experience”. My internal response was “Yeah, doi, that’s kinda why I wanted them”. I ended up hiring a statistician who’d done some R and some Python. And proceeded to beat the R out of them. Anyway, back to the point. Data Science is a field that thrives on questioning and different perspectives, and if all you have is one leg, you’re gonna fall over. Defended empowerment: part of my responsibility as a team lead is to give my team cover, both from management noise but also from vexatious questions; our team is doing great things because of the deep and wide knowledge embodied in it, and I don’t want to waste that strength answering questions from colleagues who haven’t read our reports or done any of their own research. So I field those calls, and if I can’t point to a part of a report, document, or code that explains the question, I add it to my own to-do list to explain it and update the documentation, pending review from the original contributor. So that’s it, that’s my principles for establishing and running a high performing Data Science team; get good diverse people, encourage their curiosity by giving them freedom to talk to anyone, encourage them to share their successes and their failures, and cover their ass from all the stuff that gets in the way, but make sure that their voice is heard at the highest levels as an equal partner. And Finally Just for those who wanted to get the Tools discussion, here’s my recommended stack; Jupyter Stacks running in Docker with folder mapping to userland for exploratory stuff, although my team are currently looking at moving all our exploratory analysis to Azure, and Azures Databricks looks like a drop in replacement, with the added benefit of the team being able to work in their own environments if they have a preferred stack themselves. Metabase for exploratory data collation as a team (also does particularly well at introspecting on what should be foreign keys but aren’t) Jetbrains DataGrip for, well, basically anything that it supports If you don’t have a data catalogue and a data glossary, you don’t have data. Azure Data Catalog is very good for both of these, including metadata tagging and the ability to make people outside your team ‘admins’ on particular terms. AWS Glue does similar but is more internally focused Pandas goes without saying, but I would flag that Plotly Express and it’s Jupyter integrations are looking awesome. If you’ve ever played with in notebook interactive graphing and found it frustrating, try it again. And finally, a personal favourite; nbconflux, an extension to push Jupyter notebooks up to Atlassian Confluence, so that people outside your analysis environments can work out what the hell you’re talking about and what assumptions you made. Final Thoughts Data Science sits somewhere between Engineering, R&D, and Management. Most people think it’s either magic or it’s going to steal their job or both. For all the talk of Data Science being about technology, so far, for me, I’d had to learn more about the human side than the technical side. But, as ever, your mileage may be non-deterministic. Thank you for your time. Tue, 20 Oct 2020 10:03:00 +0000 https://andrewbolster.info//2020/10/a-stranger-in-a-strange-land-data-science-onboarding-in-practice.html https://andrewbolster.info//2020/10/a-stranger-in-a-strange-land-data-science-onboarding-in-practice.html Tell me about your Programmer - Robopsychologist and other careers that don't exist (yet) This talk was originally prepared for NI Raspberry Jam’s Kids Track, associated with the full Northern Ireland Developers Conference, held in lockdown and pre-recorded in the McKee Room in Farset Labs In Issac Asimov’s stories, the technical, social and personal impacts of advanced robotics and artificial intelligence are explored. One creation in his books was the career of “Robopsychologist”, a combination of mathematician, programmer, and psychologist, that diagnosed and treated misbehaving AI. In this talk we’ll discuss how on earth you can prepare for careers in Robopsychology and other careers that don’t exist (yet). Issac Asimov is primarily known for being one of the most prolific and impactful science fiction writers ever, and as you would expect, while wandering around these fictional worlds, he came up with a few science-y sounding mumbo jumbo terms such as ‘positronic’ and ‘psychohistory’, he is literally the father of the word ‘robotics’. He first used the term in his 1941 story ‘Liar!’, about a robot called ‘Herbie’ that develops telepathic abilities, and can read people’s thoughts. However, because the robot’s core operating principles, or, ‘programming’ still included the first law of robotics, that is, not to hurt people, Herbie starts lying to people to make them feel happy. This leads to Herbie leading a Robopsychologist called Susan Calvin to believe that a co-worker fancies her, and when she finds out that this isn’t true and Herbie just told her this because Herbie knew the idea would make her feel better, she is very hurt. This ‘First Law’ is part of the ‘Three Laws of Robotics’, which were officially codified the following year as: A robot may not injure a human being or, through inaction, allow a human being to come to harm. A robot must obey the orders given it by human beings except where such orders would conflict with the First Law. A robot must protect its own existence as long as such protection does not conflict with the First or Second Law. These simple ‘laws’ that were programmed into all robots, collided with the very human tendencies of wanting to be loved, and created this imaginary field of research called ‘Robopsychology’. Robopsychology - “the study of the personalities and behaviour of intelligent machines” This is a field of research that today, doesn’t exist, but I first read about it in 2014 when I was on a packed commuter train in California just having just left Google’s Mountain View complex where I’d met a university friend who was a programmer there. I had spent the previous 5 years in a Masters programme at Queen’s University Belfast studying Electronics and Software Engineering, a course that’s now called Software and Electronic Systems Engineering (that I would highly recommend, ask me afterwards…) It focused on the overlap between two fields that, in my eyes, were obviously one bigger field; How to put lightning in glass to add numbers together, and how to turn adding numbers together into thinking. The physical world of computation and the virtual world of programming. And it fascinated me. I wasn’t a particularly stellar student, I asked lots of awkward questions and never did particularly well in exams, and I had just resigned myself to going down the ‘standard engineering route’ of ‘Get Degree, Join Big Company, Write Banking Software for 30 years, Retire’, but my discovery of this weird term ‘Robopsychology’ kicked me in the behind. After that, I gave up the prospective Banking Analyst job and I took up postgraduate research exploring how smart submarines collaborate and interact with each other for environmental and military applications, including how to hide nuclear submarines using sound and how to use atomic clocks to build an underwater GPS system, as well as doing all of this assuming that someone can take control of one or more of your submarines and make it ‘lie’. The research eventually became too classified for me to continue working on it but I know that I contributed to international agreements on how autonomous systems are allowed to integrate into military chains of command. This interplay between how fixed, rule based, programmed systems like computers and robots, and the fuzzy, fluffy, mushy stuff that comes from people and communities, has driven my career since then. I spent two years developing smart watch applications that could tell how stressed you were, culminating in developing particularly shameful lie-detecting underwear, as well as a survey system for a deodorant manufacturer that only asked you questions about the product when it knew you were sweating. During this time we developed systems to translate emotions from heart rates to words to colours to sounds and back again, a universal translator for emotions. After that, I leapt back into cybersecurity, developing smart algorithms that watched the worlds internet traffic sniffing out hackers, trying to predict their next moves and detect the faintest whiffs of exploitation, fully aware that the hackers were doing exactly the same thing on the other side; automating intelligence, teaching that lightning to think, teaching that lightning to think for them. In my current role, I lead a team of Data Scientists, a term that didn’t really exist when I was on that train in California only 6 years ago. We develop and monitor intelligent systems that watch company websites for security vulnerabilities. My day job is to work out better ways to try and pretend to be a hacker and work out how to automate the boring bits of the professional WhiteHat hackers I work with. When I was your age, those jobs didn’t exist. The internet as we know it today didn’t exist. We didn’t even know what we didn’t know. So when my careers teacher told me in 2004 that I should look at being an insurance adjuster “because you’re good with numbers”, she didn’t know that that job would basically be automated out of existence. So I ended up being a Data Scientist. Not because it’s what my careers teacher or parents or friends told me, it’s not because it was on some ‘skills and employability map’ or because of the output of some assessment tests. The job role simply didn’t exist. And I guarantee that most of you watching now will end up working in and creating jobs that simply don’t exist today. That could be bit-farming or crypto-influencer or quantum annealer or, indeed, robopsychiatrist. So as you go through your studies, don’t allow yourself to fixate or judge yourself against what jobs are out there now. Your parents and your teachers and your friends genuinely want the best for you, so they will suggest and encourage you to follow certain paths, generally because it’s advice they wish they could give themselves 20 years ago based on their own experiences. But the thing is, the past 20 years was theirs, the next 20 years is yours. There are no robopsychiatrist jobs out there. Yet. Build your own paths and experiences, read widely, care deeply, and don’t be afraid of being ‘directionless’ or meandering. Because if you make your own luck, you might just end up in the right place at the right time and with the right skills to realise you’re being lied to by a telepathic robot. Thank you for your time, and I’m happy to take any questions. (Online or offline!) Also, I’ll still be hiring in a few years, so if you wanna join me, gimme a shout. The best way to contact Andrew if you want a short answer is Twitter, and if you want a longer answer that may take several days, tweet him for his email address (or find it yourself :wink:) Mon, 19 Oct 2020 10:34:00 +0000 https://andrewbolster.info//2020/10/tell-me-about-your-programmer.html https://andrewbolster.info//2020/10/tell-me-about-your-programmer.html Back to reality Realised it’d been a while, thought it was time for an update. I’m gonna try and make an effort to get closer to an ideal of something like #weeknotes as I’ve always been jealous of Adrian’s ability to put this thoughts down and get them out the door in a sensible way… Anyway, use the Table of Contents above to skip to bits you may care about. Farset Updates So, this was the actual reason for this update but it started a weird thought process that led to this blog. Post renovation work at Farset has been the definition of a group effort; everyone’s doing their bits to make small improvements as they go along (and while being COVID save…), and we’ve been terrible at collecting them and sharing those updates, so people could easily be forgiven for thinking that bugger all has happened since COVID. Art has been essential in encouraging people to throw post-its around the space with ideas for people to pick up. I didn’t take photos of all of them but here’s the ‘master panel’ in the entrance way. I’m only gonna include stuff that I did today after not having been in the space for a month, but I encourage everyone contributing to take the credit and share what you do to fix and grow our wee space in these times. We’ve got working calipers (and a stash of relevant batteries) Coat trees in both desk-spaces (which are ‘gentrified’ according to one member :joy:) Managed to unpack, blade, stash and label the Circular Saw that we’d ordered before I left (Do not attempt to use unless you really know what you’re doing, do not work alone, etc, etc.) As part of our newly-ratified Guest policy, got a nicely hazard taped visitors book with appropriate warnings A Dremel-esque workstation got set up on the workbench (mostly just to match with it’s friendly drill press) Fixed the heights of the rolling drawers we’d ordered ages ago to see if they fit under the workbenches (they do) so will take a proposal to the board about getting a load more to help with organising the workshop Put the clock up in the downstairs coworking space (all the other spaces already had one from a previous pass) Not shown; all the other awesomeness others have done. This is just for my own sake and because doing updates in our Slack doesn’t exactly provide much visibility for the outside Previous interesting highlights not from todays work; Updated AV setup in the Event Space We’re getting a laser cutter but we’re not talking too much about details till it’s all confirmed Not our work, but there’s an EV charging point outside our space, so that might have to get listed on the website at some point… Loads of social distancing signage (and Arts fantastic signage templates; but no spoilers) A couple of cast-able screens that we’re not entirely sure where to mount yet (ideas welcome) Lockdown, Travel, Coronavirus, and Moving Everyone and their aunt has done a COVID lockdown stress post so here’s some highlights for myself. Got stuck with my partner during lockdown as it made more sense to stay at my place as I have better access to Deliveroo; haven’t killed each-other yet, although it’s been close a few times… Work has migrated pretty well, some teething problems but nothing that can’t be managed. Still enjoying the work and the opportunities, and pushing for them to take a more active role in the now estranged Belfast/ NI tech scene, encouraging people to submit personal talks to NIDC (Come see mine!) etc and basically ‘get out there’ more, but with limited local management beyond team leads like me acting alone, getting sign off can be slower than I’d like. Took said partner to France to visit my parents in a village in the deep south (Hérault-ish) after persistent chastising; isolated in the village for 3 days before the locals would even consider talking to us as we were ‘unclean’, but once we got past that barrier, spent the next two weeks enjoying ourselves, me, mostly enjoying wine and cheese, and my partner, mostly enjoying the pool of a friend of the family and posting cheesy artsy videos on instagram…. As you may know/remember, France was taken off the ‘green list’ while we were over, but we were aware of this possibility and took the risk anyway; we got caught out by a feckless reactionary government response that was locking out countries while opening up schools and pubs. Good job :man_shrugging: So, we returned when we’d planned to, and locked ourselves down as per the guidelines. Fortunately, I’d gotten an Index just before we left, so I had fun playing with HL:Alyx, and yes, it’s as good as everyone says. Partner prefers Space Pirate Trainer though Some may know that after spending 8 months in lockdown with my Harry Potter Style ‘cupboard under the stairs’, standing desk or no, has taken a toll a bit, and I’d been keeping an eye out for interesting properties. I’ve been doing this for years with weekly digests from PropertyPal, but had only been even vaguely excited about one in months, until one strange one popped up, right in the middle of our isolation…. We dispatched a friendly telepresence robot to take a video walk through for us, and in a day of bouncing back and forth, it was ours. We move in next month. Best bit? It has both a study (aka ‘bedroom two’) and most shockingly for that area of Belfast, a garage, which I’m hoping to turn into a bit of a carpentry setup… to then build a bar, since IMO in this whole post-covid world, everyone going to the bar together and mixing with loads of different people may be a thing of the past for a while…. So why not have a few folks round for a few tins in style! Upcoming Projects So far I’ve committed to; Doing a half day workshop on pandas and arrow for the NIDC pre-workshops on the 8th of October And then reusing said workshop for UUJ’s Data Analytics MSc Guest Lecture series…. Doing a talk on Data Science Onboarding Best Practices for NIDC proper Finishing editing a very drunken Data Delinquents ‘podcast’ Getting back to the funders of DataArtNI about our plans to restart it (a programme that was painfully confirmed just in time to be locked down hard) Wee buns…. :grimacing: Conclusion I hope to get better at this ‘updating’ stuff as a form of journaling, because, if lockdown has taught me anything, time has no meaning, memory is fallible, and if you don’t record things, they’re lost to time and really difficult to reverse engineer when you realise you should have written it down six months later…. Tue, 08 Sep 2020 15:22:00 +0000 https://andrewbolster.info//2020/09/back-to-reality.html https://andrewbolster.info//2020/09/back-to-reality.html Merging Git Repos for Archival Purposes TL;DR I had reason to want to combine git repos into one big repo consisting of repos in their own folders, while ideally maintaining the histories of all those repos for archaeological purposes. There are many reasons why someone would want to do this, and my specific use case isn’t relevant. Good luck. Why so complicated? ‘Hidden’ files (dotfiles) suck Shell Wildcards suck Wildcards with selective exclusions (i.e. .git) suck File names with spaces suck Trailing Slashes suck Rewriting History sucks Raw version because I don’t trust GIST and embeds and such…. #!/bin/bash usage() { cat << EOF This script imports a git repo (accessible from https://\$origin/\$user/\$repo) and all its history as subdirectory of a destination (available locally at \$dest) It is designed for non-production, archival processes and may destroy everything you've ever loved because you looked at it funny. You have been warned. The structure of the destination will end up something like this: ~/src - \$dest - origins - \$origin - \$user - \$repo Required Arguments: -u|--user: The user that owns the repo to be imported -r|--repo: The name of the repository to be imported -d|--dest: The local name of the destination repository (assumed to be under ~/src) -o|--origin: The git server that is the origin of the repo to be imported EOF } if [ $# -le 6 ]; then usage exit 1 fi while [[ "$#" -gt 0 ]]; do case $1 in -u|--user) user="$2"; shift ;; -r|--repo) repo="$2"; shift ;; -d|--dest) dest="$2"; shift ;; -o|--origin) origin="$2"; shift ;; *) echo "Unknown parameter passed: $1"; usage; exit 1 ;; esac shift done tmp="/tmp/_${dest}_tmp" echo "Importing $origin/$user/$repo into $dest" rm -rf ~/src/$repo cd ~/src git clone https://$origin/$user/$repo cd $repo git filter-branch \ --tree-filter "mkdir -p $tmp/origin; git ls-files | cpio -pdumB $tmp/origin; git ls-files | xargs -d '\n' rm -r; find . -type d -empty -delete; mkdir -p origins/$origin/$user; mv $tmp/origin origins/$origin/$user/$repo/"\ --tag-name-filter cat --prune-empty -- --all if [ $? -eq 0 ]; then ## WAIT PATIENTLY cd ../$dest git remote add $repo ../$repo git fetch $repo --tags git merge --allow-unrelated-histories $repo/master #Youre on your own if you want a different / multiple branch(es)... git remote remove $repo else echo failed for $user/$repo fi Tue, 23 Jun 2020 13:32:00 +0000 https://andrewbolster.info//2020/06/merging-git-repos-for-archival-purposes.html https://andrewbolster.info//2020/06/merging-git-repos-for-archival-purposes.html Pragmatic Data Science; When Unstoppable Math meets Immovable Ethics This is a Rough Transcript from The Virtual Bash on Ethics Introduction Around two thousand, two hundred and 20 years ago, Archimedes said “Give me a place to stand and I will move the world”. Archimedes has come to be (rightly) associated with many fundamental innovations in mathematics and engineering, and while I’m sure we’re all sick and tired of screws and levers and jumping out of bathtubs, I’d like to start off with a discussion on one of his more mundane creations, the compound pulley. The Greek biographer Plutarch describes an interaction between him and King Heiro of Syracuse, where Archimedes boasts that he can pretty much move or lift anything. The King was dubious of this, but, being a king, had a few massive warships that required hundreds of men to manoeuvre in and out of dock and put Archies wild claims to the test. As the story goes, Archie shows up with what we now call a compound pully system, and pulls the kings warship along the dock, with his relatively frail body pulling arm over armfuls of rope. Each chained pully multiplies the expressed force of the previous, pretty much for as many layers as you like, trading distance for force. This, today seemingly simple, technical innovation would find applications across human experience; Archie wasn’t the first to use a pulley, and was almost certainly not the first to come up the compound pully, and he’s almost certainly not the first to use a pully to make a military manoeuvre easier… But Archimedes went beyond the simple application of his stated ‘creations’ like the lever, the water screw and the compound pulley, to mathematically characterise their operation and performance, making them sharable, teachable, and in this day and age we might even say ‘scalable’. My name’s Andrew Bolster, I’m the team lead of the Data Science team at WhiteHat Security, a director at your friendly neighbourhood hackerspace, Farset Labs, and in past lives I’ve worked on 4G telecoms which definitely don’t cause coronavirus, been an autonomous systems and AI researcher for the defence establishment designing communications networks for smart submarines, and have tried to teach machines how to read and respond to human emotions, and in the next hopefully 20 minutes, I am going to talk about the Missing E from STEM, Ethics, and how within the burgeoning field of Data Science, we have a serious ethics problem, and potentially, how we can work towards redressing it. We’re going to approach this under three main themes; · Ethical Optimisation, via our old friend the trolley problem; · Unscrupulous Actors, Perverse Incentives and Bernhardssons Corollary, via our ‘it’s complicated’ friend Facebook; and finally · Asilomar, or how an attempt to mess with carcinogenic monkey virus led to what I believe might be a depressingly bureaucratic but optimistically pragmatic approach to, maybe not putting the genie back in the bottle, but putting a leash on it. So, lets see if I can keep to time. I think there’s a bracket in the slack room, put me down for a tenner on 26 minutes total. The Trolley Problem We all know the story by now; There is a somehow unstoppable indestructible trolley barrelling down the tracks heading for some number of immobilised people, say five in this creative example. You, an innocent bystander, see this calamity and can predict what is going to happen if you do nothing. You also see that on another siding, some other number of people, in this case 1, are similarly restrained, and you can predict what happens if you pull the lever. You, dear bystander, can act to choose to reroute the trolley, killing one to save five. And as we’re all good Vulcans, we can confidently act to maximise the number of lives; after all, the needs of the many outweigh the needs of the few, or in this case, that poor unfortunate one. So, what we’ve got here is a good old fashioned optimisation problem. Ok. It’s clear that more lives are saved by acting in this case. But I’m sure we’ve sat through talks that then go on at length about the combinations of the saps on each track, whether there’s a guy you can push off the tracks to stop the train, or if there’s a baby on one side and an old person on the other, to the point where you eventually work out that the exchange rate between ‘Baby’ and ‘OAP’ is something around 3.235 or whatever. That’s not what I want to talk about. Any machine learning system, as well as most data science research is a form of optimisation system; you feed it data, and you give it some way of knowing if it’s “good” or not, and decisions or recommendations pop out of a trained model. These models are generally educated on ‘goodness’ by what is called a fitness function. Usually, and thankfully, it’s rarely that simple, but lets say you do a massive global study and you somehow create this huge matrix of different peoples decisions so that you can confidently aggregate what the ‘normal’ human would do in a given situation. That sounds like a pretty awesome solution, right? Literally crowdsourcing ethics? Wee buns. Well, as usual, MIT already did it. In 2017 they had 70,000 participants across 42 countries. And the while the whole paper is a fascinating read, I’ll give you an easy highlight and move on. To cherry pick; in 82% of responses, Germans, like most europeans, did a Spock and sacrificed the one to save the many. However in China, that number is 58%. And China is not some surprising outlier, there are a smorgasbord of variations in judgement calls that are correlated within regions but vary across regions. There is no global definition of what “a socially moral act” is, and indeed if you just took the straight average, more people would disagree with it than would agree with it. And this is in quite possibly the ‘simplest’ ethical question we think we can ask, saving lives or not. So, we can’t optimise for ethics, because no one can or will be able to agree on what ethics is. Unscrupulous Actors and Perverse Incentives In mid 2008, Facebook proudly announced their Connect API. This enabled truly portable social login, and also enabled app developers to access the users social web to recommend, engage, and customise. Fun fact, their proudly launched blog post has since disappeared from the internet, go figure. Apps like Eventbee turned your friends into mini Ticketmasters by enabling them to get a cut of virally recommended ticket sales; Netflix let you share movie ratings within your friend group; Hell, the Watchmen Blue-Ray had the ability to invite your friends for simultaneous watch parties, which would have been handy about now… This was the dawn of a new age for the web, everything was connected to everyone, and the experience was customised on a per user basis, rather than requiring some bulk data access API or something similarly privacy-crazy. And then, the quizzes started. Netizens of a certain age remember the day before Facebook quizzes exploded, and the day after. It was hell. And it made no sense; but you just ignored your dumb-ass mates posts and moved on with your life. By 2014, Facebook had killed this ‘next generation social network’, and years later, we found out why. But the genie was out of the bottle and the world was already changed. But, how? How did a proliferation of pointless quizzes and single-purpose app integrations lead to the global manipulation of public opinion on an industrially mechanised scale? Well, friends, it started off with people like us. Technologists, Data Scientists, Researchers, Academics, Numerical Fiddlers. Between 2007 and 2016, The University of Cambridge Psychometrics Centre, which is part of the Judge Business School, ran a quiet little side project called ‘MyPersonality’ that, like many other psychometrics research groups across the world, leveraged the kind of viral social media reach that the Connect API enabled, to run playful personality tests to tell you what kind of dog you are or whatever. But these results were genuinely used for bona-fide, ethically moderated, research into the social and psychological questions of the day. The API itself provided a perverse incentive for data driven researchers; the more people you can get to run a quiz, the more people who’s demographic and interest metadata information you could lift at the same time, even if the friends of the obsessive compulsive quiz taker never touched, or consented to, your friendly little app. The movies they liked, the pages they liked, the artists they followed and the mutual-friends with the quiz taker were all on the table. Today we call this kind of thing ‘Triangulation’, but at this time, it was just an interesting research context to characterise different quiz responses into different social buckets. Like Age, Race, Occupation, Economic status, Education… Beliefs… And what kind of posts you would ‘like’ And the quizzes themselves, through psychometric aggregation, could score people on estimates like “Respect for authority”, “Voting Intention”, Media Savviness, Neuroticism and more. I have to confess, I myself played with the same datasets that eventually got called ‘The Cambridge Analytica’ files. I was at the University of Liverpool at the time researching trust networks for applications in autonomous systems, so borrowing structures and activities from human social networks made sense. And I requested access, got vetted for my purposes by diligent academics, and I played with the things I asked for. All good, sounds nice. Then one academic accidentally published their access credentials onto GitHub… And there they were for over four years. The story of what the evidently unscrupulous actors, SCL and Cambridge Analytica did with that data is better documented in other places, but I want to talk about Facebook’s handling of the situation. Frankly, Facebook didn’t give a poop that there was an aggregated data set of millions of its users, which was in massive violation of the Connect API’s Ts&Cs. And it’s not like they didn’t know about it; a Facebook employee applied to access the data, but was rejected by vetters. It is here I want to break off for a second. I think most of us are aware of ‘Hanlon’s Razor’; Never attribute to malice that which can be adequately explained by stupidity. Its a great way to stress less about our reptilian overlords and pizza shop basements, but I came across what I have termed “Bernhardssons Corollary”, “Never attribute to stupidity that which can be explained by opportunity cost”. Facebook didn’t plan for their data to be misused by unscrupulous actors, they weren’t too stupid to think about it, and while Zuckerberg might look inhuman when he’s supping on Dihydrogen Monoxide, given the choice between “deep state conspiracy to elect a glorified used car salesman” and “prioritised shipping features over data governance”, my bet is always with over-worked data scientists, developers and product managers on a Sisyphean treadmill. It is easy to say today that ‘yeah, that’s a nice API and all, but what if someone doesn’t adhere to our lazy, unenforceable rules for legitimate research purposes, the same research that we do internally for funzies, and what if a private firm work out how to get access to the aggregate data and use it to isolate and manipulate filter bubbles for the highest bidder?”. Wargaming Data Governance “What If?” scenarios isn’t prioritised at an executive level; it wasn’t something that could be quantified on a burn down; or allocated story points; and it was never going to end up in the investor briefings or press releases. Well, until it made it into the congressional record. But lets not just poop over Zuck for this one; our memetastic interweb has an applicable ding on those supposedly responsible for oversight too. Policy makers are supposed to be experts in policy; politicians ar And modern corporate governance, that has eschewed internal research teams to optimise for the bottom line, can’t be expected to be informed enough to self-govern in a way that covers the explosive possibilities of the connected world. But then, when public bodies do try and create these kind of structures, they inevitably run in to massive obstacles, particularly in implementation, with technologists tearing their remaining hair out wondering ‘what the hell does “appropriate measures to prevent deanonymisation” mean?’, looking at you GDPR… They can’t prepare for unforeseen consequences, and they can’t see the perverse incentives until long after they’ve pushed their beautiful code. Asilomar and Public Scientific Policy So, crowdsourced ethics makes no one happy, corporate entities don’t account for moral failure on their balance sheets, and socio-economics prevents governments from effectively legislating this kind of thing, What about practitioners? In 1974 a biochemist at Stanford called Paul Berg designed an experiment to stuff bits of a carcinogenic monkey virus into some E. Coli bacteria, but when he shared his proposed experimental protocol, he got a very very mixed response from the scientific community at the time. Long story short, they believed that it could cause a quickly spreading cancer causing viral pathogen. Fun times. Instead of pulling a Frankenstein, the 1975 Asilomar Conference on Recombinant DNA was held with over 100 internationally respected molecular biologists in attendance, with Berg as its chair. There, they established a set of guidelines to be followed by all scientists doing this kind of recombinant DNA research. They considered several classes of experiment and assigned relative levels of risk, from minimal to high. Each level of risk required a corresponding set of containment procedures, designed to minimise the chance of carriers from escaping into the wider world and, in our established parlance, ‘unforeseen consequences’. This included the forbiddance of certain classes of experiments, such as using materials from highly infectious or toxic genomes or running experiments to produce large scales of reagents. These guidelines were predominantly ‘socially enforced’ rather than being legislated; with a significant strand of the conference discussing the importance of operating ‘in the open’ and bringing their science into the public eye. Some commentators suggest that this is a reflection of the distrust of government in the wake of the Watergate scandal, such that it was easier for the public to trust scientists following scientific consensus than it was to trust scientists following government derived and potentially variably enforced legislation. This generation of an open consensus within a community of practitioners and experts led to an increasingly positive view of the field within the general public. Somewhat ironically from our vantage point almost 50 years later, this enabled specialists who had previously confined themselves to academia, to develop ties in the private sector, as shareholders, executives, and consultants; engaging with private industry in a constructive way, because their Biomolecular ‘Hippocratic Oath’ provided a stability and confidence to explore new innovations and opportunities, safely. Berg went on to earn the Nobel Prize in Chemistry in 1980 and, by the 20th anniversary of the Asilomar conference in 1995, genetics and it’s terminology had become part of the day to day vocabulary and not something restricted to the hallowed halls of university or private research establishments. I’d argue that if it’s good enough for cancer causing simian stomach bugs, it’s good enough for Data Science. Efforts along these lines are already active, and I would encourage you as practitioners to take part. There are many Data Ethics projects across the world, as well as Gillian’s Human Impact Statements; some of them open source such as the Open Ethics Canvas from the Open Data Institute, or Public Sector, such as GOV.UK’s Data Ethics Framework, which while being good has no interesting graphics associated with it other than Matt Hancocks face, which I won’t subject you to. As well as this there are ranges of ad-hoc and drifting towards professional standards being developed for data governance and ethical data science. Of particular interest to me at the moment, is the IEEE’s proposed 7000 series of standards on ethically aligned design, which ranges from p7003 concerned with Algorithmic Bias Considerations, P7002 Data Privacy Processes, and one of the weirder ones, P7014, and a bit of a mouthful, Standard for Ethical considerations in Emulated Empathy in Autonomous and Intelligent Systems, which is to do with the strange world of thinking about systems that quantify, respond to, or simulate emotion. And of course I sit on the working group for the weird one. These groups bring together wide and diverse voices together to interrogate and twist the use and potential misuse of data science and autonomous systems, developing shared standards and guidelines to which we as a community of practice should adhere. And it’s far from perfect; the bureaucracy involved in engaging a community of practitioners of thousands is not without it’s challenges, with the ever taunting pressures of ‘just get it done’ hanging over us all we have to ask the question; how do we do this? So we’ve covered the challenge of mathematically operating in an ethical, human, world; the inevitable challenge of unforeseen consequences, and, as usual in software engineering, we think we might be able to steal another fields good ideas to save our own skins. What does this all mean as a Data Scientist today? What now? Frankly, it’s a fraught time to be a Data Scientist. I don’t have any easy answers. With an analytical background, and the usual laundry list of ‘skills’ this industry desires, you’re used to developing tools against constraints, but when there’s no accepted, or potentially even possible, standard against which to measure your work ethically, how can you write code or interpretations of data that you can stand behind? If the Ethics is unaccountable and un-costable, how do we progress with any kind of confidence as an industry? Or do we just resign ourselves to say that Data Science as a field is the art of fulfilling the ‘lies, damned lies and statistics’ aphorism, or getting trapped in Gillian’s ‘Get It Done’ environment? For years, I considered myself an Engineer above being a Scientist, approaching problems in the practical mindset of “How do we do the thing that satisfies the requirements in the most practical elegant solution?. These days, I’ve been forced to accept that we don’t know what the “thing” is, the requirements will be written by the headlines 5 years later; and the best we can hope for is for a generation of data scientists, developers, engineers, and product managers, to take a pragmatic approach, to assess the second order risks like perverse incentives and the acceptance that there are unscrupulous actors that are more ingenious than us, and that take a step back and remember that ethical optimisation is at best a heuristic rather than an algorithm, and above all, try and leave the world a little better, and better documented, than we found it. It’s on us, as an industry and as a field, to hold each other accountable, and to question and support each others decisions, pragmatically, fairly, and openly. Thank you. Fri, 22 May 2020 13:35:00 +0000 https://andrewbolster.info//2020/05/pragmatic-data-science.html https://andrewbolster.info//2020/05/pragmatic-data-science.html Is Your AI Ethical? Originally posted in RTInsights Businesses should do their part to ensure products are designed judiciously to reflect core company values and provide audit trails of how AI is learned. As we examine an increasing reliance on artificial intelligence (AI) and machine learning, it’s being revealed that AI can have a built-in bias, whether intentional or not. In late 2019, Apple and Goldman Sachs faced allegations that the Apple Card used an algorithm that discriminated against women in credit-scoring evaluation – after Apple’s own co-founder Steve Wozniak and entrepreneur David Heinemeier Hansson received credit limits of 10-20 times higher than their wives. A recent study also found that AI-based automated speech recognition (ASR) systems from Amazon, Apple, Google, IBM, and Microsoft exhibited substantial racial disparities, with an average word error rate (WER) of 0.35 for African American speakers, compared with 0.19 for Caucasian speakers. The study highlighted the need to invest resources into ensuring that ASR systems and speech researchers are broadly inclusive. Companies are only beginning to understand issues with AI-based products and take action towards more ethical AI. A State of AI in the Enterprise survey from Deloitte found that 32% of executives ranked ethical issues as a top-three risk of AI, but most don’t yet have specific plans in place to address the risk. The Road to Ethical AI In the corporate world, AI is being used for everything from the development of new products and platforms to driving marketing initiatives and major business decisions. In a world where data is the new oil, it is increasingly tempting to throw all the data you can lay your hands on at a problem to best solve it quickly, cheaply, and accurately for your planned use case; however, it’s vital that the data that you collect, clean, and use to train such models is nurtured both on its inclusiveness, correctness, and its ethics. Part of any major machine learning project that interacts with humanity should assess its fairness. Indeed, many standards groups are attempting to build consensuses around such ethics, such as the IEEE’s P7000 series on Ethics of Autonomous and Intelligent Systems. While progress is being made within standards bodies, organizations can also take steps to do their part to ensure products are designed judiciously to reflect core company values, provide audit trails of how AI is learned, and finally be remediated if or when it discriminates or causes harm. If we cannot do so at the outset of AI design and ensure inclusive data for machine learning, we risk losing the benefits of AI altogether. Organizations using and creating AI-based products are recognizing that responsible innovation requires stronger internal governance. Getting Started One solution is building internal ethics committees. For organizations, they can help with the design of AI-based products and services, while ensuring privacy, security, fairness, and ultimately building trust among consumers and partners. But building ethics committees is easier said than done. It was reported that Google formed an AI ethics committee that was to meet quarterly in 2019, but it only lasted a week. Luckily, a new report from the Ethics Institute at Northeastern University and Accenture entitled Building Data & AI Ethics Committees offers expert guidance. The report states that when getting started with an ethics committee, it’s critical to put together the right team of people to represent organizational stakeholders. From there, they must think through and agree on key functions, values, principles, and processes. Key committee considerations include: What are the basic values the committee is meant to protect? What are the guiding principles in support of the values? What are the types of expertise needed? What are the standards by which committees make judgments? How can the committee avoid bias and conflicts of interest? When should the committee be consulted? What authority does the committee have? While forming and managing a committee may be difficult in the beginning, having one in place could help prevent AI issues down the road caused by biased product development, and better inform engagements and business decisions that will ultimately build trust and confidence with customers. Sun, 26 Apr 2020 20:16:00 +0000 https://andrewbolster.info//2020/04/is-your-ai-ethical.html https://andrewbolster.info//2020/04/is-your-ai-ethical.html