Walking Randomly

ChatGPT loves to delve

Mike Croucher — Wed, 21 Feb 2024 14:08:05 +0000

Part of my role at MathWorks involves supporting academics in the use of their tools. Every now and then I get involved with co-supervising projects which usually involves some element of assessment. Ever since the advent of ChatGPT, I have noticed that students like to ‘delve’ into topics a lot more than they used to. The topics that are being delved into often involve a ‘complex interplay’ of something or maybe there’s reference to an ‘intricate dance’ while the reader is invited to go on a ‘journey’.

Obviously (I hope!) the appearance of such statements cannot be used as a ‘ChatGPT detector’ since it seems that no one can make a viable one without too many false positives or by disadvantaging various groups.

I do find it amusing though! Let’s go on a journey and delve into watching paint dry.

Let’s delve into ‘nothing’ and embark on a journey

Even when stuck in a Traffic Jam, we get to go on a journey. Of course, there is more delving too!

All of the above are interactions with my paid-for ChatGPT 4 account.

Have you noticed this? Do your students delve into things more than they used to? Let me know on Twitter or LinkedIn.

Appearing in my first podcast with Jousef Murad: Helping researchers and engineers become better coders

Mike Croucher — Tue, 02 Jan 2024 11:53:44 +0000

I recently had the opportunity to appear on Jousef Murad’s podcast Deep Dive. My appearance was sponsored by my employer, MathWorks, but I was given free reign to talk about pretty much whatever I wanted. So I chose to talk about Research Software Engineering and the techniques from software engineering that I think every scientist and engineer should employ (e.g. Version Control!). MATLAB and MathWorks is mentioned a bit but the aim of the podcast was that it would be of interest to anyone who has to do technical computing in any programming language.

Check it out and let me know what you think.

WalkingRandomly at MathWorks

Mike Croucher — Thu, 19 May 2022 17:53:10 +0000

WalkingRandomly is coming up to its 15th Anniversary! When I started blogging in 2007, I never would have believed that this little corner of the web would still be going in 2022!

As you might have noticed, posting frequency has waned over recent years which is partly to do with Real Life taking over somewhat but also because my professional life has become rather more hectic over the years. When I started posting here, I was a junior research computing specialist and when I left academia in 2019 I was a head of Research Computing for a major UK University! It was a wild ride.

In 2019 I joined the commercial world with The Numerical Algorithms Group where I spent a lovely couple of years immersing myself in commercial cloud, HPC and numerical computing.

A new beginning

At the beginning of 2021, in the deepest depths of the pandemic, I was offered the opportunity to join MathWorks, The Makers of MATLAB, to work with academia as a ‘Customer Success Engineer’. Broadly speaking, I help researchers and educators solve their problems with MATLAB.

Things have been going pretty well and earlier this month they gave me my own blog on MathWorks Central. Little old me right next to the original developer of MATLAB, Cleve Moler, in the gallery of MathWorks bloggers.

The new blog is called simply The MATLAB blog but as you might expect from someone who has spent 15 years writing about many aspects of scientific computing, there’s going to be a lot more than just MATLAB on there. Expect to see posts on subjects such as HPC, Research Software Engineering, C++, Python, GPUs and maybe even a little Fortran!

Here’s a few sample posts from the last few weeks

www.walkingrandomly.com won’t be going away and I’ll still post here from time to time but, for the forseeable future at least, most of my output will be on The MATLAB blog.

It only took me 15 years but I finally figured out how to get paid for blogging :)

See you around kid.

MATLAB mini hack – No MATLAB license required

Mike Croucher — Mon, 04 Oct 2021 11:03:53 +0000

The MATLAB community site, MATLAB Central, is celebrating its 20th anniversary with a coding competition where the only aim is to make an interesting image with under 280 characters of code. The 280 character limit ensures that the resulting code is tweetable. There are a couple of things I really like about the format of this competition, other than the obvious fact that the only aim is to make something pretty!

No MATLAB license is required! Sign up for a free MathWorks account and away you go. You can edit and run code in the browser.
~~Stealing~~ reusing other people’s code is actively encouraged! The competition calls it remixing but GitHub users will recognize it as forking.

Here’s an example. I wrote a piece of code to render the Sierpiński triangle – Wikipedia using a simple random algorithm called the Chaos game.

This is about as simple as the Chaos game gets and there are many things that could be done to produce a different image. As you can see from the Remix tree on the right hand side, I’ve already done one of them by changing a from 3 to 4 and adding some extra code to ensure that the same vertex doesn’t get chosen twice. This is result:

Someone else can now come along, hit the remix button and edit the code to produce something different again. Some things you might want to try for the chaos game include

Try changing the value of a which is used in the variables t and c to produce the vertices of a polygon on an enclosing circle.
Instead of just using the vertices of a polygon, try using the midpoints or another scheme for producing the attractor points completely.
Try changing the scaling factor — currently p=(p+s)/2;
Try putting limitations on which vertex is chosen. This remix ensures that the current vertex is different from the last one chosen.
Each vertex is currently equally likely to be chosen using the code idx=randi(a); Think of ways to change the probabilities.
Think of ways to colorize the plots.

Maybe the chaos game isn’t your thing. You are free to create your own design from scratch or perhaps you’d prefer to remix some of the other designs in the gallery.

The competition is just a few hours old and there are already some very nice ideas coming out.

Running MATLAB’s GPU Bench on an NVIDIA GTX 3070

Mike Croucher — Tue, 13 Jul 2021 20:45:10 +0000

Just over 4 years ago I was very happy with the 1.2 Teraflops of single precision performance I measured on my then-new Dell XPS 95600 laptop using MATLAB’s GPU Bench and noted that its performance was on-par with the first supercomputer I supported professionally. Double precision performance stank of course but I had gotten used to that with laptop GPUs!

Fast forward to 2021 and my personal computational landscape has changed substantially. The pandemic has confined me to my home office and high performance laptops don’t have quite the same allure that they did when I was doing most of my interesting computational work in airports and coffee shops. I swapped a 15 inch screen for a 49 inch Ultrawide monitor and my laptop is gathering dust while I enjoy my new Dell desktop with an 8 core Intel i7-11700 and an NVIDIA GTX 3070.

This is the first desktop machine I’ve personally owned for almost two decades and the amount of performance I got for less money than the high spec laptops I used to favour is astonishing! Here are the GPU Bench results using MATLAB 2021a

The standout figure is 10,399 Gigaflops for single precision Matrix-Matrix multiplication. At only slightly shy of 10.4 Teraflops that’s almost an order of magnitude faster than the results that impressed me on my laptop 4 years ago!

Double precision performance is nothing to write home about but it never is on consumer NVIDIA GPU cards. I stopped being upset about that years ago!

The other result I find interesting with respect to my personal history of devices is the 622 Gigaflops result for single precision from the Intel i7 CPU. This is getting on for twice as fast as the GPU on my 2015 Apple MacbookPro which managed 349 Gigaflops….a machine I still use today although primarily for Netflix.

Carnival of mathematics #194

Mike Croucher — Thu, 08 Jul 2021 11:43:52 +0000

Welcome to the 194th Carnival of Mathematics! I am extremely late in posting this…so late in fact that the 195th edition as already been published!

194 has a few interesting properties. According to its wikipedia entry, 194 is an odious number that is also the smallest Markov number that is neither a Fibonacci number nor a Pell number. It’s the smallest number that can be written as the sum of three squares in five ways and is the number of irreducible representations of the Monster group.

Micro Visual Proofs – Animated proofs without words

Proofs without words have been around for a long time. Diagrams that demonstrate that a mathematical statement is ‘obviously’ true. The example that always springs to my mind is the one below that demonstrates that the sum of consecutive odd numbers is a square number.

Such things have appeared in print for years. The books by Nelsen are a wonderful source of many of them and the Mathematical Association of America has a few very nice interactive examples.

Tom Edgar has been working on videos that animate proofs without words for a new YouTube channel called Micro Visual Proofs and submitted the one below for this month’s Carnival of math.

2.648102… Has anyone seen this constant?

Peter Cameron has been working on graphs on groups and has come across the constant 2.648102…. In this post he discusses how it turned up and asks if anyone else has come across it. There’s an interesting discussion in the comments section

Electrostatics and the Gauss–Lucas Theorem

Insight into mathematical theorems can come from many different places with physics being an extremely common one! Physical problems can be very useful in making even the most abstract concepts more concrete. John Carlos-Baez demonstrates this perfectly in his post Electrostatics and the Gauss-Lucas Theorem

Mathematics from TikTok

When the carnival of mathematics first began, it was all about mathematical blogs – a medium that was relatively new at the time. There was something wonderful about seeing mathematicians, scientists, teachers and hobbyists communicating more directly with the world. Internet time moves quickly and there has since been an explosion of interesting mathematical discussion on many other social media platforms — one of these being TikTok.

In this twitter post by Howie Hua, a teacher of math to future elementary school teachers, Howie advertises a TikTok video where he asks ‘What Happens if we add fractions across?’

New TikTok video: What happens if we add fractions across? pic.twitter.com/RDfU7AHYKL

— Howie Hua (@howie_hua) May 21, 2021

The Mathematigals Show

Mathematigals is a new YouTube channel that asks explores things like ‘What do broccoli and lightning have in common?’ and The Potato Paradox. In this month’s submission they discuss the source of many a mathematical argument, The Monty Hall Problem. Head over to YouTube to subscribe to their channel and show your support.

21-card trick explained

Mathematics is at the foundation of many card tricks and here is Ganit Charcha demonstrating one of them

Ganit has several videos on various aspects of mathematics and you can subscribe to his channel to show your support.

Shock Me Amadeus

Larissa Fedunik-Hofman of Sydney’s Mathematical Research Institute writes:

“In this profile video of applied mathematician Bronwyn Hajek from the University of South Australia, Hajek describes how she is motivated by the
quest to solve tricky, obscure, unsolved partial differential equations. Hajek will visit the Sydney Mathematical Research Institute in coming
months to collaborate on a project to apply Lie symmetry methods to model biological shocks.

I hope you like this video, which covers her upcoming project and some of the mathematical breakthroughs that Hajek has been involved with in recent years.”

Chalkdust – A magazine for the mathematically curious

Robin Whitty submitted this post that celebrates the launch of issue 13 of the Chalkdust magazine which focuses on John Conway, creator of the game of life, among many other things.

No Fortran? No data science in R and Python!

Mike Croucher — Fri, 13 Nov 2020 18:53:55 +0000

Earlier this week Apple announced their new, Arm-based ‘Apple Silicon’ machines to the world in a slick marketing event that had many of us reaching for our credit cards. Simultaneously, The Numerical Algorithms Group announced that they had ported their Fortran Compiler to the new platform. At the time of writing this is the only Fortran compiler publicly available for Apple Silicon although that will likely change soon as open source Fortran compilers get updated.

Fortran on your Mac! @NAGTalk announce the first Fortran Compiler for Apple Silicon Macs ahead of @Apple event #OneMoreThing – learn more https://t.co/pcCuTs2vET #AppleEvent #Apple #AppleSiliconMacs #Fortran #Compiler #SC20 @Arm pic.twitter.com/d03WsGVrUN

— NAG (@NAGTalk) November 10, 2020

Fortran? OK Boomer!

At over 60 years old, Fortran is one of the oldest programming languages that continues to be actively developed and used (The latest language specification is Fortran 2018). Routinely mocked by software engineers as old-skool (including me who, over a decade ago, suggested that it shouldn’t be taught to undergraduates), Fortran is the language that everyone overlooks as they spend their days sipping flat-whites in Starbucks while hacking away in MATLAB, R or Python on shiny Macbook Airs.

It can come as quite a shock, therefore, to discover that much of our favourite data science tools simply cannot work natively on any system that doesn’t have a Fortran compiler!

It’s Fortran all the way down

Much of the numerical functionality we routinely use today was developed decades ago and released in Fortran. More modern systems, such as R, make direct use of a lot of this code because it is highly performant and, perhaps more importantly, has been battle tested in production for decades. Numerical computing is hard (even when all of your instincts suggest otherwise) and when someone demonstrably does it right, it makes good sense to reuse rather than reinvent.

As a result, with no Fortran, there’s no R.

The Python crowd don’t get away with it either. Here are the GitHub stats for Scipy as of today

Of course, no Scipy means you also can’t have anything that depends on Scipy including things like Keras or Scikit-learn. Also, if you want good performance for linear algebra operations (Which underpins a huge number of data science and ML algorithms) then you’ll need a good BLAS implementation. Many projects use OpenBLAS (It’s a popular option in Numpy for example) which…you guessed it…is almost 49% Fortran.

Emulation for now

NAG’s Fortran compiler is excellent (it routinely tops independent charts for its checking and Fortran standards compliance for example) but it is commercial. The community needs and will demand open source (or at least free) Fortran compilers if data scientists are ever going to realise the full potential of Apple’s new hardware and I have no doubt that these are on the way. Other major silicon providers (e.g. Intel, AMD, NEC and NVIDIA/PGI) have their own Fortran compiler that co-exist with the open ones. Perhaps Apple should join the club (I suggest they talk to NAG!).

Until this is resolved, most of us will be relying on the Rosetta2 emulation…which, thankfully, is apparently pretty good!

Side by side comparison of Microsoft Azure’s High Performance Computing (HPC) Instances

Mike Croucher — Tue, 25 Aug 2020 15:36:10 +0000

At the time of writing, Microsoft Azure has 3 High Performance Computing instances available and I often find myself looking up their specifications and benchmark results when discussing their various methods with colleagues and clients. All the information is out there but seems to be spread across several documents. To save me the trouble in future, here is everything I usually need in one place

HC nodes: Original source of data
* CPU: Intel Skylake
* Cores: 44 (non-hyperthreaded)
* 3.5 TeraFLOPS double precision peak
* 352 Gigabyte RAM
* 700 Gigabyte local SSD storage
* 190 GB/s memory bandwidth
* 100 GB/s infiniband

HB Nodes: Original source of data
* CPU: AMD EPYC 7000 series (Naples)
* Cores: 60 (non-hyperthreaded)
* ???? Peak double precision peak
* 240 Gigabyte RAM
* 700 Gigabyte local SSD storage
* 263 GB/s memory bandwidth
* 100 GB/s infiniband

HBv2 nodes: Original source of data
* CPU: AMD EPYC 7002 series (Rome)
* Cores: 120 (non-hyperthreaded)
* 4 TeraFlops Peak double precision
* 480 Gigabyte RAM
* 1.6 Terabyte local SSD storage
* 350 GB/s memory bandwidth
* 200 GB/s infiniband

Azure HPC Benchmarks

Programmatically listing all available regions in Microsoft Azure

Mike Croucher — Thu, 30 Jul 2020 19:35:12 +0000

When using the Azure Command Line Interface (CLI), it is often necessary to explicitly state which region you want to perform an operation in. For example, when creating a resource group

az group create --name  --location

Where you replace with the region where you want to create the resource group. You may have a region in mind (South Central US perhaps) or you may be wondering which regions are available. Either way, it is helpful to see the list of all possible values for in commands like the one above.

The way to do this is with the following command

az account list-locations

By default, the output is in JSON format:

[
  {
    "displayName": "East Asia",
    "id": "/subscriptions/1ce68363-bb47-4365-b243-7451a25611c4/locations/eastasia",
    "latitude": "22.267",
    "longitude": "114.188",
    "name": "eastasia",
    "subscriptionId": null
  },
  {
    "displayName": "Southeast Asia",
    "id": "/subscriptions/1ce68363-bb47-4365-b243-7451a25611c4/locations/southeastasia",
    "latitude": "1.283",
    "longitude": "103.833",
    "name": "southeastasia",
    "subscriptionId": null
  },
  {
    "displayName": "Central US",
    "id": "/subscriptions/1ce68363-bb47-4365-b243-7451a25611c4/locations/centralus",
    "latitude": "41.5908",
    "longitude": "-93.6208",
    "name": "centralus",
    "subscriptionId": null

Let’s get the output in a more human readable table format.

az account list-locations -o table

DisplayName           Latitude    Longitude    Name
--------------------  ----------  -----------  ------------------
East Asia             22.267      114.188      eastasia
Southeast Asia        1.283       103.833      southeastasia
Central US            41.5908     -93.6208     centralus
East US               37.3719     -79.8164     eastus
East US 2             36.6681     -78.3889     eastus2
West US               37.783      -122.417     westus
North Central US      41.8819     -87.6278     northcentralus
South Central US      29.4167     -98.5        southcentralus
North Europe          53.3478     -6.2597      northeurope
West Europe           52.3667     4.9          westeurope
Japan West            34.6939     135.5022     japanwest
Japan East            35.68       139.77       japaneast
Brazil South          -23.55      -46.633      brazilsouth
Australia East        -33.86      151.2094     australiaeast
Australia Southeast   -37.8136    144.9631     australiasoutheast
South India           12.9822     80.1636      southindia
Central India         18.5822     73.9197      centralindia
West India            19.088      72.868       westindia
Canada Central        43.653      -79.383      canadacentral
Canada East           46.817      -71.217      canadaeast
UK South              50.941      -0.799       uksouth
UK West               53.427      -3.084       ukwest
West Central US       40.890      -110.234     westcentralus
West US 2             47.233      -119.852     westus2
Korea Central         37.5665     126.9780     koreacentral
Korea South           35.1796     129.0756     koreasouth
France Central        46.3772     2.3730       francecentral
France South          43.8345     2.1972       francesouth
Australia Central     -35.3075    149.1244     australiacentral
Australia Central 2   -35.3075    149.1244     australiacentral2
UAE Central           24.466667   54.366669    uaecentral
UAE North             25.266666   55.316666    uaenorth
South Africa North    -25.731340  28.218370    southafricanorth
South Africa West     -34.075691  18.843266    southafricawest
Switzerland North     47.451542   8.564572     switzerlandnorth
Switzerland West      46.204391   6.143158     switzerlandwest
Germany North         53.073635   8.806422     germanynorth
Germany West Central  50.110924   8.682127     germanywestcentral
Norway West           58.969975   5.733107     norwaywest
Norway East           59.913868   10.752245    norwayeast

The entries in the Name column are valid options for in the command we started out with.

Academic Benefits of Using git and GitHub

Mike Croucher — Fri, 17 Jan 2020 06:16:37 +0000

Feel free to discuss and contribute to this article over at the corresponding GitHub repo.

Many people suggest that you should use version control as part of your scientifc workflow. This is usually quickly followed up by recommendations to learn git and to put your project on GitHub. Learning and doing all of this for the first time takes a lot of effort. Alongside all of the recommendations to learn these technologies are horror stories telling how difficult it can be and memes saying that no one really knows what they are doing!

There are a lot of reasons to not embrace the git but there are even more to go ahead and do it. This is an attempt to convince you that it’s all going to be worth it alongside a bunch of resources that make it easy to get started and academic papers discussing the issues that version control can help resolve.

This document will not address how to do version control but will instead try to answer the questions what you can do with it and why you should bother. It was inspired by a conversation on twitter.

Improvements to individual workflow

Ways that git and GitHub can help your personal computational workflow – even if your project is just one or two files and you are the only person working on it.

Fixing filename hell

Is this a familiar sight in your working directory?

mycode.py
mycode_jane.py
mycode_ver1b.py
mycode_ver1c.py
mycode_ver1b_january.py
mycode_ver1b_january_BROKEN.py
mycode_ver1b_january_FIXED.py
mycode_ver1b_january_FIXED_for_supervisor.py

For many people, this is just the beginning. For a project that has existed long enough there might be dozens or even hundreds of these simple scripts that somehow define all of part of your computational workflow. Version control isn’t being used because ‘The code is just a simple script developed by one person’ and yet this situation is already becoming the breeding ground for future problems.

Which one of these files is the most up to date?
Which one produced the results in your latest paper or report?
Which one contains the new work that will lead to your next paper?
Which ones contain deep flaws that should never be used as part of the research?
Which ones contain possibly useful ideas that have since been removed from the most recent version?

Applying version control to this situation would lead you to a folder containing just one file

mycode.py

All of the other versions will still be available via the commit history. Nothing is ever lost and you’ll be able to effectively go back in time to any version of mycode.py you like.

A single point of truth

I’ve even seen folders like the one above passed down generations of PhD students like some sort of family heirloom. I’ve seen labs where multple such folders exist across a dozen machines, each one with a mixture of duplicated and unique files. That is, not only is there a confusing mess of files in a folder but there is a confusing mess of these folders!

This can even be true when only one person is working on a project. Perhaps you have one version of your folder on your University HPC cluster, one on your home laptop and one on your work machine. Perhaps you email zipped versions to yourself from time to time. There are many everyday events that can lead to this state of affairs.

By using a GitHub repository you have a single point of truth for your project. The latest version is there. All old versions are there. All discussion about it is there.

Everything…one place.

The power of this simple idea cannot be overstated. Whenever you (or anyone else) wants to use or continue working on your project, it is always obvious where to go. Never again will you waste several days work only to realise that you weren’t working on the latest version.

Keeping track of everything that changed

The latest version of your analysis or simulation is different from the previous one. Thanks to this, it may now give different results today compared to yesterday. Version control allows you to keep track of everything that changed between two versions. Every line of code you added, deleted or changed is highlighted. Combined with your commit messages where you explain why you made each set of changes, this forms a useful record of the evolution of your project.

It is possible to compare the differences between any two commits, not just two consecutive ones which allows you to track the evolution of your project over time.

Always having a working version of your project

Ever noticed how your collaborator turns up unnanounced just as you are in the middle of hacking on your code. They want you to show them your simulation running but right now its broken! You frantically try some of the other files in your folder but none of them seem to be the version that was working last week when you sent the report that moved your collaborator to come to see you.

If you were using version control you could easily stash your current work, revert to the last good commit and show off your work.

Tracking down what went wrong

You are always changing that script and you test it as much as you can but the fact is that the version from last year is giving correct results in some edge case while your current version is not. There are 100 versions between the two and there’s a lot of code in each version! When did this edge case start to go wrong?

With git you can use git bisect to help you track down which commit started causing the problem which is the first step towards fixing it.

Providing a back up of your project

Try this thought experiment: Your laptop/PC has gone! Fire, theft, dead hard disk or crazed panda attack.

It, and all of it’s contents have vanished forever. How do you feel? What’s running through your mind? If you feel the icy cold fingers of dread crawling up your spine as you realise Everything related to my PhD/project/life’s work is lost then you have made bad life choices. In particular, you made a terrible choice when you neglected to take back ups.

Of course there are many ways to back up a project but if you are using the standard version control workflow, your code is automatically backed up as a matter of course. You don’t have to remember to back things up, back-ups happen as a natural result of your everyday way of doing things.

Making your project easier to find and install

There are dozens of ways to distribute your software to someone else. You could (HORRORS!) email the latest version to a colleuage or you could have a .zip file on your web site and so on.

Each of these methods has a small cognitive load for both recipient and sender. You need to make sure that you remember to update that .zip file on your website and your user needs to find it. I don’t want to talk about the email case, it makes me too sad. If you and your collaborator are emailing code to each other, please stop. Think of the children!

One great thing about using GitHub is that it is a standardised way of obtaining software. When someone asks for your code, you send them the URL of the repo. Assuming that the world is a better place and everyone knows how to use git, you don’t need to do anything else since the repo URL is all they need to get your code. a git clone later and they are in business.

Additionally, you don’t need to worry abut remembering to turn your working directory into a .zip file and uploading it to your website. The code is naturally available for download as part of the standard workflow. No extra thought needed!

In addition to this, some popular computational environments now allow you to install packages directly from GitHub. If, for example, you are following standard good practice for building an R package then a user can install it directly from your GitHub repo from within R using the devtools::install_github() function.

Automatically run all of your tests

You’ve sipped of the KoolAid and you’ve been writing unit tests like a pro. GitHub allows you to link your repo with something called Continuous Integration (CI) that helps maximise the utility of those tests.

Once its all set up the CI service runs every time you, or anyone else, makes a commit to your project. Every time the CI service runs, a virtual machine is created from scratch, your project is installed into it and all of your tests are run with any failures reported.

This gives you increased confidence that everything is OK with your latest version and you can choose to only accept commits that do not break your testing framework.

Collaboration and Community

How git and GitHub can make it easier to collaborate with others on computational projects.

Control exactly who can see your work

‘I don’t want to use GitHub because I want to keep my project private’ is a common reason given to me for not using the service. The ability to create private repositories has been free for some time now (Price plans are available here https://github.com/pricing) and you can have up to 3 collaborators on any of your private repos before you need to start paying. This is probably enough for most small academic projects.

This means that you can control exactly who sees your code. In the early stages it can be just you. At some point you let a couple of trusted collaborators in and when the time is right you can make the repo public so everyone can enjoy and use your work alongside the paper(s) it supports.

Faciliate discussion about your work

Every GitHub repo comes with an Issues section which is effectively a discussion forum for the project. You can use it to keep track of your project To-Do list, bugs, documentation discussions and so on. The issues log can also be integrated with your commit history. This allows you to do things like git commit -m "Improve the foo algorithm according to the discussion in #34" where #34 refers to the Issue discussion where your collaborator pointed out

Allow others to contribute to your work

You have absolute control over external contributions! No one can make any modifications to your project without your explicit say-so.

I start with the above statement because I’ve found that when explaining how easy it is to collaborate on GitHub, the first question is almost always ‘How do I keep control of all of this?’

What happens is that anyone can ‘fork’ your project into their account. That is, they have an independent copy of your work that is clearly linked back to your original. They can happily work away on their copy as much as they like – with no involvement from you. If and when they want to suggest that some of their modifications should go into your original version, they make a ‘Pull Request’.

I emphasised the word ‘Request’ because that’s exactly what it is. You can completely ignore it if you want and your project will remain unchanged. Alternatively you might choose to discuss it with the contributor and make modifications of your own before accepting it. At the other end of the spectrum you might simply say ‘looks cool’ and accept it immediately.

Congratulations, you’ve just found a contributing collaborator.

Reproducible research

How git and GitHub can contribute to improved reproducible research.

Simply making your software available

A paper published without the supporting software and data is (much!) harder to reproduce than one that has both.

Making your software citable

Most modern research cannot be done without some software element. Even if all you did was run a simple statistical test on 20 small samples, your paper has a data and software dependency. Organisations such as the Software Sustainability Institute and the UK Research Software Engineering Association (among many others) have been arguing for many years that such software and data dependencies should be part of the scholarly record alongside the papers that discuss them. That is, they should be archived and referenced with a permanent Digital Object Identifier (DOI).

Once your code is in GitHub, it is straightforward to archive the version that goes with your latest paper and get it its own DOI using services such as Zenodo. Your University may also have its own archival system. For example, The University of Sheffield in the UK has built a system called ORDA which is based on an institutional Figshare instance which allows Sheffield academics to deposit code and data for long term archival.

Which version gave these results?

Anyone who has worked with software long enough knows that simply stating the name of the software you used is often insufficient to ensure that someone else could reproduce your results. To help improve the odds, you should state exactly which version of the software you used and one way to do this is to refer to the git commit hash. Alternatively, you could go one step better and make a GitHub release of the version of your project used for your latest paper, get it a DOI and cite it.

This doesn’t guarentee reproducibility but its a step in the right direction. For extra points, you may consider making the computational environment reproducible too (e.g. all of the dependencies used by your script – Python modules, R packages and so on) using technologies such as Docker, Conda and MRAN but further discussion of these is out of scope for this article.

Building a computational environment based on your repository

Once your project is on GitHub, it is possible to integrate it with many other online services. One such service is mybinder which allows the generation of an executable environment based on the contents of your repository. This makes your code immediately reproducible by anyone, anywhere.

Similar projects are popping up elsewhere such as The Littlest JupyterHub deploy to Azure button which allows you to add a button to your GitHub repo that, when pressed by a user, builds a server in their Azure cloud account complete with your code and a computational environment specified by you along with a JupterHub instance that allows them to run Jupyter notebooks. This allows you to write interactive papers based on your software and data that can be used by anyone.

Complying with funding and journal guidelines

When I started teaching and advocating the use of technologies such as git I used to make a prediction These practices are so obviously good for computational research that they will one day be mandated by journal editors and funding providers. As such, you may as well get ahead of the curve and start using them now before the day comes when your funding is cut off because you don’t. The resulting debate was usually good fun.

My prediction is yet to come true across the board but it is increasingly becoming the case where eyebrows are raised when papers that rely on software are published don’t come with the supporting software and data. Research Software Engineers (RSEs) are increasingly being added to funding review panels and they may be Reviewer 2 for your latest paper submission.

Other uses of git and GitHub for busy academics

It’s not just about code…..

Build your own websites using GitHub pasges. Every repo can have its own website served directly from GitHub
Put your presentations on GitHub. I use reveal.js combined with GitHub pages to build and serve my presentations. That way, whenever I turn up at an event to speak I can use whatever computer is plugged into the projector. No more ‘I don’t have the right adaptor’ hell for me.
Write your next grant proposal. Use Markdown, LaTex or some other git-friendly text format and use git and GitHub to collaboratively write your next grant proposal

The movie below is a visualisation showing how a large H2020 grant proposal called OpenDreamKit was built on GitHub. Can you guess when the deadline was based on the activity?

Further Resources

Further discussions from scientific computing practitioners that discuss using version control as part of a healthy approach to scientific computing

Good Enough Practices in Scientific Computing –
Is Your Research Software Correct? – A presentation from Mike Croucher discussing what can go wrong in computational research and what practices can be adopted to do help us do better
The Turing Way A handbook of good practice in data science brought to you from the Alan Turning Institute
A guide to reproducible code in ecology and evolution – A handbook from the British Ecological Society that discusses version control as part of general good practice

Learning version control

Convinced? Want to start learning? Let’s begin!

Git lesson from Software Carpentry – A free, community written tutorial on the basics of git version control

Graphical User Interfaces to git

If you prefer not to use the command line, try these

Hypot – A story of a ‘simple’ function

Mike Croucher — Mon, 06 Jan 2020 04:46:15 +0000

My stepchildren are pretty good at mathematics for their age and have recently learned about Pythagora’s theorem

$c=\sqrt{a^2+b^2}$

The fact that they have learned about this so early in their mathematical lives is testament to its importance. Pythagoras is everywhere in computational science and it may well be the case that you’ll need to compute the hypotenuse to a triangle some day.

Fortunately for you, this important computation is implemented in every computational environment I can think of!
It’s almost always called hypot so it will be easy to find.
Here it is in action using Python’s numpy module

import numpy as np
a = 3
b = 4
np.hypot(3,4)

5

When I’m out and about giving talks and tutorials about Research Software Engineering, High Performance Computing and so on, I often get the chance to mention the hypot function and it turns out that fewer people know about this routine than you might expect.

Trivial Calculation? Do it Yourself!

Such a trivial calculation, so easy to code up yourself! Here’s a one-line implementation

def mike_hypot(a,b):
    return(np.sqrt(a*a+b*b))

In use it looks fine

mike_hypot(3,4)

5.0

Overflow and Underflow

I could probably work for quite some time before I found that my implementation was flawed in several places. Here’s one

mike_hypot(1e154,1e154)

inf

You would, of course, expect the result to be large but not infinity. Numpy doesn’t have this problem

np.hypot(1e154,1e154)

1.414213562373095e+154

My function also doesn’t do well when things are small.

a = mike_hypot(1e-200,1e-200)

0.0

but again, the more carefully implemented hypot function in numpy does fine.

np.hypot(1e-200,1e-200)

1.414213562373095e-200

Standards Compliance

Next up — standards compliance. It turns out that there is a an official standard for how hypot implementations should behave in certain edge cases. The IEEE-754 standard for floating point arithmetic has something to say about how any implementation of hypot handles NaNs (Not a Number) and inf (Infinity).

It states that any implementation of hypot should behave as follows (Here’s a human readable summary https://www.agner.org/optimize/nan_propagation.pdf)

hypot(nan,inf) = hypot(inf,nan) = inf

numpy behaves well!

np.hypot(np.nan,np.inf)

inf

np.hypot(np.inf,np.nan)

inf

My implementation does not

mike_hypot(np.inf,np.nan)

nan

So in summary, my implementation is

Wrong for very large numbers
Wrong for very small numbers
Not standards compliant

That’s a lot of mistakes for one line of code! Of course, we can do better with a small number of extra lines of code as John D Cook demonstrates in the blog post What’s so hard about finding a hypotenuse?

Hypot implementations in production

Production versions of the hypot function, however, are much more complex than you might imagine. The source code for the implementation used in openlibm (used by Julia for example) was 132 lines long last time I checked. Here’s a screenshot of part of the implementation I saw for prosterity. At the time of writing the code is at https://github.com/JuliaMath/openlibm/blob/master/src/e_hypot.c

That’s what bullet-proof, bug checked, has been compiled on every platform you can imagine and survived code looks like.

There’s more!

Active Research

When I learned how complex production versions of hypot could be, I shouted out about it on twitter and learned that the story of hypot was far from over!

The implementation of the hypot function is still a matter of active research! See the paper here https://arxiv.org/abs/1904.09481

Is Your Research Software Correct?

Given that such a ‘simple’ computation is so complicated to implement well, consider your own code and ask Is Your Research Software Correct?.

When numpy.empty is not faster than numpy.zeros

Mike Croucher — Sun, 22 Sep 2019 14:43:20 +0000

In many introductions to numpy, one gets taught about np.ones, np.zeros and np.empty. The accepted wisdom is that np.empty will be faster than np.ones because it doesn’t have to waste time doing all that initialisation. A quick test in a Jupyter notebook shows that this seems to be true!

import numpy as np

N = 200
zero_time = %timeit -o some_zeros = np.zeros((N,N))
empty_time = %timeit -o empty_matrix = np.empty((N,N))
print('np.empty is {0} times faster than np.zeros'.format(zero_time.average/empty_time.average))


8.34 µs ± 202 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
436 ns ± 10.6 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
np.empty is 19.140587654682545 times faster than np.zeros

20 times faster may well be useful in production when using really big matrices. Might even be worth the risk of dealing with uninitialised variables even though they are scary!

However…..on my machine (Windows 10, Microsoft Surface Book 2 with 16Gb RAM), we see the following behaviour with a larger matrix size (1000 x 1000).

import numpy as np

N = 1000
zero_time = %timeit -o some_zeros = np.zeros((N,N))
empty_time = %timeit -o empty_matrix = np.empty((N,N))
print('np.empty is {0} times faster than np.zeros'.format(zero_time.average/empty_time.average))


113 µs ± 2.97 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
112 µs ± 1.01 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
np.empty is 1.0094651980894993 times faster than np.zeros

The speed-up has vanished! A subsequent discussion on twitter suggests that this is probably because the operating system is zeroing all of the memory for security reasons when using numpy.empty on large arrays.

Second Order Cone Programming (SOCP) using the NAG Library for Python

Mike Croucher — Mon, 02 Sep 2019 22:32:44 +0000

What is Second Order Cone Programming (SOCP)?

Second Order Cone Programming (SOCP) problems are a type of optimisation problem that have applications in many areas of science, finance and engineering. A summary of the type of problems that can make use of SOCP, including things as diverse as designing antenna arrays, finite impulse response (FIR) filters and structural equilibrium problems can be found in the paper ‘Applications of Second Order Cone Programming’ by Lobo et al. There are also a couple of examples of using SOCP for portfolio optimisation in the GitHub repository of the Numerical Algorithms Group (NAG).

A large scale SOCP solver was one of the highlights of the Mark 27 release of the NAG library (See here for a poster about its performance). Those who have used the NAG library for years will expect this solver to have interfaces in Fortran and C and, of course, they are there. In addition to this is the fact that Mark 27 of the NAG Library for Python was released at the same time as the Fortran and C interfaces which reflects the importance of Python in today’s numerical computing landscape.

Here’s a quick demo of how the new SOCP solver works in Python. The code is based on a notebook in NAG’s PythonExamples GitHub repository.

NAG’s handle_solve_socp_ipm function (also known as e04pt) is a solver from the NAG optimization modelling suite for large-scale second-order cone programming (SOCP) problems based on an interior point method (IPM).

$$
\begin{array}{ll}
{\underset{x \in \mathbb{R}^{n}}{minimize}\ } & {c^{T}x} \\
\text{subject to} & {l_{A} \leq Ax \leq u_{A}\text{,}} \\
& {l_{x} \leq x \leq u_{x}\text{,}} \\
& {x \in \mathcal{K}\text{,}} \\
\end{array}
$$

where $\mathcal{K} = \mathcal{K}^{n_{1}} \times \cdots \times \mathcal{K}^{n_{r}} \times \mathbb{R}^{n_{l}}$ is a Cartesian product of quadratic (second-order type) cones and $n_{l}$-dimensional real space, and $n = \sum_{i = 1}^{r}n_{i} + n_{l}$ is the number of decision variables. Here $c$, $x$, $l_x$ and $u_x$ are $n$-dimensional vectors.

$A$ is an $m$ by $n$ sparse matrix, and $l_A$ and $u_A$ and are $m$-dimensional vectors. Note that $x \in \mathcal{K}$ partitions subsets of variables into quadratic cones and each $\mathcal{K}^{n_{i}}$ can be either a quadratic cone or a rotated quadratic cone. These are defined as follows:

Quadratic cone:

$$
\mathcal{K}_{q}^{n_{i}} := \left\{ {z = \left\{ {z_{1},z_{2},\ldots,z_{n_{i}}} \right\} \in {\mathbb{R}}^{n_{i}} \quad\quad : \quad\quad z_{1}^{2} \geq \sum\limits_{j = 2}^{n_{i}}z_{j}^{2},\quad\quad\quad z_{1} \geq 0} \right\}\text{.}
$$

Rotated quadratic cone:

$$
\mathcal{K}_{r}^{n_{i}} := \left\{ {z = \left\{ {z_{1},z_{2},\ldots,z_{n_{i}}} \right\} \in {\mathbb{R}}^{n_{i}}\quad\quad:\quad \quad\quad 2z_{1}z_{2} \geq \sum\limits_{j = 3}^{n_{i}}z_{j}^{2}, \quad\quad z_{1} \geq 0, \quad\quad z_{2} \geq 0} \right\}\text{.}
$$

For a full explanation of this routine, refer to e04ptc in the NAG Library Manual

Using the NAG SOCP Solver from Python

This example, derived from the documentation for the handle_set_group function solves the following SOCP problem

minimize $${10.0x_{1} + 20.0x_{2} + x_{3}}$$

from naginterfaces.base import utils
from naginterfaces.library import opt

# The problem size:
n = 3

# Create the problem handle:
handle = opt.handle_init(nvar=n)

# Set objective function
opt.handle_set_linobj(handle, cvec=[10.0, 20.0, 1.0])

subject to the bounds
$$
\begin{array}{rllll}
{- 2.0} & \leq & x_{1} & \leq & 2.0 \\
{- 2.0} & \leq & x_{2} & \leq & 2.0 \\
\end{array}
$$

# Set box constraints
opt.handle_set_simplebounds(
    handle,
    bl=[-2.0, -2.0, -1.e20],
    bu=[2.0, 2.0, 1.e20]
)

the general linear constraints

\begin{array}{lcrcrcrclcl}
& & {- 0.1x_{1}} & – & {0.1x_{2}} & + & x_{3} & \leq & 1.5 & & \\
1.0 & \leq & {- 0.06x_{1}} & + & x_{2} & + & x_{3} & & & & \\
\end{array}

# Set linear constraints
opt.handle_set_linconstr(
    handle,
    bl=[-1.e20, 1.0],
    bu=[1.5, 1.e20],
    irowb=[1, 1, 1, 2, 2, 2],
    icolb=[1, 2, 3, 1, 2, 3],
    b=[-0.1, -0.1, 1.0, -0.06, 1.0, 1.0]
    );

and the cone constraint

$$\left( {x_{3},x_{1},x_{2}} \right) \in \mathcal{K}_{q}^{3}\text{.}$$

# Set cone constraint
opt.handle_set_group(
    handle,
    gtype='Q',
    group=[ 3,1, 2],
    idgroup=0
);

We set some algorithmic options. For more details on the options available, refer to the routine documentation

# Set some algorithmic options.
for option in [
        'Print Options = NO',
        'Print Level = 1'
]:
    opt.handle_opt_set(handle, option)

# Use an explicit I/O manager for abbreviated iteration output:
iom = utils.FileObjManager(locus_in_output=False)

Finally, we call the solver

# Call SOCP interior point solver
result = opt.handle_solve_socp_ipm(handle, io_manager=iom)

 ------------------------------------------------
  E04PT, Interior point method for SOCP problems
 ------------------------------------------------

 Status: converged, an optimal solution found
 Final primal objective value -1.951817E+01
 Final dual objective value   -1.951817E+01

The optimal solution is

result.x

array([-1.26819151, -0.4084294 ,  1.3323379 ])

and the objective function value is

result.rinfo[0]

-19.51816515094211

Finally, we clean up after ourselves by destroying the handle

# Destroy the handle:
opt.handle_free(handle)

As you can see, the way to use the NAG Library for Python interface follows the mathematics quite closely.
NAG also recently added support for the popular cvxpy modelling language that I’ll discuss another time.

Fixing missing ‘Anaconda Prompt’ start menu shortcut on windows 10

Mike Croucher — Mon, 01 Jul 2019 14:08:02 +0000

I am a huge user of Anaconda Python and the way I usually get access to the Anaconda Prompt is to start typing ‘Anaconda’ in the Windows search box and click on the link as soon as it pops up. Easy and convenient. Earlier today, however, the Windows 10 menu shortcuts for the Anaconda command line vanished from my machine!

I’m not sure exactly what triggered this but I was heavily messing around with various environments, including the base one, and also installed Visual Studio 2019 Community Edition with all the Python extensions before I noticed that the menu shortcuts had gone missing. No idea what the root cause was.

Fortunately, getting my Anaconda Prompt back was very straightforward:

launch Anaconda Navigator
Click on Environments
Selected base (root)
Choose Not installed from the drop down list
Type for console_ in the search box
Check the console_shortcut package
Click Apply and follow the instructions to install the package

High Performance Computing (HPC) – On premise or cloud? The cost question.

Mike Croucher — Fri, 24 May 2019 02:28:30 +0000

HPC! To cloud or not to cloud….

Over the course of my career I have been involved with the provision of High Performance Computing (HPC) at almost every level. As a researcher and research software engineer I have been, and continue to be, a user of many large scale systems. As a member of research computing support, I was involved in service development and delivery, user-support, system administration and HPC training. Finally, as a member of senior leadership teams, I have been involved in the financial planning and strategic development of institution-wide HPC services.

In recent years, the question that pops up at every level of involvement with HPC is ‘Should we do this with our own equipment or should we do this in the cloud?’

This is an extremely complex, multi-dimensional question where the parameters are constantly shifting. The short, answer is always ‘well…it depends’ and it always does ‘depend’…on many things! What do you want to do? What is the state of the software that you have available? How much money do you have? How much time do you have? What support do you have? What equipment do you currently have and who else are you working with and so on.

Today, I want to focus on just one of these factors. Cost. I’m not saying that cost is necessarily the most important consideration in making HPC-related decisions but given that most of us have fixed budgets, it is impossible to ignore.

The value of independence

I have been involved in many discussions of HPC costs over the years and have seen several attempts at making the cloud vs on-premise cost-comparison. Such attempts are often biased in one direction or the other. It’s difficult not to be when you have a vested interest in the outcome.

When I chose to leave academia and join industry, I decided to work with the Numerical Algorithms Group (NAG), a research computing company that has been in business for almost 50 years. One reason for choosing them was their values which strongly resonate with my own. Another was their independence. NAG are a not-for-profit (yet commercial!) company who literally cannot be bought. They solve the independence problem by collaborating with pretty much everybody in the industry.

So, NAG are more interested in helping clients solve their technical computing problems than they are in driving them to any given solution. Cloud or on-premise? They don’t mind…they just want to help you make the decision.

Explore HPC cost models yourself

NAG’s VP for HPC services and consulting, Andrew Jones (@hpcnotes), has been teaching seminars on Total Cost Of Ownership models for several years at events like the SC Supercomputing conference series. To support these tutorials, Andrew created a simple, generic TCO model spreadsheet to illustrate the concepts and to show how even a simple TCO model can guide decisions.

Many people asked if NAG could send them the TCO spreadsheet from the tutorial but they decided to go one better and converted it into a web-form where you can start exploring some of the concepts yourself.

Here’s the HPC Total Cost of Ownership calculator for you to play with: https://www.nag.com/content/hpc-tco-calculator

I also made a video about it.

If you want more, get in touch with either me (@walkingrandomly) or Andrew (@hpcnotes) on twitter or email hpc@nag.com. We’ll also be teaching this type of thing at ISC 2019 so please do join us there too.