Lost Boy

A non-digital service example of working in the open

Leigh Dodds — 2025-09-12T10:24:04Z

Matt has recently been blogging and speaking about “working in the open” in public service roles. Giles has written a lot about working in the open too, most recently collecting examples of teams who are doing open for different purposes, e.g. for remembering and thinking out loud.

I’ve worked in the open on software and data projects and there are similar and different issues to tackle. The driver for “openness” might also have different goals. E.g. creating something together.

So I always enjoy reading about working in the open in different contexts. There’s usually something to learn. Even if sometimes that might just be a glimpse at the organisational issues that make it necessary, different or hard.

In the spirit of sharing another example of working in the open, from a very different context, I wanted to point to David Bull. He’s a woodblock print maker based on Japan. You can read a bit about him on his Wikipedia page.

I’m not 100% sure who first pointed out David to me. I think it might have been Jack. But I’ve been watching his YouTube videos and Twitch streams for around five years now. I think the craft, art and process around Japanese woodblock printmaking is fascinating.

For this blog post though what I wanted to highlight is the variety of ways in which David is working in the open through his videos, live streams and collaborations:

He’s been open about the challenges of growing his business, including recruiting and developing his team of carvers and printers, right down to annual updates on their finances
He’s openly sharing the process of creating prints for his customers, so they can see behind the scenes at how they’re being made. (I now own a print that I’ve watched being carved and printed)
He’s documenting the craft of carving and printmaking, to allow his community to learn from him and other experts in the field. Including a recent project with the British Museum. There’s a lot of tacit knowledge being shared in his videos
He’s sharing his knowledge of the history of printmaking, with regular show-and-tells of old prints
Just recently, he’s started investing in rebuilding a supply chain of good quality paper (“washi“). This paper is increasingly difficult to source but necessary for his work. So he’s being doing an experiment that involves reopening previously closed paper making workshops and supporting other craftspeople. All of this is being recorded and shared openly

I think it’s a great example of working in the open. Just not one that involves creating public services, data or software.

I’m also slightly in awe of the guy. Not just for his skills as a carver and print maker, but as a video maker and software developer. He’s casually mentioned that he’s developed all the websites and software used by his shop.

Automating the right things

There’s one other nugget I wanted to share here. It’s not about working in the open, but about automation.

There’s a lot of laborious process involved in making washi paper. One of the most time consuming involves picking out the impurities (“chiri“) from the fibres used to make the paper. This involves working through all of the fibres, by hand, in vats of cold water for hours at a time.

David has been talking about trying to automate this in some of his streams. It’s very time consuming.

There’s a section in a recent video where he mentioned this idea to the paper maker he’s working with. It’s a short section of just a couple of minutes, which I’d encourage you to watch it.

David explains how the paper maker was not interested in automating at all. For her that step is an important part of her craft and the value she was bringing as an artisan. It’s part of her pride in her work.

David then goes on to say he feels the same way about his carving. Why automate something that brings him joy?

While we’re not all artisans, but I think we all bring some craft to our lives.

I thought the exchange was interesting, not just as an example of artisans talking about their work. I think it also highlights how people even in adjacent roles in the same process, can have very different perspectives on where the “inefficiencies” lie.

Probably a lesson somewhere there for digital working too.

Calculating carbon emissions for energy data in the UK

Leigh Dodds — 2025-03-26T19:15:41Z

I’ve recently been taking a closer look at the Streamlined Energy & Carbon Reporting (SECR) guidelines. While we currently produce figures on CO₂ emissions in Energy Sparks we’re not producing SECR style reports for schools or trusts. This is something we’re planning to add to the product soon, so I’ve been looking at what’s involved and how it’s different to what we’re currently doing.

I’m going to drop my notes in here in case useful for anyone else. I’ll briefly outline how to calculate emissions for SECR and then look at more accurate ways to do it.

I’m not going to summarise SECR itself or look at anything except reporting of emissions for electricity and mains gas. If you’re looking for more details then look at the requirements document. There’s also some specific guidance for multi-academy trusts, as well as plenty of third-party tools and explanations.

A brief look at Scopes 1, 2 and 3

As I’ll be talking about Scope 1, 2 and 3 emissions, its useful to define what those are:

Scope 1 – your direct emissions, e.g. from burning gas to heat a building
Scope 2 – your indirect emissions that are caused directly by your activities, e.g. by you consuming electricity from the grid
Scope 3 – all other indirect emissions that are not directly caused by you, but which relate to your organisations “value chain”, e.g. emissions from the transmission and distribution of electricity or, if you used LPGs, from transport of that to your buildings

Basic SECR Calculations

There are three main greenhouse gases: Carbon Dioxide (CO₂), Methane (CH₄) and Nitric Oxide (N₂O) with CO₂ being the main one.

It’s possible to report these separately but the convention is to report all GHG emissions as a “CO₂ equivalent”, e.g. the warming caused by an equivalent emission of CO₂.

For SECR reporting your reporting needs to include both your annual consumption (in kWh) and your emissions.

Emissions can be calculated from your consumption using a conversion factor that converts a unit of energy into a CO₂ equivalent (kg CO₂e). The conversion factors for the UK are published annually — around June I think — with the entire history available online.

So you need to find and use the right conversion factors, for the right year, to calculate the emissions for different types of energy consumption. For 2024 the electricity and gas figures look like this:

		Unit	kg CO2e	kg CO2e of CO2 per unit	kg CO2e of CH4 per unit	kg CO2e of N2O per unit
Electricity	Consumption	kWh	0.20705	0.20493	0.00090	0.00122
	Transmission & Distribution	kWh	0.0183	0.01811	0.00008	0.00011
Natural Gas	Consumption	m3	2.04542	2.0414	0.00307	0.00095
		kWh	0.20264	0.20223	0.00031	0.0001

2024 Green House Gas Conversion Factors. Note that Natural Gas Consumption figures in kWh are Net CV. Source: https://www.gov.uk/government/publications/greenhouse-gas-reporting-conversion-factors-2024

The Transmission & Distribution factor for electricity is used when reporting Scope 3 emissions attributable to the supply of electricity to your location. This isn’t a mandatory requirement but I’ve read that its considered best practice to include these.

There are two conversion factors for Natural Gas. This is because gas meters measure volume of gas rather than a kWh value. But bills typically also provide a kWh value instead for consistency.

Actually, some older gas meters are still on Imperial units so they report measurements in cubic-feet or hundreds of cubic-feet. Something to watch for when dealing with energy data.

There is a formula for converting volume measurements of gas into a kWh value, which relies on knowing the calorific value of the gas being used. In the UK the calorific values are regulated to be between around 38 MJ/m³ to 41 MJ/m³. So you can use an average of those, but see below for more details.

Renewables

Energy generated and consumed from solar panels can be included in SECR. Although I find the guidance to be a little unclear.

The 2024 conversion factors includes a note to say that:

Where electricity is onsite generated from a renewable source such as PV then the kWh should be reported under scope 1 with 0 emissions for both the electricity fuel source and 0 emissions for the T&D value.
Greenhouse gas reporting: conversion factors 2024 – condensed set

So my understanding from that is that annual solar generation figures can be directly included in the Scope 1 and Scope 3 emissions.

Annex G of the main SECR guidance also states that solar exports can be used to produce a net kg CO₂e emissions figure:

The emission reduction should be calculated using the grid average factor. Total emissions reductions from generated and exported renewable electricity (and green tariffs if appropriate) must not be greater than gross scope 2 emissions.
SECR, Annex G

So the export emissions for producing that offset are calculated using the electricity consumption intensity factor for that year (e.g. 0.20705 kg CO₂e from the above table).

What about if you’re on a 100% renewable tariff or have purchased offsets of some kind? Well there are no green electrons so you can’t actually consume only 100% renewable energy if you’re consuming any electricity from the grid. At least today.

SECR accommodates green tariffs and offsets by suggesting that you should report both a “location” based emissions (which is what I’ve outlined above) AND a “market” based set of emissions.

The market emissions would be calculated using set of intensity factors provided by your supplier. These figures are also supposed to be backed up, e.g. by a REGO.

Limitations of SECR

SECR is a simplified approach so has some limitations.

The main one is that the intensity factors produced by the government are always out of date. As the 2024 methodology explains, the data used to calculate the electricity consumption intensity factor is based on grid emissions data from 2 years ago.

So the current 2024 electricity consumption factors are based on the state of the National Grid in 2022. The grid has decarbonised significantly over the last few years and continues to do so. So the emissions reported using these factors will be higher than actual.

Source: https://www.neso.energy/about-neso/our-progress-towards-net-zero/carbon-intensity-dashboard

Because SECR is based on calculating emissions from your total annual consumption, it won’t reflect that the grid carbon intensity varies over the course of the day. Or regionally. The amount of renewable energy in the mix has a significant impact. So the time when you consume energy, and where you’re consuming it, also makes a difference.

For natural gas emissions there’s a similar issue. The Calorific Value of the gas pumped to your home varies from day to day. So the actual emissions will be different from the averages used in calculating the natural gas intensity factors.

More accurate approaches

A more accurate approach to calculating CO₂ emissions is to use half-hourly energy data. You can calculate the emissions attributable to each half-hourly period using:

Data from the Carbon Intensity API which provides an estimate of the carbon intensity of the grid for each half-hour. It provides both historical estimates and forecasts, as well as regional figures. We harvest these in Energy Sparks on a daily basis.
By using the daily natural gas calorific values from the National Gas data portal to convert volume of gas consumed to kWh to give a more accurate estimate of the energy consumed, and then calculate the CO₂ emissions

However there’s some caveats here that apply to the data from the Carbon Intensity API. As they explain in their methodology there are some differences in the intensity factors used in SECR:

The estimated emissions are for CO₂ only. Other greenhouse gases are not estimated. SECR requires reporting CO₂e figures which include all three gases. I suspect the intensity could be adjusted based on the contribution of other gases over time, using the standard GHG factors.
The estimated emissions cover direct emissions from the electricity generation AND transmission and distribution emissions. SECR requires reporting these as separate figures, with separate emissions factors for each. The electricity generation factor also includes the cost of transporting fuel to power stations

This means the resulting figures aren’t comparable but are likely to be a better overall estimate of CO₂ emissions.

SECR is useful for providing a baseline for comparing businesses but is less helpful for tracking your current emissions or setting targets.

If you think I’ve missed anything or made a mistake, then let me know!

“AI-Ready Data” is the wrong framing

Leigh Dodds — 2025-03-08T14:32:21Z

A paper was published this week by Stefaan Verhulst, Andrew Zahuranec and Hannah Chafetz called “Moving Toward the FAIR-R principles: Advancing AI-Ready Data“.

The paper sets out to do two things:

Make the case that we are in a “Fourth Wave” of open data in which it is critical that data is made useful for AI, and Generative AI (GenAI) in particular, so that data can be democratised and be used to create impact
That the FAIR data framework needs to be extended to make it “FAIR-R”: Ready for AI

I think both of these things are wrong.

Is there a Fourth Wave of open data, and if so it about making data “AI-Ready”?

The paper opens by laying out a description of how the open data movement has evolved. The authors suggest that this can be mapped out as the following stages:

Freedom of Information requests, and government transparency
Open by default
Purpose-driven reuse of data
Preparing data for generative AI

I think it’s impossible to adequately describe the evolution of the open data movement without acknowledging that different parts of that movement have adopted open data, as a tool and a means to an end, for very different reasons. Anything else is unnecessarily reductive.

The open government data movement might be traced back to FOI and transparency. But that’s not the origin of open data in science, which can be traced back for decades with a focus on cooperation and reusability.

These movements do not start from the same place. They also have not gone through the same stages or issues. They’re not facing all of the same issues today.

There was definitely a refocusing of the effort around opening government and commercial data from an “open by default” (“if we release it, the magic will happen“) stance to one that was more focused on creating impact through more purposeful publishing and collaboration.

“Open by default” made sense in the early days as a means to unlock data. But purposeful publishing, with closer coordination between publishers and reusers of data, has been proven to have better results.

But again, this is not the same for all parts of the open data movement.

In my opinion the first three “waves” that the authors describe might be seen as a characterisation of the evolution of the open government data movement, but not more broadly.

I think the suggestion that the next, or current, wave is about preparing data for Generative AI is just wrong.

To me the current situation is more about reconciling the goals of a movement, which always been based on unrestricted access and use of data, with a landscape in which data is being used at a scale which is not sustainable, and in ways that may cause harm.

Responding to that challenge involves improving the stewardship and governance of data, by building on that closer collaboration between the publishers and consumers of data demonstrated in the “Third Wave”. It is not about leaning in to the large scale industrial use of data in AI.

I’m not denying that machine learning and AI can be used in profound and innovative ways. Just that focusing on the needs of a particular type of use increases the risk of eroding trust and creating harms.

Put more simply: framing the “Fourth Wave” of open data as being about servicing the needs of AI makes the same mistake as the “open by default” framing: “if we do this thing, the magic will happen“.

We’re just substituting AI for hackdays.

Do we need to extend FAIR to FAIR-R?

The rest of the paper proposes that FAIR should be extended with an extra letter in the acronym:

Readiness for AI: Datasets must be structured to meet the specific (quality)
requirements of AI applications, such as labeled data for supervised learning or
comprehensive coverage for unsupervised learning.

The authors outline some of the potential impacts of AI, characteristics of the datasets that are useful for training, and the emerging standards for describing and publishing training datasets.

The aspects of FAIR (“Findable“, “Accessible“, “Interoperable” and “Reusable“) are all underpinned by a set of principles that describe that they mean. For example, for a dataset to be “Findable” it must have good metadata, and that metadata should be published somewhere where it can be indexed.

What it means to be “Ready for AI” is not really defined in the paper.

But, with an exception that I’ll explore below, the concept of “Ready for AI” is already covered within the existing elements of FAIR. For a dataset to be useful in training AI means ticking all of the Findable, Accessible, Interoperable and Reusable boxes.

What the authors should instead be proposing is a FAIR implementation profile that describes what FAIR means when applied to training datasets.

I’ve previously described the importance of implementation profiles in bridging the gap between broad principles and actionable data management guidance. Packaging up all of the existing work around improving data infrastructure for AI into a set of actionable best practices would be a useful step.

Not least because it clearer about what is being requested by the authors: which is that data publishers should invest in supporting specific standards and publish data in ways, and within platforms, that are useful for a specific technology and user community.

Open data has always placed more of a burden on publishers than consumers. It is publishers that invest in collecting, structuring, documenting and publishing data in ways that reduce the work of consumers.

Asking data providers to do more work, to be part of a broader data ecosystem, potentially comes at the cost of supporting existing users. That feels problematic to me. In many areas essential data infrastructure needs additional investment in order to remain useful in its current form. We’re asking people to do more.

It’s also problematic because the organisations building new AI based systems are amongst the most well funded organisations in the world. Organisations that might reasonably be expected to be able to shoulder the costs of translating and restructuring data into the forms that are useful for them. Or even invest in its co-creation and maintenance.

In the paper, the authors do note the need to address ethical concerns around the use and reuse of data within AI applications. But don’t acknowledge other efforts like the CARE principles which already address that. I previously summarised a few other limitations of FAIR and efforts to address those. Address ethical and legal concerns should not be focused solely on AI.

In my opinion, FAIR data has always been about advocating for better data management: the processes by which we collect, manage, publish and share data. It hasn’t really been about data governance: the decision making and oversight that guides and informs those data management processes.

To use a metaphor: FAIR is about making data easy to consume. It encourages you to describe what is on the menu; to make sure it’s clearly described and labelled; and to make sure that what is presented is well-cooked and served. It’s never been about telling you the source of those ingredients or what has been going on in the kitchen.

To achieve that we don’t really need more principles.

Falsehoods this programmer believed about energy meters

Leigh Dodds — 2025-03-01T12:39:38Z

This is the second part to a post I published earlier this week in which I summarised some things I learned about working with half-hourly energy data. I’ll be updating that shortly with a few extra details and clarifications.

This post will be a summary of some things I’ve learned about energy meters and metering. It’s not a comprehensive primer on meters. I am not an electrician. And I’ve already written at length about smart meters and non-domestic energy data infrastructure so will avoid covering same ground here.

If you’re already familiar with energy data and metering there’s unlikely to be a lot of deep insights here. But there are a number of things that weren’t obvious to me at the start of my journey into energy data. And some which have sprung up since to surprise me.

A site will have a single electricity, or gas or other meter

A UK home usually as a gas meter and an electricity meter. But generally working with energy data means dealing with multiple meters at a single location.

It’s not even always true that UK homes have a single electricity or gas meter. And non-domestic sites can have many different electricity and gas meters serving different parts of the property. We’ve got schools on Energy Sparks with more than 20 meters.

The SMETS2 smart meter standard supports something like five different electricity meters within the same local network.

Related to this, because an MPAN is an identifier for an electricity supply point, that identifier can at times map to multiple electricity meters, each with their own serial number. This one tripped me up recently.

The set of meters on a site won’t change

Devices fail and are replaced. Devices need to be upgraded and are changed. Replacements will have different serial numbers but the same MPAN. Except where they don’t.

Buildings grow and shrink on a temporary or permanent basis. They can be refitted and rewired. As a result, the set of meters at that location may also change.

Meters can be direct replacements for one another. Or the new meter might be monitoring part of the consumption originally reported from an existing meter(s). Or it might be measuring something new.

To build a picture of the overall energy usage at a site you need to aggregate data from multiple meters. But you will likely need some rules to define how to build that aggregate time series from the underlying data.

For example, if you’re advising someone on energy efficiency, when looking at overall usage you will want to make sure you have the latest data from all currently installed meters. You don’t necessarily want to make recommendations if you don’t have recent data for one or more meters (unless perhaps if they are monitoring a minor part of the on-site consumption). So you need some rules that handle lags in data as well as meters being removed.

The consumption from multiple meters is additive

Let’s say you have data for three electricity meters at a location. You can calculate the total consumption by just adding up the data from those meters, right? Not necessarily.

Submetering is where additional meters are installed at a location to measure a particular type of consumption (e.g. your lighting or heating) or even a piece of equipment (a heat pump, a boiler).

Submeters measure the same usage as the mains meters on the same circuit. You can’t add them all up otherwise you’ll double count.

Some solar monitoring systems will also be metering the mains electricity supply, so you may have two ways to access that data.

So you need to know what those different meters are measuring.

Different meters on the same site will have similar usage profiles

Or, to put it differently: you can’t use usage from one meter to estimate that of another.

This is obvious when you stop to think about it.

Meters may be measuring different buildings or areas of a property that are used in very different ways. Submeters might be monitoring different types of equipment with different usage profiles/.

The profile of gas usage for a kitchen is different to a boiler. In a school, the gas usage in the science block might just be measuring the gas taps in the classroom. These are all very different.

A couple of less obvious situations that have arisen in our work at Energy Sparks.

Meters are only measuring consumption by the people or organisation paying the bill

This is obviously the case for rental properties, flat shares, etc. But here’s a couple of non-obvious domestic examples.

One school on Energy Sparks has a leisure centre that operates on the same site. Its metered as part of the school’s overall consumption but is run by a different organisation. A submeter is used to do some internal recharging between the school and the leisure centre. This does mean that the school’s overall energy consumption isn’t representative of its own energy efficiency

In another situation, a new school is being built next to an existing one. Once built it will have its own meters and be part of a separate trust. The building site is temporarily using the existing school’s electricity supply until the new supply is installed. Again there is a submeter helping to handle the billing.

This makes analysing energy usage particularly challenging as usage no longer correlates to the organisation you’re supporting with your energy efficiency advice.

Meters have a consistent way to be remotely read

There are a wide range of different technologies used to allow a readings to be fetched, or sent from a meter. This includes GSM, 3G, the UK’s Smart Meter WANs, via a modem or by connecting to a local Wifi.

Older energy meters need to be upgraded or replaced before the UK’s 2G/3G networks are switched off.

Solar systems are fully metered

For solar systems you’re generally interested in how much the panels are generating, how much of that you’re consuming and how much you might be exporting.

Sometimes the generation, self-consumption and export are individually metered within the solar monitoring system. But that’s not the common case. Older systems might only have a generation meter.

The UK’s Feed-In-Tariff scheme paid people with solar panels for exporting energy to the grid. The solar generation was metered, but the export was just estimated at half the generation. So you got paid the same amount no matter how much you were exporting. So there was no need for export meters.

The is a nice example of how a policy design directly shapes (data) infrastructure and its impact on the long-term management of that infrastructure. Its left a potential gap in understanding how much the electricity being generated by these panels is actually being consumed on site, impacting the ability to give energy efficiency advice and support, e.g. local load-balancing.

The newer Smart Export Guarantee and export tariffs use export metering to monitor how much electricity you’re actually exporting. So a combination of generation and export metering is now more common.

But, as noted below, if the panels are installed under a Power Purchase Agreement and its unlikely that there will be significant export, an export meter might not be installed by the organisation (e.g. a community energy company) that owns the panels.

The lack of a self-consumption meter isn’t an issue, because self-consumption is just the amount of electricity generated minus the amount exported. Right?

Solar panels are owned by the people or organisation whose property they’re installed on.

This is no longer even true for domestic properties let alone non-domestic.

Power Purchase Agreements (PPA) allow (community) energy companies to install solar at a property then have an agreement with the owner to supply them with electricity at a discount rate. The energy company gets guaranteed revenue for the consumption, and perhaps some export revenue as well.

Tesco are installing solar on their stores this way. And the project in Cornwall I linked above illustrates that this arrangement is coming to the domestic market too.

This means that the energy from the solar panels is not free. Tariffs are set in the PPA. It also means that you may need to negotiate data access with a third-party, although transparency is usually beneficial on both signs.

Meters at the same site fail to be read in consistent ways

You have multiple types of meter at a site. They use different technologies to store and transmit readings. And those meters are managed by different organisations. That means you will encounter uneven data coverage and quality issues that means the usage reported by different meters does not align.

In my last post I mentioned that solar generation and export meters can end up producing inconsistent readings when they independently fail or fill in missing readings.

This means you sometimes cannot reliably calculate the self-consumption if its not metered and will need to find other ways to estimate that.

You might encounter similar issues with submeters.

Connectivity issues will be sorted promptly

lol, no.

Meter upgrades and replacements are carried out promptly

See above.

Non half-hourly meters don’t produce half-hourly readings

The energy industry refers to “non half-hourly” and “half-hourly” meters. Half-hourly meters are installed in locations with higher consumption and must produce half-hourly readings.

Historically non half-hourly meters were read, e.g. monthly and so did not produce half-hourly readings.

But increasingly “non half-hourly” meters include AMR and SMETS2 meters that actually do produce half-hourly data.

The terms “half-hourly” and “non half-hourly” now relate more to regulations around what meters are installed, how quickly connectivity issues need to be sorted, and how usage is settled rather than relating to the actual capabilities of the meters.

Smart Meters store actual customer tariffs

This seems to be generally true for domestic properties. When a smart meter is installed within around 24-48 hours the default tariffs on the meter will be replaced by your actual tariffs.

It’s your energy company’s responsibility to push those to the meter as they are used to provide you with cost advice for your In-Home Device. But I’ve heard of problems meaning that tariffs are not up to date.

For non-domestic usage, there isn’t always an In-Home Device provided. They’re not that helpful in a large building. And while the tariffs might be pushed to the meter, that doesn’t always seem to be the case.

Smart Meters tariffs can be used to calculate bills

I believe this is actually true for domestic meters but its not true more broadly.

Smart meters support a range of tariffs including flat rates, differential (time of use) tariffs as well as more complex tariffs. E.g. block based pricing (e.g. the price per unit is fixed until a threshold). They also support a simple standard charge.

The structure of what tariffs can be supported is part of the SMETS2 standard.

Non-domestic tariffs are way more complicated. Specifically, the daily standard charge that we have on our domestic bills is unbundled into a range of charges including:

Agreed Supply and Excess Capacity charges
DUoS and TNuOs charges which are basically additional differential tariffs that applied by the distribution and transmission network operators
Standing charges
Fixed additional fees
Settlement agency fees
Metering agent charges
Site fees
Data access fees
etc, etc

Those can’t be stored on a smart meter. So while you might be able to access the basic consumptions costs you need additional information to calculate a bill.

Energy efficiency advice that focuses on the basic unit rate costs of consumption will underestimate those if they don’t take into account the DUoS and TNuOs charges.

None of this is openly or easily available.

The complexity of non-domestic billing, the potential for errors, opportunity for re-negotiation of fees, and ability to unbundle services like meter management, is why there are a whole class of energy brokers and reconciliation services serving that market.

There are standard APIs and formats for energy data

There are no standard formats for tabular energy data. Energy suppliers generally do not offer APIs for accessing consumption data.

Recent policy changes around access to non-domestic energy data have improved access but not standardisation; in fact formats have proliferated.

There is no standard API for accessing Smart Meter data. There is one for interfacing directly with the DCC but it’s a low level messaging system and access to it is, understandably, tightly controlled.

Services like n3rgy and Hildebrand that have created APIs to wrap that lower level Smart Meter infrastructure are all bespoke.

There are no standardised protocols for agreeing consent to access data.

Solar monitoring systems do not have standard APIs even though they frequently offer the same basic functionality. At Energy Sparks we’ve integrated with three systems initially, based on those we’ve most commonly encountered in the education sector.

If you want to get a sense of the number of solar monitoring systems, then take a look at Solar Fox, who provide display panels to show how your panels are performing. Currently they have integrations with 74 different solar monitoring systems.

While some suppliers provide access to submeter data, because these might be remotely read in different ways through different services. So there isn’t a standard way to access that data either. This includes data that might be coming from heat pumps, district heating, etc.

If future energy systems are about local generation, storage and load balancing, we need more secure, standardised APIs onto these systems.

Falsehoods this programmer believed about half-hourly energy data

Leigh Dodds — 2025-02-24T10:44:21Z

It is common for energy generation and consumption values to be presented as half-hourly readings: giving 48 readings over the course of a single 24 hour period. This is the type of data we’re working with on a daily basis in Energy Sparks.

I thought I’d share a few things that I learned about working with this type of data, in the classic “Falsehoods Programmers Believe About X” format. But in this case the programmer is me. You might have different or better insights. In which case, leave a comment!

This post focuses just on half-hourly data. I’m going to do a second one about metering.

Half-hourly data is always available

While modern meters are AMR (Automated Meter Reading) or SMETS 1 or 2, there are still meters out there that are not capable of producing half-hourly readings.

Some meters installed to monitor, e.g. solar arrays, only report usage over a longer period, e.g. life-time generation.

Half-hourly data is the most granular data available

Half-hourly data is the standard around which modern electricity and gas metering is based. But that doesn’t mean that more granular data isn’t available in some cases.

Meters installed as part of a solar array, e.g. Solar Edge, might report data at 15 minute intervals.

While SMETS 2 meters make data available to suppliers (or other authorised users) on a half-hourly basis, within the home you can access real-time data from the electricity meter (only).

Those readings are only available to certified devices. Some companies and suppliers are manufacturing in-home devices that can bridge from the Zigbee network to Wifi allowing readings at a more fine-grained level

Other “clamp-on” devices which take readings from an electricity cable can do similar real-time reporting.

There are at most 48 readings

Daylight savings means that 2 days in the year will have 50 readings.

Except some supplier data feeds only ever have 48. It’s unclear what happens during daylight savings.

There will never be less than 48 readings

Meters can fail to record (or submit) a reading so you might have missing readings within a day.

Some solar monitoring systems, e.g. Solis Cloud, only seem to report data when there are non-zero readings. So their API only returns data between e.g. 6am and 8pm when there was any generation on the panels, not a fixed set of 48 readings. Other APIs differ.

The readings are always reporting energy consumption

Different types of meters measure different things. So a half-hourly time series might cover a range of data types and units.

A solar generation meter measures the power generated by the panels, a self-consumption meter shows the local consumption of that power. An export meter measures how much energy is exported to the grid.

An electricity meter might also report the Reactive Energy for the circuit.

If you’re taking data from a supplier then the half-hourly data might also be estimated, rather than actual reads from the meter.

Labelling of half-hourly data is standardised

Some CSV formats organise readings into rows, one for each day and with each half-hour being in a separate column. The column headings might be labelled with the start or the end of the half-hour being reported. E.g. the electricity consumption for the period of time between 1am and 1.30am might be labelled as “1:00” or “1:30”.

For CSV files organised like this, then there’s no real ambiguity. The 48 (or 50!) readings are presented in column order.

But some formats report data with one row per half-hourly reading. So you need to take care to ensure you’re parsing the reading time or time-stamps correctly.

2024-04-24T00:00 might be the consumption between 23:30 on the 23rd February and 00:00 on the 24th. Or it might be the data for 00:00 to 00:30 on the 24th.

Consumption readings are always in kWh

Electricity consumption data is reported in kWh. But gas meters report usage based on the volume of gas supplied.

Modern meters report usage in cubic meters. There are still gas meters in use that report data in imperial units. They might be reporting in cubic feet (cf) or hundreds of cubic feet (hcf).

You’ll need to convert this to kWh using a standard formula that takes into account variations in temperature and pressure as well as the calorific value of the gas.

In the UK calorific values will typically be between 37.5 and 43.0 MJ/m cubed with 40 being an acceptable default. The actual calorific value for your gas supply will be on the monthly bill. As far as I’m aware there’s no programmatic way to access this information.

Whether your gas meter reports data in cf or hcf also only seems to be present on energy bills.

The readings are always measured values

SMETS electricity and gas meters occasionally produce high-values which are not actual recorded consumption. They are error codes used to report some kind of meter fault. You need to trap and handle these when processing the data.

As the linked documents note: “Due to the nature of the national programme and the position taken by government and the regulator this is no obligation or requirement for manufacturers to publish these error codes so they can be captured and processed explicitly.“

The readings are always for the usage in a single half-hourly period

I’ve only observed this for some solar meters, but it might occur elsewhere with other types of metering. It relates to how meters remotely report data in scenarios where the connectivity is poor.

Sometimes there will be a visible spike for a single half-hour after a period of missing readings.

Some meters, if they hit a communication error will, at the next opportunity report, all consumption (or generation) that haven’t already been reported. In effect the meter “catches up” by just attributing all unreported data for the next half-hour its able to connect.

Depending on the capability of the meters, this “catch-up” reporting might be for a couple of hours or it might extend into the next day.

This can create problems where your generation, export and self-consumption are no longer aligned. One option is to just average out the data across the missing period, allowing for periods when the panels are unlikely to be generating.

As far as I’m aware there’s no standard way to describe this feature or other capabilities of solar meters.

Estimated data is always clearly labelled

Meters don’t report estimated readings, so if you’re pulling data directly from, e.g. a SMETS2 meter then you’ll only get actuals.

But if you’re taking data from suppliers or from other parts of the energy data infrastructure then the readings might include estimates. E.g. if a meter cannot be read remotely.

You need to take care to understand whether the data feed or API you’re using includes only actuals or also estimated readings. Not every feed makes this clear.

You may also need to reload or refresh data when the actual become available. Some data feeds will push through actuals when available. Others used a fixed window, e.g. last 7 days, so you need another mechanism to handle getting historical data.

Estimated data will eventually be replaced with actual values

Estimated data is used if there’s a problem reading a meter. However, even if that problem is resolved you won’t necessarily be able to get the “missing” half-hourly data.

Some meters do have a memory so will store a history of the readings which can be later downloaded. But there’s an upper limit on this. So you may never get actual half-hourly reads to replace the estimates.

From a billing perspective the customer’s bill will be adjusted based on the current meter readings. But you’re not guaranteed to get the missing data.

A quick review of the 8BitDo mechanical keyboard

Leigh Dodds — 2025-02-21T08:47:48Z

I don’t normally review things here but a few people expressed an interest in how I got on with this, especially as I was planning to use it with Ubuntu. So here we are.

Be warned, I know nothing about keyboards except how to go tippy-tap on them. And sometimes not even that. So don’t expect any deep insights.

Last year I decided to change up my desk setup. I was already using two screens, but to make things a more ergonomic I moved my laptop onto a separate swivel stand, got a USB hub to reduce the cabling and increase the number of ports. I also decided to get a new keyboard.

Out of what might is undoubtedly a mix of nostalgia and a burgeoning mid-life crisis, I decided what I wanted was a chunky retro-styled keyboard. Something that evoked “hacking the mainframe” using an old IBM terminal.

I then tripped over the 8BitDo mechanical keyboard and was sold. Compact. Wireless. Gave off the right vibes. And, inexplicably, has some extra big red buttons.

Let’s start with…

I’M SORRY?

…I said let’s start with discussing the…

WHAT?

…discussing the downsides.

OK!

It’s loud.

I love that, to be honest. Gives me more of those mainframe hacking vibes. But if I’m honest, that feature is less popular with everyone else in the house.

The other downside is that the occasionally I discover exxxxxxtra letttters. Not consistently enough for it to be an actual fault. But maybe once or twice a day?

I suspect it might be a bluetooth connectivity thing, but I’ve not used it enough when wired or using the wireless dongle to be sure. It’s slightly more common when I’ve paused for a while to read something. So might be related to the keyboard going to sleep.

The final issue is actually my own problem. This is an 87 key US layout. So fewer keys and no replaceable keycaps.

As a UK tippy-tapper I’m stuck with some mislabelled keys. I don’t really care about that as muscle memory guides my typing. But fewer keys means that a couple of frequently used programming characters, like | and \, are only available via an Alt key combo. I did consider remapping keys to try and work around that, but decide it was easier to just retrain that muscle memory a bit.

So not sure if I can unreservedly recommend this for programmers unless you’re comfortable with that. Or in the US.

What about the upsides?

WHAT?

I said let’s talk about the upsides.

Firstly, it works seamlessly with Ubuntu. Just worked. The included wireless dongle, and an easy switch between modes means I can easily flip between my laptop and PC desktop.

The macro programming software is for PC only. But you can do all of the key assignment directly on the keyboard itself. So I’ve just assigned keyboard shortcuts to the individual buttons and then mapped those shortcuts in Ubuntu. For example, at the moment, I’ve just got keys assigned to one tap mute my microphone and audio. Handy.

I’ll admit that, retro-stylings aside, my favourite feature is the volume knob. I now get to pump up the volume while waving my arm in the air like some DJ dropping a hot set. No really, it’s great and I’m sure the midlife crisis will pass eventually.

Despite the niggles, I have zero regrets with this as a purchase. If I were to duct tape this to my XPS 13 it’s probably as close as I’m going to get to that Ono-Sendai deck I’ve always wanted.

YMMV.

Reflecting on 2024

Leigh Dodds — 2025-02-09T14:48:27Z

Another of my annual end of the year reflections. Like last year its taken me until February to finish.

This post is likely to be of little interest to anyone else, but I enjoy doing them and they’re the closest thing I have to a diary.

I’ve previously done these for 2020, 2021 , 2022 and 2023.

Reading

I log all my comic and book reading in Story Graph. They produce an annual wrap-up. Here’s mine for 2024.

There was a definite shift in my reading patterns this year. I basically all but stopped reading comics and graphic novels. I read 47 books: 13 graphic novels, 27 fiction and 7 non-fiction.

By way of contrast, last read I read something like 70-80 graphic novels.

Fiction

My favourites this year were:

Red Sky in Morning by Paul Lynch (although this is a close tie with his other novel Beyond the Sea)
Wandering Souls by R. F. Kuang
Isaac and the egg by Bobby Palmer
Montpelier Parade by Karl Geary
Treacle Walker by Alan Garner

Didn’t read a lot of sci-fi in 2024. Felt like I was in a bit of a rut so sought out some different authors. I struggled to finish Orbital by Samantha Harvey just didn’t click for me at all.

Non-Fiction

How Infrastructure Works by Deb Chachra is just brilliant.
The Care Manifesto gets a big thumbs-up from me. Not a lot of it about though, it seems
Enjoyed Abroad in Japan
Modeling Mindsets by Christoph Molnar was an interesting approach to looking at how people data scientists approach their analysis.

I read How to Blow Up a Pipeline by Andreas Malm but to be honest I found it to be naive and simplistic.

Comics

I read the first three books of 100 Bullets which I really enjoyed. Finished Lemire’s Hawkeye series which I’d been saving. Still great

I read all 1249 pages of the Infinity Gauntlet Omnibus but found it disappointing. The films were better.

I started reading some of Grant Morrison’s JLA comics and they were just awful. So bad that they basically killed all interest I had in reading any more graphic novels last year. I’ve got a few books of his which I’ve not read yet (Animal Man being the main one), but I think Morrison is massively overrated.

Listening

Still keeping to my habit of creating a playlist of “tracks that I loved on first listen, which were released this year“. Here’s my 2024 Tracked playlist.

It contains 157 tracks which is a total of 10 hours and 39 seconds of listening. Tracks are logged in order of when I heard them.

That’s 4 hours and 57 tracks more than last year. I’m still listening to a lot of 6Music during weekdays, but also spending time in the evening and weekends following recommendations on YouTube and elsewhere.

My My Most Listened 2024 includes Thee Marloes, The Last Dinner Party, Lucy Rose and The Bug Club.

Still listening to Unclassified but I’ve started regularly listening to BBC Radio 3 Night Tracks. I stick it on now late in the evening while I’m reading rather than doomscrolling.

My daughter has started joining me for this. A nice peaceful time together.

Maybe 2025 will be the year I get into podcasts?

Gigs

I wanted to go to more gigs in 2024. Here’s what I managed this year

Hania Rani. Jumped at the chance to finally see her perform live, and she didn’t disappoint. A much more high-energy gig than I was expecting, but it was amazing
Happy Mondays supported by the Inspiral Carpets and the Stereo MCs. Pure old geezer nostalgia. Lots of fun
~~King Gizzard and the Lizard Wizard~~. Couldn’t go because I got my second dose of COVID.
Jalen Ngonda. Grabbed tickets as soon as I found out he was playing the Trinity in Bristol. Nice venue and the gig was brilliant. I’ve seeing him twice more in 2025.
John Hopkins. I saw Hopkins play in Bath right before the start of lockdown, my very first solo gig and my last time in a crowded room for a long time. While the Bath gig was a mix of his electronica and quieter piano pieces this set at the Beacon was a full-on assault on the senses. Brilliant.
Ezra Collective. I’ve never seen anyone work an audience like these guys. Uplifting.
Sea Sick Steve. Fantastic. Learnt what a diddley-bow is.

Gaming

I’m ldodds on both Steam and PSN if you want to add me there.

I spent a long time playing Shadow of the Erdtree which I thought was an absolute triumph. These games manage to light up parts of my brain in a way that many other games don’t. It’s the mixture of exploration, skill development, themes and tantalising secrets. I will never tire of them.

Spent some more time with Horizon Forbidden West, which I’ve still not finished. Also sunk quite a few hours into both PowerWash Simulator and Hardspace Shipbreaker which are engaging in their own ways. They provide satisfying distractions and sometimes that’s what I need in a game, not challenge.

I played Thank Goodness You’re Here, which was funny and reminded me of some Spectrum games from the ’80s which it also directly references in places.

What Wakes the Deep was scary, cinematic, very well designed, but a bit frustrating a times. Having watched The Rig this year, which explores some similar themes, it was slightly surprising to walk out onto the platform and feel like I already had some sense of the layout.

I played a lot of Framed, the New York Times word puzzles and an a life-timer subscriber to Puzzmo. Though not some much by the end of the year.

I spent a lot of time role-playing, which includes planning and running games, and reading through rule books.

I’ve been running a campaign of Agon which we finished up a few weeks ago. That was a lot of fun, although I struggled a bit with the system and pacing at times. I’m also now running Blades in the Dark for an in-person group. I even bought some new dice! While I love playing online, it’s just a delight to be able to get round a table with people and make some new friends.

We finished Season 1 of that campaign just before Christmas. During the break one of other players ran a Mothership one-shot which was brilliant. He incorporated sound effects and even an app at one point. A great way to round off the year.

In a bid to play even more TTRPGs, I also signed up to play an online session of Night of the Hogmen with a group of strangers. That turned out to be a lot of fun. Hogmen is a great one-shot anyway, but the GM was great and the other players leant into the wild caper.

Watching

Film

I watched 126 films in 2024, totalling 247 hours. 22 of these were rewatches.

Numbers of filmed I watched each month in 2024

I was given an “Odeon Limitless” pass for my birthday this year. It’s the best gift I’ve ever had! I’ve been in the cinema pretty regularly this year: 32 times in total. One a few occasions I’ve just gone and watched a couple of films back to back. Decadent.

My favourite films released in 2024 were:

The Substance
Conclave
The Fall Guy
Dune: Part Two
Kinds of Kindness

But my favourites of the year were:

Conclave. I cannot state how good this film is. on every level
The Substance.
Perfect Days. Lovely
Past Lives. This one stayed with me for a while
Anatomy of a Fall. Gripping

TV

Masterchef, Bake-off and now The Traitors are now core viewing in our house. Rare occasions when the family clusters around the TV.

Finished working through my Star Trek: Voyager rewatch and started in on Deep Space Nine. Started rewatching Black Mirror and they haven’t all aged well.

Favourites of the new stuff:

Scavengers Reign. Amazing world-building. Obviously they’ve now gone and cancelled it, but the series is self-contained enough that it’s worth a watch
The Rig. Enjoyable sci-fi
Disclaimer. Gripping
Severance. Stylish and mysterious. Please let them end this well
Skeleton Crew. Of all the disney shows this year, I think this was the best. I enjoyed the Acolyte and Agatha All Along but neither landed as well as this.

Writing

I wrote 14 blog posts in 2024, totalling 10,993 words.

That’s more than I remembered writing to be honest.

I didn’t feel like I had much to say this year. Or that I had things to say that other people weren’t already saying better. And who needs to hear from another middle-aged white guy?

That’s an odd place to end up given that I still believe I’m mostly writing for my own benefit. Given that I’m doing less research and analysis I have less need to organise my thoughts in the same way.

That said, I think old-school blogging is more important than ever. So I may end up writing more this year but probably about different things? I don’t know.

I do have a larger writing project planned but haven’t begun the drafting in earnest yet.

Coding

I’m spending all day writing and review code, alongside planning and product work for Energy Sparks, so less interested in doing side projects and I have other uses for my spare time. Those films aren’t going to watch themselves.

But I did make something in Downpour and I did finish that Doom WAD bot for Mastodon. Although in December botsin.space shutdown so I’ve moved it to mastodon.me.uk.

I still want to do more creative coding projects.

Cooking

Like last year, I didn’t really log what I was cooking in 2024. I’ve bookmarked some recipes here. Only one new cookbook this year: The Latin American Cookbook by Virgilo Martinez.

Things that went down well with the family this year:

Roasted Nectarine Salad with Feta and Mint
Creamy Tuscan Chicken
the panjeon recipe from Little Korea
the tarts I made (pictured above) from some Spicy Pear and Mango chutney my dad made, along with a dollop of goat’s cheese

Still mostly vegetarian although some meat has crept back into the diet. Meera Sodha recipes are pretty reliable for midweek family dinners.

Isaac Korean Street Toast has now cemented itself as a pre-Xmas family dinner tradition.

Gardening

A mixture of poor rain and no desire to do a big groundwork project — to rebuild the raised beds and refresh the soil — left me spending weekends doing other things.

Grew some tomatoes in pots which did well. Otherwise did nothing else in 2024. Garden is in sore need of some attention.

Working

Energy Sparks is going well. My decision to step away from consultancy roles to focus on building something was absolutely the right one and I continue to enjoy my day to day work. It’s varied, and I’m learning enough, that I’m not getting bored.

Like every other charity and non-profit we’re still struggling to grow whilst remaining sustainable. The funding situation is always a worry but we’ve managed to shift our mix of revenue streams so we’re not as entirely reliant on grants as we were.

By the end of 2024 we’d grown to 10 people and back up to 1000 schools again. The expanded team includes someone responsible for marketing and comms and I’m hoping we can add a business development / sales role next year.

It is frustrating to see the government pour a lot of money into expensive retrofit projects for schools, when there are so many gains to be made through simple low cost energy efficiency interventions. There are a lot of schools that need refits and investment, without a doubt. But energy efficiency advice and the data infrastructure needed to support that also needs investment. Will write more about that another time.

In terms of product development and architecture we’ve made some satisfying improvements. By the end of 2024 we’d grown to 10 people and back up to 1000 schools again. But we’ve improved how we deploy and manage our infrastructure, pruned the codebase, sped up some core features, and have a good sense of what other features might be useful. But as always, more user research needed.

Everything Else

Last year I decided I wanted to get out more and meet some new people locally.

I’m not sure that going to the cinema counts, as I’m usually say by myself in a darkened room. But I’m still out. And I have ended up chatting to other regular cinema goers whilst we’re waiting for the film to start.

Early in the year I went and spent a few sessions volunteering at Bath Community Farm which was brilliant. I cut down some trees, helped to build a woodland wildlife garden, and planted up some new trees elsewhere on the site. It was a lot of hard work but in a beautiful setting. I spent time working alongside and chatting to a bunch of the loveliest people. Tiring but came away with a sense of peace and connection which I haven’t felt in a while.

Unfortunately there weren’t many of these opportunities as those events were part of a funded project. But still keeping an eye out for other chances.

I’m also out now pretty regularly playing TTRPGs with a new local group. So have met a bunch of other nice people and we’ve had some great sessions. For several of them it’s their first time playing any TTRPG, so I’m happy to be able to give them that opportunity.

Getting out to gigs also also been great.

Spent some money improving my desk layout and setup so less prone to back and neck aches.

Did a lot of DIY over the summer including redecorating the kitchen. We’ve been wanting to get our hall replastered and decorated for ages, so bit the bullet and go that done too. Kitchen roof has started leaking so hoping to get that sorted soon.

I checked out of social media as the US election hit its peak. Deleted apps and logged out of everything. Finally deleted my twitter account. Spent time reading and filling out my RSS reader with new blogs instead.

I really haven’t missed it. While I’ve been checking in again on Bluesky and Mastodon a bit recently as an experiment, I’m not sure it’s working for me anymore.

There are better ways to stay informed. There are better ways to feel part of a community. I do not need to witness every batshit or distressing thing that happens or read all of the analysis. None of that really impacts my sense of what is right.

Noticed an odd patch inside my mouth in November. Looks benign but ended up having to go for a biopsy. I was a brave boy but they didn’t give me a sticker. Still waiting on the results which I should get in February.

What does community-driven data governance look like?

Leigh Dodds — 2024-11-01T13:37:17Z

Some idle thoughts for a Friday afternoon.

I was just taking a look at Source.Plus a dataset of public domain images for training Foundation models. It’s a project of Spawning.ai which is working to build “data governance for generative AI”. I have some thoughts on the tools they’re building, but that’s not what I’m writing about here.

It was this statement which caught my attention: “Through the Source.Plus platform, we also introduce novel, community-driven dataset governance mechanisms that reduce harm and support reproducibility over time.“

I think its early days for the project, so there’s no much detail about what that novel governance might look like. Although I assume it will be based on the Spawning.ai tools.

What exists at present is a brief summary of how the Source.Plus dataset is managed and specifically how it deals with representation, safety, copyright, etc. And links to relevant policies, e.g. the Takedown Policy which outlines a process that kicks in if content is flagged.

I think what I was expecting to see, is what we called “Visible Processes” in Collaborative Data Patterns: not just a policy document but a set of online tools that would provide:

a summary of activity, e.g. how many cases are open, how many have been resolved (and how), and how long it takes requests to be completed
insights into specific cases, e.g. at what stage is my request to take down some content?
some ability for the community to engage with that process once started, e.g. to upvote a request to add a problematic piece of content or additional evidence that might be useful in the process
…and maybe some detail on who is driving that process, who are the people behind the email addresses and contact forms, and how might others get involved?

Obviously there’s privacy and safety issues that need to be considered in all of the above. You need to protect both staff, rights holders and contributors for multiple reasons.

I think this type of framework is what I would expect to see as a minimum around a “community-driven dataset governance mechanism“.

But to me community-driven means more than community-initiated. The community should be involved at every step of the process. And that means more than just dealing with takedowns and copyright. It means shaping the content and organisation of the dataset itself.

We captured some more patterns around that but they’re clearly not exhaustive.

The value of capturing these types of patterns is that it becomes easier for different projects to adopt similar approaches, allows the creation of shared infrastructure and tools, and builds community expectations around what good looks like.

Comments on “A data for AI taxonomy”

Leigh Dodds — 2024-10-19T12:12:54Z

Jack Hardinges and Elena Simperl recently published a taxonomy to describe the data relevant to AI models and systems. Their goal is to help to better distinguish between the different types of data relevant to developing, using and monitoring AI models and systems to help to better distinguish them and thereby add some nuance to debates around what type of data infrastructure and governance is required.

Datasets are shaped by various factors including their contents, their intended use, the communities involved in collecting and using them, and expectations around their longevity.

For example a dataset of images will be organised very differently to one consisting of tabular data. And, if that same set of tabular data contains geospatial coordinates it might be published using a different set of standards by an organisation working in the geospatial community, than one working in local government (e.g. Geopackage vs a CSV file). One published as the result of a research project might be uploaded to an archive whereas another might be published via an API. Etc, etc.

A few years ago I wrote a paper about different dataset archetypes which was intended to help inform this kind of discussion by highlighting the different characteristics of some commonly produced datasets.

So I read the taxonomy with some interest. Here are a few notes on some of the core definitions.

Existing data

Given that anything digital can now be computed by AI, or any other system, I’m uneasy about using “existing data” as the framing for all text, audio, images, movies, code, etc. Because most people won’t think of that digital stuff as “data”.

The taxonomy sets out to define different types of dataset, so it’s understandable that it attempts to define the “everything else”. But if the intention is to help shed light on the different types of inputs and outputs of AI systems, for a broader audience, then a better label might be more appropriate.

The current definition of “Existing data” encompasses big collections of digital objects harvested from the web, unstructured corpuses of text and imagery, as well as structured datasets, etc.

Some of this “data” might only exist in aggregate as the result of creating a training set. Or may have been intentionally published and structured with a specific set of use cases in mind that did not originally include AI.

I find it hard to think of something like the web as a single dataset. It doesn’t fit my personal definition.

Training data

I also don’t think the taxonomy adequately distinguishes between “Existing data” and “Training data“.

One of the defining characteristics of foundation models are that they are training on very large, unstructured datasets, usually harvested from the web, as opposed to more purposefully curated data intended for use in other types of machine-learning systems.

All of the current concerns around foundation models derive from this feature.

Attempts to retrofit governance and solve licensing issues to outputs of web crawling look different to more traditional approaches of building datasets where gaining consent and permission because there is a closer connection between those people represented in, or contributing to a dataset, and those producing it.

The taxonomy references Common Crawl as “Training data” but its use, at least originally, is much broader. The openly licensed StackOverflow dataset has been used in a variety of ways, but it’s only recently that its use to train foundation models has caused concerns. Whereas ImageNet is nearly 10 years old and has been intentionally published for machine-learning research.

Some datasets are intentionally produced as “Training data“. Others become training data, because they are used that way.

Reference and Local data

One way that AI systems try to incorporate additional data, e.g. facts that were not available when the model was training, is through techniques like “Retrieval Augmented Generation” (RAG). It’s a popular technique due to its ability to incorporate new or private data into a system that is otherwise powered by a foundation model training on a broad set of sources.

The same dataset might feature as both “Reference data” and “Local data” in a deployed same system. E.g. using Wikidata as a source of labelling for a training or fine-tuning dataset AND as a knowledge base that is queried during deployment.

“Local data” to me implies data local to a deployment or use of a system, e.g. that of a specific organisation or user. But it might also include any other “existing data“.

A shorter summary of my feedback might be that this is less of a taxonomy of data, e.g. which might be used to classify or describe existing data, and more of the roles in which a dataset might participate in an AI workflow or system.

Finally, going back to my introduction, the reason I’m interested in understanding the roles of different datasets in AI systems and workflows is because that might further shape how that data is being accessed, used and shared. Or provide useful insights into the types of governance models they need.

Training and fine-tuning datasets need to be accessible as a whole. So that implies that they will be accessed and published as a complete dataset. So we may need systems of change discovery and retraction to deal with issues found in that data.

If data is published in a decentralised way then an aggregate dataset may need to be created before it can be used in training. That creates another layer of governance.

Whereas data included via RAG is likely to be more API based. So the responses from that API can be altered in real-time if issues are discovered. Those APIs are also governed by additional terms that may shape use of the data. Existing APIs might be pressed into service to help to deploy AI systems without their providers being aware of those new use cases. Etc, etc.

How to accidentally DDOS yourself

Leigh Dodds — 2024-10-03T12:31:22Z

We had some performance issues last week. Entirely of our own making but not in the usual way.

We nearly DDOS ourselves by sending out emails.

We do a lot of analysis in Energy Sparks and, to be honest, some of it needs optimising. Tickets are in the backlog and we are exploring solutions.

Anyway, one morning last week we had a sudden spike in memory and CPU usage. This was during the period when we send out our weekly emails to users. The emails signpost them to the latest analysis, highlighting any key issues in their energy consumption. For example that their baseload has crept up, so maybe devices are being left on. Or that their heating is running during a holiday period.

Alarms started going off, so I started investigating. Remedial action was taken and we stabilised things while digging for the root cause.

I found an issue in the email sending code pretty quickly and fixed around it.

But as I normally do when this happens, I did a full run through of our monitoring and logs to identify other contributing factors. Sometimes it’s not that the software has changed. It might be the context in which it is running. So I try to avoid looking at just the obvious fixes.

While reviewing the server logs I noticed we were getting a lot of HEAD requests during the period of slow performance. Like a LOT.

Then I noticed that all these were for URLs originating in the emails that we were sending out that morning.

While we get good engagement with these emails, it wasn’t normal user traffic. It was something else. Checking the User Agents in the requests, I realised what was happening.

At least some of the schools to which we are sending email are using security software that scans incoming emails. I assume this is common in a lot of organisations these days.

That software was reading the emails then doing HEAD requests on all the links.

I assume this is to check SSL certificates and look for dodgy redirects that might be associated with phishing.

The more emails we sent, the more HEAD requests we got. And the majority of those requests were hitting the pages that I said needed optimising.

As we were struggling to send out emails, the more we sent, the more were were being hit with waves of HEAD requests to pages that were causing additional performance issues. We were basically DDOSing ourselves with the help of some email security software.

Cue those alarms.

Now the extra fun thing is that we hadn’t implemented a HEAD handler for these URLs. But Rails silently converts a HEAD request to a GET. Before throwing away the response and just serving the headers. So application was struggling to produce analysis that was immediately thrown away.

I have no idea how long this has been happening. Perhaps for a while, and we’ve just grown to the point where its a problem. Or maybe more of the recently joined schools are using different software. I don’t know and haven’t dug further.

But this type of email scanning behaviour was new to me. I don’t know whether these email scanners routinely follow all links, or just a sample. But there didn’t seem to be any throttling. Or much use of user agents headers to help identify the source. This seems a bit unfriendly.

Clearly we had issues to fix in the application. And those performance optimisations are getting a bump up the backlog. But this was an amusing little incident and a nice example of unexpected interactions between systems.

Lessons have been learned.