The NeoSmart Files

Benchmarking rust compilation speedups and slowdowns from sccache and -Zthreads

Mahmoud Al-Qudsi — Mon, 01 Jul 2024 18:50:08 +0000

Just a PSA from one rust developer to another: if you use sccache, take a moment to benchmark a clean build¹ of your favorite or current project and verify whether or not having RUSTC_WRAPPER=sccache is doing you any favors.

I’ve been an sccache user almost from the very start, when the Mozilla team first introduced it in the rust discussions on GitHub maybe seven years back or so, probably because of my die-hard belief in the one-and-only ccache earned over long years of saving my considerable time and effort on C++ development projects (life pro-tip: git bisect on large C or C++ projects is a night-and-day difference with versus without ccache). At the same time, I was always painfully aware of just how little sccache actually cached compared to its C++ progenitor, and I was left feeling perpetually discontented ever since learning to query its hit rate stats with sccache -s (something I never needed to do for ccache).

But my blind belief in the value of build accelerators led me to complacency, and I confess that with sccache mostly chugging away without issue in the background, I kind of forgot that I had RUSTC_WRAPPER set at all. But I recently remembered it and in a bout of procrastination, decided to benchmark how much time sccache was actually saving me… and the results were decidedly not great.

The bulk of my rust use at $work is done on a workstation under WSLv1, while a lot of of my open source rust work is done on a “mobile workstation” running a Debian derivative. I began my investigation on the WSL desktop, originally spurred by a ~~need~~ desire to see what kind of speedups different values for the ~recently added -Z threads flag to parallelize the rustc frontend would net me, then remembered that I had sccache enabled and should disable that for deterministic results… leading me to include the impact of sccache in my benchmarks.

On a (well-cooled, not thermally throttled) 16-core (32-thread) AMD Ryzen ThreadRipper 1950X machine with 64 GB of DDR4 RAM and NVMe boot and (code) storage disks, using rustc 1.81.0-nightly to compile a relatively modest-sized rust project from scratch in debug mode (followed by exactly one clean rebuild in the case of sccache to see how much of a speedup it gives), I obtained the following results:

-Zthreads	sccache	Time	Time (2nd run)
not set	no	33.08s
not set	yes	1m 18s	56.20s
8	no	33.27s
8	yes	1m 32s	1m 00s
16	no	34.43s
16	yes	40.78s	56.06s
32	no	37.25s
32	yes	1m 14s	52.99s

Shockingly enough, there was not a single configuration where the usage of sccache ended up speeding up compilation over the baseline without it — not even on a subsequent clean build of the same code with the same RUSTFLAGS value.

After inspecting the results of sccache -s and with some experimentation, it turned out that this pathological performance appeared to be caused – in part – by having a full sccache cache folder.² Running rm -rf ~/.cache/sccache/* then re-running the benchmark (a few project commits later, so not to be directly compared to the above) revealed a significant improvement to the baseline build times with sccache… but probably not enough to justify its use:

-Zthreads	sccache	Time	Time (2nd Run)	-ZTHREADS	sccache (initial)	sccache (subsequent)
not set	no	31.41s		1.000
not set	yes	42.57s	29.47s		1.355	0.938
0	no	35.94s		1.144
0	yes	44.26s	28.95s		1.231	0.806
2	no	30.53s		0.972
2	yes	58.94s	38.36s		1.931	1.256
4	no	29.69s		0.945
4	yes	1m 10s	43.57s		2.358	1.467
8	no	30.43s		0.969
8	yes	1m 17s	47.90s		2.530	1.574
16	no	32.39s		1.031
16	yes	1m 22s	52.67s		2.532	1.626
32	no	35.17s		1.120
32	yes	1m 26s	53.27s		2.445	1.515

Looking at the chart above, using sccache slowed down the clean build by anywhere from 23% (in the case of no -Zthreads) to 153% (in the case of -Zthreads set to 8 or 16), while providing a speed up to subsequent builds with identical flags and unchanged code/toolchain only in the case of no -Zthreads (6% speedup) and by 19% in the case of -Zthreads set to 0 (its default value), but it still managed to slow down even subsequent, clean builds with a fully primed cache as compared to the no-sccache baseline by from 25% (-Zthreads=2) to 51% (-Zthreads=32).

Analyzing the benefits of -Zthreads is much harder. Looking at the cases with -Zthreads but no sccache, it appears that with a heavily saturated pipeline (building all project dependencies from scratch in non-incremental mode affords a lot of opportunities for parallelized codegen), the use of -Zthreads can provide at best a very modest 5% speed up to build times (in the case of -Zthreads=4) while actually slowing down compile times by 14% (the soon-to-be-default -Zthreads=0) and 12% (with -Zthreads=16).³

It’s interesting to note that the rust developers were rather more ebullient when introducing this feature to the world, claiming wins of up to 50% with -Zthreads=8 and suggesting that lower levels of parallelism would see lower speedups (the opposite of what I saw, where using 8 threads provided about half the benefit of using 4, and going any higher caused slow-downs rather than speed-ups). Note that I was compiling in the default dev/debug profile above, so maybe I should try and see what happens in release mode, though I would think the architectural limitations would persist.

Back to sccache, though.

One of the open source projects I contribute to most is fish-shell, which recently underwent a complete port from C++ to Rust (piece-by-piece while still passing all tests every step of the way). Some day I want to write at length about that experience, but the reason I’m bringing this up is because my fish experience has taught me that some things that are normally very fast under Linux can run much slower than expected under WSL, primarily due to I/O constraints caused by the filesystem virtualization layer. I haven’t dug into it yet, but going with the working theory that reading/writing some 550-800 MiB⁴ was slowing down my builds (even with the code and the cache located on two separate NVMe drives), I moved on to my other machine (where I’m running Linux natively).

Running the same benchmark with the same versions of rustc and sccache on the 4-core/8-thread Intel Xeon E3-1545M v5 (Skylake) with 32GiB of DDR3 RAM gave the following results, which were much more in line with my expectations for sccache (though even more disappointing when it came to the frontend parallelization flag):

-Zthreads	sccache	Time	Time (2nd Run)
unset	no	1m 10s
unset	yes	1m 13s	14.12s
0	no	1m 12s
0	yes	1m 18s	14.26s
2	no	1m 10s
2	yes	1m 17s	14.20s
4	no	1m 11s
4	yes	1m 15s	14.44s
6	no	1m 12s
6	yes	1m 16s	14.90s
8	no	1m 14s
8	yes	1m 20s	15.04s

Here at last are the sccache results I was expecting! A maximum slowdown of about 8% for uncached builds and speedups of about 80% across the board for a (clean) rebuild of the same project immediately after caching the artifacts.⁵

As for -Zthreads, the results here are consistent with what we saw above, at least once you take into account the fact that there are significantly fewer cores/threads to distribute the parallelized frontend work across. But we’re left with the same conclusion that at times when the CPU is already handling a high degree of concurrency with jobserver already saturated with work from the compilation pipeline across multiple compilation units from independent crates, adding further threads to the mix ends up hurting overall performance (to the tune of 14% in the worst case, when -Zthreads is set to total number of cores). The good news is that adding just a slight degree of parallelization with -Zthreads=2 doesn’t hurt build times in this worst-case scenario and likely helps when the available threads aren’t already saturated with more work than they can handle, so that at least seems to be a safe value for the option for now.

I would have expected that -Zthreads wouldn’t “unilaterally” dictate the number of chunks frontend work was being broken up into. While I’m sure it integrates nicely with jobserver to prevent an insane number of threads from being spawned and overwhelming the machine, it would seem that dividing the frontend work into n chunks when there aren’t n threads immediately available ends up hurting overall build performance. So in that sense, I suppose it would be better if -Zthreads were a hint of sorts, treated as a “max chunks” limit, and if introspection of available threads happened before the decision was made to chunk the available work (and to what extent) so that the behavior of -Zthreads, even with a hard-coded number, would hopefully only ever improve build times and never hurt.

If you would like to receive a notification the next time we release a rust library, publish a crate, or post some rust-related developer articles, you can subscribe below. Note that you'll only get notifications relevant to rust programming and development by NeoSmart Technologies. If you want to receive email updates for all NeoSmart Technologies posts and releases, please sign up in the sidebar to the right instead.

sccache does not cache nor speed up incremental builds, and recent versions try to more or less bypass the caching pipeline altogether in an attempt to avoid slowing down incremental builds. ↩
A similar (but not identical) issue was reported in the project’s GitHub repo in December of 2019. ↩
I assumed -Zthreads=0 would mean “default to the available concurrency” (i.e. 32 threads, in my case) but that doesn’t appear to be the case just by looking at the numbers. ↩
The size of the target/ folder varies depending on the RUSTFLAGS used. ↩
This was, of course, the same version of sccache that was tested under WSL above. ↩

The post Benchmarking rust compilation speedups and slowdowns from sccache and -Zthreads first appeared on The NeoSmart Files.

Using build.rs to integrate rust applications with system libraries like a pro

Mahmoud Al-Qudsi — Mon, 13 May 2024 16:24:30 +0000

I’m happy to announce the release of version 0.2 of the rsconf crate, with new support for informing Cargo about the presence of custom cfg keys and values (to work around a major change that has resulted in hundreds of warnings for many popular crates under 1.80 nightly).

rsconf itself is a newer crate that was born out of the need (in the fish-shell transition from C++ to rust) for a replacement for some work that’s traditionally been relegated to the build system (e.g. CMake or autoconf) in order to “feature detect” various native system capabilities in the kernel, selected runtime (e.g. libc), or installed libraries. It (optionally) integrates with the popular cc crate so you can test and configure the build toolchain for various underlying features or behavior, and then unlock conditional compilation of native rust code that interops with the system or external libraries accordingly.

While Cargo is an impressive build tool and normally more than sufficient for the needs of the majority of rust crates shipping standalone, contained libraries or packages, for those of us transitioning “more brittle” system software or libraries that rely on functionality of the native kernel, libc, or external libraries – and often have to support older versions thereof, lacking in features or enhancements – it doesn’t offer feature parity with some of the build tools we’ve been traditionally using in the C and C++ world.

Fortunately, the language and tooling devs behind rust recognized early on the need for a more flexible and involved approach to building certain crates and came up with the build.rs approach that lets developers run what is effectively a rust script prior to the traditional build steps invoked by Cargo, influencing how Cargo works, what flags are passed to rustc (the actual rust compiler), and what libraries Cargo asks your linker to include when generating the final binary. Importantly, at this stage devs can inspect the host/target systems and tell the compiler which Cargo features and rust cfgs should be enabled, letting you conditionally compile (or not) various bits and pieces of regular rust code to take advantage of functionality discovered to be present at build time or work around capabilities found to be missing or lacking.

In addition to supporting external libraries (like gettext and, once upon a time, curses), it also runs on quite a number of different unixy systems, starting with Linux, macOS, and the BSDs, and with a long tail of support for various other fun posix-compatible operating systems and projects.¹ As you can imagine, as a shell, fish does a lot of stuff outside the purview of the rust standard library and a lot of the codebase deals with low-level integration with the operating system – and the details of this change quite a bit just from one kernel version to the other, let alone across different OS kernels and distributions altogether.

A lot of the crates in the rust ecosystem seem to rely on OS detection to determine what feature should or shouldn’t be available, but the C/C++ world moved from that to feature detection a long time ago. Fish uses build.rs to determine what low-level operating system features are available (regardless of what OS we are targeting) and enables or disables the compilation of rust code sitting behind #[cfg(key = "value")] depending on the results.

Cargo exposes a very bare-bones mechanism for build.rs to influence which cfg or feature will be enabled/disabled at build-time by means of intercepting specially prefixed stdout messages. Generally, this would look like

In build.rs, check for the presence/absence of some feature somehow,
In response, execute println!("cargo:rustc-cfg=foo");

Which prints to stdout a line of text Cargo will intercept when it is running the compiled build.rs binary and, based off of that, tell rustc to enable the cfg foo. This “unlocks” code behind a #[cfg(foo)], allowing rustc to see and compile it (normally it would be as if it weren’t there at all).

This is all well and good, but it has a few obvious drawbacks. The first is that somehow that glares at us from point number one above. How exactly does one check if a system feature is present or not? Why doesn’t Cargo help us in this regard? In the world of legacy build systems, this is part and parcel of what a build system does and, in fact, a raison d’être for their existence in a world that predates package managers, semantic versioning, and all the other nice stuff we can now take for granted in a rust-native ecosystem.

It’s from this need that rsconf was born. The crate offers some of the functionality typically made available by “legacy” build systems, wrapped in an easy-to-use and rust-friendly api. Some examples of the available functionality:

system.has_library(libname)
system.has_header(name)
system.has_symbol(symbol)
system.has_symbol_in(symbol, &[libname])
system.has_type(type_name)
system.get_macro_value(name)
system.get_{u32,u64,i32,i64}_value(ident)
system.ifdef(define)
system.if(expr)
add_library_search_path(path)
link_library(libname, LinkType::Static/Dynamic)

The names of these methods should make what they do quite clear, and there are various convenience functions to simplify some common patterns around these same principles. The methods themselves are largely implemented as build-time attempts to compile or link minimal C source code to determine the truthiness of the expressions, while the latter two direct Cargo as to how it should attempt find and use external libraries pre-installed on the build host/target.²

In addition, there are methods that make it easier to perform “regular” Cargo stuff in a build.rs script, offering a “strongly typed” api instead of the usual println!() stuff that is prone to typos, mangled types, and more:

enable_cfg(cfg_name)
enable_feature(feature_name)
set_cfg_value(cfg_name, cfg_value)
rebuild_if_env_changed(env_var_name)
rebuild_if_path_changed(path)

New to the 0.2.0 rsconf release (as hinted at above) are variations on enable_cfg() and set_cfg_value(), necessitated by changes to the rust compiler that will land in 1.80.³ The compiler will begin checking expressions such as #[cfg(foo)], #[cfg(foo = "bar")], and #[cfg(feature = "baz")], to make sure that all of foo, bar, and baz are valid constraints (not typos or hallucinations). As you can imagine, if the compiler comes across the attribute cfg!(hello) while the hello cfg is enabled, it knows that it’s a valid cfg name. But what about when it’s not enabled? So now we have to let rustc know up front not only which cfg or feature names/values are valid and active, we also have to let it know which are valid even when they’re inactive. To that end, rsconf 0.2.0 introduces the following:

declare_cfg(cfg_name: &str, enabled: bool)
declare_cfg_values(cfg_name: &str, valid_values: &[&str])
declare_feature(feature: &str, enabled: bool)

The first can be used directly in lieu of the old enable_cfg() to both declare a cfg and specify that it is to be enabled (or disabled), but, for now, the second needs to be used in conjunction with the existing set_cfg_value(cfg_name, value) to let the compiler know in advance all the valid values, regardless of whether they’re defined for the current compilation or not.⁴

If you’re curious, you can take a quick look at fish-shell’s current build.rs to get an idea of what real-world rsconf usage in the wild looks like. It’s a short build script, and quite easy to understand.

The rsconf crate itself is still in its early days and will, DV, continue to evolve and see new features. If there are build-related tests or tasks that you feel would fall under its purview, do open an issue in the repository and let us know. Feedback about the proposed builder api for declaring cfg values is also welcome! Otherwise, please give it a try and see if it can help you make your build.rs sane and easier to understand. It’s intentionally written to be fast and lite, with only a dependency on the cc-rs crate (which you’ll almost certainly already be taking a dependency on if you’re compiling against system libraries or headers), so you only stand to benefit from making the switch!

Looking for something else to read or learn? Take a look at my other rust posts, especially this one about designing truly safe semaphores in rust or learn about using simd to speed up rust applications significantly. Sign up below to get emails about open source rust stuff or add my blog to your RSS reader!

Unfortunately, as a part of the transition from C++ to rust, we have currently lost support for the vast majority of the legacy OSes and other non-tier-1 unix platforms we used to support due to issues with availability of the rust toolchain and incompatible dependencies like the libc crate, but we are keen to get them back. ↩
One thing I really appreciate about Cargo is that it goes to great lengths to support cross-compilation out-of-the-box, clarifying operations and configurations taken from/applying to the machine you are building on (the host) vs the machine you are building for (the target). rsconf is written in a way that similarly respects this divide. ↩
You can also refer to this rustc documentation page for more on how this new feature works. ↩
A separate “builder-style” api will be added at some point, but there are decisions to make about its shape. For example, it could look like add_cfg("cfg_name").with_values(&["value1", "value2"]).set_value("value1") or along the lines of add_cfg("cfg_name").set_values("active_value", &["other value 1", "other value 2"]) ↩

The post Using build.rs to integrate rust applications with system libraries like a pro first appeared on The NeoSmart Files.

Embed only the video from another post on X or Twitter

Mahmoud Al-Qudsi — Sun, 22 Oct 2023 17:33:56 +0000

Twitter has a new-ish feature that lets you embed only the video from another post or tweet in a post/tweet of your own (without quote-replying the source tweet itself). Only the video is then embedded in your post, and a small attribution appears at the bottom identifying where the video came from:

In the screenshot above, Sarah is sharing a video that was originally shared by Luc, but she’s not embedding/quoting Luc’s tweet itself – only the video. This post will cover how to do that yourself, both on the desktop/web and in the iOS Twitter app on iPhone.

All of Twitter’s features are really just special-cased handling of URLs, and video embedding is no different. If you want to quote-reply, you are actually just posting something followed by the URL of the original tweet you want to quote. For example,

Look at the size of this crowd!
https://twitter.com/LucAuffret/status/1716085946016252251

ends up with the following quote reply:

And similarly, embedding just the video from a tweet is as simple as appending /video/1 to the URL of the source tweet. In this case:

Look at the size of that crowd! #LibérezPalestine
https://twitter.com/LucAuffret/status/1716085946016252251/video/1

becomes

On iOS, in the Twitter/X app, this is all done for you automatically. If you just long-press on a video, you can use the “Post Video” option in the menu that pops up to have twitter copy-and-paste the full tweet URL with the /video/1 already appended for you:

If the source tweet/post contains more than one video, you can change the 1 in /video/1 to a different number in order to embed a video other than the first one in the post.

Liked this post? Follow me on twitter @mqudsi or subscribe to new posts via email from the sidebar to the right!

The post Embed only the video from another post on X or Twitter first appeared on The NeoSmart Files.

Increment only numbers matching regex in Vim

Mahmoud Al-Qudsi — Fri, 13 Oct 2023 17:47:14 +0000

Long-time vim or neovim users are probably already aware that visually selecting a block of text then pressing CTRL + A in vim will result in any numbers in the selected block of text being incremented by 1. This works even if the block contains non-numeric text: each group of digits gets treated as a number and is incremented.¹

For example, here’s a video that shows what happens when you select some text in vim and then use CTRL + A to increment the values:

(It’s also a fact that a lot of vim users learned about this functionality the terribly hard way: accidentally pressing CTRL + A then later realizing that all the numbers in their document were off-by-one for some unknown reason.)

But in all honesty, this isn’t a very useful mapping because it’s rare (at least in the programming world) to have numeric and text content completely separate: you usually have numbers in certain, key places while also possibly having numbers intermixed with the remainder of the text in the document. And we’ll often need to only increment certain numeric values but not the rest.

Here’s how you can use increment only numbers matching against a regular expression (including multiple numbers on the same line) while leaving the rest intact:

Write a regex that matches against only the numbers you want to change. By doing this in normal mode, we can get vim to highlight the matches as we edit the regular expression, allowing us to visually confirm that the regex matches the numbers we want to increment. To do this, just use our trusty, old friend :s/foo (you can match against numeric content by using \d\+ to select all consecutive digits)
For the second half of the /s/foo/bar/ expression (bar, the replacement value), we’ll use the magic of a vim expression to increment (or otherwise manipulate) the matching value. Remember that regex capture groups are made with (match here), match group 0 is the entirety of the match, and our manually captured groups (via the parentheses) are then counted from left-to-right from number 1 onward. The magic bits are \=submatch(n)+1 which replaces the nth match group with the incremented value.

Here’s an example where we want to insert some text in the middle of a numbered/indexed structured body of text then update all the indexes afterwards by bumping them up by one:

We have some subtitles in the SRT format and we want to insert a new caption in the middle, then update all the caption numbers but not the timestamps to reflect the insertion in the middle of the list. We have this text to start with:

1
00:00:03,400 --> 00:00:06,177
In this episode, we'll be talking about
the importance of strong typing in programming.

2
00:00:010,000 --> 00:00:11,200
Strongly-typed languages have many benefits over
their loosely-typed counterparts.

3
00:00:11,500 --> 00:00:13,655
Using strongly-typed languages can actually make
you more productive.

And we want to insert the following subtitles between 1 and 2, but not have to increment all the indexes that come after one-by-one by hand, which is time-consuming, error-prone, and a chore:

00:00:06,600 --> 00:00:09,220
Hang on to your hats because this is going to be fun!

We’ll do this by pasting the text where we want it to go, selecting the remainder of the text (where we need to increment the subtitle index number), and then using the vim expression :s/^\d\+$/\=submatch(0)+1/g to match a line that contains only numeric content (so it’ll match the subtitle index number but not the timestamps, which we absolutely don't want to inadvertently increment in the process):

As you can see, it’s simply a matter of selecting the text you want to replace in (in our case, everything past the captions we just entered) and then coming up with a regex that matches only the numbers we want to replace but not the numbers we don’t. If we had used CTRL + A here instead, we would have ended up with the first timestamp in each caption in our selection incremented, in addition to incrementing the index.

I think the syntax for this one is easy enough to remember that you probably don’t need a plugin or a custom key mapping to do this for you. The trickiest part is just the regex, and odds are in most cases judicious application of ^ (start of line), $ (end of line), and whitespace will probably suffice to get you a regex that matches only the values you need. Unlike some other vim expressions that have really inscrutable names or incantations, using \=submatch(0) (or \=submatch(4) or whatever) a few times is probably all it will take for you to memorize the syntax and soon enough it’ll be second nature.

If you enjoyed this tip, consider subscribing to blog posts via RSS or via email from the sidebar to the right and follow me on twitter @mqudsi for more fun hacking or programming stuff! (If you’re an emacs user, it’s highly unlikely I’ll have any text editing hacks for you at any time, unfortunately!)

To be pedantic, only the first group-of-digits/number on each line gets incremented; like many vim commands this only works on the first match per line of text unless some sort /g global modifier is used. ↩

The post Increment only numbers matching regex in Vim first appeared on The NeoSmart Files.

tcpproxy 0.4 released

Mahmoud Al-Qudsi — Sun, 08 Oct 2023 18:53:36 +0000

Image courtesy of Hack A Day

This blog post was a bit delayed in the pipeline, but a new release of tcproxy, our educational async (tokio) rust command line proxy project, is now available for download (precompiled binaries or install via cargo).

I was actually surprised to find that we haven’t written about tcpproxy before (you can see our other rust-related posts here), but it’s a command line tcp proxy “server” written with two purposes in mind: a) serving as a real-world example of an async (tokio-based) rust networking project, and b) serving as a minimal but-still-useful tcp proxy you can run and use directly from the command line, without needing complex installation or configuration procedures. (You can think of it as being like Minix, but for rust and async networking.)

The tcpproxy project has been around for quite some time, originally published in 2017 before rust’s async support was even stabilized. At the time, it manually chained futures to achieve scalability without relying on the thread-per-connection model – but today its codebase is a lot easier to follow and understand thanks to rust’s first-class async/await support.

That doesn’t mean that there aren’t “gotchas” that rust devs need to be aware of when developing long-lived async-powered applications, and tcpproxy’s purpose here is to serve as a real-world illustration of the correct way to handle some of the thornier issues such as tying the lifetime of various connections (or halves of connections) to one-another and aborting all remaining tasks when the first terminates (without blocking or polling).

The 0.4.0 release doesn’t contain any major changes but tweaks a number of things to improve both the usability of the application and to model the correct way of handling a few things (such as not using an Arc to share state that remains alive (and static) for the duration of the program’s execution¹)

One of the user-visible changes in this release is that ECONNRESET and ECONNABORT are no longer treated as exceptional, meaning that tcpproxy proceeds as if the connection in question were closed normally and uneventfully. While a compliant TCP client shouldn’t just abort a tcp connection (and a server shouldn’t reset one), these things happen quite often in the real world, and since all tcpproxy connections are stateless, there’s really no reason to handle these any differently than a normal, compliant tcp connection tear-down. Since we don’t report a connection error in these cases, tcpproxy prints (when executed in debug -d mode, that is) the normal messages about the number of bytes proxied in each direction, hopefully leading to less confusion.

For those of you hearing about the tcpproxy project for the first time, I invite you to look over the core event loop which remains fairly small even when correctly handling all the cases we need to account for and synchronizing lifetimes the way we like. If you spot something that’s wrong, not quite right, or could be done in a more idiomatic way, please do leave a comment, send an email, or open an issue – tcpproxy is an open source project and it takes a village to raise and nurture even the smallest of projects to a healthy state!

You can follow me on twitter @mqudsi or sign up below for our rust-only mailing list to receive a heads-up when new rust educational content or rust open source crates are released. If you’re in a position to do so, I am also experimenting with accepting sponsors on my Patreon page and would greatly appreciate your patronage and support!

In cases like this, the recommendation is to actually just leak the memory instead to reduce cache coherency traffic in the MESI or MOESI protocols that is caused when each new task increments or decrements the shared reference count bits in the Arc. If you know the value is going to live until the end of the application’s lifetime anyway, there’s no need to incur that cost and any future (read-only) access to the shared variable from any thread on any core will be ~free. ↩

The post tcpproxy 0.4 released first appeared on The NeoSmart Files.

CallerArgumentExpression and extension methods don’t mix

Mahmoud Al-Qudsi — Mon, 11 Sep 2023 17:17:55 +0000

This post is for the C# developers out there and takes a look at the interesting conjunction of [CallerArgumentExpression] and static extension methods – a mix that at first seems too convenient to pass up.

A quick recap: [CallerArgumentExpression] landed as part of the C# 10.0 language update and helps to reduce the (often brittle!) boilerplate involved in, among other uses, creating useful error messages capturing the names of variables or the text of expressions. You tag an optional string method parameter with [CallerArgumentExpression("argName")] where argName is the name of the method argument you want stringified, and the compiler does the rest.

Here’s a quick demo of how [CallerArgumentExpression] works:

using System;
using System.Runtime.CompilerServices;

public class Program
{
    static string Stringify(object obj,
        [CallerArgumentExpression("obj")] string expr = "")
    {
        return expr;
    }

    public static class Foo
    {
        public string Bar = "bar";
    }

    public static void Main()
    {
        var expr = Stringify(Foo.Bar);
        Console.WriteLine(expr); // prints "Foo.Bar"
        expr = Stringify(Foo.Bar + Foo.Bar);
        Console.WriteLine(expr); // prints "Foo.Bar + Foo.Bar"
    }
}

And you can try it online yourself in this .NET Fiddle.

It’s really cool and it opens the door to a lot of possibilities (though I’m still stuck trying to figure some of them out, such as reliably setting/clearing model binding errors that involve array expressions).

As mentioned, this shipped with C# 10. And of course, C# 8 shipped “the big one:” nullable reference types. Since then, the following pattern has become familiar in many a codebase while devs figure out where variables actually can or can’t be null:

using System;
using System.Diagnostics.CodeAnalysis;
using System.Runtime.CompilerServices;

static class Extensions
{
    public static T ThrowIfNull([NotNull] T? value, string expr)
    {
        if (value is null) {
            throw new ArgumentNullException(expr);
        }
        return value;
    }
}

This does exactly what you think it does: it verifies that a value isn’t null or throws an exception if it is. And it lets the compiler know that downstream of this call, the passed-in value is non-null. To make it useful, it’s common enough to extend it with more caller attribute magic:

using System;
using System.Diagnostics.CodeAnalysis;
using System.Runtime.CompilerServices;

static class Extensions
{
    public static T ThrowIfNull(
        [NotNull] T? value,
        string expr,
        [CallerMemberName] string callerName = "",
        [CallerFilePath] string filePath = "",
        [CallerLineNumber] int lineNumber = 0)
    {
        if (value is null) {
            throw new InvalidOperationException(
                $"{expr} unexpectedly null in {callerName} "
                + $"at {filePath}:{lineNumber}");
        }
        return value;
    }
}

Now we get useful exceptions that we’ll hopefully log and revisit to help us find any places in our codebase where we are assuming a value can’t be null but it turns out that, in fact, it can be.

But what about if we try to add our new best buddy [CallerArgumentExpression] here, to get rid of the need to manually specify the text of the argument via argName in our ThrowIfNull()?

using System;
using System.Diagnostics.CodeAnalysis;
using System.Runtime.CompilerServices;

static class Extensions
{
    public static T ThrowIfNull(
        [NotNull] T? value,
        [CallerArgumentExpression("value")] string expr = "",
        [CallerMemberName] string callerName = "",
        [CallerFilePath] string filePath = "",
        [CallerLineNumber] int lineNumber = 0)
    {
        if (value is null) {
            throw new InvalidOperationException(
                $"{expr} unexpectedly null in {callerName} "
                + $" at {filePath}:{lineNumber}");
        }
        return value;
    }
}

At first blush, this works great. Use it with a single variable directly, as in foo.ThrowIfNull(), and everything will work swimmingly and it’ll do exactly what it says on the tin. But try using it in a more-complicated setting, say like foo?.bar?.ThrowIfNull(), and you’ll see what I mean: here, argName will only capture the last token in the chain and you’ll see that argName is only bar and not foo.bar!

It’s actually not particularly surprising behavior. Even without knowing what Roslyn desugars the above code to, you could logically think of it as being an expression (conditionally) invoked on/with the final variable bar itself – after all, T here would have been bar.GetType(), so it’s not a huge stretch of the imagination to guess that expr might only span bar as well.¹

Indeed, when you look at what the code compiles to, you’ll see why. For the following code fragment:

public class Foo {
    public string? Bar;
}

public class C {
    public void M(Foo? foo) {
        foo?.Bar.ThrowIfNull();
    }
}

We get

public class Foo
{
    [System.Runtime.CompilerServices.Nullable(2)]
    public string Bar;
}

public class C
{
    [System.Runtime.CompilerServices.NullableContext(2)]
    public void M(Foo foo)
    {
        if (foo != null)
        {
            Extensions.ThrowIfNull(foo.Bar, ".Bar");
        }
    }
}

Which, while still helpful, is not exactly what we want. Although as C# developers we are somewhat allergic to calling static helper utilities directly instead of cleverly turning them into their more ergonomic extension method counterparts, in this we don’t have any other choice.

When we change ThrowIfNull() from an extension method to a regular static method though, we get the result we really wanted:

public static class Utils
{
    public static T ThrowIfNull(
        [NotNull] T? value,
        [CallerArgumentExpression("value")] string? expr = null) 
    {
        if (value is null) {
            throw new ArgumentNullException(expr);
        }
        return value;
    }
}

public class Foo
{
    public string? Bar;
}

public class C
{
    public void M(Foo? foo)
    {
        Utils.ThrowIfNull(foo?.Bar);
    }
}

Desugaring to:

public class C
{
    [System.Runtime.CompilerServices.NullableContext(2)]
    public void M(Foo foo)
    {
        Utils.ThrowIfNull((foo != null) ? foo.Bar : null, "foo?.Bar");
    }
}

Liked this post? Follow me on twitter @mqudsi and like this tweet for more .NET awesomeness!

Do you know how to effectively use [CallerArgumentExpression] to supercharge C# @dotnet codebase? Here's the number one mistake to look out for. https://t.co/sReLDSS3iw

— Mahmoud Al-Qudsi (@mqudsi) September 11, 2023

If you would like to receive a notification the next time we release a nuget package for .NET or release resources for .NET Core and ASP.NET Core, you can subscribe below. Note that you'll only get notifications relevant to .NET programming and development by NeoSmart Technologies. If you want to receive email updates for all NeoSmart Technologies posts and releases, please sign up in the sidebar to the right instead.

Except expr is actually not bar but rather .bar! ↩

The post CallerArgumentExpression and extension methods don’t mix first appeared on The NeoSmart Files.

Implementing truly safe semaphores in rust

Mahmoud Al-Qudsi — Mon, 03 Oct 2022 20:11:34 +0000

Discuss this article on r/rust or on Hacker News.

Low-level or systems programming languages generally strive to provide libraries and interfaces that enable developers, boost productivity, enhance safety, provide resistance to misuse, and more — all while trying to reduce the runtime cost of such initiatives. Strong type systems turn runtime safety/sanity checks into compile-time errors, optimizing compilers try to reduce an enforced sequence of api calls into a single instruction, and library developers think up of clever hacks to even completely erase any trace of an abstraction from the resulting binaries. And as anyone that’s familiar with them can tell you, the rust programming language and its developers/community have truly embraced this ethos of zero-cost abstractions, perhaps more so than any others.

I’m not going to go into detail about what the rust language and standard library do to enable zero-cost abstractions or spend a lot of time going over some of the many examples of zero-cost interfaces available to rust programmers, though I’ll just quickly mention a few of my favorites: iterators and all the methods the Iterator trait exposes have to be at the top of every list given the amount of black magic voodoo the compiler has to do to turn these into their loop-based equivalents, zero-sized types make developing embedded firmware in rust a dream and it’s really crazy to see how all the various peripheral abstractions can be completely erased giving you small firmware blobs despite all the safety abstractions, and no list is complete the newest member of the team, async/await and how rust manages to turn an entire web server api into a single state machine and event loop. (And to think this can be used even on embedded without a relatively heavy async framework like tokio and with even zero allocations to boot!)

But the tricky thing with abstractions is that the relative price you pay scales rather unfairly with the size of the interface you are abstracting over. While a byte here and a byte there may mean nothing when we’re talking framework-scale interfaces, when you are modeling smaller and finer-grained abstractions, every byte and every instruction begin to count.

A couple of weeks ago, we released an update to rsevents, our crate that contains a rusty cross-platform equivalent to WIN32 events for signaling between threads and writing your own synchronization primitives, and rsevents-extra a companion crate that provides a few handy synchronization types built on top of the manual- and auto-reset events from the rsevents crate. Aside from the usual awesome helpings of performance improvements, ergonomics enhancements, and more, this latest version of rsevents-extra includes a Semaphore synchronization primitive – something that the rust standard library surprisingly lacks… but not without good reason.

What makes a semaphore a semaphore?

Semaphores are well-documented and fairly well-understood underpinnings of any concurrency library or framework and essential Computer Science knowledge. So why doesn’t the rust standard library have a semaphore type?

Unlike the synchronization types that the rust standard library currently provides (such as Mutex and RwLock, a semaphore is somewhat harder to model as it doesn’t so much restrict concurrent access to a single object or variable so much as it does limit concurrency within a region of code.

Of course it can be argued that in traditional programming a semaphore is just a more general case of a mutex, and just like mutexes traditionally protected a region of code from concurrent access¹ but were converted into synchronization primitives owning the data they protect and marshalling access to it, there’s no reason a rust semaphore couldn’t do the same. But therein lies the problem: a mutex and a read-write lock can both be understood in terms of readers and writers,² a semaphore makes no such guarantees. And rust is quite fundamentally built on the concept of read ^ write: it needs to know if a thread/scope is reading or writing from a variable or memory location in order to uphold its most basic memory safety guarantee: there can either be multiple “live” read-only references to an object or a single write-enabled (&mut) reference to the same — but a semaphore doesn’t make that distinction!

While a strictly binary semaphore (max concurrency == 1) can guarantee that there will never be multiple writers accessing a memory region, there’s not much theoretical benefit to such a binary semaphore over a mutex – in fact, they’re interchangeable. What makes a semaphore truly special is that it can be created (or even dynamically modified) with a concurrency limit n and then uphold its core precondition, guaranteeing that at any given time there will never be more than n threads/stacks³ accessing a semaphore-protected region at any given time.

The problem is that with n > 1, there’s no concept of a “privileged” owning thread and all threads that have “obtained” the semaphore do so equally. Therefore, a rust semaphore can only ever provide read-only (&T) access to an underlying resource, limiting the usefulness of such a semaphore almost to the point of having no utility. As such, the only safe “owning” semaphore with read-write access that can exist in the rust world would be Semaphore<()>,⁴ or one that actually owns no data and can only be used for its side effect of limiting concurrency while the semaphore is “owned,” so to speak.⁵ (Actual mutation of accessed resources within the concurrency-limited region, if needed, would continue to be marshalled via Mutex or RwLock on a fine-grained level.)

Ok, so this explains why the rust standard library doesn’t contain a Semaphore type to mirror Mutex and its friends, but then what’s so hard about shipping a non-owning std::sync::Semaphore instead?

Designing a safe `Semaphore` for rust

To answer this, we need to look at what a semaphore API generally looks like in other languages. While the names and calling semantics differ, a semaphore is generally found as a type that provides the following, starting with the most fundamental to de facto properties:

It is a type that can be used to limit concurrency to a resource or region of code, up to a dev-defined limit n.
It is a type that has a concept of “currently available concurrency,” which represents and tracks the remaining number of threads/stacks that can “obtain” the semaphore, thereby reducing its available concurrency and generally giving the calling thread access to the concurrency-limited region,
A semaphore can be created/declared with an “initially available concurrency” and a “maximum possible concurrency,” which may differ (indeed, “initially available concurrency” is often zero),
Semaphores don’t generally have a concept of ownership, meaning a thread (or any thread) can increment (up to the pre-defined limit) or decrement (down to zero) the available concurrency for a semaphore without having “obtained” or “created” it. (This is necessary otherwise it’d be impossible to initialize a semaphore with a lower initial concurrency limit than its maximum, because no thread could then increase it.)

It’s the last of these points that makes a semaphore so tricky to model in any language that prides itself on safety. While a semaphore that acts strictly as a variable-occupancy mutex (i.e. initial concurrency equals the max concurrency and each time it is obtained, it must be subsequently released by the same thread that obtained it), that’s not generally a requirement that semaphores impose, and such a requirement would considerably limit the utility that a semaphore could offer.

Let’s look at some ways we might design such a semaphore in rust, some of which we actually tried while prototyping rsevents_extra::Semaphore.

Before anything else, let’s get the hard part out of the way by introducing you to rsevents::AutoResetEvent, a one-byte⁶ synchronization primitive that takes care of putting threads to sleep when the event isn’t signalled/available and allowing one-and-only-one waiting thread to either consume the event (if it’s not already asleep) or to wake up (if it’s asleep waiting for the event) when the event is signalled (after which the event is atomically reset to a “not signalled” state). It doesn’t even have any spurious waits, making it really nice and easy to work with in a safe fashion. All of our Semaphore implementations will use this auto-reset event to take care of the synchronization and we’ll omit the details of when and where to call AutoResetEvent::set() and AutoResetEvent::reset() for now.

So here’s what our initial semaphore skeleton looks like. We know we need an internal count of some integral type to keep track of the current concurrency (since we already established that it’s going to be variable and not just zero or one), and we know that at minimum a semaphore’s interface needs to provide a wait to “obtain” the semaphore (decrementing the available concurrency for future callers) and a way to “release” the semaphore (at least to be used by a thread that has already obtained a “concurrency token” to re-increment the count after it is done and wants to give up its access to the concurrency-restricted region for some other caller to take).

struct Semaphore {
    event: AutoResetEvent,
    count: AtomicU32,
    // TODO: Fill in the rest
}

impl Semaphore {
    /// Create a new `Semaphore`
    pub fn new(/* TODO */) -> Semaphore { /* TODO */ }

    /// Obtain the `Semaphore`, gaining access to the protected concurrency
    /// region and reducing the semaphore's internal count. If the 
    /// `Semaphore` is not currently available, blocks sleeping until it
    /// becomes available and is successfully obtained.
    pub fn wait(&self) -> ??? { ... }

    /// Increment the semaphore's internal count, increasing the available
    /// concurrency limit.
    pub fn release(&self, ...) { ... }
}

Our goal now is to fill in the blanks, attempting to make a semaphore that’s maximally useful and as safe as possible, while limiting its size and runtime checks (the costs of said safety).

A constant maximum concurrency?

We’ve already established that a semaphore needs to offer a tunable “maximum concurrency” parameter that decides the maximum number of threads that can have access to the concurrency-limited region at any given time. This number is typically supplied at the time the semaphore is instantiated, and it’s quite normal for it to be fixed thereafter: while the current available concurrency may be artificially constrained beyond the number of threads that have actually borrowed/obtained the semaphore, it’s OK for the absolute maximum to be unchangeable after a semaphore is created.

We really have only two choices here: we either add a max_count integral struct member to our Semaphore or we take advantage of rust’s const generics (another brilliant zero-cost abstraction!) to eliminate this field from our struct altogether… but at a considerable cost.

Let’s consider some constraints that might determine the maximum concurrency a semaphore can have:

We’ve experimentally determined that our backup application works best with up to eight simultaneous threads in the file read part of the pipeline and up to two simultaneous threads in the file write part of the pipeline, to balance throughput against maxing out the disk’s available IOPS. Here we can use two separate semaphores, each with a different hard-coded maximum concurrency.
Almost all modern web browsers limit the maximum number of live TCP connections per network address to a hard-coded limit, typically 16. Internally, the browser has a semaphore for each domain name with an active TCP connection (perhaps in a HashMap), and blocks before opening a new connection if the maximum concurrency is exceeded.

In these cases, we could get away with changing our Semaphore declaration to Semaphore and using const generics when initializing a semaphore to specify the hard-coded maximum concurrency limit, e.g. let sem = Semaphore::<16>::new(...). But const generics doesn’t just lock us into a hard-coded value specified at the time of object initialization, it locks us into a hard-coded value specified at the time of writing the source code for the object’s initialization. That means we can’t use a const generic parameter in lieu of a max_concurrency field in cases like the following, which we unfortunately can’t just ignore:

We want to limit the number of threads accessing a code section or running some operation to the number (or a ratio of the number) of CPU cores the code is running on (not compiled on).
We want to let the user select the max concurrency at runtime, perhaps by parsing a -j THREADS argument or letting the user choose from a drop-down menu or numselect widget in a GUI application.
Going back to our backup application example, we want to use different read/write concurrency limits in the case of an SSD vs in the case of an old-school HDD.

This means that we’re going to have to approximately double the size of our Semaphore object, mirroring whatever type we’re using for Semaphore::count to add a Semaphore::max_concurrency member (they have to be the same size because count can vary from zero to max_concurrency), giving us the following, fairly complete, struct declaration.

A functionally complete semaphore

As we’ve determined, we unfortunately can’t use rust’s const generics to roughly halve the space a Semaphore object takes in memory, giving us the following declaration:

struct Semaphore {
    event: AutoResetEvent,
    count: AtomicU32,
    max_count: AtomicU32,
}

While we were forced to add a max_count field to store the maximum allowed concurrency for the semaphore, this wasn’t at all a violation of the “zero-cost abstraction” principle: if we want to allow the user to specify something at runtime and match against it later, this is the cost we invariably pay (whether in rust or in assembly).

As bare as our Semaphore structure is, this is actually enough to implement a completely functional and – for the most part – safe semaphore. Thanks to how AutoResetEvent is implemented internally and its own safety guarantees, we can get away with just some extremely careful use of atomics, without a lock or mutex of any sort:

impl Semaphore {
    /// Create a new `Semaphore`
    pub fn new(initial_count: u32, max_count: u32) -> Semaphore {
        Semaphore { 
            event: AutoResetEvent::new(EventState::Unset),
            count: initial_count,
            max_count,
        }
    }

    /// Obtain the `Semaphore`, gaining access to the protected concurrency
    /// region and reducing the semaphore's internal count. If the 
    /// `Semaphore` is not currently available, blocks sleeping until it
    /// becomes available and is successfully obtained.
    pub fn wait(&self) {
        let mut count = self.count.load(Ordering::Relaxed);

        loop {
            count = if self.count == 0 {
                // Semaphore isn't available, sleep until it is.
                self.event.wait();
                // Refresh the count after waking up
                self.count.load(Ordering::Relaxed)
            } else {
                // We can't just fetch_sub(1) because it might underflow in a race
                match self.count.compare_exchange(count, count - 1, 
                    Ordering::Acquire, Ordering::Relaxed)
                {
                    Ok(_) => {
                        // We officially obtained the semaphore.
                        // If the (now perhaps stale) new `count` value is non-zero, 
                        // it's our job to wake someone up (or let the next caller in).
                        if count - 1 > 0 {
                            self.event.set();
                        }
                        break;
                    },
                    Err(count) => count, // Refresh the cached value and try again
                }
            }
        }

        // This must hold true at all times!
        assert!(count <= self.max_count); 
        return ();
    }

    /// Increment the semaphore's internal count, increasing the available
    /// concurrency limit. 
    pub fn release(&self) { 
        // Try to increment the current count, but don't exceed max_count
        let old_count = self.count.fetch_add(1, Ordering::Release);
        if old_count + 1 > self.max_count {
            panic!("Attempt to release past max concurrency!");
        }
        // Check if we need to wake a sleeping waiter
        if old_count == 0 {
            self.event.set();
        }
    }
}

If you’re familiar with atomic CAS, the annotated source code above should be fairly self-explanatory. In case you’re not, briefly, AtomicXXX::compare_and_exchange() is how most lock-free data structures work: you first read the old value (via self.count.load()) and decide based on it what the new value should be, then use compare_and_exchange() to change $old to $new, if and only if the value is still $old and hasn’t been changed by another thread in the meantime (if it has changed, we re-read to get the new value and try all over again until we succeed).

With the Semaphore skeleton filled out, we now have a functionally complete semaphore with features comparable to those available in other languages. At the moment, Semaphore::wait() returns nothing and any thread that’s obtained the semaphore must ensure that Semaphore::release() is called before it returns, but that’s just an ergonomic issue that we can easily work around by returning a concurrency token that calls Semaphore::release() when it’s dropped instead.⁷

But can we make it safer?

The problem with `Semaphore::release()`

For the remainder of this article, we’ll be focusing on the ugly part of a semaphore, the part that makes writing a truly safe semaphore so challenging: Semaphore::release().

In rust, the standard library concurrency primitives all return “scope guards” that automatically release (or poison) their associated synchronization object at the end of the scope or in case of panic. This actually isn’t just a question of ergonomics (as we mentioned before), it’s a core part of their safety. On Windows, you can create a mutex/critical section with InitializeCriticalSection() and obtain it with EnterCriticalSection() but any thread can call LeaveCriticalSection(), even if it invokes undefined behavior that may cause a deadlock. On Linux and its pals, pthread_mutex_init() can be used to create a mutex and pthread_mutex_lock() can be used to enter the mutex. While pthread_mutex_unlock() is documented as may return an EPERM error if the current thread doesn’t own the mutex, the notes clarify that the current owning thread id isn’t stored and deadlocks are allowed in order to avoid overhead – meaning its implementation-defined whether or not pthread_mutex_unlock() actually protects against unlocking from a different thread. In practice, it doesn’t.⁸

Rust, as a partially object-oriented language, sidesteps all this via our good friend RAII by simply making the equivalent of Mutex::unlock(&self) unavailable except via the scope guard (MutexGuard) returned by Mutex::lock(&self) (and the same for RwLock). The type system prevents you from calling the equivalent of pthread_mutex_unlock() unless you already own a MutexGuard, which can only be acquired as a result of a successful call to Mutex::lock() – without needing any runtime code to check whether or not the calling thread is the owning thread, because the type system provides that safety guarantee at zero cost.

Unfortunately, as I mentioned earlier in passing, even if we did make our Semaphore::wait() function return some sort of concurrency token/scope guard, it would at best by an ergonomic improvement to make sure one doesn’t forget to call Semaphore::release() before exiting the scope, but it wouldn’t allow us to eliminate a publicly callable Semaphore::release() function that any thread could call at any time, at least not without making it impossible to create a semaphore with an initial count that doesn’t equal the maximum count, and not without making the “current maximum concurrency” limit non-adjustable at runtime.

Do we even need `Semaphore::release()` anyway?

At this point, you might be tempted to ask if we really truly absolutely need these options and questioning what purpose they serve – which is a good and very valid question, given the cost we have to pay to support it and especially because in all the cases we mentioned above (those with a fixed max_count and those without), you’d always create a semaphore with initial_count == max_count. Here are some hopefully convincing reasons:

Semaphores aren’t just used to limit concurrency up to a maximum pre-defined limit, they’re also used to temporarily artificially constrain available concurrency to a value in the range of [0, max_count] – in fact, this is where you might find their greatest utility.
CS principle: Preventing a “thundering herd” when a number of threads are blocked waiting on a semaphore to allow them access to a resource or code region. You can have up to n threads performing some task at the same time, but you don’t want them to all start synchronously because it’ll cause unwanted contention on some shared state, e.g.
- Perhaps an atomic that needs to be incremented, and you don’t want to waste cycles attempting to satisfy compare_exchange() calls under contention or you otherwise have a traditional fine-grained lock or even lock-free data structure where uncontended access is ~free but contention wastes CPU cycles busy-waiting in a spinlock or even puts threads to sleep;
- Starting operations at ~the same time can run into real-world physical limitations, e.g. it takes n IO threads to achieve maximum saturation of available bulk bandwidth but the initial seeks can’t satisfy as many IOPS without unduly increasing latency and decreasing response times.
- The semaphore controls the number of threads downloading in parallel but you don’t want to DOS the DNS or HTTP servers by flooding them with simultaneous requests, though you do ultimately want n threads handling the response, downloading the files, etc. in parallel.
CS principle: Making available concurrency accrue over time, up to a maximum saturation limit (represented by max_count), for example:
- A family of AWS EC2 virtual machine instances have what is referred to as “burstable performance” where for every m minutes of low CPU usage, you get t time slices of “more than what you paid for” instantaneous CPU performance, up to n maximum time slices accrued. As a simplified example, you “rent” a virtual machine with a lower nominal CPU speed of 2.0 GHz and for every 5 minutes with a p95 per-5-second-time-slice CPU utilization below 50%, you get a “free” 5-second-time-slice of 3.0 GHz compute. If you launch a burstable instance and immediately try running Prime95, you’ll be capped to 2.0 GHz, but if you are running nginx and serving web requests under relatively no load, then after 10 minutes you’ll have accrued 10 seconds of “free” 3.0 GHz performance. When you suddenly get a burst of traffic because a link to your site was just shared on an iMessage or WhatsApp group and 200 phones are hitting your server to generate a web preview, you can “cash in” those accrued 3.0 GHz time slices to satisfy those requests quickly, after which you’re constrained to the baseline 2.0 GHz performance until the requests settle down and you begin to accrue 3.0 GHz time slices once again.
- While you can implement a rate limiter by giving each connected user/IP address a maximum number of requests they can make per 5 minutes and resetting that limit every 5 minutes, that still means your server can be DDOS’d by n clients saturating their limit of 100-requests-per-5-minutes in the first few seconds after establishing a connection. You can instead give each client a semaphore⁹ with an initial available count of, say, 2 and then every three seconds increment the available concurrency by 3; meaning clients would have to be connected for a full 5 minutes before they can hit you with 100 requests in one go (making it more expensive to mount an attack and giving your firewall’s heuristics a chance to disconnect them).

Hopefully the above reasons demonstrate the utility in being able to create a Semaphore with an initial count lower than the maximum count, and therefore, the need to have a free-standing Semaphore::release() function that, at the very least, the creator of the semaphore can call even without having previously made a corresponding call to Semaphore::wait().

Can we make `Semaphore::release()` safer?

There’s an immediately obvious way to make Semaphore::release() safer to call, and that’s to replace it with a Semaphore::try_release() method that first checks to ensure that the semaphore’s core integrity predicate (self.count <= self.max_count) is upheld instead of blindly incrementing self.count and then panicking if it then exceeds self.max_count, returning true or false instead to denote whether the operation completed successfully or not.

It’s actually not that hard of a change to make, provided you’re familiar with compare-and-exchange atomics (which we used above in safely implementing Semaphore::wait() to prevent two threads from racing to increment self.count after checking that it’s less than self.max_count), which we’ll use to write our try_release() method now:

impl Semaphore {
    /// Attempts to increment the `Semaphore`'s internal count,
    /// returning `false` if it would exceed the maximum allowed.
    pub fn try_release(&self) -> bool {
        let mut prev_count = self.count.load(Ordering::Relaxed);
        loop {
            if prev_count == self.max_count {
                // Incrementing self.count would violate our core precondition
                return false;
            }

            match self.count.compare_exchange_weak(
                prev_count, // only exchange `count` if it's still `prev_count`
                prev_count + 1, // what to exchange `count` with
                Ordering::Release, Ordering::Relaxed)
            {
                // The CAS succeeded
                Ok(_) => {
                    if prev_count == 0 {
                        // Wake a waiting thread
                        self.event.set();
                    }
                    return true;
                }
                // If it failed, refresh `prev_count` and retry, failing if 
                // another thread has caused `prev_count` to equal `max_count`.
                Err(new_count) => prev_count = new_count,
            }
        }
    }
}

This is much better. We only increment count if we can guarantee that it won’t cause it to exceed max_count, and relay our results to the caller. We can now guarantee that call to Semaphore::try_release() that wasn’t paired with a previous Semaphore::wait() won’t inadvertently violate the absolute limit on the allowed concurrency.

But can you spot the problem? Look at the code sample carefully, and consider it not in the context of a single call to Semaphore::try_release() but as a whole.

If you think you’ve figured it out or you just want to skip to the answers, read on to find out where the problem still lies. (Yes, this is just a filler paragraph to avoid your eyes immediately seeing the answer while you think this over. Hopefully you haven’t already scrolled down too much so that the answer is already in view. Sorry, I can’t really do better than this, my typesetting system doesn’t let me insert empty vertical whitespace very elegantly.)

The issue is that we’ve prevented this call to Semaphore::try_release() from incrementing count past max_count, but we might already have extant threads chugging away in parallel that have already obtained the semaphore, oblivious to the fact that count has changed from under them. The problem is easy to spot if we keep a copy of our old Semaphore::release() around and mix-and-match between it and try_release(), with calls we expect to always succeed using the former and calls that may overflow the current count using the latter:

fn main() {
    // Create a semaphore with a count of 1 and max count of 2
    let semaphore = Semaphore::new(1, 2); 

    // Create a number of threads that will use the semaphore 
    // to make sure concurrency limits are observed.
    
    for n in 0..NUM_CPUS {
        std::thread::spawn(|| {
            // Make sure to obtain a concurrency token before beginning work
            semaphore.wait();

            // TODO: Replace `sleep()` with actual work!
            std::thread::sleep(Duration::from_secs(5));

            // Release the concurrency token when we're done so another 
            // thread may enter this code block and do its thing.
            semaphore.release();
        });
    }

    while !work_finished {
        // 
        
        // In response to a user command, increase concurrency 
        if !semaphore.try_release() {
            eprintln!("Cannot raise limit, max already reached!");
        }
    }
}

We’re using our semaphore to limit the number of simultaneous worker threads, originally operating at 50% of our maximum concurrency limit (count == 1). To start, one of NUM_CPUS¹⁰ worker threads obtains the semaphore (count == 0) to access the concurrency limited region and the rest are locked out.

In the main thread’s event loop, we wait for work to finish or a user command to be entered at the keyboard (omitted from the example).¹¹ The user keys in an input that’s interpreted as “try to speed things up by increasing the concurrency limit,” and in response the main thread calls Semaphore::try_release() and succeeds in incrementing its internal self.count (now count == 1 again), thereby allowing a second thread to enter the semaphore-protected region (giving count == 0) and contribute to progress.

But what happens now that two threads are already in the semaphore-protected scope but the user triggers a second call to Semaphore::try_release()? As far as the semaphore knows, count (equal to 0) is less than max_count, and can be safely incremented, yielding count == 1 and thereby unblocking yet another worker thread sleeping in semaphore::wait() (reducing count to zero).

But what does that mean? Even though our core precondition hasn’t been violated in this instant (Semaphore::count <= Semaphore::max_count), we now have three worker threads in the concurrency-limited region, exceeding the hard-coded max_count provided during the semaphore initialization! In our example above where each worker thread, having first obtained a concurrency token via Semaphore::wait(), assumes that it can call the infallible Semaphore::release() method safely when it’s done, but when the last worker thread finishes its work, it’ll end up incrementing count past max_count and panicking in the process.

Of course, the panic isn’t the problem – it’s just reporting the violation when it sees it. We could replace all Semaphore::release() calls with Semaphore::try_release() and we’d still have a problem: there are more worker threads in the semaphore-protected region than allowed, and one of the “shouldn’t fail” calls to Semaphore::try_release() will eventually return false, whether that triggers a panic or not.

The crux of the matter is that we borrow from Semaphore::count but don’t have a way to track the total “live” concurrency count available (the remaining concurrency tokens in self.count plus any temporarily-unavailable-but-soon-to-be-returned concurrency tokens borrowed by the various worker threads). And, importantly, we don’t need this for any core semaphore functionality but rather only to protect against a user/dev calling Semaphore::release() more times than they’re allowed to. In a perfect world, it would be the developer’s responsibility to make sure that there are never more outstanding concurrency tokens than allowed, and we could omit these safety checks altogether. She could statically ensure it’s the case by only ever pairing together wait() and release() calls (perhaps after issuing a fixed, verifiable number of release() calls upon init), or by tracking when and where needed the number of free-standing calls to release() in an external variable to make sure the limit is never surpassed.

At last, a truly safe semaphore?

While we could offload the cost of verifying that there will never be more than max_count calls to Semaphore::release() to the developer, we’re rust library authors and we hold ourselves to a higher standard. We can most certainly do just that, but we would have to mark Semaphore::release() as an unsafe function to warn users that there are preconditions to calling this and dangers to be had if calling it willy-nilly without taking them into account. But what if we want a synchronization type that’s easier to use and worthy of including in the standard library, being both safe and easy-to-use? What then?

There are actually several approaches we can take to solving this gnarly problem. The easiest and safest solution would be to simply change our Semaphore definition, adding another field of the same integral count type, perhaps called total_count or live_count that would essentially track initial_count plus the number of successful free-standing calls to Semaphore::release(), but not decrement it when a call to Semaphore::wait() is made.¹² This way each time try_release() is called, we can check if total_count (and not count) would exceed max_count — and continue to use just self.count for the core semaphore functionality in Semaphore::wait():

struct Semaphore {
	event: AutoResetEvent,
	count: AtomicU32,
	total_count: AtomicU32,
    max_count: AtomicU32,
} 

impl Semaphore {
    /// Attempts to increment the `Semaphore`'s internal count,
    /// returning `false` if it would exceed the maximum allowed.
    pub fn try_release(&self) -> bool {
        // First try incrementing `total_count`:
        let mut prev_count = self.total_count.load(Ordering::Relaxed);
        loop {
            if prev_count == self.max_count {
                // Incrementing self.total_count would violate our core precondition
                return false;
            }

            match self.total_count.compare_exchange_weak(
                prev_count, // only exchange `total_count` if it's still `prev_count`
                prev_count + 1, // what to exchange `total_count` with
                Ordering::Relaxed, Ordering::Relaxed)
            {
                // If the CAS succeeded, continue to the next phase
                Ok(_) => break,
                // If it failed, refresh `prev_count` and retry, failing if 
                // another thread has caused `prev_count` to equal `max_count`.
                Err(new_count) => prev_count = new_count,
            }
        }

        // Now increment the actual available count:
        let prev_count = self.count.fetch_add(1, Ordering::Release);
        if prev_count == 0 {
            // Wake up a sleeping waiter, if any
            self.event.set();
        }

        return true;
    }
}

But now, something else breaks!

Let’s go back to our last example with many worker threads attempting to access a concurrency-restricted region and a main event loop that reads user commands that can affect the available concurrency limit. To make this more relatable, let’s use a real-world example: we’re developing a command-line BitTorrent client and we want the user to control the number of simultaneous uploads or downloads, up to some limit (that might very well be u32::MAX). In the last example we were focused on user commands that increased the available concurrency limit, in our real-world example perhaps reflecting a user trying to speed up ongoing downloads or seeds by allowing more concurrent files or connections.

But this isn’t a one-way street! Just as a user may wish to unthrottle BitTorrent downloads, they might very well wish to throttle them to make sure they’re just seeding in the background without saturating their Top Tier Cable Provider’s pathetic upload speed and killing their internet connection in the process. How do we do that safely?

One way would be to introduce a method to directly decrement our semaphore’s self.total_count and self.count fields (down to zero), but what do we do if total_count was non-zero but count was zero (i.e. all available concurrency was currently in use)? Apart from the fact that we are using unsigned atomic integers to store the counts, we could decrement count (but not total_count) past zero, for example to -1, and let the “live” concurrency eventually settle at the now-reduced total_count after a sufficient number of threads finish their work and leave the concurrency-limited region.

But we don’t actually need to do any of that: by its nature a semaphore already provides a way to artificially reduce the available concurrency by means of a free-standing call to Semaphore::wait(), i.e. calling wait() without calling release() afterwards. It ensures that the count isn’t reduced until it’s non-zero and that count never exceeds max_count or total_count at any time, not even temporarily.

Unfortunately, herein lies a subtle problem. With our revised semaphore implementation, we increase both count and total_count when the free-standing release() is called and assume that each call to Semaphore::wait() will have a matching call to some Semaphore::special_release() that increases count without touching total_count. This way, total_count tracks the “total available” concurrency, assuming that it’s equal to “remaining concurrency limit” plus “outstanding calls to wait() that haven’t yet called xxx_release().

While free-standing calls to Semaphore::release() were our problem before, here we’ve shifted that to an issue with free-standing calls to Semaphore::wait() – an admittedly less hairy of a situation but, as we have seen, still not one that we can afford to ignore!

More importantly, even if we weren’t using free-standing calls to Semaphore::wait() to artificially reduce the available concurrency, we actually can’t guarantee that release() is always called after wait(): it’s a form of the halting problem, and even if we ignore panics and have wait() return a scope guard that automatically calls release() when it’s dropped, it’s still completely safe for a user to call std::mem::forget(scope_guard) thereby preventing release() from being called!¹³

Fundamentally, we can’t really solve this problem. We either err on the side of potentially allowing too many free-standing calls to release() to be made, with safety checks delaying the overflow of max_count until the last borrowed concurrency token is returned after a call to wait(); or we err on the side of prudence and incorrectly prevent a free-standing call to release() from going through because we don’t know (and can’t know) that a thread which previously call wait() and took one of our precious concurrency tokens has decided it’s not going to ever return it.

But don’t despair! Do you remember the old schoolyard riddle? You’re silently passed a pen and notebook, on which you see the following:

Like with the riddle,¹⁴ we can implement scope guards to make it more likely that every call to wait() is matched by call to release() but we can’t actually stop the user from calling std::mem::forget(sem.wait()) – and we don’t have to. Without trying to think of ways to cause a compiler error when a scope guard is leaked and not dropped, we can still make it, if not hard then at least harder for the user to leak a scope guard and throw our internal count off. How? Not by hiding the ability to forget a scope guard but by highlighting it front and center!

Let’s fast forward to our semaphore from above, but modified to return a scope guard instead to encourage returning concurrency tokens back to the semaphore when a thread has finished with a concurrency-limited operation:

/// This concurrency token is returned from a call to `Semaphore::wait()`.
/// It's automatically returned to the semaphore upon drop, incrementing
/// the semaphore's internal available concurrency counter once more.
struct ConcurrencyToken<'a> {
    sem: &'a Semaphore
}

impl Semaphore {
    pub fn wait<'a>(&'a self) -> ConcurrencyToken<'a> {
        include!("earlier definition of Semaphore::wait() here");

        // Now instead of returning () we return a ConcurrencyToken

        return ConcurrencyToken {
            sem: self,
        }
    }

    /// Directly increments the internal concurrency count without touching 
    /// `total_count` and without checking if it would exceed `max_count`.
    unsafe fn release_internal() {
        let prev_count = self.count.fetch_add(1, Ordering::Release);

        // We only need to wake a sleeping waiter if the previous count
        // was zero. In all other cases, no one will be asleep.
        if prev_count == 0 {
            self.event.set();
        }
    }
}

impl Drop for ConcurrencyToken<'_> {
    fn drop (&mut self) {
        unsafe { self.sem.release_internal(); }
    }
}

This was our initial attempt at “strongly encouraging” calls to Semaphore::wait() to always be paired with calls to Semaphore::internal_release() (called by the ConcurrencyToken on drop), which decrements count without touching total_count so our logic in Semaphore::try_release() can continue to work.

As we said though, if one were to call std::mem::forget(sem.wait()) the ConcurrencyToken would be forgotten without internal_release() ever being called, and the count we track in total_count would be off by one, preventing a free-standing call to Semaphore::release() that should have been allowed.

So what if we just directly add a new method to our concurrency token? A ConcurrencyToken::forget() makes it harder to call std::mem::forget() on our concurrency token than it is to just call Semaphore::wait().forget() directly! (See, I really was going somewhere with that riddle!)

impl ConcurrencyToken<'_> {
    /// It is a violation of this crate's contract to call `std::mem::forget()`
    /// on the result of `Semaphore::wait()`. To forget a `ConcurrencyToken`, 
    /// use this method instead.
    pub fn forget(self) {
        // We're keeping `count` permanently reduced, but we need to decrement
        // `total_count` to reflect this as well before forgetting ourselves.
        self.sem.total_count.fetch_sub(1, Ordering::Relaxed);
        std::mem::forget(self);
    }
}

And just like that, we now have something we can reasonably call a “safe” semaphore, worthy of rust!

The price we pay for safety

While I can’t say with complete confidence that this is the optimal implementation of a safe semaphore (exposing the same functionality), our journey above is still representative of the constant tug-of-war that takes place when trying to build an API as you juggle performance, the desire for zero-cost abstractions, and the imperative of surfacing a (within the boundaries of reasonable use) safe and misuse-resistant interface.

We started with something completely simple: two words for the count and a single byte auto-reset event we could use to impose strict ordering and optimized waiting/sleeping in cases of contention. Correctness (which, if you squint at it in a particular way, is just another kind of safety) mandated the use of atomics from the very start, preventing us from playing fast and loose with our integer math and imposing heavy penalties when it came to ensuring cache coherence and cross-core integrity. Then, just when we thought we had everything figured out, we needed to completely change our approach and even add a new struct member to boot (raising the size of the Semaphore struct by 33-45% depending on the integer width, which sounds really scary until you realize it’s still just a handful of bytes).

There are of course other possible solutions to the same problem, all of which potentially have their own drawbacks.¹⁵ And even if there are cost-free solutions here, the general picture isn’t unique to semaphores or even concurrency primitives in general: it’s a story that’s played on repeat and comes up every time an interface needs to be added that has some caveats the caller needs to keep in mind. Writing correct code is hard, writing safe and correct code is harder. But, in my opinion, this is what actually makes rust special.

Rust’s concepts of ownership and exclusive RO/RW semantics play a huge role in making it such a popular language with low-level developers, but I would argue that it’s this attention that’s paid when writing libraries that deal with intricate or abstract concepts that can’t be reduced to simple &foo vs &mut foo semantics that make rust truly unique. As an old hat at C and C++ development, I’ve already worn my “low-level library developer” mantle thin, and it’s absolutely awesome to be able to write an abstraction that other developers can use without having to dive into syscalls and kernel headers as the only source of literature. But with rust, I’m experiencing a completely different kind of challenge in writing libraries and APIs: here there’s a bar even higher than just writing clean abstractions, and it’s being able to write these low-level abstractions in a way that not only can clever developers that have previously dealt with these system internals appreciate and use our libraries to their advantage, but that even others new to the scene can just read your documentation (or maybe not even that) and then let the shape of your API guide them to the rust, figuring out the right way of using it.

In any language, a savvy enough developer can usually figure their way around even a completely new library dealing with concepts and mechanisms completely alien to them. But in the process of figuring it out, they’re bound to make mistakes, bump into leaky abstractions (“but why shouldn’t I call pthread_mutex_unlock from another thread if I need to access the mutex and its currently locked? What is it there for, then?” – whether they’re asking it on SO or mulling it over quietly in their head as they figure out the black-box internals by poking and prodding away at the API surface), pull out what’s left of their hair, and bang their head against the wall some before arriving at a generally correct and workable solution.

But it doesn’t have to be that way, and the burden is on us as developers of these libraries and crates to give our fellow devs a better option. Runtime errors (like the ones the pthread API doesn’t even bother returning!) are good and other languages¹⁶ have demonstrated how it can be used with non-zero but still fairly minimal overhead, but with the benefits of a strongly typed language, powerful type abstractions, and perhaps a smattering of generics and ZSTs, we can and should do better.

The truly safe semaphore, for your benefit and review

The semaphore we iteratively designed in this article is available for you to use, study, or review as part of the 0.2 release of the rsevents-extra crate. This is the current API exposed by our Semaphore type (source code here), incorporating some of the ideas discussed above.¹⁷

The Semaphore type in rsevents-extra actually includes even more safety than we’ve demonstrated above, but it’s of the bog-standard variety (checking for integer overflows, making sure we’re not decrementing past zero, etc) and not something unique to the challenges presented by semaphores in particular. The Semaphore docs example shows a more fleshed-out version of the “listen for user events to artificially throttle/unthrottle the semaphore,” if you want to check it out.

If you have an itch to try your own hand at writing concurrency primitives, I cannot encourage you enough: it’s all kinds of challenging and rewarding, and really opens your mind to what goes on behind-the-scenes with synchronization types. The rsevents crate was written to make doing just that a breeze, and I recommend at least starting off with either manual- or auto-reset events to take care of the intricacies of sleeping and the correctness of waking one and only one past/future awaiter at a time. Rust generally uses channels and mutexes to take care of synchronization, but there’s always a time and place for lower level thread signalling constructs!

Show some love and be the first to get new rust articles!

Last but not least: a request for you, dear reader. I put a lot of effort into these rust writeups (and into the open source libraries and crates I author) for nothing but love. I’ve heard good things about Patreon, and have literally just now put up a page to see if anyone would be interested in sponsoring my open source work. If you can’t spare some change to support me and my work on Patreon, please consider following me and my work on twitter, and starring rsevents and rsevents-extra on GitHub.

I’m currently looking for work opportunities as a senior engineer (rust is preferable, but I’m a polyglot). If you or your team is hiring, let me know!

If you liked this writeup, please share it with others on your favorite nerdy social media platform. I also set up a small mailing list that only sends out an email when I post about rust here on this blog, you can sign up below to join (no spam, double opt-in required, one-click unsubscribe, I never share your email, etc. etc.):

Just posted a ~longer writeup on what it takes to implement a truly safe Semaphore type in #rust. Feedback welcome. https://t.co/QZdeCACpnH

— Mahmoud Al-Qudsi (@mqudsi) October 4, 2022

Update (10/5/2022): The examples in this post have been updated to use Ordering::Acquire or Ordering::Release when reading/writing self.count in the wait() and release() family of functions to synchronize memory/cache coherence between threads and ensure correct instruction ordering. In the original version of this article, the examples all used Ordering::Relaxed and relied on self.event to take care of ordering, but as self.event is skipped as an optimization in cases where the semaphore has available concurrency, this was insufficient.

In fact, in some languages/apis a “critical section” is another name for a mutex. ↩
For a mutex, they are one and the same as it mandates exclusive access to a region. ↩
Semaphores are generally non-reentrant so recursively obtaining a semaphore will count against its limit! ↩
The empty parentheses/tuple () is rust-speak for void, for those of you not familiar with the language. ↩
In rust, there’s actually (out of necessity) an entire class of types or values that can be changed even through read-only references, a concept known as interior mutability and exposed via types like AtomicXXX and the various std::cell Cell types — but those are generally to be used on a fine-grained level and in general you wouldn’t be using them to make entire objects writeable via read-only references. ↩
Actually, the AutoResetEvent implementation only takes all of two bits, but let’s just call it a byte to make everything nice and easy. ↩
We would still have to keep Semaphore::release() around and make sure it can be publicly called so that a semaphore initialized with { count: m, max_count: n, ... } with m ≥ 0 and n > m can be used. ↩
You can see a sample application testing for pthread_mutex_unlock() safety here and try it out for yourself online here. ↩
After all, if we use an AtomicU8 to represent the max/initial count, they can be as small as three bytes each! ↩
For the sake of this example, let’s assume NUM_CPUS is a sufficiently large number like 4 or 8, so that enough worker threads will try to enter the semaphore-protected region. ↩
rsevent-extra‘s CountdownEvent might just be the perfect tool for this job, btw! ↩
Another way would be to pack count and max_count into a single double-width atomic (assuming such an atomic of this size exists) and to decrement both count and max_count when a call to Semaphore::wait() is made. This way, any calls to Semaphore::release() would compare the potential increase in count against a similarly decremented max_count and can catch any violations of our core precept. The issues described in the remainder of this article persist regardless of which method was chosen. ↩
In rust parlance, memory leaks do not fall under the safety guarantees of the language and it’s perfectly “safe” if not exactly cromulent to write code that doesn’t drop() RAII resources. ↩
Give up? Just draw another, longer line next to it! ↩
I still need to sit down and experiment with packing count and max_count into one atomic double-width word and see how it works to decrement both count and max_count after each call to wait() instead of tracking with an additional total_count field, but even there we’d have a price to pay. We can no longer use AtomicXXX::fetch_add() and we’d have to use compare_exchange_weak() in a loop, after fetching the initial value, separating it into its component fields, incrementing/decrementing, then combining the fields into a single word again – although a quick godbolt attempt shows the compiler actually does a rather nice job. ↩
C# is a great example here, with extensive input and sanity checking for most APIs but almost all of it in the form of runtime exceptions – despite it being a strongly typed language with powerful and extensible compiler integration. ↩
It currently still uses atomic unsigned integers in the implementation and so does not implement the wait-free, eventually consistent API to artificially reduce the currently available concurrency without making a wait() call, waiting for it to return, and then forgetting the result. At the time of its initial development, I had started off with signed integers then realized I didn’t need them for the core functionality and switched to unsigned atomic values instead. I may need to revisit that decision in another release if it can give us either a wait-free reduce() corollary to release() instead of the Semaphore::wait().forget() or the current modify() method which allows wait-free direct manipulation of the internal count, but only with an &mut Semaphore reference (to guarantee that it isn’t in use, eschewing eventual consistency for correctness), but feedback on whether a wait-free reduce() at the cost of eventual consistency is a win or a draw/loss would be appreciated from anyone nerdy enough to read these footnotes! ↩

The post Implementing truly safe semaphores in rust first appeared on The NeoSmart Files.

SecureStore 0.100: KISS, git-versioned secrets management for rust

Mahmoud Al-Qudsi — Mon, 08 Aug 2022 19:00:15 +0000

A few days ago, we published a new version of both the securestore library/crate and the ssclient CLI used to create, manage, and retrieve secrets from SecureStore vaults, an open and cross-language protocol for KISS secrets management. SecureStore vaults provide a more secure and far more reliable solution to storing secrets in environment variables and a simpler and less error prone alternative to network-based secrets management solutions, and make setting up development environments a breeze.

For some background, the SecureStore protocol (first published in 2017) is an open specification and cross-language library/frontend for securely storing encrypted secrets versioned in git, alongside your code. We have implementations available in rust (crate, cli) and for C#/.NET (api and cli, nuget) and the specification is purposely designed to be both easy-to-use and easy-to-port to other languages or frameworks.

This is the first update with (minor) breaking changes to the securestore public api, although pains have been taken to ensure that most common workflows won’t break. The changes are primarily to improve ergonomics when retrieving secrets from rust, and come with completely rewritten docs and READMEs (for the project, the lib, and the cli).

Storing Secrets in GIT

We’ve previously written at length about the many reasons for preferring the SecureStore approach of storing your secrets in a human-readable format, versioned in git alongside your code. At this point, it’s probably unnecessary for us to explain why hard-coding secrets in your code and committing them to your repo is a bad idea. Similarly, unless you are deploying code to a fleet of servers from a dozen different repositories to multiple, non-interchangeable servers distributed across time and space, chances are you don’t need and shouldn’t use a networked secrets management server, thereby increasing both operational and runtime complexity, adding dependencies on external services, and worrying about managing and securing yet another bit of infrastructure in-prod. And you shouldn’t be saving secrets as environment variables where they’re hard to manage and even harder to secure against runtime leaks: environment variables were never meant to stay secret.

The SecureStore protocol lets you conveniently store secrets to an encrypted vault, and verify that your secrets and the code using them remain in sync. You can require that code isn’t merged/committed without the accompanying secrets (for dev, staging, and/or production) it uses being available (without disclosing their contents). You can unify the how secrets are stored and retrieved in-dev and in-prod, and greatly reduce the complexity of setting up a dev environment (compare the complexity of working on a codebase that has a runtime dependency on a networked PostgreSQL db vs a local SQLite one). You can see and track down or revert changes to secrets the same way you can with code, using the same tools you already know and love. You can deploy new secrets as easily as pushing to master and letting your CI do its thing. The list goes on and on.

Using SecureStore from rust

While the securestore crate exposes everything you need to create a SecureStore vault, add and update secrets, and retrieve them at runtime; most users will want to use the ssclient frontend (built on the securestore crate) to create new SecureStore vaults or to add/remove secrets, and then use the securestore crate in their app only to retrieve secrets at runtime.

Creating a new SecureStore vault

Creating a new SecureStore vault and adding secrets is made incredibly easy with the ssclient frontends, available as a dependency-free single executable and also installable via cargo. In this example, we’ll create a new SecureStore vault that’s encrypted with a password but can also be decrypted headlessly at runtime with an exported keyfile:

> cargo install ssclient
> ssclient create secrets.json --export-key secrets.key
Password: ********
Confirm password: ********
> ssclient set db:username pgsql
Password: ********
> ssclient set db:password
Value: pgsql123
Password: ********

We created a new SecureStore vault, stored as a JSON-encoded, human-readable text file called secrets.json, secured it with a password, and exported a key for headless use (secrets.key). We then added a not-so-secret db:username and a secret named db:password, securely stored with their values.

We’ll commit secrets.json to our git repository (ideally we’d commit the updated secrets.json containing the encrypted db:username and db:password secrets at the same time as we commit the code that retrieves these secrets at runtime). The secrets.key file is extremely sensitive and is never committed to git (ssclient helpfully adds a .gitignore rule for us to make sure that we never commit it by accident).

Retrieving secrets at runtime

In our rust app, we’ll add a dependency on the securestore crate to Cargo.toml and then add some code to load the SecureStore vault and retrieve the secret so we can connect to the postgres instance:

Cargo.toml:

[package]
name = "sstemp"
version = "0.1.0"
edition = "2021"

[dependencies]
once_cell = "1.13.0"
securestore = "0.100"

This example also uses the once_cell crate to initialize a securestore::SecretsManager singleton, but you can use whatever approach you want to abstract the access to the vault (especially if we’re only going to be reading secrets rather than updating them).

In our src/main.rs, we’ll add some code to instantiate a SecretsManager singleton and then to interact with that singleton and retrieve the database credentials:

use securestore::SecretsManager;
use once_cell::sync::Lazy;

static SECRETS: Lazy = Lazy::new(|| {
    SecretsManager::load("secrets.json", "secrets.key")
        .expect("Failed to load SecureStore vault!")
});

fn get_db_credentials() -> Result<(String, String), securestore::Error> {
    let username = SECRETS.get("db:username")?;
    let password = SECRETS.get("db:password")?;
    Ok( (username, password) )
}

fn main() {
    let credentials = get_db_credentials().unwrap();
    assert_eq!(credentials.0, String::from("pgsql"));
    assert_eq!(credentials.1, String::from("pgsql123"));

    // TODO: Actually connect to the database with these credentials
}

You can refer to the project README for a more complete and annotated example of using SecureStore from rust, how to separate dev, staging, and production secrets, what the encrypted secrets.json vault looks like and contains, and more.

Notable changes in `securestore` and `ssclient` 0.100

A complete changelog for this release is available on GitHub, but for our existing securestore users, here’s a rundown on what’s new and improved in this release and what you should look out for when migrating your existing code, in no particular order:

As mentioned, all the documentation has been overhauled extensively. This includes the command line help for ssclient and the crate docs for securestore, as well as all the accompanying README files and more.
SecretsManager::new() no longer takes a filename, you specify that out-of-band via the new SecretsManager::save_as() (the old save() function is still there).
SecretsManager::get() is no longer generic; the old, generic function is still there, but renamed to SecretsManager::get_as::(). This lets you omit the return type for the 99.99% of the cases where you have a String secret.
There are some quality-of-life improvements to the BinarySerializable and BinaryDeserializable traits for all two of you storing arbitrary types as SecureStore secrets.
ssclient is now git-aware and excludes exported or generated key files in .gitignore by default (support for other VCS is coming; pull requests are welcome).
ssclient can now decrypt binary secrets and represent them as text (encoded as base64).
There is a new PEM-like keyfile format, generated by default by new versions of securestore and ssclient. The old binary format is still (and will remain) supported. Older versions of securestore cannot read the new ASCII-armored key format.
You should use KeySource::from_file(..) instead of KeySource::File(..) (usage is the same). KeySource::File(..) may be deprecated in the future (it’s no longer an actual enum variant but rather a function masquerading as an enum variant for backwards compatibility). Refer to the changelog to understand why.

The post SecureStore 0.100: KISS, git-versioned secrets management for rust first appeared on The NeoSmart Files.

C# file size formatting library PrettySize 3.1 released

Mahmoud Al-Qudsi — Mon, 27 Jun 2022 16:59:00 +0000

Hot on the heels of an update to our rust port of PrettySize we have a new release of PrettySize.NET that brings new features and capabilities to the best .NET library for formatting file sizes for human-readable output and display.

PrettySize 3.1, available on GitHub and via Nuget, has just been released and contains a number of improvements and requested features and newfound abilities to make handling file sizes (and not just formatting them) easier and more enjoyable.

Starting with a quick recap of PrettySize’s abilities for those of you that haven’t heard of it before:

using NeoSmart.PrettySize;

public void Main() {
    // Initializing directly with a raw byte count:
    var size = new PrettySize(bytes: 200);
    Console.WriteLine($"Size: {size}"); 
    // Prints "Size: 200 bytes"

    // Initializing via a unit-based helper function:
    var size1 = PrettySize.KiB(28);
    var size2 = PrettySize.Bytes(14336);
    var sum = size1 + size2;
    Console.WriteLine($"The total size is {sum}");
    // Prints "The total size is 42.00 KiB"
}

As you can see, PrettySize takes any arbitrary file size (in whatever unit you happen to have on hand) and automatically figures out how it should be displayed, figuring out

What the result of any mathematical operations on strongly-typed PrettySize values (wrapping file sizes) are
What unit to display the final size in (bytes, kilobytes, megabytes, etc)
What precision to use for the shown numeric size (just enough, but not too much)

New to version 3.x are the following:

The ability to perform math directly on PrettySize types, including adding or subtracting PrettySize values and multiplying or dividing a PrettySize value by a scalar numeric value.¹
The ability to encapsulate/express negative sizes (such as the negative result of 4KiB - 8KiB),
The ability to format/print negative results, previously only available for unsigned values,
Comparison and equality operators for directly comparing PrettySize values rather than having to compare their underlying byte count (as exposed via the PrettySize.TotalBytes property).

There are some more improvements and enhancements under the hood, and a number of new helper functions modeled after the unit names to make it easier to instantiate PrettySize instances from non-byte file size values (e.g. directly from kilobytes). The interface should be more natural and ergonomic, and hopefully developers should find themselves reaching for PrettySize.TotalBytes far less than before.

Remember that PrettySize supports custom formatting of file sizes and has full support for both base-10 (kilobyte, megabyte, gigabyte) and base-2 (kibibyte, mebibyte, gibibyte — more often referred to as KiB, MiB, GiB, etc) units and you can choose between them, as well as determine how the unit names are styled and more:

// using NeoSmart.PrettySize

var size = PrettySize.Bytes(2048);

var formatted = size.Format(UnitBase.Base2, UnitStyle.Full);
Console.WriteLine(formatted); // Prints "2.00 Kebibytes"

var formatted2 = size.Format(UnitBase.Base10, UnitStyle.FullLower);
Console.WriteLine(formatted2); // Prints "2.05 kilobytes"

Minor Update
Version 3.1.1 is available on NuGet and includes a [Obsolete] backwards compatibility shim for a member that was misspelled in an earlier release.

The code is released as open source under the MIT license and the library is available free of charge for all uses, commercial or otherwise. Contributions to the GitHub repository are welcome.

Get PrettySize (free, cross-platform)

If you want to hear more from me about dotnet dev, follow me on twitter. To express your thanks or support, please star the repo on GitHub or retweet below:

Just released PrettySize v3, my #opensource C# and @dotnet library for calculating and pretty-printing file sizes. https://t.co/Dx2I7gqAMC #nuget

— Mahmoud Al-Qudsi (@mqudsi) June 27, 2022

In the case of multiplication, the commutative inverse (multiplying a scalar numeric value by a strongly-typed PrettySize value) is also supported. For more about the caveats when implementing commutative mathematical operations on or between arbitrary types, read last week’s article on the rust release of PrettySize for some PLT-related matters. ↩

The post C# file size formatting library PrettySize 3.1 released first appeared on The NeoSmart Files.

PrettySize 0.3 release and a weakness in rust’s type system

Mahmoud Al-Qudsi — Wed, 22 Jun 2022 23:07:06 +0000

I’m happy to announce that a new version of size, the PrettySize.NET port for rust, has been released and includes a number of highly requested features and improvements.

The last major release of the size crate was 0.1.2, released in December of 2018. It was feature complete with regards to its original purpose: the (automatic) textual formatting of file sizes for human-readable printing/display purposes. It would automatically take a file size, pick the appropriate unit (KB, MB, GB, etc) to display the final result in, and choose a suitable precision for the floating numeric component. It had support for both base-10 (KB, MB, GB, etc) and base-2 (KiB, MiB, GiB, etc) types, and the user could choose between them as well as override how the unit was formatted. In short, it did one thing and did it right.

A brief recap of the `size` crate to date

Some time after its release, there was a request made to add support for mathematical operations on strongly-typed Size types (without having to go to an intermediate “raw” bytes value and back) that I originally approached with some gusto, but ended up dismayed by some restrictions of the rust type system that made it difficult to write generic code that could support the full gamut of what a user could reasonably expect to be able to do (more on this later).

As the size crate covered the functionality we needed out of it here at NeoSmart Technologies and as there were valid workarounds for composing/calculating sizes (via the .bytes() escape hatch), there wasn’t a pressing need to tackle those limitations and other projects took priority meaning size didn’t see any updates since then. But it never sat well with me that I left my git working tree in an unclean state and had open issues languishing unresolved, and from time to time I would always think of going back and issuing an update… but you know how that goes.

However, in a recent discussion apropos “genuine limitations of the rust type system that frustrate people that otherwise love rust” I raised the issue that I ran into and that finally got me hot and bothered enough to finally tackle a new size release with the requested support for mathematical operators and more.

Rust’s problem with commutative mathematical operations

So what’s this about a problem in rust’s type system? Well, with the caveat that it’s importance is almost certainly overinflated in my eyes, the issue lies with how mathematical operations (impls of core::ops::{Add, Sub, Mul, Div} and others) are written and how that conflicts with rust’s orphan rule, which forbids you from implementing a trait for a foreign type (defined in a different crate).

After publishing, I realized that this article heavily uses the LHS and RHS abbreviations, but never defines what they stand for! LHS is “left-hand side” and RHS is “right-hand side” and they denote (for a binary mathematical operation) what side of the operator a token is on. e.g. in 42 apples + 17 oranges, the LHS is “42 apples,” while the RHS has a magnitude of 17 with a unit of “oranges.”

Let’s take a look at how we would normally implement (in rust) support for a mathematical operation like foo * bar where Foo and Bar are different types, both local to the current crate:

use core::ops::Mul;

struct Foo(i32);
struct Bar(i32);

/// Here, Bar is the RHS type and Foo is the LHS type, 
/// i.e. this impl is used for `let prod = foo * bar`
impl Mul for Foo {
    type Output = i32;
    
    fn mul(self, other: Bar) -> Self::Output {
        self.0 * other.0
    }
}

/// Here, Foo is the RHS type and Bar is the LHS type, 
/// i.e. this impl is used for `let prod = bar * foo`
impl Mul for Bar {
    type Output = i32;
    
    fn mul(self, other: Foo) -> Self::Output {
        self.0 * other.0
    }
}

fn main() {
    println!("{}", Foo(7) * Bar(6));
    println!("{}", Bar(6) * Foo(7));
}

You can try this online in the rust playground.

The above code demonstrates the commutative property of scalar multiplication: foo * bar is the same as bar * foo and returns the same result, both in type (here, i32) and magnitude (42 in this example).

It’s important for a type system to allow (or even require) you to write out separate implementations for each of the two commutative permutations because not all mathematical operations (or even all multiplications) are commutative. For example, while addition is generally commutative, subtraction isn’t (4 - 2 gives a different result from 2 - 4) – and multiplication of matrices M and N may not only give different results for M * N as compared to N * M, one of those operations may be valid while the other is an error!

So far, we haven’t run into any issues. But let’s say we have a bunch of different types, all of which (for reasons we won’t get into) can be boiled down to an integer equivalent and we want to support commutative multiplication for them all. It sounds like a textbook case for the use of generics: define a trait AsInt, have each type implement it however it likes, then implement core::ops::Mul via the AsInt trait:

use core::ops::Mul;

trait AsInt {
    fn as_int(&self) -> i32;
}

struct Foo(i32);
struct Bar(i32);

impl AsInt for Foo {
    fn as_int(&self) -> i32 { self.0 }
}

impl AsInt for Bar {
    fn as_int(&self) -> i32 { self.0 }
}

impl Mul for Lhs {
    type Output = i32;
    
    fn mul(self, other: Rhs) -> Self::Output {
        self.as_int() * other.as_int()
    }
}

You can try this online in the rust playground.

Unfortunately, this doesn’t compile:

error[E0210]: type parameter `Lhs` must be used as the type parameter for some local type (e.g., `MyStruct`)
  --> src/main.rs:18:6
   |
18 | impl Mul for Lhs {
   |      ^^^ type parameter `Lhs` must be used as the type parameter for some local type
   |
   = note: implementing a foreign trait is only possible if at least one of the types for which it is implemented is local
   = note: only traits defined in the current crate can be implemented for a type parameter

For more information about this error, try `rustc --explain E0210`.
error: could not compile `playground` due to previous error

The problem is that while all the types and the traits in this implementation are indeed local, the rust compiler doesn’t check if we are in violation of the orphan rule by checking which types implement the (local) trait we are implementing (another) trait against – it just checks to see if the implementing type itself is local. You can actually use a generic parameter implementing any (foreign or local) trait in your impl, but you can’t implement against that generic type directly – you can only forward it as a generic parameter to a local type.

Sidebar: Quare rust’s orphan rule and its limitations?

The most succinct PLT answer to this is that it’s because “local types” is a closed set (a new local type can never be added without changing your code and its API) while “types implementing local trait” is (or could be) an open set: a downstream user of your crate/library may implement your trait on their type at a later date, and suddenly we could have a conflict. You might be tempted to think “that’s on them,” and I wouldn’t blame you (and might even agree) but the problems don’t stop there – we absolutely need the ability to implement both local and foreign traits, but a crate or library upstream of yours (or even the standard library itself) might implement the same foreign trait against types implementing another foreign trait, and then the conflict would be your problem, in your code.

Of course the rust compiler could be smarter about this and allow a combination of only certain permutations of impl/for local/foreign types/traits to get around these restrictions (e.g. allow implementing anything for a local type, implementing only sealed traits inaccessible to downstream users for foreign types, etc) and while there are open issues and rfcs for some of these, the road to hell is paved with good intentions and there are a thousand pitfalls.¹ Long story short, the situation is what it is (for now) for $reasons and until that changes, these restrictions on commutative operations for generic types aren’t going anywhere.

Back to the issue at hand

You can kind of work around this by abusing Deref with an output that’s some intermediate type exposing a reference to an i32 (because we actually have an underlying i32 in this case, and are not just calculating one out of the blue each time), but deref coercion will only get you so far.

For the particular case of Size, we just need to implement commutative multiplication of Size * number and number * Size so it turns out we can actually side-step this entire debate by manually writing out a million or so different impls, one for each primitive numeric type (macros help here!). Then multiply those by four, because you need to write a separate impl for each of Foo * Bar, Foo * &Bar, &Foo * Bar and &Foo * &Bar. Lots of code, but conceptually simple.

Except it turns out not to be so simple after all. Here’s an example that demonstrates commutative multiplication of a type (but not a reference to a type) with an i32 value:

use core::ops::Mul;

#[derive(Debug, Copy, Clone)]
struct Foo(i32);

impl Mul for i32 {
    type Output = Foo;
    
    fn mul(self, other: Foo) -> Self::Output {
        Foo(self * other.0)
    }
}

impl Mul for Foo {
    type Output = Foo;
    
    fn mul(self, other: i32) -> Self::Output {
        Foo(self.0 * other)
    }
}

fn main() {
    println!("{:?}", Foo(7) * 6);
    println!("{:?}", 6 * Foo(7));
}

You can try this online in the rust playground.

It works great. This time we are returning a strongly-typed Foo rather than an i32 scalar value, commutative multiplication works fine, the code compiles, and prints the expected output.

We originally wanted to make this generic over all primitive numeric types, so that if the user has a num: u8 or a float: f64 lying around, they can just perform the multiplication automatically without getting a type mismatch error like you would with the above if you tried to multiply by some already-typed value that rust can’t coerce/infer to be an i32 (which our impl is specifically for):

fn main() {
    println!("{:?}", Foo(7) * 6_u8);
    println!("{:?}", 6_f32 * Foo(7));
}

You can try this online in the rust playground.

Which gives the following (expected) type errors:

error[E0308]: mismatched types
  --> src/main.rs:23:31
   |
23 |     println!("{:?}", Foo(7) * 6_u8);
   |                               ^^^^ expected `i32`, found `u8`
   |
help: change the type of the numeric literal from `u8` to `i32`
   |
23 |     println!("{:?}", Foo(7) * 6_i32);
   |                                 ~~~

error[E0277]: cannot multiply `f32` by `Foo`
  --> src/main.rs:24:28
   |
24 |     println!("{:?}", 6_f32 * Foo(7));
   |                            ^ no implementation for `f32 * Foo`
   |
   = help: the trait `Mul` is not implemented for `f32`

Some errors have detailed explanations: E0277, E0308.
For more information about an error, try `rustc --explain E0277`.

We said we can’t use generics to implement this support, but we can add a second pair of impl Mul for u8 to get this to work, right?

use core::ops::Mul;

#[derive(Debug, Copy, Clone)]
struct Foo(i32);

impl Mul for i32 {
    type Output = Foo;
    
    fn mul(self, other: Foo) -> Self::Output {
        Foo(self * other.0)
    }
}

impl Mul for Foo {
    type Output = Foo;
    
    fn mul(self, other: i32) -> Self::Output {
        Foo(self.0 * other)
    }
}

impl Mul for u8 {
    type Output = Foo;
    
    fn mul(self, other: Foo) -> Self::Output {
        Foo(self as i32 * other.0)
    }
}

impl Mul for Foo {
    type Output = Foo;
    
    fn mul(self, other: u8) -> Self::Output {
        Foo(self.0 * other as i32)
    }
}

fn main() {
    println!("{:?}", Foo(7) * 6_u8);
    println!("{:?}", 6_i32 * Foo(7));
}

You can try this online in the rust playground.

Indeed, this adds support for multiplying a Foo by a u8 or the other way around, just as we wanted. We also still have support for multiplying Foo by i32 (and vice-versa) as well. Great! This is what we wanted, right? Ergonomics +100 achievement unlocked!

Unfortunately, no. While we did add support for multiplying by u8 or i32 typed values, we broke something probably much more important: the ability to multiply by an untyped (or at least, not explicitly typed) literal:

use core::ops::Mul;

#[derive(Debug)]
struct Foo(i32);

impl Mul for i32 {
    type Output = Foo;

    fn mul(self, other: Foo) -> Self::Output {
        Foo(self * other.0)
    }
}

impl Mul for u8 {
    type Output = Foo;
    
    fn mul(self, other: Foo) -> Self::Output {
        Foo(self as i32 * other.0)
    }
}

fn main() {
    let prod = 7 * Foo(6);
    assert_eq!(prod.0, 42);
}

You can try this online in the rust playground.

This breaks in a rather weird way: you’d expect that if there’s any confusion about what a type is, it’s about whether 7 is an i32 or u8 here. Indeed, that’s what’s happening internally, but that’s not what the error surfaced by the rust compiler says:

error[E0282]: type annotations needed
  --> src/main.rs:39:9
   |
39 |     let prod = 7 * Foo(6);
   |         ^^^^^
   |
   = note: type must be known at this point
help: consider giving `prod` an explicit type
   |
39 |     let prod: _ = 7 * Foo(6);
   |              +++

Weird. We know (or at least, can reasonably surmise) the problem is with the ambiguity in the literal 7 and whether the compiler should invoke the Mul for i32 impl or the Mul for u8 impl, but the compiler says the problem is actually with the missing return type for the entire operation (which is always Foo because that’s the Mul::Output we have specified for both)! In fact, an older version of the compiler produces a better message:



error[E0283]: type annotations needed
  --> src/main.rs:39:19
   |
39 |     let prod1 = 7 * Foo(6);
   |                   ^ cannot infer type for type `{integer}`
   |
note: multiple `impl`s satisfying `{integer}: Mul` found
  --> src/main.rs:22:1
   |
22 | impl Mul for i32 {
   | ^^^^^^^^^^^^^^^^^^^^^
...
30 | impl Mul for u8 {
   | ^^^^^^^^^^^^^^^^^^^^

However, let’s just take the latest rustc at its word and add the missing Foo type to the let prod = ... expression:

fn main() {
    let prod: Foo = 7 * Foo(6);
    assert_eq!(prod.0, 42);
}

And everything magically works! But we didn’t actually solve the problem we were dealing with, we just worked around the resulting compiler error – something each of our users would have to do any time they relied on type inference to multiply a scalar number by a typed Foo (interesting tidbit: this doesn’t happen the other way around, when multiplying a Foo by a scalar value – I’m not sure why, but I’ve opened bugs for these issues: [1], [2]).

To recap:

Rust’s orphan rule prevents us from implementing commutative addition/multiplication for types implementing a trait, which isn’t a complete blocker if you’re the only one that’s ever going to be implementing them because you can use macros or good, old copy-and-paste to work around that limitation and implement the operation manually. Half the operations can be generic over RHS (because impl Mul for SpecificType is perfectly legal) but the other half need to be manually spelled out. If you’re doing just two or three types, it’s fairly manageable but since you need 4 * M * N impls in total (accounting for the ref/non-ref permutations), it can quickly spiral into insanity.
A (temporary?) bug? quirk? limitation? in the rust compiler stops us from manually implementing commutative operations with the various numeric literals, because even though rust has a (silent) integer inference preference for i32 and a default floating point type of f64, the presence of multiple impls breaks type inference in interesting ways.

Just published a long article on a weakness in the #rust type system when it comes to commutative operations and how to preserve backwards compatibility even when making breaking changes.https://t.co/Wodf9rfEF2

— Mahmoud Al-Qudsi (@mqudsi) June 22, 2022

A new (and a newer) `size` crate

This brings us at long last to today’s announcement regarding a new size crate. Faced with the issues above while attempting to implement commutative mathematical operations, size now features support for the following, implemented via a combination of macros/copy-and-paste and generic impls where possible:

Strongly-typed addition and subtraction of Size values, giving Size results . This was implemented directly, as there’s only one type involved, with copy-and-pasted impls for the ref/non-ref cases.
Multiplication and division of an LHS Size by an RHS integer or floating value, yielding a Size instance; implemented via generics as impl Mul for Size is perfectly accepted, then copy-and-pasted as needed to handle ref/non-ref permutations.
Multiplication of an LHS integer or floating point value by an RHS Size value, yielding a Size result.This could only be implemented for one integer type (i64) and floating type (f64) to prevent the bizarre breakage when an untyped integer/float value is used (with only one possible {float} type and one possible {integer} type, rustc will try to coerce to the matching type of the two automatically). This had to be implemented manually (via macros) as rust’s orphan rule got in the way.
Division of an LHS integer by a Size value is not implemented, since it makes no sense (what does 42 / 16 KiB yield?).

That was pretty much it in terms of the features I’d wanted to implement from a few years back before I was stymied by the rust restrictions/limitations we’ve discussed. But the additions and improvements to size didn’t stop there:

As a result of implementing core::ops::Subtraction, it became necessary to add support for the concept of negative file sizes (something which can only exist in the abstract and wasn’t previously supported). This necessitated a change in the “output type” used by the library, and now the core primitive type returned/expected by the library (generic overloads excepted) is i64 rather than u64.
The goal of this crate has changed from “merely” providing formatting for file sizes to encapsulating all operations on sizes in general by providing a strongly-typed size that can expose just the right number of features and functionality while restricting the user from doing things that don’t make sense (such as dividing a scalar integer by a file size, as mentioned above). To that end, it is now possible to directly compare Size types for equality or order (via PartialEq and PartialOrd impls).
With its newfound ability to do more than just format file sizes for human readable output, it’s possible to imagine using size in completely different contexts. To that end, the size crate may now be compiled as a no_std library² which lets you use the basic Size features such as initializing a Size from different units, comparing Size instances, etc but disables features that aren’t meant to be used in embedded or other no_std contexts.
The size crate no longer has any dependencies. It previously featured only a single dependency on num_crates (plus its transitive dependency on autocfg) for abstracting over the different primitive numeric types, but the latest releases now use a sealed local trait and some macros to accomplish the same but without any foreign dependencies. Compilation time has been significantly improved as a result.

The changes above formed the bulk of the size 0.2.0 release. But just as I was about to sit down and write up this article, it struck me that the Size api was not very rusty. The crate (and its basic API) was originally written in 2018 and envisioned somewhat differently from how it turned out. The original idea was to take advantage of rust’s game-changing tagged enums to provide, in addition to pretty printing of file sizes, an interface for converting directly between sizes of different units (almost like a units(1) but in rust).

Rust’s enums seemed perfect for the job, so size 0.1.x and 0.2.x shipped with an API that exposed an enum composed of a strongly-typed unit name and a generic, numeric size, e.g. Size::Bytes(T), Size::Gigabytes(T), Size::Kibibytes(T), etc. But in practice, people aren’t reaching out for the size crate to convert between well-defined base-2/base-10 size units, they’re using it to create strongly-typed Size objects to represent an underlying file size and format it for display. Users requested the ability to perform math/logic on Size types, but users didn’t care for requesting an equivalent Size but with a base unit of gigabytes.

The majority of the code I found on GitHub and in other nooks and crannies of the internet ended up looking like this:

let size = Size::Bytes(some_value);
...
println!("File size: {}", size);

Which, while being perfectly valid rust code and actually conforming to the rust formatting rules and regulations, just didn’t feel rusty and didn’t match the approach that other crates have pretty much clustered around. You don’t get the feeling that Size::Bytes() is an enum so much as it appears to be an unfortunately mis-capitalized method exposed by the Size type. What’s more, the interface was extremely generic heavy, but the generics were only skin-deep because all operations coming out of a Size stripped the original T and returned values of the intermediate (u64/i64) numeric type instead. While the Size variants were storing the user-picked T, it wasn’t actually used anywhere except as an input to internal calculations completely masked to the end user, it wasn’t intuitive that operations like Size + Size were even possible let alone yielded a Size (regardless of the initial types), and the internal type conversions and changes to precision (one way or the other) were not intuitively exposed.

Enter size 0.3.x with a new and rustier (if not improved) API that should feel more natural to rustaceans around the globe. Size has been changed from an enum to a struct and now exposes functions to model the behavior previously exposed by the old variants. The biggest API change is that the Size type itself is no longer generic and things like Size::Kilobytes(10) are now expressed as Size::from_kilobytes(10) (or, optionally, Size::from_kb(10) instead). It should be more immediately intuited that a numeric conversion is (or at least may be) taking place given the “from” in the function name and the fact that you are not directly instantiating an instance of Size containing a particular numeric type T that is somehow never afterwards seen.

One other minor change that may be of interest to other crate developers: there are certain spellings or phrasings specific to each community, and it helps the ecosystem considerably for crate authors to make a conscious effort to adhere to them where possible. For example, while size 0.2.x spelled “lower case” as two words, the rust standard library has it as a single word “lowercase” and so enum members like Style::FullLowerCase have been renamed to Style::FullLowercase to match.

Preserving backwards compatibility in rust

It would seem like a massively breaking change to switch the core Size type from an enum to a struct, let alone to rename virtually all the interface members in such a manner. But if you take the unix “source-compatible” approach rather than focusing on strict ABI compatibility, things are actually not that – if you’re willing to break some conventions and bend the rust compiler to your will with liberal usage of #[allow(...)] in carefully chosen places.

After the new Size interface was in place, a second impl Size { ... } was added, – this time prefixed with #[doc(hidden)] to keep it out of the documentation – that contained a number of “fakes,” in this case, const functions masquerading as enum variants. Here’s an excerpt of what that looks like:

#[doc(hidden)]
impl Size {
    #![allow(non_snake_case)]

    #[inline]
    #[deprecated(since = "0.3", note = "Use Size::from_bytes() instead")]
    /// Express a size in bytes.
    pub const fn Bytes(t: T) -> Self { Self::from_bytes(t) }

    #[inline]
    #[deprecated(since = "0.3", note = "Use Size::from_kibibytes() instead")]
    /// Express a size in kibibytes. Actual size is 2^10 \* the value.
    pub const fn Kibibytes(t: T) -> Self { Self::from_kibibytes(t) }

    // ...
}

You may not be able to achieve full compatibility with the old API and there are a lot of cases where this won’t cut it, but fortunately for us, they’re not how most users would approach things. For example, someone creating a Size via Size::Bytes(num.into()) would find that their code no longer compiles, as Size itself is not generic (rather, it’s the function/mock variant Size::Bytes that is generic over T). But luckily for us, that’s not how most people would write that code and the “natural” way of expressing it (Size::Bytes(num as u64)) continues to compile, happily oblivious to the fact that we’re actually calling a function called Bytes() rather than constructing an enum variant Size::Bytes.

For the renamed “plain” enums, a similar approach was used to make it seem like FullLowerCase was still a valid member of the Style enum (used to specify how the unit name is formatted when the size is pretty-printed):

enum Style { .... }
impl Style {
    #[doc(hidden)]
    #[allow(non_upper_case_globals)]
    #[deprecated(since = "0.3", note = "Use Style::FullLowercase instead")]
    /// A backwards-compatible alias for [`Style::FullLowercase`]
    pub const FullLowerCase: Style = Style::FullLowercase;
}

In this particular case, it would have been possible to keep the old FullLowerCase enum member around and simply hide it from the docs, since Style remained an enum. But that would mean updating all our match sites to handle both the old and the new name, incurring both a maintenance and a (negligible) runtime cost to keeping the backwards-compatible name around. With this approach, and especially with all the old names kept in a separate impl Style block that only contained shims for the deprecated API, there is almost no cost to keeping the code compatible for a few versions or however long we choose to support the legacy API.

Again, this isn’t a magic fix that keeps everything working, but it does handle pretty much all the cases our users were actually using (in this case, calling a function and specifying a Style::Foo variant as a parameter). I highly recommend using GitHub’s (or any other service’s) code search feature to look at how people are using your API before introducing breaking changes or remodeling an API; it really helps to understand how your users approach your crate, which may be quite different than how you originally intended for it to be used.

Using `size` or contributing

The latest release of the size crate is available on crates.io, and the documentation has been completely overhauled as part of the new 0.2.x and 0.3.x releases. The source code is available on GitHub and is released under the MIT license.

I’ve actually released size 0.4 shortly after publishing this article, mainly to future-proof the API against breaking changes in the future (by breaking it in the here-and-now instead ). Unfortunately, I wasn’t able to use any of the methods outlined above to preserve backwards compatibility, and I humbly apologize to everyone affected by this breakage!

You can use size in your rust code today by simply adding a reference to size in your Cargo.toml and placing use size::Size at the top of your rust code:

use size::Size;
use std::fs::File;

fn main() {
    let metadata = File::open("foo.bin").metadata().unwrap();
    let file_size = Size::from_bytes(metadata.len());
    println!("{}", file_size); // prints "13.37 MiB"
}

Sign up and follow for more!

If you found this article interesting, please follow me on twitter and sign up for my rust mailing list to get notifications on new rust articles and make sure you never miss out. You won’t get any other emails, I pinky swear!

Just published a long article on a weakness in the #rust type system when it comes to commutative operations and how to preserve backwards compatibility even when making breaking changes.https://t.co/Wodf9rfEF2

— Mahmoud Al-Qudsi (@mqudsi) June 22, 2022

For example, a foreign trait is implemented for impls of a local trait, but one of your types implements both the local trait and some other upstream trait and a later upstream release implements the same foreign trait for all impls of the other foreign trait, and suddenly your type has multiple impls for the same trait. ↩
Just compile with default features disabled. ↩

The post PrettySize 0.3 release and a weakness in rust’s type system first appeared on The NeoSmart Files.

Easy Window Switcher 1.3.0 released

NeoSmart Technologies — Sat, 05 Feb 2022 20:50:12 +0000

Easy Window Switcher, our Windows “power toy” that brings macOS-like switching between open windows of the same application with Alt+` to Microsoft Windows, has just been updated to version 1.3.0. This new release brings some much requested fixes for keyboard layouts used by our friends in Denmark and Sweden and some more compatibility fixes for everyone else.

There’s not too much more to say about this release, as everything Easy Window Switcher does continues to take place entirely behind the scenes, since it is designed to unobtrusively power up your context switching capabilities without ever getting in the way. (Existing WinCycle users just need to run the download and they’ll be prompted to update to the latest version.)

Remember that the Alt+` is really only the shortcut for switching between windows for users of US Standard keyboard layouts. All of our international users will use the same physical key placement, but will hit Alt and a different button – whatever button is the immediately left of the 1 key on your keyboard:

If you’re a new or long-time Easy Window Switcher user, please consider making a donation to support future development of Easy Window Switcher and our other free software!

Download Easy Window Switcher 1.3.0

The post Easy Window Switcher 1.3.0 released first appeared on The NeoSmart Files.

CVE-2022-23968: Xerox vulnerability allows unauthenticated users to remotely brick network printers (UPDATED)

Mahmoud Al-Qudsi — Mon, 24 Jan 2022 22:31:07 +0000

In the world of network security, it pays to always remember that many (if not most!) security bugs start off their lives as seemingly innocuous “regular” bugs, and it’s only by diligently considering how aberrant behavior – say, incorrect results returned for particular inputs or a mere “stability issue” that turns out to actually be a use-after-free causing the observed crashes – could be abused by determined malicious actors that the underlying security implications become obvious. This has great benefits: for instance, it can be argued that it wasn’t until Microsoft started taking BSoDs that could be triggered by unprivileged users seriously, recognizing them for the open backdoors most of them were, that Windows actually became usably stable.

Of course, then there are the bugs that have such blatantly obvious security implications that it would be hard to qualify them as wolves in sheep’s clothing. Someone encountering such a bug, even if not particularly security-minded, would be forced to immediately recognize the risk they pose even if only because they have to deal with its consequences. This post is about such a security bug that I encountered in the same vein as many others in the past: simply trying to do something completely unrelated and running into a vulnerability that made the task at hand that much harder.

In September of 2019, I ran into an issue while developing a one-click scan-to-print daemon that POSTed documents to the (by default, unsecured/unauthenticated) web interface on Xerox network printers. While the original project was working fully, the goal was to have scanned multipage documents print as multipage documents (rather than as several one-page documents) without introducing a PDF or PostScript dependency in the daemon, which was submitting the images to the printer’s job queue as JPEGs at the time. Seeing that the Xerox web interface supported TIFF documents, my approach was to create and send a multi-page TIFF document and see if that was supported by the printer.¹ Given that the code was already handling the scanned images as JPEG and given that the TIFF document model actually lets you use TIFF as a container hosting non-TIFF (i.e. JPEG) data, I created a multi-page TIFF with each page set up as a TIFF Image File Directory hosting a page of the scanned document as a JPEG-compressed image.

I never got to test my idea as I was hasty in the coding stage and unwittingly attempted to test the daemon with a valid multi-page TIFF container, but one where the TIFF Image Directory wasn’t correctly finalized (the pages within the TIFF were incomplete). Imagine my surprise – and anguish – when I discovered that the payload would seemingly permanently brick the network-attached printer! Here are the details of what happens:

The Xerox network brick vulnerability

The user/attacker submits a TIFF document with an incomplete Image Directory payload to the network printer. In addition to being done manually (by printing the document from a USB-attached PC or selecting the document to print via the web interface) this can also be done programmatically with a script that does the same, but over the network. What’s more, this is even doable over the web via a JavaScript payload (easier if the IP address of the printer is known) because unauthenticated network users may POST payloads to the printer via bog-standard http(s) POST requests that aren’t even secured by a nonce (which I’m generally grateful for in the name of scriptability and convenience) and for which current cross-origin mitigations are insufficient!
The printer, before it actually prints the document, opens the document to inspect its contents and to determine what resources are needed to complete the job (so it can request paper in the correct tray, count pages for accounting purposes, etc). In this case, the TIFF handler in the Xerox firmware runs into unhandled/undefined behavior as it attempts and fails to parse the image directories within the TIFF container, and the printer firmware panics, displaying a message to the user indicating that an unexpected error has occurred and that a hard reboot is required for the printer to resume working.
Upon reboot, the print queue manager attempts to resume the job at the front of the print queue, which is still our buggy document, runs into the same issue described in the previous point, and panics requesting a reboot once more.
This continues ad infinitum, as jobs are persisted to non-volatile memory and are not cleared when the unit is unplugged or restarted effectively bricking the device. The print queue management (web or on-device) interface is inaccessible before the printer reaches the point where it tries to read the failed job and it is inaccessible after the panic, meaning there’s no means via any of the available user interfaces for the print queue to be cleared to break out of this vicious loop.

Timeline of discovery and reporting

To the best of my knowledge, this vulnerability remains unpatched and continues to affect a number of Xerox printers across different product/model lines. This full, public disclosure is being made given the egregious amount of time that has elapsed since this issue was brought to Xerox’s attention. The exact timeline of events (below) has been recreated from the emails I have regarding this issue:

I discovered this issue the week of September 23, 2019. After running into the initial issue and figuring out a tedious workaround that enabled me to recover my printer (see below), I attempted to narrow down the parameters of the exploit until I had a sub-kilobyte payload that could reliably reproduce the issue.
After confirming the security ramifications of the vulnerability, I first contacted Xerox via the automated security-related contact form on their United States website with a brief synopsis of the issue. To the credit of the security officer assigned to the case, I had a response within roughly an hour with the requested contact and disclosure information I had requested, while thanking me for my “responsible disclosure.”
On September 26, 2019, I replied with the details of the exploit and provided a sample payload (attached to this post) that would trigger the network brick procedure. Half an hour later, the security specialist assigned to the case had replied:

I will turn this over to the proper development team and see what they can determine. We’ll let you [know] as soon as we have our assessment done and some idea as to what we might plan to do to address the issue, and we definitely will keep you apprised of our progress on this.
I did not hear anything back until I emailed requesting an update on January 14, 2020 (over three months later):

I have not heard back regarding this matter, but there have been new firmware releases since then.

Was this issue addressed?
I received a reply within the hour informing me that the vulnerability was confirmed but still not fixed for the Versalink line of printers and found to not affect the Altalink line of printers (which I’d included in the original report for sharing portions of the software stack):

We did finish our assessment and was able to confirm your results for VersaLink. However, we found that AltaLink is not vulnerable to this TIFF vulnerability.

We have forwarded your results to the 3rd party vendor that developed the VersaLink software to develop a fix for this TIFF vulnerability but so far we have yet received it yet.

Given that it has now been not 90 days but closer to two-and-a-half years since this issue was disclosed to Xerox Corp and I have not received any updates regarding the matter, I have decided to disclose this publicly (which I probably should have done much sooner but kept putting off for the same $reasons that make me put a lot of things off).

Vulnerability details

CVE: CVE-2022-23968
Class: Denial-of-service
Severity: Critical (bricked, semi-permanent)
Exploitable: Over the internet, over the local network, in person from a connected PC, in person at the device in question
Privately reported: September 25, 2019 to Xerox product security
Publicly disclosed: January 24, 2022
Status: Confirmed by Xerox product security specialist and security researchers. Update: Xerox has released firmware upgrades version xx.61.23 (or higher) in response to this vulnerability.
Affected devices: all Xerox Versalink business printers and copy machines, including the VersaLink B400, B405, B600, B605, B615, B70xx, C400, C405, C500, C505, C600, C605, C7000, C70xx, C8000, and C9000 series copy machines and printers. Additionally at least the Phaser 6510 and WorkCentre 6515 printer and copy machine have been reported to be affected as well; other Phaser and WorkCentre models are suspected to also be included.
Affected firmware versions: tested against firmware version xx.42.01 and xx.50.61, confirmed by Xerox to affect all firmware versions up to xx.61.23 (which is the first version to include a patch for this vulnerability).
Permissions required for in-person exploit: None (documents may be submitted over USB, LAN, or selected in-person from a USB stick)
Permissions required for LAN exploit: None by default (as all LAN users are allowed to POST jobs to the handling endpoint over http(s) by default)
Permissions required for exploit over the internet: None by default. The device’s web interface exposes an http(s) POST interface that is not protected by any nonce and for which cross-site origin mitigations are useless as the response may be freely discarded. Only the device’s name or IP address on the destination network is required, although even that is not required as it may be discovered via JavaScript given that the endpoint URL is fixed and IPv4 is enabled by default, limiting the possible search space.
Discovered by: Mahmoud Al-Qudsi, NeoSmart Technologies

Proof-of-Concept

The contents of the following archive may be submitted to a vulnerable Xerox printer to trigger the remote brick: xerox brick.rar (345 bytes). The TIFF payload is also included in this base64-encoded attachment:

UmFyIRoHAQAzkrXlCgEFBgAFAQGAgAD5BbdHEwMC5QAE5QAA9kPUNIAAAANDTVRYZXJveCByZW1v
dGUgYnJpY2sgcGF5bG9hZCBieSBNYWhtb3VkIEFsLVF1ZHNpDQpTZWUgaHR0cHM6Ly9uZW9zbWFy
dC5uZXQvYmxvZy8/cD00ODY1IGZvciBtb3JlIGluZm8uAOsG2ysrAgMLjQEEvAMgCd+uuYADAA94
ZXJveCBicmljay50aWYKAwIA/Fsg4nPVAcISiiBENSb2YDSTz9+g+ofkEQVoaUFeJvK3kDY8WbGp
HgjY0bFPe8gzgjwjaJNmzSGzlGGm0ZRkySYEISicQttsKElCEti8EbSsdkcDz6/WmRz/N1o/EIEf
YPQUn+fPO4RLXjWeRbJT8isQTI5AnW6pF0WsD5DaxM4tgNHp3U7xR1fsHuvMYwMeDGyHIB13VlED
BQQA

I am withholding the sample code to exploit this over the network or over the internet at this time, although it is trivially implemented given the payload TIFF file above and some monitoring of outgoing requests in the browser inspector while normally submitting a job via the on-device web interface.

Workarounds and Mitigations

Immediate Mitigation:
As of January 28, 2022, Xerox has announced that devices upgraded to firmware versions xx.61.23 or greater will be protected against this vulnerability. Accordingly, please check xerox.com to see if a firmware update is available for your device and update immediately where possible.

For deployments of devices for which no firmware upgrade is available at this time or in environments where it is not possible to immediately deploy such a firmware upgrade, configure the devices to deny all print privileges to unauthenticated users, both over the network and in-person. Remove access to affected printers from any USB- or network-connected PCs or devices used by untrusted individuals.

Recovering Bricked Devices:
If there are any unapplied firmware updates, a network firmware update procedure may be initiated which will upgrade the device and in the process, clear the job queue. This breaks the device out of its reboot-panic loop. Otherwise, manually desoldering and wiping/reprogramming the storage module on the device mainboard is required to clear the job queue and unbrick the device. It is likely – but not verified – that a Xerox field technician may be able to clear the NVRAM by initiating an undocumented startup sequence that bypasses the print queue or by jumping specific pins on the mainboard.

Updates

Follow me on twitter @mqudsi and follow NeoSmart Technologies @neosmart for updates, or see below:

January 25, 2022: This vulnerability has been assigned CVE ID 2022-23968
January 27, 2022: Xerox appears to have posted then removed version 1.0 of special security bulletin XRX22-002 on their advisories page. It acknowledged the vulnerability but stated that firmware versions xx.50.61 (released October 2019) and later contained a fix for the issue. However in a January 2020 email from Xerox security division (see above), a Xerox security officer confirms that the latest firmware upgrade did not contain any such fix and the vulnerability remained unpatched.
January 28, 2022: After reaching out to Xerox, a new version 1.1 of Xerox security bulletin XRX22-002 has been posted with the corrected information. As suspected, the claim that the fix was released in October of 2019 was retracted; per the updated document, firmware versions xx.61.23 and above are patched against this vulnerability. The sections on affected models, firmware versions, and suggested mitigations have been updated based on this information.

~~This article will be updated with a CVE id when one is assigned.~~ Any other updates will be posted as they come to light.

I’m reporting a zero-day that lets unauthenticated network users remotely brick Xerox printers. I reported this to Xerox in 2019 and decided it’s been long enough. https://t.co/EZtnDz4AzC

— Mahmoud Al-Qudsi (@mqudsi) January 24, 2022

Baseline TIFF support requires that decoders are capable of handling multi-page TIFF documents, even if they are only capable of reading/displaying the first page. ↩

The post CVE-2022-23968: Xerox vulnerability allows unauthenticated users to remotely brick network printers (UPDATED) first appeared on The NeoSmart Files.

Free Windows 11 Repair and Recovery Tool Download

NeoSmart Technologies — Tue, 04 Jan 2022 22:49:54 +0000

NeoSmart Technologies is pleased to announce the immediate availability of the latest additions to its Easy Recovery Essentials for Windows line of bootable repair and recovery tools for Microsoft Windows: EasyRE for Windows 11 and EasyRE Pro for Windows 11. Continuing a tradition that started with Windows 10, our Windows 11 boot recovery USB is currently available as a completely free download for anyone that needs to fix their Windows 11 installation after a virus infection or a Windows Update gone wrong.

EasyRE is fully compatible with the latest generation of EFI PCs and fixes everything from the original Windows 11 release to problems with the latest Windows 11 22H2 release and beyond.

EasyRE for Windows 11 is probably the easiest and most reliable way to fix BCD boot errors, blue screens during Windows boot, startup errors, EFI bootloader problems, MBR issues and more. You can download EasyRE for Windows 11 for free today, and use it to create a bootable Windows repair USB with the free Easy USB Creator or create a free Windows recovery CD if you prefer that route instead. You just download EasyRE on any working PC, convert the ISO image download to a USB or CD with one of our free tools, then place it in the computer that needs repair and restart it, choosing to boot from the EasyRE CD or USB, and wait for it to load the main menu:

From there, you can start a completely automatic repair, initiate an offline virus scan of your PC (and because it’s an offline environment, rootkits can’t hide themselves from EasyRE!), use the partition editor, make a backup of your files to an external disk, or even browse the web.

EasyRE’s automated repair takes a completely novel approach to fixing boot problems by actually simulating step-by-step what Windows itself does when booting up – except when it encounters an error that would normally cause your Windows installation to halt the boot process and show an error message, it instead dynamically makes repairs based off the database of corrective procedures that is baked into Easy Recovery Essentials. In essence, instead of searching through a limited catalog of known problems and known solutions, EasyRE actually tries to boot the same way Windows does, letting it catch issues that would slip through the cracks with other repair solutions… including Microsoft’s very own.

The only user action required is to select a Windows installation to repair.

EasyRE doesn’t shy away from sharing the details of the repair, and it reports issues it encounters while analyzing and repairing your Windows installation as they are encountered. And while EasyRE can’t actually fix any problems caused by failed or dying hardware, EasyRE includes a comprehensive memory scan stage that can usually pick up on issues caused by memory corruption (due to bad RAM in your PC or laptop, overclocking gone wrong, or problems with your motherboard and power supply) and checks for and reports any errors in reading or writing to the disk Windows is installed on, alerting the user to any issues that could indicate pending or ongoing hard drive failure.

Once EasyRE has finished the repair process, it prints a summary of the repair process and gives the user the option of either returning to the main menu to use a different diagnostic tool or the option of rebooting the PC into a now hopefully repaired Windows installation.

EasyRE can automatically fix errors caused by Windows Updates, partial Windows upgrades, many cases of filesystem corruption due to power loss, partition issues, loss of the UEFI or MBR boot menus, BCD errors, incorrectly installed or removed essential drivers, and more. EasyRE is the only tool in existence that can automatically and losslessly convert an MBR installation to boot in UEFI mode – and, when it’s possible to do so safely, the reverse as well.

This latest EasyRE release features complete UEFI support and can boot from either CD or USB even on PCs with the BIOS compatibility mode (also known as the CSM) disabled or completely unavailable. This Windows repair tool is compatible with and has been tested on all the popular makes and models of Windows PCs and laptops, including those by Dell, HP, Lenovo, ASUS, Acer, MSI, Samsung, the Microsoft Surface, and more.

Download a bootable Windows repair USB absolutely for free from the link below:

Download EasyRE to Fix Windows 11

Follow us on Twitter and Facebook to get more freebies and qualify for our upcoming NeoSmart swag giveaway!

Please note that free EasyRE downloads do not qualify for tier one support by phone or email; any issues using EasyRE should be discussed in our user-to-user Windows 11 support forums. EasyRE is being offered as-is in the hopes users will find it beneficial to help repair their Windows 11 machines without any warranty, express or implied.

The post Free Windows 11 Repair and Recovery Tool Download first appeared on The NeoSmart Files.

Microsoft bids adieu to Windows Phone in new emoji

Mahmoud Al-Qudsi — Tue, 02 Nov 2021 18:17:48 +0000

Windows 11 is here and it comes with a new version of Segoe UI Emoji, the font that’s used across the OS to render various emoji from Unicode codepoint sequences to the emoji you see on screen (developers: use Unicode.NET for your emoji needs!). With it, a number of emoji icons have been upgraded to a new look: some to mirror the connotations and semantics of other emoji fonts, others to be less disparaging. But there’s a less welcome surprise too, for those four… maybe five? of us that still remember the ill-fated Windows Phone fondly.

The “Mobile Phone” emoji, perhaps better known to Unicode junkies as U+1F4F1, as depicted in Windows 10 in the default Segoe UI Emoji (see above) was a throwback to Microsoft’s short-lived Windows Phone and its tile-centric UI – often found in blue. Depending on the relative importance of a widget/tile, a user could change its size and have it take up anywhere from a full row to a quarter of one square in a row (this lived on in some ways with the Windows 10 start menu… where it doesn’t work as well). You can see it in this picture below:¹

Using what was clearly a Windows Phone was a somewhat cheeky way for someone at Microsoft to give that phone and its UI one last hurrah, in a “we’ll never forget you” kind of way… except, of course, it seems that Microsoft did forget all about the Windows Phone in the latest update, for this is what we now see:

Emoji	Old (Windows 10)	New (Windows 11)
Mobile Phone (U+1F4F1)
Mobile Phone with Arrow (U+1F4F2)
Vibration Mode (U+1F4F3)
Mobile Phone, Off (U+1F4F4)

Apart from the actual phone being depicted (which is now a generic clonephone), the design hasn’t changed much. In the case of U+1F4F2, the red arrow better matches the (bulky) aesthetics of the rest of the emoji, but the emoji still don’t take up a constant width so as to line up nicely. Microsoft’s next major emoji update, dubbed “Fluent Emoji,” is supposed to be coming soon – but if early preview is remotely accurate, it looks like the new emoji belong on the set of Wallace and Gromit or some other clay animation kids TV show.. Here’s to hoping the preview was just a prank!

Photo credit: The Verge ↩

The post Microsoft bids adieu to Windows Phone in new emoji first appeared on The NeoSmart Files.

Using SIMD acceleration in rust to create the world’s fastest tac

Mahmoud Al-Qudsi — Mon, 30 Aug 2021 18:57:07 +0000

NeoSmart Technologies’ open source (MIT-licensed), cross-platform tac (written in rust) has been updated to version 2.0 (GitHub link). This is a major release with enormous performance improvements under-the-hood, primarily featuring handwritten SIMD acceleration of the core logic that allows it to process input up to three times faster than its GNU Coreutils counterpart. (For those that aren’t familiar with it, tac is a command-line utility to reverse the contents of a file.)

This release has been a long time in the making. I had the initial idea of utilizing vectorized instructions to speed up the core line ending search logic during the initial development of tac in 2017, but was put off by the complete lack of stabilized SIMD support in rust at the time. In 2019, after attempting to process a few log files – each of which were over 100 GiB – I decided to revisit the subject and implemented a proof-of-concept shortly thereafter… and that’s when things stalled for a number of reasons.

In perfect honesty, the biggest reason a SIMD release was put on hold was that I had what I needed: my local copy of tac that used AVX2 acceleration worked just fine to help speed up my processing of reversed server logs, and having completed the challenging part of the puzzle (trying to vectorize every possible operation in the core logic), I was not particularly interested in the remaining 20% that needed to be done – and the state of rust’s support for intrinsics was considerably lacking at the time.

Apart from requiring compiler intrinsics that were gated behind the rust nightly toolchain, I wasn’t particularly sure how to tackle “graceful degradation” when executing on AVX2-incapable hardware. As mentioned earlier, the vectorization of tac was entirely done by hand – as in, by explicitly using the AVX2 assembly instructions to carry out vectorized operations, rather by writing vectorization-friendly high-level rust code and then praying that the compiler would correctly vectorize the logic. Three years ago, the state of rust’s support for dynamic CPU instruction support detection looked very much different than it does today, and I wasn’t looking forward to releasing a tac 2.0 that required a nightly build (aka “bleeding edge” or “unstable”) of the rust compiler. For better or for worse, the SIMD-accelerated tac builds lay dormant (in a branch of the original code, no less) on GitHub until a couple of months ago when a number of other rust rewrites of common greybeard utilities were making the rounds and I was inspired to revisit the codebase.

Fortunately, in 2021 it was considerably easier to dynamically enable AVX2 support in tac, first by taking advantage of the is_x86_feature_detected!() macro (available since mid-2018) to detect early on whether or not AVX2 intrinsics are supported at runtime, and then converting a few manual invocations of nightly-only intrinsics to use SIMD-enabled equivalents exposed by the standard library instead. It helped that I had split off the core search logic into two blackbox functions, one with the original naive search logic, and the other with the AVX2-accelerated search logic, so all I needed to do was call the right function. What remained at that point was to verify that the generated dynamic-detection binary actually performed as I expected it to.

Until this point, I’d been compiling my local tac builds with my local environment defaults, in particular with the RUSTFLAGS environment variable set to -C target-cpu=native which would always result in the rustc compiler’s LLVM backend generating AVX2-compatible instructions. In addition to inspecting the generated assembly to ensure the expected AVX2 instructions were present in the AVX2 codepath (and temporarily adding a panic!("Wrong codepath!") to the non-AVX2 branch), I compiled a copy with target-cpu=native and set it aside, then compiled another copy with RUSTFLAGS unset, and proceeded to benchmark the two on some large (~40 GiB) payloads. I also benchmarked against my earlier proof-of-concept code that didn’t use dynamic detection at all.¹

Unfortunately I found that although the AVX2 codepath was being correctly taken, the “dynamically accelerated” version of tac compiled without target-cpu=native was always benchmarking up to 15% slower than the same codebase with that equivalent of -march=native enabled (which was indistinguishable in benchmarks from the “always-AVX2” proof-of-concept). Given the non-negligible magnitude of the slowdown, it was fairly obvious that any initial analysis of the problem should focus on the core processing loop rather than any of the startup/cleanup code, so with the code region isolated the only thing left to do was to diff the generated assembly for the core logic between the two binaries and see where they differed.

While rustc makes it easy to generate the assembly (just set RUSTFLAGS to --emit asm -Cllvm-args=--x86-asm-syntax=intel), I like to use godbolt.org for a quicker feedback loop, primarily because the highlighting makes it immediately obvious what instructions pertain to which functions and vice-versa – although it makes diffing the results a bit harder.² It was a bit hard to spot at first, but the line-by-line comparison ultimately made it possible to track down the problem.

The way dynamically enabling SIMD-accelerated codepaths works in rust is a little “manual” in the sense that the detection of the supported CPU instructions is quite separate from emitting the differently-compiled code in question; i.e. the following doesn’t actually accomplish anything (and you wouldn’t intuitively expect it to):

fn foo() {
    // Perform some operation here
}

fn foo_simd() {
    // Perform the same operation here
}

pub fn main() {
    if is_x86_feature_detected!("avx2") {
        foo_simd()
    } else {
        foo()
    }
}

But what this lets you do is apply different compilation options to one branch of code, and then dynamically select/execute it. So instead of compiling the entire assembly with -C target-cpu=avx2 or not at all, we can invoke cargo or rustc with no target-cpu specified, but then instruct the compiler to switch to a different target CPU to compile a particular function (but note that if it calls out to any other functions, those calls won’t be duplicated and recompiled w/ the optimizations enabled):

fn foo() {
    // Perform some operation here
}

#[target_feature(enable = "avx2")]
unsafe fn foo_simd() {
    // Perform the same operation here
}

pub fn main() {
    if is_x86_feature_detected!("avx2") {
        unsafe { foo_simd() }
    } else {
        foo()
    }
}

Note that we had to specify the avx2 feature bit twice and that there’s no “type checking” between the two – we could have checked for is_x86_feature_detected!("foo") and then called a function decorated with #[target_feature(enable = "bar")], so rust forces us to make the target function unsafe regardless of whether or not it actually uses any unsafe code. (This is a problem since it can mask inadvertent usage of unsafe code). Anyway, this is all just background to what the problem actually turned out to be, which is in turn a confluence of two issues:

rust/llvm does not correctly recognize “supersets” of instructions,
rust/llvm converts direct calls to intrinsics not supported by the specified target_feature(...) to non-inlined functions that end up containing just the requisite asm call itself, adding a massive amount of function call overhead.³

In this case, the AVX2 search function was decorated with #[target_feature(enable = "avx2")] but in addition to its direct usage of AVX2 functions and types, it also made a call to the BZHI and LZCNT intrinsics/asm instructions – which rustc/llvm do not recognize as being supported via the avx2 feature! So although (to the best of this developer’s knowledge) there does not exist a processor on the face of this planet that supports AVX2 but doesn’t support BZHI and LZCNT, attempting to use those two in a function that only enabled support for avx2 resulted in codegen that turned something like this:

#[target_feature(enable = "avx2")]
pub unsafe fn bzhi(num: u32) -> u32 {
    core::arch::x86_64::_bzhi_u32(num, 31)
}

into something like this (godbolt link):

core::core_arch::x86::bmi2::_bzhi_u32:
        sub     rsp, 4
        bzhi    eax, edi, esi
        mov     dword ptr [rsp], eax
        mov     eax, dword ptr [rsp]
        add     rsp, 4
        ret

example::bzhi:
        push    rax
        mov     esi, 31
        call    core::core_arch::x86::bmi2::_bzhi_u32
        mov     dword ptr [rsp + 4], eax
        mov     eax, dword ptr [rsp + 4]
        pop     rcx
        ret

…which is obviously not what we want.

Technically, BZHI (zero the high bits starting from the specified bit) is not part of the avx2 instruction set but rather part of the bmi2 (bit manipulation instructions v2) instruction set. However, AMD introduced support for the BMI2 instructions at the same time as they first introduced support for AVX2 (as part of the Excavator microarchitecture, in 2015). Intel likewise introduced BMI2 support (along with BMI1 support, as a matter of fact) as part of the Haswell microarchitecture in 2013, also at the same time they debuted support for AVX2. No AVX2 CPU has shipped from either company without BMI2 since then, and it’s pretty unfathomable that any would in the future.

Likewise, LZCNT (count number of leading zero bits) is not officially part of the AVX2 instruction set but rather part of the bmi1 instruction set – which, as we saw earlier, was supported by AMD prior to their debuting AVX2 support but only supported by Intel in Haswell alongside avx2 and bmi2 support. In other words, you can view avx2 support as a strict superset of support for both bmi1 and bmi2, and to a lesser extent, support for bmi2 as being identical to support for avx2 (because there could be x86-only CPUs out there with bmi2 but no avx2). Long story short, rust’s codegen (here provided by llvm) would certainly be within its rights to expand avx2 to avx2,bmi1,bmi2 (and then some).

Confusingly, while the poor codegen for lzcnt causes the intrinsic to expand to an inlined function call containing the fallback logic for platforms that don’t support the bmi1 instruction set:

#![feature(core_intrinsics)]

#[target_feature(enable = "avx2")]
pub unsafe fn ctlz(num: i32) -> i32 {
    core::intrinsics::ctlz(num)
}

example::ctlz:
        push    rax
        bsr     eax, edi
        mov     ecx, 63
        cmove   eax, ecx
        xor     eax, 31
        mov     dword ptr [rsp + 4], eax
        mov     eax, dword ptr [rsp + 4]
        mov     dword ptr [rsp], eax
        mov     eax, dword ptr [rsp]
        pop     rcx
        ret

… however, as you can see in the first assembly output from earlier, the bzhi intrinsic just generates a non-inlined function call to a new function that just executes bzhi directly, meaning you gain virtually nothing but function call overhead. Presumably there is a reason for this (unless it’s just bad codegen), my best guess is that this is an attempt to guard against microarchitectures that will generate a #UD exception if that instruction is seen, even if not executed? That seems a bit unlikely, given that we needed to implement W^X as its own thing, but I could be wrong.

In all cases, decorating the AVX2-enabled method with

#[target_feature(enable = "avx2")]
#[target_feature(enable = "bmi1")]
#[target_feature(enable = "bmi2")]
unsafe fn search_avx2() {
    // …
}

worked, and the dynamically enabled AVX2 acceleration of the core tac logic benchmarked identical to the other two builds (which is to say, extremely fast).

What’s the impact of these changes? Our tac 2.0 release is up to 3x faster than the GNU Coreutils version of tac if benchmarking with speculative execution mitigations disabled. If the mitigations are enabled, then our tac will be even faster than that as we are using memory-mapped files (with specifically mmap-friendly access patterns) to avoid the repeated syscalls to refill the buffers, which are considerably slower with mitigations enabled.

You can explore the source code yourself to see how this all comes together or inspect the SIMD instructions used to speed up the search. You can grab pre-compiled binaries for the major platforms from the releases page, and I’m working on getting tac into the FreeBSD ports repo and into either choco or scoop for Windows users. As with all our other open source projects, help getting the binaries into platform/distribution-specific package managers is always welcome.

Join the discussion on Hacker News or on reddit’s /r/rust. You can follow me on twitter @mqudsi and/or subscribe below to receive emails when I post specifically about rust (and never anything else):

I used hyperfine for this – excellent tool to keep in your arsenal if you are into writing high-performance code. ↩
I recommend opening two side-by-side browser windows and copy-and-pasting to the terminal to diff -U ↩
Especially if combined with llvm inefficiencies when calling avx2 functions, although I didn’t run into that. ↩

The post Using SIMD acceleration in rust to create the world’s fastest tac first appeared on The NeoSmart Files.

The NeoSmart Files

Benchmarking rust compilation speedups and slowdowns from sccache and -Zthreads

Using build.rs to integrate rust applications with system libraries like a pro

Embed only the video from another post on X or Twitter

Increment only numbers matching regex in Vim

tcpproxy 0.4 released

CallerArgumentExpression and extension methods don’t mix

Implementing truly safe semaphores in rust

What makes a semaphore a semaphore?

Designing a safe Semaphore for rust

A constant maximum concurrency?

A functionally complete semaphore

The problem with Semaphore::release()

Do we even need Semaphore::release() anyway?

Can we make Semaphore::release() safer?

At last, a truly safe semaphore?

But now, something else breaks!

The price we pay for safety

The truly safe semaphore, for your benefit and review

Show some love and be the first to get new rust articles!

SecureStore 0.100: KISS, git-versioned secrets management for rust

Storing Secrets in GIT

Using SecureStore from rust

Creating a new SecureStore vault

Retrieving secrets at runtime

Notable changes in securestore and ssclient 0.100

C# file size formatting library PrettySize 3.1 released

PrettySize 0.3 release and a weakness in rust’s type system

A brief recap of the size crate to date

Rust’s problem with commutative mathematical operations

Sidebar: Quare rust’s orphan rule and its limitations?

Back to the issue at hand

A new (and a newer) size crate

Preserving backwards compatibility in rust

Using size or contributing

Sign up and follow for more!

Easy Window Switcher 1.3.0 released

CVE-2022-23968: Xerox vulnerability allows unauthenticated users to remotely brick network printers (UPDATED)

The Xerox network brick vulnerability

Timeline of discovery and reporting

Vulnerability details

Proof-of-Concept

Workarounds and Mitigations

Updates

Free Windows 11 Repair and Recovery Tool Download

Microsoft bids adieu to Windows Phone in new emoji

Using SIMD acceleration in rust to create the world’s fastest tac

Designing a safe `Semaphore` for rust

The problem with `Semaphore::release()`

Do we even need `Semaphore::release()` anyway?

Can we make `Semaphore::release()` safer?

Notable changes in `securestore` and `ssclient` 0.100

A brief recap of the `size` crate to date

A new (and a newer) `size` crate

Using `size` or contributing