rtyler

Butterflia

2026-05-02T00:00:00+00:00

There’s a dead deer in the bin. The burial while unceremonious was not without a deep sadness for an animal that didn’t know I exist.

My house is in the shockingly short boundary where city turns into wild. Troops of wild turkeys march through, the jack rabbits scurry away when you walk up the hill, and deer leap across the road heading from the creek on one side to the hills on the other.

I purchased this house from the woman who built it with her husband. Frank passed away about 20 years ago, but he still visits. His wife Marilyn moved to Sacramento and is an avid gardener. She cultivated a rich variety of flowers, bushes, and trees under the towering oaks of the home she built.

I let most of the non-native plants wither, encouraging and those which could survive the wet winters and dry summers. Continuing Marilyn’s garden would have taken too much time and water, neither of which are in abundance these days. Nonetheless I still feel a sense of responsibility to the land, the animals, and the oaks.

Returning from a week of travel, I was shocked to spot a fat doe lounging on the hillside. Deer are common but I have never seen a deer just lying on the ground. A couple days later walking through the backyard the doe, who was given the name “Butterflia” was just as surprised to see me as I was to see her. Later that evening I noticed her again further down the hill, a tiny fawn wobbling between her legs.

Last year a doe was struck and killed by a car on the road that runs alongside my house. The following day I discovered a fawn with a broken hind leg in the backyard. The nice woman from Fawn Rescue of Sonoma County and I cautiously crept through the woods with outstretched beach towels but could not catch the injured fawn and it escaped.

I called them back to see if they could help me relocate this doe and fawn. The same nice woman assured me that they would probably move on in a week or so, and to call back if their situation deteriorated.

Over the weekend I spotted a second fawn. Butterflia had twins! The backyard was off limits to ensure that neither Butterflia nor her fawns would be scared into the road. I am happy to share my space with such relatively benign neighbors.

By the end of the following week there was only one fawn that I regularly saw with Butterflia. During the day when I knew she was out I swept through the backyard to see if I could find a body, but never learned what happened with that fawn.

For whatever reason Butterflia had decided to call the backyard home.

A week later, on a Sunday while carefully walking through the backyard to get something from the shed I discovered the other fawn’s newly deceased body behind a rock near the house. The job of coroner would have to wait until Monday after work, Sunday’s plans did not include a dead fawn.

Butterflia discovered the body Monday morning around 4am. I was awaken by the mournful bellowing of a mother discovering the body of her child. It was absolutely gut-wrenching. I busied myself on the other side of the house, periodically checking back in the bedroom, only to hear Butterflia’s continued sorrow. She painfully wailed for almost three hours and spent the rest of the day near the rock. I would see her revisiting the body, poking her head under the leaves of the bushes as if in disbelief.

She was mourning and I found myself mourning as well.

I let the body rest for a couple days. I wanted to make sure Butterflia had left before I did anything, out of some sense of respect. Using an old spade I scooped up the fawn and gently placed it in the bin, covered it with dirt, and then some leaves and straw.

The week came and went without any additional sightings of Butterflia. Her sadness affected me more deeply than I anticipated. I don’t know if she mourned the first fawn, or even knew it was gone. I imagined her grief was compounded by losing both of her babies, the early morning anguish heavy with the knowledge that they both died.

She must have moved on, hopefully to greener pastures.

I swept through the yard this weekend to see if there were other remnants of deer habitation to be cleaned up. After completing my yard work, I startled Butterflia as she gingerly walked through the yard.

I said hello with a warm smile, pleased to see her, because I’m human.

She turned and looked at me dumbly, because she’s a deer.

2026 April: Recently Studied Stuff

2026-04-30T00:00:00+00:00

Similar to last month I have given more intention to some of the interesting things that I have stumbled across in my feed reader or the fediverse. Rather than just a quip, boost, or reply, I have wanted to consolidate these thoughts with more permanance here to my blog.

Chris’ talk below at North Bay Python was, as his always are, well-delivered and worth consideration.

The conclusion that he draws towards the end is similar to something I was noodling last year:

At some point somebody, somewhere, is going to have to actually understand how things work.

Chris makes the point, as he typically does, much more thoughtfully and with a stronger philosophical base.

Had some discussions with the delta-kernel-rs developers after they mistakenly added a ton of new files to tests/ blowing up test cycle times. Another community member shared this great overview about not using Cargo integration tests.

Catching up on Daniel’s thoughts on Data Quality and reconsidering the domain. The generation of slop has resulted in renewed discussions of “but how do we ensure correctness?” which is a great question to be trying to answer, but I am still rather disappointed with the state of the art for data quality tooling.

I recommend this blog post which has some good citations for negative AI behaviors affecting free and open source communities.

This is going to be a difficult problem to solve, more difficult than the email spam problem we have been unable to solve after 30 years of working on it.

This is also a very important problem, we are currently in an age where we have access to information that most people couldn’t even dream of 30 years ago. We also have disinformation that combines some of the worst aspects of authoritarian regimes throughout history combined with the worst aspects of cult brainwashing. If we lose access to the information but the disinformation remains (or get worse) then the result will be terrible.

I really enjoy Planet Debian as an aggregator of an international set of voices from the Debian community. I get exposed to so many different view points from around the free software ecosystem, which I really value. This past week I read this blog post by a debian maintainer which I was so flummoxed by I wrote out my thoughts on the topic here

Streaming tar over SSH is one of the more novel Unix tricks I don’t get to use much anymore. Drew Devault shared some helpful tips for using it without needing to use incantations of rsync(1).

Unity Catalog with S3 Access Points

2026-04-25T00:00:00+00:00

Governance is the synergy of our era. If I could go one week without a discussion around governance that really just boils down to classic role-based access control practices..

The bad news I have for you is that today, in the year 2026 Unity Catalog does not work with S3 Access Points.

However it does show a different pathology than it once did, which leads me to believe that it could, if not for one silly little piece of technical debt.

The system I am building utilizes Amazon S3 Access Points for governance but must integrate into the Databricks platform. A platform which has its own ideas about governance: Unity Catalog. It should come as no surprise that a system which was named unity would go to great strides to make itself the center of the universe.

How troublesome!

Years ago a colleague and I tried to integrate Databricks Unity Catalog and S3 Access Points only for the approach to crash and burn. Integrating two different opaque tools like IAM permissions and Unity Catalog led to all sorts of attempted incantations, none of which actually succeeded.

The Databricks product team told us that the system did not support S3 Access Points “by design.” I found the reasoning very patronizing because it was presented as “we don’t support S3 Access Points by design to prevent users from circumventing Unity access controls.”

What I understand now is how that “by design” was more of an excuse “we just don’t want to test it” rather than something more substantive.

S3 Access Points can be referenced a number of ways like S3 Access Point Aliases to where even the most legacy system can integrate with them.

An access point alias name meets all the requirements of a valid Amazon S3 bucket name and consists of the following parts:

The first time we bounced off this problem S3 Access Point Aliases had been only recently released;

Despite all Unity Catalog’s protestations the errors we ended up seeing don’t convey a structural limitation when using S3 Access Point Aliases, instead they point to simply out-dated SDK support in the underlying Databricks Runtime.

My hunch is that the AWS SDK v1 being announced as deprecated over two years ago and being completely deprecated as of the end of 2025. Lots of Databricks and other Spark libraries still interact with S3 via the v1 SDK. That SDK was originally released in 2010 (lol) and so it’s likely that the issue we were authentication issue we were seeing was mixed up in the support for S3 Access Point Aliases with this old SDK.

Since we bounced off this problem a number of years ago one thing that has changed for the better in Unity Catalog is that it is now possible to grant Unity a completely read-only configuration in IAM-based S3 bucket policies. While we cannot use S3 Access Points as part of our governance strategy, we can at least still grant a fairly limited permission to Unity for read-only operations.

Now I can have my esoteric Delta Lake datastores present in Unity without any risk of misconfiguration or error in Unity leading to data corruption!

Governance to a lot of enterprise vendors is about centralization of control, but for me it’s about defense in depth. I never want a business-critical system to be a single misconfiguration away from granting read or write access to the wrong principal.

Private Open Source

2026-04-01T00:00:00+00:00

Open source communities depend on a fundamental assumption that is no longer true: the presumption of good faith actors. The hosts serving free and open source code are scraped relentlessly, denying service to developers. Once that code has been assimilated into various models it is washed of all attribution and license information, denying rights of the developers. Some subset of users then feel empowered, emboldened, I’m not sure what exactly by these models and lob massive thousand line changes back at the developers. Nearly every technology has the possibility to be used for positive and negative effects, but free and open source communities are being harmed from multiple directions right now.

I am a big believer in the four opens:

The Four Opens are a set of principles guidelines that were created by the OpenStack community as a way to guarantee that the users get all the benefits associated with open source software, including the ability to engage with the community and influence future evolution of the software.

Open Source

Open Design

Open Development

Open Community

There is an implied “to the public” in each of the four opens, at least how I have understood it over the past many (many) years. I have repeatedly advocated for open (to the public) discourse and transparency when working with companies like CloudBees and Databricks as they have engaged with open source projects.

The mounting negative pressures and in some cases outright hostility towards free and open source projects has me reconsidering the implied “to the public” and how these communities may need to evolve in the future.

While I have never been a fan of invite-only Discord or Slack servers, both of which are used by the Apache Datafusion project for some odd reason. There are good reasons to put the project’s shared spaces in slightly more private and slightly less AI-accessible systems. A little bit of privacy can lead to more candid conversations and potentially a stronger feeling of community and safety.

My first line of thinking led me to the idea of “vouching” which I recall mitchellh posting about in the fediverse, but I couldn’t find a good linkable reference.

Vouching is what we did as kids when a new friend was suggested to join the mischief, somebody would vouch for the new kid and say “hey, they’re my neighbor, they’re cool” and then we would go start new trouble together. In the context of an open source community vouching can:

Help build a web of trust without every person necessarily knowing each new person
But also vouching means there is a higher tendency for a community to be homogeneous, since it will be less welcoming to random new-comers.

I think vouching could also exacerbate the likelihood of a Jia Tan where the web of trust within the community is compromised by a malicious actor. Getting one member to vouch for you may lower the guard of all of the other members of the community making these style of attacks easier to pull off.

Since I started writing this post a whole week has passed by, without any new ideas or patterns popping into mind. I’m curious how others are thinking about it, so please let me know on Mastodon or via email rtyler@~

The problem is obeying in advance

2026-03-25T00:00:00+00:00

Linux power-users tend to have strong opinions about two things: distribution and systemd. The bazaar of distributions means competing implementations or different perspectives end up expressed through the curation of the packaged software. systemd ended up so contentious because it’s a decent piece of technology which suffers from persistent scope creep that became a foundational component in a lot of distributions. The drama du jour is that systemd is somehow implicated in “age verification laws.”

systemd as an init system is pretty good. Once upon a time I worked on porting launchd to FreeBSD and so I have some familiarity with the silliness of most init systems.

systemd as a katamari at the root level of most Linux systems is not “pretty good.” There have been numerous tendrils of what is understood to be “systemd” which are of lesser quality and have resulted in security issues.

Anyways, I hope you get the point. systemd as an init system: good. systemd as a operating system: bad.

The drama du jour is about the latter.

One should not obey in advance. Especially in the domain free and open source software which is literally a political project.

I stumbled into this blog post through Planet Debian by a debian maintainer which is patently absurd.

Recently, the free software Nazi bar crowd styling themselves as “concerned citizens” has tried to start a moral panic by saying that systemd is implementing age verification checks or that somehow it will require providing personally identifiable information.

The author is correct insofar that systemd did not add age verification. However most of the folks upset with the change are upset that their Linux systems are obeying in advance.

systemd did make changes in order to obey. To take part in anti-free restrictions under the guise of “age verification” From the pull request

Stores the user’s birth date for age verification, as required by recent laws in California (AB-1043), Colorado (SB26-051), Brazil (Lei 15.211/2025), etc.

The whole motivation of the change was to obey in advance to these unjust laws.

The author then goes on to make some equally absurd claims about how this functionality is important for porents to implement controls on computers, for the children! Clearly this person must not know any actual children, or even parents for that matter. Children are excellent at finding ways to circumvent restrictions. The idea that a user-modifiable piece of data on a local machine should be trusted for “parental controls” is so detached from reality that I originally thought they were making a sarcastic joke.

I think this tongue-in-cheek systemd-censord post does better than anybody to exclaim how absolutely ludicrous this obeying in advance is:

Systemd units will be created for every desired censorship function, and will be started based on the user’s location. For example, a unit for Kazakhstan will implement the government-required backdoor, a unit for China will implement keyword scans and web access blocks (more on this later), a unit for Florida will ban all packages with “trans” in the name (201 packages in current stable distribution), a unit for Oklahoma will ensure all educational software is compliant with the Christian Holy Bible, a unit for the entire United States will prevent installation of any program capable of decoding DVD or BluRay media, and a unit for California will provide the user’s age to all applications and all web sites from which applications may be downloaded. As can be seen, multiple units may be started for a given location.

Do not obey in advance.

35E

2026-03-22T00:00:00+00:00

35E, 35E.
Stuck here in the middle
of the middle,
35E.

At my height any seat
can feel like misery.

I wouldn’t be here today,
if not for last night’s delay.

35E, 35E
trapped in this humid
sneeze of
humanity.

Everything is expensive, and still it sucks.
The cheapest coffee was four lousy bucks.

The grumpiness was extreme at the TSA
acting in their theatre for deferred pay

35E, 35E
I’m not sure in which
timezone
I should be.

United customer support has been totally outsourced,
hour seventeen on the phone; just the worst.

For the shareholders the texts all lie,
about being powered by GenAI.

35E, 35E
all of the staff
on this flight and the last
have been really kind and patient which is a testament to their professionalism and hospitality despite the overtly customer-hostile environment of modern American commercial aviation.

2026 March: Recently Studied Stuff

2026-03-21T00:00:00+00:00

Over the past week I have made a more conscious effort to keep track of some really interesting articles that came through my feed reader. I am a big fan of the open web and the power of RSS for disseminating interesting information from actual people. Below are some really interesting posts I have read recently!

Compressed Apache Arrow tables over HTTP

When discussing transport protocols for sending data between services at work recently, a colleague asked “why can’t we just yeet Arrow over HTTP?” It turns out, you absolutely can and Arrow IPC streams even have a registered MIME type:

Content-Type: application/vnd.apache.arrow.stream

Understanding Parquet format for beginners

A great introduction to the Apache Parquet format and why it makes so many things better with large data storage systems like Delta Lake. I have written on this topic before and encourage you to take another read through this blog post by some maintainers of the parquet crate.

Every layer of review makes you 10x slower

Every layer of approval makes a process 10x slower [..]

Just to be clear, we’re counting “wall clock time” here rather than effort. Almost all the extra time is spent sitting and waiting.

Code a simple bug fix: 30 minutes

Get it code reviewed by the peer next to you: 300 minutes → 5 hours → half a day

Get a design doc approved by your architects team first: 50 hours → about a week

Get it on some other team’s calendar to do all that (for example, if a customer requests a feature): 500 hours → 12 weeks → one fiscal quarter

This inspired these thoughts which I shared with the delta-rs community:

“what if we didn’t require code review for merging into main”

I’m exploring the thought more about what we might need to make that happen. “Why would you do such a thing, code review is so valuable!” I do find code reviews valuable but we do seem to lose a lot of flow time due to timezones, differing work schedules, and a number of other things. For something without a lot of changes, especially bug fixes that come with tests I would be much more comfortable with maintainers merging once CI goes green.

Some pieces of the puzzle that I think would be needed:

Soft caps on pull requests. I saw this mentioned somewhere else, but implementing a soft cap of <500 lines per pull request can help people avoid massive unreviewable changes that are simpler to integrate.
Incorporating some of the benchmarking work into CI that has already been explored. If performance of key operations is not affected and the build is green, go for it.
Stronger semantic version checks: if our APIs have not changed and all tests pass, I’m generally comfortable with landing stuff by maintainers.
Implementing Apache Software Foundation style release candidates and voting: this is where we would put a mandatory bottleneck, rather than some jokey slack emojis like I tend to do, implementing a true release candidate process that requires review and vote before we push something to users.

All of this is to say that reviews can still be requested, but I would love to see us land more improvements faster and I think we have a bunch of different schedules that can make pushing each change through a review queue a lot slower than necessary.

Conditional Impls in Rust

It’s possible in Rust to conditionally implement methods and traits based on the traits implemented by a type’s own type parameters. While this is used extensively in Rust’s standard library, it’s not necessarily obvious that this is possible.

I have been vaguely aware of this functionality but haven’t really taken the time to consider it, so I really appreciated this post walking through the conditional impl functionality in Rust.

Only so many sunrises

2026-03-15T00:00:00+00:00

With a lot of discussion around intelligence lately, I find myself thinking a lot more about wisdom. Age doesn’t necessarily beget wisdom, but I do think that experience can. I am always impressed by those who are able to reflect and grow wise from the varied joys and traumas that shape each one of us.

This video struck a chord for me. Contrasting the Bay Area hustle culture to the things that make life worth living:

San Francisco has always a destination for those seeking their fortunes. The frenetic enthusiasm radiates through seemingly everything there.

I also really enjoyed the energy of San Francisco when I first moved there. I had nothing else but work.

The trade-off for my relentless focus on my career was a tremendous level up in a short amount of time. I wouldn’t be where I am today without a few years of judicious networking and workaholism.

San Francisco was “Lord of the Flies” when I described it to friends from elsewhere. Awash in adult boys, untethered from the real world. I would hang out with men 10 years older then me who were doing the same dumb shit I was, except I was in my early twenties, a commonly accepted time to be foolish.

I did not want to end up like them and increasingly put both physical and mental distance between the them and myself.

There is more to life than panning for gold.

Today I was talking with an elder almost twice as old as me, who casually offered:

I still get up at 5am; at this age there are only so many sunrises left to see.

I’m going to try to not stay up too late dwelling on the comment, lest I miss tomorrow’s sunrise.

Based Lake, a petabyte-scale low-latency data lake

2026-03-10T00:00:00+00:00

I had a chat today about building large scale low-latency data retrieval systems around AWS S3. In doing so I got to share a bit of the talk proposal I submitted to Data and AI Summit this year about real-live work that has made it into production.

For years the conventional wisdom around Delta Lake has been to not connect user-facing/online systems to Delta tables. Basically, don’t point your Django app at your Delta tables. This continues to be a decent guideline but definitely not a rule and I have the performance data to back that up.

My talk abstract:

Scribd hosts hundreds of millions of documents and has hundreds of billions of objects across our buckets. Combining large-language models with a massive amounts of text has required investment in our new Content Library architecture. We selected Delta Lake as the underlying storage technology but have pushed it to an extreme. Using the same Delta Lake architecture we offer both direct data access for data scientists in Databricks Notebooks and online data retrieval in milliseconds for user-facing web services.

In this talk we will review principles of performance for each layer of the stack: web APIs, the Delta Lake tables, Apache Parquet, and AWS S3.

The work done by myself and my colleague Eugene in this area has been heavily related to my previous research around Low latency Parquet reads which informed work named Content Crush, which I have explored more on the Scribd tech blog and on the Screaming in the Cloud podcast.

I really hope that I am able to share results at Data and AI Summit from this incredibly challenging work that I am undertaking. But even if I don’t, blog posts like my musings on Multimodal with Delta Lake, scaling streaming Delta Lake applications, and a myriad of other articles I have published can be pieced together to form the larger mosaic of insane large-scale data work I have been hammering on!

Using tmux with bhyve

2026-03-10T00:00:00+00:00

Many aspects of FreeBSD follow the user-friendly unix philosophy, it’s just choosy about who its friends are. 1 I have always found bhyve virtualization to be really interesting but really unfriendly. The vm-bhyve management system was what finally cracked bhyve open and made it usable for me. The vm command has paper cuts but generally speaking it does what I want on my primary FreeBSD machine.

For the longest time I used the built-in VNC support to connect to machines because the vm console command would use /usr/bin/cu which would inevitably trap my console and no amount of ~>~>~><~D~>DD<~> would help me exit.



Somewhere along the line tmux support was
added to vm-bhyve and now vm console  simply opens up a new tmux
window!

I host everything under /vm on the machine, so in /vm/.config/system.conf:

console="tmux"


This seems like a simple thing to be excited about, and it is, but it makes
VMs wildly more accessible and useful for me.