Evan Meagher

Design documentation at small companies

2015-09-24T00:00:00+00:00

One component of the engineering culture at Twitter (where I used to work) that I’m trying to instill at my new job is the importance of writing design documents prior to implementing complicated systems. In this essay, I will argue in favor of premeditated software design at small companies and propose what I call “precautionary migration planning” as a design doc section that caters specifically to the tradeoffs required by startups.

Traveling by map

A design document is an outline of a proposed design for a software system in writing and figures. The level of detail and formality can vary, but the purpose is to force an engineer to think about and document what a system should do and how it should be built before effort is spent on implementation.

Many large companies enforce design docs for all new projects, going so far as to prescribe document templates and design review meetings. While such a formal approach makes sense when projects require coordinated effort across multiple teams and scores of people, it would be an inappropriate amount of overhead for an engineering team at a startup.

But the baby shouldn’t be thrown out with the bathwater. Writing down and examining your thoughts prior to acting on them is a good way to avoid mistakes and prevent unwarranted technical debt. As such, even at a startup, going through a semi-formal design exercise injects a healthy amount of peer-review into the process and can increase the reliability of the systems you end up with. Not to mention the added benefit of having a good understanding of a project’s scope and thorough high-level documentation prior to writing a single line of code. Ideally, when you bring new folks onto the team, you can simply link them to a set of design docs and save yourself an hour of whiteboarding.

A straightforward analogy helps illustrate when a design doc is appropriate for a new undertaking. A design doc is like a set of directions and a map. The complexity of a journey determines whether or not directions are required. For instance, you can walk up the road to the grocery store without thinking, so you obviously don’t need a map. Similarly, if all you need to do is add a simple feature or fix a simple bug, then a rigorous design process is probably unnecessary.

However, for trips venturing into unfamiliar territory or requiring multiple vehicles, coordinating travel with a set of directions is a must. Likewise, if a system at the core of the company’s business has many moving parts and will affect the lives of numerous people over its lifetime, then a design doc will probably prove to be worthwhile.

External dependencies

After writing the first couple design documents at Whisker Labs, I’ve noticed a key difference between what I’m writing now and those I wrote at Twitter. Critically, the former tend to rely on the availability of services maintained by unfamilar people at other companies rather than acquaintances down the hall. For instance, by making use of Amazon Web Services instead of technologies stewarded in-house, our services’ uptime is reliant on the diligence of anonymous Amazon personnel. As the swashbuckling systems cliché goes, you own your availability, but you aren’t in control of all of the factors from which it derives.

Strategies exist for managing the impact of intermittent outages of third-party services. RPC interactions can be augmented with features like retries and failure accrual, or can simply return partial results as a means to limit the damage caused by temporary downtime. But at a higher level, years of experience with as-a-service offerings have shown that there is typically a threshold scale beyond which any given hosted service ceases to be economical. What we’ve observed is that almost all companies who bootstrap their software atop whatever-as-a-service solutions eventually move away from them on account of cost, reliability, and/or functionality. In the long term, everybody ends up running their own Graphite and Kafka clusters and the luckiest of us get our own datacenters.

Not to mention the trend of services simply disappearing out from under you, on account of the originating company being acquired or otherwise going out of business.

But for a scrappy, bandwidth-constrained startup team, paying someone to do the heavy lifting of distributed systems operation is a no-brainer. So what does a responsible software engineer do in such cases when business and productivity concerns demand the usage of hosted services regardless of their long-term feasibility?

Precautionary migration planning

The easy (and industry-standard) answer is to throw up your hands and say “we’ll cross that bridge when we get there.” The pricing and long-term viability of external services is entirely out of your control, so why worry about hypothetical futures that you can’t influence? People still live in Seattle and Portland even though the mega-quake is coming, right?

This is a fine answer if you’ve made the conscious decision that your #1 priority as an engineering organization is speed of execution. Depending on your product or service’s reliability requirements, the pace of your market, and your bottom line, it very well may be preferable to put your time to more immediately productive use than planning for eventualities.

On the other hand, deciding which failure modes are worth planning for is part of what makes engineering interesting. The best you can do to minimize the risk imposed by external dependencies is to come up with a feasible (but brief) plan for migrating away from them. Consider it a precautionary principle for SaaS.

This is why I’m starting to bake such a section into the design docs that I’m writing. They follow the same principles of situational awareness and premeditated action that motivates having runbooks for services, but are more akin to a heart transplant than a simple runbook item. The sections will:

List the system’s external dependencies whose long-term feasibility is deemed at risk (i.e. “<PaaS> will be too expensive by the time we hit <milestone>”)
List potential replacements for the risky dependency and give a high-level plan for migrating

The result of this exercise is a better understanding of a system’s risk profile and the paths by which the system is likely to evolve over time.

Countering the logical conclusion

In response to my initial thoughts on this strategy on Twitter, an esteemed former colleague pointed out its logical conclusion, in which the list of “hosted services” is exhaustive. In literal terms, a program’s “external dependencies” include the operating system and proprietary hardware on which it runs, all the way down to the utility company that supplies the energy powering the computer. In this light, precautionary migration planning is absurd, given that the engineering effort involved in reinventing every wheel between your program and electrons in circuits is well beyond most companies’ capabilities.

However, I don’t think that this argument refutes the usefulness of such planning. When done pragmatically, focusing on a reasonable subset of a system’s dependencies, a team gains the ability to act quickly when migrations are deemed necessary.

One way to differentiate external dependencies is by whether or not they are truly fundamental to a service’s operation. If the power goes out, a program (or at least a stricken instance thereof) is unrecoverable regardless of any migration plan. Thus such planning is only relevant for partial failure modes, such as the loss of a hosted database or the end-of-LTS for a specific operating system version.

Conclusion

Even small ships carry maps. I’ve made a case for the use of design documents at startups, but a key takeaway is that their use varies from organization to organization. For some businesses, time spent planning for hypothetical futures is not time well spent. For others, it’s a valuable hedge against undesirable outcomes.

Experience has shown that once an engineering organization reachs a certain size, a reasonably-rigorous design process is well worth having in place. A startup team’s habits tend to ossify into company culture, which is motivation to start thinking about a team’s design process early. Even if you decide against design documentation in the early stage of your company, going through the mental exercise of considering its implications will increase your team’s operational awareness.

Thanks to Marcel Molina and Gary Tsang for reading and providing feedback on drafts of this essay.

Introducing Armsible

2015-07-13T00:00:00+00:00

Update: Since the publication of this article, Armsible projects have since been folded into Whisker Labs’ GitHub organization.

Much ink has been spilled over the “Internet of Things”. A consequence of this trend is the rise of the single-board computer as a mainstream form factor for application development. With the popularity of open source [1] platforms like Raspberry Pi, Arduino, and BeagleBoard, it’s never been easier to build applications that encompass both hardware and software.

However, there is less publicly-available material on how to incorporate single-board computers into larger-scale deployments. A typical use case involves someone using an ARM computer to monitor or actuate devices in their home. The deployment workflow is more often than not akin to a Linux server administered manually through SSH sessions over the lifetime of the device. In contrast to the level of automation fetishized in the software operations community, the state of the art in the open source IoT space is remarkably unsophisticated.

In spirit, Armsible represents a call-to-action for the use of industry-standard provisioning tools and techniques in embedded applications [2]. Specifically, it is a collection of Ansible roles and related tools that facilitate the automated deployment of single-board computers.

How do I use Armsible?

As of its unveiling, Armsible boils down to a few Ansible roles and a dynamic inventory script for targeting hosts on a local network. The initial use case is to provision a set of single-board computers on a LAN.

Armsible’s focused, albeit limited scope is a consequence of its intended use in concert with other roles from the Ansible community. A typical playbook for an embedded project will not be composed entirely of Armsible roles. Configuration management for standard components like DNS is a solved problem. Armsible fills the gaps between the needs of embedded applications and the existing suite of roles from the wider community.

To that end, we’d like Armsible to be the home for the following:

Roles for provisioning specific hardware platforms (e.g. Raspberry Pi, BeagleCore, Intel Edison)
Roles for installing and configuring software components that are needed by embedded developers but not currently covered by the open source Ansible community (e.g. the kernel watchdog, U-Boot, GPIO configuration)
Tooling that enforces best practices for embedded development

Why Ansible?

Ansible struck us as the right tool for the job because it is built around vanilla SSH connections. For embedded devices that run no-frills distributions of Linux, Ansible is much more applicable out of the box than other tools that rely on less-ubiquitous transport protocols and more-complicated topologies.

How is Armsible organized?

Armsible is structurally inspired by DebOps, a collection of Ansible playbooks for Debian-based server deployments. It comprises a number of Ansible roles stored as distinct repositories within ~~an Armsible GitHub organization~~ Whisker Labs’ GitHub organization. These roles are published to Ansible Galaxy and thus installable on the command-line with ansible-galaxy. A bin project is provided to house complementary tools (i.e. dynamic inventory scripts) to be used in conjunction with Armsible roles.

What plans exist for Armsible’s future?

The project spawned from the hardware provisioning needs of products developed at Whisker Labs. As such, the project’s initial offerings are a sample of what we’ve developed so far and are thus limited to the technologies we use.

Part of the intention behind open-sourcing this work is to foster a community around IoT hardware provisioning. We encourage anyone working in this space to take a look at Armsible and help make it more useful. The best ways to get involved are by filing GitHub issues on individual projects or joining the conversation in #armsible on irc.freenode.net.

[1] The technologies in question are "open source" to varying degrees, but vendors' overall inclination towards open source is helping push the hardware world in the right direction. For instance, the Arduino and BeagleBoard/BeagleBone device families benefit greatly from the tooling, documentation, and manufacturing ecosystem afforded by open hardware design.

[2] "Embedded" should really be in air quotes here, given that we're talking about machines that run Linux. At the risk of graybeards not taking me seriously, I'm going to roll with it.

Coordinating technological change in large software organizations

2014-06-19T00:00:00+00:00

The topic of software scalability seems to bring out the armchair general in everybody. Much of the culture of the software industry is fueled by anecdotal war stories, blog posts, and “this one paper you should read”. We are all knee-deep in an unending stream of literature prescribing ways to achieve maximum computer performance, but the organizational consequences of hyper-growth get far fewer headlines. I would argue that these consequences have more of an impact on the daily lives of more developers than the scalability of code. The structure of a company can determine what you work on and who you do it with. Without widespread appreciation for the cost of coordinating technology changes across such a dispersed group of people, it’s hard to imagine any single employee not being impacted by wasted time and miscommunication.

A common tactic for scaling a software engineering organization is to compartmentalize teams around various components that collectively make up the company’s product. The development team may be split into Frontend Engineering and Backend Engineering. Each of these may be subdivided into focus areas, terminating in teams that cover specific sets of technologies. In this manner, a company’s team structure is modelled as a tree (conveniently similar to how its personnel fit into a tree-based org chart):

For instance, “Backend Engineering” may encompass any piece of technology deeper in the stack than user-facing clients, from analytics pipelines and application servers down to databases and operating systems. This model is especially well-suited for the development of service-oriented architectures, in which the components of a product’s backend are encapsulated in network services each maintained by small teams.

The burden of coordination

A consequence of this organizational complexity is an increase in the amount of coordination required to make progress. Given the subdivision into specialized teams, any work to improve the overall product will necessarily involve multiple teams. For example, the task of adding a recommendations widget may spawn work for the web, iOS, and Android client teams, the creation of a new batch job to be built and maintained by the analytics team, and a new API endpoint to be added by an application services team. The burden imposed by this need for top-down, product-oriented coordination is part of what motivates the widespread criticism of “big companies”. Implicit in the idea of being an early employee at a growing company is the ability to be directly involved in the product. As a workforce grows, the perceived ability of any individual to affect change diminishes. Compared to the freedom and breadth enjoyed by employees of short-staffed small businesses, making an impact within a larger organization may seem like more trouble than it’s worth. This sentiment often manifests in technology-driven companies leaving a trail of “startup people” in their wake who step away from the company once it’s survived the trial by fire of early-stage growth.

However, well-run large organizations benefit from the higher throughput afforded by a larger workforce to apply to problems. A great example of this on a grand scale is Apple, whose ability to “walk and chew gum at the same time” results in concurrent efforts to drastically reshape both their mobile and desktop offerings.

This covers the macro-level work that trickles down from high-level product decisions, but not the variety that stems from changes deep in the stack. Infrastructure work results in a separate class of communication overhead.

Bottom-up coordination

The often underestimated counterpart of this top-down coordination is the cost of the bottom-up coordination imposed on developers working on infrastructure. By “infrastructure” I mean any technology that is depended upon by other developers. In this context, infrastructural work would include library development, database administration, and service ownership. For these kinds of teams, making profound changes implies effort to coordinate with numerous teams. For example, before migrating to a new database or replacing a deprecated library, the initiating team will have to communicate with many others. These scenarios inevitably cause friction with other teams, whether by imposing unplanned work on them or simply adding the operational risk of deploying new code.

Part of what distinguishes great infrastructure teams is a sense of empathy for those that depend on them. When attempting to move an organization forward with a new technology, such a team will reduce the barrier to entry by addressing any likely concerns and minimizing the amount of work that developers have to do to make the switch. By going the extra mile to ease the lives of others, the team initiating the change improves the likelihood of success and greases the wheels of forward progress.

Preventing surprises

When rolling out a new technology, the goal is to lessen the likelihood of something unexpected happening. This involves predicting and documenting the things that are unavoidably apt to change as a consequence of the new technology. To those without context, any change will be unexpected, so the main thing to strive for is increasing the organization’s collective awareness of the change without being annoying.

Thorough documentation can go a long way, whether it be on a wiki, an email, or whatever communication mechanism the company relies on. A good way to frame migration documentation is in terms of the deficiencies of the old way and how the new hotness will improve the situation. “We’re hitting the safe upper limit of how far we can scale Database Product X within budget and our testing shows that Database Product Y will suit our projected needs for the next year and save us n dollars per month.”

Part of this documentation’s purpose is to walk developers through the process of migrating their projects to the new hotness. This will vary depending on the type of migration. For instance, a library change would call for introductory background information, before/after code samples, and links to any relevant API reference documentation. It’s important to mention any operational effects the changes may have. For instance, if the new APIs entail different resource utilization rates (e.g. object allocation, TCP connection churn) or behavioral changes, then the documentation should include specific metrics to keep an eye on when deploying the new code.

Conclusion

Coordinating changes within large software organizations is a necessary evil. There are serious downsides to doing too little or too much, so keeping a manageable number of people informed is a balancing game. Given the definitionally wide reach of “infrastructure”, bottom-up coordination is a key part of introducing new technologies within an organization.

Thanks to Ruben Oanta and Johan Oskarsson for reading and providing feedback on drafts of this post.

Survey on Technical Debt Management

2013-06-04T00:00:00+00:00

First coined by Ward Cunningham in 1992, the concept of “technical debt” is widely known within the software engineering community. It evokes other colloquialisms such as “code rot”, “cruft”, and “kludge”. The word “hack” is often used synonymously, but its usage is now overloaded and popularized to the point of meaninglessness. From his keynote presentation at the 2013 International Workshop on Managing Technical Debt, Steve McConnell (of Code Complete fame) provides a good working definition of technical debt:

A design or construction approach that's expedient in the short term but that creates a technical context in which the same work will cost more to do later than it would cost to do now.
Steve McConnell, Managing Technical Debt

A sizable portion of the work done by my team at Twitter classifies as paying down technical debt. This is by no means meant as a negative. The performance gains from transitioning a Rails-based infrastructure into an ecosystem of JVM services have been gratifyingly enormous and the work itself is intellectually enriching. However, dealing with technical debt is generally considered to be undesirable in favor of feature development.

This sentiment is totally understandable. Greenfield work is sexy and fits the trope of the lone hacker cranking out code, fueled by caffeine and the Social Network soundtrack. The harsh reality is that when you’re working on systems of any meaningful scale, building in isolation is rare. There will always be dependencies, requirements, or even simply code you wrote two weeks ago that gets in your way.

Technical debt is a natural part of the software development process, and is thus unavoidable. There exist software anti-patterns that produce predictable debt, as codified in Michael Duell’s Resign Patterns. Through awareness and internalization of sanitary development techniques, one can prevent certain classes of technical debt from occurring in the first place. But for the inevitable cases when it falls through the cracks, a manageable strategy is to be mindful of the debt as it accumulates and to periodically make a concerted effort to pay it down.

Mindfulness toward technical debt

Just as with financial debt, there are multiple classes of technical debt with varying levels of insidiousness. There is “high interest” debt that will waste countless future hours of work. An example of this would be an inconsiderate choice of framework, resulting in great expense to port to a different system later on. In contrast, an item of low interest debt could be putting off writing a class’s test suite until after a milestone. If paid down soon after being taken on, this type of debt can be acceptable. However as low interest debt piles up, both in quantity and lifetime, it is increasingly dangerous and more onerous to deal with. If a development team is diligent about avoiding high and reducing low interest debt, they will be much more effective at reaching goals and staying productive in the long term.

Another axis on which to characterize debt is whether or not it’s taken on intentionally. Teams accrue intentional debt by making conscious decisions about the feasibility of their being able to handle the debt load later on. “We need to ship this feature ASAP, so let’s skip these tests until our next sprint.”

Unintentional debt is taken on carelessly, either by individuals’ actions or institutional change. On the level of an individual, a junior developer or contractor could introduce changes that render a system less maintainable. Depending on the complexity of the problem, code review is an effective preventative measure for these situations. Harder to deal with are large-scale events that inadvertently introduce vast tracts of debt. For example, the integration of an acquired company’s codebase or a coordinated refactor could leave a system in a less tenable state than it was before. There is no one-size-fits-all solution for such cases and they exemplify the importance of remaining mindful of debt accumulation.

In addition, it is important to track debt. With a log of specific debt items, a team can assess their debt load at any point and act accordingly. Without one, they are blindly flying into a minefield, condemned to endlessly fit square pegs into round holes. There is no way to reasonably fix the unmeasured quantity.

Planned payment of technical debt

Once a team locks down the rate at which they accumulate debt and makes a concerted effort to avoid the high-interest kind, paying down what remains is much more straightforward. From there, it’s simply a matter of prioritizing items in the debt log and chipping away at them.

The application of positive habit formation tactics can be very effective here. Just as someone wanting to get in better shape can explicitly plan gym visits into their schedule, software development teams can plan debt-reduction periods into your release cycles. This can take many forms, depending on the temperament of the team:

Baking debt-repayment into the sprint cycle. (e.g. devoting a portion of each sprint or one entire sprint per month/quarter to tackling items on the debt log)
Having a debt-reduction rotation wherein individuals focus on debt during their duty cycle.
Spinning out debt-reduction into its own project with a separate pool of resources. I’m admittedly skeptical of this approach. It seems to be analogous to a garbage collection problem, in which a mutator (the development team) is continuously introducing work items to be fixed by a collector (the debt-reduction squad). This is theoretically feasible if debt introduction is kept at a reasonable rate, but the division seems unmanageable to me.

Conclusion

McConnell’s viewpoint is abstract and arguably too high level to be of much use for certain development teams. The strategy presented here meshes well with what I’ve experienced at Twitter, but I admittedly may be writing from a BigCo stance. It’s been pointed out that McConnell’s principles don’t necessarily suit the realities of smaller companies. It would be interesting to examine this statement in another post, focusing on debt accumulation and fallout as companies grow.

Technical debt is often preventable, but an inevitable part of the software development process. As much as it hurts one’s pride to hear it, everyone writes unthoughtful code some of the time. In order to keep systems maintainable, teams must adopt a strategic approach to controlling the rate at which debt accumulates, tracking the specific items that are deemed short-term-acceptable, and paying them down. Through this, a team can avoid much of the productivity and morale degradation associated with technical debt buildup.

If you find this topic interesting, I would encourage you to read through McConnell’s slides. My notes on the slides are available in this gist.

Thanks to Trevor Bramble, Mike Bernstein, and Richard Bailey for reading and providing feedback on drafts of this post.

TTLs for Dropbox

2011-10-31T00:00:00+00:00

A bunch of friends and I have a Dropbox shared folder in which we swap files of various (legal) sorts. Most of the folks in the group aren’t Dropbox zealots like myself who find ways to get 9+ GB for free. Thus the size of the directory in question becomes an issue as large forgotten files start to eat up others’ precious 2GB of space.

As a solution to this problem, I wrote a Node.js program that in essence lets you assign TTLs to items within a Dropbox directory. It runs as a daemon and deletes any files older than a specified lifetime.

For example, to run a daemon that checks the directory Dropbox/expirable-items once a day for items that are older than a week, modify the variable declarations thusly:

var dirToWatch = "expirable-items",
    ttl = 604800000, // 7 days
    interval = 86400; // 24 hours

The program depends on the log.js and dropbox Node modules:

$ npm install log dropbox

Startup and delete events are logged to stdout, so redirect as you see fit:

$ node app.js > dropbox-ttl.log

Teach Scala to undergrads

2011-09-26T00:00:00+00:00

A symptom of Scala’s growing popularity is the incessant discussion of its place in the bevy of industrial programming languages. This debate is often confusing, as both advocates and detractors of the language at times use the same argument in their favor: that Scala’s complexity renders it unfit for use by the average developer. This talking point may generate votes on Hacker News, but it isn’t remarkably productive at improving the state of software development.

People have been demonizing the rise of JavaSchools for years and I believe Scala to be an effective countermeasure. It represents the perfect supplement to a programming languages course, with the ability to show students how powerful functional programming is when applied to “real world problems”. As a single example, seeing how one can use higher order functions to avoid manual iteration through collections is enough to at least show students how much easier life can be with Scala.

I posit that the outlook of many students coming out of PL courses is akin to this continuum:

On one end you have “academic” languages like Haskell, ML, and Scheme which are interesting, but esoteric and impractical in that they’re rarely used in production environments due to their difficulty. On the other are the common currency of most software developers: Java and C (and Ruby and Python within more hip circles). The languages on the right are influenced by the research that culminates in the languages on the left in the same way that mainstream musical artists say that they listen to Thelonious Monk and Stravinsky to get ideas.

Scala fits somewhere in the middle. It’s a reasonably approachable language with a rapidly growing community and ample room for neckbearding. As proven by Foursquare, Tumblr, Twitter, Yammer, etc, Scala is a remarkable language for building the kinds of systems that CS students swoon over. After teaching ML, Haskell, or Scheme (WLOG), one could use Scala to show that many of the most expressive features of functional programming can be harnessed for use in a JVM language. Helping students connect the dots between imperative and functional programming would be a valuable lesson that many students don’t fully understand.

More emphasis should be placed on experimenting with ways of raising the bar of the “average developer”. While I agree with the sentiments behind the notion that Scala is “too hard for a large portion of the Java community”, this comes off as more of a statement about Java developers than about Scala. If Scala is going to be pigeonholed into strictly being for a higher class of programmer, then why not enlighten students in their formative years?

Note: This argument could just as easily be made in favor of Clojure. The point is to experiment with improving the state of average instead of saying things are too hard.

Two months in

2011-09-05T00:00:00+00:00

Like countless others on the internet, I’ve been “meaning to write more” for a long time. Under the assumption that Wordpress puts too much process into the task of blogging, I’ve designed a new personal website using a simpler tool. Hopefully the ability to write essays using the same workflow that I use to write code will grease the wheels of expression.

My old Wordpress site is now accessible at old.evanmeagher.net. The new site is hosted on GitHub and its source is available here.

Last friday was the two month mark of my employment at Twitter, Inc. I don’t think that I could be happier with my current situation. Twitter is proving to be exactly the workplace that I was hoping for: a friendly and open atmosphere with brilliant coworkers more than willing to help me learn everything that I can as quickly as possible. Coming out of college, it’s exactly the kind of environment that I want to be in to further my technical education.

As for the contents of this blog, I intend to write about what I learn. At the moment, this would include things about Scala, functional programming, and distributed systems, but my interests are bound to ebb and flow as I work on different projects and interact with different people.

To keep up to date with me, you can subscribe to this blog or follow me on Twitter for more granular updates.

Graduation

2011-06-27T00:00:00+00:00

It’s been a little over two weeks since I graduated from college. Tomorrow I’ll pack my life into a truck and begin the 800-mile journey from Seattle to San Francisco.

This move has been the light at the end of my tunnel for the past six months. In December, I turned down a job at Google Seattle in favor of one at Twitter, to the bewilderment of many of my friends and family. With a new city and an exciting job looming on the horizon, I’ve spent the first half of 2011 finishing my last two quarters of school and mentally preparing myself for a head-first dive into Silicon Valley.

As I begin a new chapter of my life, it seems like as good a time as any to take a crack at my lofty, neglected goal of writing more. Thus, I’ve created this blog on which to write about things that interest me. Stay tuned to see if I follow through.