Buckingham Inquirer

Ecosystem Niches

2012-09-16T23:38:00.000-07:00

Reading this recent argument on Reddit made me reflect on the different expectations of sysadmins in web and non-web (yes, every business has a website, but if your website is down and nobody thinks it's an emergency, you're not a web company) businesses of different sizes. I think a lot of the sysadmin blogging I read, while very good, comes too narrowly from the perspective of medium and large non-web businesses, since smaller companies don't tend to have a full time sysadmin on staff and web businesses rely more on their programmers for sysadminry due to scale (and since they already have a bunch of developers in house). Basically, I worry that while sysadmins are right to emphasize professionalism and working for the needs of the business (over idiosyncrasy), they're too quick to assume that every businesses needs are like theirs.

Small businesses are different in that they actually need to take life and death (of the business, I mean!) risks. There are so many things that can go wrong at this scale that beyond good backups, focusing on them just removes energy from the main problems. Yes, you might go under if you get some bad IT breaks, but you might go under from sheer bad luck in any number of areas. The vast majority of small businesses fail, and the ones that succeed are not the ones that spend more time worrying about outlier problems, even when those problems are quite real and ignoring them in a larger business would be deeply unprofessional. Web businesses are different in that change velocity is at an absolute premium. So while the usual best practices for any size of business definitely apply, the order of affairs should be "automate, then make sure the automation incorporates best practices" rather than the reverse. If you don't automate first, you're so underwater keeping up with changes that nothing else will ever happen.

So here are what I would lay out as reasonable expectations for each category of business (as defined by its IT needs, rather than headcount or revenue):

Small business (<1 admin): not generally willing to pay for redundancy or sysadmin time outside of new setups and crises. If you name things clearly, follow best practices for environment setup, and have tested backups, you're doing it right. Don't expect to get beyond putting out fires.
Small web business (<1 admin): here the devs probably do all of the sysadmin work for the (probably cloudy) deployed infrastructure. The good part is that things will probably be reasonably automated. The bad part is that they probably won't do much research and will spend a lot of time reinventing the wheel (which produces brittle infrastructure as well as sucking up time). As long as you have working backups, don't worry about the hackery--just focus on growing until you can afford a full time devops person to help with build/deploy, database, and sysadmin knowledge (and do hire one as soon as you can afford it--definitely before your tenth developer).
Medium business (1 admin team): This is the time to read all of those classic sysadmin books. You can afford some redundancy and professionalism, and downtime gets expensive with all of those people on the payroll who can't do their jobs. Planned downtime is probably not a big deal. If your hair is always on fire, you're doing it wrong. If you have new user and new server setup pretty automated, you're ahead of the curve, but automated client rollout is no longer optional.
Medium web enterprise (no whole racks of identical servers yet): focus on getting complete monitoring and complete redundancy, first within and then across datacenters; completely automated server rollouts; infrastructure as code; and painless deploys. Downtime is failure. Painful builds/deploys are failure. In addition to following all of the classic sysadmin best practices (except with more automation) you need to help your developers understand and apply best practices, too, because the software is the business and you need to support it.
Large nonweb (meaning heterogeneous, not that big companies don't care about their websites) enterprise: this is basically medium business, except with multiple business units, division among network, systems, and storage teams, and big budgets. You probably have lots of EMC and Cisco stuff. You probably also have lots of bureaucracy. The priorities here have to be avoiding vendor lockin and retaining the benefits of bureaucracy (change control) without its evils (silos). Time to read those devops culture books. You might be able to make your business more like a large web business with a private cloud.
Large web enterprise (whole racks of identical servers--not that anyone at this level is relying on me for advice): focus on getting infrastructure costs down and self-healing software, since you don't have time to manage even the emergencies on a per-machine basis. At this level, infrastructure teams start to need C and Java programmers and hardware engineers. Success means low costs per end user and very few rolling outages (usually caused by errant self-healing code in your software).

High Availability Means 14 Copies of Your Data

2012-08-04T09:33:00.000-07:00

There's been a lot of grumbling lately about wasted disk in Cassandra replication, as if this is a fault of the software rather than an inherent constraint, so I'd like to review what is possible with perfect software.
Assumptions:

1. The software is perfectly topology-aware, perfectly load-balanced, has no bugs, and no unnecessary overhead.
2. All servers and disks are identical.
3. One server with one spinning disk (or RAID0 set) perfectly handles peak traffic at the desired QoS.
4. There are no issues with DDoS or unanticipated traffic peaks. We assume the client-side is perfectly understood and uniform, and merely seek to understand the implications of server-side faults.
5. High availability is actually important, not just a slogan, so we're ok with Availability and Partition-tolerance in the CAP theorem iron triangle.
6. This is a web-service or somesuch, not a bottomless-budget government program, so we won't build totally parallel software systems to guard against that level of human error.

So let's follow the logic and see where it takes us:

1. One disk in one server in one datacenter is clearly inadequate. Any fault causes unavailability.
2. What about two disks in two servers in two datacenters? Now we're protected against a known fault at any level of the system, but have no way to recover from a partition, as rebuilding a copy of the data would take our existing server over its maximum possible load.
3. What of three disks in three servers in three datacenters? This approach seems solid, but quorum will require cross-datacenter-reads, and synchronous cross-datacenter-writes would be required to avoid loss of data with a simple disk-failure. Furthermore, recovering from disk failures (frequent) would require cross-data-center reads, which take a long time, meaning the cluster would frequently be in a highly-degraded state, and the odds of an unrecoverable read error during recovery are high.
4. Four disks in four servers in two datacenters? Losing a disk still requires cross-datacenter-reads for recovery. Losing a datacenter will require a long time to recovery even if you can instantly spin up a third (e.g. EC2).
5. Asymmetric datacenters don't get you anywhere, since you don't know which one you'll be running from when failure happens.
6. Six disks in six servers in two or three datacenters. With two datacenters, you now don't need to read across datacenters just because you lose a disk on the active side. With three, you would, except you can just fail over the traffic while you rebuild, and if you have a failure in datacenter B while A is rebuilding, you can just fail over to C. The choice of two or three datacenters would probably depend on your fixed costs and the replication characteristics of your software. Either of these solutions could work, until you bring in human error, the most frequent cause of downtime. To provide reasonable amelioration against human error, you need a non-live backup system.
7. Nope, your odds of unrecoverable error on read from backups are too high.
8. Ok, two backup servers, or at least two disks. If you're convinced by the vendor numbers or you don't mind having to choose between data loss and availability when in extremis. This seems to be where Amazon and Google live, which probably makes sense given the low profitability of their transactions, and in Google's case strong segmentation such that problems on all six copies of a given set of data only cause availability problems for a vanishingly small subset of their users. If your business makes twice as much per transaction as Google, however (most incorporated businesses not named Facebook), or you require only a handful of modern servers to handle your peak traffic (the vast majority of businesses period--remember we're talking about the database layer here) then those painful decisions will be more painful for you, and you'll want them to be correspondingly rarer. Small clusters are also more subject to bad luck (getting shipped a bad batch of disks) which doesn't show up in the vendor or Google's overall failure rates.
14. The above, with RAID. This is the real world concession to high disk failure rates, correlated failures in small batches, lower tolerance for partial outages, and slow replacements for failures.
24. Three symmetrical datacenters with three RAID10 servers + backup. This can make new datacenter buildouts easier (just power down a set of servers and move them, while keeping online redundancy), and might allow you to serve active-active from datacenters closer to customers most of the time. It also makes taking a whole datacenter down for network or power upgrades much less painful. However, in the real world many businesses find it easier to buy new servers when they open a new datacenter, and many popular databases (MySQL, PostgreSQL) don't well-support active-active or three-master replication triangles. If, however, your read load is a lot higher than your write load, and QoS requirements are tighter for it, you may want to use this model rather than buying more servers in each of two datacenters, since you can go active-active for reads, and buy operational flexibility and availability in addition to capacity without spending much more money (if your per-datacenter costs are low). This can also lower your costs by making it easier to negotiate with hosts, since you have a more credible threat to leave any particular provider and already have relationships with three of them. That all said, human errors are the greatest cause of downtime, and complexity is the mother of human error, so I tend to think RAID10 is more valuable than expected since the complexity is all hidden, and a third datacenter is less valuable than expected since the complexity needs to be handled by human architects at the application and network levels as well as in systems.

So whether you have 8, 14, or 24 copies of your data depends on particular real-world concerns beyond any simplistic model, but the idea that a replication of 3(!) might be too much just means that you're not really in a high-availability world, in the sense that you're not very worried about the impact of hardware-level issues. That's ok--modern hardware is pretty reliable, and RAID10+backup (still at least 3 disks!) might be adequate for many revenue-producing uses where recovery has been well-planned, scheduled downtime for upgrades is possible, and blaming upstream providers is feasible in case of a major outage.

UPDATE: Percona has some thoughts on failures and backups worthy of a read. Yes, their advice and expertise are targeted to MySQL, but remember that they support more web services than you can shake a stick at, so it's worth paying attention to what they have to say. First, "what kind of outages can happen?"

Someone runs UPDATE or DELETE and forgets the where clause or filters weren’t quite right
The application had a bug causing data to be removed or overwritten
A table (or entire schema) was dropped accidentally
Your InnoDB table was corrupt and mysql shuts down
Your server or RAID controller crashes and all data is lost on that server
A disk failed, and RAID array does not recover
You run into a InnoDB corruption bug that propagates via replication (not common, but does happen)
You lose your entire SAN and all your DB servers were located there. Let’s hope your backups are somewhere else!
You lose a PSU or network switch in your datacenter and some or all of your servers go down in that location
Your entire datacenter loses power and the generators do not start, which happens more often than you might think

I'd like to draw attention to three of these in particular: infrastructure corruption that propagates through replication, losing your entire SAN, and a "fully redundant" datacenter that goes down and does not come back. Stop thinking these things can't happen to you! They absolutely do happen; you cannot trust your software or your hardware. So what is to be done about it? Percona's "philosophy on backups:"

It is a good idea to schedule both logical and binary backups. They each have their use cases and add redundancy to your backups. If there is an issue with your backup, it’s likely not to affect the other tool.
Store your backups on more than one server.
In addition to local copies, store backups offsite. Look at the cost of S3 or S3+Glacier, it’s worth the peace of mind!
Test your backups, and if you have a test environment, load them there periodically. You can also spin up an EC2 instance to load your backups onto. In addition, you can binlog rollforward 24 hours of binlogs as a good test.
Store your binlogs off your primary server so you can perform point in time recovery.
Store your binlogs offsite for disaster recovery scenarios.
Run pt-table-checksum periodically (i.e. once a month) and make sure your servers data stays consistent. Checksumming is important, as backups are typically pulled off a slave and it’s vital that it has the same data.

Note how half of these boil down to "test, test, test" but the other half are simply "store more copies of your data." And this is all just backups, with no provision for high availability, and no guarantees of a quick restore: "Typically we upload mydumper backups to s3 vs xtrabackup given the time needed to upload/download. Though it depends on the available bandwidth and should be factored into your restore time." "Often the limiter of how fast this can be restored to another server, is how fast you can transfer data over your network. If you have 1GB network and you have 1TB of data, it could take awhile." How long do you think it will take from S3?

Lessons Learned

2012-04-20T11:22:00.003-07:00

For the last six months, I have been employed half-time as an IT project manager, systems analyst, data scientist, and SQL developer. Small company, many hats--but that's not exactly new for me considering that my first full-time job was at a YCombinator startup of four where I did not only all of the sysadmin work, but also the importer, the exporter, the mobile device integration, the manual frontend testing, and answered the customer service email. What was new for me was having an IT job without root, and what a learning experience that was. All of you who worked at places where I was the sysadmin will probably be saying "it's about time!" but better late than never. I now have a new appreciation for the following:

Hardware matters. If developers ever once think "this would be faster on my home computer" then any hardware expenditure you've supposedly saved is totally illusory given the cost of your developers' time. Not only is one gigabyte of RAM wholly inadequate for analysis of even medium-sized (~1MM records) datasets, but given that most modern tools are built with the expectation of local development, hardware frequently needs to be adequate to run servers, test databases, and client virtual machines.
OS matters. Windows XP is not a modern OS, and frequently gets into states where it can't update due to obscure conflicts, making it a wildly insecure OS as well. Installing libraries required by applications can be a nightmare. This doesn't mean you should let developers install whatever they want, as that's not only a security risk, but also makes each developer reinvent the wheel on getting local development up and running, but you need to use something current and well supported.
Permissions matter. The traditional administrator/user distinction is wholly inadequate for developers. On Windows, they'll need to be local administrators, though you can still impose most domain policies as long as they aren't intrusive (IE only or somesuch nonsense). On Linux, use sudoers aggressively, or use Puppet to control what permanent changes root users can actually make, or both.
Tools matter. Yes, in the end Excel gives you a complete SQL shell and Turing-complete programming environment. However compared to Ruby (or Perl or Python) for data munging and R for data analysis, it's a trainwreck where everything takes massively longer to write, test, and run. Also, using 2007 with its row limits rather than 2010 can mean a ten-fold increase in effort as whole sheets get devoted to intermediate summary functions.

And of course in addition to the new lessons, some old ones were heavily reinforced:

You don't always get what you pay for with employees, but you rarely if ever get more. If you can't find good people, or are afraid to fire bad people because you can't replace them, then either your HR director needs to be the first one to go or you're not offering adequate compensation. Don't forget that working for a prestigious company or in a great place or one with low cost of living or on a great project or for a famous person or whatever are their own forms of compensation--you both over and under-estimate them at your peril.
Confusion and indirectness are multiplicative. A person or department with two bosses will get half as much done. Software that has to be payed for by a third party or installed by a "value-added" reseller will only deliver half of the value. If the professional services division of your vendor and its software developers are in different countries, they might as well be separate businesses.
There's no substitute for agile delivery of software if it involves any custom code or professional installation whatsoever. Waterfall Does Not Work. Nobody knows in advance what all of the right questions are, so your software will inevitably actually be delivered in stages. Why not just plan it that way? If you think the "minimum viable product" is really, really large, then think harder about whether you can deliver some piece of that product first to some part of the team, even if it can't yet face an end-customer.
If you haven't automated enforcement of your policies, you have no policy. This (combined with 2 & 3 to be sure) is why all "enterprise" software takes forever to pay off: the business logic delivered in the code isn't actually your business logic because before you had code nobody actually knew what your business logic was. No systems analyst, team thereof, or procedure can fix this. Agile software directly payed for by the people who use it, who themselves have a clear chain of command, is the only solution.
Test environments matter. Agile development helps with this, because if development is agile then HEAD and PROD should always be close enough together to easily port production data back and forth, and nobody can pretend that test and production will happen on the same system since they're contemporaneous.
There's no substitute for courageous management. The sooner the pain happens, the less painful it is.

Cluster Filesystems: Some people still don't get HA

2012-01-20T13:34:00.000-08:00

HA is about, more than anything, transparency and defined behavior. I don't lose any sleep over switches, loadbalancers, NetApps, or DRBD systems that take 30s to come up after failover. An issue where I would need to go physically swap a piece of hardware, or manually intervene to fail over between datacenters, might cause an hour of downtime, but at least it's well-bounded. The things that really kill you are the ones where you shoot yourself or lose data. Restoring backups on fresh hardware might take long enough that most people wouldn't describe it as HA, but it takes a lot less time than manual data recovery and change reconciliation.

Wide area block devices aren't transparent and they don't have defined behavior. That makes them a much worse choice than distributed databases, and a bit of setup complexity on the sysadmin and application sides doesn't change that. I like Jonathan Ellis, but I think he just doesn't get it here. He's focused on the technical possibilities at a particular layer of the stack (you can build a distributed filesystem on Cassandra, after all), rather than the performance of the stack as a whole (the only reason you'd do that is if you had an app that only understood filesystems, and then your app can't make the decisions it needs to make given a distributed backend). Developers who think that they can get HA resilience for free on top of someone else's abstractions are dangerously kidding themselves and need to stop. Build your app within the constraints of someone's PaaS, build your own CAS tradeoff logic, or tell the business they can't afford HA. Welcome to reality, folks.

Glad Tidings

2012-01-20T12:29:00.000-08:00

Just a quick post which I'd hoped to have out for Christmas highlighting some of the best recent software releases from a sysadmin perspective.

First, unit testing tools for Puppet. Turning your infrastructure into code is an amazing development for about 50,000,000 reasons, but you'll only fully capitalize if you're willing to learn from the best practices of software development, like version control, peer review, modularization, and testing. Up to now, testing tools for Puppet had been rather clunky, but this looks like a significant improvement.

Second, PAM authentication for MySQL. I'm definitely a fan of Postgres, Cassandra, or HBase over MySQL, depending on the application requirements, but if you're stuck with a legacy MySQL deployment this is a security godsend. I'd keep my application passwords in the legacy system for performance and robustness, but elevated privileges should be strictly the domain of human beings, and this allows you to easily manage those accounts with whatever LDAP or Puppet tooling you've already built for system accounts. Welcome to access rights on day one and clean shutoff for departures. You might think that my MySQL pick would be Percona's synchronous replication, but I have serious performance and robustness misgivings about that arrangement: it's more complicated than simple failover, but doesn't give you the performance advantages of sharding.

Third, Netflix released their zookeeper library. Multi-tier applications need a way to keep track of what cluster members are live in each tier, and traditional solutions like Puppet and DNS are too slow and asynchronous for large deployments. Zookeeper is perfectly built for this problem, but adoption was limited since the interface was a pain unless you had a homogenous java stack. Now solved :).

Fourth, a set of tools for making java deployments easier. I guess Java and the JVM platform are great for developers, with lots of tooling support for rapid development but a reasonable level of access to networking, data structures, and algorithms for performance. It also allows hacked together Mac and Windows development environments. For sysadmins, though, Java has basically been a nightmare, with its supposed portability meaning that it doesn't fit in cleanly with native Linux tools for deployment, configuration, and monitoring. The last may still be a major issue, but the first seems to have made a major step forward with the projects mentioned above.

Fifth, Amazon DynamoDB. No, it doesn't have the features large users of Cassandra and HBase, or even clustered/sharded SQL, have come to expect. It's probably not much cheaper, either. But for many good and bad reasons, lots of OLTP webapps are committed to running in the cloud, and for those (read: nearly all) where SimpleDB wasn't enough, this is a massive improvement over the disaster of running your own SQL or NoSQL databases on block storage.

Happy 2012, sysadmins!

Puppet & Cross-Cutting Concerns

2011-11-13T11:16:00.000-08:00

One of the hardest things about managing a Puppet installation for a complex infrastructure is handling cross-cutting concerns (servers that need to get a piece of configuration data based on the cartesian join of location, role, project/cluster, etc.). Currently supported ways of doing this all have serious drawbacks. You can do all the assignment manually in LDAP or some other directory, but that defeats the point of automation. You can use extlookup, but it only supports a single hierarchy of overrides. You can just have separate puppet instances for each cluster and reuse your modules, but what if you have servers that are in multiple clusters? (Such things are often dismissed by small shops with a single project or large shops with many servers in each role, but many real medium-size businesses with multiple products have such problems, and more or less inevitably so.) I think Puppet needs to seriously look at decorators or some other method of handling this organically, which would require some aspect-oriented-programming support in the language.

Black Boxes and Overfitting: The Twin Cases of EBS and CDOs

2011-11-13T07:59:00.000-08:00

Edit: lest you think EBS had solved its issues, it's still the AWS component with the most frequent availability-zone-wide outages.

I think there are some instructive comparisons to be made between the motivations, technologies, and failure modes of Amazon Web Services Elastic Block Store (EBS) and investment banks' collateralized debt obligations (CDOs). With luck, elucidating those similarities will help lessons learned in each area mitigate risk in the other and maybe even help technical workers better understand financial risk and financial workers better understand technical risk (in the spirit of technical debt).

1. EBS and CDOs were both born from the insight that sharing can reduce risk. Before EC2/EBS, companies could either bet on high usage, with the attendant risk of having lots of capital tied up in non-productive assets if traffic was lower than expected or efficiency was higher, or bet on low usage and fail to serve customers if traffic was higher than expected or efficiency lower. EC2/EBS uses a shared infrastructure, so capacity projection can happen at the (hopefully more stable) level of the internet as a whole, with resources dynamically allocated to whoever needs them at a given time.

Before CDOs, financial institutions tended to have large exposures to the risk of those regions or industries where their market presence was highest, and since particular regions and industries tend to have more volatile economic profiles than the world as a whole, they either needed to carry excess non-productive (reserve) capital to offset that risk, or potentially go bankrupt in a sectoral downturn (e.g. Dustbowl banks during the Great Depression). CDOs allow each financial institution to package risk in a standard way, so that they can buy and sell whatever portions are necessary to achieve an optimum risk profile within their budgets, rather than being hostile to the vagaries of the markets in which they operate.

2. EBS and CDOs are both marketed by trusted vendors of other products who ask for more faith. Whatever you personally think of them, both Amazon and the major investment banks (Goldman, Merrill, Morgan, Bear, Lehman) were broadly trusted for all kinds of transactions before they started to push EBS and CDOs. Not only unrelated kinds of business (retail brokerage, Christmas presents) but also more direct precursors in the sharing of risk: asset backed securities and S3.

3. EBS and CDOs are both driven by concerns about time to market, transaction costs, and labor costs. In addition to chasing lower risk, businesses also want to cut expenses, and having standardized structures in which somebody else does the legwork that you can buy and sell on the open market at any time seemed like a great way of doing that. Tech companies could stop hiring sysadmins and purchasing agents. Banks could stop trying to grow and acquire their way into more diversified markets. Risk and computer time could be bought and sold whenever the business required, rather than waiting for some difficult logistics or paperwork chain to swing into action.

4. EBS and CDOs are both an attempt to make features from an old paradigm available in a new one. I can't really think of anyone who would deny my first three claims, but this one is more subtle and possibly contentious. Cloud services promised a new world of abstraction, ephemerality, and explicit guarantees. The upside made companies desperate to move their web operations onto the platform, but their entrenched data models didn't fit into Dynamo's simple key/value paradigm and their legacy databases didn't replicate well enough for ephemeral storage to be sufficient. Amazon wanted to please their customers (and increase profits) so they put a lot of duct tape and bailing wire around iSCSI, DRBD, and LVM and called it EBS. All of the cloudy resource sharing, but now with permanent* block storage to accommodate legacy databases, carved out of some set of disks transparently mirrored and replicated behind the scenes. Users then began to rely on that leaky abstraction.

Investment banks did something similar with CDOs. In the new world of high frequency trading and complex computer models, once the mortgages were standardized into pools buyers could theoretically have bought any cross-section of a deal that they wanted, tailored to their needs, and subsequently saleable on the open market at whatever appreciation or depreciation then applied, just like equities. Less sophisticated investors, however, like retirement funds, weren't equipped (or potentially allowed by statute) to run complex computer models on each segment of a transaction. Instead, they demanded large blocks rated by the agencies (e.g. Standard & Poor, Fitch, etc) at discrete ratings (e.g. AAA or investment-grade). So, like Amazon, the investment banks listened to their customers and their bottom lines and delivered. Standardized mortgage pools were tranched into more complex structures that allowed large investment blocks to be declared AAA.

5. EBS and CDOs both provide incredible value in good times--which are most of the time. Baron Schwartz explains why virtualization has a very steep normal/worst-case performance curve, and why it's difficult to even find out what the worst case performance is. Given that, the median performance will be well above the average performance, which makes cloud service seem like a very good value. That's even more true of EBS, where the additive variation occurs at each level of the stack that is virtualized (cpu, network, storage), and thus has even fatter tails.

CDOs worked similarly: having achieved a AAA rating with better yields than treasury bonds or similarly-rated corporate debt, during non-severe-recession years there was no way to detect hidden risk, and only the higher income stream was evident. Public pension funds and private hedge funds both looked flush with cash, but were really just lucky.

6. EBS and CDOs both created environments where users were hyper-aware of luck at a micro-scale, while completely ignoring it at a macro-scale, so hedging strategies were actually counter-productive. EBS users are so aware of randomness in the loading of particular parts of the infrastructure that they often create new volumes, test them, and discard them if performance is poor. They devise best practices around hardware failures. As that first article explains, however, performance failures turned out to be correlated, and as both AWS itself and its users tried to find enough working sections of the system to remirror their data, more and more sections became overloaded and failed. In fact, the more users tried to spread data across volumes, the more pain they felt. User behavior in attempting to account for local risk actually changed the way the service was being used enough to generate increased global risk.

CDOs work similarly. If particular mortgages and bonds weren't risky, there would be no point in securitizing them. It turns out that securitization is highly dependent on the correlation factor of the underlying assets, however, which of course turned out to be higher than expected. So the creation of CDOs lowered the expectation of risk, but didn't actually lower risk, meaning that lots of perceived value was destroyed when they crumbled. Furthermore, as banks tried to further hedge their risks with credit default swaps (CDS) on those CDOs, sellers of credit default swaps, like AIG, would go bankrupt in a bust even if they had no exposure at all to the original loans. Worse, the very creation of CDOs allowed credit to flow more effectively, creating a bubble in housing prices which changed the assumptions on which the CDOs were based.

7. EBS and CDOs both create abstractions which are not amenable to performance observation or prediction in crises. Much of the pain suffered by customers around EBS is because the abstraction's developer contract is so broad and opaque. Amazon can restore an S3 bucket sans one file, and you have most of your life back. Amazon can't restore an EBS volume without a certain stripe because your filesystem won't mount. Conversely, the end user doesn't know when snapshotting and moving to a new volume will help (because the contention is purely local and random) or hurt (because the whole system is experiencing pain). Since neither party knows what data is actually where, neither party knows what's really going on.

CDOs suffer a similar problem with lack of transparency. Since the companies who deal with mortgage customers don't actually own the loans, and in fact many entities may own shares of a single loan under different contact terms, preventing effective loan modifications. CDO default correlations are the outputs of models with enormous numbers of variables, each with uknown distributions, making it impossible for different parties to agree on valuations during periods of market volatility--which is why CDOs were so illiquid during the crash, and basically solvent banks had to accept government assistance to pay their daily bills.

Lessons learned? In both cases, some major firms, like Netflix and Goldman Sachs understood the underlying architectures well enough to avoid using those parts which were not sufficiently transparent for their needs, unlike Reddit and AIG which suffered mightily from failing to do so. The obvious lesson is that if you're making money but you don't really understand the underlying models that power your business, you're probably unwittingly taking on a lot of risk, perhaps at someone else's gain. TANSTAAFL, as I believe Smarter Travel Media may have found out in the online advertising department.

The more effective your business is at optimizing to a set of constraints, the more important it is that you understand the forward validity of those constraints.

On the provider end, realize that most of your users don't know what abstractions are actually helpful. You have to trust your gut to come up with good ones--and you'll remember that AWS didn't ship with EBS, and no other cloud provider has anything like it. I wonder why? Similarly, Goldman hedged their CDO risk instead of banking on it, and most investment bankers are incentivized to ignore their instincts on risk. If you're pressured by customers/profit into offering an abstraction that your gut knows is wrong, prepare technically and legally for assured future pain.

Why You Shouldn't Use MongoDB for OLTP

2011-11-11T20:57:00.000-08:00

UPDATE4: Still broken on the Jepsen network partition test; many users report *weekly* cluster crashes.

UPDATE3: Don't use Mongo for OLAP either. I used to think Mongo was fine for OLAP, because if you lost a disk or something you could just reload all of your data. But if your OLAP setup is big enough to actually need a replicated datastore, it's big enough that you can't afford to reload everything everytime you lose a disk and/or crash a process.

UPDATE2: If you thought Mongo2.2, getLastError, or writeConcern would solve these problems, they don't. Yes, that's a flame by a competitor, but the arguments from the code aren't rebutted anywhere. Even with all of the options turned on, you can still lose data in a single disk crash.

UPDATE: Obviously this post is a little dated, but this post and comments at Hacker News substantiates all the same concerns about Mongo (and MySQL-like architectures in general) in production settings. If you need availability, you need a true P2P architecture, and if you need performance you need to focus on durable write speed and scalability. The other way around turns into ops-hell just when your business takes off.

Why is MongoDB a popular option?
1. The developer libraries are really easy to use and standalone systems are easy to install. Many database projects are developed by extremely talented people who think that because writing a connection library is easy compared to the database internals, end users would really just want to write their own custom stuff anyway. Mongo and MySQL succeed in no small part because they fully recognize that developers are expected to have a working prototype yesterday.

2. It seems to offer the scalability and power of HBase/Cassandra with the ease of use of CouchDB/MySQL. Of course you can't get something for nothing, but "something for me now" at the cost of "maybe something for somebody else later" is often an attractive tradeoff. Additionally, 10gen try really hard to gloss over/ignore/propagandize over those tradeoffs.

3. Mongo is the best way of solving the unsolvable problem of running a high volume Web 2.0 OLTP database on EC2, due to the horrific storage latency of EBS.

So let me take those in order.

1. That's just true, and is a lesson that lots of infrastructure developers should bear in mind. It's what made Steve Jobs rich. Usability matters, whether we like it or not. One thing organizations can do to keep this from skewing their judgment too much is to take on the DevOps model, where the production sysadmins are right in there with the prototyping developers, and can hopefully help them get environments up and running quickly while also accelerating the production sysadmin's understanding of the platform. Win/win. Of course in purely exploratory use cases this may be the only thing that matters, but I think part of the lesson of the agile movement is that there basically aren't any of those. If they wanted the prototype yesterday, they'll want the production implementation yesterday too, so you'll reuse the prototype.

2. This is the big one. Alex Popescu passes on an anonymous rant that summarizes the issues. The rant itself may be a hoax, but the design complaints echoed by Alex's commenters are no less trenchant for that:
a. Like MySQL, Mongo's performance benchmarks are all done with the safety features turned off (immediate writes, writeahead log). You either get (drastically) less performance than you expected when you flip the production switch, or you cross your fingers and pray.
b. Like MySQL, Mongo is optimized for reads (everything in RAM) and not writes (global write lock!)--except writes are the hard problem, writes are the problem that doesn't occur in dev and is hard to simulate in test, and reads can usually be papered over with caches.
c. Like MySQL, Mongo relies on a master/slave arrangement for availability. Anyone who has dealt with MySQL replication at scale, like say Mark Callaghan, knows this is a nightmare. Percona estimates the approach at only three nines of uptime. Making replication truly crash-safe is a really hard problem. Making sure that you have a known window of data loss during slave failover is also a hard problem. Both need to be baked in from the beginning--databases need to be fundamentally clustered (this is of course why things like HBase/Cassandra/Oracle RAC are harder to get going in the first place).
d. Like MySQL, resharding is a bolt-on feature of Mongo, instead of being baked in from the beginning. Given that this is basically the hardest thing to get right in a clustered database, that's a recipe for disaster.

In general, 10gen like MySQL looks good on Wiki pages and RFP responses because all the boxes are checked, but fails in real life where you want all of the boxes checked at the same time. Not only is that borderline dishonest, it's a terrible way to write software since it's almost impossible to fix later on. Beyond that, the CAP theorem is a harsh mistress, and as the people who actually write this stuff keep telling you, the only way to make an accurate tradeoff is to know your data really well (what's the locality for reads and writes, does the distribution work well for bloom filtering, etc.). When Mongo offers a schema-free solution that offers great query performance without knowing anything about your data, you can be sure that CAP is going to bite you later in ways you didn't expect.

3. This is also probably true. EBS is a lie, which is to say inevitably brittle, and like with Mongo and MySQL, it's a design level problem that implementation fixes won't help. So your options right now are:
a. Magically force your data into a pure key/value system and use SimpleDB.
b. Use Mongo (or RDS if your volume is low) and live with lost data and downtime. With latency like this you can't afford to doublewrite or synchronously replicate anyway. If your business is FourSquare, that might be ok.
c. Buy servers.
d. Fork over the substantial operating cost for DynamoDB.
e. Run Cassandra on ephemeral node storage like Reddit. This works ok if you have a lot of nodes (>6, minimum) across availability zones and highly automated bootstrapping (e.g. if you're big enough that you could probably afford your own servers anyway). But this is a reasonable upgrade path from DynamoDB.