Max Kalashnikov

The myth of the "commodity" server (for memory)

2011-04-11T07:41:00.000-07:00

Over the past several years, I keep stumbling upon deployment systems and such concepts as "sharding" which use as their raison d'être the ability to scale across an arbitrary number of cheap, "commodity" (usually 1U) servers.

The implication is that "larger" servers either have a higher price per performance or are somehow more difficult to administer[1]. I reject both suppositions.

The day of "big iron" is well past us. This isn't to say one can't still buy large machines, or even run Linux on an IBM z-series, but for most practical intents, there are only two classes of Linux[2] server hardware.

The larger class is based on the quad-processor Xeon 7xxx series motherboards. These machine are, I admit, less bang for the buck, if ones "bang" is fungible processor power and/or memory.

Everything else, however, has either linear or even sub-linear pricing.

Let's lo0k at the current pricing from Dell, whom I find to be the cheapest of the brand-name vendors:

CPU (cores@clock) model slots $price

X3430 (4@2.4) R310 4mem $1257
E5620 (4@2.4) R410 8mem $1319
E5620 (4@2.4) R510 8mem $1418
E5620 (4@2.4) R610 12mem $1762
E5620 (4@2.4) T610 12mem $1537
E5620 (4@2.4) R710 18mem $1712
E5620 (4@2.4) T710 18mem $1498

1*E6510 (4@1.73) R810 16mem $3821
2*E7520 (4@1.86) R810 32mem $5531
2*E7520 (4@1.86) R910 32mem $5790
4*E7520 (4@1.86) R910 64mem $8855

These are all configured with rack rails with cable arms and as little memory as possible, assuming one would buy commodity memory. What's notable is that the "small" machines with 4 and 8 memory slots are under 10% cheaper than the next ones up and that the 18-slot models are cheaper than the 12-slotters.

If one is memory-bound[3], the best deal for the money is the 5U-tall T710. If you're fortunate enough to be in a facility with plenty of power but not plenty of space, then the 2U-tall R710 makes sense for the extra 15%. Either way, assembling that many memory slots out of the smaller 1Us is going to be more expensive, more space and power consuming, and will yield less usable memory, since each box has some common OS overhead.

What I also find notable is that the higher-end servers, though over twice as expensive for the cheapest model, are still cheaper and smaller for the memory slots than enough 1Us. Even over the 2Us, the price premium is under 50% for the base system, and likely a good deal less once the memory itself is included.

Since memory density increases with Moore's law, if you have 3% monthly growth or less and you comfortably[4] fit into one of the $1500 servers, there's no need to worry about "sharding" due to memory. Similarly, if you're at 10% monthly growth (doubling every year), you have 2 years to grow into the then-current larger machines, assuming that number of memory slots per same cost server [5] doesn't increase.

For a startup, 2 years is a lot of engineering time that could be spent on actually driving the growth rather than focusing on how to handle it if it happens to appear.

For now, pricing of CPU "horsepower" across the different servers is left as an exercise to the reader who enjoys comparing benchmarks.

[1] The virtualization proponents seem to go both ways on this, the other way being the subdivision of larger servers into several smaller, virtual machines.

[2] Linux on x86 is the only one that counts these days, right?

[3] Often the case with modern languages such as Java and Python. The practice of using memcached or other in-memory databases similarly leads to memory scarcity.

[4] That is, without paying a huge premium for the highest density memory, which premium often only exists for a short period of time.

[5] Or, rather, per processor, unless we go back to a serial connection technology like FB-DIMM.

When is it time for a senior sysadmin?

2011-04-04T10:38:00.000-07:00

In the quest for the "perfect" startup to join, I have my own personal guidelines as to company size and growth. However, I also tend to ask questions to determine if it's too early or too late for me (as a system administrator) to be of adequate help.

I'm not just a porridge-swilling Goldilocks when it comes to this kind of timing. If it's too early[1], I'm going to get bored, while the company wastes its money, which isn't good for anyone. Too late, and I end up being incapable of overcoming legacy hurdles, which is a source of frustration and appearance of ineffectiveness, again not being good for anyone.

Growth

The first thing I look at is growth, since that's the single most reliable sign that it may be "too early." Merely modest growth can also mean a great challenge for someone who is by all accounts still a senior sysadmin, but not for me, which is why I probe that early.

For a startup with the expectation of "hockey stick" style growth, I would say the right time is anywhere on the elbow part (of greatest slope change). The nearly-horizontal part means it's too early, since that can last an indeterminate amount of time and such minimal[2] growth can be handled by developers sharing the load of administration.

I look for 10% monthly growth or doubling yearly as a minimum. Any metric that can be credibly linked to infrastructure works, including bandwidth, users, revenue, servers, even employees. I have yet to run into the need for having a maximum. Does anyone have a suggestion of a growth rate that's clearly up the handle of the hockey stick? Factor of 10 yearly?

Employees

Another metric I use is number of employees, or, more specifically, number of technical employees. The "too late" case has an easy rule of thumb: everyone technical needs sit in the same room and still be able to communicate effectively with each other. My experience is that this is a common early startup model. Once people have walls (even cube walls) and doors separating them, there's just enough of an "us vs. them" mentality that a sysadmin can no longer absorb enough of everything that's going on to effectively influence how things are done in the future[3].

I'm not sure there's a danger of there being a "too early" case, but I'd be hard pressed to recommendsysadmin being ones first or second hire.

Number of Servers

A common, though, to my mind, less significant, metric is number of servers. The reason I consider it of secondary importance is that doesn't translate well to overall environment complexity. Put another way, the existing number of servers doesn't translate well to the eventual number of servers once a sysadmin.

Still, if you can run everything on one or two servers, it's probably too early. If you have a couple hundred and you don't already have someone dedicated to thinking about them, it's too late.

Server/Services Spending

More significant than the number of servers is how much is being spent on the hardware (if applicable), hosting, and services, such as ISPs and CDNs. My sweet spot is that this needs to be about twice the salary of a sysadmin, since I can often cut those expenditures in half[4]. Less than a sysadmin's salary and it's too early. More than 5 times a sysadmin's salary and it's too late, though, like with growth, I have yet to see this be an issue in the real world.

[1] Granted, it may only be too early for me but not for a junior sysadmin. That's a philosophical question for another post. I've found, however, that most startups don't want to spend the time and money to eventually hire two people rather than waiting and getting just one.

[2] It's important to remember to normalize against technological progress. Even I/O performance progresses linearly, even if it doesn't follow the geometric progression of Moore's Law.

[3] Including influencing development process and tools, if not providing the outright. I've heard this method called "DevOps," but I just consider it to be good startup system administration.

[4] Easily justifying my own salary, if needed, but, more importantly, revealing the negotiation over $10k one way or the other seem the silly waste of time that it is.

OpenStreetMap is a ghetto of stagnation.

2011-03-25T11:48:00.000-07:00

Having interacted with a few other mappers, particularly in disputes, I had the odd impression that either they were a bit, shall we say, mentally challenged, or struggled with language. Now I know why.

Fully a year later, one of the people in charge communicates with me and, in summary, says that the community is favored over map quality every time. Wow.

So what is this community? Where do its members "hang out?" There's a plethora of choice, and, apparently, they're all equally inadequate, except for the mailing lists, which, despite being shockingly anachronistic[1], are held up as the pillar of excellence as discussion venue. Never mind that the same OSMF board member who did so is also complaining about "toxic" participants on the mailing lists, who are there only to argue and aren't otherwise active mappers.

The fora I've found so far are:

Wiki
Wiki Talk
Forums
Meetus
Help pages
Individual OSM email
Out-of-band regular e-mail
and, of course, the mailing lists and their archives.

That's quite a dauntingly fragmented set of channels, even for an earnest participant. At best, they strike me as a signficant distraction from the task at hand.

All this means smart, dedicated, motivated mappers are going to get systematically chased away, while those who oppose change but are good at playing politics will stay. Sound familiar? I fear that this is always the logical conclusion to any such Wiki-like "crowdsourcing" effort.

I had such high hopes for the octo-chicken. Still, it may work, as it seems to have mostly worked for Wikipedia. Here's hoping for a worthy fork in the meantime.

[1] Even when I started with the 'net a quarter century ago, they already seemed quaintly backwards, compared to Usenet. I'm pretty confident OSM isn't nearly that old.

Secondary DNS

2011-03-01T09:38:00.000-08:00

Here's my advice for "secondary" DNS service. I recommend running the master unlisted ("stealth master") and using it only to serve zone transfer to the slaves. It can also be a good idea to have a backup "stealth" slave that could become the master.

I call them "slaves" even though, in registration terms, I think they're still called "primary" and "secondary." I have yet to find a practical distinction, and, with a stealth master, there could be confusion.

Make sure to have at least one slave listed from a different TLD (.com, .org, .net, or a ccTLD).

A list of my preferred providers, reasonably priced:

DNS Made Easy (per 5-10 million query pricing)
BackupDNS (flat per zone per month)
EasyDNS (per million query pricing)
DNS Unlimited (cheap per million query pricing)
Durable DNS (per million query pricing)
No-IP "squared" (flat per domain per year)

Not all of them support configuring more than one master, but they all have web access to effect the changes.

More detailed advice may be forthcoming.

Virtualization for databases (bad idea)

2011-02-03T11:29:00.001-08:00

Originally in response to this (excerpt of a) discussion on LinkedIn:

I think this is a LINUX issue! Because in linux the I/O is buffered or delegated to a proccess. When you install Postgres or any DB, Postgres tell to the OS that it can't wait to do the I/O, it must be done inmediattly. But what happens in a virtualized environment?

There's no such thing as telling the OS to do an I/O immediately, as opposed to waiting. It's the other way around: non-buffered I/O requires waiting for it to actually complete. This is important for such features as data integrity (knowing it was written to the platter, or, perhaps, in the case of SSDs, that the silicon was erased and written to).

The real problem is that virtualization is fundamentally flawed. What is an operating system for, in the first place? It's the interface between the hardware and the applications. Virtualization breaks this, without, IMO, adequate benefit.

Put another way, virtualization abstracts away hardware, to a lowest common denominator. It is therefore an unsurprising result that the subsequent performance is consistent with the lowest common denominator as well. "Commodity hardware" is a myth[1].

One of my greatest tools as a sysadmin is my knowledge of hardware, how it fits together, and how it interacts with the OS. Take that away from me by insisting on virtualization or ordering off a hosting provider's menu of servers, and I, too, suffer from the lowest common denominator syndrome.

[1] Really, it's that non-commodity "big iron" is extinct in my world, especially with the demise of Sun.

You only just swallowed us, I know, but please cough us back up.

2011-01-22T10:53:00.002-08:00

I was asked recently what my ideal scenario to retain me long-term, and it occurred to me, after answering otherwise, that there does exist such a situation. Our new overlords would have to spin us off and let us operate independently, as a wholly-owned subsidiary.

My own role has not been rendered completely irrelevant, as I had feared, just stagnant. The closest thing I currently have to a boss managed to finagle our keeping our own deployment process and administrative control of our servers. For now, this means the hosting provider. Later, it means mostly virtual boxes in eBay's datacenter(s). It probably won't be significantly worse than what we have now, since our provider's internal network has had numerous failures.

However, since my next step, right before the acquisition, was going to be to move to our own datacenter, there will be no moving forward. I'll be stuck with the already outgrown scaling (for lack of a better term) model and no control of the network, hardware, or provisioning. The most powerful tools with which I am adept won't be available to me.

There will also be no opportunity for mentorship or participation in hiring other sysadmins, something I have found adds significantly to my overall job satisfaction. No, joining eBay Ops (cue "Central Services" jingle from Brazil) is not an option, since I enjoy being productive.

If we were spun off, the lip service given to continuing what we were doing, just with eBay's resources behind us, could actually be made to be true. We would be free of the usual bureaucratic encumbrances, all-downside purchasing process (no buyers, just forms)[1], crippling "collaboration" tools like Exchange and Skype, and the temptation to shoehorn what's still a nimble startup operation into a nearly immobile behemoth's infrastructure.

We could still sub-lease their campus and maybe even be eBay-galaxy-of-companies employees so as to share benefits (though even those are lackluster and an administrative time sink). However, we would control our own destiny in terms of hiring, purchasing, and operating our service. Integration with eBay's services would be via API, as it would otherwise, since the code bases have, to put it mildly, irreconcilable differences.

I very seriously doubt, however, that this could ever happen, since there's too much potential for loss of face somewhere up the chain of command. In the meantime, I'll continue to help in what ways I can and be on the lookout for another suitable startup.

[1] Unless it's over a million dollars. The purchasing department has a great scam going. They've managed to appear to have very low costs, because they outsourced everything one might think they do. The accounting work is off-shore, and the request, quote, purchase, and receiving tasks are all pushed onto all employees in the guise of self-service. Of course, it's still Purchasing that dreams up the Byzantine policies everyone else is expected to implement.

Compression at "Internet" scale (originally posted to StorageMonkeys November 22, 2009)

2011-01-21T13:26:00.000-08:00

One of the things I've learned, having been in more traditional "Enterprise" environments and "Internet" companies is that the latter have much larger scale issues, with respect to storage, by an order of magnitude or two, than the former.

Fortunately, there's also a difference in the nature of the data, such that the most voluminous (and, arguably, most valuable) data, web access logs, are highly compressible (5-20x) with the right algorithm. Compression is important at this scale for reducing I/O and increasing speed of access, not the number of bits "spinning" on platters.

A solution must work in real-time. There is some flexibility in that average load is rarely anywhere near peak load. However, my experience is that paying for unused capacity is better than depending on catchingup on backlogs during off-peak times. In the former case, the consequences of a poor estimate are finite and predictable but not so in the latter.

Assuming one wants to use the data, a solution must decompress at least as fast as it compresses. I haven't run into this as a problem, since the readily available algorithms easily meet such a requirement. A possible issue could be with parallel processing of the compression but centralized processing of the decompression, such as to load into decision support database.

Performance has to be no more than O(n) for memory (distributed compressors) or O(n) for CPU (central compressor). Fortunately, the former appears easily satisfied by available algorithms, so long as "n" is log event volume, not average size of each log event

HTTP logs are extremly self-similar, so just throwing Lempel-Ziv at them is sub-optimal. Experimenting, although I've found descendants like LZMA do quite well (around 5x), that seems to be the top end, at a not particularly impressive speed. This may be great for general purpose compression but not for this special purpose.

Though they'll often have plenty of natural-language embedded within, large text compressors (such as those tuned for the Hutter Prize cf. http://mattmahoney.net/dc/text.html) aren't ideal, either. I speculate that this is due to a much higher incidence of abbreviations and numerals, but I'm hardly qualified.

Another possibility would be to configure/customize ones web server to log in a pre-compressed format. I generally reject this out of hand, because it removes much of the self-documenting nature of verbose logs. Moreover, it can't predict the future to determine the frequency of a current log event. To do so would mean maintaining a buffer, which may as well be on the disk of another server, the current situation. Perhaps more to the point, my operational philosophy discourages burdening something critical like a web server with something ancillary like log compression.

The best option I've found so far is the PPMd algorithm, primarily as implemented in softwarey by the 7zip package. Specifically, with order 7 and 1GB of memory, a modern CPU will compress my web logs 10:1 at 10MB/s. Its main disadvantages are being memory heavy, with an identical footprint for compression and decompression and lack of parallel implementation.

I don't yet have any good data, partly because of the fast pace of startups means the character of the logs I work with changes and partly because I rarely have the luxury of trying more than one method on the same data. However, once I do, I'll post some hard number comparisons between LZMA and PPMd with various tuning options.

Next year, look for my musings on compression of database redo/write-ahead logs.

Storage on the cheap - lessons learned (originally posted to StorageMonkeys July 11, 2009)

2011-01-21T13:23:00.000-08:00

Having purchased, assembled, configured, and turned up quite a number of storage arrays, where a major concern was total cost, I've come up with something of a checklist of best practices.

Use cheap, commodity, desktop SATA drives. They're as good, if not better than, "enterprise" models. They're certainly cheaper per performance.

If advanced administration, failover, or clustering features, such as from Veritas, are needed, use SAS HBAs.

Otherwise, use SAS RAID cards. They tend to support more attached devices and may even be cheaper.

Make sure to buy disks from multiple batches for use within a RAID. That is, have a mix of drive models and sub-models, manufacturers, and even end vendors.

Bad batch syndrome is, potentially the most catastrophic. Corollary: Don't buy models so new that there's only one.

Buy only drives which support NCQ. The price premium, if any, is neglible.
Even if there's no performance gain for a particular use case, there's no downside to having it turned on everywhere.
To that end, turn NCQ on for all new adapters connecting new disks.

If coming into a legacy environment, turn off NCQ unless absolutely certain that all existing disks support it.
Problems/corruption can be insidiously subtle.

Before use, write (zeros are fastest) to the entire device. This will trigger any bad blocks to be reallocated.
After that, run a SMART scan on the whole device and check for clean results. This will catch any (very rare) infant mortality.
It also indelibly "stamps" the drive as having been tested.

Install smartmontools on all servers. It's small and otherwise takes no resources.
Running the smartd daemon is another matter. That's a monitoring concern.

Turn on all the supported idle/background SMART tests supported by each device.

Discard (permanently stop using) a disk at the first sign of trouble.
A SMART error or even warning is trouble.
A write error is trouble.
A read error (assuming the disk has been zeroed) is trouble.
A timeout, unless positively isolated to the disk itself, is not trouble.

For external connectors, use only the screw-on type. For SAS, that's SFF-8470.
This does mean spending more money.
Often, one must use internal connections (e.g. SFF-8087) with an adapter.
The latching connectors are all too easily disconnected (sometimes only partially, which can be worse than fully) and/or too fragile.

Locate equipment such that storage cables can be short but have enough slack.
Always provide good strain relief on all ext cables. This means cable ties at strategic points.
Test for adequate slack and clearance by sliding all connected and neighboring equipment.

Add between 3% and 6% (of active disks) hot spares. That should last 2-3 years without human intervention.
By then, replace all the disk, not just the failed ones, as your failure rate will, otherwise, accelerate heavily.
Time your transition to take advantage of technology and/or price improvements but assume closer to 2 years than 3.

RAID1(+0) is far more flexible and simpler than RAID5. It performs much better in degraded and recovery modes.
A good implementation can nearly double read performance, especially on contentious operations.
It costs only 60% more than a 4 column (+1 parity) RAID5 or an 8 column RAID6.

Don't oversubscribe the system bus.
PCI-X 64bit@133MHz is only 1067MB/s half-duplex. (i.e. could be adequate for highly asymmetric read/write)
PCIe x4 is 1000MB/s full-duplex.
SAS 4-lane is 1200MB/s full-duplex.

Once everything is assembled, measure these maximum throughputs. Do so at each layer, including the HBA/RAID card and each spindle.

At each layer with a dirty region log (DRL) and/or journaling option, opt to use it.
If practical, "waste" a whole spindle on it. Otherwise, locate it somewhere highly contentious or low-demand, such as the boot disk.

Similarly, try simulating a failure at each layer and measure the recovery time. That will be the minimum under no load.

If the block size an application or database uses can be tuned, raise it to the highest possible.
Conversely, use the smallest supported stripe unit width size.
Set number of columns such that full stripe width is an even multiple (or, better yet, factor) of block size.
For RAID5, this usually means 4 (plus parity), 8, or (rarely) 16.
4 columns plus parity is particularly well suited to PCIe-to-SAS hardware RAID5, since there's a 4:5 PCIe:SAS bandwidth ratio.

For redundant components (e.g. cables, expanders, power supplies), test hot-swappability.
Do so at different "duty" (simluated outage) cycles and flap rates.
Test flip-flopping between the two components.

If you can ever check all these off, I'll be impressed. Still, I hope it helps other cheapskates out there avoid a few pitfalls.

"Dark" storage: wastefulness or just good engineering? (originally posted to StorageMonkeys June 24, 2009)

2011-01-21T13:21:00.001-08:00

Having recently read more and more discussion about so-called dark storage, I've been reminded of something I routinely try to impress upon managers, especially clients: unless your use case is archiving, total bytes is a poor metric for storage.

In fact, the term "storage" itself may be partly to blame for the continued misconception. One need only glance at the prices of commodity disks to recognize that there isn't anything near a linear relationship between cost and bytes stored.

A quarter century ago was the golden age of the mini-computer, and the reign of the micro- was dawning. The Fujitsu Eagle was, at least in the semiconductor industry here in Silicon Valley, very popular, so it will be my yardstick. At a third of a gigabyte in usable space and just under 1.9MB/s, one could read or write the whole thing in just under 3 minutes. Today, a 1.5TB Barracuda is 4500 times the size but only 66 times the throughput, so it takes over 3 hours to go through the whole thing. A 6th-generation 450GB Cheetah is better, at under an hour.

I like the Eagle's 3 minutes as a rule of thumb. That's 21GB on larger, modern, 7200 RPM disks, and I suggest that everything beyond that may as well be considered superfluous or archive storage. Accepting this measure end-to-end means that one would only want 72GB accessible to a host off each 4Gb/s FC or 216GB per 4x SAS. Ouch.

A whitepaper from Xiotech criticizes storage vendors' performance numbers as being misleading, since they are based on short-stroking benchmarks, rather than representing the performance of the whole disk.

I suggest that short-stroking disks as a matter of course and leaving the rest purposefully "dark" is smart engineering. Suddenly, those 160GB drives look much more appealing than the 1.5TB ones, at least for performance-sensitive uses, such as databases.

Certainly, there are use cases where data beyond the 3 minute limit is still useful: anything that rarely, if ever, gets read. That tends to include backups, archives, audit trails, and even database intent logs. One may be able to have all these coexist on the same spindles as the "high performance" uses, but it would require careful forethought and testing.

My 21GB example with a 160GB disk means 87% "dark," to simulate an Eagle. It's a high percentage but nothing to be alarmed about, as long as it's done with full awareness.

Why are DRAM SSDs so pricey? (originally posted to StorageMonkeys June 10, 2009 )

2011-01-21T13:18:00.002-08:00

As a UNIX veteran who has a vague recollection of /dev/drum, I keep thinking that it would be really nice to have a device to swap to that's somewhere between disk and memory in terms of speed and cost (total installed cost, not just each module).

Mostly, I feel constrained by the 32-48GB limits on moderately priced ($1-3k) servers. To go higher, for even modest processor speeds, is a $5-$10k premium. Moreover, DRAM doesn't really wear out, and it would be nice to put older, lower density modules to use.

The trouble is, what I've found so far is either very low capacity, priced much higher than the memory modules themselves, or both. I'm not particularly interested in adding 4GB of fast swap to a 48GB machine, though ACARD has something for $250 with a 48GB limit, with high density modules, defeating my second purpose. Similarly, I'm not interested in paying $10k for 16GB of RAM SSD ($625/GB?!) when I could just dump that money into the base server and get much faster access.

I'm not a hardware guy (in the EE sense), so I'm genuinely curious about this. Is it really that difficult/expensive to stick a memory controller (northbridge?) onto a SATA interface? Am I being too cynical in assuming that it's mere market "segmentation" without a low-end consumer segment?

What I described already exists with the name "motherboard," but the software package "scst" seems woefully incomplete. For example, the MPT-Fusion driver is still described as "alpha" or early development, so I'm not holding my breath on reliability, let alone performance. I'm sure participation by the vendors would help. LSI, are you listening?

All about the Benjamins (originally posted to StorageMonkeys June 9, 2009)

2011-01-21T13:17:00.001-08:00

The choice of the unit of measure of storage is interesting to me because it's otherwise tought to measure price for performance.

I remain agape at the price tag on high-end, supposedly high-performance, storage systems. Connected by FibreChannel or gigabit Ethernet, that's a limit of 400 and 110 MB/s, respectively. (Yes, I know of 8Gb/s FC and 10GE, but these are prohibitively expensive, if supported.Even link-aggregated GigE practically tops out at 880MB/s) I'm thinking that writes across 40 7200RPM disks could saturate an FC link, and it would take fewer than 20 15k disks. Neither of these strikes me as impractical or unusual sizes of storage arrays, even doubling those numbers for RAID 1. More importantly, such arrays don't strike me as high performance.

Particularly shocking is that a brand name "SAN" solution of such a size would cost in the neighborhood of a quarter million dollars and be at its performance limit. Granted, it might be half that price without fancy management and replication software. whereas the less fancy alternative, at one tenth to one fifth the cost, would still be expandable from a performance standpoint. How much does the Veritas database suite cost these days?

The cheaper alternative, which I have implemented and benchmarked, is using Serial-Attached SCSI (SAS) instead of FibreChannel and commodity SATA disks instead of 10k or 15k spindles.Although it's not necessarily "SAN" in the marketing sense, SAS readily supports multiple hosts per bus. It's also typically implemented as 4x 300MB/s channels on one connector for interfacing to expanders (a rough equivalent to FC switches). An x4 PCIe slot is actually the limiting throughput factor for one of these, as each x1 lane is only 250MB/s. Even with RAID1, rolling my own array would cost $25k (including labor), maybe double that for Dell brand MD1000s. One could then spend twice again the same amount to get triple the throughput on the same server(s), before running up against the limit. Additional fanciness can be gained from 3rd-party storage software vendors, especially in this economy, for under 6 figures.

That's for truly random I/O. For sequential I/O, such as for logs, the situation is even more egregious: only 4 7.2k spindles would saturate a (dedicated) FC link. If it's paired for redundancy, one would need a second pair for the non-sequential, perhaps introducing some management complexity, unless FC link aggregation becomes common enough to be standardized.

Another issue I've had come up in conversation is reliability and/or maintenance. This Usenix paper belies the notion that SATA disks are any less reliable than others. With a 3-6% annualized replacement rate, that's 2-5 disks per year, or about 15% or 12 disks over 2.5 years, on an 80-disk array. I've actually already included this (4 spares per 20 non-spares) in the $25k above.

Somewhere between 2 and 3 years, you're going to have to bite the bullet, spend another $25k for twice as much space, and migrate the old data, assuming you're not already upgrading for other reasons. Woe is you. You'll just have to resort to drowning your sorrows in the hundreds of grand you saved, never mind the headache of shipping disks back and forth.

The Storage Emperor's new clothes are looking mighty skimpy, indeed.