Donnie Berkholz's Story of Data

The end of this story

dberkholz — Fri, 03 Apr 2015 20:48:15 +0000

I titled this blog “Story of Data” with the goal of telling stories based on the enormous quantity of information on software that’s out there but is nearly incomprehensible without context. And I believe I’ve succeeded. I’ve told a ton of data-driven stories spanning software development in all its forms, be it DevOps, Big Data, mobile app dev, or Java app servers. Now it’s time for me to end this story and move on to the next one.

Since starting at RedMonk in 2011, I’ve deeply enjoyed interacting with all of you. It was truly a dream come true — I never imagined I’d be able to combine my weird backgrounds in journalism, science, and software into a single job. Working with the RedMonk crew, both the more public and private folks, has been a pleasure over the years. Huge props to our analysts Steve, James, and the GreenMonk Tom; as well as our account manager Juliane and admin Marcia for all their help and support. And I want to give a big thanks to Steve for the kind send-off.

Finally, thank you all so much for the opportunity. It’s been a blast, especially hanging out with the RedMonk community in person at the Monktoberfest, Monki Gras, and ThingMonk. If you want to reach me in the future, dberkholz@dberkholz.com works, as does my Twitter handle @dberkholz. As usual, I probably won’t answer my phone so don’t bother calling.

I hope to see you in the future over a beer, coffee, or other tasty beverage. Cheers!

Image credits to Mr T.

React and Polymer arising among JavaScript MV* frameworks

dberkholz — Fri, 03 Apr 2015 05:21:01 +0000

A while back, I started looking into JavaScript MV* frameworks. My colleague James often says we could have full-time work only making recommendations for the correct JS framework for a given week.

While I don’t have the time remaining at RedMonk to do a truly in-depth analysis, I did want to post a quick hit with some stats and conclusions.

This analysis began by pulling a list of frameworks from the excellent TodoMVC, because as we all know, if it doesn’t work well for a todo list, it clearly can’t handle a larger app . But I had to start somewhere, so start I did.

I first took a look at Stack Overflow and plotted tags referencing any of these frameworks over time (click to embiggen/focus):

I was frankly shocked by the overwhelming dominance of Angular.js. Although it’s well-known that Angular is quite popular, this is nearly absurd. It’s so popular that it’s impossible to even see trends in the other frameworks.

So I next removed Angular from the comparison and plotted everything again (click to embiggen/focus):

This produced some more broadly useful results. Frameworks tended to segment into a couple of main tiers, with few exceptions.

Tier 0

Angular.js

Tier 1

Ember.js (note more recent and continued growth; this is top-ranked outside of Angular)
Knockout (trending downward)
Backbone.js (trending downward)
Kendo UI
Ext JS
Dojo (barely)

Tier 2

YUI
PureMVC
Sammy.js
Enyo
Agility.js
CanJS
Stapes.js
SAPUI5 (OpenUI5) —There are strong hints that this may be rising out of the pack as well, see graph below
vue.js
SproutCore
Durandal

Rogues

React
Polymer

Being RedMonk, we’re typically on the lookout for new, emerging technologies, so the rogue behavior of React and Polymer is of particular interest. They’re separately plotted here, in addition to SAPUI5/OpenUI5 (click to embiggen/focus):

Looking at up-to-the-minute information suggests that Polymer may have stalled out for the time being, while React continues to grow. This is in keeping with anecdata from the (far too many) conferences I attend, where I hear increasingly often about React but very little Polymer. Another surprise is SAPUI5/OpenUI5, which is one worth tracking in the future.

Regardless — in terms of new and emerging JS frameworks, those are ones to watch out for.

Update (2015/04/03): Added notes on SAPUI5/OpenUI5.

Disclosures: SAP is a client. Google (Angular, Polymer) and Facebook (React) are not.

Are we nearing peak language fragmentation?

dberkholz — Fri, 03 Apr 2015 04:08:07 +0000

Thanks to the fine folks at Black Duck, I obtained a bolus of data from Open Hub (then Ohloh) on all of the open-source repositories they track over time. I’ve written previously on this data, but this time I’m taking a different take and looking more deeply into fragmentation [writeups by my colleague and me]. Specifically, what’s possible with this data is to dig into how usage of programming languages has diversified over time. To wit, here’s a graph showing how language use has changed using snapshots every 2½ years since 1995, plus a final more recent snapshot in red (click to embiggen/focus):

On the vertical axis is the share for a given language, and on the horizontal axis is that language’s rank, or popularity. From looking at this data, you can see that the #1 most popular language had roughly 30% market share in 1995 but that gradually declined over the past 20 years to roughly 10% today.

Perhaps the most interesting aspect of this data is that the decline appears to be slowing. Every 2½ years, the share decreases a little bit less, particularly in contrast to the enormous shifts in the ’90s and early ’00s. This suggests that we may be nearing peak fragmentation for programming languages, with the potential of a backswing.

One last note is that the mid-tier languages (around ranks 5–15) appear to be defragmenting over time. In other words, while the top 5 are spreading out, the next 10 appear to be congealing, with the lower-ranked languages in that range losing popularity.

Disclosures: Black Duck has been a client.

RedMonk’s analytical foundations, part 4: 2011–present

dberkholz — Fri, 03 Apr 2015 02:50:47 +0000

a.k.a. the advent of Donnie Berkholz

Finally, here’s part 4 on the RedMonk approach to industry analysis [part 1, part 2, part 3]. Again, I hope it’s useful to you, friendly reader, as well as the upcoming additions to the RedMonk team.

2011

“How important is software? Generational differences between software producers” on the transformation in business models depending on when companies were founded, be it hardware, software, services, or data (follow-up on the Age of Data).
“The Rails/Node lesson: frameworks lead adoption” on how growth of a programming language is a function of framework popularity.
“You are who you build for” on companies selling to those compatible with their own corporate culture and the need to sell to individual employees (follow-up one year later on enterprise vs consumer software).
“Napster: lessons for the enemies of shadow IT” on embracing and centralizing shadow IT rather than fighting it.

2012

“Microsoft Surface and the future of software” on the resurgence of integration between software and hardware.
“The importance of software at Oracle” and “The end of software: Microsoft posts a loss for the first time ever” on the shrinking business of software licensing.
“IaaS pricing patterns and trends” on using data to determine where vendors are attempting to differentiate.
“On APIs and copyright” on vendors’ ability to use assumptive API capture to strangle competition.
“Data science, Gangnam style” on bringing modern collaboration a la open source and GitHub to a new arena.
“Free hardware and the rise of Android” on the surprising power of giveaways in generating ecosystems.
“What can data scientists learn from DevOps?” on the need for codifying repetitive work (in actual software code) and applying agile outside of software development alone.
“AWS, Y Combinator and the startup boom” on the importance of incubators and the cloud to lowering the cost of startups and horizontal scaling of funding.
“On recent IBM, SAP and Adobe conferences. Developers developers developers … marketers?” and “Developers OR marketers? Nah, developers ARE marketers” on the role of developers in championing technologies internally and externally.
“On package management: negating the downsides of bundling” on the benefits and challenges of packaging.
“Windows 8: everyone is a consumer and a creator, but developers will universally drive adoption” on the importance of a heterogeneous approach to technology.

2013

“Interest withering in Java application servers” on the trend toward composability over monoliths.
“Quantifying the shift toward permissive licensing” on the move toward commercializable open source.
“GitHub will hit 5 million users within a year” on the use of true data science in our analysis.
“DevOps and cloud: a view from outside the Bay Area bubble” on the divide between early adopters and the rest of the world.
“Conway’s law but for software: Salesforce and SAP” on the importance of corporate context in opportunities to innovate.
“VMworld: the pundits versus the practitioners” on the difference between users’ pragmatism vs pundits’ visionary looks into the hazy future.

2014

“A swing of the pendulum: are fragmentation’s days numbered?” and “Microservices and the migrating Unix philosophy” on the shifts in fragmentation over time.
“IT must become a service provider, or die” on the changing needs of IT shops to provide good CX for their internal customers.

You may notice a heavy bias toward my posts in the past couple of years. That’s largely because, while my colleagues have published very important work, enough of it follows from their earlier philosophical foundations that it doesn’t require mention in a post on approaches rather than greatest hits. However, I was still developing my approach to analysis, so I included my key approaches here.

RedMonk’s analytical foundations, part 3: 2008–2010

dberkholz — Fri, 03 Apr 2015 01:16:01 +0000

As I prepare to wrap up my time at RedMonk, I wanted to complete the series of foundational posts I started in 2012 on our approach to understanding the tech industry [part 1, part 2]. Hopefully this aids any readers in addition to the next RedMonk analyst (we’re hiring!) in understanding the historical context.

2008

“Open source licensing: obsolete or of importance?” on OSS licensing strategy.
“The Friday grab bag: X300, Github, GAE, and more” on Github. Sure, we mentioned distribution version control earlier, but Github’s effect upon the barrier to entry was what really tipped the scales. (Follow-up on the future of open source looking more like Github than like nonprofit foundations.)

2009

“Development frameworks and the enterprise” on frameworks, productivity, and acceptance.

2010

“Flightcaster and the future of asymmetric intelligence as a product” foreshadowing concepts like data moats years later.
“Beyond Cassandra: Facebook, Twitter, and the future of development” on companies going open by default with permissive licensing in an increasingly polyglot world (follow-up a year and a half later on the extracted software model).
“Why you should pay attention to Node.js” on opinionated software and enabling full-stack, single-language web development.
“The future of open data looks like … Github?” on data collaboration.
“Open core is the new dual licensing” on the lost goodwill in open-core business models.
“Even with Big Data, it’s hard to ask the right question” on getting insight from information.
“Platform as a Service: Current and future returns” on the gradual disappearance of blockers to enterprise adoption of new technologies, like security, stability, and compliance.

The breakout of Ansible, and the state of config-management communities

dberkholz — Thu, 02 Apr 2015 14:48:24 +0000

TL;DR:

Chef is dev-biased, Puppet is ops-biased
Ansible is growing like crazy
CFEngine activity is minimal
But … Docker Docker Docker

In February, I gave a talk at cfgmgmtcamp on trends in configuration-management communities. I wanted to post the data and provide a bit more context than I did on Slideshare.

My goal was to examine a variety of community metrics across configuration-management frameworks to provide an update on the work that Steve did back in 2013.

For starters, here’s a look at the development communities for the core software. While this ignores third-party modules, it does say a lot about the amount of change to the core codebases:

It’s worth noting that in Salt, everything is done via pull requests, even from existing developers, so that number is a bit inflated. However, there’s a pretty clear correlation between age of the framework and activity in the core. CFEngine released 1.0 in 1993 and it’s fairly slow today; Puppet and Chef date to the mid-’00s and they’re in the middle; while Salt and Ansible are just a few years old and remain quite active in the core.

But it’s hard to get a feel for trends without plotting this over time, so I did:

Please note that the scales are different for Salt due in part to the inflated PR numbers. Again the numbers are not terribly surprising, with a shrinking CFEngine community, Puppet and Chef holding relative static, and Salt and Ansible growing at rates. However, Ansible has grown to around ~200 forks a month while Salt grew to around ~100/month. This indicates a significant difference in activity across the two that’s also largely supported by stars and PRs.

However, core development is not necessarily reflective of the entire community, so the next data source I examined was mailing-list activity on the development list:

In keeping with the other data, over the course of 2014 CFEngine lagged behind while Ansible charged ahead, with the others largely holding steady in the middle. There is a potential downward trend with Puppet to keep an eye on, although it’s unclear whether that will remain the case given the amount of noise in this data.

The next data source I looked at was the IRC community. This is the first source that’s suggestive of anecdotal sayings that Puppet is for ops and Chef is for developers, as IRC tends to be a more old-school chat tool. It’s otherwise broadly in line with the others:

In contrast, for a developer-leaning audience I took a look at Hacker News. This is has potential artifacts for Salt (due to salted password hashing) but that doesn’t appear to be a major issue. While the reason downward trend in many frameworks over the past couple of years is unclear, what’s absolutely clear is the growth in Ansible activity and the relative dearth of CFEngine conversation. In addition, Chef has a slight advantage over Puppet in this developer-heavy audience.

Finally, I did a comparison across Stack Overflow (a developer discussion forum) and Server Fault (an ops discussion forum), both of which are hosted on Stack Exchange. Intriguingly, the long-term trend showed that development-related discussion tends toward Chef while ops-related discussion tends toward Puppet, again supporting that differentiation.

However, it’s worth setting some broader context. Let’s compare all of this to Docker:

All this debate about configuration management may be dwarfed in the bigger picture by a move toward containers rather than configuration management. While the future of broader adoption is unclear, the dominant interest in containers among many leading-edge communities is inarguable.

Disclosures: Chef and AnsibleWorks are clients. Puppet has been. Docker, CFEngine, and SaltStack are not.

The emergence of Spark

dberkholz — Fri, 13 Mar 2015 21:11:09 +0000

In the continuing Big Data evolution of reinventing everything that happened in HPC a couple of decades ago (with slight modifications), one newer ecosystem that comes up more and more is the Berkeley Data Analytics Stack. Some of the better-known components of this stack are Spark, Mesos, GraphX, and MLlib.

Spark in particular has gained interest due in part to very fast computation in-memory or on-disk, generally pulling from Hadoop or Cassandra (courtesy of a connector). And its programming model uses Python, Scala, or Java, which — especially in the case of Python — is very friendly to data scientists. Coincidentally, Spark 1.3 was released today, and it supports the DataFrame abstraction used both in the popular Python pandas library as well as in R (for which it has an upcoming API).

This investigation began while I was sitting at O’Reilly’s Strata conference in a packed Spark talk and began wondering about overall traction and interest in Spark. On a qualitative level, nearly every talk about Spark at the conference was reportedly packed. This came despite the lack of commercial interest highlighted below, which I wrote more about earlier.

As you can see, the level of commercial interest was quite low. In concert with the much busier talk schedule and talk attendance, this became quite suggestive of a broader effect. It maps well to the adoption curve followed by many new open-source technologies, where early adopters and contributors dominate the ecosystem initially with talks about the state of the technology and about DIY implementations. This is later followed by vendors coming up to speed in terms of commercial offerings and integrations, which are quite low at present.

To investigate whether this was a wider pattern, I took the approach of pulling in a number of data sources across the development community to compare relative interest in Spark and some other technologies in the Hadoop ecosystem for extracting and operating on data.

The first and most surprising data was from Stack Overflow:

In the past year and a half or less, interest in Spark has skyrocketed from minimal to far above every other technology on the chart. This roughly coincides with, and slightly lags, two major events:

The project’s move to the Apache foundation; and
The founding of Databricks, the vendor behind a significant chunk of Spark development.

Although it’s difficult to deconvolute the effects of these two things, it seems likely that they combined to catalyze the growth of the Spark community.

As another data source, let’s examine Hacker News. In general this tends to be a more bleeding-edge crowd, but this data may slightly temper your enthusiasm:

Unlike Stack Overflow, there’s no enormous spike in the last year. Also given the limitations of HN search (words vs tags), some noise like discussion about Spark Devices slips into these queries. While less dramatic than the SO data, there is an equally clear emergence over time from middle of the pack to the dominant technology shown.

It could be that the bleeding-edge crowd here picked up Spark over a longer period of time since mid-2010, while Stack Overflow’s somewhat more conservative audience compressed that same adoption into the past year and a half.

In an attempt to resolve it, I looked at a third data source, Google Trends. This is generally indicative of a broad population that, out of all these, best reflects mass adoption. Queries were coupled with “big data” to limit results to a more accurate subset.

It’s intriguing to see Spark’s emergence echoed again here, with a dramatic-appearing spike just in the past few months. We’ll have to follow it over a longer period of time to determine whether that looks like the Stack Overflow data, but it very clearly stands out beyond the peaks of any of these other technologies.

The next question is how Spark is being used. While difficult to infer, the kind folks at Databricks shared some data with us about the users of the Databricks Cloud:

No surprise to see the dominance of SQL. 100% of their customer base uses SQL, often coupled with another language like Python or Scala. Much as my colleague Steve wrote back in 2011, one of the first things added to most NoSQL databases was something that looked a whole lot like SQL. The large usage of Python also supports Spark’s accessibility to data scientists.

Unfortunately the ‘spark’ tag on Stack Overflow is a mess containing both Apache Spark and Flex Spark (part of the old Adobe Flex), so I was unable to take a deeper look at that as another comparison point.

Regardless, it’s clear that Spark is a technology you can’t afford to ignore if you’re looking into modern processing of big datasets.

Disclosure: Databricks, Datastax, and Mesosphere are not clients. A number of Hadoop vendors are clients.

Strata 2015: Reaching for the business user

dberkholz — Tue, 17 Feb 2015 18:54:20 +0000

This week I’m headed to O’Reilly’s Strata conference in San Jose, which is all about Big Data and more broadly data in general. To get a feel for what’s going to happen there and what the big news is, I repeated my analysis from two years ago and dug through all my pre-announcements to look at the overall themes.

As you might expect, this tends to focus on launches and funded startups vs all companies present or the talks. But it does give a reasonable level of clue as to what the take-homes will be for this year’s attendees, and where they might want to dig into the new hotness in more depth.

Without further ado, here’s the themes underlying what 48 companies are announcing this year. Note that the numbers add up to more than 48 because I tagged some announcements that fit into multiple areas.

The top 5 areas of interest are:

Hadoop itself
Analytics and BI
NoSQL databases outside the Hadoop ecosystem
Data integration
Big Data packaging

I highlighted six areas worth noting in red because of a few reasons:

The contrast with two years ago;
They’re a major problem for data users; or
They’re new, emerging technologies, like Spark.

Two years ago, I received 41 notices rather than 48 so there’s been a slight increase in launches at the show. The primary focuses back then were analytics, databases, and packaging. What’s changed?

The rise of BI (business intelligence)

This year I split analytics into two sections (analytics and the new one, BI), aimed at advanced technical users and business users, respectively. Products and companies that appealed to both were tagged with both rather than artificially segmenting them into one or the other. Together, Analytics/BI was easily the dominant sector with 31% of the overall volume targeting it.

This says a lot about the maturity of the Big Data ecosystem. As it matures, you expect increasingly higher-level applications rather than delivery of raw, low-level building blocks. Analytics tools are about as low-level as shipped apps get, with BI being one level higher because it tends to require more intelligence in the app than in the end user. Farther down the road, look for applications that merely incorporate Big Data rather than being all about analyzing a dataset. Most of them today are heavily customized, but this will change.

To draw an analogy to houses, Hadoop is a bag of ready-mix concrete and some trees. Analytics is cinder blocks, boards, and hand tools; BI is power tools. Horizontal business apps and libraries that are composed into business apps are the contractors building your house. Vertical-specific apps are what the general contractor builds for you, and at scale are built on a common template. As you move up the stack, you lose a little flexibility but you’re able to build upon more and more existing work and expertise.

Packaging is no longer the key blocker

In 2013, the major unappreciated theme was packaging Big Data so it was consumable by end users. That no longer seems to be the case, with packaging dropping down from 2nd to 6th place in the list. This implies that Hadoop has become much easier to get up and running than it was in 2013, which is a key blocker to adoption.

Data cleaning remains underappreciated

Only two companies are pushing products that are primarily about data cleaning, which is generally understood to consume 80%–90% of a data scientist’s time. This to me suggests that either it’s a solved problem (unlikely, given the time expenditure), a problem that’s incredibly difficult to solve, or a problem for which the solution is inexplicably difficult to sell.

What happened to NewSQL?

Companies with new, much faster approaches to traditional RDBMS were all the rage a couple of years ago, but this time around they’ve nearly vanished from the public eye. I’ll be looking to see what their presence is like at the conference, but it seems they don’t have much new to announce at this point.

Emerging tech still emerging (Spark, streaming, in-memory)

Much to my surprise, Spark only showed up 3 times. I would’ve expected at least double the presence of Spark in the announcements as I got. Along with streaming as a whole and in-memory databases, this group formed what I’d call the “emerging tech” category. Although that’s said with a grain of salt, as the technologies themselves have been around for years if not decades, and even a newer streaming option like Storm is now 3.5 years old.

I expect every piece of this area to take off over the next couple of years commercially, as interest within the RedMonk community in these technologies has grown dramatically over the past couple of years. Particularly with the advent of the Internet of Things, streaming technology becomes vital to coping with the data in a timely manner.

Interestingly, in-memory tech has held nearly static, with the exception of Spark. Perhaps that’ll be where the revolution comes from.

(Tangentially, we’re running an IoT developer conf in a few weeks called ThingMonk, in Denver — our first time in the US. Check it out if you want to dig into this!)

Conclusions

To sum up, I expect the growing appeal to the business user via BI and analytics to be among the key takeaways of this Strata conference. Over the next year, especially in the more technologically progressive CA edition, I’ll be looking for increasing uptake of the Berkeley data analytics stack (Spark & friends), streaming tech, and in-memory data processing.

Update [2015/02/17]: Added house analogy.

Cloud outages, transparency, and trust

dberkholz — Mon, 12 Jan 2015 21:26:56 +0000

The ongoing blips and bloops of public-cloud outages, whether planned or unplanned, continue to draw headlines and outrage. And rightly so, since downtime for those who use a single availability zone or even a single region can cost millions in lost business and reputation for companies whose own websites and online stores disappear.

The latest is a much-maligned 40-hour outage on Verizon’s new cloud:

https://twitter.com/kennwhite/status/554406992450953216

As this tweet shows, the most important part of every outage, planned or unplanned, isn’t the outage itself. It’s everything surrounding it.

It’s the comms, stupid

Much like Bill Clinton’s 1992 rallying cry “It’s the economy, stupid,” cloud providers need to focus on what customers really care about.

Take a look at the CloudHarmony cloud-uptime listings. While AWS is among the top performers, Azure is far from it. Google has a few hours of downtime, and up-and-comer DigitalOcean is more comparable to Azure than AWS.

This suggests to me that outage frequency, within a certain range, isn’t a blocker on adoption of an otherwise compelling cloud provider. The question isn’t which provider is best — but what is the upper limit of what customers find acceptable.

One factor that does very clearly make a difference, however, is communications about the outage. The best-of-breed providers have status sites and Twitter accounts where they post periodic updates, whether an outage was planned or unplanned. Heroku and GitHub are good examples of this. While both sites have their share of downtime, they use strong transparency to maintain the trust of their users.

On the other side of the spectrum is Microsoft, which used to post nice postmortems but has since largely given it up. If you match up their public postmortems with articles pointing out Azure outages, you’ll note a significant disparity, particularly in the last year or two.

I got this bland, unattributed statement courtesy of Microsoft analyst relations:

Reliability is critical to our customers and therefore, extremely important to us. While we aim to deliver high uptime of all services, unfortunately sometimes machines break, software has bugs and people make mistakes, and these are realities that occur across all cloud vendors. When these unusual instances occur, our main focus is fixing the problem, getting the service working and then investigating the failure. Once we identify the cause of the failure we share those learnings with our customers so they can see what went wrong. We also take steps to mitigate that being a problem in the future, so that customers feel confident in us and the service.

We all understand that sometimes things break, because clouds are incredibly complex systems. We’re only really looking for two things out of it: (1) don’t have the same problem twice, and (2) keep us informed. Unfortunately, they aren’t living up to the second half of that. And they’re far from the only ones — see the Verizon example at the beginning of this piece.

As I argued a year ago:

https://twitter.com/dberkholz/statuses/421855352540250112

For those wondering what a great postmortem looks like, Mark Imbriaco (in the past at Heroku, GitHub, and DigitalOcean) gives a masterclass here:

Monitorama 2013 – Mark Imbriaco from Monitorama on Vimeo.

And there’s a plethora of examples posted at sites including the following:

If you don’t have trust; if you think old-school opacity is still the right approach; you don’t have loyal customers and they’ll leave you at their first opportunity. Now you’ve seen the examples and the counterexamples — go forth and communicate!

Disclosure: Amazon Web Services, Microsoft, and Salesforce.com (Heroku) are clients. GitHub has been. Google, Verizon, CloudHarmony, and DigitalOcean are not.

Time for sysadmins to learn data science

dberkholz — Tue, 06 Jan 2015 20:58:10 +0000

At PuppetConf 2012, I had an epiphany when watching a talk by Google’s Jamie Wilkinson where he was live-hacking monitoring data in R. I can’t recommend his talk highly enough — as an analytics guy, this blew my mind:

Since then, one thing has become clear to me: As we scale applications and start thinking of servers as cattle rather than pets, coping with the vast amounts of data they generate will require increasingly advanced approaches. That means over time, monitoring will require the integration of statistics and machine learning in a way that’s incredibly rare today, on both the tools and people sides of the equation.

It’s clear that the analysis paralysis induced by the wall of dashboards doesn’t work. We’ve moved to an approach defined largely by alerting on-demand with tools like Nagios, Sensu, and PagerDuty. Most of the data is never viewed unless there’s a problem, in which case you investigate much more deeply than you ever see in any overview or dashboard.

However, most alerting remains broken. It’s based on dumb thresholds rather than anything even the slightest bit smarter. You’re lucky if you can get something as advanced as alerting based on percentiles, let alone standard deviations or their robust alternatives (black magic!). With log analysis, it’s considered great if you can even manage basic pattern-matching to group together repetitive entries. Granted, this is a big step forward from manual analysis, but we’re still a long way from the moon.

This needs to change. As scale and complexity increase with companies moving to the cloud, to microservice architectures, and to transient containers, monitoring needs to go back to school for its Ph.D. to cope with this new generation of IT.

Exceptions are few and far between, often as add-ons that many users haven’t realized exist — for example Prelert (first for Splunk, now available as a standalone API engine too), or Bischeck for Nagios. Etsy open-sourced the Kale stack, which does some of this, but it wasn’t widely adopted. More recently Numenta announced Grok, its own foray into anomaly detection, which looks quite impressive. And today, Twitter announced another R-based tool in its anomaly-detection suite. Many of you may be surprised to hear that, completely on the other end of the tech spectrum, IBM’s monitoring tools can do some of this too.

On the system-state side, we’re seeing more entrants helping deal with related problems like configuration drift including Metafor, ScriptRock, and Opsmatic. They take a variety of approaches at present. But it’s clear that in the long term, a great deal of intelligence will be required behind the scenes because it’s incredibly difficult to effectively visualize web-scale systems.

The tooling of the future applies techniques like adaptive thresholds that vary by day, time, and more; predictive analytics; and anomaly detection to do things like:

Avoid false-positive alerts that wake you up at 3am for no reason;
Prevent eye strain from staring at hundreds of graphs looking for a blip;
Pinpoint problems before they would hit a static threshold, like an instance gradually running out of RAM; and
Group together alerts from a variety of applications and systems into a single logical error.

DevOps or not, I’m running into more people and bleeding-edge vendors who are bringing a “data science” approach to IT. This is epitomized by attendees to Jason Dixon’s Monitorama conference. Before long, it will be unavoidable in modern infrastructure.

Want to get started? You could do a lot worse than Coursera’s data-science specialization.

Disclosure: Prelert, Splunk, IBM, and ScriptRock are clients. Puppet Labs has been. Etsy, Metafor, Nagios Inc, Numenta, Opsmatic, Twitter, and PagerDuty are not.