dougmcclure.net

Exploring Operational Impacts of Running a Default Nagios – PagerDuty Integration

doug — Thu, 31 Jan 2019 10:00:12 +0000

In the prior blog post, I walked through how following the PagerDuty – Nagios XI integration guides leads us to the creation of a “Monitoring Service”. At the end of that post, I mentioned I’d talk about some of the reasons why running in this default configuration isn’t best practice and how this impacts an ops team’s response when using PagerDuty. I’ll talk about this today as well as layout the next few blog posts about moving to better practices when integrating Nagios with PagerDuty.

These few items called out here are by no means an exhaustive or complete list but do represent many of the significant areas I see in both small and large PagerDuty customer environments and spend the most time optimizing for them.

Your PagerDuty Foundation Isn’t Ready for Event Intelligence, Visibility, Analytics and Modern Incident Response!

“Weak Foundation?”

When the sum of all the PagerDuty parts converge in a best practice configuration, PagerDuty’s platform capabilities ensure people (responder, team lead, manager, exec, etc) receive notifications with the right context at the right time so the appropriate response can be taken.

If the context conveyed via a PagerDuty service and incoming events is super generalized or named after a monitoring tool like “Nagios Service”, the ability to respond with the right urgency, understanding (context) and then take the appropriate action can be significantly impacted.

For example, if an on-call responder is paged at 3 AM for a problem with the “Nagios Service”, what’s the appropriate response? Does the “SERVICE_DESC” in your Nagios Alerts prompt the desired response?
If the MTTA/R is increasing for the “Nagios Service”, what is the root cause? Is it due to a single server or systemic problem across all the things, certain teams?
There can only be one Escalation Policy (EP) for that “Monitoring Service”. This means all events from Nagios into your “Nagios Service” go to the same responder(s) or schedule(s)! If you own it all, great, but chances are you’ve got many responsible groups to deal with.
There can only be one automated Response Play for that “Monitoring Service”. Mature operations teams seek to automate operational response where seconds count using automated responses with very specific Response Plays for applications, functional technology types, specific teams or responders. This isn’t possible due to the limitations of a single automated Response Play for your “Monitoring Service”. Don’t hit the big red panic button for 60% disk full events! (* Multiple Response Plays can be configured and launched manually via the Incident UI or Mobile App.)
Responder Notification (Urgency) is broadly applied (High Urgency by default – aka “Wake You Up at 3 AM Setting”) to everything that may be coming in rather than specifically applied based on the required response. Maybe you want to use PagerDuty’s Dynamic Notifications on that “Monitoring Service” but do you ‘trust’ that incoming events have a severity that accurately maps to the needed urgency of an on-call responder’s response? I’ll bet that you’re probably sending in everything as ‘CRITICAL’ anyway. If 1 of 20 servers in your web tier has a ‘CRITICAL’ disk failure – does that warrant a high urgency page at 3 AM?

Sometimes Things [ Are | Are NOT ] Better Together!

mmmm bacon…

One of our better practices is to use Alert Grouping as a means of controlling the noise from poorly configured thresholds or alerting logic, or just plain old “SHTF” situations and alert storms that happen in any ops environment. Without the use of something like PagerDuty’s Time-Based Alert Grouping (TBAG), every single incoming event results in a unique incident, which sends notifications to the on-call responder(s), rinse and repeat for every…single…event.

The situation that many customers fear when talking about “smart stuff” that is supposed to do whiz-bang grouping, correlating or other AI-ML-EIEIO magic is that the wrong alerts are grouped and things get missed.

PagerDuty TBAG is a hard, time-based approach to group things so if a Network Link 5% Packet Loss event (BFD!) happens in the same time window as a MySQL Process Failure event (Oh, shit!), those things likely don’t relate yet they are grouped together. The first event’s description becomes the incident’s description and someone is paged for the Network Link 5% Packet Loss item and the on-call responder dismissed that incident b/c their quick scan of the incident in the mobile app doesn’t prompt closer investigation or an urgent response. All the while, the critical business impacting MySQL Process Failure alert is unnoticed as it’s grouped in with the Network Link 5% Packet Loss incident. See why the concern? Not a fun discussion with the boss…

PagerDuty’s Intelligent Alert Grouping (IAG) ‘learns’ based upon historical TBAG grouping and responders manually merging alerts into incidents. If IAG makes sense in your future (and it will unless responders are dedicated to doing this manual correlation and merging, it can be challenging to do this within the PD Alert UI), you won’t want to influence what IAG may do with bogus groups that could happen with broad based “Monitoring Services”.

Net net here is you probably don’t want to use alert grouping on big, broad based “Monitoring Services” for fear that things are grouped incorrectly and something uber important is missed.

The Journey along the “Signal to Insight to Action” Path Leads to a Peaceful On-Call Experience!

You can get there…

All PagerDuty customers are entitled to use Global Event Routing and certain Global Event Rules to process and route incoming events to the appropriate service. If you’re following the default Nagios – PagerDuty integration guide and directly integrating with the service, you’re bypassing this very powerful feature.

Building upon this basic Global Event Routing capability is the broader Event Intelligence offering and its own associated Global Event Rules providing a growing toolbox of capabilities to deal with the operational realities of your environment.

When deployed properly you’ll efficiently move from signal to insight to action by ensuring the right events land on the right services at the right time so the right responder/team have the right context to take the right action. Whew, that’s a mouthful – but that’s the real goal here right? If you could avoid waking up Fred, Sally and Shika at 3 AM with non-actionable, low urgency events, WHY WOULDN’T YOU WANT TO DO THAT?

Any of this sound familiar?

Alert fatigue from too much noise in your monitoring tools? No problem, there’s a rule for dealing with that!
False positive alerts waking people up at 3 AM due to reoccurring maintenance windows? No problem, there’s a rule for dealing with that!
Crappy alert metadata leading to missed issues or long MTTA/R because on-call responders don’t grok what the alert is trying to tell them or don’t know what to do next? No problem, there’s a rule for dealing with that!

Don’t Let “Business As Usual” or “We’ve Always Done it This Way” Hold You Back!

Insanity?

Imagine a situation where Nagios is deployed and monitoring ALL of your infrastructure – dozens, hundreds maybe thousands of nodes, services, interfaces, URLs, etc. This would take considerable time and effort to move away from! (Worse, you probably have at least a dozen tools all set up similarly with PagerDuty!)

Imagine the sheer amount of manual work to move from your “Business as Usual” configuration of Nagios and PagerDuty “Monitoring Services” to something better – maybe you’re nervously thinking how you might unpack your “Nagios Service” – it may go something like this:

Discover and map out exactly what’s being monitored by Nagios – “I think Nagios Ned can help me with that…”
Discover server, application, ‘thing’ owners – “ugh, I have to talk with that group/person…”
Discover context of what that ‘thing’ does, what it supports, what is impacted when problems found – “uh oh, I’m feeling really uncomfortable…”
Discover what the appropriate operational response needs to be for all event classes/types and who’s responsible – “more meetings…fml…”
Translate all of the above to appropriate PagerDuty configurations following best practices – “a whole lot of point+click coming my way…”
…

There is a much better way and I have some ‘magic pixie dust’ that can help you optimize this!

Where do we go from here?

The next few posts I’ve got in mind build out something like this:

Growing up from the Nagios – PagerDuty defaults – Crawling away from the default “Monitoring Service”
Introducing the Global Event Routing API – Walking in with your eyes wide open
Extending Nagios with Custom Attributes – Running with PagerDuty like a champ
Applying Event Intelligence to improve your Nagios + PagerDuty experience for on-call responders
Magic Pixie Dust – How PagerDuty can help you ADAPT to a better way of doing things in ops and on-call when using Nagios

Nagios XI & PagerDuty: The path to a “Monitoring Service”

doug — Tue, 08 Jan 2019 18:05:46 +0000

Nagios (Core/XI) is one of the top 5 most widely integrated tools across PagerDuty’s 10K+ customers providing fundamental host and service monitoring and alerting capabilities. During my time here at PagerDuty, I’ve had the opportunity to work with very very large and well established enterprises and the latest up and coming start-ups / DevOps teams all still relying on good old Nagios monitoring for their infrastructure. I remember the SysAdmin teams I worked with back in the 2000’s using early versions of Nagios, it’s certainly been around for a long time and works well for the basics.

From what I’ve seen, many of PagerDuty’s customers take a “set it and forget it” approach in their integrations. They’ve followed the super simple integration guide (e.g. Nagios XI) and created a “monitoring service” in PagerDuty. In this configuration, Nagios host or service templates are updated to send all Nagios alerts to one single PagerDuty service integration key (the “PagerDuty Contact” pager number) and ALL alerts are sent to PagerDuty notifying on-call responders of the latest Nagios alert. The key here with this default configuration is that everything monitored in Nagios whether it is a Windows or Linux server, network device, database or web server, all of the alert notifications are sent to the same “PagerDuty Contact” (the PagerDuty service) and notify whoever is on that single service’s escalation policy. Most integration guides include an FAQ section at the bottom of the guide with pointers on extending the default integration to address this, but few seem to go down this path. This default configuration pattern isn’t the best practice for an ideal operational response and use of PagerDuty.

When I say “set it and forget it”, what I’m really saying is that the integration well-established up so quickly and ‘just works’, that teams just move on to the next thing vying for their attention in their environment and accept what they have as ‘good enough’. Some teams have more maturity within their team/processes or have dedicated FTEs solely responsible for the care and feeding of their Nagios configurations as part of their configuration management, CI/CD or similar release process. Over time, the daily demands of the business and ops prohibit going back and optimizing the Nagios to PagerDuty integration to properly address many of the shortcomings in the default integration guide available today.

What I’d like to drill in on here are a few of these default configurations and their resulting alert, incident and reporting artifacts within PagerDuty and the operational implications over the next few blog posts how to move beyond the defaults to a best practice configuration in both Nagios and PagerDuty.

The Guts of the Nagios XI – PagerDuty Default Integration

The core of the PagerDuty – Nagios XI integration comes down to two parts, the Nagios XI Contact configuration and the Nagios XI Command (and associated pd-nagios python script). Essentially, the contact defines the alert notification conditions and whom (or what) to send the notification details to and the command receives the alert notification metadata (via Nagios macros) and passes this data into the pd-nagios script resulting in a post to the PagerDuty Event API v1.

In the service notification example below, when a service alert is triggered on a host, the configured contact alert conditions are evaluated the service notification command is executed. The default ‘notify-service-by-pagerduty’ command is called passing in a number of parameters to the pd-nagios python script.

Command Line: /usr/share/pdagent-integrations/bin/pd-nagios -n service -k $CONTACTPAGER$ -t “$NOTIFICATIONTYPE$” -f SERVICEDESC=”$SERVICEDESC$” -f SERVICESTATE=”$SERVICESTATE$” -f HOSTNAME=”$HOSTNAME$” -f SERVICEOUTPUT=”$SERVICEOUTPUT$”

This simple Nagios command and associated executable script and parameters is where everything comes together. Let’s review this in more detail.

-n [service|host]: notification_type: This parameter is used to signify if this is a service or host notification from Nagios. It’s used in the pd–nagios script to set up which macro values are mapped into PagerDuty event payload fields. This value is displayed in the alert key field.

-k $CONTACTPAGER$: This is the pager number/address for the Nagios contact. This value is taken from the pager directive in the contact definition. This macro value maps into the PagerDuty Event API v1 service_key field and represents the “Integration Key” for the Nagios XI integration configured on the PagerDuty service. This could also be an integration key for a Custom Event Transformer (CET) or the Global Event Routing API key. Set in step #12 of the PagerDuty – Nagios XI integration guide.

-t $NOTIFICATIONTYPE$: A string identifying the type of notification that is being sent (“PROBLEM”, “RECOVERY”, “ACKNOWLEDGEMENT”, “FLAPPINGSTART”, “FLAPPINGSTOP”, “FLAPPINGDISABLED”, “DOWNTIMESTART”, “DOWNTIMEEND”, or “DOWNTIMECANCELLED”). This macro value maps into the PagerDuty Event API v1 event_type field. Nagios XI “PROBLEM” maps to event_type ‘trigger’, “ACKNOWLEDGEMENT” maps to event_type ‘acknowledge’ and “RECOVERY” maps to event_type ‘resolve’.

-f $SERVICEDESC$: The long name/description of the service (i.e. “Main Website”). This value is taken from the service_description directive of the service definition. This macro value is used in the PagerDuty incident description, alert key, service and custom details fields.

-f $SERVICESTATE$: A string indicating the current state of the service (“OK”, “WARNING”, “UNKNOWN”, or “CRITICAL”). This macro value maps into the PagerDuty Event API v1 severity field and is used in the PagerDuty incident description, severity, state and custom details fields.

-f $HOSTNAME$: Short name for the host (i.e. “biglinuxbox“). This value is taken from the host_name directive in the host definition. This macro value is used in the PagerDuty incident description, alert key, source, host and custom details fields.

-f $SERVICEOUTPUT$: The first line of text output from the last service check (i.e. “Ping OK”). This macro value is used in the PagerDuty incident service output and custom details fields.

The results of the Nagios XI – PagerDuty Default Integration

Let’s explore how this default Nagios XI – PagerDuty integration looks as seen in various places the alert and incident could be displayed. We’ll use this Nagios XI alert as the example.

This Nagios XI disk alert will trigger the ‘notify-service-by-pagerduty’ command and pass the macro values to the PagerDuty Event API v1 resulting in a new PagerDuty alert.

This is the resulting PagerDuty alert display within the Alert tab. Note that I customized the column display to show the additional columns that could display alert information.

If “View Message” is clicked on in the lower left corner, a portion of the PagerDuty Event API v1 payload can be seen.

By clicking on the alert summary I can open the alert’s detailed display.

By clicking on the link next to “Related to Incident:”, the PagerDuty incident details is displayed.

If the PagerDuty user has configured their notification preferences to receive email, this is what they would be sent.

If the PagerDuty user has configured their notification preferences to receive push notifications on the PagerDuty mobile app, this is what they’d see for the incident and alert detail.

In the next blog post, I’ll call out some of the reasons why running in this default configuration isn’t best practice and how this impacts an ops team’s response when using PagerDuty. I’ll also lay out the next steps that can be taken to move towards better practices when integrating Nagios with PagerDuty.

In the Beginning – The “Monitoring Service”

doug — Fri, 28 Dec 2018 14:42:16 +0000

Follow one of PagerDuty’s integration guides for common monitoring tools and you’ll quite easily end up with your very own “Monitoring Service” and open the floodgates for incoming signals, alerts or events from that tool now triggering PagerDuty alerts and incidents. In the end, hopefully, you’ll be paging the right on-call person at the right time! PagerDuty makes it super easy to get started with this design pattern no matter the tool in your environment thanks to over 300+ integrations in our portfolio today.

Your “Monitoring Services” may resemble something like “Nagios XI -Datacenter 1” or “Splunk – Atlanta” or just plain old “New Relic”. I see them in all shapes and sizes across customers in all parts of the world and every industry and size. Internally here at PagerDuty, in addition to calling these “Monitoring Services” we also often refer to them as “Catch All Services”, “Event Sink Services”, or “Datacenter Services” because they do one thing well – catch all incoming signals, alerts or events in one single PagerDuty service and notify someone based upon the single escalation policy associated with that service. Works, but maybe not so well in the long run.

The speed and ease at which you can integrate tools into PagerDuty is awesome. In a very short time, you’re up and running getting value from PagerDuty. Any responder or any schedule on the escalation policy associated with these kinds of services will get paged. Application Developer team paged for network events, you bet! Security team paged for server foo.bar.com disk space events, you got it! On-call responders paged at 3 am for a problem with “New Relic”, a piece of cake. Trying to engage the right team for the right alert/incident is very challenging when you only have one escalation policy to use for anything/everything that might be monitored by your integrated tool.

If you’d like to apply PagerDuty’s best practices for reducing the sheer number of incidents and notifications in this configuration you can simply turn on Time Based Alert Grouping at let’s say a two-minute grouping window. Group away! Sometime later, the Application Support team reaches out to you confused because there are some weird Cisco Chassis Card Inserted alerts grouped together with their important application incidents. The Storage Ops team pings you in Slack confused by the custom “Front Door Visitor” alert grouped into their DX8000 SAN incident. Time is time and Time Based Alert Grouping is just doing its job perfectly across the mega “Monitoring Service”.

Ease and speed aside, as you can see above with not so subtle examples that there are a number of drawbacks from following the “Monitoring Service” design pattern and this configuration certainly isn’t a ‘best practice’. Over the next few blog posts, I’d like to take you along on a better practice journey by exploring PagerDuty’s service design best practices and our Event Intelligence product through practical applications when using Nagios XI.

Cough, cough….

doug — Thu, 27 Dec 2018 19:18:02 +0000

Uggh, so much dust around here. I’m thankful for the power of my Dyson vacuum as I’m going to need it to get this blog back into some presentable shape for 2019. Let me start with this as a means of pushing some of the old stuff off the main page.

A Day in the Life in NextGenCo’s Digital Transformation Journey

doug — Thu, 30 Mar 2017 19:36:47 +0000

I invite you to spend a few minutes reading this background story about the fictitious NextGenCo and ponder for a few minutes if it resonates with you and a day in your life? If so, pay particular attention to this last paragraph where I’m setting the stage for future blog posts and an invitation for you to get involved!

NextGenCo is a decades-old enterprise rapidly accelerating their digital transformation and adoption of hybrid cloud architectures using IBM Bluemix to expand their Application Platform offerings. With many datacenters supporting their global operations, they are quickly modernizing by deploying more and more of their Application Platform infrastructure within IBM’s Bluemix Cloud. Every department is challenged by the CIO to improve availability, reliability, quality and spending within their areas of the business.

Not only are the changes being made in the infrastructure, technologies and architecture used to support their hybrid cloud pattern, but changes within the NextGenCo culture are happening as well to become more agile and collaborative across business and technology silos of their past. The adoption of DevOps practices such as continuous integration and delivery and use of tools like Slack for open, transparent collaboration, ChatOps and getting work done much more efficiently are helping their teams realize the benefits of NextGenCo’s more modern competitors.

Andy is a member of the Application Platform delivery team supporting key business applications within the NextGenCo environment. Andy’s primary role is as the lead IT Admin supporting WebSphere Application Servers (WAS) and various other middleware software used in the NextGenCo Application Platform.

As the lead WAS Admin, Andy spends considerable time worrying about the health, welfare and security of their WAS environment within the Application Platform and the business applications that depend on it. Andy’s team is constantly challenged to meet the demanding needs of the business with an ever-changing application environment. He is regularly challenged to cut costs, reduce spending and do more with less.

His team is overworked resulting in lower priority (yet important) work such as upgrades, maintenance and housekeeping often put off for extended periods of time.

Andy works with Jim who’s one of the top SMEs in the engineering department. Jim is one of the super-heroes often involved in complex application problems involving WAS and someone Andy can count on for help during outages as well as more challenging WAS maintenance planning.

He also relies heavily on Anne who runs Capacity Planning for the Application Platform. Much of the project work Andy’s team is responsible for comes from Anne’s upgrade change requests to support growth in the Application Platform.

Andy also has a close working relationship with RJ and Annette in the IT Ops organization. Andy, RJ and Annette collaborate and communicate within the NextGenCo Slack team where everyone is kept up to date with progress updates when problems occur or when Andy is executing change requests within the production WAS environment supporting NextGenCo’s critical business applications running on the Application Platform.

Andy’s team uses dozens of corporate tools, 3rd party software products, spreadsheets and data to ensure his team meets their commitments to the IT organization, CIO and the LoB. His team is measured against traditional IT metrics such as availability and performance service levels, IT budget Capex, Opex and expense targets as well as service quality (# outages, # incidents, MTTR).

As a leader, Andy desires to improve his team’s chemistry and work-life balance by looking for ways technology and process improvements could help them be more successful. Andy is always looking to get more done in his day. He regularly makes use of his time during his morning train ride into the office to plan his team’s daily priorities. From his iPad, Andy is able to quickly gain insights into his environment by asking Eleanor, his Watson Powered Cognitive ChatBot, “What’s Up?” within his Slack team.

What would a day in your life look like if you had a IBM Watson powered cognitive assistant or ChatBot available to help you with your day-to-day work planning and execution? What if those mundane tasks, workflows and problem escalations could be radically simplified and augmented by your own bot, right where you are already having conversations and doing work? Watch for future blog posts as I take you along a cognitive journey demonstrating how Eleanor can help transform a day in your life supporting IBM middleware.

If you’re interested in exploring how IBM’s Watson and cognitive capabilities and experiences such as a cognitive assistant or ChatBot could come to life in your environment and for your IT roles like Andy, Jim, Anne, RJ or Annette please consider joining our next trial of #codename:Eleanor, our platform for multiple cognitive use cases within Cloud, IT, Network and DevOps teams. #codename:Eleanor will be first exposed within the new IBM Cloud Product Insights offering supporting IBM middleware such as WebSphere, Liberty, MQ, IIB and ODM (many more to come). Check out the detailed offering page within IBM Bluemix here for details on supported versions and how to get started with the core offering. If you’re an Andy, Jim or Anne supporting these applications, I want to help you experience Eleanor!

Sign up here for the April start of the IBM Cloud Product Insights – Cognitive Trial or feel free to reach out to me directly for more information at dmcclure@us.ibm.com.

WYNTK about IBM Operations Analytics – Log Analysis v1.3.x

doug — Mon, 15 Jun 2015 14:07:39 +0000

Here we are again with another quarterly release of IBM’s Log Analysis solution. About three months ago or so we released version 1.3.0 and this week we’re releaseing version 1.3.1 full of all kinds of log ninja goodness. Let me get some basics out in this first post.

ITOA Community and GitHub

Our new (and hopefully final!) community was launched during the IBM Interconnect conference and is available here. It’s here you’ll find blogs, wikis, forums and our resources catalog full of all kinds of things to help you get started with our ITOA portfolio. What’s even cooler is our GitHub community linked in here where our SMEs, partners and clients can share their own best practices, configurations and code. Check that out here.

Keep a close eye here as the catalog of available content packs rapidly grows. We expect upwards of 100 new content packs to be available over the coming year.

Documentation

It’s always fun to try and find the docs isn’t it? Here’s some quick links to the online pubs.

v1.3.1: Pubs
v1.3.1: Release Notes

v1.3.0: Pubs
v1.3.0 Release Notes

I’m not a mainframe guy myself, but if that’s your cup of tea, here’s the docs for our mainframe log analysis release. This release is based off of the v1.3.0 mentioned above.

v2.1.0
v2.1.0 Release Notes

Here are some of the significant features in the v1.3.x release train so far. I’ll dive deeper into some of these with some how-to’s in follow on posts.

Real Time Alerting, Alert Management GUI
“Anomaly” Detection
Logstash 1.4.2 Bundling with custom Logstash Output Plugin for LA
New OOTB Insight Packs – IBM MQ Series and IBM MQ Broker/Integration Bus
Ticket Analytics – IBM Control Desk, Service Now and BMC Remedy
Hadoop/HDFS Integrations – IBM Big Insights 3.x and Cloudera CDH 5.x
Role Based Access Control (RBAC) Phase 1
Auditing
Additional statistical functions for search facets (for use in visualizations and dashboards
Dashboard Auto-Refresh
Globalization for ten languages
Currency for Apache Solr, RHEL and browsers

If any of this makes you feel warm and fuzzy, why not grab the free trial version (software or VM) and play around in your environment. The LA 1.3.0 version is available here. I hope they get the LA 1.3.1 version up there soon, so keep an eye out for it at the same link.

Don’t hesitate to reach out to me directly if you have any questions or need some help!

Design and Deployment of a BlueMix Dev/Ops Log Solution: Progressing towards Milestone 1

doug — Wed, 13 Aug 2014 18:10:48 +0000

To catch up on this series, start back here.

We started with a simple deployment planning activity – get some servers built so we can get the necessary software installed and then and start hacking some configurations together to allow sample data to be collected and indexed and some simple searches and visualizations from that data to be available. Enter fun activity #1 – SoftLayer and associated VPNs. I would do anything for some way to stay logged into my SoftLayer VPN for longer than a day! Nothing pains me more than having 20 more more putty sessions up and then getting kicked off the VPN when the 24 hour timer ends. Please tell me how I can change this behavior or have some magical way to reconnect the VPN and all those putty sessions!

My partner Hao on the Dev/Ops team sent me some sample logs from a handful of Cloud Foundry (CF) components including the Cloud Controller and DEA components. Turns out each CF component type and end point system (CCI – SoftLayer Cloud Computing Instance) could have upwards of 10-20 different log types. I started out as I do with any project like this and began to immerse myself in the log samples. Before I was asked to start, Hao started kicking the tires on our Log Analysis solution on his own and hit some of the fundamental challenges in this space in terms of how to get logs in. He specifically called out logs that were wrapped in JSON, logs with timestamps in Epoch format, logs without timestamps, etc. as a few of his key challenges.

I think most of us in this space always start out in a common way – find the timestamps, find the message boundaries, find the unique message patterns that might exist within each log type and then think about what meaningful data should be extracted from each log type to enable problem isolation and resolution activities. I talk to clients about taking an approach for finding a sweet spot in their log analysis architecture between simple consolidation and archival of everything, to enabling everything for search to gaining high value insights from log data. Just because you have all kinds of logs doesn’t mean you must send them all into your log solution or invest a lot of time trying to parse and annotate every possible log message into unique and detailed fields. Finding the right balance of effort put into integration, parsing and annotation of logs, frequency of use and the value to the primary persona (Dev/Ops, IT Ops, App/Dev, etc.) are all dimensions and trade offs to consider before starting to boil the ocean. Simple indexing and search can go a long way before deep parsing and annotation is really required across all log types.

What this means to us for our first few milestones is that there are a lot of log types and we don’t have any firm requirements for what’s exactly needed yet. We don’t know which logs will prove to be most valuable to the global BlueMix Fabric Dev/Ops team. Obviously, some of these log types will be more useful in problem isolation and resolution activities. We want to bring everything in initially in a simple manner and then get that iterative feedback on which logs provide most useful insights and work on parsing and annotating those for high value search, analysis and apps. Out of two dozen unique log sources in the CF environment, we started out by taking an approach that kept things as simple as possible. We wanted to get as much data in as quick as possible so we could get feedback from the global Dev/Ops team on which log types were the most useful to them.

From the beginning of my discussions with the Dev/Ops team, we needed to keep in line with their deployment model, base images and automation approaches as much as possible. We didn’t want to deploy the Tivoli Log File Agent (actually not supported on Ubuntu anyway) or install anything else to move the log data off the end point systems. We decided to make use of the standard installation of rsyslog 4.2 (ancient!) on the Ubuntu 10.4 virtual cloud computing instances (CCIs) used within the BlueMix/CF/Softlayer environment.

We’re using the standard rsyslog imfile module to map in the 10-20 different log files per CCI into the standard rsyslog message format for shipping to a centralized rsyslog server. Each file we ship using imfile gets a functional tag assigned (eg DEA, CCNG, etc) which is useful downstream for filtering and parsing. On the centralized rsyslog server we’re using a custom template to create a simple CSV output format which we send to logstash for parsing and annotation. The point here is that we took a simple approach and normalized all of the CF logs to a standard rsyslog format which gives us a number of standardized, well formatted slots single including one slot containing the entire original message content to use downstream. I’ll spend an number of posts on rsyslog and logstash later!

Here’s an example of the typical imfile configuration:

### warden.log $InputFileName /var/vcap/sys/log/warden/warden.log $InputFileTag DEA_WARDEN_warden $InputFileStateFile stat-DEA_WARDEN_warden $InputFileSeverity debug $InputFileFacility local3 $InputRunFileMonitor

This is the template we use to take the incoming rsyslog stream from all of the CCIs and turn it into a simple CSV formatted message structure. In reality I could even simplify this a bit more by removing a couple of the fields I’m ultimately not indexing in the Log Analysis solution.

template(name="scalaLogFormatDSV" type="list") { property(name="timestamp" dateFormat="rfc3339" position.from="1" position.to="19") constant(value="Z,") property(name="hostname") constant(value=",") property(name="fromhost") constant(value=",") property(name="syslogtag") constant(value=",") property(name="programname") constant(value=",") property(name="procid") constant(value=",") property(name="syslogfacility-text") constant(value=",") property(name="syslogseverity-text") constant(value=",") property(name="app-name") constant(value=",") property(name="msg" ) constant(value="\n") }

This gives a simple output message format like this for each of the 24 or more log types in this environment.

2014-08-07T14:42:17Z,localhost,10.x.x.x,DEA_WARDEN_warden,DEA_WARDEN_warden,-,local3,debug,DEA_WARDEN_warden, {"timestamp":1407422537.8179543,"message":"info (took 0.065380)","log_level":"debug","source":"Warden::Container::Linux","data":{"handle":"123foobar","request":{"handle":"123foobar"},"response":{"state":"active","events":[],"host_ip":"10.x.x.x","container_ip":"10.x.x.x","container_path":"/var/vcap/data/warden/depot/123foobar","memory_stat":"#","cpu_stat":"#","disk_stat":"#","bandwidth_stat":"#","job_ids":[123foobar]}},"thread_id":123foobar,"fiber_id":123foobar,"process_id":123foobar,"file":"/var/vcap/data/packages/warden/43.2/warden/lib/warden/container/base.rb","lineno":300,"method":"dispatch"}

So much like the picture in this blog post, we’ve got lots of logs in all shapes and sizes. Some are huge, some are small and we’re getting them into a uniform format (eg uniform size/length for the log truck) in preparation for getting the most value from them (eg lumber). Up next, shipping an aggregated stream to logstash for parsing these normalized messages

IBM Log Analysis (SCALA) Tuning App v1

doug — Fri, 08 Aug 2014 02:19:50 +0000

As part of this BlueMix Fabric Log Solution project, getting visibility into everything in my log solution architecture is pretty important. I’ve got a lot of instrumentation across the end-to-end pipeline so metrics are overflowing in my environment. I started to work on a simple app to pull all of this together so I can trend and visualize it over time and be able to see the impacts of tuning activities.

This is a first cut at pulling some of the metrics out of the Distributed EIF Receiver / Unity Generic Receiver logs. I’m shipping them with the logstash-forwarder, parsing them in logstash and sending them to the internal elasticsearch server for easy search and visualization using kibana. ELK at its finest!

I’ll update this as I go for other SCALA logs as well as others I’m using frequently such as rsyslog impstats.

Ping me or check my github soon for the configurations.

Swimming in the BIG BlueMix Sea – Design and Deployment of a Log Analysis Solution for IBM’s BlueMix Cloud Foundry PaaS

doug — Thu, 07 Aug 2014 19:06:06 +0000

A few months back, I was asked to help deploy our Log Analysis solution for our BlueMix Fabric Dev/Ops team. Their pain point – getting value and insights from massive amounts of Cloud Foundry (CF) log data across multiple development, staging and production environments in order to provide a highly available BlueMix offering. No problem I thought. A log is a log is a log. I’d done this a number of times for various applications or technologies using our Log Analysis solution. Three months later as we move this into an environment most closely resembling our production environment, things have gotten very interesting to say the least in terms of designing for a scale out log solution supporting 100’s to 1,000’s of GB of log data each day.

I want to share my journey here on my blog so others may benefit who chose to do something similar in their own Cloud Foundry environment or within their other very large application or technology environments using our Log Analysis solution. Parts of this are certainly reusable with other similar log collection, consolidation, search and visualization solutions available today and are not all dependent on use of the IBM Log Analysis solution. The overall architecture and design approach, decisions and many of the configurations are reusable for anyone desiring to design, build and deploy a log management solution using modern products, tools and techniques.

Within most highly dynamic and growth projects, start-ups, etc., management and monitoring stuff is often an afterthought, a “we’ll get to it later” kind of thing. There were no firm business or technical requirements guiding us as we began this project. I think everyone on the global Dev/Ops team knew it should be done and many were trying to attack the problem with scripts and one-off tools to help them keep their heads above water and deal with daily problems. What we need, expect and desire from the solution has evolved over each milestone of this project and will continue to as more of the global Dev/Ops team begins to use the solution on a daily basis. There are however, a few fundamental architecture and design goals that I’ve anchored my work on this project around based on our early experiences in the project:

Architecture Design Goals – We didn’t start with these from day zero, but they quickly became the focus of our work as we discovered the operational characteristics of each BlueMix Cloud Foundry environment.

Support a sustained message volume of XX MB|GB/s < -- not hiding numbers here, we just haven't set a target yet!
Message delivery quality of XX %
End to end source to search availability in XX minutes – when a record is written, when is it available for search?
Absorb a sustained burst in message volume of XX MB|GB/s over XX minutes
Process all rsyslog disk assist cache and/or buffers from burst within XX minutes

Need to Understand – With anything new, there are lots of unanswered questions and concerns from various parts of the Dev/Ops team. We need to work towards being able to answer some fundamental questions such as these.

Total daily message volume (GB/TB), messages/sec, network utilization
Total message volume by CF component type (eg Cloud Controller) day/week/month
Total message volume by end point CCI (with a given CF component)
Retention period (eg 30 days) – system resources required, pruning frequency
Find the high value log types and/or messages needed for problem isolation/resolution and determine parsing/annotation requirements
Find the lower value log types and/or messages and enable filtering and the edge or consolidation elsewhere

Need to Answer – We need to know how to scale up the solution as the overall BlueMix offering grows.

How must the log solution architecture scale with expected BlueMix growth?
What is the impact on current solution resources when new environments, components or end points are added?
How to scale architecture, control costs and provide good UX across a global deployment of BlueMix environments (datacenters)?

That’s a good intro to what I’ve been up to lately along with all of the normal customer and development activity! A lot more to come for sure. Up next, The path to milestone 1 – Sample Cloud Controller and DEA Logs

Bookmarks for March 25th through June 18th

delicious — Wed, 18 Jun 2014 13:00:13 +0000

These are my links for March 25th through June 18th:

OpenStack LumberJack – Part 1 rsyslog | Professional OpenStack – Logging for OpenStack has come quite a ways. What I’m going to attempt to do over a few posts, is recreate and expand a bit on what was discussed at this last OpenStack Summit with regard to Log Management and Mining in OpenStack. For now, that means installing rsyslogd and setting it up to accept remote connections.
rsyslog.conf file –
FailoverSyslogServer – rsyslog wiki –
How to configure failover for rsyslog in Red Hat Enterprise Linux 6? – Red Hat Customer Portal –
Introducing the Solr Scale Toolkit | SearchHub | Lucene/Solr Open Source Search –
Highly Available ELK (Elasticsearch, Logstash and Kibana) Setup | Everything Should Be Virtual –
Logstash configuration dissection –
Splunk Introduces Splunk Enterprise 6.1 – Enabling the Mission-critical Enterprise Multi-site Clustering: Delivers continuous availability for Splunk Enterprise deployments that span multiple sites, countries or continents by replicating raw and indexed data in a clustered configuration. Search Affinity: Provides a performance increase when using multi-site clustering by routing search and analytics requests to the nearest cluster, increasing performance and decreasing network usage. zLinux Forwarder: Allows for application and platform data from IBM mainframes to be easily collected and indexed by Splunk Enterprise. Data Preview with Structured Inputs: Enables previewing of massive data files to verify alignment of fields and headers before indexing to improve data quality and the time it takes to discover critical insights.
Streamlining application logs collection on AWS Elastic Beanstalk with logstash – part 1 | Mob in Tech – However, we like to experiment things, so I decided to try the home made solution for the backend of our new upcoming mobile game. Our backend is a homebrewed Java REST webservices application hosted in an Elastic Beanstalk container, in the us-east-1 region. The final goal is to gather logs from all instances of the Java application into a local (Paris) Elastic Search database, in a scalable manner. In this case, scalable means for us: every single step of the data pipeline has to be horizontally scalable, meaning we can speed up the process by adding additional capacity at each step independently.
How to Pre-Process Logs with Logstash: Part III of “Scalable and Robust Logging for Web Applications” ← #workHard / partyHard – This article is an introduction on how to pre-process logs from multiple sources in logstash before storing them in a data store or analyze them in real time. Some common use cases are unifying time formats across different log sources, anonymizing data, extracting only interesting information from the logs as well as tagging and selective distribution.
Building an Activity Feed System with Storm – Programming – O’Reilly Media – Problem You want to build an activity stream processing system to filter and aggregate the raw event data generated by the users of your application. Solution Streams are a dominant metaphor for presenting information to users of the modern Internet. Used on sites like Facebook and Twitter and mobile apps like Instagram and Tinder, streams are an elegant tool for giving users a window into the deluge of information generated by the applications they use every day.
Wirbelsturm: 1-Click Deployments of Storm and Kafka clusters with Vagrant and Puppet – Michael G. Noll – I am happy to announce the first public release of Wirbelsturm, a Vagrant and Puppet based tool to perform 1-click local and remote deployments, with a focus on big data related infrastructure. Wirbelsturm’s goal is to make tasks such as “I want to deploy a multi-node Storm cluster” simple, easy, and fun. In this post I will introduce you to Wirbelsturm, talk a bit about its history, and show you how to launch a multi-node Storm (or Kafka or …) cluster faster than you can brew an espresso.
RapidEngines Application Analytics – We provide the worlds fastest, most flexible and most scalable time series data platform. Delivered as software or a cloud service to help you visualize and detect application performance events before they impact your business.
SevOne Acquires Log Analytics Provider RapidEngines | Business Wire – SevOne, the leader of scalable performance monitoring solutions to the world’s most connected companies, today announced it has acquired RapidEngines, a leading provider of highly scalable log analytics software for IT enterprises, service providers and application developers. The acquisition is the first from SevOne since closing the $150M investment from Bain Capital which remains one of the largest venture financings of 2013. SevOne’s large customer base will now have access to RapidEngines’ log analytics software granting users the benefit of automatically collecting and organizing log data to better provide a detailed picture of user and machine behavior.
Google Cloud Platform Blog: A New Logs Viewer for Google Cloud Platform – Today we are excited to announce a significantly updated Logs Viewer for App Engine users. Logs from all your instances can be viewed together in near real time, with greatly improved filtering, searching and browsing capabilities. This release includes UI and functional improvements. We’ve added features that simplify navigation and make it easier to find the logs data you’re looking for.
About | LOGSEARCH – What started out as an internal development project from within City Index was soon after released as an open source project for all to benefit. City Index realised the potential value of the information available to them in the log files and required a flexible solution to not only view the log files but rather to view and cross analyse them.
Approaches to Indexing Multiple Logs File Types in Solr and Setting up a Multi Node, Multi Core Solr Cloud – Apache Solr is a widely used open source search platform that internally uses Apache Lucene based indexing. Solr is very popular and provides a database to store indexed data and is a very high scalable, capable search solution for the enterprise platform. This article provides a basic vision for a single and multi-core approach to indexing and querying multiple log file types in Solr. Solr indexes the log files generated by the servers and allows searching the logs for troubleshooting. It has the capability to scale to work in a multi-node cluster set up in a distributed and fault tolerant manner. These capabilities are collectively called SolrCloud. Solr uses Zookeeper for working in a distributed manner
Introducing Morphlines: The Easy Way to Build and Integrate ETL Apps for Hadoop | Cloudera Developer Blog – Morphlines can be seen as an evolution of Unix pipelines where the data model is generalized to work with streams of generic records, including arbitrary binary payloads. A morphline is an efficient way to consume records (e.g. Flume events, HDFS files, RDBMS tables, or Apache Avro objects), turn them into a stream of records, and pipe the stream of records through a set of easily configurable transformations on the way to a target application such as Solr, for example as outlined in the following figure: In this figure, a Flume Source receives syslog events and sends them to a Flume Morphline Sink, which converts each Flume event to a record and pipes it into a readLine command. The readLine command extracts the log line and pipes it into a grok command. The grok command uses regular expression pattern matching to extract some substrings of the line. It pipes the resulting structured record into the loadSolr command. Finally, the loadSolr command loads the record into Solr, typically a SolrCloud. In the process, raw data or semi-structured data is transformed into structured data according to application modelling requirements.
Pivotal CF 1.1 Advances Enterprise PaaS with New Capabilities | Pivotal P.O.V. – What’s new in Pivotal CF 1.1:
Improved app event log aggregation – developers can now go to a unified log stream for full application event visibility (Watch) and drain logs to a 3rd party tool like Splunk for analysis (Watch)
elasticsearch-curator 1.0.0 : Python Package Index – Tending your time-series indices in Elasticsearch