knownbugs

Production Support

2020-03-20T00:08:00.001-04:00

I have seen my share of IT processes across a number of companies over the past 25 years. The ones around production support have always been the most interesting, specifically, in situations where teams/organizations get themselves into trouble and then spend their life on crisis/bridge calls trying to get themselves out of it. A lot can be learned about the culture of a company going through one of these ..

Whilst the basics of best-practices around IT production support have been enshrined in standards such as ITIL, just following a book/rules has never been a recipe for extra-ordinary.

One specific dynamic I have often debated in my mind is around the "segregation of duties". I remember my time in the early 90s where it became a bad thing suddenly for us developers to have access to production. Heaven forbid, we would make changes in production on the fly .. this was true in IT telecom (i.e. at least until telecom became IT), primarily driven via the discipline of managing mission critical networks and related lifeline services. I did have a certain respect for this, especially given the hard-coded culture of service availability in telecom companies.

Of course, in the finance industry, I found that even data-center sysadmins were not trusted with privileged access to the very servers they were meant to administer. Also two-eyes-four-eyes .. something we can thank the banking sector for.

To my point and the picture above. I have tried to capture what I see is an often misunderstood and consequently dysfunctional state of affairs within an IT organization. I define "commando" as the behaviors where folks make changes to production without any due diligence or testing etc. I define "process driven" as everything by the book and in the extreme case, overly constraining and time consuming (without adding any value).

On the other axis, I define "trial & error" as the mode of analysis/resolution that teams resort to when they lack the technical knowledge/understanding/skills of what they are supporting. This can be methodical and process driven and will ultimately yield a result (however, on speed, you need to be lucky). NB> I don't classify the "bounce the servers" solution as necessarily "trial & error" as it is an effective step in cutting your losses on troubleshooting when you have SLAs. I define "knowledge/skills" based as state where the highest technical skills (typically the original developers/engineers) are applied and engaged on problem solving. NB> This is NOT a line that defines Tier2 application support vs Tier3.

On target zones, it really depends on the business impact. Typically, however, in most companies, it is a one-size fits all approach. Companies have a hard enough time getting consistency in their performance.

Break-glass is an interesting one and typically is meant as a safety or "panic" button, when normal process doesn't work or speed is required. This allows for the developers to take control and break the "segregation of duties" barriers.

Questions to ask when you assess where you are on the chart :
1. when in a crisis, are your smartest/highest skilled people engaged (and accountable) ?
2. do they have access when needed (and the tools) ?
3. are they allowed to lead or are they muted by process ?

2020-03-12T21:12:00.002-04:00

I am always amazed when I find IT teams with this behavior :
1. Business requires something urgently from IT
2. IT assesses the change
3. IT then designs and solutions the change
4. IT then risk assesses the change as "high risk"
5. IT then presents the solution with a "high risk" profile to the business pretty much scaring the pants off everyone.
6. Business then backs off the ask
7. End = do nothing. IT feels happy they made a good "risk" based decision

Bright futures in such companies ..

barely sufficient

2010-01-28T06:11:00.003-05:00

I observed a very talented engineering team making a basic mistake today.

I think we often mis-interpret what 'barely sufficient' means in the context of agile software development and delivery. It is mistakenly interpreted as an excuse to cut down requirements. In reality, I believe it really applies more to engineering and design than to requirements. There are two different syndromes to be careful of in a software project - 'scope creep' and 'creeping elegance'.

In an agile methodology, we iterate through cycles of 'design a little', 'code a little', 'test a little'. We embarq on writing software often without fully understanding the problem. I am a big fan of this vs up front. Personally, I pretty much think on the keyboard as I believe, people learn incrementally. That said, I am aware always that there is a risk in this mode until I have a full grasp of the problem. My energy is always directed towards activities or areas that help me flesh out unknowns. Whilst in this mode, I 'hack' for speed and refactor only after I think I have my arms around the business problem.

What I found the team doing was laying the 'foundation' down aka building middleware. That by itself wasnt a problem, however, they had taken their eyes of the business problem and failed to deliver the results in the needed timeframe.

Programming competitions are a great way to teach developers this mindset. You have a fixed duration so you have to be quick. You have to focus on the problem and only the problem .. no deviating onto bunny trails. You have to solve the problem in the simplest quickest way. As engineers, we love complexity so, the last discipline is the hardest.

Service granularity and re-use

2009-06-04T13:04:00.002-04:00

Architects in the IT organisation where I work display an interesting tendency to equate re-usabilty with granularity. The received wisdom seems to be that the more granular a service is, the more re-useable it is. To a certain extent it is useful to have the ability to mix and match just the bits of functionality you require. This becomes detrimental at the point where it starts to push behaviour to toward the consuming systems, of which there are usually more than one.

As a case in point, a new system is being written, which (among other things) exposes the ability to lookup items in a cache. It employs a side-caching strategy to do this. It's API looks something like this:

MyItemCache
findItem(itemId : String) : Item
createItem(item : Item)

Consumers of this service will call findItem() to check for the item in the cache, using the Item if it is there. If the Item is not there, they are expected to fetch the Item from the source system - which is a different system entirely - and then add it to the cache, using createItem().

There are a number of issues with this approach.

* The fact that the cache is a side cache (and therefore does not front the source system) means that consumers of this cache must also have knowledge of the source system - making the overall architecture more complicated and brittle.

* Each consuming system is expected to implement over again the logic required to check in the cache, then fetch and add the Item if it is not already cached.

While this last point may seem like a small amount of code to write, it should be remembered that forcing each client system to re-implement this logic means that the difficultly of changing this logic is multiplied by the number of client systems, with all the associated co-ordination of teams that this involves. Add to that the impact that differing or buggy implementations might add to the mix and you have a problem far more damaging that the cost of a few lines of code.

An alternative approach, which would eliminate both of these problems, would be to make the cache a through-cache, with the cache itself handling the "if not cached: fetch from source" behaviour. Having eliminated the need to manually add things to the cache, the service interface is simplified, as follows:

MyService
findItem(itemId : String) : Item

Client systems are then relieved of responsibility for implementing this logic over and over, and need not have knowledge of yet another system. Re-use is enhanced, while complexity is reduced. Everyone is happy! :-)

Performance Engineering

2008-05-02T15:13:00.003-04:00

A good article from Alok Mahajan & Nikhil Sharma from Infosys ! Am quite stunned to be frank.

on colocation

2008-03-18T18:22:00.003-04:00

One of the interesting things I found was the impact of colocation on the productivity of teams.

Consider a team of six people working on a module. Try this out .. place them at random seats within an office floor out of line of sight from each other. What you will find is that they may as well be in different countries.

Now place them in a six person pod or cubicles adjescent to each other. You will be amazed at the difference this makes in their productivity.

Back in 96, I was amazed one day to find one team started the habit of standing up on their desks (across six partitioned cubicles) for a daily quick discussion / debrief. It was quite funny as this was before the days of 'agile' and 'standups' etc. They found it quite effective to just stand up on their desks and confer when they had a quick team decision to make.

Just something that make me smile back then.

running systems integration delivery hubs

2008-03-09T04:56:00.004-04:00

A large SI (systems integration) project typically involves the requirements, design, development, integration, testing and deployment of an integrated enterprise software release. This typically involves the work of a large number of people spread across numerous teams, typically, across the organization and often across organizational boundaries.

Ideally, there are only two roles in this setup. A business role and a technical role. The business role represents the 'user' or the 'customer'. They lead on requirements specification & testing. The technical role lead on design, development, integration and deployment. Of course, each activity involves folks from both roles, however, it is important to keep perspective of who 'leads'.

The main source of inefficiency within any structure is communication. It is the hardest problem to solve. That is why people with multi-dimensional skills are so much more effective than people with a single dimension or specialized roles.

Having separate teams perform each specialized function (requirements, design, development, integration, testing, deployment etc) causes havoc as imagine the inefficiency and discontinuity created by the hand-offs. Where possible, such optimizations must be made, however, some specialization is inevitable when you scale up.

The establishment of the critical roles (leadership) and the required infrastructure is key to success.

Delivery Hub must-haves.
1> co-location facility with open floor plan, plenty of whiteboards, and 2-3 breakout rooms.

2> strongest presence from the e2e testing team (ideally also the 'users').

3> e2e solution design team (also ideally the integration leads).

4> representation from each of the component teams (project manager + 2-3 key developers)

5> e2e test environments - ideally 3. One being production mirror, another one for development-integration and the last being integration-release. The config and release management for the production mirror and integration-release environment must be formally managed to ensure integrity and optimal uptime that allows testing productivity. Expect the dev-int environment to be the most dynamic of the three with daily drops from component teams.

6> Core delivery hub roles :
a> overall program lead with program management support
b> overall architect / design lead (ideally also the integration lead)
c> overall business lead (requirements, business process etc.)
d> testing lead (ideally also the requirements lead)
e> test/config management lead (manages change control and integrity of the test environments, also ideally leads the software deployment to production)
f> deployment lead (focussed on pulling together all aspects of the deployment - training, data grooming, documentation, systems conversion & cut-over etc.)
g> e2e support lead (service management including e2e monitoring)

7> Daily program stand-up. I always did a 6 pm (1 hr) stand-up every day that helped bring focus to our execution.

In the project, there are three key phases that you will encounter. Am assuming that some form of agile or spiral methodology is used, so, these phases while unique, may not represent a distinct milestone. You will know when you are in each phase and effort must be made to prioritize activities that allow you to move to the next phase.

Phase I: This is where the requirements and design are still under a fair amount of churn and incomplete.
Phase II: The scope is now locked down, the designers have more or less finished their stuff with the developers now with plenty to do and on the critical path.
Phase III: Testing / integration is critical path with the developers and designers fixing defects.

The diagram above gives you a sense of the dynamics and activities in the project. A couple of core principles to highlight.
a> In phase I, it is the designers that are under stress. All help to make them productive. What this really means is that component development teams help with the e2e solution design and not wait to be spoon fed a document. The burden is really more the feeling of being the bottleneck than the technical challenge. This also solves another standard pitfall in that it improves the efficiency of knowledge transfer from design to development.
b> Tight requirement documentation and change control from the start is a must.
c> Test case development starts with the requirements .. ideally, use cases are translated to test scenarios from the start. Data setup demands for the test environments are looked at from the start.
d> In phase II, force early drops from the development teams. Any and all early feedback from the 'users' and 'testers' helps derisk the project. Early integration = SUCCESS ! Get into the specify/design-code-integrate-test spiral as quickly as possible. In this phase, give as much flexibility to the development teams. Anticipate frustration from the testers due to quality and downtime of test environments. Do not underestimate the value of this although .. this is crucial that they stay engaged and provide any and all early feedback. The only thing to not compromise with the development teams is releasing into test. In this phase, testers are usually buddies with the developers.
e> This is where testers become antagonistic towards the developers. Here the balance shifts to testing productivity. You must favor discipline in release management for test environments over the development team's inclination to fix (aka change) things constantly.

On the e2e project plan :
The critical artifact for me was what I called the 'component integration matrix'. It was a simple spreadsheet that had a breakdown of functionality of the enterprise release on one axis and the components on the other. The solution design outlines which components are involved for each functional grouping. The program managers working with the component delivery managers would work out the release dates and software versions for their components ready for integration. The easy optimization that this allowed was to align/prioritize the work for each component so that e2e functional threads would be completed as a priority thereby allowing the e2e test teams to get going. Of course, in parallel to the design, the test teams aligned their e2e test cases to each of these horizontals and reported test coverage along these lines. That way, I could in a single spreadsheet see the convergence and critical path of the program from a software development and integration perspective.

This was always more useful than the prettiest gantt charts. It is only finalizing this that gave me any sense of a target date, so, completing this was always a priority.

-------------------------------
NB> This is by no means expected to represent a comprehensive view of the subtleties or complexities within a delivery hub ... just some food for thought. This represents 15 years of experience brain dumped within 30 mins so take it for value added.

It's not done until it's deployed and working!

2007-11-20T07:47:00.001-05:00

It may amaze some people, but there are still development teams out there that think their job is done when they hand over to the test team. It should be obvious to everyone that the business derives no value from a system until it is deployed and working. Finger pointing and claiming that "it works on my machine" doesn't make money for the business!

Scripting or otherwise automating the deployment of an application is an invaluable aid to the whole development process. It speeds the process, thereby reducing the code/test feedback cycle. Even more importantly, it makes the process repeatable. The same script used for deployment to test should be used for deployment to production, thereby exercising the deployment scripts as part of the test cycle.

Likewise, if your project has difficulty with deployments, having a developer present during production deployments will pay dividends. There is nothing like first hand experience for bringing home to the development team the issues faced when their application is used in anger.

Until you have confidence that your deployment will go perfectly every time, involve the development team in every production deployment. And make automated deployment a requirement of every development project.

wikipedia

2007-11-11T13:09:00.000-05:00

Wikipedia - continues to amaze me every day.

I am often faced with situations I have absolutely no technical information or background on. All Believe it or not, in many situations, all I have as starting tools are my instincts. However, this is rapidly corrected by good ol' google search and wikipedia lookups although, I still do miss speed reading documentation in book form.

On vacation with nothing better to do than just relax, I came across these amazing pictures ..
http://en.wikipedia.org/wiki/Wikipedia:Featured_pictures

English Wikipedia Featured Pictures

double your broadband speed

2007-11-02T11:23:00.000-04:00

I have been struggling with my DSL service since I moved into this new home. For some reason, my modem was training to 4M instead of the 8M in my previous home. My first reaction was acceptance that this was due to the distance factor from the exchange (which IS a significant factor). However, after months of resentment that life should be better, I decided to do something about it.

I had suspected that my internal home wiring was a factor. I knew that I should get a boost by changing things around, however, did not expect the level of impact. 15 minutes of investment boosted my speed from 4 MB to 8 MB.

So here is what I did and some explanation of what was causing the problem.

In the picture above, scenario A is your typical home internal wiring. The pair comes into the home through a special wall plate (NTE) and is then distributed around the home. This is a spiderweb typically. Worse, there may be other devices using the phone lines .. intercoms, alarm systems, pay-per-view box etc. The ADSL modem is typically on the end of one of these legs so, the signal to the modem has the interference and loss caused by the spiderweb of internal home wiring in addition to the normal loss and interference.

Scenario B uses a special device called an NTE5 central adsl splitter that plugs into the wall-plate where the copper pair from the exchange enters your home (you will have to do a mini external home survey to find where the pair disappears into the wall into your home).

You can see from the diagram that the signal to the modem does not have any of the issues with the internal home wiring. This also eliminates the need to put splitters on each and every outlet that has a phone.

This is a 15 min job. Really trivial. Only limitation is that your ADSL modem now has to be located and plugged into this wall plate only. Here is a good link / site that explains further.

http://www.broadbandzone.co.uk/shop/centralisedfilter.html

The issue of having the ADSL modem locked into one place possibly far away from a computer is really a non-issue now-a-days. Two options / reasons : most ADSL modems now have WIFI built in (G or N standards will be more than enough). Alternately, there are powerline ethernet devices that work very well. What these devices do is basically make your internal home power (220V) cables into a transmission network for ethernet. Fairly pricey still, however, extremely flexible and they work !! I use the NetGear Powerline Ethernet HD adapters and have no complaints.

Maturing IT support - framework / model

2007-11-01T08:30:00.001-04:00

I find the model I created useful in evaluating where teams stand in their maturity and the kinds of things I ask them to focus on to move up the value chain and improve their performance.

An example, to move from a state of 'managed' to 'measured', I ask teams to put in place measures in the following areas :

A> Business KPI reporting in the context of the system being measured. B> Measures around the utilization of the system (beyond CPU etc.). The most basic is a graph of concurrent user logins at 15 min intervals. More sophesticated is transactional level measures. C> Systems availability reporting which of course is always 99%+. A better way is measuring business impact i.e. #minutes downtime / call centre agent / month.

I've fallen and I can't get up

2007-11-01T08:06:00.000-04:00

Too often I get a plea for help wherein a development/delivery manager runs into problems taking a system into production. Guess most often where they run into trouble .. yep ! Performance.

When questioned about the technical details, same old pattern. Lack of understanding of underlying middleware, database, 3rd party tools etc.

When a team displays such a lack of understanding, what I hear is the team saying to me - "I can code it, however, I don't know how to actually make it work !"

Performance considerations are intrinsic to good development practices & design. While a focussed effort on performance optimization for a week using a highly skilled team always yields amazing results, it is a bad idea to deliver under that assumption.

This is often the simple difference between average and good teams.

reuse

2007-09-16T13:52:00.000-04:00

On reuse within IT. I seem to be talking about this a lot so .. might as well put this down.

IMHO, I break reuse within IT into the following stages :

Stage 0 : Reuse teams (just this gets you 60% of the way)
Stage 1 : Reuse design patterns (typically effected by having a clearly articulated architecture and some governance frameworks). This however, may be a legacy of waterfall methodologies.
Stage 2 : Reuse software (libraries, SOA, components etc.)

on monitoring enterprise shared/common components

2007-07-07T17:43:00.000-04:00

I had an encounter with a commonly used enterprise component - single sign-on, specifically, a tool called Site-minder (now owned by computer associates).

First my rant, as this cost me 3 days of my life and a weekend away from my family. While quite a nifty tool and what appears to be a highly scalable platform, I was not prepared for the level of 'blindness' to simple things like throughput and response time. You can get all sorts of information about connections, threads etc. however, the information isn't sufficient within the package to fully monitor what is going in and out of this black-box. Also, no historical graphs in the monitor ? Wait, that is another product you have to buy from CA ? Wouldn't it be awesome if only we could reset the stats/counters at run-time ? That way, you could tune, then reset the stats/counters & re-measure.

OK. Got that off my chest and it wasn't completely venomous. Believe me, its been a tough week.

Seriously, shared infrastructure typically have really compelling business cases and yes, there are truly efficiencies to be gained, however, effective monitoring becomes absolutely critical. All eggs in one basket means you save on baskets and runners, however, the stakes for a mistake go way up !! So you better be careful.

Shared infrastructure is also more complex to model and monitor, specifically when you are dealing with layered distributed systems. Eg., in the Siteminder model, there are agents that consume transactions (may be locally cached) from a series of policy servers which in turn consume transactions from downstream services, eg. LDAP directory servers, authentication servers ...

In such a framework, I would closely monitor the following aspects :
- daily/hourly transaction arrival rate from each agent (and the cache rate).
- transaction response time variance (this is a sign of downstream bottlenecks)
- resource consumption in the policy server (no. of threads active/in-use) .. monitor at peak
- above three for each of the downstream consumables.

If your application support team can do this, they have the first clue about what is going on within their framework else, guess what, tomorrow you may go through what I just went through this past week.

on the dark arts of search engine optimization

2007-04-25T07:15:00.000-04:00

I have been trying to figure out why I cannot search the internet (google, msn, yahoo ...) and get to my blog even after doing an exact search for keyword combinations exclusive to my blog. What opened up to me is a whole new world. I know, I know, I am obsolete .. what world have I been living in ?! I am an old UNIX/C guy and html/xml really doesnt classify as programming to me. I now feel bad about poking fun at the mainframe guys back in the 90s. In saying that, now I really feel old !! I digress, sorry for the soapbox.

My quest is to get a hit in google search using a combination of my name and 'knownbugs' keywords. Should be unique with the top hit bringing me to the website hosted on google blogspot .. big assumption being the search engines give you results ordered by relevance (occurence of all keywords).

Well, not quite so. So, reading up on recommendations, I first researched tagging. Technorati tags is an emerging 'power player' in the world of blogs. Supposedly, 'labels', 'titles', 'headers' in blog content/articles should be automatically picked up by the Technorati engine (invoked when you 'ping'). Alternatively, you can force a tag by using the 'Technorati Tags' method.

Even so, this only makes your tags visible within Technorati's blog search. For the normal google internet search, blog content hosted on blogspot appears invisible, however, on google's blog search, it works.

I also did the wait 30 days and magic will happen thing. This is what some recommend as the time it takes for spiders to crawl your content.

So, further tricks/tips. I am now in the process of getting a custom domain name. Godaddy.com offers cheap registrar services. My selection - www.knownbugs.org (or .info, .net, .biz .. unfortunately, .com was taken by someone who wants to make money by selling the name).

What a custom domain name will do, is treat the blog content as regular www content and hopefully allow the search engine 'spiders' to index the content making it visible in the regular google search world.

I suspect, I will find other gotchas as there clearly is money at play here.

Instead of us being in the age of 'content in king', we appear to be living in the age of 'content control is king'. Here is where there is a war going on. The behemoths - google, microsoft, yahoo .. all at play.

'Influencing' search engines is worth a lot of money nowadays. A massive amount of complexity behind the scenes with 'SEC' or search engine optimization being a real growth area. I worry !

I worry about such central control on information access, however, hacks like us always have a way of breaking free.

More as my quest progresses !! Am still waiting for the DNS servers to update so expect to be redirected to www.knownbugs.org when you go to knownbugs.blogspot.com shortly.

common pitfalls in outage/crisis situations

2007-04-24T11:31:00.000-04:00

Teams dealing with outages typically suffer the following behavioral problems. Some of these conflict with each other in aims .. there isn't unfortunately a set formula I can come up with as each situation has its own variables/complexities. A standardized process/template however, would be great !

1> Trial and error
Don't fall for this. You know you are in trouble when you get ambiguity from the technical teams. If you are stuck, ask yourself what can you do to get additional information onto the table. Often, you will find teams stuck 'enjoying' the problem because they have no method for infusing new information or experience into the problem diagnosis. Doing things trial & error mode also force you into a sequential analysis mode. Also, this leads to the '2 hr ETA bait' (see below).

2> Debugging in live
While prevention of recurrence is key and that requires some investment in analysis / data collection during an outage, it MUST be capped. Do not fall into the trap of ... "give me 20 more minutes, I have almost figured it out ..". You will hate yourself later. Walk into the situation with a time limit in your mind upon which you will trigger a failsafe way of restoring service. Communicate that upfront to the team (I am assuming that you have a failsafe procedure to restart the system to restore service that you will trigger on .. may be something as simple as rebooting the servers). Of course, best practice is that you always execute the failsafe and never debug in live, however, that requires a significant investment in test infrastructures that are capable of reproducing the problem. Remember, there is a value to getting to true root cause as that is the only way you will prevent recurrence.

3> Sequential analysis
Distinguish 'sequential analysis' from 'sequential execution'. Sequential execution is good, sequential analysis is BAD. On the analysis side, you should try to split off multiple teams (assuming you have the resources) on different aspects of the problem. That allows you to cover all bases quickly vs. an elongated recovery path where you are problem solving only one thing at a time. Sequential execution is GOOD because you want to introduce only one variable at a time else, you will break the cause and effect chain. Usually, problem solving is about eliminating variables and then incrementally fixing one thing at a time using a measured/scientific approach.

4> 2 hr ETA bait ...
Setting expectations on 'expected time to restoral' is really really hard. Here is the dark art of estimation at its finest. Setting no expectation is unacceptable (it will be fixed when it is fixed .. attitude). Your business partners/users will not be as upset about an outage as they will be about setting false expectations. Usually, a significant outage will require operational teams to build workaround/catch up plans where they may have to staff overtime or weekends. These plans depend on your estimates.

And .. what's worse is none of your technology suppliers / partners will co-operate.

On crisis situations with financial implications, vendors get very very conservative or worse, clam up ! In a crisis situation, you always will feel the information is inadequate to make a decision or set an expectation. I usually follow my instincts here (of course, harnessing whatever facts are available on the situation). Don't try this unless you have the right technical experience.

enterprise support from technology partners

2007-04-24T11:00:00.000-04:00

Five years ago, I was pounding Microsoft on their lack of understanding of enterprise support. It is amazing how far they have come. I remember a couple of years back, an incident relating to a system based on SQL server. It was terrible !! The answers back from Microsoft were very casual .. try this patch ! Of course, nothing being hot patchable however, luckily not requiring a complete rebuild of WindowsNT server, an hour later when we figured that didn't work, the answer was, OK, try this now. We felt really foolish architecting a mission critical enterprise application on SQL server.

Microsoft has really come a long way since that. I was very pleasantly surprized in a recent encounter on how they have matured. Their crisis technical lead was clear, crisp, unfazed by pressure and clearly knew what he was talking about. That instilled confidence. He knew how to distill and present the facts and avoid making false promises. Also, their follow-the-sun model actually worked !! The transitions were seamless with knowledgement transfer occuring behind the scene and a warm hand-off with 1 hr overlap. Their account team was on the ball and follow-up and follow through was perfect. In fact, they chased me !!

Other examples of great support I have received are from BEA. BEA's account manager takes the unique honor of being the only sales guy I know who stuck with me for 36 hours straight during a crisis situation helping with anything he could (including doing the coffee rounds). I never believed a sales guy had that kind of stamina ;-). Oracle's down systems group are also top notch.

Technology partners usually have to support crisis situations remotely. They will depend on you for information and one of the challenges is to be able to supply it to them - real-time. Simple things like file-size limits in your email servers can look like bad ideas in these circumstances. Firewalls are a fact of life so, have a strategy on how your technology partners get access to your systems/intranet when you need them to.

crisis bridge protocol

2007-04-24T10:42:00.000-04:00

When dealing with outage situations, it is important to establish a clear bridge protocol for the participants. Hopefully, you won't have to go through these on each call as this will contribute to your MTTR (remember, you are in an outage/crisis situation).

a> one person speak at a time
b> identify the lead (hopefully you !)
c> people mute when not speaking
d> people not put you on hold (most PBXs will play music for the rest of the participants)
e> mute if you want to have a sidebar conversation
f> remember, if you go to sleep, you will be spotted because of your snoring
g> no calling from a cell phone (or c becomes very important)
h> establish clearly the participants and their role/what function they represent

Traditional conference bridges are slowly evolving into a multimedia facility - IM session in parallel is becoming commonplace with Netmeeting/Livemeeting quickly following. At Qwest, it was nice to have the facility to dial into an 800 number and then select a sub-bridge (option 1..9). That way, the main bridge team could quickly branch sub-teams off without confusion and avoid wasting time on communicating bridge numbers.

Separate out from the start a management bridge, customer bridge and the technical bridge. Chaos ensues if you mix them all into one.

Just my two minutes of brain dump .. will add more as I flesh this out/collect my thoughts.

Systems monitoring - historical views - best practice

2007-04-24T10:16:00.000-04:00

One of the best things I have seen is the standardized use of a monitoring framework with historical reporting for the technical aspects of a system (CPU, i/o, network, memory, database, kernel ..).

There are several tools out there in the marketplace that do this. IBM Tivoli, HP Openview, BMC, EMC SMARTS (and then some), all offer solutions along these lines. The key is to instrument agents / data collectors across the estate (on each server) and have a central database & reporting web-site that allows IT folks to select a node and display historical results around a wide variety of technical aspects. The value seems ambiguous, however, let me tell you that it makes my life much much easier.

When in crisis mode, this data helps immensely. It is crucial data that tells you when something changed. It allows technical teams to quickly focus, analyze and resolve a set of issues that normally are thorny and contribute to large MTTR numbers on problem incidents. Yes, logging into the box and monitoring real time tells you there is a problem, however, a historical view tells you when something changed. Also, this is crucial for another best practice - server capacity management and monitoring.

The key is universal rollout / standardization. Don't get trapped in the technology selection mode. Pick one and implement universally. This isn't difficult work, however, best implemented within your server provisioning process so that anything new automatically has the standardized framework.

It is amazing how telling something as simple as a historical CPU profile is. You see processing/business utilization patterns, exactly when backups occur, batch jobs etc. and more importantly, when something CHANGED.

database maintenance

2007-03-06T23:19:00.000-05:00

Oracle database maintenance.

One recurring theme I see with junior dbas is the lack of understanding of the 'analyze table' proceedure. This is crucial for proper database performance. Can be scheduled once a week or more frequently based on system profile.

Basically, the 'analyze table' command gathers statistics on the table and stores them internally as hints for the query optimizer. Indexes are picked up based on these statistics. For dynamic tables (large growth or shrinkage or changes to key indexed fields), it is imperative this is performed frequently. For safety, run it anyway after key database operations.

Know what 'good' looks like for your system. Baseline and save (better still commit to memory) the key behavioral aspects of your system eg. cpu utilization profile, throughput, response times, i/o levels etc. That way, you will know when this profile changes and will be a key trigger for you to action for early detection of problems. More importantly, when you implement change, you should compare the before and after profile of your system.

Milan Gupta

Change management

2007-03-03T06:14:00.000-05:00

The first question I ask a team when something breaks is what changed ?. 99% of problems end up relating to change. Amazing !!

I did not put change as the top problem within my post on what keeps me busy because I believe that change is good. It is necessary and as inevitable as evolution. It is necessary for healthy growth. So IT cannot take the simplistic approach of 'if it ain't broke, don't fix it!'.

Any decent sized telco typically has an IT estate of 3000+ systems, 50K+ computing assets and 10K+ people all changing stuff. Is it any surprise that systems availability & stability actually goes up during christmas ?

What is key to establishing a high performance IT organization is managing change effectively. It is instrumental to have a proper inventory of systems, more importantly, their inter-dependencies and impact on business process. Most importantly, a team that understands this model.

While end-to-end testing frameworks can flesh out unexpected side-effects of change, it isn't reasonable to expect that all work will go through this framework. With 10K+ employees, there will be leakage and impact. Specifically around stuff that you would least suspect.

Proper and effective change management framework would consist of the following :

A team as described above that performs the function of a change approval board (centralized or decentralized)
An e2e testing framework that certifies each change
An effective communication framework typically a change ticket process that notifies the relevant enterprise pieces that are potentially impacted by the change
Leadership & support from the development teams on implementing change.
Post-implementation verification of change (ideally monitoring key business KPIs before and after the change).

Milan Gupta
milangupta1@gmail.com

proactive vs. reactive application support

2007-03-03T05:23:00.000-05:00

A key aspect of a good application support model that is often missed is the importance of each application support group behaving as a customer of a downstream dependent system.

A common fallacy is to assume that each application and its support group is an independent unit purely acting on a reactive basis. This puts the onus of taking a business user problem and translating it to a specific application problem, onto some form of a centralized service management wrap. Not the most efficient as most problems are first detected within the application support teams (provided they are awake). It is imperative that they drive the resolution of the problem to their downstream dependent system.

As an example, consider the picture wherein the application arena is a simple service order processing chain wherein system A is some CRM system, system B is an orchestration / workflow layer, system C is a inventory / assignment system and system D is a service activation system. Typically, there is tight coupling and dependencies between each of these layers - any anomalies impact business KPIs and flow-through. The users of the CRM systems will see these anomalies as orders not being completed on time. The orchestration system will see these as orders stuck at a particular stage. The problem may actually lie within the assign / inventory layer or the service activation layer which also will be noticed by the respective app support teams.

Behavior within the teams should be as follows :

1> Ideally, the bridge monitoring system should have received alarms and alerted the respective systems .. this would represent IT being pro-active.

2> Failsafe on this would be the app-support team for the CRM system creating an IT fault on the orchestration system who in turn would transfer the fault to the assign / inventory system. Not the most effective, however, necessary as a failsafe and reinforces proper organizational behavior. For complex scenarios, a service management layer may be introduced. I still consider this pro-active.

3> Least ideal is 1 & 2 failing with the users reporting the fault - this is IT being reactive.

The usual breakdown I observe is in #2 with companies mostly operating in #3. #1 requires a sophesticated business process monitoring infrastructure, something I consider to be still an industry wide problem given state of investment and commitment to such projects within an IT portfolio. Breakdown in #2 is usually an artifact of organizational boundaries and/or poor skillsets & focus. Each team operates in a silo and purely on a reactive basis. A truly dangerous place to be for any CIO.

Milan Gupta
milangupta1@gmail.com

synchronous vs asynchronous transactions

2007-03-01T19:04:00.000-05:00

If you have ever built call center apps .. you will already have learned this lesson. For some reason, we keep repeating these mistakes over and over and over ...

Remember - synchronous transactions for time-sensitive stuff. For transactions that a call center rep has to wait on (while customer is on the phone) .. use synchronous backplane eg. web-services. For others (non-time sensitive), use asynchronous. Your messaging architecture MUST support both.

The thing that creates havoc the most in call centers is transaction performance variance. Not always just transaction performance. If something consistently takes 90 seconds, you will find your call center reps work around this poor performance by predicting this period of wait and filling it with other work or small talk with the customer. What makes call center agents mad is transaction variance - sometimes it only takes 4 secs, sometimes 300. That's when the customer on the line gets the embarrassed comments - 'my system is slow .. my system has frozen up ..' etc. etc.

Milan Gupta
milangupta1@gmail.com

systems availability vs. systems effectiveness

2007-03-01T18:29:00.000-05:00

First of a series of posts where I will cover the area of application support & service management. This is probably one of the largest problem areas in an IT portfolio and the number one reason that leads to CIO departures.

Providing excellent day-to-day service ! What is the role of IT in this ? What is the role of a particular application support team ?

A typical CIO challenge is to take the IT group up the value chain within a company. This applies to all disciplines including providing day-to-day support.

Support teams and IT value is typically stuck at the systems availability monitoring and reporting level. 99.95% uptime. Famous words. We've all heard this. Somehow that 0.05% seems to hide a massive amount of operational impact. Putting that under the microscope usually leads to startling revelations.

An alternate strategy is to focus on systems effectiveness - my name for nothing other than business process monitoring, however, this is a little different. Here, you apply the concepts of business process monitoring in a 'systemy' way.

To clarify, each system typically performs a specific function within a process chain. Systems monitoring at the technical level covers all the engineering aspects of the platform eg.
Database, hardware, CPU, I/O, Filesystem space etc. Usually, this stuff is trapped using tools like HP Openview, BMC etc. and monitored by a 7x24 bridge operation. When alerts are received, automatic callouts are performed with an extra pair of eyes to make sure.

Better groups take this up one level. Monitoring of log files for errors eg. SQL errors, core dumps, etc. However, this is also usually insufficient. Even better groups start getting sophesticated around application level capacity monitoring - eg. thread utilization, queuing behavior and other subtleties around bad jvm characteristics eg. full GCs.

However, that also isn't usually sufficient. The trick is to customize a set of measures that are relevant to the business use of the application and monitor for that. Keep your finger on that pulse and magic happens. Your operational partners will no longer care if the system goes up or down .. they are happy for you to measure based on the business performance. An example of this is to measure performance response time and variance for transactions that are time sensitive - eg. those that a call center application calls on the back-office systems. Alternately, in the case of workflow, some measure of cycle time and right first time (on-time being trivial case of RFT).

Do this and suddenly, you have gone up the value chain and made your life simpler. Your teams grow as they go from being purely reactive to being proactive. Also, more importantly, they learn the operational side of things and recognize exactly the criticality and value of their system in the larger picture.

This isn't anything fancy. I'm not talking about a full-scale business process monitoring framework here. Full BPM requires a standardization of metrics and process and usually abstracts away from the systems design and implementation. For architectures that are a mix of legacy and new, this is usually never perfect either. I'm talking about a simple application of common sense to what you monitor. They challenge is usually understanding the design of the system and extrapolating the meaningful set of measures that the end-to-end business process depends on. This is usually very specific to the design and implementation of the system as the data must be harvested frequently and usually in real-time.

Milan Gupta
milangupta1@gmail.com

Project Execution

2007-02-14T18:42:00.000-05:00

For any significant development project or classical integration programme, there are a number of necessary ingredients, the absence of which usually are a recipe for disaster.

1> The right leadership.
Any project must have its technical leadership and its business leadership straight. Yes, this boils down to two people who will challenge each other and maintain the necessary checks and balances.

2> Management & Escalation.
One of the biggest blunders and chaotic environments is where you have non-technical management managing technical work. Recipe for disaster as you will spend you life on escalations that look complex and scary, however, are very simply solved. Also, a lack of understanding of the development cycle usually leads to pre-mature questions from the management which in turn leads to pre-mature decision making, needless work etc. Eg. on one of my projects, being the architect / technical lead, I was asked (by the VP) for a technical specification of the system within the first month of what was a 2 yr project !! Understand that managers want to know when things will get done even before they allow anything to start. Developers will not reliably tell you when things will get done until they are actually done. Such is life and the variability of agile. You are welcome to use waterfall if you want a 100% schedule predictability, however, understand, that you are basically getting 1 unit of work at a cost of 5 (the addl. 4 are padding to manage the risks/unknowns which are inherent in most projects). On structure, you have basically two philosophies, architect/tech lead report to the manager or manager report to the architech / tech lead. I vote for the latter. As long as the manager / project manager understand their role wrt the tech lead / architect, things typically are fine. Watch for this dynamic very very carefully as this is where bad bad decisions are usually made. You do NOT want a non-technical person making a technical decision.

3> Top talent recruiting - the best attract the best. No one likes carrying dead weight. Once you seed the team correctly, this will be self-correcting. I have and never will believe in the 200+ project team size. I have done amazing things with a 40 person development team. Remember, the software design and tools you choose itself brings limitations on how many developers can work concurrently and productively.

4> Pay attention to the learning curve. Things will not progress at the pace you would expect until you have a seed development team that has matured sufficiently around their understanding of the business problem. These will become your technical leads as your project grows. It is wise to invest this learning in the best technical developers from the start as they are the ones who are going to produce the software.

5> Match your technology choices to your developer team skills. If you want your project to be the guinea pig for the 'next cool new tool / technology' .. OK .. but understand your risks. You need the time to get your developers up to speed on this.

6> Establish the right roles within the team from the start.
Architect/Designer, Developer/Tech Lead, Business SME/Tester, Project Manager, Test/Development Environment Manager, Application support lead, Deployment Lead, Integration Lead (Designer), Business Implementation Lead (Training/Comms/Metrics)

7> Get and stay close to the end-user/customer.
The shorter the communication chain between an end user and a developer, the greater the chance of success.

8> Test from the start - your user stories are really test cases in disguise. Pay attention to test data. Test Director is a decent tool to document your tests and track your coverage.

9> Solve the hard problems first. Focus on the unknowns as early as possible. PANIC EARLY !!
Its only great teams that have a 40 hr work week in their final week before deployment. No magic here .. it comes from spending the weekends before that so that you are coasting in style when you near the finish line.

10> Develop/Test with real data as early as possible.
At the early phases of a project, the developers must have flexibility over the testers. This is the best-effort co-operative testing phase. It is extremely frustrating for the testers as the productivity is low. If you are using end-users, you must have alignment, else, you will be fire-drilling all the time trying to manage perceptions of problems from above. This phase of testing is key as your end goal is to get as much early feedback to the developers. During the last phase of the project, the testers must be the enemy of the developers as they move to the 'antagonistic' testing and the user acceptance testing phase. Here, the testers must be the focus with full support from the developers and the test environment leads. Another best practice is to have a repeatable set of test cases rather than a set of testers. This provides a very easy way to manage stakeholders as anybody who wants a say, can review and add to the test cases. The better they are, the better your chances for a quality delivery.

11> Co-locate as much as possible.
NB> Do not assume co-location means same building or floor. Even team members strewn across the floor randomly will not have the same effectiveness as an integration pod or 6 adjascent cubicles housing a sub-team working on a common area. There is truly an amazing effect on productivity. Make the hard call and force this if you are getting into the red zone. I understand the day and age of offshoring, however, in crunch mode, you cannot replace good old co-location with anything. Remember, communication/organizational barriers is the most common problem to integration problems. The main problems will be at the boundaries.

12> Basic software engineering disciplines - makefiles, daily/continuous builds, regression test automation, code reviews, use of software quality and analysis tools (purify, jprobe, ..).

13> If you are using 3rd party tools, understand your risks. There will be issues and it is up to you to design around them. Remember, you will have limited flexibility to fix/modify the 3rd party tool. A third party tool does not relieve you from the need to understand the details.

14> Certain roles go hand in hand with accountabilities and later roles in the project lifecycle eg. it is ideal to have the people who specified the user requirements also be the testers; the solution designers/architects be also intrinsic in the integration / testing / defect resolution of the project.

15> Plan for production - clean up the logfiles - meaningful and concise. Write the required business process monitoring reports that allow you to ensure the platform's effectiveness post production. Do this early as this will allow you to identify gaps in the design / functionality that can make your life extremely difficult post production. A easy example is for a workflow based system, have the ability to take a snapshot of the in-flight jobs and where they are. Have a clear model of expected execution profile so you can catch exceptions, performance issues, bottlenecks, etc.

16> Measure before you implement, measure after you implement .. you are not done until you restabilize the business KPIs (and this will take you a month). Your end-users may not notice things as they are new to the system too .. so, don't expect to get the usual level of guidance from operation on 'problem areas / defects'.

Milan Gupta
milangupta1@gmail.com