Tuesday, March 31, 2015

On-boarding an Application for DevOps Support


Application Development Completed, Production Release done, Party … now what
Here comes the bigger phase of any Solution / tool / Product, the Support phase. Once the Production wheel started spinning for any application (internal/external/COTS), the DevOps or the Support workforce are the one going to get the first impact of any incidents, outages, etc. Particularly if the Application infrastructure is pretty new stack to the Support team and intertwined with lot other systems inside the infrastructure.
Is the Support team always geared with right set of skills, details, credentials, tools to tackle the newly inducted application? Support team here I mean is the team beyond raising tickets against. Typically that category of team falls in Tier 3 or Tier4, beyond whom only Developers or the Product vendor could help (who usually own the source code of the item).
Here we see how a T3, T4 Operations team be inducted to support a new application with right set of details to better handle the show.
Application Eco-system
In current Service layered assembly of solutions and tools, it’s hard to see any application serve independently. The team has to be provided this eco-system knowledge of the application. It could be
·         What the tool is about?
·         Where the application entirely fit in the companies’ offering?
·         What are the different tiers of that application? DB, Message Queue, Storage, etc.
·         Is this application clustered, redundant? This information provides whether restarting or bringing down the application have what level of impact.
·         Is the application spread across multiple Data Centers but still served as one URL?
·         How is the application accessed by end-users? For e.g. Thick client, web browser, headless, within other code, etc.
·         Who is the contact person in the architect team to refer for any clarification?
All these details about the application should be prepared as knowledge base page with diagrams to detail the stack.
Sandbox for Practice
It’s not always just takes theory to complete the knowledge. A Sandbox or practice system could help the team in much more ways in practicing installation, upgrade, repeating production issues, etc. When providing such system we also need to consider the additional license cost, learning materials, trainings to be taken for it if it was purchased from external vendor. Also Sandboxing may not be possible with every application or solution if It is a major one or involving cost incurring inter-dependencies in the infrastructure (for e.g. Additional Storage unit).
Monitoring Set up
Once a product is rolled down, it would be monitored in some or other way like Port, URL, Disk usage, counters, queue size, etc. This details has to be shared with the Support team to give them clear picture of what is being monitored and why is that KPI important to the whole system functioning. It’s also better to provide what is the monitoring system being used like Netcool, Nagios, etc. Sometimes we may get two different alert indications. At times its better to check the Monitoring system itself to see if it has correct visual of the application or it’s just a network glitch. This is to avoid any false alerts/escalations or ignoring a vital alert, cross verification, etc.
Administration Activities
When T3/T4 teams become the top Support team for everyone to fallback, it’s no wonder they might also involve in Install, Upgrade, Recovery, Restoration, Patching, Decommission of the Application. The team has to be trained well on all these activities. Most of the vendor now-days have a common portal for product download, license renewal, patch download, issue tracking, knowledge base, etc. This portal credentials needs to be share with right people to enable them to manage the product. If the application is in-house then an alternate and standard arrangement for all above items should be made. A periodical meeting with vendor/developer/architect would also be fruitful for initial period of support while the team gets well versed in managing the application.
Release Activities
Usually a T3/T4 Support team is the one who would be doing the periodical release activities for the application. This involves patching, upgrading, feature addition, deployments, bringing up new instance, etc. The team has to be trained well with Change Approval Process, Release process and seriousness of its violation. The Release calendar for the company or the application should be posted in some common intranet page where its provided in advance and always available for reference. The team should also be advised on the urgent change process which might be required on outages impacting revenue. This may be completely different from the regular approval, change process.
Support
Here comes the actual day to day work. Incidents and Requests always hit the top activities in any Application management. Though Requests looks less serious than Incidents, it has to be carefully handled with all approvals, etc.
Requests
Before any user addition, group addition, providing additional permissions, it’s not just enough to get the approvals, but to see whether it’s a short lived or long lived request, is it a duplication effort. For e.g. A team might be having common credentials for certain activity, but a member of that team may request a separate privilege for the same activity which makes a duplicate effort. If these kinds of items are unchecked then later, the application authorization system would be having 100s of users and groups with most of them unused and in confused state leading to incompliant state.
Incidents
Incidents are the complex activity the Support team might struggle with always. But with right set of details and tools, this work can almost be standardized. Most of the time the incidents come as alerts from monitoring systems, T1, T2 teams, failure of other dependent system, failure inside application stack, network issue, etc. The team should be provided clarity on all these paths and how to handle each path in standard way. It is very usual for the team to immediately login to servers to debug the issue, irrespective of which path the issue came from.
Credentials
Having right set of credentials for the infrastructure makes difference in MTTR. When we talk about credentials it’s not just applicable for Production, but also for TEST, QA, STAGE, DEV instances. When the application consists of DB, MQ, and Protected Web Services in the backend, then the number of credentials that the team need to manage will grow exponentially. On top of these to be complaint the company might need to change all of these passwords periodically. This poses a new problem of keeping the passwords in sync. Usually individuals use Excel sheet, KeePass or similar tools to store these kinds of passwords. The best way is to host a password portal with one credentials and have everyone refer here when required.
Alert Category
Alert categorization is one other detail that should not be left to individuals. Though it’s always agreed that P1 marks business impact, revenue impact, P2 to P4 are usually the confused ones. Correct categorization is essential for other teams to understand the issue and work accordingly per SLA.
Knowledge Base
This is nothing new for a Support team. Every team might have some Knowledge base source either in intranet, Excel sheet, Document, Database, Portal, etc. There has to be someone held responsible for updating, purging and creating these records. Usually this kind of systems has more consumers than creators. Everybody thinks that others creating it. There has to be a standard way for creating and updating them, otherwise it become big, ugly and unreliable.
Troubleshooting tool set
Toolset can range from simple shell interface to the server to the advanced diagnostic Portal. Based on individual’s system expertise, each tends to use their own set of tools to troubleshoot the application issue. This may be command shell, grep, wget, curl, soap-ui, browser built-ins, telnet, netstat, etc. The first step is to make this toolset uniform and available as one package to be deployed in all machines involving troubleshooting. Next a standard steps to be defined to narrow down the issue using the above tool set. The steps should always lead close to a particular root cause always. This frees inconsistencies in troubleshooting between individual team members
User Group to keep in loop
Sometimes it’s just not enough if someone assigned the Incident ticket and started working on it. Some issues requires the appropriate stakeholders to be updated on time. This could be internal management staff, customers, vendors, etc. At the same time the team shouldn’t ring the wrong bell. This knowledge about escalation and keep-informed culture has to be standardized across team members. Preferably documented and available in common area like intranet.
Hand offs and Continuing Incidents
For a distributed Support team working across the geography, handing off the issues is additional task. Likewise handing over knowledge about continuing issues (most of the time P1) should be taken care well. This is essential for the other team to effectively continue their troubleshooting. Often this procedure happens through reading mail chain or chatting one-to-one between handing and receiving team member. If that is the case, then anyone else on the receiving team who wanted to identify the root cause or contribute to the troubleshooting would be left with lack of information. This procedure to the extent can be standardized through a common portal FORM, Filling a structured document and sending across to entire distribution list.
Daily Standups
Standups doesn’t just belong to Scrum and Agile projects. Support team can conduct a daily standup probably at their end of day to discuss and share the issues occurred over the day and ways in which that got resolved. This brings the private knowledge to the table and aid in sharing it. It should be just open discussion, some lead person could take a note of the issues and their resolution steps on day to day basis so that it can be applied to improving root cause identification, reduce MTTR, duplication of work , more standardization of process. The same meeting can also be utilized for communicating upcoming release events related to application.
Vendor/Developer Meet
Support team can meet with either the architect/developer (for internal applications) or Vendor periodically and discuss repeated issues that can be resolved by some design change, feature suggestion, etc. This can also help Support team in receiving knowledge from the other side for better manage the application.
Reports & MIS Data
Assume there are no Incidents or Requests to an application for some time and everyone was peaceful, does that really a state of peace unless added with supporting data. This is where MIS data comes in picture. The application usage and its internal resource usage, uptime report, Resolution time graph, etc. to be captured and discussed periodically to ensure application is behaving normally and to its expectation, issues are resolved effectively within SLA.
Conclusion

Handing over a new Application to Support isn't just taking one day session to the team and letting them fight the war day to day. I discussed about Sandboxes, right set of credentials, Application eco-system, Architect Contacts, Daily Standups, Continuing Incidents, Application Release Activities, Monitoring Setup, etc. When effectively handed over, the team can fight the Incidents and contribute the improvement in managing the tool than fighting with accessing, troubleshooting, narrowing down issue. In fact to say "Automation" at Support level tasks can yield lot of advantage and make all the issues discussed above to disappear. I'll cover about it in my further posts.