Application Development Completed, Production Release done,
Party … now what
Here comes the bigger phase of any Solution / tool / Product,
the Support phase. Once the Production wheel started spinning for any
application (internal/external/COTS), the DevOps or the Support workforce are
the one going to get the first impact of any incidents, outages, etc.
Particularly if the Application infrastructure is pretty new stack to the
Support team and intertwined with lot other systems inside the infrastructure.
Is the Support team always geared with right set of skills,
details, credentials, tools to tackle the newly inducted application? Support
team here I mean is the team beyond raising tickets against. Typically that
category of team falls in Tier 3 or Tier4, beyond whom only Developers or the
Product vendor could help (who usually own the source code of the item).
Here we see how a T3, T4 Operations team be inducted to support
a new application with right set of details to better handle the show.
Application Eco-system
In current Service layered assembly of solutions and tools, it’s
hard to see any application serve independently. The team has to be provided
this eco-system knowledge of the application. It could be
·
What the tool is about?
·
Where the application entirely fit in the companies’ offering?
·
What are the different tiers of that application? DB, Message
Queue, Storage, etc.
·
Is this application clustered, redundant? This information
provides whether restarting or bringing down the application have what level of
impact.
·
Is the application spread across multiple Data Centers but still
served as one URL?
·
How is the application accessed by end-users? For e.g. Thick
client, web browser, headless, within other code, etc.
·
Who is the contact person in the architect team to refer for any
clarification?
All these details about the application should be prepared as
knowledge base page with diagrams to detail the stack.
Sandbox for Practice
It’s not always just takes theory to complete the knowledge. A
Sandbox or practice system could help the team in much more ways in practicing
installation, upgrade, repeating production issues, etc. When providing such
system we also need to consider the additional license cost, learning
materials, trainings to be taken for it if it was purchased from external
vendor. Also Sandboxing may not be possible with every application or solution
if It is a major one or involving cost incurring inter-dependencies in the
infrastructure (for e.g. Additional Storage unit).
Monitoring Set up
Once a product is rolled down, it would be monitored in some or
other way like Port, URL, Disk usage, counters, queue size, etc. This details
has to be shared with the Support team to give them clear picture of what is
being monitored and why is that KPI important to the whole system functioning.
It’s also better to provide what is the monitoring system being used like
Netcool, Nagios, etc. Sometimes we may get two different alert indications. At
times its better to check the Monitoring system itself to see if it has correct
visual of the application or it’s just a network glitch. This is to avoid any
false alerts/escalations or ignoring a vital alert, cross verification, etc.
Administration Activities
When T3/T4 teams become the top Support team for everyone to
fallback, it’s no wonder they might also involve in Install, Upgrade, Recovery,
Restoration, Patching, Decommission of the Application. The team has to be
trained well on all these activities. Most of the vendor now-days have a common
portal for product download, license renewal, patch download, issue tracking,
knowledge base, etc. This portal credentials needs to be share with right
people to enable them to manage the product. If the application is in-house
then an alternate and standard arrangement for all above items should be made.
A periodical meeting with vendor/developer/architect would also be fruitful for
initial period of support while the team gets well versed in managing the
application.
Release Activities
Usually a T3/T4 Support team is the one who would be doing the
periodical release activities for the application. This involves patching,
upgrading, feature addition, deployments, bringing up new instance, etc. The
team has to be trained well with Change Approval Process, Release process and
seriousness of its violation. The Release calendar for the company or the
application should be posted in some common intranet page where its provided in
advance and always available for reference. The team should also be advised on
the urgent change process which might be required on outages impacting revenue.
This may be completely different from the regular approval, change process.
Support
Here comes the actual day to day work. Incidents and Requests
always hit the top activities in any Application management. Though Requests
looks less serious than Incidents, it has to be carefully handled with all
approvals, etc.
Requests
Before any user addition, group addition, providing additional
permissions, it’s not just enough to get the approvals, but to see whether it’s
a short lived or long lived request, is it a duplication effort. For e.g. A
team might be having common credentials for certain activity, but a member of
that team may request a separate privilege for the same activity which makes a
duplicate effort. If these kinds of items are unchecked then later, the
application authorization system would be having 100s of users and groups with
most of them unused and in confused state leading to incompliant state.
Incidents
Incidents are the complex activity the Support team might
struggle with always. But with right set of details and tools, this work can
almost be standardized. Most of the time the incidents come as alerts from
monitoring systems, T1, T2 teams, failure of other dependent system, failure inside
application stack, network issue, etc. The team should be provided clarity on
all these paths and how to handle each path in standard way. It is very usual
for the team to immediately login to servers to debug the issue, irrespective
of which path the issue came from.
Credentials
Having right set of credentials for the infrastructure makes
difference in MTTR. When we talk about credentials it’s not just applicable for
Production, but also for TEST, QA, STAGE, DEV instances. When the application
consists of DB, MQ, and Protected Web Services in the backend, then the number
of credentials that the team need to manage will grow exponentially. On top of
these to be complaint the company might need to change all of these passwords
periodically. This poses a new problem of keeping the passwords in sync.
Usually individuals use Excel sheet, KeePass or similar tools to store these
kinds of passwords. The best way is to host a password portal with one
credentials and have everyone refer here when required.
Alert Category
Alert categorization is one other detail that should not be left
to individuals. Though it’s always agreed that P1 marks business impact,
revenue impact, P2 to P4 are usually the confused ones. Correct categorization
is essential for other teams to understand the issue and work accordingly per
SLA.
Knowledge Base
This is nothing new for a Support team. Every team might have
some Knowledge base source either in intranet, Excel sheet, Document, Database,
Portal, etc. There has to be someone held responsible for updating, purging and
creating these records. Usually this kind of systems has more consumers than
creators. Everybody thinks that others creating it. There has to be a standard
way for creating and updating them, otherwise it become big, ugly and
unreliable.
Troubleshooting tool set
Toolset can range from simple shell interface to the server to
the advanced diagnostic Portal. Based on individual’s system expertise, each
tends to use their own set of tools to troubleshoot the application issue. This
may be command shell, grep, wget, curl, soap-ui, browser built-ins, telnet,
netstat, etc. The first step is to make this toolset uniform and available as
one package to be deployed in all machines involving troubleshooting. Next a
standard steps to be defined to narrow down the issue using the above tool set.
The steps should always lead close to a particular root cause always. This
frees inconsistencies in troubleshooting between individual team members
User Group to keep in loop
Sometimes it’s just not enough if someone assigned the Incident
ticket and started working on it. Some issues requires the appropriate
stakeholders to be updated on time. This could be internal management staff,
customers, vendors, etc. At the same time the team shouldn’t ring the wrong
bell. This knowledge about escalation and keep-informed culture has to be
standardized across team members. Preferably documented and available in common
area like intranet.
Hand offs and Continuing
Incidents
For a distributed Support team working across the geography,
handing off the issues is additional task. Likewise handing over knowledge
about continuing issues (most of the time P1) should be taken care well. This
is essential for the other team to effectively continue their troubleshooting.
Often this procedure happens through reading mail chain or chatting one-to-one
between handing and receiving team member. If that is the case, then anyone
else on the receiving team who wanted to identify the root cause or contribute
to the troubleshooting would be left with lack of information. This procedure
to the extent can be standardized through a common portal FORM, Filling a
structured document and sending across to entire distribution list.
Daily Standups
Standups doesn’t just belong to Scrum and Agile projects.
Support team can conduct a daily standup probably at their end of day to
discuss and share the issues occurred over the day and ways in which that got
resolved. This brings the private knowledge to the table and aid in sharing it.
It should be just open discussion, some lead person could take a note of the
issues and their resolution steps on day to day basis so that it can be applied
to improving root cause identification, reduce MTTR, duplication of work , more
standardization of process. The same meeting can also be utilized for
communicating upcoming release events related to application.
Vendor/Developer Meet
Support team can meet with either the architect/developer (for
internal applications) or Vendor periodically and discuss repeated issues that
can be resolved by some design change, feature suggestion, etc. This can also
help Support team in receiving knowledge from the other side for better manage
the application.
Reports & MIS Data
Assume there are no Incidents or Requests to an application for
some time and everyone was peaceful, does that really a state of peace unless
added with supporting data. This is where MIS data comes in picture. The
application usage and its internal resource usage, uptime report, Resolution
time graph, etc. to be captured and discussed periodically to ensure
application is behaving normally and to its expectation, issues are resolved
effectively within SLA.
Conclusion
Handing over a new Application to Support isn't just taking one
day session to the team and letting them fight the war day to day. I discussed
about Sandboxes, right set of credentials, Application eco-system, Architect
Contacts, Daily Standups, Continuing Incidents, Application Release Activities,
Monitoring Setup, etc. When effectively handed over, the team can fight the
Incidents and contribute the improvement in managing the tool than fighting
with accessing, troubleshooting, narrowing down issue. In fact to say
"Automation" at Support level tasks can yield lot of advantage and
make all the issues discussed above to disappear. I'll cover about it in my
further posts.












