By Girish Muckai, CRO,HEAL Software
Traditional monitoring setups were reactive instead of proactive; they focused on gathering metrics galore, laid the onus of setting static thresholds on the Operations personnel, and dealt with outages based on information gleaned from alert storms – with the result that NOCs (Network Operations Centers) were always in a firefighting mode; analysts were ill-equipped to understand the root cause of the outage and were always a little too late to avoid downtime. This is leading to an increased adoption of AIOps and Observability solutions. Here are some practical considerations for you to follow while moving your operations setup from traditional monitoring to Observability and AIOps.
Move away from Blackbox monitoring – Amidst today’s complex hybrid environments that are too complex, Black box monitoring does not work and often fails to give accurate root cause scenarios on outages.
An answer to this problem can be Whitebox monitoring – reporting data from inside a system. Techniques like instrumenting code, collecting system diagnostics via logs and scripts, and coding for greater testability and debuggability focus on making a system more “monitorable”, as the data reported by systems can result in more meaningful and actionable alerts.
Monitoring everything is an anti-pattern – Traditional monitoring tools do not always focus on “quality” metrics, but instead collect a vast quantity of metrics with the result that dashboards become overloaded with data, are difficult to understand and more difficult to decode to arrive at a root cause. Operations teams usually suffer from “metric fatigue” – a situation where a whole lot of metrics are collected, a majority of which are never even looked at, and do not contribute in any way to root cause analysis.
Therefore it is imperative for operations to only focus on necessary parameters which, at just a glance, help analysts accurately get a fix on the overall health of a distributed system by recording and exposing high-level metrics over time across all components of the system. These metrics are called Golden Signals and typically do not exceed 8-10 in a system.
Graduate from reactive alerting to proactive detection- In the past monitoring was always symptom-based alerting, which means that by the time the system raises an alert, an issue has already occurred and user experience has already degraded. This is strict No No today where downtime has become more expensive than ever.
What companies now require is proactive detection of incidents to avoid downtime as far as possible. This can be done using preventive healing solutions which go one step beyond AIOps and Observability platforms and enable an enterprise to move closer to complete automation.
Preventive healing is a must-have for any enterprise looking to cut costs and scale digitally. By flagging a potential issue before it even occurs and putting in place techniques to avert it, the enterprise can truly move towards zero-downtime.
Focus on tool integrations – Heavy, proprietary agents built into most APMs for collecting data make innovation and adoption of newer tools that are more suited to monitoring cloud and hybrid environments difficult. Most organizations are unable to radically change their monitoring setups due to vendor lock-in and a mandate to protect existing monitoring investments, even if such setups have been rendered virtually obsolete due to rapidly changing deployment scenarios.
Questions that every ITOM leader needs to ask before onboarding any new tool include:
o Does the product support multiple deployment options?
o Does the product ingest from multiple heterogeneous data sources or support such ingestions via APIs?
o Is the product standalone or does it depend on other products to be installed for it to work?
Use metrics to reduce MTTR – The sheer volume of metrics makes it exhausting for IT Ops personnel to continuously monitor dashboards with a plethora of data. It also leads to higher costs, higher MTTR and a drain on revenue.
Monitoring data should always provide a summary of the overall health of a distributed system by recording and exposing high-level metrics across all components of the system. This helps teams significantly reduce MTTR on issues that cannot be predicted in advance. It also gives teams insight into the effect of any fix deployed.
Derive insights from data to scale intelligently – Capacity forecasting is essential for any enterprise to scale intelligently; however, most AIOps tools do capacity planning, and not forecasting which mostly relies on insights generated from historical data and the prediction of future values based on current system parameters through statistical models. However, these models do not account for a disproportionate change in workload trends, which might affect the system requirements in ways that the predictive model is unable to capture.
The solution for this is to use tools that forecast capacity based on workload trends. Business owners can carry out a what-if analysis of projected workload growth trends, based on which two possibilities may arise:
o The user can clearly understand which system components would probably breach capacity in the coming days, where transaction growth rates follow the projected trends.
o The user can also pinpoint over-provisioned servers: those servers which have more than adequate headroom even when transaction volumes are, for instance, doubled or tripled.
In conclusion, traditional monitoring is passe – not only is it inadequate to monitor today’s complex heterogeneous environments. The solution is to adopt AIOps/Observability and Preventive Healing techniques and doing it right will improve operational efficiency and transform IT Operations teams into profit centers from cost centers.
– Girish Muckai is the CRO of HEAL Software, previously Appnomic. HEAL is a complete preventive healing AIOps solution for modern Indian enterprises that goes beyond traditional APM and uses patented AI and ML algorithms to help ITOps teams prevent incidents even before they occur