While overseeing Network Operations Centers (NOCs), I often noticed situations where systems were activated without adequate monitoring or input from the engineering teams responsible for building them. These engineers were skilled but often lacked the time and creativity to anticipate potential system failures and implement effective monitoring. As a result, the systems would fail, typically shortly after implementation, leaving management puzzled about why the failure wasn't detected. In response, engineers would quickly develop dashboards to satisfy the NOC and management.
Nevertheless, the strategy to continually monitor these dashboards presented challenges, despite some being exceptionally well-designed. These dashboards were aggregated with many others, all aimed at monitoring the systems under the NOC's purview. It was not feasible for NOC personnel to continuously stair at multiple dashboards throughout their shifts.
Numerous dashboards utilized pattern-based designs, presenting general patterns that may or may not indicate an existing or imminent issue. I termed these "analyst dashboards," intended for intermittent use by the engineers who constructed the systems. They were designed to assess overall system health and identify potential problem signals. Additionally, it was imperative for the engineers designing and utilizing these dashboards to offer a clear explanation of what the signals and patterns conveyed about the system's health.
Given these challenges, despite having dashboards in place, NOCs frequently overlooked system failures. This prompted me to consistently emphasize the distinction between "alert monitors" and "analyst monitors" when communicating with engineers. An alert monitor signifies an actual or imminent failure that demands immediate action. A succinct explanation for an alert monitor is, "If this alert triggers, someone needs to be notified, and an immediate response is essential." NOC engineers could be trained to shift from alert monitors to relevant analyst monitors (dashboards) to diagnose the issue. However, an analyst monitor displaying a general pattern would not convey the urgency of "NOC, take immediate action!"
In my collaboration with SRE (Site Reliability Engineering) teams, I created a table called the "Golden Signals Monitoring Matrix" to tackle these challenges. This involved collaborating with the SRE team to document the monitoring strategy for the specific system or application in question.
Comments