Operations Manager is a great platform when it comes to end-to-end monitoring. But even if it came in a golden box as a diamond engraved dvd, it is still just a tool. Anyone can launch the setup, do a default installation and open the console with a great ‘tada’-effect.
But what comes next? Do I just open the online catalog and start importing management packs like there is no tomorrow? Do I fire critical alerts to my mailbox? Will I throw overrides everywhere whenever it pleases me?
Well for a small company, you might get away with this. But if you are monitoring an enterprise, or are a service provider actively managing customers this kind of working will end up in a monitoring wasteland where noisy alerts wreak havoc and important ones are killed before they have a chance to notify the correct person.
1 Mismanaged monitoring platforms bring forth bandit alerts!
What is equally important to a correctly set up monitoring infrastructure is a process to manage it. There are a lots of both IT- or business-focused processes which usually are part of a greater entity called a framework.
The Microsoft Operations Framework is provides guidance on how to correctly manage multiple aspects of your IT environment. This encompasses the entire lifecycle: from designing it, implementing it until finally managing it. The framework is divided into phases, with each phase containing so called ‘Service Management Function’. These SMF’s flesh out a specific process within the phase it resides in.
And here it comes: there is a SMF specifically on managing service monitoring, called the ‘Service Monitoring & Control’-SMF!
This SMF offers a process which encompasses the entire monitoring lifecycle. For the techies out there (including myself): this shouldn’t be perceived as a manual for your platform! A framework deals with a much larger picture than technical execution: it influences how people interact with each other, which documents are created, what meetings are held and so on. In short, it determines how a business works.
Nonetheless, a monitoring platform is part of a business, and thus can be tied into this process. I like to see the relation between monitoring platform and framework as a technical implementation validated against a functional set of requirements.
There are 4 steps within the SMC-SMF:
While I won’t go into too much detail (I’m writing a blog-post, not a book) I will describe each step on both a high-level and a level relevant to a technical implementation of Operations Manager.
Define SMC Requirements
This step is all about identifying the critical business components. What needs to be monitored, why? How does the component work and relate to other components? Who are the stakeholders? What are their requirements?
Specific to Operations Manager, this step must be seen as the design phase of your Operations Manager implementation. The first thing I advise: modularize! Don’t try to get your entire environment monitored like a monolithic slab straight out of ‘A space odyssey’. I look at an environment in 2 directions: horizontal, and vertical.
Vertically an environment is layered: fabric (storage, hardware,…), virtualization (Hypervisors), platform (OS), application (SQL, IIS) and service (Badge-system, Company websites). The amount of layers depends on how complex you want to make them of course, but it is always possible to identify them.
Once you know your layers you can fill them up horizontally with the specific technologies that are present. Your fabric layer may contain Dell Storage and HP servers, your platform layer Windows Server and so on.
When designing the requirements your monitoring platform initially, start from the bottom, work to the right, then move up and repeat! More specifically, first define what technologies you will be monitoring on your most lower layer, when that’s done, take the one above that one.
For each technology you define, a standard checklist should be followed. This contains questions you should ask yourself, like:
- How does the service/technology work, how can it be monitored?
- What are the KPI’s and SLA’s on this technology/service?
As you progress upwards in the layers, you’ll notice that the dependencies and relationships between technologies and services will become more complex. Within the service layer, it may be possible that custom monitoring must be built exclusively, as each service combines technologies from the other layers.
When custom monitoring is needed, and thus Management Pack authoring comes into play, the design requirements will increase.
When designing (and afterwards operating) a monitoring platform, the following documents might be good to create during this step:
- A general platform design
One operational document per management pack (overrides, custom monitoring, notifications, views)
- Implement Service
This step takes the outputs from the previous one and glues them into the existing monitoring structure. It is the implementation phase of the lifecycle.
For Operations Manager, this is where your designs are solidified into management packs and put into operation. During this step it is very important to communicate and document the changes that are made to the platform. Everyone involved in the monitoring lifecycle should be aware of the impact on their respective responsibilities. Will new alerts be generated? What new capabilities do we obtain? Are there any new SLA’s? The existing monitoring should be reviewed as well and any impact it endures from the new functionality should be documented.
The following documents might be good to create during this step:
- One operational document per management pack (overrides, custom monitoring, notifications, views)
- One high-level manual per custom management pack for the related stakeholders
This is the operational phase of the lifecycle. It focuses heavily on alert management.
In this phase, both the framework and Operations Manager share the same “common sense”: when an alert comes in, what has to be done with it?
Operators should know what alerts they can receive, what SLA they have and how they can escalate them. When an alert is deemed as unoptimized, unnecessary or false, this should be recorded. What is equally (or even more) important is when an incident doesn’t receive an alert. Such a case should be escalated immediately.
It is recommended to offer a medium to operators to log monitoring inconsistencies, which can be reviewed in the next step.
Control and Reporting
This phase compares the initial design documents with the actual implementation and validates the results with the KPI’s.
This is the moment where room for improvement is identified and dealt with.
Using reporting, for example “Logged false alerts” or “monitoring server load” one can identify which areas of the monitoring platform need redesign and finetuning. By conducting an operational health review, multiple perspectives offer a balanced insight into the overall health of your monitoring platform.
All approved improvements are then documented (in the original designs) and executed.
This is the entire monitoring lifecycle. As you see, although Operations Manager and the framework are on very different levels, they are compatible with each other. By applying a process, your investments into Operations Manager are cost-efficient and long-term!