Maintenance Matters: Monitoring

Nevin Morgan, Senior DevOps Engineer

Article Categories: #Code, #Front-end Engineering, #Back-end Engineering, #Performance, #Tooling

Posted on

In the digital era, monitoring is not just a technical necessity; it's the backbone of operational excellence. This guide will serve as your reference as you build your monitoring practice and strengthen the foundation of your platforms.

This article is part of a series focusing on how developers can center and streamline software maintenance. The other articles in the Maintenance Matters series are: Continuous Integration, Code Coverage, Documentation, Default FormattingBuilding Helpful Logs, Timely Upgrades, Code Reviews, and Good Tests.

Here at Viget, we build platforms that help to make the digital world a better place, and we believe that monitoring plays a critical part in that goal by ensuring that the technology we create operates as expected.

Monitoring plays a crucial role in maintenance, serving as the eyes and ears for teams to ensure optimal performance, security, and reliability of systems. It enables early detection of issues, allowing for proactive management before they escalate into serious problems that can lead to downtime, data loss, or security breaches. By keeping an eye on system performance, network traffic, and application behavior, monitoring provides valuable insights into the health of your platform. The data generated by monitoring tools allows teams to make informed decisions, optimize resources, and plan for future capacity needs. Overall, effective monitoring is fundamental in maintaining system integrity, enhancing user experience, and supporting business continuity.

In this article, we delve into the world of monitoring, covering the basics of what monitoring entails and the tools and best practices that can elevate your platform to new heights. We’ll explore various types of monitoring, including Uptime Monitoring, Log Aggregation, Server Monitoring, Application Performance Monitoring, and System Telemetry; and provide insights into selecting the right tools, setting clear objectives, and automating alerts for efficient system management. Whether you’re new to monitoring or looking to refine your existing strategies, this article aims to equip you with the knowledge and tools necessary to ensure your systems are secure, reliable, and performing at their best.

So what exactly is monitoring? In the context of technology, it’s the continuous observation and analysis of a system’s operational status, performance, and health. It’s all about collecting time-series data from various components of the platform such as the servers, applications, networks, and services so that you can then analyze the trends they reveal. Monitoring aims to identify and address potential issues before they impact users or lead to system failures.

Types of Monitoring

Monitoring can generally be broken down into several different types. These buckets help to categorize your efforts and identify areas that might be missing:

Uptime Monitoring

When monitoring uptime, you are ensuring that systems and services are available and operational by checking the status of servers, devices, or network connections. It helps in the early detection of outages or downtime. Your goal is to know as soon as something is wrong so that you can address it quickly. There is nothing more damaging to your users’ trust than to have them be the source of your uptime monitoring! Typically, users won’t reach out until long after an incident has started.

Log Aggregation

Log aggregation is the process of collecting and storing logs from various systems in a centralized location. It allows for improved data analysis, aids in troubleshooting, and can provide insights into system behavior and trends. When brought into a dashboard alongside other monitoring data, it can allow you to quickly locate the relevant logs for an incident — or even help to identify weaknesses and opportunities for improvement in your logging strategy.

Server Monitoring

Server monitoring involves tracking and analyzing a server's system resources such as CPU Usage, Memory Consumption, I/O, Network, and Disk Usage to ensure they are functioning smoothly and efficiently. It helps in detecting server failures, outages, or any performance issues that could lead to a significant impact on the end users. The data collected can be used to create reports for an in-depth understanding, to make forecasts about future server health, and to take proactive measures to improve server performance.

Application Performance Monitoring (APM)

This type of monitoring focuses on tracking the performance and availability of software applications. It helps identify bottlenecks, slowdowns, and other issues that could impact user experience. It is critical to have some form of APM to alert you when a code change has a negative side effect, or some critical code path starts taking too long to process. The data from this type of monitoring can be a lifesaver for application teams and help them to fix a bottleneck before it ever makes it to production, or, in the worst-case scenario, to revert a change quickly.

System Telemetry

System telemetry monitoring is the process of recording different metrics related to the platform itself that fall outside of the scope of the other buckets. You are looking to collect information like request volume, transaction volume, firewall events, or other metrics that help you gauge your platform's health. Your goal here isn’t to track absolutely everything that happens, but to curate a set of useful heuristics that can alert you to issues within the broader system. For example, you might want to know if the request volume for a resource-constrained service spikes before it becomes overwhelmed.

Honorable mention: Network Monitoring

If your platform runs on-premises (i.e., you physically manage your servers and networking), check out this guide from Grafana Labs on how to use SNMP, Prometheus, and Grafana to keep an eye on your network performance.

Monitoring Best Practices

There are some important things to keep in mind when setting up your monitoring solution. To help you build out an effective monitoring practice, here are some guidelines. Think of them like navigational stars — they will help you chart your course, but ultimately the destination is up to you.

Set Clear Objectives

To effectively monitor your systems, it’s important to clearly define what you need to monitor, and most importantly, why. You need to understand what the critical components of your infrastructure and application are that directly impact your operations and customer experience. This will let you build a picture of that all-important "why." For example, the "why" for monitoring your memory usage could be that your application is memory-bound, and as it approaches a certain threshold, your users start to have a worse experience.

The most important thing to keep in mind is to avoid making these measurements the metrics of success. If a measurement that you use to monitor your system becomes the target that your team start aiming for, you can end up in a situation where the platform is overly optimized for that metric, which then will render it invalid and useless within the context of monitoring the health of the overall system.

Establish Baselines

To know when something is going wrong, you need to understand what normal looks like. To do that, observe your monitoring data for between a week to a month, roughly, and trends should begin to reveal themselves. Once you have your normal operating range, make sure to update it on a schedule that makes sense for your business. For example, if you have a seasonal business, your normal range might look different during the busy season than during the off season.

Automate Alerts and Responses

When those metrics start exceeding the normal range, make sure to have automated alerts that notify the relevant people. Where possible, automate the responses to common issues (think things like restarting services or scaling resources, within ranges). If you have automated responses, make sure that you have limits to those actions and alerts in place to notify people when restarts, scaling, etc. get out of hand.

Without automation and alerts, you are unlikely to see the value of your monitoring system in the real world. Even with the most comprehensive suite of metrics, there is a very high probability your team will never know there was a problem if no one is alerted to issues. An important lesson here is that users will generally only reach out during the worst-case scenarios of a complete outage, but if your platform just performs poorly they will start looking for alternatives without ever letting you know.

Keep the Documentation Updated

Make sure that you have good documentation for your monitoring setup. The goal here is to have a document that can onboard new team members quickly. I try to write the documentation in such a way that if someone outside the team without domain-specific knowledge reads it, they can fully grasp the system.

As part of this documentation, it’s a good idea to include response procedures (or links to them), as well as the prioritization logic. Not all alerts are created equal, so try to prioritize issues based on their potential impact on your business operations and address the most critical ones first.

Continuous Improvement

Monitoring is like maintenance itself, it is not a set-it and forget-it task. It’s critical to regularly review your overall monitoring strategy and tools to ensure they meet your business needs today. Learn from your past solutions, incidents, and user/team feedback to refine your approach. This constant improvement will help prevent future issues and catch problems before they can impact your users.

Embrace a Holistic View

To have monitoring serve the whole business and not just one discipline, the strategy has to encompass more than just technical metrics. To help align IT performance with business objectives, the monitoring strategy needs to also include business metrics that help to indicate the health of the platform from a user’s perspective. When tracked over time, data like those covered by core web vitals such as Largest Contentful Paint, First Input Delay, and Cumulative Layout Shift can help to pinpoint when a deployment has had a negative user impact.

Incorporate Log Management

Make sure to collect and regularly analyze logs from your systems. They can help to provide insights into security incidents, system errors, and performance issues that are not always evident from metrics alone. They can also be invaluable if you can correlate a log event to a metric alert which will save you hours or days when tracking down the root cause of an issue within your system. The more your platform grows, the more important your log management strategy becomes.

Pick Your Tools

Look for monitoring solutions that align with your objectives, your infrastructure, and, most importantly, with how you and your team think. I like to think of monitoring tools like the tools a blacksmith will make: they fit the work that smith does and their team perfectly. There is no one-size-fits-all solution here, so consider tools that offer comprehensive coverage, are scalable, and integrate well with your existing systems. Utilize open-source and commercial options depending on what fits your environment. Oftentimes the best monitoring solutions are some amalgamation of open-source and commercial tools.

Essential Tools

With all that said, we can hardly leave you without giving you some signposts of tools worth looking into (as of publishing). Keep in mind: this list is not remotely comprehensive, what works well in one context won’t work at all in another, and the monitoring landscape is constantly shifting. Just like a blacksmith crafting a bespoke tool for a challenging job, the solution you end up with will be unique to your platform and organization. You may notice that some of the commercial options on this list do more than one thing, which is great — just make sure that the tradeoffs and assumptions that the tool makes align with your team and organization.

Application Performance Monitoring (APM)

Uptime Monitoring

System Telemetry

Dashboarding

Where to Go From Here

No matter where you are in your monitoring journey, I would recommend reviewing your objectives and baselines; taking a look at the landscape of available tools; and, if you aren’t using OpenTelemetry exporters already, it might be time to standardize. Review your plans and documentation to ensure that your solution is meeting the needs of your team and organization. Finally, make sure that the solution you have or are building is generating an actionable signal. There is nothing worse than a monitoring system that people ignore because it generates too much noise.

If you are looking for more resources to improve your maintenance and help your team improve the platforms they care for, go check out our Maintenance Matters series for more insights into the approach Viget takes to the care and feeding of everything from MVPs, to startup platforms, to world-class applications.

Nevin Morgan

Nevin is a senior devops engineer working remotely from Ohio. He specializes in automating and codifying the infrastructure and platforms our teams and clients depend upon.

More articles by Nevin

Related Articles