At Rhythmic, our approach to monitoring isn’t just about keeping an eye on a dashboard or waiting for alerts in Slack. It’s about feeling the rhythm of a system so acutely that even a slight deviation is like nails on a chalkboard.
In this post, we’ll share insights on how we elevate monitoring practices, from traditional techniques to advanced metric-based monitoring, and how you can apply these principles to set up robust monitoring for your own environment. We focus primarily on infrastructure, network, and workload monitoring via metrics, logs, and probes. Other topics like security, cost, compliance, and user experience are important and items we will cover in future posts.
Monitoring Approaches
There are four primary approaches for monitoring in most modern observability platforms:
-
Metric-based: Dimensional time-series data is pulled from cloud providers, application stacks, and other sources. This allows for query-based monitoring that can explore volume, frequency, and occurrence by each dimension or combinations thereof. When coupled with a good tagging strategy (as we’ll discuss shortly), metric-based monitoring can automatically cover new infrastructure resources whether that’s a new instance in an autoscale group or a new Lambda that implements a new function following existing patterns.
-
Event-based: Events, such as a change in configuration, a notification that a storage volume has become corrupt, or an outage reported by a provider can all indicate a potentially production-impacting incident. Events can trigger alerts based on criteria such as matching, frequency, and correlation to other events.
-
Logs: System and service logs produce copious amounts of information that are typically only consulted in response to an outage. Modern platforms allow for logs to be correlated in real-time. Alerts can be generated based on matching, frequency, and correlation to other logs.
-
Synthetics: Synthetic monitors replicate the behavior of a user or downstream caller. Sophisticated synthetic monitors can record browser sessions and turn them into monitors that fire if a step fails or if performance is not acceptable.
-
Probe-based: Probe monitors interrogate a resource for runtime configuration and status. For example, a monitor that confirms whether a process is running on a server or a port is listening on a server are both probes. This approach, though still valuable, is the basis of traditional monitoring systems.
A good monitoring strategy blends all five together and takes advantage of additional platform capabilities such as anomaly detection, watchdogs, dynamic threshold setting, and other automated methods.
Tagging
Tags are essential for effective monitoring. Tags allow you to add dimensions to your telemetry and event data, enabling you to filter, aggregate, and compare metrics in more meaningful ways. For example, by tagging components of your AWS infrastructure, you can easily monitor aggregate performance across multiple hosts or isolate specific services for detailed analysis.
Not following a good tagging strategy for monitoring, especially in dynamic environments like those created by AWS and Kubernetes, can lead to several challenges and inefficiencies:
-
Difficulty in Identifying Resources: Without proper tags, it becomes challenging to quickly identify specific resources even in modestly sized infrastructure. This can lead to confusion and delays when trying to pinpoint which instances, containers, or services are involved in an issue.
-
Ineffective Monitoring and Troubleshooting: Without a coherent tagging strategy, it’s difficult to isolate issues, understand dependencies, and correlate metrics, logs, and traces. This can significantly slow down troubleshooting and resolution times.
-
Challenges in Scaling Operations: As infrastructure grows, a weak tagging approach creates compounding difficulties in managing and monitoring resources efficiently. What might be manageable at a smaller scale can quickly become unmanageable as the number of resources increases.
-
Impaired Alerting and Notifications: Proper tags allow you to create dynamic, context-rich alerts that automatically cover new resources. Without them, you may end up with generic alerts that lack specificity or coverage gaps.
-
Limited Operational Insights: Tags enable detailed analysis and insights into the performance and utilization of different components of your infrastructure. Without them, gaining actionable insights into how to improve or optimize your systems can be significantly hindered.
Without tagging, metric-based monitoring is no more effective than traditional threshold monitoring and in fact, can create blind spots. You have to invest the time to get tagging right to effectively monitor. Adopt a tagging strategy that incorporates these best practices;
-
Leverage Platform-Specific Tags: Utilize the “default” tagging of your infrastructure resources (e.g., AWS, Kubernetes) as it provides essential metadata and context automatically.
-
Use Key:Value Pairs: Adopt a structured format for tags, preferably key:value pairs (e.g., “role:web-server”, “env:production”). Avoid key-only tags whenever possible.
-
Standardize Tags Across Resources: Ensure consistency in your tagging scheme across different resources and environments. This helps in effectively grouping and filtering resources for monitoring and analysis.
-
Tag with Purpose: Tag resources with their role, environment, location, and other operational dimensions that are relevant to your monitoring and management needs. This enables quick isolation and analysis of specific components within your infrastructure.
-
Automate Tagging Where Possible: Leverage automation tools and scripts to apply tags, especially in dynamic environments where resources are frequently created and destroyed. This ensures that tagging keeps pace with changes in your infrastructure.
-
Use Tags for Alerting and Notifications: Design your alerting policies to use tags, enabling you to dynamically target alerts to relevant teams or services based on the tagged context. Avoid building alerts tied to specific resources whenever feasible.
-
Review and Clean Up Tags Regularly: As your infrastructure evolves, review your tagging strategy and tags applied to resources regularly to ensure they remain relevant and useful for your operational needs.
-
Educate and Enforce Tagging Policies: Ensure all team members understand the importance of tagging and adhere to the established tagging policies to maintain consistency and effectiveness in your monitoring strategy. Use platform tools such as AWS Config to enforce tagging.
-
Utilize Tags for Dependency Mapping: Use tags to map dependencies and relationships between different components of your application and infrastructure. This is invaluable for root cause analysis and understanding service interactions.
With such a strategy, you will be able to enhance monitoring, management, and operational efficiency. Coverage will improve and anomaly detection and other predictive processes will work more effectively.
Writing Effective Queries
Learning to write effective queries is crucial.
-
Be mindful of platform capabilities: Queries should be mindful of the capabilities of the monitoring platform. For example, at Rhythmic, we use Datadog most often. Datadog can be expensive to query on high cardinality data, so it’s important to create queries that do not filter on tags like task or instance ID.
-
Learn the syntax: It is important to learn the syntax of your monitoring platform. Effective queries combine the correct metric(s), tags for filtering, and functions to normalize values into relevant information.
-
Learn the metrics: Cloud services and monitoring agents bring in tons of metrics automatically, including new metrics over time as services are enhanced. Spend time exploring the available metrics. Don’t be afraid to build dashboards with an obnoxious number of metrics when you’re working with a new service until you learn the few metrics that matter for your workload.
-
Layer your queries: Use roll-up functions to create multiple “tiers” of alerts for a given metric. For example, you may want to know as a notification if a service has a sharp but short spike in errors, while you want to get an engineer out of bed if that same service has a lesser but more sustained increase in error rates.
-
Use functions: Aggregation, vector, and range functions to write more powerful queries. For example, vector functions can look at error rates over intervals of time, dramatically simplifying syntax over other methods. On most modern monitoring platforms, a range of functions can be predictive and consider trends. You could use this to create a disk alarm that fires if the disk is likely to run out of space before daylight hours.
-
Link monitors to dashboards: If a metric is worth alerting on, it is worth graphing. Even though you can pull up a metric graph, it is one more step in investigating an alert. Instead, create dashboards that you reference in your notifications and include the metric in question along with other metrics relevant to that service. This will allow incident responders to quickly assess the situation.
Leveraging AI and Dynamic Thresholds
Use your platform’s AI-based monitoring capabilities, but do not rely exclusively on them. Anomaly detection, watchdogs, and ML-driven methods can be helpful but are often not more effective than digging into more cost-effective methods. We tend to stick to anomaly detection for things like microservices and use watchdogs as a failsafe. When watchdogs detect incidents that we otherwise would not have found, we identify primary monitors as part of our post-mortem process.
Though generative AI is just now making its way into monitoring platforms, we are cautious about its role in dynamically defining thresholds at runtime and prefer its promise of helping you craft more effective queries.
Practical Steps to Elevate Your Monitoring
-
Evaluate your current setup: Make sure you have the right tools in place. We like Prometheus/Loki/Grafana on the open source side and Datadog on the commercial side. But this is the era of peak monitoring, and there are many great options. Make sure yours works for you. We recommend prioritizing overall monitoring capabilities. Do not pick a tool based only on APM or RUM. We very rarely see either used to their full potential in the real world.
-
Get tagging right: We spent a lot of time in this post on tagging for a good reason. The more dynamic, serverless, and otherwise ephemeral your workload becomes, the more likely you are to have gaps if your tagging game isn’t on point.
-
Write advanced metric queries: Learn the query language and use arithmetic and statistical functions, aggregations, and vectors/ranges to increase the quality of your monitors and reduce false positives.
-
Use your logs: Don’t just use logs for investigations. Alert when known errors are logged and create alerts that detect swings in log volume. Logs are an area where watchdogs and other AI methods are highly effective over other less expensive techniques.
-
Include monitoring in your post-mortem process: Consider how existing monitors served you in anticipating and resolving an incident. Don’t be the person who demands a new critical alert after any incident. But do be honest about what could have worked better and what was helpful. Most incidents have a monitoring lesson to learn.
-
Continuous Learning and Adaptation: Monitoring is not a set-and-forget task. Regularly review your monitoring strategies, update your queries, and refine your thresholds based on new insights and changing environments. Be intentional about your monitors and expect that if they’re not changing regularly, their effectiveness is drifting.
Conclusion
At Rhythmic, we believe effective monitoring is the backbone of a reliable and efficient cloud infrastructure. By shifting from traditional methods to a more nuanced, metric-based approach, you can achieve a deeper understanding of your systems’ behavior and health. Embrace tagging, effective metric queries, and the judicious use of AI to transform your monitoring from a reactive chore to a proactive strength.
Remember, the goal is not just to monitor but to understand and anticipate. By adopting these strategies, you can ensure your monitoring works for you, maximizing availability and reliability while setting the stage for innovation and growth.