Metrics Monitoring and Alerting System
Designing a Metrics Monitoring and Alerting System
Step 1: Understand the Problem and Establish Design Scope
A metrics monitoring system must be clear on its focus to avoid unnecessary complexity.
Identifying Stakeholders: Are we building for an internal team or for external SaaS users?
Internal Use: In this case, we focus strictly on operational metrics for the company.
Define Metrics: What type of metrics needs to be collected?
Operational Metrics: CPU load, Memory usage, Disk space, Requests per second.
Out of Scope: Business metrics, logs (error/access), and distributed tracing.
Infrastructure Scale:
100 million daily active users
1000 server pools with 100 machines each
Leading to approximately 10 million metrics to be monitored.
Data Retention: How long will metrics data be stored?
Retention Policy: Raw data for 7 days, 1-minute resolution for the following 30 days, and 1-hour resolution for 1 year.
Alert Channels: Support needed for email, phone alerts, PagerDuty, or webhooks.
High-Level Requirements and Assumptions
Operational Assumptions:
Infrastructure for the monitored system is large-scale.
Collection of various categorized metrics (CPU load, Requests count).
Non-functional Requirements:
Scalability: Able to expand to accommodate additional metrics and alerts.
Low Latency: Quick query responses for dashboards and alerts.
Reliability: Must reliably detect and alert on critical events.
Flexibility: Integration capabilities for new technologies.
Out of Scope:
Log monitoring systems like ELK stack.
Distributed system tracing.
Step 2: Propose High-Level Design and Get Buy-In
Core Components of Metrics Monitoring System
Data Collection:
Gather metrics data from various sources.
Data Transmission:
Transfer data from sources to monitoring back-end.
Data Storage:
Structure and persist incoming data efficiently.
Alerting System:
Analyze incoming data, detect anomalies, and issue alerts.
Visualization:
Create dashboards with data represented in various formats (graphs/charts).
Data Model
Metrics as Time-Series:
Metrics data is recorded as a time-series, identified by a metric name and optional tags.
Data Points Characteristics:
Example of CPU load metric on server:
Metric:
cpu.loadTags:
host:i631, env:prodTimestamp:
1613707265Value:
0.29
Time-series is structured to respond to specific queries regarding metrics over time.
Time Series Visualization Example
Example query for CPU load across web servers in a region over the last 10 minutes:
Utilize stored data,
Average the values recorded during that time to provide insights.