Metrics Monitoring and Alerting System

Designing a Metrics Monitoring and Alerting System

Step 1: Understand the Problem and Establish Design Scope
  • A metrics monitoring system must be clear on its focus to avoid unnecessary complexity.

    • Identifying Stakeholders: Are we building for an internal team or for external SaaS users?

    • Internal Use: In this case, we focus strictly on operational metrics for the company.

    • Define Metrics: What type of metrics needs to be collected?

    • Operational Metrics: CPU load, Memory usage, Disk space, Requests per second.

    • Out of Scope: Business metrics, logs (error/access), and distributed tracing.

    • Infrastructure Scale:

    • 100 million daily active users

    • 1000 server pools with 100 machines each

    • Leading to approximately 10 million metrics to be monitored.

    • Data Retention: How long will metrics data be stored?

    • Retention Policy: Raw data for 7 days, 1-minute resolution for the following 30 days, and 1-hour resolution for 1 year.

    • Alert Channels: Support needed for email, phone alerts, PagerDuty, or webhooks.

High-Level Requirements and Assumptions
  • Operational Assumptions:

    • Infrastructure for the monitored system is large-scale.

    • Collection of various categorized metrics (CPU load, Requests count).

  • Non-functional Requirements:

    • Scalability: Able to expand to accommodate additional metrics and alerts.

    • Low Latency: Quick query responses for dashboards and alerts.

    • Reliability: Must reliably detect and alert on critical events.

    • Flexibility: Integration capabilities for new technologies.

  • Out of Scope:

    • Log monitoring systems like ELK stack.

    • Distributed system tracing.

Step 2: Propose High-Level Design and Get Buy-In
Core Components of Metrics Monitoring System
  1. Data Collection:

    • Gather metrics data from various sources.

  2. Data Transmission:

    • Transfer data from sources to monitoring back-end.

  3. Data Storage:

    • Structure and persist incoming data efficiently.

  4. Alerting System:

    • Analyze incoming data, detect anomalies, and issue alerts.

  5. Visualization:

    • Create dashboards with data represented in various formats (graphs/charts).

Data Model
  • Metrics as Time-Series:

    • Metrics data is recorded as a time-series, identified by a metric name and optional tags.

  • Data Points Characteristics:

    • Example of CPU load metric on server:

    • Metric: cpu.load

      • Tags: host:i631, env:prod

      • Timestamp: 1613707265

      • Value: 0.29

    • Time-series is structured to respond to specific queries regarding metrics over time.

Time Series Visualization Example
  • Example query for CPU load across web servers in a region over the last 10 minutes:

    • Utilize stored data,

    • Average the values recorded during that time to provide insights.