Best practices for managing metrics

Metric names

Choosing and maintaing a good naming convention helps you more easily identify what metrics to track more closesly for indications of system health.

Naming conventions

  • Maintain a consistent naming scheme for your metrics.

    A common naming convention should follow this pattern:

    `<application>_<metric_name>_<metric_type>`.
  • Order names in a manner that leads to convenient grouping when listing metrics, or searching for metrics:

    advanced_functions_server_data_fetch_latency_bucket
    advanced_functions_server_http_request_fetch_latency_bucket
    advanced_functions_server_model_predict_latency_bucket

    Avoid using these types of metric names:

    azure_devices_iothubs_c2d_twin_read_failure
    azure_devices_iothubs_c2d_twin_read_success
  • Avoid using the words success and failure in the metric name. Instead, save these for labels used by the metric. This approach makes it easier to create queries.

  • Consult the [Prometheus Metric and Label Naming](https://prometheus.io/docs/practices/naming/,role=external,window=_blank) documentation for more details on best naming practices.

Cardinality

Cardinality is the number of unique time series (usings label combinations) stored in the database. Metrics that have high cardinality, or many unique label combinations. High cardnality potentially impacts performance and resource usage.

A simple way to measure cardinality is to take the number of values in each label, and multiplying them. For example, http_requests_total has three labels, method has 4 values, status has 10 values, and path has 100 values. The cardinality estimate for this metric would be 4 * 10 * 100 = 4,000

A metric with high cardinality can begin to impact performnace in the time range of the data that you examine. As a general rule, we consider metrics with a cardinality at 10,000 as high cardinality metrics.

Using labels

Labels both identify and categorize your metrics. They form the foundation of Prometheus’s dimensional data model.

Consider the following code example:

http_requests_total{method="GET", status="200"} 1234
http_requests_total{method="POST", status="404"} 5

Keep labels meaningful

Make sure your labels provide enough information and context when creating queries and troubleshooting.

Avoid too many unique labels

The number of time series that you need depends on the number of labels in combination with other labels. When you have a large number of unique label values, or high cardinality, it can quickly consume storage and then impact performance.

Avoid Inconsistent labeling

Inconsistent use of labels can lead to difficulties in building and understanding queries.

Avoid overuse of labels

Avoid using labels that never get used in either queries or filters. This practice can impact performance by increasing the number of time series, and often complicates query usage between teams.

Consider label size

Don’t use long labels or values. Be concise, because overly long labels are harder to type and remember.

Alternatively, labels that are too short can be too cryptic, and hard to understand. Attempt to strike a balance to achive clarity.

These guidelines also apply to label values.

Scrape configuration

Use a higher scrape interval of 30s to 60s.

+ Shorter intervals (10-15 seconds) provide more granular data, and increase the volume of work performed by collectors and the target Kloudfuse cluster. Additionally, a lower granularity does not necessarially provide more useful information.

Use histograms to view percentile data

A histogram samples observations, like request durations or response sizes, and counts them in configurable buckets. It also provides a sum of all observed values.

By using a histogram, you can create queries that provide you with more options than a stand-alone counter. Consider using histograms for percentiles.

Use recording rules for expensive queries

When you have large or expensive queries, consider using recording rules. Pre-computing these queries helps improve performance at runtime.