APM API Queries Documentation

This document provides a comprehensive guide to all Prometheus and GraphQL queries used in the APM (Application Performance Monitoring) system.

Overview

The APM system uses raw spans data to create edge_latency_* metrics that power RED metrics and dependency/service graph metrics. These metrics are exposed through Prometheus and accessed via PromQL queries.

Key Metrics Types

  • edge_latency_count: Total number of requests

  • edge_latency_sum: Sum of all latencies (for average calculations)

  • edge_latency_bucket: Histogram buckets for percentile calculations

  • edge_latency_max: Maximum latency observed

  • edge_latency_min: Minimum latency observed

Edge Latency Metrics

All APM metrics are based on the edge_latency_* family of metrics derived from span data.

Common Labels

  • service_hash: Unique identifier for a service. The attributes used for hash calculation are configurable. Default: ["kf_platform", "availability_zone", "cloud_account_id", "kube_cluster_name", "kube_namespace", "project", "region", "service_name"]

  • service_name: Human-readable service name

  • client_service_hash: Hash of the calling service (for dependency tracking)

  • client_service_name: Name of the calling service

  • span_type: Type of span (e.g., "db" for database calls)

  • error: Boolean indicating if the request resulted in an error

  • le: Histogram bucket boundaries (for percentile calculations)

Service List Page Queries

P99 Latency Calculation

Description: Calculates the 99th percentile latency for all services

histogram_quantile(0.99,
  sum(rate(edge_latency_bucket{span_type!="db"}[${stepInMs}ms]))
  by (service_hash, service_name, le)
)
promql

Parameters:

  • stepInMs: Time window for rate calculation

  • span_type!="db": Excludes database operations

P95 Latency Calculation

histogram_quantile(0.95,
  sum(rate(edge_latency_bucket{span_type!="db"}[${stepInMs}ms]))
  by (service_hash, service_name, le)
)
promql

P90 Latency Calculation

histogram_quantile(0.90,
  sum(rate(edge_latency_bucket{span_type!="db"}[${stepInMs}ms]))
  by (service_hash, service_name, le)
)
promql

P75 Latency Calculation

histogram_quantile(0.75,
  sum(rate(edge_latency_bucket{span_type!="db"}[${stepInMs}ms]))
  by (service_hash, service_name, le)
)
promql

P50 Latency (Median) Calculation

histogram_quantile(0.50,
  sum(rate(edge_latency_bucket{span_type!="db"}[${stepInMs}ms]))
  by (service_hash, service_name, le)
)
promql

Average Latency

sum by (service_hash, service_name) (rate(edge_latency_sum{span_type!="db"}[${stepInMs}ms]))
/
sum by (service_hash, service_name) (rate(edge_latency_count{span_type!="db"}[${stepInMs}ms]))
promql

Maximum Latency

max(max_over_time(edge_latency_max{span_type!="db"}[${stepInMs}ms]))
by (service_hash, service_name)
promql

Minimum Latency

min(min_over_time(edge_latency_min{span_type!="db"}[${stepInMs}ms]))
by (service_hash, service_name)
promql

Request Count

round(sum by (service_hash, service_name)
  (increase(edge_latency_count{span_type!="db"}[${stepInMs}ms]))
)
promql

Requests Per Second

sum by (service_hash, service_name)
  (rate(edge_latency_count{span_type!="db"}[${stepInMs}ms]))
promql

Error Rate

sum by (service_hash, service_name) (rate(edge_latency_count{span_type!="db",error="true"}[${stepInMs}ms]))
/
sum by (service_hash, service_name) (rate(edge_latency_count{span_type!="db"}[${stepInMs}ms]))
promql

APDEX Score

(sum by (service_hash, service_name) (increase(edge_latency_bucket{span_type!="db",le="1.0"}[${stepInMs}ms]))
 + sum by (service_hash, service_name) (increase(edge_latency_bucket{span_type!="db",le="0.5"}[${stepInMs}ms])))
/
(2 * sum by (service_hash, service_name) (increase(edge_latency_count{span_type!="db"}[${stepInMs}ms])))
promql

Service Details Page Queries

When viewing a specific service, queries are filtered by service_hash:

Service P99 Latency Over Time

histogram_quantile(0.99,
  sum(rate(edge_latency_bucket{service_hash="${serviceHash}"}[${rateIntervalSeconds}]))
  by (${property}, le)
)
promql

Parameters:

  • serviceHash: The specific service’s hash

  • property: Grouping property (e.g., "endpoint", "version", "client_service_hash")

  • rateIntervalSeconds: Rate calculation window

Service Request Rate

sum by (${property})
  (rate(edge_latency_count{service_hash="${serviceHash}"}[${rateIntervalSeconds}]))
promql

Service Error Rate

sum by (${property}) (rate(edge_latency_count{service_hash="${serviceHash}",error="true"}[${rateIntervalSeconds}]))
/
sum by (${property}) (rate(edge_latency_count{service_hash="${serviceHash}"}[${rateIntervalSeconds}]))
promql

Downstream Dependencies (Client Services)

For analyzing which services call the current service:

histogram_quantile(0.99,
  sum(rate(edge_latency_bucket{client_service_hash="${serviceHash}"}[${rateIntervalSeconds}]))
  by (service_hash, service_name, le)
)
promql

Trace List Page Queries

Trace queries are handled through GraphQL rather than Prometheus metrics.

GraphQL Queries

Get Services List

query GetServices {
  services(
    filter: {
      attributeFilter: {
        eq: { key: "${customerFilterKey}", value: "${customerFilterValue}" }
      }
    }
    durationSecs: ${durationSecs}
    kfSource: "${kfSource}"
    service: { kfType: "${spanTypeFilter}" }
    timestamp: "${endTime}"
  ) {
    name
    distinctLabels
    labels
    hash
    kfType
  }
}
graphql

Get Traces

{
  traces(
    durationSecs: ${durationSecs}
    filter: ${buildTracesFilter(...)}
    limit: ${limit}
    pageNum: ${pageNum}
    timestamp: "${endTime}"
    sortField: "${sortBy}"
    sortOrder: ${sortOrder}
  ) {
    traceId
    span {
      spanId
      parentSpanId
      startTimeNs
      endTimeNs
      attributes
      durationNs
      name
      service {
        name
        labels
        hash
        distinctLabels
      }
      statusCode
      method
      endpoint
      rootSpan
    }
    traceMetrics {
      spanCount
      serviceExecTimeNs
    }
  }
}
graphql

Get SLOs

{
  listSLOs {
    id
    name
    type
    service {
      name
      hash
      distinctLabels
      kfType
      labels
    }
    goodEventsSLIQuery
    totalEventsSLIQuery
    matchers
    latencyThreshold
    objective
    description
    timeWindow
    alertUid
    contactPoints
  }
}
graphql

Database-Specific Queries

For database operations, queries filter by span_type="db":

Database P99 Latency

histogram_quantile(0.99,
  sum(rate(edge_latency_bucket{span_type="db"}[${stepInMs}ms]))
  by (service_hash, service_name, le)
)
promql

Database Request Count

round(sum by (service_hash, service_name)
  (increase(edge_latency_count{span_type="db"}[${stepInMs}ms]))
)
promql

Common Query Parameters

Time Windows

  • stepInMs: Step size in milliseconds for instant queries

  • rateIntervalSeconds: Rate interval in seconds format (e.g., "5m", "1h")

  • durationSecs: Total duration in seconds for the query window

Filters

  • selectedFacetValuesByName: Key-value pairs for filtering by service attributes

  • customerFilter: Customer-specific filtering

  • spanTypeFilter: Filter by span type ("db", "http", etc.)

Aggregation

  • sumBy: Fields to group by in aggregations (typically includes service identifiers)

Usage Examples

Example 1: Get P99 latency for a specific service

curl -X POST http://api.example.com/prometheus/api/v1/query \
  -d 'query=histogram_quantile(0.99, sum(rate(edge_latency_bucket{service_hash="abc123"}[5m])) by (le))'
bash

Example 2: Get all services via GraphQL

curl -X POST http://api.example.com/graphql \
  -H "Content-Type: application/json" \
  -d '{
    "query": "{ services(durationSecs: 3600, timestamp: \"2024-01-01T00:00:00Z\") { name hash } }"
  }'
bash

Example 3: Calculate service error rate over last hour

curl -X POST http://api.example.com/prometheus/api/v1/query \
  -d 'query=sum by (service_name) (rate(edge_latency_count{error="true"}[1h])) / sum by (service_name) (rate(edge_latency_count[1h]))'
bash

Notes

  1. All latency values in Prometheus metrics are in milliseconds

  2. GraphQL timestamps use ISO 8601 format

  3. The span_type!="db" filter excludes database operations from general service metrics

  4. APDEX thresholds are typically 0.5s (satisfied) and 1.0s (tolerable)

  5. Rate intervals should be at least 4x the scrape interval for accuracy