APM API Queries Documentation

This document provides a comprehensive guide to all Prometheus and GraphQL queries used in the APM (Application Performance Monitoring) system.

Overview

The APM system uses raw spans data to create edge_latency_* metrics that power RED metrics and dependency/service graph metrics. These metrics are exposed through Prometheus and accessed via PromQL queries.

Key Metrics Types

edge_latency_count: Total number of requests
edge_latency_sum: Sum of all latencies (for average calculations)
edge_latency_bucket: Histogram buckets for percentile calculations
edge_latency_max: Maximum latency observed
edge_latency_min: Minimum latency observed

Edge Latency Metrics

All APM metrics are based on the edge_latency_* family of metrics derived from span data.

Common Labels

service_hash: Unique identifier for a service. The attributes used for hash calculation are configurable. Default: ["kf_platform", "availability_zone", "cloud_account_id", "kube_cluster_name", "kube_namespace", "project", "region", "service_name"]
service_name: Human-readable service name
client_service_hash: Hash of the calling service (for dependency tracking)
client_service_name: Name of the calling service
span_type: Type of span (e.g., "db" for database calls)
error: Boolean indicating if the request resulted in an error
le: Histogram bucket boundaries (for percentile calculations)

Service List Page Queries

P99 Latency Calculation

Description: Calculates the 99th percentile latency for all services

histogram_quantile(0.99,
  sum(rate(edge_latency_bucket{span_type!="db"}[${stepInMs}ms]))
  by (service_hash, service_name, le)
)

promql

Parameters:

stepInMs: Time window for rate calculation
span_type!="db": Excludes database operations

P95 Latency Calculation

histogram_quantile(0.95,
  sum(rate(edge_latency_bucket{span_type!="db"}[${stepInMs}ms]))
  by (service_hash, service_name, le)
)

promql

P90 Latency Calculation

histogram_quantile(0.90,
  sum(rate(edge_latency_bucket{span_type!="db"}[${stepInMs}ms]))
  by (service_hash, service_name, le)
)

promql

P75 Latency Calculation

histogram_quantile(0.75,
  sum(rate(edge_latency_bucket{span_type!="db"}[${stepInMs}ms]))
  by (service_hash, service_name, le)
)

promql

P50 Latency (Median) Calculation

histogram_quantile(0.50,
  sum(rate(edge_latency_bucket{span_type!="db"}[${stepInMs}ms]))
  by (service_hash, service_name, le)
)

promql

Average Latency

sum by (service_hash, service_name) (rate(edge_latency_sum{span_type!="db"}[${stepInMs}ms]))
/
sum by (service_hash, service_name) (rate(edge_latency_count{span_type!="db"}[${stepInMs}ms]))

promql

Maximum Latency

max(max_over_time(edge_latency_max{span_type!="db"}[${stepInMs}ms]))
by (service_hash, service_name)

promql

Minimum Latency

min(min_over_time(edge_latency_min{span_type!="db"}[${stepInMs}ms]))
by (service_hash, service_name)

promql

Request Count

round(sum by (service_hash, service_name)
  (increase(edge_latency_count{span_type!="db"}[${stepInMs}ms]))
)

promql

Requests Per Second

sum by (service_hash, service_name)
  (rate(edge_latency_count{span_type!="db"}[${stepInMs}ms]))

promql

Error Rate

sum by (service_hash, service_name) (rate(edge_latency_count{span_type!="db",error="true"}[${stepInMs}ms]))
/
sum by (service_hash, service_name) (rate(edge_latency_count{span_type!="db"}[${stepInMs}ms]))

promql

APDEX Score

(sum by (service_hash, service_name) (increase(edge_latency_bucket{span_type!="db",le="1.0"}[${stepInMs}ms]))
 + sum by (service_hash, service_name) (increase(edge_latency_bucket{span_type!="db",le="0.5"}[${stepInMs}ms])))
/
(2 * sum by (service_hash, service_name) (increase(edge_latency_count{span_type!="db"}[${stepInMs}ms])))

promql

Service Details Page Queries

When viewing a specific service, queries are filtered by service_hash:

Service P99 Latency Over Time

histogram_quantile(0.99,
  sum(rate(edge_latency_bucket{service_hash="${serviceHash}"}[${rateIntervalSeconds}]))
  by (${property}, le)
)

promql

Parameters:

serviceHash: The specific service’s hash
property: Grouping property (e.g., "endpoint", "version", "client_service_hash")
rateIntervalSeconds: Rate calculation window

Service Request Rate

sum by (${property})
  (rate(edge_latency_count{service_hash="${serviceHash}"}[${rateIntervalSeconds}]))

promql

Service Error Rate

sum by (${property}) (rate(edge_latency_count{service_hash="${serviceHash}",error="true"}[${rateIntervalSeconds}]))
/
sum by (${property}) (rate(edge_latency_count{service_hash="${serviceHash}"}[${rateIntervalSeconds}]))

promql

Downstream Dependencies (Client Services)

For analyzing which services call the current service:

histogram_quantile(0.99,
  sum(rate(edge_latency_bucket{client_service_hash="${serviceHash}"}[${rateIntervalSeconds}]))
  by (service_hash, service_name, le)
)

promql

Trace List Page Queries

Trace queries are handled through GraphQL rather than Prometheus metrics.

GraphQL Queries

Get Services List

query GetServices {
  services(
    filter: {
      attributeFilter: {
        eq: { key: "${customerFilterKey}", value: "${customerFilterValue}" }
      }
    }
    durationSecs: ${durationSecs}
    kfSource: "${kfSource}"
    service: { kfType: "${spanTypeFilter}" }
    timestamp: "${endTime}"
  ) {
    name
    distinctLabels
    labels
    hash
    kfType
  }
}

graphql

Get Traces

{
  traces(
    durationSecs: ${durationSecs}
    filter: ${buildTracesFilter(...)}
    limit: ${limit}
    pageNum: ${pageNum}
    timestamp: "${endTime}"
    sortField: "${sortBy}"
    sortOrder: ${sortOrder}
  ) {
    traceId
    span {
      spanId
      parentSpanId
      startTimeNs
      endTimeNs
      attributes
      durationNs
      name
      service {
        name
        labels
        hash
        distinctLabels
      }
      statusCode
      method
      endpoint
      rootSpan
    }
    traceMetrics {
      spanCount
      serviceExecTimeNs
    }
  }
}

graphql

Get SLOs

{
  listSLOs {
    id
    name
    type
    service {
      name
      hash
      distinctLabels
      kfType
      labels
    }
    goodEventsSLIQuery
    totalEventsSLIQuery
    matchers
    latencyThreshold
    objective
    description
    timeWindow
    alertUid
    contactPoints
  }
}

graphql

Database-Specific Queries

For database operations, queries filter by span_type="db":

Database P99 Latency

histogram_quantile(0.99,
  sum(rate(edge_latency_bucket{span_type="db"}[${stepInMs}ms]))
  by (service_hash, service_name, le)
)

promql

Database Request Count

round(sum by (service_hash, service_name)
  (increase(edge_latency_count{span_type="db"}[${stepInMs}ms]))
)

promql

Common Query Parameters

Time Windows

stepInMs: Step size in milliseconds for instant queries
rateIntervalSeconds: Rate interval in seconds format (e.g., "5m", "1h")
durationSecs: Total duration in seconds for the query window

Filters

selectedFacetValuesByName: Key-value pairs for filtering by service attributes
customerFilter: Customer-specific filtering
spanTypeFilter: Filter by span type ("db", "http", etc.)

Aggregation

sumBy: Fields to group by in aggregations (typically includes service identifiers)

Usage Examples

Example 1: Get P99 latency for a specific service

curl -X POST http://api.example.com/prometheus/api/v1/query \
  -d 'query=histogram_quantile(0.99, sum(rate(edge_latency_bucket{service_hash="abc123"}[5m])) by (le))'

bash

Example 2: Get all services via GraphQL

curl -X POST http://api.example.com/graphql \
  -H "Content-Type: application/json" \
  -d '{
    "query": "{ services(durationSecs: 3600, timestamp: \"2024-01-01T00:00:00Z\") { name hash } }"
  }'

bash

Example 3: Calculate service error rate over last hour

curl -X POST http://api.example.com/prometheus/api/v1/query \
  -d 'query=sum by (service_name) (rate(edge_latency_count{error="true"}[1h])) / sum by (service_name) (rate(edge_latency_count[1h]))'

bash

Notes

All latency values in Prometheus metrics are in milliseconds
GraphQL timestamps use ISO 8601 format
The span_type!="db" filter excludes database operations from general service metrics
APDEX thresholds are typically 0.5s (satisfied) and 1.0s (tolerable)
Rate intervals should be at least 4x the scrape interval for accuracy