APM API Queries Documentation
This document provides a comprehensive guide to all Prometheus and GraphQL queries used in the APM (Application Performance Monitoring) system.
Overview
The APM system uses raw spans data to create edge_latency_* metrics that power RED metrics and dependency/service graph metrics. These metrics are exposed through Prometheus and accessed via PromQL queries.
Edge Latency Metrics
All APM metrics are based on the edge_latency_* family of metrics derived from span data.
Common Labels
-
service_hash: Unique identifier for a service. The attributes used for hash calculation are configurable. Default: ["kf_platform", "availability_zone", "cloud_account_id", "kube_cluster_name", "kube_namespace", "project", "region", "service_name"] -
service_name: Human-readable service name -
client_service_hash: Hash of the calling service (for dependency tracking) -
client_service_name: Name of the calling service -
span_type: Type of span (e.g., "db" for database calls) -
error: Boolean indicating if the request resulted in an error -
le: Histogram bucket boundaries (for percentile calculations)
Service List Page Queries
P99 Latency Calculation
Description: Calculates the 99th percentile latency for all services
histogram_quantile(0.99,
sum(rate(edge_latency_bucket{span_type!="db"}[${stepInMs}ms]))
by (service_hash, service_name, le)
)
Parameters:
-
stepInMs: Time window for rate calculation -
span_type!="db": Excludes database operations
P95 Latency Calculation
histogram_quantile(0.95,
sum(rate(edge_latency_bucket{span_type!="db"}[${stepInMs}ms]))
by (service_hash, service_name, le)
)
P90 Latency Calculation
histogram_quantile(0.90,
sum(rate(edge_latency_bucket{span_type!="db"}[${stepInMs}ms]))
by (service_hash, service_name, le)
)
P75 Latency Calculation
histogram_quantile(0.75,
sum(rate(edge_latency_bucket{span_type!="db"}[${stepInMs}ms]))
by (service_hash, service_name, le)
)
P50 Latency (Median) Calculation
histogram_quantile(0.50,
sum(rate(edge_latency_bucket{span_type!="db"}[${stepInMs}ms]))
by (service_hash, service_name, le)
)
Average Latency
sum by (service_hash, service_name) (rate(edge_latency_sum{span_type!="db"}[${stepInMs}ms]))
/
sum by (service_hash, service_name) (rate(edge_latency_count{span_type!="db"}[${stepInMs}ms]))
Maximum Latency
max(max_over_time(edge_latency_max{span_type!="db"}[${stepInMs}ms]))
by (service_hash, service_name)
Minimum Latency
min(min_over_time(edge_latency_min{span_type!="db"}[${stepInMs}ms]))
by (service_hash, service_name)
Request Count
round(sum by (service_hash, service_name)
(increase(edge_latency_count{span_type!="db"}[${stepInMs}ms]))
)
Requests Per Second
sum by (service_hash, service_name)
(rate(edge_latency_count{span_type!="db"}[${stepInMs}ms]))
Error Rate
sum by (service_hash, service_name) (rate(edge_latency_count{span_type!="db",error="true"}[${stepInMs}ms]))
/
sum by (service_hash, service_name) (rate(edge_latency_count{span_type!="db"}[${stepInMs}ms]))
APDEX Score
(sum by (service_hash, service_name) (increase(edge_latency_bucket{span_type!="db",le="1.0"}[${stepInMs}ms]))
+ sum by (service_hash, service_name) (increase(edge_latency_bucket{span_type!="db",le="0.5"}[${stepInMs}ms])))
/
(2 * sum by (service_hash, service_name) (increase(edge_latency_count{span_type!="db"}[${stepInMs}ms])))
Service Details Page Queries
When viewing a specific service, queries are filtered by service_hash:
Service P99 Latency Over Time
histogram_quantile(0.99,
sum(rate(edge_latency_bucket{service_hash="${serviceHash}"}[${rateIntervalSeconds}]))
by (${property}, le)
)
Parameters:
-
serviceHash: The specific service’s hash -
property: Grouping property (e.g., "endpoint", "version", "client_service_hash") -
rateIntervalSeconds: Rate calculation window
Service Request Rate
sum by (${property})
(rate(edge_latency_count{service_hash="${serviceHash}"}[${rateIntervalSeconds}]))
GraphQL Queries
Get Services List
query GetServices {
services(
filter: {
attributeFilter: {
eq: { key: "${customerFilterKey}", value: "${customerFilterValue}" }
}
}
durationSecs: ${durationSecs}
kfSource: "${kfSource}"
service: { kfType: "${spanTypeFilter}" }
timestamp: "${endTime}"
) {
name
distinctLabels
labels
hash
kfType
}
}
Get Traces
{
traces(
durationSecs: ${durationSecs}
filter: ${buildTracesFilter(...)}
limit: ${limit}
pageNum: ${pageNum}
timestamp: "${endTime}"
sortField: "${sortBy}"
sortOrder: ${sortOrder}
) {
traceId
span {
spanId
parentSpanId
startTimeNs
endTimeNs
attributes
durationNs
name
service {
name
labels
hash
distinctLabels
}
statusCode
method
endpoint
rootSpan
}
traceMetrics {
spanCount
serviceExecTimeNs
}
}
}
Database-Specific Queries
Common Query Parameters
Time Windows
-
stepInMs: Step size in milliseconds for instant queries -
rateIntervalSeconds: Rate interval in seconds format (e.g., "5m", "1h") -
durationSecs: Total duration in seconds for the query window
Usage Examples
Example 1: Get P99 latency for a specific service
curl -X POST http://api.example.com/prometheus/api/v1/query \
-d 'query=histogram_quantile(0.99, sum(rate(edge_latency_bucket{service_hash="abc123"}[5m])) by (le))'
Notes
-
All latency values in Prometheus metrics are in milliseconds
-
GraphQL timestamps use ISO 8601 format
-
The
span_type!="db"filter excludes database operations from general service metrics -
APDEX thresholds are typically 0.5s (satisfied) and 1.0s (tolerable)
-
Rate intervals should be at least 4x the scrape interval for accuracy