APM Instrumentation Best Practices
Effective APM instrumentation requires more than attaching an agent—it requires deliberate decisions about service naming, span design, sampling strategy, and attribute cardinality. This page consolidates production-grade guidance for instrumenting services with Kloudfuse APM across Java, Python, and Go.
Naming Conventions
Good naming is the foundation of useful trace data. Span names and service names are indexed by the platform and used as grouping keys—mistakes here are expensive to fix after data is collected.
Service Names
service.name is the single most important resource attribute. Every service must set it explicitly.
-
Use a stable, lowercase, kebab-case identifier:
checkout-service,inventory-api,payments-worker -
Names must be identical across all horizontally scaled instances of the same service
-
Never rely on the SDK default (
unknown_service:<executable>)—it prevents all per-service filtering
Span Names
Span names must be low-cardinality. They are indexed by the backend and used to group spans for latency histograms and error rates.
| Operation type | Correct span name | Incorrect span name |
|---|---|---|
HTTP server |
|
|
HTTP client |
|
|
Database |
|
|
Message queue |
|
|
Background job |
|
|
Do not embed user IDs, request IDs, order IDs, timestamps, or full URLs in span names. Those values belong in span attributes where they are stored per-span without affecting aggregation keys.
Resource Attributes
Resource attributes describe the entity that produced telemetry. They are set once at SDK startup and attached to all spans, metrics, and logs.
Core Attributes
| Attribute | Requirement | Description |
|---|---|---|
|
Required |
Logical name of the service. Must be set—no exceptions. |
|
Recommended |
Namespace grouping services by team or domain (e.g., |
|
Recommended |
Semantic version, git SHA, or build tag (e.g., |
|
Recommended |
Globally unique identifier for this instance. Use pod name or a UUID. |
|
Recommended |
Deployment tier: |
Use deployment.environment.name (not deployment.environment, which is deprecated).
|
Setting Attributes via Environment Variables
All OpenTelemetry SDKs support these standard environment variables:
OTEL_SERVICE_NAME=checkout-service
OTEL_RESOURCE_ATTRIBUTES=service.namespace=payments,service.version=1.4.2,deployment.environment.name=production,service.instance.id=pod-abc-123
Automatic Resource Detection
Resource detectors populate cloud and infrastructure metadata automatically at SDK startup—no manual configuration required for most environments.
Java (javaagent):
# Enable cloud provider detection (off by default)
-Dotel.resource.providers.aws.enabled=true
-Dotel.resource.providers.gcp.enabled=true
-Dotel.resource.providers.azure.enabled=true
Go:
res, err := resource.New(ctx,
resource.WithFromEnv(), // reads OTEL_RESOURCE_ATTRIBUTES
resource.WithProcess(), // PID, executable name
resource.WithOS(), // OS type
resource.WithContainer(), // container ID
resource.WithHost(), // hostname
)
For Kubernetes environments, use the kf-agent k8sattributes processor to enrich all telemetry with pod name, namespace, node name, and deployment name without requiring any SDK changes.
Sampling Strategy
Sampling controls what fraction of traces are collected and exported. The right strategy balances visibility against cost and storage volume.
Head-Based Sampling
Sampling decisions are made at the root span before any child spans are created. The decision propagates to all downstream services via the traceparent header.
| Sampler | When to use |
|---|---|
|
Low-traffic services or development—capture everything. |
|
Recommended for production. Samples a configured percentage of new traces while respecting sampling decisions from upstream callers. |
|
Silence a noisy service that adds no debugging value. |
Configure via environment variables:
OTEL_TRACES_SAMPLER=parentbased_traceidratio
OTEL_TRACES_SAMPLER_ARG=0.05 # 5% of new root traces
The parentbased_* samplers are critical for distributed tracing consistency: if Service A decides to sample a trace and propagates that decision to Service B, Service B will also sample it. Without parentbased_*, services make independent decisions, producing traces with missing spans.
Tail-Based Sampling
Tail-based sampling defers the decision until after the full trace is collected, enabling criteria such as "always capture error traces" or "always capture the slowest 1%." This is implemented in the OpenTelemetry Collector, not in the SDK.
processors:
tail_sampling:
decision_wait: 10s
policies:
- name: errors-policy
type: status_code
status_code: {status_codes: [ERROR]}
- name: slow-traces-policy
type: latency
latency: {threshold_ms: 1000}
- name: probabilistic-policy
type: probabilistic
probabilistic: {sampling_percentage: 5}
Use tail-based sampling when you cannot afford to miss error traces or latency outliers.
Context Propagation
Trace context must be propagated across service boundaries for distributed traces to assemble correctly.
W3C TraceContext (Default)
All OpenTelemetry SDKs default to W3C TraceContext propagation. Two HTTP headers carry the trace:
-
traceparent: contains the trace ID, span ID, and sampling flag -
tracestate: carries vendor-specific metadata including sampling probability
The default propagator configuration (tracecontext,baggage) is appropriate for most deployments:
OTEL_PROPAGATORS=tracecontext,baggage
For systems using Zipkin or older Jaeger agents, add B3 compatibility:
OTEL_PROPAGATORS=tracecontext,baggage,b3
Baggage
W3C Baggage propagates key-value pairs through the entire request chain as HTTP headers. Unlike span attributes, baggage values are available in-flight to every service handling the request without reading the trace backend.
Appropriate uses: tenant ID, cost-center tag, feature flag assignment, routing hint.
| Baggage travels as plain HTTP headers, including to third-party services. Never put credentials, API keys, session tokens, or personally identifiable information in baggage. |
Keep baggage small—prefer short identifiers over objects, and limit entries to 5–10 at most.
Python:
from opentelemetry import baggage, context
# Set baggage on outgoing context
ctx = baggage.set_baggage("tenant.id", "tenant-abc")
context.attach(ctx)
# Read in a downstream service
tenant_id = baggage.get_baggage("tenant.id")
Go:
// Set baggage
member, _ := baggage.NewMember("tenant.id", "tenant-abc")
bag, _ := baggage.New(member)
ctx = baggage.ContextWithBaggage(ctx, bag)
// Inject into outgoing HTTP request (if not using otelhttp)
otel.GetTextMapPropagator().Inject(ctx, propagation.HeaderCarrier(req.Header))
Error Recording
Two operations are always required together when an error occurs. They do not imply each other—omitting either produces incomplete data.
-
RecordError/record_exception— records the exception as a span event withexception.type,exception.message, andexception.stacktrace -
SetStatus(ERROR, description)— marks the span as failed for aggregation, alerting, and error rate calculations
If you only call RecordError without SetStatus, the span is still marked OK/Unset and will not appear in error rate metrics.
Java
Span span = tracer.spanBuilder("process-payment").startSpan();
try (Scope scope = span.makeCurrent()) {
processPayment();
} catch (Exception e) {
span.recordException(e);
span.setStatus(StatusCode.ERROR, e.getMessage());
} finally {
span.end();
}
Python
with tracer.start_as_current_span("process-payment") as span:
try:
process_payment()
except Exception as e:
span.record_exception(e)
span.set_status(StatusCode.ERROR, str(e))
raise
Go
ctx, span := tracer.Start(ctx, "process-payment")
defer span.End()
if err := processPayment(ctx); err != nil {
span.RecordError(err)
span.SetStatus(codes.Error, err.Error())
return err
}
When to Set ERROR Status
-
HTTP 5xx responses (server-side)
-
HTTP 4xx responses on the client side (the outbound call failed)
-
Unhandled exceptions that terminate the operation
-
Database errors, connection timeouts, or message processing failures
Do not set ERROR for HTTP 4xx on the server side—those are client errors handled by the server, not server failures.
Cardinality Management
High cardinality in span attributes and metric labels degrades backend performance and increases storage cost. It is one of the most common production problems with APM deployments.
Anti-Patterns to Avoid
| Anti-pattern | Problem |
|---|---|
User ID, session ID, or order ID in span name |
Every unique ID creates a new grouping bucket. Aggregations become meaningless. |
Full URL ( |
Every URL is unique. No two requests share a span name. |
Request ID as a metric label |
Metric series count explodes. Backends run out of memory. |
Timestamp embedded in any label or name |
Always unique. Destroys aggregation. |
Unbound enum values in metric labels |
Product IDs, SKUs, transaction codes—each new value adds a new series. |
What to Do Instead
-
Put user IDs, order IDs, and request IDs in span attributes, not span names or metric labels
-
Use route templates (
/users/{id}) as span names, never resolved paths -
Use the OpenTelemetry Collector
attributesprocessor to drop or hash high-cardinality fields before they reach your metrics backend -
Set
OTEL_SPAN_ATTRIBUTE_VALUE_LENGTH_LIMITto prevent large exception stacktraces from bloating per-span storage
OTEL_SPAN_ATTRIBUTE_VALUE_LENGTH_LIMIT=1024
OTEL_SPAN_ATTRIBUTE_COUNT_LIMIT=64
OTEL_SPAN_EVENT_COUNT_LIMIT=10
Performance Tuning
BatchSpanProcessor
Always use BatchSpanProcessor in production. The alternative (SimpleSpanProcessor) exports synchronously on every span end, adding latency to every instrumented operation.
The BSP collects spans in a queue and exports them in batches on a background thread. Tune these settings based on your span throughput:
| Environment variable | Default | Guidance |
|---|---|---|
|
5000 ms |
Lower (e.g., 1000 ms) for near-realtime visibility in Kloudfuse; higher to reduce CPU overhead. |
|
2048 |
Increase for high-throughput services: |
|
512 |
Must be ≤ |
|
30000 ms |
Reduce to 10 s for faster circuit-breaking if the Kloudfuse agent is unreachable. |
Language-Specific Best Practices
Each language SDK has distinct setup patterns, lifecycle requirements, and common pitfalls. The sections below consolidate the most important production guidance for Java, Python, and Go.
Java
Agent Configuration File
For non-trivial deployments, use a properties file instead of a long list of -D flags. Pass it with -Dotel.javaagent.configuration-file:
# otel-config.properties
otel.service.name=checkout-service
otel.traces.exporter=otlp
otel.exporter.otlp.endpoint=http://kf-agent:4317
otel.exporter.otlp.compression=gzip
otel.traces.sampler=parentbased_traceidratio
otel.traces.sampler.arg=0.05
otel.propagators=tracecontext,baggage
otel.resource.providers.aws.enabled=true
otel.javaagent.logging=application
otel.bsp.schedule.delay=2000
otel.bsp.max.queue.size=4096
java -javaagent:/path/to/opentelemetry-javaagent.jar \
-Dotel.javaagent.configuration-file=/etc/otel/otel-config.properties \
-jar app.jar
Selective Instrumentation
Disable noisy or irrelevant auto-instrumentation without losing everything else:
# Suppress a specific library
-Dotel.instrumentation.log4j-appender.enabled=false
# Start from zero and enable only what you need
-Dotel.instrumentation.common.default-enabled=false
-Dotel.instrumentation.opentelemetry-api.enabled=true
-Dotel.instrumentation.spring-webmvc.enabled=true
-Dotel.instrumentation.jdbc.enabled=true
Span Annotations with @WithSpan
The javaagent intercepts the @WithSpan annotation to create spans without modifying business logic:
@WithSpan("process-payment")
public PaymentResult processPayment(
@SpanAttribute("payment.method") String method,
@SpanAttribute("payment.amount") double amount
) {
// Span is created automatically; parameters become span attributes
}
Cloud Resource Providers
Cloud provider detection is disabled by default to avoid startup latency in non-cloud environments. Enable the provider that matches your deployment:
-Dotel.resource.providers.aws.enabled=true # EC2, ECS, EKS, Lambda
-Dotel.resource.providers.gcp.enabled=true # GCE, GKE, Cloud Run
-Dotel.resource.providers.azure.enabled=true # Azure VMs, AKS
Best Practices
-
Set
service.nameexplicitly — never rely on theunknown_servicedefault -
Tag with
service.namespace,service.version, anddeployment.environment.nameinotel.resource.attributes -
Use
parentbased_traceidratiowith a 5–10% sample rate for high-traffic production services -
Enable
otel.exporter.otlp.compression=gzipto reduce network overhead -
Route agent logs to your application logging framework with
otel.javaagent.logging=application -
Prefer
@WithSpanfor business-logic spans — it is less intrusive than the programmatic API and automatically propagates trace context -
Disable instrumentation for high-volume, low-value scheduled jobs (e.g., health checks via Spring Scheduling)
-
Enable cloud resource providers matching your environment to populate instance and cluster metadata automatically
See Java Instrumentation → for the full setup guide.
Python
Auto-Instrumentation Setup
Install all detected library instrumentors in one step:
pip install opentelemetry-distro[otlp] opentelemetry-instrumentation
opentelemetry-bootstrap -a install # detects and installs instrumentation packages
Run your application with full auto-instrumentation:
OTEL_SERVICE_NAME=my-service \
OTEL_EXPORTER_OTLP_ENDPOINT=http://kf-agent:4317 \
OTEL_TRACES_SAMPLER=parentbased_traceidratio \
OTEL_TRACES_SAMPLER_ARG=0.1 \
opentelemetry-instrument python app.py
Manual SDK Initialization
When you need more control than environment variables provide, initialize the SDK programmatically:
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource, SERVICE_NAME
resource = Resource.create({
SERVICE_NAME: "checkout-service",
"service.version": "1.4.2",
"deployment.environment.name": "production",
})
provider = TracerProvider(resource=resource)
exporter = OTLPSpanExporter(endpoint="http://kf-agent:4317")
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)
Fork Safety (Gunicorn, uWSGI)
The Python BatchSpanProcessor spawns background threads and is not fork-safe. If your application server forks worker processes (Gunicorn, uWSGI, multiprocessing), initializing the SDK before the fork will cause threads to be silently lost in child processes, resulting in no traces exported.
|
Always initialize the OTel SDK inside the worker initialization hook, after the fork:
# gunicorn.conf.py
def post_fork(server, worker):
# Initialize OTel here — after fork, not at module import time
setup_opentelemetry()
Best Practices
-
Use
opentelemetry-bootstrap -a installafter adding new dependencies — it detects and installs matching instrumentation packages automatically -
Use
opentelemetry-instrumentfor library coverage (HTTP, DB, etc.) and add manual spans only for business logic -
Set
OTEL_TRACES_SAMPLER=parentbased_traceidratioandOTEL_TRACES_SAMPLER_ARG=0.05in production -
If using Gunicorn or uWSGI, always initialize the OTel SDK inside
post_fork— never at module import time -
Install cloud resource detector packages (
opentelemetry-resource-detector-aws,opentelemetry-resource-detector-gcp) to populate environment metadata
See Python Instrumentation → for the full setup guide.
Go
SDK Initialization and Graceful Shutdown
Always call tp.Shutdown() on exit. Without it, the BatchSpanProcessor may not flush spans that are still queued when the process exits.
func initTracer(ctx context.Context) (func(context.Context) error, error) {
exp, err := otlptracegrpc.New(ctx,
otlptracegrpc.WithEndpoint("kf-agent:4317"),
otlptracegrpc.WithInsecure(),
)
if err != nil {
return nil, err
}
res, err := resource.Merge(
resource.Default(),
resource.NewWithAttributes(
semconv.SchemaURL,
semconv.ServiceName("my-service"),
semconv.ServiceVersion("1.0.0"),
attribute.String("deployment.environment.name", "production"),
),
)
if err != nil {
return nil, err
}
tp := sdktrace.NewTracerProvider(
sdktrace.WithBatcher(exp),
sdktrace.WithResource(res),
sdktrace.WithSampler(sdktrace.ParentBased(
sdktrace.TraceIDRatioBased(0.05),
)),
)
otel.SetTracerProvider(tp)
otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator(
propagation.TraceContext{},
propagation.Baggage{},
))
return tp.Shutdown, nil
}
func main() {
ctx := context.Background()
shutdown, err := initTracer(ctx)
if err != nil {
log.Fatal(err)
}
defer func() {
shutdownCtx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
defer cancel()
if err := shutdown(shutdownCtx); err != nil {
log.Printf("tracer shutdown error: %v", err)
}
}()
// ... application code
}
Context Threading
In Go, the active span lives in context.Context. You must pass context explicitly through every function call. Goroutines do not inherit context automatically—if you forget to pass it, child operations will start new root spans instead of appearing as children.
func handleRequest(ctx context.Context) {
ctx, span := tracer.Start(ctx, "handle-request")
defer span.End()
// Pass ctx to all child operations
result, err := fetchData(ctx)
// Pass ctx explicitly to goroutines — do not capture from outer scope
go func(ctx context.Context) {
_, childSpan := tracer.Start(ctx, "async-task")
defer childSpan.End()
// ...
}(ctx)
}
HTTP Instrumentation
Use otelhttp for automatic context propagation on both server and client sides:
import "go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
// Server: wraps the handler and extracts incoming trace context
http.Handle("/api/orders", otelhttp.NewHandler(ordersHandler, "handle-orders"))
// Client: injects trace context into outgoing requests
client := &http.Client{
Transport: otelhttp.NewTransport(http.DefaultTransport),
}
For manual propagation without otelhttp:
// Inject into outgoing request
otel.GetTextMapPropagator().Inject(ctx, propagation.HeaderCarrier(req.Header))
// Extract from incoming request
ctx = otel.GetTextMapPropagator().Extract(r.Context(), propagation.HeaderCarrier(r.Header))
Best Practices
-
Always
defer tp.Shutdown(ctx)inmain()— without it theBatchSpanProcessormay drop queued spans on exit -
Pass
ctxexplicitly through every function call; goroutines do not inherit context automatically -
Use
otelhttp.NewHandlerandotelhttp.NewTransportinstead of manually injecting/extracting headers -
Register both
propagation.TraceContext{}andpropagation.Baggage{}as composite propagators at startup -
Use
sdktrace.ParentBased(sdktrace.TraceIDRatioBased(0.05))as the sampler for production services -
Use
resource.WithFromEnv()alongside programmatic resource attributes soOTEL_RESOURCE_ATTRIBUTEScan override values at deploy time without recompiling
See Go Instrumentation → for the full setup guide.
Semantic Conventions Reference
OpenTelemetry defines standard attribute names for common operation types. Using these ensures Kloudfuse and other backends can correctly parse and display trace data.
HTTP
| Attribute | Stability | Use |
|---|---|---|
|
Stable |
|
|
Stable |
Integer: |
|
Stable |
Matched route template: |
|
Stable |
Absolute URL for client spans. Replaces deprecated |
|
Stable |
|
|
Stable |
Server hostname or IP |
|
Stable |
Server port number |
|
Stable |
HTTP status code string or exception type for errors |
The attributes http.url, http.scheme, http.method, and http.status_code are deprecated. Use url.full, url.scheme, http.request.method, and http.response.status_code respectively.
|
Database
| Attribute | Stability | Use |
|---|---|---|
|
Stable |
|
|
Stable |
|
|
Stable |
Database or schema name |
|
Stable |
Table, collection, or index name |
|
Stable |
Full query text (sanitize before enabling in production) |
|
Stable |
Low-cardinality summary: |
Messaging
| Attribute | Stability | Use |
|---|---|---|
|
Development |
|
|
Development |
|
|
Development |
Queue or topic name |
|
Development |
Low-cardinality template: |
|
Development |
Consumer group identifier |
Production Checklist
-
Set
OTEL_SERVICE_NAMEexplicitly — never rely on theunknown_servicedefault -
Set
service.namespace,service.version, anddeployment.environment.namein resource attributes -
Use
parentbased_traceidratiosampler at 5–10% for most production services -
Use
BatchSpanProcessor(default) — neverSimpleSpanProcessorin production -
Enable gzip compression:
OTEL_EXPORTER_OTLP_COMPRESSION=gzip -
Call both
RecordErrorandSetStatus(ERROR)together — never one without the other -
Use route templates (
/users/{id}) as span names — never resolved paths -
Set attribute and value length limits to prevent storage bloat
-
Enable cloud resource providers matching your deployment environment
-
Initialize W3C TraceContext + Baggage propagators (SDK default)
-
Python: initialize OTel SDK after fork in multiprocessing environments
-
Go: always
defer tp.Shutdown(ctx)inmain(); passctxexplicitly to all goroutines -
Java: use a
.propertiesconfig file for complex deployments; enable cloud providers explicitly
References
The guidance on this page is drawn from the following OpenTelemetry specifications, SDK documentation, and language-specific guides.
Specifications and Semantic Conventions
-
OpenTelemetry Resource Semantic Conventions — resource attribute definitions including
service.name,service.namespace,service.version, andservice.instance.id -
Deployment Environment Resource Attribute —
deployment.environment.namedefinition and migration from the deprecateddeployment.environment -
HTTP Spans Semantic Conventions — span naming,
http.request.method,http.response.status_code,http.route,url.full, and deprecated attribute migration -
Database Client Spans Semantic Conventions —
db.system.name,db.operation.name,db.namespace,db.query.summary, and span name format -
Messaging Spans Semantic Conventions —
messaging.system,messaging.operation.name,messaging.destination.template, and span name format -
Exception Semantic Conventions —
exception.type,exception.message,exception.stacktraceevent attributes
Concepts and SDK Configuration
-
Sampling — OpenTelemetry — head-based vs tail-based sampling, built-in samplers,
parentbased_traceidratioexplanation -
Baggage — OpenTelemetry — W3C Baggage propagation, use cases, and security considerations
-
General SDK Configuration —
OTEL_BSP_*,OTEL_EXPORTER_OTLP_*,OTEL_SPAN_*environment variables and defaults -
Sampling Milestones 2025 — OpenTelemetry Blog — probability sampling via W3C
tracestateandot=th:key
Java
-
Java Agent Getting Started — agent download,
-javaagent:flag, and first-run configuration -
Java Agent Configuration Reference — all
otel.*JVM system properties, environment variable equivalents, and configuration file usage -
Suppressing Instrumentation — Java Agent —
otel.instrumentation.common.default-enabled, per-libraryotel.instrumentation.[name].enabled, and span suppression strategies -
Java SDK Autoconfiguration — SDK-level configuration used when not using the javaagent
Python
-
Python Instrumentation Guide — manual SDK initialization,
TracerProvider,BatchSpanProcessor, and resource configuration -
Python Distro and Auto-Instrumentation —
opentelemetry-distro,opentelemetry-bootstrap, andopentelemetry-instrumentCLI usage
Go
-
Go Instrumentation Guide —
TracerProvidersetup,resource.Merge, graceful shutdown, and context propagation patterns