APM Instrumentation Best Practices

Effective APM instrumentation requires more than attaching an agent—it requires deliberate decisions about service naming, span design, sampling strategy, and attribute cardinality. This page consolidates production-grade guidance for instrumenting services with Kloudfuse APM across Java, Python, and Go.

Naming Conventions

Good naming is the foundation of useful trace data. Span names and service names are indexed by the platform and used as grouping keys—mistakes here are expensive to fix after data is collected.

Service Names

service.name is the single most important resource attribute. Every service must set it explicitly.

  • Use a stable, lowercase, kebab-case identifier: checkout-service, inventory-api, payments-worker

  • Names must be identical across all horizontally scaled instances of the same service

  • Never rely on the SDK default (unknown_service:<executable>)—it prevents all per-service filtering

Span Names

Span names must be low-cardinality. They are indexed by the backend and used to group spans for latency histograms and error rates.

Operation type Correct span name Incorrect span name

HTTP server

GET /users/{userId}

GET /users/12345

HTTP client

POST /api/orders

POST https://api.example.com/api/orders?token=abc

Database

SELECT users

SELECT * FROM users WHERE id=99

Message queue

publish shop.orders

publish shop.orders msg-uuid-1234

Background job

process-invoice

process-invoice 2024-03-15T10:32:11Z

Do not embed user IDs, request IDs, order IDs, timestamps, or full URLs in span names. Those values belong in span attributes where they are stored per-span without affecting aggregation keys.

Resource Attributes

Resource attributes describe the entity that produced telemetry. They are set once at SDK startup and attached to all spans, metrics, and logs.

Core Attributes

Attribute Requirement Description

service.name

Required

Logical name of the service. Must be set—no exceptions.

service.namespace

Recommended

Namespace grouping services by team or domain (e.g., payments, platform).

service.version

Recommended

Semantic version, git SHA, or build tag (e.g., 2.3.1, a01dbef8).

service.instance.id

Recommended

Globally unique identifier for this instance. Use pod name or a UUID.

deployment.environment.name

Recommended

Deployment tier: production, staging, development.

Use deployment.environment.name (not deployment.environment, which is deprecated).

Setting Attributes via Environment Variables

All OpenTelemetry SDKs support these standard environment variables:

OTEL_SERVICE_NAME=checkout-service
OTEL_RESOURCE_ATTRIBUTES=service.namespace=payments,service.version=1.4.2,deployment.environment.name=production,service.instance.id=pod-abc-123
bash

Automatic Resource Detection

Resource detectors populate cloud and infrastructure metadata automatically at SDK startup—no manual configuration required for most environments.

Java (javaagent):

# Enable cloud provider detection (off by default)
-Dotel.resource.providers.aws.enabled=true
-Dotel.resource.providers.gcp.enabled=true
-Dotel.resource.providers.azure.enabled=true
bash

Go:

res, err := resource.New(ctx,
    resource.WithFromEnv(),    // reads OTEL_RESOURCE_ATTRIBUTES
    resource.WithProcess(),    // PID, executable name
    resource.WithOS(),         // OS type
    resource.WithContainer(),  // container ID
    resource.WithHost(),       // hostname
)
go

For Kubernetes environments, use the kf-agent k8sattributes processor to enrich all telemetry with pod name, namespace, node name, and deployment name without requiring any SDK changes.

Sampling Strategy

Sampling controls what fraction of traces are collected and exported. The right strategy balances visibility against cost and storage volume.

Head-Based Sampling

Sampling decisions are made at the root span before any child spans are created. The decision propagates to all downstream services via the traceparent header.

Sampler When to use

parentbased_always_on

Low-traffic services or development—capture everything.

parentbased_traceidratio

Recommended for production. Samples a configured percentage of new traces while respecting sampling decisions from upstream callers.

parentbased_always_off

Silence a noisy service that adds no debugging value.

Configure via environment variables:

OTEL_TRACES_SAMPLER=parentbased_traceidratio
OTEL_TRACES_SAMPLER_ARG=0.05   # 5% of new root traces
bash

The parentbased_* samplers are critical for distributed tracing consistency: if Service A decides to sample a trace and propagates that decision to Service B, Service B will also sample it. Without parentbased_*, services make independent decisions, producing traces with missing spans.

Tail-Based Sampling

Tail-based sampling defers the decision until after the full trace is collected, enabling criteria such as "always capture error traces" or "always capture the slowest 1%." This is implemented in the OpenTelemetry Collector, not in the SDK.

processors:
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: errors-policy
        type: status_code
        status_code: {status_codes: [ERROR]}
      - name: slow-traces-policy
        type: latency
        latency: {threshold_ms: 1000}
      - name: probabilistic-policy
        type: probabilistic
        probabilistic: {sampling_percentage: 5}
yaml

Use tail-based sampling when you cannot afford to miss error traces or latency outliers.

Context Propagation

Trace context must be propagated across service boundaries for distributed traces to assemble correctly.

W3C TraceContext (Default)

All OpenTelemetry SDKs default to W3C TraceContext propagation. Two HTTP headers carry the trace:

  • traceparent: contains the trace ID, span ID, and sampling flag

  • tracestate: carries vendor-specific metadata including sampling probability

The default propagator configuration (tracecontext,baggage) is appropriate for most deployments:

OTEL_PROPAGATORS=tracecontext,baggage
bash

For systems using Zipkin or older Jaeger agents, add B3 compatibility:

OTEL_PROPAGATORS=tracecontext,baggage,b3
bash

Baggage

W3C Baggage propagates key-value pairs through the entire request chain as HTTP headers. Unlike span attributes, baggage values are available in-flight to every service handling the request without reading the trace backend.

Appropriate uses: tenant ID, cost-center tag, feature flag assignment, routing hint.

Baggage travels as plain HTTP headers, including to third-party services. Never put credentials, API keys, session tokens, or personally identifiable information in baggage.

Keep baggage small—prefer short identifiers over objects, and limit entries to 5–10 at most.

Python:

from opentelemetry import baggage, context

# Set baggage on outgoing context
ctx = baggage.set_baggage("tenant.id", "tenant-abc")
context.attach(ctx)

# Read in a downstream service
tenant_id = baggage.get_baggage("tenant.id")
python

Go:

// Set baggage
member, _ := baggage.NewMember("tenant.id", "tenant-abc")
bag, _ := baggage.New(member)
ctx = baggage.ContextWithBaggage(ctx, bag)

// Inject into outgoing HTTP request (if not using otelhttp)
otel.GetTextMapPropagator().Inject(ctx, propagation.HeaderCarrier(req.Header))
go

Error Recording

Two operations are always required together when an error occurs. They do not imply each other—omitting either produces incomplete data.

  1. RecordError / record_exception — records the exception as a span event with exception.type, exception.message, and exception.stacktrace

  2. SetStatus(ERROR, description) — marks the span as failed for aggregation, alerting, and error rate calculations

If you only call RecordError without SetStatus, the span is still marked OK/Unset and will not appear in error rate metrics.

Java

Span span = tracer.spanBuilder("process-payment").startSpan();
try (Scope scope = span.makeCurrent()) {
    processPayment();
} catch (Exception e) {
    span.recordException(e);
    span.setStatus(StatusCode.ERROR, e.getMessage());
} finally {
    span.end();
}
java

Python

with tracer.start_as_current_span("process-payment") as span:
    try:
        process_payment()
    except Exception as e:
        span.record_exception(e)
        span.set_status(StatusCode.ERROR, str(e))
        raise
python

Go

ctx, span := tracer.Start(ctx, "process-payment")
defer span.End()

if err := processPayment(ctx); err != nil {
    span.RecordError(err)
    span.SetStatus(codes.Error, err.Error())
    return err
}
go

When to Set ERROR Status

  • HTTP 5xx responses (server-side)

  • HTTP 4xx responses on the client side (the outbound call failed)

  • Unhandled exceptions that terminate the operation

  • Database errors, connection timeouts, or message processing failures

Do not set ERROR for HTTP 4xx on the server side—those are client errors handled by the server, not server failures.

Cardinality Management

High cardinality in span attributes and metric labels degrades backend performance and increases storage cost. It is one of the most common production problems with APM deployments.

Anti-Patterns to Avoid

Anti-pattern Problem

User ID, session ID, or order ID in span name

Every unique ID creates a new grouping bucket. Aggregations become meaningless.

Full URL (/users/12345) in span name

Every URL is unique. No two requests share a span name.

Request ID as a metric label

Metric series count explodes. Backends run out of memory.

Timestamp embedded in any label or name

Always unique. Destroys aggregation.

Unbound enum values in metric labels

Product IDs, SKUs, transaction codes—each new value adds a new series.

What to Do Instead

  • Put user IDs, order IDs, and request IDs in span attributes, not span names or metric labels

  • Use route templates (/users/{id}) as span names, never resolved paths

  • Use the OpenTelemetry Collector attributes processor to drop or hash high-cardinality fields before they reach your metrics backend

  • Set OTEL_SPAN_ATTRIBUTE_VALUE_LENGTH_LIMIT to prevent large exception stacktraces from bloating per-span storage

OTEL_SPAN_ATTRIBUTE_VALUE_LENGTH_LIMIT=1024
OTEL_SPAN_ATTRIBUTE_COUNT_LIMIT=64
OTEL_SPAN_EVENT_COUNT_LIMIT=10
bash

Performance Tuning

BatchSpanProcessor

Always use BatchSpanProcessor in production. The alternative (SimpleSpanProcessor) exports synchronously on every span end, adding latency to every instrumented operation.

The BSP collects spans in a queue and exports them in batches on a background thread. Tune these settings based on your span throughput:

Environment variable Default Guidance

OTEL_BSP_SCHEDULE_DELAY

5000 ms

Lower (e.g., 1000 ms) for near-realtime visibility in Kloudfuse; higher to reduce CPU overhead.

OTEL_BSP_MAX_QUEUE_SIZE

2048

Increase for high-throughput services: spans/sec × expected export latency × 2. Monitor for drops.

OTEL_BSP_MAX_EXPORT_BATCH_SIZE

512

Must be ≤ MAX_QUEUE_SIZE. Increase to reduce network round-trips.

OTEL_BSP_EXPORT_TIMEOUT

30000 ms

Reduce to 10 s for faster circuit-breaking if the Kloudfuse agent is unreachable.

OTLP Export Settings

Enable gzip compression to significantly reduce network bandwidth:

OTEL_EXPORTER_OTLP_COMPRESSION=gzip
OTEL_EXPORTER_OTLP_ENDPOINT=http://kf-agent:4317
OTEL_EXPORTER_OTLP_TIMEOUT=10000
OTEL_EXPORTER_OTLP_PROTOCOL=grpc
bash

Language-Specific Best Practices

Each language SDK has distinct setup patterns, lifecycle requirements, and common pitfalls. The sections below consolidate the most important production guidance for Java, Python, and Go.

Java

Agent Configuration File

For non-trivial deployments, use a properties file instead of a long list of -D flags. Pass it with -Dotel.javaagent.configuration-file:

# otel-config.properties
otel.service.name=checkout-service
otel.traces.exporter=otlp
otel.exporter.otlp.endpoint=http://kf-agent:4317
otel.exporter.otlp.compression=gzip
otel.traces.sampler=parentbased_traceidratio
otel.traces.sampler.arg=0.05
otel.propagators=tracecontext,baggage
otel.resource.providers.aws.enabled=true
otel.javaagent.logging=application
otel.bsp.schedule.delay=2000
otel.bsp.max.queue.size=4096
properties
java -javaagent:/path/to/opentelemetry-javaagent.jar \
     -Dotel.javaagent.configuration-file=/etc/otel/otel-config.properties \
     -jar app.jar
bash

Selective Instrumentation

Disable noisy or irrelevant auto-instrumentation without losing everything else:

# Suppress a specific library
-Dotel.instrumentation.log4j-appender.enabled=false

# Start from zero and enable only what you need
-Dotel.instrumentation.common.default-enabled=false
-Dotel.instrumentation.opentelemetry-api.enabled=true
-Dotel.instrumentation.spring-webmvc.enabled=true
-Dotel.instrumentation.jdbc.enabled=true
bash

Span Annotations with @WithSpan

The javaagent intercepts the @WithSpan annotation to create spans without modifying business logic:

@WithSpan("process-payment")
public PaymentResult processPayment(
    @SpanAttribute("payment.method") String method,
    @SpanAttribute("payment.amount") double amount
) {
    // Span is created automatically; parameters become span attributes
}
java

Cloud Resource Providers

Cloud provider detection is disabled by default to avoid startup latency in non-cloud environments. Enable the provider that matches your deployment:

-Dotel.resource.providers.aws.enabled=true    # EC2, ECS, EKS, Lambda
-Dotel.resource.providers.gcp.enabled=true    # GCE, GKE, Cloud Run
-Dotel.resource.providers.azure.enabled=true  # Azure VMs, AKS
bash

Best Practices

  • Set service.name explicitly — never rely on the unknown_service default

  • Tag with service.namespace, service.version, and deployment.environment.name in otel.resource.attributes

  • Use parentbased_traceidratio with a 5–10% sample rate for high-traffic production services

  • Enable otel.exporter.otlp.compression=gzip to reduce network overhead

  • Route agent logs to your application logging framework with otel.javaagent.logging=application

  • Prefer @WithSpan for business-logic spans — it is less intrusive than the programmatic API and automatically propagates trace context

  • Disable instrumentation for high-volume, low-value scheduled jobs (e.g., health checks via Spring Scheduling)

  • Enable cloud resource providers matching your environment to populate instance and cluster metadata automatically

See Java Instrumentation → for the full setup guide.

Python

Auto-Instrumentation Setup

Install all detected library instrumentors in one step:

pip install opentelemetry-distro[otlp] opentelemetry-instrumentation
opentelemetry-bootstrap -a install    # detects and installs instrumentation packages
bash

Run your application with full auto-instrumentation:

OTEL_SERVICE_NAME=my-service \
OTEL_EXPORTER_OTLP_ENDPOINT=http://kf-agent:4317 \
OTEL_TRACES_SAMPLER=parentbased_traceidratio \
OTEL_TRACES_SAMPLER_ARG=0.1 \
opentelemetry-instrument python app.py
bash

Manual SDK Initialization

When you need more control than environment variables provide, initialize the SDK programmatically:

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource, SERVICE_NAME

resource = Resource.create({
    SERVICE_NAME: "checkout-service",
    "service.version": "1.4.2",
    "deployment.environment.name": "production",
})

provider = TracerProvider(resource=resource)
exporter = OTLPSpanExporter(endpoint="http://kf-agent:4317")
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)
python

Fork Safety (Gunicorn, uWSGI)

The Python BatchSpanProcessor spawns background threads and is not fork-safe. If your application server forks worker processes (Gunicorn, uWSGI, multiprocessing), initializing the SDK before the fork will cause threads to be silently lost in child processes, resulting in no traces exported.

Always initialize the OTel SDK inside the worker initialization hook, after the fork:

# gunicorn.conf.py
def post_fork(server, worker):
    # Initialize OTel here — after fork, not at module import time
    setup_opentelemetry()
python

Best Practices

  • Use opentelemetry-bootstrap -a install after adding new dependencies — it detects and installs matching instrumentation packages automatically

  • Use opentelemetry-instrument for library coverage (HTTP, DB, etc.) and add manual spans only for business logic

  • Set OTEL_TRACES_SAMPLER=parentbased_traceidratio and OTEL_TRACES_SAMPLER_ARG=0.05 in production

  • If using Gunicorn or uWSGI, always initialize the OTel SDK inside post_fork — never at module import time

  • Install cloud resource detector packages (opentelemetry-resource-detector-aws, opentelemetry-resource-detector-gcp) to populate environment metadata

See Python Instrumentation → for the full setup guide.

Go

SDK Initialization and Graceful Shutdown

Always call tp.Shutdown() on exit. Without it, the BatchSpanProcessor may not flush spans that are still queued when the process exits.

func initTracer(ctx context.Context) (func(context.Context) error, error) {
    exp, err := otlptracegrpc.New(ctx,
        otlptracegrpc.WithEndpoint("kf-agent:4317"),
        otlptracegrpc.WithInsecure(),
    )
    if err != nil {
        return nil, err
    }

    res, err := resource.Merge(
        resource.Default(),
        resource.NewWithAttributes(
            semconv.SchemaURL,
            semconv.ServiceName("my-service"),
            semconv.ServiceVersion("1.0.0"),
            attribute.String("deployment.environment.name", "production"),
        ),
    )
    if err != nil {
        return nil, err
    }

    tp := sdktrace.NewTracerProvider(
        sdktrace.WithBatcher(exp),
        sdktrace.WithResource(res),
        sdktrace.WithSampler(sdktrace.ParentBased(
            sdktrace.TraceIDRatioBased(0.05),
        )),
    )

    otel.SetTracerProvider(tp)
    otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator(
        propagation.TraceContext{},
        propagation.Baggage{},
    ))

    return tp.Shutdown, nil
}

func main() {
    ctx := context.Background()
    shutdown, err := initTracer(ctx)
    if err != nil {
        log.Fatal(err)
    }
    defer func() {
        shutdownCtx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
        defer cancel()
        if err := shutdown(shutdownCtx); err != nil {
            log.Printf("tracer shutdown error: %v", err)
        }
    }()

    // ... application code
}
go

Context Threading

In Go, the active span lives in context.Context. You must pass context explicitly through every function call. Goroutines do not inherit context automatically—if you forget to pass it, child operations will start new root spans instead of appearing as children.

func handleRequest(ctx context.Context) {
    ctx, span := tracer.Start(ctx, "handle-request")
    defer span.End()

    // Pass ctx to all child operations
    result, err := fetchData(ctx)

    // Pass ctx explicitly to goroutines — do not capture from outer scope
    go func(ctx context.Context) {
        _, childSpan := tracer.Start(ctx, "async-task")
        defer childSpan.End()
        // ...
    }(ctx)
}
go

HTTP Instrumentation

Use otelhttp for automatic context propagation on both server and client sides:

import "go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"

// Server: wraps the handler and extracts incoming trace context
http.Handle("/api/orders", otelhttp.NewHandler(ordersHandler, "handle-orders"))

// Client: injects trace context into outgoing requests
client := &http.Client{
    Transport: otelhttp.NewTransport(http.DefaultTransport),
}
go

For manual propagation without otelhttp:

// Inject into outgoing request
otel.GetTextMapPropagator().Inject(ctx, propagation.HeaderCarrier(req.Header))

// Extract from incoming request
ctx = otel.GetTextMapPropagator().Extract(r.Context(), propagation.HeaderCarrier(r.Header))
go

Best Practices

  • Always defer tp.Shutdown(ctx) in main() — without it the BatchSpanProcessor may drop queued spans on exit

  • Pass ctx explicitly through every function call; goroutines do not inherit context automatically

  • Use otelhttp.NewHandler and otelhttp.NewTransport instead of manually injecting/extracting headers

  • Register both propagation.TraceContext{} and propagation.Baggage{} as composite propagators at startup

  • Use sdktrace.ParentBased(sdktrace.TraceIDRatioBased(0.05)) as the sampler for production services

  • Use resource.WithFromEnv() alongside programmatic resource attributes so OTEL_RESOURCE_ATTRIBUTES can override values at deploy time without recompiling

See Go Instrumentation → for the full setup guide.

Semantic Conventions Reference

OpenTelemetry defines standard attribute names for common operation types. Using these ensures Kloudfuse and other backends can correctly parse and display trace data.

HTTP

Attribute Stability Use

http.request.method

Stable

GET, POST, PUT, DELETE

http.response.status_code

Stable

Integer: 200, 404, 500

http.route

Stable

Matched route template: /users/{id}

url.full

Stable

Absolute URL for client spans. Replaces deprecated http.url.

url.scheme

Stable

http or https

server.address

Stable

Server hostname or IP

server.port

Stable

Server port number

error.type

Stable

HTTP status code string or exception type for errors

The attributes http.url, http.scheme, http.method, and http.status_code are deprecated. Use url.full, url.scheme, http.request.method, and http.response.status_code respectively.

Database

Attribute Stability Use

db.system.name

Stable

postgresql, mysql, mongodb, redis

db.operation.name

Stable

SELECT, INSERT, UPDATE, FIND

db.namespace

Stable

Database or schema name

db.collection.name

Stable

Table, collection, or index name

db.query.text

Stable

Full query text (sanitize before enabling in production)

db.query.summary

Stable

Low-cardinality summary: SELECT users

Messaging

Attribute Stability Use

messaging.system

Development

kafka, rabbitmq, aws_sqs, azure_servicebus

messaging.operation.name

Development

publish, receive, process

messaging.destination.name

Development

Queue or topic name

messaging.destination.template

Development

Low-cardinality template: orders.{region}

messaging.consumer.group.name

Development

Consumer group identifier

Exceptions

Attribute Requirement Use

exception.type

Conditionally required

Fully-qualified exception class name

exception.message

Conditionally required

Exception message text

exception.stacktrace

Recommended

Language-native stack trace string

Production Checklist

  • Set OTEL_SERVICE_NAME explicitly — never rely on the unknown_service default

  • Set service.namespace, service.version, and deployment.environment.name in resource attributes

  • Use parentbased_traceidratio sampler at 5–10% for most production services

  • Use BatchSpanProcessor (default) — never SimpleSpanProcessor in production

  • Enable gzip compression: OTEL_EXPORTER_OTLP_COMPRESSION=gzip

  • Call both RecordError and SetStatus(ERROR) together — never one without the other

  • Use route templates (/users/{id}) as span names — never resolved paths

  • Set attribute and value length limits to prevent storage bloat

  • Enable cloud resource providers matching your deployment environment

  • Initialize W3C TraceContext + Baggage propagators (SDK default)

  • Python: initialize OTel SDK after fork in multiprocessing environments

  • Go: always defer tp.Shutdown(ctx) in main(); pass ctx explicitly to all goroutines

  • Java: use a .properties config file for complex deployments; enable cloud providers explicitly

References

The guidance on this page is drawn from the following OpenTelemetry specifications, SDK documentation, and language-specific guides.

Specifications and Semantic Conventions

Concepts and SDK Configuration

Java

Python

Go

  • Go Instrumentation GuideTracerProvider setup, resource.Merge, graceful shutdown, and context propagation patterns