Sample traces

A high-traffic service can produce millions of spans per hour. Most of them are uninteresting — successful requests that all look the same. Sampling is the standard answer: send a representative subset, keep cost manageable, retain the ability to debug.

Two flavors

Head-based sampling

Decision made at trace start. Cheap, deterministic, but blind to outcome — you might sample away the one slow trace.

// OpenTelemetry — sample 10% of traces
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { TraceIdRatioBasedSampler } = require('@opentelemetry/sdk-trace-base');

const sdk = new NodeSDK({
  serviceName: 'demo-api',
  sampler: new TraceIdRatioBasedSampler(0.1),  // 10%
  // ...
});

Use head sampling for sustained-high-volume services where you can afford to lose individual traces.

Tail-based sampling

Decision made after the trace completes. Expensive (you have to buffer everything until you decide), but smart — keep all errors, all slow traces, plus a sample of normal ones.

OpenTelemetry has a tail sampling collector that runs as a sidecar / daemon between your apps and SiteQwality:

processors:
  tail_sampling:
    decision_wait: 30s
    policies:
      - name: errors-policy
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow-policy
        type: latency
        latency: { threshold_ms: 1000 }
      - name: sample-rest
        type: probabilistic
        probabilistic: { sampling_percentage: 5 }

exporters:
  otlphttp:
    endpoint: https://traces.siteqwality.com/v1/traces
    headers:
      Authorization: Bearer ${SITEQWALITY_API_KEY}

service:
  pipelines:
    traces:
      processors: [tail_sampling]
      exporters: [otlphttp]

Use tail sampling when:

You can run an OTel collector in your infra.
Most traces are uninteresting but the few interesting ones are critical.
You want to keep 100% of errors regardless of volume.

Recommended starting point

For most teams:

Low volume (under 10 spans/sec): No sampling. Send everything.
Mid volume (10–500 spans/sec): Head sample at 30–50% in app, send the rest.
High volume (over 500 spans/sec): Tail sample via a collector. Keep 100% errors + 5% successful.

Per-request override

Sometimes you want to force a trace through regardless of the sampler:

const { trace, SpanKind } = require('@opentelemetry/api');

app.post('/api/critical-thing', async (req, res) => {
  const span = trace.getTracer('demo').startSpan('critical_thing', {
    kind: SpanKind.SERVER,
    attributes: { 'sampling.priority': 1 },  // hint to sampler: keep me
  });
  // ...
});

Many sampler implementations honor sampling.priority. Check yours.

Cost vs visibility tradeoff

Sampling rate	Use case
100%	Dev/staging, low-traffic prod, anything < 50 spans/sec.
50%	Mid-traffic prod where you can spot patterns from half the data.
10%	High-traffic prod with head-based sampling.
1–5% + 100% errors	Very high traffic, tail-based.