Blog - Sematext Community

Using AI to Instrument Applications with OpenTelemetry

fulya.uluturk — Thu, 21 May 2026 07:20:49 +0000

OpenTelemetry is one of the best things that’s happened to observability in the last decade. It’s open. It has SDKs for every language that matters. It’s vendor neutral. The OTel community has been doing the hard work of standardizing how applications emit telemetry, so that you, the engineer, don’t have to learn five different agent formats to monitor five different services.

But there’s a part of the OTel pitch that often gets glossed over: somebody still has to instrument the application. And that part isn’t quick or easy. Even now.

The instrumentation tax

Modern applications aren’t a single binary anymore. At Sematext we run 30+ microservices to power the Sematext Cloud platform: alerts, metrics receivers and consumers, log receivers and consumers, user experience services, tracing pipelines, network map, various APIs and more, across Java (Spring Boot), Go, and a few other stacks.

That’s a lot of surface area, several languages, multiple frameworks, different build systems for different stacks. None of this is unusual for a mature system. This means that instrumentation effort grows with that diversity, so any mechanism that helps minimize instrumentation mistakes will be welcomed by engineers tasked with instrumentation.

If you want end-to-end tracing through that stack, the kind that actually tells you where a slow request spent its time, you can’t just instrument one service. You have to instrument the whole chain: frontend → API gateway → backend service A → backend service B → database. Skip a hop and the trace breaks. The dependency graph the trace gives you stops being useful exactly at the boundary you didn’t instrument.

So in practice, “let’s adopt OpenTelemetry” turns into a checklist of dozens of services that each need their own instrumentation work. The good news is that it doesn’t have to happen all at once and AI can help.

How much does OpenTelemetry instrumentation cost?

By cost, we mean the cost to an engineer, the team, and the organization. We can look at it as a non-monetary cost, but if we trace (pun intended!) this cost all the way down then yes, there is also a financial cost associated with this effort.

Three things make this hard, even with OTel:

Prioritization. Instrumenting a service competes with shipping features and fixing bugs. It’s preventive work; its value shows up the next time something breaks at 3am, not this sprint. That’s a hard sell to a product manager.

Unknown territory. When the chain spans services you didn’t write, in languages you don’t use day to day, you’re spending most of your time on context switch overhead. You’re not adding instrumentation; you’re re-learning a framework you saw once two years ago.

Time needed even for auto-instrumentation. “Auto-instrumentation” means no code changes. It doesn’t mean no work. For one service the loop typically goes:

Read the right OTel SDK docs for your language
Pick the right auto-instrumentation package (there are usually three options, only one of which is current)
Install it in the build (pom.xml, package.json, requirements.txt, …)
Configure the OTLP endpoint, the auth header, the service name
Restart, hit the service, watch what happens
Debug the first attempt: wrong port (4317 vs 4318 vs 4338), wrong protocol (http/protobuf vs grpc), wrong auth header (Bearer vs vendor-specific), region mismatch on the endpoint
Verify the data lands in the right place in your observability tool
Multiply by the number of services in your chain

Forty minutes to two hours per service if you’re moving carefully, and that’s just for traces and metrics. The OTel auto-instrumentation packages don’t ship logs in most SDKs. For logs you need to switch to manual instrumentation, which is another SDK init block per service.

And then there’s custom OpenTelemetry instrumentation

The above buys you “spans for every incoming HTTP request” and a generic metrics set. The moment you want anything specific (e.g., a custom span attribute for the user’s account tier, a business metric counting checkouts, a log enriched with the trace ID so you can correlate logs and traces for faster Root Cause Analysis), you’re back in manual-instrumentation land, writing SDK code in every service you care about. The auto path ends; the per-language SDK learning curve begins.

For one service that’s an afternoon. For thirty services that’s a quarter. What can we do about this?

Can we use AI to instrument applications with OpenTelemetry?

The instrumentation work (pick the SDK, set the env vars, debug the endpoint, verify it landed) is exactly the kind of structured, repetitive, well-documented task an AI agent does well. The blockers aren’t intellectual; they’re “look up the right thing, paste it in the right place, watch for the obvious gotcha.”

That’s not “let the AI do your engineering.” It’s “let the AI do the parts that already had a right answer, written down somewhere, and just needed someone to fetch it.”

We tried this for instrumenting applications against Sematext Cloud. The result is a small,but highly valuable open-source artifact: a Claude Code Agent Skill that walks an engineer through OTel instrumentation conversationally. It’s plain markdown, lives in our public Github repository, and works with any AI agent that can read a URL.

What the OTel instrumentation AI skill does

The Sematext OTel skill at sematext-otel-onboarding/blob/main/skills/SKILL.md is the AI-readable version of “how to wire your application to Sematext.” When loaded into Claude Code (or any agent that can fetch a markdown URL), it triages the user through six short questions:

Sematext region (US or EU)
Which App types you’re wiring (Tracing, Logs, Monitoring, any combination)
Flow: managed OTLP endpoint or Sematext Agent
Protocol: HTTP (default) or gRPC
Language and deployment environment
Auto or manual instrumentation

Then it produces the exact env-var block, parameterized to your answers, including:

The correct OTLP endpoint URL for your region and protocol
The Sematext-specific X-API-TOKEN header (different from the standard Authorization: Bearer … most OTel docs show, easy to miss)
One header per signal type, so you only configure what you’re using
A pointer to a runnable reference example in the same repo, in your language

Similarly, you can use the skill not only to add instrumentation to uninstrumented applications, but also to fix broken instrumentation that’s not really working. Auto-instrumentation doesn’t ship logs? The skill flags that and asks if you want to switch to manual. Region-token mismatch? The skill warns explicitly. Custom header convention? Documented. The skill is opinionated and aware of the setup required where the official OpenTelemetry docs may be silent or difficult to understand and follow.

What an instrumentation session looks like

In practice, an engineer with Claude Code in their editor opens their service’s directory and pastes:

Use https://github.com/sematext/sematext-otel-onboarding/blob/main/skills/SKILL.md to instrument this app for Sematext. Region: US. App type: Tracing. Token: .

Claude loads the skill, reads the project’s files to figure out the language and framework, asks the two remaining triage questions, then proposes the diff to the project (adds the OTel SDK to the build file, adds the env vars to docker-compose or systemd or .env or wherever they belong, and shows you what to expect in the Sematext UI within 60 seconds). You review the diff. You apply. You restart. You see traces.

Compared to the manual path (read docs, pick SDK, install, configure, debug, verify), we’d expect the instrumentation effort time to first data to drop from the typical 40 to 120 minutes per service to a handful of minutes. For an organization adopting OTel across dozens of services, that compounds quickly. A quarter of part-time effort becomes a couple of focused days. Thousands of dollars in engineering time drops to a much more sane number. The effort has a positive ROI.

What the skill doesn’t do

This is where AI posts usually start hand-waving. Here’s the honest list:

It doesn’t write custom span attributes or business metrics for you. It writes the boilerplate that gets you to the point where you can write those. The judgment about what to measure is still yours. The benefit is that the skill gives you all the scaffolding, a working instrumentation, so the effort of collecting custom/business metrics becomes significantly lower.
It doesn’t psychic-debug your network. If your service can’t reach the OTLP endpoint on first run (corporate proxy, missing TLS cert chain, wrong port), the skill points at the common causes, but you still have to look at the service’s own logs to confirm what happened.
It doesn’t change OTel’s reality. Auto-instrumentation still doesn’t ship logs in most SDKs. AI doesn’t fix the SDK. But it does tell you up front, so you don’t spend an hour wondering why your Logs App is empty.

Try it, the skill is vendor-agnostic

The skill is open-source and lives in our OTel onboarding repo. The same repo has runnable reference apps for Node.js, Java, Python, .NET, and PHP across baremetal, Docker, and Kubernetes deployments, so if you want to see what the skill is going to walk you through, the reference is right there.

If you have Claude Code, point it at the URL above. If you use a different AI agent, the skill is just markdown. Load it however your agent loads documentation. There’s no install step; there’s no vendor lock-in. The skill shared is not Sematext-specific. Sematext’s contribution is the knowledge, encoded in a format an AI can act on.

Where this is going

The OTel skill is one example of a broader pattern we think makes sense for observability tools: knowledge as something an AI can use, not just something a human can read.

A few directions we’re exploring:

Per-language sub-skills for deeper, opinionated guidance when a language has subtle gotchas (Node async hooks, Java agent attach, Python startup ordering)
In-product wiring so the App creation page in Sematext Cloud gives you a one-click “use AI to set this up” alongside the existing manual instructions, with your region and token pre-filled into the prompt
Skills for the rest of the observability journey (picking sensible default alerts, creating dashboards, interpreting RCA results), each as a small, auditable markdown file you can use, fork, or ignore

The bet is that the value of an observability platform isn’t just the data it collects; it’s how quickly an engineer can go from “we should monitor this” to “we’re monitoring it and we know what to do when it breaks.” AI doesn’t replace the judgment in that loop as of yet. But it can absolutely replace the busywork around it.

If you give the skill a try and find a gap, open a PR. The fastest way for this pattern to get good is for more people to use it on more apps.

Start Free Trial

The post Using AI to Instrument Applications with OpenTelemetry appeared first on Sematext.

Pull Request Velocity as a Proxy for AI Usage for Software Development

Otis — Tue, 31 Mar 2026 07:06:27 +0000

While AI have usage has been growing steadily for the last several years, the LLM models noticeably improved around the end of 2025. Specifically, they become more viable for software development. We are seeing the results. The feature and product delivery has picked up. One way to visualize this is by looking at the number of pull requests for your organization / software development teams. This chart shows the number of Github pull requests created by a team. Can you spot when AI usage increased?

It starts in late November, 2025. This marks the beginning of increased AI usage (for coding) in Sematext. That’s when the LLMs got better. It roughly matches the change in velocity as visualized in JIRA.

Individual AI Adoption

The blurred part are PR author names, which we can use for filtering. If we look at trends of individuals we can spot early adopters like this one:

Or another individual who started making more use of AI later:

Source: Github WebHook Events

This data comes into Sematext via Github Webhook Events. It takes about 5-10 minutes to set up. It can be set up at the Github organization level or for individual repositories.

A Word of Caution

There are many software development styles. There are people who commit frequently and incrementally and there are those who keep things to themselves until everything is nearly done. This is a fun chart to look at and is helpful when you want to get the feel for the “pulse” of a team or even an individual. But be careful not to judge people on this sort of data alone. Use this with a grain of salt and in combination with other inputs, observations, etc.
Creating more code or PRs doesn’t always equal better code or higher effectiveness. A person may be tapping in the dark trying to debug or implement something with the help of AI and, in the process, creating a lot of (temporary?) code and PRs.
As velocity increases, so will regressions, unless you take countermeasures. See Faster Coding with AI and Increased Regressions.

Start Free Trial

The post Pull Request Velocity as a Proxy for AI Usage for Software Development appeared first on Sematext.

Running OpenTelemetry at Scale: Architecture Patterns for 100s of Services

fulya.uluturk — Tue, 03 Mar 2026 12:06:59 +0000

It feels great getting OpenTelemetry working in a demo environment. Spans appear, metrics flow, you connect it to a backend and everything lights up in a satisfying cascade. You write the internal doc, you present it to the team, but it’s just a matter of time when somebody on the team asks: “Great, so how do we roll this out to all 100 services?” If you are at that point on your OTel journey, this article will help you roll out OTel to production.

Running OTel across a handful of services and running it across a few hundred are genuinely different problems. The instrumentation part stays roughly the same. Everything around it — how you collect the data, how you route it, how you make sure a traffic spike in one region does not take down your entire observability pipeline — that is where teams either build something resilient or spend the next six months fire-fighting because of inadequate planning or suboptimal architecture.

I wrote this article to share the patterns that actually hold up at scale: collector tiers, load balancing strategies, sampling at volume, and multi-cluster setups. Everything comes with real config examples because “it depends” is only useful advice if you can see what it depends on.

See How OpenTelemetry changes the way teams do observability for why OpenTelemetry matters and how it shifts focus from traditional metrics and logs to full, end-to-end observability.

Why a Single Collector Falls Apart (and When)

Most OTel tutorials show you a single collector instance receiving spans from all your services and forwarding everything to a backend. That setup works until about the point where it stops working, which tends to happen quietly and at the worst possible time. You are not going to notice a single collector struggling until it is already dropping data, buffering is maxed out, and your traces have gaps you cannot explain.

The core issue is that a single collector is both a single point of failure and a resource bottleneck. At low traffic it sits there looking fine. Add a few dozen services, let traffic spike during a product launch or a retry storm, and you will watch it start falling behind. The exporter queue fills up. Backpressure kicks in. Services start dropping spans rather than blocking on the export. By the time anyone notices, you have lost the exact telemetry you needed to understand what just happened.

The failure mode is silent. When a collector falls behind, it does not usually crash spectacularly. It drops spans without loud errors, your traces become incomplete, and your dashboards show suspiciously clean latency numbers because the slow requests stopped being recorded. If your p99 looks unexpectedly healthy during an incident, check your collector queue depth before trusting it.

The solution is to stop thinking about the collector as a single process and start thinking about it as a tier. Two tiers cover most production scenarios. Three tiers cover the rest. The architecture you need depends on your traffic, whether you need tail-based sampling, and how many backends you are exporting to.

Let me make this more specific: if you have fewer than 20 services and under 500 requests per second total, a single well-configured collector will likely hold up (yes, of course it depends on the underlying hardware/resources). At 20 to 80 services or 500 to 5,000 RPS, the two-tier model becomes worthwhile. Above 80 services or 5,000 RPS, you need the full tiered setup with trace-aware load balancing and tail-based sampling at the gateway.

For more information on common production pitfalls and strategies to prevent them, see OpenTelemetry Production Monitoring: What Breaks and How to Prevent It.

Collector Tiers: The Architecture That Actually Scales

The tiered collector model separates two concerns that should never have been combined in the first place: getting data off your services quickly, and doing something intelligent with that data before it hits your backend.

Before getting into the architecture, it helps to know that the OTel Collector can run in three modes — and in a scaled setup, you will use all three:

Agent — a collector running on the same host as your services, collecting telemetry locally and forwarding it upstream. It stays thin: no heavy processing, just receive-and-forward.
Gateway — a collector running as a standalone service, receiving data from agents (or directly from SDKs) and doing the heavier work: sampling, routing, fan-out to backends, attribute redaction.

Combined — the full pattern, where agent collectors feed into gateway collectors. Agents handle what only makes sense per-host (host metrics, file logs, resource detection). Gateways handle what only makes sense centrally (tail-based sampling, cross-service routing, policy management). The OTel Collector deployment docs call this the combined deployment pattern.

The tiered setup this article describes is the combined pattern. Here is what it looks like:

TWO-TIER COLLECTOR ARCHITECTURE

SERVICES

Service

App A

Service

App B

Service

App C

Service

App N

↓

TIER 1 — AGENT / SIDECAR COLLECTORS

Agent

Collector

Agent

Collector

Agent

Collector

Agent

Collector

↓

TIER 2 — GATEWAY COLLECTORS

Gateway

Collector (HA)

Gateway

Collector (HA)

↓

Backend

Traces

Backend

Metrics

Backend

Logs

Tier 1 agents sit close to services and do minimal work. Tier 2 gateways handle sampling, routing, and backend fan-out.

Tier 1: Collectors running as agents

The agent tier runs as a sidecar. Its job is exactly one thing: receive telemetry from the services and forward it as fast as possible. No tail-based sampling, no complex routing logic, no fan-out to multiple backends. The only processing you want at this tier is cheap and stateless: adding resource attributes like cluster name, node name, and environment; batching spans to reduce connection overhead; and basic filtering to drop genuinely worthless spans like health check endpoints generating thousands of spans per minute and telling you nothing.

Only stamp resource attributes that are low-cardinality and apply to the whole node or pod — things like environment, cluster name, and region. Adding high-cardinality values like user IDs or request IDs as resource attributes will explode your metrics storage, because each unique value becomes a separate time series.

TIER 1 AGENT COLLECTOR CONFIG

# Tier 1: runs as DaemonSet, minimal processing
receivers:
  otlp:
    protocols:
      grpc: {endpoint: "0.0.0.0:4317"}
      http: {endpoint: "0.0.0.0:4318"}

processors:
  batch:                    # batch before forwarding
    send_batch_size: 1024
    timeout: 5s
  resourcedetection:        # stamp node/pod metadata
    detectors: [k8snode, env]
  filter/drop_healthchecks:
    spans:
      exclude:
        match_type: regexp
        attributes:
          - {key: http.route, value: ".*/health.*"}

exporters:
  otlp:
    # forward to gateway tier, not directly to backend
    endpoint: "otel-gateway:4317"
    sending_queue:
      enabled: true
      num_consumers: 4
      queue_size: 500

service:
  pipelines:
    traces:
      receivers:  [otlp]
      processors: [batch, resourcedetection, filter/drop_healthchecks]
      exporters:  [otlp]

Agent config stays thin. Anything heavier than batching and attribute stamping belongs in the gateway tier.

Tier 2: Collectors running as gateways

The gateway tier is where the interesting work happens: tail-based sampling, fan-out to multiple backends, and the routing logic that sends traces, metrics, and logs where they need to go. Once you introduce a gateway tier, it needs careful resource sizing. In practice, that means running at least two gateway collectors behind a load balancer to avoid single points of failure.

How you deploy them depends on your environment. In Kubernetes, that typically means a Deployment scaled by load rather than node count. In a VM-based setup, two or more collector processes behind a hardware or software load balancer works just as well. The important thing is that the gateway tier scales horizontally based on traffic, not based on how many hosts you have.

Two to four instances is a reasonable starting point for a deployment handling roughly 1,000 to 5,000 spans per second across 20 to 50 services. Beyond that, sizing should be driven primarily by your tail-based sampling configuration — specifically the decision_wait window and the num_traces value — which determine how much trace state each gateway must hold in memory.

Load Balancing: The Subtle Trap with Tail-Based Sampling

If you are using tail-based sampling and running multiple gateway collector instances, standard round-robin load balancing will silently break your sampling decisions. Tail-based sampling works by collecting all spans for a given trace and then making a single keep-or-drop decision once the trace is complete. With round-robin, spans for the same trace end up scattered across different collector instances. Each instance only sees a fragment, so no instance ever has enough context to make a valid decision.

The symptom is traces that look complete but are not. You will see traces that hit your sampling rate but are missing spans from certain services, because those spans went to a different collector instance that independently decided to drop its fragment. This is one of the harder things to debug because the data loss is structured rather than random.

The solution is trace-aware load balancing, where spans are routed to gateway instances based on their trace ID. The OTel Collector has a loadbalancing exporter built for exactly this. It consistently hashes trace IDs to the same downstream collector, which means all spans for a given trace always end up in the same place regardless of which agent they came from.

LOAD BALANCING EXPORTER CONFIG — AGENT TIER

exporters:
  loadbalancing:
    routing_key: "traceID"   # hash by trace ID, not round-robin
    resolver:
      k8s:                    # auto-discover gateway pods via DNS
        service: "otel-gateway"
        ports: [4317]
    protocol:
      otlp:
        timeout: 1s
        sending_queue:
          enabled: true
          queue_size: 1000

The k8s resolver watches the gateway headless service and automatically updates routing when pods scale up or down.

Gateway restarts or scale-in events can occasionally produce incomplete traces. See OTel Collector scaling documentation for details.

Sampling Strategies at Volume: Picking the Right One

At small scale, sampling feels like an optional optimization. At large scale, it is a financial and operational necessity. Sending 100 percent of traces from a service handling 10,000 requests per second generates a staggering volume of data, most of which you will never look at. This is not too different from logs – for example, Sematext’s log pipeline contains the Sampling Processor for the same reason. Getting sampling right means you keep the traces that help you debug real incidents and drop the ones that would just sit there consuming storage.

The tricky part is that “keep the useful traces” is not as simple as it sounds. The traces you most need to keep are the ones with errors and high latency, which are often a small fraction of total traffic. If you use pure random sampling at 1 percent, you will statistically drop 99 percent of your error traces along with everything else. That is the core tension that drives the choice between head-based and tail-based sampling.

SAMPLING STRATEGY COMPARISON

STRATEGY	WHERE	KEEPS ERRORS	MEMORY COST	BEST FOR	WHAT IT DOES
Always-on	SDK	YES	HIGH	Dev / staging only	Keep all spans, no sampling
Parent-based	SDK	INHERITS	LOW	Consistent decisions across services	Keep/drop based on parent trace
Probabilistic	SDK/Collector	NO	LOW	Volume reduction on healthy traffic	Randomly keep spans at a fixed rate
Rate-limiting	Collector	NO	LOW	Capping ingest cost during spikes	Keep spans until a fixed rate limit
Tail-based	Collector (GW)	YES	HIGH	Error-aware sampling at scale	Keep spans based on errors & latency

Most production deployments combine parent-based sampling at the SDK with tail-based sampling at the gateway tier.

The combination that works at scale

Parent-based sampling means the sampling decision is made once at the root span — the first service that receives the request — and every downstream service in that trace inherits the same decision automatically, so you never end up with a trace where some spans were kept and others were dropped by different services making independent choices.

Use parent-based sampling at the SDK level to reduce overall span volume before it even reaches the collector, then use tail-based sampling at the gateway tier to make intelligent keep-or-drop decisions on what makes it through. Two passes of selection — aggressive on volume, smart about what survives.

A concrete example: set parent-based sampling at 10 percent for general traffic at the SDK. At the gateway, keep 100 percent of error traces, 100 percent of traces exceeding your latency SLO, and 10 percent of everything else. You end up storing roughly 11 to 12 percent of total trace volume, but with near-complete coverage of the production incidents you actually need to investigate.

TAIL SAMPLING POLICY CONFIG — GATEWAY TIER

processors:
  tail_sampling:
    decision_wait: 10s      # wait for all spans before deciding
    num_traces: 100000      # traces held in memory simultaneously
    expected_new_traces_per_sec: 1000
    policies:
      # always keep error traces
      - name: keep-errors
        type: status_code
        status_code: {status_codes: [ERROR]}

      # always keep slow traces (adjust threshold to your SLO)
      - name: keep-slow
        type: latency
        latency: {threshold_ms: 500}

      # keep 100% of checkout and payment — business critical
      - name: keep-critical-services
        type: string_attribute
        string_attribute:
          key: service.name
          values: [checkout-api, payment-service]

      # probabilistic baseline for everything else
      - name: baseline-sample
        type: probabilistic
        probabilistic: {sampling_percentage: 10}

Policies are evaluated in order. A trace is kept if any policy matches. The probabilistic baseline catches everything the specific policies did not select.

Memory sizing for tail-based sampling

The num_traces parameter is the one that will bite you if you undershoot it. It controls how many traces the gateway holds in memory simultaneously while waiting for all their spans to arrive. A rough formula: multiply your expected traces per second by your decision_wait value, then add 20 percent headroom. For 1,000 traces per second with a 10 second wait, you need at least 12,000 slots — not the 1,000 that most tutorial configs show.

The tail sampling processor documentation has the full parameter reference including the memory limiter integration, which you absolutely want enabled at the gateway tier to prevent OOM kills during traffic spikes.

Multi-Cluster Setups: When One Pipeline Is Not Enough

At some point, a single OTel pipeline stops being the right model. Maybe you operate in multiple regions with data residency requirements. Maybe you have a mix of Kubernetes clusters running different workloads with different SLOs. Whatever the reason, multi-cluster OTel setups introduce a layer of complexity that single-cluster thinking does not prepare you for.

The fundamental question is where aggregation happens. Aggregate within each cluster and forward summarized telemetry to a global backend, and you keep cross-region bandwidth low but lose the ability to do cross-cluster trace correlation. Forward raw telemetry to a central aggregation layer, and you get full correlation capability at significantly higher egress cost. Most organizations end up with a hybrid: metrics and logs aggregate locally, traces are forwarded to a central tier for correlation.

Getting trace context across cluster boundaries

Cross-cluster trace correlation only works if your services propagate the W3C traceparent header across cluster boundaries. Internal service mesh traffic usually handles this correctly. However, cross-cluster calls that pass through an API gateway, CDN, or any reverse proxy that strips unknown headers will break trace continuity at that boundary.

Diagnosing this is straightforward: if you see a trace starting at an API gateway span and the first downstream service shows a different root span with no parent, there’s a propagation break. To fix it, add traceparent and tracestate to your proxy’s header allowlist.

Here is what that looks like in the two most common cases:

PROXY HEADER CONFIG — NGINX AND ENVOY

# nginx — add inside your proxy_pass block
proxy_set_header traceparent $http_traceparent;
proxy_set_header tracestate  $http_tracestate;

---

# Envoy — request_headers_to_add in HttpConnectionManager
route_config:
  request_headers_to_add:
    - header: { key: traceparent }
    - header: { key: tracestate }

One of these two covers the vast majority of cases. If you are behind a CDN, check their documentation for custom header passthrough settings.

If you operate in the EU, forwarding raw traces containing user identifiers to a central tier outside the EU can be a compliance problem. The practical solution is to run attribute redaction in your regional gateway before any data leaves the region. The OTel Collector’s transform processor lets you hash, mask, or drop specific attributes before export.

PII REDACTION CONFIG — EU GATEWAY PROCESSOR

processors:
  transform/redact_pii:
    trace_statements:
      - context: span
        statements:
          # hash user IDs rather than drop
          - set(attributes["user.id"], SHA256(attributes["user.id"]))
          # drop email entirely
          - delete_key(attributes, "user.email")
          # truncate IP to /24 for geo without individual tracking
          - replace_pattern(attributes["net.peer.ip"], "\\d+$", "0")

Run PII redaction at the regional gateway, not the central tier. By the time data reaches central, sensitive attributes should already be gone.

Keeping the Pipeline Itself Observable

It would be funny if the observability tools couldn’t be observed. The OTel Collector exposes its own internal telemetry as a standard OTLP pipeline, which means you can route it to any backend or an observability solution that you are already using.

otelcol_processor_batch_timeout_trigger_send (gotta love this long property name!) tells you whether the batch processor is flushing because the timeout fired rather than because the batch was full. A high ratio of timeout-triggered flushes means your traffic volume is lower than your batch config expects, and you are adding unnecessary latency.

otelcol_exporter_queue_size is the canary for backpressure. When otelcol_exporter_queue_size climbs toward your configured maximum, your exporter is falling behind the ingest rate. If it hits the maximum, the collector starts dropping data. Set an alert at 80 percent of queue capacity and you will catch pressure building before it becomes data loss.

otelcol_processor_tail_sampling_sampling_decision_timer_latency (another awesome long name!) tells you how long the tail sampling processor is taking to make decisions. A sudden increase here usually means the number of active traces in memory has grown past what the processor can efficiently scan — either increase resources or tighten your sampling policy.

COLLECTOR SELF-MONITORING CONFIG

receivers:
  prometheus:
    config:
      scrape_configs:
        - job_name: otel-collector
          scrape_interval: 15s
          static_configs:
            - targets: ["localhost:8888"]

# Expose collector's own telemetry via its service config
service:
  telemetry:
    metrics:
      level: detailed   # basic | normal | detailed
      address: 0.0.0.0:8888
    logs:
      level: warn       # keep collector logs quiet in production

Set telemetry level to ‘detailed’ in staging to understand baseline behavior, then dial back to ‘normal’ in production.

Rolling This Out Without Breaking Everything

The migration path from a single collector to a tiered setup does not have to be a big-bang cutover. You could introduce the gateway tier first while keeping the existing single collector in place, route a small percentage of services to the new tier, and validate that data is flowing correctly before moving everything over.

I suggest you start with a non-critical service — one that has decent traffic but where gaps in telemetry during the migration window would not cause anyone to lose sleep. Verify spans arrive at the gateway, verify they arrive at the backend with the right resource attributes, and check that your tail sampling policies are making sensible decisions. That validation loop is worth running for a week before you touch any of your critical services.

The config change on the service side is usually just updating the OTLP endpoint to the new agent address. If you are using the OTel Operator for Kubernetes, you can inject the agent endpoint as an environment variable through the Instrumentation custom resource — no application code changes, no redeployment of service configs when the collector topology changes.

The pattern across all of this — tiered collectors, trace-aware load balancing, layered sampling strategies, regional pipelines — is that scaling OTel is fundamentally an architecture problem, not an instrumentation problem. The instrumentation is the relatively easy part. The hard part is building a pipeline that stays operational under load, degrades gracefully when individual components have problems, and gives you enough visibility into itself that you can tell when something is wrong before it starts affecting the data your engineers depend on during incidents.

Once your OpenTelemetry pipeline is running at scale, the next step is learning how to interpret the traces to identify performance bottlenecks and root causes. See Troubleshooting Microservices with OpenTelemetry Distributed Tracing for an in-depth and very practical guidance on that subject.

Start Free Trial

The post Running OpenTelemetry at Scale: Architecture Patterns for 100s of Services appeared first on Sematext.

From Debugging to SLOs: How OpenTelemetry Changes the Way Teams Do Observability

fulya.uluturk — Mon, 23 Feb 2026 10:15:34 +0000

At some point in every team’s life, someone gets paged at 2 AM because a service is ‘slow.’ Nobody knows which service. Nobody knows why. Someone opens five different dashboards, pastes a trace ID into a Slack thread, and thirty minutes later you have twelve engineers in a call arguing about whether the problem is in the database or the API gateway. By the time you find the actual culprit, half the team has memorized each other’s sleep schedules.

This is what life looks like when observability is an afterthought: logs in one place, metrics in another, and a custom monitoring agent that only works for two services because the third one was written in a language nobody on the team uses anymore. It works, technically. Until it does not.

OpenTelemetry came out of a genuine frustration with this fragmented mess. It is an open-source observability framework that gives you a vendor-neutral, standardized way to instrument your applications and then connect that instrumentation to service health, error budgets, and eventually SLOs that your entire organization actually understands. This article walks through what that shift looks like in practice, and why it matters for more than just the people who are on call.

The Old World: Logs, APM Agents, and the Dashboard Graveyard

Let’s be direct about how most teams actually do observability before they invest in it properly. You have application logs going into a log management platform, with varying levels of structure depending on who wrote which service. You have an APM tool that auto-instruments some of your services but not all of them, and the traces it produces are siloed within its own ecosystem. And you have a monitoring dashboard that someone built eighteen months ago and that might or might not reflect how the service actually behaves today.

The real cost is not the outage. It is the investigation. A 2023 industry study on downtime costs found that engineering teams spend an average of 200-plus hours per year just on incident investigation, separate from the time actually fixing things. A good chunk of that is tool-switching and context-switching because telemetry data lives in silos.

The deeper problem is not the tools themselves; it is that each one has its own instrumentation model. Your APM agent captures HTTP spans one way. Your custom metrics library reports latency percentiles slightly differently. Your logs do not correlate to your traces automatically. So when something breaks, you are stitching together three different narratives instead of reading one coherent story about what happened.

This is actually the origin of Sematext – back in 2012 Sematext was the first platform to offer both performance monitoring (so metrics) and log monitoring in one observability platform, and then distributed transaction tracing in 2015.

What OpenTelemetry Actually Is (Without the Fluff)

OpenTelemetry standardizes how you generate, collect, and export telemetry data. It covers three signal types (with more to come), which are the foundation of everything else in this article:

THE THREE PILLARS OF OPENTELEMETRY

🔗

Traces

End-to-end request paths across services. Shows exactly where time is spent and where errors propagate.

📊

Metrics

Numeric measurements over time: latency histograms, request counts, error rates, resource utilization. The raw material for SLOs.

📋

Logs

Structured event records with trace context attached. No more copy-pasting trace IDs; logs link directly to the span that generated them, and an error span links back to every log event emitted during that span.

Traces, Metrics, and Logs share the same context propagation model in OTel, which lets you jump from a log line to its trace in seconds.

What makes OTel different from what came before is not magic; it is the fact that all three signals share the same context propagation model. A trace ID that starts in your frontend propagates through every instrumented microservice call, and if your logs are also emitting that trace ID, you can jump from a log line to its trace in seconds. Not minutes. Seconds. If you are the person doing production troubleshooting you know how valuable this difference is!

SLOs: What They Are and Why OTel Makes Them Achievable

Service Level Objectives have been a thing since Google wrote about them in the Site Reliability Engineering book, and they have been misunderstood and poorly implemented since roughly the same time. The core idea is simple: you agree on a target for how reliable a service needs to be, you measure it consistently, and you manage your engineering work in relation to how much reliability budget you have consumed or have left.

The reason SLOs often fail is not the concept; it is that teams try to define them before they have reliable telemetry. You cannot set a meaningful availability target for a service if your metrics come from three different monitoring agents that measure availability in subtly different ways. You end up with SLOs that nobody trusts, which means nobody uses them to make decisions.

Example SLOs Built on OTel Metrics

Service	SLI	Target	Error Budget	Status
Checkout API	% requests < 500 ms, non-5xx	99.5%	3 h 36 m remaining	HEALTHY
Auth Service	% successful token validations	99.9%	0 h 22 m remaining	AT RISK
Search API	% queries returning results < 1 s	98.0%	Budget exhausted	BREACHED
Order Worker	% jobs processed without retry	99.0%	5 h 12 m remaining	HEALTHY

When SLIs are computed from OTel semantic conventions, every service uses the same measurement logic regardless of language or framework.

When your SLIs are computed from OTel metrics, specifically from the semantic conventions that define how HTTP span duration and status should be recorded, you get consistency across services by default. The latency histogram for your Go service and the one for your .NET service use the same bucket boundaries. The error classification follows the same logic. Suddenly your SLOs are comparing apples to apples, and that changes what you can do with them.

The Correlation Story: How one Trace ID Connects Everything

One of the things that sounds academic until you experience it is trace context propagation. When a request comes into your frontend and you are using OTel instrumentation, a trace ID gets generated and passed along to every downstream service call via HTTP headers, gRPC metadata, message queue attributes, or whatever transport you are using. Every span in that trace carries the same trace ID, and your logs carry it too if you have set up log correlation.

What this means in practice: when your error rate alert fires because the checkout service just breached its error budget, you do not start by guessing. You go to the traces for that time window, filter for error spans, and you are already looking at the full call path: frontend, checkout API, inventory service, payment gateway, with timing for each hop. If the inventory service was slow, you will see a long span there. If the payment gateway returned a 503, you will see that in the span status. No grep-ing through logs trying to find a request ID that someone may or may not have remembered to log. For a step-by-step breakdown of what these patterns look like in real incidents, troubleshooting microservices with distributed tracing is a good companion read.

Before vs After: What Investigation Actually Looks Like

Before OTel

Alert fires. Open APM tool, find service.

Open logging tool, search by timestamp.

Paste trace ID into search; hope the log format includes it.

Cross-reference three tools. Escalate because nobody can reproduce it.

MTTR: 45 to 90 min for medium-severity incidents.

After OTel

Alert fires with a link to the error budget burn rate.

Click through to traces for that time window.

Follow the trace to the failing span.

Logs automatically surfaced by trace ID.

MTTR: 5 to 20 min for the same incidents.

The difference in MTTR is not about effort. It is about whether correlated telemetry exists at all.

Auto-Instrumentation: Getting Value Without Rewriting Everything

One of the biggest objections to investing in observability is the instrumentation cost. If you have thirty microservices and each one needs to be manually instrumented before you see any benefit, that is a project with a very long feedback loop. This is actually what we saw with our initial distributed tracing implementation at Sematext back in 2015 – adoption of a challenge due to how much work engineers would have to invest in instrumenting their applications. OTel’s auto-instrumentation libraries change that equation significantly.

For Java, the OTel Java agent attaches to your JVM at startup and automatically instruments common frameworks such as Spring Boot, gRPC, JDBC, and Kafka without any code changes. For Python, opentelemetry-instrument does the same for Flask, Django, FastAPI, and SQLAlchemy. The .NET ecosystem has similar coverage through the automatic instrumentation package. You get spans for every incoming HTTP request, every outgoing call, and every database query without touching the application code. If you want to skip the boilerplate and start from something that already works, these language-specific OTel examples cover the setup end to end.

What to Actually Watch Out For

None of this comes without tradeoffs, and articles that only cover the benefits are setting you up for some unpleasant surprises. A few things will bite you if you do not plan for them.

A deep dive into OpenTelemetry instrumentation best practices covers all of these in detail, but here is the short version.

Cardinality explodes if you are not careful

OTel metrics support rich attribute sets, which is great for debugging but problematic for storage if you start adding high-cardinality attributes like user IDs or request IDs to your metrics. The OTel metrics spec includes cardinality limits, and you should understand them before you start attaching attributes to everything.

Sampling is necessary at scale and confusing to get right

Sending 100 percent of traces when you are handling thousands of requests per second is expensive. Head-based sampling, where you decide at the start of a trace whether to keep it, is simple but means you might drop the interesting traces. Tail-based sampling, where you decide after seeing the whole trace, keeps the errors but requires the OTel Collector to buffer spans, which adds complexity. There is no right answer, only tradeoffs that depend on your volume and budget.

Auto-instrumentation vs manual instrumentation: the honest tradeoff

Auto instrumentation gets you running in an afternoon with zero code changes and gives consistent coverage across your entire fleet from day one. The trade off is that it understands frameworks, not business intent. It can tell you a database query took 800 ms but not that it was pricing a cart for a high value customer.

Manual instrumentation fills the gap that actually matters for SLOs. Checkout completion time, order processing latency by fulfillment partner, or time to first search result. It takes more effort, but it is what turns a latency alert into a business conversation.

In practice, auto instrumentation provides the foundational 80 percent. Requests, error rates, and durations (aka RED) from day one. You then layer manual instrumentation on top for the business critical signals your SLOs should be measuring.

The Collector configuration gets complex fast

Once you start running multiple pipelines, applying transforms, doing tail-based sampling, and exporting to multiple backends, your collector config becomes something that needs to be tested and versioned like application code. Treat it that way from the start.

Starting Without Starting Over

The most common mistake teams make when adopting OTel is treating it as a big-bang migration. You do not need to instrument every service before any of it becomes useful. Pick one service, ideally something that sits in the middle of your call graph so you can see upstream and downstream spans, and get it fully instrumented with OTel, exporting to a collector and from there to whatever backend you already have. Define one or two SLIs for it. Watch them for a week and see if they match your intuition about how the service is performing.

That first service will teach you things that no amount of reading can. You will find out how your framework handles context propagation. You will discover that your log format does not include trace IDs and need to fix that. You will learn what your normal latency histogram looks like and be surprised by the long tail. Do that before you roll out to thirty services, and the rollout will go much faster.

To get started see the Sematext step-by-step setup guide for OpenTelemetry tracing. Once you have that in place, the article on building a troubleshooting workflow with Sematext tracing shows how to use those first traces to investigate issues and iterate on your instrumentation.

Start Free Trial

The post From Debugging to SLOs: How OpenTelemetry Changes the Way Teams Do Observability appeared first on Sematext.

OpenTelemetry Production Monitoring: What Breaks, and How to Prevent It

fulya.uluturk — Tue, 17 Feb 2026 11:15:15 +0000

OpenTelemetry almost always works beautifully in staging, demos, and videos. You enable auto-instrumentation, spans appear, metrics flow, the collector starts, and dashboards light up. Everything looks clean and predictable.

However, production has a way of humbling even the most carefully prepared setups. When real traffic hits, and it always spikes sooner or later, you start seeing dropped spans. Collector memory climbs until the process gets killed, and if you are running a single-instance collector, you can forget about collecting any telemetry until you bring it back up. Costs climb faster than anyone budgeted for. A few traces look incomplete. The bossman asks why latency increased by 12% after “just adding observability.”

None of this means OpenTelemetry is broken. It means production behaves differently than demos. This guide walks through what actually breaks when OpenTelemetry meets real-world scale, and what you can do about it before it becomes a 2 AM incident. Catching these issues early is the difference between a boring Tuesday and a war room.

For a practical setup of OpenTelemetry in microservices, see our step-by-step guide on distributed tracing with auto-instrumentation.

The First Production Surprise: Cardinality Explosions

High cardinality is one of the fastest ways to destabilize an otherwise healthy observability setup, and it almost always starts innocently. Someone with the best intension adds a genuinely helpful attribute:

user_id
session_id
request_uuid
a fully expanded URL path

In development, nothing bad happens. In production, that single decision can create millions of unique time series. For example, if a request counter is labeled with user_idand you have two million users, you have just created two million distinct metric series for one metric. Multiply that across services and dimensions, and storage, memory, and the performance of your observability tool degrades quickly.

You will notice it in a few ways: dashboards become noisy or slow, request latency increases, storage costs spike, and collector memory usage grows for no obvious reason.

The fix is not complicated, but it requires discipline. Metrics should use low-cardinality dimensions only, things like environment (prod, staging), service name, endpoint patterns rather than full URLs, and HTTP status classes (2xx, 4xx, 5xx). Anything that is essentially unique per request does not belong on a metric.

With auto-instrumentation, you do not always control attribute creation directly, but you can still suppress high-cardinality attributes via agent configuration, or drop and transform attributes in the collector using processors like filter, attributes, or transform. With manual instrumentation, you have full control and full responsibility. If you truly need high-cardinality identifiers, consider hashing or aggregating them before attaching them.

The key habit is to monitor cardinality continuously, not just after a cost spike. Keep an eye on the collector metrics that look like processor_accepted_metric_points broken down by metric name. These reveal which metrics are growing out of control before they degrade performance or inflate your bill.

For more guidance on instrumentation hygiene and preventing cardinality issues from the start, see our OpenTelemetry instrumentation best practices.

Scaling Pressure in OpenTelemetry Production Pipelines

OpenTelemetry components, SDKs, agents, and collectors, are not magic. They are software services that can be overloaded, and in high-throughput systems they often are.

In busy environments, traces can be generated at hundreds of thousands per second. Metrics multiply across services, containers, and pods. If batching, memory limits, and exporter throughput are not tuned, the pipeline itself becomes the bottleneck. The symptoms are predictable: processor_refused_spans starts increasing, collector memory climbs steadily, export failures appear, and telemetry arrives late or gets dropped entirely.

To understand where these bottlenecks occur, consider the overall OpenTelemetry production pipeline:

If you are using manual SDK instrumentation, you can tune batching and flush intervals directly. Larger batches reduce per-span overhead but increase memory pressure in the application itself, raising the risk of an OOM kill for containerized workloads. Smaller batches reduce memory but increase network calls. There is a balance, and you find it through load testing rather than guesswork.

With auto-instrumentation agents, you do not have direct SDK access, but most agents expose equivalent environment variables for batch size and schedule delay. These matter in production just as much as they do with manual instrumentation. A simple example showing where these settings live can save a lot of trial and error:

OTEL_BSP_MAX_EXPORT_BATCH_SIZE=512

OTEL_BSP_SCHEDULE_DELAY=5000

For detailed information, see Environment Variable Specification.

Regardless of instrumentation type, the collector itself must be treated like any other production service. Monitor its CPU and memory, scale it horizontally when needed, use load balancing with trace ID based routing so spans for the same trace land on the same collector instance, and watch queue lengths in the batch processor. If your collector is not monitored, you do not have observability, you have a single point of failure.

For detailed guidance, see OpenTelemetry Collector architecture and best practices.

Sampling Strategies for OpenTelemetry in Production

At some point, you realize capturing 100% of traces is not sustainable. Sampling becomes necessary. However, sampling is not just a cost decision, it also changes what you can see, so it deserves more thought than simply dialing a number down.

Agent-Level Sampling

Agent-level sampling makes the decision immediately when a request starts, before a single span hits the collector. The benefit is immediate volume reduction: CPU, memory, and network overhead all drop. The trade-off is permanent blindness for discarded traces. If an error happens in a trace that was not sampled, it simply does not exist in your backend. There is no way to recover it after the fact.

Agent-level sampling works well as a baseline control mechanism. Many production systems start at 5 to 10% and adjust based on throughput and debugging needs. It is particularly useful when throughput is extremely high, infrastructure or observability vendor cost is the primary concern, or you need to protect the collector from being overwhelmed. Just keep in mind that it does not guarantee you will retain slow or rare traces that would have been most useful during an incident.

Tail Sampling

Tail sampling moves the decision to the collector, after the entire trace has been observed. This enables smarter decisions: keep slow traces, keep error traces, retain 100% of traffic from business-critical services, and sample normal traffic probabilistically.

This is more powerful, but it comes with real operational weight. The collector has to buffer complete traces in memory while waiting for all spans to arrive, which means memory usage is meaningfully higher than with head-based sampling. It also adds latency to trace delivery, since the collector has to wait for the full trace before deciding whether to keep it. If your typical transaction takes 90 seconds to complete, your collector is buffering 90 seconds of trace data before it can act, which is a lot of memory at scale, and your traces will arrive in your backend 90 or more seconds after the fact. For short-lived transactions this is barely noticeable. For long-running workflows, plan accordingly.

In distributed systems, spans for the same trace can arrive at multiple collector instances. If each collector makes independent sampling decisions, traces become fragmented, leaving gaps that make debugging much harder. Using tail sampling with load-balanced routing, where all spans for a trace are routed to the same collector instance using trace ID hashing, keeps traces intact and reliable. To be precise – this sticky routing is required for well-functioning tail-sampling.

The most effective production strategy usually combines both approaches: use agent-level sampling to cut down overall span volume and prevent the collector from being overwhelmed, then use tail sampling at the collector to make sure high-value traces, slow requests, errors, and critical transactions, are preserved. Sampling is not random volume reduction. It is selecting the traces that help you debug real incidents.

For the official OpenTelemetry guidance, refer to the OpenTelemetry sampling specification.

How to Set Tail Sampling Policies in Practice

Before writing any tail sampling policy, start by asking yourself a few practical questions: what types of incidents happen most often? Are latency regressions more frequent than hard failures? Which services are business-critical or compliance-sensitive? The answers should guide your sampling decisions, not the other way around.

For example, if most of your incidents are latency-related, prioritize keeping slow traces. A common starting point is to retain 100% of traces slower than twice your SLO, while sampling just 5 to 10% of normal traffic. For compliance-sensitive endpoints, always keep those traces intact. For business-critical services, bias your sampling to capture a higher proportion of requests, perhaps 50% from your payment service but only 5% from static content services.

It is also worth maintaining a small baseline sample across all services, around 5 to 10% of overall traffic, even for well-behaved paths. This gives you trend data and lets you detect unknown failure modes you did not anticipate when writing the policies. Without that baseline, you lose visibility into normal system behavior and can miss gradual degradations that do not trigger your explicit rules.

Agent and Collector Stability: The Hidden Risk

Agents and collectors are not passive observers. They are active components in your application infrastructure, and they can fail like any other component.

The collector is the more straightforward case. OpenTelemetry SDKs instrument your application code directly, and the collector runs as a separate process (or set of processes) that receives, processes, and exports telemetry. When a collector crashes, all buffered data is lost, including any traces that were being held in memory for tail sampling decisions. Memory spikes can trigger OOM kills, and if you are running a single collector instance, the entire observability pipeline goes dark until it recovers.

The common causes are predictable: exporters fall behind because the backend is slow or throttling ingest, queues grow, memory fills, and eventually the collector crashes. The practical safeguard against this is the memory limiter processor, which watches the collector’s overall memory consumption and temporarily refuses incoming data when it crosses your configured threshold, giving the collector room to catch up.

processors:

  memory_limiter:

    check_interval: 1s

    limit_mib: 2000

    spike_limit_mib: 400

service:

  pipelines:

    traces:

      receivers: [otlp]

      processors: [memory_limiter, batch]

      exporters: [otlphttp]

This is one of those configurations that feels optional until the day it is not.

Auto-instrumentation adds another layer of complexity. Java agents rewrite bytecode at runtime, async context propagation in .NET or Node.js can behave unexpectedly under load, and in high-throughput systems you may spend measurable CPU time just recording spans. This is why load testing your instrumentation matters as much as load testing your application. Before rolling out to production, measure baseline latency without instrumentation, then measure P50, P95, and P99 latency with it enabled. A 5 to 10% latency increase is often acceptable. Triple-digit millisecond overhead per request is not.

For detailed instructions by language, see the OpenTelemetry auto-instrumentation documentation.

Exporter Bottlenecks: When the Backend Cannot Keep Up

Even if your SDKs and collectors are perfectly tuned, the backend you are exporting to may not be. When the backend is slow, throttling requests, or simply unable to absorb your telemetry volume, batches start piling up in the exporter queues inside the collector. Left unchecked, this cascades into collector instability.

The signals to watch for are otelcol_exporter_send_failed_spans (a counter visible in the collector’s own self-monitoring metrics), growing exporter queue lengths, increased export latency, and rising memory pressure in the collector process.

For self-hosted backends like Elasticsearch, OpenSearch, or Prometheus, ingestion capacity must match telemetry throughput and cardinality. For external vendors, you need to understand their API rate limits, network latency characteristics, and burst handling policies before you are under pressure. An asynchronous exporter with buffering, retry logic, and exponential backoff is essential. Without it, a temporary backend slowdown cascades through the entire pipeline. Your observability stack is only as reliable as its slowest component.

Why This Matters in Real Systems

Many OpenTelemetry tutorials and examples show instrumentation working out of the box, which it does, in a demo environment with predictable traffic and no cost constraints. Real production systems are a different beast entirely: high throughput, distributed microservices, partial network failures, uneven traffic spikes, and budgets that someone is accountable for.

OpenTelemetry is genuinely powerful, but it requires operational discipline. When you adopt it, you are not just instrumenting a few services. You are operating an observability pipeline that itself needs capacity planning, monitoring, load testing, a clear sampling strategy, and ongoing cardinality governance. Treat it as first-class infrastructure and it becomes a strong foundation for understanding your systems. Treat it as a set-and-forget library and it becomes your next incident.

Start Free Trial

The post OpenTelemetry Production Monitoring: What Breaks, and How to Prevent It appeared first on Sematext.

Troubleshooting Microservices with OpenTelemetry Distributed Tracing

fulya.uluturk — Sun, 15 Feb 2026 13:46:17 +0000

Distributed tracing doesn’t just show you what happened. It shows you why things broke. While logs tell you a service returned a 500 error and metrics show latency spiked, only traces reveal the full chain of causation: the upstream timeout that triggered a retry storm, the N+1 query pattern that saturated your connection pool, or the missing cache hit that turned a 50ms call into a 3-second database roundtrip.

This guide covers practical, trace-based troubleshooting patterns for production microservices. You’ll learn how to use OpenTelemetry distributed traces to diagnose the most common, and most frustrating, problems that surface in distributed architectures.

What you’ll learn:

How to identify latency bottlenecks using trace waterfall analysis
Detecting N+1 query patterns and database performance issues in traces
Diagnosing retry storms, timeout cascades, and circuit breaker failures
Using error propagation traces to find root causes across service boundaries
Spotting connection pool exhaustion, cache misses, and queue backlogs
Correlating traces with logs and metrics for full-context debugging

For step-by-step instrumentation setup, see our companion guide: How to Implement Distributed Tracing in Microservices with OpenTelemetry Auto-Instrumentation. For production-hardening your instrumentation, see OpenTelemetry Instrumentation Best Practices for Microservices Observability.

Why Traces Are the Best Tool for Microservices Troubleshooting

Logs, metrics, and traces each serve a different purpose. But when a production incident hits a distributed system, traces are uniquely positioned to answer the hardest questions, especially those that span service boundaries.

Troubleshooting Question	Logs	Metrics	Traces
Which service is slow?	❌ Scattered across services	✅ Latency dashboards	✅ Waterfall shows exact span
Why is it slow?	🟡 If you logged enough context	❌ No causal detail	✅ Child spans reveal cause
Which upstream call caused the error?	❌ Requires correlation IDs	❌ Only shows error rate	✅ Error propagation is visible
Is it a single request or systemic?	❌ Hard to aggregate	✅ Rate/error trends	✅ Trace grouping by pattern
What was the exact sequence of calls?	❌ Requires reconstruction	❌ No ordering info	✅ Waterfall shows call graph

The key insight is that traces give you causation, not just correlation. When service A calls service B, which calls service C, and C fails, a trace shows you the entire chain, the timing of each call, and exactly where things went wrong.

Anatomy of a Troubleshooting Trace

Before diving into specific patterns, let’s establish what you’re looking at in a trace waterfall. Understanding the structure makes pattern recognition faster during incidents.

A distributed trace consists of spans organized in a parent-child hierarchy, as defined by the OpenTelemetry Trace specification. Each span represents a single operation: an HTTP request, a database query, a cache lookup, a message publish. The root span represents the entry point, and child spans represent downstream operations.

[Root Span: GET /api/orders/12345] ─────────── 1,247ms

├── [auth-service: POST /validate] ── 23ms
├── [order-service: GET /orders/12345] ────── 1,180ms
│     ├── [PostgreSQL: SELECT * FROM orders] ── 12ms
│     ├── [inventory-service: GET /stock] ─── 890ms  ← BOTTLENECK
│     │     ├── [Redis: GET inventory:12345] ── 2ms (miss)
│     │     └── [PostgreSQL: SELECT ...] ── 875ms  ← ROOT CAUSE
│     └── [pricing-service: GET /calculate] ── 45ms
└── [notification-service: POST /email] ── 18ms

In this trace, the total request took 1,247ms. The trace waterfall immediately shows that inventory-service consumed 890ms, and within it, a database query took 875ms following a cache miss. Without the trace, you’d see a slow /api/orders endpoint in your metrics and have to investigate each service individually.

Key span attributes to examine during troubleshooting (see the full OpenTelemetry Semantic Conventions for reference):

Attribute	What It Tells You
http.status_code	HTTP response status for service calls
db.statement	The actual SQL query executed
db.system	Which database (PostgreSQL, MySQL, Redis)
http.method + http.url	Which endpoint was called
otel.status_code = ERROR	Span completed with an error
exception.message	Error details if an exception occurred
net.peer.name	Which host the call went to
messaging.system	Message broker involved (Kafka, RabbitMQ)
Span duration	How long the operation took

Diagnosing Latency Bottlenecks with Trace Waterfall Analysis

Latency issues are the most common reason teams reach for traces. The waterfall view transforms a vague “the API is slow” complaint into a precise diagnosis.

Pattern: The Slow Database Query

Symptoms in metrics: Elevated p95/p99 latency on a specific endpoint. Database CPU or connection usage may appear normal.

What the trace reveals:

[order-service: GET /orders] ────────── 2,340ms

├── [PostgreSQL: SELECT o.*, oi.* FROM orders o
│    JOIN order_items oi ON o.id = oi.order_id
│    WHERE o.customer_id = $1
│    ORDER BY o.created_at DESC] ──── 2,280ms  ← Problem
└── [Redis: SET order-cache:customer:789] ── 3ms

The trace shows a single database query consuming 97% of the request time. The db.statement attribute reveals the actual SQL, which is a full table scan joining orders with order items, likely missing an index on customer_id.

What to look for in spans:

db.statement: Check for missing WHERE clauses, full table scans, large JOINs, or unoptimized queries. Use EXPLAIN to confirm.
Span duration vs. typical duration: Compare against baseline traces for the same operation
Sequential vs. parallel queries: Are queries running sequentially when they could be parallelized?

Pattern: Sequential Service Calls (Missed Parallelization)

Symptoms in metrics: High latency that seems disproportionate to what any single service reports.

What the trace reveals:

[api-gateway: GET /dashboard] ──────────── 1,850ms

├── [user-service: GET /profile] ── 320ms
├── [order-service: GET /recent] ─── 480ms    (starts after user-svc)
├── [notification-svc: GET /unread] ── 410ms  (starts after order-svc)
└── [recommendation-svc: GET /for-you] ── 590ms (starts after notif.)

The waterfall reveals that four independent service calls are executing sequentially. Total time is the sum of all calls (1,800ms) instead of the max (590ms), a 3x penalty. The trace makes this immediately visible because spans don’t overlap.

The fix: Refactor to concurrent calls. With parallelization, the trace collapses to ~620ms as all four spans overlap.

Pattern: Fan-out Amplification

Symptoms in metrics: Latency increases with load, but individual service latencies look normal.

The trace reveals a product catalog page making 50 individual HTTP calls to the inventory service, one per product. Each call is fast (45–60ms), but the accumulated overhead of 50 sequential HTTP roundtrips adds up to over 3 seconds.

The fix: Replace individual calls with a batch API (GET /stock?skus=A001,A002,…,A050) or use a GraphQL-style query that returns all needed data in a single request.

Detecting N+1 Query Patterns in Traces

N+1 queries are one of the most common performance killers in microservices, and traces make them trivially easy to spot. The pattern appears as one initial query followed by N repetitive queries, and in the trace waterfall, it’s unmistakable.

Pattern: Classic ORM N+1

What the trace reveals:

[order-service: GET /orders] ─────────── 1,890ms

├── [PostgreSQL: SELECT * FROM orders WHERE status = 'active'
│    LIMIT 50] ── 15ms                              (1 query)
├── [PostgreSQL: SELECT * FROM customers WHERE id = 101] ── 8ms
├── [PostgreSQL: SELECT * FROM customers WHERE id = 102] ── 9ms
├── [PostgreSQL: SELECT * FROM customers WHERE id = 103] ── 7ms
│   ... (47 more identical-pattern queries)
└── [PostgreSQL: SELECT * FROM customers WHERE id = 150] ── 11ms

The trace shows 1 query to fetch orders + 50 individual queries to fetch each order’s customer. ORM lazy loading is the usual culprit. Each query is fast individually, but 51 database roundtrips add up to nearly 2 seconds.

How to spot N+1 patterns in your tracing tool:

High span count on a single trace: A trace with 50+ database spans for a simple endpoint is almost always an N+1
Repetitive db.statement patterns: Same query template with different parameter values
Low individual span duration but high total trace duration: Each query is fast, but there are too many

The fix: Replace lazy loading with eager loading (JOIN or IN clause):

— Instead of 51 queries, use 1:

SELECT o.*, c.* FROM orders o

JOIN customers c ON o.customer_id = c.id

WHERE o.status = 'active' LIMIT 50

Pattern: Service-Level N+1 (Microservice Fan-out)

The N+1 pattern isn’t limited to databases. It manifests across service boundaries too:

[checkout-service: POST /checkout] ───────── 4,100ms
├── [cart-service: GET /cart/items] ── 35ms
│    Response: [{productId: "P1"}, ..., {productId: "P20"}]
├── [product-service: GET /products/P1] ── 120ms
├── [product-service: GET /products/P2] ── 135ms
│   ... (18 more calls)
└── [product-service: GET /products/P20] ── 128ms

The checkout service fetches cart items, then calls the product service individually for each item. The fix: implement a batch endpoint (POST /products/batch accepting a list of IDs) or use request collapsing.

Diagnosing Timeout Cascades and Retry Storms

Timeout cascades are among the most dangerous failure modes in microservices. Patterns like the circuit breaker exist specifically to contain them. A single slow dependency can cause cascading failures across your entire system, and traces are the fastest way to understand the chain reaction.

Pattern: Timeout Cascade

Symptoms in metrics: Multiple services show elevated error rates simultaneously. Latency spikes propagate across services.

What the trace reveals:

[api-gateway: POST /orders] ──────────── 30,012ms (TIMEOUT)
└── [order-service: POST /create] ─────── 30,005ms (TIMEOUT)
├── [inventory-svc: POST /reserve] ──── 30,001ms (TIMEOUT)
│     └── [PostgreSQL: UPDATE inventory ...] ── 30,000ms
│           otel.status_code: ERROR
│           exception.message: "Lock wait timeout exceeded"
└── [payment-service: POST /charge] (NOT REACHED)

The trace reveals the cascade: a database lock timeout in inventory causes inventory to time out, which causes order-service to time out, which causes the gateway to time out. Without the trace, you’d see three services all timing out and might investigate the wrong one first.

Key diagnostic signals in timeout traces:

Span duration equals the configured timeout value exactly (e.g., 30,000ms), which confirms a timeout rather than slow processing
otel.status_code: ERROR with timeout-related exception messages
Child spans that were never started (like payment-service above), which confirms the timeout interrupted the flow
Multiple parent spans with identical durations, meaning each parent waited for the full timeout of its child

Pattern: Retry Storm

Symptoms in metrics: Sudden traffic spike to a downstream service. Error rates increase rather than decrease.

What the trace reveals:

[order-service: POST /create] ─────────── 12,450ms
├── [inventory-svc: POST /reserve] ── 5,001ms TIMEOUT
├── [inventory-svc: POST /reserve] ── 5,002ms TIMEOUT (retry 1)
├── [inventory-svc: POST /reserve] ── 2,410ms TIMEOUT (retry 2)
│     exception.message: "Connection pool exhausted"
└── Result: ERROR "Failed after 3 retries"

The trace shows the order service retrying the inventory call three times. With 100 concurrent requests all doing the same, the inventory service receives 300 requests instead of 100, a 3x amplification. The connection pool exhaustion on retry 2 confirms the retry storm is making things worse.

Multi-layer retry amplification: When multiple layers retry, the multiplication compounds:

Gateway (3 retries) → Order Service (3 retries) → Inventory

= 3 × 3 = 9 requests to inventory per user request

Troubleshooting Error Propagation Across Service Boundaries

When an error surfaces at the API boundary, the root cause often lies several services deep. Traces let you follow the error propagation chain backwards from symptom to cause.

Pattern: Hidden Error Origin

Symptoms: Users see “Internal Server Error” on the checkout page. Logs show 500 errors cascading through services.

What the trace reveals in a single view:

[api-gateway: POST /checkout] ─ 500 Internal Server Error
└── [checkout-service: POST /process] ─ 500
├── [cart-service: GET /cart] ─ 200 OK (45ms)
└── [payment-service: POST /charge] ─ 500
└── [fraud-service: POST /evaluate] ─ 500
└── [ML model: POST /predict] ─ 503

exception.message: "Model server OOM:
cannot allocate 2GB for inference batch"

The trace cuts through four levels of error wrapping and reveals the actual root cause: the ML model server ran out of memory. Without the trace, the on-call engineer would start by investigating the checkout service, then the payment service, before eventually reaching the fraud detection service, potentially losing 30+ minutes following the chain manually.

Pattern: Silent Error Swallowing

Sometimes errors don’t propagate. Instead, they get silently caught, and the system returns degraded results instead of errors:

[product-service: GET /product/123] ─ 200 OK (890ms)
├── [PostgreSQL: SELECT ...] ── 12ms ─ 200 OK
├── [review-service: GET /reviews] ── 5,001ms ─ TIMEOUT
│     otel.status_code: ERROR
├── [recommendation-svc: GET /similar] ── 5,002ms ─ TIMEOUT
│     otel.status_code: ERROR
└── [Redis: SET product-cache:123] ── 3ms

The product page returns 200 OK, but the trace reveals two child services timed out. Metrics show 200 OK and ~900ms latency. Only the trace reveals the degraded user experience.

To catch this pattern: Filter traces by spans with otel.status_code: ERROR even when the root span shows success.

Spotting Connection Pool Exhaustion

Connection pool exhaustion is subtle. It doesn’t always produce errors, but it silently adds latency to every request as threads wait for available connections.

Pattern: Pool Wait Time

What the trace reveals:

[order-service: GET /orders] ───────── 2,340ms
├── [PostgreSQL: SELECT ...] ── 15ms
├── [gap: 1,800ms]  ← No spans, just waiting
└── [PostgreSQL: SELECT ...] ── 12ms

The telltale sign is gaps between spans, periods where the service is doing nothing visible. The 1,800ms gap between the first and second database query indicates the thread was waiting for a connection from the pool.

Diagnostic approach: Look for consistent gaps in trace waterfalls that don’t correspond to any span. When you see this pattern across multiple traces for the same service, check connection pool metrics (active connections, wait queue depth, pool size). The trace points you to the exact service experiencing pool pressure, and metrics confirm the diagnosis.

Diagnosing Cache Effectiveness Issues

Caches are supposed to reduce latency, but misconfigured caches can make things worse. Traces reveal cache behavior that’s invisible in aggregate metrics.

Pattern: Cache Miss Cascade

[product-service: GET /product/456] ─────── 1,250ms
├── [Redis: GET product:456] ── 1ms (MISS)
├── [PostgreSQL: SELECT * FROM products ...] ── 85ms
├── [Redis: GET product:456:reviews] ── 1ms (MISS)
├── [review-service: GET /reviews] ── 890ms
│     ├── [PostgreSQL: SELECT ...reviews...] ── 45ms
│     └── [PostgreSQL: SELECT ...users...] ── 830ms  ← Slow join
├── [Redis: SET product:456] ── 2ms
└── [Redis: SET product:456:reviews] ── 1ms

The trace shows: both cache lookups missed, forcing expensive database queries and service calls. The review service’s slow user join (830ms) is the real latency contributor, normally hidden behind a cache hit.

To monitor cache effectiveness with traces: Add custom span attributes for cache hit/miss status. Then in your tracing tool, filter and group by this attribute to see miss rates per operation, not just aggregate miss rates.

# Python example: Adding cache status to spans

from opentelemetry import trace

tracer = trace.get_tracer("cache-instrumentation")

def get_from_cache(key):

with tracer.start_as_current_span("cache.lookup") as span:

span.set_attribute("cache.key", key)

result = redis_client.get(key)

span.set_attribute("cache.hit", result is not None)

return result

Pattern: Cache Stampede

When a popular cache key expires, many concurrent requests simultaneously miss the cache and hit the database, a problem known as cache stampede. Looking at multiple traces for the same endpoint around the same timestamp reveals the stampede: each trace shows a cache miss, and database query durations increase progressively as the database becomes overloaded. All traces set the same cache key, resulting in redundant writes.

Troubleshooting Message Queue Issues

Asynchronous messaging adds complexity to troubleshooting because the producer and consumer execute at different times. OpenTelemetry’s context propagation via W3C Trace Context headers connects these spans into a single trace.

Pattern: Consumer Lag

[order-service: POST /orders] ─ (publishes to Kafka)

├── [Kafka: produce to orders-topic] ── 5ms
│     messaging.kafka.partition: 3
│     messaging.kafka.offset: 1847293
│
│  ~~~ 45,000ms gap (consumer lag) ~~~
│
└── [fulfillment-svc: consume from orders-topic] ── 120ms
└── [PostgreSQL: INSERT INTO fulfillment_queue] ── 8ms

The trace links the producer span (order-service) to the consumer span (fulfillment-service) through propagated context. The 45-second gap between produce and consume timestamps reveals consumer lag. The consumer itself processes quickly (120ms), so the problem is in Kafka consumer group throughput, not processing logic.

Pattern: Poison Message / Dead Letter

[order-service: produce to orders-topic] ── 3ms

→ [fulfillment-svc: consume attempt 1] ── 15ms ── ERROR
│    exception.message: "Invalid product SKU format: null"
→ [fulfillment-svc: consume attempt 2] ── 12ms ── ERROR
→ [fulfillment-svc: consume attempt 3] ── 14ms ── ERROR
→ [dead-letter-queue: produce to orders-dlq] ── 4ms

The trace shows a message being consumed, failing, retried twice, and finally sent to the dead letter queue. The exception message reveals the root cause: a null product SKU, likely a producer-side validation issue.

Using Trace-Based Alerting for Proactive Troubleshooting

Reactive troubleshooting (waiting for users to complain) isn’t good enough. Modern tracing tools support alerting on trace-derived signals that catch issues before they impact users.

Alert on RED Metrics Derived from Traces

Alert	Condition	What It Catches
Error rate spike	Error rate > 5% for 5 minutes	Failed deployments, dependency outages
Latency degradation	p95 latency > 2x baseline for 10 min	Slow queries, missing indexes, cache failures
Throughput drop	Request rate < 50% of expected for 5 min	Upstream routing issues, DNS failures
Error rate by operation	Any operation error rate > 10%	Targeted failures in specific endpoints

Trace-Specific Alerts

Beyond RED metrics, some conditions are only visible through trace analysis:

Span count anomaly: Alert when average spans-per-trace exceeds a threshold, catching N+1 regressions after deployments
New error types: Alert when exception.type values appear that haven’t been seen in the last 7 days
Missing service in trace: Alert when an expected service stops appearing in traces for a critical flow

Building a Troubleshooting Workflow with Sematext Tracing

Sematext Tracing provides the trace analysis capabilities needed to apply all the patterns described above. Here’s how to build an effective troubleshooting workflow.

Step 1: Start with the Service Overview

The Tracing Overview dashboard provides RED metrics (Rate, Error, Duration) across all instrumented services. This is your starting point: identify which service has elevated error rates or latency, and in which time window the problem started.

Step 2: Drill into the Trace Explorer

Use the Trace Explorer to filter traces by the affected service, time window, and error status. Sort by duration to find the slowest traces, or filter by otel.status_code: ERROR to find failures.

Key filters for troubleshooting:

By service name: Isolate traces involving a specific service
By minimum duration: Find traces exceeding your latency SLO
By status: Filter for error traces only
By operation: Focus on a specific endpoint or database operation
By custom attributes: Filter by customer ID, order ID, or other business context

Step 3: Analyze the Trace Waterfall

Open the Trace Details view for a representative trace. The waterfall visualization shows the complete request flow with timing for each span. Look for the patterns described in this guide: long spans, gaps between spans, high span counts, and error spans.

Step 4: Set Up Alerts

Configure alerts on the RED metrics derived from your traces. Start with error rate and p95 latency alerts for your most critical services and endpoints, then expand to more specific alerts as you learn your system’s failure patterns.

Troubleshooting Checklist for Production Incidents

When an incident hits, use this trace-based workflow to minimize time-to-resolution:

Identify the scope: Check the service overview: is the issue isolated to one service or affecting multiple? Are error rates or latency elevated?
Find representative traces: Use the trace explorer to filter for affected traces. Sort by duration for latency issues, filter by error status for failures.
Read the waterfall: Open 3–5 representative traces. Look for: the longest span (bottleneck), error spans (root cause), gaps between spans (pool exhaustion), high span counts (N+1 patterns), and missing expected spans (service unreachable).
Check span attributes: Examine db.statement for bad queries, http.status_code for upstream failures, exception.message for error details, and custom attributes for business context.
Correlate with other signals: Jump to logs for detailed error messages and stack traces. Check infrastructure metrics for resource exhaustion. Look at deployment events for recent changes.
Verify the fix: After applying a fix, compare new traces against the problematic ones. Confirm the bottleneck span duration decreased, error spans disappeared, or the N+1 pattern resolved.

Summary

Distributed tracing transforms microservices troubleshooting from guesswork into systematic diagnosis. The patterns covered in this guide, including latency bottlenecks, N+1 queries, timeout cascades, retry storms, error propagation, connection pool exhaustion, cache failures, and message queue issues, account for the vast majority of production incidents in distributed systems.

The key is developing pattern recognition: learn what healthy traces look like for your critical flows, and the unhealthy patterns will stand out immediately during incidents. OpenTelemetry auto-instrumentation provides the data foundation, and a capable tracing backend like Sematext Tracing gives you the analysis tools to turn that data into fast resolution.

Next steps:

Not yet instrumented? Start with How to Implement Distributed Tracing in Microservices with OpenTelemetry Auto-Instrumentation
Need to optimize your instrumentation? Read OpenTelemetry Instrumentation Best Practices for Microservices Observability
Want to extract higher-level insights? See From Raw Traces to Operational Intelligence (coming soon – contact us)
Ready to try? Start your free Sematext trial, no credit card required

Start Free Trial

The post Troubleshooting Microservices with OpenTelemetry Distributed Tracing appeared first on Sematext.

OpenTelemetry in Production: Design for Order, High Signal, Low Noise, and Survival

Otis — Wed, 11 Feb 2026 10:33:15 +0000

A lot of talk around OpenTelemetry has to do with instrumentation, especially auto-instrumentation, about OTel being vendor neutral, being open and a defacto standard. But how you use the final output of OTel is what makes business difference.

In other words, how do you use it to make your life as an SRE/DevOps/biz person easier?

How do you have to set things up to truly solve production issues faster?

And does doing that require you to spend more money on observability or can you be smart about how you set things up so that OTel doesn’t break the bank?

While we were putting the finishing touches on Sematext’s OTel support, I asked one of my friends about their experience with and use of OTel in the context of questions like the ones above. The friend, the company, and the monitoring vendor they used will go unnamed, but here are the experiences and the practices my friend shared.

We’re a mid-sized org with about 30 frontend and backend developers. We know our way around observability, but have not adopted OpenTelemetry until late 2025. When we first rolled out OpenTelemetry in production, it felt like we had finally “done observability right.”

Every service was instrumented. OK, almost every service. ;)
Every request had a trace.
Every component had a metric.
Logs were nicely structured and correlated.

It was not quick and easy to set it all up, but we split the work among several team members and we did it.

However, within about two weeks we started observing – pun intended – problems:

our storage bill doubled
dashboards became slow
our team stopped opening traces
cardinality exploded
and we started sampling randomly just to survive

It became apparent pretty quickly that just adopting OpenTelemetry is not automatically going to give us good monitoring. OpenTelemetry doesn’t give you a signal strategy. Out of the box, with naive usage, it just gives you a firehose and enables you to drown in your own telemetry more quickly.

We kept this new firehose on, but we had to quickly start making decisions around things like:

what belongs in metrics
what belongs in traces
what belongs in logs
and, perhaps most importantly, what should never be emitted at all!

How I Think About the Three Telemetry Signals Now

Early on, we treated metrics, logs, and traces as three different ways to describe the same thing. They are not. That was a mistake. They are different tools with different costs and different failure modes.

Now I think about them like this:

Metrics answer: “Is the system healthy?” (both from tech/engineering perspective and business – we use metrics to understand the business side of things, too)
Traces answer: “Where did the time go?”
Logs answer: “What exactly happened?”

This separation of concerns feels simple and straightforward. As long as the observability tool you’re using has good UX for cross-connecting and correlating these signals, this separation should serve you well.

The Architecture We Ended Up With

This is the shape that finally worked for us:

The key idea is simple:
Applications emit everything. The collector acts as a filter, among other things, and decides what survives.

If you try to enforce strategy in application code, you’ll fail. Teams move too fast, especially now with AI. You need one place where you can say:

keep error traces
drop noisy attributes
batch aggressively
deduplicate
enforce memory limits
…

That place is the collector.

Metrics: What We Actually Trust During Incidents

The first real incident after we adopted OpenTelemetry was a checkout latency spike. Nobody opened a trace first. We all looked at metrics because our alert notifications pointed us there.

Metrics are what we trust when:

we get an alert notification
the CTO asks “are we down?”
a deploy goes wrong

So we designed metrics to answer only three questions:

How many requests?
How many errors?
How slow are they?

Sounds familiar? 👌Yes, RED!

Here’s a snippet from the relevant Python application.

Example (Python)

from opentelemetry import metrics

meter = metrics.get_meter("checkout")

request_counter = meter.create_counter(
    "http.server.requests",
    description="Total HTTP requests"
)

latency_histogram = meter.create_histogram(
    "http.server.duration",
    unit="ms"
)

def handle_request():
    request_counter.add(1, {"route": "/checkout", "status": "200"})
    latency_histogram.record(245, {"route": "/checkout"})

Hard Rule We Learned

It’s actually very simple: If a label (aka tag) can be different for every request, it does not belong in metrics.

These caused real problems for us:

user_id
email
request_id
order_id

You see where this is going? Yeah, cardinality. Cardinality tends to kill storage, makes certain UI elements unusable (think dropdowns with 1000+ values – fun!), etc.

See The First Production Surprise: Cardinality Explosions for more details on cardinality problems in OpenTelemetry.

Traces: How We Debugged Slow Requests

When it comes to traces you might think that they are like logs and you want to have them all so you can really dig in when you need to troubleshoot. However, for us at least, traces became useful only after we stopped trying to store all of them.

At first, we sampled at 100%. Meaning we didn’t sample at all.
Then we realized how much that was going to cost us.
Then we went for the other extreme and sampled at 1%.
But then we missed the interesting traces.

What finally worked was tail-based sampling:
We decide after the trace finishes whether it’s worth keeping.

Earlier, I mentioned a collector acting as a filter that decides what survives. This is a perfect example of that. Here’s the collector config for sampling.

Tail Sampling Config

processors:
  tail_sampling:
    policies:
      - name: errors
        type: status_code
        status_code:
          status_codes: [ERROR]

      - name: slow
        type: latency
        latency:
          threshold_ms: 500

So now what we have does this:

slow requests survive
failed requests survive
boring 200ms health checks die

This changed traces from “expensive noise” into “high-signal debugging data.”

We also learned to be careful with attributes.
Anything that explodes into millions of values makes sampling useless.

For more details on sampling strategies, see our article The First Production Surprise: Cardinality Explosions.

Logs: The Last Mile of Debugging

We still rely on logs like we relied on them before, except with tracing in place oftentimes logs are what we read after traces tell us “this DB call is slow” and we need to know why beyond what we can see through traces themselves.

So the big change – the key – for us was correlating logs with traces.

Here’s how we do it with Python. You’d do something like this in any language. Note how we get the trace_id and span_id from the context and include it in the log event.

Python Logging with Trace Context

from opentelemetry.trace import get_current_span
import logging

logger = logging.getLogger(__name__)

span = get_current_span()
ctx = span.get_span_context()

logger.error(
    "payment failed",
    extra={
        "trace_id": format(ctx.trace_id, "x"),
        "span_id": format(ctx.span_id, "x"),
        "order_id": 1234
    }
)

Once we did this, debugging became a flow instead of a search:

Alert → metric → trace → log.
That’s the loop we optimized for.

The Collector Is Where Strategy Lives

Here’s a simplified version of the collector config we ended up with:

receivers:
  otlp:
    protocols:
      grpc:
      http:

processors:
  memory_limiter:
    limit_mib: 400
  batch:
  tail_sampling:
    policies:
      - name: errors
        type: status_code
        status_code:
          status_codes: [ERROR]
      - name: slow
        type: latency
        latency:
          threshold_ms: 500

exporters:
  otlp:
    endpoint: backend:4317

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, tail_sampling, batch]
      exporters: [otlp]

    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlp]

    logs:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlp]

This let us:

tune sampling without redeploying apps
cap memory
drop junk centrally

What Scaling Telemetry Really Means

When people say “scaling OpenTelemetry,” they usually mean handling “more traffic/observability data.”

Based on our experience, though, what we actually hit first was:

cardinality
storage
query performance
human attention

And thus, what scaling really meant for us in this context was:

having fewer but better metrics
having fewer but selectively chosen traces
well structured logs that we can not just search but really slice and dice

What I’d Do Again (and What I Wouldn’t)

Decision	Result
Tail-sample traces	Saved money and sanity
Golden signal metrics only	Stable dashboards
Correlate logs with traces	Faster debugging
Put strategy in collector	Central control
Let teams emit anything	Mistake (at first)

The Gist

OpenTelemetry is neither an observability strategy or solution. It’s a transport, a spec, an implementation in the form of SDKs. It’s just a tool. And one capable of drowning you in your own telemetry.

The strategy is being smart about how you set it up and how you use it. I strongly suggest counting on needing to spend some time on this. It pays off in the long run. Questions to answer:

what questions you want answered
what data you’re willing to pay for
what engineers will actually use

Metrics tell me when things break.
Traces tell me where they break.
Logs tell me why they break.

Everything else …….send to /dev/null?

Start Free Trial

The post OpenTelemetry in Production: Design for Order, High Signal, Low Noise, and Survival appeared first on Sematext.

8 Postman Alternatives Reviewed and Compared

Otis — Sun, 08 Feb 2026 09:07:00 +0000

Postman is handy and we’ve been using it at Sematext for years to organize requests into collections, define environments for different stages (local, test, prod), and write basic tests around responses. We also use it to share API examples with teammates and to generate docs from those collections. That said, we recently received an email from Postman:

On March 1, 2026, we’re updating our plans. These changes affect how the Free plan works for teams.

….Moving forward, the Free plan will be limited to a single user. If you want to continue using Postman with multiple users, you’ll need to move to the Team plan.

Ooops! 🤬

So we looked at Postman alternatives and tested them. I’m guessing we are not the only ones curious about what else is out there. Below are our reviews.

💡 Side note: If you are looking to monitor your APIs, whether internal or external, have a look at Sematext Synthetics – it’s super simple API, uptime, and website monitoring. It’s cheap, has simple HTTP monitors as well as Browser monitors for simulating user journeys (with Playwright), syncing with Github, ability to extract data from API responses (REST or XML) and chart numerical values, supports various API auth methods, alerting on various conditions, has screenshotting capabilities, and so on. Here are the docs. 🤓

At the bottom of the reviews you will find several comparison tables, too.

Postman Alternatives for API Testing & Development

Postman is the default for many teams, but it’s not always the best fit. Some tools are lighter, some are more code-friendly, and some focus on collaboration or automation. Below are the alternatives I’ve actually seen people use in real workflows.

Insomnia

Insomnia is a desktop API client focused on request/response workflows without trying to manage your whole API lifecycle. It supports REST, GraphQL, gRPC, and WebSockets and does a good job of keeping the UI clean and predictable. It feels like what Postman used to be before collections, docs, and workspaces took over the interface. You can manage environments, reuse variables, and do light scripting. It works well as a daily tool for debugging APIs locally or against staging. It’s less about automation and more about being a solid interactive client.

👉 Key features:

REST, GraphQL, gRPC, WebSocket
Environments and variables
Local, Git, or cloud storage
Request chaining and scripting

Pros:

Clean UI
Lightweight compared to Postman
Works offline

Cons:

Limited test automation
Collaboration is weaker without cloud tier

💰 Pricing: Free core app. Paid plans for cloud sync and team features.

My opinion:
I like Insomnia for hands-on API work, for rapid API exploring – creating headers, chaining requests, poking GraphQL endpoints. I don’t like it for large automated test suites — it’s clearly optimized for manual usage. I wish its automated test runner was more powerful, but for local testing it’s solid.

Thunder Client

Thunder Client is an API testing extension for VS Code. Instead of running a separate app, you run requests inside your editor. It supports REST and GraphQL, environments, collections, and basic assertions. It’s aimed at developers who already live in VS Code and don’t want context switching. Requests are stored locally, which makes it easier to keep secrets out of cloud tools. It’s not meant to replace a full API testing platform, but it’s very effective for quick feedback while coding.

👉 Key features:

REST and GraphQL
VS Code integration
Environments and collections
Basic testing and CLI

Pros:

No app switching
Very fast setup
Local storage

Cons:

Limited advanced automation
VS Code only

💰 Pricing: Free tier. Paid plans for advanced testing and CI features.

My opinion:
Not great for big test scenarios or sharing with non-devs. I use this when I’m in the zone in VS Code and need quick checks — hitting an endpoint, validating JSON, testing auth headers. It’s not Postman-level if you’re building complex automated regression suites, but for day-to-day dev work this feels like a good tool..

Hoppscotch

Hoppscotch is a browser-based API client that started as an open-source Postman clone. It supports REST, GraphQL, WebSockets, and SSE and runs entirely in the browser. You don’t need to install anything, which makes it good for quick experiments or debugging on machines where you can’t install tools. It also supports workspaces and collaboration, but its strength is speed and simplicity. It’s more of an API scratchpad than a full testing platform.

👉 Key features:

REST, GraphQL, WebSocket, SSE
Browser-based
Environments and history
Collaboration

Pros:

Zero install
Open source
Fast startup

Cons:

Limited automation
Browser storage constraints

💰 Pricing: Free and open source.

My opinion:
It doesn’t replace a structured test suite, but for adhoc requests – “just need to hit this URL quickly” sort of situation – this works great.

SoapUI

SoapUI is one of the oldest API testing tools and is heavily used in enterprise environments, especially where SOAP still exists. It supports REST and SOAP with assertions, data-driven tests, and service mocking. Compared to Postman-style tools, it’s more focused on structured testing rather than ad hoc requests. The UI feels dated, but the feature set is deep, especially for integration testing and complex scenarios.

👉 Key features:

REST and SOAP
Assertions and data-driven tests
Mock services
Database integration

Pros:

Very powerful test tooling
Good for enterprise APIs

Cons:

Old-school UI
Steeper learning curve

💰 Pricing: Open-source version is free. ReadyAPI is commercial.

My opinion:
When you need serious integration tests, SoapUI is better than most GUI tools. But for simple REST debugging it can feel clunky.

Rest-Assured

Rest-Assured is not a GUI tool. It’s a Java DSL for API testing that integrates with JUnit or TestNG. You write tests in code that send HTTP requests and assert on responses. It’s designed for CI pipelines and code-first testing. If your backend is Java, this fits naturally into your test suite. It’s not meant for manual exploration — it’s meant for repeatable, automated verification.

👉 Key features:

Java DSL
JSON and XML assertions
CI/CD friendly
Code-first testing

Pros:

Excellent for automation
Strong typing and IDE support

Cons:

No GUI
Java-only

💰 Pricing: Free and open source.

My opinion:
I like it when API tests are part of the build. I don’t like it when I just want to manually inspect a response. If you’re automating tests as part of CI and writing tests alongside your code, this is often better than UI tools. But I’d still pop open a GUI for quick manual work.

Bruno

Bruno is a desktop API client built around file-based collections. Instead of storing requests in a database or cloud, it stores them as files in your repo. That makes it Git-friendly and easy to review changes in pull requests. It supports REST and GraphQL, environments, and basic scripting. It’s intentionally opinionated: no mandatory cloud sync, no accounts, and minimal UI chrome.

👉 Key features:

REST and GraphQL
File-based collections
Environments
Script hooks

Pros:

Git-friendly
Offline-first
Simple model

Cons:

Smaller ecosystem
Collaboration depends on Git

💰 Pricing: Free and open source. Paid plans for team features.

My opinion:
I like Bruno because it treats API requests like code instead of like cloud artifacts. Being able to review API changes in a pull request is a big win. What I don’t like is that it still feels young — some workflows are rough compared to mature tools like Insomnia or Postman.

Nokia API Hub (aka RapidAPI)

Nokia API Hub, previously known as RapidAPI, is more of an API marketplace with a built-in client. It’s useful when consuming third-party APIs because it gives you a hosted playground with auth, sample requests, and code snippets. It’s not really for testing your own local APIs. Think of it as interactive documentation with execution.

👉 Key features:

Browser-based request runner
API key management
Code generation
Public API marketplace

Pros:

Great for external APIs
No setup
Easy onboarding

Cons:

Not for local APIs
Limited testing features

💰 Pricing: Free tier. Paid plans depend on API usage.

My opinion:
I’d use RapidAPI when evaluating an external API quickly. It’s good at “try before you integrate.” I wouldn’t use it as my daily API client because it’s not designed for local dev or structured testing.

HTTPie

HTTPie is a CLI-first API client designed as a more readable alternative to curl. It prints formatted JSON by default and has a cleaner syntax for headers, auth, and bodies. It fits well into shell scripts and automation. There is a desktop app now, but the CLI is the core product.

👉 Key features:

CLI-based
Pretty-printed output
Auth helpers
Script-friendly

Pros:

Fast
Works in terminals and CI
Easy to automate

Cons:

No visual UI
Harder for complex flows

💰 Pricing: CLI is free and open source. Desktop app has paid plans.

My opinion:
I like it for quick checks and automation. I don’t like using it for complex multi-step workflows; once things get stateful, a GUI tool is still easier to reason about.

Tool Comparisons Tables

Here are some of the above data presented in a tabular format for those who, like me, prefer this to free-form text above. 🙂

Tools Compared by Use Case and Type

Tool	Primary Use Case	UI Type
Insomnia	Manual API debugging	Desktop GUI
Thunder Client	In-editor testing	VS Code extension
Hoppscotch	Quick browser testing	Web UI
SoapUI	Structured API testing	Desktop GUI
Rest-Assured	Automated tests	Code (Java)
Bruno	Git-based API client	Desktop GUI
RapidAPI	3rd-party API exploration	Web UI
HTTPie	Scriptable requests	CLI

Automation vs Manual Testing

Tool	Manual Testing	Automation
Insomnia	Strong	Weak
Thunder Client	Strong	Moderate
Hoppscotch	Strong	Weak
SoapUI	Moderate	Strong
Rest-Assured	None	Strong
Bruno	Strong	Weak
RapidAPI	Moderate	Weak
HTTPie	Moderate	Moderate

Best Fit By Developer Type

If you are…	Tool that fits
Backend dev in Java	Rest-Assured
VS Code power user	Thunder Client
Want Git-based collections	Bruno
Want fast manual client	Insomnia
Need enterprise testing	SoapUI
Testing external APIs	RapidAPI
Terminal-first dev	HTTPie
Need zero install	Hoppscotch

Start Free Trial

The post 8 Postman Alternatives Reviewed and Compared appeared first on Sematext.

Top 10 Lightstep (ServiceNow) Alternatives in 2026: Complete Migration Guide

fulya.uluturk — Wed, 04 Feb 2026 12:23:50 +0000

ServiceNow just pulled the plug on Lightstep. Here’s where to go next and how to migrate without re-instrumenting your entire stack.

TL;DR

ServiceNow announced the end-of-life (EOL) for Lightstep (Cloud Observability) on March 1, 2026. There’s no direct migration path and no replacement planned. If you’re a Lightstep user, you need to start planning your migration now.

The good news? If you’re already using OpenTelemetry with Lightstep, switching to another OTel-native platform like Sematext takes minutes, not months. No re-instrumentation required.

What Happened to Lightstep?

On August 25, 2025, ServiceNow made it official: Lightstep (now called Cloud Observability) is sunsetting. Support ends March 1, 2026, or at your contract end date, whichever comes later.

The key points from ServiceNow’s announcement:

No direct replacement on the ServiceNow platform
No migration path to other ServiceNow products
Historical data cannot be migrated
ServiceNow is “shifting focus” to Service Reliability Management and Agentic Observability

This isn’t just an inconvenience, it’s a reminder of the real risk of vendor lock-in with proprietary observability tools. Lightstep was acquired by ServiceNow in 2021 and rebranded to Cloud Observability in 2023. Three years later, it’s being killed.

The lesson: this is exactly why OpenTelemetry matters. When your instrumentation is built on open standards, switching backends is a configuration change, not a re-instrumentation project. Teams using OTel with Lightstep can migrate in minutes, not months.

What Made Lightstep Great (And What You’ll Miss)

Before jumping to alternatives, let’s acknowledge what Lightstep did well:

Early OpenTelemetry Champion

Lightstep was a founding contributor to OpenTelemetry. Ben Sigelman, Lightstep’s co-founder, was also a co-creator of OpenTracing (which later merged into OTel). This means most Lightstep users are already instrumented with OpenTelemetry, your biggest migration headache is already solved.

Unified Observability

Lightstep unified logs, metrics, and traces in a single platform, letting teams correlate telemetry signals during investigations without context-switching between tools.

Change Intelligence

Lightstep’s correlation of performance changes with deployments helped teams quickly identify if a recent release caused degradation.

Service Diagrams

Visual service dependency maps made it easy to understand complex microservices architectures at a glance.

A good Lightstep alternative should match or exceed these capabilities while being built on open standards to prevent future vendor lock-in situations.

What to Look for in a Lightstep Alternative

When evaluating your options, prioritize these criteria:

1. OpenTelemetry-Native Support

This is non-negotiable. If you’re coming from Lightstep, you’re almost certainly using OpenTelemetry. Choose a platform built around OTel, not one that bolted it on as an afterthought. This ensures:

Zero code changes during migration
Your instrumentation stays 100% standard OpenTelemetry, no vendor-specific SDKs or code changes
Future portability to any OTel-compatible backend

2. Unified Logs, Metrics, and Traces

Tracing alone isn’t enough. You need the ability to pivot from a slow trace to related logs and infrastructure metrics with one click, not three different tools.

3. Transparent Pricing

Observability costs can spiral out of control at scale. Look for straightforward pricing models that let you predict costs as your traffic grows. Avoid complex SKUs with per-host, per-user, AND per-GB charges.

4. Easy Migration

If your alternative requires re-instrumenting your application, you’re looking at weeks of engineering work. The right choice accepts your existing OTLP data with a configuration change.

5. No Vendor Lock-In

You’re migrating from a tool that’s being killed. Don’t repeat the mistake. Choose platforms that embrace open standards and make it easy to export your data.

Top 10 Lightstep Alternatives in 2026

1. Sematext Tracing

Best for: Teams wanting OpenTelemetry-native tracing with full observability at 50-80% lower cost than enterprise competitors

Sematext Tracing is a modern distributed tracing solution built on OpenTelemetry from the ground up. It’s designed for teams that want deep visibility into their applications without the complexity or cost of traditional APM platforms.

Key Features:

OpenTelemetry-native architecture: Send traces via OTLP (HTTP or gRPC) using any OpenTelemetry-compatible instrumentation. No proprietary agents.
Zero-code auto-instrumentation: Automatic tracing for Java, Python, Node.js, Go, .NET, Ruby, and more—no code changes required.
Powerful trace analysis: Search and filter traces by service, operation, latency, errors, or custom attributes with waterfall visualizations.
Service Map — See how your services communicate, track performance and errors at a glance, and investigate incidents faster.
Trace-log-metric correlation: Navigate from traces to related logs and infrastructure metrics with one click, unified in a single platform.
Intelligent sampling: Configure sampling to always capture errors and high-latency requests while controlling costs.
RED metrics out of the box: Automatically generate Rate, Error, and Duration metrics from trace data.
Migration guides: Dedicated documentation for migrating from Jaeger, Zipkin, Datadog, New Relic, and Dynatrace.

Pros:

50-80% cheaper than Datadog, Dynatrace, Grafana Cloud or New Relic with no compromise on features. Not empty statements – yes, we compared the costs side by side with each of these vendors.
ClickHouse-powered backend delivers real-time trace analysis at any scale
True OpenTelemetry-native design eliminates vendor lock-in
Works with existing Jaeger, Zipkin, or any OTLP-compatible instrumentation
Unified platform combining logs, metrics, and traces in one UI
Simple, transparent per-span pricing with no hidden fees
14-day free trial with no credit card required

Cons:

Smaller brand recognition than enterprise incumbents

Pricing: Starts at $19/month for 2 million spans. Typically 50-80% less expensive than Datadog or New Relic at scale.

Migration from Lightstep: Since both platforms are OpenTelemetry-native, migration is a configuration change, update your OTLP exporter endpoint from Lightstep to Sematext Agent and you’re done. No code changes required.

2. SigNoz

Best for: Open-source teams wanting a self-hosted Datadog alternative

SigNoz is an open-source observability platform that unifies logs, metrics, and traces. It’s built on ClickHouse and is OpenTelemetry-native.

Pros: Full observability stack in one open-source tool, no vendor lock-in, active community.

Cons: Requires operational expertise for self-hosting, younger project.

Pricing: Free (open-source self-hosted). SigNoz Cloud starts at $199/month.

3. Datadog APM

Best for: Enterprises with deep pockets already invested in the Datadog ecosystem

Datadog APM provides distributed tracing as part of its comprehensive observability platform. It’s known for extensive integrations and polished UX.

Pros: Mature, feature-rich platform with excellent UX, strong integration ecosystem.

Cons: Pricing complexity with multiple SKUs, can become very expensive at scale (“bill shock” is common).

Pricing: APM starts at $31/host/month. Additional charges for indexed spans, profiling, and more.

4. New Relic

Best for: Teams wanting unlimited users with usage-based pricing

New Relic offers distributed tracing as part of its full-stack observability platform with a unique unlimited-users model for basic access.

Pros: Unlimited basic users included, comprehensive APM, free tier available.

Cons: Pricing complexity with user types (Basic, Core, Full), OTel-compatible but not OTel-native.

Pricing: Free tier includes 100GB/month. Paid plans from $0.30/GB plus per-user fees.

5. Honeycomb

Best for: Teams practicing observability-driven development with high-cardinality data

Honeycomb focuses on high-cardinality data exploration for debugging complex distributed systems.

Pros: Excellent for exploring unknown issues, no cardinality limits, BubbleUp for pattern detection.

Cons: Different mental model from traditional APM, weaker on infrastructure monitoring.

Pricing: Free tier available. Team plan from $130/month.

6. Grafana Tempo

Best for: Teams already using Grafana with Prometheus and Loki

Grafana Tempo is an open-source tracing backend designed for cost-effective trace storage using object storage.

Pros: Extremely cost-effective storage, excellent Grafana integration, scales to massive volumes.

Cons: Requires Grafana ecosystem investment, steeper learning curve for TraceQL.

Pricing: Free (open-source). Grafana Cloud offers hosted Tempo with usage-based pricing.

7. Jaeger

Best for: Teams wanting open-source, self-hosted tracing with full data control

Jaeger is a CNCF-graduated distributed tracing platform originally developed at Uber.

Pros: Completely free and open-source, no vendor lock-in, mature and production-proven.

Cons: Requires operational expertise, no built-in log/metric correlation.

Pricing: Free (open-source). Infrastructure costs vary based on scale.

8. Elastic APM

Best for: Organizations already using the Elastic Stack (ELK)

Elastic APM provides distributed tracing as part of Elastic Observability.

Pros: Deep integration with Elastic Stack, powerful search and analytics.

Cons: Complexity of managing Elasticsearch at scale, resource-intensive.

Pricing: Free tier available. Elastic Cloud pricing based on deployment size.

9. Dynatrace

Best for: Large enterprises requiring automatic discovery and AI-driven insights

Dynatrace is an enterprise observability platform known for automatic instrumentation and AI-powered root cause analysis.

Pros: Truly automatic instrumentation, AI-driven anomaly detection, strong enterprise features.

Cons: Premium pricing, steep learning curve, OneAgent consumes more memory (200-500MB).

Pricing: Contact sales. Generally one of the most expensive options.

10. Splunk APM

Best for: Enterprises using Splunk for security and log analytics

Splunk APM provides full-fidelity distributed tracing with no sampling required.

Pros: Full-fidelity tracing, strong correlation with Splunk logs and SIEM.

Cons: Very expensive, complex pricing model.

Pricing: Contact sales. Premium end of the market.

Quick Comparison Table

Tool	OTel-Native	Self-Hosted	Log/Metric Correlation	Best For
Sematext	✅ Yes	❌ SaaS	✅ Yes	Cost-conscious teams
SigNoz	✅ Yes	✅ Yes	✅ Yes	Open-source full-stack
Datadog	⚠️ Supported	❌ No	✅ Yes	Datadog ecosystem
New Relic	⚠️ Supported	❌ No	✅ Yes	Unlimited basic users
Honeycomb	✅ Yes	❌ No	⚠️ Partial	High-cardinality exploration
Grafana Tempo	✅ Yes	✅ Yes	✅ Yes	Grafana users
Jaeger	✅ Yes	✅ Yes	❌ No	Self-hosted K8s teams
Elastic APM	⚠️ Supported	✅ Yes	✅ Yes	ELK users
Dynatrace	⚠️ Supported	✅ Yes	✅ Yes	Large enterprises
Splunk APM	⚠️ Supported	⚠️ Yes	✅ Yes	Splunk enterprises

Note on “OTel-Native”: ✅ means the platform was built from the ground up on OpenTelemetry with no proprietary agents. ⚠️ means the platform supports OTel but also offers proprietary agents.

How to Migrate from Lightstep to Sematext

The migration is straightforward because both platforms are OpenTelemetry-native. You won’t need to change any application code, just update your exporter configuration.

Step 1: Create a Sematext Account

Once logged in, create a Tracing App and note your ingestion endpoint and token from Settings → Ingestion Details.

Step 2: Use the Sematext Agent

Deploy the Sematext Agent with OpenTelemetry support. The agent receives OTLP data locally and forwards it securely to Sematext Cloud.

Step 3: Verify and Decommission

Run both platforms in parallel briefly to verify traces are flowing correctly to Sematext. Once confirmed, you can safely disable Lightstep.

Important Migration Notes:

Historical data cannot be migrated from Lightstep. You’ll start fresh in Sematext.
Dashboards and alerts need to be recreated in Sematext. Contact us if you’d like some help with that.
No code changes required if you’re using standard OpenTelemetry instrumentation.

Why Sematext for Your Lightstep Migration?

1. True OpenTelemetry-Native Design

Sematext was built from the ground up around OpenTelemetry. There’s no proprietary agent to choose between, OTel is the only instrumentation path. Migration from Lightstep is a configuration change, not a project.

2. Unified Observability

Like Lightstep, Sematext provides logs, metrics, and traces in a single platform. Navigate from a slow trace to related logs and infrastructure metrics without switching tools.

3. Predictable, Affordable Pricing

Sematext’s pricing is simple: pay per span with volume discounts. No per-host fees, no per-user fees, no surprise charges. Teams typically see 50-80% cost savings compared to Datadog or New Relic.

4. Comprehensive Language Support

Auto-instrumentation SDKs for all major languages: Java, Python, Node.js, Go, .NET, Ruby, Browser JavaScript, and Mobile (iOS/Android).

5. Production-Ready Today

Sematext has been providing observability solutions since 2010. The platform handles massive scale with a ClickHouse-powered backend that delivers fast queries regardless of data volume.

Conclusion

The Lightstep EOL is disruptive, but it’s also an opportunity to move to a more resilient, open, and cost-effective observability stack.

Key takeaways:

Don’t panic. You have until March 1, 2026.
Start planning now. Evaluate alternatives and begin migration before the deadline crunch.
Leverage your OpenTelemetry investment. If you’re using OTel with Lightstep, migration to another OTel-native platform is trivial.
Prioritize open standards. Avoid repeating the vendor lock-in mistake.

For teams seeking a balance of features, OpenTelemetry-native design, and cost-effectiveness, Sematext Tracing is the ideal Lightstep replacement. It delivers enterprise-grade distributed tracing at a fraction of the cost of incumbents like Datadog and New Relic, with no compromise on functionality.

Ready to migrate? Start your free 14-day trial at https://apps.sematext.com/ui/registration, no credit card required.

FAQ

Can I migrate my historical data from Lightstep?

No. Historical trace data cannot be exported from Lightstep. You’ll start fresh with your new provider. This is another reason to avoid proprietary observability platforms, your data should always be portable.

How long does migration take?

If you’re already using OpenTelemetry with Lightstep, migration to Sematext takes about 10-15 minutes, it’s just a configuration change. Recreating dashboards and alerts will take longer depending on complexity.

Do I need to re-instrument my applications?

No. If you’re using OpenTelemetry (which most Lightstep users are), you simply update your OTLP exporter configuration. No code changes required.

What if I’m using Lightstep’s proprietary SDK?

Lightstep deprecated their proprietary SDK years ago in favor of OpenTelemetry. If you’re still on the old SDK, you’ll need to migrate to OTel first, but you should have done this already regardless of the Lightstep EOL.

Start Free Trial

The post Top 10 Lightstep (ServiceNow) Alternatives in 2026: Complete Migration Guide appeared first on Sematext.

How to Implement Distributed Tracing in Microservices with OpenTelemetry Auto-Instrumentation

fulya.uluturk — Tue, 03 Feb 2026 14:15:37 +0000

This guide shows you how to implement OpenTelemetry’s auto-instrumentation for complete distributed tracing across your microservices, from initial setup through production optimization and troubleshooting.

How Distributed Tracing Works in Microservices

At its core, distributed tracing tracks requests as they flow through a distributed system. Each trace captures a complete journey, from the initial API gateway request to the last database write or message publication. Inside a trace, individual operations are represented as spans, each capturing duration, attributes and status. By visualizing this information, you can pinpoint latency bottlenecks, identify errors and understand dependencies between services.

A distributed trace showing a request flowing through multiple microservices. Each horizontal bar represents a span (operation), with the x-axis showing time and nested spans showing service dependencies. The trace shows: API Gateway (180ms total) coordinating Auth Service (30ms), Cart Service (50ms), Payment Service (70ms) with their respective database calls.

Imagine an e-commerce platform with an API gateway that calls authentication, cart, payment, and notification services. A simple checkout may involve 10–15 different components. If latency spikes, a trace will reveal whether the root cause is the database query in the payment service, a downstream timeout in the email service or an overloaded cache in the cart service. This type of visibility is impossible with logs or metrics alone.

Distributed tracing provides two essential benefits for SREs and DevOps engineers:

It dramatically reduces mean time to resolution (MTTR) by exposing the exact point of failure, and it enables continuous performance tuning through detailed latency analysis.
It also helps teams understand architectural dependencies that emerge organically over time, such as hidden service-to-service calls.

What is OpenTelemetry and How Does It Enable Distributed Tracing?

OpenTelemetry (OTel) is an open-source, CNCF graduated project for collecting telemetry data: traces, metrics, and logs, from any application, in any language. It provides the instrumentation libraries, SDKs, and exporters needed to collect and send data to any observability backend.

A span in OpenTelemetry represents a single operation, such as an HTTP request or a database query. A trace is a collection of spans that share the same trace ID, forming a tree that represents the full request path. OpenTelemetry attaches contextual metadata to each span following semantic conventions, such as service.name, environment, version and host, allowing you to group and filter traces later.

Diagram showing how a single trace contains multiple spans in a tree structure. The root span represents the initial request, with child spans for each service call, database query, and cache operation. Each span contains: Trace ID (shared), Span ID (unique), Parent Span ID, Start/End timestamps, Attributes (http.method, db.statement), and Status.

Context propagation is what allows traces to connect across service boundaries. When Service A calls Service B, the trace context (trace ID and parent span ID) is passed along, usually via the W3C traceparent header. Without proper propagation, spans appear isolated and the trace is incomplete.

Every OpenTelemetry setup involves an SDK, one or more exporters, and a collector or backend. The SDK manages spans, processors, and samplers. Exporters send the data to an endpoint using the OpenTelemetry Protocol (OTLP) over gRPC or HTTP. The OpenTelemetry Collector or agent receives this data, processes it, and forwards it to the observability platform.

How Does Auto-Instrumentation Work? Benefits and Implementation

Auto-instrumentation represents the most significant advancement in making distributed tracing accessible to production environments. Instead of manually adding tracing code throughout your application, auto-instrumentation agents detect and wrap common frameworks, libraries, and protocols automatically. This approach delivers immediate visibility with zero code changes, making it the recommended starting point for any OpenTelemetry implementation.

The magic happens through runtime manipulation, but each language uses a different approach to achieve zero-code instrumentation.

How Auto-Instrumentation Works by Language

Language	Instrumentation Method	How It Works	Agent Attachment
Java	Bytecode Instrumentation	Modifies JVM bytecode at runtime	-javaagent:agent.jar
Python	Monkey Patching	Replaces functions at import time	opentelemetry-instrument wrapper
Node.js	Module Wrapping	Patches require() and wraps exports	–require ./tracing.js
.NET	CLR Profiling API	Intercepts method calls via CLR	Environment variables or NuGet
Go	Manual wrapping required	No auto-instrumentation available	Compile-time wrapping
Ruby	Monkey Patching	Modifies classes at runtime	require ‘opentelemetry’
PHP	Extension hooks	Uses PHP extension API	extension=opentelemetry.so

Bytecode Instrumentation (Java, .NET)

Bytecode instrumentation is the most powerful auto-instrumentation method, working at the virtual machine level. The agent modifies the bytecode of classes as they’re loaded, inserting tracing code without changing source files. This happens transparently when you start your application with the agent:

# Java example
java -javaagent:opentelemetry-javaagent.jar -jar myapp.jar# The agent intercepts class loading and modifies methods like:
# - HttpServlet.service() → wrapped with span creation
# - PreparedStatement.execute() → wrapped with SQL capture
# - KafkaProducer.send() → wrapped with message tracing

This approach provides the deepest integration, capturing everything from servlet containers to JDBC drivers, with zero application code changes.

Monkey Patching (Python, Ruby)

Monkey patching dynamically modifies classes and modules at runtime by replacing their methods with instrumented versions. The OpenTelemetry SDK wraps your application startup, patching libraries before your code runs:

# Python wraps your app at startup
opentelemetry-instrument python myapp.py# Behind the scenes, it patches libraries:
# - requests.get → wrapped version with span creation
# - django.views → wrapped with request tracing
# - psycopg2.connect → wrapped with database tracing

This method is simple to implement but requires careful ordering – instrumentation must happen before libraries are imported.

Module Wrapping (Node.js)

Node.js auto-instrumentation works by intercepting the require() function and wrapping module exports. When your application loads a library, the instrumentation intercepts it and returns a wrapped version:

// Start with instrumentation
node --require ./tracing.js myapp.js// The tracing.js file hooks into require():
// - require('express') → returns wrapped Express with tracing
// - require('mysql') → returns wrapped MySQL client
// - require('@aws-sdk/client-s3') → returns wrapped AWS SDK

This approach uses Node.js’s module system, making it reliable across different package managers and module formats.

Libraries and Frameworks Covered by OpenTelemetry Auto-Instrumentation

What makes auto-instrumentation particularly powerful is its depth of coverage. The OpenTelemetry Java agent, for instance, instruments over 100 libraries and frameworks out of the box. It captures servlet containers like Tomcat and Jetty, HTTP clients including OkHttp and Apache HttpClient, JDBC connections to any database, message queues like Kafka and RabbitMQ, caching layers such as Redis and Memcached, and even AWS SDK calls. Each instrumentation module understands the semantics of what it’s tracing, adding appropriate attributes like http.method, db.statement, or messaging.destination that make traces immediately useful for debugging.

Example: What Gets Traced in a Spring Boot Microservice

Consider a typical Spring Boot microservice. With auto-instrumentation, a single HTTP request automatically generates spans for the incoming HTTP server request, any Spring MVC controller invocations, JDBC queries with full SQL statements, outgoing HTTP calls to other services, Redis cache operations, and Kafka message publications. The agent also ensures proper context propagation across all these operations, maintaining trace continuity even through asynchronous boundaries.

How OpenTelemetry Captures Errors and Performance Metrics Automatically

Auto-instrumentation goes beyond basic operation tracking. It captures exceptions and stack traces when errors occur, records response codes and status information, measures queue times and connection pool waiting, and adds resource attributes about the runtime environment. This rich context transforms raw timing data into actionable insights. When a database query shows high latency, you can immediately see the exact SQL statement, the connection pool state, and whether the delay was in acquiring a connection or executing the query itself.

Manual vs Auto-Instrumentation: When to Use Each Approach

Manual instrumentation still has its place, primarily for capturing business-specific operations that auto-instrumentation cannot understand. Examples include domain events like order processing stages, custom caching logic, batch job progress, or proprietary protocol interactions. The key is to use manual instrumentation to supplement auto-instrumentation, not replace it. Most production systems achieve excellent observability with 95% auto-instrumentation and 5% manual additions for critical business logic.

Aspect	Auto-Instrumentation	Manual Instrumentation
Setup Time	Minutes, just attach the agent	Hours to days, requires code changes
Code Changes	Zero, no application code modified	Extensive, spans added throughout code
Coverage	Automatic for all supported libraries	Only what you explicitly instrument
Maintenance	Automatically updated with agent	Requires ongoing code maintenance
Business Context	Limited to technical operations	Can capture business specific metrics
Performance Impact	~2-5% overhead	Variable, depends on implementation
Best For	HTTP calls, databases, queues, caches	Business events, custom protocols, domain logic

Table: Comparison of Auto-Instrumentation and Manual Instrumentation with OpenTelemetry

The optimal approach combines both: use auto-instrumentation for technical coverage, then add manual instrumentation for critical business operations that need specific context.

Manual Instrumentation Example

Here’s how to add manual spans to capture business context that auto-instrumentation misses:

// Manual span to augment auto-instrumentation
Span span = tracer.spanBuilder("order.validation").startSpan();
try (Scope scope = span.makeCurrent()) {
  validateInventory(order);
  validatePayment(order);
  span.setAttribute("order.total", order.getTotal());
  span.setAttribute("order.items", order.getItemCount());
} finally {
  span.end();
}

OpenTelemetry Auto-Instrumentation Setup for Microservices

Implementing auto-instrumentation varies by runtime, but the pattern remains consistent: attach an agent or SDK, configure the export destination, and start your application. The following examples demonstrate production-ready configurations for common platforms. For more detailed SDK documentation, see Sematext OpenTelemetry SDKs.

Java Microservices (Spring Boot, Quarkus, Micronaut)

The Java agent works with any JVM application, from Spring Boot to Quarkus to legacy servlet containers. Download the agent JAR and attach it via the -javaagent flag:

# Download the agent

curl -L https://github.com/open-telemetry/opentelemetry-java-instrumentation/releases/latest/download/opentelemetry-javaagent.jar -o opentelemetry-javaagent.jar

# Run with full instrumentation

java -javaagent:./opentelemetry-javaagent.jar \
-Dotel.service.name=payment-service \
-Dotel.exporter.otlp.endpoint=http://your-collector:4318 \
-Dotel.exporter.otlp.protocol=http/protobuf \
-Dotel.metrics.exporter=none \
-Dotel.logs.exporter=none \
-Dotel.instrumentation.jdbc.statement-sanitizer.enabled=true \
-Dotel.instrumentation.common.db-statement-sanitizer.enabled=true \
-Dotel.resource.attributes=deployment.environment=production,service.version=2.5.1 \
-Dotel.propagators=tracecontext,baggage \
-Dotel.javaagent.debug=false \
-jar your-application.jar

For containerized environments, integrate the agent directly into your Docker image:

FROM eclipse-temurin:17-jre-alpine
RUN apk add --no-cache curl# Add the OpenTelemetry agent
RUN curl -L https://github.com/open-telemetry/opentelemetry-java-instrumentation/releases/latest/download/opentelemetry-javaagent.jar \
-o /opt/opentelemetry-javaagent.jar# Copy your application
COPY target/payment-service.jar /opt/app.jar# Configure the agent via environment variables
ENV JAVA_TOOL_OPTIONS="-javaagent:/opt/opentelemetry-javaagent.jar"
ENV OTEL_SERVICE_NAME="payment-service"
ENV OTEL_EXPORTER_OTLP_ENDPOINT="http://sematext-agent:4318"
ENV OTEL_METRICS_EXPORTER="none"
ENV OTEL_LOGS_EXPORTER="none"ENTRYPOINT ["java", "-jar", "/opt/app.jar"]

Node.js Microservices (Express, Fastify, NestJS)

The Node.js instrumentation requires a small initialization file but then automatically instruments all supported packages.

// tracing.js - Initialize before your application code

const { NodeSDK } = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { Resource } = require('@opentelemetry/resources');
const { SemanticResourceAttributes } = require('@opentelemetry/semantic-conventions');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http');
const { BatchSpanProcessor } = require('@opentelemetry/sdk-trace-base');

const traceExporter = new OTLPTraceExporter({
  url:
    process.env.OTEL_EXPORTER_OTLP_ENDPOINT ||
    'http://localhost:4318/v1/traces',
  headers: {},
});

const sdk = new NodeSDK({
  resource: new Resource({
    [SemanticResourceAttributes.SERVICE_NAME]:
      process.env.SERVICE_NAME || 'api-gateway',
    [SemanticResourceAttributes.SERVICE_VERSION]:
      process.env.SERVICE_VERSION || '1.0.0',
    [SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]:
      process.env.NODE_ENV || 'development',
  }),

  spanProcessor: new BatchSpanProcessor(traceExporter, {
    maxQueueSize: 2048,
    maxExportBatchSize: 512,
    scheduledDelayMillis: 5000,
  }),

  instrumentations: [
    getNodeAutoInstrumentations({
      '@opentelemetry/instrumentation-fs': {
        enabled: false, // Too noisy for production
      },

      '@opentelemetry/instrumentation-http': {
        requestHook: (span, request) => {
          span.setAttribute(
            'http.request.body.size',
            request.headers['content-length']
          );
        },

        ignoreIncomingRequestHook: (request) => {
          // Ignore health checks and metrics endpoints
          return request.url?.match(/^\/(health|metrics|ready|live)/);
        },
      },

      '@opentelemetry/instrumentation-aws-sdk': {
        suppressInternalInstrumentation: true,
      },
    }),
  ],
});

sdk.start();

// Graceful shutdown
process.on('SIGTERM', () => {
  sdk
    .shutdown()
    .then(() => console.log('Tracing terminated'))
    .catch((error) =>
      console.log('Error terminating tracing', error)
    )
    .finally(() => process.exit(0));
});

Start your application with the initialization:

node --require ./tracing.js app.js

Python Microservices (FastAPI, Django, Flask)

Python auto-instrumentation uses the opentelemetry-instrument command to wrap your application:

# Install the required packages
pip install opentelemetry-distro[otlp] opentelemetry-instrumentation
# Bootstrap to install all available instrumentations
opentelemetry-bootstrap --action=install
# Run with auto-instrumentation
OTEL_SERVICE_NAME=cart-service \
OTEL_EXPORTER_OTLP_ENDPOINT=http://your-collector:4318 \
OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf \
OTEL_METRICS_EXPORTER=none \
OTEL_LOGS_EXPORTER=none \
OTEL_RESOURCE_ATTRIBUTES="service.version=1.2.3,deployment.environment=production" \
opentelemetry-instrument python app.py

For production deployments using Gunicorn or uWSGI:

# gunicorn_config.py
import os
from opentelemetry import trace
from opentelemetry.instrumentation.auto_instrumentation import sitecustomizedef post_fork(server, worker):
# Force re-initialization after fork
sitecustomize.initialize()bind = "0.0.0.0:8000"
workers = 4
worker_class = "uvicorn.workers.UvicornWorker"

.NET Microservices (ASP.NET Core)

.NET instrumentation can be done via NuGet packages or using the automatic instrumentation agent:

// Program.cs

using OpenTelemetry.Exporter;
using OpenTelemetry.Instrumentation.AspNetCore;
using OpenTelemetry.Instrumentation.Http;
using OpenTelemetry.Instrumentation.SqlClient;
using OpenTelemetry.Resources;
using OpenTelemetry.Trace;

var builder = WebApplication.CreateBuilder(args);

// Configure OpenTelemetry
builder.Services
    .AddOpenTelemetry()
    .ConfigureResource(resource => resource
        .AddService("inventory-service", serviceVersion: "2.1.0")
        .AddAttributes(new Dictionary
        {
            ["deployment.environment"] = builder.Environment.EnvironmentName,
            ["host.name"] = Environment.MachineName
        }))
    .WithTracing(tracing => tracing
        .AddAspNetCoreInstrumentation(options =>
        {
            options.Filter = httpContext =>
            {
                // Exclude health checks
                return !httpContext.Request.Path.Value?.Contains("health") ?? true;
            };

            options.RecordException = true;
        })
        .AddHttpClientInstrumentation()
        .AddSqlClientInstrumentation(options =>
        {
            options.SetDbStatementForText = true;
            options.RecordException = true;
            options.SetDbStatementForStoredProcedure = true;
        })
        .AddEntityFrameworkCoreInstrumentation()
        .AddRedisInstrumentation()
        .AddOtlpExporter(otlpOptions =>
        {
            otlpOptions.Endpoint = new Uri("http://your-collector:4318");
            otlpOptions.Protocol = OtlpExportProtocol.HttpProtobuf;
        })
        .SetSampler(new TraceIdRatioBasedSampler(0.1))); // 10% sampling

var app = builder.Build();

From Instrumentation to Insights: What’s Next?

With OpenTelemetry auto-instrumentation now running across your microservices, you’re collecting comprehensive trace data from every request, database query, and service interaction. The agents are capturing timing, errors, and context automatically. But instrumentation is just the foundation.

The real value of distributed tracing comes from using this data to:

Debug Production Issues – Traces reveal performance problems that are invisible in logs or metrics alone. Issues like N+1 database queries, connection pool exhaustion, service dependency bottlenecks, and timeout cascades become immediately apparent in trace visualizations. Learn how to diagnose these issues step-by-step in our guide to Troubleshooting Microservices with OpenTelemetry Distributed Tracing.

Optimize for Production Scale – While auto-instrumentation works out of the box, production deployments require careful tuning. From implementing intelligent sampling strategies to ensuring context propagation across async boundaries, there are proven patterns for running OpenTelemetry at scale. Learn these critical configurations and avoid common pitfalls in OpenTelemetry Instrumentation Best Practices for Microservices Observability.

Extract Operational Intelligence – Raw traces contain rich insights about your system’s behavior. By analyzing span relationships and attributes, you can build service dependency maps, identify critical paths that impact latency, detect performance regressions between deployments, and understand resource utilization patterns.

The following sections provide a foundation for using your newly instrumented traces effectively, with links to our detailed guides for deeper exploration.

How Sematext Uses OpenTelemetry

OpenTelemetry with auto-instrumentation provides extensive data collection, but you need a backend to store and analyze this data. While open-source options like Jaeger and Zipkin work well for development, and commercial APMs like Datadog require proprietary agents, Sematext Tracing offers a fully OpenTelemetry-native platform that handles the scale and cardinality of production microservices without vendor lock-in.

Frequently Asked Questions

Does OpenTelemetry impact microservices performance?

Auto-instrumentation typically adds 2-5% CPU overhead and 30-50MB memory per service according to official benchmarks. With 10% sampling, the impact is negligible for most production workloads. Performance impact can be further minimized by disabling noisy instrumentations and optimizing batch processor settings – see our guide to [OpenTelemetry Instrumentation Best Practices for Microservices Observability] for detailed performance tuning strategies.

OpenTelemetry vs commercial APM tools – what’s the difference?

OpenTelemetry provides vendor-neutral instrumentation that works with any backend. Commercial APMs use proprietary agents that lock you to their platform. OpenTelemetry gives you freedom to switch backends (i.e. observability vendors) without re-instrumenting your entire stack.

Can OpenTelemetry handle production scale?

Yes. Companies like Uber and Netflix use OpenTelemetry-based tracing at massive scale, processing billions of spans daily. The key is choosing a backend that can handle your data volume and implementing appropriate sampling strategies. Learn how to configure OpenTelemetry for high-volume production deployments in our comprehensive guide: [OpenTelemetry Instrumentation Best Practices for Microservices Observability].

Is OpenTelemetry production-ready?

OpenTelemetry tracing reached stability in 2021 and is production-ready for all major languages. Major cloud providers and observability vendors now support OTLP natively.

Conclusion

OpenTelemetry’s auto-instrumentation agents handle the complexity of trace collection, context propagation, and data formatting. They work across languages and frameworks, providing consistent telemetry regardless of your technology stack. The zero-code approach means you can instrument legacy services, third-party applications, and rapidly evolving microservices with equal ease.

By combining OpenTelemetry auto-instrumentation with an appropriate backend, you create a production-ready observability solution that scales from proof-of-concept to enterprise deployment. Auto-instrumentation provides the data, and modern backends provide the intelligence to deliver the visibility you need to operate distributed systems with confidence.

The future of observability isn’t about instrumenting more code, it’s about extracting more value from the instrumentation that happens automatically.

Start Free Trial

The post How to Implement Distributed Tracing in Microservices with OpenTelemetry Auto-Instrumentation appeared first on Sematext.

Blog - Sematext Community

Using AI to Instrument Applications with OpenTelemetry

The instrumentation tax

How much does OpenTelemetry instrumentation cost?

And then there’s custom OpenTelemetry instrumentation

Can we use AI to instrument applications with OpenTelemetry?

What the OTel instrumentation AI skill does

What an instrumentation session looks like

What the skill doesn’t do

Try it, the skill is vendor-agnostic

Where this is going

Pull Request Velocity as a Proxy for AI Usage for Software Development

Individual AI Adoption

Source: Github WebHook Events

A Word of Caution

Running OpenTelemetry at Scale: Architecture Patterns for 100s of Services

Why a Single Collector Falls Apart (and When)

Collector Tiers: The Architecture That Actually Scales

Tier 1: Collectors running as agents

Tier 2: Collectors running as gateways

Load Balancing: The Subtle Trap with Tail-Based Sampling

Sampling Strategies at Volume: Picking the Right One

The combination that works at scale

Memory sizing for tail-based sampling

Multi-Cluster Setups: When One Pipeline Is Not Enough

Getting trace context across cluster boundaries

Data residency and the GDPR headache

Keeping the Pipeline Itself Observable

Rolling This Out Without Breaking Everything

From Debugging to SLOs: How OpenTelemetry Changes the Way Teams Do Observability

The Old World: Logs, APM Agents, and the Dashboard Graveyard

What OpenTelemetry Actually Is (Without the Fluff)

SLOs: What They Are and Why OTel Makes Them Achievable

The Correlation Story: How one Trace ID Connects Everything

Auto-Instrumentation: Getting Value Without Rewriting Everything

What to Actually Watch Out For

Cardinality explodes if you are not careful

Sampling is necessary at scale and confusing to get right

Auto-instrumentation vs manual instrumentation: the honest tradeoff

The Collector configuration gets complex fast

Starting Without Starting Over

OpenTelemetry Production Monitoring: What Breaks, and How to Prevent It

The First Production Surprise: Cardinality Explosions

Scaling Pressure in OpenTelemetry Production Pipelines

Sampling Strategies for OpenTelemetry in Production

Agent-Level Sampling

Tail Sampling

How to Set Tail Sampling Policies in Practice

Agent and Collector Stability: The Hidden Risk

Exporter Bottlenecks: When the Backend Cannot Keep Up

Why This Matters in Real Systems

Troubleshooting Microservices with OpenTelemetry Distributed Tracing

Why Traces Are the Best Tool for Microservices Troubleshooting

Anatomy of a Troubleshooting Trace

Diagnosing Latency Bottlenecks with Trace Waterfall Analysis

Pattern: The Slow Database Query

Pattern: Sequential Service Calls (Missed Parallelization)

Pattern: Fan-out Amplification

Detecting N+1 Query Patterns in Traces

Pattern: Classic ORM N+1

Pattern: Service-Level N+1 (Microservice Fan-out)

Diagnosing Timeout Cascades and Retry Storms

Pattern: Timeout Cascade

Pattern: Retry Storm

Troubleshooting Error Propagation Across Service Boundaries

Pattern: Hidden Error Origin

Pattern: Silent Error Swallowing

Spotting Connection Pool Exhaustion

Pattern: Pool Wait Time

Diagnosing Cache Effectiveness Issues

Pattern: Cache Miss Cascade

Pattern: Cache Stampede

Troubleshooting Message Queue Issues

Pattern: Consumer Lag

Pattern: Poison Message / Dead Letter

Using Trace-Based Alerting for Proactive Troubleshooting

Alert on RED Metrics Derived from Traces

Trace-Specific Alerts

Building a Troubleshooting Workflow with Sematext Tracing

Step 1: Start with the Service Overview