The Observability Tax: Why Your Monitoring Costs More Than Your App
Observability spending is growing faster than the infrastructure it watches. Teams ship dashboards nobody reads, alerts nobody trusts, and invoices that make the CFO ask hard questions. There is a better architecture -- and it starts with admitting that most of what you collect is waste.
Table of Contents
Observability spending is growing faster than the infrastructure it watches. Teams ship dashboards nobody reads, alerts nobody trusts, and invoices that make the CFO ask hard questions. There is a better architecture -- and it starts with admitting that most of what you collect is waste.
We run 22 production systems. Our combined observability bill is less than a single engineer's monthly coffee budget. Not because we are negligent. Because we are deliberate. This article explains how we got there, and why you might want to follow.
I. The Invoice Nobody Expected
Somewhere around 2023, a pattern emerged across our consulting engagements. Teams that spent years optimizing cloud compute -- rightsizing instances, moving to ARM, adopting serverless -- would open their monthly bill and find a new line item quietly eating their savings.
Datadog. New Relic. Splunk. Elastic Cloud. Dynatrace.
The observability vendor had become the second or third largest line item. In some cases, the largest. A fintech client running 40 microservices on AWS spent $8,200/month on EKS and Fargate. Their Datadog bill was $14,600/month. The tool watching the infrastructure cost 78% more than the infrastructure itself.
This is not an anomaly. This is the market.
Datadog's annual revenue crossed $2.1 billion in 2024. New Relic, Splunk (now Cisco), Elastic, and Dynatrace collectively represent a market north of $40 billion. That money comes from somewhere. It comes from you.
The question is not whether observability matters. It does. The question is whether your observability architecture is proportional to the decisions it enables. For most teams, the answer is no.
II. The Pricing Trap: How Vendors Extract
Understanding the cost requires understanding the pricing models. They are not designed for transparency. They are designed for lock-in and surprise.
Per-host pricing. Datadog charges per host per month. Infrastructure monitoring starts around $15/host/month, but APM pushes that to $31-40/host. A 100-node Kubernetes cluster is $3,100-4,000/month before you enable a single custom metric.
Per-GB ingestion. Log management charges by volume. Datadog charges roughly $0.10/GB ingested for the first tier, with retention costs on top. A moderately chatty application producing 500GB/month of logs is $50/month just for ingestion -- before indexing, alerting, or retention.
Per-custom-metric. This is where budgets die. Datadog's base plan includes 100 custom metrics. Each additional custom metric costs roughly $0.05/month. That sounds cheap until a developer instruments a service with Prometheus and accidentally creates 50,000 custom metrics from high-cardinality labels. That is $2,500/month from a single deployment.
Per-span pricing for tracing. Ingesting 1 billion spans/month at $0.0000002/span does not sound expensive. Until you realize a single API request through 8 microservices produces 8+ spans, and at 10,000 requests/second you are generating 6.9 billion spans/month. The math gets real fast.
Cost model for a mid-size SaaS (100 hosts, 50 services):
Infrastructure monitoring: 100 hosts x $23/mo = $ 2,300
APM (50 services): 50 hosts x $40/mo = $ 2,000
Log management (2TB/mo): 2,000 GB x $0.10 = $ 200
Log indexing + retention: = $ 1,800
Custom metrics (15,000): 15,000 x $0.05 = $ 750
Trace ingestion (5B spans): 5B x $0.0000002 = $ 1,000
Synthetics (500 tests): 500 x $12/mo = $ 6,000
RUM (1M sessions): = $ 1,500
─────────────────────────────────────────────────────────
Monthly total: ~$15,550
Annual total: ~$186,600
That is two senior engineers. Or the entire cloud bill for a well-architected system at the same scale. And it grows with every host you add, every metric you create, every log line you emit.
III. Fear-Driven Monitoring and the Dashboard Graveyard
The cost would be defensible if every dollar drove a decision. It does not.
Most observability setups are products of fear. An incident happens. Someone says "we need more visibility." A dashboard gets built. An alert gets created. Nobody deletes the dashboard six months later when the underlying issue has been fixed. Nobody reviews whether the alert ever fired. The monitoring surface grows monotonically.
We have a name for this: the dashboard graveyard. Every organization has one. Rows of Grafana panels nobody has opened in months. Datadog dashboards created by engineers who left the company two years ago. PagerDuty services with alert rules that reference infrastructure that no longer exists.
The cost is not just financial. Alert fatigue is real and measurable.
A 2023 study by PagerDuty's own research team found that the median on-call engineer receives 25+ alerts per shift. Of those, fewer than 3 are actionable. The rest are noise -- threshold breaches on metrics that do not correlate with user impact, transient spikes that self-resolve, and cascading alerts from a single root cause.
The 90/10 rule: 90% of alerts are noise. 10% are actionable. And the noise makes the signal harder to find.
Alert taxonomy (based on audit of 4 client systems, 2024-2025):
Actionable + urgent: 8% -- Real incidents, need human response
Actionable + deferred: 4% -- Real issues, can wait for business hours
Informational: 22% -- Interesting but requires no action
Noise (self-resolving): 41% -- Transient spikes, auto-scaling events
Stale (dead references): 25% -- Alerts on decommissioned services
If a quarter of your alerts reference infrastructure that no longer exists, your observability is not providing visibility. It is providing theater.
IV. The Cargo-Cult Instrumentation Problem
There is a pattern in how teams adopt observability that mirrors how they adopt testing: they start with good intentions and end with coverage metrics that measure effort, not value.
The cargo cult works like this:
- Read a blog post about how Uber instruments their microservices.
- Install the Datadog agent on everything.
- Enable APM on every service.
- Add custom metrics for every counter, gauge, and histogram you can think of.
- Build dashboards that look impressive in sprint demos.
- Never delete anything.
The result is comprehensive instrumentation of a system you do not understand any better than you did before. You have more data. You do not have more insight.
The distinction matters. Data is what your systems emit. Insight is what changes your behavior. A metric that nobody looks at is not observability. It is an accounts payable entry.
Consider the alternative: instead of instrumenting everything and hoping patterns emerge, you start by asking what decisions you need to make and work backward to the minimum data required to make them.
Decision: Is the API healthy from the user's perspective? Minimum data: Request latency P50/P95/P99, error rate, and availability -- three metrics per endpoint.
Decision: Is the database approaching capacity limits? Minimum data: Connection pool utilization, query latency P95, disk usage percentage -- three metrics.
Decision: Did the last deployment cause a regression? Minimum data: Error rate delta and latency delta compared to the previous 24-hour window -- two metrics.
You do not need 15,000 custom metrics to answer these questions. You need fewer than 100.
V. The OpenTelemetry Standard: Your Exit Strategy
OpenTelemetry is the most important infrastructure project most teams are ignoring. It is a vendor-neutral, open-source observability framework that standardizes how telemetry data -- traces, metrics, and logs -- is collected, processed, and exported.
Why it matters: it decouples instrumentation from destination. You instrument your code once with OpenTelemetry SDKs, and you can send that data to any backend. Datadog today, Grafana Cloud tomorrow, self-hosted ClickHouse next quarter. No re-instrumentation. No vendor lock-in.
The OpenTelemetry Collector is the architectural linchpin. It sits between your applications and your backends, handling collection, processing, filtering, and routing.
# otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 5s
send_batch_size: 8192
# Drop metrics you don't need before they hit storage
filter/metrics:
metrics:
exclude:
match_type: regexp
metric_names:
- "http.server.request.body.size"
- "runtime.cpython.*"
- "process.runtime.*"
# Sample traces intelligently -- keep errors, sample successes
tail_sampling:
decision_wait: 10s
policies:
- name: errors-always
type: status_code
status_code: { status_codes: [ERROR] }
- name: slow-requests
type: latency
latency: { threshold_ms: 2000 }
- name: sample-normal
type: probabilistic
probabilistic: { sampling_percentage: 5 }
# Add deployment context to every signal
resource:
attributes:
- key: deployment.environment
value: production
action: upsert
- key: service.version
from_attribute: DEPLOY_SHA
action: upsert
exporters:
otlp/tempo:
endpoint: tempo.internal:4317
tls:
insecure: false
prometheusremotewrite:
endpoint: http://mimir.internal/api/v1/push
loki:
endpoint: http://loki.internal/loki/api/v1/push
service:
pipelines:
traces:
receivers: [otlp]
processors: [tail_sampling, resource, batch]
exporters: [otlp/tempo]
metrics:
receivers: [otlp]
processors: [filter/metrics, resource, batch]
exporters: [prometheusremotewrite]
logs:
receivers: [otlp]
processors: [resource, batch]
exporters: [loki]
The key insight is in the processors section. Before a single byte hits your storage backend, the collector filters out metrics you do not need, samples traces intelligently (keeping errors and slow requests while sampling routine traffic at 5%), and enriches everything with deployment context.
That filtering stage is where you reclaim 60-80% of your observability budget.
VI. Structured Logging That Actually Works
Most logging is narrative. A developer writes logger.info("Processing order for customer") and moves on. That log line is readable by humans and useless to machines. You cannot aggregate it, filter it, or alert on it without regex gymnastics.
Structured logging inverts this. Every log entry is a typed event with named fields. The narrative is still there -- in the message field -- but the actionable data lives in structured attributes.
// Bad: narrative logging
logger.info(Processing order ${orderId} for customer ${customerId});
logger.error(Payment failed for order ${orderId}: ${error.message});
// Good: structured logging with OpenTelemetry semantic conventions
logger.info("order.processing", {
order_id: orderId,
customer_id: customerId,
currency: order.currency,
amount_cents: order.totalCents,
items_count: order.items.length,
});
logger.error("payment.failed", {
order_id: orderId,
customer_id: customerId,
payment_provider: "stripe",
error_code: error.code,
error_category: categorizePaymentError(error),
retry_eligible: isRetryable(error),
attempt_number: attemptCount,
});
The structured version costs the same to emit but is infinitely more useful downstream. You can query all failed payments by provider. You can calculate retry-eligible error rates. You can build alerts on error_category instead of regex-matching error messages that change with every refactor.
A Log Level Taxonomy That Means Something
Most teams treat log levels as vibes. Here is a taxonomy that ties each level to an operational response:
| Level | Meaning | Operational response | Retention |
|---|---|---|---|
FATAL |
Process cannot continue. Data may be at risk. | Immediate page. All hands. | Forever |
ERROR |
Operation failed. User impact confirmed. | Page on-call. Investigate within SLO. | 90 days |
WARN |
Degraded but functional. Approaching a limit. | Create ticket. Review next business day. | 30 days |
INFO |
Normal operation milestones. State transitions. | No action. Available for investigation. | 14 days |
DEBUG |
Detailed flow for active troubleshooting. | No action. Enable on demand. | 3 days |
The retention column is the cost lever. If DEBUG logs represent 60% of your volume and you retain them for 3 days instead of 30, you just cut storage costs by 54%. If you disable DEBUG in production entirely and enable it per-service when investigating, the savings are larger.
VII. Edge Telemetry: The Cheapest Observability You Are Not Using
If you deploy to Cloudflare Workers, Vercel Edge Functions, or any CDN with analytics capabilities, you already have a telemetry layer that most teams ignore.
Cloudflare Workers Analytics Engine provides per-request metrics at the edge -- latency, status codes, cache hit ratios, geographic distribution -- at no additional cost beyond your Workers plan. No agent to install. No SDK to integrate. No per-metric pricing.
// Cloudflare Worker: lightweight telemetry at the edge
export default {
async fetch(request: Request, env: Env): Promise<Response> {
const start = Date.now();
const url = new URL(request.url);
try {
const response = await handleRequest(request, env);
const duration = Date.now() - start;
// Write to Analytics Engine -- included in Workers plan
env.TELEMETRY.writeDataPoint({
blobs: [
url.pathname, // index 1: route
response.status.toString(), // index 2: status
request.headers.get("cf-ipcountry") || "XX", // index 3: country
],
doubles: [
duration, // index 1: latency_ms
response.headers.get("content-length")
? parseInt(response.headers.get("content-length")!)
: 0, // index 2: response_bytes
],
indexes: [url.pathname], // queryable index
});
return response;
} catch (err) {
env.TELEMETRY.writeDataPoint({
blobs: [url.pathname, "500", "error"],
doubles: [Date.now() - start, 0],
indexes: [url.pathname],
});
throw err;
}
},
};
You can query this data with SQL through the Cloudflare API or GraphQL. P50/P95 latency by route, error rates by country, traffic patterns by hour. The data you need for 80% of operational decisions, collected at the point closest to the user, for zero marginal cost.
This is the observability inversion: instead of pulling metrics from inside your system outward, you capture them at the boundary where users actually experience your service.
VIII. The Architecture: Cheap Storage, Smart Queries
The commercial observability platforms bundle three things: collection, storage, and querying. They charge premium prices because the bundle is convenient. But each component has a commodity alternative that costs 5-20x less.
The Observability Tax architecture vs. the alternative:
VENDOR BUNDLE UNBUNDLED ARCHITECTURE
┌──────────────────────┐ ┌─────────────────────────────────┐
│ App + Agent │ │ App + OTel SDK │
│ │ │ │ │ │
│ v │ │ v │
│ Vendor Collector │ │ OTel Collector (self-hosted) │
│ │ │ │ │ │ │ │
│ v │ │ v v v │
│ Vendor Storage │ │ ClickHouse S3/R2 Loki │
│ ($$$$/GB) │ │ (metrics+ (cold (logs) │
│ │ │ │ traces) archive) │
│ v │ │ │ │ │ │
│ Vendor UI │ │ └────┬────┘─────────┘ │
│ (included) │ │ v │
│ │ │ Grafana │
│ Cost: ~$15K/mo │ │ (dashboards + alerts) │
│ │ │ │
│ │ │ Cost: ~$1.5K-3K/mo │
└──────────────────────┘ └─────────────────────────────────┘
The components:
ClickHouse for metrics and trace storage. Column-oriented, compression ratios of 10-20x on structured telemetry, query performance that rivals purpose-built TSDB systems. ClickHouse Cloud pricing starts around $0.03/GB for compressed storage -- roughly 10-30x cheaper than commercial APM storage when compression is factored in.
S3/R2/GCS for cold archive. After 7-14 days, move raw telemetry to object storage. Cloudflare R2 charges $0.015/GB/month with zero egress fees. Query on demand with Athena, DuckDB, or ClickHouse external tables.
Grafana for visualization and alerting. Grafana itself is open source. Grafana Cloud's free tier handles 10,000 metrics, 50GB logs, and 50GB traces per month. The paid tier starts at roughly $0.50/1000 metric series -- still a fraction of commercial APM pricing.
Loki for log aggregation. Designed by the Grafana team as a "Prometheus for logs." Does not index log content -- only labels. This makes it dramatically cheaper to operate than Elasticsearch/Splunk for log storage.
A real cost comparison
For the same mid-size SaaS from Section II (100 hosts, 50 services):
Commercial APM (Datadog-class): ~$186,600/year
Unbundled architecture:
ClickHouse Cloud (metrics+traces): $ 7,200/year
R2 cold storage (24 months): $ 2,160/year
Grafana Cloud (Pro): $ 6,000/year
Loki (self-hosted on 2 nodes): $ 3,600/year
OTel Collector (3 nodes): $ 2,400/year
──────────────────────────────────────────────────
Total: ~ $21,360/year
Annual savings: ~$165,240
Savings percentage: ~89%
The trade-off is operational complexity. You run the collector. You manage ClickHouse (or pay for ClickHouse Cloud). You configure Grafana dashboards. This requires a team that understands the components. It is not zero effort.
But it is honest effort. You are paying for engineering time instead of vendor margin. And the engineers who run this stack understand their observability deeply -- because they built it. That understanding is worth more than any vendor dashboard.
IX. When Commercial Platforms Are the Right Answer
This article is not a polemic against commercial observability. There are genuine cases where paying the tax is rational:
Early-stage startups with zero ops capacity. If you have 5 engineers and no dedicated ops person, Datadog's agent-install-and-go model saves time you cannot afford to spend. The cost is real but the time cost of self-hosting is higher. Revisit when you hit 50 engineers or $10K/month in observability spend.
Genuinely complex distributed systems. If you run 200+ microservices across multiple clouds with polyglot runtimes, the correlation engine in Datadog or New Relic provides value that is hard to replicate with open-source tooling. The service map, the automatic trace correlation, the anomaly detection -- these features justify a premium when the alternative is a team of three spending their entire quarter building equivalents.
Compliance environments with audit requirements. Some regulated industries require specific retention policies, access controls, and audit trails that commercial platforms provide out of the box. Building SOC 2-compliant log infrastructure from scratch is possible but expensive in a different currency.
Teams with no interest in becoming observability experts. This is valid. Not every team should build their own monitoring stack. If observability is not a core competency and you would rather spend engineering time on product features, the vendor tax is a reasonable trade. Just know you are paying it.
The test is simple: divide your annual observability spend by the number of incidents it helped you resolve. If the cost per incident exceeds the cost of the incident itself, you are over-instrumented.
X. Gothar's Approach: 22 Systems, Minimal Tax
We practice what we preach. Here is how we observe 22 production systems across Cloudflare Workers, AWS, and bare-metal infrastructure:
Layer 1: Edge telemetry. Every Cloudflare Worker writes to Analytics Engine. Route-level latency, status codes, cache performance. Zero marginal cost. This covers 70% of our "is the system healthy?" questions.
Layer 2: Structured logging with context propagation. Every service emits structured JSON logs with trace IDs, request IDs, and deployment SHAs. Logs ship to Loki via the OTel Collector. Retention: 14 days hot, 90 days cold on R2.
Layer 3: Four golden signals per service. Latency (P50/P95/P99), traffic (requests/sec), errors (rate), and saturation (resource utilization). Exported as Prometheus metrics via the OTel SDK. Stored in a lightweight Mimir instance. No custom metrics beyond these four categories.
Layer 4: Traces on demand. We do not trace every request. We trace errors, slow requests (>2s), and a 5% sample of normal traffic. Stored in Tempo. Queried through Grafana when investigating specific incidents.
Layer 5: Synthetic canaries. Simple HTTP checks against critical endpoints, every 60 seconds, from three geographic regions. Cloudflare Health Checks plus a handful of custom Workers. Alerts go to a single Slack channel with an on-call rotation.
Total observability spend across 22 systems: approximately $340/month. That is not a typo. Edge telemetry is free. Loki, Mimir, and Tempo run on two modest VMs ($85/month each). R2 storage is negligible. Grafana Cloud free tier handles our dashboard and alerting needs.
The constraint that makes this work: we decided what questions we need to answer before we decided what to collect. Discipline at the design stage eliminates waste at the billing stage.
XI. The Practical Migration Path
If you are currently paying the observability tax and want to reduce it, do not rip and replace. Migrate incrementally:
Month 1: Audit. Inventory every dashboard, alert, and metric. For each, answer: when was this last viewed? Did it drive a decision? Delete everything that fails both tests. Most teams can eliminate 40-60% of their observability surface in this step alone, with zero risk.
Month 2: Instrument with OpenTelemetry. Add the OTel SDK alongside your existing vendor agent. Dual-ship telemetry to both your current vendor and a self-hosted collector. This gives you a comparison baseline with zero disruption.
Month 3: Stand up cheap storage. Deploy ClickHouse Cloud or a self-hosted instance, Loki for logs, and Grafana for dashboards. Route the OTel Collector's output to both backends. Verify that your self-hosted stack answers the same questions as the commercial platform.
Month 4: Cut over. Disable the vendor agent on non-critical services first. Monitor for gaps. Expand to all services over 2-4 weeks. Keep the vendor contract active for one more month as insurance.
Month 5: Optimize. Now that you control the pipeline, implement aggressive filtering and sampling in the OTel Collector. Drop metrics nobody queries. Sample routine traces. Shorten retention on debug logs. This is where the 80-90% cost reduction materializes.
XII. What Observability Should Be
Observability exists to inform decisions. Not to provide comfort. Not to generate dashboards. Not to fill a SOC 2 checkbox. To inform decisions.
Every metric should have an owner and a response plan. Every alert should have a runbook. Every dashboard should answer a question that someone actually asks. Everything else is cost without value.
The ancient Greek physician Hippocrates is credited with the principle: "First, do no harm." The observability equivalent is: first, do not waste. Do not collect data you will not query. Do not alert on conditions you will not act on. Do not retain logs beyond the window in which they are useful.
There is a deeper parallel here. In Marcus Aurelius's Meditations, he returns repeatedly to the idea of distinguishing between what is in your control and what is not. The Stoic discipline is to focus attention and energy exclusively on the former. Observability, practiced well, is a Stoic discipline. You cannot prevent every incident. You can control what you measure, what you alert on, and how you respond. The rest is noise -- and noise, in observability as in philosophy, is the enemy of clarity.
The best monitoring system is the one you understand completely, that costs proportionally to the value it provides, and that you could rebuild from scratch in a week. If your observability vendor disappeared tomorrow, would you be lost -- or liberated?
That question deserves an honest answer.
References
- Datadog, Inc. "Pricing." datadoghq.com/pricing
- OpenTelemetry Project. "OpenTelemetry Collector Documentation." opentelemetry.io/docs/collector
- ClickHouse, Inc. "ClickHouse Cloud Pricing." clickhouse.com/pricing
- Grafana Labs. "Grafana Loki Documentation." grafana.com/docs/loki
- Cloudflare. "Workers Analytics Engine." developers.cloudflare.com/analytics/analytics-engine
- Charity Majors, Liz Fong-Jones, George Miranda. Observability Engineering. O'Reilly Media, 2022.
- Google SRE Team. "Monitoring Distributed Systems." sre.google/sre-book/monitoring-distributed-systems
- Cindy Sridharan. Distributed Systems Observability. O'Reilly Media, 2018.
- PagerDuty. "State of Digital Operations Report 2023." pagerduty.com/resources
- Cloudflare. "R2 Pricing." developers.cloudflare.com/r2/pricing
- Grafana Labs. "Grafana Tempo Documentation." grafana.com/docs/tempo
- New Relic, Inc. "New Relic Pricing." newrelic.com/pricing