Skip to content

Cost Observability

First PublishedByAtif Alam

Cost observability is about seeing where your money goes—so you can control spend, right-size resources, and tie cost to usage and reliability.

It complements the rest of observability (latency, errors, throughput, infrastructure metrics) by adding a cost dimension to the same systems and workloads.

This page covers cloud cost visibility, unit economics (cost per request, cost per tenant), FinOps basics, reserved vs on-demand, and right-sizing.

Before you can optimize, you need to see cost: by service, by region, by team, by project, and over time.

  • Billing and cost APIs — Cloud providers expose cost and usage data (e.g. AWS Cost Explorer, Azure Cost Management, GCP Billing). Ingest this into your metrics or a dedicated cost dashboard so cost is visible alongside SLIs and infrastructure metrics.
  • Allocation and tagging — Tag resources by team, project, environment (prod/staging), and service. Use consistent tagging so you can slice cost by owner and workload. Without tags, cost is a blob; with them, you can hold teams accountable and spot outliers.
  • Dashboards and alerts — Show spend over time, trend by tag, and alert when cost spikes or exceeds a threshold (e.g. daily or monthly budget). Treat cost like an SLO: you want to know when you’re burning too fast.

Visibility is the foundation for everything below: unit economics, right-sizing, and reserved capacity decisions all depend on knowing what costs what.

Unit economics express cost in meaningful units so you can reason about efficiency and scalability.

  • Cost per request — Total cost of a service (or slice of infra) over a period, divided by request count. Tells you how much each request “costs” and whether that number is improving or degrading as you scale. Use with Throughput and Infrastructure Metrics to see how utilization and efficiency relate to cost.
  • Cost per tenant (or per customer) — For multi-tenant systems, allocate or attribute cost to tenants (e.g. by usage, by namespace, by tag). Lets you understand profitability, identify heavy tenants, and set pricing or limits. Depends on good allocation (tagging, metering).

These metrics help you answer: “Are we getting more efficient?” and “Which workloads or tenants drive cost?”

FinOps is the practice of managing cloud cost with the same rigor as performance and reliability—collaboration between engineering, finance, and product; data-driven decisions; and continuous improvement.

  • Culture — Cost is a shared responsibility. Engineers need visibility and incentives to optimize; finance needs engineering input to forecast and explain variance. Regular cost reviews (e.g. monthly or quarterly) keep cost in the conversation.
  • Efficiency — Use the data you have: right-size (see below), eliminate waste (idle resources, orphaned disks, over-provisioning), and choose the right pricing model (reserved vs on-demand, spot where appropriate). See Capacity Planning for how headroom and scaling interact with cost.
  • Governance — Policies (e.g. approval for large instances, mandatory tags), budgets, and alerts so cost doesn’t spiral. Balance control with speed; avoid blocking every change while still catching runaway spend.

Cost observability (visibility, unit economics, dashboards, alerts) is the data layer that makes FinOps possible.

Cloud pricing often has on-demand (pay per use, no commitment) vs reserved (commit for 1–3 years, lower per-unit price). Spot or preemptible instances add another tier (cheapest, but can be taken away).

  • On-demand — Flexibility; good for variable or unpredictable workload, dev/test, or when you don’t want commitment. Higher unit cost.
  • Reserved — Lower unit cost in exchange for commitment. Good for baseline, steady production load. You’re betting on usage; if workload drops, you may overpay.
  • Spot / preemptible — Lowest cost, interruptible. Use for batch, fault-tolerant, or stateless workloads that can handle termination. See Capacity Planning for how to think about mix of reserved + on-demand + spot.

Use cost and usage data to decide what to reserve: look at steady-state utilization and commit for the part that’s predictable; leave the rest on-demand or spot.

Right-sizing means provisioning the right amount of resource—not over-provisioned (wasting money) and not under-provisioned (risking performance or availability). It sits at the intersection of cost and performance.

  • Data — Use Infrastructure Metrics (CPU, memory, disk, network) and Capacity Planning (utilization, headroom, scaling thresholds). If CPU is consistently low, consider a smaller instance; if you’re constantly near limit, you’re at risk and may need more or better autoscaling.
  • Approach — Review instances and services periodically; downsize or change instance family where utilization is low and headroom is high. Test after changes (e.g. load test) to avoid regressions. Autoscaling reduces the need for manual right-sizing of peak capacity but doesn’t remove it—baseline and scaling limits still need to be chosen.
  • Tradeoffs — Right-sizing saves cost but don’t sacrifice reliability: keep enough headroom for spikes and failures. See Capacity Planning for headroom policies and SLOs for what you’re protecting.

Cost observability tells you what costs what; right-sizing is one of the main actions you take with that information.

  • Infrastructure Metrics — CPU, memory, disk, and how they feed into scaling and cost decisions.
  • Capacity Planning — Headroom, scaling, and how provisioned capacity relates to cost.
  • SLOs, SLIs & SLAs — Cost is often tracked alongside reliability and performance SLOs in reviews.