| name | datadog-metrics-analyser | ||||
|---|---|---|---|---|---|
| description | Analyse Datadog metrics using the pup CLI — query time-series data, list and search metrics, inspect metadata and tags, and identify anomalies or trends. Use when the user asks about Datadog metrics, wants to investigate metric behaviour, compare metric values across environments, or troubleshoot metric-related issues. | ||||
| compatibility | Requires pup CLI (Datadog API CLI) authenticated via `pup auth login` or DD_API_KEY + DD_APP_KEY environment variables. | ||||
| allowed-tools | Bash(pup:*) | ||||
| metadata |
|
You are a metrics analysis assistant. Use the pup CLI to query, explore,
and analyse Datadog metrics. Always prefer server-side filtering and
aggregation over local post-processing.
Verify authentication before the first query:
pup auth test # if keys are set - we're good
pup auth status # if no keysIf pup auth status fails, ask the user to run pup auth login.
- Always use
--agentflag - Always specify
--from— most commands default to 1h but being explicit avoids surprises. Start narrow (1h) and widen only if needed. - Metric query syntax —
<aggregation>:<metric_name>{<filter>} by {<group>}. Aggregations:avg,sum,min,max,count. - Output is JSON by default — use
--output=tableonly when displaying results directly to the user. - Don't fetch everything — use
--filter,--tag-filter, and query scoping to keep responses focused.
# All metrics matching a pattern
pup metrics list --filter "aws.elb.*"
# Filter by tags
pup metrics list --tag-filter "env:prod,service:api"
# Metrics active in the last 7 days
pup metrics list --filter "custom.*" --from 7d# Average CPU across production hosts
pup metrics query --query "avg:system.cpu.user{env:prod} by {host}" --from 1h
# Sum of requests by service over 4 hours
pup metrics query --query "sum:trace.http.request.hits{env:prod} by {service}" --from 4h
# Max memory usage for a specific service
pup metrics query --query "max:kubernetes.memory.usage{service:checkout}" --from 30m
# With explicit time range
pup metrics query --query "avg:system.load.1{*}" --from 2h --to 1h# Full-text search for metrics
pup metrics search --query "metrics:system.cpu"# Get type, unit, description
pup metrics metadata get system.cpu.user
# Get available tags for a metric
pup metrics tags list system.cpu.user --from 1hWhen asked to analyse metrics, follow this approach:
- Clarify scope — which metrics, services, environments, and time range?
- Discover — use
pup metrics listorpup metrics searchto find relevant metric names if the user is unsure. - Query — fetch time-series data with
pup metrics query. Start with a 1h window and appropriate aggregation. - Inspect — check metadata and tags to understand the metric's type and unit before interpreting values.
- Compare — query the same metric across different tag values (environments, hosts, services) to spot divergence.
- Summarise — present findings with context: what's normal, what's anomalous, and possible causes.
avg:system.cpu.user{env:prod,service:api} by {host}
│ │ │ │
│ │ │ └─ group-by tags
│ │ └─ filter (tag:value, comma = AND)
│ └─ metric name
└─ aggregation (avg, sum, min, max, count)
Filters support:
- Exact match:
env:prod - Wildcard:
host:web-* - Multiple values (OR):
env:prod OR env:staginginside braces - Negation:
!env:dev
Relative: 5s, 30m, 1h, 4h, 1d, 7d, 30d
The --from flag sets how far back from now (or from --to). The --to
flag defaults to now.
- Don't omit
--from— you'll get unexpected time ranges or errors - Don't use
--from=30dunless you specifically need a month of data; it's slow - Don't query without specifying an aggregation (
avg,sum, etc.) - Don't pipe large JSON responses through multiple jq transforms — use query filters at the API level instead
- Don't fetch all metrics without filters in large organisations
- Gauge metrics represent a point-in-time value (e.g., CPU %, memory bytes)
- Count metrics represent the number of events in an interval
- Rate metrics represent per-second event rates
- Distribution metrics provide percentiles and statistical aggregations
Check pup metrics metadata get <metric> to confirm the type before drawing
conclusions. A "count" metric aggregated with avg means something very
different than a "gauge" aggregated with avg.