name

datadog-metrics-analyser

description

Analyse Datadog metrics using the pup CLI — query time-series data, list and search metrics, inspect metadata and tags, and identify anomalies or trends. Use when the user asks about Datadog metrics, wants to investigate metric behaviour, compare metric values across environments, or troubleshoot metric-related issues.

compatibility

Requires pup CLI (Datadog API CLI) authenticated via `pup auth login` or DD_API_KEY + DD_APP_KEY environment variables.

allowed-tools

Bash(pup:*)

metadata

author	version
velppa	1.0.0

Datadog Metrics Analyser

You are a metrics analysis assistant. Use the pup CLI to query, explore, and analyse Datadog metrics. Always prefer server-side filtering and aggregation over local post-processing.

Prerequisites

Verify authentication before the first query:

pup auth test  # if keys are set - we're good
pup auth status  # if no keys

If pup auth status fails, ask the user to run pup auth login.

Key rules

Always use --agent flag
Always specify --from — most commands default to 1h but being explicit avoids surprises. Start narrow (1h) and widen only if needed.
Metric query syntax — <aggregation>:<metric_name>{<filter>} by {<group>}. Aggregations: avg, sum, min, max, count.
Output is JSON by default — use --output=table only when displaying results directly to the user.
Don't fetch everything — use --filter, --tag-filter, and query scoping to keep responses focused.

Common operations

List available metrics

# All metrics matching a pattern
pup metrics list --filter "aws.elb.*"

# Filter by tags
pup metrics list --tag-filter "env:prod,service:api"

# Metrics active in the last 7 days
pup metrics list --filter "custom.*" --from 7d

Query time-series data

# Average CPU across production hosts
pup metrics query --query "avg:system.cpu.user{env:prod} by {host}" --from 1h

# Sum of requests by service over 4 hours
pup metrics query --query "sum:trace.http.request.hits{env:prod} by {service}" --from 4h

# Max memory usage for a specific service
pup metrics query --query "max:kubernetes.memory.usage{service:checkout}" --from 30m

# With explicit time range
pup metrics query --query "avg:system.load.1{*}" --from 2h --to 1h

Search metrics (v1 API)

# Full-text search for metrics
pup metrics search --query "metrics:system.cpu"

Inspect metric metadata

# Get type, unit, description
pup metrics metadata get system.cpu.user

# Get available tags for a metric
pup metrics tags list system.cpu.user --from 1h

Analysis workflow

When asked to analyse metrics, follow this approach:

Clarify scope — which metrics, services, environments, and time range?
Discover — use pup metrics list or pup metrics search to find relevant metric names if the user is unsure.
Query — fetch time-series data with pup metrics query. Start with a 1h window and appropriate aggregation.
Inspect — check metadata and tags to understand the metric's type and unit before interpreting values.
Compare — query the same metric across different tag values (environments, hosts, services) to spot divergence.
Summarise — present findings with context: what's normal, what's anomalous, and possible causes.

Query syntax reference

avg:system.cpu.user{env:prod,service:api} by {host}
│   │               │                      │
│   │               │                      └─ group-by tags
│   │               └─ filter (tag:value, comma = AND)
│   └─ metric name
└─ aggregation (avg, sum, min, max, count)

Filters support:

Exact match: env:prod
Wildcard: host:web-*
Multiple values (OR): env:prod OR env:staging inside braces
Negation: !env:dev

Time formats

Relative: 5s, 30m, 1h, 4h, 1d, 7d, 30d

The --from flag sets how far back from now (or from --to). The --to flag defaults to now.

Anti-patterns to avoid

Don't omit --from — you'll get unexpected time ranges or errors
Don't use --from=30d unless you specifically need a month of data; it's slow
Don't query without specifying an aggregation (avg, sum, etc.)
Don't pipe large JSON responses through multiple jq transforms — use query filters at the API level instead
Don't fetch all metrics without filters in large organisations

Interpreting results

Gauge metrics represent a point-in-time value (e.g., CPU %, memory bytes)
Count metrics represent the number of events in an interval
Rate metrics represent per-second event rates
Distribution metrics provide percentiles and statistical aggregations

Check pup metrics metadata get <metric> to confirm the type before drawing conclusions. A "count" metric aggregated with avg means something very different than a "gauge" aggregated with avg.

velppa/SKILL.md

Select an option

No results found