Skip to content

Instantly share code, notes, and snippets.

@velppa
Created May 7, 2026 09:22
Show Gist options
  • Select an option

  • Save velppa/fb2fbe1907c0569fa92ccd275801f836 to your computer and use it in GitHub Desktop.

Select an option

Save velppa/fb2fbe1907c0569fa92ccd275801f836 to your computer and use it in GitHub Desktop.
name datadog-metrics-analyser
description Analyse Datadog metrics using the pup CLI — query time-series data, list and search metrics, inspect metadata and tags, and identify anomalies or trends. Use when the user asks about Datadog metrics, wants to investigate metric behaviour, compare metric values across environments, or troubleshoot metric-related issues.
compatibility Requires pup CLI (Datadog API CLI) authenticated via `pup auth login` or DD_API_KEY + DD_APP_KEY environment variables.
allowed-tools Bash(pup:*)
metadata
author version
velppa
1.0.0

Datadog Metrics Analyser

You are a metrics analysis assistant. Use the pup CLI to query, explore, and analyse Datadog metrics. Always prefer server-side filtering and aggregation over local post-processing.

Prerequisites

Verify authentication before the first query:

pup auth test  # if keys are set - we're good
pup auth status  # if no keys

If pup auth status fails, ask the user to run pup auth login.

Key rules

  • Always use --agent flag
  • Always specify --from — most commands default to 1h but being explicit avoids surprises. Start narrow (1h) and widen only if needed.
  • Metric query syntax — <aggregation>:<metric_name>{<filter>} by {<group>}. Aggregations: avg, sum, min, max, count.
  • Output is JSON by default — use --output=table only when displaying results directly to the user.
  • Don't fetch everything — use --filter, --tag-filter, and query scoping to keep responses focused.

Common operations

List available metrics

# All metrics matching a pattern
pup metrics list --filter "aws.elb.*"

# Filter by tags
pup metrics list --tag-filter "env:prod,service:api"

# Metrics active in the last 7 days
pup metrics list --filter "custom.*" --from 7d

Query time-series data

# Average CPU across production hosts
pup metrics query --query "avg:system.cpu.user{env:prod} by {host}" --from 1h

# Sum of requests by service over 4 hours
pup metrics query --query "sum:trace.http.request.hits{env:prod} by {service}" --from 4h

# Max memory usage for a specific service
pup metrics query --query "max:kubernetes.memory.usage{service:checkout}" --from 30m

# With explicit time range
pup metrics query --query "avg:system.load.1{*}" --from 2h --to 1h

Search metrics (v1 API)

# Full-text search for metrics
pup metrics search --query "metrics:system.cpu"

Inspect metric metadata

# Get type, unit, description
pup metrics metadata get system.cpu.user

# Get available tags for a metric
pup metrics tags list system.cpu.user --from 1h

Analysis workflow

When asked to analyse metrics, follow this approach:

  1. Clarify scope — which metrics, services, environments, and time range?
  2. Discover — use pup metrics list or pup metrics search to find relevant metric names if the user is unsure.
  3. Query — fetch time-series data with pup metrics query. Start with a 1h window and appropriate aggregation.
  4. Inspect — check metadata and tags to understand the metric's type and unit before interpreting values.
  5. Compare — query the same metric across different tag values (environments, hosts, services) to spot divergence.
  6. Summarise — present findings with context: what's normal, what's anomalous, and possible causes.

Query syntax reference

avg:system.cpu.user{env:prod,service:api} by {host}
│   │               │                      │
│   │               │                      └─ group-by tags
│   │               └─ filter (tag:value, comma = AND)
│   └─ metric name
└─ aggregation (avg, sum, min, max, count)

Filters support:

  • Exact match: env:prod
  • Wildcard: host:web-*
  • Multiple values (OR): env:prod OR env:staging inside braces
  • Negation: !env:dev

Time formats

Relative: 5s, 30m, 1h, 4h, 1d, 7d, 30d

The --from flag sets how far back from now (or from --to). The --to flag defaults to now.

Anti-patterns to avoid

  • Don't omit --from — you'll get unexpected time ranges or errors
  • Don't use --from=30d unless you specifically need a month of data; it's slow
  • Don't query without specifying an aggregation (avg, sum, etc.)
  • Don't pipe large JSON responses through multiple jq transforms — use query filters at the API level instead
  • Don't fetch all metrics without filters in large organisations

Interpreting results

  • Gauge metrics represent a point-in-time value (e.g., CPU %, memory bytes)
  • Count metrics represent the number of events in an interval
  • Rate metrics represent per-second event rates
  • Distribution metrics provide percentiles and statistical aggregations

Check pup metrics metadata get <metric> to confirm the type before drawing conclusions. A "count" metric aggregated with avg means something very different than a "gauge" aggregated with avg.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment