GPU Inference Observatory

Model

NIM model

Fire a Request

Load Test

Prompt preset

Concurrency

Num requests

Max tokens

This tool shows what the GPU infrastructure does when an LLM request comes in — not just "the AI answered."

Time to First Token (TTFT) — how long before the first word arrived. Changes dramatically with prompt length.

Load test — fire multiple requests concurrently. Watch p95/p99 latency climb as the request queue fills. That's the GPU scheduler under pressure.

Backed by NVIDIA NIM on DGX Cloud. For full GPU metrics (VRAM, KV cache, temperature) see the self-hosted container path.

Last Request — Infrastructure Metrics

Time to First Token

—ms

Total Latency

—ms

Throughput

—tok/s

Tokens Generated

—

Response

Fire a request to see the response here. The metrics above update in real time as tokens arrive.