Case Study

LLM Inference Bench

A benchmarking suite that measures the real-world performance difference between FP16 and INT4 quantized inference on the same GPU hardware. Tests 18 configurations and visualizes throughput, latency, and quality results in an interactive dashboard.

Arch

Controlled vLLM benchmark harness on A100

Hard

Isolated first-token latency noise via warmup protocol

Win

3.3x throughput gain validated across 18 configurations

What

  • Benchmarking suite that runs Mistral-7B in FP16 and AWQ-Marlin INT4 via vLLM on GCP GPUs, measuring throughput, latency, and memory usage.

  • Results visualized in a Next.js dashboard.

Why

Quantization claims were easy to find. Clean comparisons were not. The real question was whether INT4 was faster under controlled conditions.

Who

ML engineers choosing serving formats, infrastructure teams managing GPU spend, and builders validating latency before deployment.

When / Where

Useful when GPU cost, latency, throughput, and output quality all affect the serving decision.

Constraints

GPU time on GCP is expensive, so benchmark runs had to be automated and efficient.

Comparisons had to be fair across both quantization modes and different prompt types.

Wanted results that made sense to people who aren't already deep in quantization literature.