Case Study
LLM Inference Bench
A benchmarking suite that measures the real-world performance difference between FP16 and INT4 quantized inference on the same GPU hardware. Tests 18 configurations and visualizes throughput, latency, and quality results in an interactive dashboard.
Controlled vLLM benchmark harness on A100
Isolated first-token latency noise via warmup protocol
3.3x throughput gain validated across 18 configurations
What
Benchmarking suite that runs Mistral-7B in FP16 and AWQ-Marlin INT4 via vLLM on GCP GPUs, measuring throughput, latency, and memory usage.
Results visualized in a Next.js dashboard.
Why
Quantization claims were easy to find. Clean comparisons were not. The real question was whether INT4 was faster under controlled conditions.
Who
ML engineers choosing serving formats, infrastructure teams managing GPU spend, and builders validating latency before deployment.
When / Where
Useful when GPU cost, latency, throughput, and output quality all affect the serving decision.
Constraints
GPU time on GCP is expensive, so benchmark runs had to be automated and efficient.
Comparisons had to be fair across both quantization modes and different prompt types.
Wanted results that made sense to people who aren't already deep in quantization literature.
What
Benchmarking suite that runs Mistral-7B in FP16 and AWQ-Marlin INT4 via vLLM on GCP GPUs, measuring throughput, latency, and memory usage.
Results visualized in a Next.js dashboard.
Why
Quantization claims were easy to find. Clean comparisons were not. The real question was whether INT4 was faster under controlled conditions.
Who
ML engineers choosing serving formats, infrastructure teams managing GPU spend, and builders validating latency before deployment.
When / Where
Useful when GPU cost, latency, throughput, and output quality all affect the serving decision.
Constraints
GPU time on GCP is expensive, so benchmark runs had to be automated and efficient.
Comparisons had to be fair across both quantization modes and different prompt types.
Wanted results that made sense to people who aren't already deep in quantization literature.