SLM-Benchmarking v1.0

By Siva Sindhuja Tsundupalli | April 2026

A Comprehensive Study of Local SLM Inference: Performance vs. Reliability on Integrated GPUs.

GPU: Intel Iris Xe iGPU (8GB Shared VRAM) OS: Windows (MINGW64/Ollama v0.5+) Validation: Pydantic v2.x Automation: Python 3.11

1. Project Objectives

As Generative AI moves to the edge, running models locally on non-dedicated hardware (Integrated GPUs) is the next deployment frontier. This project was designed to scientifically quantify the practical trade-offs encountered in this environment.

Benchmark Raw Performance: Measure throughput (TPS) and user-perceived latency (TTFT) across different SLM architectures.
Engineering Reliability: Move beyond conversational chat by enforcing strict JSON output schemas, validated via Pydantic with automated retry logic.
Stochasticity Control: Document the relationship between Temperature sampling and schema adherence in small models.
Hardware Analysis: Map out the memory allocation and stability limitations of the Intel Iris Xe iGPU when running models up to 7B parameters.

2. Inference Performance Benchmarks

Inference was measured using a standardized 30-prompt evaluation suite categorized into Explanation, Extraction, Logic, and Synthesis tasks.

Warm Throughput

13.52 Tokens / Sec

(Llama 3.2 Leader)

Warm Snappiness

204 ms (TTFT)

(Mistral 7B Leader)

Cold Start Penalty

62.1 Seconds

(Phi-4 Mini Overhead)

Detailed Matrix (iGPU Baseline)

Model Name	Parameters	TPS (Throughput)	TTFT (Snappiness)	Total Response Latency
Llama 3.2	3.0B	13.52 tokens/s	284.35 ms	6.42 s
Phi-4 Mini	3.8B	7.42 tokens/s	255.10 ms	9.88 s
Mistral 7B	7.2B	7.71 tokens/s	204.29 ms	14.39 s

3. Structure, Determinism & The Reliability Loop

Local SLM output is intrinsically chaotic. To convert raw text into data usable by an automated software pipeline, we developed a Constrained Generation & Validation Loop.

# Conceptual Logic of the SLM Reliability Wrapper
try:
    # 1. Enforcement (Ollama API options: format="json")
    raw_output = ollama.chat(..., format="json")
    
    # 2. Validation (Pydantic model: TaskResponse)
    validated_json = TaskResponse.model_validate_json(raw_output)
    
except ValidationError as e:
    # 3. Retry Logic (Exactly one reprompt with error feedback)
    validated_json = retry_with_feedback(e.json()) 
    
except Exception:
    # 4. Graceful Failure (Standardized null object)
    return {"status": "fail", "data": None}

Schema Adherence Study

Model Name	Zero-Shot JSON Success	Few-Shot JSON Success	Deployment Status
Llama 3.2 (3B)	0.0% (Failed keys/filler)	100.0% Adherence	Verified Production
Phi-4 Mini	0.0% (Structure mismatch)	100.0% Adherence	Verified Production
Mistral 7B (v0.3)	64.0% Success	86.0% Success	Schema Hallucination

4. Stochasticity Control & Temperature

We measured output variance at Temperature $T=0$ (Greedy Decoding) and $T=0.7$ (Multinomial Sampling) across 5 iterations per model, per prompt.

Temp 0 Stability

1/5 Unique Outputs

Total Determinism (Greedy)

Temp 0.7 Variance

5/5 Unique Outputs

High creative variance (Stochastic)

5. Technical Challenges & Solutions

This section documents the specific engineering constraints encountered when benchmarking Llama and Mistral on the shared memory architecture of an Intel Integrated GPU.

Challenge 1 Intel iGPU VRAM Exhaustion (Status 500)

When switching between models or loading high-parameter models (Mistral 7B), the Ollama runner process frequently terminated with a generic 500 Internal Server Error. Diagnosed as memory fragmentation in the iGPU's shared memory allocation.

Solution iGPU "Clean Slate" Protocol

We implemented a service management wrapper that detects allocation failure, gracefully terminates the Ollama service, and enforces a mandatory "Cooling Period" to allow the Intel GPU driver to deallocate memory blocks.

Challenge 2 The "Cold Load" Stutter (62s Penalty)

The initial request to Phi-4 Mini after model acquisition required 62,161 ms for TTFT, while Windows Explorer became unresponsive. The model weights were not fitting into the contiguous "shared memory" space on startup.

Solution Warm-up Pulse Implementation

Modified the benchmarking pipeline to inject a non-measured **"Warm-up Pulse"** request. This forces the model weights to be fully resident in shared VRAM before TTFT or TPS measurements begin, stabilizing user experience data.

Challenge 3 Zero-Shot Instruction Incompetence

Small Language Models (under 4B parameters) lack the intrinsic "instruction-following" weight to infer complex JSON structure from scratch. Llama 3.2 and Phi-4 Mini both scored **0.0%** zero-shot success.

Solution The Few-Shot Breakthrough

Developed a Few-Shot System Prompting Strategy. By embedding exactly one valid JSON example in the system message, we provided the model with a structural anchor, raising the Llama 3.2 success rate from 0% to 100%.