SLM-Benchmarking v1.0

By Siva Sindhuja Tsundupalli | April 2026

A Comprehensive Study of Local SLM Inference: Performance vs. Reliability on Integrated GPUs.

GPU: Intel Iris Xe iGPU (8GB Shared VRAM) OS: Windows (MINGW64/Ollama v0.5+) Validation: Pydantic v2.x Automation: Python 3.11

1. Project Objectives

As Generative AI moves to the edge, running models locally on non-dedicated hardware (Integrated GPUs) is the next deployment frontier. This project was designed to scientifically quantify the practical trade-offs encountered in this environment.

2. Inference Performance Benchmarks

Inference was measured using a standardized 30-prompt evaluation suite categorized into Explanation, Extraction, Logic, and Synthesis tasks.

Warm Throughput

13.52 Tokens / Sec

(Llama 3.2 Leader)

Warm Snappiness

204 ms (TTFT)

(Mistral 7B Leader)

Cold Start Penalty

62.1 Seconds

(Phi-4 Mini Overhead)

Detailed Matrix (iGPU Baseline)

Model Name Parameters TPS (Throughput) TTFT (Snappiness) Total Response Latency
Llama 3.2 3.0B 13.52 tokens/s 284.35 ms 6.42 s
Phi-4 Mini 3.8B 7.42 tokens/s 255.10 ms 9.88 s
Mistral 7B 7.2B 7.71 tokens/s 204.29 ms 14.39 s

3. Structure, Determinism & The Reliability Loop

Local SLM output is intrinsically chaotic. To convert raw text into data usable by an automated software pipeline, we developed a Constrained Generation & Validation Loop.

# Conceptual Logic of the SLM Reliability Wrapper
try:
    # 1. Enforcement (Ollama API options: format="json")
    raw_output = ollama.chat(..., format="json")
    
    # 2. Validation (Pydantic model: TaskResponse)
    validated_json = TaskResponse.model_validate_json(raw_output)
    
except ValidationError as e:
    # 3. Retry Logic (Exactly one reprompt with error feedback)
    validated_json = retry_with_feedback(e.json()) 
    
except Exception:
    # 4. Graceful Failure (Standardized null object)
    return {"status": "fail", "data": None}

Schema Adherence Study

Model Name Zero-Shot JSON Success Few-Shot JSON Success Deployment Status
Llama 3.2 (3B) 0.0% (Failed keys/filler) 100.0% Adherence Verified Production
Phi-4 Mini 0.0% (Structure mismatch) 100.0% Adherence Verified Production
Mistral 7B (v0.3) 64.0% Success 86.0% Success Schema Hallucination

4. Stochasticity Control & Temperature

We measured output variance at Temperature $T=0$ (Greedy Decoding) and $T=0.7$ (Multinomial Sampling) across 5 iterations per model, per prompt.

Temp 0 Stability

1/5 Unique Outputs

Total Determinism (Greedy)

Temp 0.7 Variance

5/5 Unique Outputs

High creative variance (Stochastic)

5. Technical Challenges & Solutions

This section documents the specific engineering constraints encountered when benchmarking Llama and Mistral on the shared memory architecture of an Intel Integrated GPU.

Challenge 1 Intel iGPU VRAM Exhaustion (Status 500)

When switching between models or loading high-parameter models (Mistral 7B), the Ollama runner process frequently terminated with a generic 500 Internal Server Error. Diagnosed as memory fragmentation in the iGPU's shared memory allocation.

Solution iGPU "Clean Slate" Protocol

We implemented a service management wrapper that detects allocation failure, gracefully terminates the Ollama service, and enforces a mandatory "Cooling Period" to allow the Intel GPU driver to deallocate memory blocks.

Challenge 2 The "Cold Load" Stutter (62s Penalty)

The initial request to Phi-4 Mini after model acquisition required 62,161 ms for TTFT, while Windows Explorer became unresponsive. The model weights were not fitting into the contiguous "shared memory" space on startup.

Solution Warm-up Pulse Implementation

Modified the benchmarking pipeline to inject a non-measured **"Warm-up Pulse"** request. This forces the model weights to be fully resident in shared VRAM before TTFT or TPS measurements begin, stabilizing user experience data.

Challenge 3 Zero-Shot Instruction Incompetence

Small Language Models (under 4B parameters) lack the intrinsic "instruction-following" weight to infer complex JSON structure from scratch. Llama 3.2 and Phi-4 Mini both scored **0.0%** zero-shot success.

Solution The Few-Shot Breakthrough

Developed a Few-Shot System Prompting Strategy. By embedding exactly one valid JSON example in the system message, we provided the model with a structural anchor, raising the Llama 3.2 success rate from 0% to 100%.