A Comprehensive Study of Local SLM Inference: Performance vs. Reliability on Integrated GPUs.
As Generative AI moves to the edge, running models locally on non-dedicated hardware (Integrated GPUs) is the next deployment frontier. This project was designed to scientifically quantify the practical trade-offs encountered in this environment.
Inference was measured using a standardized 30-prompt evaluation suite categorized into Explanation, Extraction, Logic, and Synthesis tasks.
(Llama 3.2 Leader)
(Mistral 7B Leader)
(Phi-4 Mini Overhead)
| Model Name | Parameters | TPS (Throughput) | TTFT (Snappiness) | Total Response Latency |
|---|---|---|---|---|
| Llama 3.2 | 3.0B | 13.52 tokens/s | 284.35 ms | 6.42 s |
| Phi-4 Mini | 3.8B | 7.42 tokens/s | 255.10 ms | 9.88 s |
| Mistral 7B | 7.2B | 7.71 tokens/s | 204.29 ms | 14.39 s |
Local SLM output is intrinsically chaotic. To convert raw text into data usable by an automated software pipeline, we developed a Constrained Generation & Validation Loop.
# Conceptual Logic of the SLM Reliability Wrapper
try:
# 1. Enforcement (Ollama API options: format="json")
raw_output = ollama.chat(..., format="json")
# 2. Validation (Pydantic model: TaskResponse)
validated_json = TaskResponse.model_validate_json(raw_output)
except ValidationError as e:
# 3. Retry Logic (Exactly one reprompt with error feedback)
validated_json = retry_with_feedback(e.json())
except Exception:
# 4. Graceful Failure (Standardized null object)
return {"status": "fail", "data": None}
| Model Name | Zero-Shot JSON Success | Few-Shot JSON Success | Deployment Status |
|---|---|---|---|
| Llama 3.2 (3B) | 0.0% (Failed keys/filler) | 100.0% Adherence | Verified Production |
| Phi-4 Mini | 0.0% (Structure mismatch) | 100.0% Adherence | Verified Production |
| Mistral 7B (v0.3) | 64.0% Success | 86.0% Success | Schema Hallucination |
We measured output variance at Temperature $T=0$ (Greedy Decoding) and $T=0.7$ (Multinomial Sampling) across 5 iterations per model, per prompt.
Total Determinism (Greedy)
High creative variance (Stochastic)
This section documents the specific engineering constraints encountered when benchmarking Llama and Mistral on the shared memory architecture of an Intel Integrated GPU.
When switching between models or loading high-parameter models (Mistral 7B), the Ollama runner process frequently terminated with a generic 500 Internal Server Error. Diagnosed as memory fragmentation in the iGPU's shared memory allocation.
We implemented a service management wrapper that detects allocation failure, gracefully terminates the Ollama service, and enforces a mandatory "Cooling Period" to allow the Intel GPU driver to deallocate memory blocks.
The initial request to Phi-4 Mini after model acquisition required 62,161 ms for TTFT, while Windows Explorer became unresponsive. The model weights were not fitting into the contiguous "shared memory" space on startup.
Modified the benchmarking pipeline to inject a non-measured **"Warm-up Pulse"** request. This forces the model weights to be fully resident in shared VRAM before TTFT or TPS measurements begin, stabilizing user experience data.
Small Language Models (under 4B parameters) lack the intrinsic "instruction-following" weight to infer complex JSON structure from scratch. Llama 3.2 and Phi-4 Mini both scored **0.0%** zero-shot success.
Developed a Few-Shot System Prompting Strategy. By embedding exactly one valid JSON example in the system message, we provided the model with a structural anchor, raising the Llama 3.2 success rate from 0% to 100%.