Local LLM Development with llama.cpp and HuggingFace
> May 20, 2025
Running LLMs in the cloud is convenient, but sometimes you need inference without cloud dependency — for privacy, cost, or latency reasons. This post walks through building a local LLM inference setup using llama.cpp and HuggingFace Transformers, and then evaluating it against cloud-hosted models.
> Why Go Local?
There are several compelling reasons to run models on-premise:
- **Data privacy** — sensitive documents never leave your infrastructure
- **Cost control** — no per-token API charges at scale
- **Latency** — no network round-trip for time-sensitive applications
- **Air-gapped environments** — regulated industries (finance, healthcare) often prohibit cloud calls
> Setting Up llama.cpp
llama.cpp is a C++ inference engine for running quantized GGUF models. It's blazing fast on CPU and supports GPU offloading via Metal and CUDA.
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make -j8
# Download a GGUF model, e.g. Mistral-7B-Instruct Q4_K_M
./llama-cli -m mistral-7b-instruct-v0.2.Q4_K_M.gguf -p "Explain RAG in one paragraph." -n 200For a Python interface, use the llama-cpp-python binding:
from llama_cpp import Llama
llm = Llama(model_path="mistral-7b-instruct-v0.2.Q4_K_M.gguf", n_ctx=4096, n_gpu_layers=32)
response = llm("What is retrieval-augmented generation?", max_tokens=256)
print(response["choices"][0]["text"])> Using HuggingFace Transformers Locally
For models that are not yet in GGUF format, HuggingFace Transformers is the go-to library:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "mistralai/Mistral-7B-Instruct-v0.2"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, device_map="auto")
inputs = tokenizer("Explain vector search:", return_tensors="pt").to("cuda")
output = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(output[0], skip_special_tokens=True))> Building an Evaluation Pipeline
Comparing local vs. cloud models requires a structured evaluation pipeline. Here's the framework I use:
import time
from dataclasses import dataclass
from typing import Callable
@dataclass
class EvalResult:
model_name: str
latency_ms: float
answer: str
relevance_score: float
def evaluate_model(model_fn: Callable, questions: list[str], contexts: list[str]) -> list[EvalResult]:
results = []
for q, ctx in zip(questions, contexts):
start = time.perf_counter()
answer = model_fn(q, ctx)
latency = (time.perf_counter() - start) * 1000
score = compute_relevance(answer, ctx) # custom metric
results.append(EvalResult("local", latency, answer, score))
return results> Benchmark Results
Running Mistral-7B Q4_K_M locally versus Claude 3 Haiku on a 100-question domain QA dataset:
- **Latency**: Local — 1.2s average; Cloud — 0.8s average (with network)
- **Accuracy**: Local — 71%; Cloud — 79%
- **Retrieval relevance** (RAGAS score): Local — 0.68; Cloud — 0.74
The gap is smaller than most people expect, especially for domain-specific retrieval tasks where the context window provides most of the signal.
> Key Takeaways
- **Quantization is your friend** — Q4_K_M gives ~90% of the quality at 25% of the memory footprint.
- **GPU offloading matters** — even partial offload to a consumer GPU cuts latency by 3–5x.
- **Evaluation is non-negotiable** — never deploy a local model without benchmarking it against your specific use case.
Local LLMs are no longer a researcher's toy. With the right tooling, they are production-ready for many enterprise workloads.