~/krishna_dhakal
#AI#LLM#llama.cpp#HuggingFace

Local LLM Development with llama.cpp and HuggingFace

> May 20, 2025

Running LLMs in the cloud is convenient, but sometimes you need inference without cloud dependency — for privacy, cost, or latency reasons. This post walks through building a local LLM inference setup using llama.cpp and HuggingFace Transformers, and then evaluating it against cloud-hosted models.


> Why Go Local?


There are several compelling reasons to run models on-premise:


  • **Data privacy** — sensitive documents never leave your infrastructure
  • **Cost control** — no per-token API charges at scale
  • **Latency** — no network round-trip for time-sensitive applications
  • **Air-gapped environments** — regulated industries (finance, healthcare) often prohibit cloud calls

> Setting Up llama.cpp


llama.cpp is a C++ inference engine for running quantized GGUF models. It's blazing fast on CPU and supports GPU offloading via Metal and CUDA.


git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make -j8
# Download a GGUF model, e.g. Mistral-7B-Instruct Q4_K_M
./llama-cli -m mistral-7b-instruct-v0.2.Q4_K_M.gguf -p "Explain RAG in one paragraph." -n 200

For a Python interface, use the llama-cpp-python binding:


from llama_cpp import Llama

llm = Llama(model_path="mistral-7b-instruct-v0.2.Q4_K_M.gguf", n_ctx=4096, n_gpu_layers=32)
response = llm("What is retrieval-augmented generation?", max_tokens=256)
print(response["choices"][0]["text"])

> Using HuggingFace Transformers Locally


For models that are not yet in GGUF format, HuggingFace Transformers is the go-to library:


from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "mistralai/Mistral-7B-Instruct-v0.2"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, device_map="auto")

inputs = tokenizer("Explain vector search:", return_tensors="pt").to("cuda")
output = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(output[0], skip_special_tokens=True))

> Building an Evaluation Pipeline


Comparing local vs. cloud models requires a structured evaluation pipeline. Here's the framework I use:


import time
from dataclasses import dataclass
from typing import Callable

@dataclass
class EvalResult:
    model_name: str
    latency_ms: float
    answer: str
    relevance_score: float

def evaluate_model(model_fn: Callable, questions: list[str], contexts: list[str]) -> list[EvalResult]:
    results = []
    for q, ctx in zip(questions, contexts):
        start = time.perf_counter()
        answer = model_fn(q, ctx)
        latency = (time.perf_counter() - start) * 1000
        score = compute_relevance(answer, ctx)  # custom metric
        results.append(EvalResult("local", latency, answer, score))
    return results

> Benchmark Results


Running Mistral-7B Q4_K_M locally versus Claude 3 Haiku on a 100-question domain QA dataset:


  • **Latency**: Local — 1.2s average; Cloud — 0.8s average (with network)
  • **Accuracy**: Local — 71%; Cloud — 79%
  • **Retrieval relevance** (RAGAS score): Local — 0.68; Cloud — 0.74

The gap is smaller than most people expect, especially for domain-specific retrieval tasks where the context window provides most of the signal.


> Key Takeaways


  1. **Quantization is your friend** — Q4_K_M gives ~90% of the quality at 25% of the memory footprint.
  2. **GPU offloading matters** — even partial offload to a consumer GPU cuts latency by 3–5x.
  3. **Evaluation is non-negotiable** — never deploy a local model without benchmarking it against your specific use case.

Local LLMs are no longer a researcher's toy. With the right tooling, they are production-ready for many enterprise workloads.