Experimenting with LLMs

Using the llama-cpp inference runtime with smaller GGUF transformer models on old dusty hardware.

Dusty old home lab hardware
LLM joke

Hardware

The goal was to run a local LLM on whatever was already sitting around — no GPU, no cloud, no new hardware.

Machine Lenovo Desktop
CPU Intel Core i3-6100T @ 3.20GHz
RAM 3.8 GB
OS Debian 12

Install llama.cpp

llama.cpp is a C++ inference runtime for transformer models in GGUF format. It runs on CPU with no GPU dependencies, making it the right fit for constrained hardware.

Clone the repo
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

Build without GPU support:

Build · CPU only
mkdir build
cd build
cmake .. -DLLAMA_CUBLAS=OFF

Model Selection

With under 4 GB of RAM, model selection matters. The smallest viable chat model is tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf — a 1.1B parameter model quantized to 4-bit, which fits within the memory budget with room to spare.

Model weights ~0.8 GB
KV cache (1024 ctx) ~0.5–0.8 GB
Overhead ~0.3 GB
Total ~1.8–2.2 GB

For reference, this is a 1.1B parameter model. ChatGPT is estimated to have approximately 1.7 trillion parameters — roughly 1,500× larger.

Running the Interface

llama-cli · launch command
./build/bin/llama-cli \
  -m models/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
  -c 1024 \
  -n 256 \
  -t 2
llama-cli running in a terminal

Prompt & Output

llama-cpp prompt and output

The model runs and responds. Output is coherent but the examples it generates are incorrect — expected at this scale. At 1.1B parameters the model has general language structure but limited factual reliability.

What This Demonstrates

Further reading: Interesting reads on AI