The llama.cpp is a high-performance C++ implementation for running LLM models locally, enabling fast, offline inference on consumer-grade hardware. Running llama.cpp inside a Docker container ensures a consistent and reproducible environment across different machines, simplifies dependency management, and avoids common build issues.
First, create a directory on the host machine to store the model files:
sudo mkdir /opt/llamacpp
For this tutorial, we'll use the Llama 3.2 1B model from Hugging Face. Download model with command:
sudo curl -Lo /opt/llamacpp/Llama-3.2-1B-Instruct-Q8_0.gguf https://huggingface.co/unsloth/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q8_0.gguf
This model is compact enough to run on most machines while demonstrating how llama.cpp works.
With the model downloaded, you’re ready to run llama.cpp inside a Docker container. The following command mounts the local model directory into the container and launches an interactive session with the specified model. Replace the prompt as needed to test different inputs:
docker run -it --rm -v /opt/llamacpp:/models ghcr.io/ggml-org/llama.cpp:light -m /models/Llama-3.2-1B-Instruct-Q8_0.gguf -p "What is an LLM?"
You should see the model generate a response directly in the terminal. To quit, press CTRL+C
. This stops the container and returns you to the terminal.
If your system has an NVIDIA GPU, you can run the CUDA version of llama.cpp for significantly faster inference. Use the following command to launch the NVIDIA GPU-accelerated container:
docker run -it --rm -v /opt/llamacpp:/models --gpus all ghcr.io/ggml-org/llama.cpp:light-cuda -ngl 99 -m /models/Llama-3.2-1B-Instruct-Q8_0.gguf -p "What is an LLM?"
Note: GPU acceleration requires the NVIDIA Container Toolkit to be properly installed. Without it, the --gpus all
option will not work. Installation steps for Ubuntu are included in this post.
Leave a Comment
Cancel reply