The llama.cpp is an open-source project that enables efficient inference of LLM models on CPUs (and optionally on GPUs) using quantization. Its server component provides a local HTTP interface compatible with the OpenAI API, allowing you to run and interact with LLMs entirely on your own machine.
This tutorial explains how to install llama.cpp server inside a Docker container on the Linux. Commands have been tested on Ubuntu.
Prepare environment
Make sure you have installed Docker in your system. If you are using Ubuntu, installation instructions can be found in the post.
Install llama.cpp server
Create directory to store LLM models:
sudo mkdir /opt/llamacpp
Download Llama 3 1B model for testing:
sudo curl -Lo /opt/llamacpp/Llama-3.2-1B-Instruct-Q8_0.gguf https://huggingface.co/unsloth/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q8_0.gguf
- Host network
Run the following command to create a container for llama.cpp server that uses host network:
docker run -d --name=llamacpp --restart=always --network=host \
-v /opt/llamacpp:/models \
ghcr.io/ggml-org/llama.cpp:server --host 0.0.0.0 -m /models/Llama-3.2-1B-Instruct-Q8_0.gguf
To enable faster inference with NVIDIA GPU support, run the following command:
docker run -d --name=llamacpp --restart=always --network=host \
-v /opt/llamacpp:/models \
--gpus all \
ghcr.io/ggml-org/llama.cpp:server-cuda --host 0.0.0.0 -ngl 99 -m /models/Llama-3.2-1B-Instruct-Q8_0.gguf
Note: The GPU acceleration requires the NVIDIA Container Toolkit to be correctly installed on the system. Without it, the --gpus all
option will not work. Installation instructions for Ubuntu are provided in the post.
- User-defined bridge network
User-defined bridge network can be used for listening on different port. By default, llama.cpp server is listening on port 8080. It can be changed with -p
option.
docker network create app-net
docker run -d --name=llamacpp --restart=always --network=app-net \
-p 8081:8080 \
-v /opt/llamacpp:/models \
ghcr.io/ggml-org/llama.cpp:server --host 0.0.0.0 -m /models/Llama-3.2-1B-Instruct-Q8_0.gguf
To run llama.cpp server with NVIDIA GPU support, use:
docker run -d --name=llamacpp --restart=always --network=app-net \
-p 8081:8080 \
-v /opt/llamacpp:/models \
--gpus all \
ghcr.io/ggml-org/llama.cpp:server-cuda --host 0.0.0.0 -ngl 99 -m /models/Llama-3.2-1B-Instruct-Q8_0.gguf
Testing llama.cpp server
To test that the llama.cpp server is running correctly and responding to requests, you can send a sample chat completion request using curl
command:
curl http://192.168.0.227:8080/v1/chat/completions -d '{"stream":false,"messages":[{"role":"user","content":"What is an LLM?"}]}'
Note: Replace the IP address with the actual IP address of the system.
Uninstall llama.cpp server
To completely remove llama.cpp server, remove its container:
docker rm --force llamacpp
Remove llama.cpp server images:
docker rmi ghcr.io/ggml-org/llama.cpp:server
docker rmi ghcr.io/ggml-org/llama.cpp:server-cuda
You can also delete downloaded models:
sudo rm -rf /opt/llamacpp
If a user-defined bridge network was created, you can delete it as follows:
docker network rm app-net
Leave a Comment
Cancel reply