Install llama.cpp Server Inside Docker Container on Linux

Install llama.cpp Server Inside Docker Container on Linux

The llama.cpp is an open-source project that enables efficient inference of LLM models on CPUs (and optionally on GPUs) using quantization. Its server component provides a local HTTP interface compatible with the OpenAI API, allowing you to run and interact with LLMs entirely on your own machine.

This tutorial explains how to install llama.cpp server inside a Docker container on the Linux. Commands have been tested on Ubuntu.

Prepare environment

Make sure you have installed Docker in your system. If you are using Ubuntu, installation instructions can be found in the post.

Install llama.cpp server

Create directory to store LLM models:

sudo mkdir /opt/llamacpp

Download Llama 3 1B model for testing:

sudo curl -Lo /opt/llamacpp/Llama-3.2-1B-Instruct-Q8_0.gguf https://huggingface.co/unsloth/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q8_0.gguf
  • Host network

Run the following command to create a container for llama.cpp server that uses host network:

docker run -d --name=llamacpp --restart=always --network=host \
    -v /opt/llamacpp:/models \
    ghcr.io/ggml-org/llama.cpp:server --host 0.0.0.0 -m /models/Llama-3.2-1B-Instruct-Q8_0.gguf

To enable faster inference with NVIDIA GPU support, run the following command:

docker run -d --name=llamacpp --restart=always --network=host \
    -v /opt/llamacpp:/models \
    --gpus all \
    ghcr.io/ggml-org/llama.cpp:server-cuda --host 0.0.0.0 -ngl 99 -m /models/Llama-3.2-1B-Instruct-Q8_0.gguf

Note: The GPU acceleration requires the NVIDIA Container Toolkit to be correctly installed on the system. Without it, the --gpus all option will not work. Installation instructions for Ubuntu are provided in the post.

  • User-defined bridge network

User-defined bridge network can be used for listening on different port. By default, llama.cpp server is listening on port 8080. It can be changed with -p option.

docker network create app-net
docker run -d --name=llamacpp --restart=always --network=app-net \
    -p 8081:8080 \
    -v /opt/llamacpp:/models \
    ghcr.io/ggml-org/llama.cpp:server --host 0.0.0.0 -m /models/Llama-3.2-1B-Instruct-Q8_0.gguf

To run llama.cpp server with NVIDIA GPU support, use:

docker run -d --name=llamacpp --restart=always --network=app-net \
    -p 8081:8080 \
    -v /opt/llamacpp:/models \
    --gpus all \
    ghcr.io/ggml-org/llama.cpp:server-cuda --host 0.0.0.0 -ngl 99 -m /models/Llama-3.2-1B-Instruct-Q8_0.gguf

Testing llama.cpp server

To test that the llama.cpp server is running correctly and responding to requests, you can send a sample chat completion request using curl command:

curl http://192.168.0.227:8080/v1/chat/completions -d '{"stream":false,"messages":[{"role":"user","content":"What is an LLM?"}]}'

Note: Replace the IP address with the actual IP address of the system.

Uninstall llama.cpp server

To completely remove llama.cpp server, remove its container:

docker rm --force llamacpp

Remove llama.cpp server images:

docker rmi ghcr.io/ggml-org/llama.cpp:server
docker rmi ghcr.io/ggml-org/llama.cpp:server-cuda

You can also delete downloaded models:

sudo rm -rf /opt/llamacpp

If a user-defined bridge network was created, you can delete it as follows:

docker network rm app-net

Leave a Comment

Cancel reply

Your email address will not be published.