Llama 1
model.gguf was just a placeholder. You need to download an actual .gguf file, such as:
gpt-oss-20b-mxfp4.gguf
For GPT-OSS 20B, the Hugging Face repo is:
ggml-org/gpt-oss-20b-GGUF
The model page shows gpt-oss-20b-mxfp4.gguf as the file used by llama.cpp / llama-cpp-python, and it also supports direct llama-server -hf ggml-org/gpt-oss-20b-GGUF. (Hugging Face)
Easiest: let llama.cpp download it
From your llama.cpp directory:
cd ~/llama.cpp ./build/bin/llama-server \ -hf ggml-org/gpt-oss-20b-GGUF \ -ngl 999 \ -c 4096 \ --host 0.0.0.0 \ --port 8080
The official quick start for this repo is basically:
llama-server -hf ggml-org/gpt-oss-20b-GGUF -c 0 --jinja
For your P40, I would start with:
-c 4096
rather than unlimited/context auto mode.
Manual download to a real file
Install the Hugging Face CLI:
sudo apt install -y python3-pip python3 -m pip install --user -U huggingface_hub
Make sure your user-local Python bin path is active:
export PATH="$HOME/.local/bin:$PATH"
Create a model directory:
mkdir -p ~/models/gpt-oss-20b
Download the GGUF file:
huggingface-cli download ggml-org/gpt-oss-20b-GGUF \ gpt-oss-20b-mxfp4.gguf \ --local-dir ~/models/gpt-oss-20b \ --local-dir-use-symlinks False
Then run it:
cd ~/llama.cpp ./build/bin/llama-server \ -m ~/models/gpt-oss-20b/gpt-oss-20b-mxfp4.gguf \ -ngl 999 \ -c 4096 \ --host 0.0.0.0 \ --port 8080
Test it
curl http://127.0.0.1:8080/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "gpt-oss-20b",
"messages": [
{
"role": "user",
"content": "Write a minimal Go HTTP health check server."
}
],
"temperature": 0.2
}'
For Qwen later
Same idea, but choose a Qwen GGUF repo instead. For example, Qwen’s docs show running Qwen models through llama.cpp using GGUF files. (Hugging Face)
For now, get GPT-OSS 20B working first with:
-hf ggml-org/gpt-oss-20b-GGUF
or with the downloaded file:
-m ~/models/gpt-oss-20b/gpt-oss-20b-mxfp4.gguf