Busk: Created page with "`model.gguf` was just a placeholder. You need to download an actual `.gguf` file, such as: ```text gpt-oss-20b-mxfp4.gguf ``` For GPT-OSS 20B, the Hugging Face repo is: ```..."

2026-05-13T16:32:10Z

Created page with "`model.gguf` was just a placeholder. You need to download an actual `.gguf` file, such as: ```text gpt-oss-20b-mxfp4.gguf ``` For GPT-OSS 20B, the Hugging Face repo is: ```..."

New page

`model.gguf` was just a placeholder. You need to download an actual `.gguf` file, such as:

```text
gpt-oss-20b-mxfp4.gguf
```

For GPT-OSS 20B, the Hugging Face repo is:

```text
ggml-org/gpt-oss-20b-GGUF
```

The model page shows `gpt-oss-20b-mxfp4.gguf` as the file used by `llama.cpp` / `llama-cpp-python`, and it also supports direct `llama-server -hf ggml-org/gpt-oss-20b-GGUF`. ([Hugging Face][1])

## Easiest: let `llama.cpp` download it

From your `llama.cpp` directory:

```bash
cd ~/llama.cpp

./build/bin/llama-server \
-hf ggml-org/gpt-oss-20b-GGUF \
-ngl 999 \
-c 4096 \
--host 0.0.0.0 \
--port 8080
```

The official quick start for this repo is basically:

```bash
llama-server -hf ggml-org/gpt-oss-20b-GGUF -c 0 --jinja
```

([Hugging Face][2])

For your P40, I would start with:

```bash
-c 4096
```

rather than unlimited/context auto mode.

## Manual download to a real file

Install the Hugging Face CLI:

```bash
sudo apt install -y python3-pip
python3 -m pip install --user -U huggingface_hub
```

Make sure your user-local Python bin path is active:

```bash
export PATH="$HOME/.local/bin:$PATH"
```

Create a model directory:

```bash
mkdir -p ~/models/gpt-oss-20b
```

Download the GGUF file:

```bash
huggingface-cli download ggml-org/gpt-oss-20b-GGUF \
gpt-oss-20b-mxfp4.gguf \
--local-dir ~/models/gpt-oss-20b \
--local-dir-use-symlinks False
```

Then run it:

```bash
cd ~/llama.cpp

./build/bin/llama-server \
-m ~/models/gpt-oss-20b/gpt-oss-20b-mxfp4.gguf \
-ngl 999 \
-c 4096 \
--host 0.0.0.0 \
--port 8080
```

## Test it

```bash
curl http://127.0.0.1:8080/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "gpt-oss-20b",
"messages": [
{
"role": "user",
"content": "Write a minimal Go HTTP health check server."
}
],
"temperature": 0.2
}'
```

## For Qwen later

Same idea, but choose a Qwen GGUF repo instead. For example, Qwen’s docs show running Qwen models through `llama.cpp` using GGUF files. ([Hugging Face][3])

For now, get GPT-OSS 20B working first with:

```bash
-hf ggml-org/gpt-oss-20b-GGUF
```

or with the downloaded file:

```bash
-m ~/models/gpt-oss-20b/gpt-oss-20b-mxfp4.gguf
```

[1]: https://huggingface.co/ggml-org/gpt-oss-20b-GGUF?utm_source=chatgpt.com "ggml-org/gpt-oss-20b-GGUF"
[2]: https://huggingface.co/ggml-org/gpt-oss-20b-GGUF/resolve/main/README.md?download=true&utm_source=chatgpt.com "285 Bytes"
[3]: https://huggingface.co/docs/inference-endpoints/engines/llama_cpp?utm_source=chatgpt.com "llama.cpp"

Llama 1 - Revision history

Busk: Created page with "`model.gguf` was just a placeholder. You need to download an actual `.gguf` file, such as: ```text gpt-oss-20b-mxfp4.gguf ``` For GPT-OSS 20B, the Hugging Face repo is: ```..."