Llama 1

From UVOO Tech Wiki
Revision as of 16:32, 13 May 2026 by Busk (talk | contribs) (Created page with "`model.gguf` was just a placeholder. You need to download an actual `.gguf` file, such as: ```text gpt-oss-20b-mxfp4.gguf ``` For GPT-OSS 20B, the Hugging Face repo is: ```...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

model.gguf was just a placeholder. You need to download an actual .gguf file, such as:

gpt-oss-20b-mxfp4.gguf

For GPT-OSS 20B, the Hugging Face repo is:

ggml-org/gpt-oss-20b-GGUF

The model page shows gpt-oss-20b-mxfp4.gguf as the file used by llama.cpp / llama-cpp-python, and it also supports direct llama-server -hf ggml-org/gpt-oss-20b-GGUF. (Hugging Face)

Easiest: let llama.cpp download it

From your llama.cpp directory:

cd ~/llama.cpp

./build/bin/llama-server \
  -hf ggml-org/gpt-oss-20b-GGUF \
  -ngl 999 \
  -c 4096 \
  --host 0.0.0.0 \
  --port 8080

The official quick start for this repo is basically:

llama-server -hf ggml-org/gpt-oss-20b-GGUF -c 0 --jinja

(Hugging Face)

For your P40, I would start with:

-c 4096

rather than unlimited/context auto mode.

Manual download to a real file

Install the Hugging Face CLI:

sudo apt install -y python3-pip
python3 -m pip install --user -U huggingface_hub

Make sure your user-local Python bin path is active:

export PATH="$HOME/.local/bin:$PATH"

Create a model directory:

mkdir -p ~/models/gpt-oss-20b

Download the GGUF file:

huggingface-cli download ggml-org/gpt-oss-20b-GGUF \
  gpt-oss-20b-mxfp4.gguf \
  --local-dir ~/models/gpt-oss-20b \
  --local-dir-use-symlinks False

Then run it:

cd ~/llama.cpp

./build/bin/llama-server \
  -m ~/models/gpt-oss-20b/gpt-oss-20b-mxfp4.gguf \
  -ngl 999 \
  -c 4096 \
  --host 0.0.0.0 \
  --port 8080

Test it

curl http://127.0.0.1:8080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "gpt-oss-20b",
    "messages": [
      {
        "role": "user",
        "content": "Write a minimal Go HTTP health check server."
      }
    ],
    "temperature": 0.2
  }'

For Qwen later

Same idea, but choose a Qwen GGUF repo instead. For example, Qwen’s docs show running Qwen models through llama.cpp using GGUF files. (Hugging Face)

For now, get GPT-OSS 20B working first with:

-hf ggml-org/gpt-oss-20b-GGUF

or with the downloaded file:

-m ~/models/gpt-oss-20b/gpt-oss-20b-mxfp4.gguf