Llama optimization

From UVOO Tech Wiki
Revision as of 17:32, 13 May 2026 by Busk (talk | contribs) (Created page with "Yes, that is a good **stable baseline**: ```bash ./build/bin/llama-server \ -hf ggml-org/gpt-oss-20b-GGUF \ -ngl 999 \ -c 2048 \ -np 1 \ -t 8 \ --no-warmup \ --...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Yes, that is a good stable baseline:

./build/bin/llama-server \
  -hf ggml-org/gpt-oss-20b-GGUF \
  -ngl 999 \
  -c 2048 \
  -np 1 \
  -t 8 \
  --no-warmup \
  --cache-ram 0 \
  --host 0.0.0.0 \
  --port 8080

For your Threadripper + 128 GB RAM + Tesla P40, I’d optimize for stability first, not max speed. Your previous log showed GPT-OSS 20B fully loaded, with the P40 detected as compute capability 6.1 and about 22.9 GiB VRAM available, so the hardware is basically working.

My recommended P40-stable launcher

Use this instead:

cd ~/llama.cpp

GGML_CUDA_DISABLE_GRAPHS=1 ./build/bin/llama-server \
  -hf ggml-org/gpt-oss-20b-GGUF \
  -ngl 999 \
  -c 2048 \
  -np 1 \
  -t 12 \
  -b 512 \
  -ub 256 \
  --no-warmup \
  --cache-ram 0 \
  --host 0.0.0.0 \
  --port 8080

Why:

GGML_CUDA_DISABLE_GRAPHS=1  Avoids CUDA graph weirdness on older GPUs.
-c 2048                     Lower context = lower KV/cache pressure.
-np 1                       One request/slot at a time.
-t 12                       Enough CPU threads without pegging Threadripper.
-b 512 -ub 256              Smaller batches; slower maybe, but less crashy.
--cache-ram 0               Avoids prompt cache save/load path that appeared near your crash.
--no-warmup                 Avoids extra startup GPU work.

CUDA graphs can improve performance, but they are also an area where llama.cpp has had CUDA-specific edge cases and workarounds; disabling them is a reasonable stability choice on Pascal/P40. NVIDIA’s CUDA-graphs writeup explains why they are used for performance, while llama.cpp issue/discussion threads show cases where disabling CUDA graphs is used as a workaround. (NVIDIA Developer)

Auto-detect launcher script

Create this:

cat > ~/run-llama-gpt-oss.sh <<'EOF'
#!/usr/bin/env bash
set -euo pipefail

MODEL_REPO="${MODEL_REPO:-ggml-org/gpt-oss-20b-GGUF}"
HOST="${HOST:-0.0.0.0}"
PORT="${PORT:-8080}"

# Detect CPU threads.
CPU_THREADS="$(nproc)"

# Use about half the CPU threads, capped. Good for responsiveness.
if (( CPU_THREADS >= 32 )); then
  THREADS=12
elif (( CPU_THREADS >= 16 )); then
  THREADS=8
else
  THREADS=$(( CPU_THREADS / 2 ))
  (( THREADS < 4 )) && THREADS=4
fi

# Detect system RAM in GiB.
RAM_GB="$(awk '/MemTotal/ { printf "%d", $2/1024/1024 }' /proc/meminfo)"

# Detect NVIDIA VRAM in MiB.
if command -v nvidia-smi >/dev/null 2>&1; then
  VRAM_MIB="$(nvidia-smi --query-gpu=memory.total --format=csv,noheader,nounits | head -n1 | tr -d ' ')"
  GPU_NAME="$(nvidia-smi --query-gpu=name --format=csv,noheader | head -n1)"
else
  VRAM_MIB=0
  GPU_NAME="none"
fi

# Conservative defaults for older GPUs.
CTX=2048
NP=1
BATCH=512
UBATCH=256
NGL=999
CACHE_RAM=0
NO_WARMUP="--no-warmup"
CUDA_ENV="GGML_CUDA_DISABLE_GRAPHS=1"

# If VRAM is very small, reduce context/batch.
if (( VRAM_MIB > 0 && VRAM_MIB < 16000 )); then
  CTX=1024
  BATCH=256
  UBATCH=128
fi

# If VRAM is >= 32GB, allow larger context.
if (( VRAM_MIB >= 32000 )); then
  CTX=4096
  BATCH=1024
  UBATCH=512
fi

# Detect Pascal/P40-ish GPU and keep safer settings.
if echo "$GPU_NAME" | grep -Eiq 'P40|P100|Pascal|GTX 10'; then
  CTX=2048
  NP=1
  BATCH=512
  UBATCH=256
  CACHE_RAM=0
  CUDA_ENV="GGML_CUDA_DISABLE_GRAPHS=1"
fi

echo "Detected:"
echo "  CPU threads: $CPU_THREADS"
echo "  RAM:         ${RAM_GB} GiB"
echo "  GPU:         $GPU_NAME"
echo "  VRAM:        ${VRAM_MIB} MiB"
echo
echo "Launching llama-server:"
echo "  model repo:  $MODEL_REPO"
echo "  ctx:         $CTX"
echo "  parallel:    $NP"
echo "  threads:     $THREADS"
echo "  batch:       $BATCH"
echo "  ubatch:      $UBATCH"
echo "  cache-ram:   $CACHE_RAM"
echo

cd "$HOME/llama.cpp"

exec env $CUDA_ENV ./build/bin/llama-server \
  -hf "$MODEL_REPO" \
  -ngl "$NGL" \
  -c "$CTX" \
  -np "$NP" \
  -t "$THREADS" \
  -b "$BATCH" \
  -ub "$UBATCH" \
  $NO_WARMUP \
  --cache-ram "$CACHE_RAM" \
  --host "$HOST" \
  --port "$PORT"
EOF

chmod +x ~/run-llama-gpt-oss.sh

Run it:

~/run-llama-gpt-oss.sh

Test command

In another terminal:

curl -N http://127.0.0.1:8080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "gpt-oss-20b",
    "stream": true,
    "messages": [
      {
        "role": "user",
        "content": "Output only valid Python code. No markdown. Create a small Dog class with name, breed, age, tricks, add_trick(), and __str__()."
      }
    ],
    "temperature": 0.2,
    "max_tokens": 250
  }'

About auto-detecting max_tokens

max_tokens is per request, not really a server setting. The server can control context size with -c, but each API request should still set max_tokens.

Simple rule:

Small answer:      max_tokens 50-150
Small code:        max_tokens 200-500
Medium code:       max_tokens 800-1500
Long file/design:  max_tokens 2000+

For the P40, I’d keep most tests at:

"max_tokens": 250

Then increase only when needed.

Add a simple curl wrapper

Create this:

cat > ~/ask-llama.sh <<'EOF'
#!/usr/bin/env bash
set -euo pipefail

PROMPT="${*:-Say hello.}"
MAX_TOKENS="${MAX_TOKENS:-300}"

curl -s http://127.0.0.1:8080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d "$(jq -n \
    --arg prompt "$PROMPT" \
    --argjson max_tokens "$MAX_TOKENS" \
    '{
      model: "gpt-oss-20b",
      messages: [
        {
          role: "user",
          content: $prompt
        }
      ],
      temperature: 0.2,
      max_tokens: $max_tokens
    }')" | jq -r '.choices[0].message.content'
EOF

chmod +x ~/ask-llama.sh

Use it:

~/ask-llama.sh "Output only valid Python code. Create a hello world script."

With more output:

MAX_TOKENS=800 ~/ask-llama.sh "Output only valid Python code. Create a Dog class with name, breed, age, tricks, add_trick(), and __str__()."

My final recommendation

For your P40, use:

GGML_CUDA_DISABLE_GRAPHS=1
-c 2048
-np 1
-t 12
-b 512
-ub 256
--cache-ram 0
--no-warmup

After it is stable for a while, try performance increases one at a time:

1. Increase -t from 12 to 16
2. Increase -c from 2048 to 4096
3. Increase -b 512 to 1024
4. Re-enable CUDA graphs only if everything else is stable

Do not tune everything at once, because then you will not know what caused the next crash.