Llama optimization

Yes, that is a good stable baseline:

./build/bin/llama-server \
  -hf ggml-org/gpt-oss-20b-GGUF \
  -ngl 999 \
  -c 2048 \
  -np 1 \
  -t 8 \
  --no-warmup \
  --cache-ram 0 \
  --host 0.0.0.0 \
  --port 8080

For your Threadripper + 128 GB RAM + Tesla P40, I’d optimize for stability first, not max speed. Your previous log showed GPT-OSS 20B fully loaded, with the P40 detected as compute capability 6.1 and about 22.9 GiB VRAM available, so the hardware is basically working.

My recommended P40-stable launcher

Use this instead:

cd ~/llama.cpp

GGML_CUDA_DISABLE_GRAPHS=1 ./build/bin/llama-server \
  -hf ggml-org/gpt-oss-20b-GGUF \
  -ngl 999 \
  -c 2048 \
  -np 1 \
  -t 12 \
  -b 512 \
  -ub 256 \
  --no-warmup \
  --cache-ram 0 \
  --host 0.0.0.0 \
  --port 8080

Why:

GGML_CUDA_DISABLE_GRAPHS=1  Avoids CUDA graph weirdness on older GPUs.
-c 2048                     Lower context = lower KV/cache pressure.
-np 1                       One request/slot at a time.
-t 12                       Enough CPU threads without pegging Threadripper.
-b 512 -ub 256              Smaller batches; slower maybe, but less crashy.
--cache-ram 0               Avoids prompt cache save/load path that appeared near your crash.
--no-warmup                 Avoids extra startup GPU work.

CUDA graphs can improve performance, but they are also an area where llama.cpp has had CUDA-specific edge cases and workarounds; disabling them is a reasonable stability choice on Pascal/P40. NVIDIA’s CUDA-graphs writeup explains why they are used for performance, while llama.cpp issue/discussion threads show cases where disabling CUDA graphs is used as a workaround. (NVIDIA Developer)

Auto-detect launcher script

Create this:

cat > ~/run-llama-gpt-oss.sh <<'EOF'
#!/usr/bin/env bash
set -euo pipefail

MODEL_REPO="${MODEL_REPO:-ggml-org/gpt-oss-20b-GGUF}"
HOST="${HOST:-0.0.0.0}"
PORT="${PORT:-8080}"

# Detect CPU threads.
CPU_THREADS="$(nproc)"

# Use about half the CPU threads, capped. Good for responsiveness.
if (( CPU_THREADS >= 32 )); then
  THREADS=12
elif (( CPU_THREADS >= 16 )); then
  THREADS=8
else
  THREADS=$(( CPU_THREADS / 2 ))
  (( THREADS < 4 )) && THREADS=4
fi

# Detect system RAM in GiB.
RAM_GB="$(awk '/MemTotal/ { printf "%d", $2/1024/1024 }' /proc/meminfo)"

# Detect NVIDIA VRAM in MiB.
if command -v nvidia-smi >/dev/null 2>&1; then
  VRAM_MIB="$(nvidia-smi --query-gpu=memory.total --format=csv,noheader,nounits | head -n1 | tr -d ' ')"
  GPU_NAME="$(nvidia-smi --query-gpu=name --format=csv,noheader | head -n1)"
else
  VRAM_MIB=0
  GPU_NAME="none"
fi

# Conservative defaults for older GPUs.
CTX=2048
NP=1
BATCH=512
UBATCH=256
NGL=999
CACHE_RAM=0
NO_WARMUP="--no-warmup"
CUDA_ENV="GGML_CUDA_DISABLE_GRAPHS=1"

# If VRAM is very small, reduce context/batch.
if (( VRAM_MIB > 0 && VRAM_MIB < 16000 )); then
  CTX=1024
  BATCH=256
  UBATCH=128
fi

# If VRAM is >= 32GB, allow larger context.
if (( VRAM_MIB >= 32000 )); then
  CTX=4096
  BATCH=1024
  UBATCH=512
fi

# Detect Pascal/P40-ish GPU and keep safer settings.
if echo "$GPU_NAME" | grep -Eiq 'P40|P100|Pascal|GTX 10'; then
  CTX=2048
  NP=1
  BATCH=512
  UBATCH=256
  CACHE_RAM=0
  CUDA_ENV="GGML_CUDA_DISABLE_GRAPHS=1"
fi

echo "Detected:"
echo "  CPU threads: $CPU_THREADS"
echo "  RAM:         ${RAM_GB} GiB"
echo "  GPU:         $GPU_NAME"
echo "  VRAM:        ${VRAM_MIB} MiB"
echo
echo "Launching llama-server:"
echo "  model repo:  $MODEL_REPO"
echo "  ctx:         $CTX"
echo "  parallel:    $NP"
echo "  threads:     $THREADS"
echo "  batch:       $BATCH"
echo "  ubatch:      $UBATCH"
echo "  cache-ram:   $CACHE_RAM"
echo

cd "$HOME/llama.cpp"

exec env $CUDA_ENV ./build/bin/llama-server \
  -hf "$MODEL_REPO" \
  -ngl "$NGL" \
  -c "$CTX" \
  -np "$NP" \
  -t "$THREADS" \
  -b "$BATCH" \
  -ub "$UBATCH" \
  $NO_WARMUP \
  --cache-ram "$CACHE_RAM" \
  --host "$HOST" \
  --port "$PORT"
EOF

chmod +x ~/run-llama-gpt-oss.sh

Run it:

~/run-llama-gpt-oss.sh

Test command

In another terminal:

curl -N http://127.0.0.1:8080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "gpt-oss-20b",
    "stream": true,
    "messages": [
      {
        "role": "user",
        "content": "Output only valid Python code. No markdown. Create a small Dog class with name, breed, age, tricks, add_trick(), and __str__()."
      }
    ],
    "temperature": 0.2,
    "max_tokens": 250
  }'

About auto-detecting `max_tokens`

max_tokens is per request, not really a server setting. The server can control context size with -c, but each API request should still set max_tokens.

Simple rule:

Small answer:      max_tokens 50-150
Small code:        max_tokens 200-500
Medium code:       max_tokens 800-1500
Long file/design:  max_tokens 2000+

For the P40, I’d keep most tests at:

"max_tokens": 250

Then increase only when needed.

Add a simple curl wrapper

Create this:

cat > ~/ask-llama.sh <<'EOF'
#!/usr/bin/env bash
set -euo pipefail

PROMPT="${*:-Say hello.}"
MAX_TOKENS="${MAX_TOKENS:-300}"

curl -s http://127.0.0.1:8080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d "$(jq -n \
    --arg prompt "$PROMPT" \
    --argjson max_tokens "$MAX_TOKENS" \
    '{
      model: "gpt-oss-20b",
      messages: [
        {
          role: "user",
          content: $prompt
        }
      ],
      temperature: 0.2,
      max_tokens: $max_tokens
    }')" | jq -r '.choices[0].message.content'
EOF

chmod +x ~/ask-llama.sh

Use it:

~/ask-llama.sh "Output only valid Python code. Create a hello world script."

With more output:

MAX_TOKENS=800 ~/ask-llama.sh "Output only valid Python code. Create a Dog class with name, breed, age, tricks, add_trick(), and __str__()."

My final recommendation

For your P40, use:

GGML_CUDA_DISABLE_GRAPHS=1
-c 2048
-np 1
-t 12
-b 512
-ub 256
--cache-ram 0
--no-warmup

After it is stable for a while, try performance increases one at a time:

1. Increase -t from 12 to 16
2. Increase -c from 2048 to 4096
3. Increase -b 512 to 1024
4. Re-enable CUDA graphs only if everything else is stable

Do not tune everything at once, because then you will not know what caused the next crash.

Llama optimization

Contents

My recommended P40-stable launcher

Auto-detect launcher script

Test command

About auto-detecting `max_tokens`

Add a simple curl wrapper

My final recommendation

Navigation menu

Search

Llama optimization

My recommended P40-stable launcher

Auto-detect launcher script

Test command

About auto-detecting max_tokens

Add a simple curl wrapper

My final recommendation

Navigation menu

Search

About auto-detecting `max_tokens`