Busk: Created page with "Yes, that is a good stable baseline: ```bash ./build/bin/llama-server \ -hf ggml-org/gpt-oss-20b-GGUF \ -ngl 999 \ -c 2048 \ -np 1 \ -t 8 \ --no-warmup \ --..."

2026-05-13T17:32:49Z

Created page with "Yes, that is a good **stable baseline**: ```bash ./build/bin/llama-server \ -hf ggml-org/gpt-oss-20b-GGUF \ -ngl 999 \ -c 2048 \ -np 1 \ -t 8 \ --no-warmup \ --..."

New page

Yes, that is a good **stable baseline**:

```bash
./build/bin/llama-server \
-hf ggml-org/gpt-oss-20b-GGUF \
-ngl 999 \
-c 2048 \
-np 1 \
-t 8 \
--no-warmup \
--cache-ram 0 \
--host 0.0.0.0 \
--port 8080
```

For your **Threadripper + 128 GB RAM + Tesla P40**, I’d optimize for **stability first**, not max speed. Your previous log showed GPT-OSS 20B fully loaded, with the P40 detected as compute capability 6.1 and about 22.9 GiB VRAM available, so the hardware is basically working.

## My recommended P40-stable launcher

Use this instead:

```bash
cd ~/llama.cpp

GGML_CUDA_DISABLE_GRAPHS=1 ./build/bin/llama-server \
-hf ggml-org/gpt-oss-20b-GGUF \
-ngl 999 \
-c 2048 \
-np 1 \
-t 12 \
-b 512 \
-ub 256 \
--no-warmup \
--cache-ram 0 \
--host 0.0.0.0 \
--port 8080
```

Why:

```text
GGML_CUDA_DISABLE_GRAPHS=1 Avoids CUDA graph weirdness on older GPUs.
-c 2048 Lower context = lower KV/cache pressure.
-np 1 One request/slot at a time.
-t 12 Enough CPU threads without pegging Threadripper.
-b 512 -ub 256 Smaller batches; slower maybe, but less crashy.
--cache-ram 0 Avoids prompt cache save/load path that appeared near your crash.
--no-warmup Avoids extra startup GPU work.
```

CUDA graphs can improve performance, but they are also an area where llama.cpp has had CUDA-specific edge cases and workarounds; disabling them is a reasonable stability choice on Pascal/P40. NVIDIA’s CUDA-graphs writeup explains why they are used for performance, while llama.cpp issue/discussion threads show cases where disabling CUDA graphs is used as a workaround. ([NVIDIA Developer][1])

## Auto-detect launcher script

Create this:

```bash
cat > ~/run-llama-gpt-oss.sh <<'EOF'
#!/usr/bin/env bash
set -euo pipefail

MODEL_REPO="${MODEL_REPO:-ggml-org/gpt-oss-20b-GGUF}"
HOST="${HOST:-0.0.0.0}"
PORT="${PORT:-8080}"

# Detect CPU threads.
CPU_THREADS="$(nproc)"

# Use about half the CPU threads, capped. Good for responsiveness.
if (( CPU_THREADS >= 32 )); then
THREADS=12
elif (( CPU_THREADS >= 16 )); then
THREADS=8
else
THREADS=$(( CPU_THREADS / 2 ))
(( THREADS < 4 )) && THREADS=4
fi

# Detect system RAM in GiB.
RAM_GB="$(awk '/MemTotal/ { printf "%d", $2/1024/1024 }' /proc/meminfo)"

# Detect NVIDIA VRAM in MiB.
if command -v nvidia-smi >/dev/null 2>&1; then
VRAM_MIB="$(nvidia-smi --query-gpu=memory.total --format=csv,noheader,nounits | head -n1 | tr -d ' ')"
GPU_NAME="$(nvidia-smi --query-gpu=name --format=csv,noheader | head -n1)"
else
VRAM_MIB=0
GPU_NAME="none"
fi

# Conservative defaults for older GPUs.
CTX=2048
NP=1
BATCH=512
UBATCH=256
NGL=999
CACHE_RAM=0
NO_WARMUP="--no-warmup"
CUDA_ENV="GGML_CUDA_DISABLE_GRAPHS=1"

# If VRAM is very small, reduce context/batch.
if (( VRAM_MIB > 0 && VRAM_MIB < 16000 )); then
CTX=1024
BATCH=256
UBATCH=128
fi

# If VRAM is >= 32GB, allow larger context.
if (( VRAM_MIB >= 32000 )); then
CTX=4096
BATCH=1024
UBATCH=512
fi

# Detect Pascal/P40-ish GPU and keep safer settings.
if echo "$GPU_NAME" | grep -Eiq 'P40|P100|Pascal|GTX 10'; then
CTX=2048
NP=1
BATCH=512
UBATCH=256
CACHE_RAM=0
CUDA_ENV="GGML_CUDA_DISABLE_GRAPHS=1"
fi

echo "Detected:"
echo " CPU threads: $CPU_THREADS"
echo " RAM: ${RAM_GB} GiB"
echo " GPU: $GPU_NAME"
echo " VRAM: ${VRAM_MIB} MiB"
echo
echo "Launching llama-server:"
echo " model repo: $MODEL_REPO"
echo " ctx: $CTX"
echo " parallel: $NP"
echo " threads: $THREADS"
echo " batch: $BATCH"
echo " ubatch: $UBATCH"
echo " cache-ram: $CACHE_RAM"
echo

cd "$HOME/llama.cpp"

exec env $CUDA_ENV ./build/bin/llama-server \
-hf "$MODEL_REPO" \
-ngl "$NGL" \
-c "$CTX" \
-np "$NP" \
-t "$THREADS" \
-b "$BATCH" \
-ub "$UBATCH" \
$NO_WARMUP \
--cache-ram "$CACHE_RAM" \
--host "$HOST" \
--port "$PORT"
EOF

chmod +x ~/run-llama-gpt-oss.sh
```

Run it:

```bash
~/run-llama-gpt-oss.sh
```

## Test command

In another terminal:

```bash
curl -N http://127.0.0.1:8080/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "gpt-oss-20b",
"stream": true,
"messages": [
{
"role": "user",
"content": "Output only valid Python code. No markdown. Create a small Dog class with name, breed, age, tricks, add_trick(), and __str__()."
}
],
"temperature": 0.2,
"max_tokens": 250
}'
```

## About auto-detecting `max_tokens`

`max_tokens` is **per request**, not really a server setting. The server can control context size with `-c`, but each API request should still set `max_tokens`.

Simple rule:

```text
Small answer: max_tokens 50-150
Small code: max_tokens 200-500
Medium code: max_tokens 800-1500
Long file/design: max_tokens 2000+
```

For the P40, I’d keep most tests at:

```json
"max_tokens": 250
```

Then increase only when needed.

## Add a simple curl wrapper

Create this:

```bash
cat > ~/ask-llama.sh <<'EOF'
#!/usr/bin/env bash
set -euo pipefail

PROMPT="${*:-Say hello.}"
MAX_TOKENS="${MAX_TOKENS:-300}"

curl -s http://127.0.0.1:8080/v1/chat/completions \
-H 'Content-Type: application/json' \
-d "$(jq -n \
--arg prompt "$PROMPT" \
--argjson max_tokens "$MAX_TOKENS" \
'{
model: "gpt-oss-20b",
messages: [
{
role: "user",
content: $prompt
}
],
temperature: 0.2,
max_tokens: $max_tokens
}')" | jq -r '.choices[0].message.content'
EOF

chmod +x ~/ask-llama.sh
```

Use it:

```bash
~/ask-llama.sh "Output only valid Python code. Create a hello world script."
```

With more output:

```bash
MAX_TOKENS=800 ~/ask-llama.sh "Output only valid Python code. Create a Dog class with name, breed, age, tricks, add_trick(), and __str__()."
```

## My final recommendation

For your P40, use:

```bash
GGML_CUDA_DISABLE_GRAPHS=1
-c 2048
-np 1
-t 12
-b 512
-ub 256
--cache-ram 0
--no-warmup
```

After it is stable for a while, try performance increases one at a time:

```text
1. Increase -t from 12 to 16
2. Increase -c from 2048 to 4096
3. Increase -b 512 to 1024
4. Re-enable CUDA graphs only if everything else is stable
```

Do not tune everything at once, because then you will not know what caused the next crash.

[1]: https://developer.nvidia.com/blog/optimizing-llama-cpp-ai-inference-with-cuda-graphs/?utm_source=chatgpt.com "Optimizing llama.cpp AI Inference with CUDA Graphs"

Llama optimization - Revision history

Busk: Created page with "Yes, that is a good **stable baseline**: ```bash ./build/bin/llama-server \ -hf ggml-org/gpt-oss-20b-GGUF \ -ngl 999 \ -c 2048 \ -np 1 \ -t 8 \ --no-warmup \ --..."

Busk: Created page with "Yes, that is a good stable baseline: ```bash ./build/bin/llama-server \ -hf ggml-org/gpt-oss-20b-GGUF \ -ngl 999 \ -c 2048 \ -np 1 \ -t 8 \ --no-warmup \ --..."