Llama optimization
Yes, that is a good stable baseline:
./build/bin/llama-server \ -hf ggml-org/gpt-oss-20b-GGUF \ -ngl 999 \ -c 2048 \ -np 1 \ -t 8 \ --no-warmup \ --cache-ram 0 \ --host 0.0.0.0 \ --port 8080
For your Threadripper + 128 GB RAM + Tesla P40, I’d optimize for stability first, not max speed. Your previous log showed GPT-OSS 20B fully loaded, with the P40 detected as compute capability 6.1 and about 22.9 GiB VRAM available, so the hardware is basically working.
My recommended P40-stable launcher
Use this instead:
cd ~/llama.cpp GGML_CUDA_DISABLE_GRAPHS=1 ./build/bin/llama-server \ -hf ggml-org/gpt-oss-20b-GGUF \ -ngl 999 \ -c 2048 \ -np 1 \ -t 12 \ -b 512 \ -ub 256 \ --no-warmup \ --cache-ram 0 \ --host 0.0.0.0 \ --port 8080
Why:
GGML_CUDA_DISABLE_GRAPHS=1 Avoids CUDA graph weirdness on older GPUs. -c 2048 Lower context = lower KV/cache pressure. -np 1 One request/slot at a time. -t 12 Enough CPU threads without pegging Threadripper. -b 512 -ub 256 Smaller batches; slower maybe, but less crashy. --cache-ram 0 Avoids prompt cache save/load path that appeared near your crash. --no-warmup Avoids extra startup GPU work.
CUDA graphs can improve performance, but they are also an area where llama.cpp has had CUDA-specific edge cases and workarounds; disabling them is a reasonable stability choice on Pascal/P40. NVIDIA’s CUDA-graphs writeup explains why they are used for performance, while llama.cpp issue/discussion threads show cases where disabling CUDA graphs is used as a workaround. (NVIDIA Developer)
Auto-detect launcher script
Create this:
cat > ~/run-llama-gpt-oss.sh <<'EOF'
#!/usr/bin/env bash
set -euo pipefail
MODEL_REPO="${MODEL_REPO:-ggml-org/gpt-oss-20b-GGUF}"
HOST="${HOST:-0.0.0.0}"
PORT="${PORT:-8080}"
# Detect CPU threads.
CPU_THREADS="$(nproc)"
# Use about half the CPU threads, capped. Good for responsiveness.
if (( CPU_THREADS >= 32 )); then
THREADS=12
elif (( CPU_THREADS >= 16 )); then
THREADS=8
else
THREADS=$(( CPU_THREADS / 2 ))
(( THREADS < 4 )) && THREADS=4
fi
# Detect system RAM in GiB.
RAM_GB="$(awk '/MemTotal/ { printf "%d", $2/1024/1024 }' /proc/meminfo)"
# Detect NVIDIA VRAM in MiB.
if command -v nvidia-smi >/dev/null 2>&1; then
VRAM_MIB="$(nvidia-smi --query-gpu=memory.total --format=csv,noheader,nounits | head -n1 | tr -d ' ')"
GPU_NAME="$(nvidia-smi --query-gpu=name --format=csv,noheader | head -n1)"
else
VRAM_MIB=0
GPU_NAME="none"
fi
# Conservative defaults for older GPUs.
CTX=2048
NP=1
BATCH=512
UBATCH=256
NGL=999
CACHE_RAM=0
NO_WARMUP="--no-warmup"
CUDA_ENV="GGML_CUDA_DISABLE_GRAPHS=1"
# If VRAM is very small, reduce context/batch.
if (( VRAM_MIB > 0 && VRAM_MIB < 16000 )); then
CTX=1024
BATCH=256
UBATCH=128
fi
# If VRAM is >= 32GB, allow larger context.
if (( VRAM_MIB >= 32000 )); then
CTX=4096
BATCH=1024
UBATCH=512
fi
# Detect Pascal/P40-ish GPU and keep safer settings.
if echo "$GPU_NAME" | grep -Eiq 'P40|P100|Pascal|GTX 10'; then
CTX=2048
NP=1
BATCH=512
UBATCH=256
CACHE_RAM=0
CUDA_ENV="GGML_CUDA_DISABLE_GRAPHS=1"
fi
echo "Detected:"
echo " CPU threads: $CPU_THREADS"
echo " RAM: ${RAM_GB} GiB"
echo " GPU: $GPU_NAME"
echo " VRAM: ${VRAM_MIB} MiB"
echo
echo "Launching llama-server:"
echo " model repo: $MODEL_REPO"
echo " ctx: $CTX"
echo " parallel: $NP"
echo " threads: $THREADS"
echo " batch: $BATCH"
echo " ubatch: $UBATCH"
echo " cache-ram: $CACHE_RAM"
echo
cd "$HOME/llama.cpp"
exec env $CUDA_ENV ./build/bin/llama-server \
-hf "$MODEL_REPO" \
-ngl "$NGL" \
-c "$CTX" \
-np "$NP" \
-t "$THREADS" \
-b "$BATCH" \
-ub "$UBATCH" \
$NO_WARMUP \
--cache-ram "$CACHE_RAM" \
--host "$HOST" \
--port "$PORT"
EOF
chmod +x ~/run-llama-gpt-oss.sh
Run it:
~/run-llama-gpt-oss.sh
Test command
In another terminal:
curl -N http://127.0.0.1:8080/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "gpt-oss-20b",
"stream": true,
"messages": [
{
"role": "user",
"content": "Output only valid Python code. No markdown. Create a small Dog class with name, breed, age, tricks, add_trick(), and __str__()."
}
],
"temperature": 0.2,
"max_tokens": 250
}'
About auto-detecting max_tokens
max_tokens is per request, not really a server setting. The server can control context size with -c, but each API request should still set max_tokens.
Simple rule:
Small answer: max_tokens 50-150 Small code: max_tokens 200-500 Medium code: max_tokens 800-1500 Long file/design: max_tokens 2000+
For the P40, I’d keep most tests at:
"max_tokens": 250
Then increase only when needed.
Add a simple curl wrapper
Create this:
cat > ~/ask-llama.sh <<'EOF'
#!/usr/bin/env bash
set -euo pipefail
PROMPT="${*:-Say hello.}"
MAX_TOKENS="${MAX_TOKENS:-300}"
curl -s http://127.0.0.1:8080/v1/chat/completions \
-H 'Content-Type: application/json' \
-d "$(jq -n \
--arg prompt "$PROMPT" \
--argjson max_tokens "$MAX_TOKENS" \
'{
model: "gpt-oss-20b",
messages: [
{
role: "user",
content: $prompt
}
],
temperature: 0.2,
max_tokens: $max_tokens
}')" | jq -r '.choices[0].message.content'
EOF
chmod +x ~/ask-llama.sh
Use it:
~/ask-llama.sh "Output only valid Python code. Create a hello world script."
With more output:
MAX_TOKENS=800 ~/ask-llama.sh "Output only valid Python code. Create a Dog class with name, breed, age, tricks, add_trick(), and __str__()."
My final recommendation
For your P40, use:
GGML_CUDA_DISABLE_GRAPHS=1 -c 2048 -np 1 -t 12 -b 512 -ub 256 --cache-ram 0 --no-warmup
After it is stable for a while, try performance increases one at a time:
1. Increase -t from 12 to 16 2. Increase -c from 2048 to 4096 3. Increase -b 512 to 1024 4. Re-enable CUDA graphs only if everything else is stable
Do not tune everything at once, because then you will not know what caused the next crash.