<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://tech.uvoo.io/index.php?action=history&amp;feed=atom&amp;title=Llama_optimization</id>
	<title>Llama optimization - Revision history</title>
	<link rel="self" type="application/atom+xml" href="https://tech.uvoo.io/index.php?action=history&amp;feed=atom&amp;title=Llama_optimization"/>
	<link rel="alternate" type="text/html" href="https://tech.uvoo.io/index.php?title=Llama_optimization&amp;action=history"/>
	<updated>2026-05-14T18:13:35Z</updated>
	<subtitle>Revision history for this page on the wiki</subtitle>
	<generator>MediaWiki 1.35.2</generator>
	<entry>
		<id>https://tech.uvoo.io/index.php?title=Llama_optimization&amp;diff=5690&amp;oldid=prev</id>
		<title>Busk: Created page with &quot;Yes, that is a good **stable baseline**:  ```bash ./build/bin/llama-server \   -hf ggml-org/gpt-oss-20b-GGUF \   -ngl 999 \   -c 2048 \   -np 1 \   -t 8 \   --no-warmup \   --...&quot;</title>
		<link rel="alternate" type="text/html" href="https://tech.uvoo.io/index.php?title=Llama_optimization&amp;diff=5690&amp;oldid=prev"/>
		<updated>2026-05-13T17:32:49Z</updated>

		<summary type="html">&lt;p&gt;Created page with &amp;quot;Yes, that is a good **stable baseline**:  ```bash ./build/bin/llama-server \   -hf ggml-org/gpt-oss-20b-GGUF \   -ngl 999 \   -c 2048 \   -np 1 \   -t 8 \   --no-warmup \   --...&amp;quot;&lt;/p&gt;
&lt;p&gt;&lt;b&gt;New page&lt;/b&gt;&lt;/p&gt;&lt;div&gt;Yes, that is a good **stable baseline**:&lt;br /&gt;
&lt;br /&gt;
```bash&lt;br /&gt;
./build/bin/llama-server \&lt;br /&gt;
  -hf ggml-org/gpt-oss-20b-GGUF \&lt;br /&gt;
  -ngl 999 \&lt;br /&gt;
  -c 2048 \&lt;br /&gt;
  -np 1 \&lt;br /&gt;
  -t 8 \&lt;br /&gt;
  --no-warmup \&lt;br /&gt;
  --cache-ram 0 \&lt;br /&gt;
  --host 0.0.0.0 \&lt;br /&gt;
  --port 8080&lt;br /&gt;
```&lt;br /&gt;
&lt;br /&gt;
For your **Threadripper + 128 GB RAM + Tesla P40**, I’d optimize for **stability first**, not max speed. Your previous log showed GPT-OSS 20B fully loaded, with the P40 detected as compute capability 6.1 and about 22.9 GiB VRAM available, so the hardware is basically working. &lt;br /&gt;
&lt;br /&gt;
## My recommended P40-stable launcher&lt;br /&gt;
&lt;br /&gt;
Use this instead:&lt;br /&gt;
&lt;br /&gt;
```bash&lt;br /&gt;
cd ~/llama.cpp&lt;br /&gt;
&lt;br /&gt;
GGML_CUDA_DISABLE_GRAPHS=1 ./build/bin/llama-server \&lt;br /&gt;
  -hf ggml-org/gpt-oss-20b-GGUF \&lt;br /&gt;
  -ngl 999 \&lt;br /&gt;
  -c 2048 \&lt;br /&gt;
  -np 1 \&lt;br /&gt;
  -t 12 \&lt;br /&gt;
  -b 512 \&lt;br /&gt;
  -ub 256 \&lt;br /&gt;
  --no-warmup \&lt;br /&gt;
  --cache-ram 0 \&lt;br /&gt;
  --host 0.0.0.0 \&lt;br /&gt;
  --port 8080&lt;br /&gt;
```&lt;br /&gt;
&lt;br /&gt;
Why:&lt;br /&gt;
&lt;br /&gt;
```text&lt;br /&gt;
GGML_CUDA_DISABLE_GRAPHS=1  Avoids CUDA graph weirdness on older GPUs.&lt;br /&gt;
-c 2048                     Lower context = lower KV/cache pressure.&lt;br /&gt;
-np 1                       One request/slot at a time.&lt;br /&gt;
-t 12                       Enough CPU threads without pegging Threadripper.&lt;br /&gt;
-b 512 -ub 256              Smaller batches; slower maybe, but less crashy.&lt;br /&gt;
--cache-ram 0               Avoids prompt cache save/load path that appeared near your crash.&lt;br /&gt;
--no-warmup                 Avoids extra startup GPU work.&lt;br /&gt;
```&lt;br /&gt;
&lt;br /&gt;
CUDA graphs can improve performance, but they are also an area where llama.cpp has had CUDA-specific edge cases and workarounds; disabling them is a reasonable stability choice on Pascal/P40. NVIDIA’s CUDA-graphs writeup explains why they are used for performance, while llama.cpp issue/discussion threads show cases where disabling CUDA graphs is used as a workaround. ([NVIDIA Developer][1])&lt;br /&gt;
&lt;br /&gt;
## Auto-detect launcher script&lt;br /&gt;
&lt;br /&gt;
Create this:&lt;br /&gt;
&lt;br /&gt;
```bash&lt;br /&gt;
cat &amp;gt; ~/run-llama-gpt-oss.sh &amp;lt;&amp;lt;'EOF'&lt;br /&gt;
#!/usr/bin/env bash&lt;br /&gt;
set -euo pipefail&lt;br /&gt;
&lt;br /&gt;
MODEL_REPO=&amp;quot;${MODEL_REPO:-ggml-org/gpt-oss-20b-GGUF}&amp;quot;&lt;br /&gt;
HOST=&amp;quot;${HOST:-0.0.0.0}&amp;quot;&lt;br /&gt;
PORT=&amp;quot;${PORT:-8080}&amp;quot;&lt;br /&gt;
&lt;br /&gt;
# Detect CPU threads.&lt;br /&gt;
CPU_THREADS=&amp;quot;$(nproc)&amp;quot;&lt;br /&gt;
&lt;br /&gt;
# Use about half the CPU threads, capped. Good for responsiveness.&lt;br /&gt;
if (( CPU_THREADS &amp;gt;= 32 )); then&lt;br /&gt;
  THREADS=12&lt;br /&gt;
elif (( CPU_THREADS &amp;gt;= 16 )); then&lt;br /&gt;
  THREADS=8&lt;br /&gt;
else&lt;br /&gt;
  THREADS=$(( CPU_THREADS / 2 ))&lt;br /&gt;
  (( THREADS &amp;lt; 4 )) &amp;amp;&amp;amp; THREADS=4&lt;br /&gt;
fi&lt;br /&gt;
&lt;br /&gt;
# Detect system RAM in GiB.&lt;br /&gt;
RAM_GB=&amp;quot;$(awk '/MemTotal/ { printf &amp;quot;%d&amp;quot;, $2/1024/1024 }' /proc/meminfo)&amp;quot;&lt;br /&gt;
&lt;br /&gt;
# Detect NVIDIA VRAM in MiB.&lt;br /&gt;
if command -v nvidia-smi &amp;gt;/dev/null 2&amp;gt;&amp;amp;1; then&lt;br /&gt;
  VRAM_MIB=&amp;quot;$(nvidia-smi --query-gpu=memory.total --format=csv,noheader,nounits | head -n1 | tr -d ' ')&amp;quot;&lt;br /&gt;
  GPU_NAME=&amp;quot;$(nvidia-smi --query-gpu=name --format=csv,noheader | head -n1)&amp;quot;&lt;br /&gt;
else&lt;br /&gt;
  VRAM_MIB=0&lt;br /&gt;
  GPU_NAME=&amp;quot;none&amp;quot;&lt;br /&gt;
fi&lt;br /&gt;
&lt;br /&gt;
# Conservative defaults for older GPUs.&lt;br /&gt;
CTX=2048&lt;br /&gt;
NP=1&lt;br /&gt;
BATCH=512&lt;br /&gt;
UBATCH=256&lt;br /&gt;
NGL=999&lt;br /&gt;
CACHE_RAM=0&lt;br /&gt;
NO_WARMUP=&amp;quot;--no-warmup&amp;quot;&lt;br /&gt;
CUDA_ENV=&amp;quot;GGML_CUDA_DISABLE_GRAPHS=1&amp;quot;&lt;br /&gt;
&lt;br /&gt;
# If VRAM is very small, reduce context/batch.&lt;br /&gt;
if (( VRAM_MIB &amp;gt; 0 &amp;amp;&amp;amp; VRAM_MIB &amp;lt; 16000 )); then&lt;br /&gt;
  CTX=1024&lt;br /&gt;
  BATCH=256&lt;br /&gt;
  UBATCH=128&lt;br /&gt;
fi&lt;br /&gt;
&lt;br /&gt;
# If VRAM is &amp;gt;= 32GB, allow larger context.&lt;br /&gt;
if (( VRAM_MIB &amp;gt;= 32000 )); then&lt;br /&gt;
  CTX=4096&lt;br /&gt;
  BATCH=1024&lt;br /&gt;
  UBATCH=512&lt;br /&gt;
fi&lt;br /&gt;
&lt;br /&gt;
# Detect Pascal/P40-ish GPU and keep safer settings.&lt;br /&gt;
if echo &amp;quot;$GPU_NAME&amp;quot; | grep -Eiq 'P40|P100|Pascal|GTX 10'; then&lt;br /&gt;
  CTX=2048&lt;br /&gt;
  NP=1&lt;br /&gt;
  BATCH=512&lt;br /&gt;
  UBATCH=256&lt;br /&gt;
  CACHE_RAM=0&lt;br /&gt;
  CUDA_ENV=&amp;quot;GGML_CUDA_DISABLE_GRAPHS=1&amp;quot;&lt;br /&gt;
fi&lt;br /&gt;
&lt;br /&gt;
echo &amp;quot;Detected:&amp;quot;&lt;br /&gt;
echo &amp;quot;  CPU threads: $CPU_THREADS&amp;quot;&lt;br /&gt;
echo &amp;quot;  RAM:         ${RAM_GB} GiB&amp;quot;&lt;br /&gt;
echo &amp;quot;  GPU:         $GPU_NAME&amp;quot;&lt;br /&gt;
echo &amp;quot;  VRAM:        ${VRAM_MIB} MiB&amp;quot;&lt;br /&gt;
echo&lt;br /&gt;
echo &amp;quot;Launching llama-server:&amp;quot;&lt;br /&gt;
echo &amp;quot;  model repo:  $MODEL_REPO&amp;quot;&lt;br /&gt;
echo &amp;quot;  ctx:         $CTX&amp;quot;&lt;br /&gt;
echo &amp;quot;  parallel:    $NP&amp;quot;&lt;br /&gt;
echo &amp;quot;  threads:     $THREADS&amp;quot;&lt;br /&gt;
echo &amp;quot;  batch:       $BATCH&amp;quot;&lt;br /&gt;
echo &amp;quot;  ubatch:      $UBATCH&amp;quot;&lt;br /&gt;
echo &amp;quot;  cache-ram:   $CACHE_RAM&amp;quot;&lt;br /&gt;
echo&lt;br /&gt;
&lt;br /&gt;
cd &amp;quot;$HOME/llama.cpp&amp;quot;&lt;br /&gt;
&lt;br /&gt;
exec env $CUDA_ENV ./build/bin/llama-server \&lt;br /&gt;
  -hf &amp;quot;$MODEL_REPO&amp;quot; \&lt;br /&gt;
  -ngl &amp;quot;$NGL&amp;quot; \&lt;br /&gt;
  -c &amp;quot;$CTX&amp;quot; \&lt;br /&gt;
  -np &amp;quot;$NP&amp;quot; \&lt;br /&gt;
  -t &amp;quot;$THREADS&amp;quot; \&lt;br /&gt;
  -b &amp;quot;$BATCH&amp;quot; \&lt;br /&gt;
  -ub &amp;quot;$UBATCH&amp;quot; \&lt;br /&gt;
  $NO_WARMUP \&lt;br /&gt;
  --cache-ram &amp;quot;$CACHE_RAM&amp;quot; \&lt;br /&gt;
  --host &amp;quot;$HOST&amp;quot; \&lt;br /&gt;
  --port &amp;quot;$PORT&amp;quot;&lt;br /&gt;
EOF&lt;br /&gt;
&lt;br /&gt;
chmod +x ~/run-llama-gpt-oss.sh&lt;br /&gt;
```&lt;br /&gt;
&lt;br /&gt;
Run it:&lt;br /&gt;
&lt;br /&gt;
```bash&lt;br /&gt;
~/run-llama-gpt-oss.sh&lt;br /&gt;
```&lt;br /&gt;
&lt;br /&gt;
## Test command&lt;br /&gt;
&lt;br /&gt;
In another terminal:&lt;br /&gt;
&lt;br /&gt;
```bash&lt;br /&gt;
curl -N http://127.0.0.1:8080/v1/chat/completions \&lt;br /&gt;
  -H 'Content-Type: application/json' \&lt;br /&gt;
  -d '{&lt;br /&gt;
    &amp;quot;model&amp;quot;: &amp;quot;gpt-oss-20b&amp;quot;,&lt;br /&gt;
    &amp;quot;stream&amp;quot;: true,&lt;br /&gt;
    &amp;quot;messages&amp;quot;: [&lt;br /&gt;
      {&lt;br /&gt;
        &amp;quot;role&amp;quot;: &amp;quot;user&amp;quot;,&lt;br /&gt;
        &amp;quot;content&amp;quot;: &amp;quot;Output only valid Python code. No markdown. Create a small Dog class with name, breed, age, tricks, add_trick(), and __str__().&amp;quot;&lt;br /&gt;
      }&lt;br /&gt;
    ],&lt;br /&gt;
    &amp;quot;temperature&amp;quot;: 0.2,&lt;br /&gt;
    &amp;quot;max_tokens&amp;quot;: 250&lt;br /&gt;
  }'&lt;br /&gt;
```&lt;br /&gt;
&lt;br /&gt;
## About auto-detecting `max_tokens`&lt;br /&gt;
&lt;br /&gt;
`max_tokens` is **per request**, not really a server setting. The server can control context size with `-c`, but each API request should still set `max_tokens`.&lt;br /&gt;
&lt;br /&gt;
Simple rule:&lt;br /&gt;
&lt;br /&gt;
```text&lt;br /&gt;
Small answer:      max_tokens 50-150&lt;br /&gt;
Small code:        max_tokens 200-500&lt;br /&gt;
Medium code:       max_tokens 800-1500&lt;br /&gt;
Long file/design:  max_tokens 2000+&lt;br /&gt;
```&lt;br /&gt;
&lt;br /&gt;
For the P40, I’d keep most tests at:&lt;br /&gt;
&lt;br /&gt;
```json&lt;br /&gt;
&amp;quot;max_tokens&amp;quot;: 250&lt;br /&gt;
```&lt;br /&gt;
&lt;br /&gt;
Then increase only when needed.&lt;br /&gt;
&lt;br /&gt;
## Add a simple curl wrapper&lt;br /&gt;
&lt;br /&gt;
Create this:&lt;br /&gt;
&lt;br /&gt;
```bash&lt;br /&gt;
cat &amp;gt; ~/ask-llama.sh &amp;lt;&amp;lt;'EOF'&lt;br /&gt;
#!/usr/bin/env bash&lt;br /&gt;
set -euo pipefail&lt;br /&gt;
&lt;br /&gt;
PROMPT=&amp;quot;${*:-Say hello.}&amp;quot;&lt;br /&gt;
MAX_TOKENS=&amp;quot;${MAX_TOKENS:-300}&amp;quot;&lt;br /&gt;
&lt;br /&gt;
curl -s http://127.0.0.1:8080/v1/chat/completions \&lt;br /&gt;
  -H 'Content-Type: application/json' \&lt;br /&gt;
  -d &amp;quot;$(jq -n \&lt;br /&gt;
    --arg prompt &amp;quot;$PROMPT&amp;quot; \&lt;br /&gt;
    --argjson max_tokens &amp;quot;$MAX_TOKENS&amp;quot; \&lt;br /&gt;
    '{&lt;br /&gt;
      model: &amp;quot;gpt-oss-20b&amp;quot;,&lt;br /&gt;
      messages: [&lt;br /&gt;
        {&lt;br /&gt;
          role: &amp;quot;user&amp;quot;,&lt;br /&gt;
          content: $prompt&lt;br /&gt;
        }&lt;br /&gt;
      ],&lt;br /&gt;
      temperature: 0.2,&lt;br /&gt;
      max_tokens: $max_tokens&lt;br /&gt;
    }')&amp;quot; | jq -r '.choices[0].message.content'&lt;br /&gt;
EOF&lt;br /&gt;
&lt;br /&gt;
chmod +x ~/ask-llama.sh&lt;br /&gt;
```&lt;br /&gt;
&lt;br /&gt;
Use it:&lt;br /&gt;
&lt;br /&gt;
```bash&lt;br /&gt;
~/ask-llama.sh &amp;quot;Output only valid Python code. Create a hello world script.&amp;quot;&lt;br /&gt;
```&lt;br /&gt;
&lt;br /&gt;
With more output:&lt;br /&gt;
&lt;br /&gt;
```bash&lt;br /&gt;
MAX_TOKENS=800 ~/ask-llama.sh &amp;quot;Output only valid Python code. Create a Dog class with name, breed, age, tricks, add_trick(), and __str__().&amp;quot;&lt;br /&gt;
```&lt;br /&gt;
&lt;br /&gt;
## My final recommendation&lt;br /&gt;
&lt;br /&gt;
For your P40, use:&lt;br /&gt;
&lt;br /&gt;
```bash&lt;br /&gt;
GGML_CUDA_DISABLE_GRAPHS=1&lt;br /&gt;
-c 2048&lt;br /&gt;
-np 1&lt;br /&gt;
-t 12&lt;br /&gt;
-b 512&lt;br /&gt;
-ub 256&lt;br /&gt;
--cache-ram 0&lt;br /&gt;
--no-warmup&lt;br /&gt;
```&lt;br /&gt;
&lt;br /&gt;
After it is stable for a while, try performance increases one at a time:&lt;br /&gt;
&lt;br /&gt;
```text&lt;br /&gt;
1. Increase -t from 12 to 16&lt;br /&gt;
2. Increase -c from 2048 to 4096&lt;br /&gt;
3. Increase -b 512 to 1024&lt;br /&gt;
4. Re-enable CUDA graphs only if everything else is stable&lt;br /&gt;
```&lt;br /&gt;
&lt;br /&gt;
Do not tune everything at once, because then you will not know what caused the next crash.&lt;br /&gt;
&lt;br /&gt;
[1]: https://developer.nvidia.com/blog/optimizing-llama-cpp-ai-inference-with-cuda-graphs/?utm_source=chatgpt.com &amp;quot;Optimizing llama.cpp AI Inference with CUDA Graphs&amp;quot;&lt;/div&gt;</summary>
		<author><name>Busk</name></author>
	</entry>
</feed>