Difference between revisions of "Tesla P40 llama.cpp"

Latest revision as of 16:25, 13 May 2026

sudo apt update
sudo apt install -y nvidia-cuda-toolkit
sudo apt install -y git cmake build-essential curl libcurl4-openssl-dev

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j"$(nproc)"

That error is simple: your NVIDIA driver is installed, but the CUDA Toolkit compiler is not.

nvidia-smi working means the driver/runtime is working. Building llama.cpp with CUDA requires nvcc, which comes from the CUDA Toolkit, not from nvidia-utils. NVIDIA’s CUDA Linux install guide separates the driver from CUDA development tools/toolkit. (NVIDIA Docs)

Check:

which nvcc
nvcc --version

Right now those probably fail.

Best fix on Ubuntu 24.04

Since you are using 580-server for the Tesla P40, I would install the Ubuntu CUDA toolkit package first:

sudo apt update
sudo apt install -y nvidia-cuda-toolkit

Then verify:

which nvcc
nvcc --version

Then rebuild llama.cpp cleanly:

cd ~/llama.cpp

rm -rf build

cmake -B build \
  -DGGML_CUDA=ON \
  -DCMAKE_BUILD_TYPE=Release \
  -DCMAKE_CUDA_ARCHITECTURES=61

cmake --build build --config Release -j"$(nproc)"

The important part for Tesla P40 is:

-DCMAKE_CUDA_ARCHITECTURES=61

P40 is Pascal, compute capability 6.1. Setting that explicitly avoids CMake trying to guess or building unnecessary architectures.

If `nvidia-cuda-toolkit` is too old or missing

Use NVIDIA’s CUDA repo/toolkit packages instead. For a P40, I would avoid chasing CUDA 13 unless needed. CUDA 12.x is safer for Pascal-era cards and llama.cpp.

Example pattern:

sudo apt install -y cuda-toolkit-12-8

Some Ubuntu 24.04 CUDA install docs show installing cuda-toolkit-12-8 from NVIDIA’s repo after adding the CUDA repo/keyring. (docs.leaf.cloud)

But try this first:

sudo apt install -y nvidia-cuda-toolkit

Do not reinstall the NVIDIA driver

Do not run a broad CUDA installer that replaces your working 580-server driver with 590/595.

You want:

Driver:       nvidia-driver-580-server
CUDA toolkit: nvcc/compiler only
llama.cpp:    built with GGML_CUDA=ON
Arch:         61 for Tesla P40

After build, test:

./build/bin/llama-cli --help | grep -i cuda
./build/bin/llama-bench --help

Then run a small model or server with:

./build/bin/llama-server \
  -m /path/to/model.gguf \
  -ngl 999 \
  -c 4096 \
  --host 0.0.0.0 \
  --port 8080

Your driver is fine now; you just need nvcc.

Yes — for a Tesla P40, I would use llama.cpp first for GPT-OSS, Qwen, DeepSeek, Llama, Gemma, etc.

Your P40 is a great cheap 24 GB VRAM card, but it is Pascal, so it lacks Tensor Cores and newer datatype acceleration. That makes llama.cpp a better fit than heavier stacks like vLLM/TGI for most local single-GPU use.

Recommendation for your P40

Use this stack:

Best low-friction stack:
llama.cpp + GGUF models + CUDA build

Optional convenience layer:
Ollama or LM Studio

Avoid as first choice on P40:
vLLM, TensorRT-LLM, exllamav2 for GPT-OSS

llama.cpp has active support for GPT-OSS and GGUF models, and there are ready GGUF builds for gpt-oss-20b. Hugging Face’s GGUF page for ggml-org/gpt-oss-20b-GGUF shows direct llama.cpp, llama-cpp-python, Ollama, LM Studio, and Jan usage paths. (Hugging Face)

For Qwen, the official Qwen docs also document running Qwen3/Qwen3MoE with llama.cpp and GGUF, with support starting from specific llama.cpp builds. (Qwen)

What models fit your P40?

With 24 GB VRAM:

Model type	Recommendation
GPT-OSS 20B	Good target
GPT-OSS 120B	Not realistic on one P40
Qwen 7B / 14B / 30B-A3B MoE	Good targets depending quant
DeepSeek Coder / Qwen Coder GGUF	Good for coding
Llama 3.x 8B / 70B quantized	8B easy, 70B partially/offload or CPU spill
Mixtral / MoE GGUF	Can work, speed varies

For your use — Go, SQL, Python, React, infra code — I would start with:

1. Qwen coder GGUF model
2. GPT-OSS 20B GGUF
3. DeepSeek coder/distill GGUF

For the P40 specifically, GGUF quantized models are the practical path.

Build `llama.cpp` with CUDA

On Ubuntu:

sudo apt update
sudo apt install -y git cmake build-essential curl libcurl4-openssl-dev

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j"$(nproc)"

Check CUDA offload works:

./build/bin/llama-cli --help | grep -i gpu

Run GPT-OSS 20B

Example using llama.cpp’s Hugging Face download support:

./build/bin/llama-server \
  -hf ggml-org/gpt-oss-20b-GGUF \
  -ngl 999 \
  -c 8192 \
  --host 0.0.0.0 \
  --port 8080

Then test OpenAI-compatible API:

curl http://127.0.0.1:8080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "gpt-oss-20b",
    "messages": [
      {"role": "user", "content": "Write a small Go HTTP server with health check."}
    ],
    "temperature": 0.2
  }'

If VRAM gets tight, reduce context:

-c 4096

Or reduce GPU layers:

-ngl 60

But on a 24 GB P40, for many 20B-ish quantized GGUFs, this should be okay.

Run a Qwen coder model

For coding, Qwen is probably where I would spend most of my time. Example:

./build/bin/llama-server \
  -hf unsloth/Qwen3-Coder-Next-GGUF \
  -ngl 999 \
  -c 8192 \
  --host 0.0.0.0 \
  --port 8080

The Unsloth Qwen GGUF page notes recent llama.cpp fixes for Qwen output/tool-calling behavior, so use a fresh llama.cpp build rather than an old distro package. (Hugging Face)

Should you use Ollama instead?

Ollama is easier, but llama.cpp gives you more direct control.

Use Ollama if you want simple:

ollama run hf.co/ggml-org/gpt-oss-20b-GGUF

Use llama.cpp directly if you care about:

GPU layer tuning
context size tuning
exact GGUF file selection
server flags
benchmarking
performance debugging

Given how you work, I’d use llama.cpp directly first.

Should you use vLLM?

For your P40, probably no — not as your first runtime.

vLLM is excellent for newer GPUs and serving many users, but it tends to assume newer CUDA paths and benefits heavily from Tensor Cores / modern attention kernels. On Pascal P40, llama.cpp is usually simpler and more forgiving.

I would use vLLM when you have something like:

RTX 3090
RTX 4090
RTX PRO 4500 Blackwell
A10/A40/A100/H100/L40S

For the P40:

llama.cpp > Ollama > exllamav2/vLLM experiments

My practical recommendation

Start with:

Runtime: llama.cpp
Driver: 580-server
Model format: GGUF
First model: gpt-oss-20b-GGUF
Coding model: Qwen coder GGUF
Context: 4096 or 8192 first
GPU layers: -ngl 999

Then benchmark:

./build/bin/llama-bench -m /path/to/model.gguf -ngl 999

For a single Tesla P40, don’t chase the newest serving framework. Use stable 580-server, fresh llama.cpp, and GGUF models. That will give you the fewest headaches and the best compatibility.

Difference between revisions of "Tesla P40 llama.cpp"

Latest revision as of 16:25, 13 May 2026

Contents

Best fix on Ubuntu 24.04

If `nvidia-cuda-toolkit` is too old or missing

Do not reinstall the NVIDIA driver

Recommendation for your P40

What models fit your P40?

Build `llama.cpp` with CUDA

Run GPT-OSS 20B

Run a Qwen coder model

Should you use Ollama instead?

Should you use vLLM?

My practical recommendation

Navigation menu

Search

@@ Line 10: / Line 10: @@
 cmake --build build --config Release -j"$(nproc)"
 ```
+That error is simple: **your NVIDIA driver is installed, but the CUDA Toolkit compiler is not.**
+`nvidia-smi` working means the **driver/runtime** is working.
+Building `llama.cpp` with CUDA requires **`nvcc`**, which comes from the **CUDA Toolkit**, not from `nvidia-utils`. NVIDIA’s CUDA Linux install guide separates the driver from CUDA development tools/toolkit. ([NVIDIA Docs][1])
+Check:
+```bash
+which nvcc
+nvcc --version
+```
+Right now those probably fail.
+## Best fix on Ubuntu 24.04
+Since you are using `580-server` for the Tesla P40, I would install the Ubuntu CUDA toolkit package first:
+```bash
+sudo apt update
+sudo apt install -y nvidia-cuda-toolkit
+```
+Then verify:
+```bash
+which nvcc
+nvcc --version
+```
+Then rebuild `llama.cpp` cleanly:
+```bash
+cd ~/llama.cpp
+rm -rf build
+cmake -B build \
+  -DGGML_CUDA=ON \
+  -DCMAKE_BUILD_TYPE=Release \
+  -DCMAKE_CUDA_ARCHITECTURES=61
+cmake --build build --config Release -j"$(nproc)"
+```
+The important part for Tesla P40 is:
+```bash
+-DCMAKE_CUDA_ARCHITECTURES=61
+```
+P40 is Pascal, compute capability **6.1**. Setting that explicitly avoids CMake trying to guess or building unnecessary architectures.
+## If `nvidia-cuda-toolkit` is too old or missing
+Use NVIDIA’s CUDA repo/toolkit packages instead. For a P40, I would avoid chasing CUDA 13 unless needed. CUDA 12.x is safer for Pascal-era cards and llama.cpp.
+Example pattern:
+```bash
+sudo apt install -y cuda-toolkit-12-8
+```
+Some Ubuntu 24.04 CUDA install docs show installing `cuda-toolkit-12-8` from NVIDIA’s repo after adding the CUDA repo/keyring. ([docs.leaf.cloud][2])
+But try this first:
+```bash
+sudo apt install -y nvidia-cuda-toolkit
+```
+## Do not reinstall the NVIDIA driver
+Do **not** run a broad CUDA installer that replaces your working `580-server` driver with 590/595.
+You want:
+```text
+Driver:       nvidia-driver-580-server
+CUDA toolkit: nvcc/compiler only
+llama.cpp:    built with GGML_CUDA=ON
+Arch:         61 for Tesla P40
+```
+After build, test:
+```bash
+./build/bin/llama-cli --help | grep -i cuda
+./build/bin/llama-bench --help
+```
+Then run a small model or server with:
+```bash
+./build/bin/llama-server \
+  -m /path/to/model.gguf \
+  -ngl 999 \
+  -c 4096 \
+  --host 0.0.0.0 \
+  --port 8080
+```
+Your driver is fine now; you just need `nvcc`.
+[1]: https://docs.nvidia.com/cuda/cuda-installation-guide-linux/?utm_source=chatgpt.com "CUDA Installation Guide for Linux"
+[2]: https://docs.leaf.cloud/en/latest/data-science/installing-gpu-drivers/?utm_source=chatgpt.com "Installing NVIDIA Drivers and CUDA on Ubuntu 24.04"

Difference between revisions of "Tesla P40 llama.cpp"

Latest revision as of 16:25, 13 May 2026

Best fix on Ubuntu 24.04

If nvidia-cuda-toolkit is too old or missing

Do not reinstall the NVIDIA driver

Recommendation for your P40

What models fit your P40?

Build llama.cpp with CUDA

Run GPT-OSS 20B

Run a Qwen coder model

Should you use Ollama instead?

Should you use vLLM?

My practical recommendation

Navigation menu

Search

If `nvidia-cuda-toolkit` is too old or missing

Build `llama.cpp` with CUDA