Difference between revisions of "Tesla P40 llama.cpp"
| Line 10: | Line 10: | ||
cmake --build build --config Release -j"$(nproc)" | cmake --build build --config Release -j"$(nproc)" | ||
``` | ``` | ||
| + | |||
| + | That error is simple: **your NVIDIA driver is installed, but the CUDA Toolkit compiler is not.** | ||
| + | |||
| + | `nvidia-smi` working means the **driver/runtime** is working. | ||
| + | Building `llama.cpp` with CUDA requires **`nvcc`**, which comes from the **CUDA Toolkit**, not from `nvidia-utils`. NVIDIA’s CUDA Linux install guide separates the driver from CUDA development tools/toolkit. ([NVIDIA Docs][1]) | ||
| + | |||
| + | Check: | ||
| + | |||
| + | ```bash | ||
| + | which nvcc | ||
| + | nvcc --version | ||
| + | ``` | ||
| + | |||
| + | Right now those probably fail. | ||
| + | |||
| + | ## Best fix on Ubuntu 24.04 | ||
| + | |||
| + | Since you are using `580-server` for the Tesla P40, I would install the Ubuntu CUDA toolkit package first: | ||
| + | |||
| + | ```bash | ||
| + | sudo apt update | ||
| + | sudo apt install -y nvidia-cuda-toolkit | ||
| + | ``` | ||
| + | |||
| + | Then verify: | ||
| + | |||
| + | ```bash | ||
| + | which nvcc | ||
| + | nvcc --version | ||
| + | ``` | ||
| + | |||
| + | Then rebuild `llama.cpp` cleanly: | ||
| + | |||
| + | ```bash | ||
| + | cd ~/llama.cpp | ||
| + | |||
| + | rm -rf build | ||
| + | |||
| + | cmake -B build \ | ||
| + | -DGGML_CUDA=ON \ | ||
| + | -DCMAKE_BUILD_TYPE=Release \ | ||
| + | -DCMAKE_CUDA_ARCHITECTURES=61 | ||
| + | |||
| + | cmake --build build --config Release -j"$(nproc)" | ||
| + | ``` | ||
| + | |||
| + | The important part for Tesla P40 is: | ||
| + | |||
| + | ```bash | ||
| + | -DCMAKE_CUDA_ARCHITECTURES=61 | ||
| + | ``` | ||
| + | |||
| + | P40 is Pascal, compute capability **6.1**. Setting that explicitly avoids CMake trying to guess or building unnecessary architectures. | ||
| + | |||
| + | ## If `nvidia-cuda-toolkit` is too old or missing | ||
| + | |||
| + | Use NVIDIA’s CUDA repo/toolkit packages instead. For a P40, I would avoid chasing CUDA 13 unless needed. CUDA 12.x is safer for Pascal-era cards and llama.cpp. | ||
| + | |||
| + | Example pattern: | ||
| + | |||
| + | ```bash | ||
| + | sudo apt install -y cuda-toolkit-12-8 | ||
| + | ``` | ||
| + | |||
| + | Some Ubuntu 24.04 CUDA install docs show installing `cuda-toolkit-12-8` from NVIDIA’s repo after adding the CUDA repo/keyring. ([docs.leaf.cloud][2]) | ||
| + | |||
| + | But try this first: | ||
| + | |||
| + | ```bash | ||
| + | sudo apt install -y nvidia-cuda-toolkit | ||
| + | ``` | ||
| + | |||
| + | ## Do not reinstall the NVIDIA driver | ||
| + | |||
| + | Do **not** run a broad CUDA installer that replaces your working `580-server` driver with 590/595. | ||
| + | |||
| + | You want: | ||
| + | |||
| + | ```text | ||
| + | Driver: nvidia-driver-580-server | ||
| + | CUDA toolkit: nvcc/compiler only | ||
| + | llama.cpp: built with GGML_CUDA=ON | ||
| + | Arch: 61 for Tesla P40 | ||
| + | ``` | ||
| + | |||
| + | After build, test: | ||
| + | |||
| + | ```bash | ||
| + | ./build/bin/llama-cli --help | grep -i cuda | ||
| + | ./build/bin/llama-bench --help | ||
| + | ``` | ||
| + | |||
| + | Then run a small model or server with: | ||
| + | |||
| + | ```bash | ||
| + | ./build/bin/llama-server \ | ||
| + | -m /path/to/model.gguf \ | ||
| + | -ngl 999 \ | ||
| + | -c 4096 \ | ||
| + | --host 0.0.0.0 \ | ||
| + | --port 8080 | ||
| + | ``` | ||
| + | |||
| + | Your driver is fine now; you just need `nvcc`. | ||
| + | |||
| + | [1]: https://docs.nvidia.com/cuda/cuda-installation-guide-linux/?utm_source=chatgpt.com "CUDA Installation Guide for Linux" | ||
| + | [2]: https://docs.leaf.cloud/en/latest/data-science/installing-gpu-drivers/?utm_source=chatgpt.com "Installing NVIDIA Drivers and CUDA on Ubuntu 24.04" | ||
| + | |||
Latest revision as of 16:25, 13 May 2026
sudo apt update sudo apt install -y nvidia-cuda-toolkit sudo apt install -y git cmake build-essential curl libcurl4-openssl-dev git clone https://github.com/ggml-org/llama.cpp cd llama.cpp cmake -B build -DGGML_CUDA=ON cmake --build build --config Release -j"$(nproc)"
That error is simple: your NVIDIA driver is installed, but the CUDA Toolkit compiler is not.
nvidia-smi working means the driver/runtime is working.
Building llama.cpp with CUDA requires nvcc, which comes from the CUDA Toolkit, not from nvidia-utils. NVIDIA’s CUDA Linux install guide separates the driver from CUDA development tools/toolkit. (NVIDIA Docs)
Check:
which nvcc nvcc --version
Right now those probably fail.
Best fix on Ubuntu 24.04
Since you are using 580-server for the Tesla P40, I would install the Ubuntu CUDA toolkit package first:
sudo apt update sudo apt install -y nvidia-cuda-toolkit
Then verify:
which nvcc nvcc --version
Then rebuild llama.cpp cleanly:
cd ~/llama.cpp rm -rf build cmake -B build \ -DGGML_CUDA=ON \ -DCMAKE_BUILD_TYPE=Release \ -DCMAKE_CUDA_ARCHITECTURES=61 cmake --build build --config Release -j"$(nproc)"
The important part for Tesla P40 is:
-DCMAKE_CUDA_ARCHITECTURES=61
P40 is Pascal, compute capability 6.1. Setting that explicitly avoids CMake trying to guess or building unnecessary architectures.
If nvidia-cuda-toolkit is too old or missing
Use NVIDIA’s CUDA repo/toolkit packages instead. For a P40, I would avoid chasing CUDA 13 unless needed. CUDA 12.x is safer for Pascal-era cards and llama.cpp.
Example pattern:
sudo apt install -y cuda-toolkit-12-8
Some Ubuntu 24.04 CUDA install docs show installing cuda-toolkit-12-8 from NVIDIA’s repo after adding the CUDA repo/keyring. (docs.leaf.cloud)
But try this first:
sudo apt install -y nvidia-cuda-toolkit
Do not reinstall the NVIDIA driver
Do not run a broad CUDA installer that replaces your working 580-server driver with 590/595.
You want:
Driver: nvidia-driver-580-server CUDA toolkit: nvcc/compiler only llama.cpp: built with GGML_CUDA=ON Arch: 61 for Tesla P40
After build, test:
./build/bin/llama-cli --help | grep -i cuda ./build/bin/llama-bench --help
Then run a small model or server with:
./build/bin/llama-server \ -m /path/to/model.gguf \ -ngl 999 \ -c 4096 \ --host 0.0.0.0 \ --port 8080
Your driver is fine now; you just need nvcc.
Yes — for a Tesla P40, I would use llama.cpp first for GPT-OSS, Qwen, DeepSeek, Llama, Gemma, etc.
Your P40 is a great cheap 24 GB VRAM card, but it is Pascal, so it lacks Tensor Cores and newer datatype acceleration. That makes llama.cpp a better fit than heavier stacks like vLLM/TGI for most local single-GPU use.
Recommendation for your P40
Use this stack:
Best low-friction stack: llama.cpp + GGUF models + CUDA build Optional convenience layer: Ollama or LM Studio Avoid as first choice on P40: vLLM, TensorRT-LLM, exllamav2 for GPT-OSS
llama.cpp has active support for GPT-OSS and GGUF models, and there are ready GGUF builds for gpt-oss-20b. Hugging Face’s GGUF page for ggml-org/gpt-oss-20b-GGUF shows direct llama.cpp, llama-cpp-python, Ollama, LM Studio, and Jan usage paths. (Hugging Face)
For Qwen, the official Qwen docs also document running Qwen3/Qwen3MoE with llama.cpp and GGUF, with support starting from specific llama.cpp builds. (Qwen)
What models fit your P40?
With 24 GB VRAM:
| Model type | Recommendation |
|---|---|
| GPT-OSS 20B | Good target |
| GPT-OSS 120B | Not realistic on one P40 |
| Qwen 7B / 14B / 30B-A3B MoE | Good targets depending quant |
| DeepSeek Coder / Qwen Coder GGUF | Good for coding |
| Llama 3.x 8B / 70B quantized | 8B easy, 70B partially/offload or CPU spill |
| Mixtral / MoE GGUF | Can work, speed varies |
For your use — Go, SQL, Python, React, infra code — I would start with:
1. Qwen coder GGUF model 2. GPT-OSS 20B GGUF 3. DeepSeek coder/distill GGUF
For the P40 specifically, GGUF quantized models are the practical path.
Build llama.cpp with CUDA
On Ubuntu:
sudo apt update sudo apt install -y git cmake build-essential curl libcurl4-openssl-dev git clone https://github.com/ggml-org/llama.cpp cd llama.cpp cmake -B build -DGGML_CUDA=ON cmake --build build --config Release -j"$(nproc)"
Check CUDA offload works:
./build/bin/llama-cli --help | grep -i gpu
Run GPT-OSS 20B
Example using llama.cpp’s Hugging Face download support:
./build/bin/llama-server \ -hf ggml-org/gpt-oss-20b-GGUF \ -ngl 999 \ -c 8192 \ --host 0.0.0.0 \ --port 8080
Then test OpenAI-compatible API:
curl http://127.0.0.1:8080/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "gpt-oss-20b",
"messages": [
{"role": "user", "content": "Write a small Go HTTP server with health check."}
],
"temperature": 0.2
}'
If VRAM gets tight, reduce context:
-c 4096
Or reduce GPU layers:
-ngl 60
But on a 24 GB P40, for many 20B-ish quantized GGUFs, this should be okay.
Run a Qwen coder model
For coding, Qwen is probably where I would spend most of my time. Example:
./build/bin/llama-server \ -hf unsloth/Qwen3-Coder-Next-GGUF \ -ngl 999 \ -c 8192 \ --host 0.0.0.0 \ --port 8080
The Unsloth Qwen GGUF page notes recent llama.cpp fixes for Qwen output/tool-calling behavior, so use a fresh llama.cpp build rather than an old distro package. (Hugging Face)
Should you use Ollama instead?
Ollama is easier, but llama.cpp gives you more direct control.
Use Ollama if you want simple:
ollama run hf.co/ggml-org/gpt-oss-20b-GGUF
Use llama.cpp directly if you care about:
GPU layer tuning context size tuning exact GGUF file selection server flags benchmarking performance debugging
Given how you work, I’d use llama.cpp directly first.
Should you use vLLM?
For your P40, probably no — not as your first runtime.
vLLM is excellent for newer GPUs and serving many users, but it tends to assume newer CUDA paths and benefits heavily from Tensor Cores / modern attention kernels. On Pascal P40, llama.cpp is usually simpler and more forgiving.
I would use vLLM when you have something like:
RTX 3090 RTX 4090 RTX PRO 4500 Blackwell A10/A40/A100/H100/L40S
For the P40:
llama.cpp > Ollama > exllamav2/vLLM experiments
My practical recommendation
Start with:
Runtime: llama.cpp Driver: 580-server Model format: GGUF First model: gpt-oss-20b-GGUF Coding model: Qwen coder GGUF Context: 4096 or 8192 first GPU layers: -ngl 999
Then benchmark:
./build/bin/llama-bench -m /path/to/model.gguf -ngl 999
For a single Tesla P40, don’t chase the newest serving framework. Use stable 580-server, fresh llama.cpp, and GGUF models. That will give you the fewest headaches and the best compatibility.