Tesla P40 llama.cpp
sudo apt update sudo apt install -y nvidia-cuda-toolkit sudo apt install -y git cmake build-essential curl libcurl4-openssl-dev git clone https://github.com/ggml-org/llama.cpp cd llama.cpp cmake -B build -DGGML_CUDA=ON cmake --build build --config Release -j"$(nproc)"
Yes — for a Tesla P40, I would use llama.cpp first for GPT-OSS, Qwen, DeepSeek, Llama, Gemma, etc.
Your P40 is a great cheap 24 GB VRAM card, but it is Pascal, so it lacks Tensor Cores and newer datatype acceleration. That makes llama.cpp a better fit than heavier stacks like vLLM/TGI for most local single-GPU use.
Recommendation for your P40
Use this stack:
Best low-friction stack: llama.cpp + GGUF models + CUDA build Optional convenience layer: Ollama or LM Studio Avoid as first choice on P40: vLLM, TensorRT-LLM, exllamav2 for GPT-OSS
llama.cpp has active support for GPT-OSS and GGUF models, and there are ready GGUF builds for gpt-oss-20b. Hugging Face’s GGUF page for ggml-org/gpt-oss-20b-GGUF shows direct llama.cpp, llama-cpp-python, Ollama, LM Studio, and Jan usage paths. (Hugging Face)
For Qwen, the official Qwen docs also document running Qwen3/Qwen3MoE with llama.cpp and GGUF, with support starting from specific llama.cpp builds. (Qwen)
What models fit your P40?
With 24 GB VRAM:
| Model type | Recommendation |
|---|---|
| GPT-OSS 20B | Good target |
| GPT-OSS 120B | Not realistic on one P40 |
| Qwen 7B / 14B / 30B-A3B MoE | Good targets depending quant |
| DeepSeek Coder / Qwen Coder GGUF | Good for coding |
| Llama 3.x 8B / 70B quantized | 8B easy, 70B partially/offload or CPU spill |
| Mixtral / MoE GGUF | Can work, speed varies |
For your use — Go, SQL, Python, React, infra code — I would start with:
1. Qwen coder GGUF model 2. GPT-OSS 20B GGUF 3. DeepSeek coder/distill GGUF
For the P40 specifically, GGUF quantized models are the practical path.
Build llama.cpp with CUDA
On Ubuntu:
sudo apt update sudo apt install -y git cmake build-essential curl libcurl4-openssl-dev git clone https://github.com/ggml-org/llama.cpp cd llama.cpp cmake -B build -DGGML_CUDA=ON cmake --build build --config Release -j"$(nproc)"
Check CUDA offload works:
./build/bin/llama-cli --help | grep -i gpu
Run GPT-OSS 20B
Example using llama.cpp’s Hugging Face download support:
./build/bin/llama-server \ -hf ggml-org/gpt-oss-20b-GGUF \ -ngl 999 \ -c 8192 \ --host 0.0.0.0 \ --port 8080
Then test OpenAI-compatible API:
curl http://127.0.0.1:8080/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "gpt-oss-20b",
"messages": [
{"role": "user", "content": "Write a small Go HTTP server with health check."}
],
"temperature": 0.2
}'
If VRAM gets tight, reduce context:
-c 4096
Or reduce GPU layers:
-ngl 60
But on a 24 GB P40, for many 20B-ish quantized GGUFs, this should be okay.
Run a Qwen coder model
For coding, Qwen is probably where I would spend most of my time. Example:
./build/bin/llama-server \ -hf unsloth/Qwen3-Coder-Next-GGUF \ -ngl 999 \ -c 8192 \ --host 0.0.0.0 \ --port 8080
The Unsloth Qwen GGUF page notes recent llama.cpp fixes for Qwen output/tool-calling behavior, so use a fresh llama.cpp build rather than an old distro package. (Hugging Face)
Should you use Ollama instead?
Ollama is easier, but llama.cpp gives you more direct control.
Use Ollama if you want simple:
ollama run hf.co/ggml-org/gpt-oss-20b-GGUF
Use llama.cpp directly if you care about:
GPU layer tuning context size tuning exact GGUF file selection server flags benchmarking performance debugging
Given how you work, I’d use llama.cpp directly first.
Should you use vLLM?
For your P40, probably no — not as your first runtime.
vLLM is excellent for newer GPUs and serving many users, but it tends to assume newer CUDA paths and benefits heavily from Tensor Cores / modern attention kernels. On Pascal P40, llama.cpp is usually simpler and more forgiving.
I would use vLLM when you have something like:
RTX 3090 RTX 4090 RTX PRO 4500 Blackwell A10/A40/A100/H100/L40S
For the P40:
llama.cpp > Ollama > exllamav2/vLLM experiments
My practical recommendation
Start with:
Runtime: llama.cpp Driver: 580-server Model format: GGUF First model: gpt-oss-20b-GGUF Coding model: Qwen coder GGUF Context: 4096 or 8192 first GPU layers: -ngl 999
Then benchmark:
./build/bin/llama-bench -m /path/to/model.gguf -ngl 999
For a single Tesla P40, don’t chase the newest serving framework. Use stable 580-server, fresh llama.cpp, and GGUF models. That will give you the fewest headaches and the best compatibility.