Tesla P40 llama.cpp

sudo apt update
sudo apt install -y nvidia-cuda-toolkit
sudo apt install -y git cmake build-essential curl libcurl4-openssl-dev

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j"$(nproc)"

Yes — for a Tesla P40, I would use llama.cpp first for GPT-OSS, Qwen, DeepSeek, Llama, Gemma, etc.

Your P40 is a great cheap 24 GB VRAM card, but it is Pascal, so it lacks Tensor Cores and newer datatype acceleration. That makes llama.cpp a better fit than heavier stacks like vLLM/TGI for most local single-GPU use.

Recommendation for your P40

Use this stack:

Best low-friction stack:
llama.cpp + GGUF models + CUDA build

Optional convenience layer:
Ollama or LM Studio

Avoid as first choice on P40:
vLLM, TensorRT-LLM, exllamav2 for GPT-OSS

llama.cpp has active support for GPT-OSS and GGUF models, and there are ready GGUF builds for gpt-oss-20b. Hugging Face’s GGUF page for ggml-org/gpt-oss-20b-GGUF shows direct llama.cpp, llama-cpp-python, Ollama, LM Studio, and Jan usage paths. (Hugging Face)

For Qwen, the official Qwen docs also document running Qwen3/Qwen3MoE with llama.cpp and GGUF, with support starting from specific llama.cpp builds. (Qwen)

What models fit your P40?

With 24 GB VRAM:

Model type	Recommendation
GPT-OSS 20B	Good target
GPT-OSS 120B	Not realistic on one P40
Qwen 7B / 14B / 30B-A3B MoE	Good targets depending quant
DeepSeek Coder / Qwen Coder GGUF	Good for coding
Llama 3.x 8B / 70B quantized	8B easy, 70B partially/offload or CPU spill
Mixtral / MoE GGUF	Can work, speed varies

For your use — Go, SQL, Python, React, infra code — I would start with:

1. Qwen coder GGUF model
2. GPT-OSS 20B GGUF
3. DeepSeek coder/distill GGUF

For the P40 specifically, GGUF quantized models are the practical path.

Build `llama.cpp` with CUDA

On Ubuntu:

sudo apt update
sudo apt install -y git cmake build-essential curl libcurl4-openssl-dev

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j"$(nproc)"

Check CUDA offload works:

./build/bin/llama-cli --help | grep -i gpu

Run GPT-OSS 20B

Example using llama.cpp’s Hugging Face download support:

./build/bin/llama-server \
  -hf ggml-org/gpt-oss-20b-GGUF \
  -ngl 999 \
  -c 8192 \
  --host 0.0.0.0 \
  --port 8080

Then test OpenAI-compatible API:

curl http://127.0.0.1:8080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "gpt-oss-20b",
    "messages": [
      {"role": "user", "content": "Write a small Go HTTP server with health check."}
    ],
    "temperature": 0.2
  }'

If VRAM gets tight, reduce context:

-c 4096

Or reduce GPU layers:

-ngl 60

But on a 24 GB P40, for many 20B-ish quantized GGUFs, this should be okay.

Run a Qwen coder model

For coding, Qwen is probably where I would spend most of my time. Example:

./build/bin/llama-server \
  -hf unsloth/Qwen3-Coder-Next-GGUF \
  -ngl 999 \
  -c 8192 \
  --host 0.0.0.0 \
  --port 8080

The Unsloth Qwen GGUF page notes recent llama.cpp fixes for Qwen output/tool-calling behavior, so use a fresh llama.cpp build rather than an old distro package. (Hugging Face)

Should you use Ollama instead?

Ollama is easier, but llama.cpp gives you more direct control.

Use Ollama if you want simple:

ollama run hf.co/ggml-org/gpt-oss-20b-GGUF

Use llama.cpp directly if you care about:

GPU layer tuning
context size tuning
exact GGUF file selection
server flags
benchmarking
performance debugging

Given how you work, I’d use llama.cpp directly first.

Should you use vLLM?

For your P40, probably no — not as your first runtime.

vLLM is excellent for newer GPUs and serving many users, but it tends to assume newer CUDA paths and benefits heavily from Tensor Cores / modern attention kernels. On Pascal P40, llama.cpp is usually simpler and more forgiving.

I would use vLLM when you have something like:

RTX 3090
RTX 4090
RTX PRO 4500 Blackwell
A10/A40/A100/H100/L40S

For the P40:

llama.cpp > Ollama > exllamav2/vLLM experiments

My practical recommendation

Start with:

Runtime: llama.cpp
Driver: 580-server
Model format: GGUF
First model: gpt-oss-20b-GGUF
Coding model: Qwen coder GGUF
Context: 4096 or 8192 first
GPU layers: -ngl 999

Then benchmark:

./build/bin/llama-bench -m /path/to/model.gguf -ngl 999

For a single Tesla P40, don’t chase the newest serving framework. Use stable 580-server, fresh llama.cpp, and GGUF models. That will give you the fewest headaches and the best compatibility.

Tesla P40 llama.cpp

Contents

Recommendation for your P40

What models fit your P40?

Build `llama.cpp` with CUDA

Run GPT-OSS 20B

Run a Qwen coder model

Should you use Ollama instead?

Should you use vLLM?

My practical recommendation

Navigation menu

Search

Tesla P40 llama.cpp

Recommendation for your P40

What models fit your P40?

Build llama.cpp with CUDA

Run GPT-OSS 20B

Run a Qwen coder model

Should you use Ollama instead?

Should you use vLLM?

My practical recommendation

Navigation menu

Search

Build `llama.cpp` with CUDA