GPU acceleration with torch¶
Advanced
Prerequisites: Single deterministic simulation
pylectra's ODE engine has an optional PyTorch backend with torchdiffeq solvers — large speed-ups on machines with NVIDIA GPUs. This page covers when to switch, how, and how to manage memory.
When is torch worth it?¶
| Scenario | scipy | torch CPU | torch CUDA |
|---|---|---|---|
| case39 single sim | ~5 s | ~3 s | ~1 s |
| case118 single sim | ~30 s | ~20 s | ~3 s |
| case2000+ single sim | ~10 min | ~6 min | ~30 s |
| 1000-sample batch (case39) | ~1 h | ~40 min | ~10 min (and lower power) |
Numbers from measured case39 + extrapolation. Bigger cases and bigger batches benefit most from CUDA.
If your work is just a handful of case39 runs, stay on scipy — torch's CUDA-context startup (~1 s) eats the win.
Install torch¶
CPU only:
NVIDIA GPU + CUDA 12.x (typical):
Don't know your CUDA version? Run
nvidia-smi; the top-right shows "CUDA Version". 12.1 → cu121, 11.8 → cu118.
Verify:
import torch
print("torch version:", torch.__version__)
print("CUDA available:", torch.cuda.is_available())
print("device count:", torch.cuda.device_count())
If CUDA available: False despite having a GPU, pip installed the CPU build. Reinstall via the CUDA channel:
YAML solver swap¶
solver:
kind: torch_dopri5 # or torch_dopri8 / torch_rk4 / torch_euler
options:
rtol: 1.0e-6
atol: 1.0e-8
chunk_seconds: 0.5 # OOM relief (see below)
torch_dtype: float64 # cuda default float64; mps requires float32
device: auto # auto → cuda → mps → cpu
Four torch solvers:
| Name | Method | Adaptive | When |
|---|---|---|---|
torch_dopri5 |
Dormand-Prince 5(4) | yes | Default; matches scipy_rk45 in accuracy |
torch_dopri8 |
Dormand-Prince 8(7) | yes | High accuracy; counterpart of scipy_dop853 |
torch_rk4 |
classical RK4 | no | Fixed step; simplest |
torch_euler |
Euler | no | Rarely useful; reference only |
Automatic device selection¶
auto tries:
Force a specific device:
Out-of-memory: chunking¶
torchdiffeq.odeint can OOM on long sims because it retains internal RK k1..k6 buffers across the whole call (size ∝ T·n).
Fix: chunk_seconds slices each leg into windows; tensor.detach() between windows.
chunk_seconds |
Memory | Speed |
|---|---|---|
null (default) |
O(leg × state dim) | fastest |
1.0 |
half | ~5 % slower |
0.5 |
quarter | ~10 % slower |
0.1 |
1/40 | ~30 % slower |
OOM rescue checklist:
- Try
chunk_seconds: 0.5. - Still OOM →
0.2. - Still OOM → also lower
dense_n(output points per leg) andrtol. - Still OOM → fall back to CPU (
device: cpu).
Mathematical equivalence is enforced by tests/integration/test_torch_backend.py::test_chunking_is_numerically_equivalent: chunked vs. non-chunked agree at rtol=1e-4.
Choosing dtype¶
solver:
options:
torch_dtype: float64 # default, high accuracy
# torch_dtype: float32 # 1.5–2× faster, half the memory, ~5 fewer digits
Rotor-angle dynamics range over 10⁻³–10⁰ — float32 is usually enough. Use float64 for numerical comparisons or paper-grade results.
Apple MPS supports float32 only. With
device: auto, pylectra raises a clear error and asks you to switch to cpu or change the dtype.
Running batches on GPU¶
batch + torch + joblib — but don't spawn n_jobs = number-of-GPUs × many: each worker creates a CUDA context (~1 GB VRAM) and you'll OOM fast.
For multi-GPU machines:
# Pick a GPU per worker; needs custom dispatching
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0" # or "1"
More elaborate multi-GPU schemes are out of pylectra's core scope — wrap your own driver script.
Verify the GPU is actually working¶
Look for a python process, VRAM usage, and GPU-Util > 0 % — that's the real thing.
Or in code:
scipy vs. torch numerical differences¶
They are not bit-identical, because:
- Different algorithms (dop853 is 8th-order; dopri5 is 5th-order).
- Different initial step controllers.
- Different float64 accumulation orders.
Both are valid solutions of the same ODE. pylectra enforces < 1 % L2 error:
torch_dopri5 vs scipy_dop853 on case39 + bus 16 fault:
Angles L2 = 0.3%
Speeds L2 = 0.1%
Voltages L2 = 3 % (larger near fault discontinuities)
For paper-grade results, scipy's output is the safer reference; switch to torch when scaling up datasets.
FAQ¶
Q: torch is installed and I have a GPU, but pylectra reports device=cpu¶
Possible causes:
pip install torchdefaulted to the CPU build.pip uninstall torchand reinstall via the CUDA channel.- conda's torch doesn't match the system CUDA toolkit. (conda's torch ships its own CUDA runtime; pip's requires system CUDA.)
- You set
device: cpuexplicitly somewhere.
Q: torch starts slowly¶
The very first import torch on a GPU machine pays the CUDA-context cost (~1–2 s). Subsequent calls are fast. If every run is slow, the GPU may be in power-save under nvidia-persistenced — try nvidia-smi -pm 1.
Q: torchdiffeq won't install¶
Q: torch on CPU is faster than scipy?¶
Yes. torchdiffeq uses PyTorch's BLAS backend (OpenBLAS / MKL) and a fully-vectorised tensor RHS — both faster than scipy's Python interpretation overhead. You get the win even without a GPU.
Next steps¶
- Tune memory for long simulations — beyond chunk_seconds.
- pylectra.run.run() API — full solver-option reference.