Speed up batch with multiple cores¶
Intermediate
Prerequisites: Batch dataset generation
One YAML block¶
output:
parallel:
n_jobs: -1 # -1 = all CPU cores
backend: loky # loky | multiprocessing | threading
batch_size: 4 # tasks per dispatch (loky default = auto)
Choosing n_jobs¶
| Value | Meaning |
|---|---|
1 |
Serial (development / debugging) |
-1 |
Every logical CPU |
auto |
pylectra heuristic: min(cpu_count - 1, 16) |
4 |
Explicit 4 workers |
Rules of thumb:
- ≤ 16 cores:
-1is usually fastest. -
16 cores:
auto(capped at 16) avoids diminishing returns. - 4 / 8 GB machines: each worker takes ~500 MB; 16 workers will OOM. Cap at
n_jobs: 4or8.
from pylectra.hardware import recommend_n_jobs, summary
print(summary()) # full hardware summary
print(recommend_n_jobs()) # pylectra's recommendation
Choosing the backend¶
loky (recommended default)¶
joblib's process pool. Works on Windows / macOS / Linux. Each worker is its own Python process — bypasses the GIL and is fastest for CPU-bound tasks (pylectra is one).
multiprocessing¶
Python's stdlib multiprocessing. Similar in spirit but with more fork-related quirks. Avoid — loky patches a long list of fork issues on Windows / macOS.
threading¶
Thread pool. Limited by the GIL — no speed-up for pylectra's mostly-Python ODE driver, even at n_jobs=8.
Exception: if your code calls GIL-releasing C extensions (numpy BLAS on big matrices, scipy solvers, torch), threading does parallelise. Pylectra's main loop is mostly Python, so don't use threading.
The Windows non-ASCII username pitfall¶
A Windows username with non-ASCII chars (e.g. 龙宗加) + joblib loky = UnicodeEncodeError.
multiprocessing.resource_tracker encodes IPC messages as ASCII; paths like C:\Users\龙宗加\AppData\Local\Temp\... blow up.
Fix: redirect JOBLIB_TEMP_FOLDER to an ASCII-only path:
# Windows cmd
set JOBLIB_TEMP_FOLDER=D:\joblib_tmp
python -m pylectra run examples/batch_case39.yaml
# Windows PowerShell
$env:JOBLIB_TEMP_FOLDER = "D:\joblib_tmp"
# Linux / macOS (rarely an issue here)
export JOBLIB_TEMP_FOLDER=/tmp/joblib_tmp
Or in Python:
import os
os.environ["JOBLIB_TEMP_FOLDER"] = r"D:\joblib_tmp"
from pylectra.run import run
run("examples/batch_case39.yaml")
Tuning batch_size¶
batch_size controls how many tasks are dispatched to each worker per chunk.
auto/ default: joblib adapts (usually fine).- Too small (
1): scheduling overhead, idle workers. - Too large (
100): unbalanced load (one slow worker drags the rest).
For very large batches (10 000+), explicit batch_size: 8 or 16 is often 5–10 % faster than auto.
Memory vs. n_jobs¶
Per worker:
- A separate Python interpreter (~50 MB).
- Copy of the numpy case data (~1 MB).
- joblib internal buffers (~10 MB).
- Simulation state (case-size dependent; case39 ~30 MB).
case39, n_jobs=8 → ~250 MB total. case2000+, n_jobs=8 → easily exceeds 8 GB.
OOM rescue:
Verify the speed-up¶
5-sample sanity check:
# Serial
time python -m pylectra run examples/batch_case39.yaml \
-O 'output={"parallel": {"n_jobs": 1}}'
# Parallel
time python -m pylectra run examples/batch_case39.yaml \
-O 'output={"parallel": {"n_jobs": -1}}'
8-core typical: serial ~50 s, parallel ~10 s — about 5× (ideal 8× minus joblib start-up + I/O overhead).
On a shared cluster¶
In academic / corporate HPC, each worker rarely owns a physical core — others are also competing.
- SLURM / PBS: after
--cpus-per-task=8,n_jobs=-1sees 8 (not the whole node). - Cluster forbids
fork? Switch tobackend: threading— gives up parallel speed-up but avoids being killed by the scheduler. daskis the better fit for true distributed clusters; pylectra has no built-in dask support, butSingleRunneris picklable so wrapping it externally is straightforward.
Long-running setups¶
import joblib
joblib.parallel_config(backend="loky",
n_jobs=8,
temp_folder="/scratch/joblib")
# All later batch calls inherit this config
FAQ¶
Q: Why is 4-core n_jobs=4 only 30 % faster than n_jobs=2?¶
Each individual sim already eats ~60 % CPU (numpy / scipy BLAS multi-threading). Disable BLAS internal threading for more linear scaling:
export OPENBLAS_NUM_THREADS=1
export MKL_NUM_THREADS=1
python -m pylectra run examples/batch_case39.yaml
Q: Stuck on "Joblib starting"¶
First-time loky workers import pylectra in full (every plugin + numpy / scipy) — 10–30 s is normal. Only the first start is slow; subsequent tasks reuse the same workers.
If still stuck after 5 minutes, an importing plugin probably raised. Add -O 'verbose=2' for details.
Q: Distribute across multiple machines?¶
joblib is single-host. For multi-machine:
- dask.distributed: submit
SingleRunnertasks viaclient.submit. - Ray: similar.
- SLURM array jobs: each array task runs an independent batch; merge metadata afterwards.
Next steps¶
- Tune memory for long simulations — chunk_seconds + BLAS sharing, etc.
- Batch tutorial — config refresher.