Scale to Multiple Nodes¶
Distribute training across multiple machines.
Prerequisites¶
- SSH access between nodes
- Shared filesystem for checkpoints
- Fast network (InfiniBand recommended)
Configuration¶
distributed:
world_size: 16 # Total GPUs
tensor_parallel: 4
pipeline_parallel: 1
data_parallel: 4
master_addr: node1
master_port: 29500
Launch with torchrun¶
# On node 1
torchrun --nnodes=2 --node_rank=0 \
--nproc_per_node=8 \
--master_addr=node1 --master_port=29500 \
-m flux.cli train --config config.yaml
# On node 2
torchrun --nnodes=2 --node_rank=1 \
--nproc_per_node=8 \
--master_addr=node1 --master_port=29500 \
-m flux.cli train --config config.yaml
SLURM Script¶
#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=8
#SBATCH --gpus-per-node=8
srun python -m flux.cli train --config config.yaml