Project Demo

Python License SLURM Distributed

Author: Shaswat Gupta
Contact: Email
Project Repository: PyRecover

Download Full Report (PDF)

Overview

PyRecover is a robust distributed checkpointing and job management system for multi-GPU SLURM workloads. It enables efficient, time-aware checkpointing to maximize cluster utilization and prevent loss of training progress.

View on GitHub

Key Features

  • Distributed checkpointing for large models and multi-GPU jobs
  • Time-aware job management to avoid job preemption and maximize resource usage
  • Seamless SLURM integration for easy deployment on HPC clusters
  • Fault-tolerant training with automatic resume from latest checkpoint
  • Support for Flash Attention and other advanced optimizations
  • Flexible configuration for both single-node and multi-node jobs
  • Comprehensive benchmarking and loss convergence tracking
  • Open-source under the MIT License

Installation

# Clone the repository
git clone https://github.com/Shaswat-G/PyRecover
cd pyrecover

# Create and activate conda environment
conda env create -f env.yml
conda activate pyrecover

Installation with Flash Attention

Ensure CUDA toolkit, C++ compiler, and Python dev headers are installed. Then:

./setup_flashattention.sh
# or
pip install ".[flash-attention]"

Project Structure

pyrecover/
├── train.py                      # Main training script
├── env.yml                       # Conda environment file
├── submit-training-simple.sh     # SLURM submission script
├── setup_flashattention.sh       # Flash Attention setup
├── tests/                        # Benchmark and test scripts
│   └── check_weights_equality.py # Checkpoint equality checker
├── ...                           # Other modules and utilities

Quick Start

Non-distributed Training

sbatch submit-training-simple.sh --exp_name=my_experiment

Distributed Training

sbatch submit-training-simple.sh --distributed --exp_name=distributed_exp

Resume from Checkpoint

sbatch submit-training-simple.sh --distributed --continue --use_torch_distributed_ckpt

Command Line Arguments

The training script (train.py) accepts various arguments:

Argument Description Default
--dataset Path to parquet file with text data /capstor/store/cscs/ethz/large-sc/datasets/train_data.parquet
--sequence-length Maximum sequence length 2048
--batch-size Batch size per GPU 1
--learning-rate Learning rate 1e-5
--training-steps Number of training steps 1000
--distributed Enable distributed training False
--model-dtype Model precision (fp16/bf16/fp32/fp64) “bf16”
--checkpoint-dir Directory for checkpoints “checkpoints/”
--checkpoint-frequency Save checkpoint every N steps 10
--resume-from-checkpoint Path to checkpoint or “latest” None
--profile activates profiling support for nsys False
--experiment_name Name of experiment (for checkpoint subfolder) “default-exp”
--use-torch-distributed-ckpt Use distributed checkpointing False
--compile Compile model with torch.compile False
--fused-optimizer Use fused optimizer False
--use_flash_attention Use flash-attention in the model False
--log-loss-to-csv Log loss to a csv for plots/comparison False
--timeaware-checkpointing Activates time aware checkpointing False

For a complete list, run:

python train.py --help

SLURM Submission Script

The script submit-training-simple.sh launches jobs on SLURM clusters. Key parameters:

SLURM Parameter Description
--nodes Number of nodes to allocate
--ntasks-per-node Tasks per node (typically 1 per GPU)
--gpus-per-node GPUs to use per node
--time Time limit for the job
--partition SLURM partition to use

Checkpointing

  • Vanilla Checkpointing: Standard PyTorch checkpointing (default)
  • Distributed Checkpointing: Faster for large models (enable with --use_torch_distributed_ckpt)
  • Time-Aware Checkpointing: Add --timeaware-checkpointing to auto-save before walltime ends

Benchmarks

  • Check equality of weights: python check_weights_equality.py <checkpoint1> <checkpoint2> [--distributed] [--tolerance 1e-7] [--verbose]
  • Loss convergence: Add --log-loss-to-csv to log step-wise loss for analysis

License

This project is licensed under the MIT License.

Acknowledgments

Developed at ETH Zurich for robust, large-scale deep learning on HPC clusters.

View on GitHub

Updated: