PyRecover

Project Demo

Author: Shaswat Gupta
Contact: Email
Project Repository: PyRecover

Overview

PyRecover is a robust distributed checkpointing and job management system for multi-GPU SLURM workloads. It enables efficient, time-aware checkpointing to maximize cluster utilization and prevent loss of training progress.

View on GitHub

Key Features

Distributed checkpointing for large models and multi-GPU jobs
Time-aware job management to avoid job preemption and maximize resource usage
Seamless SLURM integration for easy deployment on HPC clusters
Fault-tolerant training with automatic resume from latest checkpoint
Support for Flash Attention and other advanced optimizations
Flexible configuration for both single-node and multi-node jobs
Comprehensive benchmarking and loss convergence tracking
Open-source under the MIT License

Installation

# Clone the repository
git clone https://github.com/Shaswat-G/PyRecover
cd pyrecover

# Create and activate conda environment
conda env create -f env.yml
conda activate pyrecover

Installation with Flash Attention

Ensure CUDA toolkit, C++ compiler, and Python dev headers are installed. Then:

./setup_flashattention.sh
# or
pip install ".[flash-attention]"

Project Structure

pyrecover/
├── train.py                      # Main training script
├── env.yml                       # Conda environment file
├── submit-training-simple.sh     # SLURM submission script
├── setup_flashattention.sh       # Flash Attention setup
├── tests/                        # Benchmark and test scripts
│   └── check_weights_equality.py # Checkpoint equality checker
├── ...                           # Other modules and utilities

Quick Start

Non-distributed Training

sbatch submit-training-simple.sh --exp_name=my_experiment

Distributed Training

sbatch submit-training-simple.sh --distributed --exp_name=distributed_exp

Resume from Checkpoint

sbatch submit-training-simple.sh --distributed --continue --use_torch_distributed_ckpt

Command Line Arguments

The training script (train.py) accepts various arguments:

Argument	Description	Default
`--dataset`	Path to parquet file with text data	`/capstor/store/cscs/ethz/large-sc/datasets/train_data.parquet`
`--sequence-length`	Maximum sequence length	2048
`--batch-size`	Batch size per GPU	1
`--learning-rate`	Learning rate	1e-5
`--training-steps`	Number of training steps	1000
`--distributed`	Enable distributed training	False
`--model-dtype`	Model precision (fp16/bf16/fp32/fp64)	“bf16”
`--checkpoint-dir`	Directory for checkpoints	“checkpoints/”
`--checkpoint-frequency`	Save checkpoint every N steps	10
`--resume-from-checkpoint`	Path to checkpoint or “latest”	None
`--profile`	activates profiling support for nsys	False
`--experiment_name`	Name of experiment (for checkpoint subfolder)	“default-exp”
`--use-torch-distributed-ckpt`	Use distributed checkpointing	False
`--compile`	Compile model with torch.compile	False
`--fused-optimizer`	Use fused optimizer	False
`--use_flash_attention`	Use flash-attention in the model	False
`--log-loss-to-csv`	Log loss to a csv for plots/comparison	False
`--timeaware-checkpointing`	Activates time aware checkpointing	False

For a complete list, run:

python train.py --help

SLURM Submission Script

The script submit-training-simple.sh launches jobs on SLURM clusters. Key parameters:

SLURM Parameter	Description
`--nodes`	Number of nodes to allocate
`--ntasks-per-node`	Tasks per node (typically 1 per GPU)
`--gpus-per-node`	GPUs to use per node
`--time`	Time limit for the job
`--partition`	SLURM partition to use

Checkpointing

Vanilla Checkpointing: Standard PyTorch checkpointing (default)
Distributed Checkpointing: Faster for large models (enable with --use_torch_distributed_ckpt)
Time-Aware Checkpointing: Add --timeaware-checkpointing to auto-save before walltime ends

Benchmarks

Check equality of weights: python check_weights_equality.py <checkpoint1> <checkpoint2> [--distributed] [--tolerance 1e-7] [--verbose]
Loss convergence: Add --log-loss-to-csv to log step-wise loss for analysis

License

This project is licensed under the MIT License.

Acknowledgments

Developed at ETH Zurich for robust, large-scale deep learning on HPC clusters.