ParaAttention is an innovative solution designed to enhance model inference speed using context parallel attention with torch.compile. Offering a user-friendly interface, it aims to optimize performance effectively while integrating both Ulysses and Ring Style parallelism. Experience significant speedups and streamlined operations in your AI models.
ParaAttention is an innovative library designed to enhance your model inference speed using context parallel attention in conjunction with torch.compile
. With seamless integration of both Ulysses Style and Ring Style parallelism, this tool is perfect for developers aiming to improve their model's performance.
Key Features
- User-friendly Interface: Easily accelerate model inference while utilizing context parallel attention and
torch.compile
, allowing you to achieve lossless speed improvements for models likeFLUX
andMochi
. - Unified Attention Mechanism: Execute context parallel attention through a consistent interface while maintaining peak performance with
torch.compile
. - Optimized Performance: Experience cutting-edge attention implementation using Triton, demonstrating speeds that can reach 50% faster than the original FA2 implementation when running on powerful GPUs such as the RTX 4090.
Performance Benchmarks
Here’s a glance at the performance improvements you can expect with ParaAttention:
Model | GPU | Method | Wall Time (s) | Speedup |
---|---|---|---|---|
FLUX.1-dev | A100-SXM4-80GB | Baseline | 13.843 | 1.00x |
FLUX.1-dev | A100-SXM4-80GB | torch.compile | 9.997 | 1.38x |
FLUX.1-dev | A100-SXM4-80GB x 2 | para-attn (ulysses) | 8.379 | 1.65x |
FLUX.1-dev | A100-SXM4-80GB x 2 | para-attn (ring) | 8.307 | 1.66x |
FLUX.1-dev | A100-SXM4-80GB x 2 | para-attn (ulysses) + torch.compile | 5.915 | 2.34x |
FLUX.1-dev | A100-SXM4-80GB x 2 | para-attn (ring) + torch.compile | 5.775 | 2.39x |
mochi-1-preview | A100-SXM4-80GB | Baseline | 196.534 | 1.00x |
These benchmarks highlight the incredible enhancements this project delivers, making it easier than ever to optimize your model performance effectively.
Getting Started with ParaAttention
ParaAttention enables an intuitive implementation process, allowing you to focus on creating efficient models. The library also ensures compatibility with existing pipelines such as FLUX
and Mochi
.