Discover how MLX and PyTorch stack up on Apple Silicon with this benchmark repository. Our comprehensive tests provide insights into training and inference performance across various models. Make informed decisions for your AI projects with data-driven results that highlight the strengths and weaknesses of each framework.
MLX-vs-PyTorch
Welcome to the MLX-vs-PyTorch repository! This project presents a thorough benchmarking comparison between two leading artificial intelligence frameworks – MLX and PyTorch – optimized for Apple Silicon devices.
Purpose
This project is designed to help developers make informed decisions when initiating AI projects on Apple computers. We ran a series of benchmarks simulating day-to-day AI workloads to evaluate the performance of each framework. For detailed insights into these benchmarks, please refer to the section on Details about each benchmark.
Benchmarks Conducted
The following benchmarks were executed to compare the performance of MLX and PyTorch:
- Training a transformers language model (
lm_train.py
). - Training/fine-tuning BERT (
bert_fine_tune.py
). - Inference using OpenAI's whisper model (
whisper_inference.py
). - Language model inference using TinyLLama (
llm_inference.py
). - A synthetic benchmark to transfer data between CPU and GPU for matrix multiplication (
switch_test.py
).
Results Overview
We conducted multiple iterations for each benchmark to compute average execution times. Results below reflect the average timing for the iterations we performed:
M1 Pro (10 CPU core, 16 GPU core, 32 GB RAM)
Benchmark | PyTorch Time (s) | MLX Time (s) |
---|---|---|
Training a transformer language model | 1806.63 | 1157.00 |
Training BERT | 751.02 | 718.35 |
Whisper inference | 31.99 | 8.50 |
TinyLLama inference | 59.27 | 33.38 |
CPU/GPU switch | 349.72 | 270.15 |
M1 Max (10 CPU core, 32 GPU core, 64 GB RAM)
Benchmark | PyTorch Time (s) | MLX Time (s) |
---|---|---|
Training a transformer language model | 1106.75 | 752.25 |
Training BERT | 793.67 | 499.34 |
Whisper inference | 21.28 | 6.95 |
TinyLLama inference | 50.98 | 20.61 |
CPU/GPU switch | 251.71 | 214.57 |
M3 Max (16 CPU core, 40 GPU core, 48 GB RAM)
Benchmark | PyTorch Time (s) | MLX Time (s) |
---|---|---|
Training a transformer language model | 912.52 | 426.00 |
Training BERT | 550.29 | 408.45 |
Whisper inference | 17.90 | 4.85 |
TinyLLama inference | 36.18 | 15.41 |
CPU/GPU switch | 146.35 | 140.51 |
For the raw execution times of each benchmark, please refer to raw_results.txt.
In-Depth Look at Each Benchmark
Training a Transformers Language Model
This benchmark utilizes the transformer model from the MLX's TransformerLM example with the PTB corpus dataset. For detailed configuration, visit lm_train.py.
Training/Fine-tuning BERT
Using the BERT-tiny model as described in Conneau et al, this benchmark classifies sentences. Training data is drawn from the NLI dataset, and no pre-trained weights were used. The results are comparable between pure PyTorch and MLX implementations.
Whisper Inference
For the PyTorch setting, the benchmark employs the HuggingFace transformers library for the tiny whisper model, while the MLX benchmark uses the MLX examples tools to adapt the model accordingly.
TinyLLama Inference
This process includes integration with HuggingFace's TinyLlama-1.1B-Chat-v1.0
model. Adaptations have been made to ensure seamless execution across frameworks.
CPU/GPU Switch
The matrix multiplication benchmark involves continuous switching between CPU and GPU processing, allowing us to assess the efficiency of data transfer mechanisms within each framework.
Explore our findings to determine which framework aligns best with your AI development needs on Apple Silicon!