PitchHut
Log in / Sign up
MLX-vs-PyTorch
17 views
Evaluate AI frameworks MLX and PyTorch on Apple Silicon GPUs.
Pitch

Discover how MLX and PyTorch stack up on Apple Silicon with this benchmark repository. Our comprehensive tests provide insights into training and inference performance across various models. Make informed decisions for your AI projects with data-driven results that highlight the strengths and weaknesses of each framework.

Description

MLX-vs-PyTorch

Welcome to the MLX-vs-PyTorch repository! This project presents a thorough benchmarking comparison between two leading artificial intelligence frameworks – MLX and PyTorch – optimized for Apple Silicon devices.

Purpose

This project is designed to help developers make informed decisions when initiating AI projects on Apple computers. We ran a series of benchmarks simulating day-to-day AI workloads to evaluate the performance of each framework. For detailed insights into these benchmarks, please refer to the section on Details about each benchmark.

Benchmarks Conducted

The following benchmarks were executed to compare the performance of MLX and PyTorch:

  1. Training a transformers language model (lm_train.py).
  2. Training/fine-tuning BERT (bert_fine_tune.py).
  3. Inference using OpenAI's whisper model (whisper_inference.py).
  4. Language model inference using TinyLLama (llm_inference.py).
  5. A synthetic benchmark to transfer data between CPU and GPU for matrix multiplication (switch_test.py).

Results Overview

We conducted multiple iterations for each benchmark to compute average execution times. Results below reflect the average timing for the iterations we performed:

M1 Pro (10 CPU core, 16 GPU core, 32 GB RAM)

BenchmarkPyTorch Time (s)MLX Time (s)
Training a transformer language model1806.631157.00
Training BERT751.02718.35
Whisper inference31.998.50
TinyLLama inference59.2733.38
CPU/GPU switch349.72270.15

M1 Max (10 CPU core, 32 GPU core, 64 GB RAM)

BenchmarkPyTorch Time (s)MLX Time (s)
Training a transformer language model1106.75752.25
Training BERT793.67499.34
Whisper inference21.286.95
TinyLLama inference50.9820.61
CPU/GPU switch251.71214.57

M3 Max (16 CPU core, 40 GPU core, 48 GB RAM)

BenchmarkPyTorch Time (s)MLX Time (s)
Training a transformer language model912.52426.00
Training BERT550.29408.45
Whisper inference17.904.85
TinyLLama inference36.1815.41
CPU/GPU switch146.35140.51

For the raw execution times of each benchmark, please refer to raw_results.txt.

In-Depth Look at Each Benchmark

Training a Transformers Language Model

This benchmark utilizes the transformer model from the MLX's TransformerLM example with the PTB corpus dataset. For detailed configuration, visit lm_train.py.

Training/Fine-tuning BERT

Using the BERT-tiny model as described in Conneau et al, this benchmark classifies sentences. Training data is drawn from the NLI dataset, and no pre-trained weights were used. The results are comparable between pure PyTorch and MLX implementations.

Whisper Inference

For the PyTorch setting, the benchmark employs the HuggingFace transformers library for the tiny whisper model, while the MLX benchmark uses the MLX examples tools to adapt the model accordingly.

TinyLLama Inference

This process includes integration with HuggingFace's TinyLlama-1.1B-Chat-v1.0 model. Adaptations have been made to ensure seamless execution across frameworks.

CPU/GPU Switch

The matrix multiplication benchmark involves continuous switching between CPU and GPU processing, allowing us to assess the efficiency of data transfer mechanisms within each framework.

Explore our findings to determine which framework aligns best with your AI development needs on Apple Silicon!