optimizers

by notedance

Optimizers for TensorFlow and Keras.

Pitch

This project implements optimizers for TensorFlow and Keras, which can be used in the same way as Keras optimizers.

Description

AdaBelief

Overview:

The AdaBelief optimizer is a modification of the Adam optimizer designed to adapt the learning rate to the gradient’s variability. This approach makes it particularly effective for handling noisy gradients and improving generalization. It supports advanced features like rectification (inspired by RAdam), weight decay, gradient clipping, and the ability to degenerate into SGD when required.

Parameters:

learning_rate (float, default=1e-3): The step size for parameter updates.
beta_1 (float, default=0.9): Exponential decay rate for the first moment estimates.
beta_2 (float, default=0.999): Exponential decay rate for the second moment estimates.
epsilon (float, default=1e-16): Small constant for numerical stability.
weight_decay (float, default=0): Coefficient for weight decay. Applies either decoupled or standard decay based on decoupled_decay.
amsgrad (bool, default=False): Whether to use the AMSGrad variant.
decoupled_decay (bool, default=True): Enables decoupled weight decay as described in AdamW.
fixed_decay (bool, default=False): Uses fixed weight decay instead of scaling it by the learning rate.
rectify (bool, default=True): Whether to apply rectified updates inspired by RAdam.
degenerated_to_sgd (bool, default=True): Degenerates into SGD in low-variance scenarios.
clipnorm (float, optional): Clips gradients by norm.
clipvalue (float, optional): Clips gradients by value.
global_clipnorm (float, optional): Clips gradients by global norm.
use_ema (bool, default=False): Whether to apply Exponential Moving Average to model weights.
ema_momentum (float, default=0.99): Momentum for EMA.
ema_overwrite_frequency (int, optional): Frequency for overwriting EMA weights.
loss_scale_factor (float, optional): Factor for scaling the loss during gradient computation.
gradient_accumulation_steps (int, optional): Steps for accumulating gradients.
name (str, default="adabelief"): Name of the optimizer.

Example Usage:

import tensorflow as tf
from optimizers.adabelief import AdaBelief

# Instantiate optimizer
optimizer = AdaBelief(
    learning_rate=1e-3,
    weight_decay=1e-2,
    rectify=True,
    decoupled_decay=True
)

# Compile a model
model.compile(optimizer=optimizer, loss="sparse_categorical_crossentropy", metrics=["accuracy"])

# Train the model
model.fit(train_dataset, validation_data=val_dataset, epochs=10)

AdamP

Overview:

The AdamP optimizer is a modification of the Adam optimizer that aims to slow down the increase of weight norms in momentum-based optimizers. This is particularly useful for improving generalization and preventing overfitting. The optimizer uses a projection step to decouple sharp and flat components of the gradients, effectively reducing sensitivity to noise.

Parameters:

learning_rate (float, default=1e-3): The step size for parameter updates.
beta_1 (float, default=0.9): Exponential decay rate for the first moment estimates.
beta_2 (float, default=0.999): Exponential decay rate for the second moment estimates.
epsilon (float, default=1e-8): Small constant for numerical stability.
weight_decay (float, default=0): Weight decay coefficient.
delta (float, default=0.1): Threshold for decoupling sharp and flat gradient components.
wd_ratio (float, default=0.1): Ratio for scaling weight decay during projection.
nesterov (bool, default=False): Enables Nesterov momentum.
clipnorm (float, optional): Clips gradients by their norm.
clipvalue (float, optional): Clips gradients by their value.
global_clipnorm (float, optional): Clips gradients by global norm.
use_ema (bool, default=False): Enables Exponential Moving Average for model weights.
ema_momentum (float, default=0.99): Momentum value for EMA.
ema_overwrite_frequency (int, optional): Frequency for overwriting model weights with EMA.
loss_scale_factor (float, optional): Scaling factor for loss values.
gradient_accumulation_steps (int, optional): Number of steps over which gradients are accumulated.
name (str, default="adamp"): Name of the optimizer.

Example Usage:

import tensorflow as tf
from optimizers.adamp import AdamP

# Define the optimizer
optimizer = AdamP(
    learning_rate=1e-3,
    weight_decay=1e-2,
    delta=0.1,
    wd_ratio=0.1,
    nesterov=True
)

# Compile a model
model.compile(optimizer=optimizer, loss="sparse_categorical_crossentropy", metrics=["accuracy"])

# Train the model
model.fit(train_dataset, validation_data=val_dataset, epochs=10)

LaProp

Overview:

The LaProp optimizer is an adaptive gradient optimization algorithm that improves upon Adam by dynamically adjusting learning rates in proportion to the gradients. It includes optional features like centered second moments, AMSGrad stabilization, and weight decay, making it a versatile optimizer for deep learning tasks.

Parameters:

learning_rate (float, default=4e-4): Base step size for parameter updates.
beta_1 (float, default=0.9): Coefficient for the moving average of the first moment (mean of gradients).
beta_2 (float, default=0.999): Coefficient for the moving average of the second moment (variance of gradients).
epsilon (float, default=1e-15): Small constant for numerical stability.
amsgrad (bool, default=False): If True, uses the AMSGrad variant of the optimizer.
centered (bool, default=False): If True, centers the second-moment estimate for better stability.
weight_decay (float, default=0): Weight decay coefficient for L2 regularization.
clipnorm (float, optional): Clips gradients by their norm.
clipvalue (float, optional): Clips gradients by their value.
global_clipnorm (float, optional): Clips gradients by global norm.
use_ema (bool, default=False): Enables Exponential Moving Average (EMA) for model weights.
ema_momentum (float, default=0.99): Momentum value for EMA.
ema_overwrite_frequency (int, optional): Frequency for overwriting model weights with EMA values.
loss_scale_factor (float, optional): Scaling factor for loss values.
gradient_accumulation_steps (int, optional): Number of steps for gradient accumulation.
steps_before_using_centered (int, default=10): Minimum steps before enabling centered updates.
name (str, default="laprop"): Name of the optimizer.

Example Usage:

import tensorflow as tf
from optimizers.laprop import LaProp

# Define the optimizer
optimizer = LaProp(
    learning_rate=4e-4,
    beta_1=0.9,
    beta_2=0.999,
    epsilon=1e-15,
    amsgrad=True,
    centered=True,
    weight_decay=1e-2,
)

# Compile a model
model.compile(optimizer=optimizer, loss="sparse_categorical_crossentropy", metrics=["accuracy"])

# Train the model
model.fit(train_dataset, validation_data=val_dataset, epochs=10)

Lars

Overview:

The Lars optimizer is an implementation of Layer-wise Adaptive Rate Scaling (LARS), a variant of stochastic gradient descent (SGD) designed for large-batch training. It combines weight decay and trust region-based learning rate adaptation, ensuring effective scaling for deep learning models with high-dimensional parameters. This implementation also includes optional LARC (Layer-wise Adaptive Rate Clipping), momentum, and Nesterov updates.

Parameters:

learning_rate (float, default=1.0): Base learning rate for parameter updates.
momentum (float, default=0): Momentum factor for the optimizer.
dampening (float, default=0): Dampening factor for momentum.
epsilon (float, default=1e-8): Small constant for numerical stability.
weight_decay (float, default=0): Weight decay coefficient for L2 regularization.
nesterov (bool, default=False): Enables Nesterov momentum.
trust_coeff (float, default=0.001): Trust coefficient for scaling the learning rate based on LARS.
trust_clip (bool, default=False): If True, clips the trust ratio to a maximum value of 1.0.
always_adapt (bool, default=False): If True, forces the trust ratio to be computed regardless of weight decay.
clipnorm (float, optional): Clips gradients by their norm.
clipvalue (float, optional): Clips gradients by their value.
global_clipnorm (float, optional): Clips gradients by global norm.
use_ema (bool, default=False): Enables Exponential Moving Average (EMA) for model weights.
ema_momentum (float, default=0.99): Momentum value for EMA.
ema_overwrite_frequency (int, optional): Frequency for overwriting model weights with EMA values.
loss_scale_factor (float, optional): Scaling factor for loss values.
gradient_accumulation_steps (int, optional): Number of steps for gradient accumulation.
name (str, default="lars"): Name of the optimizer.

Example Usage:

import tensorflow as tf
from optimziers.lars import Lars

# Define the optimizer
optimizer = Lars(
    learning_rate=1.0,
    momentum=0.9,
    weight_decay=1e-4,
    trust_coeff=0.001,
    nesterov=True,
    trust_clip=True,
)

# Compile a model
model.compile(optimizer=optimizer, loss="sparse_categorical_crossentropy", metrics=["accuracy"])

# Train the model
model.fit(train_dataset, validation_data=val_dataset, epochs=10)

MADGRAD

Overview:

The MADGRAD optimizer is an advanced optimization algorithm designed for large-scale machine learning tasks. It is based on the paper MADGRAD: Stochastic Optimization with Momentum Decay for Training Neural Networks and provides benefits for sparse and dense gradient updates. This implementation is compatible with TensorFlow and includes support for advanced features like weight decay, momentum, and gradient accumulation.

Parameters:

learning_rate (float, default=1e-2): Base learning rate for parameter updates.
epsilon (float, default=1e-6): Small constant for numerical stability.
momentum (float, default=0.9): Momentum factor for the optimizer.
weight_decay (float, default=0): Weight decay coefficient for L2 regularization.
decoupled_decay (bool, default=False): If True, applies decoupled weight decay.
clipnorm (float, optional): Clips gradients by their norm.
clipvalue (float, optional): Clips gradients by their value.
global_clipnorm (float, optional): Clips gradients by global norm.
use_ema (bool, default=False): Enables Exponential Moving Average (EMA) for model weights.
ema_momentum (float, default=0.99): Momentum value for EMA.
ema_overwrite_frequency (int, optional): Frequency for overwriting model weights with EMA values.
loss_scale_factor (float, optional): Scaling factor for loss values.
gradient_accumulation_steps (int, optional): Number of steps for gradient accumulation.
name (str, default="madgrad"): Name of the optimizer.

Example Usage:

import tensorflow as tf
from optimizers.madgrad import MADGRAD

# Define the optimizer
optimizer = MADGRAD(
    learning_rate=1e-2,
    momentum=0.9,
    weight_decay=1e-4,
    decoupled_decay=True,
)

# Compile a model
model.compile(optimizer=optimizer, loss="sparse_categorical_crossentropy", metrics=["accuracy"])

# Train the model
model.fit(train_dataset, validation_data=val_dataset, epochs=10)

MARS

Overview:

The MARS optimizer implements a novel optimization algorithm designed for training large-scale models effectively. It leverages variance reduction techniques, adaptive learning rates, and supports both AdamW-style and Lion-style updates for parameter optimization. MARS also incorporates specialized mechanisms to handle 1D and 2D gradients differently, ensuring efficiency and accuracy in various scenarios.

This implementation is based on the paper MARS: Unleashing the Power of Variance Reduction for Training Large Models.

Parameters:

learning_rate (float, default=3e-3): The learning rate for optimization.
beta_1 (float, default=0.9): Coefficient for the first moment estimate.
beta_2 (float, default=0.99): Coefficient for the second moment estimate.
epsilon (float, default=1e-8): Small constant for numerical stability.
weight_decay (float, default=0): Coefficient for weight decay (L2 regularization).
gamma (float, default=0.025): Coefficient controlling the variance reduction term.
mars_type (str, default="adamw"): Type of parameter update to use:
- "adamw": AdamW-style updates.
- "lion": Lion-style updates.
optimize_1d (bool, default=False): If True, applies MARS-specific updates to 1D parameters.
lr_1d_factor (float, default=1.0): Scaling factor for learning rate for 1D parameter updates.
betas_1d (tuple, optional): Separate (beta1, beta2) values for 1D parameters.
caution (bool, default=False): If True, applies a masking mechanism to stabilize updates.
clipnorm (float, optional): Clips gradients by their norm.
clipvalue (float, optional): Clips gradients by their value.
global_clipnorm (float, optional): Clips gradients by global norm.
use_ema (bool, default=False): Enables Exponential Moving Average (EMA) for model weights.
ema_momentum (float, default=0.99): Momentum value for EMA.
ema_overwrite_frequency (int, optional): Frequency for overwriting model weights with EMA values.
loss_scale_factor (float, optional): Scaling factor for loss values.
gradient_accumulation_steps (int, optional): Number of steps for gradient accumulation.
name (str, default="mars"): Name of the optimizer.

Example Usage:

import tensorflow as tf
from optimizers.mars import Mars

# Initialize the MARS optimizer
optimizer = Mars(
    learning_rate=3e-3,
    beta_1=0.9,
    beta_2=0.99,
    gamma=0.025,
    mars_type="adamw",
    optimize_1d=True,
    lr_1d_factor=0.8,
)

# Compile a model
model.compile(optimizer=optimizer, loss="sparse_categorical_crossentropy", metrics=["accuracy"])

# Train the model
model.fit(train_dataset, validation_data=val_dataset, epochs=10)

NAdam

Overview:

The NAdam optimizer is an implementation of the Nesterov-accelerated Adaptive Moment Estimation (Nadam) algorithm. Nadam extends the widely-used Adam optimizer by incorporating Nesterov momentum, providing faster convergence in some scenarios. This optimizer is particularly useful for tasks where momentum plays a critical role in overcoming saddle points and improving optimization dynamics.

The algorithm is described in:

"Incorporating Nesterov Momentum into Adam" (PDF link)
"On the Importance of Initialization and Momentum in Deep Learning" (PDF link)

Parameters:

learning_rate (float, default=2e-3): Learning rate for the optimizer.
beta_1 (float, default=0.9): Coefficient for the first moment estimate (momentum term).
beta_2 (float, default=0.999): Coefficient for the second moment estimate (variance term).
epsilon (float, default=1e-8): Small constant for numerical stability in divisions.
weight_decay (float, default=0): Weight decay coefficient for L2 regularization.
schedule_decay (float, default=4e-3): Decay factor for momentum scheduling.
clipnorm (float, optional): Clips gradients by their norm.
clipvalue (float, optional): Clips gradients by their value.
global_clipnorm (float, optional): Clips gradients by their global norm.
use_ema (bool, default=False): Enables Exponential Moving Average (EMA) for model weights.
ema_momentum (float, default=0.99): Momentum value for EMA.
ema_overwrite_frequency (int, optional): Frequency for overwriting weights with EMA values.
loss_scale_factor (float, optional): Scaling factor for loss values.
gradient_accumulation_steps (int, optional): Number of steps for gradient accumulation.
name (str, default="nadam"): Name of the optimizer.

Example Usage:

import tensorflow as tf
from optimizers.nadam import NAdam

# Initialize the NAdam optimizer
optimizer = NAdam(
    learning_rate=2e-3,
    beta_1=0.9,
    beta_2=0.999,
    schedule_decay=4e-3,
    weight_decay=1e-4,
)

# Compile a model
model.compile(optimizer=optimizer, loss="sparse_categorical_crossentropy", metrics=["accuracy"])

# Train the model
model.fit(train_dataset, validation_data=val_dataset, epochs=10)

NvNovoGrad

Overview:

The NvNovoGrad optimizer is an implementation of NovoGrad, an optimization algorithm designed for deep learning that uses layer-wise adaptive moments for efficient and robust training. NovoGrad is particularly effective for large-scale and resource-constrained deep learning tasks, as it combines the benefits of Adam and L2 regularization while being computationally efficient.

The algorithm is described in:

"Stochastic Gradient Methods with Layer-wise Adaptive Moments for Training of Deep Networks" (arXiv link)

This implementation is inspired by NVIDIA's original implementation in PyTorch, used in speech recognition models like Jasper:

Jasper Example

Parameters:

learning_rate (float, default=1e-3): Learning rate for the optimizer.
beta_1 (float, default=0.95): Exponential decay rate for the first moment estimate (momentum term).
beta_2 (float, default=0.98): Exponential decay rate for the second moment estimate (variance term).
epsilon (float, default=1e-8): Small constant for numerical stability in divisions.
weight_decay (float, default=0): Weight decay coefficient for L2 regularization.
grad_averaging (bool, default=False): Enables gradient averaging for smoother updates.
amsgrad (bool, default=False): Enables the AMSGrad variant for convergence improvements.
clipnorm (float, optional): Clips gradients by their norm.
clipvalue (float, optional): Clips gradients by their value.
global_clipnorm (float, optional): Clips gradients by their global norm.
use_ema (bool, default=False): Enables Exponential Moving Average (EMA) for model weights.
ema_momentum (float, default=0.99): Momentum value for EMA.
ema_overwrite_frequency (int, optional): Frequency for overwriting weights with EMA values.
loss_scale_factor (float, optional): Scaling factor for loss values.
gradient_accumulation_steps (int, optional): Number of steps for gradient accumulation.
name (str, default="nvnovograd"): Name of the optimizer.

Example Usage:

import tensorflow as tf
from optimizers.nvnovograd import NvNovoGrad

# Initialize the NvNovoGrad optimizer
optimizer = NvNovoGrad(
    learning_rate=1e-3,
    beta_1=0.95,
    beta_2=0.98,
    weight_decay=1e-4,
    grad_averaging=True,
    amsgrad=True,
)

# Compile a model
model.compile(optimizer=optimizer, loss="categorical_crossentropy", metrics=["accuracy"])

# Train the model
model.fit(train_dataset, validation_data=val_dataset, epochs=10)

RAdam

Overview:

The RAdam (Rectified Adam) optimizer is a variant of the Adam optimizer that incorporates a mechanism to rectify the variance of adaptive learning rates. This rectification improves stability and prevents early training instabilities, especially in the initial training phase. RAdam maintains the benefits of Adam while being more robust for a wide range of applications.

The algorithm is described in the paper:

"On the Variance of the Adaptive Learning Rate and Beyond" (arXiv link)

This implementation is inspired by the original PyTorch implementation:

RAdam GitHub Repository

Parameters:

learning_rate (float, default=1e-3): Base learning rate for the optimizer.
beta_1 (float, default=0.9): Exponential decay rate for the first moment estimate (momentum term).
beta_2 (float, default=0.999): Exponential decay rate for the second moment estimate (variance term).
epsilon (float, default=1e-8): Small constant for numerical stability in divisions.
weight_decay (float, default=0): Weight decay coefficient for L2 regularization.
clipnorm (float, optional): Clips gradients by their norm.
clipvalue (float, optional): Clips gradients by their value.
global_clipnorm (float, optional): Clips gradients by their global norm.
use_ema (bool, default=False): Enables Exponential Moving Average (EMA) for model weights.
ema_momentum (float, default=0.99): Momentum value for EMA.
ema_overwrite_frequency (int, optional): Frequency for overwriting weights with EMA values.
loss_scale_factor (float, optional): Scaling factor for loss values.
gradient_accumulation_steps (int, optional): Number of steps for gradient accumulation.
name (str, default="radam"): Name of the optimizer.

Example Usage:

import tensorflow as tf
from optimizers.radam import RAdam

# Initialize the RAdam optimizer
optimizer = RAdam(
    learning_rate=1e-3,
    beta_1=0.9,
    beta_2=0.999,
    epsilon=1e-8,
    weight_decay=1e-4,
)

# Compile a model
model.compile(optimizer=optimizer, loss="categorical_crossentropy", metrics=["accuracy"])

# Train the model
model.fit(train_dataset, validation_data=val_dataset, epochs=10)

SGDP

Overview:

The SGDP (Stochastic Gradient Descent with Projection and Weight Decay) optimizer is a variant of SGD that incorporates decoupled weight decay regularization and gradient projection. These features help control weight norm growth during training, improving convergence and performance.

This algorithm is described in the paper:

"Slowing Down the Weight Norm Increase in Momentum-based Optimizers" (arXiv link).

The implementation is inspired by the official repository:

AdamP GitHub Repository

Parameters:

learning_rate (float, default=1e-3): Learning rate for the optimizer.
momentum (float, default=0): Momentum factor for SGD.
dampening (float, default=0): Dampening factor to control momentum updates.
epsilon (float, default=1e-8): Small constant for numerical stability.
weight_decay (float, default=0): L2 regularization coefficient (weight decay).
delta (float, default=0.1): Threshold for the cosine similarity in the projection mechanism.
wd_ratio (float, default=0.1): Weight decay ratio for decoupling.
nesterov (bool, default=False): If True, enables Nesterov momentum.
clipnorm (float, optional): Clips gradients by their norm.
clipvalue (float, optional): Clips gradients by their value.
global_clipnorm (float, optional): Clips gradients by their global norm.
use_ema (bool, default=False): Enables Exponential Moving Average (EMA) for model weights.
ema_momentum (float, default=0.99): Momentum for EMA.
ema_overwrite_frequency (int, optional): Frequency for overwriting weights with EMA values.
loss_scale_factor (float, optional): Scaling factor for loss values.
gradient_accumulation_steps (int, optional): Number of steps for gradient accumulation.
name (str, default="sgdp"): Name of the optimizer.

Example Usage:

import tensorflow as tf
from optimizers.sgdp import SGDP

# Initialize the SGDP optimizer
optimizer = SGDP(
    learning_rate=1e-3,
    momentum=0.9,
    dampening=0.1,
    weight_decay=1e-4,
    delta=0.1,
    wd_ratio=0.1,
    nesterov=True,
)

# Compile a model
model.compile(optimizer=optimizer, loss="sparse_categorical_crossentropy", metrics=["accuracy"])

# Train the model
model.fit(train_dataset, validation_data=val_dataset, epochs=10)

0 comments

No comments yet.

New comment