The top 5 most used optimization techniques are as follows:
- Learning Rate Scheduling: Gradually increases the learning rate at the start of training to stabilize initial updates.
- Cosine Annealing: Lower the learning rate following a cosine decay schedule, which helps the model converge smoothly.
- Gradient Clipping: Caps gradients to a maximum threshold to prevent large updates and improve stability.
- AdamW Optimizer: A variant of Adam with weight decay, often used for large models to prevent overfitting.
- Stochastic Weight Averaging (SWA): Averages weights from different steps, leading to a smoother and more robust model.
These above techniques are widely adopted for their effectiveness in optimizing large model training while maintaining or improving performance.