1. The Core Role of Batch Size#
Batch Size determines the number of samples used to update the model parameters at each iteration. It directly affects the following aspects:
-
Accuracy of Gradient Calculation:
- Large batches provide gradients that are averages of multiple samples, closer to the "true gradient" (the gradient direction of the entire dataset).
- Small batches have greater gradient noise but may provide a regularization effect, preventing overfitting.
-
Utilization of Hardware Resources:
- The parallel computing capability of GPUs is more efficient with large batches.
- However, if the batch size is too large, it can lead to out-of-memory (OOM) errors, requiring a balance of resources.
-
Convergence Speed and Stability:
- Large batch updates are more accurate but may converge to "sharp" minima (poor generalization).
- Small batch updates are more frequent and have a "jittery" convergence path but may find "flat" minima (better generalization).
2. The Relationship Between Batch Size and Gradient Descent#
1. Mathematical Explanation of Gradient Noise#
Assuming the total number of samples is ( N ), Batch Size is ( B ), and the loss function is ( L ).
- Full Batch Gradient Descent (B=N):
The gradient is noise-free, but the computational cost is high.
- Mini-Batch Gradient Descent (B≪N):
The gradient is a biased estimate of the true gradient, with noise variance proportional to ( ( \frac{1}{B} ) ).
2. The Impact of Noise on Training#
-
Small Batch (B=32):
- High noise → Large fluctuations in parameter update direction → May escape local optima.
- Similar to "random exploration," suitable for complex tasks (e.g., small datasets, high-noise data).
-
Large Batch (B=1024):
- Low noise → Stable update direction → Fast convergence but prone to local optima.
- Similar to "precision guidance," suitable for large datasets and distributed training.
3. Practical Strategies for Choosing Batch Size#
1. Maximum Batch Size Under Resource Constraints#
- VRAM Estimation Formula:
-
For example: GPU VRAM 24GB, model usage 4GB, each sample uses 0.2GB → Maximum Batch Size ≈ ( (24-4)/0.2 = 100 ).
-
Tip:
- Use gradient accumulation: Perform multiple forward passes with small batches to accumulate gradients before updating parameters.
For example: Target Batch Size=64, actual GPU can only support 16 → Accumulate gradients 4 times before updating.
- Use gradient accumulation: Perform multiple forward passes with small batches to accumulate gradients before updating parameters.
2. Interaction Between Learning Rate and Batch Size#
-
Linear Scaling Rule:
- When Batch Size is multiplied by ( k ), the learning rate should also be multiplied by ( k ).
- Theoretical basis: The gradient variance of large batches decreases by ( k ) times, necessitating an increase in learning rate to maintain consistent update step sizes.
- For example: Original Batch Size=64, learning rate=0.1 → When Batch Size=256, learning rate ≈ 0.4.
-
Caution:
- Learning rate cannot be infinitely increased! In practice, it should be combined with a warmup strategy to gradually increase the learning rate.
3. Empirical Values for Different Tasks#
-
Image Classification (ImageNet):
- Commonly used Batch Size=256 or 512 (requires multiple GPUs for parallel processing).
- Smaller models (e.g., MobileNet) can reduce to 64~128.
-
Object Detection/Segmentation (COCO):
- Batch Size=2~16 (due to high VRAM usage of high-resolution images).
- For example, Mask R-CNN typically uses Batch Size=2~8.
-
Natural Language Processing (BERT):
- Batch Size=32~512, combined with gradient accumulation.
- Large batches (e.g., 8192) require special optimizations (e.g., LAMB optimizer).
4. Advanced Effects of Batch Size#
1. Generalization Ability#
-
The Generalization Dilemma of Large Batches:
- Experiments show that training with large batches tends to converge to "sharp" minima, resulting in poor performance on the test set.
- Solutions:
- Increase data augmentation.
- Use Stochastic Weight Averaging (SWA).
- Introduce explicit regularization (e.g., Label Smoothing).
-
Implicit Regularization of Small Batches:
- Gradient noise acts as a random perturbation on parameters, similar to the effect of Dropout.
2. Coupling with Batch Normalization#
- BN's Dependence on Batch Size:
- BN normalizes based on the mean and variance of the current batch.
- If the batch size is too small → Statistical estimates become inaccurate → Training becomes unstable.
- Recommendation: Use BN when Batch Size ≥ 32; if the Batch Size is too small, consider using Group Normalization or Layer Normalization.
3. Batch Size in Distributed Training#
-
Data Parallelism:
- Each GPU processes a sub-batch, and gradients are synchronized at the end.
- Global Batch Size = Single GPU Batch Size × Number of GPUs.
- For example: 4 GPUs, each with Batch Size=64 → Global Batch Size=256.
-
Extreme Large Batch Training:
- For instance, Google’s training of ResNet with a Batch Size of 1.5M:
- Requires the LARS (Layer-wise Adaptive Rate Scaling) optimizer.
- Learning rate is adaptively adjusted based on the norm of weights in each layer.
- For instance, Google’s training of ResNet with a Batch Size of 1.5M:
5. Specific Steps for Debugging Batch Size#
1. Initial Selection#
- Start with common values (e.g., 32 or 64) and observe VRAM usage and training speed.
- If VRAM is insufficient, gradually halve the Batch Size until no longer experiencing OOM (Out Of Memory).
2. Monitor Training Dynamics#
-
Training Loss Curve:
- Small Batch: Loss decreases with large fluctuations, but the overall trend is downward.
- Large Batch: Loss decreases smoothly but may stagnate early.
-
Validation Set Performance:
- If training loss decreases but validation loss does not → May be overfitting (need to reduce Batch Size or augment data).
- If both do not decrease → Model capacity may be insufficient or there may be labeling errors.
3. Hyperparameter Tuning#
- Fix Batch Size and Adjust Learning Rate:
- Use Learning Rate Finder: Gradually increase the learning rate to find the range where loss decreases the fastest.
- Joint Tuning:
- Batch Size and learning rate need to be adjusted together (refer to the linear scaling rule).
6. Practical Case: Adjusting Batch Size in Image Segmentation#
Assuming you are training U-Net for medical image segmentation:
- Hardware Conditions: Single GPU with 12GB VRAM, input size 256x256.
- Estimate Batch Size:
- Model itself uses 3GB, leaving 9GB.
- Each image uses approximately 0.5GB → Maximum Batch Size ≈ 18 → Choose 16 (a power of 2).
- Training Effect:
- Noticed large fluctuations in validation IoU → Batch Size may be too small, causing high gradient noise.
- Attempt gradient accumulation: Accumulate over 4 steps (equivalent to Batch Size=64), adjust learning rate to 4 times.
- Results:
- Loss curve smoother, IoU improved by 5%.
7. Conclusion#
- Batch Size is a Lever in Training: It requires balancing speed, resources, stability, and generalization ability.
- Core Rules:
- When resources allow, start with common values (32~256).
- Large batches require increasing the learning rate, while small batches need attention to gradient noise.
- Flexibly adjust according to task characteristics and hardware conditions.
This article is synchronized and updated by Mix Space to xLog.
The original link is https://blog.kanes.top/posts/ArtificialIntelligence/EasytounderstandBatchSizeinDeepLearning