banner
kanes

kanes

Understanding Batch Size in Deep Learning

1. The Core Role of Batch Size#

Batch Size determines the number of samples used to update the model parameters at each iteration. It directly affects the following aspects:

  1. Accuracy of Gradient Calculation:

    • Large batches provide gradients that are averages of multiple samples, closer to the "true gradient" (the gradient direction of the entire dataset).
    • Small batches have greater gradient noise but may provide a regularization effect, preventing overfitting.
  2. Utilization of Hardware Resources:

    • The parallel computing capability of GPUs is more efficient with large batches.
    • However, if the batch size is too large, it can lead to out-of-memory (OOM) errors, requiring a balance of resources.
  3. Convergence Speed and Stability:

    • Large batch updates are more accurate but may converge to "sharp" minima (poor generalization).
    • Small batch updates are more frequent and have a "jittery" convergence path but may find "flat" minima (better generalization).

2. The Relationship Between Batch Size and Gradient Descent#

1. Mathematical Explanation of Gradient Noise#

Assuming the total number of samples is ( N ), Batch Size is ( B ), and the loss function is ( L ).

  • Full Batch Gradient Descent (B=N):
θt+1=θtη1Ni=1NLi(θt)\theta_{t+1} = \theta_t - \eta \cdot \frac{1}{N} \sum_{i=1}^{N} \nabla L_i(\theta_t)

The gradient is noise-free, but the computational cost is high.

  • Mini-Batch Gradient Descent (B≪N):
θt+1=θtη1Bi=1BLi(θt)\theta_{t+1} = \theta_t - \eta \cdot \frac{1}{B} \sum_{i=1}^B \nabla L_i(\theta_t)

The gradient is a biased estimate of the true gradient, with noise variance proportional to ( ( \frac{1}{B} ) ).

2. The Impact of Noise on Training#

  • Small Batch (B=32):

    • High noise → Large fluctuations in parameter update direction → May escape local optima.
    • Similar to "random exploration," suitable for complex tasks (e.g., small datasets, high-noise data).
  • Large Batch (B=1024):

    • Low noise → Stable update direction → Fast convergence but prone to local optima.
    • Similar to "precision guidance," suitable for large datasets and distributed training.

3. Practical Strategies for Choosing Batch Size#

1. Maximum Batch Size Under Resource Constraints#

  • VRAM Estimation Formula:
Maximum Batch Size=Available VRAMModel VRAM UsageSingle Sample VRAM Usage \text{Maximum Batch Size} = \frac{\text{Available VRAM} - \text{Model VRAM Usage}}{\text{Single Sample VRAM Usage}}
  • For example: GPU VRAM 24GB, model usage 4GB, each sample uses 0.2GB → Maximum Batch Size ≈ ( (24-4)/0.2 = 100 ).

  • Tip:

    • Use gradient accumulation: Perform multiple forward passes with small batches to accumulate gradients before updating parameters.
      For example: Target Batch Size=64, actual GPU can only support 16 → Accumulate gradients 4 times before updating.

2. Interaction Between Learning Rate and Batch Size#

  • Linear Scaling Rule:

    • When Batch Size is multiplied by ( k ), the learning rate should also be multiplied by ( k ).
    • Theoretical basis: The gradient variance of large batches decreases by ( k ) times, necessitating an increase in learning rate to maintain consistent update step sizes.
    • For example: Original Batch Size=64, learning rate=0.1 → When Batch Size=256, learning rate ≈ 0.4.
  • Caution:

    • Learning rate cannot be infinitely increased! In practice, it should be combined with a warmup strategy to gradually increase the learning rate.

3. Empirical Values for Different Tasks#

  • Image Classification (ImageNet):

    • Commonly used Batch Size=256 or 512 (requires multiple GPUs for parallel processing).
    • Smaller models (e.g., MobileNet) can reduce to 64~128.
  • Object Detection/Segmentation (COCO):

    • Batch Size=2~16 (due to high VRAM usage of high-resolution images).
    • For example, Mask R-CNN typically uses Batch Size=2~8.
  • Natural Language Processing (BERT):

    • Batch Size=32~512, combined with gradient accumulation.
    • Large batches (e.g., 8192) require special optimizations (e.g., LAMB optimizer).

4. Advanced Effects of Batch Size#

1. Generalization Ability#

  • The Generalization Dilemma of Large Batches:

    • Experiments show that training with large batches tends to converge to "sharp" minima, resulting in poor performance on the test set.
    • Solutions:
      • Increase data augmentation.
      • Use Stochastic Weight Averaging (SWA).
      • Introduce explicit regularization (e.g., Label Smoothing).
  • Implicit Regularization of Small Batches:

    • Gradient noise acts as a random perturbation on parameters, similar to the effect of Dropout.

2. Coupling with Batch Normalization#

  • BN's Dependence on Batch Size:
    • BN normalizes based on the mean and variance of the current batch.
    • If the batch size is too small → Statistical estimates become inaccurate → Training becomes unstable.
    • Recommendation: Use BN when Batch Size ≥ 32; if the Batch Size is too small, consider using Group Normalization or Layer Normalization.

3. Batch Size in Distributed Training#

  • Data Parallelism:

    • Each GPU processes a sub-batch, and gradients are synchronized at the end.
    • Global Batch Size = Single GPU Batch Size × Number of GPUs.
    • For example: 4 GPUs, each with Batch Size=64 → Global Batch Size=256.
  • Extreme Large Batch Training:

    • For instance, Google’s training of ResNet with a Batch Size of 1.5M:
      • Requires the LARS (Layer-wise Adaptive Rate Scaling) optimizer.
      • Learning rate is adaptively adjusted based on the norm of weights in each layer.

5. Specific Steps for Debugging Batch Size#

1. Initial Selection#

  • Start with common values (e.g., 32 or 64) and observe VRAM usage and training speed.
  • If VRAM is insufficient, gradually halve the Batch Size until no longer experiencing OOM (Out Of Memory).

2. Monitor Training Dynamics#

  • Training Loss Curve:

    • Small Batch: Loss decreases with large fluctuations, but the overall trend is downward.
    • Large Batch: Loss decreases smoothly but may stagnate early.
  • Validation Set Performance:

    • If training loss decreases but validation loss does not → May be overfitting (need to reduce Batch Size or augment data).
    • If both do not decrease → Model capacity may be insufficient or there may be labeling errors.

3. Hyperparameter Tuning#

  • Fix Batch Size and Adjust Learning Rate:
    • Use Learning Rate Finder: Gradually increase the learning rate to find the range where loss decreases the fastest.
  • Joint Tuning:
    • Batch Size and learning rate need to be adjusted together (refer to the linear scaling rule).

6. Practical Case: Adjusting Batch Size in Image Segmentation#

Assuming you are training U-Net for medical image segmentation:

  1. Hardware Conditions: Single GPU with 12GB VRAM, input size 256x256.
  2. Estimate Batch Size:
    • Model itself uses 3GB, leaving 9GB.
    • Each image uses approximately 0.5GB → Maximum Batch Size ≈ 18 → Choose 16 (a power of 2).
  3. Training Effect:
    • Noticed large fluctuations in validation IoU → Batch Size may be too small, causing high gradient noise.
    • Attempt gradient accumulation: Accumulate over 4 steps (equivalent to Batch Size=64), adjust learning rate to 4 times.
  4. Results:
    • Loss curve smoother, IoU improved by 5%.

7. Conclusion#

  • Batch Size is a Lever in Training: It requires balancing speed, resources, stability, and generalization ability.
  • Core Rules:
    • When resources allow, start with common values (32~256).
    • Large batches require increasing the learning rate, while small batches need attention to gradient noise.
    • Flexibly adjust according to task characteristics and hardware conditions.

This article is synchronized and updated by Mix Space to xLog.
The original link is https://blog.kanes.top/posts/ArtificialIntelligence/EasytounderstandBatchSizeinDeepLearning


Loading...
Ownership of this post data is guaranteed by blockchain and smart contracts to the creator alone.