Understanding Batch Size in Deep Learning

Batch Size determines the number of samples used to update the model parameters at each iteration. It directly affects the following aspects:

Accuracy of Gradient Calculation:
- Large batches provide gradients that are averages of multiple samples, closer to the "true gradient" (the gradient direction of the entire dataset).
- Small batches have greater gradient noise but may provide a regularization effect, preventing overfitting.
Utilization of Hardware Resources:
- The parallel computing capability of GPUs is more efficient with large batches.
- However, if the batch size is too large, it can lead to out-of-memory (OOM) errors, requiring a balance of resources.
Convergence Speed and Stability:
- Large batch updates are more accurate but may converge to "sharp" minima (poor generalization).
- Small batch updates are more frequent and have a "jittery" convergence path but may find "flat" minima (better generalization).

Assuming the total number of samples is ( N ), Batch Size is ( B ), and the loss function is ( L ).

\theta_{t+1} = \theta_t - \eta \cdot \frac{1}{N} \sum_{i=1}^{N} \nabla L_i(\theta_t)

The gradient is noise-free, but the computational cost is high.

\theta_{t+1} = \theta_t - \eta \cdot \frac{1}{B} \sum_{i=1}^B \nabla L_i(\theta_t)

The gradient is a biased estimate of the true gradient, with noise variance proportional to ( ( \frac{1}{B} ) ).

Small Batch (B=32):
- High noise → Large fluctuations in parameter update direction → May escape local optima.
- Similar to "random exploration," suitable for complex tasks (e.g., small datasets, high-noise data).
Large Batch (B=1024):
- Low noise → Stable update direction → Fast convergence but prone to local optima.
- Similar to "precision guidance," suitable for large datasets and distributed training.

\text{Maximum Batch Size} = \frac{\text{Available VRAM} - \text{Model VRAM Usage}}{\text{Single Sample VRAM Usage}}

For example: GPU VRAM 24GB, model usage 4GB, each sample uses 0.2GB → Maximum Batch Size ≈ ( (24-4)/0.2 = 100 ).
Tip:
- Use gradient accumulation: Perform multiple forward passes with small batches to accumulate gradients before updating parameters.
  For example: Target Batch Size=64, actual GPU can only support 16 → Accumulate gradients 4 times before updating.

Linear Scaling Rule:
- When Batch Size is multiplied by ( k ), the learning rate should also be multiplied by ( k ).
- Theoretical basis: The gradient variance of large batches decreases by ( k ) times, necessitating an increase in learning rate to maintain consistent update step sizes.
- For example: Original Batch Size=64, learning rate=0.1 → When Batch Size=256, learning rate ≈ 0.4.
Caution:
- Learning rate cannot be infinitely increased! In practice, it should be combined with a warmup strategy to gradually increase the learning rate.

Image Classification (ImageNet):
- Commonly used Batch Size=256 or 512 (requires multiple GPUs for parallel processing).
- Smaller models (e.g., MobileNet) can reduce to 64~128.
Object Detection/Segmentation (COCO):
- Batch Size=2~16 (due to high VRAM usage of high-resolution images).
- For example, Mask R-CNN typically uses Batch Size=2~8.
Natural Language Processing (BERT):
- Batch Size=32~512, combined with gradient accumulation.
- Large batches (e.g., 8192) require special optimizations (e.g., LAMB optimizer).

The Generalization Dilemma of Large Batches:
- Experiments show that training with large batches tends to converge to "sharp" minima, resulting in poor performance on the test set.
- Solutions:
  - Increase data augmentation.
  - Use Stochastic Weight Averaging (SWA).
  - Introduce explicit regularization (e.g., Label Smoothing).
Implicit Regularization of Small Batches:
- Gradient noise acts as a random perturbation on parameters, similar to the effect of Dropout.

BN's Dependence on Batch Size:
- BN normalizes based on the mean and variance of the current batch.
- If the batch size is too small → Statistical estimates become inaccurate → Training becomes unstable.
- Recommendation: Use BN when Batch Size ≥ 32; if the Batch Size is too small, consider using Group Normalization or Layer Normalization.

Data Parallelism:
- Each GPU processes a sub-batch, and gradients are synchronized at the end.
- Global Batch Size = Single GPU Batch Size × Number of GPUs.
- For example: 4 GPUs, each with Batch Size=64 → Global Batch Size=256.
Extreme Large Batch Training:
- For instance, Google’s training of ResNet with a Batch Size of 1.5M:
  - Requires the LARS (Layer-wise Adaptive Rate Scaling) optimizer.
  - Learning rate is adaptively adjusted based on the norm of weights in each layer.

Start with common values (e.g., 32 or 64) and observe VRAM usage and training speed.
If VRAM is insufficient, gradually halve the Batch Size until no longer experiencing OOM (Out Of Memory).

Training Loss Curve:
- Small Batch: Loss decreases with large fluctuations, but the overall trend is downward.
- Large Batch: Loss decreases smoothly but may stagnate early.
Validation Set Performance:
- If training loss decreases but validation loss does not → May be overfitting (need to reduce Batch Size or augment data).
- If both do not decrease → Model capacity may be insufficient or there may be labeling errors.

Fix Batch Size and Adjust Learning Rate:
- Use Learning Rate Finder: Gradually increase the learning rate to find the range where loss decreases the fastest.
Joint Tuning:
- Batch Size and learning rate need to be adjusted together (refer to the linear scaling rule).

Assuming you are training U-Net for medical image segmentation:

Hardware Conditions: Single GPU with 12GB VRAM, input size 256x256.
Estimate Batch Size:
- Model itself uses 3GB, leaving 9GB.
- Each image uses approximately 0.5GB → Maximum Batch Size ≈ 18 → Choose 16 (a power of 2).
Training Effect:
- Noticed large fluctuations in validation IoU → Batch Size may be too small, causing high gradient noise.
- Attempt gradient accumulation: Accumulate over 4 steps (equivalent to Batch Size=64), adjust learning rate to 4 times.
Results:
- Loss curve smoother, IoU improved by 5%.

Batch Size is a Lever in Training: It requires balancing speed, resources, stability, and generalization ability.
Core Rules:
- When resources allow, start with common values (32~256).
- Large batches require increasing the learning rate, while small batches need attention to gradient noise.
- Flexibly adjust according to task characteristics and hardware conditions.