Training Neural Networks from Scratch

1. Training Is the Hard Part

Designing a neural network is easy. Training it well is hard.

Most failures in deep learning come from:

  • Poor initialization
  • Wrong learning rate
  • Bad loss choice
  • Overfitting / underfitting

This article breaks training down into concrete, controllable components.


2. The Training Loop (Core of Everything)

Every neural network training process reduces to:

for each batch:
forward_pass()
loss = compute_loss()
backward_pass()
update_weights()

Understanding each step deeply is more important than knowing any framework.


3. Weight Initialization

Why Initialization Matters

Initialization controls:

  • Signal propagation
  • Gradient flow
  • Speed of convergence

Bad initialization causes:

  • Vanishing gradients
  • Exploding gradients

Modern Initialization Schemes

Xavier (Glorot) – for tanh / sigmoid:

Var(W) = 1 / fan_avg

He Initialization – for ReLU:

Var(W) = 2 / fan_in

Goal: keep activation variance stable across layers.


4. Loss Functions: What Are You Really Optimizing?

Regression Losses

  • Mean Squared Error (MSE)
  • Mean Absolute Error (MAE)
See also  From Research to Production

Classification Losses

  • Cross Entropy
  • Binary Cross Entropy

Choosing the wrong loss = optimizing the wrong objective.


5. Optimizers: How Parameters Move

Gradient Descent Variants

SGD

  • Simple
  • Noisy updates
  • Better generalization

Momentum

  • Faster convergence
  • Escapes shallow minima

Adam

  • Adaptive learning rates
  • Fast convergence
  • Risk of overfitting

Rule of thumb:

  • Start with Adam
  • Fine-tune with SGD

6. Learning Rate: The Most Important Hyperparameter

Too small → slow training Too large → divergence

Common Strategies

  • Step decay
  • Exponential decay
  • Cosine annealing
  • Warm-up

A good learning rate schedule often matters more than model size.


7. Batch Size Trade-offs

Small Batch Large Batch
Noisy gradients Stable gradients
Better generalization Faster per step
Slower throughput Needs more memory
See also  Deep Neural Network Architectures

Large batches often need learning rate scaling.


8. Regularization: Controlling Overfitting

L2 Regularization

Penalizes large weights:

L_total = L + λ||W||²

Dropout

  • Randomly disables neurons
  • Forces redundancy

Data Augmentation

  • Most powerful regularizer
  • Increases effective dataset size

9. Normalization Techniques

Batch Normalization

  • Stabilizes training
  • Allows higher learning rates
  • Reduces sensitivity to initialization

Layer Normalization

  • Better for RNNs & Transformers

Normalization makes deep networks trainable.


10. Vanishing & Exploding Gradients

Causes

  • Deep networks
  • Saturating activations
  • Poor initialization

Solutions

  • ReLU / GELU
  • Proper initialization
  • Normalization
  • Residual connections

11. Overfitting vs Underfitting

Overfitting

  • Low training loss
  • High validation loss

Underfitting

  • High training loss
  • Model too simple

Fixing requires:

  • More data
  • Better regularization
  • Model capacity tuning

12. Debugging Training Failures

Checklist:

  • Loss decreasing?
  • Gradients exploding?
  • Outputs saturated?
  • Data normalized?
See also  Understanding Machine Learning Algorithms: A Comprehensive Guide

Always overfit a small batch first.


13. Training From Scratch: Mental Model

Training is:

Navigating a noisy, high-dimensional surface using local gradient information

Success depends more on stability than cleverness.


14. What Comes Next?

Next article explores deep architectures:

  • CNNs
  • RNNs
  • Transformers
  • Why architecture matters

Article 4: Deep Neural Network Architectures

Leave a Reply

Your email address will not be published. Required fields are marked *

Get a Quote

Give us a call or fill in the form below and we will contact you. We endeavor to answer all inquiries within 24 hours on business days.