1. Why Architecture Matters
Architecture defines:
- What patterns a model can learn
- How efficiently it learns
- What inductive biases it has
A good architecture bakes assumptions about the data directly into the model.
2. Fully Connected Networks (MLPs)
What They Are
Every neuron connects to every neuron in the next layer.
Strengths
- Universal function approximation
- Simple and flexible
Limitations
- Parameter explosion
- Poor inductive bias
- Not scalable for images or sequences
MLPs are rarely used alone for complex data.
3. Convolutional Neural Networks (CNNs)
Key Idea: Locality + Weight Sharing
CNNs assume:
- Nearby pixels are related
- Same features appear everywhere
This drastically reduces parameters.
Core Components
- Convolution layers
- Stride & padding
- Pooling layers
A convolution learns feature detectors.
Why CNNs Work So Well
- Translation invariance
- Hierarchical feature learning
Example hierarchy:
- Edges → textures → objects
CNNs dominate computer vision.
4. Recurrent Neural Networks (RNNs)
Motivation: Sequential Data
Data where order matters:
- Text
- Time series
- Speech
RNNs maintain a hidden state:
Limitations of Vanilla RNNs
- Vanishing gradients
- Short memory
Training long sequences is unstable.
5. LSTM & GRU: Fixing RNNs
Long Short-Term Memory (LSTM)
Uses gates to control information flow:
- Forget gate
- Input gate
- Output gate
Allows learning long-term dependencies.
GRU
- Simplified LSTM
- Fewer parameters
- Faster training
Both are used in speech and time-series.
6. Attention Mechanism
The Core Idea
Not all inputs matter equally.
Attention computes:
This allows models to focus on relevant parts of the input.
7. Transformers: The Modern Standard
Why Transformers Changed Everything
- No recurrence
- Fully parallelizable
- Long-range dependencies
Key building blocks:
- Self-attention
- Positional encoding
- Feed-forward layers
Self-Attention Explained
Each token attends to every other token.
This enables:
- Global context
- Better representations
Transformers power modern LLMs.
8. Residual Connections
The Problem
Very deep networks degrade.
The Solution
Residual connections:
They:
- Improve gradient flow
- Enable very deep models
Used everywhere today.
9. Encoder–Decoder Architectures
Used in:
- Translation
- Summarization
- Speech recognition
Encoder builds representation. Decoder generates output.
Transformers use this pattern extensively.
10. Choosing the Right Architecture
| Data Type | Architecture |
| Tabular | MLP |
| Images | CNN / Vision Transformer |
| Text | Transformer |
| Time Series | LSTM / Transformer |
| Audio | CNN + Transformer |
Architecture choice matters more than depth.
11. Architectural Trade-offs
- CNNs → strong inductive bias
- Transformers → flexible but expensive
- RNNs → sequential bottlenecks
Modern trend: transformers everywhere.
12. Architecture Evolution
Timeline:
- MLPs → CNNs → RNNs
- LSTMs → Attention → Transformers
Progress comes from removing bottlenecks.
13. Mental Model
Architectures are:
Structured ways of restricting the hypothesis space
Better structure → faster learning → better generalization.
14. What Comes Next?
Next article dives into representation learning:
- Embeddings
- Self-supervised learning
- Why features emerge automatically
➡ Article 5: Representation Learning & Embeddings