Linear Algebra
Essentials
for AI
A comprehensive, intensive reference synthesising leading sources on every linear algebra concept that powers modern machine learning, deep learning, and generative AI — from first principles to real-world applications.
Linear algebra is the silent workhorse of artificial intelligence. Every image a model sees, every word it understands, every prediction it makes — all encoded as vectors and matrices, processed through linear operations.
“Linear algebra is the mathematics of data. Without it, modern machine learning as we know it simply would not exist.”— Stanford Professor Andrew Ng
Machine learning fundamentally does three things: represents data, transforms data, and optimises model parameters. Linear algebra is the language for all three. When you train a neural network, the forward pass is matrix multiplication. When you reduce dimensions with PCA, you compute eigenvectors. When gradient descent updates weights, it follows a gradient vector. Understanding linear algebra means understanding AI at its deepest level.
From IBM’s research: “Whether training a neural network, building a recommendation system or applying PCA to a complex high-dimensional dataset, practitioners are using linear algebra to perform massive calculations.” Data rarely comes as a single number — it arrives as datasets of vectors, matrices of features, tensors of images. Linear algebra provides the tools to organise, manipulate and analyse all of it efficiently.
Data Representation: Scalars, vectors, matrices and tensors encode every type of real-world data — a customer profile, an image, a sentence, a sound clip. Transformation: Matrix operations reshape, rotate, scale and project data into forms that reveal patterns. Optimisation: Gradients, eigenvalues and matrix factorizations power the learning algorithms that tune model weights.
Finds the best-fit hyperplane by solving a system of linear equations using matrix operations and least squares.
FoundationalEvery layer performs y = Wx + b — a matrix multiplication with weights plus a bias vector.
Deep LearningPCA and SVD use eigenvectors to compress high-dimensional data into its most informative directions.
UnsupervisedSVD decomposes user-item matrices into latent factors that predict what users will enjoy.
Applied AIThe building blocks of all linear algebra in AI are four data structures that differ only in how many dimensions they occupy. Everything in machine learning — a data point, an image, a neural network weight — is one of these four structures.
1.1 Scalar
A scalar is the simplest building block — a single numerical value, such as 5 or 2.3. In machine learning, scalars represent individual parameters (like a learning rate of 0.001), loss function outputs (a single error value), scaling factors, or single measurements. When you see the temperature today as 37°C or a model’s accuracy as 0.94 — those are scalars.
1.2 Vector
A vector is an ordered list of numbers, written as a column or row. Vectors are the workhorse data structure of machine learning — they represent data points, feature sets, model weights, directions, embeddings, and probability distributions. A customer described by age, income, and purchase count becomes the vector [34, 75000, 12].
Vectors in ML represent: a single data sample, a row in a dataset, model weights in a layer, directions for gradient descent, and word embeddings (where each word is a vector of ~300–1536 numbers).
1.3 Matrix
A matrix is a two-dimensional array of numbers arranged in rows and columns. A dataset where each row is a data point and each column is a feature naturally forms a matrix. Matrices are central to linear algebra because they allow for efficient storage and transformation of data.
In neural networks, the weight matrix W of a layer defines how input signals are mixed and projected to the output. Multiplying an input vector x by W applies a learned linear transformation: the foundation of every deep learning forward pass.
1.4 Tensor
A tensor is a generalisation of scalars, vectors and matrices to higher dimensions. A colour image is a 3D tensor: height × width × 3 (RGB channels). A batch of 32 such images becomes a 4D tensor: 32 × height × width × 3. In deep learning, tensors are the standard data structure — all inputs, weights, activations, and gradients in PyTorch and TensorFlow are tensors.
Scalar (0D Tensor)
Single number. Examples: loss value, learning rate, accuracy score.
Vector (1D Tensor)
Ordered list. Examples: feature vector, word embedding, bias term.
Matrix (2D Tensor)
Table of numbers. Examples: dataset, weight matrix, confusion matrix.
3D Tensor
Stacked matrices. Examples: colour image (H×W×C), time series batch.
4D Tensor
Batch of 3D. Examples: batch of images (N×H×W×C) fed into CNNs.
5D Tensor
Video data. Examples: batch × frames × height × width × channels.
Matrix operations are the verbs of linear algebra — they transform, combine, and reshape data. Every forward pass in a neural network, every regression update, every embedding comparison executes one or more of these operations at massive scale.
2.1 Matrix Multiplication
Matrix multiplication is the single most important operation in deep learning. When input data flows through a neural network layer, it is multiplied by the weight matrix. The result encodes a learned combination of every input feature — this is how networks detect patterns.
For two matrices A (m×n) and B (n×p), the product C = AB is an m×p matrix where each element C[i,j] is the dot product of row i of A with column j of B. The inner dimension (n) must match — this constraint drives the design of neural network architecture shapes.
2.2 Dot Product
The dot product of two vectors produces a scalar — the sum of element-wise products: a·b = Σ(aᵢ × bᵢ). It measures how much two vectors point in the same direction. Crucially, it is the building block of cosine similarity, which measures whether two vectors (like two word embeddings) are semantically similar.
In transformer models, the attention mechanism computes dot products between query and key vectors to determine which parts of the input to attend to — the mathematical basis of self-attention in GPT and BERT.
2.3 Transpose
The transpose of a matrix flips its rows and columns (Aᵀ[i,j] = A[j,i]). This is used constantly to align dimensions for multiplication and to uncover structural patterns. In the normal equation for linear regression, the formula involves Xᵀ to align the data matrix for self-multiplication.
2.4 Matrix Inverse
The inverse of a square matrix A, denoted A⁻¹, satisfies A·A⁻¹ = I (the identity matrix). It exists only when the determinant is non-zero. In linear regression, the optimal weights are found by: w = (XᵀX)⁻¹ Xᵀy — the Normal Equation, which directly solves for the best-fit parameters using matrix inversion.
Real-world data is often ill-conditioned or the matrix is rectangular (not square), making direct inversion impossible. The Moore-Penrose Pseudoinverse (A⁺) solves this — it’s used in linear regression when data is overdetermined (more equations than unknowns) or underdetermined (fewer equations than unknowns), and is the practical replacement for exact inversion in numerical computing.
2.5 Determinant
The determinant of a square matrix is a single scalar that encodes important geometric and algebraic information. If det(A) ≠ 0, the matrix is invertible and the linear system Ax = b has a unique solution. If det(A) = 0, the matrix is singular — the data has redundant dimensions, and solutions may be infinite or nonexistent. Geometrically, the determinant measures the volume scaling factor of the linear transformation.
2.6 Identity and Zero Matrices
The identity matrix I is a square matrix with 1s on the diagonal and 0s elsewhere. Multiplying any matrix by I leaves it unchanged — it is the multiplicative identity for matrices, analogous to multiplying a number by 1. The identity matrix plays a crucial role in defining inverses and in the convergence proofs of many optimisation algorithms.
| Operation | Notation | Where Used in ML | Key Property |
|---|---|---|---|
| Matrix Multiplication | AB or A·B | Neural network layers, linear regression | Non-commutative: AB ≠ BA in general |
| Dot Product | a·b or aᵀb | Attention mechanisms, cosine similarity, SVM | Returns scalar; measures vector alignment |
| Transpose | Aᵀ | Normal equation, covariance matrix, attention | (AB)ᵀ = BᵀAᵀ |
| Matrix Inverse | A⁻¹ | Normal equation, Kalman filter | Only exists if det(A) ≠ 0 |
| Hadamard Product | A⊙B | LSTM gates, element-wise operations | Element-wise; same dimensions required |
| Outer Product | abᵀ | Rank-1 updates, gradient computation | Produces matrix from two vectors |
| Trace | tr(A) | Loss functions, Gaussian log-likelihood | Sum of diagonal elements |
Every machine learning model ultimately reduces to solving a system of linear equations. Understanding how these systems behave — when they have unique solutions, no solutions, or infinite solutions — is fundamental to understanding why ML models work.
A system of linear equations can be compactly written as Ax = b, where A is the coefficient matrix, x is the vector of unknowns, and b is the right-hand side. This is the universal form of almost every ML problem setup.
3.1 Solution Types
A linear system Ax = b has three possible outcomes based on the properties of matrix A. Unique solution: when A is full rank and invertible — the system is consistent with exactly one answer. No solution: when the equations are contradictory (overdetermined — more equations than unknowns, typical in real ML data). Infinite solutions: when the system is underdetermined — fewer equations than unknowns, common in deep learning where models have more parameters than training examples.
3.2 Gaussian Elimination & Row Reduction
Gaussian elimination is the foundational algorithm for solving linear systems. By applying elementary row operations (scaling, swapping, and adding rows), it transforms the coefficient matrix into Row Echelon Form (REF) or Reduced Row Echelon Form (RREF), from which solutions can be read directly. In ML, this principle is used computationally in solving the normal equation and in understanding the structure of data matrices.
3.3 Least Squares: The ML Default
In practice, real data rarely satisfies an exact system of equations — there is always noise and measurement error. The least squares method finds the x that minimises ||Ax – b||² — the squared difference between predictions and targets. This is the foundational objective function of linear regression and underpins the training of most supervised learning models.
The normal equation gives an exact solution but requires inverting XᵀX, which costs O(n³) time. For large datasets (millions of examples, thousands of features), this is computationally prohibitive. Gradient descent finds an approximate solution iteratively at O(n) cost per step — which is why every deep learning model uses gradient descent, not the closed-form normal equation.
A linear transformation is a function that maps vectors from one space to another while preserving the structure of addition and scalar multiplication. Every neural network layer, every image preprocessing step, every coordinate change is a linear transformation.
Formally, a function T is a linear transformation if T(u + v) = T(u) + T(v) and T(cu) = cT(u) for all vectors u, v and scalar c. Critically, every linear transformation between finite-dimensional spaces can be represented as matrix multiplication — which is why matrices are so central to machine learning.
4.1 Geometric Interpretations
Linear transformations can be visualised geometrically as operations on space:
- Rotation: Rotates vectors around the origin by angle θ — used in data augmentation for computer vision
- Scaling: Stretches or compresses vectors along each axis — used in feature normalisation
- Reflection: Flips vectors across a line or plane — another augmentation technique
- Shearing: Slants the coordinate system — distorts images for training robustness
- Projection: Collapses vectors onto a subspace — the core operation of PCA and dimensionality reduction
4.2 Composition of Transformations
Applying one transformation after another is called composition, represented by matrix multiplication. A deep neural network with n layers applies n consecutive linear transformations (each followed by a non-linear activation function). The composition of all linear layers can be expressed as a single large matrix — which is why adding more layers doesn’t help without non-linear activations, since the composition of linear functions is still linear.
Pure linear transformations can only learn linear relationships — a neural network made entirely of matrix multiplications, no matter how deep, can only fit a linear model. This is why activation functions (ReLU, sigmoid, tanh) are inserted after each linear layer — they break the linearity and allow the network to approximate arbitrarily complex functions. Linear algebra provides the structure; non-linearity provides the expressive power.
Eigenvalues and eigenvectors reveal the fundamental modes of behaviour of a linear transformation — they are the directions a matrix naturally acts on, and the magnitudes by which it stretches or compresses those directions. They are the DNA of data matrices.
5.1 Intuitive Understanding
Imagine a rubber sheet drawn with lines. Pull it in some direction — a linear transformation. Some lines stretch a lot, some barely move, some stay exactly in place. Those special directions that don’t rotate — only scale — are the eigenvectors. The amount they stretch is the corresponding eigenvalue.
An eigenvector of a square matrix A is a non-zero vector v such that Av = λv — multiplying A by v only scales v by a factor λ without changing its direction. The eigenvalue λ tells you the factor by which the eigenvector is scaled. Large eigenvalues correspond to directions of strong transformation; small eigenvalues to weak transformation. Zero eigenvalues correspond to directions that collapse entirely.
5.2 Eigendecomposition
Eigendecomposition breaks a square matrix A into A = QΛQ⁻¹, where Q is a matrix whose columns are the eigenvectors, and Λ is a diagonal matrix of eigenvalues. This factorisation reveals the intrinsic structure of the transformation — the principal axes and their scaling factors. It is applicable only to square matrices and is the basis for PCA, spectral clustering, and PageRank.
5.3 Where Eigenvalues Appear in ML
| Algorithm | Role of Eigenvalues/Eigenvectors | Practical Effect |
|---|---|---|
| PCA | Eigenvectors of covariance matrix = principal components; eigenvalues = variance captured | Dimensionality reduction, feature compression |
| Spectral Clustering | Eigenvectors of graph Laplacian matrix define cluster structure | Non-convex cluster detection |
| Google PageRank | PageRank vector is the dominant eigenvector of the web link matrix | Web page importance ranking |
| Optimisation (Hessian) | Eigenvalues of Hessian indicate local curvature; positive = minimum | Convergence diagnosis for gradient descent |
| Stability Analysis | Largest eigenvalue of weight matrix (spectral radius) governs RNN stability | Preventing vanishing/exploding gradients |
High-dimensional data is both powerful and treacherous — more features can improve accuracy, but also cause the curse of dimensionality, slow training, and overfitting. Principal Component Analysis (PCA) uses linear algebra to distill complex data into its most informative directions.
6.1 The Curse of Dimensionality
As the number of dimensions grows, data points become increasingly sparse in the feature space. The concept of “closeness” loses meaning — in high dimensions, all points are roughly equidistant. With 100 features, many may be redundant or correlated. Training on such data is computationally expensive, memory-intensive, and prone to overfitting, where a model memorises noise instead of patterns.
Imagine 10,000 customers described by 100 features each. Analysing all 100 is slow and most features are redundant — “sports gear interest” overlaps with “outdoor equipment interest.” PCA reduces this to 3-5 principal components capturing 95% of the variance, enabling fast visualisation and efficient downstream modelling. The geometry of customer behaviour can be understood as a 3D shape, not a 100-dimensional cloud.
6.2 How PCA Works — Step by Step
- Standardise the data: Subtract the mean of each feature and divide by its standard deviation so all features are on the same scale.
- Compute the covariance matrix: C = (1/n)XᵀX — captures how features vary together. A high covariance between two features means they carry overlapping information.
- Eigendecompose the covariance matrix: Find eigenvectors (principal components) and eigenvalues (variance explained by each component).
- Sort by eigenvalue: Order components from largest to smallest eigenvalue — the first principal component captures the most variance.
- Project the data: Multiply the original data by the top k eigenvectors to get a k-dimensional representation.
6.3 Interpreting PCA Results
The percentage of variance explained by each component tells you how much information you retain when keeping that component. In practice, practitioners look for the “elbow” in a scree plot — the point where adding more components yields diminishing returns. Keeping components that explain 95% of total variance is a common heuristic.
PCA is not just used for dimensionality reduction. It is also used for data visualisation (reducing to 2D or 3D for plotting), noise reduction (discarding low-variance components removes noise), and as a preprocessing step for algorithms sensitive to high dimensions like K-nearest neighbours.
Singular Value Decomposition is one of the most important and versatile tools in all of linear algebra. Unlike eigendecomposition, SVD works on any matrix — not just square ones — and provides a complete structural picture of any linear transformation.
7.1 What SVD Reveals
SVD breaks any matrix A into three components: U describes the directions in the input space (what patterns the transformation looks for), Σ contains the singular values on its diagonal (the magnitude/importance of each pattern), and Vᵀ describes the directions in the output space (what the patterns map to). The singular values in Σ are always non-negative and sorted in decreasing order — the first is always the most “important” direction.
7.2 Low-Rank Approximation
One of SVD’s most powerful applications is low-rank approximation. By keeping only the top k singular values and their corresponding vectors, we get the best possible rank-k approximation of A in terms of minimising the Frobenius norm (the squared sum of all element differences). This principle is used everywhere from image compression to noise reduction in sensor data.
A 512×512 greyscale image is a matrix with 262,144 values. SVD decomposes it into components ranked by information content. Keeping only the top 50 singular values (out of 512) gives a recognisable image using just (512×50 + 50 + 50×512) / (512×512) ≈ 19% of the original storage — while preserving the main visual content. The discarded components were mostly noise.
7.3 SVD in Recommendation Systems
Netflix, Amazon, and Spotify use variants of SVD to power their recommendation engines. A user-item interaction matrix (rows = users, columns = movies/products, values = ratings) is decomposed into latent factor matrices. The left matrix U captures latent user preferences; the right matrix V captures latent item characteristics; Σ scales their importance. Unobserved ratings can then be predicted as the dot product of a user’s latent vector with an item’s latent vector.
7.4 SVD vs PCA Relationship
PCA can be computed using SVD applied to the data matrix X (after centring). The right singular vectors of X are exactly the principal components; the squared singular values divided by (n-1) are the eigenvalues of the covariance matrix. SVD is numerically more stable than eigendecomposition and is the preferred implementation for PCA in production ML libraries like scikit-learn.
| Application | How SVD Is Used | Example System |
|---|---|---|
| Recommender Systems | Decomposes user-item matrix into latent factor spaces | Netflix, Spotify, Amazon |
| Image Compression | Low-rank approximation discards noise components | JPEG2000, image databases |
| Natural Language Processing | Latent Semantic Analysis (LSA) uses SVD on term-document matrix | Topic modelling, search engines |
| Data Denoising | Small singular values correspond to noise; truncating removes it | Scientific data processing |
| PCA Implementation | Numerically stable way to compute principal components | scikit-learn’s PCA class |
| Pseudoinverse | A⁺ = V Σ⁺ Uᵀ — handles non-invertible matrices | Least squares regression |
Norms measure the size of vectors. Distances measure how far apart data points are. Projections find the nearest point on a subspace. These three related concepts are the geometric intuition behind loss functions, regularisation, clustering, and regression.
8.1 Vector Norms
A norm is a function that assigns a non-negative length to a vector, satisfying three properties: non-negativity (||x|| ≥ 0), homogeneity (||cx|| = |c|·||x||), and the triangle inequality (||x+y|| ≤ ||x|| + ||y||). The most important norms in ML:
8.2 Norms in Regularisation
Regularisation prevents overfitting by adding a penalty term to the loss function based on model weights. L2 regularisation (Ridge) adds λ||w||₂² to the loss — penalising large weights equally in all directions, pushing all weights toward zero smoothly. L1 regularisation (Lasso) adds λ||w||₁ to the loss — producing sparse solutions where many weights become exactly zero, effectively performing automatic feature selection.
L2 (Ridge) Regularisation
Penalises large weights. Produces dense solutions where all weights are small. Best when many features contribute. Loss: L + λ||w||₂²
L1 (Lasso) Regularisation
Produces sparse solutions. Many weights become exactly zero. Best for feature selection. Loss: L + λ||w||₁
Elastic Net
Combination of L1 and L2. Gets both sparsity and stability. Loss: L + λ₁||w||₁ + λ₂||w||₂²
8.3 Projections
A projection is the “shadow” of one vector onto another — the closest point to a vector within a given subspace. Projections are the geometry behind regression (projecting the target vector onto the column space of features) and PCA (projecting data onto principal component directions).
Shine a flashlight directly above an object — the shadow on the ground is the projection of the object onto the 2D plane. In linear regression, the predictions ŷ = Xw are exactly the projection of the true labels y onto the column space of the feature matrix X. The residuals (y – ŷ) are perpendicular (orthogonal) to every column in X. This geometric interpretation explains why least squares minimises squared error.
Gradient descent is how neural networks learn. It is fundamentally a linear algebra operation — computing the gradient vector of the loss function and taking a step in the negative direction. Understanding gradients is understanding how every ML model is trained.
9.1 The Gradient Vector
The gradient of a scalar-valued function f(w) is a vector ∇f(w) pointing in the direction of steepest ascent. Each element ∂f/∂wᵢ is the partial derivative — how much the loss changes when weight wᵢ changes. For a loss function L with model parameters w₁, w₂, …, wₙ, the gradient ∇L is an n-dimensional vector telling the model exactly how to adjust each weight to increase the loss. Training moves opposite to the gradient — toward lower loss.
9.2 Gradient Descent Variants
| Variant | Gradient Computed Over | Properties | Used In |
|---|---|---|---|
| Batch GD | All training examples | Stable but slow; exact gradient | Small datasets, convex problems |
| Stochastic GD (SGD) | One random example | Fast, noisy; can escape local minima | Online learning, large datasets |
| Mini-batch GD | Batch of 32-512 examples | Balance of speed and stability | Deep learning standard |
| Adam | Mini-batch with adaptive rates | Adapts learning rate per parameter | Most modern deep learning |
| RMSProp | Mini-batch with RMS scaling | Good for non-stationary problems | RNNs, online learning |
9.3 Backpropagation: Chain Rule as Matrix Operations
Backpropagation computes gradients through a neural network by applying the chain rule of calculus layer by layer — but it is fundamentally a sequence of matrix operations. For a layer with weight matrix W and input x, the gradient flows as δ = Wᵀ · δ_next, where δ is the error signal. The weight update is ΔW = -η · δ · xᵀ (the outer product of the error and input vectors).
Gradient descent at scale is almost entirely matrix multiplications — computing WᵀX for thousands of parameters across millions of examples. GPUs excel at exactly this: performing thousands of floating-point multiply-accumulate operations in parallel. A modern A100 GPU can perform 312 teraFLOPS of FP16 matrix operations. Training GPT-4 required thousands of such GPUs for months — all of it matrix multiplication.
Rank, span, and linear independence answer a fundamental question about data: how much unique information does it actually contain? Redundant features inflate dimension without adding information — these concepts identify and measure that redundancy.
10.1 Linear Independence
A set of vectors is linearly independent if no vector in the set can be expressed as a linear combination of the others. If you can write v₃ = 2v₁ + 3v₂, then v₃ adds no new information — it is redundant. In a dataset, linearly dependent features (like height in cm and height in inches) don’t add information; they waste computation and can destabilise regression solutions.
10.2 Rank
The rank of a matrix is the number of linearly independent rows (or columns) — the number of unique dimensions of information. A rank-deficient matrix (rank < min(rows, cols)) means data has redundant features. This leads to unstable solutions in regression (XᵀX is not invertible) and poor generalisation. Understanding rank helps diagnose data quality issues early.
10.3 Span and Column Space
The span of a set of vectors is all possible linear combinations of those vectors — the entire region of space you can reach by mixing them. In regression, predictions always lie in the column space of the feature matrix X. If the true labels cannot be expressed as a linear combination of the features, the regression can only find the closest approximation — the projection onto the column space.
“If you only know how to walk north and east, you can never reach south-west. Your possible movement space is limited — and so are your model’s predictions if features don’t span the necessary directions.”
— Sayan Chowdhury, Towards AIOrthogonal vectors are the ideal building blocks for numerical computation — they are independent, non-redundant, and simplify calculations dramatically. Most of the numerical stability in modern ML algorithms comes from ensuring computations happen in orthogonal bases.
11.1 Orthogonality
Two vectors are orthogonal if their dot product is zero: a·b = 0. Geometrically, they meet at a 90° angle. Orthogonal vectors don’t interfere with each other — they act as clean, independent directions. A set of mutually orthogonal unit vectors is called an orthonormal basis, and working in such a basis makes projections, inversions, and distance calculations computationally efficient and numerically stable.
11.2 Gram-Schmidt Process
The Gram-Schmidt process converts any set of linearly independent vectors into an orthonormal set spanning the same space. It works by iteratively subtracting the projection of each new vector onto all previously processed vectors, leaving only the component orthogonal to all of them, then normalising. It is the foundation for QR decomposition.
11.3 QR Decomposition
QR decomposition factorises a matrix A into A = QR, where Q is orthogonal (Qᵀ = Q⁻¹) and R is upper triangular. It is used for solving linear systems and least squares problems in a numerically stable way — much more stable than computing (XᵀX)⁻¹ directly. PCA, linear regression, and eigenvalue algorithms in numpy and scipy all use QR decomposition under the hood.
When feature columns are nearly linearly dependent (collinear), the matrix XᵀX becomes nearly singular — small measurement errors cause wildly unstable coefficient estimates. Working with orthonormal bases (as QR decomposition creates) eliminates this problem. This is why principal components are always orthogonal to each other — it makes computations with them maximally stable and interpretable.
The gradient tells you which direction to descend. The Hessian tells you the shape of the terrain you are descending — flat, steep, curved, or saddle-shaped. Second-order methods use this curvature information to take more intelligent optimisation steps.
12.1 The Hessian Matrix
The Hessian H of a scalar function L(w) is the matrix of second derivatives: H[i,j] = ∂²L/∂wᵢ∂wⱼ. It captures the curvature of the loss surface in every direction. For a model with n parameters, the Hessian is an n×n matrix — for GPT-4 with ~1.8 trillion parameters, storing the Hessian is utterly infeasible (1.8×10²⁴ entries). This is why approximate second-order methods (Adam, K-FAC) are used in practice.
12.2 Saddle Points in Deep Learning
A critical insight for deep learning: most “stuck” points in high-dimensional loss surfaces are saddle points, not local minima. A saddle point has some positive and some negative Hessian eigenvalues — the loss surface curves up in some directions and down in others. SGD and Adam automatically escape saddle points because the random noise in mini-batch gradients perturbs the path away from them. This explains why stochastic methods often outperform full-batch methods in deep learning.
The largest eigenvalue of the Hessian (or spectral radius of the weight matrix) determines gradient explosion risk. If eigenvalues are very large, gradient steps can overshoot wildly — exploding gradients. If very small, gradients vanish and learning stalls. This is why gradient clipping (limiting gradient norm), careful weight initialisation (Xavier, He), and batch normalisation are essential techniques — they all manage the eigenvalue distribution of weight matrices.
In deep learning frameworks, everything is a tensor — inputs, outputs, weights, activations, gradients. Understanding tensor operations is the practical skill that bridges linear algebra theory to actual model implementation in PyTorch or TensorFlow.
13.1 Tensor Shapes in Practice
| Tensor Shape | Meaning | Example |
|---|---|---|
| (32,) | 1D: vector of 32 values | Batch of 32 loss values |
| (784,) | 1D: flattened 28×28 image | MNIST digit as vector |
| (100, 50) | 2D: matrix | Weight matrix: 100 inputs → 50 outputs |
| (32, 28, 28, 3) | 4D: batch of colour images | 32 RGB images of 28×28 pixels |
| (512, 768) | 2D: transformer weight | Attention projection matrix in BERT |
| (32, 512, 768) | 3D: batch of sequences | 32 sequences, 512 tokens, 768-dim embeddings |
13.2 Essential Tensor Operations
- Reshape/View: Change tensor dimensions without changing data. Used to flatten images before fully connected layers.
- Transpose/Permute: Reorder dimensions. Essential for attention: (batch, seq, head) → (batch, head, seq).
- Broadcasting: Automatically expands tensors of different shapes for element-wise operations — adds a (1,10) bias to a (32,10) batch.
- Matrix Multiply (matmul): The core operation of every linear layer and attention head.
- Concatenation: Joins tensors along a dimension — used in skip connections (ResNet) and feature fusion.
- Einsum: Einstein summation notation — expresses any tensor contraction in one line, used for batched attention computation.
The transformer attention mechanism is pure linear algebra. Given input X, we compute Q = XW_Q, K = XW_K, V = XW_V (three matrix multiplications). Then attention weights A = softmax(QKᵀ / √d) (dot product between queries and keys, scaled). Finally, output = AV (weighted sum of values). This entire sequence — the foundation of GPT, BERT, and all modern LLMs — is matrix multiplications and dot products, nothing more.
Every major AI and ML algorithm has linear algebra at its core. This section traces the specific operations that power each algorithm — making the abstract mathematics concrete and actionable.
Linear Regression
Data stored as matrix X; weights as vector w. Training solves: w = (XᵀX)⁻¹Xᵀy. Predictions are: ŷ = Xw. Core operations: matrix multiplication, transpose, matrix inversion.
Neural Networks
Each layer: output = activation(Wx + b). Backprop: gradient = Wᵀ × upstream_gradient. Core operations: batched matrix multiply, outer products for weight updates.
Convolutional Networks
Convolution is a matrix operation — the filter is applied via a structured sparse matrix (Toeplitz). Feature maps computed as matrix products. Pooling uses max/mean aggregation.
Transformers & LLMs
Self-attention: A = softmax(QKᵀ/√d); output = AV. Feedforward: two matrix multiplications. Positional encoding: vector addition. All 175B+ GPT-3 parameters are weight matrices.
Support Vector Machines
Finds the optimal separating hyperplane via dot products between support vectors. Kernel trick uses inner product functions in feature space. Core operation: dot product.
Recommendation Systems
Matrix factorisation (SVD) decomposes user-item ratings into latent factors. Predictions: r̂ᵤᵢ = uᵤ · vᵢ (dot product of user and item embeddings). Used by Netflix, Amazon.
NLP & Word Embeddings
Words mapped to vectors (Word2Vec, GloVe, BERT embeddings). Semantic similarity = cosine similarity = normalised dot product. SVD on co-occurrence matrix gives LSA.
Generative AI (GANs & VAEs)
GANs: generator and discriminator are neural networks using matrix ops. VAEs: encoder/decoder use linear projections into/out of latent space. Latent space is a linear algebraic structure.
Linear Algebra Across Industries
Healthcare & Genomics
PCA analyses gene expression data from thousands of genes. MRI reconstruction uses SVD. Drug molecule generation uses VAE latent space exploration.
Finance & Trading
Risk modelling with covariance matrices. Portfolio optimisation solves linear systems. Fraud detection uses high-dimensional feature vectors and SVMs.
Autonomous Vehicles
Computer vision uses CNNs (matrix ops). LIDAR point clouds are 3D tensors. Sensor fusion applies Kalman filters (matrix inversions). Path planning uses linear programming.
ML practitioners don’t implement linear algebra from scratch. A rich ecosystem of optimised Python libraries handles all the computation — letting you focus on models and insights. Knowing which function does what is essential practical knowledge.
The foundational numerical computing library. Provides arrays, matrices, and the complete numpy.linalg sub-module for linear algebra operations.
Extends NumPy with more specialised algorithms: sparse matrix operations, iterative solvers, and advanced decompositions via scipy.linalg.
Deep learning framework where all computations are tensor operations. Automatic differentiation (autograd) computes gradients automatically for any tensor graph.
Deep LearningML library implementing PCA, SVD, linear regression, and SVMs — all using optimised linear algebra under the hood via LAPACK and BLAS.
MLEssential NumPy Functions for Linear Algebra
| Function | Operation | ML Use Case |
|---|---|---|
| np.dot(A, B) | Matrix/vector dot product | Layer forward pass, attention scores |
| np.matmul(A, B) or A @ B | Matrix multiplication | Neural network layers |
| A.T or np.transpose(A) | Transpose | Normal equation, backprop |
| np.linalg.inv(A) | Matrix inverse | Normal equation solution |
| np.linalg.eig(A) | Eigenvalues & eigenvectors | PCA, spectral methods |
| np.linalg.svd(A) | Singular Value Decomposition | Dimensionality reduction, recommender |
| np.linalg.norm(x) | Vector/matrix norm | Regularisation, gradient clipping |
| np.linalg.det(A) | Determinant | Check invertibility |
| np.linalg.solve(A, b) | Solve linear system Ax=b | Linear regression, systems of equations |
| np.linalg.rank(A) | Matrix rank | Check data quality, linear independence |
PyTorch and TensorFlow automatically execute tensor operations on GPU using CUDA. The key insight: all the operations discussed in this document — matrix multiplication, SVD, gradient computation — are executed thousands of times faster on GPU than CPU. Modern AI training is feasible precisely because NVIDIA GPUs are purpose-built for the matrix operations that are the core of linear algebra.
You don’t need to master all of linear algebra at once. A structured progression — from arithmetic foundations through advanced decompositions — builds intuition before rigor, and ensures every concept is reinforced by practical application.
Arithmetic Foundations (Week 1–2)
Real numbers, angles, trigonometry, Pythagorean theorem, Cartesian coordinate system, Euclidean distance, norm vs distance distinction. Goal: comfort with geometric intuition in 2D and 3D space.
Vector Fundamentals (Week 3–4)
Vector addition, scalar multiplication, dot product, cross product, unit vectors, linear combinations, linear independence, span, basis, and dimension. Code: implement all operations in NumPy.
Matrix Operations (Week 5–7)
Matrix arithmetic, transpose, inverse, determinant, rank, null space, column space, Gaussian elimination, REF/RREF. Implement linear regression with the normal equation from scratch.
Linear Transformations (Week 8–9)
Linear maps, their matrix representations, geometric interpretations (rotation, scaling, reflection, projection). Gram-Schmidt, orthogonality, QR decomposition. Code: image transformations.
Advanced Decompositions (Week 10–12)
Eigenvalues, eigenvectors, eigendecomposition, SVD. Implement PCA from scratch. Build a basic recommender system using SVD. Understand SVD relationship to PCA.
ML Integration (Week 13–16)
Gradient vectors, Hessian matrices, optimisation landscapes. Implement gradient descent, understand backpropagation as matrix operations. Build a neural network layer from scratch using NumPy.
Visual Learning: 3Blue1Brown’s “Essence of Linear Algebra” on YouTube is the best visual introduction that exists — 16 videos, each 8–15 minutes, that build geometric intuition from scratch. Theory: Gilbert Strang’s MIT OCW Linear Algebra (18.06) is the gold standard academic course, available free. ML-Focused: The “Mathematics for Machine Learning” textbook (Deisenroth et al.) is freely available and specifically targets ML applications. Practice: freeCodeCamp’s Linear Algebra course includes hands-on Python coding throughout.
Self-Assessment Checklist
- Can you explain what a dot product measures geometrically and name 3 ML uses?
- Can you implement linear regression from scratch using matrix operations in NumPy?
- Can you explain what eigenvalues represent and how PCA uses them?
- Can you describe SVD and name 3 applications beyond PCA?
- Can you explain why gradient descent works using the concept of gradient vectors?
- Can you explain what linear independence means and why it matters for regression?
- Can you explain how a transformer’s attention mechanism uses dot products?
- Can you identify the shapes of weight tensors in a simple 3-layer neural network?
Sources & References
Comprehensive coverage of scalars, vectors, matrices, tensors, eigenvalues, PCA, SVD and optimisation from IBM’s AI research team.
Introduction series by Ebrahim Mousavi covering the role of linear algebra in ML, Python libraries, and model examples.
Tatev Aslanyan’s structured roadmap covering core concepts, real-world applications, and recommended learning resources.
Deep dive into the mathematical foundations of AI including learning path and reference material for practitioners.
Sayan Chowdhury’s accessible guide covering 16 key concepts with real-world analogies and ML application examples.
Beginner-friendly coverage of how linear algebra powers ML algorithms with practical code examples and FAQ.
Enterprise perspective on linear algebra applications in AI for business transformation, covering key concepts and industry use cases.
Academic course notes from the University of Pennsylvania covering rigorous mathematical foundations for computer science applications.