00
Foundation
Why Linear Algebra Powers AI

Linear algebra is the silent workhorse of artificial intelligence. Every image a model sees, every word it understands, every prediction it makes — all encoded as vectors and matrices, processed through linear operations.

“Linear algebra is the mathematics of data. Without it, modern machine learning as we know it simply would not exist.”
— Stanford Professor Andrew Ng

Machine learning fundamentally does three things: represents data, transforms data, and optimises model parameters. Linear algebra is the language for all three. When you train a neural network, the forward pass is matrix multiplication. When you reduce dimensions with PCA, you compute eigenvectors. When gradient descent updates weights, it follows a gradient vector. Understanding linear algebra means understanding AI at its deepest level.

15
Core Concepts
Dimensions Possible
100B+
Parameters in GPT-4
1950s
LA in Computing Began

From IBM’s research: “Whether training a neural network, building a recommendation system or applying PCA to a complex high-dimensional dataset, practitioners are using linear algebra to perform massive calculations.” Data rarely comes as a single number — it arrives as datasets of vectors, matrices of features, tensors of images. Linear algebra provides the tools to organise, manipulate and analyse all of it efficiently.

📐 The Three Pillars of Linear Algebra in ML

Data Representation: Scalars, vectors, matrices and tensors encode every type of real-world data — a customer profile, an image, a sentence, a sound clip. Transformation: Matrix operations reshape, rotate, scale and project data into forms that reveal patterns. Optimisation: Gradients, eigenvalues and matrix factorizations power the learning algorithms that tune model weights.

🧮
Linear Regression

Finds the best-fit hyperplane by solving a system of linear equations using matrix operations and least squares.

Foundational
🧠
Neural Networks

Every layer performs y = Wx + b — a matrix multiplication with weights plus a bias vector.

Deep Learning
📉
Dimensionality Reduction

PCA and SVD use eigenvectors to compress high-dimensional data into its most informative directions.

Unsupervised
🎯
Recommendation Systems

SVD decomposes user-item matrices into latent factors that predict what users will enjoy.

Applied AI
01
Core Concept One
Scalars, Vectors, Matrices & Tensors

The building blocks of all linear algebra in AI are four data structures that differ only in how many dimensions they occupy. Everything in machine learning — a data point, an image, a neural network weight — is one of these four structures.

1.1 Scalar

A scalar is the simplest building block — a single numerical value, such as 5 or 2.3. In machine learning, scalars represent individual parameters (like a learning rate of 0.001), loss function outputs (a single error value), scaling factors, or single measurements. When you see the temperature today as 37°C or a model’s accuracy as 0.94 — those are scalars.

Scalar Example s = 5.7     (a single real number — one dimension)

1.2 Vector

A vector is an ordered list of numbers, written as a column or row. Vectors are the workhorse data structure of machine learning — they represent data points, feature sets, model weights, directions, embeddings, and probability distributions. A customer described by age, income, and purchase count becomes the vector [34, 75000, 12].

Vectors in ML represent: a single data sample, a row in a dataset, model weights in a layer, directions for gradient descent, and word embeddings (where each word is a vector of ~300–1536 numbers).

Column Vector (3-dimensional) x = [1200, 2, 8]ᵀ    ← house: size 1200 sqft, 2 bedrooms, 8 years old

1.3 Matrix

A matrix is a two-dimensional array of numbers arranged in rows and columns. A dataset where each row is a data point and each column is a feature naturally forms a matrix. Matrices are central to linear algebra because they allow for efficient storage and transformation of data.

In neural networks, the weight matrix W of a layer defines how input signals are mixed and projected to the output. Multiplying an input vector x by W applies a learned linear transformation: the foundation of every deep learning forward pass.

Data Matrix — 4 samples × 3 features X = ⎡ 1200 2 8 ⎤ ⎢ 950 3 5 ⎥ ⎢ 1800 4 12 ⎥ ⎣ 700 1 2 ⎦

1.4 Tensor

A tensor is a generalisation of scalars, vectors and matrices to higher dimensions. A colour image is a 3D tensor: height × width × 3 (RGB channels). A batch of 32 such images becomes a 4D tensor: 32 × height × width × 3. In deep learning, tensors are the standard data structure — all inputs, weights, activations, and gradients in PyTorch and TensorFlow are tensors.

0️⃣

Scalar (0D Tensor)

Single number. Examples: loss value, learning rate, accuracy score.

1️⃣

Vector (1D Tensor)

Ordered list. Examples: feature vector, word embedding, bias term.

2️⃣

Matrix (2D Tensor)

Table of numbers. Examples: dataset, weight matrix, confusion matrix.

3️⃣

3D Tensor

Stacked matrices. Examples: colour image (H×W×C), time series batch.

4️⃣

4D Tensor

Batch of 3D. Examples: batch of images (N×H×W×C) fed into CNNs.

5️⃣

5D Tensor

Video data. Examples: batch × frames × height × width × channels.

02
Core Concept Two
Key Matrix Operations

Matrix operations are the verbs of linear algebra — they transform, combine, and reshape data. Every forward pass in a neural network, every regression update, every embedding comparison executes one or more of these operations at massive scale.

2.1 Matrix Multiplication

Matrix multiplication is the single most important operation in deep learning. When input data flows through a neural network layer, it is multiplied by the weight matrix. The result encodes a learned combination of every input feature — this is how networks detect patterns.

For two matrices A (m×n) and B (n×p), the product C = AB is an m×p matrix where each element C[i,j] is the dot product of row i of A with column j of B. The inner dimension (n) must match — this constraint drives the design of neural network architecture shapes.

Matrix Multiplication — Neural Network Forward Pass output = W · x + b where W is [neurons_out × neurons_in], x is input vector, b is bias

2.2 Dot Product

The dot product of two vectors produces a scalar — the sum of element-wise products: a·b = Σ(aᵢ × bᵢ). It measures how much two vectors point in the same direction. Crucially, it is the building block of cosine similarity, which measures whether two vectors (like two word embeddings) are semantically similar.

In transformer models, the attention mechanism computes dot products between query and key vectors to determine which parts of the input to attend to — the mathematical basis of self-attention in GPT and BERT.

Dot Product & Cosine Similarity a·b = a₁b₁ + a₂b₂ + … + aₙbₙ cos(θ) = (a·b) / (||a|| × ||b||)   ← similarity: +1=same, 0=orthogonal, -1=opposite

2.3 Transpose

The transpose of a matrix flips its rows and columns (Aᵀ[i,j] = A[j,i]). This is used constantly to align dimensions for multiplication and to uncover structural patterns. In the normal equation for linear regression, the formula involves Xᵀ to align the data matrix for self-multiplication.

2.4 Matrix Inverse

The inverse of a square matrix A, denoted A⁻¹, satisfies A·A⁻¹ = I (the identity matrix). It exists only when the determinant is non-zero. In linear regression, the optimal weights are found by: w = (XᵀX)⁻¹ Xᵀy — the Normal Equation, which directly solves for the best-fit parameters using matrix inversion.

💡 When Matrix Inversion Isn’t Possible

Real-world data is often ill-conditioned or the matrix is rectangular (not square), making direct inversion impossible. The Moore-Penrose Pseudoinverse (A⁺) solves this — it’s used in linear regression when data is overdetermined (more equations than unknowns) or underdetermined (fewer equations than unknowns), and is the practical replacement for exact inversion in numerical computing.

2.5 Determinant

The determinant of a square matrix is a single scalar that encodes important geometric and algebraic information. If det(A) ≠ 0, the matrix is invertible and the linear system Ax = b has a unique solution. If det(A) = 0, the matrix is singular — the data has redundant dimensions, and solutions may be infinite or nonexistent. Geometrically, the determinant measures the volume scaling factor of the linear transformation.

2.6 Identity and Zero Matrices

The identity matrix I is a square matrix with 1s on the diagonal and 0s elsewhere. Multiplying any matrix by I leaves it unchanged — it is the multiplicative identity for matrices, analogous to multiplying a number by 1. The identity matrix plays a crucial role in defining inverses and in the convergence proofs of many optimisation algorithms.

OperationNotationWhere Used in MLKey Property
Matrix MultiplicationAB or A·BNeural network layers, linear regressionNon-commutative: AB ≠ BA in general
Dot Producta·b or aᵀbAttention mechanisms, cosine similarity, SVMReturns scalar; measures vector alignment
TransposeAᵀNormal equation, covariance matrix, attention(AB)ᵀ = BᵀAᵀ
Matrix InverseA⁻¹Normal equation, Kalman filterOnly exists if det(A) ≠ 0
Hadamard ProductA⊙BLSTM gates, element-wise operationsElement-wise; same dimensions required
Outer ProductabᵀRank-1 updates, gradient computationProduces matrix from two vectors
Tracetr(A)Loss functions, Gaussian log-likelihoodSum of diagonal elements
03
Core Concept Three
Linear Systems & Equations

Every machine learning model ultimately reduces to solving a system of linear equations. Understanding how these systems behave — when they have unique solutions, no solutions, or infinite solutions — is fundamental to understanding why ML models work.

A system of linear equations can be compactly written as Ax = b, where A is the coefficient matrix, x is the vector of unknowns, and b is the right-hand side. This is the universal form of almost every ML problem setup.

General Form of a Linear System Ax = b ↑ coefficient matrix ↑ unknowns ↑ targets/outputs Example — House Price Prediction: price = w₁·(sqft) + w₂·(bedrooms) + b

3.1 Solution Types

A linear system Ax = b has three possible outcomes based on the properties of matrix A. Unique solution: when A is full rank and invertible — the system is consistent with exactly one answer. No solution: when the equations are contradictory (overdetermined — more equations than unknowns, typical in real ML data). Infinite solutions: when the system is underdetermined — fewer equations than unknowns, common in deep learning where models have more parameters than training examples.

3.2 Gaussian Elimination & Row Reduction

Gaussian elimination is the foundational algorithm for solving linear systems. By applying elementary row operations (scaling, swapping, and adding rows), it transforms the coefficient matrix into Row Echelon Form (REF) or Reduced Row Echelon Form (RREF), from which solutions can be read directly. In ML, this principle is used computationally in solving the normal equation and in understanding the structure of data matrices.

3.3 Least Squares: The ML Default

In practice, real data rarely satisfies an exact system of equations — there is always noise and measurement error. The least squares method finds the x that minimises ||Ax – b||² — the squared difference between predictions and targets. This is the foundational objective function of linear regression and underpins the training of most supervised learning models.

Normal Equation — Closed-Form Least Squares Solution w* = (XᵀX)⁻¹ Xᵀy Finds optimal weights w* that minimise mean squared error on training data
⚡ Why Gradient Descent Instead of the Normal Equation?

The normal equation gives an exact solution but requires inverting XᵀX, which costs O(n³) time. For large datasets (millions of examples, thousands of features), this is computationally prohibitive. Gradient descent finds an approximate solution iteratively at O(n) cost per step — which is why every deep learning model uses gradient descent, not the closed-form normal equation.

04
Core Concept Four
Linear Transformations

A linear transformation is a function that maps vectors from one space to another while preserving the structure of addition and scalar multiplication. Every neural network layer, every image preprocessing step, every coordinate change is a linear transformation.

Formally, a function T is a linear transformation if T(u + v) = T(u) + T(v) and T(cu) = cT(u) for all vectors u, v and scalar c. Critically, every linear transformation between finite-dimensional spaces can be represented as matrix multiplication — which is why matrices are so central to machine learning.

4.1 Geometric Interpretations

Linear transformations can be visualised geometrically as operations on space:

  • Rotation: Rotates vectors around the origin by angle θ — used in data augmentation for computer vision
  • Scaling: Stretches or compresses vectors along each axis — used in feature normalisation
  • Reflection: Flips vectors across a line or plane — another augmentation technique
  • Shearing: Slants the coordinate system — distorts images for training robustness
  • Projection: Collapses vectors onto a subspace — the core operation of PCA and dimensionality reduction

4.2 Composition of Transformations

Applying one transformation after another is called composition, represented by matrix multiplication. A deep neural network with n layers applies n consecutive linear transformations (each followed by a non-linear activation function). The composition of all linear layers can be expressed as a single large matrix — which is why adding more layers doesn’t help without non-linear activations, since the composition of linear functions is still linear.

🧠 The Role of Non-Linearity

Pure linear transformations can only learn linear relationships — a neural network made entirely of matrix multiplications, no matter how deep, can only fit a linear model. This is why activation functions (ReLU, sigmoid, tanh) are inserted after each linear layer — they break the linearity and allow the network to approximate arbitrarily complex functions. Linear algebra provides the structure; non-linearity provides the expressive power.

05
Core Concept Five
Eigenvalues & Eigenvectors

Eigenvalues and eigenvectors reveal the fundamental modes of behaviour of a linear transformation — they are the directions a matrix naturally acts on, and the magnitudes by which it stretches or compresses those directions. They are the DNA of data matrices.

Eigenvalue Equation Av = λv A = square matrix, v = eigenvector (non-zero), λ = eigenvalue (scalar) “Multiplying A by v only scales v by λ — it doesn’t change direction”

5.1 Intuitive Understanding

Imagine a rubber sheet drawn with lines. Pull it in some direction — a linear transformation. Some lines stretch a lot, some barely move, some stay exactly in place. Those special directions that don’t rotate — only scale — are the eigenvectors. The amount they stretch is the corresponding eigenvalue.

An eigenvector of a square matrix A is a non-zero vector v such that Av = λv — multiplying A by v only scales v by a factor λ without changing its direction. The eigenvalue λ tells you the factor by which the eigenvector is scaled. Large eigenvalues correspond to directions of strong transformation; small eigenvalues to weak transformation. Zero eigenvalues correspond to directions that collapse entirely.

5.2 Eigendecomposition

Eigendecomposition breaks a square matrix A into A = QΛQ⁻¹, where Q is a matrix whose columns are the eigenvectors, and Λ is a diagonal matrix of eigenvalues. This factorisation reveals the intrinsic structure of the transformation — the principal axes and their scaling factors. It is applicable only to square matrices and is the basis for PCA, spectral clustering, and PageRank.

5.3 Where Eigenvalues Appear in ML

AlgorithmRole of Eigenvalues/EigenvectorsPractical Effect
PCAEigenvectors of covariance matrix = principal components; eigenvalues = variance capturedDimensionality reduction, feature compression
Spectral ClusteringEigenvectors of graph Laplacian matrix define cluster structureNon-convex cluster detection
Google PageRankPageRank vector is the dominant eigenvector of the web link matrixWeb page importance ranking
Optimisation (Hessian)Eigenvalues of Hessian indicate local curvature; positive = minimumConvergence diagnosis for gradient descent
Stability AnalysisLargest eigenvalue of weight matrix (spectral radius) governs RNN stabilityPreventing vanishing/exploding gradients
06
Core Concept Six
PCA & Dimensionality Reduction

High-dimensional data is both powerful and treacherous — more features can improve accuracy, but also cause the curse of dimensionality, slow training, and overfitting. Principal Component Analysis (PCA) uses linear algebra to distill complex data into its most informative directions.

6.1 The Curse of Dimensionality

As the number of dimensions grows, data points become increasingly sparse in the feature space. The concept of “closeness” loses meaning — in high dimensions, all points are roughly equidistant. With 100 features, many may be redundant or correlated. Training on such data is computationally expensive, memory-intensive, and prone to overfitting, where a model memorises noise instead of patterns.

📊 Concrete Example of Dimensionality

Imagine 10,000 customers described by 100 features each. Analysing all 100 is slow and most features are redundant — “sports gear interest” overlaps with “outdoor equipment interest.” PCA reduces this to 3-5 principal components capturing 95% of the variance, enabling fast visualisation and efficient downstream modelling. The geometry of customer behaviour can be understood as a 3D shape, not a 100-dimensional cloud.

6.2 How PCA Works — Step by Step

  1. Standardise the data: Subtract the mean of each feature and divide by its standard deviation so all features are on the same scale.
  2. Compute the covariance matrix: C = (1/n)XᵀX — captures how features vary together. A high covariance between two features means they carry overlapping information.
  3. Eigendecompose the covariance matrix: Find eigenvectors (principal components) and eigenvalues (variance explained by each component).
  4. Sort by eigenvalue: Order components from largest to smallest eigenvalue — the first principal component captures the most variance.
  5. Project the data: Multiply the original data by the top k eigenvectors to get a k-dimensional representation.
PCA — Covariance Matrix & Projection C = (1/n) Xᵀ X ← covariance matrix Cv = λv ← find eigenvectors v and eigenvalues λ Z = X · V_k ← project onto top-k components

6.3 Interpreting PCA Results

The percentage of variance explained by each component tells you how much information you retain when keeping that component. In practice, practitioners look for the “elbow” in a scree plot — the point where adding more components yields diminishing returns. Keeping components that explain 95% of total variance is a common heuristic.

PCA is not just used for dimensionality reduction. It is also used for data visualisation (reducing to 2D or 3D for plotting), noise reduction (discarding low-variance components removes noise), and as a preprocessing step for algorithms sensitive to high dimensions like K-nearest neighbours.

07
Core Concept Seven
Singular Value Decomposition (SVD)

Singular Value Decomposition is one of the most important and versatile tools in all of linear algebra. Unlike eigendecomposition, SVD works on any matrix — not just square ones — and provides a complete structural picture of any linear transformation.

SVD Decomposition of any Matrix A A = U Σ Vᵀ U = left singular vectors (m×m orthogonal) Σ = diagonal matrix of singular values (m×n) V = right singular vectors (n×n orthogonal)

7.1 What SVD Reveals

SVD breaks any matrix A into three components: U describes the directions in the input space (what patterns the transformation looks for), Σ contains the singular values on its diagonal (the magnitude/importance of each pattern), and Vᵀ describes the directions in the output space (what the patterns map to). The singular values in Σ are always non-negative and sorted in decreasing order — the first is always the most “important” direction.

7.2 Low-Rank Approximation

One of SVD’s most powerful applications is low-rank approximation. By keeping only the top k singular values and their corresponding vectors, we get the best possible rank-k approximation of A in terms of minimising the Frobenius norm (the squared sum of all element differences). This principle is used everywhere from image compression to noise reduction in sensor data.

📸 Image Compression with SVD

A 512×512 greyscale image is a matrix with 262,144 values. SVD decomposes it into components ranked by information content. Keeping only the top 50 singular values (out of 512) gives a recognisable image using just (512×50 + 50 + 50×512) / (512×512) ≈ 19% of the original storage — while preserving the main visual content. The discarded components were mostly noise.

7.3 SVD in Recommendation Systems

Netflix, Amazon, and Spotify use variants of SVD to power their recommendation engines. A user-item interaction matrix (rows = users, columns = movies/products, values = ratings) is decomposed into latent factor matrices. The left matrix U captures latent user preferences; the right matrix V captures latent item characteristics; Σ scales their importance. Unobserved ratings can then be predicted as the dot product of a user’s latent vector with an item’s latent vector.

7.4 SVD vs PCA Relationship

PCA can be computed using SVD applied to the data matrix X (after centring). The right singular vectors of X are exactly the principal components; the squared singular values divided by (n-1) are the eigenvalues of the covariance matrix. SVD is numerically more stable than eigendecomposition and is the preferred implementation for PCA in production ML libraries like scikit-learn.

ApplicationHow SVD Is UsedExample System
Recommender SystemsDecomposes user-item matrix into latent factor spacesNetflix, Spotify, Amazon
Image CompressionLow-rank approximation discards noise componentsJPEG2000, image databases
Natural Language ProcessingLatent Semantic Analysis (LSA) uses SVD on term-document matrixTopic modelling, search engines
Data DenoisingSmall singular values correspond to noise; truncating removes itScientific data processing
PCA ImplementationNumerically stable way to compute principal componentsscikit-learn’s PCA class
PseudoinverseA⁺ = V Σ⁺ Uᵀ — handles non-invertible matricesLeast squares regression
08
Core Concept Eight
Norms, Distances & Projections

Norms measure the size of vectors. Distances measure how far apart data points are. Projections find the nearest point on a subspace. These three related concepts are the geometric intuition behind loss functions, regularisation, clustering, and regression.

8.1 Vector Norms

A norm is a function that assigns a non-negative length to a vector, satisfying three properties: non-negativity (||x|| ≥ 0), homogeneity (||cx|| = |c|·||x||), and the triangle inequality (||x+y|| ≤ ||x|| + ||y||). The most important norms in ML:

The Key Norms L2 norm (Euclidean): ||x||₂ = √(x₁² + x₂² + … + xₙ²) ← straight-line distance L1 norm (Manhattan): ||x||₁ = |x₁| + |x₂| + … + |xₙ| ← sum of absolute values L∞ norm (Max norm): ||x||∞ = max(|x₁|, |x₂|, …, |xₙ|) ← largest element

8.2 Norms in Regularisation

Regularisation prevents overfitting by adding a penalty term to the loss function based on model weights. L2 regularisation (Ridge) adds λ||w||₂² to the loss — penalising large weights equally in all directions, pushing all weights toward zero smoothly. L1 regularisation (Lasso) adds λ||w||₁ to the loss — producing sparse solutions where many weights become exactly zero, effectively performing automatic feature selection.

🎯

L2 (Ridge) Regularisation

Penalises large weights. Produces dense solutions where all weights are small. Best when many features contribute. Loss: L + λ||w||₂²

✂️

L1 (Lasso) Regularisation

Produces sparse solutions. Many weights become exactly zero. Best for feature selection. Loss: L + λ||w||₁

🔀

Elastic Net

Combination of L1 and L2. Gets both sparsity and stability. Loss: L + λ₁||w||₁ + λ₂||w||₂²

8.3 Projections

A projection is the “shadow” of one vector onto another — the closest point to a vector within a given subspace. Projections are the geometry behind regression (projecting the target vector onto the column space of features) and PCA (projecting data onto principal component directions).

💡 Projection Analogy

Shine a flashlight directly above an object — the shadow on the ground is the projection of the object onto the 2D plane. In linear regression, the predictions ŷ = Xw are exactly the projection of the true labels y onto the column space of the feature matrix X. The residuals (y – ŷ) are perpendicular (orthogonal) to every column in X. This geometric interpretation explains why least squares minimises squared error.

09
Core Concept Nine
Gradients & Optimisation

Gradient descent is how neural networks learn. It is fundamentally a linear algebra operation — computing the gradient vector of the loss function and taking a step in the negative direction. Understanding gradients is understanding how every ML model is trained.

9.1 The Gradient Vector

The gradient of a scalar-valued function f(w) is a vector ∇f(w) pointing in the direction of steepest ascent. Each element ∂f/∂wᵢ is the partial derivative — how much the loss changes when weight wᵢ changes. For a loss function L with model parameters w₁, w₂, …, wₙ, the gradient ∇L is an n-dimensional vector telling the model exactly how to adjust each weight to increase the loss. Training moves opposite to the gradient — toward lower loss.

Gradient Descent Weight Update wwη · ∇L(w) η = learning rate (step size scalar) ∇L(w) = gradient vector (direction of steepest ascent) Repeat until convergence → minimum of loss function

9.2 Gradient Descent Variants

VariantGradient Computed OverPropertiesUsed In
Batch GDAll training examplesStable but slow; exact gradientSmall datasets, convex problems
Stochastic GD (SGD)One random exampleFast, noisy; can escape local minimaOnline learning, large datasets
Mini-batch GDBatch of 32-512 examplesBalance of speed and stabilityDeep learning standard
AdamMini-batch with adaptive ratesAdapts learning rate per parameterMost modern deep learning
RMSPropMini-batch with RMS scalingGood for non-stationary problemsRNNs, online learning

9.3 Backpropagation: Chain Rule as Matrix Operations

Backpropagation computes gradients through a neural network by applying the chain rule of calculus layer by layer — but it is fundamentally a sequence of matrix operations. For a layer with weight matrix W and input x, the gradient flows as δ = Wᵀ · δ_next, where δ is the error signal. The weight update is ΔW = -η · δ · xᵀ (the outer product of the error and input vectors).

⚡ Why GPUs Accelerate Training

Gradient descent at scale is almost entirely matrix multiplications — computing WᵀX for thousands of parameters across millions of examples. GPUs excel at exactly this: performing thousands of floating-point multiply-accumulate operations in parallel. A modern A100 GPU can perform 312 teraFLOPS of FP16 matrix operations. Training GPT-4 required thousands of such GPUs for months — all of it matrix multiplication.

10
Core Concept Ten
Rank, Span & Linear Independence

Rank, span, and linear independence answer a fundamental question about data: how much unique information does it actually contain? Redundant features inflate dimension without adding information — these concepts identify and measure that redundancy.

10.1 Linear Independence

A set of vectors is linearly independent if no vector in the set can be expressed as a linear combination of the others. If you can write v₃ = 2v₁ + 3v₂, then v₃ adds no new information — it is redundant. In a dataset, linearly dependent features (like height in cm and height in inches) don’t add information; they waste computation and can destabilise regression solutions.

10.2 Rank

The rank of a matrix is the number of linearly independent rows (or columns) — the number of unique dimensions of information. A rank-deficient matrix (rank < min(rows, cols)) means data has redundant features. This leads to unstable solutions in regression (XᵀX is not invertible) and poor generalisation. Understanding rank helps diagnose data quality issues early.

10.3 Span and Column Space

The span of a set of vectors is all possible linear combinations of those vectors — the entire region of space you can reach by mixing them. In regression, predictions always lie in the column space of the feature matrix X. If the true labels cannot be expressed as a linear combination of the features, the regression can only find the closest approximation — the projection onto the column space.

“If you only know how to walk north and east, you can never reach south-west. Your possible movement space is limited — and so are your model’s predictions if features don’t span the necessary directions.”

— Sayan Chowdhury, Towards AI
11
Core Concept Eleven
Orthogonality & QR Decomposition

Orthogonal vectors are the ideal building blocks for numerical computation — they are independent, non-redundant, and simplify calculations dramatically. Most of the numerical stability in modern ML algorithms comes from ensuring computations happen in orthogonal bases.

11.1 Orthogonality

Two vectors are orthogonal if their dot product is zero: a·b = 0. Geometrically, they meet at a 90° angle. Orthogonal vectors don’t interfere with each other — they act as clean, independent directions. A set of mutually orthogonal unit vectors is called an orthonormal basis, and working in such a basis makes projections, inversions, and distance calculations computationally efficient and numerically stable.

11.2 Gram-Schmidt Process

The Gram-Schmidt process converts any set of linearly independent vectors into an orthonormal set spanning the same space. It works by iteratively subtracting the projection of each new vector onto all previously processed vectors, leaving only the component orthogonal to all of them, then normalising. It is the foundation for QR decomposition.

11.3 QR Decomposition

QR decomposition factorises a matrix A into A = QR, where Q is orthogonal (Qᵀ = Q⁻¹) and R is upper triangular. It is used for solving linear systems and least squares problems in a numerically stable way — much more stable than computing (XᵀX)⁻¹ directly. PCA, linear regression, and eigenvalue algorithms in numpy and scipy all use QR decomposition under the hood.

🔢 Why Orthonormal Bases Matter for Numerical Stability

When feature columns are nearly linearly dependent (collinear), the matrix XᵀX becomes nearly singular — small measurement errors cause wildly unstable coefficient estimates. Working with orthonormal bases (as QR decomposition creates) eliminates this problem. This is why principal components are always orthogonal to each other — it makes computations with them maximally stable and interpretable.

12
Core Concept Twelve
Hessians & Second-Order Curvature

The gradient tells you which direction to descend. The Hessian tells you the shape of the terrain you are descending — flat, steep, curved, or saddle-shaped. Second-order methods use this curvature information to take more intelligent optimisation steps.

12.1 The Hessian Matrix

The Hessian H of a scalar function L(w) is the matrix of second derivatives: H[i,j] = ∂²L/∂wᵢ∂wⱼ. It captures the curvature of the loss surface in every direction. For a model with n parameters, the Hessian is an n×n matrix — for GPT-4 with ~1.8 trillion parameters, storing the Hessian is utterly infeasible (1.8×10²⁴ entries). This is why approximate second-order methods (Adam, K-FAC) are used in practice.

Hessian Eigenvalues & Loss Surface Shape All eigenvalues > 0 → positive definite → at a local minimum (convex bowl) All eigenvalues < 0 → negative definite → at a local maximum Mixed eigenvalues → indefinite → saddle point (not a minimum!) Some eigenvalues = 0 → semi-definite → flat region (slow convergence)

12.2 Saddle Points in Deep Learning

A critical insight for deep learning: most “stuck” points in high-dimensional loss surfaces are saddle points, not local minima. A saddle point has some positive and some negative Hessian eigenvalues — the loss surface curves up in some directions and down in others. SGD and Adam automatically escape saddle points because the random noise in mini-batch gradients perturbs the path away from them. This explains why stochastic methods often outperform full-batch methods in deep learning.

📉 Vanishing and Exploding Gradients

The largest eigenvalue of the Hessian (or spectral radius of the weight matrix) determines gradient explosion risk. If eigenvalues are very large, gradient steps can overshoot wildly — exploding gradients. If very small, gradients vanish and learning stalls. This is why gradient clipping (limiting gradient norm), careful weight initialisation (Xavier, He), and batch normalisation are essential techniques — they all manage the eigenvalue distribution of weight matrices.

13
Core Concept Thirteen
Tensors in Deep Learning

In deep learning frameworks, everything is a tensor — inputs, outputs, weights, activations, gradients. Understanding tensor operations is the practical skill that bridges linear algebra theory to actual model implementation in PyTorch or TensorFlow.

13.1 Tensor Shapes in Practice

Tensor ShapeMeaningExample
(32,)1D: vector of 32 valuesBatch of 32 loss values
(784,)1D: flattened 28×28 imageMNIST digit as vector
(100, 50)2D: matrixWeight matrix: 100 inputs → 50 outputs
(32, 28, 28, 3)4D: batch of colour images32 RGB images of 28×28 pixels
(512, 768)2D: transformer weightAttention projection matrix in BERT
(32, 512, 768)3D: batch of sequences32 sequences, 512 tokens, 768-dim embeddings

13.2 Essential Tensor Operations

  • Reshape/View: Change tensor dimensions without changing data. Used to flatten images before fully connected layers.
  • Transpose/Permute: Reorder dimensions. Essential for attention: (batch, seq, head) → (batch, head, seq).
  • Broadcasting: Automatically expands tensors of different shapes for element-wise operations — adds a (1,10) bias to a (32,10) batch.
  • Matrix Multiply (matmul): The core operation of every linear layer and attention head.
  • Concatenation: Joins tensors along a dimension — used in skip connections (ResNet) and feature fusion.
  • Einsum: Einstein summation notation — expresses any tensor contraction in one line, used for batched attention computation.
🤖 Attention Mechanism as Linear Algebra

The transformer attention mechanism is pure linear algebra. Given input X, we compute Q = XW_Q, K = XW_K, V = XW_V (three matrix multiplications). Then attention weights A = softmax(QKᵀ / √d) (dot product between queries and keys, scaled). Finally, output = AV (weighted sum of values). This entire sequence — the foundation of GPT, BERT, and all modern LLMs — is matrix multiplications and dot products, nothing more.

14
Real-World Applications
Linear Algebra in AI & ML Models

Every major AI and ML algorithm has linear algebra at its core. This section traces the specific operations that power each algorithm — making the abstract mathematics concrete and actionable.

📈

Linear Regression

Data stored as matrix X; weights as vector w. Training solves: w = (XᵀX)⁻¹Xᵀy. Predictions are: ŷ = Xw. Core operations: matrix multiplication, transpose, matrix inversion.

🧠

Neural Networks

Each layer: output = activation(Wx + b). Backprop: gradient = Wᵀ × upstream_gradient. Core operations: batched matrix multiply, outer products for weight updates.

🖼️

Convolutional Networks

Convolution is a matrix operation — the filter is applied via a structured sparse matrix (Toeplitz). Feature maps computed as matrix products. Pooling uses max/mean aggregation.

🔍

Support Vector Machines

Finds the optimal separating hyperplane via dot products between support vectors. Kernel trick uses inner product functions in feature space. Core operation: dot product.

🎬

Recommendation Systems

Matrix factorisation (SVD) decomposes user-item ratings into latent factors. Predictions: r̂ᵤᵢ = uᵤ · vᵢ (dot product of user and item embeddings). Used by Netflix, Amazon.

🗣️

NLP & Word Embeddings

Words mapped to vectors (Word2Vec, GloVe, BERT embeddings). Semantic similarity = cosine similarity = normalised dot product. SVD on co-occurrence matrix gives LSA.

Linear Algebra Across Industries

🏥

Healthcare & Genomics

PCA analyses gene expression data from thousands of genes. MRI reconstruction uses SVD. Drug molecule generation uses VAE latent space exploration.

💰

Finance & Trading

Risk modelling with covariance matrices. Portfolio optimisation solves linear systems. Fraud detection uses high-dimensional feature vectors and SVMs.

🚗

Autonomous Vehicles

Computer vision uses CNNs (matrix ops). LIDAR point clouds are 3D tensors. Sensor fusion applies Kalman filters (matrix inversions). Path planning uses linear programming.

15
Implementation
Python Tools & Libraries

ML practitioners don’t implement linear algebra from scratch. A rich ecosystem of optimised Python libraries handles all the computation — letting you focus on models and insights. Knowing which function does what is essential practical knowledge.

🔢
NumPy

The foundational numerical computing library. Provides arrays, matrices, and the complete numpy.linalg sub-module for linear algebra operations.

Foundation
🔬
SciPy

Extends NumPy with more specialised algorithms: sparse matrix operations, iterative solvers, and advanced decompositions via scipy.linalg.

Scientific
🤖
PyTorch

Deep learning framework where all computations are tensor operations. Automatic differentiation (autograd) computes gradients automatically for any tensor graph.

Deep Learning
🧪
scikit-learn

ML library implementing PCA, SVD, linear regression, and SVMs — all using optimised linear algebra under the hood via LAPACK and BLAS.

ML

Essential NumPy Functions for Linear Algebra

FunctionOperationML Use Case
np.dot(A, B)Matrix/vector dot productLayer forward pass, attention scores
np.matmul(A, B) or A @ BMatrix multiplicationNeural network layers
A.T or np.transpose(A)TransposeNormal equation, backprop
np.linalg.inv(A)Matrix inverseNormal equation solution
np.linalg.eig(A)Eigenvalues & eigenvectorsPCA, spectral methods
np.linalg.svd(A)Singular Value DecompositionDimensionality reduction, recommender
np.linalg.norm(x)Vector/matrix normRegularisation, gradient clipping
np.linalg.det(A)DeterminantCheck invertibility
np.linalg.solve(A, b)Solve linear system Ax=bLinear regression, systems of equations
np.linalg.rank(A)Matrix rankCheck data quality, linear independence
💻 GPU Acceleration

PyTorch and TensorFlow automatically execute tensor operations on GPU using CUDA. The key insight: all the operations discussed in this document — matrix multiplication, SVD, gradient computation — are executed thousands of times faster on GPU than CPU. Modern AI training is feasible precisely because NVIDIA GPUs are purpose-built for the matrix operations that are the core of linear algebra.

16
Learning Path
Your Linear Algebra Learning Roadmap

You don’t need to master all of linear algebra at once. A structured progression — from arithmetic foundations through advanced decompositions — builds intuition before rigor, and ensures every concept is reinforced by practical application.

Phase 1

Arithmetic Foundations (Week 1–2)

Real numbers, angles, trigonometry, Pythagorean theorem, Cartesian coordinate system, Euclidean distance, norm vs distance distinction. Goal: comfort with geometric intuition in 2D and 3D space.

Phase 2

Vector Fundamentals (Week 3–4)

Vector addition, scalar multiplication, dot product, cross product, unit vectors, linear combinations, linear independence, span, basis, and dimension. Code: implement all operations in NumPy.

Phase 3

Matrix Operations (Week 5–7)

Matrix arithmetic, transpose, inverse, determinant, rank, null space, column space, Gaussian elimination, REF/RREF. Implement linear regression with the normal equation from scratch.

Phase 4

Linear Transformations (Week 8–9)

Linear maps, their matrix representations, geometric interpretations (rotation, scaling, reflection, projection). Gram-Schmidt, orthogonality, QR decomposition. Code: image transformations.

Phase 5

Advanced Decompositions (Week 10–12)

Eigenvalues, eigenvectors, eigendecomposition, SVD. Implement PCA from scratch. Build a basic recommender system using SVD. Understand SVD relationship to PCA.

Phase 6

ML Integration (Week 13–16)

Gradient vectors, Hessian matrices, optimisation landscapes. Implement gradient descent, understand backpropagation as matrix operations. Build a neural network layer from scratch using NumPy.

📚 Recommended Resources

Visual Learning: 3Blue1Brown’s “Essence of Linear Algebra” on YouTube is the best visual introduction that exists — 16 videos, each 8–15 minutes, that build geometric intuition from scratch. Theory: Gilbert Strang’s MIT OCW Linear Algebra (18.06) is the gold standard academic course, available free. ML-Focused: The “Mathematics for Machine Learning” textbook (Deisenroth et al.) is freely available and specifically targets ML applications. Practice: freeCodeCamp’s Linear Algebra course includes hands-on Python coding throughout.

Self-Assessment Checklist

  • Can you explain what a dot product measures geometrically and name 3 ML uses?
  • Can you implement linear regression from scratch using matrix operations in NumPy?
  • Can you explain what eigenvalues represent and how PCA uses them?
  • Can you describe SVD and name 3 applications beyond PCA?
  • Can you explain why gradient descent works using the concept of gradient vectors?
  • Can you explain what linear independence means and why it matters for regression?
  • Can you explain how a transformer’s attention mechanism uses dot products?
  • Can you identify the shapes of weight tensors in a simple 3-layer neural network?

Sources & References

01
IBM Think — Linear Algebra for Machine Learning

Comprehensive coverage of scalars, vectors, matrices, tensors, eigenvalues, PCA, SVD and optimisation from IBM’s AI research team.

02
Medium — Mastering Linear Algebra Part 1

Introduction series by Ebrahim Mousavi covering the role of linear algebra in ML, Python libraries, and model examples.

03
freeCodeCamp — Practical Guide to Linear Algebra in Data Science and AI

Tatev Aslanyan’s structured roadmap covering core concepts, real-world applications, and recommended learning resources.

04
GeoGo Blog — Mathematical Notion of AI: Linear Algebra

Deep dive into the mathematical foundations of AI including learning path and reference material for practitioners.

05
Towards AI — All Linear Algebra Concepts for ML

Sayan Chowdhury’s accessible guide covering 16 key concepts with real-world analogies and ML application examples.

06
Learning Labb — Linear Algebra in ML for Beginners

Beginner-friendly coverage of how linear algebra powers ML algorithms with practical code examples and FAQ.

07
Infosys BPM — Linear Algebra in AI (PDF)

Enterprise perspective on linear algebra applications in AI for business transformation, covering key concepts and industry use cases.

08
UPenn CIS5150 — Linear Algebra I (PDF)

Academic course notes from the University of Pennsylvania covering rigorous mathematical foundations for computer science applications.