Almost six years ago (which means this blog is … old), I wrote what has become one of my favourite blog posts), explaining the linear algebra approach to linear regression (specifically to OLS, or ordinary least square regression). I was recently rereading it, and I realised that there are some steps I skipped which might not be intuitive for people without a background in linear algebra. As such, in this blog post I will build this algorithm from scratch in Python, showing you step-by-step how the algorithm works. To really show you exactly what's going on under the hood, I'll be building all methods from scratch rather than using the corresponding Numpy operations.
What is OLS regression?
If you've used scikit-learn's LinearRegression or statsmodels' OLS function, you already know what regression does: it finds the best-fitting line (or hyperplane) through your data to predict a continuous target variable from one or more features. When you call .fit(), the model is calculating a set of coefficients that define this line. But what's actually happening inside that .fit() method?
OLS regression solves this problem by finding coefficients that minimize the sum of squared residuals, that is, the squared distances between each actual data point and the predicted value on the line (hence the name). The resulting linear equation takes the familiar form:
$$\hat{y} = b_0 + b_1X_1 + b_2X_2 + ... + b_nX_n$$where:
- $\hat{y}$ is the predicted value for the target;
- $X_1$ to $X_n$ are your features;
- $b_0$ is the intercept (where the line crosses the y-axis when all features are zero);
- $b_1$ to $b_n$ are the beta-coefficients that tell you how much the prediction changes for each one-unit increase in each feature.
The challenge that OLS solves is: how do we calculate these beta-coefficients? As I described in that post I referenced above, there's an elegant mathematical solution using linear algebra, and in this tutorial, I'll walk you through implementing it step-by-step in pure Python.
To keep things simple and easy to follow, I'll focus on a regression problem with just one feature. While production implementations (like those in scikit-learn) generalize to any number of features, working through the single-feature case will make each step much clearer.
Let's get started!
Defining a tiny dataset for OLS regression
In the last blog post, we imagined we were working with a dataset which has one feature with the values $1, 2, 3$, and a target with the values $3, 6, 7$. We'll get started by assigning each of these to Numpy arrays, with $X$ being the feature array and $y$ being the target array.
import numpy as np
X = np.array([[1, 1], [1, 2], [1, 3]])
X
array([[1, 1],
[1, 2],
[1, 3]])
y = np.array([[3], [6], [7]])
y
array([[3],
[6],
[7]])
You might notice something odd about $X$: it has two columns, even though we only have one feature! The first column contains only 1's, and it represents the intercept term. This is called the design matrix.
This might seem strange, but it's a mathematical requirement: by including this column of 1's, we can calculate both the intercept $b_0$ and the feature coefficient $b_1$ using the same matrix equation. Think of it this way: when we multiply this column of 1's by $b_0$, we get $1 \times b_0 = b_0$, which is exactly the intercept term in our equation.
We now have two arrays. $X$ is a 3 × 2 array, with 3 rows and 2 columns. $y$ is a 3 × 1 array, with 3 rows and 1 column. In linear algebra terminology, $X$ is called a matrix (essentially a table of numbers), and $y$ is called a vector (a single column or row of numbers). Don't worry if you're not familiar with these terms, as we'll be working through the operations using them step-by-step. (If you want to dive deeper into matrices and vectors, you can check out this post, but it's not necessary to follow along.)
We can see the dimensions if we call the shape attribute on each of them.
X.shape
(3, 2)
y.shape
(3, 1)
As I explained in my previous blog post on OLS regression, we're essentially trying to solve the equation $X^Tb = X^Ty$ for $b$, which are our beta coefficients. This is called the normal equation, and it's the standard linear algebra expression for the solution to OLS regression. (If you're curious about how we derive this equation, I really recommend reading the post I mentioned at the start of this post, as it goes through the maths step-by-step.)
Breaking down the normal equation, we can see that have three elements:
- $X^T$: this is the transpose of the array $X$, which essentially means it's an array that contains the same values as $X$ but with the rows and columns reversed. In the case of $X^T$, this means that it will contain 2 rows and 3 columns. If you want to read more about matrix transposition, you can check out this post.
- $y$: this is simply our target array, which we've already created.
- $b$: this is the coefficient array, containing the beta coefficients for the intercept ($b_0$) and the feature ($b_1$). This is what we're trying to find by solving this equation.
Step 1: Transpose the design matrix ($X^T$)
So, first things first, we need to get $X^T$ by transposing $X$. Let's start by creating an empty array $X^T$ which has the same number of columns as $X$ has rows, and the same number of rows as $X$ has columns. In other words, as stated above, as $X$ has 3 rows and 2 columns, $X^T$ will have 2 rows and 3 columns.
n_rows = X.shape[0]
n_cols = X.shape[1]
X_T = np.empty([n_cols, n_rows])
X_T.shape
(2, 3)
After this, we're going to create a nested loop which iterates over each row, then each column, in $X$, and reverses (transposes) the position of each element. For example, we can see that value 3 has the position $(2, 1)$ in X, so this loop that transposes it would place it in position $(1, 2)$.
np.where(X == 3)
(array([2]), array([1]))
Let's create $X^T$ using this loop now:
for i in range(0, n_rows):
for j in range(0, n_cols):
X_T[j][i] = X[i][j]
We can see that $X^T$ is indeed the transpose of $X$ by checking it below.
X_T
array([[1., 1., 1.],
[1., 2., 3.]])
And that 3 is now in position $(1, 2)$:
np.where(X_T == 3)
(array([1]), array([2]))
Step 2: Compute $X^TX$ and $X^Ty$ (normal equation components)
Now that we have $X^T$, we now need to calculate $X^TX$. This is going involve multiplying the two matrices $X^T$ and $X$. You can read more about matrix multiplication in this post, but at a high level, what we're trying to do is to match up each row of $X^T$ with its corresponding column in $X$, and multiply and add their elements.
Let's see what this would look like concretely. As we saw above, $X^T$ has 2 rows and 3 columns, and $X$ has 3 rows and 2 columns. So first, we'd multiply (by taking the dot product) of the first row of $X^T$ and the first column of $X$. In this notation, $X^T_{11}$ refers to the first element of the first row of $X^T$, $X^T_{12}$ refers to the second element, and so on. Let's see what this would look like:
$$\begin{align} &X^T_{11}X_{11} + X^T_{12}X_{21} + X^T_{13}X_{31} \\ &= 1 \times 1 + 1 \times 1 + 1 \times 1 \\ &= 3 \end{align}$$We then need to repeat this for the first row of $X^T$ and the second column of $X$, the second column of $X^T$ and the first column of $X$, and finally, the second row of $X^T$ and the second column of $X$.
Now, let's think about how we might implement this in Python. The most obvious way to do this is by looping over pairs of elements, multiplying them, and adding each multiplication. To get started, we can first create a loop that indexes each row of $X^T$:
for i in range(0, X_T.shape[0]):
print(f"X_T row: {i}")
X_T row: 0
X_T row: 1
We can then nest another loop inside that also indexes each column of $X$:
for i in range(0, X_T.shape[0]): # Rows
for j in range(0, X.shape[1]): # Columns
print(f"X_T row: {i}, X column: {j}")
X_T row: 0, X column: 0
X_T row: 0, X column: 1
X_T row: 1, X column: 0
X_T row: 1, X column: 1
We must now go through every item in these rows and columns. This can be achieved by placing a third loop within the existing two, with its range set to the number of columns in $X^T$. (Alternatively, we could have used the number of rows in $X$, as they are identical.) To access each element, we index each matrix using these loops.
for i in range(0, X_T.shape[0]): # Rows
for j in range(0, X.shape[1]): # Columns
print(f"\nX_T row: {i}, X column: {j}")
for k in range(0, X_T.shape[1]): # Elements
print(f"X_T element: {X_T[i][k]}, X element: {X[k][j]}")
X_T row: 0, X column: 0
X_T element: 1.0, X element: 1
X_T element: 1.0, X element: 1
X_T element: 1.0, X element: 1
X_T row: 0, X column: 1
X_T element: 1.0, X element: 1
X_T element: 1.0, X element: 2
X_T element: 1.0, X element: 3
X_T row: 1, X column: 0
X_T element: 1.0, X element: 1
X_T element: 2.0, X element: 1
X_T element: 3.0, X element: 1
X_T row: 1, X column: 1
X_T element: 1.0, X element: 1
X_T element: 2.0, X element: 2
X_T element: 3.0, X element: 3
Now that we have a (kind of complex!) nested for loop which accesses all the correct elements, we need to add the code to get each dot product.
for i in range(0, X_T.shape[0]): # Rows
for j in range(0, X.shape[1]): # Columns
print(f"\nX_T row: {i}, X column: {j}")
dot_product = 0
for k in range(0, X_T.shape[1]): # Elements
dot_product += X_T[i][k] * X[k][j]
print(dot_product)
X_T row: 0, X column: 0
3.0
X_T row: 0, X column: 1
6.0
X_T row: 1, X column: 0
6.0
X_T row: 1, X column: 1
14.0
Now finally, we assign this to a new matrix which has 2 rows and 2 columns:
X_T_X = np.empty([2, 2])
for i in range(0, X_T.shape[0]): # Rows
for j in range(0, X.shape[1]): # Columns
dot_product = 0
for k in range(0, X_T.shape[1]): # Elements
dot_product += X_T[i][k] * X[k][j]
X_T_X[i][j] = dot_product
X_T_X
array([[ 3., 6.],
[ 6., 14.]])
You might have noticed that the size of this new matrix is the same as the number of rows in $X^T$ and the number of columns in $X$. This is always the case for matrix multiplication: the resultant matrix has the number of rows from the first matrix and the number of columns from the second. This is why we needed $X^T$ to have shape $(2, 3)$ and $X$ to have shape $(3, 2)$: the inner dimensions (both 3) must match for multiplication to work.
As we're going to do a few matrix multiplications, let's turn this code into a function.
def matrix_multiplication(matrix_1: np.ndarray,
matrix_2: np.ndarray
) -> np.ndarray:
output_array = np.empty([matrix_1.shape[0], matrix_2.shape[1]])
for i in range(0, matrix_1.shape[0]): # Rows
for j in range(0, matrix_2.shape[1]): # Columns
dot_product = 0
for k in range(0, matrix_1.shape[1]): # Elements
dot_product += matrix_1[i][k] * matrix_2[k][j]
output_array[i][j] = dot_product
return output_array
And now that we've done that, we can use our new function immediately to solve the next step. Now that we've calculated $X^TX$ from the left-hand side of our equation, we can also calculate $X^Ty$ from the right-hand side. Let's go ahead and do this:
X_T_y = matrix_multiplication(X_T, y)
X_T_y
array([[16.],
[36.]])
Step 3: Solve $X^TXb = X^Ty$ with $LU$ decomposition
Now, we've gone through a lot of steps, but let's pause and remember what we're trying to accomplish. We started with the equation $X^TXb = X^Ty$, and we've now calculated both $X^TX$ and $X^Ty$. The only unknown left is our coefficients, $b$. So how do we extract $b$ from this equation?
If we represent this equation with the actual values we've calculated for $X^TX$ and $X^Ty$, it looks like this:
$$\underbrace{\begin{bmatrix}3 & 6 \\6 & 14\end{bmatrix}}_{X^TX}\;\underbrace{\begin{bmatrix}b_0 \\b_1\end{bmatrix}}_{\mathbf{b}}\;=\;\underbrace{\begin{bmatrix}16 \\36\end{bmatrix}}_{\mathbf{X^Ty}}$$This matrix equation is actually a compact way of writing two regular equations. When we multiply the matrix by the vector, each row of the matrix produces one equation. We want to solve both of these for $b_0$ and $b_1$:
$$\begin{align} 3b_0 + 6b_1 &= 16 \\ 6b_0 + 14b_1 &= 36 \end{align}$$There are a few ways of doing this (we could actually do it directly as an algebra problem!), but let's use a more generalisable method which breaks down $X^TX$ into two simpler matrices called $LU$ decomposition.
Step 3a: Applying $LU$ decomposition to $X^TX$
The reasons for this decomposition are a bit beyond this blog post (but if you are interested in reading more, this article explains $LU$ decomposition very nicely). However, the key idea is that solving $X^TXb = X^Ty$ directly would require matrix inversion, which is computationally expensive and can be numerically unstable. Instead, $LU$ decomposition breaks down $X^TX$ into two simpler matrices ($L$ and $U$) that transform our problem into two easier equations that can be solved with simple forward and backward substitution. This approach is both faster and more numerically stable, and it's actually what libraries like scikit-learn use under the hood.
First things first, let's break down $X^TX$ into $L$ and $U$:
$$\begin{align} L &= \begin{bmatrix}1 & 0\\\ell_{21} & 1\end{bmatrix} \\ U &= \begin{bmatrix}u_{11} & u_{12}\\0 & u_{22}\end{bmatrix} \end{align}$$The structure of these matrices - with 1's on the diagonal and 0's in specific positions - isn't arbitrary. The $L$ matrix (lower triangular) has 1's on its diagonal and 0's in the upper right, while the $U$ matrix (upper triangular) has 0's in the lower left. This special structure is what makes forward and backward substitution possible.
We can then calculate $LU$ (which is equal to $X^TX$) using matrix multiplication, giving us:
$$ LU = \begin{bmatrix}1 & 0\\\ell_{21} & 1\end{bmatrix} \begin{bmatrix}u_{11} & u_{12}\\0 & u_{22}\end{bmatrix} = \begin{bmatrix}u_{11} & u_{12}\\\ell_{21}u_{22} & \ell_{21}u_{12} + u_{22}\end{bmatrix} $$As you can see, this gives us four values we need to solve for above: $\ell_{21}$, $u_{11}$, $u_{12}$, and $u_{22}$. We can extract these from the values in $X^TX$. $u_{11}$ and $u_{12}$ are easy, we just need to substitute in the values from $X^TX$:
u_11 = X_T_X[0, 0]
u_12 = X_T_X[0, 1]
Following this, we can do a bit of algebra to get $\ell_{21}$ and $u_{22}$.
l_21 = X_T_X[1, 0] / u_11
u_22 = X_T_X[1, 1] - (l_21 * u_12)
Ok, great, we have this new $LU$ matrix, and the values that go into it. But how does this help us solve for $b$? Well, if we substitute these values, $\ell_{21}$, $u_{11}$, $u_{12}$, and $u_{22}$ back into $L$ and $U$ separately, we can use them to solve two separate sets of equations: $Lz = X^Ty$ and $Ub = z$. This is much easier than solving $X^TXb = X^Ty$ directly. As mentioned above, the lower triangular structure of $L$ that lets us solve for unknowns one at a time from top to bottom (forward substitution), and the upper triangular structure of $U$ lets us solve from bottom to top (backward substitution).
Step 3b: Solving $Lz = X^Ty$ for $z$
Let's start with $Lz = X^Ty$. We first define $L$ and $U$ in Python:
L = np.array([[1, 0], [l_21, 1]])
U = np.array([[u_11, u_12,], [0, u_22]])
L
array([[1., 0.],
[2., 1.]])
U
array([[3., 6.],
[0., 2.]])
We can now define our new equation:
$$\underbrace{\begin{bmatrix}1 & 0 \\2 & 1\end{bmatrix}}_{L}\;\underbrace{\begin{bmatrix}z_0 \\z_1\end{bmatrix}}_{\mathbf{z}}\;=\;\underbrace{\begin{bmatrix}16 \\36\end{bmatrix}}_{\mathbf{X^Ty}}$$and expand this into two simultaneous equations again:
$$\begin{align} 1z_0 + 0z_1 &= 16 \\ 2z_0 + 1z_1 &= 36 \end{align}$$You can see these are much easier to solve. If we simplify the first equation, we get $z_0$ out the gate, with the equation becoming $z_0 = 16$. If we substitute that into the second equation, we get:
$$\begin{align} 2z_0 + 1z_1 &= 36 \\ 2(16) + z_1 &= 36 \\ z_1 &= 36 - 32 \\ &= 4 \end{align}$$And with that, we've solved for $z$! We can generalise this way of solving for $z$ to any 2 x 2 matrix using the following code:
z_0 = X_T_y[0]
z_1 = X_T_y[1] - L[1, 0] * z_0
z = np.array([z_0, z_1])
z
array([[16.],
[ 4.]])
However, keep in mind that we'd need to do quite a bit more work to generalise this method beyond one feature.
Step 3c: Solving $Ub = z$ for $b$
Ok! Now let's move onto solving the second equation to get $b$, $Ub = z$.
$$\underbrace{\begin{bmatrix}3 & 6 \\0 & 2\end{bmatrix}}_{U}\;\underbrace{\begin{bmatrix}b_0 \\b_1\end{bmatrix}}_{\mathbf{b}}\;=\;\underbrace{\begin{bmatrix}16 \\4\end{bmatrix}}_{\mathbf{z}}$$which again, we can expand into:
$$\begin{align} 3b_0 + 6b_1 &= 16 \\ 0b_0 + 2b_1 &= 4 \end{align}$$We can solve for $b_1$ immediately, by simplifying the second equation:
$$\begin{align} 0b_0 + 2b_1 &= 4 \\ 2b_1 &= 4 \\ b_1 &= 2 \end{align}$$We can then substitute this into the first equation to solve for $b_0$:
$$\begin{align} 3b_0 + 6(2) &= 16 \\ 3b_0 &= 16 - 12 \\ 3b_0 &= 4 \\ b_0 &= \frac{4}{3} \end{align}$$So phew! We have finally solved for $b$! And again, we can write some Python code to generalise this to any 2 x 2 matrix:
b_1 = z[1] / U[1, 1]
b_0 = (z[0] - (U[0, 1] * b_1))/U[0, 0]
b = np.array([b_0, b_1])
b
array([[1.33333333],
[2. ]])
Step 4: Predict with $\hat{y} = Xb$
Now that we have our coefficients $b$, we can use them for what regression is all about: making predictions! Remember our regression equation from the beginning: $\hat{y} = b_0 + b_1X_1$? In matrix form, this is:
$$ \hat{y} = Xb $$This single matrix multiplication gives us predictions for all of our data points at once. Let's calculate it:
matrix_multiplication(X, b)
array([[3.33333333],
[5.33333333],
[7.33333333]])
And that's the last step! With all of that work, we've finally finished our regression problem and made our predictions.
Sanity check: verify with scikit-learn
As a final check, let's see what answer we get using scikit-learn:
from sklearn.linear_model import LinearRegression
reg = LinearRegression().fit(X, y)
Extracting $b$, we can see that we get the exact same answer as when we calculated it manually.
b = np.array([[reg.intercept_[0]], [reg.coef_[0, 1]]])
array([[1.33333333],
[2. ]])
In addition, the values that scikit-learn found for $\hat{y}$ are the same that we found.
reg.predict(X)
array([[3.33333333],
[5.33333333],
[7.33333333]])
Naturally, this was an overly simplified version of OLS regression: we've used some inefficient ways of calculating each method, we limited ourselves to one feature, and we haven't touched regularisation (although if you're keen, you might want to dig into how to do this yourself!).
With that said, I hope this has given you a good intuition of how OLS regression works at each step. Along with my previous post), you hopefully now have the mathematical intuition of how this algorithm works, as well as a concrete understanding of what's being done at each step.
And keep an eye out, as I'm planning on showing how more machine learning algorithms work under the hood in future posts!
