Blog

Using schemas to speed up reading into Spark DataFrames

While Spark is the best thing since sliced bread for dealing with big data, I definitely realise I have a lot to learn before I can use it to its full potential. One trick I recently discovered was using explicit schemas to speed up how fast PySpark …

Posted on August 24, 2020 • 3 minutes read Read on

Reading S3 data into a Spark DataFrame using Sagemaker

I recently finished Jose Portilla's excellent Udemy course on PySpark, and of course I wanted to try out some things I learned in the course. I have been transitioning over to AWS Sagemaker for a lot of my work, but I haven't tried using it with …

Posted on August 10, 2020 • 5 minutes read Read on

Simplifying the normal equation with Gram-Schmidt

In the last post I talked about how to find the coefficients that give us the line of best fit for a OLS regression problem using the normal solution. The core of this approach is the equation: $$ X^TXb = X^Ty $$The way we solved this in the previous …

Posted on July 27, 2020 • 8 minutes read Read on

Solving OLS regression with linear algebra

When I first learned least-squares linear regression in my undergrad degree, I remember that we approached it in the "calculus" way: taking the sum of the squared differences for each observation and solving a massive (and tedious) equation until we …

Posted on July 13, 2020 • 9 minutes read Read on

Working with matrices: powers and transposition

Part of the series Linear Algebra Basics 1. Working with matrices: addition, subtraction and multiplication 2. Working with matrices: inversion 3. Working with matrices: powers and transposition Today, we'll complete our series on basic matrix …

Posted on June 29, 2020 • 5 minutes read Read on