Tag: Pyspark

2 posts with this tag

Using schemas to speed up reading into Spark DataFrames

While Spark is the best thing since sliced bread for dealing with big data, I definitely realise I have a lot to learn before I can use it to its full potential. One trick I recently discovered was using explicit schemas to speed up how fast PySpark …

Posted on August 24, 2020 • 3 minutes read Read on

Reading S3 data into a Spark DataFrame using Sagemaker

I recently finished Jose Portilla's excellent Udemy course on PySpark, and of course I wanted to try out some things I learned in the course. I have been transitioning over to AWS Sagemaker for a lot of my work, but I haven't tried using it with …

Posted on August 10, 2020 • 5 minutes read Read on