Member-only story

Spark Interview Question5: What is caching in Spark?

Data Cat
3 min readMar 5, 2024

--

Date: March 5th, 2024

Hi everyone! I started posting contents about Spark interview questions for SWE/Data Engineers, mainly for Spark Optimization related questions. I aim to continuously write about ten posts about Spark optimization. After the series of these posts, you will ace technical interviews related to Spark! Although this post aims for helping technical interview rounds, any Spark users will find this series insightful and help your learning!

“Disclaimer: The views and opinions expressed in this blog post are solely my own and do not reflect those of any entity with which I have been, am now, or will be affiliated. This content was written during a period in which the author was not affiliated with nor belong to any organization that could influence my perspectives. As such, these are my personal insights, shared without any external bias or influence.”

What is Caching?

If you’re working with DataFrames or Datasets, consider caching them in memory after applying initial transformations. This can help reduce the need for re-serialization during iterative processing. Once you created a caching of you dataframe, operations you perform on the cached DataFrame, such as select, filter, or groupBy, will benefit from the cached data.

How to use caching in Spark Dataframe?

ex)

data = [("Alice", 25), ("Bob", 30), ("Charlie", 35)]
columns = ["Name", "Age"]
df = spark.createDataFrame(data, columns)
df.cache()
# Perform some operations on the cached DataFrame
average_age = df.selectExpr("avg(Age)").collect()[0][0]
print("Average Age:", average_age)
# Optional: Unpersist (remove) the DataFrame from cache
df.unpersist()

Does Spark automatically unpersist() when it is no used anymore?

This answer may depend on which platform you use. At least for AWS Glue, WS Glue is designed to optimize resource usage and automatically handle DataFrame caching and unpersisting. Glue keeps track of the DataFrames and their caching status within a Glue job, and it will automatically unpersist…

Create an account to read the full story.

The author made this story available to Medium members only.
If you’re new to Medium, create a new account to read this story on us.

Or, continue in mobile web

Already have an account? Sign in

--

--

Data Cat
Data Cat

Written by Data Cat

Software Engineer in Data Platform

No responses yet

Write a response