Spark Interview Question 4: What is partitioning? Coalesce() vs Repartition()

Data Cat
6 min readMar 5, 2024

Date: March 5th, 2024

Hi everyone! I started posting contents about Spark interview questions for SWE/Data Engineers, mainly for Spark Optimization related questions. I aim to continuously write about ten posts about Spark optimization. After the series of these posts, you will ace technical interviews related to Spark! Although this post aims for helping technical interview rounds, any Spark users will find this series insightful and help your learning!

“Disclaimer: The views and opinions expressed in this blog post are solely my own and do not reflect those of any entity with which I have been, am now, or will be affiliated. This content was written during a period in which the author was not affiliated with nor belong to any organization that could influence my perspectives. As such, these are my personal insights, shared without any external bias or influence.”

What is Partitioning ?

Partitioning in Spark refers to the process of dividing a large dataset into smaller, manageable parts (called partitions) that can be processed in parallel across different nodes of a Spark cluster. This is essential for distributed computing, as it allows Spark to perform operations on datasets in a more efficient and scalable manner.

--

--