Member-only story
Spark Interview Question: How to deal with multi-tagging data, delimiter strings, and Array in DataFrame.
Hi Spark users! I recently navigated through a challenging scenario involving data with delimiter strings and arrays in DataFrames. This required a transformation and aggregation process that was both enlightening. In this post, I’ll share insights and strategies for effectively handling such data.
“Disclaimer: The views and opinions expressed in this blog post are solely my own and do not reflect those of any entity with which I have been, am now, or will be affiliated. This content was written during a period in which the author was not affiliated with nor belong to any organization that could influence their perspectives. As such, these are author’s personal insights, shared without any external bias or influence.”
Background
Why delimited strings and Array data usage are increasing ?
Recently I found delimiter strings and Array use cases are increases because people move from Data Warehouse to Data Lakes and Open Format such as Delta Lakes, or Apache Icebergs. In Data Lake and Open Format, they do not necessary enforce the schema when we insert, so it is called schema on read. The file format is getting more semi-structure or unstructured…