Member-only story

Spark Interview Question: like() vs rlike()? How Spark filters word token level filtering for NLP

Data Cat
3 min readJan 7, 2024

--

Hi everyone, this is another series in Spark. I keep journaling my experience and learning in this series. Today’s topic is to share a great feature in Spark which is token level filtering without actually doing a token split.

“Disclaimer: The views and opinions expressed in this blog post are solely my own and do not reflect those of any entity with which I have been, am now, or will be affiliated. This content was written during a period in which the author was not affiliated with nor belong to any organization that could influence their perspectives. As such, these are author’s personal insights, shared without any external bias or influence.”

Problem Statement

Suppose you have a dataset and one of the field is description. You want to filter rows which contain a set of keywords that you want to use for your machine learning training. This description is long and complex.

The straightforward brute force way is to do a token split first, then for each word in your token_list, you check if word is in your keywords.

There is no problem with this approach, but for me personally, this process was always frustrating. Is there an alternative to fetch target rows in one…

--

--

Data Cat
Data Cat

Written by Data Cat

Software Engineer in Data Platform

No responses yet