scala - Extract substring based on regex to use in RDD.filter -
i trying filter out rows of text file second column value begins words list.
i have list such as:
val mylist = ["inter", "intra"]
if have row like:
cricket inter-house
inter
in list, row should filtered out rdd.filter
operation. using following regex:
`[a-za-z0-9]+`
i tried using """[a-za-z0-9]+""".r
extract substring result in non empty iterator.
my question how access above result in filter operation?
you need construct regular expression ".* inter.*".r
since """[a-za-z0-9]+"""
matches word. here working example, hope helps:
val mylist = list("inter", "intra") val textrdd = sc.parallelize(list("cricket inter-house", "cricket int-house", "aaa bbb", "cricket intra-house")) // map on list dynamically construct regular expressions , check if within // text , use reduce make sure none of pattern exists in text, have // call collect() see result or take(5) if want see first 5 results. (textrdd.filter(text => mylist.map(word => !(".* " + word + ".*").r .pattern.matcher(text).matches).reduce(_&&_)).collect()) // res1: array[string] = array(cricket int-house, aaa bbb)
Comments
Post a Comment