Spark flatMapToPair vs [filter + mapToPair] -

- September 15, 2015

what performance difference between blocks of code below?

1.flatmaptopair: code block uses single transformation, having filter condition inside of returns empty list, technically not allowing element in rdd progress along

rdd.flatmaptopair(     if ( <condition> )         return lists.newarraylist();      return lists.newarraylist(new tuple2<>(key, element)); )

2.[filter + maptopair] code block has 2 transformations first transformation filters using same condition above block of code transformation maptopair after filter.

rdd.filter(     (element) -> <condition> ).maptopair(     (element) -> new tuple2<>(key, element) )

is spark intelligent enough perform same both these blocks of code regardless of number of transformation or perform worse in code block 2 these 2 transformations?

thanks

actually spark perform worse in first case because has initialize , garbage collect new arraylist each record. on large number of records can add substantial overhead.

otherwise spark "intelligent enough" use lazy data structures , combines multiple transformations don't require shuffles single stage.

there situations explicit merging of different transformations beneficial (either reduce number of initialized objects or keep shorter lineage) not 1 of these.

Search This Blog

celery

Spark flatMapToPair vs [filter + mapToPair] -

Comments

Post a Comment

Popular posts from this blog

mysql - Dreamhost PyCharm Django Python 3 Launching a Site -

java - Sending SMS with SMSLib and Web Services -

java - How to resolve The method toString() in the type Object is not applicable for the arguments (InputStream) -