Spark flatMapToPair vs [filter + mapToPair] -
what performance difference between blocks of code below?
1.flatmaptopair: code block uses single transformation, having filter condition inside of returns empty list, technically not allowing element in rdd progress along
rdd.flatmaptopair(     if ( <condition> )         return lists.newarraylist();      return lists.newarraylist(new tuple2<>(key, element)); )   2.[filter + maptopair] code block has 2 transformations first transformation filters using same condition above block of code transformation maptopair after filter.
rdd.filter(     (element) -> <condition> ).maptopair(     (element) -> new tuple2<>(key, element) )   is spark intelligent enough perform same both these blocks of code regardless of number of transformation or perform worse in code block 2 these 2 transformations?
thanks
actually spark perform worse in first case because has initialize , garbage collect new arraylist each record. on large number of records can add substantial overhead.
otherwise spark "intelligent enough" use lazy data structures , combines multiple transformations don't require shuffles single stage.
there situations explicit merging of different transformations beneficial (either reduce number of initialized objects or keep shorter lineage) not 1 of these.
Comments
Post a Comment