Spark flatMapToPair vs [filter + mapToPair] -
what performance difference between blocks of code below?
1.flatmaptopair: code block uses single transformation, having filter condition inside of returns empty list, technically not allowing element in rdd progress along
rdd.flatmaptopair( if ( <condition> ) return lists.newarraylist(); return lists.newarraylist(new tuple2<>(key, element)); )
2.[filter + maptopair] code block has 2 transformations first transformation filters using same condition above block of code transformation maptopair after filter.
rdd.filter( (element) -> <condition> ).maptopair( (element) -> new tuple2<>(key, element) )
is spark intelligent enough perform same both these blocks of code regardless of number of transformation or perform worse in code block 2 these 2 transformations?
thanks
actually spark perform worse in first case because has initialize , garbage collect new arraylist
each record. on large number of records can add substantial overhead.
otherwise spark "intelligent enough" use lazy data structures , combines multiple transformations don't require shuffles single stage.
there situations explicit merging of different transformations beneficial (either reduce number of initialized objects or keep shorter lineage) not 1 of these.
Comments
Post a Comment