java - how to improve performance of application on spark cluster? -
i have 3 node cluster (1 master + 2 worker)
assigned driver-memory = 12gb
assigned executor-memory = 12gb
input data size = total 12gb (15 files, each of 800 mb)
two files of 400 mb , 70 mb store in 2 maps , broadcast workers. each record input data lookup in broadcast-ed maps match return pairrdd ,say , javapairrdd < object1, list< object2 > > r1 .
from r1 need 2 outputs,
first, r2 = r1.map(output1mapper())
for second, need output value list < object2 > in file named key object1 r1.
hence collect keys in set < object1 > keys.
and each object1 in keys lookup() in rdd r1 , output list < object2 > in output file.
problem:
if load 15 files in rdd process per above fails error: java heap space, increase spark.driver.maxresultsize 3gb again fails.
i used way, perform above operations on each input files generate outputs , taking 35 mins approx. on cluster.
need pointers on how increase performance , approach best.
Comments
Post a Comment