java - how to improve performance of application on spark cluster? -


i have 3 node cluster (1 master + 2 worker)

assigned driver-memory = 12gb

assigned executor-memory = 12gb

input data size = total 12gb (15 files, each of 800 mb)

two files of 400 mb , 70 mb store in 2 maps , broadcast workers. each record input data lookup in broadcast-ed maps match return pairrdd ,say , javapairrdd < object1, list< object2 > > r1 .

from r1 need 2 outputs,

first, r2 = r1.map(output1mapper())

for second, need output value list < object2 > in file named key object1 r1.

hence collect keys in set < object1 > keys.

and each object1 in keys lookup() in rdd r1 , output list < object2 > in output file.

problem:

if load 15 files in rdd process per above fails error: java heap space, increase spark.driver.maxresultsize 3gb again fails.

i used way, perform above operations on each input files generate outputs , taking 35 mins approx. on cluster.

need pointers on how increase performance , approach best.


Comments

Popular posts from this blog

Load Balancing in Bluemix using custom domain and DNS SRV records -

oracle - pls-00402 alias required in select list of cursor to avoid duplicate column names -

python - Consider setting $PYTHONHOME to <prefix>[:<exec_prefix>] error -