java - how to improve performance of application on spark cluster? -


i have 3 node cluster (1 master + 2 worker)

assigned driver-memory = 12gb

assigned executor-memory = 12gb

input data size = total 12gb (15 files, each of 800 mb)

two files of 400 mb , 70 mb store in 2 maps , broadcast workers. each record input data lookup in broadcast-ed maps match return pairrdd ,say , javapairrdd < object1, list< object2 > > r1 .

from r1 need 2 outputs,

first, r2 = r1.map(output1mapper())

for second, need output value list < object2 > in file named key object1 r1.

hence collect keys in set < object1 > keys.

and each object1 in keys lookup() in rdd r1 , output list < object2 > in output file.

problem:

if load 15 files in rdd process per above fails error: java heap space, increase spark.driver.maxresultsize 3gb again fails.

i used way, perform above operations on each input files generate outputs , taking 35 mins approx. on cluster.

need pointers on how increase performance , approach best.


Comments

Popular posts from this blog

javascript - How to get current YouTube IDs via iMacros? -

c# - Maintaining a program folder in program files out of date? -

emulation - Android map show my location didn't work -