pyspark - Spark Job Wrapping a Transformation With Local Operations (Very Slow and OOM Problems) -


i'm trying write spark job analyzes time series in variety of ways. through series of transformations, take dataframe , drop rdds, have rows structured as:

row[((tuple, key) t1, t2, t3, t4, t5, ...)]

let's call rdd: rdd.

i call rdd.flatmap(foo(r[1:])), , expected foo take it's input list of times, , deliver list of analytics it's output.

let's say

def foo(times):     return [average(times), percentile(times, 25)] 

when run job, takes forever , oom errors cause fail. times should have no more 600k items, , that's super outlier case. have between 10k - 100k.

i can't use reducebykey() because operations need perform require looking @ entire time series , going , forth multiple times.

does have recommendations on way solve oom slowness problem?

assuming i've read question correctly, have rdd each row list of tuples , rows can have 600k tuples.

without knowing cluster configurations or looking @ actual code, can speculate. best guess since spark partitions row, rows huge numbers of columns can't have columns distributed among partitions, causing out of memory errors.

if that's cause, may need increase cluster capacity or restructure data each tuple on own row.


Comments

Popular posts from this blog

Load Balancing in Bluemix using custom domain and DNS SRV records -

oracle - pls-00402 alias required in select list of cursor to avoid duplicate column names -

python - Consider setting $PYTHONHOME to <prefix>[:<exec_prefix>] error -