pyspark - Spark Job Wrapping a Transformation With Local Operations (Very Slow and OOM Problems) -


i'm trying write spark job analyzes time series in variety of ways. through series of transformations, take dataframe , drop rdds, have rows structured as:

row[((tuple, key) t1, t2, t3, t4, t5, ...)]

let's call rdd: rdd.

i call rdd.flatmap(foo(r[1:])), , expected foo take it's input list of times, , deliver list of analytics it's output.

let's say

def foo(times):     return [average(times), percentile(times, 25)] 

when run job, takes forever , oom errors cause fail. times should have no more 600k items, , that's super outlier case. have between 10k - 100k.

i can't use reducebykey() because operations need perform require looking @ entire time series , going , forth multiple times.

does have recommendations on way solve oom slowness problem?

assuming i've read question correctly, have rdd each row list of tuples , rows can have 600k tuples.

without knowing cluster configurations or looking @ actual code, can speculate. best guess since spark partitions row, rows huge numbers of columns can't have columns distributed among partitions, causing out of memory errors.

if that's cause, may need increase cluster capacity or restructure data each tuple on own row.


Comments

Popular posts from this blog

javascript - How to get current YouTube IDs via iMacros? -

c# - Maintaining a program folder in program files out of date? -

emulation - Android map show my location didn't work -