pyspark - Spark Job Wrapping a Transformation With Local Operations (Very Slow and OOM Problems) -
i'm trying write spark job analyzes time series in variety of ways. through series of transformations, take dataframe , drop rdds, have rows structured as:
row[((tuple, key) t1, t2, t3, t4, t5, ...)]
let's call rdd: rdd
.
i call rdd.flatmap(foo(r[1:]))
, , expected foo
take it's input list of times, , deliver list of analytics it's output.
let's say
def foo(times): return [average(times), percentile(times, 25)]
when run job, takes forever , oom
errors cause fail. times
should have no more 600k
items, , that's super outlier case. have between 10k
- 100k
.
i can't use reducebykey()
because operations need perform require looking @ entire time series , going , forth multiple times.
does have recommendations on way solve oom
slowness problem?
assuming i've read question correctly, have rdd each row list of tuples , rows can have 600k tuples.
without knowing cluster configurations or looking @ actual code, can speculate. best guess since spark partitions row, rows huge numbers of columns can't have columns distributed among partitions, causing out of memory errors.
if that's cause, may need increase cluster capacity or restructure data each tuple on own row.
Comments
Post a Comment