pyspark - Spark Job Wrapping a Transformation With Local Operations (Very Slow and OOM Problems) -

- May 15, 2015

i'm trying write spark job analyzes time series in variety of ways. through series of transformations, take dataframe , drop rdds, have rows structured as:

row[((tuple, key) t1, t2, t3, t4, t5, ...)]

let's call rdd: rdd.

i call rdd.flatmap(foo(r[1:])), , expected foo take it's input list of times, , deliver list of analytics it's output.

let's say

def foo(times):     return [average(times), percentile(times, 25)]

when run job, takes forever , oom errors cause fail. times should have no more 600k items, , that's super outlier case. have between 10k - 100k.

i can't use reducebykey() because operations need perform require looking @ entire time series , going , forth multiple times.

does have recommendations on way solve oom slowness problem?

assuming i've read question correctly, have rdd each row list of tuples , rows can have 600k tuples.

without knowing cluster configurations or looking @ actual code, can speculate. best guess since spark partitions row, rows huge numbers of columns can't have columns distributed among partitions, causing out of memory errors.

if that's cause, may need increase cluster capacity or restructure data each tuple on own row.

Search This Blog

Addrety

pyspark - Spark Job Wrapping a Transformation With Local Operations (Very Slow and OOM Problems) -

Comments

Post a Comment

Popular posts from this blog

javascript - Feed FileReader from server side files -

How to Change swipe Tab Title color in java Android -

Using globs in Perl replace one liner in TCL script -