python - How to select certain columns from a csv file in pyspark based on the list of index of columns and then determine their distinct lengths -

- February 15, 2015

i have code in pyspark in pass index value of columns list. want select columns csv file these corresponding indexes:

def ml_test(input_col_index):      sc = sparkcontext(master='local', appname='test')      inputdata = sc.textfile('hdfs://localhost:/dir1').zipwithindex().filter(lambda (line, rownum): rownum >= 0).map(lambda (line, rownum): line)  if __name__ == '__main__':      input_col_index = sys.argv[1] # example - ['1','2','3','4']      ml_test(input_col_index)

now if have static or hardcoded set of columns want select above csv file, can here indexes of desired columns being passed parameter. have calculate distinct length of each of selected columns know can done colmn_1 = input_data.map(lambda x: x[0]).distinct().collect() how do set of columns not pre-known , determined based on index list passed @ runtime?

note: have calculate distinct length of columns because have pass length parameter pysparks randomforest algorithm.

you can use list comprehensions.

# given list of indicies... indicies = [int(i) in input_col_index]  # select columns each row rdd = rdd.map(lambda x: [x[idx] idx in indicies])  # rows, choose longest columns longest_per_column = rdd.reduce(     lambda x, y: [max(a, b, key=len) a, b in zip(x, y)])  # lengths of longest columns print([len(x) x in longest_per_column])

the reducing function takes 2 lists, loops on each of values simultaneously, , creates new list selecting (for each column) whichever 1 longer.

update: pass lengths randomforest constructor, can this:

column_lengths = [len(x) x in longest_per_column]  model = randomforest.trainregressor(     categoricalfeaturesinfo=dict(enumerate(column_lengths)),     maxbins=max(column_lengths),     # ... )

Search This Blog

Addrety

python - How to select certain columns from a csv file in pyspark based on the list of index of columns and then determine their distinct lengths -

Comments

Post a Comment

Popular posts from this blog

javascript - Feed FileReader from server side files -

python - Consider setting $PYTHONHOME to <prefix>[:<exec_prefix>] error -

php - Webix Data Loading from Laravel Link -