python - How to select certain columns from a csv file in pyspark based on the list of index of columns and then determine their distinct lengths -


i have code in pyspark in pass index value of columns list. want select columns csv file these corresponding indexes:

def ml_test(input_col_index):      sc = sparkcontext(master='local', appname='test')      inputdata = sc.textfile('hdfs://localhost:/dir1').zipwithindex().filter(lambda (line, rownum): rownum >= 0).map(lambda (line, rownum): line)  if __name__ == '__main__':      input_col_index = sys.argv[1] # example - ['1','2','3','4']      ml_test(input_col_index) 

now if have static or hardcoded set of columns want select above csv file, can here indexes of desired columns being passed parameter. have calculate distinct length of each of selected columns know can done colmn_1 = input_data.map(lambda x: x[0]).distinct().collect() how do set of columns not pre-known , determined based on index list passed @ runtime?

note: have calculate distinct length of columns because have pass length parameter pysparks randomforest algorithm.

you can use list comprehensions.

# given list of indicies... indicies = [int(i) in input_col_index]  # select columns each row rdd = rdd.map(lambda x: [x[idx] idx in indicies])  # rows, choose longest columns longest_per_column = rdd.reduce(     lambda x, y: [max(a, b, key=len) a, b in zip(x, y)])  # lengths of longest columns print([len(x) x in longest_per_column]) 

the reducing function takes 2 lists, loops on each of values simultaneously, , creates new list selecting (for each column) whichever 1 longer.

update: pass lengths randomforest constructor, can this:

column_lengths = [len(x) x in longest_per_column]  model = randomforest.trainregressor(     categoricalfeaturesinfo=dict(enumerate(column_lengths)),     maxbins=max(column_lengths),     # ... ) 

Comments

Popular posts from this blog

Load Balancing in Bluemix using custom domain and DNS SRV records -

oracle - pls-00402 alias required in select list of cursor to avoid duplicate column names -

python - Consider setting $PYTHONHOME to <prefix>[:<exec_prefix>] error -