python - How to select certain columns from a csv file in pyspark based on the list of index of columns and then determine their distinct lengths -


i have code in pyspark in pass index value of columns list. want select columns csv file these corresponding indexes:

def ml_test(input_col_index):      sc = sparkcontext(master='local', appname='test')      inputdata = sc.textfile('hdfs://localhost:/dir1').zipwithindex().filter(lambda (line, rownum): rownum >= 0).map(lambda (line, rownum): line)  if __name__ == '__main__':      input_col_index = sys.argv[1] # example - ['1','2','3','4']      ml_test(input_col_index) 

now if have static or hardcoded set of columns want select above csv file, can here indexes of desired columns being passed parameter. have calculate distinct length of each of selected columns know can done colmn_1 = input_data.map(lambda x: x[0]).distinct().collect() how do set of columns not pre-known , determined based on index list passed @ runtime?

note: have calculate distinct length of columns because have pass length parameter pysparks randomforest algorithm.

you can use list comprehensions.

# given list of indicies... indicies = [int(i) in input_col_index]  # select columns each row rdd = rdd.map(lambda x: [x[idx] idx in indicies])  # rows, choose longest columns longest_per_column = rdd.reduce(     lambda x, y: [max(a, b, key=len) a, b in zip(x, y)])  # lengths of longest columns print([len(x) x in longest_per_column]) 

the reducing function takes 2 lists, loops on each of values simultaneously, , creates new list selecting (for each column) whichever 1 longer.

update: pass lengths randomforest constructor, can this:

column_lengths = [len(x) x in longest_per_column]  model = randomforest.trainregressor(     categoricalfeaturesinfo=dict(enumerate(column_lengths)),     maxbins=max(column_lengths),     # ... ) 

Comments

Popular posts from this blog

javascript - How to get current YouTube IDs via iMacros? -

c# - Maintaining a program folder in program files out of date? -

emulation - Android map show my location didn't work -