python - How to select certain columns from a csv file in pyspark based on the list of index of columns and then determine their distinct lengths -
i have code in pyspark
in pass index
value of columns list
. want select columns csv
file these corresponding indexes:
def ml_test(input_col_index): sc = sparkcontext(master='local', appname='test') inputdata = sc.textfile('hdfs://localhost:/dir1').zipwithindex().filter(lambda (line, rownum): rownum >= 0).map(lambda (line, rownum): line) if __name__ == '__main__': input_col_index = sys.argv[1] # example - ['1','2','3','4'] ml_test(input_col_index)
now if have static or hardcoded set of columns want select above csv
file, can here indexes
of desired columns being passed parameter. have calculate distinct length of each of selected columns know can done colmn_1 = input_data.map(lambda x: x[0]).distinct().collect()
how do set of columns not pre-known , determined based on index list passed @ runtime?
note: have calculate distinct length of columns because have pass length parameter pysparks
randomforest
algorithm.
you can use list comprehensions.
# given list of indicies... indicies = [int(i) in input_col_index] # select columns each row rdd = rdd.map(lambda x: [x[idx] idx in indicies]) # rows, choose longest columns longest_per_column = rdd.reduce( lambda x, y: [max(a, b, key=len) a, b in zip(x, y)]) # lengths of longest columns print([len(x) x in longest_per_column])
the reducing function takes 2 lists, loops on each of values simultaneously, , creates new list selecting (for each column) whichever 1 longer.
update: pass lengths randomforest
constructor, can this:
column_lengths = [len(x) x in longest_per_column] model = randomforest.trainregressor( categoricalfeaturesinfo=dict(enumerate(column_lengths)), maxbins=max(column_lengths), # ... )
Comments
Post a Comment