python - Creating a Parquet file with PySpark on an AWS EMR cluster -
i'm trying spin spark cluster datbricks' csv package can create parquet files , stuff spark obviously.
this being done within aws emr, don't think i'm putting these options in correct place.
this command want send cluster spins up: spark-shell --packages com.databricks:spark-csv_2.10:1.4.0 --master yarn --driver-memory 4g --executor-memory 2g
. i've tried putting on spark step - correct?
if cluster spun without being installed, how start pyspark package? correct: pyspark --packages com.databricks:spark-csv_2.10:1.4.0
? can't tell if installed or not. not sure functions test
and in regards using package, correct creating parquet file:
df = sqlcontext.read.format('com.databricks.spark.csv').options(header='false').load('s3n://bucketname/nation.tbl', schema = customschema) #is option1 df.write.parquet("s3n://bucketname/nation_parquet.parquet") #or option2 df.select('nation_id', 'name', 'some_int', 'comment').write.parquet('com.databricks.spark.csv').save('s3n://bucketname/nation_parquet.tbl')
i'm not able find recent (mid 2015 , later) documentation regarding writing parquet file.
edit:
okay, i'm not sure if i'm creating dataframe correctly. if try run select
queries on , show resultset, don't , instead error. here's tried running:
df = sqlcontext.read.format('com.databricks.spark.csv').options(header='false').load('s3n://bucketname/nation.tbl', schema = customschema) df.registertemptable("region2") tcp_interactions = sqlcontext.sql(""" select nation_id, name, comment region2 nation_id > 1 """) tcp_interactions.show() #get weird java error: #caused by: java.lang.numberformatexception: input string: "0|algeria|0| haggle. final deposits detect slyly agai|"
Comments
Post a Comment