python - Creating a Parquet file with PySpark on an AWS EMR cluster -


i'm trying spin spark cluster datbricks' csv package can create parquet files , stuff spark obviously.

this being done within aws emr, don't think i'm putting these options in correct place.

this command want send cluster spins up: spark-shell --packages com.databricks:spark-csv_2.10:1.4.0 --master yarn --driver-memory 4g --executor-memory 2g. i've tried putting on spark step - correct?

if cluster spun without being installed, how start pyspark package? correct: pyspark --packages com.databricks:spark-csv_2.10:1.4.0? can't tell if installed or not. not sure functions test

and in regards using package, correct creating parquet file:

df = sqlcontext.read.format('com.databricks.spark.csv').options(header='false').load('s3n://bucketname/nation.tbl', schema = customschema)  #is option1 df.write.parquet("s3n://bucketname/nation_parquet.parquet")  #or option2 df.select('nation_id', 'name', 'some_int', 'comment').write.parquet('com.databricks.spark.csv').save('s3n://bucketname/nation_parquet.tbl') 

i'm not able find recent (mid 2015 , later) documentation regarding writing parquet file.

edit:

okay, i'm not sure if i'm creating dataframe correctly. if try run select queries on , show resultset, don't , instead error. here's tried running:

df = sqlcontext.read.format('com.databricks.spark.csv').options(header='false').load('s3n://bucketname/nation.tbl', schema = customschema)  df.registertemptable("region2")  tcp_interactions = sqlcontext.sql(""" select nation_id, name, comment region2 nation_id > 1 """)  tcp_interactions.show() #get weird java error: #caused by: java.lang.numberformatexception: input string: "0|algeria|0| haggle. final deposits detect slyly agai|" 


Comments

Popular posts from this blog

javascript - How to get current YouTube IDs via iMacros? -

c# - Maintaining a program folder in program files out of date? -

emulation - Android map show my location didn't work -