python - Pyspark socket write error -


i'm trying read file(~600m csv file) pyspark. following error.

surprisingly same code works correctly scala.

i found issue page https://issues.apache.org/jira/browse/spark-12261 not work.

reading code:

import os pyspark import sparkcontext pyspark import sparkconf  datasetdir = 'd:\\datasets\\movielens\\ml-latest\\' ratingfile = 'ratings.csv'  conf = sparkconf().setappname("movie_recommendation-server").setmaster('local[2]') sc = sparkcontext(conf=conf)  ratingrdd = sc.textfile(os.path.join(datasetdir, ratingfile)) print(ratingrdd.take(1)[0])  

i getting error:

      16/04/25 09:00:04 error pythonrunner: python worker exited unexpectedly    (crashed)     java.net.socketexception: connection reset peer: socket write error     @ java.net.socketoutputstream.socketwrite0(native method)     @ java.net.socketoutputstream.socketwrite(socketoutputstream.java:109)     @ java.net.socketoutputstream.write(socketoutputstream.java:153)     @ java.io.bufferedoutputstream.flushbuffer(bufferedoutputstream.java:82)     @ java.io.bufferedoutputstream.write(bufferedoutputstream.java:126)     @ java.io.dataoutputstream.write(dataoutputstream.java:107)     @ java.io.filteroutputstream.write(filteroutputstream.java:97)     @ org.apache.spark.api.python.pythonrdd$.writeutf(pythonrdd.scala:622)     @ org.apache.spark.api.python.pythonrdd$.org$apache$spark$api$python$pythonrdd$$write$1(pythonrdd.scala:442)     @ org.apache.spark.api.python.pythonrdd$$anonfun$writeiteratortostream$1.apply(pythonrdd.scala:452)     @ org.apache.spark.api.python.pythonrdd$$anonfun$writeiteratortostream$1.apply(pythonrdd.scala:452)     @ scala.collection.iterator$class.foreach(iterator.scala:727)     @ scala.collection.abstractiterator.foreach(iterator.scala:1157)     @ org.apache.spark.api.python.pythonrdd$.writeiteratortostream(pythonrdd.scala:452)     @ org.apache.spark.api.python.pythonrunner$writerthread$$anonfun$run$3.apply(pythonrdd.scala:280)     @ org.apache.spark.util.utils$.loguncaughtexceptions(utils.scala:1765)     @ org.apache.spark.api.python.pythonrunner$writerthread.run(pythonrdd.scala:239)          


Comments

Popular posts from this blog

Load Balancing in Bluemix using custom domain and DNS SRV records -

oracle - pls-00402 alias required in select list of cursor to avoid duplicate column names -

python - Consider setting $PYTHONHOME to <prefix>[:<exec_prefix>] error -