Spark Streaming from Kafka Source Go Back to a Checkpoint or Rewinding -


when streaming spark dstreams consumer kafka source, 1 can checkpoint spark context when app crashes (or affected kill -9), app can recover context checkpoint. if app 'accidentally deployed bad logic', 1 might want rewind last topic+partition+offset replay events kafka topic's partitions' offset positions working fine before 'bad logic'. how streaming apps rewound last 'good spot' (topic+partition+offset) when checkpointing in effect?

note: in (heart) logs, jay kreps writes using parallel consumer (group) process starts @ diverging kafka offset locations until caught original , killing original. (what 2nd spark streaming process respect starting partition/offset locations?)

sidebar: question may related mid-stream changing configuration check-pointed spark stream similar mechanism may need deployed.

you not going able rewind stream in running sparkstreamingcontext. it's important keep these points in mind (straight docs):

  • once context has been started, no new streaming computations can set or added it.
  • once context has been stopped, cannot restarted.
  • only 1 streamingcontext can active in jvm @ same time.
  • stop() on streamingcontext stops sparkcontext. stop streamingcontext, set optional parameter of stop() called stopsparkcontext false.
  • a sparkcontext can re-used create multiple streamingcontexts, long previous streamingcontext stopped (without stopping sparkcontext) before next streamingcontext created

instead, going have stop current stream, , create new one. can start stream specific set of offsets using 1 of versions of createdirectstream takes fromoffsets parameter signature map[topicandpartition, long] -- it's starting offset mapped topic , partition.

another theoretical possibility use kafkautils.createrdd takes offset ranges input. "bad logic" started @ offset x , fixed @ offset y. use cases, might want createrdd offsets x y , process results, instead of trying stream.


Comments

Popular posts from this blog

Load Balancing in Bluemix using custom domain and DNS SRV records -

oracle - pls-00402 alias required in select list of cursor to avoid duplicate column names -

python - Consider setting $PYTHONHOME to <prefix>[:<exec_prefix>] error -