DEV Community

Discussion on: How to recover from a deleted _spark_metadata folder in Spark Structured Streaming

Collapse
 
gupash profile image
Ashish • Edited

Great article.. two cents I would like to add:
In any of the methods mentioned here, It only removes/defers the error for the spark producer job (one writing data on s3). But any consumer job who want to read the data already written on s3, will still face one of the issues mentioned below:
1. If you create the blank 0 file

Error:
Exception in thread "main" java.lang.IllegalStateException: Failed to read log file /Spark-Warehouse/_spark_metadata/0. Incomplete log file
Enter fullscreen mode Exit fullscreen mode

2. If you don't create the blank file:

a. If only 1 batch was present
Error: Exception in thread "main" org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet at . It must be specified manually
b. If multiple batches were present and you deleted only 1 metadata file
Error: Exception in thread "main" java.lang.IllegalStateException: /Documents/Spark-Warehouse/_spark_metadata/0 doesn't exist (latestId: 1, compactInterval: 10)
Enter fullscreen mode Exit fullscreen mode
Collapse
 
kevinwallimann profile image
Kevin Wallimann

Hi @gupash
Thanks for your comment.
Indeed if you create a blank 0 file, it will throw the error that you posted. However, the dummy log file that I described in the article contains the string "v1". In that case, no error should be thrown on the reader's side. Maybe I could have pointed out this fact more clearly.

Collapse
 
jmagana2000 profile image
jmagana2000

I was missing files 0 through 5 and I just copied 6 and renamed to 0 to 5 and that worked.