Great article.. two cents I would like to add:
In any of the methods mentioned here, It only removes/defers the error for the spark producer job (one writing data on s3). But any consumer job who want to read the data already written on s3, will still face one of the issues mentioned below: 1. If you create the blank 0 file
Error:
Exception in thread "main" java.lang.IllegalStateException: Failed to read log file /Spark-Warehouse/_spark_metadata/0. Incomplete log file
2. If you don't create the blank file:
a. If only 1 batch was present
Error: Exception in thread "main" org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet at . It must be specified manually
b. If multiple batches were present and you deleted only 1 metadata file
Error: Exception in thread "main" java.lang.IllegalStateException: /Documents/Spark-Warehouse/_spark_metadata/0 doesn't exist (latestId: 1, compactInterval: 10)
Hi @gupash
Thanks for your comment.
Indeed if you create a blank 0 file, it will throw the error that you posted. However, the dummy log file that I described in the article contains the string "v1". In that case, no error should be thrown on the reader's side. Maybe I could have pointed out this fact more clearly.
Great article.. two cents I would like to add:
In any of the methods mentioned here, It only removes/defers the error for the spark producer job (one writing data on s3). But any consumer job who want to read the data already written on s3, will still face one of the issues mentioned below:
1. If you create the blank 0 file
2. If you don't create the blank file:
Hi @gupash
Thanks for your comment.
Indeed if you create a blank 0 file, it will throw the error that you posted. However, the dummy log file that I described in the article contains the string "v1". In that case, no error should be thrown on the reader's side. Maybe I could have pointed out this fact more clearly.
I was missing files 0 through 5 and I just copied 6 and renamed to 0 to 5 and that worked.