DEV Community: Kevin Wallimann

How to recover from a Kafka topic reset in Spark Structured Streaming

Kevin Wallimann — Wed, 13 Jul 2022 20:40:43 +0000

Kafka topics should not routinely be deleted and recreated or offsets reset. Should it be necessary, care must be taken how and when to update the offsets in Spark Structured Streaming's checkpoints, in order to avoid data loss.

Since such an offset reset happens outside of Spark, the offsets in the checkpoints are obviously not automatically updated to reflect the change. This may cause unexpected behaviour, because the offsets are not in the expected range.

When and how data loss may occur

Let's assume a Spark Structured Streaming query with a Once-Trigger consumed 500 records from Kafka topic test-topic.

The checkpointed offset for microbatch 0 contains this

{checkpoint-dir}/offsets/0:

{"test-topic":{"0":500}}

On the Kafka topic, the beginning offset is 0 and the end offset is 500.

So far, so good.

New end offsets < checkpointed offset

Now, let's assume the Kafka topic has been deleted and recreated and 300 messages produced, such that the new end offset on the topic is 300. If the Spark query is restarted, it will fail with the following error message:

java.lang.IllegalStateException: Partition test-topic-0's offset was changed from 500 to 300, some data may have been missed.
Some data may have been lost because they are not available in Kafka any more; either the
 data was aged out by Kafka or the topic may have been deleted before all the data in the
 topic was processed. If you don't want your streaming query to fail on such cases, set the
 source option "failOnDataLoss" to "false".

Based on its checkpoint from batch 0, Spark would have expected messages with offsets 500 and higher, but only found lower offsets. Spark does not know if all messages have been consumed before the offset reset and by default assumes that messages could have been lost, failing with the above exception.

What happens if the query is rerun with failOnDataLoss=false?

This time, Spark only prints a warning

WARN KafkaMicroBatchReader: Partition test-topic-0's offset was changed from 500 to 300, some data may have been missed.

Some data may have been lost because they are not available in Kafka any more; either the data was aged out by Kafka or the topic may have been deleted before all the data in the topic was processed. If you want your streaming query to fail on such cases, set the source option "failOnDataLoss" to "true".

In the checkpoint folder, a new file for batchId 1 is created, which contains

{checkpoint-dir}/offsets/1:

{"test-topic":{"0":300}}

But how many records were consumed?

Zero.

22/07/13 10:29:59 INFO MicroBatchExecution: Streaming query made progress: {
  "id" : "5762b467-2d58-4b62-937b-427f99b38659",
  "runId" : "fc98564a-f2f0-42be-91e2-6d1f97446372",
  "name" : null,
  "timestamp" : "2022-07-13T08:29:57.863Z",
  "batchId" : 1,
  "numInputRows" : 0,
  "processedRowsPerSecond" : 0.0,
  "durationMs" : {
    "addBatch" : 709,
    "getBatch" : 45,
    "queryPlanning" : 301,
    "triggerExecution" : 1372
  },
  "stateOperators" : [ ],
  "sources" : [ {
    "description" : "KafkaV2[Subscribe[test]]",
    "startOffset" : {
      "test-topic" : {
        "0" : 500
      }
    },
    "endOffset" : {
      "test-topic" : {
        "0" : 300
      }
    },
    "numInputRows" : 0,
    "processedRowsPerSecond" : 0.0
  } ],
  "sink" : {
    "description" : "org.apache.spark.sql.kafka010.KafkaSourceProvider@43bbf133"
  }
}

As you can see from the log, Spark takes 500 as the start offset for this microbatch. So it takes the offset of the previous checkpoint and not the current beginning offset of the topic-partition. From that point of view, it makes sense that no record is ingested, but messages 0-300 are lost nonetheless.

New end offsets > checkpointed offset

It's also possible that the new end offset is greater than the offset of the latest checkpoint. For example, let's assume that the new end offset is 800. In this case, the user probably expects to ingest the 800 new records. However, if the Spark query is restarted, it will succeed, but only ingest the 300 records from offsets 500-800. The log may look like this

22/07/13 16:39:46 INFO MicroBatchExecution: Streaming query made progress: {
  "id" : "5762b467-2d58-4b62-937b-427f99b38659",
  "runId" : "1822bd34-269c-4df0-8dbe-b63e19df0e77",
  "timestamp" : "2022-07-13T14:39:43.074Z",
  "batchId" : 2,
  "numInputRows" : 300,
  "processedRowsPerSecond" : 96.32674030310815,
  "durationMs" : {
    "addBatch" : 2747,
    "getBatch" : 7,
    "getEndOffset" : 0,
    "queryPlanning" : 315,
    "setOffsetRange" : 331,
    "triggerExecution" : 3892,
    "walCommit" : 213
  },
  "stateOperators" : [ ],
  "sources" : [ {
    "description" : "KafkaV2[Subscribe[test]]",
    "startOffset" : {
      "test-topic" : {
        "0" : 500
      }
    },
    "endOffset" : {
      "test-topic" : {
        "0" : 800
      }
    },
    "numInputRows" : 300,
    "processedRowsPerSecond" : 96.32674030310815
  } ],
  "sink" : {
    "description" : "org.apache.spark.sql.kafka010.KafkaSourceProvider@4554d7c4"
  }
}

Spark will not even print a warning, because from Spark's perspective, it does not know that offsets have been reset, and it looks just like 300 new records have been added to the topic, without any offset reset happening in between.

How to avoid data loss

We have seen that in both cases, whether the new end offsets are smaller or larger than the checkpointed offset, data loss may occur. Fundamentally, Spark cannot distinguish between offsets before or after a recreation of a topic, so especially for the latter case, where the new end offset on the topic-partition is larger than the latest offset in the checkpoint, there is no general solution. One could try to manually modify the latest offset in the checkpoint or delete the checkpoints altogether, however this may require deleting the _spark_metadata folder in case of a file sink.

A special case is if the new end offset is 0. In that case, there can be no data loss, because there is no new data on the topic yet. A possible strategy to perform a topic deletion and recreation could therefore be:

Make sure all data has been ingested from the topic.
Delete and recreate the topic.
Restart the Spark Structured Streaming query that consumes from the topic. Spark will write a new checkpoint with offset 0.
Only now start producing to the recreated topic.
In the next microbatch, Spark will consume from offset 0.

Conclusion

The offsets in Spark Structured Streaming's checkpoints are not automatically updated when a offset reset happens on a Kafka topic.
Queries with Once-Triggers that are restarted periodically may be oblivious to an offset reset
The best way to keep Spark's offsets up-to-date is to restart the query before any new data has been published on the reset topic.

How to recover from a deleted _spark_metadata folder in Spark Structured Streaming

Kevin Wallimann — Thu, 11 Mar 2021 15:30:28 +0000

Warning: The described procedures have been tested on Spark 2.4.3 and 3.0.1, but otherwise not on all possible environments. Be mindful of what you're doing on your system. Having said that, I'd be grateful for any feedback if you find caveats.

Introduction

Spark Structured Streaming guarantees exactly-once processing for file outputs. One element to maintain that guarantee is a folder called _spark_metadata which is located in the output folder. The folder _spark_metadata is also known as the "Metadata Log" and its files "Metadata log files". It may look like this:

/tmp/destination/_spark_metadata/0
/tmp/destination/_spark_metadata/1
/tmp/destination/_spark_metadata/2
/tmp/destination/_spark_metadata/3

A metadata log file may look like this:

v1
{"path":"file:///tmp/destination/part-00000-5ee05bb5-3c65-4028-9c9e-dbc99f5fdbca.c000.snappy.parquet","size":3919,"isDir":false,"modificationTime":1615462080000,"blockReplication":1,"blockSize":33554432,"action":"add"}

When Spark writes a file to the output folder, it writes the absolute path of the added file to the metadata log file of the current micro-batch.

If a partial write occurs, that filename will not be added to the metadata log, and that's how Spark can maintain exactly-once semantics.

When Spark reads a file from the output folder, it only reads from files that are referenced in the metadata log. At least that's the idea. For more details on that topic, see https://dev.to/kevinwallimann/is-structured-streaming-exactly-once-well-it-depends-noe

Deleting the `_spark_metadata` folder

I hope it's clear by now that this folder should not be deleted. It should not be deleted!

Anyway, let's see what happens if we delete it nonetheless.
For this scenario, let's assume we have a structured streaming query, writing to a folder called /tmp/destination and a checkpoint folder called /tmp/checkpoint-location. After two micro-batches, the folder structure for the checkpoint-folder and the _spark_metadata folder looks like this:

/tmp/checkpoint-location/commits
/tmp/checkpoint-location/commits/0
/tmp/checkpoint-location/commits/1
/tmp/checkpoint-location/metadata
/tmp/checkpoint-location/offsets
/tmp/checkpoint-location/offsets/0
/tmp/checkpoint-location/offsets/1
/tmp/checkpoint-location/sources
/tmp/checkpoint-location/sources/0
/tmp/checkpoint-location/sources/0/0

/tmp/destination/_spark_metadata/0
/tmp/destination/_spark_metadata/1

Now for some reason, the _spark_metadata folder in the destination is deleted or moved, but not the corresponding checkpoints folder.

The following exception will be thrown sooner or later:

Caused by: java.lang.IllegalStateException: /tmp/destination/_spark_metadata/0 doesn't exist when compacting batch 9 (compactInterval: 10)
    at org.apache.spark.sql.execution.streaming.CompactibleFileStreamLog.$anonfun$compact$3(CompactibleFileStreamLog.scala:187)
    at scala.Option.getOrElse(Option.scala:189)
    at org.apache.spark.sql.execution.streaming.CompactibleFileStreamLog.$anonfun$compact$2(CompactibleFileStreamLog.scala:185)
    at org.apache.spark.sql.execution.streaming.CompactibleFileStreamLog.$anonfun$compact$2$adapted(CompactibleFileStreamLog.scala:183)
    at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245)
    at scala.collection.immutable.NumericRange.foreach(NumericRange.scala:74)
    at scala.collection.TraversableLike.flatMap(TraversableLike.scala:245)
    at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:242)
    at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108)
    at org.apache.spark.sql.execution.streaming.CompactibleFileStreamLog.$anonfun$compact$1(CompactibleFileStreamLog.scala:183)
    at org.apache.spark.util.Utils$.timeTakenMs(Utils.scala:561)
    at org.apache.spark.sql.execution.streaming.CompactibleFileStreamLog.compact(CompactibleFileStreamLog.scala:181)
    at org.apache.spark.sql.execution.streaming.CompactibleFileStreamLog.add(CompactibleFileStreamLog.scala:156)
    at org.apache.spark.sql.execution.streaming.ManifestFileCommitProtocol.commitJob(ManifestFileCommitProtocol.scala:75)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:215)

Looking at the checkpoint folder, we see the following files

/tmp/checkpoint-location/commits
/tmp/checkpoint-location/commits/0
/tmp/checkpoint-location/commits/1
/tmp/checkpoint-location/commits/2
/tmp/checkpoint-location/commits/3
/tmp/checkpoint-location/commits/4
/tmp/checkpoint-location/commits/5
/tmp/checkpoint-location/commits/6
/tmp/checkpoint-location/commits/7
/tmp/checkpoint-location/commits/8
/tmp/checkpoint-location/metadata
/tmp/checkpoint-location/offsets
/tmp/checkpoint-location/offsets/0
/tmp/checkpoint-location/offsets/1
/tmp/checkpoint-location/offsets/2
/tmp/checkpoint-location/offsets/3
/tmp/checkpoint-location/offsets/4
/tmp/checkpoint-location/offsets/5
/tmp/checkpoint-location/offsets/6
/tmp/checkpoint-location/offsets/7
/tmp/checkpoint-location/offsets/8
/tmp/checkpoint-location/offsets/9
/tmp/checkpoint-location/sources
/tmp/checkpoint-location/sources/0
/tmp/checkpoint-location/sources/0/0

Meanwhile, the destination folder contains

/tmp/destination/_spark_metadata/2
/tmp/destination/_spark_metadata/3
/tmp/destination/_spark_metadata/4
/tmp/destination/_spark_metadata/5
/tmp/destination/_spark_metadata/6
/tmp/destination/_spark_metadata/7
/tmp/destination/_spark_metadata/8

As we can see, the _spark_metadata folder is missing the files 0 and 1, that were previously deleted.
Instead of simply writing /tmp/destination/_spark_metadata/9, Spark tries to concatenate the files 0, 1, ..., 8 to a file called 9.compact to improve reading efficiency and to avoid the small files problem. This process is called log compaction. That's when the exception is thrown because the files 0 and 1 unexpectedly don't exist. Log compaction doesn't happen in every micro-batch, but the frequency is determined by the compactInterval which is 10 by default.

How to fix the problem

1. Restore the files of the removed _spark_metadata folder

If the deleted _spark_metadata folder has only been moved and can be restored, its files should be restored.
The files of the deleted _spark_metadata folder should be moved into the new _spark_metadata folder. There should be no overlapping filenames.

After restoring the files, the _spark_metadata folder should look like this

/tmp/destination/_spark_metadata/0
/tmp/destination/_spark_metadata/1
/tmp/destination/_spark_metadata/2
/tmp/destination/_spark_metadata/3
/tmp/destination/_spark_metadata/4
/tmp/destination/_spark_metadata/5
/tmp/destination/_spark_metadata/6
/tmp/destination/_spark_metadata/7
/tmp/destination/_spark_metadata/8

Now, the query can be restarted and should finish without errors.

2. Create dummy log files

If the metadata log files are irrecoverable, we could create dummy log files for the missing micro-batches.
In our example, this could be done like this

for i in {0..1}; do echo v1 > "/tmp/destination/_spark_metadata/$i"; done

or on HDFS

for i in {0..1}; do echo v1 > "/tmp/$i"; hdfs dfs -copyFromLocal "/tmp/$i" "/tmp/destination/_spark_metadata/$i"; done

This will create the files

/tmp/destination/_spark_metadata/0
/tmp/destination/_spark_metadata/1

Now, the query can be restarted and should finish without errors.

Note that the information from the metadata log files 0 and 1 will definitely be lost, hence the exactly-once guarantee is lost for micro-batches 0 and 1, and you need to address this problem separately, but at least the query can continue.

3. Deferring compaction

If it's the middle of the night and you simply need that query to continue, or you have no write access to the filesystem, you can buy yourself some time by deferring
the compaction. However, this solution does not solve the root cause.

By default, the compactInterval is 10. You can increase it to e.g. 100 by restarting the query with this additional config

spark-submit --conf spark.sql.streaming.fileSink.log.compactInterval=100

The same exception will be thrown in 100 micro-batches, so this is really just a very temporary fix to keep the query running for a few more micro-batches.

Eventually, the missing log files have to be recreated.

Upgrading ABRiS from version 3 to version 4

Kevin Wallimann — Tue, 15 Dec 2020 10:56:04 +0000

With release v4.0.1, a new fluent API was introduced to ABRiS to reduce configuration errors and provide more type safety. While this change is a huge improvement going forward, it causes a breaking change for users migrating from version 3. This article walks you through an upgrade of some common use-cases of ABRiS.

More information can be found on the Github Page. More usage examples can be found on the documentation pages. Documentation for version 3 can be found under branch 3.2.

Reading records

A common use-case is to read data from a topic with both key and value schema. In ABRiS 3, this could be done like this:

val keyConfig = Map(
  SchemaManager.PARAM_SCHEMA_REGISTRY_TOPIC -> "example_topic",
  SchemaManager.PARAM_SCHEMA_REGISTRY_URL -> "http://localhost:8081",
  SchemaManager.PARAM_KEY_SCHEMA_NAMING_STRATEGY -> "topic.name",
  SchemaManager.PARAM_KEY_SCHEMA_ID -> "latest"
)

val valueConfig = Map(
  SchemaManager.PARAM_SCHEMA_REGISTRY_TOPIC -> "example_topic",
  SchemaManager.PARAM_SCHEMA_REGISTRY_URL -> "http://localhost:8081",
  SchemaManager.PARAM_VALUE_SCHEMA_NAMING_STRATEGY -> "topic.record.name",
  SchemaManager.PARAM_VALUE_SCHEMA_ID -> "latest",
  SchemaManager.PARAM_VALUE_SCHEMA_NAME_FOR_RECORD_STRATEGY -> "record.name",
  SchemaManager.PARAM_VALUE_SCHEMA_NAMESPACE_FOR_RECORD_STRATEGY -> "record.namespace"
)

import za.co.absa.abris.avro.functions.from_confluent_avro

val result: DataFrame  = dataFrame.select(
    from_confluent_avro(col("key"), keyConfig) as 'key,
    from_confluent_avro(col("value"), valueConfig) as 'value)

With ABRiS 4, it looks like this:

val keyConfig: FromAvroConfig = AbrisConfig
    .fromConfluentAvro
    .downloadReaderSchemaByLatestVersion
    .andTopicNameStrategy("topicName", isKey=true)
    .usingSchemaRegistry("http://localhost:8081")

val valueConfig: FromAvroConfig = AbrisConfig
    .fromConfluentAvro
    .downloadReaderSchemaByLatestVersion
    .andTopicRecordNameStrategy("topicName", "record.name", "record.namespace")
    .usingSchemaRegistry("http://localhost:8081")

import za.co.absa.abris.avro.functions.from_avro
val result: DataFrame = dataFrame.select(
   from_avro(col("key"), keyConfig) as 'key,
   from_avro(col("value"), valueConfig) as 'value)

First and foremost, a new object was introduced, AbrisConfig. This is the entry point for the new fluent API.

Second, the method from_confluent_avro was removed and should be replaced with from_avro. To use the confluent format, specify .fromConfluentAvro on AbrisConfig. If you've been using simple vanilla avro, choose .fromSimpleAvro instead.

Third, notice the second parameter of .andTopicNameStrategy. The default value of isKey is false, which is ok for value schemas. However, in the case of key schemas, isKey must be set to true.

Writing records

Using an existing schema

In ABRiS 3, writing records providing an existing schema id could be done like this:

def writeAvro(dataFrame: DataFrame): DataFrame = {
  val config = Map(
    SchemaManager.PARAM_SCHEMA_REGISTRY_TOPIC -> "example_topic",
    SchemaManager.PARAM_SCHEMA_REGISTRY_URL -> "http://localhost:8081",
    SchemaManager.PARAM_VALUE_SCHEMA_NAMING_STRATEGY -> "topic.record.name",
    SchemaManager.PARAM_VALUE_SCHEMA_ID -> "42"
  )

  import za.co.absa.abris.avro.functions.to_confluent_avro
  val allColumns = struct(dataFrame.columns.head, dataFrame.columns.tail: _*)
  dataFrame.select(to_confluent_avro(allColumns, config) as 'value)
}

In ABRiS 4, it's like this:

def writeAvro(dataFrame: DataFrame): DataFrame = {
  val config: ToAvroConfig = AbrisConfig
    .toConfluentAvro
    .downloadSchemaById(42)
    .usingSchemaRegistry("http://localhost:8081")

  import za.co.absa.abris.avro.functions.to_avro

  val allColumns = struct(dataFrame.columns.head, dataFrame.columns.tail: _*)
  dataFrame.select(to_avro(allColumns, config) as 'value)
}

Here again, the method to_confluent_avro was removed and you have to use .toConfluentAvro from AbrisConfig.

Generating the schema from the records

In ABRiS 3, it was incredibly easy (too easy!) to simply have ABRiS generate the schema for you from the records if you didn't provide the schema, like this:

def writeAvro(dataFrame: DataFrame): DataFrame = {

  val config = Map(
    SchemaManager.PARAM_SCHEMA_REGISTRY_TOPIC -> "example_topic",
    SchemaManager.PARAM_SCHEMA_REGISTRY_URL -> "http://localhost:8081",
    SchemaManager.PARAM_VALUE_SCHEMA_NAMING_STRATEGY -> "topic.record.name"
  )

  val allColumns = struct(dataFrame.columns.head, dataFrame.columns.tail: _*)
  dataFrame.select(to_confluent_avro(allColumns, config) as 'value)
}

Generating and registering the schema had to be done during the evaluation of the Spark expression, which was inefficient. Therefore this functionality was removed in v4 and now the schema needs to be registered before the evaluation phase and passed to the ABRiS config.

import org.apache.spark.sql.avro.SchemaConverters.toAvroType
import za.co.absa.abris.avro.read.confluent.SchemaManagerFactory
import za.co.absa.abris.avro.registry.SchemaSubject

def writeAvro(dataFrame: DataFrame): DataFrame = {
  // generate schema
  val allColumns = struct(dataFrame.columns.map(c => dataFrame(c)): _*)
  val expression = allColumns.expr
  val schema = toAvroType(expression.dataType, expression.nullable)

  // register schema
  val schemaRegistryClientConfig = 
Map(AbrisConfig.SCHEMA_REGISTRY_URL -> "http://localhost:8081")
  val schemaManager = SchemaManagerFactory.create(schemaRegistryClientConfig)
  val subject = SchemaSubject.usingTopicNameStrategy("topic", isKey=false)
  val schemaId = schemaManager.register(subject, schema)

  // create config
  val config = AbrisConfig
    .toConfluentAvro
    .downloadSchemaById(schemaId)
    .usingSchemaRegistry("http://localhost:8081")

  val allColumns = struct(dataFrame.columns.head, dataFrame.columns.tail: _*)
  dataFrame.select(to_avro(allColumns, config) as 'value)
}

Notice that we used the topic name strategy in this example. SchemaSubject offers methods for the record name strategy (.usingRecordNameStrategy) and topic record name strategy as well (.usingTopicRecordNameStrategy)

Is Structured Streaming Exactly-Once? Well, it depends...

Kevin Wallimann — Fri, 06 Nov 2020 10:06:53 +0000

TLDR

Yes, but only for the file sink. Not for the Kafka sink nor the Foreach sink
It depends on how you read from the sink, whether you get exactly-once semantics or not
If you read using a globbed path or read directly from partition subdirectory, exactly-once semantics is not applied.

Exactly-once semantics depends on the reader

One of the key features of Spark Structured Streaming is its support for exactly-once semantics, meaning that no row will be missing or duplicated in the sink after recovery from failure.

As per the documentation, the feature is only available for the file sink, while the Kafka sink and Foreach sink only support at-least-once semantics (https://spark.apache.org/docs/3.0.0/structured-streaming-programming-guide.html#output-sinks)

Let's demonstrate exactly-once semantics using a spark-shell:
First, we'll write some streaming data to a destination. We add a literal column and partition by it just for the sake of having a partition subdirectory. Finally, we repartition the dataframe just to get multiple parquet files in the output.

scala> import org.apache.spark.sql.execution.streaming.MemoryStream
scala> import org.apache.spark.sql.streaming.Trigger
scala> import org.apache.spark.sql.functions._
scala> val input = MemoryStream[Int](1, spark.sqlContext)
scala> input.addData(1 to 100)
scala> val df = input.toDF().
     | withColumn("partition1", lit("value1")).
     | repartition(4)
scala> val query = df.writeStream.
     | partitionBy("partition1")
     | trigger(Trigger.Once).
     | option("checkpointLocation", "/tmp/checkpoint").
     | format(source="parquet").
     | start("/tmp/destination")

We can go ahead and count those values

scala> query.awaitTermination()
scala> spark.read.parquet("/tmp/destination").count
res1: Long = 100

As expected, we get 100 as the result

We should now see 4 parquet files in the destination and one file in the metadata log, e.g. like this

% ls -R /tmp/destination
_spark_metadata   partition1=value1

/tmp/destination/_spark_metadata:
0

/tmp/destination/partition1=value1:
part-00000-54c74e55-7cdb-44f0-9c6f-2db62e2901aa.c000.snappy.parquet
part-00001-d2b67dae-3fe9-40ed-8e6a-75c4a36e8300.c000.snappy.parquet
part-00002-275dd640-4148-4947-96ca-3cad4feae215.c000.snappy.parquet
part-00003-bd18be1e-3906-4c49-905b-a9d1c37d3282.c000.snappy.parquet

The metadata log file should reference exactly the four files, like this:

% cat /tmp/destination/_spark_metadata/0
v1
{"path":"file:///tmp/destination/partition1=value1/part-00000-54c74e55-7cdb-44f0-9c6f-2db62e2901aa.c000.snappy.parquet","size":498,"isDir":false,"modificationTime":1604655052000,"blockReplication":1,"blockSize":33554432,"action":"add"}
{"path":"file:///tmp/destination/partition1=value1/part-00001-d2b67dae-3fe9-40ed-8e6a-75c4a36e8300.c000.snappy.parquet","size":498,"isDir":false,"modificationTime":1604655052000,"blockReplication":1,"blockSize":33554432,"action":"add"}
{"path":"file:///tmp/destination/partition1=value1/part-00002-275dd640-4148-4947-96ca-3cad4feae215.c000.snappy.parquet","size":498,"isDir":false,"modificationTime":1604655052000,"blockReplication":1,"blockSize":33554432,"action":"add"}
{"path":"file:///tmp/destination/partition1=value1/part-00003-bd18be1e-3906-4c49-905b-a9d1c37d3282.c000.snappy.parquet","size":498,"isDir":false,"modificationTime":1604655052000,"blockReplication":1,"blockSize":33554432,"action":"add"}

We can simulate a partial write by copying one of the four dataframes. Now, we have 5 parquet files in the destination:

% cd /tmp/destination/partition1=value1
% cp part-00000-54c74e55-7cdb-44f0-9c6f-2db62e2901aa.c000.snappy.parquet part-00000-54c74e55-7cdb-44f0-9c6f-2db62e2901aa.c000-copy.snappy.parquet
% ls /tmp/destination/partition1=value1
part-00000-54c74e55-7cdb-44f0-9c6f-2db62e2901aa.c000-copy.snappy.parquet
part-00000-54c74e55-7cdb-44f0-9c6f-2db62e2901aa.c000.snappy.parquet
part-00001-d2b67dae-3fe9-40ed-8e6a-75c4a36e8300.c000.snappy.parquet
part-00002-275dd640-4148-4947-96ca-3cad4feae215.c000.snappy.parquet
part-00003-bd18be1e-3906-4c49-905b-a9d1c37d3282.c000.snappy.parquet

Reading with globbed paths

Exactly-once semantics guarantees that we will still only read 100 rows. Let's check that

scala> spark.read.parquet("/tmp/destination").count
res2: Long = 100

As expected, we get 100.

What about this query (notice the star)

scala> spark.read.parquet("/tmp/destination/*").count
res3: Long = 125

Well, that's why exactly-once semantics depends on how you read. As shown above, the destination directory does contain 5 parquet files. When reading without the globbed path, spark consults the _spark_metadata directory (aka metadata log) and only reads from the parquet files that are listed there. That's not the case with globbed paths. The metadata log is not consulted, hence exactly-once semantics does not apply when reading with globbed paths and we read duplicated data.

Reading from partition subdirectory

What about filtering by the partition? All of our values are in the same partition, so we should count 100 elements when we filter for it:

scala> spark.read.parquet("/tmp/destination").filter("partition1='value1'").count
res4: Long = 100

And indeed, it works as expected. Now, in non-streaming Spark you could also read directly from the partition subdirectory and arrive at the same result. Does this work with streaming as well?

scala> spark.read.parquet("/tmp/destination/partition1=value1").count
res5: Long = 125

No. It's the same reason as above, Spark does not consider the metadata log when you read from a subdirectory and therefore cannot determine if any of the parquet files are from partial writes and possible duplicates.

Conclusion

As we have seen, the exactly-once semantics is only guaranteed when the _spark_metadata directory is considered. This means that it depends on the reader whether or not exactly-once semantics is applied. In the case of Spark, it is only considered when you read from the root directory, without globbed paths. Whether this behaviour is a bug or a feature is not entirely clear to me. In practice, parquet files from partial writes should occur only rarely since Spark 3.0, as a best-effort cleanup in case of task abortions has been implemented (https://issues.apache.org/jira/browse/SPARK-27210). However, it's important to stress that this is only a best-effort and not a guaranteed cleanup.

How to make a column non-nullable in Spark Structured Streaming

Kevin Wallimann — Sat, 11 Jul 2020 10:06:30 +0000

TLDR

Like this:

import org.apache.spark.sql.catalyst.expressions.objects.AssertNotNull
import org.apache.spark.sql.functions.col

dataFrame
  .withColumn(columnName, new Column(AssertNotNull(col(columnName).expr)))

Changing column nullability in Batch mode

For Spark in Batch mode, one way to change column nullability is by creating a new dataframe with a new schema that has the desired nullability.

 val schema = dataframe.schema
 // modify [[StructField] with name `cn`
 val newSchema = StructType(schema.map {
   case StructField( c, t, _, m) if c.equals(cn) 
        =>  StructField( c, t, nullable = nullable, m)
   case y: StructField => y
 })
 // apply new schema
 df.sqlContext.createDataFrame( df.rdd, newSchema )

https://stackoverflow.com/a/33195510/13532243

However, this approach isn't supported for a structured streaming dataframe, which fails with the following error.

Caused by: org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();
    at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:389)
    at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:38)
    at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:36)

Make a column nullable in structured streaming

In the same stackoverflow thread, another answer provides a way how to make a non-nullable column nullable, which works for Structured Streaming queries.

dataframe.withColumn("col_name", when(col("col_name").isNotNull,
  col("col_name")).otherwise(lit(null)))

https://stackoverflow.com/a/46119565/13532243

This is a neat trick, since Spark has to account for the (hypothetical) fact that a value could be null and mark the column nullable, even though the column doesn't contain any null value in practice.

Make a column non-nullable in structured streaming

If you know that a nullable column in fact only contains non-nullable values, you may want to make that column non-nullable. Here's the trick with AssertNotNull again:

import  org.apache.spark.sql.catalyst.expressions.objects.AssertNotNull
import org.apache.spark.sql.functions.col

dataFrame
  .withColumn(columnName, new Column(AssertNotNull(col(columnName).expr)))

How does it work? Looking at its implementation https://github.com/apache/spark/blob/3fdfce3120f307147244e5eaf46d61419a723d50/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala#L1591-L1628, the key is that AssertNotNull overrides nullable and always returns false. That's how Spark determines this column to be non-nullable. Of course, if your column unexpectedly contains null values, the query will fail with a NullPointerException.

DEV Community: Kevin Wallimann

How to recover from a Kafka topic reset in Spark Structured Streaming

When and how data loss may occur

New end offsets < checkpointed offset

New end offsets > checkpointed offset

How to avoid data loss

Conclusion

How to recover from a deleted _spark_metadata folder in Spark Structured Streaming

Introduction

Deleting the _spark_metadata folder

How to fix the problem

Upgrading ABRiS from version 3 to version 4

Reading records

Writing records

Using an existing schema

Generating the schema from the records

Is Structured Streaming Exactly-Once? Well, it depends...

TLDR

Exactly-once semantics depends on the reader

Reading with globbed paths

Reading from partition subdirectory

Conclusion

How to make a column non-nullable in Spark Structured Streaming

TLDR

Changing column nullability in Batch mode

Make a column nullable in structured streaming

Make a column non-nullable in structured streaming

Deleting the `_spark_metadata` folder