In Spark, is it possible to share data between two executors?

In Spark, is it possible to share data between two executors?

I have a really big read only data that I want all the executors on the same node to use. Is that possible in Spark. I know, you can broadcast variables, but can you broadcast really big arrays. Does, under the hood, it shares data between executors on the same node? How is this able to share data between the JVMs of the executors running on the same node?


Answer 1:

Yes, you could use broadcast variables when considering your data is readonly (immutable). the broadcast variable must satisfy the following properties.

  • Fit in memory
  • Immutable
  • Distributed to the cluster

So, here the only condition is your data have to be able to fit in memory on one node. That means the data should NOT be anything super large or beyond the memory limits like a massive table.

Each executer receives a copy of the broadcast variable and all the tasks in that particular executor are reading/using that data. It’s like sending a large, read-only data to all the worker nodes in the cluster.
i.e., ship to each worker only once instead of with each task and executors (it’s tasks) read the data.

Answer 2:

I assume you ask how executors can share mutable state. if you only need to share immutable data, then you can just refer to @Stanislav’s answer.
if you need mutable state between executors, there are quite a few approaches:

  1. shared external FS/DB
  2. stateful streaming databricks doc
  3. mutable distributed shared cache Ignite RDD