Off-Heap persistence

One of the most important capabilities in Spark is persisting (or caching) datasets in memory across operations. Each persisted RDD can be stored using a different storage level. One of the possibilities is to store RDDs in serialized format off-heap. Compared to storing data in the Spark JVM, off-heap storage reduces garbage collection overhead and allows executors to be smaller and to share a pool of memory. This makes it attractive in environments with large heaps or multiple concurrent applications.

InsightEdge provides the capability to store RDD off-heap in the Data Grid.

Off-Heap configuration

To configure Data Grid Off-Heap persistence, set SparkConf’s spark.externalBlockStore.blockManager property to

val sparkConf = new SparkConf()
  .set("spark.externalBlockStore.blockManager", "")

RDD persistence

To persist RDDs with OFF_HEAP storage level, you can use the regular Spark API:

val sc = new SparkContext(sparkConf)

val rdd = sc.parallelize((1 to 10).map { i =>
  Product(i, "Description of product " + i, Random.nextInt(10), Random.nextBoolean())