Memory Configurations
yarn.scheduler.maximum-allocation-mb
yarn.nodemanager.resource.memory-mb
spark.driver.memory
spark.driver.memoryOverhead
spark.executor.memoryOverhead
spark.executor.memory
spark.executor.fraction
spark.executor.storageFraction
spark.executor.cores
spark.memory.offHeap.enabled
spark.memory.offHeap.size
spark.executor.pyspark.memory
Adaptive Query Execution
spark.sql.adaptive.enabled=true
spark.sql.shuffle.partitions=10
spark.sql.autoBroadcastJoinThreshold=10MB
spark.sql.adaptive.coalescePartitions.enabled
spark.sql.adaptive.coalescePartitions.initialPartitionNum
spark.sql.adaptive.coalescePartitions.minPartitionNum
spark.sql.adaptive.localShuffleReader.enabled=true
spark.sql.adaptive.advisoryPartitionSizeInBytes
spark.sql.adaptive.skewjoin.enabled=true
spark.sql.adaptive.skewjoin.skewPartitionFactor=5
spark.sql.adaptive.skewjoin.skewPartitionThresholdInBytes=256MB
Partitioning & Caching
//Pruning
spark.sql.optimizer.dynamicPartitionPruning.enabled
Cache Vs Persist
- Storage level config in Persist
- Use Disk
- Use Memory
- Deserialized
- Replication
Hints & Accumulators
Partitioning Hints:
- COALESCE
- REPARTITION
- REPARTITION_BY_RANGE
- REBALANCE
Join Hints:
- BROADCAST alias BROADCASTJOIN and MAPJOIN
- MERGE alias SHUFFLE_MERGE and MERGEJOIN
- SHUFFLE_HASH
- SHUFFLE_REPLICATE_NL
Accumulators – At Action level, gurantee accuracy
Speculative Execution
spark.speculation=true
spark.speculation.interval=100ms
spark.speculation.multiplier=1.5
spark.speculation.quantile=0.75
spark.speculation.minTaskRuntime=100ms
spark.speculation.task.duration.threshold=None
Dynamic Resource Allocation
spark.dynamicAllocation.enabled=true
spark.dynamicAllocation.shuffleTracking.enabled=true
spark.dynamicAllocation.executorIdleTimeout=60s
spark.dynamicAllocation.schedulerBacklogTimeout=1s
Spark Schedulers
spark.scheduler.mode=FAIR