Feature: spark shuffle service #374

jafreck · 2018-02-07T01:12:16Z

paselem · 2018-02-07T06:37:54Z

config/spark-defaults.conf

@@ -31,3 +31,5 @@ spark.jars                      /home/spark-current/jars/azure-storage-2.0.0.jar
 # Note: Default filesystem master HA
 spark.deploy.recoveryMode       FILESYSTEM
 spark.deploy.recoveryDirectory  /root/
+
+spark.shuffle.service.enabled   true


as part of this change, should we also move the default directory off of the OS drive? It feels a bit worrying to me that we're intentionally going to put data there.

paselem · 2018-02-07T06:38:16Z

node_scripts/install/spark.py

+def start_shuffle_service():
+    exe = os.path.join(spark_home, "sbin", "start-shuffle-service.sh")
+    print("Starting the shuffle service with {}".format(exe))
+    call([exe, " &"])


What happens if this fails to start?

If the shuffle service fails to start, it just won't be used.

…ffle-service

…tk into feature/spark-shuffle-service

* Feature: on node user creation (#303) * client side on node user creation * start create user on node implementation * fix on node user creation * remove debug statements * remove commented code * line too long * fix spinner password prompt ui bug * set wait to false by default, formatting * encrypt password on client, decrypt on node * update docs, log warning if password used * Fix list-apps crash (#364) * Allow submitting jobs into a VNET (#365) * Add subnet_id to job submission cluster config * add some docs * Feature: Spark mixed mode support (#350) * add support for aad creds for storage on node * add mixed mode support * add docs * switch error order * add dedicated to get_cluster * remove mixed mode in print_cluster_conf * Feature: spark init docker repo customization (#358) * customize docker_repo based on init args * whitespace * add some docs * r-base to r * case insensitive r flag, typo fix * Bug: Load default Jars for job submission CLI (#367) * load jars in .aztk/ by default * rewrite loading config files * Feature: Cluster Run and Copy (#304) * start implementation of cluster run * fix cluster_run * start debug sequential user add and delete * parallelize user creation and deletion, start implementation of cluster scp * continue cluster_scp implementation * debug statements, disconnect error: permission denied * untesteed parakimo implementation of clus_run * continue debugging user creation bug * fix bug with pool user creation, start concurrent implementation * start fix of paramiko cluster_run and cluster_copy * working paramiko cluster_run implementation, start cluster_scp * fix cluster_scp command * update requirements, rename cluster_run function * remove unused shell functions * parallelize run and scp, add container_name, create logs wrapper * change scp to copy, clean up * sort imports * remove asyncssh from node requirements * remove old import * remove bad error handling * make cluster user management methods private * remove comment * remove accidental commit * fix merge, move delete to finally clause * add docs * formatting * Feature: Refactor cluster config to use ClusterConfiguration model (#343) * Bug: fix core-site.xml typo (#378) * fix typo * crlf->lf * Bug: fix regex for is_gpu_enabled (#380) * fix regex for is_gpu_enabled * crlf->lf * Bug: spark SDK example fix (#383) * start fix sdk * fix sdk example * crlf->lf * Fix: Custom scripts not read from cluster.yaml (#388) * Feature: spark shuffle service (#374) * start shuffle service by default * whitespace, delete misplaced file * crlf->lf * crlf->lf * move spark scratch space off os drive * Feature: enable dynamic allocation by default (#386) * Bug: stop using mutable default parameters (#392) * Bug: always upload spark job logs errors (#395) * Bug: spark submit upload error log type error (#397) * Bug: Spark Job list apps exit code 0 (#396) * Bug: fix spark-submit cores args (#399) * Fix: Trying to add user before master is ready show better error (#402) * Bug: move spark.local.dir to location usable by rstudioserver (#407) * Feature: SDK support for file-like configuration objects (#373) * add support for filelike objects for conifguration files * fix custom scripts * remove os.pathlike * merge error * Feature: Basic Cluster and Job Submission SDK Tests (#344) * add initial cluster tests * add cluster tests, add simple job submission test scenario * sort imports * fix job tests * fix job tests * remove pytest from travis build * cluster per test, parallel pytest plugin * delete cluster after tests, wait until deleted * fix bugs * catch right error, change cluster_id to base_cluster_id * fix test name * fixes * move tests to intregration_tests dir * update travis to run non-integration tests * directory structure, decoupled job tests * fix job tests, issue with submit_job * fix bug * add test docs * add cluster and job delete to finally clause * Feature: Spark add worker on master option (#415) * Add worker_on_master to ClusterConfiguration * add worker_on_master to JobConfiguration * Feature: task affinity to master node (#413) * Release: v0.6.0 (#416) * update changelog and version * underscores to stars

jafreck added 4 commits February 6, 2018 16:57

start shuffle service by default

5c988d6

whitespace, delete misplaced file

1e3c046

crlf->lf

6cfdf8b

crlf->lf

90bc907

jafreck added work in progress in progress labels Feb 7, 2018

jafreck changed the title ~~Feature/spark shuffle service~~ Feature: spark shuffle service Feb 7, 2018

paselem reviewed Feb 7, 2018

View reviewed changes

jafreck modified the milestone: v0.5.2 Feb 8, 2018

move spark scratch space off os drive

dd2af1b

jafreck removed the work in progress label Feb 8, 2018

Merge branch 'master' into feature/spark-shuffle-service

006ac99

paselem approved these changes Feb 9, 2018

View reviewed changes

jafreck added 4 commits February 8, 2018 17:34

Merge remote-tracking branch 'upstream/master' into feature/spark-shu…

e33d0f8

…ffle-service

Merge branch 'master' into feature/spark-shuffle-service

650816b

Merge remote-tracking branch 'upstream/master' into feature/spark-shu…

ec40945

…ffle-service

Merge branch 'feature/spark-shuffle-service' of github.com:jafreck/az…

426bfed

…tk into feature/spark-shuffle-service

jafreck merged commit d75ae44 into Azure:master Feb 9, 2018

jafreck removed the in progress label Feb 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: spark shuffle service #374

Feature: spark shuffle service #374

jafreck commented Feb 7, 2018

paselem Feb 7, 2018

paselem Feb 7, 2018

jafreck Feb 8, 2018

Feature: spark shuffle service #374

Feature: spark shuffle service #374

Conversation

jafreck commented Feb 7, 2018

paselem Feb 7, 2018

Choose a reason for hiding this comment

paselem Feb 7, 2018

Choose a reason for hiding this comment

jafreck Feb 8, 2018

Choose a reason for hiding this comment