-
Notifications
You must be signed in to change notification settings - Fork 66
Conversation
…ive to mount HDFS (Azure#242)
* initial take on installing azure files * fix cluster.yaml parsing of files shares * remove test code * add docs for Azure Files
* initial refactor * rename cli_fe to cli * add docs for sdk client * typo * remove conflict * fix zip node scripts bug, add sdk_example program * start models docs * add ClusterConfiguration docs, fix merge bug * Application docs update * added Application and SparkConfiguration docs * whitespace * rename cli.py and spark/cli * add docstring for load_spark_client
* conditionally install and use nvidia-docker * status statements, and -y flag for install * add example, remove unnecessary ppa * rename custom script, remove print statement, update example * add Dockerfile * fix path in Dockerfile * update Docker images to use service account * updated docs, changed default docker repo for gpu skus * make timing statements more verbose * remove unnecessary script * added gpu docs * fix up docs and numba example
* update docker-image readme with new images * update docs
* Update 60-gpu.md make sure is available in region * Update 60-gpu.md
* Added rstudio server script * Added rstudio server port to aztk sdk * Added R dockerfiles * Added new line on dockerfiles * Pointing dockerfiles to new aztk-base * allow any user or application in the server to write to the history server log directory
* Retry asking for password when it doesn't match or is empty * Limit to 3 retries and let user know of add-user command on failure * Throw error on failure
* fix wrong path for global secrets * load spark_conf files correctly * docker-image docs fix * docker-image docs fix * move load_aztk_spark_config function to config.py
* add default filesystem master ha * move settings to spark-defaults.conf * whitespace
* Update README.md streamline and update main readme.md * Update README.md * Update README.md * Update 13-configuration.md * Update 12-docker-image.md * Update 12-docker-image.md * Update README.md * Create README.md * Update README.md * Update 10-clusters.md
* add feedback for cluster create wait * whitespace * alphasort imports
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me, per the discussion in #256. I think you are right about adding validation (specifically to SecretsConfig object in config.py) in and error checking throughout. But this is definitely on track.
* gpu sample + jupyter mnt point * rename jupyter gpu sample
I've added some checks so that it fails if both service principal and shared key auth is provided for either batch or storage. What other checks/validation do you think are needed? (for easy reference, this is how it compares to master: 89f0e54...emlyn:vnet). |
I've also added some checks that the correct servicePrincipal keys are present. |
@emlyn yeah this is fine to merge. We will do a final review before merging into master which might include some small changes like variable/yaml key renames. But everything else looks great! |
@jafreck Thanks. don't think I have permission to merge, could you (or someone else with permission) do it please? |
@emlyn Sorry -- didn't realize that. I'll take care of it. |
What's the next step, shall I open a PR from the feature branch onto master? It's already a bit behind again, so there are other things showing up in the diff - maybe it's better if someone with appropriate access merges master in first? |
@paselem I've successfully used it to start a couple of clusters and run a few spark jobs, but haven't done any thorough testing. |
Thanks for the update @emlyn. We have another huge change being worked on in parallel for 'serverless jobs', and we are kind of in the same position. Lots of ad-hoc testing, but not enough thorough testing. I think we need to start to seriously think about having good automated tests especially for very large changes like this one. We have a sync up today to discuss immediate next steps for this project - specifically how do we get the 'serverless jobs' and 'vnet' branches merged into master and released. My main concern is around the amount of churn and testing, so we'll need to get that addressed before we can be confident about the quality of the feature going in. I'm hopeful that by the end of the day we'll have a plan down to getting this merged in and can move forward. I'll update this thread once we know more. |
@emlyn - we discussed these changes today. I did some quick testing and so far things look good. I have a local change w/ master merged into it and will probably push that back into this branch soon. We agreed that we want to push this into master by the end of the week. @jafreck has signed off and @timotheeguerin will do some additional testing and reviews. We need to figure out which change will go into master first (this or the serverless jobs). If we sign off on this quickly my guess is that this will make it in sooner. |
I think there's still a bit of cleaning up and error checking remaining, but it would be good to get feedback on what's here so far.