Update VNET feature branch (#284)

* remove redundant setting in non-master code section and use non-os drive to mount HDFS (#242) * Feature: Azure Files (#241) * initial take on installing azure files * fix cluster.yaml parsing of files shares * remove test code * add docs for Azure Files * Feature: Rename SDK (#231) * initial refactor * rename cli_fe to cli * add docs for sdk client * typo * remove conflict * fix zip node scripts bug, add sdk_example program * start models docs * add ClusterConfiguration docs, fix merge bug * Application docs update * added Application and SparkConfiguration docs * whitespace * rename cli.py and spark/cli * add docstring for load_spark_client * Bug: fix bad reference to FileShare (#245) * Feature: Spark GPU (#206) * conditionally install and use nvidia-docker * status statements, and -y flag for install * add example, remove unnecessary ppa * rename custom script, remove print statement, update example * add Dockerfile * fix path in Dockerfile * update Docker images to use service account * updated docs, changed default docker repo for gpu skus * make timing statements more verbose * remove unnecessary script * added gpu docs * fix up docs and numba example * Feature: update docker image doc (#251) * update docker-image readme with new images * update docs * Update 60-gpu.md (#253) * Update 60-gpu.md make sure is available in region * Update 60-gpu.md * Feature: Sparklyr (#243) * Added rstudio server script * Added rstudio server port to aztk sdk * Added R dockerfiles * Added new line on dockerfiles * Pointing dockerfiles to new aztk-base * allow any user or application in the server to write to the history server log directory * Retry asking for password when it doesn't match or is empty (#252) * Retry asking for password when it doesn't match or is empty * Limit to 3 retries and let user know of add-user command on failure * Throw error on failure * Bug: fix wrong path for global secrets (#265) * fix wrong path for global secrets * load spark_conf files correctly * docker-image docs fix * docker-image docs fix * move load_aztk_spark_config function to config.py * Feature: Default Spark filesystem master HA (#271) * add default filesystem master ha * move settings to spark-defaults.conf * whitespace * Docs: update (#263) * Update README.md streamline and update main readme.md * Update README.md * Update README.md * Update 13-configuration.md * Update 12-docker-image.md * Update 12-docker-image.md * Update README.md * Create README.md * Update README.md * Update 10-clusters.md * Feature: add feedback for cluster create wait (#273) * add feedback for cluster create wait * whitespace * alphasort imports * Bug: fix loading local spark config (#282) * Fix secrets.yaml format and add service principal for storage * Feature: update to v0.5.0 (#283) * Pass credentials through to node scripts * Bug: History server parse file not exist (#288) * jupyter azfiles bug + gpu sample (#291) * gpu sample + jupyter mnt point * rename jupyter gpu sample * Check for both service principal and shared key auth * More checks * Bug: fix logic for worker custom scripts (#295) * Bug: suppress warning on add-user (#302) * Bug: fix alignment in get print cluster (#312)
Azure · Jan 9, 2018 · 4b7cb1f · 4b7cb1f
1 parent 5083f4d
commit 4b7cb1f
Show file tree

Hide file tree

Showing 88 changed files with 2,535 additions and 513 deletions.
diff --git a/README.md b/README.md
@@ -3,17 +3,15 @@ Azure Distributed Data Engineering Toolkit (AZTK) is a python CLI application fo
 
 This toolkit is built on top of Azure Batch but does not require any Azure Batch knowledge to use.
 
-Currently, this toolkit is designed to run batch Spark jobs that require additional on-demand compute. Eventually we plan to support other distributed data engineering frameworks in a similar vein. Please let us know which frameworks you'd like for us to support in the future.
-
 ## Notable Features
 - Spark cluster provision time of 5 minutes on average
 - Spark clusters run in Docker containers
+- Run Spark on a GPU enabled cluster
 - Users can bring their own Docker image
 - Ability to use low-priority VMs for an 80% discount
 - Built in support for Azure Blob Storage and Azure Data Lake connection
-- Optional Jupyter Notebook for pythonic interactive experience
-- [coming soon] Optional RStudio Server for an interactive experience in R
-- Tailored Docker image for PySpark and [coming soon] SparklyR
+- [Tailored pythonic experience with PySpark, Jupyter, and Anaconda](https://github.com/Azure/aztk/wiki/PySpark-on-Azure-with-AZTK)
+- [Tailored R experience with SparklyR, RStudio-Server, and Tidyverse](https://github.com/Azure/aztk/wiki/SparklyR-on-Azure-with-AZTK)
 - Ability to run _spark submit_ directly from your local machine's CLI
 
 ## Setup
@@ -47,6 +45,7 @@ The core experience of this package is centered around a few commands.
 ```sh
 # create your cluster
 aztk spark cluster create
+aztk spark cluster add-user
 ```
 ```sh
 # monitor and manage your clusters
@@ -60,64 +59,70 @@ aztk spark cluster ssh
 aztk spark cluster submit
 ```
 
-### Create and setup your cluster
+### 1. Create and setup your cluster
 
 First, create your cluster:
 ```bash
-aztk spark cluster create \
-    --id <my_cluster_id> \
-    --size <number_of_nodes> \
-    --vm-size <vm_size>
-```
-You can find more information on VM sizes [here.](https://docs.microsoft.com/en-us/azure/virtual-machines/linux/sizes) Please note that you must use the official SKU name when setting your VM size - they usually come in the form: "standard_d2_v2".
-
-You can also create your cluster with [low-priority](https://docs.microsoft.com/en-us/azure/batch/batch-low-pri-vms) VMs at an 80% discount by using `--size-low-pri` instead of `--size` (we currently do not support mixed low-priority and dedicated VMs):
+aztk spark cluster create --id my_cluster --size 5 --vm-size standard_d2_v2
 ```
-aztk spark cluster create \
-    --id <my_cluster_id> \
-    --size-low-pri <number_of_low-pri_nodes> \
-    --vm-size <vm_size>
-```
-
-By default, this package runs Spark 2.2.0 on an Ubuntu16.04 Docker image. More info on this image can be found in the [docker-images](/docker-image) folder in this repo.
-
-NOTE: The cluster id (`--id`) can only contain alphanumeric characters including hyphens and underscores, and cannot contain more than 64 characters.
+- See our available VM sizes [here.](https://docs.microsoft.com/en-us/azure/virtual-machines/linux/sizes) 
+- The `--vm-size` argument must be the official SKU name which usually come in the form: "standard_d2_v2"
+- You can create [low-priority VMs](https://docs.microsoft.com/en-us/azure/batch/batch-low-pri-vms) at an 80% discount by using `--size-low-pri` instead of `--size`
+- By default, AZTK runs Spark 2.2.0 on an Ubuntu16.04 Docker image. More info [here](/docker-image)
+- By default, AZTK will create a user (with the username **spark**) for your cluster if the argument `--wait` is true
+- The cluster id (`--id`) can only contain alphanumeric characters including hyphens and underscores, and cannot contain more than 64 characters.
+- By default, you cannot create clusters of more than 20 cores in total. Visit [this page](https://docs.microsoft.com/en-us/azure/batch/batch-quota-limit#view-batch-quotas) to request a core quota increase.
 
 More information regarding using a cluster can be found in the [cluster documentation](./docs/10-clusters.md)
 
-### Check on your cluster status
+### 2. Check on your cluster status
 To check your cluster status, use the `get` command:
 ```bash
-aztk spark cluster get --id <my_cluster_id>
+aztk spark cluster get --id my_cluster
 ```
 
-### Submit a Spark job
+### 3. Submit a Spark job
 
-When your cluster is up, you can submit jobs to run against the cluster:
+When your cluster is ready, you can submit jobs from your local machine to run against the cluster. The output of the spark-submit will be streamed to your local console. Run this command from the cloned AZTK repo:
 ```bash
+// submit a java application
+aztk spark cluster submit \
+    --id my_cluster \
+    --name my_java_job \
+    --class org.apache.spark.examples.SparkPi \
+    --executor-memory 20G \
+    path\to\examples.jar 1000
+
+// submit a python application
 aztk spark cluster submit \
-    --id <my_cluste_id> \
-    --name <my_job_name> \
-    [options] \
-    <app jar | python file> \
-    [app arguments]
+    --id my_cluster \
+    --name my_python_job \
+    --executor-memory 20G \
+    path\to\pi.py 1000
 ```
-NOTE: The job name (`--name`) must be atleast 3 characters long, can only contain alphanumeric characters including hyphens but excluding underscores, and cannot contain uppercase letters. Each job you submit **must** have a unique name.
+- The `aztk spark cluster submit` command takes the same parameters as the standard [`spark-submit` command](https://spark.apache.org/docs/latest/submitting-applications.html), except instead of specifying `--master`, AZTK requires that you specify your cluster `--id` and a unique job `--name`
+- The job name, `--name`, argument must be atleast 3 characters long
+    - It can only contain alphanumeric characters including hypens but excluding underscores
+    - It cannot contain uppercase letters
+- Each job you submit **must** have a unique name
+- Use the `--no-wait` option for your command to return immediately
 
-The output of spark-submit will be streamed to the console. Use the `--no-wait` option to return immediately. More information regarding monitoring your job can be found in the [spark submit documentation.](./docs/20-spark-submit.md)
+Learn more about the spark submit command [here](./docs/20-spark-submit.md)
 
-To start testing this package, you can start by trying out a Spark job from the [./examples](./examples) folder. The examples are a curated list of samples from Spark-2.2.0.
-
-### Log in and Interact with your Spark Cluster
+### 4. Log in and Interact with your Spark Cluster
 Most users will want to work interactively with their Spark clusters. With the `aztk spark cluster ssh` command, you can SSH into the cluster's master node. This command also helps you port-forward your Spark Web UI and Spark Jobs UI to your local machine:
 ```bash
-aztk spark cluster ssh --id <my_cluster_id>
+aztk spark cluster ssh --id my_cluster --user spark
 ```
 By default, we port forward the Spark Web UI to *localhost:8080*, Spark Jobs UI to *localhost:4040*, and the Spark History Server to *localhost:18080*.
 
 You can configure these settings in the *.aztk/ssh.yaml* file.
 
-### Manage your Spark cluster
+NOTE: When working interactively, you may want to use tools like Jupyter or RStudio-Server depending on whether or not you are a python or R user. To do so, you need to setup your cluster with the appropriate docker image and custom scripts:
+ - [how to setup Jupyter with Pyspark](https://github.com/Azure/aztk/wiki/PySpark-on-Azure-with-AZTK)
+ - [how to setup RStudio-Server with Sparklyr](https://github.com/Azure/aztk/wiki/SparklyR-on-Azure-with-AZTK)
+
+### 5. Manage and Monitor your Spark Cluster
 
 You can also see your clusters from the CLI:
 ```
@@ -141,6 +146,9 @@ aztk spark cluster delete --id <my_cluster_id>
 - [How do I interact with my Spark cluster using a password instead of an SSH-key?](./docs/10-clusters.md#interactive-mode)
 - [How do I change my cluster default settings?](./docs/13-configuration.md)
 - [How do I modify my *spark-env.sh*, *spark-defaults.conf* or *core-site.xml* files?](./docs/13-configuration.md)
+- [How do I use GPUs with AZTK](./docs/60-gpu.md)
+- [I'm a python user and want to use PySpark, Jupyter, Anaconda packages, and have a Pythonic experience.](https://github.com/Azure/aztk/wiki/PySpark-on-Azure-with-AZTK)
+- [I'm a R user and want to use SparklyR, RStudio, Tidyverse packages, and have an R experience.](https://github.com/Azure/aztk/wiki/SparklyR-on-Azure-with-AZTK)
 
 ## Next Steps
 You can find more documentation [here](./docs)
diff --git a/aztk/__init__.py b/aztk/__init__.py
@@ -1,2 +1 @@
-import aztk.logger
-log = aztk.logger.root
+
diff --git a/aztk/aztklib.py b/aztk/aztklib.py
diff --git a/aztk_sdk/client.py → aztk/client.py b/aztk_sdk/client.py → aztk/client.py
@@ -1,10 +1,10 @@
-from datetime import datetime, timedelta
+from datetime import datetime, timedelta, timezone
 import azure.batch.models as batch_models
-import aztk_sdk.utils.azure_api as azure_api
-import aztk_sdk.utils.helpers as helpers
-import aztk_sdk.utils.constants as constants
-import aztk_sdk.utils.get_ssh_key as get_ssh_key
-import aztk_sdk.models as models
+import aztk.utils.azure_api as azure_api
+import aztk.utils.helpers as helpers
+import aztk.utils.constants as constants
+import aztk.utils.get_ssh_key as get_ssh_key
+import aztk.models as models
 
 
 class Client:
@@ -14,16 +14,22 @@ def __init__(self, secrets_config: models.SecretsConfiguration):
         self.blob_config = azure_api.BlobConfig(
             account_key=self.secrets_config.storage_account_key,
             account_name=self.secrets_config.storage_account_name,
-            account_suffix=self.secrets_config.storage_account_suffix
+            account_suffix=self.secrets_config.storage_account_suffix,
+
+            tenant_id=self.secrets_config.service_principal_tenant_id,
+            client_id=self.secrets_config.service_principal_client_id,
+            credential=self.secrets_config.service_principal_credential,
+            resource_id=self.secrets_config.storage_account_resource_id
         )
         self.batch_config = azure_api.BatchConfig(
             service_url=self.secrets_config.batch_service_url,
             account_key=self.secrets_config.batch_account_key,
             account_name=self.secrets_config.batch_account_name,
-            resource_url=self.secrets_config.batch_resource_url,
+
             tenant_id=self.secrets_config.service_principal_tenant_id,
             client_id=self.secrets_config.service_principal_client_id,
-            credential=self.secrets_config.service_principal_credential
+            credential=self.secrets_config.service_principal_credential,
+            resource_id=self.secrets_config.batch_account_resource_id
         )
 
         self.batch_client = azure_api.make_batch_client(self.batch_config)
@@ -62,7 +68,7 @@ def __create_pool_and_job(self, cluster_conf, software_metadata_key: str, start_
         """
             Create a pool and job
             :param cluster_conf: the configuration object used to create the cluster
-            :type cluster_conf: aztk_sdk.models.ClusterConfiguration
+            :type cluster_conf: aztk.models.ClusterConfiguration
             :parm software_metadata_key: the id of the software being used on the cluster
             :param start_task: the start task for the cluster
             :param VmImageModel: the type of image to provision for the cluster
@@ -156,14 +162,14 @@ def __create_user(self, pool_id: str, node_id: str, username: str, password: str
                 is_admin=True,
                 password=password,
                 ssh_public_key=get_ssh_key.get_user_public_key(ssh_key, self.secrets_config),
-                expiry_time=datetime.now() + timedelta(days=365)))
+                expiry_time=datetime.now(timezone.utc) + timedelta(days=365)))
 
     def __get_remote_login_settings(self, pool_id: str, node_id: str):
         """
         Get the remote_login_settings for node
         :param pool_id
         :param node_id
-        :returns aztk_sdk.models.RemoteLogin
+        :returns aztk.models.RemoteLogin
         """
         result = self.batch_client.compute_node.get_remote_login_settings(pool_id, node_id)
         return models.RemoteLogin(ip_address=result.remote_login_ip_address, port=str(result.remote_login_port))

diff --git a/aztk_sdk/error.py → aztk/error.py b/aztk_sdk/error.py → aztk/error.py
diff --git a/aztk_sdk/models.py → aztk/models.py b/aztk_sdk/models.py → aztk/models.py
@@ -1,7 +1,16 @@
 from typing import List
-import aztk_sdk.utils.constants as constants
+import aztk.utils.constants as constants
 import azure.batch.models as batch_models
 
+class FileShare:
+    def __init__(self, storage_account_name: str = None,
+        storage_account_key: str = None,
+        file_share_path: str = None,
+        mount_path: str = None):
+        self.storage_account_name = storage_account_name
+        self.storage_account_key = storage_account_key
+        self.file_share_path = file_share_path
+        self.mount_path = mount_path
 
 class CustomScript:
     def __init__(self, name: str = None, script: str = None, run_on=None):
@@ -14,6 +23,7 @@ class ClusterConfiguration:
     def __init__(
             self,
             custom_scripts: List[CustomScript] = None,
+            file_shares: List[FileShare] = None,
             cluster_id: str = None,
             vm_count=None,
             vm_low_pri_count=None,
@@ -22,6 +32,7 @@ def __init__(
             docker_repo: str=None):
 
         self.custom_scripts = custom_scripts
+        self.file_shares = file_shares
         self.cluster_id = cluster_id
         self.vm_count = vm_count
         self.vm_size = vm_size
@@ -42,13 +53,14 @@ def __init__(
             batch_account_name=None,
             batch_account_key=None,
             batch_service_url=None,
-            batch_resource_url=None,
             storage_account_name=None,
             storage_account_key=None,
             storage_account_suffix=None,
             service_principal_tenant_id=None,
             service_principal_client_id=None,
             service_principal_credential=None,
+            batch_account_resource_id=None,
+            storage_account_resource_id=None,
             docker_endpoint=None,
             docker_username=None,
             docker_password=None,
@@ -58,7 +70,6 @@ def __init__(
         self.batch_account_name = batch_account_name
         self.batch_account_key = batch_account_key
         self.batch_service_url = batch_service_url
-        self.batch_resource_url = batch_resource_url
 
         self.storage_account_name = storage_account_name
         self.storage_account_key = storage_account_key
@@ -67,6 +78,8 @@ def __init__(
         self.service_principal_tenant_id = service_principal_tenant_id
         self.service_principal_client_id = service_principal_client_id
         self.service_principal_credential = service_principal_credential
+        self.batch_account_resource_id = batch_account_resource_id
+        self.storage_account_resource_id = storage_account_resource_id
 
         self.docker_endpoint = docker_endpoint
         self.docker_username = docker_username

diff --git a/aztk/spark/__init__.py b/aztk/spark/__init__.py
@@ -0,0 +1,2 @@
+from . import models
+from .client import Client