Skip to content
This repository has been archived by the owner on Feb 3, 2021. It is now read-only.

Commit

Permalink
Update VNET feature branch (#284)
Browse files Browse the repository at this point in the history
* remove redundant setting in non-master code section and use non-os drive to mount HDFS (#242)

* Feature: Azure Files (#241)

* initial take on installing azure files

* fix cluster.yaml parsing of files shares

* remove test code

* add docs for Azure Files

* Feature: Rename SDK (#231)

* initial refactor

* rename cli_fe to cli

* add docs for sdk client

* typo

* remove conflict

* fix zip node scripts bug, add sdk_example program

* start models docs

* add ClusterConfiguration docs, fix merge bug

* Application docs update

* added Application and SparkConfiguration docs

* whitespace

* rename cli.py and spark/cli

* add docstring for load_spark_client

* Bug: fix bad reference to FileShare (#245)

* Feature: Spark GPU (#206)

* conditionally install and use nvidia-docker

* status statements, and -y flag for install

* add example, remove unnecessary ppa

* rename custom script, remove print statement, update example

* add Dockerfile

* fix path in Dockerfile

* update Docker images to use service account

* updated docs, changed default docker repo for gpu skus

* make timing statements more verbose

* remove unnecessary script

* added gpu docs

* fix up docs and numba example

* Feature: update docker image doc (#251)

* update docker-image readme with new images

* update docs

* Update 60-gpu.md (#253)

* Update 60-gpu.md

make sure is available in region

* Update 60-gpu.md

* Feature: Sparklyr (#243)

* Added rstudio server script

* Added rstudio server port to aztk sdk

* Added R dockerfiles

* Added new line on dockerfiles

* Pointing dockerfiles to new aztk-base

* allow any user or application in the server to write to the history server log directory

* Retry asking for password when it doesn't match or is empty (#252)

* Retry asking for password when it doesn't match or is empty

* Limit to 3 retries and let user know of add-user command on failure

* Throw error on failure

* Bug: fix wrong path for global secrets (#265)

* fix wrong path for global secrets

* load spark_conf files correctly

* docker-image docs fix

* docker-image docs fix

* move load_aztk_spark_config function to config.py

* Feature: Default Spark filesystem master HA (#271)

* add default filesystem master ha

* move settings to spark-defaults.conf

* whitespace

* Docs: update (#263)

* Update README.md

streamline and update main readme.md

* Update README.md

* Update README.md

* Update 13-configuration.md

* Update 12-docker-image.md

* Update 12-docker-image.md

* Update README.md

* Create README.md

* Update README.md

* Update 10-clusters.md

* Feature: add feedback for cluster create wait (#273)

* add feedback for cluster create wait

* whitespace

* alphasort imports

* Bug: fix loading local spark config (#282)

* Fix secrets.yaml format and add service principal for storage

* Feature: update to v0.5.0 (#283)

* Pass credentials through to node scripts

* Bug: History server parse file not exist (#288)

* jupyter azfiles bug + gpu sample (#291)

* gpu sample + jupyter mnt point

* rename jupyter gpu sample

* Check for both service principal and shared key auth

* More checks

* Bug: fix logic for worker custom scripts (#295)

* Bug: suppress warning on add-user (#302)

* Bug: fix alignment in get print cluster (#312)
  • Loading branch information
emlyn authored and jafreck committed Jan 9, 2018
1 parent 5083f4d commit 4b7cb1f
Show file tree
Hide file tree
Showing 88 changed files with 2,535 additions and 513 deletions.
86 changes: 47 additions & 39 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,17 +3,15 @@ Azure Distributed Data Engineering Toolkit (AZTK) is a python CLI application fo

This toolkit is built on top of Azure Batch but does not require any Azure Batch knowledge to use.

Currently, this toolkit is designed to run batch Spark jobs that require additional on-demand compute. Eventually we plan to support other distributed data engineering frameworks in a similar vein. Please let us know which frameworks you'd like for us to support in the future.

## Notable Features
- Spark cluster provision time of 5 minutes on average
- Spark clusters run in Docker containers
- Run Spark on a GPU enabled cluster
- Users can bring their own Docker image
- Ability to use low-priority VMs for an 80% discount
- Built in support for Azure Blob Storage and Azure Data Lake connection
- Optional Jupyter Notebook for pythonic interactive experience
- [coming soon] Optional RStudio Server for an interactive experience in R
- Tailored Docker image for PySpark and [coming soon] SparklyR
- [Tailored pythonic experience with PySpark, Jupyter, and Anaconda](https://github.com/Azure/aztk/wiki/PySpark-on-Azure-with-AZTK)
- [Tailored R experience with SparklyR, RStudio-Server, and Tidyverse](https://github.com/Azure/aztk/wiki/SparklyR-on-Azure-with-AZTK)
- Ability to run _spark submit_ directly from your local machine's CLI

## Setup
Expand Down Expand Up @@ -47,6 +45,7 @@ The core experience of this package is centered around a few commands.
```sh
# create your cluster
aztk spark cluster create
aztk spark cluster add-user
```
```sh
# monitor and manage your clusters
Expand All @@ -60,64 +59,70 @@ aztk spark cluster ssh
aztk spark cluster submit
```

### Create and setup your cluster
### 1. Create and setup your cluster

First, create your cluster:
```bash
aztk spark cluster create \
--id <my_cluster_id> \
--size <number_of_nodes> \
--vm-size <vm_size>
```
You can find more information on VM sizes [here.](https://docs.microsoft.com/en-us/azure/virtual-machines/linux/sizes) Please note that you must use the official SKU name when setting your VM size - they usually come in the form: "standard_d2_v2".

You can also create your cluster with [low-priority](https://docs.microsoft.com/en-us/azure/batch/batch-low-pri-vms) VMs at an 80% discount by using `--size-low-pri` instead of `--size` (we currently do not support mixed low-priority and dedicated VMs):
aztk spark cluster create --id my_cluster --size 5 --vm-size standard_d2_v2
```
aztk spark cluster create \
--id <my_cluster_id> \
--size-low-pri <number_of_low-pri_nodes> \
--vm-size <vm_size>
```

By default, this package runs Spark 2.2.0 on an Ubuntu16.04 Docker image. More info on this image can be found in the [docker-images](/docker-image) folder in this repo.

NOTE: The cluster id (`--id`) can only contain alphanumeric characters including hyphens and underscores, and cannot contain more than 64 characters.
- See our available VM sizes [here.](https://docs.microsoft.com/en-us/azure/virtual-machines/linux/sizes)
- The `--vm-size` argument must be the official SKU name which usually come in the form: "standard_d2_v2"
- You can create [low-priority VMs](https://docs.microsoft.com/en-us/azure/batch/batch-low-pri-vms) at an 80% discount by using `--size-low-pri` instead of `--size`
- By default, AZTK runs Spark 2.2.0 on an Ubuntu16.04 Docker image. More info [here](/docker-image)
- By default, AZTK will create a user (with the username **spark**) for your cluster if the argument `--wait` is true
- The cluster id (`--id`) can only contain alphanumeric characters including hyphens and underscores, and cannot contain more than 64 characters.
- By default, you cannot create clusters of more than 20 cores in total. Visit [this page](https://docs.microsoft.com/en-us/azure/batch/batch-quota-limit#view-batch-quotas) to request a core quota increase.

More information regarding using a cluster can be found in the [cluster documentation](./docs/10-clusters.md)

### Check on your cluster status
### 2. Check on your cluster status
To check your cluster status, use the `get` command:
```bash
aztk spark cluster get --id <my_cluster_id>
aztk spark cluster get --id my_cluster
```

### Submit a Spark job
### 3. Submit a Spark job

When your cluster is up, you can submit jobs to run against the cluster:
When your cluster is ready, you can submit jobs from your local machine to run against the cluster. The output of the spark-submit will be streamed to your local console. Run this command from the cloned AZTK repo:
```bash
// submit a java application
aztk spark cluster submit \
--id my_cluster \
--name my_java_job \
--class org.apache.spark.examples.SparkPi \
--executor-memory 20G \
path\to\examples.jar 1000

// submit a python application
aztk spark cluster submit \
--id <my_cluste_id> \
--name <my_job_name> \
[options] \
<app jar | python file> \
[app arguments]
--id my_cluster \
--name my_python_job \
--executor-memory 20G \
path\to\pi.py 1000
```
NOTE: The job name (`--name`) must be atleast 3 characters long, can only contain alphanumeric characters including hyphens but excluding underscores, and cannot contain uppercase letters. Each job you submit **must** have a unique name.
- The `aztk spark cluster submit` command takes the same parameters as the standard [`spark-submit` command](https://spark.apache.org/docs/latest/submitting-applications.html), except instead of specifying `--master`, AZTK requires that you specify your cluster `--id` and a unique job `--name`
- The job name, `--name`, argument must be atleast 3 characters long
- It can only contain alphanumeric characters including hypens but excluding underscores
- It cannot contain uppercase letters
- Each job you submit **must** have a unique name
- Use the `--no-wait` option for your command to return immediately

The output of spark-submit will be streamed to the console. Use the `--no-wait` option to return immediately. More information regarding monitoring your job can be found in the [spark submit documentation.](./docs/20-spark-submit.md)
Learn more about the spark submit command [here](./docs/20-spark-submit.md)

To start testing this package, you can start by trying out a Spark job from the [./examples](./examples) folder. The examples are a curated list of samples from Spark-2.2.0.

### Log in and Interact with your Spark Cluster
### 4. Log in and Interact with your Spark Cluster
Most users will want to work interactively with their Spark clusters. With the `aztk spark cluster ssh` command, you can SSH into the cluster's master node. This command also helps you port-forward your Spark Web UI and Spark Jobs UI to your local machine:
```bash
aztk spark cluster ssh --id <my_cluster_id>
aztk spark cluster ssh --id my_cluster --user spark
```
By default, we port forward the Spark Web UI to *localhost:8080*, Spark Jobs UI to *localhost:4040*, and the Spark History Server to *localhost:18080*.

You can configure these settings in the *.aztk/ssh.yaml* file.

### Manage your Spark cluster
NOTE: When working interactively, you may want to use tools like Jupyter or RStudio-Server depending on whether or not you are a python or R user. To do so, you need to setup your cluster with the appropriate docker image and custom scripts:
- [how to setup Jupyter with Pyspark](https://github.com/Azure/aztk/wiki/PySpark-on-Azure-with-AZTK)
- [how to setup RStudio-Server with Sparklyr](https://github.com/Azure/aztk/wiki/SparklyR-on-Azure-with-AZTK)

### 5. Manage and Monitor your Spark Cluster

You can also see your clusters from the CLI:
```
Expand All @@ -141,6 +146,9 @@ aztk spark cluster delete --id <my_cluster_id>
- [How do I interact with my Spark cluster using a password instead of an SSH-key?](./docs/10-clusters.md#interactive-mode)
- [How do I change my cluster default settings?](./docs/13-configuration.md)
- [How do I modify my *spark-env.sh*, *spark-defaults.conf* or *core-site.xml* files?](./docs/13-configuration.md)
- [How do I use GPUs with AZTK](./docs/60-gpu.md)
- [I'm a python user and want to use PySpark, Jupyter, Anaconda packages, and have a Pythonic experience.](https://github.com/Azure/aztk/wiki/PySpark-on-Azure-with-AZTK)
- [I'm a R user and want to use SparklyR, RStudio, Tidyverse packages, and have an R experience.](https://github.com/Azure/aztk/wiki/SparklyR-on-Azure-with-AZTK)

## Next Steps
You can find more documentation [here](./docs)
3 changes: 1 addition & 2 deletions aztk/__init__.py
Original file line number Diff line number Diff line change
@@ -1,2 +1 @@
import aztk.logger
log = aztk.logger.root

28 changes: 0 additions & 28 deletions aztk/aztklib.py

This file was deleted.

30 changes: 18 additions & 12 deletions aztk_sdk/client.py → aztk/client.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
from datetime import datetime, timedelta
from datetime import datetime, timedelta, timezone
import azure.batch.models as batch_models
import aztk_sdk.utils.azure_api as azure_api
import aztk_sdk.utils.helpers as helpers
import aztk_sdk.utils.constants as constants
import aztk_sdk.utils.get_ssh_key as get_ssh_key
import aztk_sdk.models as models
import aztk.utils.azure_api as azure_api
import aztk.utils.helpers as helpers
import aztk.utils.constants as constants
import aztk.utils.get_ssh_key as get_ssh_key
import aztk.models as models


class Client:
Expand All @@ -14,16 +14,22 @@ def __init__(self, secrets_config: models.SecretsConfiguration):
self.blob_config = azure_api.BlobConfig(
account_key=self.secrets_config.storage_account_key,
account_name=self.secrets_config.storage_account_name,
account_suffix=self.secrets_config.storage_account_suffix
account_suffix=self.secrets_config.storage_account_suffix,

tenant_id=self.secrets_config.service_principal_tenant_id,
client_id=self.secrets_config.service_principal_client_id,
credential=self.secrets_config.service_principal_credential,
resource_id=self.secrets_config.storage_account_resource_id
)
self.batch_config = azure_api.BatchConfig(
service_url=self.secrets_config.batch_service_url,
account_key=self.secrets_config.batch_account_key,
account_name=self.secrets_config.batch_account_name,
resource_url=self.secrets_config.batch_resource_url,

tenant_id=self.secrets_config.service_principal_tenant_id,
client_id=self.secrets_config.service_principal_client_id,
credential=self.secrets_config.service_principal_credential
credential=self.secrets_config.service_principal_credential,
resource_id=self.secrets_config.batch_account_resource_id
)

self.batch_client = azure_api.make_batch_client(self.batch_config)
Expand Down Expand Up @@ -62,7 +68,7 @@ def __create_pool_and_job(self, cluster_conf, software_metadata_key: str, start_
"""
Create a pool and job
:param cluster_conf: the configuration object used to create the cluster
:type cluster_conf: aztk_sdk.models.ClusterConfiguration
:type cluster_conf: aztk.models.ClusterConfiguration
:parm software_metadata_key: the id of the software being used on the cluster
:param start_task: the start task for the cluster
:param VmImageModel: the type of image to provision for the cluster
Expand Down Expand Up @@ -156,14 +162,14 @@ def __create_user(self, pool_id: str, node_id: str, username: str, password: str
is_admin=True,
password=password,
ssh_public_key=get_ssh_key.get_user_public_key(ssh_key, self.secrets_config),
expiry_time=datetime.now() + timedelta(days=365)))
expiry_time=datetime.now(timezone.utc) + timedelta(days=365)))

def __get_remote_login_settings(self, pool_id: str, node_id: str):
"""
Get the remote_login_settings for node
:param pool_id
:param node_id
:returns aztk_sdk.models.RemoteLogin
:returns aztk.models.RemoteLogin
"""
result = self.batch_client.compute_node.get_remote_login_settings(pool_id, node_id)
return models.RemoteLogin(ip_address=result.remote_login_ip_address, port=str(result.remote_login_port))
Expand Down
File renamed without changes.
19 changes: 16 additions & 3 deletions aztk_sdk/models.py → aztk/models.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,16 @@
from typing import List
import aztk_sdk.utils.constants as constants
import aztk.utils.constants as constants
import azure.batch.models as batch_models

class FileShare:
def __init__(self, storage_account_name: str = None,
storage_account_key: str = None,
file_share_path: str = None,
mount_path: str = None):
self.storage_account_name = storage_account_name
self.storage_account_key = storage_account_key
self.file_share_path = file_share_path
self.mount_path = mount_path

class CustomScript:
def __init__(self, name: str = None, script: str = None, run_on=None):
Expand All @@ -14,6 +23,7 @@ class ClusterConfiguration:
def __init__(
self,
custom_scripts: List[CustomScript] = None,
file_shares: List[FileShare] = None,
cluster_id: str = None,
vm_count=None,
vm_low_pri_count=None,
Expand All @@ -22,6 +32,7 @@ def __init__(
docker_repo: str=None):

self.custom_scripts = custom_scripts
self.file_shares = file_shares
self.cluster_id = cluster_id
self.vm_count = vm_count
self.vm_size = vm_size
Expand All @@ -42,13 +53,14 @@ def __init__(
batch_account_name=None,
batch_account_key=None,
batch_service_url=None,
batch_resource_url=None,
storage_account_name=None,
storage_account_key=None,
storage_account_suffix=None,
service_principal_tenant_id=None,
service_principal_client_id=None,
service_principal_credential=None,
batch_account_resource_id=None,
storage_account_resource_id=None,
docker_endpoint=None,
docker_username=None,
docker_password=None,
Expand All @@ -58,7 +70,6 @@ def __init__(
self.batch_account_name = batch_account_name
self.batch_account_key = batch_account_key
self.batch_service_url = batch_service_url
self.batch_resource_url = batch_resource_url

self.storage_account_name = storage_account_name
self.storage_account_key = storage_account_key
Expand All @@ -67,6 +78,8 @@ def __init__(
self.service_principal_tenant_id = service_principal_tenant_id
self.service_principal_client_id = service_principal_client_id
self.service_principal_credential = service_principal_credential
self.batch_account_resource_id = batch_account_resource_id
self.storage_account_resource_id = storage_account_resource_id

self.docker_endpoint = docker_endpoint
self.docker_username = docker_username
Expand Down
2 changes: 2 additions & 0 deletions aztk/spark/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
from . import models
from .client import Client
Loading

0 comments on commit 4b7cb1f

Please sign in to comment.