Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dask helm deployment not working in AKS #386

Open
aravindrp opened this issue Mar 30, 2021 · 14 comments
Open

Dask helm deployment not working in AKS #386

aravindrp opened this issue Mar 30, 2021 · 14 comments

Comments

@aravindrp
Copy link

aravindrp commented Mar 30, 2021

What happened:

I am trying to deploy the dask gateway to azure following the documentation : https://gateway.dask.org/install-kube.html

We already have an AKS cluster which is configured to use traefik ingress. In order to avoid a duplicate deployment of traefik, I downloaded the latest version of chart and created a modified version by removing the contents inside the template/traefik folder. Rest everything is same as the the official helmchart.

I deployed dask gateway successfully and the pods are also running without crashing. Then I tried to access the deployed dask gateway instance from a jupyternotebook also deployed within the same cluster. Since I only need to access it within the cluster tried directly accessing the clusterIP service : api--dask-gateway, but its failing with 403 forbidden error.

Could you please help in resolving this issue

What you expected to happen:

Minimal Complete Verifiable Example:

# Put your MCVE code here

Anything else we need to know?:

Environment:

  • Dask version: dask helm chart 0.9.0
  • Python version:
  • Operating System:
  • Install method (conda, pip, source): helm
@jacobtomlinson
Copy link
Member

Could you share your Dask Gateway config? Particularly your auth config.

@aravindrp
Copy link
Author

aravindrp commented Mar 30, 2021

Hi @jacobtomlinson ,

Thanks for looking into this. Please find the values.yaml and the helm debug output below.

gateway:
  # Number of instances of the gateway-server to run
  replicas: 1

  # Annotations to apply to the gateway-server pods.
  annotations: {}

  # Resource requests/limits for the gateway-server pod.
  resources: {}

  # Path prefix to serve dask-gateway api requests under
  # This prefix will be added to all routes the gateway manages
  # in the traefik proxy.
  prefix: /

  # The gateway server log level
  loglevel: INFO

  # The image to use for the gateway-server pod.
  image:
    name: <azure_container_registry>/daskgateway/dask-gateway-server
    tag: 0.9.0
    pullPolicy: IfNotPresent

  # Image pull secrets for gateway-server pod
  imagePullSecrets: []

  # Configuration for the gateway-server service
  service:
    annotations: {}

  auth:
    # The auth type to use. One of {simple, kerberos, jupyterhub, custom}.
    type: simple

    simple:
      # A shared password to use for all users.
      password: null

    kerberos:
      # Path to the HTTP keytab for this node.
      keytab: null

    jupyterhub:
      # A JupyterHub api token for dask-gateway to use. See
      # https://gateway.dask.org/install-kube.html#authenticating-with-jupyterhub.
      apiToken: null

      # JupyterHub's api url. Inferred from JupyterHub's service name if running
      # in the same namespace.
      apiUrl: null

    custom:
      # The full authenticator class name.
      class: null

      # Configuration fields to set on the authenticator class.
      options: {}

  livenessProbe:
    # Enables the livenessProbe. 
    enabled: true
    # Configures the livenessProbe. 
    initialDelaySeconds: 5
    timeoutSeconds: 2
    periodSeconds: 10
    failureThreshold: 6
  readinessProbe:
    # Enables the readinessProbe.
    enabled: true
    # Configures the readinessProbe.
    initialDelaySeconds: 5
    timeoutSeconds: 2
    periodSeconds: 10
    failureThreshold: 3

  backend:
    # The image to use for both schedulers and workers.
    image:
      name: <azure_container_registry>/daskgateway/dask-gateway
      tag: 0.9.0
      pullPolicy: IfNotPresent

    # The namespace to launch dask clusters in. If not specified, defaults to
    # the same namespace the gateway is running in.
    namespace: null

    # A mapping of environment variables to set for both schedulers and workers.
    environment: null

    scheduler:
      # Any extra configuration for the scheduler pod. Sets
      # `c.KubeClusterConfig.scheduler_extra_pod_config`.
      extraPodConfig: {}

      # Any extra configuration for the scheduler container.
      # Sets `c.KubeClusterConfig.scheduler_extra_container_config`.
      extraContainerConfig: {}

      # Cores request/limit for the scheduler.
      cores:
        request: null
        limit: null

      # Memory request/limit for the scheduler.
      memory:
        request: null
        limit: null

    worker:
      # Any extra configuration for the worker pod. Sets
      # `c.KubeClusterConfig.worker_extra_pod_config`.
      extraPodConfig: {}

      # Any extra configuration for the worker container. Sets
      # `c.KubeClusterConfig.worker_extra_container_config`.
      extraContainerConfig: {}

      # Cores request/limit for each worker.
      cores:
        request: null
        limit: null

      # Memory request/limit for each worker.
      memory:
        request: null
        limit: null

  # Settings for nodeSelector, affinity, and tolerations for the gateway pods
  nodeSelector: {}
  affinity: {}
  tolerations: []

  # Any extra configuration code to append to the generated `dask_gateway_config.py`
  # file. Can be either a single code-block, or a map of key -> code-block
  # (code-blocks are run in alphabetical order by key, the key value itself is
  # meaningless). The map version is useful as it supports merging multiple
  # `values.yaml` files, but is unnecessary in other cases.
  extraConfig: {}

# Configuration for the gateway controller
controller:
  # Whether the controller should be deployed. Disabling the controller allows
  # running it locally for development/debugging purposes.
  enabled: true

  # Any annotations to add to the controller pod
  annotations: {}

  # Resource requests/limits for the controller pod
  resources: {}

  # Image pull secrets for controller pod
  imagePullSecrets: []

  # The controller log level
  loglevel: INFO

  # Max time (in seconds) to keep around records of completed clusters.
  # Default is 24 hours.
  completedClusterMaxAge: 86400

  # Time (in seconds) between cleanup tasks removing records of completed
  # clusters. Default is 5 minutes.
  completedClusterCleanupPeriod: 600

  # Base delay (in seconds) for backoff when retrying after failures.
  backoffBaseDelay: 0.1

  # Max delay (in seconds) for backoff when retrying after failures.
  backoffMaxDelay: 300

  # Limit on the average number of k8s api calls per second.
  k8sApiRateLimit: 50

  # Limit on the maximum number of k8s api calls per second.
  k8sApiRateLimitBurst: 100

  # The image to use for the controller pod.
  image:
    name: <azure_container_registry>/daskgateway/dask-gateway-server
    tag: 0.9.0
    pullPolicy: IfNotPresent

  # Settings for nodeSelector, affinity, and tolerations for the controller pods
  nodeSelector: {}
  affinity: {}
  tolerations: []

# Configuration for the traefik proxy
traefik:
  # Number of instances of the proxy to run
  replicas: 1

  # Any annotations to add to the proxy pods
  annotations: {}

  # Resource requests/limits for the proxy pods
  resources: {}

  # The image to use for the proxy pod
  image:
    name: traefik
    tag: 2.1.3

  # Any additional arguments to forward to traefik
  additionalArguments: []

  # The proxy log level
  loglevel: WARN

  # Whether to expose the dashboard on port 9000 (enable for debugging only!)
  dashboard: false

  # Additional configuration for the traefik service
  service:
    type: LoadBalancer
    annotations: {}
    spec: {}
    ports:
      web:
        # The port HTTP(s) requests will be served on
        port: 80
        nodePort: null
      tcp:
        # The port TCP requests will be served on. Set to `web` to share the
        # web service port
        port: web
        nodePort: null

  # Settings for nodeSelector, affinity, and tolerations for the traefik pods
  nodeSelector: {}
  affinity: {}
  tolerations: []

rbac:
  # Whether to enable RBAC.
  enabled: true

  # Existing names to use if ClusterRoles, ClusterRoleBindings, and
  # ServiceAccounts have already been created by other means (leave set to
  # `null` to create all required roles at install time)
  controller:
    serviceAccountName: null

  gateway:
    serviceAccountName: null

  traefik:
    serviceAccountName: null

Best Regards,
Aravind

@TomAugspurger
Copy link
Member

You're getting a 403 error. How are you authenticating with Dask Gateway?

@aravindrp
Copy link
Author

I was trying to test this component in AKS by following this document : https://gateway.dask.org/install-kube.html. So I havent configured any authenticator and by default I believe the simple authenticator would be used. So I was trying to connect to the dask-gateway using the below code from a jupyter notebook instance deployed in the same cluster in another namespace:

from dask_gateway import Gateway
gateway = Gateway(address="http://:8000")
gateway.list_clusters()

I presume since no password is configured I can omit the auth parameter.

@jacobtomlinson
Copy link
Member

Yeah I would expect this to work.

Could you also share the pod logs for the gateway server?

@aravindrp
Copy link
Author

@jacobtomlinson please find the logs below:

gateway_logs.zip

@aravindrp
Copy link
Author

Hi @jacobtomlinson ,

Thanks for looking into this. Please find the values.yaml and the helm debug output below.

gateway:
  # Number of instances of the gateway-server to run
  replicas: 1

  # Annotations to apply to the gateway-server pods.
  annotations: {}

  # Resource requests/limits for the gateway-server pod.
  resources: {}

  # Path prefix to serve dask-gateway api requests under
  # This prefix will be added to all routes the gateway manages
  # in the traefik proxy.
  prefix: /

  # The gateway server log level
  loglevel: INFO

  # The image to use for the gateway-server pod.
  image:
    name: <azure_container_registry>/daskgateway/dask-gateway-server
    tag: 0.9.0
    pullPolicy: IfNotPresent

  # Image pull secrets for gateway-server pod
  imagePullSecrets: []

  # Configuration for the gateway-server service
  service:
    annotations: {}

  auth:
    # The auth type to use. One of {simple, kerberos, jupyterhub, custom}.
    type: simple

    simple:
      # A shared password to use for all users.
      password: null

    kerberos:
      # Path to the HTTP keytab for this node.
      keytab: null

    jupyterhub:
      # A JupyterHub api token for dask-gateway to use. See
      # https://gateway.dask.org/install-kube.html#authenticating-with-jupyterhub.
      apiToken: null

      # JupyterHub's api url. Inferred from JupyterHub's service name if running
      # in the same namespace.
      apiUrl: null

    custom:
      # The full authenticator class name.
      class: null

      # Configuration fields to set on the authenticator class.
      options: {}

  livenessProbe:
    # Enables the livenessProbe. 
    enabled: true
    # Configures the livenessProbe. 
    initialDelaySeconds: 5
    timeoutSeconds: 2
    periodSeconds: 10
    failureThreshold: 6
  readinessProbe:
    # Enables the readinessProbe.
    enabled: true
    # Configures the readinessProbe.
    initialDelaySeconds: 5
    timeoutSeconds: 2
    periodSeconds: 10
    failureThreshold: 3

  backend:
    # The image to use for both schedulers and workers.
    image:
      name: <azure_container_registry>/daskgateway/dask-gateway
      tag: 0.9.0
      pullPolicy: IfNotPresent

    # The namespace to launch dask clusters in. If not specified, defaults to
    # the same namespace the gateway is running in.
    namespace: null

    # A mapping of environment variables to set for both schedulers and workers.
    environment: null

    scheduler:
      # Any extra configuration for the scheduler pod. Sets
      # `c.KubeClusterConfig.scheduler_extra_pod_config`.
      extraPodConfig: {}

      # Any extra configuration for the scheduler container.
      # Sets `c.KubeClusterConfig.scheduler_extra_container_config`.
      extraContainerConfig: {}

      # Cores request/limit for the scheduler.
      cores:
        request: null
        limit: null

      # Memory request/limit for the scheduler.
      memory:
        request: null
        limit: null

    worker:
      # Any extra configuration for the worker pod. Sets
      # `c.KubeClusterConfig.worker_extra_pod_config`.
      extraPodConfig: {}

      # Any extra configuration for the worker container. Sets
      # `c.KubeClusterConfig.worker_extra_container_config`.
      extraContainerConfig: {}

      # Cores request/limit for each worker.
      cores:
        request: null
        limit: null

      # Memory request/limit for each worker.
      memory:
        request: null
        limit: null

  # Settings for nodeSelector, affinity, and tolerations for the gateway pods
  nodeSelector: {}
  affinity: {}
  tolerations: []

  # Any extra configuration code to append to the generated `dask_gateway_config.py`
  # file. Can be either a single code-block, or a map of key -> code-block
  # (code-blocks are run in alphabetical order by key, the key value itself is
  # meaningless). The map version is useful as it supports merging multiple
  # `values.yaml` files, but is unnecessary in other cases.
  extraConfig: {}

# Configuration for the gateway controller
controller:
  # Whether the controller should be deployed. Disabling the controller allows
  # running it locally for development/debugging purposes.
  enabled: true

  # Any annotations to add to the controller pod
  annotations: {}

  # Resource requests/limits for the controller pod
  resources: {}

  # Image pull secrets for controller pod
  imagePullSecrets: []

  # The controller log level
  loglevel: INFO

  # Max time (in seconds) to keep around records of completed clusters.
  # Default is 24 hours.
  completedClusterMaxAge: 86400

  # Time (in seconds) between cleanup tasks removing records of completed
  # clusters. Default is 5 minutes.
  completedClusterCleanupPeriod: 600

  # Base delay (in seconds) for backoff when retrying after failures.
  backoffBaseDelay: 0.1

  # Max delay (in seconds) for backoff when retrying after failures.
  backoffMaxDelay: 300

  # Limit on the average number of k8s api calls per second.
  k8sApiRateLimit: 50

  # Limit on the maximum number of k8s api calls per second.
  k8sApiRateLimitBurst: 100

  # The image to use for the controller pod.
  image:
    name: <azure_container_registry>/daskgateway/dask-gateway-server
    tag: 0.9.0
    pullPolicy: IfNotPresent

  # Settings for nodeSelector, affinity, and tolerations for the controller pods
  nodeSelector: {}
  affinity: {}
  tolerations: []

# Configuration for the traefik proxy
traefik:
  # Number of instances of the proxy to run
  replicas: 1

  # Any annotations to add to the proxy pods
  annotations: {}

  # Resource requests/limits for the proxy pods
  resources: {}

  # The image to use for the proxy pod
  image:
    name: traefik
    tag: 2.1.3

  # Any additional arguments to forward to traefik
  additionalArguments: []

  # The proxy log level
  loglevel: WARN

  # Whether to expose the dashboard on port 9000 (enable for debugging only!)
  dashboard: false

  # Additional configuration for the traefik service
  service:
    type: LoadBalancer
    annotations: {}
    spec: {}
    ports:
      web:
        # The port HTTP(s) requests will be served on
        port: 80
        nodePort: null
      tcp:
        # The port TCP requests will be served on. Set to `web` to share the
        # web service port
        port: web
        nodePort: null

  # Settings for nodeSelector, affinity, and tolerations for the traefik pods
  nodeSelector: {}
  affinity: {}
  tolerations: []

rbac:
  # Whether to enable RBAC.
  enabled: true

  # Existing names to use if ClusterRoles, ClusterRoleBindings, and
  # ServiceAccounts have already been created by other means (leave set to
  # `null` to create all required roles at install time)
  controller:
    serviceAccountName: null

  gateway:
    serviceAccountName: null

  traefik:
    serviceAccountName: null

Best Regards,
Aravind
@jacobtomlinson just to clarify if we are accessing the gateway within the cluster there is no need for traefik components to be deployed,right? as I mentioned in the issue description I had removed the yaml files in template/traefik folder as the cluster already had a traefik deployment.

@jacobtomlinson
Copy link
Member

@aravindp ah I missed that! So you've modified the chart? I think you will need the Traefik components here, it is being used here to do some specific proxying of the scheduler.

@aravindrp
Copy link
Author

@aravindp ah I missed that! So you've modified the chart? I think you will need the Traefik components here, it is being used here to do some specific proxying of the scheduler.

@jacobtomlinson , yes, I modified it since we already have traefik ingress configured in our cluster. In that case, how should I do it? From the documentation it was not clear for me, how should I integrate it with an already existing traefik ingress.

@jacobtomlinson
Copy link
Member

Dask gateway does not use traefik as an ingress, just as a service to proxy traffic.

Configure it the same way you would any other service.

@aravindrp
Copy link
Author

aravindrp commented Mar 31, 2021

@jacobtomlinson I added back the below files from the traefik templates folder.

  • template/traefik/dashboard.yaml
  • template/traefik/service.yaml

I believe deployment & rbac yamls are not required as the cluster already has traefik ingress.

After the new deployment traefik load balancer services is also available.
image

But still when I try to connect to the url using the external ip of the service getting 403 error
image

When I try to do curl, the error has some more details
image

@jacobtomlinson
Copy link
Member

@aravindrp please do not modify the YAML in the chart. It makes it much harder for us to test and support it.

If you need to disable things please do so in the config, and if it's not possible to disable in config then please raise an issue so we can get that fixed.

Please could you try installing the vanilla chart without modifications and let us know how you get on.

@aravindrp
Copy link
Author

@aravindrp please do not modify the YAML in the chart. It makes it much harder for us to test and support it.

If you need to disable things please do so in the config, and if it's not possible to disable in config then please raise an issue so we can get that fixed.

Please could you try installing the vanilla chart without modifications and let us know how you get on.

@jacobtomlinson sorry for the delay in responding. Used this shortcut method of manually modifying the helm charts as I was in the phase of evaluating dask gateway. I am planning to do it in the proper way during the final implementation.

I was doing some further analysis and it feels like a problem with aiohttp package used within dask-gateway. Because when I try to curl or use urllib3 package and try to connect with the api server its working without any issues. I have raised a ticket in the stackoverflow for this. Meanwhile, have you seen any simiar behavior before?

https://stackoverflow.com/questions/67115594/forbidden-error-while-trying-to-access-a-url-using-aiohttp

@consideRatio
Copy link
Collaborator

@aravindrp thanks for summarizing that this may be related to aiohttp. It may be relevant to know its being updated in #423 but we need also to get a release out after that is merged.

As we have not arrived at a clear action point to take with regards to the code in this repo, I suggest we close this issue at this point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants