Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal for CSI Cluster Registry #2514

Closed
wants to merge 3 commits into from
Closed
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,158 @@
# CSI Cluster Registry Design Doc

***Status:*** Pending

***Version:*** Alpha

***Author:*** Saad Ali ([@saad-ali](https://github.com/saad-ali), saadali@google.com)

## Terminology

Term | Definition
---|---
Container Storage Interface (CSI) | A specification attempting to establish an industry standard interface that Container Orchestration Systems (COs) can use to expose arbitrary storage systems to their containerized workloads.
CSI Volume Plugin | A new, in-tree volume plugin that acts as an adapter and enables out-of-tree, third-party CSI volume drivers to be used in Kubernetes.
CSI Volume Driver | An out-of-tree CSI compatible implementation of a volume plugin that can be used in Kubernetes through the Kubernetes CSI Volume Plugin.

# Summary

This document proposes a "CSI cluster registration mechanism".

This mechanism will enable:
* A CSI volume driver to register itself with Kubernetes when it is deployed.
* A CSI volume driver to customize how Kubernetes interacts with it (e.g. skip attach process because the driver doesn't support ControllerPublish, etc.).
* A user or cluster admin to easily discover which CSI volume drivers are deployed on their Kubernetes cluster.
* Be optional -- a driver may choose not to use it (and will get the default set of behaviors).


## Background & Motivations

Kubernetes supports the Container Storage Interface (CSI) to enable third party storage developers to deploy volume drivers exposing new storage systems in Kubernetes without having to touch the core Kubernetes code.
Support for CSI was introduced as alpha in Kubernetes v1.9, and moved to beta in v1.10.
See "CSI Volume Plugins in Kubernetes Design Doc" in the "Links" section below for details on how an arbitrary CSI volume driver is deployed and interacts with Kubernetes.
The beta implementation of CSI has a number of limitations, including:
* CSI drivers must be deployed with the provided CSI external-attacher sidecar container, even if they don’t implement ControllerPublishVolume.
* Meaning a CSI volume driver deployed on kubernetes has to deploy a CSI `external-attacher` container even if the volume driver doesn't require a "volume attach" operation.
* In this case, the CSI `external-attacher` container basically does a no-op in response to Kubernetes `VolumeAttachment` objects to allow Kubernetes to continue with the mounting process.
* Users and cluster admins have no easy way to discover what CSI volume drivers are deployed on their Kubernetes cluster.

The proposed "CSI cluster registration mechanism" should address these issues.

### Links

* [Container Storage Interface (CSI) Spec](https://github.com/container-storage-interface/spec/blob/master/spec.md)
* [CSI Volume Plugins in Kubernetes Design Doc](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/storage/container-storage-interface.md)

## Objective

The objective of this document is to document all the requirements for enabling a cluster level registry for a CSI compliant volume drivers in Kubernetes.

## Goals

* Allow a CSI volume driver (that opts-in to new mechanism) to configure how Kubernetes should interact with it.
* Improve discoverability by users/cluster admins of a CSI volume driver (that opts-in to new mechanism) deployed on the Kubernetes cluster.

## Non-Goals

* Define how CSI volume driver should be deployed on Kubernetes.
* Require use of new CSI cluster registry mechanism for all CSI volume drivers deployed on Kubernetes (the mechanism will be optional/opt-in).

## Design Overview

A new custom resource will be automatically be installed on Kubernetes clusters.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a new custom resource definition? CR is optional.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed


Upon deployment a driver must create a new custom resource object.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Driver may create a new object? It's optional.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed


## Design Details

### `CSIDriver` Object

#### Proposed API

```go
// CSIDriver captures information about a Container Storage Interface (CSI)
// volume driver deployed on the cluster.
//
// CSIDriver objects are non-namespaced.
type CSIDriver struct {
metav1.TypeMeta `json:",inline"`

// Standard object metadata.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wrong indentation

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

// More info: https://git.k8s.io/community/contributors/devel/api-conventions.md#metadata
metav1.ObjectMeta `json:"metadata"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we require that ObjectMeta.Name is the CSI driver name? How else can we ensure that each CSI driver has just one CSIDriver instance?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should require it. We should have validation webhook to verify that you can't create two objects with the same CSIDriverSpec.Driver fields.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This validation is IMO racy. There may be several API servers running in parallel.

  1. API server A gets POST CSIDriver.name=X for a driver.
  2. API server B gets POST CSIDriver.name=Y for the same driver at the same time.
  3. Both call validation webhook.
  4. The webhook gets two queries, possibly in parallel. Let's say the webhook synchronizes the requests and processes X fist.
  5. The webhook can GET CSIDriver to check if it already exists, but in this case it does not return anything useful, X and Y were not created yet.
  6. The webhook allows X to be created. It remembers that X exists in local cache.
  7. The webhook processes Y, it can check in the API server if X exists. It does not exist yet, either the API server is slow or it was rejected by another webhook / validation.
  8. The webhook can check internal cache if it allowed X. It was allowed, so it rejects Y.

And this is wrong, because X might have been rejected by API server (e.g. another webhook failed) and Y should have been allowed.


// Specification describing the CSI volume driver and any custom
// configuration for it.
Spec CSIDriverSpec `json:"spec"`

// Status of the CSI volume driver.
Status CSIDriverStatus `json:"status,omitempty"`
}

// CSIDriverList is a collection of CSIDriver objects.
type CSIDriverList struct {
metav1.TypeMeta `json:",inline"`

// Standard list metadata
// More info: https://git.k8s.io/community/contributors/devel/api-conventions.md#metadata
metav1.ListMeta `json:"metadata,omitempty" protobuf:"bytes,1,opt,name=metadata"`

// Items is the list of CSIDrivers
Items []CSIDriver `json:"items"`
}

// CSIDriverSpec is the specification of a CSIDriver.
type CSIDriverSpec struct {
// Driver indicates the name of the CSI driver that this object refers to.
// This MUST be the same name returned by the CSI GetPluginName() call for
// that driver.
Driver string `json:"driver"`

// Indicates this CSI volume driver requires an attach operation (because it
// implements the CSI ControllerPublishVolume() method), and that Kubernetes
// should call attach and wait for any attach operation to complete before
// proceeding to mounting.
// If value is not specified, default is true -- meaning attach will be
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a default as in the normal defaults (will be set in the object) or is it the implied behavior? To set a default will require admission control... Just clarify.

// called.
// +optional
AttachRequired *bool `json:"attachRequired"`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably a dump question.

Isn't this value already exposed from ControllerServiceCapability as PUBLISH_UNPUBLISH_VOLUME? Can't kubelet make decision based on that value?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is where the call is being made. Kubelet is not doing the attach. A pod floating in the cluster takes attachment requests, converts them into the driver calls for attach/detach.

You could have the attacher sidecar running, probe the driver as you say, and then determine that it doesn't ever have to do anything. That does work today.

But that's a fair amount of extra pluming running in the cluster that isn't needed.

The flag above tells Kubernetes to skip entirely any of the handshaking at the cluster level required to do that attach so the sidecar that does nothing and its driver doesn't even need to exist.


// Indicates this CSI volume driver requires additional pod information
// (like podName, podUID, etc.) during mount operations.
// If this is set to true, Kubelet will pass pod information as
// VolumeAttributes in the CSI NodePublishVolume() calls.
// If value is not specified, default is false -- meaning pod information
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not pass this always?

// will not be passed on mount.
// +optional
PodInfoRequiredOnMount *bool `json:"podInfoRequiredOnMount"`
}
Copy link
Member

@vladimirvivien vladimirvivien Aug 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EDITED:
Saad, please add field to indicate whether volumeHandle needs to be automatically generated. I am proposing the following:

type VolumeHandleMode string

const (
    AutomaticVolumeHandleGeneration VolumeHandleMode = "automaticVolumeHandleGeneration"
    NoVolumeHandleGeneration VolumeHandleMode = "noVolumeHandleGeneration"
)

< snip >

type CSIDriverSpec struct {
   < snip >

    // Indicates how the volumeHandle for inline CSI volumes will be handled during volume operations.
    // If set to AutomaticVolumeHandleGeneration, this provides a hint to the Kubelet to 
    // automatically generate handles for the inline volume.
    //
    // If set to `NoVolumeHandleGeneration`, this is a hint to Kubelet not to automatically 
    // generate handles for volumes when not provided.
    VolumeHandleMode *VolumeHandleMode   `json:"volumeHandleMode"`
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be limited only to inline volumes? Or handles for PVs should be (optionally) autogenerated too?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm... pv's too.... say someone used the image driver with a pv (for some reason). It would break horribly if it wasn't given proper handles.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for now the language and implementation should be limited to Inline.


// CSIDriverStatus is the status of a CSIDriver.
type CSIDriverStatus struct {
// Indicates the volume driver has been successfully deployed on the cluster
// and is ready to use.
Ready bool `json:"ready"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who is consumer of this information? Shall A/D controller postpone any attach operation until CSI plugin gets Ready? Shall Kubelet postpone mounting of volumes? Who/when sets it to true? Most importantly, who/when sets it to false? What should A/D controller and kubelet do when the plugin becomes Unready?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the initial version no consumers. In the future we can use this as one of the bits for smarter CSI volume aware scheduling.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, this should be an condition, with timestamp when the driver has become (un-)ready, with a message why and so on.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm hesitant on expanding the API more for a hypothetical use case. I'd prefer to leave it simple or drop it altogether until we have a real use case. If I drop it, there would be nothing in CSIPluginStatus so I would have to drop that for now too. What do you prefer?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am torn on this. Currently I don't see any useful Status to report. But I may come in future. Can we have an object with .Spec and no .Status?

If we use Ready, we need clear semantics what it means and most importantly what is it good for. Is it enough to have one node with CSI driver installed to have the global object ready? Is it enough to have one node with CSI driver broken (failed liveness probe, driver not installed there, crash loop...) to have the global object not-ready?

In the future we can use this as one of the bits for smarter CSI volume aware scheduling.

For smarter CSI volume aware scheduling I would use decision based on availability / status of drivers on nodes, not a global object. What should scheduler do when the global object is not ready? All it can do is to stop scheduling.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's still unclear to me what "Ready" means. "the volume driver has been successfully deployed" - so all containers (DaemonSet with node components and Deployment with controller components) are running? What one of them fails, who sets Ready=false? What if I add a new node, shall something set it to Ready=false, wait until the DaemonSet runs a new pod with node components there and set it to Ready=true?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will remove

}

```

#### CRD Installation

The `CSIDriver` object schema will checked in to kubernetes/kubernetes repo under a storage directory.
The `CSIDriver` `CustomResourceDefinition` (CRD) will be installed by a new Kubernetes controller that is responsible for ensuring required CRDs are installed.
The controller will periodically verify that the CRD is still installed, and recreate it, if it is not.

#### CR Creation

When a CSI volume driver is deployed on Kubernetes, it may optionally, register it self with Kubernetes by creating a new `CSIDriver` object.
The Kubernetes team will modify the CSI [`driver-registrar` container](https://github.com/kubernetes-csi/driver-registrar) to automatically do this.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm.... Thinking about this some more, with my operator hat on, I'd rather it not. If the driver-registrar were to do it, then we'd have to rbac it well in order to restrict its access to a safe subset of things. but if you are going through that much effort, you could instead just upload the crd directly in whatever csi packaging is provided, like via k8s yaml / helm chart to the cluster using your admin credentials instead and not worry about it at all. Similar to some of the reasoning Rook took when they tightened security (https://github.com/rook/rook/blob/master/design/security-model.md for details)

If kubelet driver registration is enabled, the kubelet will automatically do this as part of plugin registration.
The driver must set the `Driver` field to the same name returned by the CSI `GetPluginName()` call for that driver.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So does this mean that a CSI driver has to ship with some mechanism of instantiating CRs when it is deployed? Currently obviously that is not part of CSI spec, so how will that get done?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So does this mean that a CSI driver has to ship with some mechanism of instantiating CRs when it is deployed?

CSI driver authors can choose to but don't have to

Currently obviously that is not part of CSI spec, so how will that get done?

The same way all custom k8s behavior is done, side car containers provided by us (e.g. driver-registrar, external-attacher, etc.).

The driver may set any optional configuration fields (like `SkipAttach`) as appropriate.
When the driver is ready to serve, it must set `Ready` in the status to `true`.

#### Upgrade/Downgrade
TODO

## Alternatives Considered
TODO