Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: docker-compose to work off repo Dockerfile #27434

Merged
merged 4 commits into from
Mar 11, 2024
Merged

Conversation

mistercrunch
Copy link
Member

@mistercrunch mistercrunch commented Mar 8, 2024

SUMMARY

This PR improve docker-compose support for development, testing and staging, but is taking a stance against using docker-compose for production use cases. The reality is that the challenges around building a functional dev setup and a stable production environment are intricately similar yet intricately diverging. More segmentation here will lead to more sanity on both sides and more safety. docker-compose is now 100% focussed on supporting development workflows: fast startup, fully loaded builds with testing/dev tools where needed, ROOT access in-Docker so that you can bash into docker to debug, debug flags on, ...

For production support, see our helm chart and installation docs.


Now. Currently our docker-compose setup pulls images that have been built
recently on the master branch (apache/superset:${TAG:-latest). While this works in most cases, it's
non-deterministic on not guaranteed to always work. For example if I
merge a PR to master that removes a certain python library for instance,
people in branches out there doing development that still have that
dependencies are not going to work.

In this PR, I change the docker-compose setup(s) to:

  • reference the local Dockerfile
  • point to the right cache location (apache/superset-cache:....)
  • make that DRY since it's repeated many times across the docker-compose
    files
  • touch up both docker-compose.yml and docker-compose-non-dev.yml with
    the same approach
  • don't download Chromium by default as we're booting up superset-node, that added significant
    load time at every docker-compose up when in most cases we don't need that headless browser.
    I think that's to support pupeteer, and can still be set with the env var PUPPETEER_SKIP_CHROMIUM_DOWNLOAD that seemed declared, but orphaned as nothing reference it.
  • merging docker/.env and docker/.env-non-dev as it's all meant for development now

This should work across platforms, though I could only validate on linux/arm64.

TESTING INSTRUCTIONS

As far as testing goes, I made sure this builds and that the resulting
setup is functional. I was also very fast in my experience, the cache
was clearly leveraged here.

Currently our docker-compose setup pulls images that have been built
recently on the `master` branch. While this works in most cases, it's
non-deterministic on not guaranteed to always work. For example if I
merge a PR to master that removes a certain python library for instance,
people in branches out there doing development that still have that
dependencies are not going to work.

In this PR, I change the docker-compose setup(s) to:
- reference the local Dockerfile
- point to the right cache location (apache/superset-cache:....)
- make that DRY since it's repeated many times across the docker-compose
  files
- touch up both docker-compose.yml and docker-compose-non-dev.yml with
  the same approach

As far as testing goes, I made sure this builds and that the resulting
setup is functional. I was also very fast in my experience, the cache
was clearly leveraged here.
context: .
target: dev
cache_from:
- apache/superset-cache:3.9-slim-bookworm
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh I was not aware of this, what process pushes to it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anything that uses scripts/build_docker.py (a cli that wraps the docker build CLI) will use the cache-from, and cache-to, but can only push if it's logged in (push or pull_request against the main repo). Currently I think all the GitHub actions that build images (pull_request , push on master and releases) will use this thing and hopefully use the cache.

Docker-compose can piggy backing on this cache here that should really speed up the builds since in most case most layers can be re-used from the master builds

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

@mistercrunch mistercrunch Mar 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

side note - one thing I noticed is cache doesn't always seem to hit when I think it should, I'm guessing that we have some limits / intelligent cache pruning that's preventing cache hit from always working .... cache hit rate is still pretty decent, and build times not awful either when missing the cache

@@ -14,7 +14,6 @@
# See the License for the specific language governing permissions and
# limitations under the License.
#
x-superset-image: &superset-image apachesuperset.docker.scarf.sh/apache/superset:${TAG:-latest}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally think it's kind of cool to have non-dev point to a pre built image TAG, also this docker-compose does not mount current code into the container like docker-compose.yaml does, so non deterministic cases probably do not apply on this case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About this, I think both use cases are valid. To me if I'm in a repo and on a specific ref (a branch, a release tag, or my own little branch with a feature), and I run some docker-related thing (whether it's docker build or a docker-compose related thing) I'm assuming that what I'm building is the particular ref I'm into right now.

I think the 2 options I want to provide here are really just "interactive" where we mount the code, and "non-interactive" where it's just immutable set of dockers that get me a fully working testable cluster that is lined up with the branch.

Now maybe we should ADD a new way to do docker-compose-any-image.yml that would work along with a TAG env var.

Copy link
Member

@dpgaspar dpgaspar Mar 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fine by me! makes sense

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I'm making a bunch of changes here and re-writing the docs too...

@geido geido self-requested a review March 8, 2024 18:54
@rtexelm
Copy link
Member

rtexelm commented Mar 8, 2024

Suggestions from our test run:

  • Have a way to avoid npm run build if the instance is meant for dev
  • Have an option to run the dev server locally and use a specific docker compose for the backend processes only
    • Maybe altering docker-frontend.sh, removing calls to npm install and npm run dev (everything after line 24)

@mistercrunch
Copy link
Member Author

mistercrunch commented Mar 9, 2024

Notes about docker-compose viability as the main tool for development -> I just did a session with @rtexelm rtexelm, and we've found that his 8GB macbook M1 struggles quite a bit running docker-compose up, especially when it comes to running npm i and npm run dev INSIDE docker, specifically this stuff -> https://github.com/apache/superset/blob/master/docker/docker-frontend.sh#L25-L29 . I'm guessing it was swapping like crazy. Super super slow.

Historically npm i; npm run dev, even with more memory was crippled by the filesystem vritualization being super slow on Macs too. Running those commands means touching millions of files in the node_modules folder and that didn't work well. So maybe it's memory/swapping and/or it's Docker version / IO issue.

In any case, I think it'd be worthy to consider an alternative approach, the one where you run these two commands in the host as opposed to inside the docker. This has tradeoffs, but on @rtexelm's machine was MUCH faster.

So.

  1. looking into having 2 modes - build-frontend-in-docker or build-frontend-in-host, maybe directed by an ENV var and documented properly. I think build-frontend-in-docker should be the default, and that people with less memory would flip the switch to build-frontend-in-host
  2. currently we waste some work since docker build has npm i; npm run build in a layer, but we redo it after mounting locally, it'd be nice if we could skip that step, but may be tricky to have conditions/composition in the Dockerfile

@mistercrunch
Copy link
Member Author

oh! saw your comment after I posted mine. Let's get this done.

@pull-request-size pull-request-size bot added size/L and removed size/M labels Mar 9, 2024
@github-actions github-actions bot added the doc Namespace | Anything related to documentation label Mar 9, 2024
@@ -0,0 +1,101 @@
#
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this file is effectively the old docker-compose-non-dev.yml just renamed to be more clear

@mistercrunch
Copy link
Member Author

@dpgaspar I evolved this PR quite a bit, taking a harder stance against using docker-compose in production. Curious to hear your thoughts.

@craig-rueda
Copy link
Member

The slowness you're seeing is likely caused by the fact that we're running amd/linux containers on arm hardware. I would check the images that are being pulled down to ensure they're the "arm" variants.

pre-built images from docker-hub

More on these two approaches after setting up the requirements for either.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice! ^^^

Superset (which is running in its docker container). Other databases may have slightly different
configurations but gist would be same and boils down to 2 steps -

1. **(Mac users may skip this step)** Configuring the local postgresql/database instance to accept
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: before we had:

1. ** ......
2. ...

but now there's only 1. does it still make sense?

@dpgaspar dpgaspar requested a review from sfirke March 11, 2024 14:35
@mistercrunch
Copy link
Member Author

The slowness you're seeing is likely caused by the fact that we're running amd/linux containers on arm hardware. I would check the images that are being pulled down to ensure they're the "arm" variants.

Interestingly @rtexelm setup and mine were night and day in terms of build time. Both Apple silicon, but he has 8GB of ram, so I assumed he was memory constrained and swapping. Though it could be that we used different base images - like he's virtualizing amd64 and I'm on an arm base. No impossible.

It's a bit buried but I added an option and documented it here ->

By default, we mount the local superset-frontend folder here and run npm install as well as npm run dev which triggers webpack to compile/bundle the frontend code. Depending on your local setup, especially if you have less than 16GB of memory, it may be very slow to perform those operations. In this case, we recommend you set the env var BUILD_SUPERSET_FRONTEND_IN_DOCKER to false, and to run this locally instead in a terminal. Simply trigger npm i && npm run dev, this should be MUCH faster.

Also note the cache we have from the master pushes (merges) should be multi-platform as all CI docker builds are now multi-platform.

@mistercrunch
Copy link
Member Author

@rtexelm can you check whether you are/were virtualizing amd on your arm host? I'd be great to clarify.

@rtexelm
Copy link
Member

rtexelm commented Mar 11, 2024

I was, I used the setting DOCKER_DEFAULT_ENV set to linux/amd64 in the past to get it to work on my system so it must still be in effect

@mistercrunch
Copy link
Member Author

Alright, this is dandy. Mergin'

@mistercrunch mistercrunch merged commit b1adede into master Mar 11, 2024
23 checks passed
@rusackas rusackas deleted the docker-compose branch March 12, 2024 03:12
@sfirke
Copy link
Member

sfirke commented Mar 14, 2024

Sorry I'm late to respond here. I am fine with the substance of these changes if they improve things for developers. I agree with "don't try to run this docker-compose file in production".

I may reintroduce some content re: for people interested in running docker compose in production. I think Airflow has great language here:

You have Running Airflow in Docker where you can see an example of Quick Start which you can use to start Airflow quickly for local testing and development. However, this is just for inspiration. Do not expect to use this docker-compose.yml file for production installation, you need to get familiar with Docker Compose and its capabilities and build your own production-ready deployment with it if you choose Docker Compose for your deployment.

I think "you'll need to modify this and know what you're doing" is more nuanced than "don't use docker compose in production" -- IMO that's for companies to decide, if the tradeoffs of extra complexity of Kubernetes are worth it for them. Me & my org could never have tried or deployed Superset if not for docker compose, fortunately that approach was explained in the docs and I could make the necessary modifications (e.g., use an Azure Postgres instance for my metadata db).

@clayheaton
Copy link

clayheaton commented Mar 15, 2024

It makes sense to me that there's a docker compose setup intended entirely for development. I also am of the opinion that for many people who would like to deploy Superset in a small production manner that it is easiest to do so using docker compose on a hosted VM that meets the specs for serving the entire stack (minus, perhaps, the production metadata store). The reality is that Superset is relatively difficult to deploy in production right now. I don't think that it has to be that way.

One idea would be to have a docker-compose-dev.yml setup that includes a functioning superset_config_dev.py and a set of templated docker-compose-prod.yml and superset_config_prod.py files with a Read Me file with clear instructions about what variables need to be set to have a minimum viable self-contained production deployment. In a lot of cases, this would simply be setting 10-15 variables (these could include paths to ssl certs, etc.).

It is true, of course, that people who aim to use Superset in production should understand the nuances of docker compose. However, if you contrast the ease of self-deploying Superset vs. Discourse, for example, I think that Superset has some room for improvement in supporting a preconfigured type of "default" production deployment.

p.s. I'm new to the Superset community and still learning about the codebase, though I hope to contribute once I understand everything that's happening.

@mistercrunch
Copy link
Member Author

mistercrunch commented Mar 15, 2024

Trying to list out what you'd need for a production use case on docker-compose:

  • MUST change/set your SECRET_KEY
  • none of that Postgres-in-docker funky business - I doubt anyone would argue otherwise (?), so you'd want to terraform/RDS your thing and configure it in superset_config. Already you probably want a repo to store that terraform or equivalent
  • you'd want to hop off the dev docker (bloated and insecure with root access) and onto lean, but you probably need to bake some-but-not-all of the subpackages, maybe create your own Dockerfile inheriting form lean? Pick and choose the apt-get packages and python libs you need.
  • your own observability stuff, or maybe you don't care (?)
  • secret management? or are you just putting your postgres password on that EC2 instance?
  • point to a docker image? point to a branch? point to a specific version?

All of them configuration hooks become hard for us to manage. To me that complexity belongs in your environment ideally in the form of terraform/helm/k8s constructs in a git repo of your own.

And now that you need to do all this regardless, why not going k8s/helm so that you can evolve into supporting some elasticity and more resilience?

@mistercrunch
Copy link
Member Author

Thinking about it some more, knowing that docker-compose is designed (AFAIK) as a single-host solution, I don't think it's fair to call this "production" by any standard around high availability. Knowing that in most setting you'd grow a need for multi-host support, either to support HA or as usage/demand would grow to where it can't be served by a single host, I think trying to support production-type use cases with docker-compose is doing a disservice to people in the community.

What i'd suggest as a stance for maintainers is:

  • docker-compose is great and supported for development use cases - here we aim for simplicity and reproducibility - we don't recommend that you'd use them OR modify them for production use cases
  • for production support, we recommend k8s and offer a helm chart. We also offer a variety of multi-platform docker images pointing to releases that you can use. We do not offer higher level construct like terraform scripts at this time.

@sfirke
Copy link
Member

sfirke commented Mar 16, 2024

That list is a good resource for the pitfalls / customization needs of deploying with docker compose. In my case, we can address those shortcomings easily enough (e.g., point to image 3.1.1 in our docker-compose.yml, use an Azure Postgres server). The only feature we don't get vs. Kubernetes is scalability ... and we just don't experience significant growth or fluctuations that make this an issue. We wouldn't gain anything by adopting k8s.

On the other hand, for a more old-school org like ours, the increased complexity of Kubernetes would make deploying Superset unfeasible. There are fewer people in general with that skill set vs. familiarity with docker and we don't have anyone on staff positioned to stand up a K8s deployment. If the Superset project stance was "docker compose is unacceptable", we would have gone with Metabase or PowerBI. I find docker compose the simplest option to install and maintain, moreso than PyPI, and feel like keeping it as an option -- while acknowledging the downsides -- is ultimately good for Superset as a project as it gets more people in the door.

I mentioned above liking Airflow's approach. I just looked at more peer projects off the top of my head, they varied but none were explicitly "don't use docker":

  • Metabase: offers docker install, not Kubernetes
  • Lightdash: suggests that people use Kubernetes but also documents lightweight docker deployment
  • Datahub: offers both docker and Kubernetes

I wonder if our Scarf telemetry tells us anything at least directionally useful about the share of installations that are docker compose vs. helm vs. pip.

@mistercrunch
Copy link
Member Author

How about using something like MiniKube https://minikube.sigs.k8s.io/docs/start/ ?

@clayheaton
Copy link

clayheaton commented Mar 17, 2024

Regarding this statement:

Knowing that in most setting you'd grow a need for multi-host support, either to support HA or as usage/demand would grow to where it can't be served by a single host, I think trying to support production-type use cases with docker-compose is doing a disservice to people in the community.

I respect your opinion and understand your viewpoint here, but I have to disagree. "Production" and "High Availability" have drastically different meanings in different environments. In my 25+ year career working with data at about a dozen different companies (large, small, government, private, across industries), I have yet to be in an environment where it would be unacceptable to run a well-maintained instance of pretty much any type of software on a single server. In my current role, I work for a large and well-known entertainment company with > 20k employees and all of our internal tools are run on single servers on a private OpenStack cloud. There's not even a provision for easily deploying with k8. This is common in companies that have established enterprise infrastructure. We get by just fine. Is it ideal? Of course not, but definitely not a deal breaker.

It worries me that the project would not provide a clear and documented route to individuals who want to run this in production on a single server. It doesn't matter if that is with MiniKube or Docker Compose or some other tech. However, like Sam mentioned, I do think this becomes somewhat of a gatekeeper to using Superset when there are other options available that may have a lower barrier to entry.

I am not a long-term member of this community and realize that my opinion here carries less weight. I do hope you will consider how to make Superset a more welcoming project for people who do not need to define "production" and "high-availability" with multi-node k8 clusters that, in my opinion, simply are not feasible in a lot of environments.

Edit: Let me provide slightly more detail about our use case in the aforementioned org... We are a team of about 50 people working on a specialized function in the larger organization. We will only ever need about 75 people, maximum, to be able to view Superset dashboards and/or receive email reports. There will only ever be 5-7 people actually connecting datasets and creating charts and dashboards for the others to view. Datasets for analysis through Superset contain from 1 to 10 Billion rows of data, are staged for Superset useage (by the same 5-7 people) and mostly reside on Clickhouse servers that exclusively serves this team's needs. Other datasets may reside in Snowflake. We self-host Clickhouse on our own dedicated hardware both for performance reasons and to avoid query costs associated with Snowflake. The company is quite siloed like this, sometimes for good reason, so it's common to see other teams of similar size taking similar approaches to their work.

@mistercrunch
Copy link
Member Author

Loud and clear. Thank you taking the time to write this - sometimes coming out of large companies with very large infra team we forget the preferences and constraints of smaller environment. To be clear, we absolutely want to provide a clear path to production to as many orgs as possible, while providing the guarantees and flexibility that people need. The desire to run on a single-host is indeed totally reasonable - though here I'd love to be able to also offer an easy path to take that single-host setup towards a multi-host setup without having to switch the stack.

Some more thoughts:

  • having clear segmentation between supporting developers and supporting production use cases seemed important to me as the needs are intricately similar yet intricately different, and the consequence of crossing the wrong wires are really bad
  • I still think helm/k8s/minikube is a better path for us to share assets and documentation to productionize things, along with good docs and examples. I agree though that we need a better path for "production quickstart" where orgs can get to solid production quickly
  • about using docker-compose to support production use cases in smaller environments - personally I'd rather push that on the k8s side of the fence, and find a good viable solution for single-host deployments on that side. Hoping Minikube could support those that. The other option would be to craft another docker-compose-prod.yml , and try to make it clear one the shortcomings and must-do. If we go that route, I'd advocate to not share assets with the dev setup so they can evolve their own way without risks of crossing wires

It's always difficult for communities like this one to support the variety of constraints people have. The matrix of possible environments is crazy-complicated.

@clayheaton
Copy link

Thanks for the reply.

I think this is one area where it's preferable if the suggested single-machine deployment is highly opinionated, with links to information about different possible configurations and technologies.

It's reasonable to tell people interested in this type of deployment what specs to have on a machine (or VM), what distro of Linux to use if they want to follow the instructions exactly, and exactly which config values must be set for a minimally-customized deployment. For example, it may say "Use an Ubuntu VM with at least 16 gb of RAM and a 40 gb hard drive. Install microk8s, following steps 1-3 on these instructions. Put your .crt and .key files in /some/directory and then assign their paths to SOME_CONSTANTS in the config file.." Etc.

The issue with a k8 backbone is that there are many additional options that can provide quality of life (automatic certificate renewal, etc) that are not core to getting Superset running. The only option I would provide is to suggest that users have a metadata database that is not part of the k8 setup, showing them how to include the connection string from an envvar, but also provide a pathway for running Postgres in the k8 cluster as part of the deployment. I would go so far as to say that even the security of envvars is not the core business of Superset, so perhaps instruct users to put them in an .env file for the rapid deployment and then link to resources for how to better secure them.

As you mention, there are myriad possible configurations. That's why I personally feel like taking an opinionated approach here is best: it cuts through the noise and helps provide more of a quick start for a small single-machine production deployment.

Incidentally, from what I've read the past few days, microk8s seems more suitable and tuned for a single-machine k8 cluster than minikube, which often is mentioned as being more appropriate for testing. This is, of course, a matter of opinion.

However this ends up, it will be a benefit to the Superset project, in my opinion, because it will lower the entry barrier for smaller teams. I remain of the opinion that a docker compose setup is more simple, requires fewer resources, and can be less opinionated, but if that's antithetical to the high-availability goals of the project, then so be it.

@mistercrunch
Copy link
Member Author

Ok, so assuming that postgres-on-docker is NOT viable, and we want to be opinionated, we would force the use to provide a postgres host/username/psw (in the form of a sqlalchemy url) AND a secret key at a minimum.

Curious if:

  • is minikube the best option for easy-yet-solid small k8s instance?
  • how far are we to getting this setup to work super simply on both a mac and ec2 instance
  • the state of out-of-the-boxness of our current helm chart

Let's get off this PR and let me start a discussion on "Get Superset / Helm to work on MiniKube and document the process" -> #27570

@mistercrunch
Copy link
Member Author

I just tried minikube and was up and running in like 10 minutes, and I have little experience with k8s. Pretty much just brew install minikube && minikube start and followed our docs for k8s. Smooth and easy.

@clayheaton
Copy link

I found it similarly easy on my laptop but struggled to get the Ingress working when I tried it on a VM. I also found it a bit of a challenge to debug when containers weren’t starting properly.

@mistercrunch
Copy link
Member Author

mistercrunch commented Mar 21, 2024

struggled to get the Ingress

curious what the issue was and workaround if you found any. Looks like between kubectl port-forward superset-xxxx-yyyy :8088 and minikube start --listen-address=0.0.0.0 you should be able to get something going.

I also found it a bit of a challenge to debug when containers weren’t starting properly

Personally I thought this was fantastic with kubectl. Overall kubectl is fantastic once you get a hang of it. I'm guessing you probably have a GUI helping out with that, showing the deployments, pods, status, logs, .... The ecosystem is so much richer on that side.

sfirke pushed a commit to sfirke/superset that referenced this pull request Mar 22, 2024
@clayheaton
Copy link

I have no qualms with the k8 ecosystem but I find it heavy for a single-machine deployment. In my specific case, I have to set up an SSL certificate and a variety of compliance logging tools and related security tech (such as an HCP Vault, etc), configure SSO with Active Directory, etc.

It was unclear how to get this smoothly working with the kubernetes instructions on the Superset website and I don't have a ton of experience working with k8. With regards to Ingress, I tried about 30 combinations of different setups, verified the firewall, etc. and was never able to connect except on an insecure port.

My schedule is busy right now, so I was not able to devote more time to it. However, I was able to get it all working through Docker compose in about 20 minutes, in a manner that will work well for my team. When I have more free time, I'll return to it and try to get it working with minikube.

@mistercrunch
Copy link
Member Author

In my specific case, I have to set up an SSL certificate and a variety of compliance logging tools and related security tech (such as an HCP Vault, etc), configure SSO with Active Directory, etc.

This sounds like it's specific to your internal policy / k8s setup (?) Also all these things seem virtuous and important, though maybe overkill for a sandbox/POC-type environment like you're seeking. I'm guessing a quick-and-dirty minkube-on-an-EC2-host wouldn't have those requirements, and emulate the docker-compose setup pretty well, no (?) One positive thing from that setup is that whatever you'd do or customize in minkube would be more transferable to your internal k8s if you ever commit to a more proper production setup.

Another thing to think about is that it may make sense for your organization to also have some infra for a more relaxed k8s setup for lower tier of internal applications that don't need the same level of rigor as top tier internal apps. I know that sounds like a bit of an investment, but if the org doesn't provide the right level of infra (in this case something lightweight), you end up with oddball services in the dusty corners of your cloud. In my experience, making it easy for people to do the right things goes a long way in terms of overall efficiency.

Take the SECRET_KEY for instance and the creds to your data warehouse that you can obtain from mishandling the former, unless you're working on public data, seems those things are pretty important. Like if you don't care about these things being secure, then maybe the docker-compose on dev docker layer that's a "do not use in prod" setup is also fine (?)

Everyone seems to think like k8s has to be this huge complicated thing, and has to big one huge cluster to rule them all with all of the policies and compliances enforced. But you know, your refrigerator probably runs it, your car, ...

@clayheaton
Copy link

clayheaton commented Mar 23, 2024

I don't mean to derail this to my specific use case, though I appreciate and agree with most of your comments. In fact, the org is working on an internal k8 deployment platform, though it may be a while before it is ready, and the features available on it are not yet published.

The company handles a lot of PII. Like many companies, there is a tier of BI analysts who mostly handle scrubbed data, there are those who handle sensitive financial data, and there are teams who work with all of the data. Some of the latter are legal, customer support, and security, others are marketing, business dev, etc. in a targeting role. As a company based in the EU, there is a huge emphasis on data security because of the GDPR. In my experience, this is relatively common, even for companies based elsewhere.

That is to say that there really is no such thing as a "lower tier" of internal application because all applications are views as exposed surfaces from a security standpoint and are considered to increase the potential risk for having malicious actors gain entry to then intranet (which is an ongoing usually-detectable threat). A sufficiently bad breach could threaten the company's existence.

From a security hygiene point of view, this is a good thing because it encourages good habits. Were it easy enough to do, we should all strive to use valid SSL certificates on servers that are properly configured and have encrypted secrets. These are disparate pieces of tech and getting them all to work well together can be a challenge, as you know.

With regards to Superset, my opinion largely has two simply layers: 1) It should be as easy as possible for as many people as possible to get it up and running 2) in a reasonably secure manner that has a clear upgrade path for patching security vulnerabilities. Getting it running often might initially be for prototype purposes, though hardening it for a simple production deployment ideally should be a straightforward and/or well-documented step. Whether that is with docker compose or k8 does not matter much as long as it can be achieved on accessible hardware.

To hop back to what I said earlier in this thread, I don't think Superset is far away from what is needed to support these use cases. There are plenty of teams in the world that will be able to clone the repo and fully customize their setup without any handholding. There are teams that simply with sign up with Preset (or Tableau or PowerBI) and outsource the problem. Then there are the teams (of all sizes) who are going to want to self-deploy and need a well-documented (opinionated is fine) secure way to do it.

Remember how Wordpress became all the rage in 2004 or so and then became the vector for countless hacks? There are basically two approaches to avoid that: make it so the software is too difficult for an average person to get running on their own or provide the clear route and documentation to deploy and maintain in a manner that reduces the chances of that happening. This has always been one of the lessons that Discourse took to heart, resulting in software that is remarkably easy both to deploy and to keep up-to-date.

Discourse Admin page screenshot Screenshot 2024-03-23 at 7 58 51 AM

Over the years, I've worked in various roles in game development. I recall from the late 1990s, in my first role, when people would want to come in a pitch game ideas to us... one of the lead designers told me before a pitch meeting: "A document is fine, a picture is worth a thousand documents, but a working prototype is work a thousand pictures." That's been an excellent lesson for my entire career; instead of telling somebody what you're going to do, actually do it in a prototype capacity and then explain to them how to make it a reality.

There's such a hunger for high-quality BI and data visualization these days. Superset is so close to being a platform that is easy to stand up and show as a prototype, though I'm of the opinion that it's just a little bit too hard to deploy right now.

...and then comes the biggest problem of them all, regardless of business size: software meant to be only a prototype ends up in production usage and suddenly you've got a problem because it wasn't deployed in a way to fit that intention. This is where it absolutely matters that it can be secured easily and that people who don't normally deploy production software have access to opinionated resources for making it happen.

Apologies for this being too long and rambling. I haven't had enough coffee yet. :) It's a rainy weekend, so I'll see if I can work on documenting a more tangible approach to deploying in a secure manner with minikube.

@mistercrunch
Copy link
Member Author

Richest comment I've ever seen in a PR. For real. 🏆!

Love the idea of having a better version checker and show admin they need to upgrade.

The more I think about things, the more I think that the approach I'm pushing forward of using docker-compose for sandboxing and development, and k8s for production is a good approach. For people who want to go lower level and run straight on metal or EC2 equivalents, they can take the helm chart as a recipe as to how to do this.

qleroy pushed a commit to qleroy/superset that referenced this pull request Apr 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
doc Namespace | Anything related to documentation preset-io size/L
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants