Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Triage database provisioner #840

Merged
merged 10 commits into from
Aug 26, 2021
Merged

Triage database provisioner #840

merged 10 commits into from
Aug 26, 2021

Conversation

thcrock
Copy link
Contributor

@thcrock thcrock commented Mar 30, 2021

Not quite ready for merge (I would add some documentation around this and maybe clean up the code a bit), but I wanted to get thoughts on this interface for a local Dockerized database provisioner. The idea is Triage users could use this to easily set up a database outside of a Dirtyduck context but without having to do Postgres on their own. With just docker as a dependency, they can:

  1. Bring up the thing from scratch; build the image, run Postgres through the image, and write a database.yaml file using the credentials created by the script.
  2. Start the container if if stopped
  3. Rebuild the container if it was removed

All using the command 'triage db up'.

Note: I don't know what's going on with the Crosstabs CLI but I couldn't get this to run while using those imports. Did that crosstab code get removed? It's obviously unrelated to this PR but if I can't run the CLI with those lines in there I had to take them out.

@nanounanue
Copy link
Contributor

Thanks for this @thcrock

  1. The idea here is to create just a dockerized DB or create the dockerized dirtyduck DB?
  2. I am not a big fan of database.yaml thing (is another file, prone to errors, security risk, etc) but if we want to keep that interface we should support the environment variables (DATABASE_URL or the PG_... ones)

@nanounanue
Copy link
Contributor

@thcrock weird the crosstabs thingy, last week I ran it from the repo without problems. I will check it to track the bug

@thcrock
Copy link
Contributor Author

thcrock commented Apr 5, 2021

It does appear that I'm the only one who's had the crosstabs problem so I will try building my environment again.

As for environment variables: we do support environment variables for database credentials in Triage, but this database provisioner is just creating the credentials, not using them. Any way of attempting to automatically set the environment variables for the user would be shell-specific and OS-specific, and not work until they restart the shell. Do you have something in mind? Just print the credentials to screen and tell the user to set those environment variables?

@thcrock
Copy link
Contributor Author

thcrock commented Apr 5, 2021

@nanounanue about the first question: This came up at the meeting. Ultimately it seems the team wants to use it for Dirty duck (as in, remove the docker-compose and bastion dependencies from Dirty Duck). That would require much more work than is present here. However, we did decide that we can merge this PR, as currently scoped, to just be an option for regular Triage users for now. So you might do this once you finish the tutorial and want to start running Triage on your own data.

@shaycrk
Copy link
Contributor

shaycrk commented Apr 6, 2021

Yeah, just installed triage from master in a new environment and didn't run into issues with crosstabs in the CLI either.

In terms of testing out here, I did a pip install git+https://github.com/dssg/triage.git@triage_db in a new environment and then tried running triage db up but hit this error:

(temp-3.7.2) code $ triage db up
[ProcessExecutionError] Command line: ['/usr/local/bin/docker', 'build', './dirtyduck/food_db/', '-t', 'triage_db']
Exit code: 1
Stderr:  | unable to prepare context: path "./dirtyduck/food_db/" not found

I think we might need to change it to look for food_db relative to the code path?

@thcrock
Copy link
Contributor Author

thcrock commented Apr 10, 2021

@shaycrk So far I've been unable to get the contents of the dirtyduck folder to show up in the installed package, which is necessary for that docker image to be buildable.

But maybe I don't need to. For this purpose, are PostGIS and the dirtyduck data necessary? I could more easily just have the triage db command use a vanilla Postgres image, at least for now.

@nanounanue
Copy link
Contributor

nanounanue commented Apr 10, 2021 via email

@thcrock
Copy link
Contributor Author

thcrock commented Apr 28, 2021

@nanounanue I spent some more time on the Manifest. Apparently the 'dirtyduck' directory is successfully getting into the 'sdist' (what the Manifest.in is supposed to do) but is not getting copied to the site-packages/triage directory. This is despite 'include_package_data' being set to True in setup.py

@thcrock
Copy link
Contributor Author

thcrock commented May 5, 2021

@nanounanue Here is the deal with the manifest. The 'dirtyduck' directory wasn't getting into the build directory despite the MANIFEST.in and include_package_data=True combo because it's not within a package directory. If I move it to src/triage, it works.

I tried this based on this reply on a related issue: pypa/sampleproject#30 (comment)

Moving it into that directory may change some other things. Before deciding to do this, I think we should confirm that we want the database provisioner to actually include the dirtyduck data scripts. If the user types triage db up, should they get a vanilla working postgres database or should they get the food database sample data (including the one or two minutes it might take for the system to download the inspections data and load it)? @shaycrk

@shaycrk
Copy link
Contributor

shaycrk commented May 5, 2021

Thanks @thcrock! I've definitely been thinking about this mostly as motivated by making the dirty duck tutorial a little more seamless (as well as making it possible to run after a pip install triage without having to clone the repo), though I can certainly see some utility in providing an empty database, too. My sense is it might be nice to support both unless that seems unreasonable (e.g., maybe triage tutorial up vs triage db up?). Anyway, we also have our planning meeting tomorrow, so might be good to talk it through then if you can both make it?

@thcrock thcrock marked this pull request as ready for review May 13, 2021 03:37
@thcrock
Copy link
Contributor Author

thcrock commented May 13, 2021

Hey @shaycrk this is ready for review again, with just the vanilla db.

One thing you'll note is that the docs/sources/index.md changed a bunch. This is because I ran 'manage docs', which copies README.md to docs/sources/index.md. This is something I put in a while ago to solve the problem of either 1. having content split between the documentation site and the README or 2. having the README be a stub and punt the user to the documentation site.

We can iterate on this process depending on what the team wants, but for now it looks README and docs/sources/index.md were out of sync. I'm not sure which version of this page correct. README was modified much more recently than index.md, so I'm guessing README is the right version to use for now, and if that's true this PR syncs them up again. But check it out in case anything is wrong.

The command does require some version of Docker. We recommend getting it from the [official Docker downloads page](https://docs.docker.com/get-docker/).

### Can't log in
Because of the way Docker volumes work, if you manually remove the Docker container created by `triage db up`, the volume will still be around. This is usually fine, but the superuser credential information will persist as well, which means the next time you spawn the database, *the Postgres server will not take the new credential information into account*. Under normal usage (simply calling `triage db up` and never removing the container), you will never run into this situation. But if you do, and you would like to use a new username/password, you will have to remove the volume before recreating. This can be done with `docker volume rm db-data`. This will also remove all of the stored data in Postgres, so beware!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we might want to name the volume something like triage-db-data as db-data feels a little too generic/possible to clash with something else a user might have?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

return
if args.password:
password = getpass(prompt='Enter a password for your new database user: ')
logger.info(self.local['docker']['run', '-d', '-p', '5432:5432', '-e', 'POSTGRES_HOST=0.0.0.0', '-e', 'POSTGRES_USER=triage_user', '-e', 'POSTGRES_PORT=5432', '-e', f'POSTGRES_PASSWORD={password}', '-e', 'POSTGRES_DB=triage', '-v', 'db-data:/var/lib/postgresql/data', '--name', 'triage_db', 'postgres:12']())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe for the future, but would it be useful to allow the user pass an argument to map the postgres data directory to (e.g., possibly allowing them to use a separate, larger drive) as well as a custom port (e.g., if they do have a local postgres running but don't want to use that for triage)

(retcode, stdout, stderr) = self.local['docker']['container', 'inspect', '-f', "'{{.State.Status}}'", 'triage_db'].run(retcode=None)
if retcode != 0:
logger.info('Container does not exist')
if os.path.exists('database.yaml'):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is checking in the working directory where the triage command is being run from, right? That's probably fine, but should make sure the documentation calls out that the CLI cares what directory it's being run from

'db': 'triage'
}
out_fd.write(yaml.dump(config))
logger.info('New database created with credentials saved to database.yaml. You can watch it boot up with "docker logs triage_db --follow", and wait until it says "database system is ready to accept connections. At that point, you can either psql into it using the credentials in database.yaml, or use other triage commands which will look for the credentials in database.yaml')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing " after "database system is ready to accept connections

`triage db up`

This command attempts to use docker to spawn a new Postgres 12 database. If successful, it will prompt you for a password to use for a user, and populate the connection information in database.yaml. The next time you run `triage db up`, it will look for the existing container and reuse it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be useful to provide some info/link out to installing psql and/or a GUI like dbeaver for people who aren't too familiar with using postgres generally?

@shaycrk
Copy link
Contributor

shaycrk commented May 22, 2021

Thanks @thcrock! Added a few inline comments, mostly around things like docs and naming, but looks good to me overall

@thcrock
Copy link
Contributor Author

thcrock commented May 23, 2021

Thanks @shaycrk I just uploaded some changes based on your review

@shaycrk
Copy link
Contributor

shaycrk commented May 28, 2021

Thanks @thcrock -- looks good to me! Feel free to merge if you're happy with it

@shaycrk shaycrk merged commit a994f3e into master Aug 26, 2021
@shaycrk shaycrk deleted the triage_db branch August 26, 2021 21:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants