Skip to content
This repository has been archived by the owner on Aug 26, 2022. It is now read-only.

Bug 1509430: Kuma changes for the new version of KumaScript #5182

Merged
merged 1 commit into from
Jan 22, 2019

Conversation

davidflanagan
Copy link

mdn/kumascript#1035 is a big patch that
rewrites and modernizes the KumaScript server. This is a smaller PR
that updates the Kuma repo for those changes.

The primary change is that the KumaScript server no longer supports
a GET endpoint for rendering documents, so Kuma now needs to send
document content to KumaScript for rendering with a POST request. This
change is primarily in wiki/kumascript.py, but it causes small
follow-on changes in a nubmer of other files as well.

The patch also includes some new useful scripts I wrote while testing.

And it changes docker-compose.yml to remove memcached which is no
longer required.

Copy link
Contributor

@jwhitlock jwhitlock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @davidflanagan, I'm still looking into this. It needs a rebase, and there's a couple of other adjustments that stand out.

There are some changes to the Dockerfile from PR #1035 (https://github.com/davidflanagan/kuma/blob/b64743a052f2a1f8c8953d56b6980ce2defff8b4/docker/images/kumascript2/Dockerfile) that should be applied (removing Python and updating the command).

We don't update the kumascript submodule in Kuma PRs, except when shipping to production. For changes like this (which are usually rare), reviewers know to checkout the relevant kumascript branch.

docker-compose.yml Show resolved Hide resolved
# may need to add alphabetical_attributes=True to the serialize call
# below so that attributes are output in a deterministic order. This was
# a particular problem for <iframe> tags, presumably because of the
# iframe filtering above.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think adding alphabetical_attributes=True would be a good stand-alone change, then we could get rid of this comment.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like it might only be iframes that have their attribute ordering changed? If so, then forcing alphabetical might drive writers nuts when they look at their diffs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also like the idea of alphabetical_attributes=True as a standalone change, it seems like something we'd want for better diffs between revisions.

@davidflanagan
Copy link
Author

@jwhitlock: it turns out that removing the python from the Dockerfile was a mistake in the older version of the patch. It is needed by the build script for the newrelic extension that gives us internal nodejs stats apparently. And I've changed from index.js back to run.js so that is intentional.

The kumascript update was a screwup, though. I'll fix that.

@davidflanagan
Copy link
Author

@jwhitlock: I've rebased to fix the kumascript issue and the docker-compose.yml merge conflict.

I'm waiting for more feedback from you on the alphabetical attributes issue before doing anything there.

@jwhitlock
Copy link
Contributor

It looks like there were some changes way back in 2013 to allow html5lib to preserve source attribute order. However, it appears that none of the parsers preserve attribute order - not etree (the default and the one we're using), not dom, and probably not lxml. So alphabetical_attributes=True would be needed to get a consistent order, and would probably cause a lot of chaos in our diffs.

I think there's no action for this PR, and the comment can stay. I'll have to sleep on if we want alphabetical_attributes=True in the near future.

Copy link
Contributor

@jwhitlock jwhitlock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is working locally in conjunction with the kumascript changes.

I would prefer this as several PRs:

  1. Add scripts/dump_local_db.sh (along with extracting the DATABASE_URL parsing into a shared file)
  2. Add scripts/extract_documents.py (preferably converted to a Django management command)
  3. Remove the caching infrastructure from kumascript.py get, along with tests and utility functions
  4. Add a comment on kuma/scrape/scraper.py (or better yet, implement --per-second=15, but that could get complex)
  5. Add a comment on kuma/wiki/content.py (or implement alphabetical_attributes=True)
  6. Convert to standard environment variables instead of underscored config overrides for configuring KumaScript
  7. Convert to POST for all kumascript rendering calls (and maybe transitioning get() to render, but that does seem easier to do after the API change.
  8. Drop memcache

Changes 1-5 are independent of the kumascript PR, and could be merged in any order. Changes 6 and 7 are required at the same time as the kumascript PR, but I suspect they would pair to more incremental changes on the kumascript side as well. Change 8 could be done well after the KS changes are merged, deployed, and de-bugged.

Bundling these as a single PR makes it easy to submit, but challenging to review other than "yep mostly works" and merge a bunch of unrelated changes.

I'm wavering between Approve, since it all works, and Request Changes, since they are things I'd request if this was broken up and wasn't blocking forward movement. So you get a Comment for now 😝. I'll return after I wrestle with the kumascript PR.

# it useful to uncomment the following two lines. The sleep
# helps to avoid rate-limiting, and the print allows you to
# monitor the progress of your scrape.
# time.sleep(1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it faster to sleep(1) for each request, rather than hit the rate limit and wait for it to clear? If so, we could try to add some smarter self-limiting.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seemed faster to me, but maybe just becasue of the print statement so I could see the steady progress. The 1 minute sleeps were a drag.

Not a change I was confident enough in to uncomment, and not something that seems worth spending more time on. So I could also just remove the comment entirely and leave the file unmodified.

exit 1
fi

# Parse DATABASE_URL for database connection parts
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: copied from scripts/create_sample_db.sh. I'm not sure what the function reuse story is for bash.

from django.conf import settings

os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'kuma.settings.testing')
django.setup()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This suggests this should be a management command rather than a stand-alone script.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was useful to me while developing and testing this change. But I'm not sure whether it is generally useful enough to warrant further work. If you don't want to land this as-is, I'd be inclined to just remove it rather than invest more time in it.

to kuma and/or kumascript do not cause rendering changes to the documents.

In order to make it work it may be necessary to modify the KUMASCRIPT_TIMEOUT
setting in kuma/settings/common.py to be at least 60 seconds
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

django-constance is used to set variables from the database, with the default values in the settings file. The preferred way to adjust the setting is in the Django admin or with ./manage.py constance set KUMASCRIPT_TIMEOUT 60.

I find environment variables to be a better way to configure Django dynamically, and it is a personal goal to get rid of Constance or drastically reduce its usage. KUMASCRIPT_TIMEOUT is one of the harder ones to transition, because a lot of tests use KUMASCRIPT_TIMEOUT=0 to avoid live calls to the kumascript service during tests.

@davidflanagan
Copy link
Author

This is working locally in conjunction with the kumascript changes.

I would prefer this as several PRs:

1. Add `scripts/dump_local_db.sh` (along with extracting the `DATABASE_URL` parsing into a shared file)

2. Add `scripts/extract_documents.py` (preferably converted to a Django management command)

3. Remove the caching infrastructure from `kumascript.py` `get`, along with tests and utility functions

4. Add a comment on `kuma/scrape/scraper.py` (or better yet, implement `--per-second=15`, but that could get complex)

5. Add a comment on `kuma/wiki/content.py` (or implement `alphabetical_attributes=True`)

6. Convert to standard environment variables instead of underscored config overrides for configuring KumaScript

7. Convert to `POST` for all kumascript rendering calls (and maybe transitioning `get()` to `render`, but that does seem easier to do after the API change.

8. Drop memcache

Changes 1-5 are independent of the kumascript PR, and could be merged in any order. Changes 6 and 7 are required at the same time as the kumascript PR, but I suspect they would pair to more incremental changes on the kumascript side as well. Change 8 could be done well after the KS changes are merged, deployed, and de-bugged.

Bundling these as a single PR makes it easy to submit, but challenging to review other than "yep mostly works" and merge a bunch of unrelated changes.

I'm wavering between Approve, since it all works, and Request Changes, since they are things I'd request if this was broken up and wasn't blocking forward movement. So you get a Comment for now 😝. I'll return after I wrestle with the kumascript PR.

Changes 1, 2, 4 and 5 could also just be deleted from the PR entirely. They were scripts and notes that were useful to me during development, but none of them seem worth spending more time (PRs, reviews, deploys) on to me.

I agree that #8 (modifying the docker-compose.yml file to remove memcache) can be done separately. And that seems like a good idea: make sure everything still works before we start changing the infrastructure config.

Since 6 and 7 have to land at the same time, it seems wrong to have them in separate commits to me--it just increases the number of places in the commit logs where a rollback would be broken.

And I see that 3 could be done separately, but if I'm understanding you that would mean that we'd leave the get() method and all of its caching apparatus intact, but modify the rest of the code so that nothing calls it anymore? I'm not sure, but I suspect we'd also have to delete tests that test the get method as part of 6 and 7. I would have expected you'd want to delete the unused code when it became unused.

So if we're going to break this up, I'd recommend doing 3, 6, and 7 in one PR, 8 in a second PR, and just punting on the other stuff.

kuma/wiki/kumascript.py Outdated Show resolved Hide resolved
@davidflanagan
Copy link
Author

This patch has changes to docker-compose.yml: changing kumascript to use redis instead of memcached, for example.

And I suspect that means that I also need to modify the various docker-compose variants as well.

@@ -40,20 +38,26 @@ def should_use_rendered(doc, params, html=None):
(force_macros or (not no_macros and not show_raw)))


def post(request, content, locale=settings.LANGUAGE_CODE):
def _post(content, env_vars, cache_control='',
Copy link
Contributor

@escattone escattone Jan 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have used cache_control=None here (the None object can be compared to a string -- of course the comparison will always return False), but strings are also immutable and using an empty string better indicates the type of what is expected. On the other hand, I think all of the functions above this in the call chain use cache_control=None, so it would be more consistent to use that. I could go either way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def _post(content, env_vars, cache_control='',
def _post(content, env_vars, cache_control=None,

Copy link
Contributor

@escattone escattone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all of your work on this @davidflanagan! This looks good and works for me locally when paired with mdn/kumascript#1035. I think I'd prefer to remove scripts/dump_local_db.sh and scripts/extract_documents.py, unless you feel they're useful to keep around. I'm happy merging as is (or with those scripts removed), but I'll defer to you and @jwhitlock if you'd like to split into two chunks as last discussed here.

mdn/kumascript#1035 is a big patch that
rewrites and modernizes the KumaScript server. This is a smaller PR
that updates the Kuma repo for those changes.

The primary change is that the KumaScript server no longer supports
a GET endpoint for rendering documents, so Kuma now needs to send
document content to KumaScript for rendering with a POST request. This
change is primarily in wiki/kumascript.py, but it causes small
follow-on changes in a nubmer of other files as well.

The patch also includes some new useful scripts I wrote while testing.

And it changes docker-compose.yml to remove memcached which is no
longer required.
@jwhitlock
Copy link
Contributor

Thanks @davidflanagan for this work (and for removing the non-essential bits), and to @escattone for the review.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants