Load Balancer, Instant Switchover & SSL Support #145

smarnach · 2016-10-20T20:00:50Z

This implements support for load-balancing servers.

For background on the configuration of the load-balancing server, please see https://github.com/open-craft/ansible-load-balancer/blob/master/README.md.

If you prefer, you can review the first commit separately. It is a self-contained refactoring that doesn't change any functionality.

The selection algorithm to select a load-balancing server is very simple at the moment; it chooses one at random. For now this should be enough, since we only have a single server.

Testing instructions

Add the following settings to the .env file on your vagrant instance:
```
LOAD_BALANCER_FRAGMENT_NAME_PREFIX='$USERNAME-'
PRELIMINARY_PAGE_SERVER_IP='149.202.186.237'
```
$USERNAME is some name identifying you, and is used to recognise your configuration file fragments on the dev load-balancing server, which is shared between all of us.
Run make migrate.
Add the public SSH key of your vagrant instance (probably /home/vagrant/.ssh/id_rsa.pub) to the authrorised keys for the ubuntu user (/home/ubuntu/.ssh/authorized_keys) on the server haproxy-dev.opencraft.hosting. You can use the "ssh key for databases instances" from the secrets repository to log into that server.
Test that you can log into haproxy-dev.opencraft.hosting from your vagrant instance using ssh ubuntu@haproxy-dev.opencraft.hosting.

In the Django shell, run

LoadBalancingServer.objects.create(
    domain="haproxy-dev.opencraft.hosting",
    ssh_username="ubuntu")

Create a new instance and spawn an appserver, e.g.
```
instance = OpenEdXInstance.objects.create(
    sub_domain="$USERNAME-haproxy-test",
    name="haproxy test",
    configuration_source_repo_url="https://github.com/open-craft/configuration.git",
    configuration_version="opencraft-release/eucalyptus.2",
    edx_platform_repository_url="https://github.com/open-craft/edx-platform.git",
    edx_platform_commit="opencraft-release/eucalyptus.2",
    openedx_release="open-release/eucalyptus.2",
)
from instance.tasks import spawn_appserver
spawn_appserver(instance.ref.pk)
```
The DNS record will be set up immediately, but will of course take some time to propagate. Once the load-balancing server can verify that the DNS record points to it, it will request an SSL certificate for the new domain names. Once this has happened, you should be able to access the link to the LMS in the instance manager, and it should show the page informing you that the instance is being provisioned via HTTPS. (In the real production setup, we have a wildcard DNS entry for *.opencraft.hosting, so users will see the splash screen even before DNS propagation, via plain HTTP.)
Activate the appserver once it's fully deployed. Since no DNS propagation is necessary to activate the appserver, it should become available under the domain name in less than one minute.

mtyaka

@smarnach The code looks great. I am currently provisioning a new instance to test this manually.

Two questions:

We have a protocol field on OpenEdXInstance and OpenEdXAppServer models. Does this change make that field redundant (since all instances will only be available over https)?
Should we display the load balancer that is associated to the instance in the GUI? It probably won't be very helpful unless we start using more than one load balancer instance.

mtyaka · 2016-10-21T08:41:10Z

instance/logging.py

+    def process(self, msg, kwargs):
+        msg, kwargs = super().process(msg, kwargs)
+        msg = self.extra['obj'].format_log_message(msg)
+        return msg, kwargs


Nice refactoring. Since we only ever pre-pend extra annotation, we could simplify the code even further by requiring models to implement log_message_annotation rather than format_log_message:

def process(self, msg, kwargs): msg, kwargs = super().process(msg, kwargs) annotation = self.extra['obj'].log_message_annotation() if annotation: msg = "{} | {}".format(annotation, msg) return msg, kwargs

It would be less general, but would cover all current use cases.

@mtyaka I first considered doing this, but there is actually one corner case that isn't covered by the less general approach: The AppServer class might not annotate the log message at all if it's not owned by an instance.

I copied this corner case from the old code, but thinking about it, I don't think this can ever happen, so I can make the change you suggested.

mtyaka · 2016-10-21T08:47:00Z

instance/models/load_balancer.py

+        return "load_balancer={} ({!s:.15}) | {}".format(self.pk, self.domain, msg)
+
+    def get_configuration(self):
+        """Collect the backend maps and configuration fragments from all assoicated instances."""


Typo: assoicated

antoviaque

@smarnach Nice to see this!

antoviaque · 2016-10-21T10:03:48Z

instance/models/load_balancer.py

+        config_script = self.get_config_script()
+        try:
+            subprocess.run(
+                ["ssh", "-T", "-o", "PasswordAuthentication=no", "-l", self.ssh_username, self.domain, "sudo sh"],


Could you do this via ansible, like @bdero did in #144 ? It would be good to use a single approach for executing commands on a VM, and to factorize the common parts.

Using Ansible was what I originally suggested in the design doc. However, I found that it has quite a bit of overhead for this simple use case. Even when reusing the bits that are already there, the code will become more complex, a lot slower and probably more error-prone. It will also take a while to implement this.

Alternatively, I could factor out running SSH into a separate function. (I should do this anyway to be able to mock it out for testing.)

@smarnach Brandon's work on #144 already implemented this, so a part of the work is already done - you would still need to rebase on top of his PR and factorize it, but the ground work is there? If we use different approaches for executing code on VMs, they will each grow/evolve on their own, and it will be harder to reconcile the work later on. I'd rather have a single one, which we can work on making more robust if needed - that would benefit both your work and Brandon's.

antoviaque · 2016-10-21T10:03:53Z

instance/models/openedx_instance.py

@@ -84,6 +86,8 @@ class OpenEdXInstance(Instance, OpenEdXAppConfiguration, OpenEdXDatabaseMixin,
    external_lms_preview_domain = models.CharField(max_length=100, blank=True)
    external_studio_domain = models.CharField(max_length=100, blank=True)

+    load_balancing_server = models.ForeignKey(LoadBalancingServer, null=True, blank=True, on_delete=models.PROTECT)


Can you move the logic/data that aren't specific to an Open edX instance to a LoadBalancedInstance mixin?

@antoviaque That was my original approach, but it turned out that all the logic I added needs access to either the domain names or the active appserver, and it is all Open edX-specific. The generic bits are implemented in LoadBalancingServer. If I were to move this to a mixin, it would literally only contain this field, and maybe an abstract prototype for get_load_balancer_configuration(), but no actual code.

The implementation of LoadBalancingServer actually isn't completely generic, since it uses the openedxinstance_set attribute to retrieve a list of all backends it load balances. The only alternative to this approach I can think of is to add the load_balancing_server field to InstanceReference, which definitely feels wrong as well. I have thought about how to solve this, but I didn't manage to come up with a really nice and clean solution.

I noticed that it's actually quite easy to make LoadBalancingServer completely generic by using Django model reflection to find the subclasses of Instance that refer to LoadBalancingServer, so I implemented this now. To make a different subclass of instance load-balanced, all that's needed is to add a ForeignKey field like this one here and implement get_load_balancer_configuration().

Does this settle this point, or do you think I should add a mixin class containing only this field?

Cf my comment below, at https://github.com/open-craft/opencraft/pull/145/files#diff-b66ef4710ed40f1b980e3ab3a34ec812R273 - yes, the mixin could have the field, but it could also contain the logic specific to the load balancer, and only have the OpenEdXInstance object expose the information the load balancer mixin needs to generate the config.

I've replied on the linked comment. I'll introduce a LoadBalancedInstance class containing only the field for now.

antoviaque · 2016-10-21T10:03:57Z

instance/models/openedx_instance.py

+        self.disable_monitoring()
+        self.active_appserver = None
+        self.save()
+        self.load_balancing_server.reconfigure()


Might be worth syncing up with @itsjeyd on this, since he's adding shutdown logic in https://github.com/open-craft/opencraft/pull/143/files#diff-9842c8f4a538c1c1735bdb68328d427aR84

@itsjeyd @smarnach Have you done this?

There are a few clean-up related things that need to be sorted out. E.g. we currently don't remove DNS entries for old instances, and we need to remove the haproxy configuration fragments for instances that get terminated. The clean-up code needs some reconciliation. Maybe we can do this once all the current PRs are merged? It's kind of difficult to get this done with several PRs in flight.

@smarnach I don't think I had replied to this, but since @itsjeyd's PR has been merged more than a week ago, there is no blocker on this for some time now? It was one of my very first review comments, so I'd like to see this addressed before this is merged.

Btw I don't think Tim is seeing our pings - he might be ignoring notifications from PRs he is not involved in, not expecting a ping here?

antoviaque · 2016-10-21T10:04:34Z

instance/logging.py

@@ -95,3 +95,17 @@ def emit(self, record):
            # TODO: Filter out log entries for which the user doesn't have view rights
            # TODO: More targetted events - only emit events for what the user is looking at
            publish_data('log', log_event)
+
+
+class ModelLoggerAdapter(logging.LoggerAdapter):


Nice refactoring!

smarnach · 2016-10-21T22:05:36Z

@mtyaka

We have a protocol field on OpenEdXInstance and OpenEdXAppServer models. Does this change make that field redundant (since all instances will only be available over https)?

I don't think so. The instance manager is currently completely unaware that the instance are HTTPS. The links to the instance are still HTTP links (and I don't plan to change this, since it would break the splash screen before DNS propagation). The load balancer is still talking HTTP to the servers as well. For now, I would leave the protocol field untouched.

Should we display the load balancer that is associated to the instance in the GUI? It probably won't be very helpful unless we start using more than one load balancer instance.

Yes, we should do this. I've added it to my notes.

smarnach · 2016-10-21T22:20:26Z

@mtyaka @antoviaque This PR introduces for the first time settings stored in the database: The domain name and SSH username of the load-balancing server. This brings up the question how to configure this for the integration tests. One option would be to add a management command to add a load-balancing server, and add a call to that command to circle.yml.

And which load-balancing server should I use for the integration tests? haproxy-dev or haproxy-stage? The integration tests use the same OpenStack credentials as stage, so I'd tend to use haproxy-stage.

smarnach · 2016-10-22T17:33:54Z

@mtyaka @antoviaque I've chosen a different approach for the integration tests: I've introduced a Django setting that can be used to create a LoadBalancingServer in the database. This makes it easy to configure the integration tests in the same way as we've always done it.

antoviaque · 2016-10-24T08:21:38Z

instance/models/load_balancer.py

+# -*- coding: utf-8 -*-
+#
+# OpenCraft -- tools to aid developing and hosting free software projects
+# Copyright (C) 2015 OpenCraft <xavier@opencraft.com>


antoviaque · 2016-10-24T08:22:10Z

instance/models/load_balancer.py

+
+from instance.logging import ModelLoggerAdapter
+from .instance import Instance
+from .utils import ValidateModelMixin


If I remember correctly, we settled on always using absolute imports, no?

I didn't know about that. There are several relative imports in the code though.

Pylint doesn't seem to have a feature to warn about explicit relative imports (probably because they are generally considered a good thing). Without a pylint warning, it will be hard to enforce the policy of only using absolute imports. (To be clear, I don't mind the policy.)

I'll replace all relative imports I introduced.

Yes, it's not too bad, more of a stylistic decision. We can always add checks to pylints if we want to enforce it more strongly, but for now just through reviews when we add/change code would be enough imho.

antoviaque · 2016-10-24T08:27:12Z

instance/models/load_balancer.py

+        config_script = self.get_config_script()
+        try:
+            subprocess.run(
+                ["ssh", "-T", "-o", "PasswordAuthentication=no", "-l", self.ssh_username, self.domain, "sudo sh"],


@smarnach Brandon's work on #144 already implemented this, so a part of the work is already done - you would still need to rebase on top of his PR and factorize it, but the ground work is there? If we use different approaches for executing code on VMs, they will each grow/evolve on their own, and it will be harder to reconcile the work later on. I'd rather have a single one, which we can work on making more robust if needed - that would benefit both your work and Brandon's.

antoviaque · 2016-10-24T08:36:39Z

instance/models/load_balancer.py

+    def get_instances(self):
+        """Yield all instances configured to use this load balancer."""
+        for field in self._meta.get_fields():
+            if field.one_to_many and issubclass(field.related_model, Instance):


issubclass(field.related_model, LoadBalancedInstance) ?

If we are introducing that class, I could do that. Since I'm only iterating over the classes that have ForeignKey fields pointing to LoadBalancedInstance, it doesn't make a difference.

antoviaque · 2016-10-24T08:45:32Z

instance/models/openedx_instance.py

+        """
+        if self.active_appserver:
+            backend_name = "be-{}".format(self.active_appserver.server.name)
+            server_name = "appserver-{}".format(self.pk)


self is an instance not an appserver, no?

antoviaque · 2016-10-24T08:45:39Z

instance/models/openedx_instance.py

        )

+    def get_load_balancer_configuration(self):


This could also be moved to the LoadBalancedInstance mixin - the instance could just have generic domain_names and ip_addresses properties, so that it doesn't need to know about the internals of the load balancer?

The configuration template is also specific to Open edX. The stickiness set-up relies on the name of the session cookie, and for other kinds of instances, other haproxy configuration settings might make sense, possibly requiring other information from the instances than the information passed to the template for the case of Open edX.

As I've said before, the split up between the code is as follows: The parts specific to Open edX are on the OpenEdXInstance class, and the generic bits are on LoadBalancingServer. Even if I were to make the change you proposed here, the generic bits of the code should be moved to LoadBalancingServer, and not to a third place.

It would be possible to move some of the current code from LoadBalancingServer to a new class LoadBalancedInstance. However, I don't see how this would improve the code. You would just have to look in one more place to understand how the parts are working together.

@smarnach Alright, I see - I was hoping the load balancer configuration could be more independent of Open edX, but since that isn't the case, leaving that method on the OpenEdXInstance object works for me.

It would be possible to move some of the current code from LoadBalancingServer to a new class LoadBalancedInstance. However, I don't see how this would improve the code. You would just have to look in one more place to understand how the parts are working together.

If the LoadBalancer takes a raw configuration from the instance, you will end up with logic common to all the Instance objects that can't easily be moved to the LoadBalancer object. For example, in your current get_load_balancer_configuration method, the primary page server logic would be generic to all types of instances imho; but if you move that logic to the LoadBalancer object, you would have to return a special value to signify that a given instance wants to use the primary page. It would be easier to understand and more maintainable imho to have this logic on a method on LoadBalancingInstanceMixin which the get_load_balancer_configuration method can call. It is an additional place to look at, but that would help to remember to explicitly differentiate between the generic parts and the ones specific to Open edX.

@antoviaque Yes, true, it's possible to split out the logic for the preliminary page, and it makes sense to move this to the LoadBalancedInstance mixin. It's not clear to me this logic will ever be reused, so I didn't split it out in the spirit of You aren't gonna need it, but I will do so now.

I tried moving the configuration for the preliminary page to the LoadBalancedInstance mixin. However, this results in a "no self use" pylint warning, so I have to make it a module-level function.

antoviaque · 2016-10-24T08:46:01Z

instance/models/openedx_instance.py

+        self.disable_monitoring()
+        self.active_appserver = None
+        self.save()
+        self.load_balancing_server.reconfigure()


@itsjeyd @smarnach Have you done this?

antoviaque · 2016-10-24T08:47:57Z

instance/models/openedx_instance.py

@@ -84,6 +86,8 @@ class OpenEdXInstance(Instance, OpenEdXAppConfiguration, OpenEdXDatabaseMixin,
    external_lms_preview_domain = models.CharField(max_length=100, blank=True)
    external_studio_domain = models.CharField(max_length=100, blank=True)

+    load_balancing_server = models.ForeignKey(LoadBalancingServer, null=True, blank=True, on_delete=models.PROTECT)


Cf my comment below, at https://github.com/open-craft/opencraft/pull/145/files#diff-b66ef4710ed40f1b980e3ab3a34ec812R273 - yes, the mixin could have the field, but it could also contain the logic specific to the load balancer, and only have the OpenEdXInstance object expose the information the load balancer mixin needs to generate the config.

antoviaque · 2016-10-24T08:48:51Z

opencraft/settings.py

+
+LOAD_BALANCER_FRAGMENT_NAME_PREFIX = env('LOAD_BALANCER_FRAGMENT_NAME_PREFIX', default='opencraft-')
+LOAD_BALANCER_CONF_DIR = env('LOAD_BALANCER_CONF_DIR', default='/etc/haproxy/conf.d')
+LOAD_BALANCER_BACKENDS_DIR = env('LOAD_BALANCER_BACKENDS_DIR', default='/etc/haproxy/backends')


If command execution is done through ansible, these paths could go to the ansible role for this command. Since that would eventually be reconciled with the load balancer ansible role, to allow to manage/redeploy load balancers via OpenCraft IM, they could remain internal to ansible and not require OpenCraft IM to know about these or make them configurable.

Yep, the paths are a bit out of place here. I thought about introducing commands like add-haproxy-conf-fragment and add-haproxy-backend-map-fragment on the load balancer, and make these commands the public interface of the load balancer. That would also allow to do the flock locking on the load balancer (meant to avoid reading half-written fragments) completely transparent.

@smarnach Now that there you use ansible, you don't need these two anymore, right? I don't see them used anywhere in the PR anymore.

antoviaque · 2016-10-24T08:51:24Z

opencraft/settings.py

+
+# The load-balancing server given in the form ssh_username@server.domain will be created
+# in the database if it does not exist yet.
+DEFAULT_LOAD_BALANCING_SERVER = env('DEFAULT_LOAD_BALANCING_SERVER', default=None)


I've chosen a different approach for the integration tests: I've introduced a Django setting that can be used to create a LoadBalancingServer in the database. This makes it easy to configure the integration tests in the same way as we've always done it.

Yup, I also prefer this approach - keeping a single method for configuration help keeping things simple, without additional mental overhead.

antoviaque · 2016-10-24T08:51:35Z

opencraft/settings.py

@@ -475,3 +475,15 @@
 # A new relic admin user's API key, used to set up availability monitoring
 # with Synthetics
 NEWRELIC_ADMIN_USER_API_KEY = env('NEWRELIC_ADMIN_USER_API_KEY', default=None)
+
+
+# Load balancing ##############################################################


Can you add documentation about these settings to the README?

antoviaque · 2016-10-24T08:53:58Z

instance/models/load_balancer.py

+            raise
+
+
+def _create_default_load_balancing_server():


These two functions should go on a LoadBalancer model manager?

antoviaque · 2016-10-24T09:06:35Z

instance/models/openedx_instance.py

+        else:
+            backend_name = "be-preliminary-page-{}".format(self.pk)
+            server_name = "preliminary-page"
+            ip_address = settings.PRELIMINARY_PAGE_SERVER_IP


So we are using the load balancer to serve the "please wait" page then, at the end? So we should change the default wildcard for *.opencraft.hosting to point to the load balancer? That would also ensure it can get the HTTPS certificates more quickly, right?

The "please wait" page can only be served via HTTPS once a certificate has been retrieved, and a certificate can only be retrieved once the DNS entry points to the load balancer.

With the current setup, the following is happening: The links in the instance manager are still HTTP links (and I don't plan to change this). Clicking on the link before DNS has propagated will result in the "please wait" page being served via HTTP. Once the load balancer detects that DNS has propagated, it will request a cert and start serving the page via HTTPS (well, it doesn't actually "serve" it, but you know what I mean). Since users and the load-balancer might see the results of DNS propagation at different times, and since the load balancer only checks every five minutes whether it needs to request new certs, there can be a gap between the two ways of serving the page.

If we change where the wildcard entry point to the load balancer instead, the load balancer will be able to request a certificate more or less immediately, and then be able to serve the "please wait" page without interruption, so yes, that might be slightly better.

Yup, that matches what I had in mind - sounds good then, thanks. Can you update the wildcard when you deploy this? It might be worth adding it to a TODO here or on the task to not forget.

@smarnach Has a TODO entry been added somewhere?

Yes, it's on the Jira ticket (and in my paper notes as well).

mtyaka

I ran into some issues when testing, but I think the problem is with my configuration. I will try again.

mtyaka · 2016-10-24T08:42:03Z

instance/models/load_balancer.py

+        return settings.LOAD_BALANCER_FRAGMENT_NAME_PREFIX + self.fragment_name_postfix
+
+    def get_config_script(self):
+        """Render the configuration script to be executed on the loader balancer."""


typo: "loader" -> "load"

mtyaka · 2016-10-24T08:45:42Z

instance/models/load_balancer.py

+
+
+def select_load_balancing_server():
+    """Select a load-balancing server for a new instance.


nit: We usually put a newline after """.

Hmm, the code base doesn't seem to be consistent on this, so I went with the official convention in PEP 257. I'm happy to change it, though.

Ah sorry, never mind then. While looking at the diff I had openedx_instance.py opened, which uses newlines consistently, but if we are not consistent in other places it doesn't make sense to change it.

I had a look again, and it indeed looks like the overwhelming majority of modules use the other style, so I'll change it anyway. I just happen to have stumbled upon the few inconsistent places.

mtyaka · 2016-10-24T09:15:47Z

instance/models/openedx_instance.py

+        # We unconditionally set the DNS records here, though this would only be strictly needed
+        # when the first AppServer is spawned.  However, there is no easy way to tell whether the
+        # DNS records have already been successfully set, and it doesn't hurt to alway do it.
+        self.set_dns_records()


In theory setting DNS entries and configuring the load balancer could be done when creating the instance (even before spawning any app servers).

I think it makes sense to set DNS records every time when spawning a new appserver because instance domains may change (for example instance uses *.opencraft.hosting domain at first, but later switches to custom/external domains).

I just realized we don't seem to remove DNS entries from gandi when instances are destroyed. It has nothing to do with this PR, but it would make sense to implement pruning of DNS records when instances are destroyed.

Setting DNS records should only happen inside a worker task, since it is potentially slow. That's the main reason why I decided not to do it implicitly upon instance creation.

I also noticed that DNS records don't get cleaned up. The load balancer configuration fragments also need to be cleaned up. Apparently @itsjeyd is currently working on refactoring the clean-up code, so we should coordinate the approaches to do that.

Yup, @itsjeyd seem to have missed my two other pings about this on the current PR, so could you grab him on IRC/jira about this? In the meantime his PR has been merged though, so it means this would need to be done in this PR (rebased against master to not conflict with his work), or added as a ticket in the backlog. The former being better as long as it doesn't make this spillover, so we can avoid the overhead of a new task, while this is fresh in mind.

mtyaka · 2016-10-24T09:18:40Z

instance/models/openedx_instance.py

    @log_exception
    def spawn_appserver(self):
        """
        Provision a new AppServer.

        Returns the ID of the new AppServer on success or None on failure.
        """
+        if not self.load_balancing_server:
+            self.load_balancing_server = select_load_balancing_server()
+            self.load_balancing_server.reconfigure()


~~I think we should always reconfigure the load balancer to be able to change domains and be able to transparently reassign an instance from one load balancer to another?~~

edit: nevermind, we invoke reconfigure() when setting active appserver, doing it here would be wrong.

mtyaka

I think I found out why the splash page for my instance wasn't working. The instance also failed to provision, but it looks like that was a random problem with one of the package servers. I'm trying again.

mtyaka · 2016-10-24T10:00:32Z

instance/models/openedx_instance.py

    @log_exception
    def spawn_appserver(self):
        """
        Provision a new AppServer.

        Returns the ID of the new AppServer on success or None on failure.
        """
+        if not self.load_balancing_server:
+            self.load_balancing_server = select_load_balancing_server()


We have to add a self.save() right after assigning the load balancer, otherwise the reconfigure() invocation on line below won't find this instance and won't include it in the haproxy configuration.

Right, thanks!

smarnach · 2016-10-24T10:43:51Z

The integration tests are currently failing because the backend names of the two instance deployment tests running in parallel clash. I need a small refactoring to fix this.

smarnach · 2016-10-24T16:41:12Z

@mtyaka I think I've addressed all review comments,except for these open issues:

Using Ansible for running the commands on the load balancer. I'd like to wait for Added management command for collecting activity stats. #144 to be merged before doing this, to avoid having to rebase repeatedly.
The cleanup-code isn't complete yet, since I wanted to wait for @itsjeyd to comment.
I don't know yet whether the integrations tests are passing now.

mtyaka · 2016-10-24T18:31:50Z

@smarnach Thanks for all the updates!

Conditional 👍, assuming these are taken care of before this gets merged:

Using Ansible for running the commands on the load balancer. I'd like to wait for Added management command for collecting activity stats. #144 to be merged before doing this, to avoid having to rebase repeatedly.

The cleanup-code isn't complete yet, since I wanted to wait for @itsjeyd to comment.

The integrations tests are passing yet.

I tested this: I successfully created a new instance and spawned a new app server. The splash page was (almost) immediately visible. The LMS/Studio were accessible right after new app server was successfully spawned.
I read through the code
Includes documentation

smarnach · 2016-10-24T20:24:33Z

The integration tests failed because the DNS propagation took three hours, so the load balancer wasn't able to request DNS certificates in time. The negative caching time seems to be indeed set to three hours:

$ dig @a.dns.gandi.net non-exisitng-subdomain.integration.plebia.net

; <<>> DiG 9.10.3-P4-Ubuntu <<>> @a.dns.gandi.net non-exisitng-subdomain.integration.plebia.net
; (2 servers found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 14461
;; flags: qr aa rd; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;non-exisitng-subdomain.integration.plebia.net. IN A

;; AUTHORITY SECTION:
plebia.net.     10800   IN  SOA a.dns.gandi.net. hostmaster.gandi.net. 1477324491 10800 3600 604800 10800

;; Query time: 105 msec
;; SERVER: 173.246.98.1#53(173.246.98.1)
;; WHEN: Mon Oct 24 22:23:02 CEST 2016
;; MSG SIZE  rcvd: 133

I have no idea why the DNS negative caching time is set that high, or whether we can change it.

I will create a wildcard DNS record for *.integration.plebia.net pointing to the load balancer, so we don't have to wait for DNS propagation. The downside is that we also won't be able to test DNS propagation, but we won't be able to do that anyway, since the integration tests on CircleCI are limited to two hours.

smarnach · 2016-10-24T20:45:24Z

Another problem is that the integration tests are using up our Let's Encrypt quota for plebia.net relatively quickly. We will have to switch to the staging instance of Let's Encrypt for the integration tests, and convince the requests library to somehow accept the invalid certificates issued by them. That's going to be fun. :-)

antoviaque · 2016-10-25T09:14:25Z

Another problem is that the integration tests are using up our Let's Encrypt quota for plebia.net relatively quickly. We will have to switch to the staging instance of Let's Encrypt for the integration tests, and convince the requests library to somehow accept the invalid certificates issued by them. That's going to be fun. :-)

What uses the quota? Would adding plebia.net to the same list as opencraft.hosting be an option to avoid this issue?

antoviaque · 2016-10-25T09:17:04Z

I have no idea why the DNS negative caching time is set that high, or whether we can change it.
I will create a wildcard DNS record for *.integration.plebia.net pointing to the load balancer

It's probably still using the default value (10800). If you can edit the plebia.net domain on Gandi, you can set it to 1800 instead like we do for most of our domains, which should help? You might even be able to use a lower value, to get the integration test to complete more quickly.

antoviaque · 2016-10-25T11:12:45Z

Makefile

@@ -94,7 +94,8 @@ test_unit: clean static_external
 	honcho -e .env.test run coverage run --source='.' --omit='*/tests/*' ./manage.py test --noinput
 	coverage html
 	@echo -e "\nCoverage HTML report at file://`pwd`/build/coverage/index.html\n"
-	@coverage report --fail-under 94 || (echo "\nERROR: Coverage is below 94%\n" && exit 2)
+	# Temporarily relax required coverage to force all tests to run on CircleCI
+	@coverage report --fail-under 93 || (echo "\nERROR: Coverage is below 94%\n" && exit 2)


I assume this is temporary : ) (re-reading - yes, it's actually written in the comment above) To remember to revert before merging.

I've added an item to the check list on the ticket to remember this.

antoviaque · 2016-10-25T11:27:58Z

instance/models/load_balancer.py

+            return servers[random.randrange(count)]
+
+
+class LoadBalancingServer(ValidateModelMixin, models.Model):


Could this be:

class LoadBalancingServer(ValidateModelMixin, TimeStampedModel)

smarnach · 2016-10-28T12:58:54Z

@antoviaque

What uses the quota? Would adding plebia.net to the same list as opencraft.hosting be an option to avoid this issue?

The integration tests make the load balancer request certificates for the instances they bring up. Each time the integration tests are run, two certificates are requested, so we could only run the integration tests ten times per week.

Adding plebia.net to the Public Suffix List would not solve the problem. We would have to add integration.plebia.net to that list instead. However, that wouldn't be the intended use of the list, since the no mutually non-trusting parties can publicly register subdomains of integration.plebia.net, so I wouldn't want to do that.

Using the staging instance of Let's Encrypt is actually the correct solution. It's meant to be used for this purpose. Injecting another root certificate to accept to the requests library probably isn't going to be difficult; I haven't looked into it yet. If it turns out to be difficult, I'll just use a different library.

smarnach · 2016-10-28T13:02:23Z

@antoviaque

It's probably still using the default value (10800). If you can edit the plebia.net domain on Gandi, you can set it to 1800 instead like we do for most of our domains, which should help? You might even be able to use a lower value, to get the integration test to complete more quickly.

I've set a wildcard entry for *.integration.plebia.net pointing to haproxy-stage.opencraft.hosting, which solves the problem.

I don't think I can change the 10800 seconds used for negative caching anywhere, and it is rather unusual to use the same TTL for negative and positive replies, which is what Gandi seems to be doing.

antoviaque · 2016-10-31T08:57:24Z

@smarnach

Using the staging instance of Let's Encrypt is actually the correct solution. It's meant to be used for this purpose.

Sounds good, thanks for the explanation.

I don't think I can change the 10800 seconds used for negative caching anywhere, and it is rather unusual to use the same TTL for negative and positive replies, which is what Gandi seems to be doing.

Ah, true, this seem to be actually defined in the DNS SOA record, so we could only change this if we were to use our own DNS servers. Your approach with a wildcard is much simpler.

smarnach · 2016-10-31T12:42:36Z

@mtyaka While I already got a conditional approval, there have been quite a few changes since the last time you looked. I've rebased on top of master and added the commits starting from "Factor out load balancer configuration of the preliminary page." Could you please review again?

antoviaque · 2016-11-01T16:24:25Z

instance/models/openedx_instance.py

 from instance.models.server import Status as ServerStatus
 from instance.utils import sufficient_time_passed
 from .instance import Instance
+from .mixins.load_balanced import LoadBalancedInstance


Still a relative import here. I imagine you did this for consistency's sake - but in that case I'd rather change the rest of the imports next to it.

antoviaque · 2016-11-01T16:29:00Z

instance/models/mixins/ansible.py

-
-        log_lines = []
-        with ansible.run_playbook(
+        return ansible.capture_playbook_output(


Great that this allowed to factorize the call to ansible here, too - thanks!

antoviaque · 2016-11-01T16:37:47Z

opencraft/settings.py

+
+LOAD_BALANCER_FRAGMENT_NAME_PREFIX = env('LOAD_BALANCER_FRAGMENT_NAME_PREFIX', default='opencraft-')
+LOAD_BALANCER_CONF_DIR = env('LOAD_BALANCER_CONF_DIR', default='/etc/haproxy/conf.d')
+LOAD_BALANCER_BACKENDS_DIR = env('LOAD_BALANCER_BACKENDS_DIR', default='/etc/haproxy/backends')


@smarnach Now that there you use ansible, you don't need these two anymore, right? I don't see them used anywhere in the PR anymore.

antoviaque · 2016-11-01T17:02:48Z

instance/models/openedx_instance.py

+        self.disable_monitoring()
+        self.active_appserver = None
+        self.save()
+        self.load_balancing_server.reconfigure()


@smarnach I don't think I had replied to this, but since @itsjeyd's PR has been merged more than a week ago, there is no blocker on this for some time now? It was one of my very first review comments, so I'd like to see this addressed before this is merged.

Btw I don't think Tim is seeing our pings - he might be ignoring notifications from PRs he is not involved in, not expecting a ping here?

antoviaque · 2016-11-01T17:14:48Z

instance/tests/models/test_log_entry.py

        # Delete server 1:
        server1_id = server1.pk
        server1.delete()
        # Now its log entry should be deleted:
        entries = LogEntry.objects.order_by('pk').all().values_list('text', flat=True)
        for entry_text in entries:
            self.assertNotIn('Line #2', entry_text)
-        self.assertIn('Line #1, on instance', entries[0])
-        self.assertIn('Line #3, on server 2', entries[1])
+        self.assertIn('Line #1, on instance', entries[1])


What's entries[0] now? Should that be tested?

The new line comes from the implicit creation of the default load balancer when calling OpenEdXInstanceFactory. This test function verifies that log lines for a server get deleted if the server gets deleted, so I didn't feel the line resulting from something happening in the setUp() method is related. I'll add an assertion just to prevent people from wondering what that line might be.

…nts.

smarnach · 2016-11-03T18:50:32Z

@itsjeyd Could you please take a look at the last commit in this PR? I've reorganised the clean-up code to also deprovision the databases when an instance is shutt down. Does this make sense? (I haven't fixed the unit tests yet, because I first want to wait for your feedback.)

itsjeyd · 2016-11-03T20:00:13Z

instance/models/openedx_instance.py

+        self.remove_dns_records()
+        self.deprovision_mysql()
+        self.deprovision_mongo()
+        self.deprovision_swift()


@smarnach Is there a specific reason why mysql/mongo/swift need to be deprovisioned before terminating app servers belonging to this instance? At first glance it seems like it would make sense to keep the previous order (terminate app servers first, then deprovision databases and storage that they use), to avoid a (short) period of time where an app server is still up but the databases and Swift container are already down.

There is no reason for this order, and doing it the other way around sounds good.

However, I don't understand your comment about "keeping the previous order". The old version did not call the deprovision_*() methods at all when calling shut_down(), but only when the instance was deleted. The delete() and shut_down() methods both had clean-up code that was mostly disjoint, and I consolidated that code. I'm not sure we actually want to deprovision the databases when calling shut_down(), but I think we do.

@smarnach By "keeping the previous order" I was referring to the order of steps in the delete method, which used to terminate app servers first and then call the deprovision_* methods (see 651acde#diff-b66ef4710ed40f1b980e3ab3a34ec812L276).

Re: the question about deprovisioning databases when shutting down an instance, I think it makes sense to do that. The shut_down method only gets called automatically for instances associated with a merged PR, and there is a grace period of 7 days to save any data in case it is needed beyond the EOL of an instance. Also forwarding this question to @antoviaque though, to see what he thinks :)

@smarnach @itsjeyd Thanks for asking - for now I'd rather not delete the DB/swift data when we shutdown an instance; if we mistakenly shutdown one, that would make it more difficult to bring it back up (theorically the data will be in the backups, but not necessarily the very last version of it, and in any case restoring it would be a lengthy process).

Specifically, we have just promised to the betatesters that they could resume their instance within one year for the inactive ones we are currently shutting down, so keeping the data would make it safer to keep true to that promise : )

itsjeyd · 2016-11-03T20:03:59Z

instance/models/openedx_instance.py

        """
-        self.disable_monitoring()


@smarnach Based on just this one commit it looks like this is getting removed completely? I think shut_down should call disable_monitoring (before terminating any app servers).

I added a set_appserver_inactive() method as the counterpart to set_appserver_active(). This message gets implicitly called when the active appserver is terminated, and handles disabling monitoring.

@smarnach Sounds good!

itsjeyd · 2016-11-03T20:16:34Z

@smarnach I had a look at the commit. Grouping all of the deprovisioning logic in a single method (shut_down) and just calling that from delete is a good idea. See the remaining comments for a couple things I noticed/had questions about.

I hope this helps, and sorry for missing earlier pings! Somewhat ironically, this happened because of the fact that I get notifications about all of the pull requests against this repo -- I like to stay in the loop about what's happening, but if I'm not a reviewer on a PR, I usually only skim the notifications, and that's how I missed that you pinged me directly on this PR.

smarnach · 2016-11-03T21:12:05Z

@itsjeyd Thanks for taking a look! I suggest you add a mail filter that flags messages with mention@noreply.github.com in the cc header for your attention, since this indicates that you have been pinged. See https://help.github.com/articles/about-notification-emails/#the-cc-address for details.

bdero · 2016-11-04T00:57:05Z

@smarnach I tested this using your instructions and everything seems to be working well for me.

On my initial pass through the code I didn't see anything alarming, but that may be because I'm still new to a large portion of the codebase. It looks like most of this has mostly been exhaustively reviewed as well.

I will take a deeper look to make sure there are no gaps in my understanding, but functionality-wise, this has a 👍 from me. I've already reviewed all of the other haproxy PRs and confirmed that the certificate management script/letsencrypt/haproxy-config and all the other pieces are working on that side of the fence.

antoviaque

@smarnach Thanks for addressing my remaining comments! All good on my side now besides #145 (comment) - I'm 👍 once it is addressed.

…stances.

…leCI.

bdero

@smarnach I looked over your code and got to test it pretty extensively while working on #152, so I'm giving this a thumbs up if there are no other major changes from here.

👍

I tested this: I followed the testing instructions and provisioned new instances, witnessed the playbook run, and working TLS support for the domains magically happened.
I read through the code
~~I checked for accessibility issues~~ N/A
Includes documentation

bdero · 2016-11-04T01:04:57Z

instance/ansible.py

+):
+    """
+    Convenience wrapper for run_playbook() that captures the output of the playbook run.
+    """


Thanks for this - it's nice to have this easy generalized "run a playbook" method. :)

mtyaka reviewed Oct 21, 2016

View reviewed changes

antoviaque suggested changes Oct 21, 2016

View reviewed changes

smarnach force-pushed the smarnach/load-balancer branch from 2724901 to 7e6932c Compare October 22, 2016 20:00

antoviaque reviewed Oct 24, 2016

View reviewed changes

antoviaque suggested changes Oct 24, 2016

View reviewed changes

mtyaka reviewed Oct 24, 2016

View reviewed changes

mtyaka suggested changes Oct 24, 2016

View reviewed changes

mtyaka approved these changes Oct 24, 2016

View reviewed changes

antoviaque reviewed Oct 25, 2016

View reviewed changes

smarnach mentioned this pull request Oct 28, 2016

Added management command for collecting activity stats. #144

Merged

smarnach force-pushed the smarnach/load-balancer branch 2 times, most recently from 69ad196 to a77db29 Compare October 30, 2016 22:42

antoviaque changed the title ~~Smarnach/load balancer~~ Load Balancer, Instant Switchover & SSL Support Nov 1, 2016

antoviaque mentioned this pull request Nov 1, 2016

Enable Let's Encrypt automatic certificate deployment #129

Closed

antoviaque suggested changes Nov 1, 2016

View reviewed changes

smarnach added 10 commits November 3, 2016 15:33

Factor out load balancer configuration of the preliminary page.

0543609

Accept Let's Encrypt staging certs in the integration tests.

dd630a1

Use Ansible to reconfigure the load balancer.

752a3ab

Add pylint override for number of test functions.

2326f59

Move DNS management methods to the LoadBalancedInstance mixin.

55e220b

Improve test coverage and clean up load balancer configuration fragme…

e817b32

…nts.

Post-rebase fixes for migrations and pylint.

2bceb13

Use absolute imports.

c1210d9

Remove unused load-balancer settings.

27dec5b

Improve log entry test.

bdf1e13

smarnach force-pushed the smarnach/load-balancer branch from 33cb413 to 651acde Compare November 3, 2016 18:47

itsjeyd reviewed Nov 3, 2016

View reviewed changes

antoviaque approved these changes Nov 4, 2016

View reviewed changes

Reorganize clean-up code and delete DNS records when shutting down in…

93a17d1

…stances.

smarnach force-pushed the smarnach/load-balancer branch 4 times, most recently from 1ea6cc9 to 27bc36c Compare November 4, 2016 15:58

Temporarily relax required coverage to force all tests to run on Circ…

1774d8c

…leCI.

smarnach force-pushed the smarnach/load-balancer branch from 27bc36c to 1774d8c Compare November 4, 2016 16:57

bdero mentioned this pull request Nov 4, 2016

Managed RabbitMQ #152

Merged

1 task

bdero approved these changes Nov 4, 2016

View reviewed changes

Bring unit test coverage back to 95%.

486a321

smarnach merged commit 2bfa62c into master Nov 7, 2016

bradenmacdonald deleted the smarnach/load-balancer branch November 9, 2016 17:38



		def select_load_balancing_server():
		"""Select a load-balancing server for a new instance.

		return servers[random.randrange(count)]


		class LoadBalancingServer(ValidateModelMixin, models.Model):

Load Balancer, Instant Switchover & SSL Support #145

Load Balancer, Instant Switchover & SSL Support #145

Conversation

smarnach commented Oct 20, 2016

Testing instructions

mtyaka left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

antoviaque left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smarnach commented Oct 21, 2016

smarnach commented Oct 21, 2016

smarnach commented Oct 22, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mtyaka left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mtyaka Oct 24, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mtyaka Oct 24, 2016 • edited Loading

Choose a reason for hiding this comment

mtyaka left a comment

Choose a reason for hiding this comment

mtyaka Oct 24, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smarnach commented Oct 24, 2016

smarnach commented Oct 24, 2016

mtyaka commented Oct 24, 2016

smarnach commented Oct 24, 2016

smarnach commented Oct 24, 2016

antoviaque commented Oct 25, 2016

antoviaque commented Oct 25, 2016

mtyaka Oct 24, 2016 •

edited

Loading

mtyaka Oct 24, 2016 •

edited

Loading

mtyaka Oct 24, 2016 •

edited

Loading

antoviaque Oct 25, 2016 •

edited

Loading