Skip to content
This repository has been archived by the owner on Aug 22, 2022. It is now read-only.

Load Balancer, Instant Switchover & SSL Support #145

Merged
merged 32 commits into from
Nov 7, 2016

Conversation

smarnach
Copy link
Contributor

This implements support for load-balancing servers.

For background on the configuration of the load-balancing server, please see https://github.com/open-craft/ansible-load-balancer/blob/master/README.md.

If you prefer, you can review the first commit separately. It is a self-contained refactoring that doesn't change any functionality.

The selection algorithm to select a load-balancing server is very simple at the moment; it chooses one at random. For now this should be enough, since we only have a single server.

Testing instructions

  1. Add the following settings to the .env file on your vagrant instance:

    LOAD_BALANCER_FRAGMENT_NAME_PREFIX='$USERNAME-'
    PRELIMINARY_PAGE_SERVER_IP='149.202.186.237'
    

    $USERNAME is some name identifying you, and is used to recognise your configuration file fragments on the dev load-balancing server, which is shared between all of us.

  2. Run make migrate.

  3. Add the public SSH key of your vagrant instance (probably /home/vagrant/.ssh/id_rsa.pub) to the authrorised keys for the ubuntu user (/home/ubuntu/.ssh/authorized_keys) on the server haproxy-dev.opencraft.hosting. You can use the "ssh key for databases instances" from the secrets repository to log into that server.

  4. Test that you can log into haproxy-dev.opencraft.hosting from your vagrant instance using ssh ubuntu@haproxy-dev.opencraft.hosting.

  5. In the Django shell, run

    LoadBalancingServer.objects.create(
        domain="haproxy-dev.opencraft.hosting",
        ssh_username="ubuntu")
    
  6. Create a new instance and spawn an appserver, e.g.

    instance = OpenEdXInstance.objects.create(
        sub_domain="$USERNAME-haproxy-test",
        name="haproxy test",
        configuration_source_repo_url="https://github.com/open-craft/configuration.git",
        configuration_version="opencraft-release/eucalyptus.2",
        edx_platform_repository_url="https://github.com/open-craft/edx-platform.git",
        edx_platform_commit="opencraft-release/eucalyptus.2",
        openedx_release="open-release/eucalyptus.2",
    )
    from instance.tasks import spawn_appserver
    spawn_appserver(instance.ref.pk)
    

    The DNS record will be set up immediately, but will of course take some time to propagate. Once the load-balancing server can verify that the DNS record points to it, it will request an SSL certificate for the new domain names. Once this has happened, you should be able to access the link to the LMS in the instance manager, and it should show the page informing you that the instance is being provisioned via HTTPS. (In the real production setup, we have a wildcard DNS entry for *.opencraft.hosting, so users will see the splash screen even before DNS propagation, via plain HTTP.)

  7. Activate the appserver once it's fully deployed. Since no DNS propagation is necessary to activate the appserver, it should become available under the domain name in less than one minute.

Copy link
Member

@mtyaka mtyaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smarnach The code looks great. I am currently provisioning a new instance to test this manually.

Two questions:

  • We have a protocol field on OpenEdXInstance and OpenEdXAppServer models. Does this change make that field redundant (since all instances will only be available over https)?
  • Should we display the load balancer that is associated to the instance in the GUI? It probably won't be very helpful unless we start using more than one load balancer instance.

def process(self, msg, kwargs):
msg, kwargs = super().process(msg, kwargs)
msg = self.extra['obj'].format_log_message(msg)
return msg, kwargs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice refactoring. Since we only ever pre-pend extra annotation, we could simplify the code even further by requiring models to implement log_message_annotation rather than format_log_message:

def process(self, msg, kwargs):
    msg, kwargs = super().process(msg, kwargs)
    annotation = self.extra['obj'].log_message_annotation()
    if annotation:
        msg = "{} | {}".format(annotation, msg)
    return msg, kwargs

It would be less general, but would cover all current use cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mtyaka I first considered doing this, but there is actually one corner case that isn't covered by the less general approach: The AppServer class might not annotate the log message at all if it's not owned by an instance.

I copied this corner case from the old code, but thinking about it, I don't think this can ever happen, so I can make the change you suggested.

return "load_balancer={} ({!s:.15}) | {}".format(self.pk, self.domain, msg)

def get_configuration(self):
"""Collect the backend maps and configuration fragments from all assoicated instances."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: assoicated

Copy link
Member

@antoviaque antoviaque left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smarnach Nice to see this!

config_script = self.get_config_script()
try:
subprocess.run(
["ssh", "-T", "-o", "PasswordAuthentication=no", "-l", self.ssh_username, self.domain, "sudo sh"],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you do this via ansible, like @bdero did in #144 ? It would be good to use a single approach for executing commands on a VM, and to factorize the common parts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using Ansible was what I originally suggested in the design doc. However, I found that it has quite a bit of overhead for this simple use case. Even when reusing the bits that are already there, the code will become more complex, a lot slower and probably more error-prone. It will also take a while to implement this.

Alternatively, I could factor out running SSH into a separate function. (I should do this anyway to be able to mock it out for testing.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smarnach Brandon's work on #144 already implemented this, so a part of the work is already done - you would still need to rebase on top of his PR and factorize it, but the ground work is there? If we use different approaches for executing code on VMs, they will each grow/evolve on their own, and it will be harder to reconcile the work later on. I'd rather have a single one, which we can work on making more robust if needed - that would benefit both your work and Brandon's.

@@ -84,6 +86,8 @@ class OpenEdXInstance(Instance, OpenEdXAppConfiguration, OpenEdXDatabaseMixin,
external_lms_preview_domain = models.CharField(max_length=100, blank=True)
external_studio_domain = models.CharField(max_length=100, blank=True)

load_balancing_server = models.ForeignKey(LoadBalancingServer, null=True, blank=True, on_delete=models.PROTECT)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move the logic/data that aren't specific to an Open edX instance to a LoadBalancedInstance mixin?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@antoviaque That was my original approach, but it turned out that all the logic I added needs access to either the domain names or the active appserver, and it is all Open edX-specific. The generic bits are implemented in LoadBalancingServer. If I were to move this to a mixin, it would literally only contain this field, and maybe an abstract prototype for get_load_balancer_configuration(), but no actual code.

The implementation of LoadBalancingServer actually isn't completely generic, since it uses the openedxinstance_set attribute to retrieve a list of all backends it load balances. The only alternative to this approach I can think of is to add the load_balancing_server field to InstanceReference, which definitely feels wrong as well. I have thought about how to solve this, but I didn't manage to come up with a really nice and clean solution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that it's actually quite easy to make LoadBalancingServer completely generic by using Django model reflection to find the subclasses of Instance that refer to LoadBalancingServer, so I implemented this now. To make a different subclass of instance load-balanced, all that's needed is to add a ForeignKey field like this one here and implement get_load_balancer_configuration().

Does this settle this point, or do you think I should add a mixin class containing only this field?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cf my comment below, at https://github.com/open-craft/opencraft/pull/145/files#diff-b66ef4710ed40f1b980e3ab3a34ec812R273 - yes, the mixin could have the field, but it could also contain the logic specific to the load balancer, and only have the OpenEdXInstance object expose the information the load balancer mixin needs to generate the config.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've replied on the linked comment. I'll introduce a LoadBalancedInstance class containing only the field for now.

self.disable_monitoring()
self.active_appserver = None
self.save()
self.load_balancing_server.reconfigure()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth syncing up with @itsjeyd on this, since he's adding shutdown logic in https://github.com/open-craft/opencraft/pull/143/files#diff-9842c8f4a538c1c1735bdb68328d427aR84

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@itsjeyd @smarnach Have you done this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few clean-up related things that need to be sorted out. E.g. we currently don't remove DNS entries for old instances, and we need to remove the haproxy configuration fragments for instances that get terminated. The clean-up code needs some reconciliation. Maybe we can do this once all the current PRs are merged? It's kind of difficult to get this done with several PRs in flight.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smarnach I don't think I had replied to this, but since @itsjeyd's PR has been merged more than a week ago, there is no blocker on this for some time now? It was one of my very first review comments, so I'd like to see this addressed before this is merged.

Btw I don't think Tim is seeing our pings - he might be ignoring notifications from PRs he is not involved in, not expecting a ping here?

@@ -95,3 +95,17 @@ def emit(self, record):
# TODO: Filter out log entries for which the user doesn't have view rights
# TODO: More targetted events - only emit events for what the user is looking at
publish_data('log', log_event)


class ModelLoggerAdapter(logging.LoggerAdapter):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice refactoring!

@smarnach
Copy link
Contributor Author

@mtyaka

We have a protocol field on OpenEdXInstance and OpenEdXAppServer models. Does this change make that field redundant (since all instances will only be available over https)?

I don't think so. The instance manager is currently completely unaware that the instance are HTTPS. The links to the instance are still HTTP links (and I don't plan to change this, since it would break the splash screen before DNS propagation). The load balancer is still talking HTTP to the servers as well. For now, I would leave the protocol field untouched.

Should we display the load balancer that is associated to the instance in the GUI? It probably won't be very helpful unless we start using more than one load balancer instance.

Yes, we should do this. I've added it to my notes.

@smarnach
Copy link
Contributor Author

@mtyaka @antoviaque This PR introduces for the first time settings stored in the database: The domain name and SSH username of the load-balancing server. This brings up the question how to configure this for the integration tests. One option would be to add a management command to add a load-balancing server, and add a call to that command to circle.yml.

And which load-balancing server should I use for the integration tests? haproxy-dev or haproxy-stage? The integration tests use the same OpenStack credentials as stage, so I'd tend to use haproxy-stage.

@smarnach
Copy link
Contributor Author

@mtyaka @antoviaque I've chosen a different approach for the integration tests: I've introduced a Django setting that can be used to create a LoadBalancingServer in the database. This makes it easy to configure the integration tests in the same way as we've always done it.

# -*- coding: utf-8 -*-
#
# OpenCraft -- tools to aid developing and hosting free software projects
# Copyright (C) 2015 OpenCraft <xavier@opencraft.com>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2016 : )


from instance.logging import ModelLoggerAdapter
from .instance import Instance
from .utils import ValidateModelMixin
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I remember correctly, we settled on always using absolute imports, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know about that. There are several relative imports in the code though.

Pylint doesn't seem to have a feature to warn about explicit relative imports (probably because they are generally considered a good thing). Without a pylint warning, it will be hard to enforce the policy of only using absolute imports. (To be clear, I don't mind the policy.)

I'll replace all relative imports I introduced.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's not too bad, more of a stylistic decision. We can always add checks to pylints if we want to enforce it more strongly, but for now just through reviews when we add/change code would be enough imho.

config_script = self.get_config_script()
try:
subprocess.run(
["ssh", "-T", "-o", "PasswordAuthentication=no", "-l", self.ssh_username, self.domain, "sudo sh"],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smarnach Brandon's work on #144 already implemented this, so a part of the work is already done - you would still need to rebase on top of his PR and factorize it, but the ground work is there? If we use different approaches for executing code on VMs, they will each grow/evolve on their own, and it will be harder to reconcile the work later on. I'd rather have a single one, which we can work on making more robust if needed - that would benefit both your work and Brandon's.

def get_instances(self):
"""Yield all instances configured to use this load balancer."""
for field in self._meta.get_fields():
if field.one_to_many and issubclass(field.related_model, Instance):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issubclass(field.related_model, LoadBalancedInstance) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are introducing that class, I could do that. Since I'm only iterating over the classes that have ForeignKey fields pointing to LoadBalancedInstance, it doesn't make a difference.

"""
if self.active_appserver:
backend_name = "be-{}".format(self.active_appserver.server.name)
server_name = "appserver-{}".format(self.pk)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self is an instance not an appserver, no?

)

def get_load_balancer_configuration(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could also be moved to the LoadBalancedInstance mixin - the instance could just have generic domain_names and ip_addresses properties, so that it doesn't need to know about the internals of the load balancer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The configuration template is also specific to Open edX. The stickiness set-up relies on the name of the session cookie, and for other kinds of instances, other haproxy configuration settings might make sense, possibly requiring other information from the instances than the information passed to the template for the case of Open edX.

As I've said before, the split up between the code is as follows: The parts specific to Open edX are on the OpenEdXInstance class, and the generic bits are on LoadBalancingServer. Even if I were to make the change you proposed here, the generic bits of the code should be moved to LoadBalancingServer, and not to a third place.

It would be possible to move some of the current code from LoadBalancingServer to a new class LoadBalancedInstance. However, I don't see how this would improve the code. You would just have to look in one more place to understand how the parts are working together.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smarnach Alright, I see - I was hoping the load balancer configuration could be more independent of Open edX, but since that isn't the case, leaving that method on the OpenEdXInstance object works for me.

It would be possible to move some of the current code from LoadBalancingServer to a new class LoadBalancedInstance. However, I don't see how this would improve the code. You would just have to look in one more place to understand how the parts are working together.

If the LoadBalancer takes a raw configuration from the instance, you will end up with logic common to all the Instance objects that can't easily be moved to the LoadBalancer object. For example, in your current get_load_balancer_configuration method, the primary page server logic would be generic to all types of instances imho; but if you move that logic to the LoadBalancer object, you would have to return a special value to signify that a given instance wants to use the primary page. It would be easier to understand and more maintainable imho to have this logic on a method on LoadBalancingInstanceMixin which the get_load_balancer_configuration method can call. It is an additional place to look at, but that would help to remember to explicitly differentiate between the generic parts and the ones specific to Open edX.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@antoviaque Yes, true, it's possible to split out the logic for the preliminary page, and it makes sense to move this to the LoadBalancedInstance mixin. It's not clear to me this logic will ever be reused, so I didn't split it out in the spirit of You aren't gonna need it, but I will do so now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried moving the configuration for the preliminary page to the LoadBalancedInstance mixin. However, this results in a "no self use" pylint warning, so I have to make it a module-level function.

self.disable_monitoring()
self.active_appserver = None
self.save()
self.load_balancing_server.reconfigure()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@itsjeyd @smarnach Have you done this?

@@ -84,6 +86,8 @@ class OpenEdXInstance(Instance, OpenEdXAppConfiguration, OpenEdXDatabaseMixin,
external_lms_preview_domain = models.CharField(max_length=100, blank=True)
external_studio_domain = models.CharField(max_length=100, blank=True)

load_balancing_server = models.ForeignKey(LoadBalancingServer, null=True, blank=True, on_delete=models.PROTECT)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cf my comment below, at https://github.com/open-craft/opencraft/pull/145/files#diff-b66ef4710ed40f1b980e3ab3a34ec812R273 - yes, the mixin could have the field, but it could also contain the logic specific to the load balancer, and only have the OpenEdXInstance object expose the information the load balancer mixin needs to generate the config.


LOAD_BALANCER_FRAGMENT_NAME_PREFIX = env('LOAD_BALANCER_FRAGMENT_NAME_PREFIX', default='opencraft-')
LOAD_BALANCER_CONF_DIR = env('LOAD_BALANCER_CONF_DIR', default='/etc/haproxy/conf.d')
LOAD_BALANCER_BACKENDS_DIR = env('LOAD_BALANCER_BACKENDS_DIR', default='/etc/haproxy/backends')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If command execution is done through ansible, these paths could go to the ansible role for this command. Since that would eventually be reconciled with the load balancer ansible role, to allow to manage/redeploy load balancers via OpenCraft IM, they could remain internal to ansible and not require OpenCraft IM to know about these or make them configurable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, the paths are a bit out of place here. I thought about introducing commands like add-haproxy-conf-fragment and add-haproxy-backend-map-fragment on the load balancer, and make these commands the public interface of the load balancer. That would also allow to do the flock locking on the load balancer (meant to avoid reading half-written fragments) completely transparent.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smarnach Now that there you use ansible, you don't need these two anymore, right? I don't see them used anywhere in the PR anymore.


# The load-balancing server given in the form ssh_username@server.domain will be created
# in the database if it does not exist yet.
DEFAULT_LOAD_BALANCING_SERVER = env('DEFAULT_LOAD_BALANCING_SERVER', default=None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've chosen a different approach for the integration tests: I've introduced a Django setting that can be used to create a LoadBalancingServer in the database. This makes it easy to configure the integration tests in the same way as we've always done it.

Yup, I also prefer this approach - keeping a single method for configuration help keeping things simple, without additional mental overhead.

@@ -475,3 +475,15 @@
# A new relic admin user's API key, used to set up availability monitoring
# with Synthetics
NEWRELIC_ADMIN_USER_API_KEY = env('NEWRELIC_ADMIN_USER_API_KEY', default=None)


# Load balancing ##############################################################
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add documentation about these settings to the README?

raise


def _create_default_load_balancing_server():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two functions should go on a LoadBalancer model manager?

else:
backend_name = "be-preliminary-page-{}".format(self.pk)
server_name = "preliminary-page"
ip_address = settings.PRELIMINARY_PAGE_SERVER_IP
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we are using the load balancer to serve the "please wait" page then, at the end? So we should change the default wildcard for *.opencraft.hosting to point to the load balancer? That would also ensure it can get the HTTPS certificates more quickly, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "please wait" page can only be served via HTTPS once a certificate has been retrieved, and a certificate can only be retrieved once the DNS entry points to the load balancer.

With the current setup, the following is happening: The links in the instance manager are still HTTP links (and I don't plan to change this). Clicking on the link before DNS has propagated will result in the "please wait" page being served via HTTP. Once the load balancer detects that DNS has propagated, it will request a cert and start serving the page via HTTPS (well, it doesn't actually "serve" it, but you know what I mean). Since users and the load-balancer might see the results of DNS propagation at different times, and since the load balancer only checks every five minutes whether it needs to request new certs, there can be a gap between the two ways of serving the page.

If we change where the wildcard entry point to the load balancer instead, the load balancer will be able to request a certificate more or less immediately, and then be able to serve the "please wait" page without interruption, so yes, that might be slightly better.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, that matches what I had in mind - sounds good then, thanks. Can you update the wildcard when you deploy this? It might be worth adding it to a TODO here or on the task to not forget.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smarnach Has a TODO entry been added somewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's on the Jira ticket (and in my paper notes as well).

Copy link
Member

@mtyaka mtyaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran into some issues when testing, but I think the problem is with my configuration. I will try again.

return settings.LOAD_BALANCER_FRAGMENT_NAME_PREFIX + self.fragment_name_postfix

def get_config_script(self):
"""Render the configuration script to be executed on the loader balancer."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: "loader" -> "load"



def select_load_balancing_server():
"""Select a load-balancing server for a new instance.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: We usually put a newline after """.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, the code base doesn't seem to be consistent on this, so I went with the official convention in PEP 257. I'm happy to change it, though.

Copy link
Member

@mtyaka mtyaka Oct 24, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry, never mind then. While looking at the diff I had openedx_instance.py opened, which uses newlines consistently, but if we are not consistent in other places it doesn't make sense to change it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a look again, and it indeed looks like the overwhelming majority of modules use the other style, so I'll change it anyway. I just happen to have stumbled upon the few inconsistent places.

# We unconditionally set the DNS records here, though this would only be strictly needed
# when the first AppServer is spawned. However, there is no easy way to tell whether the
# DNS records have already been successfully set, and it doesn't hurt to alway do it.
self.set_dns_records()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory setting DNS entries and configuring the load balancer could be done when creating the instance (even before spawning any app servers).

I think it makes sense to set DNS records every time when spawning a new appserver because instance domains may change (for example instance uses *.opencraft.hosting domain at first, but later switches to custom/external domains).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized we don't seem to remove DNS entries from gandi when instances are destroyed. It has nothing to do with this PR, but it would make sense to implement pruning of DNS records when instances are destroyed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting DNS records should only happen inside a worker task, since it is potentially slow. That's the main reason why I decided not to do it implicitly upon instance creation.

I also noticed that DNS records don't get cleaned up. The load balancer configuration fragments also need to be cleaned up. Apparently @itsjeyd is currently working on refactoring the clean-up code, so we should coordinate the approaches to do that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, @itsjeyd seem to have missed my two other pings about this on the current PR, so could you grab him on IRC/jira about this? In the meantime his PR has been merged though, so it means this would need to be done in this PR (rebased against master to not conflict with his work), or added as a ticket in the backlog. The former being better as long as it doesn't make this spillover, so we can avoid the overhead of a new task, while this is fresh in mind.

@log_exception
def spawn_appserver(self):
"""
Provision a new AppServer.

Returns the ID of the new AppServer on success or None on failure.
"""
if not self.load_balancing_server:
self.load_balancing_server = select_load_balancing_server()
self.load_balancing_server.reconfigure()
Copy link
Member

@mtyaka mtyaka Oct 24, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should always reconfigure the load balancer to be able to change domains and be able to transparently reassign an instance from one load balancer to another?

edit: nevermind, we invoke reconfigure() when setting active appserver, doing it here would be wrong.

Copy link
Member

@mtyaka mtyaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I found out why the splash page for my instance wasn't working. The instance also failed to provision, but it looks like that was a random problem with one of the package servers. I'm trying again.

@log_exception
def spawn_appserver(self):
"""
Provision a new AppServer.

Returns the ID of the new AppServer on success or None on failure.
"""
if not self.load_balancing_server:
self.load_balancing_server = select_load_balancing_server()
Copy link
Member

@mtyaka mtyaka Oct 24, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to add a self.save() right after assigning the load balancer, otherwise the reconfigure() invocation on line below won't find this instance and won't include it in the haproxy configuration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, thanks!

@smarnach
Copy link
Contributor Author

The integration tests are currently failing because the backend names of the two instance deployment tests running in parallel clash. I need a small refactoring to fix this.

@smarnach
Copy link
Contributor Author

@mtyaka I think I've addressed all review comments,except for these open issues:

  • Using Ansible for running the commands on the load balancer. I'd like to wait for Added management command for collecting activity stats. #144 to be merged before doing this, to avoid having to rebase repeatedly.
  • The cleanup-code isn't complete yet, since I wanted to wait for @itsjeyd to comment.
  • I don't know yet whether the integrations tests are passing now.

@mtyaka
Copy link
Member

mtyaka commented Oct 24, 2016

@smarnach Thanks for all the updates!

Conditional 👍, assuming these are taken care of before this gets merged:

  • I tested this: I successfully created a new instance and spawned a new app server. The splash page was (almost) immediately visible. The LMS/Studio were accessible right after new app server was successfully spawned.
  • I read through the code
  • Includes documentation

@smarnach
Copy link
Contributor Author

The integration tests failed because the DNS propagation took three hours, so the load balancer wasn't able to request DNS certificates in time. The negative caching time seems to be indeed set to three hours:

$ dig @a.dns.gandi.net non-exisitng-subdomain.integration.plebia.net

; <<>> DiG 9.10.3-P4-Ubuntu <<>> @a.dns.gandi.net non-exisitng-subdomain.integration.plebia.net
; (2 servers found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 14461
;; flags: qr aa rd; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;non-exisitng-subdomain.integration.plebia.net. IN A

;; AUTHORITY SECTION:
plebia.net.     10800   IN  SOA a.dns.gandi.net. hostmaster.gandi.net. 1477324491 10800 3600 604800 10800

;; Query time: 105 msec
;; SERVER: 173.246.98.1#53(173.246.98.1)
;; WHEN: Mon Oct 24 22:23:02 CEST 2016
;; MSG SIZE  rcvd: 133

I have no idea why the DNS negative caching time is set that high, or whether we can change it.

I will create a wildcard DNS record for *.integration.plebia.net pointing to the load balancer, so we don't have to wait for DNS propagation. The downside is that we also won't be able to test DNS propagation, but we won't be able to do that anyway, since the integration tests on CircleCI are limited to two hours.

@smarnach
Copy link
Contributor Author

Another problem is that the integration tests are using up our Let's Encrypt quota for plebia.net relatively quickly. We will have to switch to the staging instance of Let's Encrypt for the integration tests, and convince the requests library to somehow accept the invalid certificates issued by them. That's going to be fun. :-)

@antoviaque
Copy link
Member

Another problem is that the integration tests are using up our Let's Encrypt quota for plebia.net relatively quickly. We will have to switch to the staging instance of Let's Encrypt for the integration tests, and convince the requests library to somehow accept the invalid certificates issued by them. That's going to be fun. :-)

What uses the quota? Would adding plebia.net to the same list as opencraft.hosting be an option to avoid this issue?

@antoviaque
Copy link
Member

I have no idea why the DNS negative caching time is set that high, or whether we can change it.
I will create a wildcard DNS record for *.integration.plebia.net pointing to the load balancer

It's probably still using the default value (10800). If you can edit the plebia.net domain on Gandi, you can set it to 1800 instead like we do for most of our domains, which should help? You might even be able to use a lower value, to get the integration test to complete more quickly.

@@ -94,7 +94,8 @@ test_unit: clean static_external
honcho -e .env.test run coverage run --source='.' --omit='*/tests/*' ./manage.py test --noinput
coverage html
@echo -e "\nCoverage HTML report at file://`pwd`/build/coverage/index.html\n"
@coverage report --fail-under 94 || (echo "\nERROR: Coverage is below 94%\n" && exit 2)
# Temporarily relax required coverage to force all tests to run on CircleCI
@coverage report --fail-under 93 || (echo "\nERROR: Coverage is below 94%\n" && exit 2)
Copy link
Member

@antoviaque antoviaque Oct 25, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this is temporary : ) (re-reading - yes, it's actually written in the comment above) To remember to revert before merging.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added an item to the check list on the ticket to remember this.

return servers[random.randrange(count)]


class LoadBalancingServer(ValidateModelMixin, models.Model):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be:

class LoadBalancingServer(ValidateModelMixin, TimeStampedModel)

@smarnach
Copy link
Contributor Author

@antoviaque

What uses the quota? Would adding plebia.net to the same list as opencraft.hosting be an option to avoid this issue?

The integration tests make the load balancer request certificates for the instances they bring up. Each time the integration tests are run, two certificates are requested, so we could only run the integration tests ten times per week.

Adding plebia.net to the Public Suffix List would not solve the problem. We would have to add integration.plebia.net to that list instead. However, that wouldn't be the intended use of the list, since the no mutually non-trusting parties can publicly register subdomains of integration.plebia.net, so I wouldn't want to do that.

Using the staging instance of Let's Encrypt is actually the correct solution. It's meant to be used for this purpose. Injecting another root certificate to accept to the requests library probably isn't going to be difficult; I haven't looked into it yet. If it turns out to be difficult, I'll just use a different library.

@smarnach
Copy link
Contributor Author

@antoviaque

It's probably still using the default value (10800). If you can edit the plebia.net domain on Gandi, you can set it to 1800 instead like we do for most of our domains, which should help? You might even be able to use a lower value, to get the integration test to complete more quickly.

I've set a wildcard entry for *.integration.plebia.net pointing to haproxy-stage.opencraft.hosting, which solves the problem.

I don't think I can change the 10800 seconds used for negative caching anywhere, and it is rather unusual to use the same TTL for negative and positive replies, which is what Gandi seems to be doing.

@smarnach smarnach force-pushed the smarnach/load-balancer branch 2 times, most recently from 69ad196 to a77db29 Compare October 30, 2016 22:42
@antoviaque
Copy link
Member

@smarnach

Using the staging instance of Let's Encrypt is actually the correct solution. It's meant to be used for this purpose.

Sounds good, thanks for the explanation.

I don't think I can change the 10800 seconds used for negative caching anywhere, and it is rather unusual to use the same TTL for negative and positive replies, which is what Gandi seems to be doing.

Ah, true, this seem to be actually defined in the DNS SOA record, so we could only change this if we were to use our own DNS servers. Your approach with a wildcard is much simpler.

@smarnach
Copy link
Contributor Author

@mtyaka While I already got a conditional approval, there have been quite a few changes since the last time you looked. I've rebased on top of master and added the commits starting from "Factor out load balancer configuration of the preliminary page." Could you please review again?

@antoviaque antoviaque changed the title Smarnach/load balancer Load Balancer, Instant Switchover & SSL Support Nov 1, 2016
from instance.models.server import Status as ServerStatus
from instance.utils import sufficient_time_passed
from .instance import Instance
from .mixins.load_balanced import LoadBalancedInstance
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still a relative import here. I imagine you did this for consistency's sake - but in that case I'd rather change the rest of the imports next to it.


log_lines = []
with ansible.run_playbook(
return ansible.capture_playbook_output(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great that this allowed to factorize the call to ansible here, too - thanks!


LOAD_BALANCER_FRAGMENT_NAME_PREFIX = env('LOAD_BALANCER_FRAGMENT_NAME_PREFIX', default='opencraft-')
LOAD_BALANCER_CONF_DIR = env('LOAD_BALANCER_CONF_DIR', default='/etc/haproxy/conf.d')
LOAD_BALANCER_BACKENDS_DIR = env('LOAD_BALANCER_BACKENDS_DIR', default='/etc/haproxy/backends')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smarnach Now that there you use ansible, you don't need these two anymore, right? I don't see them used anywhere in the PR anymore.

self.disable_monitoring()
self.active_appserver = None
self.save()
self.load_balancing_server.reconfigure()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smarnach I don't think I had replied to this, but since @itsjeyd's PR has been merged more than a week ago, there is no blocker on this for some time now? It was one of my very first review comments, so I'd like to see this addressed before this is merged.

Btw I don't think Tim is seeing our pings - he might be ignoring notifications from PRs he is not involved in, not expecting a ping here?

# Delete server 1:
server1_id = server1.pk
server1.delete()
# Now its log entry should be deleted:
entries = LogEntry.objects.order_by('pk').all().values_list('text', flat=True)
for entry_text in entries:
self.assertNotIn('Line #2', entry_text)
self.assertIn('Line #1, on instance', entries[0])
self.assertIn('Line #3, on server 2', entries[1])
self.assertIn('Line #1, on instance', entries[1])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's entries[0] now? Should that be tested?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new line comes from the implicit creation of the default load balancer when calling OpenEdXInstanceFactory. This test function verifies that log lines for a server get deleted if the server gets deleted, so I didn't feel the line resulting from something happening in the setUp() method is related. I'll add an assertion just to prevent people from wondering what that line might be.

@smarnach
Copy link
Contributor Author

smarnach commented Nov 3, 2016

@itsjeyd Could you please take a look at the last commit in this PR? I've reorganised the clean-up code to also deprovision the databases when an instance is shutt down. Does this make sense? (I haven't fixed the unit tests yet, because I first want to wait for your feedback.)

self.remove_dns_records()
self.deprovision_mysql()
self.deprovision_mongo()
self.deprovision_swift()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smarnach Is there a specific reason why mysql/mongo/swift need to be deprovisioned before terminating app servers belonging to this instance? At first glance it seems like it would make sense to keep the previous order (terminate app servers first, then deprovision databases and storage that they use), to avoid a (short) period of time where an app server is still up but the databases and Swift container are already down.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no reason for this order, and doing it the other way around sounds good.

However, I don't understand your comment about "keeping the previous order". The old version did not call the deprovision_*() methods at all when calling shut_down(), but only when the instance was deleted. The delete() and shut_down() methods both had clean-up code that was mostly disjoint, and I consolidated that code. I'm not sure we actually want to deprovision the databases when calling shut_down(), but I think we do.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smarnach By "keeping the previous order" I was referring to the order of steps in the delete method, which used to terminate app servers first and then call the deprovision_* methods (see 651acde#diff-b66ef4710ed40f1b980e3ab3a34ec812L276).

Re: the question about deprovisioning databases when shutting down an instance, I think it makes sense to do that. The shut_down method only gets called automatically for instances associated with a merged PR, and there is a grace period of 7 days to save any data in case it is needed beyond the EOL of an instance. Also forwarding this question to @antoviaque though, to see what he thinks :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smarnach @itsjeyd Thanks for asking - for now I'd rather not delete the DB/swift data when we shutdown an instance; if we mistakenly shutdown one, that would make it more difficult to bring it back up (theorically the data will be in the backups, but not necessarily the very last version of it, and in any case restoring it would be a lengthy process).

Specifically, we have just promised to the betatesters that they could resume their instance within one year for the inactive ones we are currently shutting down, so keeping the data would make it safer to keep true to that promise : )

"""
self.disable_monitoring()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smarnach Based on just this one commit it looks like this is getting removed completely? I think shut_down should call disable_monitoring (before terminating any app servers).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a set_appserver_inactive() method as the counterpart to set_appserver_active(). This message gets implicitly called when the active appserver is terminated, and handles disabling monitoring.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smarnach Sounds good!

@itsjeyd
Copy link
Member

itsjeyd commented Nov 3, 2016

@smarnach I had a look at the commit. Grouping all of the deprovisioning logic in a single method (shut_down) and just calling that from delete is a good idea. See the remaining comments for a couple things I noticed/had questions about.

I hope this helps, and sorry for missing earlier pings! Somewhat ironically, this happened because of the fact that I get notifications about all of the pull requests against this repo -- I like to stay in the loop about what's happening, but if I'm not a reviewer on a PR, I usually only skim the notifications, and that's how I missed that you pinged me directly on this PR.

@smarnach
Copy link
Contributor Author

smarnach commented Nov 3, 2016

@itsjeyd Thanks for taking a look! I suggest you add a mail filter that flags messages with mention@noreply.github.com in the cc header for your attention, since this indicates that you have been pinged. See https://help.github.com/articles/about-notification-emails/#the-cc-address for details.

@bdero
Copy link
Contributor

bdero commented Nov 4, 2016

@smarnach I tested this using your instructions and everything seems to be working well for me.

On my initial pass through the code I didn't see anything alarming, but that may be because I'm still new to a large portion of the codebase. It looks like most of this has mostly been exhaustively reviewed as well.

I will take a deeper look to make sure there are no gaps in my understanding, but functionality-wise, this has a 👍 from me. I've already reviewed all of the other haproxy PRs and confirmed that the certificate management script/letsencrypt/haproxy-config and all the other pieces are working on that side of the fence.

Copy link
Member

@antoviaque antoviaque left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smarnach Thanks for addressing my remaining comments! All good on my side now besides #145 (comment) - I'm 👍 once it is addressed.

@smarnach smarnach force-pushed the smarnach/load-balancer branch 4 times, most recently from 1ea6cc9 to 27bc36c Compare November 4, 2016 15:58
@bdero bdero mentioned this pull request Nov 4, 2016
1 task
Copy link
Contributor

@bdero bdero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smarnach I looked over your code and got to test it pretty extensively while working on #152, so I'm giving this a thumbs up if there are no other major changes from here.

👍

  • I tested this: I followed the testing instructions and provisioned new instances, witnessed the playbook run, and working TLS support for the domains magically happened.
  • I read through the code
  • I checked for accessibility issues N/A
  • Includes documentation

):
"""
Convenience wrapper for run_playbook() that captures the output of the playbook run.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this - it's nice to have this easy generalized "run a playbook" method. :)

@smarnach smarnach merged commit 2bfa62c into master Nov 7, 2016
@bradenmacdonald bradenmacdonald deleted the smarnach/load-balancer branch November 9, 2016 17:38
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants