Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Would like to get this working with OpenShift Origin 3.7 #27

Closed
rberlind opened this issue Nov 29, 2017 · 28 comments
Closed

Would like to get this working with OpenShift Origin 3.7 #27

rberlind opened this issue Nov 29, 2017 · 28 comments

Comments

@rberlind
Copy link

When I try to use release-3.7 or release-3.7.0day by specifying them in the git clone command inside install-from-bastion.sh, I end up getting error at the end of the Ansible run:

TASK [template_service_broker : Reconcile with RBAC file] **********************
fatal: [master.openshift.local]: FAILED! => {"changed": true, "cmd": "oc process -f "/tmp/tsb-ansible-keZijh/rbac-template.yaml" | oc auth reconcile -f -", "delta": "0:00:00.285904", "end": "2017-11-29 12:45:42.125009", "failed": true, "rc": 1, "start": "2017-11-29 12:45:41.839105", "stderr": "Error: unknown shorthand flag: 'f' in -f\n\n\nUsage:\n oc auth [options]\n\nAvailable Commands:\n can-i Check whether an action is allowed\n\nUse "oc --help" for more information about a given command.\nUse "oc options" for a list of global command-line options (applies to all commands).", "stderr_lines": ["Error: unknown shorthand flag: 'f' in -f", "", "", "Usage:", " oc auth [options]", "", "Available Commands:", " can-i Check whether an action is allowed", "", "Use "oc --help" for more information about a given command.", "Use "oc options" for a list of global command-line options (applies to all commands)."], "stdout": "", "stdout_lines": []}
to retry, use: --limit @/home/ec2-user/openshift-ansible/playbooks/byo/config.retry

This seems related to openshift/openshift-ansible#6086.

I tried fixing the commit to 56b529e (which someone on that ticket said fixed the problem) by running git checkout 56b529e after the git clone command, but I got the same error.

Can anyone suggest a workaround to get this working with OpenShift Origin 3.7? The problem is not with Terraform itself, but with the openshift-ansible code.

@rberlind
Copy link
Author

I should add that I don't know ansible at all. I was unsure what the instruction at the end about retrying with --limit @/home/ec2-user/openshift-ansible/playbooks/byo/config.retry meant. Does it mean retry the single oc process -f "/tmp/tsb-ansible-keZijh/rbac-template.yaml" | oc auth reconcile -f - command and add the extra part? Or does it mean to retry the entire make openshift command or something else?

Also, the end of the Installer Status section has "This phase can be restarted by running: playbooks/byo/openshift-cluster/service-catalog.yml". Perhaps I should run ansible-playbook -i ./inventory.cfg ./openshift-ansible/playbooks/byo/openshift-cluster/service-catalog.yml?

@rberlind
Copy link
Author

rberlind commented Dec 1, 2017

I tried adding openshift_repos_enable_testing=true to inventory.template.cfg file as suggested in response to openshift/openshift-ansible#6086. That did get past the RBAC error, but I then saw:

FAILED - RETRYING: Verify that TSB is running (1 retries left).
fatal: [master.openshift.local]: FAILED! => {"attempts": 120, "changed": false, "cmd": ["curl", "-k", "https://apiserver.openshift-template-service-broker.svc/healthz"], "delta": "0:00:01.010827", "end": "2017-12-01 13:09:25.018259", "failed": true, "msg": "non-zero return code", "rc": 7, "start": "2017-12-01 13:09:24.007432", "stderr": " % Total % Received % Xferd Average Speed Time Time Time Current\n Dload Upload Total Spent Left Speed\n\r 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0\r 0 0 0 0 0 0 0 0 --:--:-- 0:00:01 --:--:-- 0curl: (7) Failed connect to apiserver.openshift-template-service-broker.svc:443; Connection refused", "stderr_lines": [" % Total % Received % Xferd Average Speed Time Time Time Current", " Dload Upload Total Spent Left Speed", "", " 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0", " 0 0 0 0 0 0 0 0 --:--:-- 0:00:01 --:--:-- 0curl: (7) Failed connect to apiserver.openshift-template-service-broker.svc:443; Connection refused"], "stdout": "", "stdout_lines": []}

Several thing occur to me:

  1. The DNS name apiserver.openshift-template-service-broker.svc is not known
  2. The port perhaps has to be 8443
    Note that adding the DNS to /etc/hosts and then using port 8443 from master, gave me back "ok".

Additionally, the advanced installation docs for OpenShift Origin indicate that one is supposed to set openshift_template_service_broker_namespaces if enabling the template service broker. See https://docs.openshift.org/latest/install_config/install/advanced_install.html#configuring-template-service-broker.

But I tried adding openshift_template_service_broker_namespaces=['openshift'] and still got the same error.

Now, I'm going to try disabling the service catalog and template service broker with:

openshift_enable_service_catalog=false
template_service_broker_install=false

I'm also going to explicitly set the ports with:
openshift_master_api_port=8443
openshift_master_console_port=8443

@rberlind
Copy link
Author

rberlind commented Dec 1, 2017

That did not work either, but I now think I had added the new variables in the wrong part of the file, under [nodes] instead of under [OSEv3:vars]. I will retry.

@rberlind
Copy link
Author

rberlind commented Dec 1, 2017

Unfortunately, even when I put the variables in the right place, the TSB running could not be verified. However, good news is that I was able to install OpenShift Origin 3.7 by disabling the service catalog and TSB with:

openshift_enable_service_catalog=false
template_service_broker_install=false

One other note for you: I think you should technically include "etcd" under [OSEv3:children] at top of the inventory template.

@dwmkerr
Copy link
Owner

dwmkerr commented Dec 8, 2017

Hmm OK cool I'll take a look at this, thanks for sharing @rberlind!

@rberlind
Copy link
Author

rberlind commented Dec 8, 2017

No problem. Thanks for putting together this repo. It was very helpful to me.

@ghost
Copy link

ghost commented Jan 22, 2018

Hi,

did anyone solve this issue so far? The service catalog is a viable feature…

@mtbvang
Copy link

mtbvang commented Feb 13, 2018

Hi,

Are you guys still having this problem? Release 3.7 worked for me.

@rberlind
Copy link
Author

rberlind commented Feb 13, 2018 via email

@ghost
Copy link

ghost commented Feb 13, 2018

I'm on holidays at the moment and have no access to the git repository of my company. But I've worked around this issue by making a small change to the playbook. Please ask me again in the mid of march if you need some infos.

@dwmkerr
Copy link
Owner

dwmkerr commented Feb 16, 2018

Looking at this as well at the moment, note to self:

https://docs.openshift.org/latest/install_config/configuring_aws.html#aws-cluster-labeling

May need to update the labelling logic introduced in #33

Also check this for notes on dynamic aws tag names (particularly the limitations for how we can manage this in terraform):

hashicorp/terraform#14516 (comment)

@dwmkerr
Copy link
Owner

dwmkerr commented Feb 21, 2018

Hi @yves-vogl @rberlind @mtbvang,

Can you let me know the changes you had to make to get this to work? At the moment, when I try to install 3.7, I always get this issue:

RUNNING HANDLER [openshift_master : restart master controllers] ****************
        to retry, use: --limit @/home/ec2-user/openshift-ansible/playbooks/byo/config.retry

PLAY RECAP *********************************************************************
ip-10-0-1-137.ec2.internal : ok=54   changed=8    unreachable=0    failed=0
ip-10-0-1-44.ec2.internal  : ok=294  changed=106  unreachable=0    failed=1
localhost                  : ok=11   changed=0    unreachable=0    failed=0


INSTALLER STATUS ***************************************************************
Initialization             : Complete
Health Check               : Complete
etcd Install               : Complete
Master Install             : In Progress
        This phase can be restarted by running: playbooks/byo/openshift-master/config.yml



Failure summary:


  1. Hosts:    ip-10-0-1-44.ec2.internal
     Play:     Configure masters
     Task:     restart master api
     Message:  Unable to restart service origin-master-api: Job for origin-master-api.service failed because the control process exited with error code. See "systemctl status origin-master-api.service" and "journalctl -xe" for details.

My current work-in-progress branch for this is here (I've opened a PR to make it easy to see the changes):

#43

The key changes so far are:

  1. Set openshift_clusterid=${cluster_id} in the playbook see here
  2. Set the new tags required for OC 3.7 see here

That's basically it. I've found the following issues which seem to be potential causes:

I've attempted the following workarounds:

  1. Explicitly setting etcd_version=3.1.9
  2. Explicitly setting etcd_version=3.2.7
  3. Explicitly setting the SDN CIDR to one which will not overlap with the VPC CIDR (osm_cluster_network_cidr=11.0.0.0/16)

So far no luck. Any pointers would be super helpful!

@rberlind
Copy link
Author

Well, I cheated by disabling some things. Specifically, I was able to install OpenShift Origin 3.7 by disabling the service catalog and TSB with:

openshift_enable_service_catalog=false
template_service_broker_install=false

I had also set openshift_repos_enable_testing=true in the inventory.template.cfg file.

@sumitshatwara
Copy link

I also faced same issue around service catalog. I disabled below parameters as mentioned by @rberlind in host inventory file:

openshift_enable_service_catalog=false
template_service_broker_install=false

And the deployment of OpenShift Origin 3.7 was successful:

INSTALLER STATUS *******************************************************************************************************************************************************
Initialization : Complete
Health Check : Complete
etcd Install : Complete
Master Install : Complete
Master Additional Install : Complete
Node Install : Complete
Hosted Install : Complete

My question: Is service catalog a mandatory feature for OpenShift environment?
Use Case: I want to test FlexVolume driver of K8s only.

@dwmkerr
Copy link
Owner

dwmkerr commented Feb 22, 2018

@rberlind I've tried this just now but no joy! Any chance you can share your inventory so I can take a look?

@sumitshatwara You should be fine - the service catalog is an optional feature and you can test volumes without it, let me know how it goes!!

@mtbvang
Copy link

mtbvang commented Feb 22, 2018

@dwmkerr

I just applied my playbook again for openshift 3.7 for the modified version that runs on centos instead of rhel and it worked. I did get it running on rhel before making the changes to centos. I'm not hitting the issues that everyone else is so I'm a bit confused. The code I'm working with is in the centos branch in my fork.

Lower down I list my development setup and the few changes that I made in commit mtbvang@8445f08. Git did a few weird things with the image files and I ended up committing them again, maybe because I'm working in a vagrant VM. Here's the output from the run and address of the cluster https://54.93.200.118.xip.io:8443

The only other thing I can see is I make some ssh key changes to name the key to terraform-aws-openshift. I work in a vagrant VM and this key is copied into the guest VM where I run terraform from. Below is more details about my vagrant setup. These are the differences I see between what I've done:

  • vagrant VM setup
  • default aws region set to default = "eu-central-1"
  • In main.tf added private_key_path = "${var.private_key_path}"
PLAY RECAP *********************************************************************
localhost                  : ok=12   changed=0    unreachable=0    failed=0   
master.openshift.local     : ok=644  changed=265  unreachable=0    failed=0   
node1.openshift.local      : ok=191  changed=65   unreachable=0    failed=0   
node2.openshift.local      : ok=191  changed=65   unreachable=0    failed=0   


INSTALLER STATUS ***************************************************************
Initialization             : Complete
Health Check               : Complete
etcd Install               : Complete
Master Install             : Complete
Master Additional Install  : Complete
Node Install               : Complete
Hosted Install             : Complete
Service Catalog Install    : Complete

# Now the installer is done, run the postinstall steps on each host.
cat ./scripts/postinstall-master.sh | ssh -A ec2-user@$(terraform output bastion-public_dns) ssh centos@master.openshift.local
Pseudo-terminal will not be allocated because stdin is not a terminal.
Warning: Permanently added the RSA host key for IP address '10.0.1.185' to the list of known hosts.
Adding password for user admin
cat ./scripts/postinstall-node.sh | ssh -A ec2-user@$(terraform output bastion-public_dns) ssh centos@node1.openshift.local
Pseudo-terminal will not be allocated because stdin is not a terminal.
Warning: Permanently added the RSA host key for IP address '10.0.1.155' to the list of known hosts.
cat ./scripts/postinstall-node.sh | ssh -A ec2-user@$(terraform output bastion-public_dns) ssh centos@node2.openshift.local
Pseudo-terminal will not be allocated because stdin is not a terminal.
Warning: Permanently added the RSA host key for IP address '10.0.1.150' to the list of known hosts.

oc version on master

oc v3.7.0+7ed6862
kubernetes v1.7.6+a08f5eeb62
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://master.openshift.local:8443
openshift v3.7.0+7ed6862
kubernetes v1.7.6+a08f5eeb62

In my fork my current setup is in the centos branch. I've added a vagrant folder which has a vagrant file that spins up a centos development VM that installs the aws client tool version 1.14.36, terrafrom 0.11.3, and openshift client v3.7.0. You'll need to build the vagrant box using packer. There's a packer task in the vagrant/build.gradle file that can be run with with gradle wrapper from your host:

./gradlew packer

Once packer finishesf

vagrant ssh
cd /vagrant
terraform init
terraform get
terraform plan
terraform apply
make openshift

I hope this helps in figuring this out, or ping me if you need any more information.

@piyushkv1
Copy link

I'm also seeing the issue with 3.7 version.
TASK [template_service_broker : Verify that TSB is running] ******************************************************************************
FAILED - RETRYING: Verify that TSB is running (120 retries left).
FAILED - RETRYING: Verify that TSB is running (2 retries left).
FAILED - RETRYING: Verify that TSB is running (1 retries left).
fatal: [openshift.node.1]: FAILED! => {"attempts": 120, "changed": false, "cmd": ["curl", "-k", "https://apiserver.openshift-template-service-broker.svc/healthz"], "delta": "0:00:01.036646", "end": "2018-02-25 14:29:54.090730", "msg": "non-zero return code", "rc": 7, "start": "2018-02-25 14:29:53.054084", "stderr": " % Total % Received % Xferd Average Speed Time Time Time Current\n Dload Upload Total Spent Left Speed\n\r 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0\r 0 0 0 0 0 0 0 0 --:--:-- 0:00:01 --:--:-- 0curl: (7) Failed connect to apiserver.openshift-template-service-broker.svc:443; Connection refused", "stderr_lines": [" % Total % Received % Xferd Average Speed Time Time Time Current", " Dload Upload Total Spent Left Speed", "", " 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0", " 0 0 0 0 0 0 0 0 --:--:-- 0:00:01 --:--:-- 0curl: (7) Failed connect to apiserver.openshift-template-service-broker.svc:443; Connection refused"], "stdout": "", "stdout_lines": []}

@rberlind
Copy link
Author

rberlind commented Feb 27, 2018

Hi @dwmkerr,
I've attached my inventory.template.cfg file and my install-from-bastion.sh script.

I did this back in December, but I believe the key changes I made were the following:

inventory.template.cfg:

#Enable use of testing repos so that 3.7 will be used
#Note that this was before 3.7 was released, so it might not be needed anymore.
openshift_repos_enable_testing=true

openshift_enable_service_catalog=false
template_service_broker_install=false

install-from-bastion.sh:

# I was cloning from my own fork of openshift-ansible, but I think you should be able to use
# https://github.com/openshift/openshift-ansible
git clone -b release-3.7 https://github.com/rberlind/openshift-ansible

inventory-and-script.zip

@junsionzhang
Copy link

@rberlind hi, do you have tried the latest version now ? for me now , same version ,same problem

@rberlind
Copy link
Author

I have not @junsionzhang . I have started working with this again, but have been creating a quite different version in which I trigger the ansible-playbook and all other installation steps with terraform remote-exec provisioners. I have not put any of this in Github yet.

@stanvarlamov
Copy link
Contributor

stanvarlamov commented Apr 20, 2018

Suggesting to bypass 3.7 for those who are not tied to a particular version and just want to build a working cluster on a supported version of OpenShift.

3.7 seems to have a number of packaging issues, and the 3.9 release created additional problems for the 3.7 install.

Things to change in the 3.7 branch here so that it can be used for the 3.9 install (amazingly, just a few):

  1. Fix 00-tags.tf: remove the obsolete "KubernetesCluster", "${var.cluster_id}", or change it to "KubernetesCluster", "${var.cluster_name}",

  2. Update inventory.template.cfg:

openshift_deployment_type=origin
openshift_release=v3.9
  1. In install-from-bastion, change the clone version to 3.9 and replace the ansible call with these two:
ANSIBLE_HOST_KEY_CHECKING=False /usr/bin/ansible-playbook -i ./inventory.cfg ./openshift-ansible/playbooks/prerequisites.yml
ANSIBLE_HOST_KEY_CHECKING=False /usr/bin/ansible-playbook -i ./inventory.cfg ./openshift-ansible/playbooks/deploy_cluster.yml

@rberlind
Copy link
Author

Interesting that you mention the need to add "openshift_release=v3.9", @stanvarlamov. I just hit this the day before yesterday when I started getting errors about short_version 3.9 not being valid when using the 3.7 version of openshift-ansible. To keep using that version, I had to set openshift_release=v3.7. I also noticed that the documentation suggested using "openshift_deployment_type" instead of "deployment_type" and also made that change.

By the way, another change that should be made is that the provisioning of the aws_instances should use "vpc_security_group_ids" instead of "security_groups" so that subsequent applies will not trigger destroy/create against the EC2 instances. See https://www.terraform.io/docs/providers/aws/r/instance.html#security_groups.

@stanvarlamov
Copy link
Contributor

I think in the spirit of this repo being an excellent source of procedures to get a working OpenShift version up on AWS with a basic configuration - we should move the master branch to 3.9 as that has been released recently. I find the 3.9 install process much improved compare to 3.6 and 3.7, and overall 3.9 features and look&feel more appealing. That is, basically, my suggestion.

@Spengreb
Copy link

@stanvarlamov I think you're right on this one.

A few months back I forked this project to make a multi-master setup using Centos instead of RHEL. Got all that working and noticed you guys added dynamic pv support so i merged with mine, i should mention as well before this i was using 3.7 just fine with few issues. After the merge a lot went wrong for me which i discovered you guys had already been through, some seemingly random stuff was going wrong too, but ended up here in this thread. I switched to 3.9 release instead and I got a lot further.

I managed to get this project working with Centos, 2 Master nodes, 3 compute nodes (though not in ideal setup due to budget constraints), OpenShift 3.9 with Metrics and dynamic PV support.

Still testing to make sure everything works ok. But can do a PR if you want to take a look, though i may have changed too much as its very specific to my needs.

@stanvarlamov
Copy link
Contributor

@Spengreb OpenShift 3.9 appears to be much faster than 3.6-7 and more stanble; PV resize also seems to be an important feature available. Centos, metrics and logging work as a 1-click install - pretty amazing, considering the amount of time it took to work through 3.6 Ansible bugs and inconsistencies. Dynamic PV based on EBS is a sore point, through. It kind of assumes you are single-AZ, which defeats the purpose of HA in a multi-AZ setup, so I don't consider that a production-ready feature at this point. But, overall, I think that 3.9 is really a game changer. Highly recommended.

@dwmkerr
Copy link
Owner

dwmkerr commented Apr 30, 2018

@stanvarlamov @Spengreb yep agreed! Will start on the 3.9 setup shortly. 3.7 has been a total pain to get working, so I'm all in favour of skipping it for now, if for some reason someone really needs 3.7 we can always go back and try again when some more of the ansible issues are sorted, but for now 3.9 sounds like a sensible option!

@dwmkerr
Copy link
Owner

dwmkerr commented May 1, 2018

I've updated master to install 3.9, I've also raised:

#48

To track the issue @rberlind mentioned about the secutity_groups setting.

If this works then let me know guys and I'll close the issue!

@rberlind
Copy link
Author

rberlind commented May 1, 2018

you can close.

@rberlind rberlind closed this as completed May 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants