Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sonic-cfggen isn't able to render template sporadically when the image is installed from ONIE for the first time #13791

Open
stephenxs opened this issue Feb 13, 2023 · 10 comments
Assignees
Labels
MSFT Triaged this issue has been triaged

Comments

@stephenxs
Copy link
Collaborator

stephenxs commented Feb 13, 2023

Description

Steps to reproduce the issue:

  1. Install a new image on the switch from ONIE
  2. Reboot the switch

Describe the results you received:

Error message buffermgrd: ERROR (spawn error) observed which is caused asic_table.json not being able to rendered from the template.
But from the dump we see the asic_table.json is empty, which is not expected.

The asic_table.json should be rendered from the template asic_table.j2 using sonic-cfggen when the swss docker is created.
The template asic_table.j2 is built into the image, which means it should be available if the image is good. The image is able to start at most times, which means the image should be good. So, asic_table.j2 should be available.
The only possible cause is that sonic-cfggen wasn't able to render the template and generate asic_table.json.
The command to generate the json is sonic-cfggen -d -t /usr/share/sonic/templates/asic_table.j2 > /etc/sonic/asic_table.json. Only DEVICE_METADATA|localhost[platform] is required in the template.
The table DEVICE_METADATA should be good because from the log we see both platform and hwsku, where are both in DEVICE_METADATA, are available and correct.
In sonic-cfggen, only two if branches are executed:

  1. args.from_db - load all items from the CONFIG_DB
  2. args.print_data - dump the rendered dict to a json file.

branch 2 is simple and branch 1 is complicated. I suspect it somehow exited the process so nothing was output and asic_table.json was empty.

Describe the results you expected:

asic_table.json should be available.

Output of show version:

The issue is observed on master and 202211 from last December. 4 times occurred in the past 2 monthes.

Build date: Fri Feb 10 10:08:46 UTC 2023
Built by: sw-r2d2-bot@r-build-sonic-ci03-241

Platform: x86_64-mlnx_msn2700-r0
HwSKU: Mellanox-SN2700
ASIC: mellanox
ASIC Count: 1
Serial Number: MT2020T04277
Model Number: MSN2700-CS2FO
Hardware Revision: A2
Uptime: 15:00:00 up 5 min,  1 user,  load average: 4.60, 3.40, 1.48
Date: Fri 10 Feb 2023 15:00:00

Docker images:
REPOSITORY                                         TAG                               IMAGE ID       SIZE
docker-syncd-mlnx                                  202211_RC4.2-83cb318e2_Internal   10e845dd3d54   922MB
docker-syncd-mlnx                                  latest                            10e845dd3d54   922MB
docker-dhcp-relay                                  latest                            65f16c1b1cb9   509MB
docker-orchagent                                   202211_RC4.2-83cb318e2_Internal   a579f39c356c   529MB
docker-orchagent                                   latest                            a579f39c356c   529MB
docker-fpm-frr                                     202211_RC4.2-83cb318e2_Internal   6e107ab059bd   540MB
docker-fpm-frr                                     latest                            6e107ab059bd   540MB
docker-teamd                                       202211_RC4.2-83cb318e2_Internal   cd85e7fb0d5a   510MB
docker-teamd                                       latest                            cd85e7fb0d5a   510MB
docker-macsec                                      latest                            820e8d5fa373   512MB
docker-sonic-telemetry                             202211_RC4.2-83cb318e2_Internal   5deb764e2549   791MB
docker-sonic-telemetry                             latest                            5deb764e2549   791MB
docker-platform-monitor                            202211_RC4.2-83cb318e2_Internal   81727410e630   920MB
docker-platform-monitor                            latest                            81727410e630   920MB
docker-snmp                                        202211_RC4.2-83cb318e2_Internal   72418bff1490   539MB
docker-snmp                                        latest                            72418bff1490   539MB
docker-eventd                                      202211_RC4.2-83cb318e2_Internal   326438dfb6fe   493MB
docker-eventd                                      latest                            326438dfb6fe   493MB
docker-lldp                                        202211_RC4.2-83cb318e2_Internal   30e086e77134   535MB
docker-lldp                                        latest                            30e086e77134   535MB
docker-mux                                         202211_RC4.2-83cb318e2_Internal   8f34ff38e6a1   541MB
docker-mux                                         latest                            8f34ff38e6a1   541MB
docker-database                                    202211_RC4.2-83cb318e2_Internal   df28aac82383   493MB
docker-database                                    latest                            df28aac82383   493MB
docker-sonic-p4rt                                  202211_RC4.2-83cb318e2_Internal   c0808efe3ad0   575MB
docker-sonic-p4rt                                  latest                            c0808efe3ad0   575MB
docker-router-advertiser                           202211_RC4.2-83cb318e2_Internal   7a03d5ac5d1c   493MB
docker-router-advertiser                           latest                            7a03d5ac5d1c   493MB
docker-sonic-mgmt-framework                        202211_RC4.2-83cb318e2_Internal   6fcac32351a8   611MB
docker-sonic-mgmt-framework                        latest                            6fcac32351a8   611MB
docker-nat                                         202211_RC4.2-83cb318e2_Internal   6f5a251b1c76   480MB
docker-nat                                         latest                            6f5a251b1c76   480MB
docker-sflow                                       202211_RC4.2-83cb318e2_Internal   ff6e122bd91e   478MB
docker-sflow                                       latest                            ff6e122bd91e   478MB
(paste your output here)

Output of show techsupport:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

@stephenxs stephenxs changed the title sonic-cfggen isn't able to render template sporadically when the image is installed for the first time from ONIE sonic-cfggen isn't able to render template sporadically when the image is installed from ONIE for the first time Feb 13, 2023
@qiluo-msft
Copy link
Collaborator

Do you have direct proof that "sonic-cfggen isn't able to render template"?

@stephenxs
Copy link
Collaborator Author

stephenxs commented Feb 24, 2023

Do you have direct proof that "sonic-cfggen isn't able to render template"?

No, I don't.
Unlink issue #13674 which has a backtrace of sonic-cfggen as strong evidence to indicate redis server accessing failure, I do not see such failure in my dump.
So, this is just concluded from the facts in the dump and the code.
Another possibility is that the template asic_table.j2 is empty but it can hardly happen because the template comes from the image. I also added a PR, that copies the template into the dump, in order to rule out this possibility.

@liuh-80
Copy link
Contributor

liuh-80 commented Mar 2, 2023

After a minor change in sonic-cfggen, I render the template with empty data, and here is result:

admin@vlab-01:~$ sonic-cfggen -d -t /home/admin/asic_table.j2
[
]

So after check the code, sonic-cfggen will never render a empty asic_table.json.

@liuh-80
Copy link
Contributor

liuh-80 commented Mar 2, 2023

After check the syslog and asic_table.json create time in show tech dump, I found the sonic-cfggen render asic_table.json correctly:

  1. asic_table.json been render at Feb 10 14:55:05.772684

Feb 10 14:55:05.772684 r-panther-16 INFO swss.sh[5270]: Creating new swss container with HWSKU Mellanox-SN2700

According to fillowing code, this log output exactly before render json file:

{%- else %}
echo "Creating new ${DOCKERNAME} container with HWSKU $HWSKU"
{%- endif %}

{%- if docker_container_name == "swss" %}
# Generate the asic_table.json and peripheral_table.json
if [ ! -f /etc/sonic/asic_table.json ] && [ -f /usr/share/sonic/templates/asic_table.j2 ]; then
    sonic-cfggen -d -t /usr/share/sonic/templates/asic_table.j2 > /etc/sonic/asic_table.json
fi
  1. After the json file rendered, the buffermged start successfully, which also proof the json file rendered correctly:

Feb 10 14:55:26.943901 r-panther-16 NOTICE swss#buffermgrd: :- main: --- Starting buffermgrd ---
Feb 10 14:55:27.051946 r-panther-16 NOTICE swss#buffermgrd: :- readPgProfileLookupFile: Read lookup configuration file...
Feb 10 14:55:27.059340 r-panther-16 NOTICE swss#buffermgrd: :- readPgProfileLookupFile: PG profile for speed 10000 and cable 5m is: size:19456, xon:19456, xoff:22528, th:0, xon_offset:
Feb 10 14:55:27.532236 r-panther-16 INFO swss#supervisord 2023-02-10 12:55:27,531 INFO success: buffermgrd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)

  1. However, buffermged been stopped by unknown reason, and after restart, the json file seems missing:

Feb 10 14:59:01.768622 r-panther-16 INFO swss#supervisord 2023-02-10 12:59:01,767 INFO waiting for buffermgrd to stop
Feb 10 14:59:01.770757 r-panther-16 INFO swss#supervisord 2023-02-10 12:59:01,769 INFO stopped: buffermgrd (terminated by SIGTERM)
Feb 10 14:59:02.659597 r-panther-16 NOTICE syncd#SDK: :- threadFunction: time span 0 ms for 'SET:QUEUE_STAT_COUNTER:oid:0x15000000000627'
Feb 10 14:59:04.262279 r-panther-16 INFO swss#supervisord 2023-02-10 12:59:04,261 INFO spawned: 'buffermgrd' with pid 488
Feb 10 14:59:04.364069 r-panther-16 NOTICE swss#buffermgrd: :- main: --- Starting buffermgrd ---
Feb 10 14:59:04.367923 r-panther-16 ERR swss#buffermgrd: :- loadJsonFromFile: Unable to parse json from the input stream: parse error - unexpected end of input
Feb 10 14:59:04.372851 r-panther-16 INFO swss#supervisord: buffermgrd Usage: buffermgrd <-l pg_lookup.ini|-a asic_table.json [-p peripheral_table.json] [-z zero_profiles.json]>

Also, the system reboot at 14:54:

Feb 10 14:54:35.114710 r-panther-16 NOTICE kernel: [ 0.000000] Linux version 5.10.0-18-2-amd64 (debian-kernel@lists.debian.org) (gcc-10 (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 SMP Debian 5.10.140-1 (2022-09-02)

@liuh-80
Copy link
Contributor

liuh-80 commented Mar 2, 2023

@stephenxs ,base on the analyze result of tech dump files, this issue seems not a sonic-cfggen issue.

@stephenxs
Copy link
Collaborator Author

Hi @liuh-80
Thank you for checking it.
I understand that an empty json [] will be generated if sonic-cfggen gets an empty DEVICE_METADATA|localhost.platform from config_db. I'm just curious whether sonic-cfggen failed to access database and somehow exited but no error message was generated. In that case, asic_table.json can be empty.
I also suspect another cause is that the template was empty. I opened another PR sonic-net/sonic-utilities#2686 to collect information in order to prove / rule out it.
Meanwhile, traditional buffer manager does not need asic_table.json. During deployment, it starts traditional buffer manager by default, and then it reloads QoS and starts dynamic buffer manager. The dynamic buffer manager failed because it needs asic_table.json.
From the dump, we can see the asic_table.json is empty.
Maybe we can have a meeting to discuss it?

@liuh-80
Copy link
Contributor

liuh-80 commented Mar 3, 2023

Hi @stephenxs
Thanks for update, I undertsand now the buffermgrd start at 14:55:26 is traditional buffer manager, I will do more check and ping you for a meeting to discussion this.

@liuh-80
Copy link
Contributor

liuh-80 commented Mar 3, 2023

Here is a update, We have offline discussion, when sonic-cfggen have exception, a empty json file will be generated, I create following script to catch and show error:

admin@vlab-01:~$ cat test.sh
#!/bin/bash

echo "start"
sonic-cfggen -d -t /home/admin/asic_table.j2 > /home/admin/asic_table.json 2> errorlog.txt
if [ $? -gt 0 ]; then
echo "cfggen erro happen:"
echo "$(cat errorlog.txt)"
fi
echo "end"
admin@vlab-01:~$

And here is reproduce:
admin@vlab-01:$ ./test.sh
start
Traceback (most recent call last):
File "/usr/local/bin/sonic-cfggen", line 454, in
main()
File "/usr/local/bin/sonic-cfggen", line 410, in main
raise Exception('test')
Exception: test
end
admin@vlab-01:
$ ls
asic_table.j2 asic_table.json test.sh
admin@vlab-01:$ cat asic_table.json
admin@vlab-01:
$

So stephen will help reproduce and get the error messagen, then we can continue.

@stephenxs
Copy link
Collaborator Author

Thanks Hua.
Now we understand that there won't be any error message in the syslog even sonic-cfggen fails to connect to the redis db server, the issue is more likely to be the same as #13674
As the issue is very difficult to reproduce, I would like to update my PR with both WA and debugging info captured. The logic will be

sonic-cfggen -d -t /home/admin/asic_table.j2 > /home/admin/asic_table.json 2> errorlog.txt
if [ $? -gt 0 ]; then
    echo "cfggen error happen:"
    echo "$(cat errorlog.txt)"
    sonic-cfggen -a '{"DEVICE_METADATA":{"localhost":{"platform":"'$PLATFORM'"}}}' -t /usr/share/sonic/templates/asic_table.j2 > /etc/sonic/asic_table.json
fi

@stephenxs
Copy link
Collaborator Author

Hi @liuh-80
FYI.
PR #13888 updated with error message captured. Could you please review it?
Thanks.

@judyjoseph judyjoseph added Triaged this issue has been triaged MSFT labels Mar 15, 2023
liat-grozovik pushed a commit that referenced this issue Apr 24, 2023
…13888)

- Why I did it
We suspect the issue #13791 is caused by redis server being temporarily unavailable during system initialization so we do not use -d in sonic-cfggen, for now, to avoid accessing redis server

- How I did it
Provide a string containing required json data when calling sonic-cfggen

- How to verify it
Manually test it

Signed-off-by: Stephen Sun <stephens@nvidia.com>
mssonicbld pushed a commit to mssonicbld/sonic-buildimage that referenced this issue Apr 26, 2023
…onic-net#13888)

- Why I did it
We suspect the issue sonic-net#13791 is caused by redis server being temporarily unavailable during system initialization so we do not use -d in sonic-cfggen, for now, to avoid accessing redis server

- How I did it
Provide a string containing required json data when calling sonic-cfggen

- How to verify it
Manually test it

Signed-off-by: Stephen Sun <stephens@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
MSFT Triaged this issue has been triaged
Projects
None yet
Development

No branches or pull requests

4 participants