Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metron branch is broken #243

Open
gkatsikas opened this issue Mar 24, 2020 · 15 comments
Open

Metron branch is broken #243

gkatsikas opened this issue Mar 24, 2020 · 15 comments
Labels
metron wait-for-op Additional information from the OP are needed

Comments

@gkatsikas
Copy link
Collaborator

Even the simplest FastClick app is broken in the Metron branch.
Issues occur with conf/metron/metron-dispatcher-flow.click when launching secondary processes.

sudo gdb --args bin/click --dpdk -w 0000:03:00.0 -- conf/dpdk/dpdk-bounce.click

GNU gdb (Ubuntu 8.1-0ubuntu3.2) 8.1.0.20180409-git
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
http://www.gnu.org/software/gdb/bugs/.
Find the GDB manual and other documentation resources online at:
http://www.gnu.org/software/gdb/documentation/.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from bin/click...done.
(gdb) r

Starting program: /home/katsikas/nfv/projects/fastclick/bin/click --dpdk -w 0000:03:00.0 -- conf/dpdk/dpdk-bounce.click
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
EAL: Detected 16 lcore(s)
EAL: Detected 2 NUMA nodes
[New Thread 0x7ffff0199700 (LWP 10006)]
EAL: Multi-process socket /var/run/dpdk/rte/mp_socket
[New Thread 0x7fffef998700 (LWP 10007)]
EAL: Selected IOVA mode 'PA'
EAL: Probing VFIO support...
EAL: VFIO support initialized
[New Thread 0x7fffef197700 (LWP 10008)]
[New Thread 0x7fffee996700 (LWP 10009)]
[New Thread 0x7fffee195700 (LWP 10010)]
[New Thread 0x7fffed994700 (LWP 10011)]
[New Thread 0x7fffed193700 (LWP 10012)]
[New Thread 0x7fffec992700 (LWP 10013)]
[New Thread 0x7fffec191700 (LWP 10014)]
[New Thread 0x7fffeb990700 (LWP 10015)]
[New Thread 0x7fffeb18f700 (LWP 10016)]
[New Thread 0x7fffea98e700 (LWP 10017)]
[New Thread 0x7fffea18d700 (LWP 10018)]
[New Thread 0x7fffe998c700 (LWP 10019)]
[New Thread 0x7fffe918b700 (LWP 10020)]
[New Thread 0x7fffe898a700 (LWP 10021)]
[New Thread 0x7fffe8189700 (LWP 10022)]
EAL: PCI device 0000:03:00.0 on NUMA socket 0
EAL: probe driver: 15b3:1017 net_mlx5
Initializing flow parser...
Initializing DPDK
Ingress traffic on port 0 is not restricted anymore to the defined flow rules
deleted virtual method called
terminate called without an active exception

Thread 1 "click" received signal SIGABRT, Aborted.
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
51 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1 0x00007ffff5ced801 in __GI_abort () at abort.c:79
#2 0x00007ffff66e0957 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3 0x00007ffff66e6ae6 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4 0x00007ffff66e6b21 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5 0x00007ffff66e791f in __cxa_deleted_virtual () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6 0x00005555560aa48a in Router::initialize (this=, errh=0x555556de0f90) at ../lib/router.cc:1451
#7 0x00005555560444db in parse_configuration (text=..., text_is_expr=, hotswap=, errh=0x555556de0f90) at click.cc:404
#8 0x00005555556fb58e in main (argc=, argv=) at click.cc:739

@tbarbette
Copy link
Owner

Which binutils? Which GCC? NSLab racks? I would bet on binutils, those developers are cow boys^^

@gkatsikas
Copy link
Collaborator Author

gkatsikas commented Mar 24, 2020

nslrack06-07-08 (Ubuntu 18.04.4, kernel 4.15.0-91-generic)
gcc: 7.5
binutils: 2.30
DPDK: 20.02 (also failed with older versions)

Is there a known issue with binutils?
I will try on other racks with different versions

@tbarbette
Copy link
Owner

Yes, with 2.30 the code will crash with Xeon Skylake and higher, because they did an incorrect AVX512 optimization. With 2.34 we observed a problem similar to this one where some code was optimized out but actually still called... Reverting to 2.32 worked in that case :p

But then maybe it's different here. Will take a look.

@gkatsikas
Copy link
Collaborator Author

Racks 06-08 are Haswell-based; probably this problems holds for this architecture too.

@gkatsikas
Copy link
Collaborator Author

Also, I noticed something which is not related to the bug but caught my attention.
The configuration flag --enable-cpu-load is not recognized by configure.in anymore. Did you change anything that slept my attention or is it a problematic merge that we should roll-back?

@gkatsikas
Copy link
Collaborator Author

Interestingly, on rack14 (Skylake) the dpdk-bounce works fine, but metron is still problematic when spawning secondary processes (try_slave() method):

EAL: PCI device 0000:17:00.0 on NUMA socket 0
EAL: probe driver: 15b3:1017 net_mlx5
Device 0000:17:00.0 is not driven by the primary process
net_mlx5: can not attach rte ethdev
net_mlx5: probe of PCI device 0000:17:00.0 aborted after encountering an error: Cannot allocate memory
EAL: Requested device 0000:17:00.0 cannot be used
Continuing initialization...
Successful initialization!

@tbarbette
Copy link
Owner

--enable-cpu-load suffered a bad merge for sure.

I'm finishing something and then will look at it.

@tbarbette
Copy link
Owner

For me it works. Maybe you should recompile both DPDK and Click, cleaning before from the same machine?

@gkatsikas
Copy link
Collaborator Author

Did you also try Metron with a Mellanox NIC? Which machine did you use?

@tbarbette
Copy link
Owner

I just tried to launch (Mellanox yes) and did not get the messages you had. Rack 05

@gkatsikas
Copy link
Collaborator Author

Problem found:

When passing the following configuration to the Metron element:
SLAVE_DPDK_ARGS "-w0000:03:00.0"
one should be careful to omit any space between -w and the PCI ID of the NIC (i.e., 0000:03:00.0)

@gkatsikas gkatsikas reopened this Mar 27, 2020
@gkatsikas
Copy link
Collaborator Author

RSS and VMDq-based service chain deployments crash in run_service_chain() method (Child part, just before or during DPDK initialization). See the output below (RSS-based deployment):

Writing configuration: elementclass MetronSlave {
input[0] -> MarkIPHeader(OFFSET 14) -> filter0 :: IPFilter(allow ((ip ttl >= 2 && ip ttl <= 255)), deny all); filter0 -> IPRewriter(pattern 10.0.0.4 1000-65535 - - 0 0) -> DecIPTTL() -> EtherRewrite(SRC 50:6B:4B:43:88:CA, DST 50:6B:4B:43:8A:DA) -> [0]output; filter0[1] -> Discard;
};

slave :: MetronSlave();

slaveFD0C0 :: FromDPDKDevice(0, QUEUE 0, N_QUEUES 1, MAXTHREADS 1, BURST 32, NUMA false, VERBOSE 99, ACTIVE 1);
StaticThreadSched(slaveFD0C0 0);
slaveFD0C0 -> [0]slave;
slaveFD0C1 :: FromDPDKDevice(0, QUEUE 1, N_QUEUES 1, MAXTHREADS 1, BURST 32, NUMA false, VERBOSE 99, ACTIVE 0);
StaticThreadSched(slaveFD0C1 1);
slaveFD0C1 -> [0]slave;
slaveFD0C2 :: FromDPDKDevice(0, QUEUE 2, N_QUEUES 1, MAXTHREADS 1, BURST 32, NUMA false, VERBOSE 99, ACTIVE 0);
StaticThreadSched(slaveFD0C2 2);
slaveFD0C2 -> [0]slave;
slaveFD0C3 :: FromDPDKDevice(0, QUEUE 3, N_QUEUES 1, MAXTHREADS 1, BURST 32, NUMA false, VERBOSE 99, ACTIVE 0);
StaticThreadSched(slaveFD0C3 3);
slaveFD0C3 -> [0]slave;
slaveFD0C4 :: FromDPDKDevice(0, QUEUE 4, N_QUEUES 1, MAXTHREADS 1, BURST 32, NUMA false, VERBOSE 99, ACTIVE 0);
StaticThreadSched(slaveFD0C4 4);
slaveFD0C4 -> [0]slave;
slaveFD0C5 :: FromDPDKDevice(0, QUEUE 5, N_QUEUES 1, MAXTHREADS 1, BURST 32, NUMA false, VERBOSE 99, ACTIVE 0);
StaticThreadSched(slaveFD0C5 5);
slaveFD0C5 -> [0]slave;
slaveFD0C6 :: FromDPDKDevice(0, QUEUE 6, N_QUEUES 1, MAXTHREADS 1, BURST 32, NUMA false, VERBOSE 99, ACTIVE 0);
StaticThreadSched(slaveFD0C6 6);
slaveFD0C6 -> [0]slave;
slaveFD0C7 :: FromDPDKDevice(0, QUEUE 7, N_QUEUES 1, MAXTHREADS 1, BURST 32, NUMA false, VERBOSE 99, ACTIVE 0);
StaticThreadSched(slaveFD0C7 7);
slaveFD0C7 -> [0]slave;

slaveTD0 :: ExactCPUSwitch();
slaveTD0C0 :: ToDPDKDevice(0, QUEUE 0, VERBOSE 99, MAXQUEUES 1);slaveTD0[0] -> slaveTD0C0;
slaveTD0C1 :: ToDPDKDevice(0, QUEUE 1, VERBOSE 99, MAXQUEUES 1);slaveTD0[1] -> slaveTD0C1;
slaveTD0C2 :: ToDPDKDevice(0, QUEUE 2, VERBOSE 99, MAXQUEUES 1);slaveTD0[2] -> slaveTD0C2;
slaveTD0C3 :: ToDPDKDevice(0, QUEUE 3, VERBOSE 99, MAXQUEUES 1);slaveTD0[3] -> slaveTD0C3;
slaveTD0C4 :: ToDPDKDevice(0, QUEUE 4, VERBOSE 99, MAXQUEUES 1);slaveTD0[4] -> slaveTD0C4;
slaveTD0C5 :: ToDPDKDevice(0, QUEUE 5, VERBOSE 99, MAXQUEUES 1);slaveTD0[5] -> slaveTD0C5;
slaveTD0C6 :: ToDPDKDevice(0, QUEUE 6, VERBOSE 99, MAXQUEUES 1);slaveTD0[6] -> slaveTD0C6;
slaveTD0C7 :: ToDPDKDevice(0, QUEUE 7, VERBOSE 99, MAXQUEUES 1);slaveTD0[7] -> slaveTD0C7;
slave[0] -> slaveTD0;

Initializing flow parser...
:2: While configuring ‘slave/filter0 :: IPFilter’:
pattern 0: warning: relation ‘<= 255’ is always true (range 0-255)
FromDPDKDevice : remove StaticThreadSched to use FastClick's auto-thread assignment
slaveFD0C1: using queues from 1 to 1
slaveFD0C1: Queue 1 handled by th 1
click: ../include/click/vector.hh:291: T& Vector<T, ALIGNMENT>::operator[](Vector<T, ALIGNMENT>::size_type) [with T = QueueDevice::QueueInfo; long unsigned int ALIGNMENT = 64; Vector<T, ALIGNMENT>::size_type = int]: Assertion `(unsigned) i < (unsigned) vm_.n_' failed.
Could not read from control socket: Error 0
Could not launch service chain...
Cannot instantiate service chain with ID e82807f5-b89e-438a-b22d-583448a1542c

@tbarbette
Copy link
Owner

Could you run it under gdb? Compiled with "-O1 -g"? As it's the slave you can run it with
"gdb -ex run -ex "signal 2" -ex bt -batch -args " prefixes so without input it starts and shows the stacktrace upon failure.

@tbarbette
Copy link
Owner

Is this fixed?

@gkatsikas
Copy link
Collaborator Author

I could not get the stacktrace of the slave, so I abandoned.
I need to re-visit it at some point

@tbarbette tbarbette added the wait-for-op Additional information from the OP are needed label Nov 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
metron wait-for-op Additional information from the OP are needed
Projects
None yet
Development

No branches or pull requests

2 participants