Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

omnios-r151046-rc3 PXE Boot fails on Simply NUC Topaz 2 i7 #246

Open
JohnConnett opened this issue Apr 29, 2023 · 15 comments
Open

omnios-r151046-rc3 PXE Boot fails on Simply NUC Topaz 2 i7 #246

JohnConnett opened this issue Apr 29, 2023 · 15 comments

Comments

@JohnConnett
Copy link

JohnConnett commented Apr 29, 2023

PXE Boot fails on a Simply NUC Topaz 2 i7. Displays NBP file downloaded successfully. then drops through to the next boot option, without trying to get any further files using tftp. Both pxeboot and pxegrub fail. Attached is a Wireshark capture for pxeboot (the capture for pxegrub is similar except for the packet count so is not attached). The server (192.168.88.1) and topaz (192.168.88.2) are connected back-to-back on a single cable.

Possibly the same underlying problem as omnios-r151046-rc3.usb-dd fails on Simply NUC Topaz 2 i7.

I have successfully PXE booted Ubuntu Server 23.04 (Lunar Lobster) as far as the selection screen on the same configuration.

topaz.pcapng.gz

@danmcd
Copy link
Member

danmcd commented Apr 29, 2023

The product page says, "2x 2.5Gb Ethernet port". This means it likely has the Intel I225 or I226 Ethernet chipset, for which illumos does not yet have support. I cannot explain necessarily the usb-boot installer failure, but while illumos may boot on this, it cannot use the built-in I225/I226 Ethernet just yet.

@JohnConnett
Copy link
Author

Good point. It has 2 x Intel I225-LM. I have also observed some MAC address strangeness with these two devices.

I'm not familiar with how pxeboot works. If the switch from an UEFI network driver to another driver happens within pxeboot that would explain why no further files were requested via tftp.

I should have mentioned that I have disabled secure boot (another potential source of problems).

@JohnConnett
Copy link
Author

JohnConnett commented Apr 30, 2023

Tried another approach. I have a Plugable USBC-E2500 so I tried using that as suggested on the iPXE Forum. This also fails with both pxeboot and pxegrub. Here is the output for pxegrub:

Shell> FS0:
FS0:\> ncm.efi
iPXE initianising devices...ok

iPXE 1.21.1+ (gbd136) -- Open Source Network Boot Firmware -- https://ipxe.org
Features: DNS HTTP iSCSI TFTP VLAN AoE EFI Menu

net0: 8c:ae:4c:dd:3e:31 using cdc-ncm on 0000:00:0d.0-3-2.0 (Ethernet) [open]
  [Link:down, TX:0 TXE:0 RX:0 RXE:0]
  [Link status: Unknown (https://ipxe.org/1a086194)]
Waiting for link-up on net0... ok
Configuring (net0 8c:ae:4c:dd:3e:31)...... ok
net0: 192.168.199.254/255.255.255.0 gw 192.168.199.1
net0: fe0::8eae:4cff:fedd:3e31/64
Next server: 192.168.199.2
Filename: pxegrub
tftp://192.168.199.2/pxegrub... ok
pxegrub : 139032 bytes
Could not boot image: Exec format error (https://ipxe.org/2e008081)
No more network devices

FS0\> 

It appears that iPXE is unhappy with the format of pxeboot and pxegrub. Perhaps the EFI boot expects EFI format files?

Does OmniOS have drivers for USB CDC-NCM or CDC-ECM devices?

@JohnConnett
Copy link
Author

Perhaps the EFI boot expects EFI format files?

Built a simple x86_64 UEFI application and replaced pxeboot with my-uefi-app.efi. When PXE booted using either Intel I225-LM or Pluggable USBC-E2500 the expected "Hello world!" was displayed, followed by a 10 second pause. Suggests that an UEFI replacement for pxeboot might be required ...

@JohnConnett
Copy link
Author

Thought I would see if it was possible to PXE boot the Plugable USBC-E2500 without using iPXE. Copied the appropriate UEFI UNDI Driver to \EFI\Boot\RtkUndiDxe.efi on the EFI System Partition then updated the EFI variables (from within Ubuntu) to load it using:

root@topaz:~# efibootmgr --driver --disk /dev/nvme0n1 --part 1
No DriverOrder is set
root@topaz:~# efibootmgr --driver --disk /dev/nvme0n1 --part 1 --create --label 'USB 10/100/1G/2.5G LAN' --loader '\EFI\Boot\RtkUndiDxe.efi'
DriverOrder: 0000
Driver0000* USB 10/100/1G/2.5G LAN
root@topaz:~# efibootmgr --driver --disk /dev/nvme0n1 --part 1 --verbose
DriverOrder: 0000
Driver0000* USB 10/100/1G/2.5G LAN	HD(1,GPT,bbeadd11-388c-48fd-ac8c-fca17f8cfce0,0x800,0x32000)/File(\EFI\Boot\RtkUndiDxe.efi)
root@topaz:~# 

This loads the driver during boot but, unfortunately, it doesn't support PXE. The EFI Variables can be removed with:

root@topaz:~# efibootmgr --driver --disk /dev/nvme0n1 --part 1 --bootnum 0000 --delete-bootnum
No DriverOrder is set
root@topaz:~# 

It took me longer than I expected to find out how to set and clear DriverOrder and Driver#### so I have recorded it here in the hope that it will help others avoid a similar search.

@JohnConnett
Copy link
Author

As pxeboot is the wrong format for EFI, I looked at the .iso media contents. It contains boot/efiboot.img which is a FAT filesystem image that contains efi/boot/bootx64.efi. Tried using that as the filename and it went a little further. Here's the console output:

Consoles: EFI console
Command line arguments: loader64.efi
Image base: 0x51034000
EFI version: 2.80
EFI Firmware: American Megatrends (rev 5.25)

illumos/amd64 EFI loader, Revision 1.1
   Load Path:
   Load Device: PciRoot(0x0)/Pci(0x1C,0x4)/Pci(0x0,0x0)/MAC(A8a159D0E637,0x1)/IPv4(0.0.0.0)
   BootCurrent: 0008
   BootOrder: 0008[*] 0002 0000 0009 000a 000b 0007 0001
Can't find device by handle
Setting currdev to net0:
-

It seems that OmniOS Installation via PXE Boot may need updating for EFI systems.

As it appears that neither the Intel I225-LM nor Plugable USBC-E2500 have drivers for OmniOS I suspect I won't be able to investigate further with the limited amount of hardware I have available. I did try on a Hyper-V Generation 2 VM which did something broadly similar. It would be interesting to know if anyone can get further on supported hardware.

@JohnConnett
Copy link
Author

Screen capture from the PXE boot on the Hyper-V Generation 2 VM at the point where it sticks. I know that the install from the .iso media doesn't work either. However, it does get as far as loading unix so there might be some insights into why the PXE install fails.
Hyper-V

@JohnConnett
Copy link
Author

Monitored the tftp requests and discovered that there were files missing from /tftpboot as populated by kayak. Here's the results after adding the missing files:

May/11/2023 15:47:24 tftp,debug     requested file(binary): boot/loader64.efi access: allowed
May/11/2023 15:47:24 tftp,debug     requested file(binary): boot/loader64.efi access: allowed
May/11/2023 15:47:24 tftp,debug     requested file(binary): //boot/defaults/loader.conf access: allowed
May/11/2023 15:47:24 tftp,debug     requested file(binary): //boot/defaults/loader.conf access: allowed
May/11/2023 15:47:24 tftp,debug     requested file(binary): //boot/fonts/fonts.dir access: allowed
May/11/2023 15:47:24 tftp,debug     requested file(binary): //boot/fonts/fonts.dir access: allowed
May/11/2023 15:47:24 tftp,debug     requested file(binary): //boot/fonts/8x16.fnt access: allowed
May/11/2023 15:47:24 tftp,debug     requested file(binary): //boot/fonts/8x16.fnt access: allowed
May/11/2023 15:47:24 tftp,debug     requested file(binary): //boot/fonts/8x14.fnt access: allowed
May/11/2023 15:47:24 tftp,debug     requested file(binary): //boot/fonts/8x14.fnt access: allowed
May/11/2023 15:47:24 tftp,debug     requested file(binary): //boot/fonts/6x12.fnt access: allowed
May/11/2023 15:47:24 tftp,debug     requested file(binary): //boot/fonts/6x12.fnt access: allowed
May/11/2023 15:47:24 tftp,debug     requested file(binary): //boot/fonts/16x32.fnt access: allowed
May/11/2023 15:47:24 tftp,debug     requested file(binary): //boot/fonts/16x32.fnt access: allowed
May/11/2023 15:47:24 tftp,debug     requested file(binary): //boot/fonts/14x28.fnt access: allowed
May/11/2023 15:47:24 tftp,debug     requested file(binary): //boot/fonts/14x28.fnt access: allowed
May/11/2023 15:47:24 tftp,debug     requested file(binary): //boot/fonts/12x24.fnt access: allowed
May/11/2023 15:47:24 tftp,debug     requested file(binary): //boot/fonts/12x24.fnt access: allowed
May/11/2023 15:47:24 tftp,debug     requested file(binary): //boot/fonts/11x22.fnt access: allowed
May/11/2023 15:47:24 tftp,debug     requested file(binary): //boot/fonts/11x22.fnt access: allowed
May/11/2023 15:47:24 tftp,debug     requested file(binary): //boot/fonts/10x20.fnt access: allowed
May/11/2023 15:47:24 tftp,debug     requested file(binary): //boot/fonts/10x20.fnt access: allowed
May/11/2023 15:47:24 tftp,debug     requested file(binary): //boot/fonts/10x18.fnt access: allowed
May/11/2023 15:47:24 tftp,debug     requested file(binary): //boot/fonts/10x18.fnt access: allowed
May/11/2023 15:47:24 tftp,debug     requested file(binary): //boot/fonts/12x24.fnt access: allowed
May/11/2023 15:47:24 tftp,debug     requested file(binary): //boot/fonts/12x24.fnt access: allowed
May/11/2023 15:47:24 tftp,debug     requested file(binary): //boot/forth/boot.4th access: denied
May/11/2023 15:47:26 tftp,debug     requested file(binary): //boot/forth/boot.4th access: denied
May/11/2023 15:47:30 tftp,debug     requested file(binary): //boot/forth/boot.4th access: denied
May/11/2023 15:47:36 tftp,debug     requested file(binary): //boot/forth/boot.4th access: denied

It then looped requesting //boot/forth/boot.4th. I was serving dhcp and tftp from my MikroTik router running RouterOS 7.9. Discovered that their tftp implementation has a nasty feature where it seems to assume that filenames are of the form name.extension. For example, miniroot.gz and miniroot.gz.hash will both deliver the contents of miniroot.gz! I suspect this may have confused the installer ...

Changed to serving dhcp, tftp and http from an OmniOS VM. Much better! Both the Simply NUC Topaz 2 i7 and the Hyper-V Generation 2 VM get as far as starting the PXE Installer (see attached).

PXE_Installer

So it looks like adding the missing files to kayak will fix that part of the problem. It would also be good to change the comment in the first line of /usr/share/kayak/sample/000000000000.sample to point to the current documentation (Maybe Kayak Client Configuration?).

Neither PXE installation attempts run to completion. I'll provide details in a later comment.

@JohnConnett
Copy link
Author

There were some .png files missing, too. I copied /boot/*.png to /tftpboot/boot. Here's my latest list in the order requested:

boot/loader64.efi
boot/defaults/loader.conf
boot/fonts/fonts.dir
boot/fonts/8x16.fnt
boot/fonts/8x14.fnt
boot/fonts/6x12.fnt
boot/fonts/16x32.fnt
boot/fonts/14x28.fnt
boot/fonts/12x24.fnt
boot/fonts/11x22.fnt
boot/fonts/10x20.fnt
boot/fonts/10x18.fnt
boot/fonts/12x24.fnt
boot/forth/boot.4th (*)
boot/forth/boot.4th.gz (*)
boot/forth/boot.4th (*)
boot/loader.rc
boot/forth/loader.4th
boot/forth/support.4th
boot/forth/screen.4th
boot/forth/color.4th
boot/forth/delay.4th
boot/forth/check-password.4th
boot/forth/screen.4th
boot/forth/efi.4th
boot/forth/beadm.4th
boot/loader.rc.local (*)
boot/loader.rc.local.gz (*)
boot/loader.rc.local (*)
boot/solaris/bootenv.rc
boot/defaults/loader.conf
boot/loader.conf
boot/loader.conf.gz
boot/loader.conf
boot/loader.conf.local
boot/transient.conf (*)
boot/transient.conf.gz (*)
boot/transient.conf (*)
boot/forth/beastie.4th
boot/forth/menu.rc
boot/forth/version.4th
boot/forth/brand.4th
boot/forth/menu.4th
boot/forth/frames.4th
boot/forth/menu-commands.4th
boot/forth/menusets.4th
boot/forth/shortcuts.4th
boot/forth/logo-omnios.4th
boot/fonts/10x18.fnt
boot/fenix.png
boot/illumos-small.png
boot/forth/brand-omnios.4th
boot/ooce.png
boot/menu.lst (*)
boot/menu.lst.gz (*)
boot/menu.lst (*)
boot/menu.lst (*)
boot/menu.lst.gz (*)
boot/menu.lst (*)
boot/menu.rc.local (*)
boot/menu.rc.local.gz (*)
boot/menu.rc.local (*)
boot/platform/i86pc/kernel/amd64/unix

Those tagged with (*) are still missing which may be as expected?

@JohnConnett
Copy link
Author

Next hurdle. Failing to load /boot/platform/i86pc/kernel/amd64/unix. Here's the console output for the Simply NUC Topaz 2 i7:

Loading /boot/platform/i86pc/kernel/amd64/unix...
failed to allocate 1319024760 bytes for staging area: 9
cant load file '/boot/platform/i86pc/kernel/amd64/unix': cannot allocate memory

And for the Hyper-V Generation 2 VM:

Loading /boot/platform/i86pc/kernel/amd64/unix...
failed to allocate 4092125864 bytes for staging area: 9
cant load file '/boot/platform/i86pc/kernel/amd64/unix': cannot allocate memory

Wild guess is that it is using an uninitalised variable. Looking at efi_loadaddr() in illumos-omnios/usr/src/boot/efi/loader/copy.c there is this snippet of code:

        if (type == LOAD_ELF)
                return (0);     /* not supported */

Might be barking up the wrong tree, but the kernel appears to be an ELF 64-bit LSB executable.

@JohnConnett
Copy link
Author

Rebuilt /boot/loader64.efi with these changes:

diff --git a/usr/src/boot/common/load_elf_obj.c b/usr/src/boot/common/load_elf_obj.c
index f32388e170..ca7a3eabf6 100644
--- a/usr/src/boot/common/load_elf_obj.c
+++ b/usr/src/boot/common/load_elf_obj.c
@@ -137,8 +137,10 @@ __elfN(obj_loadfile)(char *filename, u_int64_t dest,
 		goto oerr;
 	}
 
-	if (archsw.arch_loadaddr != NULL)
+	if (archsw.arch_loadaddr != NULL) {
+		printf("Reached: %s %i\n", __FILE__, __LINE__);
 		dest = archsw.arch_loadaddr(LOAD_ELF, hdr, dest);
+	}
 	else
 		dest = roundup(dest, PAGE_SIZE);
 
diff --git a/usr/src/boot/efi/loader/copy.c b/usr/src/boot/efi/loader/copy.c
index 491c6787c6..18f8795e9b 100644
--- a/usr/src/boot/efi/loader/copy.c
+++ b/usr/src/boot/efi/loader/copy.c
@@ -172,8 +172,10 @@ efi_loadaddr(uint_t type, void *data, vm_offset_t addr)
 	if (addr == 0)
 		return (addr);	/* nothing to do */
 
-	if (type == LOAD_ELF)
+	if (type == LOAD_ELF) {
+		printf("Reached: %s %i\n", __FILE__, __LINE__);
 		return (0);	/* not supported */
+	}
 
 	if (type == LOAD_MEM)
 		size = *(size_t *)data;

Neither "Reached" message was displayed. Looks like I was barking up the wrong tree.

@JohnConnett
Copy link
Author

Think I may have found a problem with the Multiboot2 header in omnios-r151046.unix. Here's the header extracted from the file:

file offset: 0x190 (400)
magic: 0xE85250D6    MULTIBOOT_HEADER_MAGIC
architecture: 0x00000000
header_length: 0x00000090
checksum: 0x17ADAE9A - Good!
tags:
    type: 0x0001    MULTIBOOT_HEADER_TAG_INFORMATION_REQUEST
    flags: 0x0000
    size: 0x00000020
    mbi_tag_types: [0x00000001; 0x00000003; 0x00000005; 0x00000006; 0x00000008; 0x00000004]
    type: 0x0002    MULTIBOOT_HEADER_TAG_ADDRESS
    flags: 0x0000
    size: 0x00000018
    mbi_tag_types: [0x00C00038; 0x00BFFEA8; 0x00000000; 0x00000000]
    type: 0x0003    MULITBOOT_HEADER_TAG_ENTRY_ADDRESS
    flags: 0x0000
    size: 0x0000000C
    mbi_tag_types: [0x00C00000]
    padding: 0x0026748D
    type: 0x0004    MULTIBOOT_HEADER_TAG_CONSOLE_FLAGS
    flags: 0x0000
    size: 0x0000000C
    mbi_tag_types: [0x00000002]
    padding: 0x00000000
    type: 0x0005    MULTIBOOT_HEADER_TAG_FRAMEBUFFER
    flags: 0x0000
    size: 0x00000014
    mbi_tag_types: [0x00000000; 0x00000000; 0x00000000]
    padding: 0x00000000
    type: 0x0006    MULTIBOOT_HEADER_TAG_MODULE_ALIGN
    flags: 0x0000
    size: 0x00000008
    type: 0x0000    MULTIBOOT_HEADER_TAG_END
    flags: 0x0000
    size: 0x00000008

and here's the output of objdump -h omnios-r151046.unix:

unix:     file format elf64-x86-64-sol2

Sections:
Idx Name          Size      VMA               LMA               File off  Algn
  0 .data         00019a4c  0000000000c00000  0000000000c00000  00000158  2**0
                  CONTENTS, ALLOC, LOAD, DATA
  1 .text         000de0e1  fffffffffb800000  0000000000400000  0001a000  2**12
                  CONTENTS, ALLOC, LOAD, READONLY, CODE
  2 .dynamic      000001b0  fffffffffb8de0e8  00000000004de0e8  000f80e8  2**3
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
  3 .hash         0000b4c0  fffffffffb8de298  00000000004de298  000f8298  2**3
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
  4 .dynsym       00021e28  fffffffffb8e9758  00000000004e9758  00103758  2**3
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
  5 .dynstr       000147e7  fffffffffb90b580  000000000050b580  00125580  2**0
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
  6 .SUNW_reloc   00019e30  fffffffffb91fd68  000000000051fd68  00139d68  2**3
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
  7 .rodata       0002f177  fffffffffb939bc0  0000000000539bc0  00153bc0  2**6
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
  8 set_tsc_calibration_set 00000020  fffffffffb968d38  0000000000568d38  00182d38  2**3
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
  9 .data         00023e98  fffffffffbc00000  0000000000800000  00183000  2**12
                  CONTENTS, ALLOC, LOAD, DATA
 10 .bss          000827a0  fffffffffbc24000  0000000000824000  001a7000  2**12
                  ALLOC
 11 .note         00000020  0000000000000000  0000000000000000  001a6e98  2**2
                  CONTENTS, READONLY
 12 .comment      00000032  0000000000000000  0000000000000000  001a6eb8  2**0
                  CONTENTS, READONLY

Looking at the MULTIBOOT_HEADER_TAG_ADDRESS entry the fields are:

        +-------------------+
u16     | type = 2          |     0x0002
u16     | flags             |     0x0000
u32     | size              | 0x00000018
u32     | header_addr       | 0x00C00038
u32     | load_addr         | 0x00BFFEA8
u32     | load_end_addr     | 0x00000000
u32     | bss_end_addr      | 0x00000000
        +-------------------+

The file offset of .data is 0x158 and the file offset of the Multiboot2 header is 0x190, a difference of 0x38. header_addr looks reasonable. From the description of load_addr: Contains the physical address of the beginning of the text segment. The offset in the OS image file at which to start loading is defined by the offset at which the header was found, minus (header_addr - load_addr). However, the address difference is not the expected 0x38 but 0x190!

This may not be the source of the problem but it looks like an error.

@JohnConnett
Copy link
Author

The immediate cause of the EFI_OUT_OF_RESOURCES (9) is from this code in the function efi_loadaddr() in the file boot/efi/loader/copy.c:

$ pr -n -t boot/efi/loader/copy.c
[...]
  178           if (type == LOAD_MEM)
  179                   size = *(size_t *)data;
  180           else {
  181                   stat(data, &st);
  182                   size = st.st_size;
  183           }
[...]

type is LOAD_KERN; the (unchecked!) call to stat() fails; size is set to an uninitialized value; if that value happens to be big enough then EFI_OUT_OF_RESOURCES is the result.

More generally, I'm puzzled how this code is expected to load the kernel and why the size of the file containing the kernel is used. I need to take a closer look at the Multiboot2 specification.

@JohnConnett
Copy link
Author

The errno from stat() is EBUSY, which appears to come from the function stat() in the file boot/libsa/stat.c. The call to open() fails because the file is already open, detected in the function tftp_open() in the file boot/libsa/tftp.c.

I don't think that the value of st.st_size would be available at this point as only enough of the kernel image file has been read to obtain the contents of the Multiboot2 header. Using tftp the whole file would have to be read to obtain its size.

Maybe a better approach would be to make the loader ELF aware? This is hinted at in the description of the address tag in 3.1.5 The address tag of Multiboot2 header in Multiboot2 Specification version 2.0: Note: This information does not need to be provided if the kernel image is in ELF format, but it must be provided if the image is in a.out format or in some other format. When the address tag is present it must be used in order to load the image, regardless of whether an ELF header is also present. Compliant boot loaders must be able to load images that are either in ELF format or contain the address tag embedded in the Multiboot2 header.

Additional tags may be required in the Multiboot2 header such as 3.1.8 EFI amd64 entry address tag of Multiboot2 header and 3.1.12 EFI boot services tag.

@JohnConnett
Copy link
Author

As an experiment, I have tried to use grub2 from Fedora Linux 38. This is what I tried from the grub2 interactive prompt:

grub> multiboot2 EFI/omnios/unix -B install_media=http://192.168.199.110/kayak/omnios-r151046.zfs.xz,install_config=http://192.168.199.110/kayak
grub> module2 EFI/omnios/miniroot.gz
grub> boot

For the Hyper-V Generation 2 VM the following was displayed on the console:

krtld: failed to open '-B'
krtld: bind_primary(): no relocation information found for module -B
krtld: error during initial load/link phase

krtld could neither locate or resolve symbols for:
    -B
in the boot archive. Please verify that this file
matches what is found in the boot archive.
You may need to boot using the Solaris failsafe to fix this.
Unable to boot.
Press any key to reboot.

I'm not sure how kernel command line arguments are supposed to be passed in grub2 ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants