Linux instability #182

clbr · 2020-12-25T17:11:31Z

When running my port of Linux on (patched) cen64, it's unstable in ways that real hw is not. Very hard to track down random hangs that don't happen on hw, and there's a small chance the patches are at fault, but now that it's ready, others can try too.

I have a suspicion the TLB logic contains some more bugs, given how many I already found there, but there could be others too.

https://github.com/clbr/n64bootloader/releases

I have the following patches applied to cen64 currently. I'll be submitting PRs as the old ones get reviewed.

Teach the profiler about L1D misses
Implement ll/lld/sc/scd
Implement trap instructions (by James Lambert)
Implement Reserved Instruction exception
Implement fpu prid
commented out the TLB valid check lines
changed cen64_one_hot_lut[] TLB fetch to __builtin_ffs
corrected TLB mod exception behavior

tj90241 · 2020-12-26T16:15:34Z

Yeah, finding the last hidden issues is quite the endeavor.

Sometime in 2015-2016 when I was actively working on this code, I had the VR4300 component isolated and booting a Linux kernel all the way to initrd loading, and that was successful in fuzzing out some CP0 issues. I broke the TLB valid check (the issue you found) after that particular fuzzing endeavor. I'm surprised the TLB mod exception issue never turned up, though.. that's a new one.

Because the code models the pipeline and cache to a point where you can almost write synthesizable logic around it, there is also always the possibility that it may also be a bug with an instruction not getting squashed correctly or something. This particular case works, I think, but as for an example of how this gets tricky:

lw $at, $s0  # assume this raises an exception
tlbwi  # this instruction must be squashed in the pipeline

... but TLBWI writes to CP0 while it's in the EX stage:
https://github.com/n64dev/cen64/blob/master/vr4300/cp0.c#L267

so we inject a fault when the lw exception is raised in the DC stage and propagate it along:
https://github.com/n64dev/cen64/blob/master/vr4300/fault.c#L50

which is used on the next cycle to prevent the EX stage from executing:
https://github.com/n64dev/cen64/blob/master/vr4300/pipeline.c#L437

tj90241 · 2020-12-26T16:23:35Z

Some gotchas with sign extension too, here's another one I found during my initial Linux fuzzing:
9d9655c

bryanperris · 2021-01-10T16:31:39Z

I am wondering why cen64 always shifts the virtual address by 13, when the MIPS docs says to shift the offset off based upon the pagemask register. When I do the math, shifting by 13 seems to give me the correct VPN2 value to do the search on while shifting by the page size (16 bits) will give me a value way too small. I know shift by 13 bits works for EntryHi.

cen64/arch/x86_64/tlb/tlb.c

Line 32 in a109ac0

vpn2 =

tj90241 · 2021-01-10T17:07:46Z

@bryanperris That's just an optimization-related thing. x86 SSE encoding does not allow variable-length shifts (it must be a constant coded into the instruction word).
So, instead, we say "let's just shift off what we know will be an offset into the page (4k pages, 2 pages per TLB = 13 bits) and then AND off dynamically to workaround the fact we cannot shift dynamically. This is what check_l = _mm_and_si128(vpn, page_mask_l); is accomplishing. So, ultimately, the comparison (check_l = _mm_cmpeq_epi32(check_l, vpn_l);) is done with regards to pagemask still.

bryanperris · 2021-01-10T17:50:14Z

@tj90241 Thanks, that makes sense now. In the case of 4K pages, why shift off the 13th bit when the mask for offset is 0xFFF? Is that to apply the divide by 2 for the VPN?

tj90241 · 2021-01-10T20:05:14Z

Correct, it's because in MIPS the smallest page size is 4k (12 bits), and each physical TLB entry provides a mapping for 2 pages ("VPN2"), which is where the 13th bit comes in. The SSE lookup is just trying to find the "VPN2" entry in hardware -- once the DC stage has a hit, it will use the full address (again) to determine if EntryLo/EntryHi matches, etc.

bryanperris · 2021-01-10T21:20:56Z

Looking at your pipeline code, it calls the tlb_probe function to find the index of the matching entry. Does cen64 only handle 4K pages?

tj90241 · 2021-01-10T22:53:49Z

Right - tlb_probe is only responsible for finding the hardware entry. Then the pipeline uses the attributes of that entry to select the right page/etc.:

      tlb_miss = tlb_probe(&vr4300->cp0.tlb, vaddr, asid, &index);
      page_mask = vr4300->cp0.page_mask[index];
      select = ((page_mask + 1) & vaddr) != 0;
...
      cached = ((vr4300->cp0.state[index][select] & 0x38) != 0x10);
      paddr = (vr4300->cp0.pfn[index][select]) | (vaddr & page_mask);

awsms · 2024-07-30T16:26:32Z

Has anyone been able to compile it on Linux? Even the debian build task fails on Github

clbr · 2024-07-31T05:19:22Z

awsms, this report is about running Linux on cen64. If you're trying to compile cen64 on Linux, please open a new one.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Linux instability #182

Linux instability #182

clbr commented Dec 25, 2020

tj90241 commented Dec 26, 2020

tj90241 commented Dec 26, 2020

bryanperris commented Jan 10, 2021 •

edited

Loading

tj90241 commented Jan 10, 2021

bryanperris commented Jan 10, 2021

tj90241 commented Jan 10, 2021

bryanperris commented Jan 10, 2021

tj90241 commented Jan 10, 2021

awsms commented Jul 30, 2024

clbr commented Jul 31, 2024

Linux instability #182

Linux instability #182

Comments

clbr commented Dec 25, 2020

tj90241 commented Dec 26, 2020

tj90241 commented Dec 26, 2020

bryanperris commented Jan 10, 2021 • edited Loading

tj90241 commented Jan 10, 2021

bryanperris commented Jan 10, 2021

tj90241 commented Jan 10, 2021

bryanperris commented Jan 10, 2021

tj90241 commented Jan 10, 2021

awsms commented Jul 30, 2024

clbr commented Jul 31, 2024

bryanperris commented Jan 10, 2021 •

edited

Loading