NAND/timing fixes #52

aejsmith · 2015-07-22T12:16:18Z

Here are a bunch of fixes/workarounds for various issues

NAND corruption/ECC errors during boot.
Incorrect calibration of delay loop on second core results in broken kernel *delay() functions.
CPU usage is constantly reported at 100% on the second core (issue CPU usage statistics wrong on second CPU in 3.18 #35).

The Ethernet requires the RD/WE pins to be configured, plus its chip select pin, CS6. These already get configured correctly at boot by U-Boot so this doesn't actually fix any issue, just add them in for the sake of correctness. Signed-off-by: Alex Smith <alex.smith@imgtec.com>

No need to read the OOB first, reading the page data followed by the OOB works fine. Reading the OOB first requires an extra read command cycle, which we can avoid. Signed-off-by: Alex Smith <alex.smith@imgtec.com>

If nand_wait_ready() times out, this is silently ignored, and its caller will then proceed to read from/write to the chip before it is ready. This can potentially result in corruption with no indication as to why. While a 20ms timeout seems like it should be plenty enough, certain behaviour can cause it to timeout much earlier than expected. The situation which prompted this change was that CPU 0, which is responsible for updating jiffies, was holding interrupts disabled for a fairly long time while writing to the console during a printk, causing several jiffies updates to be delayed. If CPU 1 happens to enter the timeout loop in nand_wait_ready() just before CPU 0 re- enables interrupts and updates jiffies, CPU 1 will immediately time out when the delayed jiffies updates are made. The result of this is that nand_wait_ready() actually waits less time than the NAND chip would normally take to be ready, and then read_page() proceeds to read out bad data from the chip. The situation described above may seem unlikely, but in fact it can be reproduced almost every boot on the MIPS Creator Ci20. Debugging this was made more difficult by the misleading comment above nand_wait_ready() stating "The timeout is caught later" - no timeout was ever reported so I did not initially think that this would be the cause of the problem. Therefore, this patch increases the timeout to 200ms. This should be enough to cover cases where jiffies updates get delayed. Additionally, add a pr_warn() when a timeout does occur so that it is easier to pinpoint any problems in future caused by the chip not becoming ready. Signed-off-by: Alex Smith <alex.smith@imgtec.com>

Commit 600e7a2 sets a bit in Config7 which stops the XBurst core from special-casing short loops to avoid branch target buffer lookups. The default behaviour vastly slows down tight loops and thus results in a low BogoMIPS/loops_per_jiffy value, which is used for the *delay() functions. Setting this bit results in a higher BogoMIPS/loops_per_jiffy value. However, even though that bit also gets set on the second core, for reasons I cannot figure out, it does not appear to take effect until later on. The result is that when the delay calibration is run on the second core it will calculate a low value for loops_per_jiffy (and at that point delays using the calibrated value will delay for the correct amount of time), but later on delays will delay for far too short a time. This can be observed with udelay_test: using taskset to restrict udelay_test.sh to core 0 results in all tests passing, on core 1 however all tests fail. Ingenic's kernel does not suffer from this problem yet I cannot see why, they set the short loop BTB lookup flag in the same place we do, but their kernel correctly calibrates the delay loop. So, for now, as a workaround, copy the loops_per_jiffy value from core 0 to core 1. With this, both cores are able to pass udelay_test. Signed-off-by: Alex Smith <alex.smith@imgtec.com>

… IPIs The majority of SMP platforms handle their IPIs through do_IRQ() which calls irq_{enter/exit}(). When a call function IPI is received, smp_call_function_interrupt() is called which also calls irq_{enter,exit}(), meaning irq_count is raised twice. When tick broadcasting is used (which is implemented via a call function IPI), this incorrectly causes all CPU idle time on the core receiving broadcast ticks to be accounted as time spent servicing IRQs, as account_process_tick() will account as such if irq_count is greater than 1. This results in 100% CPU usage being reported on a core which receives its ticks via broadcast. This patch removes the SMP smp_call_function_interrupt() wrapper which calls irq_{enter,exit}(). Platforms which handle their IPIs through do_IRQ() now call generic_smp_call_function_interrupt() directly to avoid incrementing irq_count a second time. Platforms which don't (loongson, sgi-ip27, sibyte) call generic_smp_call_function_interrupt() wrapped in irq_{enter,exit}(). Signed-off-by: Alex Smith <alex.smith@imgtec.com>

NAND + timing + other smp fixes

Alex Smith added 3 commits July 22, 2015 13:12

mtd: nand: jz4780: Switch NAND_ECC_HW_OOB_FIRST to NAND_ECC_HW

b3a635a

No need to read the OOB first, reading the page data followed by the OOB works fine. Reading the OOB first requires an extra read command cycle, which we can avoid. Signed-off-by: Alex Smith <alex.smith@imgtec.com>

aejsmith force-pushed the ci20-v3.18-fixes branch from 227390c to a75f4b1 Compare July 22, 2015 12:19

Alex Smith added 2 commits July 22, 2015 13:25

aejsmith force-pushed the ci20-v3.18-fixes branch from a75f4b1 to 8f5ac1a Compare July 22, 2015 12:25

ZubairLK added a commit that referenced this pull request Jul 24, 2015

Merge pull request #52 from aejsmith/ci20-v3.18-fixes

351e3a7

NAND + timing + other smp fixes

ZubairLK merged commit 351e3a7 into MIPS:ci20-v3.18 Jul 24, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NAND/timing fixes #52

NAND/timing fixes #52

aejsmith commented Jul 22, 2015

NAND/timing fixes #52

NAND/timing fixes #52

Conversation

aejsmith commented Jul 22, 2015