Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Hardware][Xilinx] explicitly specify acc dep distance to avoid hidden pitfall #8

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

zhanghaohit
Copy link
Contributor

@zhanghaohit zhanghaohit commented Apr 29, 2020

Bug

Unrestrained #pragma HLS DEPENDENCE in VTA core waiving all loop-carried dependencies, causing wrong results in 32x32 configuration

Solution

make acc dependence configurable, and also do corresponding checking in VTA runtime (see here).

@zhanghaohit
Copy link
Contributor Author

@tqchen @tmoreau89 Could you kindly help review this PR? Thanks.

@tmoreau89
Copy link
Contributor

Thanks @zhanghaohit, for catching this error. I agree with the fact that the dependence should not be hardcoded. However in order to not to add too many parameters in the config file, can we derive the value from the VTA target (i.e. FPGA type) in pkg_config?

Also from what I understand is that the dependence distance is generally a property of the memory part that the FPGA is using. For instance that distance will be different if one uses BRAM vs. Ultra-RAM? And on a different FPGA family this number will change. Correct?

Copy link
Contributor

@tmoreau89 tmoreau89 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments above

@zhanghaohit
Copy link
Contributor Author

Thanks @zhanghaohit, for catching this error. I agree with the fact that the dependence should not be hardcoded. However in order to not to add too many parameters in the config file, can we derive the value from the VTA target (i.e. FPGA type) in pkg_config?

Also from what I understand is that the dependence distance is generally a property of the memory part that the FPGA is using. For instance that distance will be different if one uses BRAM vs. Ultra-RAM? And on a different FPGA family this number will change. Correct?

I think the dependence distance here is defined by the user, to tell the compiler (e.g., XILINX vivado) that, to what extent, the software layer (i.e., TVM code generator) can guarantee there is no acc memory conflict. Say, if the distance = 4, meaning that the software will guarantee there is no conflict within 4 iterations. So the compiler can generate hardware code with smaller II as it is not necessary for the compiler to generate code with larger II to eliminate memory conflict from the hardware perspective.

Thus, I also add the check in VTA runtime to do the VerifyDep.

@remotego
Copy link

Thanks @zhanghaohit, for catching this error. I agree with the fact that the dependence should not be hardcoded. However in order to not to add too many parameters in the config file, can we derive the value from the VTA target (i.e. FPGA type) in pkg_config?

Also from what I understand is that the dependence distance is generally a property of the memory part that the FPGA is using. For instance that distance will be different if one uses BRAM vs. Ultra-RAM? And on a different FPGA family this number will change. Correct?

Thanks @tmoreau89 for the advice. We also thought of putting the value into the pkg_config. However, later we found out the dependence distance is not tightly related to the device family/type.

In my opinion, the compiler will decide the II based on multiple factors. But eventually, it is based on the number of cycles on the datapath from "Read Mem" -> "Perform Ops" -> "Write Back", as we need to avoid RAW data hazard.

In the "Read Mem" and "Write Back" stage, it is definitely related to the properties of the memory (eg. device family, uram vs bram, etc). But it is also related to the H/W circuit of accessing the data. For example, multiplexers as there are multiple accesses. The compile will judge based on the complexity of the overall circuit, and it could add registers (increase II) if the desired frequency target could not be satisfied.

In "Perform Ops" stage, the cycles needed is mainly related to the op itself. For example, a 32-bit multiplication may require more cycles than a 8-bit multiplication.

@tmoreau89
Copy link
Contributor

Thank you for the explanation. Do you mind breaking down what happens in the 32x32 configuration that causes correctness to fail (since the dependence analysis waive is unbounded)?

I agree that there should be a constant set for hardware and software to agree on, but I'm not convinced that it should be set by a user, and instead be derived automatically by the hardware target we choose in pkg_config. This is to ensure that we can achieve an iteration interval of II=1 for the GEMM pipeline, which is constrained by the SRAM write to read latency, and then enforced by the software.

The example I can use is that if we are say we want to accumulate SRAM at address A and B for 3 consecutive cycles. We could do it (1) by updating each address consecutively, or (2) by interlacing the updates.

// initial contents of SRAM at address A is 20, and B is 14
// here we assume 2 cycles to propagate a write to being able to read it
t=0: SRAM[A] += 1 // SRAM[A] is 20, SRAM[B] is 14
t=1: SRAM[A] += 1 // SRAM[A] is 20, SRAM[B] is 14
t=2: SRAM[A] += 1 // SRAM[A] is 21, SRAM[B] is 14
t=3: SRAM[B] += 1 // SRAM[A] is 21, SRAM[B] is 14
t=4: SRAM[B] += 1 // SRAM[A] is 22, SRAM[B] is 14
t=5: SRAM[B] += 1 // SRAM[A] is 22, SRAM[B] is 15
t=6: noop // SRAM[A] is 22, SRAM[B] is 15
t=7: noop // SRAM[A] is 22, SRAM[B] is 16

// Results are incorrect: A final value is 22 instead of 23, and B final value is 16 instead of 17

Now in the case where these back to back writes are detected in the runtime, we would have an invalid schedule. So assuming the TVM compiler produces a valid schedule, we would end up with interlaced writes to A and B:

// initial contents of SRAM at address A is 20, and B is 14
// here we assume 2 cycles to propagate a write to being able to read it
t=0: SRAM[A] += 1 // SRAM[A] is 20, SRAM[B] is 14
t=1: SRAM[B] += 1 // SRAM[A] is 20, SRAM[B] is 14
t=2: SRAM[A] += 1 // SRAM[A] is 21, SRAM[B] is 14
t=3: SRAM[B] += 1 // SRAM[A] is 21, SRAM[B] is 15
t=4: SRAM[A] += 1 // SRAM[A] is 22, SRAM[B] is 15
t=5: SRAM[B] += 1 // SRAM[A] is 22, SRAM[B] is 16
t=6: noop // SRAM[A] is 23, SRAM[B] is 16
t=7: noop // SRAM[A] is 23, SRAM[B] is 17

// Results are now correct because the compiler correctly ensured enough wait time was allowed between consecutive writes and reads to the same address in SRAM.

So therefore the point I'm hoping to make is that this 2 cycle wait time in enforced by the SRAM, and should not be exposed to the user to be set as a parameter in VTA_config.

@tmoreau89
Copy link
Contributor

@remotego I just read your comment that must have been posted as I wrote my reply. Your explanation makes sense, ultimately the critical path in the accumulation is READ -> ADD to GEMM result -> WRITE, which can be more than 2 cycles if we account for the addition logic, and various multiplexers etc.

@tmoreau89
Copy link
Contributor

I assume that if you have an FPGA that needs to be clocked at a high frequency, you'll likely end up with more cycles.

@tmoreau89
Copy link
Contributor

So to conclude, I agree with you that we may want this parameter to be larger than 2 cycles if the hardware pipeline becomes more complicated.

So the big question here is: how can we ensure that it is set correctly moving forward. For instance a DISTANCE=2 works for 16x16, but might break for 32x32. Is there a way that the user as she performs design space exploration does not have to think about updating this parameter? Or at least receives some guidance to update it to the right value?

@tmoreau89
Copy link
Contributor

Perhaps one way to look at this is to start with DISTANCE=3 by default. And if the II>1 for the GEMM, we issue a warning telling the user to increase the distance to 4. The process can repeat until the II of 1 is achieved, and the runtime will be informed of the actual distance so proper checks can be inserted.

@tmoreau89
Copy link
Contributor

I'll think some more about this problem overnight, thanks for bringing attention to this issue @zhanghaohit @remotego

@remotego
Copy link

Perhaps one way to look at this is to start with DISTANCE=3 by default. And if the II>1 for the GEMM, we issue a warning telling the user to increase the distance to 4. The process can repeat until the II of 1 is achieved, and the runtime will be informed of the actual distance so proper checks can be inserted.

I love this idea.

Our ultimate goal is to achieve ii = 1, thus it would be great if we could achieve it by some intelligent scripting... only problem is that we need to introduce a way to feedback the results to S/W side.

@zhanghaohit
Copy link
Contributor Author

zhanghaohit commented May 17, 2020

Perhaps one way to look at this is to start with DISTANCE=3 by default. And if the II>1 for the GEMM, we issue a warning telling the user to increase the distance to 4. The process can repeat until the II of 1 is achieved, and the runtime will be informed of the actual distance so proper checks can be inserted.

Thanks @tmoreau89 for the advice.

For this issue, do you suggest to do it manually? For example, we have a config (e.g., ACC_DEP_DISTANCE) set with a default value (say, 3). If II > 1, we just issue a warning message and ask the user to enlarge this config if he want to achieve II = 1?

@liangfu
Copy link
Member

liangfu commented May 31, 2020

I would agree with

derive the value from the VTA target (i.e. FPGA type) in pkg_config

, and avoid requiring a user to specify the ACC_DEP_DISTANCE parameter.

@tmoreau89
Copy link
Contributor

@zhanghaohit I think that the warning approach is a reasonable compromise between user-defined flexibility and making sure that they set the value correctly.

To echo @liangfu , it would be ideal to derive the value in pkg_config based on simple rules of FPGA target, and matrix multiplication size. We can tweak those rules if say we discover that the dependence distance needs to be increased.

If you agree I suggest we go ahead with these changes.

@tqchen tqchen changed the base branch from master to main October 11, 2020 17:43
jinhongyii pushed a commit that referenced this pull request Sep 5, 2023
This PR add support for cross compiling dylib to x86 macos

Co-authored-by: tqchen <tqchenml@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants