Increase the 'carry' size, decrease the num of queues and add ability to push the same piece of memory through the layers. While the code could be made even smarter, pushing this version is a good starting point. This change gives a 4x speedup.
Increase the 'carry' size, decrease the num of queues and add ability to push the same piece of memory through the layers. While the code could be made even smarter, pushing this version is a good starting point. This change gives a 4x speedup.