-
Notifications
You must be signed in to change notification settings - Fork 137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"test" + "add" in a single call #68
base: master
Are you sure you want to change the base?
Conversation
I would expect option (2) to be faster, not slower. We should be within the same cache and there isn't any real I/O. Still, I can't argue with the numbers -- I suspect the difference is accounted for by extra python interpreter work, rather than just doing it all in C. |
return -1; | ||
} | ||
DTYPE *chunk = array->vector + (size_t)(array->preamblesize + bit / (sizeof(DTYPE) << 3)); | ||
int byte = 1 << (bit % (sizeof(DTYPE) << 3)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you use the _vector_byte
static inline function here?
There is I/O -- processor caches can make a huge difference, and I think any write invalidates it, even when the write is a no-op (no bits changed). Memory writes are expensive. |
errno = EINVAL; | ||
return -1; | ||
} | ||
DTYPE *chunk = array->vector + (size_t)(array->preamblesize + bit / (sizeof(DTYPE) << 3)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and please use the _vector_offset
static inline function here as well.
If you want to earn brownie points -- try writing a Cython version that calls out to the C test & add and see what the performance improvement is :) |
Done. Let me know if you can replicate the performance improvements, I wonder how other HW / compiler factors play into this (OS X here, Apple LLVM version 7.0.0, MacBookPro11,3). I merged the 32bit fix as well -- it modifies the same file, and I don't want conflicts later. Weirdly enough, even the pypy test passed now. |
I think you're right -- I checked a version that simply writes instead of the So, either was just Python overhead that makes this faster, or the branch is even more expensive than a write (which kinda makes sense). Either way, |
Should make it the default, if it's typically faster. |
In the (common?) scenario of "test if item already in filter, if not, add it; run additional business logic based on whether the element is new or not", we now have the choice of:
There's not much difference in performance (hashing is fast, memory caches primed), though the second approach is nicer. However, it sets the array bits even if they are already set (always writes through). This is unfortunate because the write access is unnecessary for duplicates and makes things more complicated and slower (more IO, cache stalls).
This PR adds a new method,
contains_add
, which acts just likeadd
but only sets the bits if not already set:According to benchmarks, it's also about 11% faster than option 2) above, when expecting 20% duplicates.