Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Augment raw_entry API for indexing use case #44

Closed
sujayakar opened this issue Feb 15, 2019 · 3 comments
Closed

Augment raw_entry API for indexing use case #44

sujayakar opened this issue Feb 15, 2019 · 3 comments

Comments

@sujayakar
Copy link

I'm writing a string-interning data structure in a memory-constrained environment. The strings effectively have 'static lifetime, so I don't need to support deallocation. Therefore, I just append null-terminated strings to a Vec<u8> buffer and identify a string by its (u32) offset into this buffer.

Then, we can query a string by its identifier by starting at the given offset and finding the null byte. But, we'd also like to see if a string is already in the buffer and get its offset if so. I'd like to use hashbrown for maintaining this relation from String to u32 without duplicating each string in memory twice.

The raw entry API is almost there for supporting this. When looking up a string, we can compute its hash and then use RawEntryBuilder::from_hash.

struct Offset(u32);

impl Hash for Offset {
     fn hash<H: Hasher>(&self, h: &mut H) {
         panic!("We shouldn't be hashing the offsets directly!");
     }
}

struct InternedTable {
    buf: Vec<u8>,
    index: HashMap<Offset, ()>,
}

impl InternedTable {
    fn lookup(&self, key: &str) -> Option<u32> {
        let key_hash = ...;
        let result = self.index.raw_entry().from_hash(key_hash, |offset| {
            let offset = offset.0 as usize;
            key.as_bytes() == &self.buf[offset...min(offset + key.len(), buf.len())]
        });
        result.map(|(k, v)| k)
    }
}

However, doing the same for insertions doesn't quite work. The essential issue is that the raw entry API only lets us control the hash of a single key, and inserting may require recomputing hashes of other keys when resizing.

impl InternedTable {
    fn insert(&mut self, key: &str) -> u32 {
        let key_hash = ...;
        let entry = self.index.raw_entry_mut().from_hash(key_hash, |offset| { ... });
        let vacant_entry = match entry {
            // String is already in the table, so return its existing offset.
            RawEntryMut::Occupied(entry) => return *entry.key(),
            RawEntryMut::Vacant(entry) => entry,
        };

        let offset = self.buf.len() as u32;
        self.buf.extend(key.as_bytes());
        self.buf.append(0u8);

        // Panics here from `Hash` implementation called from `RawTable::resize`
        vacant_entry.insert_hashed_nocheck(key_hash, Offset(offset), ());
        offset
    }
}

Today, I get around this by sneaking in a pointer to buf via a thread-local to the Hash implementation. I could also stash a Rc<RefCell<...>> to the buffer on the hash builder for the same effect. However, both of these solutions are ugly, and given how close it is, it feels to me like the RawEntry API should support this use case.

Let me know if accommodating this is acceptable and I can submit a PR. It looks like it should be pretty easy since RawTable already takes in a hasher: impl Fn(&T) -> u64 for many of its methods.

@Amanieu
Copy link
Member

Amanieu commented Feb 15, 2019

That sounds good, I would be happy to accept a PR which adds a new method to RawVacantEntryMut that allows passing in a hasher function like RawTable does.

You should be able to make it so that the dummy Hash implementation is not required at all.

bors bot added a commit that referenced this issue Mar 17, 2019
54: Raw entry hasher r=Amanieu a=sujayakar

This PR allows the user to create a `HashMap` where `K` doesn't implement `Hash`, so long as they are okay with only using the raw entry API.

The first step is to move around some methods that didn't actually depend on `K: Hash`.  Then, the only non-plumbing change was to defer the `.reserve(1)` call on `raw_entry_mut` to right before insertion.  Then, we expose a new method `RawVacantEntryMut::insert_with_hasher` that takes in the custom hashing function and passes it in to `reserve` before inserting.

@Amanieu do these API changes merit a version bump?  I think we're only allowing these methods to be used on _more_ types, but let me know.

Closes #44 

Co-authored-by: Sujay Jayakar <sujayakar@dropbox.com>
@bors bors bot closed this as completed in #54 Mar 17, 2019
@CAD97
Copy link
Contributor

CAD97 commented Mar 22, 2020

I've run into a hitch with insert_with_hasher: I would still like access to the hasher of the map, so I can hash with the configured hasher.

Specifically, I'd like an adjustment from

pub fn RawVacantEntryMut::insert_with_hasher<H>(
    self,
    hash: u64,
    key: K,
    value: V,
    hasher: H,
) -> (&'a mut K, &'a mut V)
where
    H: Fn(&K) -> u64,

to

pub fn RawVacantEntryMut::insert_with_hasher<H>(
    self,
    hash: u64,
    key: K,
    value: V,
    hasher: H,
) -> (&'a mut K, &'a mut V)
where
    H: Fn(&K, &S) -> u64,

The impl I'm doing is an interner that requires information outside of the hashmap to properly hash the keys. [gist]

(While we're (potentially) touching the function API, is there a reason for it to take Fn rather than FnMut?)

@Amanieu
Copy link
Member

Amanieu commented Mar 22, 2020

You can keep the hasher outside the map and use () as the hasher in the map.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants