Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix/serialization hash collision #35

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

cmhulbert
Copy link
Contributor

@cmhulbert cmhulbert commented Sep 10, 2024

There were some issues with occasional hash collisions when serializing LabelMultisetType images. This aims to solve in the following ways:

  • implement comparable so LabelMultisetEntryList is comparable by the following rules:
    • lists with fewer entries are less than lists with more entries
    • lists with the same number entries compare for each entry on their ID
      • if the IDs at an entry are the same, compare against the counts for that ID

This lets us use a TreeMap in the serialization logic, instead of a HashMap. The comparator is sorted first by size to ensure smaller lists are checked for equality first.

@cmhulbert
Copy link
Contributor Author

@axtimwalde @tpietzsch let me know if you have any thoughts, otherwise I'll plan to release this later this week, after using it a bit more

It's actually slower in many cases, since hashcode calculation always iterates over all entries.
Meanwhile, `compareTo` can return early if it's clear they are not equal.
refactor: Comparable<LabelMultisetEntryList> organization

the `put` operation of `TreeMap` is expensive, since `compareTo` is expensive. So we want to reduce the number of calls to this. To that end we don't want to check equality BEFORE calling `put` since both will calculate `compareTo`. So we `putIfAbsent` on the list. However, the list at this point is a reference that can change, so we need to copy the data. But we don't want to always copy, since most lists (in most cases) are not unique, and we only need a single copy to reference in the `TreeMap`. So only copy the data `via addAll` when the list is a newly added one.

In addition, move the `Comparable` implementation to a package-private subclass that also reuses a single ref from construction to do the comparison.

smarter way of detecting when all entries are empty (`size()` was expensive, since it requires iterating over all entries)
@cmhulbert
Copy link
Contributor Author

@axtimwalde some changes relevant to our conversation on Monday

  • LabelMultisetEntryList no longer directly implements Comparable
  • ComparableLabelMultisetEntryList package-private subclass
    • keeps Comparable internal
    • allows for reuse of LabelMultisetEntry reference during compareTo

Performance wise, it is a bit slower in the tests that are included in this library. In Paintera, with a real dataset it is only slightly slower on average (about 10%).

Test Type Relative Performance (ratio TreeView / HashMap)
Random Multiple Entry as Variable 1.22
Random Multiple Entry as Image 2.0
Empty Image 1.06
Single ID Image 1.06
Random Single Entry Image 2.8
Invalid ID 0.9
Paintera 1.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant