Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

Improve Dictionary perfomance #27149

Closed
wants to merge 7 commits into from
Closed

Conversation

benaadams
Copy link
Member

@benaadams benaadams commented Oct 12, 2019

  • Dictionary avoid second bounds check in Get methods
  • Hoist collision max count check
  • Avoid mod operator when fast alternative available
|        Method | Toolchain |      Mean | Ratio |
|-------------- |---------- |----------:|------:|
| KNucleotide_1 |      base | 368.1  ms |  1.00 |
| KNucleotide_1 |      diff | 336.9  ms |  0.92 |
| KNucleotide_9 |      base | 104.05 ms |  1.00 |
| KNucleotide_9 |      diff |  89.71 ms |  0.86 |

System.Collections.TryGetValueTrue_String, String

|     Method | Toolchain | Size |     Mean | Ratio |
|----------- |---------- |----- |---------:|------:|
| Dictionary |      base |  512 | 15.72 us |  1.00 |
| Dictionary |      diff |  512 | 12.87 us |  0.82 |

System.Collections.TryGetValueFalse_String, String

|     Method | Toolchain | Size |     Mean | Ratio |
|----------- |---------- |----- |---------:|------:|
| Dictionary |      base |  512 | 13.84 us |  1.00 |
| Dictionary |      diff |  512 | 10.61 us |  0.77 |

System.Collections.TryGetValueTrue_Int32, Int32

|     Method | Toolchain | Size |     Mean | Ratio |
|----------- |---------- |----- |---------:|------:|
| Dictionary |      base |  512 | 5.901 us |  1.00 |
| Dictionary |      diff |  512 | 3.642 us |  0.62 |

System.Collections.TryGetValueFalse_Int32, Int32

|     Method | Toolchain | Size |     Mean | Ratio |
|----------- |---------- |----- |---------:|------:|
| Dictionary |      base |  512 | 8.511 us |  1.00 |
| Dictionary |      diff |  512 | 4.194 us |  0.49 |

More #27149 (comment)

@GrabYourPitchforks
Copy link
Member

GrabYourPitchforks commented Oct 12, 2019

For reference, here's the before and after codegen for the get_Item method, courtesy Sharplab. If you drill into the linked code on Sharplab, I mocked up Ben's changes inside a dummy type just to see what would happen to the codegen.

public static int Before(Dictionary<int, int> d) {
    return d[42];
}

public static int After(MockDictionary<int, int> d)
{
    return d[42];
}
C.Before(System.Collections.Generic.Dictionary`2<Int32,Int32>)
    L0000: push rsi
    L0001: sub rsp, 0x20
    L0005: mov rsi, rcx
    L0008: mov ecx, [rsi]
    L000a: mov rcx, rsi
    L000d: mov edx, 0x2a
    L0012: call System.Collections.Generic.Dictionary`2[[System.Int32, System.Private.CoreLib],[System.Int32, System.Private.CoreLib]].FindEntry(Int32)
    L0017: test eax, eax
    L0019: jl L0035
    L001b: mov rcx, [rsi+0x10]
    L001f: cmp eax, [rcx+0x8]
    L0022: jae L0040
    L0024: movsxd rax, eax
    L0027: shl rax, 0x4
    L002b: mov eax, [rcx+rax+0x1c]
    L002f: add rsp, 0x20
    L0033: pop rsi
    L0034: ret
    L0035: mov ecx, 0x2a
    L003a: call System.ThrowHelper.ThrowKeyNotFoundException[[System.Int32, System.Private.CoreLib]](Int32)
    L003f: int3
    L0040: call 0x7ff95393ef00
    L0045: int3

C.After(MockDictionary`2<Int32,Int32>)
    L0000: sub rsp, 0x28
    L0004: mov edx, [rcx]
    L0006: mov edx, 0x2a
    L000b: call MockDictionary`2[[System.Int32, System.Private.CoreLib],[System.Int32, System.Private.CoreLib]].FindValueOrNull(Int32)
    L0010: test rax, rax
    L0013: jz L001c
    L0015: mov eax, [rax]
    L0017: add rsp, 0x28
    L001b: ret
    L001c: mov ecx, 0x2a
    L0021: call MockDictionary`2[[System.Int32, System.Private.CoreLib],[System.Int32, System.Private.CoreLib]].ThrowKeyNotFoundException[[System.Int32, System.Private.CoreLib]](Int32)
    L0026: int3

@benaadams
Copy link
Member Author

benaadams commented Oct 13, 2019

Wasn't happy with the asm this was generating; and the gnarly bool handling logic that it generated whenever it inlined. So trying a different approach

The side-effect of the new that I'm not sure about is how it effects uses of ContainsKey as it introduces an extra byref param, and again introduces some unnecessary bool handling which looks in part due to wanting the register to clear the out value? e.g.

-   mov      rdx, rsi
-   call     [Dictionary`2:FindEntry(ref):int:this]
-   test     eax, eax
-   jge      SHORT G_M9502_IG08
+   lea      r8, bword ptr [rsp+20H]
+   mov      rdx, rsi
+   call     [Dictionary`2:TryGetValue(ref,byref):bool:this]
+   movzx    rcx, al
+   xor      rax, rax
+   mov      gword ptr [rsp+20H], rax
+   test     ecx, ecx
+   jne      SHORT G_M9502_IG08

Otherwise the diffs look good

Total bytes of diff: -891 (-0.03% of base)
    diff is an improvement.

Total byte diff includes -3935 bytes from reconciling methods
  Base had    4 unique methods,     3935 unique bytes
  Diff had    0 unique methods,  0 unique bytes

Top file improvements by size (bytes):
  -891 : System.Private.CoreLib.dasm (-0.03% of base)

Copy link

@IDisposable IDisposable left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, one minor nit

@jkotas
Copy link
Member

jkotas commented Oct 13, 2019

Wasn't happy with the asm this was generating; and the gnarly bool handling logic that it generated whenever it inlined.

It is not unusual for the JIT to generate sub-optimal code for rarely used or brand new patterns. The JIT typically needs tweaking for best results.

@benaadams
Copy link
Member Author

Shaved off a little more

Total bytes of diff: -1053 (-0.04% of base)
   diff is an improvement.

Total byte diff includes -3935 bytes from reconciling methods
   Base had    4 unique methods,     3935 unique bytes
   Diff had    0 unique methods,   0 unique bytes

Top file improvements by size (bytes):
  -1053 : System.Private.CoreLib.dasm (-0.04% of base)

1 total files with size differences (1 improved, 0 regressed), 0 unchanged.

Top method regressions by size (bytes):
   1845 (444.58% of base) : System.Private.CoreLib.dasm - Dictionary`2:TryGetValue(ref,byref):bool:this (5 methods)
    861 (337.65% of base) : System.Private.CoreLib.dasm - Dictionary`2:TryGetValue(long,byref):bool:this (3 methods)
    269 (316.47% of base) : System.Private.CoreLib.dasm - Dictionary`2:TryGetValue(int,byref):bool:this
    195 (110.17% of base) : System.Private.CoreLib.dasm - Dictionary`2:TryGetValue(struct,byref):bool:this (2 base, 1 diff methods)
    108 ( 9.35% of base) : System.Private.CoreLib.dasm - Dictionary`2:System.Collections.IDictionary.Contains(ref):bool:this (11 methods)
     92 ( 7.91% of base) : System.Private.CoreLib.dasm - PseudoCustomAttribute:GetCustomAttributes(ref,ref,byref) (4 methods)
     92 (10.17% of base) : System.Private.CoreLib.dasm - PseudoCustomAttribute:IsDefined(ref,ref):bool (4 methods)
     45 (51.72% of base) : System.Private.CoreLib.dasm - KeyCollection:System.Collections.Generic.ICollection<TKey>.Contains(long):bool:this (3 methods)
     40 (24.39% of base) : System.Private.CoreLib.dasm - Unsafe:AsRef(long):byref (41 base, 51 diff methods)
     33 (16.50% of base) : System.Private.CoreLib.dasm - KeyCollection:System.Collections.Generic.ICollection<TKey>.Contains(ref):bool:this (6 methods)
     31 (18.67% of base) : System.Private.CoreLib.dasm - Attribute:CopyToArrayList(ref,ref,ref)
     27 ( 3.77% of base) : System.Private.CoreLib.dasm - MemberInfoCache`1:PopulateEvents(struct,ref,ref,byref):this
     24 (25.26% of base) : System.Private.CoreLib.dasm - KeyCollection:System.Collections.Generic.ICollection<TKey>.Contains(struct):bool:this (2 methods)
     23 (11.79% of base) : System.Private.CoreLib.dasm - EventRegistrationTokenTable`1:AddEventHandlerNoLock(ref):struct:this
     23 ( 2.15% of base) : System.Private.CoreLib.dasm - EventSource:DebugCheckEvent(byref,ref,ref,ref,ref,int)

Top method improvements by size (bytes):
  -2270 (-100.00% of base) : System.Private.CoreLib.dasm - Dictionary`2:FindEntry(ref):int:this (5 base, 0 diff methods)
  -1008 (-100.00% of base) : System.Private.CoreLib.dasm - Dictionary`2:FindEntry(long):int:this (3 base, 0 diff methods)
   -336 (-100.00% of base) : System.Private.CoreLib.dasm - Dictionary`2:FindEntry(struct):int:this (1 base, 0 diff methods)
   -321 (-100.00% of base) : System.Private.CoreLib.dasm - Dictionary`2:FindEntry(int):int:this (1 base, 0 diff methods)
   -234 (-15.50% of base) : System.Private.CoreLib.dasm - Dictionary`2:System.Collections.Generic.ICollection<System.Collections.Generic.KeyValuePair<TKey,TValue>>.Contains(struct):bool:this (11 methods)
   -196 (-11.48% of base) : System.Private.CoreLib.dasm - Dictionary`2:System.Collections.Generic.ICollection<System.Collections.Generic.KeyValuePair<TKey,TValue>>.Remove(struct):bool:this (11 methods)
   -128 (-8.34% of base) : System.Private.CoreLib.dasm - Dictionary`2:System.Collections.IDictionary.get_Item(ref):ref:this (11 methods)
   -114 (-7.88% of base) : System.Private.CoreLib.dasm - Dictionary`2:ContainsValue(ref):bool:this (6 methods)
    -75 (-3.24% of base) : System.Private.CoreLib.dasm - Dictionary`2:TryInsert(long,ref,ubyte):bool:this (3 methods)
    -54 (-25.00% of base) : System.Private.CoreLib.dasm - Dictionary`2:get_Item(long):ref:this (3 methods)
    -26 (-1.30% of base) : System.Private.CoreLib.dasm - Dictionary`2:TryInsert(ref,struct,ubyte):bool:this (2 methods)
    -25 (-3.24% of base) : System.Private.CoreLib.dasm - Dictionary`2:TryInsert(struct,ref,ubyte):bool:this
    -24 (-3.18% of base) : System.Private.CoreLib.dasm - Dictionary`2:TryInsert(int,ref,ubyte):bool:this
    -20 (-8.77% of base) : System.Private.CoreLib.dasm - Dictionary`2:get_Item(ref):struct:this (2 methods)
    -18 (-26.09% of base) : System.Private.CoreLib.dasm - Dictionary`2:get_Item(int):ref:this

65 total methods with size differences (30 improved, 35 regressed), 18583 unchanged.

@benaadams
Copy link
Member Author

benaadams commented Oct 13, 2019

It is not unusual for the JIT to generate sub-optimal code for rarely used or brand new patterns. The JIT typically needs tweaking for best results.

I see this pattern come up a bit; I think its post inlining and in this case from EqualityComparer<TKey>.Default.Equals

   mov      ecx, dword ptr [r13+12]
   cmp      ecx, esi
   sete     cl
   movzx    rcx, cl
   test     ecx, ecx
   jne      G_M59263_IG12

Would I correct in thinking it could just be:

   mov      ecx, dword ptr [r13+12]
   cmp      ecx, esi
   je       G_M59263_IG12

Where the cmp and je would microsfuse to 1 instuction? Will raise issue if so

@benaadams
Copy link
Member Author

There is a further optimisation to be had here as the test could be dropped from the collision testing sequences

dec      r13d
test     r13d, r13d
jl       SHORT G_M59263_IG15

though doesn't look like the Jit currently does that optimization https://github.com/dotnet/coreclr/issues/7566

@benaadams benaadams changed the title Dictionary avoid second bounds check in Get methods Improve Dictionary perfomance Oct 14, 2019
@benaadams
Copy link
Member Author

ContainsKey which I was worried about looks ok

System.Collections.ContainsKeyTrue_String, String

|     Method |                   Toolchain | Size |     Mean | Ratio |
|----------- |---------------------------- |----- |---------:|------:|
| Dictionary | \Core_Root-base\CoreRun.exe |  512 | 14.86 us |  1.00 |
| Dictionary | \Core_Root-diff\CoreRun.exe |  512 | 13.01 us |  0.88 |

System.Collections.ContainsKeyFalse_String, String

|     Method |                   Toolchain | Size |     Mean | Ratio |
|----------- |---------------------------- |----- |---------:|------:|
| Dictionary | \Core_Root-base\CoreRun.exe |  512 | 11.48 us |  1.00 |
| Dictionary | \Core_Root-diff\CoreRun.exe |  512 | 10.79 us |  0.94 |

System.Collections.ContainsKeyTrue_Int32, Int32

|     Method |                   Toolchain | Size |     Mean | Ratio |
|----------- |---------------------------- |----- |---------:|------:|
| Dictionary | \Core_Root-base\CoreRun.exe |  512 | 4.655 us |  1.00 |
| Dictionary | \Core_Root-diff\CoreRun.exe |  512 | 3.826 us |  0.82 |

System.Collections.ContainsKeyFalse_Int32, Int32

|     Method |                   Toolchain | Size |     Mean | Ratio |
|----------- |---------------------------- |----- |---------:|------:|
| Dictionary | \Core_Root-base\CoreRun.exe |  512 | 4.815 us |  1.00 |
| Dictionary | \Core_Root-diff\CoreRun.exe |  512 | 3.336 us |  0.69 |

TryGetValue is happy

System.Collections.TryGetValueTrue_String, String

|     Method |                   Toolchain | Size |     Mean | Ratio |
|----------- |---------------------------- |----- |---------:|------:|
| Dictionary | \Core_Root-base\CoreRun.exe |  512 | 15.72 us |  1.00 |
| Dictionary | \Core_Root-diff\CoreRun.exe |  512 | 12.87 us |  0.82 |

System.Collections.TryGetValueFalse_String, String

|     Method |                   Toolchain | Size |     Mean | Ratio |
|----------- |---------------------------- |----- |---------:|------:|
| Dictionary | \Core_Root-base\CoreRun.exe |  512 | 13.84 us |  1.00 |
| Dictionary | \Core_Root-diff\CoreRun.exe |  512 | 10.61 us |  0.77 |

System.Collections.TryGetValueTrue_Int32, Int32

|     Method |                   Toolchain | Size |     Mean | Ratio |
|----------- |---------------------------- |----- |---------:|------:|
| Dictionary | \Core_Root-base\CoreRun.exe |  512 | 5.901 us |  1.00 |
| Dictionary | \Core_Root-diff\CoreRun.exe |  512 | 3.642 us |  0.62 |

System.Collections.TryGetValueFalse_Int32, Int32

|     Method |                   Toolchain | Size |     Mean | Ratio |
|----------- |---------------------------- |----- |---------:|------:|
| Dictionary | \Core_Root-base\CoreRun.exe |  512 | 8.511 us |  1.00 |
| Dictionary | \Core_Root-diff\CoreRun.exe |  512 | 4.194 us |  0.49 |

@benaadams
Copy link
Member Author

KNucleotide changes

|        Method |                   Toolchain |      Mean | Ratio |
|-------------- |---------------------------- |----------:|------:|
| KNucleotide_1 | \Core_Root-base\CoreRun.exe | 368.1  ms |  1.00 |
| KNucleotide_1 | \Core_Root-diff\CoreRun.exe | 336.9  ms |  0.92 |
| KNucleotide_9 | \Core_Root-base\CoreRun.exe | 104.05 ms |  1.00 |
| KNucleotide_9 | \Core_Root-diff\CoreRun.exe |  89.71 ms |  0.86 |

/cc @danmosemsft

@benaadams benaadams marked this pull request as ready for review October 14, 2019 03:02
// more costly idiv instruction.

const uint CrcReflected = 0xEDB88320;
return (uint)((Sse42.Crc32(CrcReflected, index) * (ulong)range) >> 32);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mikedn @AndyAyersMS running the hashcode through Crc32 gets around the distribution issue with unknown hashcodes being high entropy in their LSB when using @lemire's fastrange. (int being an example here)

Though alas we can only do this when hardware accelerated as otherwise Crc32 is too slow; so fall back to modulo when its not supported in hardware.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can confirm from my experience that fastrange is very dangerous when used as a hash reducer. I tried it in the past and it resulted in massive collisions in common cases like sequential numbers, zip codes, etc. - i.e. stuff that has entropy in low bits.

I did not consider hardware-accelerated mixer like Crc32 - cannot comment on that. It was not available at the time. It is an interesting idea.

My guess it may still regress cases like when dictionary is used as a sparse list and keys have clusters. Modulus, for good or bad, happens to preserve the locality.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to reverse the bits to reverse the entropy; and then you get something closer to the behaviour of modulus with fastrange. Alas there still isn't a fast way to do that in hardware; and reversing the endianness still leaves the wrong order entropy per byte.

Re: locality #27149 (comment) for Dictionary specifically it has a lesser impact; than for example Hashtable, as its only changing the locality of the smaller int[] array that contains the indexes and the larger Entry[] array would maintain the same very packed locality. But yes it would have an impact.

Copy link
Member Author

@benaadams benaadams Oct 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One approach could be to use a bucket array that was of the correct type for the capacity (e.g. dictionary size 1-255 could be a byte[], 256-65536 could be an ushort[], etc)

@benaadams
Copy link
Member Author

One wrinkle with this change is it also does Crc32 on top of Marvin32 for strings which is a bit weird; but shouldn't really make it it more predictable? /cc @GrabYourPitchforks

@GrabYourPitchforks
Copy link
Member

We'd need to run the CRC32 logic by the crypto board to make sure that we're not improperly compressing the range of the hash function. The final step of CRC32 is a finite field polynomial division, which likely introduces a slight bias toward zero. It's probably safe but needs an extra set of eyes from somebody more steeped in this field.


const uint CrcReflected = 0xEDB88320;
return (uint)((Sse42.Crc32(CrcReflected, index) * (ulong)range) >> 32);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a neat idea. I can think about two potential problems with this:

  • This will make Dictionary slower when tiered JIT is disabled, or during startup until tiered JIT kicks in. @mjsabby Are you concerned about Dictionary being slower with tiered JIT disabled?
  • There may be cases today where the div gives a good locality for Dictionary access. This locality will disappear with crc32 and the overall performance will regress. I do not know what is the chance of this happening in real world workloads.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will make Dictionary slower when tiered JIT is disabled, or during startup until tiered JIT kicks in.

Does the support check get cached in a static rather than always being a CPUID? Looking at the R2R asm (call to jmp) still have quite a few cycles to spare vs the div branch:

       call     [Sse42:get_IsSupported():bool]
       test     al, al
       je       SHORT G_M59265_IG10
       mov      eax, 0xD1FFAB1E
       crc32    eax, r12d
       mov      edx, r13d
       imul     rax, rdx
       shr      rax, 32
       mov      ecx, eax
       jmp      SHORT G_M59265_IG11

G_M59265_IG10:		;; bbWeight=0.50
       mov      eax, r12d
       xor      rdx, rdx
       div      edx:eax, r13d
       mov      ecx, edx

G_M59265_IG11:		;; bbWeight=0.50

Actually, can test it. Will COMPlus_TieredCompilation=0 mean it will always use the R2R version?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There may be cases today where the div gives a good locality for Dictionary access. This locality will disappear with crc32 and the overall performance will regress.

This should have a lower impact with Dictionary; than for example Hashtable as its only changing the locality of the smaller int[] array that contains the indexes and the larger Entry[] array would maintain the same very packed locality. But yes it would have an impact.

Probably couldn't use in in Hashtable due to this greater effect as it would additionally scatter the data. Although... Hashtable calls modulo for every collision (whereas Dictionary only calls it once); so might it might balance?

However; I wouldn't want to do the work to verify that 😄

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, Dictionary is one of the hottest pieces of code we run and we run with tiered jitting disabled and would have a pretty negative perf impact. I would love to see this without a regression to Dictionary in the non tiered case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still is an improvement for non-tiered (I assume Sse42 is available, its just tiering that's disabled?)

|        Method | Toolchain |      Mean | Ratio |
|-------------- |---------- |----------:|------:|
| KNucleotide_1 |      base | 375.6  ms |  1.00 |
| KNucleotide_1 |      diff | 340.6  ms |  0.91 |
| KNucleotide_9 |      base | 100.13 ms |  1.00 |
| KNucleotide_9 |      diff |  97.05 ms |  0.97 |

System.Collections.TryGetValueFalse_Int, Int

|     Method |  Toolchain | Size |     Mean | Ratio |
|----------- |----------- |----- |---------:|------:|
| Dictionary |       base |  512 | 7.033 us |  1.00 |
| Dictionary |       diff |  512 | 3.381 us |  0.48 |

System.Collections.TryGetValueFalse_String, String

|     Method |  Toolchain | Size |     Mean | Ratio |
|----------- |----------- |----- |---------:|------:|
| Dictionary |       base |  512 | 16.46 us |  1.00 |
| Dictionary |       diff |  512 | 13.72 us |  0.83 |

System.Collections.TryGetValueTrue_Int, Int

|     Method |  Toolchain | Size |     Mean | Ratio |
|----------- |----------- |----- |---------:|------:|
| Dictionary |       base |  512 | 5.965 us |  1.00 |
| Dictionary |       diff |  512 | 3.755 us |  0.63 |

System.Collections.TryGetValueTrue_String, String

|     Method |  Toolchain | Size |     Mean | Ratio |
|----------- |----------- |----- |---------:|------:|
| Dictionary |       base |  512 | 17.92 us |  1.00 |
| Dictionary |       diff |  512 | 15.09 us |  0.84 |

@adamsitnik does setting the COMPlus_TieredCompilation=0 environment variable stick for the https://github.com/dotnet/performance repository or should I be passing a flag?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also seems like the perf repo tests are using unique randomly distributed keys. Might be worth running some perf examples where we have less well distributed key sets?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer this hashcode-to-bucket mapping change to be a separate PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not quite sure why we are using the reflected polynomial of CRC-32 (CrcReflected) as the initial hash value for CRC-32C calculation implemented by the intrinsic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also seems like the perf repo tests are using unique randomly distributed keys. Might be worth running some perf examples where we have less well distributed key sets?

Added some test for that, and large values dotnet/performance#938

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, Dictionary is one of the hottest pieces of code we run

Makes my top 6 (the two waits don't count as they are the main thread and blocked forever)

image

Typos

Co-Authored-By: Joni <jonijpn@gmail.com>
@adamsitnik
Copy link
Member

does setting the COMPlus_TieredCompilation=0 environment variable stick for the https://github.com/dotnet/performance repository or should I be passing a flag?

Yes, this should work fine (the host process is going to derive the env var from your shell and the benchmark process from the host process).

The alternative is to use our python3 script which allows for even more:

  --dotnet-compilation-mode {Default,Tiered,NoTiering,FullyJittedNoTiering,MinOpt}
                        Different compilation modes that can be set to change
                        the .NET compilation behavior. The default
                        configurations have changed between releases of .NET.
                        These flags enable ensuring consistency when running
                        more than one runtime. The different modes are:
                        Default: no environment variables are set; Tiered:
                        tiering is enabled. NoTiering: tiering is disabled,
                        but includes R2R code, and it is useful for comparison
                        against Tiered; FullyJittedNoTiering: This is JIT-
                        only, useful for comparison against Tiered and NoTier
                        for changes to R2R code or tiering; MinOpt: uses
                        minopt-JIT for methods that do not have pregenerated
                        code, and useful for startup time comparisons in
                        scenario benchmarks that include a startup time
                        measurement (probably not for microbenchmarks),
                        probably not useful for a PR.
py .\scripts\benchmarks_ci.py -f netcoreapp5.0 --filter $theFilter --coreRun $coreRuns --dotnet-compilation-mode NoTiering

private struct Entry
{
public uint hashCode;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reordered the fields ( I assume by the order of access) and made layout Auto. Just curious if this was on purpose or a typo.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On purpose; order of access, but also Auto in case the TKey and TValue are unusual size structs so the reordering doesn't work for the packing (e.g. byte, byte)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If Auto considers the initial layout (I do not know), it would be a nice trick to learn.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wondered too, maybe worth a comment.

@benaadams
Copy link
Member Author

Also seems like the perf repo tests are using unique randomly distributed keys. Might be worth running some perf examples where we have less well distributed key sets?

Sequential keys
Dictionary sizes 17 and 3000
Keys are: int
Values are: int, (long, long, long, long) and (object, object, object, object)

TryGetValue looks good;

|                             Method | Toolchain |      Mean | Ratio |
|----------------------------------- |---------- |----------:|------:|                                            
|             TryGetValue_17_Int_Int |      base | 10.157 ns |  1.00 |
|             TryGetValue_17_Int_Int |      diff |  6.109 ns |  0.60 |
|             TryGetValue_3k_Int_Int |      base |  9.563 ns |  1.00 |
|             TryGetValue_3k_Int_Int |      diff |  6.651 ns |  0.70 |
|                                    |           |           |       |
|     TryGetValue_17_Int_32ByteValue |      base | 10.633 ns |  1.00 |
|     TryGetValue_17_Int_32ByteValue |      diff |  6.772 ns |  0.64 |
|     TryGetValue_3k_Int_32ByteValue |      base | 10.132 ns |  1.00 |
|     TryGetValue_3k_Int_32ByteValue |      diff |  7.526 ns |  0.74 |
|                                    |           |           |       |
| TryGetValue_17_Int_32ByteRefsValue |      base | 14.538 ns |  1.00 |
| TryGetValue_17_Int_32ByteRefsValue |      diff | 11.143 ns |  0.77 |
| TryGetValue_3k_Int_32ByteRefsValue |      base | 14.161 ns |  1.00 |
| TryGetValue_3k_Int_32ByteRefsValue |      diff | 12.178 ns |  0.86 |

What is the performance going to be for larger TValue? E.g. ValueTuple<object, object, object, object> ? I would expect that e.g. ContainsKey will regress for this case.

Indeed it does

|                             Method | Toolchain |      Mean | Ratio |
|----------------------------------- |---------- |----------:|------:|
|           ContainsValue_17_Int_Int |      base |  8.196 ns |  1.00 |
|           ContainsValue_17_Int_Int |      diff |  6.019 ns |  0.74 |
|           ContainsValue_3k_Int_Int |      base |  7.773 ns |  1.00 |
|           ContainsValue_3k_Int_Int |      diff |  6.716 ns |  0.86 |
|                                    |           |           |       |
|     ContainsKey_17_Int_32ByteValue |      base |  7.713 ns |  1.00 |
|     ContainsKey_17_Int_32ByteValue |      diff |  7.209 ns |  0.93 |
|     ContainsKey_3k_Int_32ByteValue |      base |  7.518 ns |  1.00 |
|     ContainsKey_3k_Int_32ByteValue |      diff |  8.251 ns |  1.10 |
|                                    |           |           |       |
| ContainsKey_17_Int_32ByteRefsValue |      base |  7.624 ns |  1.00 |
| ContainsKey_17_Int_32ByteRefsValue |      diff | 11.860 ns |  1.56 |
| ContainsKey_3k_Int_32ByteRefsValue |      base |  7.583 ns |  1.00 |
| ContainsKey_3k_Int_32ByteRefsValue |      diff | 12.622 ns |  1.67 |

@@ -228,26 +232,36 @@ public void Clear()
}

public bool ContainsKey(TKey key)
=> FindEntry(key) >= 0;
=> TryGetValue(key, out _);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems this would do an unnecessary TValue copy on success.

@VSadov
Copy link
Member

VSadov commented Oct 14, 2019

Another way to avoid copying the value could be by using bool FindValue(key, ref value) - with ref instead of out, so that calee could just return false.

It may not be as efficient as dealing with unsafe, but could be not that much worse.

@benaadams
Copy link
Member Author

Another way to avoid copying the value could be by using bool FindValue(key, ref value)

Doesn't help with ContainsKey as the caller then has to default the ref; and when true it would additionally do the assignment (unless additionally passed an enum to indicate mode, but then that eats a register).

Returning ref TValue allows ContainsKey to avoid any struct copies.

@VSadov
Copy link
Member

VSadov commented Oct 15, 2019

Yes, that is why it may not be as efficient as unsafe version.

TryGet code path will be ok, but in the case of ContainsKey, caller will need to have space for the value and positive case will do a redundant copy.

That is just in case you (or JIT) stop liking unsafe for some reason :-)

@benaadams
Copy link
Member Author

#27149 (comment)

I liked your previous approach where this returned ref TValue better.

😉

@VSadov
Copy link
Member

VSadov commented Oct 15, 2019

My C# background cries when I see ref Unsafe.AsRef<TValue>(null).

From IL point of view it is ok though and I see how it is safe and efficient in this case.

@AntonLapounov
Copy link
Member

My C# background cries when I see ref Unsafe.AsRef(null).

@VSadov That should be ref? TValue value = default;. ;-)

@benaadams
Copy link
Member Author

bool FindValue(key, ref ref value) would work too 😉

@VSadov
Copy link
Member

VSadov commented Oct 15, 2019

I think ref? was actually briefly considered as a nullable ref. But then why stop half way and not bring in a real notion of a byref type, with math and comparisons, etc.

Then TValue^ pValue = null; starts looking attractive...

We are digressing :-)

@benaadams
Copy link
Member Author

I would prefer this hashcode-to-bucket mapping change to be a separate PR.

PR without the hashcode-to-bucket mapping change, will follow up with that in a different PR #27195

@gahe88
Copy link

gahe88 commented Oct 15, 2019

@benaadams I just came across an interesting way to map hash values to smaller ranges. Linked it in case you want to have a look:
https://probablydance.com/2018/06/16/fibonacci-hashing-the-optimization-that-the-world-forgot-or-a-better-alternative-to-integer-modulo/

@benaadams
Copy link
Member Author

Opened PR for the other half #27299

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.