Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

[WIP] Intrinsicify SpanHelpers.Char #22187

Closed
wants to merge 11 commits into from

Conversation

benaadams
Copy link
Member

@benaadams benaadams commented Jan 24, 2019

Use System.Runtime.Intrinsics.X86 in addition to System.Numerics.Vector and apply the learnings of byte #22118 and #22127 to char.

Split this into separate commits so its easier to follow; thought the first one is kinda messy.

  1. Prepare SpanHelpers.Char for intrinsics 6bb19c3
  • Mirror SpanHelpers.Byte methods and remove pinning; so its easier to use the same intrinsic patterns as used for byte. (Also rename nLength to lengthToExamine as it was confusing me)
  1. Intrinsicify SpanHelpers.IndexOf(char) aa80e6c
  2. Intrinsicify SpanHelpers.SequenceCompareTo(char) 88a25d2
  3. Intrinsicify SpanHelpers.IndexOfAny(char,char) 981cc45
  4. Intrinsicify SpanHelpers.IndexOfAny(char,char,char) e02e9ab
  5. Intrinsicify SpanHelpers.IndexOfAny(char,char,char,char) 733a137
  6. Intrinsicify SpanHelpers.IndexOfAny(char,char,char,char,char) 991ced6
  7. Optimize IndexOfAny(2, 3) 6ac7daf
  8. Optimize .IndexOfAny(4,5) for short lengths eebbd2f
  9. Reduce sign extensions 07a282a
  10. Align IndexOf buffers so String.wcslen can use it (Vectorize wcslen Vectorize wcslen #21730) c7836d5

Methods improved

internal partial class SpanHelpers
{
    int IndexOfAny(ref byte, byte, byte, int);

    int IndexOf(ref char, char, int);
    int IndexOfAny(ref char, char, char, int);
    int IndexOfAny(ref char, char, char, char, int);
    int IndexOfAny(ref char, char, char, char, char, int);
    int IndexOfAny(ref char, char, char, char, char, char, int);
    int SequenceCompareTo(ref char, int, ref char, int) 
}

Perf numbers #22187 (comment)

/cc @CarolEidt @fiigii @tannergooding @jkotas

@benaadams benaadams force-pushed the SpanHelpers.Char branch 7 times, most recently from 156f782 to 4660ed6 Compare January 25, 2019 22:39
@benaadams benaadams changed the title [WIP] Prepare SpanHelpers.Char for intrinsics [WIP] Intrinsicify SpanHelpers.Char Jan 26, 2019
@benaadams benaadams changed the title [WIP] Intrinsicify SpanHelpers.Char Intrinsicify SpanHelpers.Char Jan 26, 2019
@benaadams benaadams force-pushed the SpanHelpers.Char branch 2 times, most recently from 6b9b723 to 6f62eb3 Compare February 2, 2019 23:38
@jkotas
Copy link
Member

jkotas commented Feb 3, 2019

Any perf numbers?

@benaadams
Copy link
Member Author

benaadams commented Feb 3, 2019

               Method | Position |       Mean |
 -------------------- |--------- |-----------:|
-      String_IndexOf |        7 |   5.031 ns |
+      String_IndexOf |        7 |   4.403 ns |
- String_IndexOfAny_2 |        7 |   9.353 ns |
+ String_IndexOfAny_2 |        7 |   9.097 ns |
- String_IndexOfAny_3 |        7 |  11.177 ns |
+ String_IndexOfAny_3 |        7 |  10.270 ns |
- String_IndexOfAny_4 |        7 |  13.400 ns |
+ String_IndexOfAny_4 |        7 |  11.986 ns |
- String_IndexOfAny_5 |        7 |  14.540 ns |
+ String_IndexOfAny_5 |        7 |  12.865 ns |

-      String_IndexOf |        8 |   5.053 ns |
+      String_IndexOf |        8 |   4.400 ns |
- String_IndexOfAny_2 |        8 |   9.716 ns |
+ String_IndexOfAny_2 |        8 |   9.133 ns |
- String_IndexOfAny_3 |        8 |  11.476 ns |
+ String_IndexOfAny_3 |        8 |  11.137 ns |
- String_IndexOfAny_4 |        8 |  14.236 ns |
+ String_IndexOfAny_4 |        8 |  13.529 ns |
- String_IndexOfAny_5 |        8 |  15.362 ns |
+ String_IndexOfAny_5 |        8 |  13.997 ns |

-      String_IndexOf |       15 |   8.817 ns |
+      String_IndexOf |       15 |   4.577 ns |
- String_IndexOfAny_2 |       15 |  11.978 ns |
+ String_IndexOfAny_2 |       15 |   8.620 ns |
- String_IndexOfAny_3 |       15 |  12.835 ns |
+ String_IndexOfAny_3 |       15 |  10.462 ns |
- String_IndexOfAny_4 |       15 |  14.271 ns |
+ String_IndexOfAny_4 |       15 |  11.411 ns |
- String_IndexOfAny_5 |       15 |  15.660 ns |
+ String_IndexOfAny_5 |       15 |  11.894 ns |

-      String_IndexOf |       16 |   8.223 ns |
+      String_IndexOf |       16 |   4.697 ns |
- String_IndexOfAny_2 |       16 |  11.919 ns |
+ String_IndexOfAny_2 |       16 |   8.616 ns |
- String_IndexOfAny_3 |       16 |  12.857 ns |
+ String_IndexOfAny_3 |       16 |  10.538 ns |
- String_IndexOfAny_4 |       16 |  14.544 ns |
+ String_IndexOfAny_4 |       16 |  11.411 ns |
- String_IndexOfAny_5 |       16 |  15.032 ns |
+ String_IndexOfAny_5 |       16 |  11.899 ns |

-      String_IndexOf |       31 |   9.292 ns |
+      String_IndexOf |       31 |   5.226 ns |
- String_IndexOfAny_2 |       31 |  12.521 ns |
+ String_IndexOfAny_2 |       31 |   9.206 ns |
- String_IndexOfAny_3 |       31 |  14.332 ns |
+ String_IndexOfAny_3 |       31 |  11.389 ns |
- String_IndexOfAny_4 |       31 |  15.877 ns |
+ String_IndexOfAny_4 |       31 |  12.327 ns |
- String_IndexOfAny_5 |       31 |  17.119 ns |
+ String_IndexOfAny_5 |       31 |  13.023 ns |

-      String_IndexOf |       32 |   9.276 ns |
+      String_IndexOf |       32 |   5.224 ns |
- String_IndexOfAny_2 |       32 |  12.519 ns |
+ String_IndexOfAny_2 |       32 |   9.208 ns |
- String_IndexOfAny_3 |       32 |  14.990 ns |
+ String_IndexOfAny_3 |       32 |  11.387 ns |
- String_IndexOfAny_4 |       32 |  15.861 ns |
+ String_IndexOfAny_4 |       32 |  13.056 ns |
- String_IndexOfAny_5 |       32 |  16.744 ns |
+ String_IndexOfAny_5 |       32 |  13.114 ns |

-      String_IndexOf |       63 |  11.307 ns |
+      String_IndexOf |       63 |   6.991 ns |
- String_IndexOfAny_2 |       63 |  14.935 ns |
+ String_IndexOfAny_2 |       63 |  10.207 ns |
- String_IndexOfAny_3 |       63 |  16.546 ns |
+ String_IndexOfAny_3 |       63 |  12.833 ns |
- String_IndexOfAny_4 |       63 |  18.757 ns |
+ String_IndexOfAny_4 |       63 |  14.488 ns |
- String_IndexOfAny_5 |       63 |  20.503 ns |
+ String_IndexOfAny_5 |       63 |  16.153 ns |

-      String_IndexOf |       64 |  11.196 ns |
+      String_IndexOf |       64 |   6.990 ns |
- String_IndexOfAny_2 |       64 |  14.828 ns |
+ String_IndexOfAny_2 |       64 |  10.332 ns |
- String_IndexOfAny_3 |       64 |  17.834 ns |
+ String_IndexOfAny_3 |       64 |  13.441 ns |
- String_IndexOfAny_4 |       64 |  18.735 ns |
+ String_IndexOfAny_4 |       64 |  14.534 ns |
- String_IndexOfAny_5 |       64 |  20.200 ns |
+ String_IndexOfAny_5 |       64 |  15.891 ns |

-      String_IndexOf |      127 |  14.386 ns |
+      String_IndexOf |      127 |   9.050 ns |
- String_IndexOfAny_2 |      127 |  21.806 ns |
+ String_IndexOfAny_2 |      127 |  12.716 ns |
- String_IndexOfAny_3 |      127 |  24.639 ns |
+ String_IndexOfAny_3 |      127 |  15.658 ns |
- String_IndexOfAny_4 |      127 |  29.164 ns |
+ String_IndexOfAny_4 |      127 |  19.461 ns |
- String_IndexOfAny_5 |      127 |  33.225 ns |
+ String_IndexOfAny_5 |      127 |  23.109 ns |

-      String_IndexOf |      128 |  14.643 ns |
+      String_IndexOf |      128 |   9.228 ns |
- String_IndexOfAny_2 |      128 |  21.848 ns |
+ String_IndexOfAny_2 |      128 |  12.544 ns |
- String_IndexOfAny_3 |      128 |  27.099 ns |
+ String_IndexOfAny_3 |      128 |  15.608 ns |
- String_IndexOfAny_4 |      128 |  29.107 ns |
+ String_IndexOfAny_4 |      128 |  19.520 ns |
- String_IndexOfAny_5 |      128 |  33.325 ns |
+ String_IndexOfAny_5 |      128 |  23.233 ns |

-      String_IndexOf |      255 |  21.395 ns |
+      String_IndexOf |      255 |  13.919 ns |
- String_IndexOfAny_2 |      255 |  30.815 ns |
+ String_IndexOfAny_2 |      255 |  17.898 ns |
- String_IndexOfAny_3 |      255 |  38.530 ns |
+ String_IndexOfAny_3 |      255 |  22.278 ns |
- String_IndexOfAny_4 |      255 |  40.402 ns |
+ String_IndexOfAny_4 |      255 |  30.889 ns |
- String_IndexOfAny_5 |      255 |  47.078 ns |
+ String_IndexOfAny_5 |      255 |  36.649 ns |

-      String_IndexOf |      256 |  21.452 ns |
+      String_IndexOf |      256 |  13.768 ns |
- String_IndexOfAny_2 |      256 |  30.896 ns |
+ String_IndexOfAny_2 |      256 |  17.715 ns |
- String_IndexOfAny_3 |      256 |  33.635 ns |
+ String_IndexOfAny_3 |      256 |  22.412 ns |
- String_IndexOfAny_4 |      256 |  40.521 ns |
+ String_IndexOfAny_4 |      256 |  30.565 ns |
- String_IndexOfAny_5 |      256 |  45.958 ns |
+ String_IndexOfAny_5 |      256 |  36.775 ns |

-      String_IndexOf |     1023 |  68.075 ns |
+      String_IndexOf |     1023 |  49.797 ns |
- String_IndexOfAny_2 |     1023 |  88.839 ns |
+ String_IndexOfAny_2 |     1023 |  68.134 ns |
- String_IndexOfAny_3 |     1023 |  97.165 ns |
+ String_IndexOfAny_3 |     1023 |  76.172 ns |
- String_IndexOfAny_4 |     1023 | 116.962 ns |
+ String_IndexOfAny_4 |     1023 | 113.140 ns |
- String_IndexOfAny_5 |     1023 | 131.489 ns |
+ String_IndexOfAny_5 |     1023 | 133.238 ns |

-      String_IndexOf |     1024 |  68.530 ns |
+      String_IndexOf |     1024 |  49.741 ns |
- String_IndexOfAny_2 |     1024 |  89.023 ns |
+ String_IndexOfAny_2 |     1024 |  68.261 ns |
- String_IndexOfAny_3 |     1024 |  97.223 ns |
+ String_IndexOfAny_3 |     1024 |  76.760 ns |
- String_IndexOfAny_4 |     1024 | 117.060 ns |
+ String_IndexOfAny_4 |     1024 | 113.144 ns |
- String_IndexOfAny_5 |     1024 | 131.619 ns |
+ String_IndexOfAny_5 |     1024 | 132.635 ns |

@benaadams
Copy link
Member Author

coreclr-ci error is EventPipeAndEtw; added to issue https://github.com/dotnet/coreclr/issues/19302#issuecomment-460021040 though can't retrigger it

@benaadams
Copy link
Member Author

@dotnet-bot test Windows_NT x64 Checked jitx86hwintrinsicnoavx
@dotnet-bot test Windows_NT x64 Checked jitx86hwintrinsicnosimd
@dotnet-bot test Windows_NT x64 Checked jitnox86hwintrinsic

@dotnet-bot test Windows_NT x86 Checked jitx86hwintrinsicnoavx
@dotnet-bot test Windows_NT x86 Checked jitx86hwintrinsicnosimd
@dotnet-bot test Windows_NT x86 Checked jitnox86hwintrinsic

@dotnet-bot test Ubuntu x64 Checked jitx86hwintrinsicnoavx
@dotnet-bot test Ubuntu x64 Checked jitx86hwintrinsicnosimd
@dotnet-bot test Ubuntu x64 Checked jitnox86hwintrinsic

@benaadams benaadams mentioned this pull request Feb 7, 2019
// Note that MoveMask has converted the equal vector elements into a set of bit flags,
// So the bit position in 'matches' corresponds to the element offset.
// We preform the Or at non-Vector level as we are using the maximum number of non-preserved registers,
// and more causes them first to be pushed to stack and then popped on exit to preseve their values.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the prolog/epilog code have critical perf impact? I suggest doing vector-level OR to avoid the long latency MoveMask.

Btw, only Windows has callee-saved SIMD registers, so vector-level OR should improve more on Linux.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#22187 (comment)

The issue is the push and pops are a fixed cost; so it adds 64 bytes of write to stack and 64 bytes of read from stack even if the string is only 8 bytes long, so it has a significant impact for short lengths.

For longer lengths ( > 512 bytes) it gets amortized; but its use to check for invalid chars on short lengths quite often (html chars in a element name, special chars in a file search pattern etc)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't mean there isn't scope for a follow up 😉

Copy link

@fiigii fiigii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall. But I guess we probably can replace more MoveMask by Or.

@benaadams
Copy link
Member Author

benaadams commented Feb 7, 2019

@jkotas you asked in #21730 the performance effects of switching wcslen over to this implementation (last commit c7836d5)

           Method | Position |       Mean |     Error |    StdDev |
----------------- |--------- |-----------:|----------:|----------:|
   wcslen_Current |        1 |   2.735 ns | 0.0459 ns | 0.0429 ns |
 wcslen_Intrinsic |        1 |   5.238 ns | 0.0165 ns | 0.0155 ns |

   wcslen_Current |        4 |   4.406 ns | 0.0778 ns | 0.0728 ns |
 wcslen_Intrinsic |        4 |   5.231 ns | 0.0134 ns | 0.0126 ns |

   wcslen_Current |        5 |   4.688 ns | 0.0457 ns | 0.0427 ns |
 wcslen_Intrinsic |        5 |   5.228 ns | 0.0123 ns | 0.0109 ns |

   wcslen_Current |        6 |   4.790 ns | 0.0734 ns | 0.0686 ns |
 wcslen_Intrinsic |        6 |   5.235 ns | 0.0171 ns | 0.0151 ns |

   wcslen_Current |        7 |   5.714 ns | 0.0132 ns | 0.0117 ns |
 wcslen_Intrinsic |        7 |   5.230 ns | 0.0124 ns | 0.0104 ns |

   wcslen_Current |        8 |   5.256 ns | 0.0177 ns | 0.0148 ns |
 wcslen_Intrinsic |        8 |   5.231 ns | 0.0108 ns | 0.0101 ns |

   wcslen_Current |       15 |   5.814 ns | 0.0415 ns | 0.0388 ns |
 wcslen_Intrinsic |       15 |   6.500 ns | 0.0139 ns | 0.0109 ns |

   wcslen_Current |       16 |   6.193 ns | 0.0418 ns | 0.0391 ns |
 wcslen_Intrinsic |       16 |   6.510 ns | 0.0211 ns | 0.0187 ns |

   wcslen_Current |       24 |   7.794 ns | 0.1751 ns | 0.1638 ns |
 wcslen_Intrinsic |       24 |   6.525 ns | 0.0328 ns | 0.0291 ns |

   wcslen_Current |       31 |   9.328 ns | 0.0739 ns | 0.0692 ns |
 wcslen_Intrinsic |       31 |   7.153 ns | 0.0214 ns | 0.0200 ns |

   wcslen_Current |       32 |   9.400 ns | 0.0809 ns | 0.0717 ns |
 wcslen_Intrinsic |       32 |   6.686 ns | 0.0447 ns | 0.0418 ns |

   wcslen_Current |       33 |   8.757 ns | 0.1533 ns | 0.1434 ns |
 wcslen_Intrinsic |       33 |   6.703 ns | 0.0371 ns | 0.0347 ns |

   wcslen_Current |       63 |  13.125 ns | 0.2083 ns | 0.1846 ns |
 wcslen_Intrinsic |       63 |   7.953 ns | 0.0377 ns | 0.0334 ns |

   wcslen_Current |       64 |  13.409 ns | 0.1879 ns | 0.1666 ns |
 wcslen_Intrinsic |       64 |   8.333 ns | 0.0757 ns | 0.0709 ns |

   wcslen_Current |      127 |  26.741 ns | 0.3233 ns | 0.3024 ns |
 wcslen_Intrinsic |      127 |  10.215 ns | 0.0358 ns | 0.0335 ns |

   wcslen_Current |      128 |  25.681 ns | 0.2717 ns | 0.2541 ns |
 wcslen_Intrinsic |      128 |  10.652 ns | 0.0207 ns | 0.0194 ns |

   wcslen_Current |      255 |  48.372 ns | 0.2936 ns | 0.2746 ns |
 wcslen_Intrinsic |      255 |  15.141 ns | 0.0495 ns | 0.0463 ns |

   wcslen_Current |      256 |  48.583 ns | 0.3894 ns | 0.3642 ns |
 wcslen_Intrinsic |      256 |  14.756 ns | 0.0915 ns | 0.0714 ns |

   wcslen_Current |     1023 | 142.610 ns | 0.5968 ns | 0.5291 ns |
 wcslen_Intrinsic |     1023 |  55.348 ns | 0.1238 ns | 0.1098 ns |

   wcslen_Current |     1024 | 142.140 ns | 0.5755 ns | 0.5383 ns |
 wcslen_Intrinsic |     1024 |  55.931 ns | 0.1007 ns | 0.0942 ns |

I tried a preamble to align to UIntPtr then processed UIntPtr sizes with bittwiddling in the Sequential section; but that made it worse (branch forwards were better than the always run extra instructions)

@benaadams
Copy link
Member Author

Should be good to go/review; mostly copying byte version, though also changed byte version to do the Ors in vector space.

Ors are done in non-vector space where it hits the preserved vector registers; which would then impact short lengths #22187 (comment)

/cc @CarolEidt @tannergooding @jkotas

@benaadams
Copy link
Member Author

Going to split this in to separate PRs

@benaadams
Copy link
Member Author

Closed in favour of #22875, #22505, #22876, #22877, #22878, #22879 and #22880 which should hopefully make them easier to review

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants