Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARM64: Optimize IndexOf(byte) and IndexOf(char) APIs using intrinsics. #37624

Merged
merged 8 commits into from
Jun 12, 2020

Conversation

kunalspathak
Copy link
Member

@kunalspathak kunalspathak commented Jun 8, 2020

Optimize IndexOf() using ARM64 intrinsics. The only portion to be optimized here is searching 128 bytes at a time for byte/char and if found, finding the first index that matched. SSE/AVX uses MoveMask which makes it easier to find the matching index, but I didn't find an equivalent instruction/intrinsic to do that in ARM64. Hence I did little masking to find out the matching index. Here is my relevant C# code that selects the first lane that was set in CompareEqual():

private static int SelectLane(Vector128<ushort> vector>)
{
    var mask = Vector128.Create((ushort)0, 1, 2, 3, 4, 5, 6, 7);
    var not = AdvSimd.Not(vector);
    var or = AdvSimd.Or(not, mask);
    return AdvSimd.Arm64.MinAcross(or).ToScalar();
}

which generates the following code:

G_M2140_IG01:
        A9BE7BFD          stp     fp, lr, [sp,#-32]!
        910003FD          mov     fp, sp
        3D8007A0          str     q0, [fp,#16]
                                                ;; bbWeight=1    PerfScore 2.50
G_M2140_IG02:
        9C000190          ldr     q16, [@RWD16]
        3DC007B1          ldr     q17, [fp,#16]
        6E205A31          mvn     v17.16b, v17.16b
        4EB01E30          orr     v16.8h, v17.8h, v16.8h
        6E71AA10          uminv   h16, v16.8h
        0E023E00          umov    w0, v16.h[0]
                                                ;; bbWeight=1    PerfScore 9.00
G_M2140_IG03:
        A8C27BFD          ldp     fp, lr, [sp],#32
        D65F03C0          ret     lr
                                                ;; bbWeight=1    PerfScore 2.00
RWD00  db       000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h
RWD16  db       000h, 000h, 001h, 000h, 002h, 000h, 003h, 000h, 004h, 000h, 005h, 000h, 006h, 000h, 007h, 000h

I also checked ARM's optimized-routines as well in llvm and it is equivalent to following C# code:

 private static int Baseline(Vector128<ushort> vector)
{
    // var themask = Vector128.Create((byte)1, 4, 16, 64, 1, 4, 16, 64, 1, 4, 16, 64, 1, 4, 16, 64);
    var themask = Vector128.Create((uint)0x40100401).AsByte();
    var masked = AdvSimd.And(vector.AsByte(), themask);
    var pair1 = AdvSimd.Arm64.AddPairwise(masked, masked);
    var pair2 = AdvSimd.Arm64.AddPairwise(pair1, pair1);
    var toscalar = pair2.AsUInt64().ToScalar();
    return BitOperations.TrailingZeroCount(toscalar) >> 2;
}

The generated code for the same is:

G_M3369_IG01:
        A9BE7BFD          stp     fp, lr, [sp,#-32]!
        910003FD          mov     fp, sp
        3D8007A0          str     q0, [fp,#16]
                                                ;; bbWeight=1    PerfScore 2.50
G_M3369_IG02:
        9C0001F0          ldr     q16, [@RWD16]
        3DC007B1          ldr     q17, [fp,#16]
        4E301E30          and     v16.16b, v17.16b, v16.16b
        4E30BE10          addp    v16.16b, v16.16b, v16.16b
        4E30BE10          addp    v16.16b, v16.16b, v16.16b
        4E083E00          umov    x0, v16.d[0]
        DAC00000          rbit    x0, x0
        DAC01000          clz     x0, x0
        13027C00          asr     w0, w0, #2
                                                ;; bbWeight=1    PerfScore 12.00
G_M3369_IG03:
        A8C27BFD          ldp     fp, lr, [sp],#32
        D65F03C0          ret     lr
                                                ;; bbWeight=1    PerfScore 2.00
RWD00  db       000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h
RWD16  db       001h, 004h, 010h, 040h, 001h, 004h, 010h, 040h, 001h, 004h, 010h, 040h, 001h, 004h, 010h, 040h

I don't see why would this code be any better than the one that this PR produces. @TamarChristinaArm, let me know what you think.

Here are the performance improvement results. Performance code borrowed from here.

Performance results
Position Before After Improvements
0 7.371 6.873 -7%
1 7.373 8.822 20%
2 9.609 8.925 -7%
3 7.668 8.822 15%
4 8.367 8.926 7%
7 8.811 8.96 2%
8 13.145 12.322 -6%
9 10.157 8.925 -12%
10 13.093 12.276 -6%
11 13.094 12.319 -6%
12 10.34 12.275 19%
13 13.094 12.275 -6%
14 10.426 12.27 18%
15 10.242 12.274 20%
16 13.649 12.274 -10%
17 11.436 9.992 -13%
18 13.694 12.321 -10%
19 11.436 10.155 -11%
20 13.609 10.05 -26%
21 13.639 12.276 -10%
22 11.452 12.335 8%
23 13.602 9.992 -27%
24 14.924 9.988 -33%
25 12.147 13.433 11%
26 14.92 13.43 -10%
27 12.09 13.434 11%
28 14.961 10.042 -33%
29 12.083 13.48 12%
30 14.917 13.433 -10%
31 14.925 9.988 -33%
32 13.306 13.43 1%
33 15.545 13.48 -13%
34 13.301 13.475 1%
35 15.544 11.132 -28%
36 15.571 11.136 -28%
37 15.541 13.429 -14%
38 15.544 13.428 -14%
39 15.573 13.429 -14%
40 14.038 14.589 4%
41 13.995 14.621 4%
42 16.795 14.584 -13%
43 14.045 14.629 4%
44 16.836 14.622 -13%
45 16.795 11.324 -33%
46 16.794 14.62 -13%
47 16.839 11.356 -33%
48 17.457 14.644 -16%
49 17.488 14.584 -17%
50 17.462 14.583 -16%
51 17.463 14.584 -16%
52 17.494 14.621 -16%
53 17.455 14.587 -16%
54 17.455 14.584 -16%
55 17.456 14.588 -16%
56 18.752 15.746 -16%
57 18.753 15.743 -16%
58 18.73 15.746 -16%
59 18.743 15.743 -16%
60 18.757 12.287 -34%
61 18.713 15.746 -16%
62 18.756 15.742 -16%
63 18.768 15.747 -16%
64 19.415 15.743 -19%
65 19.385 13.483 -30%
66 19.429 15.742 -19%
67 19.42 15.785 -19%
68 19.415 15.746 -19%
69 17.19 15.743 -8%
70 19.43 13.437 -31%
71 19.383 15.747 -19%
72 20.691 16.901 -18%
73 17.841 16.899 -5%
74 18.03 16.896 -6%
75 20.66 16.933 -18%
76 20.652 16.895 -18%
77 18.029 16.901 -6%
78 17.886 13.44 -25%
79 20.64 13.441 -35%
80 19.114 16.902 -12%
81 21.305 14.6 -31%
82 21.308 14.64 -31%
83 21.342 14.6 -32%
84 19.191 16.945 -12%
85 21.302 16.901 -21%
86 21.313 16.901 -21%
126 26.477 16.909 -36%
127 26.507 16.904 -36%
128 24.971 20.412 -18%
129 27.155 18.109 -33%
130 27.146 20.369 -25%
131 27.111 20.37 -25%
250 41.815 29.792 -29%
251 39.094 29.636 -24%
252 41.754 26.197 -37%
253 39.046 29.704 -24%
254 41.844 29.716 -29%
255 74.629 53.029 -29%

Here is another run for different array. Again, code borrowed from here.

Performance results
Length Before After Improvements
32 16.13 10.06 -38%
33 12.45 17.17 38%
34 19.06 12.01 -37%
35 13.8 12.78 -7%
36 13.84 18.05 30%
37 14.59 18.51 27%
38 15.05 18.93 26%
39 15.85 19.7 24%
40 15.35 13.43 -13%
41 16.79 14.52 -14%
42 17.62 15.31 -13%
43 18.38 16.08 -13%
44 17.63 16.35 -7%
45 17.76 16.11 -9%
46 18.52 17.23 -7%
47 19.27 17.96 -7%
48 18.05 16.67 -8%
49 20.23 18.43 -9%
50 15.18 19.2 26%
51 21.78 19.98 -8%
52 21.05 19.58 -7%
53 16.52 19.93 21%
54 17.12 20.41 19%
55 17.83 21.14 19%
56 16.54 14.59 -12%
57 18.75 15.52 -17%
58 18.65 16.38 -12%
59 20.32 17.13 -16%
60 19.49 17.26 -11%
61 19.7 17.32 -12%
62 20.41 18.11 -11%
63 21.21 18.81 -11%
64 15.92 17.73 11%
65 16.3 19.68 21%
66 17.09 20.53 20%
67 17.65 21.32 21%
68 22.95 20.72 -10%
69 23.12 20.81 -10%
70 19.06 21.51 13%
71 19.76 22.29 13%
72 18.43 15.79 -14%
73 19.76 16.66 -16%
74 20.56 17.44 -15%
75 21.11 18.2 -14%
76 21.46 18.15 -15%
77 21.57 18.47 -14%
78 22.37 19.09 -15%
79 23.08 19.89 -14%
80 21.91 18.86 -14%
81 18.15 14.66 -19%
82 19.02 15.45 -19%
83 25.66 22.39 -13%
84 19.63 21.76 11%
85 20.38 21.97 8%
86 20.94 22.58 8%
126 28.1 22.48 -20%
127 29.01 23.52 -19%
128 27.69 22.32 -19%
129 24.03 24.22 1%
130 24.78 25.01 1%
131 25.37 25.79 2%
250 42.77 30.21 -29%
251 43.43 31.85 -27%
252 42.57 31.11 -27%
253 42.74 31.3 -27%
254 43.49 35.61 -18%
255 44.25 32.7 -26%

Contributes to #33308 and #33707.

Update:

JIT_before_indexof.txt
JIT_after_indexof.txt

Here are the revised performance improvement results. Performance code borrowed from here.

Revised performance results
| Length | Before | After | Improvements |
|--------|--------|-------|--------------|
| 32     | 16.15  | 11.44 | -29%         |
| 33     | 12.45  | 12.21 | -2%          |
| 34     | 19.09  | 12.97 | -32%         |
| 35     | 19.88  | 13.76 | -31%         |
| 36     | 19.1   | 13.36 | -30%         |
| 37     | 14.61  | 13.73 | -6%          |
| 38     | 15.05  | 14.52 | -4%          |
| 39     | 20.86  | 15.28 | -27%         |
| 40     | 15.55  | 14.16 | -9%          |
| 41     | 16.79  | 16.1  | -4%          |
| 42     | 17.63  | 16.89 | -4%          |
| 43     | 17.42  | 17.64 | 1%           |
| 44     | 17.67  | 17.12 | -3%          |
| 45     | 18.08  | 17.54 | -3%          |
| 46     | 18.61  | 18.29 | -2%          |
| 47     | 19.36  | 19.08 | -1%          |
| 48     | 18.07  | 12.97 | -28%         |
| 49     | 14.38  | 13.35 | -7%          |
| 50     | 21.07  | 14.12 | -33%         |
| 51     | 15.75  | 14.87 | -6%          |
| 52     | 21.07  | 14.5  | -31%         |
| 53     | 16.53  | 14.92 | -10%         |
| 54     | 17.07  | 15.66 | -8%          |
| 55     | 22.84  | 16.47 | -28%         |
| 56     | 16.54  | 15.33 | -7%          |
| 57     | 17.92  | 17.27 | -4%          |
| 58     | 19.53  | 18.02 | -8%          |
| 59     | 20.32  | 18.78 | -8%          |
| 60     | 19.28  | 18.29 | -5%          |
| 61     | 19.99  | 18.71 | -6%          |
| 62     | 20.54  | 19.5  | -5%          |
| 63     | 21.33  | 20.26 | -5%          |
| 64     | 15.94  | 14.2  | -11%         |
| 65     | 22.11  | 14.52 | -34%         |
| 66     | 23     | 15.29 | -34%         |
| 67     | 17.67  | 16.06 | -9%          |
| 68     | 17.74  | 16.07 | -9%          |
| 69     | 23.26  | 16.45 | -29%         |
| 70     | 24     | 17.21 | -28%         |
| 71     | 19.77  | 18.01 | -9%          |
| 72     | 19.4   | 16.84 | -13%         |
| 73     | 20.73  | 18.81 | -9%          |
| 74     | 20.53  | 19.57 | -5%          |
| 75     | 21.15  | 20.35 | -4%          |
| 76     | 21.2   | 19.84 | -6%          |
| 77     | 21.92  | 20.23 | -8%          |
| 78     | 22.46  | 21.01 | -6%          |
| 79     | 23.17  | 21.81 | -6%          |
| 80     | 17.87  | 15.68 | -12%         |
| 81     | 18.18  | 16.17 | -11%         |
| 82     | 18.97  | 16.97 | -11%         |
| 83     | 25.64  | 17.64 | -31%         |
| 84     | 24.89  | 17.2  | -31%         |
| 85     | 25.08  | 17.62 | -30%         |
| 86     | 25.84  | 18.41 | -29%         |
| 126    | 28.25  | 24.88 | -12%         |
| 127    | 29.09  | 25.66 | -12%         |
| 128    | 27.71  | 19.54 | -29%         |
| 129    | 23.96  | 19.91 | -17%         |
| 130    | 30.69  | 20.71 | -33%         |
| 131    | 31.45  | 21.46 | -32%         |
| 250    | 41.6   | 34.23 | -18%         |
| 251    | 42.38  | 34.98 | -17%         |
| 252    | 42.36  | 34.52 | -19%         |
| 253    | 42.74  | 38.53 | -10%         |
| 254    | 43.8   | 42.19 | -4%          |
| 255    | 44.41  | 36.51 | -18%         |

Here is another run for different array. Again, code borrowed from here.

Revised performance results
| Position | Before | After  | Improvements |
|----------|--------|--------|--------------|
| 0        | 7.671  | 6.536  | -15%         |
| 1        | 9.649  | 7.329  | -24%         |
| 2        | 9.614  | 7.371  | -23%         |
| 3        | 9.614  | 10.364 | 8%           |
| 4        | 9.605  | 10.287 | 7%           |
| 7        | 8.906  | 8.89   | 0%           |
| 8        | 10.175 | 13.434 | 32%          |
| 9        | 10.354 | 13.43  | 30%          |
| 10       | 10.253 | 10.263 | 0%           |
| 11       | 13.249 | 13.434 | 1%           |
| 12       | 10.356 | 13.431 | 30%          |
| 13       | 10.229 | 13.433 | 31%          |
| 14       | 10.414 | 13.43  | 29%          |
| 15       | 10.212 | 10.292 | 1%           |
| 16       | 11.338 | 13.434 | 18%          |
| 17       | 13.677 | 13.43  | -2%          |
| 18       | 13.677 | 13.438 | -2%          |
| 19       | 11.343 | 13.473 | 19%          |
| 20       | 13.677 | 11.429 | -16%         |
| 21       | 13.68  | 13.476 | -1%          |
| 22       | 13.642 | 11.507 | -16%         |
| 23       | 11.341 | 13.48  | 19%          |
| 24       | 14.829 | 14.627 | -1%          |
| 25       | 12.108 | 14.599 | 21%          |
| 26       | 12.129 | 14.586 | 20%          |
| 27       | 12.132 | 14.637 | 21%          |
| 28       | 12.108 | 11.509 | -5%          |
| 29       | 14.804 | 11.458 | -23%         |
| 30       | 14.804 | 14.6   | -1%          |
| 31       | 14.823 | 14.637 | -1%          |
| 32       | 15.56  | 14.635 | -6%          |
| 33       | 15.577 | 14.596 | -6%          |
| 34       | 15.566 | 14.642 | -6%          |
| 35       | 13.28  | 13.014 | -2%          |
| 36       | 15.576 | 13.006 | -16%         |
| 37       | 15.56  | 14.592 | -6%          |
| 38       | 13.259 | 14.592 | 10%          |
| 39       | 15.566 | 12.991 | -17%         |
| 40       | 14.02  | 16.128 | 15%          |
| 41       | 16.726 | 16.128 | -4%          |
| 42       | 16.726 | 16.131 | -4%          |
| 43       | 16.745 | 16.131 | -4%          |
| 44       | 16.737 | 16.166 | -3%          |
| 45       | 16.73  | 16.129 | -4%          |
| 46       | 16.749 | 16.172 | -3%          |
| 47       | 16.749 | 13.028 | -22%         |
| 48       | 15.185 | 16.166 | 6%           |
| 49       | 17.494 | 16.132 | -8%          |
| 50       | 17.496 | 16.128 | -8%          |
| 51       | 17.488 | 16.173 | -8%          |
| 52       | 17.495 | 16.131 | -8%          |
| 53       | 15.181 | 16.127 | 6%           |
| 54       | 17.493 | 16.132 | -8%          |
| 55       | 17.489 | 14.137 | -19%         |
| 56       | 18.653 | 17.287 | -7%          |
| 57       | 18.665 | 14.145 | -24%         |
| 58       | 18.672 | 17.286 | -7%          |
| 59       | 15.962 | 14.137 | -11%         |
| 60       | 15.971 | 17.334 | 9%           |
| 61       | 18.651 | 17.286 | -7%          |
| 62       | 18.676 | 17.287 | -7%          |
| 63       | 18.655 | 17.286 | -7%          |
| 64       | 19.429 | 17.33  | -11%         |
| 65       | 19.415 | 17.328 | -11%         |
| 66       | 19.414 | 17.33  | -11%         |
| 67       | 19.421 | 15.674 | -19%         |
| 68       | 19.43  | 15.699 | -19%         |
| 69       | 19.412 | 15.675 | -19%         |
| 70       | 19.415 | 17.329 | -11%         |
| 71       | 19.412 | 17.286 | -11%         |
| 72       | 20.608 | 15.665 | -24%         |
| 73       | 20.587 | 18.863 | -8%          |
| 74       | 17.9   | 18.818 | 5%           |
| 75       | 20.589 | 18.819 | -9%          |
| 76       | 18.074 | 18.819 | 4%           |
| 77       | 20.593 | 18.827 | -9%          |
| 78       | 17.898 | 18.818 | 5%           |
| 79       | 20.594 | 18.87  | -8%          |
| 80       | 19.046 | 18.819 | -1%          |
| 81       | 21.352 | 18.866 | -12%         |
| 82       | 21.352 | 18.868 | -12%         |
| 83       | 21.337 | 18.822 | -12%         |
| 84       | 19.05  | 18.818 | -1%          |
| 85       | 21.418 | 16.867 | -21%         |
| 86       | 19.053 | 18.853 | -1%          |
| 126      | 26.355 | 22.747 | -14%         |
| 127      | 26.349 | 19.562 | -26%         |
| 128      | 27.12  | 22.737 | -16%         |
| 129      | 27.12  | 22.694 | -16%         |
| 130      | 27.123 | 22.745 | -16%         |
| 131      | 27.122 | 22.699 | -16%         |
| 250      | 41.753 | 30.331 | -27%         |
| 251      | 41.753 | 30.376 | -27%         |
| 252      | 39.081 | 33.504 | -14%         |
| 253      | 39.18  | 30.366 | -22%         |
| 254      | 41.749 | 33.514 | -20%         |
| 255      | 74.301 | 60.107 | -19%         |

@kunalspathak
Copy link
Member Author

@dotnet/jit-contrib , @TamarChristinaArm , @tannergooding , @GrabYourPitchforks, @danmosemsft

@TamarChristinaArm
Copy link
Contributor

I don't see why would this code be any better than the one that this PR produces. @TamarChristinaArm, let me know what you think.

@kunalspathak the answer is... it depends.. on older cores such as Cortex-A55 your sequence will probably be faster, on newer cores like Cortex-A76 (which is what AoR is tuned for) the addp version will beat the uminv one by a decent margin. uminv is a reasonably expensive operation and addp has been made cheaper. Even though it has more instructions the addp one will take less time to get through the pipeline.

@kunalspathak
Copy link
Member Author

kunalspathak commented Jun 9, 2020

@kunalspathak the answer is... it depends.. on older cores such as Cortex-A55 your sequence will probably be faster, on newer cores like Cortex-A76 (which is what AoR is tuned for) the addp version will beat the uminv one by a decent margin. uminv is a reasonably expensive operation and addp has been made cheaper. Even though it has more instructions the addp one will take less time to get through the pipeline.

Thanks @TamarChristinaArm. So probably, switching to the other version will be sensible to do? #Resolved

@TamarChristinaArm
Copy link
Contributor

TamarChristinaArm commented Jun 9, 2020

@kunalspathak the answer is... it depends.. on older cores such as Cortex-A55 your sequence will probably be faster, on newer cores like Cortex-A76 (which is what AoR is tuned for) the addp version will beat the uminv one by a decent margin. uminv is a reasonably expensive operation and addp has been made cheaper. Even though it has more instructions the addp one will take less time to get through the pipeline.

Thanks @TamarChristinaArm. So probably, switching to the other version will be sensible to do?

I would say yes, even though you may get a lower increased based on hardware you may be benchmarking on now, on the long run it would the better sequence. #Resolved

@AntonLapounov
Copy link
Member

AntonLapounov commented Jun 9, 2020

uminv is a reasonably expensive operation and addp has been made cheaper.

@TamarChristinaArm Is there any table explaining instruction costs? #Resolved

@kunalspathak
Copy link
Member Author

@TamarChristinaArm Is there any table explaining instruction costs?

So far I was referring to Cortex-A55 optimization guide, but looking at Cortex-A76 optimization guide, uminv takes 6 cycles vs. addp that takes 2.

@tannergooding
Copy link
Member

tannergooding commented Jun 9, 2020

Overall LGTM, just a few questions/comments 😄 #Resolved

@kunalspathak
Copy link
Member Author

Overall LGTM, just a few questions/comments 😄

Thanks @tannergooding . As Tamar suggested, I am going to switch over to the other implementation and then address your comments.

Also moved the code in a common method that can be used
to `FindFirstMatchedLane` in other APIs as well.
@AntonLapounov
Copy link
Member

private static int Baseline(Vector128<ushort> vector)
{
    // var themask = Vector128.Create((byte)1, 4, 16, 64, 1, 4, 16, 64, 1, 4, 16, 64, 1, 4, 16, 64);
    var themask = Vector128.Create((uint)0x40100401).AsByte();
    var masked = AdvSimd.And(vector.AsByte(), themask);
    var pair1 = AdvSimd.Arm64.AddPairwise(masked, masked);
    var pair2 = AdvSimd.Arm64.AddPairwise(pair1, pair1);
    var toscalar = pair2.AsUInt64().ToScalar();
    return BitOperations.TrailingZeroCount(toscalar) >> 2;
}

I do not think we need to use a mask at all. On input vector has 4 elements, each of them is either 0000 or ffff. Would the following work?

var zip = AdvSimd.Arm64.AddPairwise(vector.AsByte(), vector.AsByte()); // Sequence of 00 and fe bytes, two identical copies
var scalar = zip.AsUInt64().ToScalar(); // Number like 0x00fe00fefe000000, each char corresponds to 1 byte = 8 bits
return BitOperations.TrailingZeroCount(scalar) >> 3;

@kunalspathak
Copy link
Member Author

I do not think we need to use a mask at all. On input vector has 4 elements, each of them is either 0000 or ffff. Would the following work?

var zip = AdvSimd.Arm64.AddPairwise(vector.AsByte(), vector.AsByte()); // Sequence of 00 and fe bytes, two identical copies
var scalar = zip.AsUInt64().ToScalar(); // Number like 0x00fe00fefe000000, each char corresponds to 1 byte = 8 bits
return BitOperations.TrailingZeroCount(scalar) >> 3;

Thanks for your suggestion Anton. Did you mean to also zip something before given that you collect the result of AddPairwise inside zip variable. Regardless, we need 2 AddPairwise to aggregate the result into 0th lane so we can use AsUInt64().ToScalar() to extract it out. The mask is needed to identify the lanes. If we don't use mask, the AddPairwise could add the consecutive set bits resulting in lost of information about the lane.

@AntonLapounov
Copy link
Member

@kunalspathak Why do we need two AddPairwise? One AddPairwise reduces the size from 128 bit to 64 bit, which gives you the scalar. Two AddPairwise would give you a 32 bit value. And there is no information lost since my AddPairwise adds two identical bytes.

@kunalspathak
Copy link
Member Author

@kunalspathak Why do we need two AddPairwise? One AddPairwise reduces the size from 128 bit to 64 bit, which gives you the scalar. Two AddPairwise would give you a 32 bit value. And there is no information lost since my AddPairwise adds two identical bytes.

Ah, I see what you mean. I was thinking that your code is for Byte but it is for char. Yes, you are right, we don't need 2nd AddPairwise. However for Byte we need a mask to find which lane is selected, but I can get rid of 1 AddPairwise. Here is what it looks for byte now:

var themask = Vector128.Create((byte)1, 16, 1, 16, 1, 16, 1, 16, 1, 16, 1, 16, 1, 16, 1, 16);
var masked = AdvSimd.And(vector, themask);
var zip = AdvSimd.Arm64.AddPairwise(masked, masked);
return BitOperations.TrailingZeroCount(zip.AsUInt64().ToScalar()) >> 2;

@AntonLapounov
Copy link
Member

@kunalspathak Yes, I was thinking about the char case only. Sorry for not being clear.

@AntonLapounov
Copy link
Member

AntonLapounov commented Jun 12, 2020

Looks great to me. One related thing I noticed is that the Contains method is not optimized the same way as the IndexOf method while a comment on the former says "Adapted from IndexOf(...)". I am wondering if we need to update Contains methods in these two files as well. @danmosemsft

@kunalspathak
Copy link
Member Author

Looks great to me. One related thing I noticed is that the Contains method is not optimized the same way as the IndexOf method while a comment on the former says "Adapted from IndexOf(...)". I am wondering if we need to update Contains methods in these two files as well. @danmosemsft

IndexOf got optimized in dotnet/coreclr#22118 but Contains was never optimized. While working on IndexOfAny(byte), I noticed that we don't optimize IndexOfAny(char) too using x86 intrinsics.

@kunalspathak kunalspathak merged commit 3228c46 into dotnet:master Jun 12, 2020
@kunalspathak kunalspathak deleted the spanhelpers branch June 12, 2020 18:53
@kunalspathak
Copy link
Member Author

Looks great to me. One related thing I noticed is that the Contains method is not optimized the same way as the IndexOf method while a comment on the former says "Adapted from IndexOf(...)". I am wondering if we need to update Contains methods in these two files as well. @danmosemsft

IndexOf got optimized in dotnet/coreclr#22118 but Contains was never optimized. While working on IndexOfAny(byte), I noticed that we don't optimize IndexOfAny(char) too using x86 intrinsics.

Found following tracking issues:

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants