ARM64: Optimize IndexOf(byte) and IndexOf(char) APIs using intrinsics. #37624

kunalspathak · 2020-06-08T23:45:58Z

Optimize IndexOf() using ARM64 intrinsics. The only portion to be optimized here is searching 128 bytes at a time for byte/char and if found, finding the first index that matched. SSE/AVX uses MoveMask which makes it easier to find the matching index, but I didn't find an equivalent instruction/intrinsic to do that in ARM64. Hence I did little masking to find out the matching index. Here is my relevant C# code that selects the first lane that was set in CompareEqual():

private static int SelectLane(Vector128<ushort> vector>)
{
    var mask = Vector128.Create((ushort)0, 1, 2, 3, 4, 5, 6, 7);
    var not = AdvSimd.Not(vector);
    var or = AdvSimd.Or(not, mask);
    return AdvSimd.Arm64.MinAcross(or).ToScalar();
}

which generates the following code:

G_M2140_IG01:
        A9BE7BFD          stp     fp, lr, [sp,#-32]!
        910003FD          mov     fp, sp
        3D8007A0          str     q0, [fp,#16]
                                                ;; bbWeight=1    PerfScore 2.50
G_M2140_IG02:
        9C000190          ldr     q16, [@RWD16]
        3DC007B1          ldr     q17, [fp,#16]
        6E205A31          mvn     v17.16b, v17.16b
        4EB01E30          orr     v16.8h, v17.8h, v16.8h
        6E71AA10          uminv   h16, v16.8h
        0E023E00          umov    w0, v16.h[0]
                                                ;; bbWeight=1    PerfScore 9.00
G_M2140_IG03:
        A8C27BFD          ldp     fp, lr, [sp],#32
        D65F03C0          ret     lr
                                                ;; bbWeight=1    PerfScore 2.00
RWD00  db       000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h
RWD16  db       000h, 000h, 001h, 000h, 002h, 000h, 003h, 000h, 004h, 000h, 005h, 000h, 006h, 000h, 007h, 000h

I also checked ARM's optimized-routines as well in llvm and it is equivalent to following C# code:

 private static int Baseline(Vector128<ushort> vector)
{
    // var themask = Vector128.Create((byte)1, 4, 16, 64, 1, 4, 16, 64, 1, 4, 16, 64, 1, 4, 16, 64);
    var themask = Vector128.Create((uint)0x40100401).AsByte();
    var masked = AdvSimd.And(vector.AsByte(), themask);
    var pair1 = AdvSimd.Arm64.AddPairwise(masked, masked);
    var pair2 = AdvSimd.Arm64.AddPairwise(pair1, pair1);
    var toscalar = pair2.AsUInt64().ToScalar();
    return BitOperations.TrailingZeroCount(toscalar) >> 2;
}

The generated code for the same is:

G_M3369_IG01:
        A9BE7BFD          stp     fp, lr, [sp,#-32]!
        910003FD          mov     fp, sp
        3D8007A0          str     q0, [fp,#16]
                                                ;; bbWeight=1    PerfScore 2.50
G_M3369_IG02:
        9C0001F0          ldr     q16, [@RWD16]
        3DC007B1          ldr     q17, [fp,#16]
        4E301E30          and     v16.16b, v17.16b, v16.16b
        4E30BE10          addp    v16.16b, v16.16b, v16.16b
        4E30BE10          addp    v16.16b, v16.16b, v16.16b
        4E083E00          umov    x0, v16.d[0]
        DAC00000          rbit    x0, x0
        DAC01000          clz     x0, x0
        13027C00          asr     w0, w0, #2
                                                ;; bbWeight=1    PerfScore 12.00
G_M3369_IG03:
        A8C27BFD          ldp     fp, lr, [sp],#32
        D65F03C0          ret     lr
                                                ;; bbWeight=1    PerfScore 2.00
RWD00  db       000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h
RWD16  db       001h, 004h, 010h, 040h, 001h, 004h, 010h, 040h, 001h, 004h, 010h, 040h, 001h, 004h, 010h, 040h

I don't see why would this code be any better than the one that this PR produces. @TamarChristinaArm, let me know what you think.

Here are the performance improvement results. Performance code borrowed from here.

Performance results

Position	Before	After	Improvements
0	7.371	6.873	-7%
1	7.373	8.822	20%
2	9.609	8.925	-7%
3	7.668	8.822	15%
4	8.367	8.926	7%
7	8.811	8.96	2%
8	13.145	12.322	-6%
9	10.157	8.925	-12%
10	13.093	12.276	-6%
11	13.094	12.319	-6%
12	10.34	12.275	19%
13	13.094	12.275	-6%
14	10.426	12.27	18%
15	10.242	12.274	20%
16	13.649	12.274	-10%
17	11.436	9.992	-13%
18	13.694	12.321	-10%
19	11.436	10.155	-11%
20	13.609	10.05	-26%
21	13.639	12.276	-10%
22	11.452	12.335	8%
23	13.602	9.992	-27%
24	14.924	9.988	-33%
25	12.147	13.433	11%
26	14.92	13.43	-10%
27	12.09	13.434	11%
28	14.961	10.042	-33%
29	12.083	13.48	12%
30	14.917	13.433	-10%
31	14.925	9.988	-33%
32	13.306	13.43	1%
33	15.545	13.48	-13%
34	13.301	13.475	1%
35	15.544	11.132	-28%
36	15.571	11.136	-28%
37	15.541	13.429	-14%
38	15.544	13.428	-14%
39	15.573	13.429	-14%
40	14.038	14.589	4%
41	13.995	14.621	4%
42	16.795	14.584	-13%
43	14.045	14.629	4%
44	16.836	14.622	-13%
45	16.795	11.324	-33%
46	16.794	14.62	-13%
47	16.839	11.356	-33%
48	17.457	14.644	-16%
49	17.488	14.584	-17%
50	17.462	14.583	-16%
51	17.463	14.584	-16%
52	17.494	14.621	-16%
53	17.455	14.587	-16%
54	17.455	14.584	-16%
55	17.456	14.588	-16%
56	18.752	15.746	-16%
57	18.753	15.743	-16%
58	18.73	15.746	-16%
59	18.743	15.743	-16%
60	18.757	12.287	-34%
61	18.713	15.746	-16%
62	18.756	15.742	-16%
63	18.768	15.747	-16%
64	19.415	15.743	-19%
65	19.385	13.483	-30%
66	19.429	15.742	-19%
67	19.42	15.785	-19%
68	19.415	15.746	-19%
69	17.19	15.743	-8%
70	19.43	13.437	-31%
71	19.383	15.747	-19%
72	20.691	16.901	-18%
73	17.841	16.899	-5%
74	18.03	16.896	-6%
75	20.66	16.933	-18%
76	20.652	16.895	-18%
77	18.029	16.901	-6%
78	17.886	13.44	-25%
79	20.64	13.441	-35%
80	19.114	16.902	-12%
81	21.305	14.6	-31%
82	21.308	14.64	-31%
83	21.342	14.6	-32%
84	19.191	16.945	-12%
85	21.302	16.901	-21%
86	21.313	16.901	-21%
126	26.477	16.909	-36%
127	26.507	16.904	-36%
128	24.971	20.412	-18%
129	27.155	18.109	-33%
130	27.146	20.369	-25%
131	27.111	20.37	-25%
250	41.815	29.792	-29%
251	39.094	29.636	-24%
252	41.754	26.197	-37%
253	39.046	29.704	-24%
254	41.844	29.716	-29%
255	74.629	53.029	-29%

Here is another run for different array. Again, code borrowed from here.

Performance results

Length	Before	After	Improvements
32	16.13	10.06	-38%
33	12.45	17.17	38%
34	19.06	12.01	-37%
35	13.8	12.78	-7%
36	13.84	18.05	30%
37	14.59	18.51	27%
38	15.05	18.93	26%
39	15.85	19.7	24%
40	15.35	13.43	-13%
41	16.79	14.52	-14%
42	17.62	15.31	-13%
43	18.38	16.08	-13%
44	17.63	16.35	-7%
45	17.76	16.11	-9%
46	18.52	17.23	-7%
47	19.27	17.96	-7%
48	18.05	16.67	-8%
49	20.23	18.43	-9%
50	15.18	19.2	26%
51	21.78	19.98	-8%
52	21.05	19.58	-7%
53	16.52	19.93	21%
54	17.12	20.41	19%
55	17.83	21.14	19%
56	16.54	14.59	-12%
57	18.75	15.52	-17%
58	18.65	16.38	-12%
59	20.32	17.13	-16%
60	19.49	17.26	-11%
61	19.7	17.32	-12%
62	20.41	18.11	-11%
63	21.21	18.81	-11%
64	15.92	17.73	11%
65	16.3	19.68	21%
66	17.09	20.53	20%
67	17.65	21.32	21%
68	22.95	20.72	-10%
69	23.12	20.81	-10%
70	19.06	21.51	13%
71	19.76	22.29	13%
72	18.43	15.79	-14%
73	19.76	16.66	-16%
74	20.56	17.44	-15%
75	21.11	18.2	-14%
76	21.46	18.15	-15%
77	21.57	18.47	-14%
78	22.37	19.09	-15%
79	23.08	19.89	-14%
80	21.91	18.86	-14%
81	18.15	14.66	-19%
82	19.02	15.45	-19%
83	25.66	22.39	-13%
84	19.63	21.76	11%
85	20.38	21.97	8%
86	20.94	22.58	8%
126	28.1	22.48	-20%
127	29.01	23.52	-19%
128	27.69	22.32	-19%
129	24.03	24.22	1%
130	24.78	25.01	1%
131	25.37	25.79	2%
250	42.77	30.21	-29%
251	43.43	31.85	-27%
252	42.57	31.11	-27%
253	42.74	31.3	-27%
254	43.49	35.61	-18%
255	44.25	32.7	-26%

Contributes to #33308 and #33707.

Update:

JIT_before_indexof.txt
JIT_after_indexof.txt

Here are the revised performance improvement results. Performance code borrowed from here.

Revised performance results

| Length | Before | After | Improvements |
|--------|--------|-------|--------------|
| 32     | 16.15  | 11.44 | -29%         |
| 33     | 12.45  | 12.21 | -2%          |
| 34     | 19.09  | 12.97 | -32%         |
| 35     | 19.88  | 13.76 | -31%         |
| 36     | 19.1   | 13.36 | -30%         |
| 37     | 14.61  | 13.73 | -6%          |
| 38     | 15.05  | 14.52 | -4%          |
| 39     | 20.86  | 15.28 | -27%         |
| 40     | 15.55  | 14.16 | -9%          |
| 41     | 16.79  | 16.1  | -4%          |
| 42     | 17.63  | 16.89 | -4%          |
| 43     | 17.42  | 17.64 | 1%           |
| 44     | 17.67  | 17.12 | -3%          |
| 45     | 18.08  | 17.54 | -3%          |
| 46     | 18.61  | 18.29 | -2%          |
| 47     | 19.36  | 19.08 | -1%          |
| 48     | 18.07  | 12.97 | -28%         |
| 49     | 14.38  | 13.35 | -7%          |
| 50     | 21.07  | 14.12 | -33%         |
| 51     | 15.75  | 14.87 | -6%          |
| 52     | 21.07  | 14.5  | -31%         |
| 53     | 16.53  | 14.92 | -10%         |
| 54     | 17.07  | 15.66 | -8%          |
| 55     | 22.84  | 16.47 | -28%         |
| 56     | 16.54  | 15.33 | -7%          |
| 57     | 17.92  | 17.27 | -4%          |
| 58     | 19.53  | 18.02 | -8%          |
| 59     | 20.32  | 18.78 | -8%          |
| 60     | 19.28  | 18.29 | -5%          |
| 61     | 19.99  | 18.71 | -6%          |
| 62     | 20.54  | 19.5  | -5%          |
| 63     | 21.33  | 20.26 | -5%          |
| 64     | 15.94  | 14.2  | -11%         |
| 65     | 22.11  | 14.52 | -34%         |
| 66     | 23     | 15.29 | -34%         |
| 67     | 17.67  | 16.06 | -9%          |
| 68     | 17.74  | 16.07 | -9%          |
| 69     | 23.26  | 16.45 | -29%         |
| 70     | 24     | 17.21 | -28%         |
| 71     | 19.77  | 18.01 | -9%          |
| 72     | 19.4   | 16.84 | -13%         |
| 73     | 20.73  | 18.81 | -9%          |
| 74     | 20.53  | 19.57 | -5%          |
| 75     | 21.15  | 20.35 | -4%          |
| 76     | 21.2   | 19.84 | -6%          |
| 77     | 21.92  | 20.23 | -8%          |
| 78     | 22.46  | 21.01 | -6%          |
| 79     | 23.17  | 21.81 | -6%          |
| 80     | 17.87  | 15.68 | -12%         |
| 81     | 18.18  | 16.17 | -11%         |
| 82     | 18.97  | 16.97 | -11%         |
| 83     | 25.64  | 17.64 | -31%         |
| 84     | 24.89  | 17.2  | -31%         |
| 85     | 25.08  | 17.62 | -30%         |
| 86     | 25.84  | 18.41 | -29%         |
| 126    | 28.25  | 24.88 | -12%         |
| 127    | 29.09  | 25.66 | -12%         |
| 128    | 27.71  | 19.54 | -29%         |
| 129    | 23.96  | 19.91 | -17%         |
| 130    | 30.69  | 20.71 | -33%         |
| 131    | 31.45  | 21.46 | -32%         |
| 250    | 41.6   | 34.23 | -18%         |
| 251    | 42.38  | 34.98 | -17%         |
| 252    | 42.36  | 34.52 | -19%         |
| 253    | 42.74  | 38.53 | -10%         |
| 254    | 43.8   | 42.19 | -4%          |
| 255    | 44.41  | 36.51 | -18%         |

Here is another run for different array. Again, code borrowed from here.

Revised performance results

| Position | Before | After  | Improvements |
|----------|--------|--------|--------------|
| 0        | 7.671  | 6.536  | -15%         |
| 1        | 9.649  | 7.329  | -24%         |
| 2        | 9.614  | 7.371  | -23%         |
| 3        | 9.614  | 10.364 | 8%           |
| 4        | 9.605  | 10.287 | 7%           |
| 7        | 8.906  | 8.89   | 0%           |
| 8        | 10.175 | 13.434 | 32%          |
| 9        | 10.354 | 13.43  | 30%          |
| 10       | 10.253 | 10.263 | 0%           |
| 11       | 13.249 | 13.434 | 1%           |
| 12       | 10.356 | 13.431 | 30%          |
| 13       | 10.229 | 13.433 | 31%          |
| 14       | 10.414 | 13.43  | 29%          |
| 15       | 10.212 | 10.292 | 1%           |
| 16       | 11.338 | 13.434 | 18%          |
| 17       | 13.677 | 13.43  | -2%          |
| 18       | 13.677 | 13.438 | -2%          |
| 19       | 11.343 | 13.473 | 19%          |
| 20       | 13.677 | 11.429 | -16%         |
| 21       | 13.68  | 13.476 | -1%          |
| 22       | 13.642 | 11.507 | -16%         |
| 23       | 11.341 | 13.48  | 19%          |
| 24       | 14.829 | 14.627 | -1%          |
| 25       | 12.108 | 14.599 | 21%          |
| 26       | 12.129 | 14.586 | 20%          |
| 27       | 12.132 | 14.637 | 21%          |
| 28       | 12.108 | 11.509 | -5%          |
| 29       | 14.804 | 11.458 | -23%         |
| 30       | 14.804 | 14.6   | -1%          |
| 31       | 14.823 | 14.637 | -1%          |
| 32       | 15.56  | 14.635 | -6%          |
| 33       | 15.577 | 14.596 | -6%          |
| 34       | 15.566 | 14.642 | -6%          |
| 35       | 13.28  | 13.014 | -2%          |
| 36       | 15.576 | 13.006 | -16%         |
| 37       | 15.56  | 14.592 | -6%          |
| 38       | 13.259 | 14.592 | 10%          |
| 39       | 15.566 | 12.991 | -17%         |
| 40       | 14.02  | 16.128 | 15%          |
| 41       | 16.726 | 16.128 | -4%          |
| 42       | 16.726 | 16.131 | -4%          |
| 43       | 16.745 | 16.131 | -4%          |
| 44       | 16.737 | 16.166 | -3%          |
| 45       | 16.73  | 16.129 | -4%          |
| 46       | 16.749 | 16.172 | -3%          |
| 47       | 16.749 | 13.028 | -22%         |
| 48       | 15.185 | 16.166 | 6%           |
| 49       | 17.494 | 16.132 | -8%          |
| 50       | 17.496 | 16.128 | -8%          |
| 51       | 17.488 | 16.173 | -8%          |
| 52       | 17.495 | 16.131 | -8%          |
| 53       | 15.181 | 16.127 | 6%           |
| 54       | 17.493 | 16.132 | -8%          |
| 55       | 17.489 | 14.137 | -19%         |
| 56       | 18.653 | 17.287 | -7%          |
| 57       | 18.665 | 14.145 | -24%         |
| 58       | 18.672 | 17.286 | -7%          |
| 59       | 15.962 | 14.137 | -11%         |
| 60       | 15.971 | 17.334 | 9%           |
| 61       | 18.651 | 17.286 | -7%          |
| 62       | 18.676 | 17.287 | -7%          |
| 63       | 18.655 | 17.286 | -7%          |
| 64       | 19.429 | 17.33  | -11%         |
| 65       | 19.415 | 17.328 | -11%         |
| 66       | 19.414 | 17.33  | -11%         |
| 67       | 19.421 | 15.674 | -19%         |
| 68       | 19.43  | 15.699 | -19%         |
| 69       | 19.412 | 15.675 | -19%         |
| 70       | 19.415 | 17.329 | -11%         |
| 71       | 19.412 | 17.286 | -11%         |
| 72       | 20.608 | 15.665 | -24%         |
| 73       | 20.587 | 18.863 | -8%          |
| 74       | 17.9   | 18.818 | 5%           |
| 75       | 20.589 | 18.819 | -9%          |
| 76       | 18.074 | 18.819 | 4%           |
| 77       | 20.593 | 18.827 | -9%          |
| 78       | 17.898 | 18.818 | 5%           |
| 79       | 20.594 | 18.87  | -8%          |
| 80       | 19.046 | 18.819 | -1%          |
| 81       | 21.352 | 18.866 | -12%         |
| 82       | 21.352 | 18.868 | -12%         |
| 83       | 21.337 | 18.822 | -12%         |
| 84       | 19.05  | 18.818 | -1%          |
| 85       | 21.418 | 16.867 | -21%         |
| 86       | 19.053 | 18.853 | -1%          |
| 126      | 26.355 | 22.747 | -14%         |
| 127      | 26.349 | 19.562 | -26%         |
| 128      | 27.12  | 22.737 | -16%         |
| 129      | 27.12  | 22.694 | -16%         |
| 130      | 27.123 | 22.745 | -16%         |
| 131      | 27.122 | 22.699 | -16%         |
| 250      | 41.753 | 30.331 | -27%         |
| 251      | 41.753 | 30.376 | -27%         |
| 252      | 39.081 | 33.504 | -14%         |
| 253      | 39.18  | 30.366 | -22%         |
| 254      | 41.749 | 33.514 | -20%         |
| 255      | 74.301 | 60.107 | -19%         |

kunalspathak · 2020-06-08T23:46:35Z

@dotnet/jit-contrib , @TamarChristinaArm , @tannergooding , @GrabYourPitchforks, @danmosemsft

TamarChristinaArm · 2020-06-09T17:06:00Z

I don't see why would this code be any better than the one that this PR produces. @TamarChristinaArm, let me know what you think.

@kunalspathak the answer is... it depends.. on older cores such as Cortex-A55 your sequence will probably be faster, on newer cores like Cortex-A76 (which is what AoR is tuned for) the addp version will beat the uminv one by a decent margin. uminv is a reasonably expensive operation and addp has been made cheaper. Even though it has more instructions the addp one will take less time to get through the pipeline.

kunalspathak · 2020-06-09T17:08:26Z

@kunalspathak the answer is... it depends.. on older cores such as Cortex-A55 your sequence will probably be faster, on newer cores like Cortex-A76 (which is what AoR is tuned for) the addp version will beat the uminv one by a decent margin. uminv is a reasonably expensive operation and addp has been made cheaper. Even though it has more instructions the addp one will take less time to get through the pipeline.

Thanks @TamarChristinaArm. So probably, switching to the other version will be sensible to do? #Resolved

TamarChristinaArm · 2020-06-09T17:13:34Z

@kunalspathak the answer is... it depends.. on older cores such as Cortex-A55 your sequence will probably be faster, on newer cores like Cortex-A76 (which is what AoR is tuned for) the addp version will beat the uminv one by a decent margin. uminv is a reasonably expensive operation and addp has been made cheaper. Even though it has more instructions the addp one will take less time to get through the pipeline.

Thanks @TamarChristinaArm. So probably, switching to the other version will be sensible to do?

I would say yes, even though you may get a lower increased based on hardware you may be benchmarking on now, on the long run it would the better sequence. #Resolved

AntonLapounov · 2020-06-09T19:21:56Z

uminv is a reasonably expensive operation and addp has been made cheaper.

@TamarChristinaArm Is there any table explaining instruction costs? #Resolved

kunalspathak · 2020-06-09T19:26:25Z

@TamarChristinaArm Is there any table explaining instruction costs?

So far I was referring to Cortex-A55 optimization guide, but looking at Cortex-A76 optimization guide, uminv takes 6 cycles vs. addp that takes 2.

src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Byte.cs

src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Char.cs

tannergooding · 2020-06-09T22:06:24Z

Overall LGTM, just a few questions/comments 😄 #Resolved

kunalspathak · 2020-06-09T22:14:01Z

Overall LGTM, just a few questions/comments 😄

Thanks @tannergooding . As Tamar suggested, I am going to switch over to the other implementation and then address your comments.

src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Byte.cs

Also moved the code in a common method that can be used to `FindFirstMatchedLane` in other APIs as well.

AntonLapounov · 2020-06-10T17:32:57Z

private static int Baseline(Vector128<ushort> vector)
{
    // var themask = Vector128.Create((byte)1, 4, 16, 64, 1, 4, 16, 64, 1, 4, 16, 64, 1, 4, 16, 64);
    var themask = Vector128.Create((uint)0x40100401).AsByte();
    var masked = AdvSimd.And(vector.AsByte(), themask);
    var pair1 = AdvSimd.Arm64.AddPairwise(masked, masked);
    var pair2 = AdvSimd.Arm64.AddPairwise(pair1, pair1);
    var toscalar = pair2.AsUInt64().ToScalar();
    return BitOperations.TrailingZeroCount(toscalar) >> 2;
}

I do not think we need to use a mask at all. On input vector has 4 elements, each of them is either 0000 or ffff. Would the following work?

var zip = AdvSimd.Arm64.AddPairwise(vector.AsByte(), vector.AsByte()); // Sequence of 00 and fe bytes, two identical copies
var scalar = zip.AsUInt64().ToScalar(); // Number like 0x00fe00fefe000000, each char corresponds to 1 byte = 8 bits
return BitOperations.TrailingZeroCount(scalar) >> 3;

kunalspathak · 2020-06-11T07:56:26Z

I do not think we need to use a mask at all. On input vector has 4 elements, each of them is either 0000 or ffff. Would the following work?

var zip = AdvSimd.Arm64.AddPairwise(vector.AsByte(), vector.AsByte()); // Sequence of 00 and fe bytes, two identical copies
var scalar = zip.AsUInt64().ToScalar(); // Number like 0x00fe00fefe000000, each char corresponds to 1 byte = 8 bits
return BitOperations.TrailingZeroCount(scalar) >> 3;

Thanks for your suggestion Anton. Did you mean to also zip something before given that you collect the result of AddPairwise inside zip variable. Regardless, we need 2 AddPairwise to aggregate the result into 0th lane so we can use AsUInt64().ToScalar() to extract it out. The mask is needed to identify the lanes. If we don't use mask, the AddPairwise could add the consecutive set bits resulting in lost of information about the lane.

AntonLapounov · 2020-06-11T08:05:48Z

@kunalspathak Why do we need two AddPairwise? One AddPairwise reduces the size from 128 bit to 64 bit, which gives you the scalar. Two AddPairwise would give you a 32 bit value. And there is no information lost since my AddPairwise adds two identical bytes.

kunalspathak · 2020-06-11T21:22:50Z

@kunalspathak Why do we need two AddPairwise? One AddPairwise reduces the size from 128 bit to 64 bit, which gives you the scalar. Two AddPairwise would give you a 32 bit value. And there is no information lost since my AddPairwise adds two identical bytes.

Ah, I see what you mean. I was thinking that your code is for Byte but it is for char. Yes, you are right, we don't need 2nd AddPairwise. However for Byte we need a mask to find which lane is selected, but I can get rid of 1 AddPairwise. Here is what it looks for byte now:

var themask = Vector128.Create((byte)1, 16, 1, 16, 1, 16, 1, 16, 1, 16, 1, 16, 1, 16, 1, 16);
var masked = AdvSimd.And(vector, themask);
var zip = AdvSimd.Arm64.AddPairwise(masked, masked);
return BitOperations.TrailingZeroCount(zip.AsUInt64().ToScalar()) >> 2;

AntonLapounov · 2020-06-11T21:33:18Z

@kunalspathak Yes, I was thinking about the char case only. Sorry for not being clear.

src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Char.cs

…s any match

src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Byte.cs

AntonLapounov · 2020-06-12T01:13:31Z

Looks great to me. One related thing I noticed is that the Contains method is not optimized the same way as the IndexOf method while a comment on the former says "Adapted from IndexOf(...)". I am wondering if we need to update Contains methods in these two files as well. @danmosemsft

kunalspathak · 2020-06-12T15:50:24Z

Looks great to me. One related thing I noticed is that the Contains method is not optimized the same way as the IndexOf method while a comment on the former says "Adapted from IndexOf(...)". I am wondering if we need to update Contains methods in these two files as well. @danmosemsft

IndexOf got optimized in dotnet/coreclr#22118 but Contains was never optimized. While working on IndexOfAny(byte), I noticed that we don't optimize IndexOfAny(char) too using x86 intrinsics.

kunalspathak · 2020-06-12T19:49:55Z

Looks great to me. One related thing I noticed is that the Contains method is not optimized the same way as the IndexOf method while a comment on the former says "Adapted from IndexOf(...)". I am wondering if we need to update Contains methods in these two files as well. @danmosemsft

IndexOf got optimized in dotnet/coreclr#22118 but Contains was never optimized. While working on IndexOfAny(byte), I noticed that we don't optimize IndexOfAny(char) too using x86 intrinsics.

Found following tracking issues:

Use Intrinsics.X86 for SpanHelpers.SequenceCompareTo(char) Use Intrinsics.X86 for SpanHelpers.SequenceCompareTo(char) #12093
Use Intrinsics.X86 for SpanHelpers.IndexOfAny(char,char) Use Intrinsics.X86 for SpanHelpers.IndexOfAny(char,char) #12094
Use Intrinsics.X86 for SpanHelpers.IndexOfAny(char,char,char) Use Intrinsics.X86 for SpanHelpers.IndexOfAny(char,char,char) #12095
Use Intrinsics.X86 for SpanHelpers.IndexOfAny(char,char,char,char) Use Intrinsics.X86 for SpanHelpers.IndexOfAny(char,char,char,char) #12096
Use Intrinsics.X86 for SpanHelpers.IndexOfAny(char,char,char,char,char) Use Intrinsics.X86 for SpanHelpers.IndexOfAny(char,char,char,char,char) #12097

Dotnet-GitSync-Bot added the area-System.Memory label Jun 8, 2020

kunalspathak force-pushed the spanhelpers branch from 49497f1 to 72bec3b Compare June 9, 2020 15:52