Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggml : synchronize threads using barriers #7993

Merged
merged 5 commits into from
Jun 19, 2024
Merged

ggml : synchronize threads using barriers #7993

merged 5 commits into from
Jun 19, 2024

Conversation

slaren
Copy link
Collaborator

@slaren slaren commented Jun 18, 2024

Replaces the spinlocks in the CPU backend with barriers, using openmp if possible.

CPU Model Threads Test t/s master t/s sl/test-omp-sync Speedup
i9-13900K stories260k F32 8 pp512 37153.86 34901.28 0.94
i9-13900K stories260k F32 8 tg128 2926.66 4913.27 1.68
i9-13900K stories260k F32 16 pp512 20069.17 21167.47 1.05
i9-13900K stories260k F32 16 tg128 1450.93 3405.28 2.35
i9-13900K stories260k F32 24 pp512 24257.18 26325.85 1.09
i9-13900K stories260k F32 24 tg128 1022.16 2734.91 2.68
i9-13900K 1B Q8_0 8 pp128 340.93 336.43 0.99
i9-13900K 1B Q8_0 8 tg32 70.62 73.01 1.03
i9-13900K 1B Q8_0 16 pp128 437.92 424.76 0.97
i9-13900K 1B Q8_0 16 tg32 58.12 64.29 1.11
i9-13900K 1B Q8_0 24 pp128 521.29 506.86 0.97
i9-13900K 1B Q8_0 24 tg32 53.88 63.47 1.18
i9-13900K 7B Q4_0 8 pp128 49.77 49.81 1.00
i9-13900K 7B Q4_0 8 tg32 21.07 21.41 1.02
i9-13900K 7B Q4_0 16 pp128 63.52 64.56 1.02
i9-13900K 7B Q4_0 16 tg32 20.71 21.93 1.06
i9-13900K 7B Q4_0 24 pp128 73.48 73.90 1.01
i9-13900K 7B Q4_0 24 tg32 17.93 19.24 1.07

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Jun 18, 2024
@Green-Sky
Copy link
Collaborator

spinlocks

Is there it is not using mtecies ? (mutexes? o.o)

@slaren
Copy link
Collaborator Author

slaren commented Jun 18, 2024

Since these waits are often very short, spinlocks resulted in better performance. OpenMP is also likely spinning for a while before sleeping.

@Green-Sky
Copy link
Collaborator

Green-Sky commented Jun 18, 2024

Its very hard to get faster then most mutex implementations, since they have staggered strategies. Most will perform a spinlock and then switch to something else...

was this actually mesured?

@slaren
Copy link
Collaborator Author

slaren commented Jun 18, 2024

There were several PRs to change the synchronization mechanism in the past, but always resulted in worse performance.

@github-actions github-actions bot added the devops improvements to build systems and github actions label Jun 18, 2024
@slaren
Copy link
Collaborator Author

slaren commented Jun 18, 2024

The fallback seems to perform ok on the M3 Max.

LLAMA_NO_METAL=1 LLAMA_NO_ACCELERATE=1 LLAMA_NO_LLAMAFILE=1

CPU Model Threads Test t/s master t/s sl/test-omp-sync Speedup
M3 Max stories260k F32 4 pp512 24415.91 24354.48 1.00
M3 Max stories260k F32 4 tg128 6134.50 7296.25 1.19
M3 Max stories260k F32 8 pp512 40683.65 41209.17 1.01
M3 Max stories260k F32 8 tg128 2503.32 3272.53 1.31
M3 Max stories260k F32 12 pp512 55364.55 55204.87 1.00
M3 Max stories260k F32 12 tg128 1977.22 2406.93 1.22
M3 Max stories260k F32 16 pp512 44830.02 49383.96 1.10
M3 Max stories260k F32 16 tg128 1058.14 1436.44 1.36
M3 Max 1B Q8_0 4 pp128 201.11 190.80 0.95
M3 Max 1B Q8_0 4 tg64 100.37 101.50 1.01
M3 Max 1B Q8_0 8 pp128 359.73 353.41 0.98
M3 Max 1B Q8_0 8 tg64 95.07 99.61 1.05
M3 Max 1B Q8_0 12 pp128 425.86 455.97 1.07
M3 Max 1B Q8_0 12 tg64 93.11 97.74 1.05
M3 Max 1B Q8_0 16 pp128 408.75 494.34 1.21
M3 Max 1B Q8_0 16 tg64 26.22 59.93 2.29
M3 Max 7B Q4_0 4 pp128 25.15 24.98 0.99
M3 Max 7B Q4_0 4 tg64 22.10 22.08 1.00
M3 Max 7B Q4_0 8 pp128 46.31 46.11 1.00
M3 Max 7B Q4_0 8 tg64 29.73 30.26 1.02
M3 Max 7B Q4_0 12 pp128 58.64 58.82 1.00
M3 Max 7B Q4_0 12 tg64 29.12 29.91 1.03
M3 Max 7B Q4_0 16 pp128 65.84 67.42 1.02
M3 Max 7B Q4_0 16 tg64 6.70 24.69 3.69
M3 Max 7B F32 4 pp128 9.65 9.66 1.00
M3 Max 7B F32 4 tg64 4.75 4.77 1.00
M3 Max 7B F32 8 pp128 17.40 17.87 1.03
M3 Max 7B F32 8 tg64 4.75 4.77 1.00
M3 Max 7B F32 12 pp128 22.86 22.93 1.00
M3 Max 7B F32 12 tg64 4.70 4.74 1.01
M3 Max 7B F32 16 pp128 23.05 24.70 1.07
M3 Max 7B F32 16 tg64 0.77 4.62 5.99

@slaren slaren marked this pull request as ready for review June 18, 2024 18:55
@slaren
Copy link
Collaborator Author

slaren commented Jun 18, 2024

The server thread sanitizer run failed the first time, but I don't understand the error, I cannot reproduce it locally, and the second run passed. It may have been a fluke, but it's something to look at if it continues failing intermittently after this is merged.

@slaren slaren changed the title ggml : synchronize using openmp barriers ggml : synchronize threads using barriers Jun 18, 2024
@slaren
Copy link
Collaborator Author

slaren commented Jun 18, 2024

With MSVC:

CPU Model Threads Test t/s master t/s sl/test-omp-sync Speedup
i9-13900K stories260k F32 8 pp512 29900.69 29218.33 0.98
i9-13900K stories260k F32 8 tg128 2168.78 3604.36 1.66
i9-13900K stories260k F32 16 pp512 14942.76 15333.24 1.03
i9-13900K stories260k F32 16 tg128 651.22 2253.29 3.46
i9-13900K stories260k F32 24 pp512 18434.45 19764.31 1.07
i9-13900K stories260k F32 24 tg128 418.43 1637.96 3.91
i9-13900K 1B Q8_0 8 pp128 298.91 304.23 1.02
i9-13900K 1B Q8_0 8 tg64 60.63 65.54 1.08
i9-13900K 1B Q8_0 16 pp128 254.52 257.34 1.01
i9-13900K 1B Q8_0 16 tg64 43.94 58.29 1.33
i9-13900K 1B Q8_0 24 pp128 355.98 368.96 1.04
i9-13900K 1B Q8_0 24 tg64 41.27 64.36 1.56
i9-13900K 7B Q4_0 8 pp128 37.51 38.40 1.02
i9-13900K 7B Q4_0 8 tg64 15.62 15.96 1.02
i9-13900K 7B Q4_0 16 pp128 32.53 32.61 1.00
i9-13900K 7B Q4_0 16 tg64 11.88 12.99 1.09
i9-13900K 7B Q4_0 24 pp128 47.63 48.02 1.01
i9-13900K 7B Q4_0 24 tg64 12.60 15.55 1.23

@slaren
Copy link
Collaborator Author

slaren commented Jun 19, 2024

I was surprised that the performance with MSVC had fallen so much behind, and I found that the reason is llamafile.

MSVC without llamafile:

model size params backend threads test t/s
llama 1B Q8_0 1.09 GiB 1.10 B CPU 8 pp128 337.27 ± 4.21
llama 1B Q8_0 1.09 GiB 1.10 B CPU 8 tg64 71.37 ± 0.59
llama 1B Q8_0 1.09 GiB 1.10 B CPU 16 pp128 417.18 ± 11.49
llama 1B Q8_0 1.09 GiB 1.10 B CPU 16 tg64 65.59 ± 0.45
llama 1B Q8_0 1.09 GiB 1.10 B CPU 24 pp128 502.06 ± 17.67
llama 1B Q8_0 1.09 GiB 1.10 B CPU 24 tg64 67.69 ± 0.34
llama 7B Q4_0 3.56 GiB 6.74 B CPU 8 pp128 46.79 ± 0.07
llama 7B Q4_0 3.56 GiB 6.74 B CPU 8 tg64 21.08 ± 0.08
llama 7B Q4_0 3.56 GiB 6.74 B CPU 16 pp128 61.67 ± 0.54
llama 7B Q4_0 3.56 GiB 6.74 B CPU 16 tg64 21.56 ± 0.13
llama 7B Q4_0 3.56 GiB 6.74 B CPU 24 pp128 73.49 ± 0.38
llama 7B Q4_0 3.56 GiB 6.74 B CPU 24 tg64 19.43 ± 0.02

@ggerganov
Copy link
Owner

I noticed that the thread sanitizer workflow also failed last week and it reported a data race in this case:

4bfe50f#comments

So maybe it's not necessary related to the changes from this PR

@slaren
Copy link
Collaborator Author

slaren commented Jun 19, 2024

That one seems to be caused by using openmp. Thread sanitizer is not directly compatible with openmp, and causes false positives. I changed the workflow to avoid using openmp with thread sanitizer, so that shouldn't happen anymore. It was already done on build.yml in the initial PR, but I missed the server CI.

@ggerganov
Copy link
Owner

ggerganov commented Jun 19, 2024

Results on M1 Pro and M2 Ultra:

LLAMA_NO_METAL=1 LLAMA_NO_ACCELERATE=1 LLAMA_NO_LLAMAFILE=1 ./scripts/compare-commits.sh master sl/test-omp-sync -m models/tinyllama-1b-chat/ggml-model-f16.gguf -m models/tinyllama-1b-chat/ggml-model-q8_0.gguf -m models/tinyllama-1b-chat/ggml-model-q4_0.gguf -p 128 -n 64 -t 4,8
CPU Model [GiB] Threads Test t/s master t/s sl/test-omp-sync Speedup
M1 Pro llama 1B F16 2.05 4 pp128 113.37 109.21 0.96
M1 Pro llama 1B F16 2.05 4 tg64 44.83 45.88 1.02
M1 Pro llama 1B F16 2.05 8 pp128 204.39 172.73 0.85
M1 Pro llama 1B F16 2.05 8 tg64 38.98 49.44 1.27
M1 Pro llama 1B Q4_0 0.59 4 pp128 128.83 123.33 0.96
M1 Pro llama 1B Q4_0 0.59 4 tg64 90.91 90.20 0.99
M1 Pro llama 1B Q4_0 0.59 8 pp128 206.68 203.71 0.99
M1 Pro llama 1B Q4_0 0.59 8 tg64 117.38 122.94 1.05
M1 Pro llama 1B Q8_0 1.09 4 pp128 160.17 143.42 0.90
M1 Pro llama 1B Q8_0 1.09 4 tg64 71.12 71.98 1.01
M1 Pro llama 1B Q8_0 1.09 8 pp128 257.38 237.44 0.92
M1 Pro llama 1B Q8_0 1.09 8 tg64 60.25 84.80 1.41
CPU Model [GiB] Threads Test t/s master t/s sl/test-omp-sync Speedup
M2 Ultra llama 1B F16 2.05 4 pp128 118.08 121.59 1.03
M2 Ultra llama 1B F16 2.05 4 tg64 58.73 53.21 0.91
M2 Ultra llama 1B F16 2.05 8 pp128 238.14 239.27 1.00
M2 Ultra llama 1B F16 2.05 8 tg64 66.46 69.03 1.04
M2 Ultra llama 1B F16 2.05 16 pp128 439.46 432.57 0.98
M2 Ultra llama 1B F16 2.05 16 tg64 72.97 82.59 1.13
M2 Ultra llama 1B Q4_0 0.59 4 pp128 147.08 147.61 1.00
M2 Ultra llama 1B Q4_0 0.59 4 tg64 90.74 91.91 1.01
M2 Ultra llama 1B Q4_0 0.59 8 pp128 285.33 285.09 1.00
M2 Ultra llama 1B Q4_0 0.59 8 tg64 136.46 145.17 1.06
M2 Ultra llama 1B Q4_0 0.59 16 pp128 526.63 526.64 1.00
M2 Ultra llama 1B Q4_0 0.59 16 tg64 136.58 164.73 1.21
M2 Ultra llama 1B Q8_0 1.09 4 pp128 174.99 175.82 1.00
M2 Ultra llama 1B Q8_0 1.09 4 tg64 76.69 80.51 1.05
M2 Ultra llama 1B Q8_0 1.09 8 pp128 336.11 336.49 1.00
M2 Ultra llama 1B Q8_0 1.09 8 tg64 92.01 98.79 1.07
M2 Ultra llama 1B Q8_0 1.09 16 pp128 595.29 612.25 1.03
M2 Ultra llama 1B Q8_0 1.09 16 tg64 108.28 123.60 1.14

M2 Ultra seems to respond better to the changes

CPU Model [GiB] Threads Test t/s master t/s sl/test-omp-sync Speedup
Ryzen 5950X 16C 1B F16 2.05 4 pp128 113.46 115.32 1.02
Ryzen 5950X 16C 1B F16 2.05 4 tg64 14.61 14.67 1.00
Ryzen 5950X 16C 1B F16 2.05 8 pp128 207.30 207.07 1.00
Ryzen 5950X 16C 1B F16 2.05 8 tg64 14.20 14.31 1.01
Ryzen 5950X 16C 1B F16 2.05 16 pp128 302.62 300.31 0.99
Ryzen 5950X 16C 1B F16 2.05 16 tg64 13.69 13.92 1.02
Ryzen 5950X 16C 1B Q4_0 0.59 4 pp128 99.55 101.64 1.02
Ryzen 5950X 16C 1B Q4_0 0.59 4 tg64 46.12 45.91 1.00
Ryzen 5950X 16C 1B Q4_0 0.59 8 pp128 191.73 193.95 1.01
Ryzen 5950X 16C 1B Q4_0 0.59 8 tg64 45.50 47.11 1.04
Ryzen 5950X 16C 1B Q4_0 0.59 16 pp128 320.21 327.24 1.02
Ryzen 5950X 16C 1B Q4_0 0.59 16 tg64 43.68 46.12 1.06
Ryzen 5950X 16C 1B Q8_0 1.09 4 pp128 122.47 124.77 1.02
Ryzen 5950X 16C 1B Q8_0 1.09 4 tg64 26.48 26.57 1.00
Ryzen 5950X 16C 1B Q8_0 1.09 8 pp128 233.34 234.81 1.01
Ryzen 5950X 16C 1B Q8_0 1.09 8 tg64 25.78 26.25 1.02
Ryzen 5950X 16C 1B Q8_0 1.09 16 pp128 370.48 373.76 1.01
Ryzen 5950X 16C 1B Q8_0 1.09 16 tg64 24.87 25.63 1.03

@slaren
Copy link
Collaborator Author

slaren commented Jun 19, 2024

Does it help if the sched_yield is removed from ggml_barrier?

@ggerganov
Copy link
Owner

After removing sched_yield, here are the results:

CPU Model [GiB] Threads Test t/s master t/s sl/test-omp-sync Speedup
M1 Pro llama 1B F16 2.05 4 pp128 113.82 113.25 0.99
M1 Pro llama 1B F16 2.05 4 tg64 44.87 45.90 1.02
M1 Pro llama 1B F16 2.05 8 pp128 220.78 188.19 0.85
M1 Pro llama 1B F16 2.05 8 tg64 38.35 39.43 1.03
M1 Pro llama 1B Q4_0 0.59 4 pp128 136.69 126.87 0.93
M1 Pro llama 1B Q4_0 0.59 4 tg64 91.48 92.21 1.01
M1 Pro llama 1B Q4_0 0.59 8 pp128 230.98 209.10 0.91
M1 Pro llama 1B Q4_0 0.59 8 tg64 85.29 124.21 1.46
M1 Pro llama 1B Q8_0 1.09 4 pp128 164.59 153.72 0.93
M1 Pro llama 1B Q8_0 1.09 4 tg64 70.59 72.08 1.02
M1 Pro llama 1B Q8_0 1.09 8 pp128 289.09 251.00 0.87
M1 Pro llama 1B Q8_0 1.09 8 tg64 55.05 57.00 1.04

@slaren
Copy link
Collaborator Author

slaren commented Jun 19, 2024

The results with the M1 Pro seem to vary quite a bit between runs, the differences between master on the previous one and this one are larger than the difference between master and the PR in some cases, even though it is the same test.

@ggerganov
Copy link
Owner

Hm, I might be doing something wrong in the benchmarks. I now ran llama-bench alone (i.e. without scripts/compare-commits.sh and get the following results on this branch as it is (i.e. without removing sched_yield):

LLAMA_NO_METAL=1 LLAMA_NO_ACCELERATE=1 LLAMA_NO_LLAMAFILE=1 make -j && ./llama-bench -m models/tinyllama-1b-chat/ggml-model-f16.gguf -m models/tinyllama-1b-chat/ggml-model-q8_0.gguf -m models/tinyllama-1b-chat/ggml-model-q4_0.gguf -p 128 -n 64 -t 4,8
model size params backend threads test t/s
llama 1B F16 2.05 GiB 1.10 B CPU 4 pp128 112.98 ± 1.06
llama 1B F16 2.05 GiB 1.10 B CPU 4 tg64 45.37 ± 0.57
llama 1B F16 2.05 GiB 1.10 B CPU 8 pp128 213.35 ± 1.99
llama 1B F16 2.05 GiB 1.10 B CPU 8 tg64 49.28 ± 0.51
llama 1B Q4_0 606.53 MiB 1.10 B CPU 4 pp128 131.81 ± 2.81
llama 1B Q4_0 606.53 MiB 1.10 B CPU 4 tg64 91.77 ± 0.12
llama 1B Q4_0 606.53 MiB 1.10 B CPU 8 pp128 213.46 ± 1.45
llama 1B Q4_0 606.53 MiB 1.10 B CPU 8 tg64 125.84 ± 1.08
llama 1B Q8_0 1.09 GiB 1.10 B CPU 4 pp128 161.83 ± 1.58
llama 1B Q8_0 1.09 GiB 1.10 B CPU 4 tg64 72.21 ± 1.45
llama 1B Q8_0 1.09 GiB 1.10 B CPU 8 pp128 263.58 ± 1.95
llama 1B Q8_0 1.09 GiB 1.10 B CPU 8 tg64 85.07 ± 1.19

build: 4d8a0c2 (3185)

@slaren
Copy link
Collaborator Author

slaren commented Jun 19, 2024

I very often have a lot of trouble getting reproducible results on the M3 Max, and I think it is due to thermal throttling. Increasing the number of repetitions with -r can help somewhat.

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, increasing the repetitions to -r 20 makes the results consistent:

CPU Model [GiB] Threads Test t/s master t/s sl/test-omp-sync Speedup
M1 Pro llama 1B F16 2.05 4 pp128 113.01 112.20 0.99
M1 Pr llama 1B F16 2.05 4 tg64 45.01 45.32 1.01
M1 Pr llama 1B F16 2.05 8 pp128 176.56 180.62 1.02
M1 Pr llama 1B F16 2.05 8 tg64 38.65 49.41 1.28
M1 Pr llama 1B Q4_0 0.59 4 pp128 132.44 131.86 1.00
M1 Pr llama 1B Q4_0 0.59 4 tg64 91.38 92.86 1.02
M1 Pr llama 1B Q4_0 0.59 8 pp128 210.17 212.74 1.01
M1 Pr llama 1B Q4_0 0.59 8 tg64 118.79 125.84 1.06
M1 Pr llama 1B Q8_0 1.09 4 pp128 151.81 152.35 1.00
M1 Pr llama 1B Q8_0 1.09 4 tg64 71.58 73.54 1.03
M1 Pr llama 1B Q8_0 1.09 8 pp128 246.67 249.50 1.01
M1 Pr llama 1B Q8_0 1.09 8 tg64 56.88 83.92 1.48

@slaren slaren merged commit 9c77ec1 into master Jun 19, 2024
68 checks passed
@slaren slaren deleted the sl/test-omp-sync branch June 19, 2024 13:04
@slaren
Copy link
Collaborator Author

slaren commented Jun 19, 2024

Initially I removed the updates of perf_node_start_cycles and perf_start_time_us for simplicity, which are used with GGML_PERF, and then I forgot to add them again. Is this feature still being used? I can easily add it back, but maybe it would be preferable to remove GGML_PERF entirely instead.

@ggerganov
Copy link
Owner

It's quite obsolete at this point so it's fine to remove it. PR welcome

@slaren
Copy link
Collaborator Author

slaren commented Jun 19, 2024

I will remove it as part of a refactor to remove the initialize and finalize tasks in favor of using barriers in the ops.

@slaren
Copy link
Collaborator Author

slaren commented Jun 20, 2024

The server thread sanitizer run failed the first time, but I don't understand the error, I cannot reproduce it locally, and the second run passed. It may have been a fluke, but it's something to look at if it continues failing intermittently after this is merged.

Since it happened again in #8018, I am trying to understand what this failure means.
The name of the test is "consistent results with same seed", and the error says Assertion Failed: invalid slot 0 expected[state] != slot[state] = 0 != 1. state here should be the state reported by /slots. It seems to be complaining that the state of the slot is SLOT_STATE_PROCESSING, but it expected it to be SLOT_STATE_IDLE. So it is not complaining that the generation is not what it expected, but that the state of the slot is not what it expected. I wonder if this is race condition in the test framework, thread sanitizer is likely making the server responses much slower, and the test may be looking at the state before the server had a chance to update it.

@ggerganov
Copy link
Owner

Yes, it seems to me that the step_all_slots_status step should have a timeout and ping the server a few times before asserting (similar to wait_for_health_status)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
devops improvements to build systems and github actions ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants