Usage of Open MP Barrier for thread sync to avoid atomic operations #8002

abhishek-rn · 2024-06-19T05:47:30Z

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

Current ggml_compute thread requires exclusive sync operations for all the threads to sync after every stage of operation.
With recent introduction of openMP library into this repo, we can leverage the barrier function of openMP and also effectively remove atomic operations which introduces delay.

Command line:
./bin/llama-cli-m ../models/7B/ggml-model-q8_0.gguf -n 100 -s 01 -b 7 -t <num_of_threads> -p "AI is going to"
Running on Graviton 3 AWS machine with 64 cores

Threads	Master (tokens/sec)	PR Changes (tokens/sec)
4	6.2	6.19
8	11.75	11.7
16	20.94	20.95
24	23.97	25.03
32	25.7	27.46
48	26.26	29.71

We see improvements in tokens/sec by as much as ~3 token/sec over the existing implementation.
This is a small change in line with openMP library usage.
Request review and feedback.

slaren · 2024-06-19T05:49:49Z

It is already being done in #7993

Added OMP Barrier in ggml.c to avoid atomic operations

4147a04

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Jun 19, 2024

mofosyne added the Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix label Jun 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Usage of Open MP Barrier for thread sync to avoid atomic operations #8002

Usage of Open MP Barrier for thread sync to avoid atomic operations #8002

abhishek-rn commented Jun 19, 2024

slaren commented Jun 19, 2024

Usage of Open MP Barrier for thread sync to avoid atomic operations #8002

Are you sure you want to change the base?

Usage of Open MP Barrier for thread sync to avoid atomic operations #8002

Conversation

abhishek-rn commented Jun 19, 2024

slaren commented Jun 19, 2024