Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Setup] Multiple Apple Silicon Macs: Questions #67

Open
s04 opened this issue May 26, 2024 · 1 comment
Open

[Setup] Multiple Apple Silicon Macs: Questions #67

s04 opened this issue May 26, 2024 · 1 comment

Comments

@s04
Copy link

s04 commented May 26, 2024

Hi, been dreaming of a project like this.

Some questions:

  1. Silicon macs are pretty fast for this kind of stuff and benefit from the unified memory. If I've got a Macbook with 24GB and one with 36GB ram could I technically run ~60GB models? I assume I won't get the performance of MLX llama implementations but can I assume similar performance of llama.cpp granted I've got fast internet?
  2. The --workers flag in the examples only have 1 IP address, do I add comma separated values or a spaced out list?

Thanks in advance, will post in discussions with results if I get some answers. Might try and pool a few colleagues Macs together to see how far we can push it.

AWESOME PROJECT. Massive respect.

@DifferentialityDevelopment
Copy link
Contributor

You just separate them with spaces like so:
./dllama inference ... --workers 10.0.0.2:9998 10.0.0.3:9998 10.0.0.4:9998

You can also run several from the same IP, like so:
./dllama inference ... --workers 10.0.0.1:9996 10.0.0.1:9997 10.0.0.1:9998

As for 1. performance on workers that have unified memory would be faster due to their increased memory bandwidth.
The root node consumes a bit more memory than the workers so I'd use the 36gb macbook as the root node, though typically it divides the memory required to load the model by the amount of workers though the number of workers need to be a power of 2 so 2, 4, 8 workers etc.

Also it's worth experimenting with the number of threads you specify, in my case I have 6 cores and 12 threads, but I get the best performance by using 8 threads.

Larger models require more data transferred during each inference pass, something like Q80 Llama 70B might already hit the limits of gigabit ethernet, switching capacity of your ethernet switch also becomes a factor then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants