Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sync : ggml (ggml-backend) #3548

Merged
merged 2 commits into from
Oct 8, 2023
Merged

sync : ggml (ggml-backend) #3548

merged 2 commits into from
Oct 8, 2023

Conversation

ggerganov
Copy link
Owner

@slaren Should we merge what we already have for ggml-backend? If there isn't any breaking change, we might want to do so in order to keep things in sync more easily

@FSSRepo
Copy link
Collaborator

FSSRepo commented Oct 8, 2023

Hello, I have some questions about backends, and these are general (apply to any project, not necessarily LLM inference) as I have seen that ggml-cuda and other backends in llama.cpp differ from the implementation in the ggml repository, there are some specific kernels. I am planning to implement the CUDA backend in stable-diffusion.cpp, but I have a lot of doubts. Could you please take some time to clarify them for me?

@slaren
Copy link
Collaborator

slaren commented Oct 8, 2023

@ggerganov yes, I think we should merge this here to avoid conflicts in the future. I already tested the changes to ggml-alloc here yesterday, but I'll run some additional tests to be sure.

@FSSRepo most of the development of the backends happens in this repository, but the changes to ggml are regularly synced between the ggml and llama.cpp repositories. I am not sure if that was your question, if you have other doubts just ask.

@ggerganov ggerganov marked this pull request as ready for review October 8, 2023 17:19
@ggerganov ggerganov merged commit db3abcc into master Oct 8, 2023
35 checks passed
@ggerganov ggerganov deleted the sync branch October 8, 2023 17:19
zshannon added a commit to zshannon/llama.cpp that referenced this pull request Oct 8, 2023
@cebtenzzre
Copy link
Collaborator

cebtenzzre commented Oct 13, 2023

This change broke offloading Falcon-7B with ngl=34 (v cache on GPU but not k cache). I'm getting this assertion failure:

GGML_ASSERT: ggml-cuda.cu:7068: src0->backend == GGML_BACKEND_GPU

@slaren
Copy link
Collaborator

slaren commented Oct 14, 2023

@cebtenzzre I think it was already broken before, this change just made it assert rather than generate nonsense. The source of the problem is that the packed wqkv weight doesn't play well with CUDA, it needs the kcur portion on the CPU, but currently I don't see a good way to copy a view back to the CPU. This should be fixed in the future with ggml-backend.

@cebtenzzre
Copy link
Collaborator

cebtenzzre commented Nov 7, 2023

This change broke offloading Falcon-7B with ngl=34

Maybe Falcon was already broken (don't remember if I checked), but LLaMA 7B with ngl=34 was definitely producing correct output before this PR: #3820

// check if a tensor is allocated by this buffer
static bool ggml_allocr_is_own(struct ggml_allocr * alloc, const struct ggml_tensor * tensor) {
void * ptr = tensor->data;
return ptr >= alloc->data && (char *)ptr < (char *)alloc->data + alloc->max_size;
return tensor->buffer == alloc->buffer;
Copy link
Collaborator

@xaedes xaedes Dec 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In train & finetune I make new allocator, allocate tensors, free the allocator and then repeat that for different groups of tensors.
Before this change the actual data pointer was used to check ownership. This pointer differed between successive allocators due to a different data buffer.
But the buffer member used here may not differ, because it is malloced by ggml_backend_buffer_init. It actually happens that the previous memory address is used, because the previous allocator along with its buffer was freed.

So with this change ggml_allocr_is_own does not work correctly with the use case of a temporary allocator.
The allocator must be kept around as long as the tensors itself.

Small example:

std::vector<char> buf0;
std::vector<char> buf1;
buf0.resize(...);
buf1.resize(...);
struct ggml_tensor * a = ...;
struct ggml_tensor * b = ...;
struct ggml_allocr * alloc;

alloc = ggml_allocr_new(buf0.data(), buf0.size(), alignment);
ggml_allocr_alloc(alloc, a);
ggml_allocr_free(alloc);

alloc = ggml_allocr_new(buf1.data(), buf1.size(), alignment);
ggml_allocr_alloc(alloc, b);
// Now ggml_allocr_is_own(alloc, a) can be unexpectedly `true`
ggml_allocr_free(alloc);

This results in the allocator freeing tensors that are not actually owned by itself.
I write it 'can happen', but I actually observe this behaviour in train & finetune.
Whether it manifests itself probably depends on malloc.

I can change the train & finetune programs to not free the allocators.
This would solve the problem.

But in the above example it still is a bit weird, that ggml_allocr_is_own(alloc, a) can unexpectedly return true when the previous allocator was freed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I missed other cases, but this should have been fixed in #4486.

The problem is that checking the addresses doesn't really work with measure allocators. The previous implementation used virtual memory as a stopgap, but this has other problems. Ultimately, the issue is that ggml_allocr_new creates a temporary ggml_backend_buffer that is freed when the allocator is freed, and from that point these tensors cannot be used anymore since they point to a freed buffer. When using ggml-backend it is possible to free the allocator if you supply your own buffer, but otherwise the ggml-alloc should be kept alive for the lifetime of the tensors.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, this solves issues somewhat for finetune. But there are some more cases. I will push PR after I come back from holiday family visit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants