Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support MiniCPM3. #9322

Merged
merged 1 commit into from
Sep 16, 2024
Merged

Support MiniCPM3. #9322

merged 1 commit into from
Sep 16, 2024

Conversation

CarryFun
Copy link
Contributor

@CarryFun CarryFun commented Sep 5, 2024

@github-actions github-actions bot added the python python script changes label Sep 5, 2024
convert_hf_to_gguf.py Outdated Show resolved Hide resolved
convert_hf_to_gguf.py Outdated Show resolved Hide resolved
src/llama.cpp Outdated Show resolved Hide resolved
src/llama.cpp Outdated Show resolved Hide resolved
src/llama.cpp Outdated Show resolved Hide resolved
@bioinformatist
Copy link

@CarryFun Will this PR be merged these days? I'm not sure if it's blocked by #9396. If there's only problems with CI, I can help to fix them. Would you mind me submit PR to https://github.com/OpenBMB/llama.cpp/tree/minicpm3?

@HanClinto
Copy link
Collaborator

Looks like the only CI failures are the linter failing whitespace checks:

./convert_hf_to_gguf.py:1838:1: E302 expected 2 blank lines, found 1
./convert_hf_to_gguf.py:1874:1: W293 blank line contains whitespace
./convert_hf_to_gguf.py:1891:1: E302 expected 2 blank lines, found 1
convert_hf_to_gguf.py:
	1874: Trailing whitespace
src/llama.cpp:
	6968: Trailing whitespace

Once these get resolved, I imagine we can move forward on merging this in?

Comment on lines +12917 to +12918
const float scale_embd = 12.0f;
const float scale_depth = 1.4f;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After #9412 we will have different names for these, but we can fix this later.

@@ -12825,6 +12909,215 @@ struct llm_build_context {
return gf;
}

struct ggml_cgraph * build_minicpm3() {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this graph very similar to build_deepseek2()? Can we do some code de-duplication?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing a diff between this and build_deepseek2(), there is a lot that is in common between these two.

Some notable differences that I'm seeing on a first pass:

minicpm3 supports scaling at several stages -- scaling the input embeddings, scaling the hidden states near the end of each layer, and finally near the end just prior to the output. This scaling factor could be set to 1 for deepseek2 and/or skipped, and we could probably reach parity this way.

deepseek2 supports a "lite" mode that simplifies the q calculation in each layer by a decent bit. This option just would be disabled in the minicpm3 branch.

deepseek2 supports MoE / shared expert calculations for generating ffn_out, and -- like "lite" mode -- this wouldn't be needed in the minicpm3 branch.

deepseek2 does some prescaling on the kq_scale and attn_factor that minicpm3 doesn't need to do. Not sure if this could be aligned or not -- will need to dig into #7416 more before I understand this.

In calls to ggml_rope_ext, minicpm3 uses rope factors that deepseek2 simply sets to null. This could be aligned pretty easily.

And finally, the final output is calculated differently in each network -- deepseek2 gets the result output with a call to ggml_mul_mat, and minicpm3 uses a call to llm_build_lora_mm -- that may be one of the largest structural changes that I saw when comparing the two.

I think these two implementations could be aligned, but it may take a bit of refactoring of the deepseek2 code as well. Not sure how to weigh the value of this effort vs. just maintaining two separate branches of code.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My view is that the best place to look for code de-duplication is in the generic
llm_build_xxx and other similar functions that represent the basic building blocks of the models. Conditional branches in the implementation depending on the arch makes the code harder to follow and may have a higher maintenance cost than some code duplication.

Additionally, I think it would be preferable to create an abstract interface for the models, and move the implementation of each one to a separate file without any coupling between models, but that will be harder to implement if the different models share the same code in this way.

Copy link
Collaborator

@slaren slaren Sep 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And finally, the final output is calculated differently in each network -- deepseek2 gets the result output with a call to ggml_mul_mat, and minicpm3 uses a call to llm_build_lora_mm -- that may be one of the largest structural changes that I saw when comparing the two.

That's probably a bug that will make loras that modify this tensor not work properly. llm_build_lora_mm is the right function to use when performing matrix multiplications with the weights.

@@ -1818,6 +1818,58 @@ def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iter

return [(self.map_tensor_name(name), data_torch)]

@Model.register("MiniCPM3ForCausalLM")
Copy link
Collaborator

@HanClinto HanClinto Sep 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@Model.register("MiniCPM3ForCausalLM")
@Model.register("MiniCPM3ForCausalLM")

Lint is currently breaking on this -- need to add an additional blank line above your class definition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your comments.

weights.reshape(n_head, 2, weights.shape[0] // n_head // 2, *weights.shape[1:])
.swapaxes(1, 2)
.reshape(weights.shape)
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
)
)

Lint is also breaking on this -- need to add an additional blank line below your class definition as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your comments.

@HanClinto
Copy link
Collaborator

LGTM -- all CI is passing now!

There are some refactoring optimizations that can probably come later, but for now I think this is probably good enough for getting the new model added.

Anyone else willing to weigh in on it? I'm not so confident that I'm willing to mark as approved on my own.

@ggerganov
Copy link
Owner

@HanClinto Thanks, will merge soon. Just want to first fix the Docker CI on master before pushing big changes.

@ggerganov ggerganov merged commit 95ca851 into ggerganov:master Sep 16, 2024
54 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python python script changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants