Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support more attributes from the Encoding structure #5

Merged
merged 8 commits into from
Nov 15, 2023

Conversation

clems4ever
Copy link
Contributor

MiniLM requires the attention mask to perform the mean pooling operation as can be seen at
https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2

tokenizers.h Outdated Show resolved Hide resolved
tokenizer.go Outdated Show resolved Hide resolved
@clems4ever
Copy link
Contributor Author

clems4ever commented Nov 14, 2023

Before

$ go test . -bench=. -benchmem -benchtime=10s
goos: linux
goarch: amd64
pkg: github.com/daulet/tokenizers
cpu: 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz
BenchmarkEncodeNTimes-8           885213             13053 ns/op             224 B/op         11 allocs/op
BenchmarkEncodeNChars-8         1000000000               2.351 ns/op           0 B/op          0 allocs/op
BenchmarkDecodeNTimes-8          2108638              5758 ns/op              96 B/op          3 allocs/op
BenchmarkDecodeNTokens-8        15591064               761.3 ns/op             7 B/op          0 allocs/op
PASS
ok      github.com/daulet/tokenizers    59.096s

After

$ go test . -bench=. -benchmem -benchtime=10s
goos: linux
goarch: amd64
pkg: github.com/daulet/tokenizers
cpu: 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz
BenchmarkEncodeNTimes-8           935011             12774 ns/op             232 B/op         12 allocs/op
BenchmarkEncodeNChars-8         1000000000               1.962 ns/op           0 B/op          0 allocs/op
BenchmarkDecodeNTimes-8          2098053              5676 ns/op              96 B/op          3 allocs/op
BenchmarkDecodeNTokens-8        15740354               742.0 ns/op             7 B/op          0 allocs/op
PASS
ok      github.com/daulet/tokenizers    57.765s

Copy link
Owner

@daulet daulet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution!

@daulet daulet merged commit 38a9a14 into daulet:main Nov 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants