Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: utf8: RuneStartLen to get the length of the rune from the first byte #68716

Open
aymanbagabas opened this issue Aug 2, 2024 · 2 comments
Labels
Milestone

Comments

@aymanbagabas
Copy link

aymanbagabas commented Aug 2, 2024

Proposal Details

I find myself in need of such a method to determine how many bytes in a UTF-8 string when iterating over bytes. Following RFC 3629, we can implement something like utf8.RuneStartLen(b byte) int.

Zig and Rust have these implemented to provide this functionality. Go could have something like this to do the same.

// RuneStartLen reports the number of bytes an encoded rune will have. It
// returns a value between 1-4, or -1 if the byte is not a valid UTF-8 first
// byte.
func RuneStartLen(b byte) int {
	if b <= 0b0111_1111 { // 0x00-0x7F
		return 1
	} else if b >= 0b1111_0000 { // 0xF0-0xF7
		return 4
	} else if b >= 0b1110_0000 { // 0xE0-0xEF
		return 3
	} else if b >= 0b1100_0000 { // 0xC0-0xDF
		return 2
	}
	return -1
}
@gopherbot gopherbot added this to the Proposal milestone Aug 2, 2024
@gabyhelp
Copy link

gabyhelp commented Aug 2, 2024

Related Issues and Documentation

  • proposal: utf8.RuneIndexToByteIndex() #31879 (closed)

  • [Package utf8 > func RuneCount

     	](https://go.dev/pkg/unicode/utf8/?m=old#RuneCount) <!-- score=0.85166 -->
    
  • [Package utf8 > func RuneLen

     	](https://go.dev/pkg/unicode/utf8/?m=old#RuneLen) <!-- score=0.84605 -->
    
  • [Package utf8 > func RuneStart

     	](https://go.dev/pkg/unicode/utf8/?m=old#RuneStart) <!-- score=0.84119 -->
    
  • unicode/utf16: add RuneLen #44940 (closed)

  • proposal: unicode/utf8: rune count in a valid UTF-8 string #57896 (closed)

  • [Package utf8 > func DecodeLastRune

     	](https://go.dev/pkg/unicode/utf8/?m=old#DecodeLastRune) <!-- score=0.82540 -->
    
  • [Package utf8 > func DecodeRune

     	](https://go.dev/pkg/unicode/utf8/?m=old#DecodeRune) <!-- score=0.82336 -->
    
  • [Package utf8 > func RuneCountInString

     	](https://go.dev/pkg/unicode/utf8/?m=old#RuneCountInString) <!-- score=0.82101 -->
    

(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)

@seankhliao seankhliao changed the title proposal: utf8: given the first byte, determine how many bytes in the UTF-8 string proposal: utf8: RuneStartLen to get the length of the rune from the first byte Aug 4, 2024
@adonovan
Copy link
Member

adonovan commented Aug 5, 2024

This is a reasonable function, but it is rarely needed except by clients that are doing something unusually sophisticated, and it's a trivial consequence of the four constants that appear in the compact pictorial summary of UTF-8 found in any document on the subject--especially if you simplify each else if cond1 && cond2 to else if cond2. (Each first condition is trivially true as a consequence of the control flow.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Incoming
Development

No branches or pull requests

4 participants