Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does default_para_level correspond to TR9#HL1? #129

Open
dhardy opened this issue Feb 24, 2024 · 6 comments
Open

Does default_para_level correspond to TR9#HL1? #129

dhardy opened this issue Feb 24, 2024 · 6 comments

Comments

@dhardy
Copy link

dhardy commented Feb 24, 2024

Presumably it does, but it is not clear how (in particular, the name may be wrong).

HL1 states:

Override P3, and set the paragraph embedding level explicitly. This does not apply when deciding how to treat FSI in rule X5c.

... then gives some suggestions for how this may do so, but no specific requirement. One of those suggestions is to follow P2 and P3, but default to level 1 (RTL).

What I find from testing is that providing Some value via default_para_level to BidiInfo::new overrides the paragraph level. Given the name, I would expect that instead it only acts as a default (i.e. when P2 does not find a strongly directional character).


Suggestions:

  1. Rename default_para_level to override_para_level
  2. Make default_para_level act only when no strongly-directional (L/AL/R) character is found. (In this case, it does not need to be optional.)
  3. Revise to support both cases: pub fn new(text: &str, default_para_level: Level, override_para_level: Option<Level>) -> BidiInfo<'_>

From a usage point-of-view, I can see value in being able to specify a default line direction (i.e. suggestion 2) for e.g. an input field when it may be empty or contain only formatting characters. The current behaviour (override) achieves the same thing, but also messes up formatting when the directionality is detectable (e.g. a sentence with a full stop as in #52), for little gain.

@dhardy
Copy link
Author

dhardy commented Feb 24, 2024

Test case (extra line-breaks inserted to allow correct rendering in the browser):

Example text in multiple languages.

مثال على نص بلغات متعددة.

Пример текста на нескольких языках.

טקסט לדוגמא במספר שפות.

Level runs:

None:
        Level(0), text[0..35]: 'Example text in multiple languages.', HardBreak, breaks=[8, 13, 16, 25]
        Level(1), text[36..81]: 'مثال على نص بلغات متعددة.', NoBreak, breaks=[68, 57, 52, 45]
        Level(1), text[81..81]: '', HardBreak, breaks=[]
        Level(0), text[82..147]: 'Пример текста на нескольких языках.', HardBreak, breaks=[95, 108, 113, 134]
        Level(1), text[148..190]: 'טקסט לדוגמא במספר שפות.', breaks=[181, 170, 157]

Some(LTR_LEVEL):
        Level(0), text[0..35]: 'Example text in multiple languages.', HardBreak, breaks=[8, 13, 16, 25]
        Level(1), text[36..80]: 'مثال على نص بلغات متعددة', NoBreak, breaks=[68, 57, 52, 45]
        Level(0), text[80..81]: '.', NoBreak, breaks=[]
        Level(0), text[81..81]: '', HardBreak, breaks=[]
        Level(0), text[82..147]: 'Пример текста на нескольких языках.', HardBreak, breaks=[95, 108, 113, 134]
        Level(1), text[148..189]: 'טקסט לדוגמא במספר שפות', NoBreak, breaks=[181, 170, 157]
        Level(0), text[189..190]: '.', breaks=[]

Some(RTL_LEVEL):
        Level(2), text[0..34]: 'Example text in multiple languages', NoBreak, breaks=[8, 13, 16, 25]
        Level(1), text[34..35]: '.', HardBreak, breaks=[]
        Level(1), text[36..81]: 'مثال على نص بلغات متعددة.', NoBreak, breaks=[68, 57, 52, 45]
        Level(1), text[81..81]: '', HardBreak, breaks=[]
        Level(2), text[82..146]: 'Пример текста на нескольких языках', NoBreak, breaks=[95, 108, 113, 134]
        Level(1), text[146..147]: '.', HardBreak, breaks=[]
        Level(1), text[148..190]: 'טקסט לדוגמא במספר שפות.', breaks=[181, 170, 157]

@Manishearth
Copy link
Member

Yes, it does. The naming comes from the standard naming for HL1 from other bidi implementations, including ICU4C.

While typically HL1 is viewed as a higher level algorithm, in practice it really should just be an input to the main algorithm. This is a bit of a discrepancy between the spec and how real world users tend to conceptualize the algorithm.

It's "default" because further embedding will change the level: it is not the level of the paragraph, it is the level the paragraph starts with. "override" carries a stronger connotation here.

I would prefer to just document this better.

@dhardy
Copy link
Author

dhardy commented Feb 27, 2024

Personally I'd still like to see something like this:

pub fn BidiInfo::new(text: &str, base_para_level: Option<Level>, default_base_para_level: Level) -> Self

Doc:

When base_para_level is None, the base paragraph direction is inferred from the first strongly-directional character of text. If no such character is found, default_base_para_level is used. If, instead, base_para_level = Some(level), then the base paragraph direction is defined by level; alternative-direction texts form inclusions within this paragraph (excluding trailing undirectional formatting such as a full-stop).

Or maybe just:

pub fn BidiInfo::new(text: &str, infer: bool, base_para_level: Level) -> Self

Doc:

When infer = true, the base paragraph direction is inferred from the first strongly-directional character of text. If no such character is found, base_para_level is used. If, instead, infer = false, then the base paragraph direction is defined by base_para_level; alternative-direction texts form inclusions within this paragraph (excluding trailing undirectional formatting such as a full-stop).

Then BidiInfo::new(text, false, default_para_level.unwrap_or(LTR_LEVEL)) would be equivalent to the current functionality.

@Manishearth
Copy link
Member

This doesn't really match typical usage patterns of the bidi algorithm.

Basically, either you have a signal from the embedding environment (e.g. your html dir attribute, or surrounding bidi text in the case of embedded inline content) as to what level to use, or you don't.

I don't want to overrotate on how the spec conceptualizes these things. The spec probably should have HL1 clarified.

@dhardy
Copy link
Author

dhardy commented Feb 27, 2024

html dir attribute

In this case, the current behaviour makes sense. For CSS's direction property it doesn't in my opinion, though of course I'm not going to advocate for breaking web standards.

I see you already have this in your code-base:

    /// TODO: Support auto-RTL base direction

Personally, I've decided to use the following:

/// Directionality of text
///
/// Texts may be bi-directional as specified by Unicode Technical Report #9.
#[derive(Copy, Clone, Debug, Default, PartialEq, Eq, Ord, PartialOrd, Hash)]
#[cfg_attr(feature = "serde", derive(serde::Serialize, serde::Deserialize))]
#[repr(u8)]
pub enum Direction {
    /// Auto-detect direction
    ///
    /// The text direction is inferred from the first strongly-directional
    /// character. In case no such character is found, the text will be
    /// left-to-right.
    #[default]
    Auto = 2,
    /// Auto-detect, default right-to-left
    ///
    /// The text direction is inferred from the first strongly-directional
    /// character. In case no such character is found, the text will be
    /// right-to-left.
    AutoRtl = 3,
    /// The base text direction is left-to-right
    ///
    /// If the text contains right-to-left content, this will be considered an
    /// embedded right-to-left run. Non-directional leading and trailing
    /// characters (e.g. a full stop) will normally not be included within this
    /// right-to-left section.
    ///
    /// This uses Unicode TR9 HL1 to set an explicit paragraph embedding level of 0.
    Ltr = 0,
    /// The base text direction is right-to-left
    ///
    /// If the text contains left-to-right content, this will be considered an
    /// embedded left-to-right run. Non-directional leading and trailing
    /// characters (e.g. a full stop) will normally not be included within this
    /// left-to-right section.
    ///
    /// This uses Unicode TR9 HL1 to set an explicit paragraph embedding level of 1.
    Rtl = 1,
}

IIUC AutoRtl can be emulated by calling get_base_direction first.

@Manishearth
Copy link
Member

In this case, the current behaviour makes sense. For CSS's direction property it doesn't in my opinion, though of course I'm not going to advocate for breaking web standards.

because direction doesn't support auto? Yeah, the direction style needs some love, but also it's not recommended in HTML so i don't think there's much impetus to improve it.


I believe calling it the base direction in the API would be fine though if we're doing that I would want documentation that uses both names.


It may be worth documenting how to get AutoRtl though really I think this is a thing that would benefit from a new API. Unfortunately it's not easy to change the current one without it being a breaking change.

Note that the basic enum is insufficient for this crate: there are valid use cases for setting a base level that is greater than 1, typically for doing incremental updates to embedded content.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants