-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Rename or replace
str::words
to side-step the ambiguity of “a word”.
- Loading branch information
1 parent
cf25ad8
commit 2973b99
Showing
1 changed file
with
67 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,67 @@ | ||
- Feature Name: str-words | ||
- Start Date: 2015-04-10 | ||
- RFC PR: | ||
- Rust Issue: | ||
|
||
# Summary | ||
|
||
Rename or replace `str::words` to side-step the ambiguity of “a word”. | ||
|
||
|
||
# Motivation | ||
|
||
The [`str::words`](http://doc.rust-lang.org/std/primitive.str.html#method.words) method | ||
is currently marked `#[unstable(reason = "the precise algorithm to use is unclear")]`. | ||
Indeed, the concept of “a word” is not easy to define in precense of punctuation | ||
or languages with various conventions, including not using spaces at all to separate words. | ||
|
||
[Issue #15628](https://github.com/rust-lang/rust/issues/15628) suggests | ||
changing the algorithm to be based on [the *Word Boundaries* section of | ||
*Unicode Standard Annex #29: Unicode Text Segmentation*](http://www.unicode.org/reports/tr29/#Word_Boundaries). | ||
|
||
While a Rust implemention of UAX#29 would be useful, it belong on crates.io more than in `std`: | ||
|
||
* It carries significant complexity that may be surprising from something that looks as simple | ||
as a parameter-less “words” method in the standard library. | ||
Users may not be aware of how subtle defining “a word” can be. | ||
* It is not a definitive answer. The standard itself notes: | ||
|
||
> It is not possible to provide a uniform set of rules that resolves all issues across languages | ||
> or that handles all ambiguous situations within a given language. | ||
> The goal for the specification presented in this annex is to provide a workable default; | ||
> tailored implementations can be more sophisticated. | ||
and gives many examples of such ambiguous situations. | ||
|
||
Therefore, `std` would be better off avoiding the question of defining word boundaries entirely. | ||
|
||
|
||
# Detailed design | ||
|
||
Rename the `words` method to `split_whitespace`, and keep the current behavior unchanged. | ||
(That is, return an iterator equivalent to `s.split(char::is_whitespace).filter(|s| !s.is_empty())`.) | ||
|
||
Rename the return type `std::str::Words` to `std::str::SplitWhitespace`. | ||
|
||
Optionally, keep a `words` wrapper method for a while, both `#[deprecated]` and `#[unstable]`, | ||
with an error message that suggests `split_whitespace` or the chosen alternative. | ||
|
||
|
||
# Drawbacks | ||
|
||
`split_whitespace` is very similar to the existing `str::split<P: Pattern>(&self, P)` method, | ||
and having a separate method seems like weak API design. (But see below.) | ||
|
||
|
||
# Alternatives | ||
|
||
* Replace `str::words` with `struct Whitespace;` with a custom `Pattern` implementation, | ||
which can be used in `str::split`. | ||
However this requires the `Whitespace` symbol to be imported separately. | ||
* Remove `str::words` entirely and tell users to use | ||
`s.split(char::is_whitespace).filter(|s| !s.is_empty())` instead. | ||
|
||
|
||
# Unresolved questions | ||
|
||
Is there a better alternative? |