Skip to content

Commit

Permalink
Rename or replace str::words to side-step the ambiguity of “a word”.
Browse files Browse the repository at this point in the history
  • Loading branch information
SimonSapin committed Apr 10, 2015
1 parent cf25ad8 commit 2973b99
Showing 1 changed file with 67 additions and 0 deletions.
67 changes: 67 additions & 0 deletions text/0000-str-words.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
- Feature Name: str-words
- Start Date: 2015-04-10
- RFC PR:
- Rust Issue:

# Summary

Rename or replace `str::words` to side-step the ambiguity of “a word”.


# Motivation

The [`str::words`](http://doc.rust-lang.org/std/primitive.str.html#method.words) method
is currently marked `#[unstable(reason = "the precise algorithm to use is unclear")]`.
Indeed, the concept of “a word” is not easy to define in precense of punctuation
or languages with various conventions, including not using spaces at all to separate words.

[Issue #15628](https://github.com/rust-lang/rust/issues/15628) suggests
changing the algorithm to be based on [the *Word Boundaries* section of
*Unicode Standard Annex #29: Unicode Text Segmentation*](http://www.unicode.org/reports/tr29/#Word_Boundaries).

While a Rust implemention of UAX#29 would be useful, it belong on crates.io more than in `std`:

* It carries significant complexity that may be surprising from something that looks as simple
as a parameter-less “words” method in the standard library.
Users may not be aware of how subtle defining “a word” can be.
* It is not a definitive answer. The standard itself notes:

> It is not possible to provide a uniform set of rules that resolves all issues across languages
> or that handles all ambiguous situations within a given language.
> The goal for the specification presented in this annex is to provide a workable default;
> tailored implementations can be more sophisticated.
and gives many examples of such ambiguous situations.

Therefore, `std` would be better off avoiding the question of defining word boundaries entirely.


# Detailed design

Rename the `words` method to `split_whitespace`, and keep the current behavior unchanged.
(That is, return an iterator equivalent to `s.split(char::is_whitespace).filter(|s| !s.is_empty())`.)

Rename the return type `std::str::Words` to `std::str::SplitWhitespace`.

Optionally, keep a `words` wrapper method for a while, both `#[deprecated]` and `#[unstable]`,
with an error message that suggests `split_whitespace` or the chosen alternative.


# Drawbacks

`split_whitespace` is very similar to the existing `str::split<P: Pattern>(&self, P)` method,
and having a separate method seems like weak API design. (But see below.)


# Alternatives

* Replace `str::words` with `struct Whitespace;` with a custom `Pattern` implementation,
which can be used in `str::split`.
However this requires the `Whitespace` symbol to be imported separately.
* Remove `str::words` entirely and tell users to use
`s.split(char::is_whitespace).filter(|s| !s.is_empty())` instead.


# Unresolved questions

Is there a better alternative?

0 comments on commit 2973b99

Please sign in to comment.