Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Locale-invariant parsing and formatting #251

Closed
wants to merge 105 commits into from

Conversation

dkhalanskyjb
Copy link
Collaborator

@dkhalanskyjb dkhalanskyjb commented Feb 26, 2023

Fixes #39
Fixes #58
Fixes #90
Fixes #128
Fixes #133
Fixes #139
Fixes #211
Fixes #240
Fixes #83
Fixes #276

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello, thanks for your nice work. I'm wondering whether there are any reference documents for this String format? Or is it created just for kotlinx.datetime library? As a user, it might be tooooooo complex to have to learn specific syntax separately in order to just use a datetime library. And that makes it inconsistent with other languages.
For example, I hove my formatter yyyy-MM-dd HH:mm:ss which works in Java/JavaScript/Python/MySQL/etc. also works here, instead of changing it to "ld<yyyy'-'mm'-'dd> lt<hh:mm:ss>

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or is it created just for kotlinx.datetime library?

Yes.

yyyy-MM-dd HH:mm:ss

This is the Unicode format: https://unicode-org.github.io/icu/userguide/format_parse/datetime/#datetime-format-syntax It is ubiquitous, as you say, but it has a fair share of issues. For example, many confuse YYYY with yyyy (or with uuuu) and hh with HH. This leads to real issues that we try to avoid by providing a more intuitive and clear format.

However, we do plan to provide a way to construct a formatter using the Unicode format strings and some others, for the purposes of graceful migration.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for replying. I have to admit that the current Unicode version dose confuse sometimes, and maybe improving it is a good advice. And will it better to discuss with community about the new norm while (or before) continuing writing the code? I mean, it is a hard work because most people need to agree to the new style and gradually spread it (to other languages, or even to the new standard). (Kotlin's user is relatively not so much, especially for Kotlin Multiplatform now)
Hope everything works better in the future.

internal class DayDirective(minDigits: Int) :
UnsignedIntFieldFormatDirective<DateFieldContainer>(DateFields.dayOfMonth, minDigits)

internal object DateFormatBuilderSpec: BuilderSpec<DateFieldContainer>(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will it support some common used formats like Y, M, W/w defined in Java's DateTimeFormatter in the future?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

M is just the month, formatting months is certainly a must-have. Regarding the other ones you listed, they are not at all common. Y and w are for week-based years (a representation of dates so niche that no one has even asked us to support it yet), and W (week of month) is only used four times across thousands of repositories in which we checked the usage of format strings (for comparison, the day-of-month directive was used more than 20_000 times).

If you need Y, W, or w, whether for formatting or as fields of LocalDate, please file an issue describing your use cases.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for detailed explanation, got it. I don't have the use cases, and it might be enough of what we already have.

put(LocalDateTime(-123456, 1, 1, 13, 44, 0, 0), ("-123456-01-01 13:44:00" to setOf()))
}
test(LocalDateTimeFormat.fromFormatString("ld<yyyy'-'mm'-'dd> lt<hh:mm:ss>"), dateTimes)
test(LocalDateTimeFormat.build {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while looking on this test, I was thinking about API shape:

  1. why not to just use something like this LocalDateTimeFormat { ... } for building format? Similar to how kotlinx.serialization create formats (Json { ... }) or buildingSerializersModule. Future stdlib HexFormat API also uses the same convention https://github.com/JetBrains/kotlin/blob/d3eef139c7b316aa9733f0272c5f8a5f8f844d30/libraries/stdlib/test/text/BytesHexFormatTest.kt#L28
  2. What is the reason to add append to all methods? I mean, if looking from formatter perspective it looks logical (because when formatting, we append elements, so it's ok), but if looking from parser perspective it looks odd. IMO, just dropping append word looks good from first sight, because then DSL looks like you are just describing the schema for formatting.
    So if to apply my comments, creating format for this test will look:
LocalDateTimeFormat {
  year(4)
  literal('-')
  monthNumber(2)
  literal('-')
  dayOfMonth(2)
  literal(' ')
  hour(2)
  literal(':')
  minute(2)
  literal(':')
  second(2)
}

If Im missing something, I would be interested to understand :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As is said in the opening post,

Most importantly: the naming and the shape of user-visible API needs to be discussed.

So, you may well be right, and I can't give an authoritative answer, just explain the reasoning used in the draft. So far, the naming is this way because we'll want to extend the API with localized formats: #253

why not to just use something like this LocalDateTimeFormat { ... } for building format

I think this format implies that every LocalDateTimeFormat can be constructed with this builder, but this may change in the future. I think it doesn't make sense to mix machine communication formats with human-readable formats in a single constructor, given how disparate the use cases for them are, but it also probably would be excessive to have an entirely separate LocalDateTimeLocalizedFormat class. Two distinct functions, build and buildLocalized, seem disparate enough.

Also, we don't have API like String { +"x"; +"y" }. Instead, it's buildString { append("x"); append("y") }.

What is the reason to add append to all methods?

It's appending to the format. First, the format is "four-digit year," then, it's "four-digit year, followed by a -," and so on. It's called append to signify that the order is important. When showing a message to the user, you want to reorder the format components, but when parsing logs, you very much don't. If it's appendYear(4); appendLiteral('-'); appendMonth(2), it's clear that it won't be reordered to appendMonth(2); appendLiteral('-'); appendYear(4). If it was just year, literal, and month, I wouldn't bet one way or the other.

@jbruchanov
Copy link

@dkhalanskyjb

Amazing piece of work.
I was wondering, wouldn't be worthy to release it as a separate library, at least at this early stage ?
It has been awhile since this PR was created and I guess it will still take some time to get it in.

You could get meantime valuable feedback about API/bugs and community might help to add more features.

@mahozad
Copy link

mahozad commented Jul 13, 2023

I have a few questions.

How is this library going to support other calendars (like Solar calendar)?

JavaScript supports formatting dates to other calendars based on the given locale:

let formatted = new Date().toLocaleDateString('fa');

See https://stackoverflow.com/q/35570884/8583692

Also, does it correctly output digits based on the locale?
For example, when the locale is set to Arabic, the digits should be
Eastern Arabic digits. Or, when the locale is set to "fa", the
digits should be Farsi (Persian).

I'm saying this because I've seen this problem in other places too.
See https://youtrack.jetbrains.com/issue/KT-58122/StringtoDouble-does-not-parse-Eastern-Arabic-and-Persian-Farsi-valid-numbers

@dkhalanskyjb
Copy link
Collaborator Author

calendars

locale

"locale-invariant" means that locales are not supported. This implementation is for machine-machine communications, like parsing logs or formatting dates in a specific format for API endpoints.

For a discussion of localized formatting capabilities, see #253

core/common/src/format/migration/Unicode.kt Outdated Show resolved Hide resolved
core/common/src/format/Format.kt Outdated Show resolved Hide resolved
core/common/src/internal/Repr.kt Outdated Show resolved Hide resolved
core/common/src/internal/dateCalculations.kt Show resolved Hide resolved
core/common/src/LocalDate.kt Outdated Show resolved Hide resolved
core/common/src/LocalDate.kt Outdated Show resolved Hide resolved
@dkhalanskyjb
Copy link
Collaborator Author

Status: the API was discussed, and the changes are implemented. The only thing left is to decide on the names of functions and such, with maybe small tweaks here and there.

This PR has names like appendYear, and they are too long when the builder is the only means of constructing a format. I didn't rename all such functions to something shorter for one reason: when we decide on the name, appendYear is much easier to find and replace automatically with a proper name of our choice than something like year. So, until we decide on the names, these ones stay as placeholders.

@dkhalanskyjb dkhalanskyjb force-pushed the locale-invariant-formatting branch 3 times, most recently from 6214827 to 30b182b Compare November 17, 2023 13:20
@dkhalanskyjb dkhalanskyjb changed the title WIP: first draft of locale-invariant parsing and formatting Locale-invariant parsing and formatting Nov 17, 2023
@dkhalanskyjb dkhalanskyjb marked this pull request as ready for review November 17, 2023 14:29
@dkhalanskyjb dkhalanskyjb added this to the 0.6.0 milestone Nov 29, 2023
@dkhalanskyjb dkhalanskyjb added the formatters Related to parsing and formatting label Nov 29, 2023
@dkhalanskyjb dkhalanskyjb self-assigned this Nov 29, 2023
core/common/src/format/migration/Unicode.kt Outdated Show resolved Hide resolved
core/common/src/internal/Repr.kt Outdated Show resolved Hide resolved
core/native/test/ThreeTenBpLocalDateTest.kt Outdated Show resolved Hide resolved
core/common/src/internal/math.kt Outdated Show resolved Hide resolved
second()
optional {
char('.')
secondFractionInternal(1, 9, FractionalSecondDirective.GROUP_BY_THREE)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks concerning that it is not possible to construct a standard format without resorting to an internal function. Or is there a public alternative?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There isn't.

core/common/src/internal/math.kt Show resolved Hide resolved
Copy link
Member

@ilya-g ilya-g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completed stage 1 of review: public API and tests.
So far the most concerning part is the usage of internal functions when building the standard formats. This would make it hard for users to copy-paste and tweak them.

* https://docs.oracle.com/en/java/javase/16/docs/api/java.base/java/time/format/DateTimeFormatterBuilder.html#appendPattern(java.lang.String)
* https://unicode-org.github.io/icu/userguide/format_parse/datetime/#datetime-format-syntax
*/
internal sealed interface UnicodeFormat {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not blocking this PR, but I wonder, would it be possible to replace such data structure with a single visitor if it is only used once to build our format?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, sure, via Church encoding, it's always possible for non-polymorphic types.

internal interface UnicodeFormatInterpreter<Result> {
    fun literal(string: String): Result
    fun sequence(listCollector: List<Result>): Result
    fun optionalGroup(inner: Result): Result
}

internal object UnicodeFormatInterpreterInDateTimeFormatBuilder: UnicodeFormatInterpreter<(DateTimeFormatBuilder) -> Unit> {
    override fun literal(string: String): (DateTimeFormatBuilder) -> Unit = { it.chars(string) }

    override fun optionalGroup(inner: (DateTimeFormatBuilder) -> Unit): (DateTimeFormatBuilder) -> Unit = {
        it.alternativeParsing({}) { inner(this) }
    }

    override fun sequence(listCollector: List<(DateTimeFormatBuilder) -> Unit>): (DateTimeFormatBuilder) -> Unit = {
        listCollector.forEach { subFormat -> subFormat(it) }
    }
}

Or we can recreate the standard mutable approach to visitors that allows us to reify the stack and avoid all the lambdas:

internal interface UnicodeFormatInterpreter<State> {
    fun literal(state: State, string: String)
    fun enterOptionalGroup(state: State)
    fun leaveOptionalGroup(state: State)
}

internal class UnicodeFormatInterpreterInDateTimeFormatBuilder<T, E: AbstractDateTimeFormatBuilder<T, E>>:
    UnicodeFormatInterpreter<MutableList<AbstractDateTimeFormatBuilder<T, E>>>
{
    override fun literal(state: MutableList<AbstractDateTimeFormatBuilder<T, E>>, string: String) {
        state.last().chars(string)
    }

    override fun enterOptionalGroup(state: MutableList<AbstractDateTimeFormatBuilder<T, E>>) {
        state.add(state.last().createEmpty())
    }

    override fun leaveOptionalGroup(state: MutableList<AbstractDateTimeFormatBuilder<T, E>>) {
        val main = state.removeLast().actualBuilder.build()
        val others = listOf(ConcatenatedFormatStructure(emptyList<NonConcatenatedFormatStructure<T>>()))
        state.last().actualBuilder.add(AlternativesParsingFormatStructure(main, others))
    }
}

core/common/src/format/DateTimeFormat.kt Outdated Show resolved Hide resolved
core/common/src/format/DateTimeComponents.kt Outdated Show resolved Hide resolved
core/common/src/format/DateTimeComponents.kt Outdated Show resolved Hide resolved
core/common/src/format/DateTimeComponents.kt Outdated Show resolved Hide resolved
/**
* The fractional part of the second without the leading dot.
*
* When formatting, the decimal fraction will round the number to fit in the specified [maxLength] and will add
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be worth to mention rounding mode used in secondFraction overloads.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those overloads are internal, so no reason to mention them from the public documentation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I see, secondFraction overloads are public here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also may be better "the decimal fraction will be rounded"?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you mean these overloads. Sure, added the mentions and improved the wording.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still is worth to mention the rounding mode. HALF_UP, I suppose?

core/common/test/format/LocalDateTimeFormatTest.kt Outdated Show resolved Hide resolved
core/common/src/Instant.kt Outdated Show resolved Hide resolved
@dkhalanskyjb
Copy link
Collaborator Author

So far the most concerning part is the usage of internal functions when building the standard formats.

Ok, I removed this functionality from the standard formats and also filed an issue to investigate whether we should also stop doing this in toString: #333

Comment on lines +22 to +26
@Benchmark
fun formatIso() = LocalDateTime.Formats.ISO.format(localDateTime)

@Benchmark
fun parseIso() = LocalDateTime.Formats.ISO.parse(formatted)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it useful to commit the benchmark set in this state?
Also, I'd prefer if we dogfood kotlinx-benchmark here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do run them occasionally.

kotlinx-benchmark

Maybe we could do that, but I think it's a problem for after we've published a release with this PR.

Copy link

@lppedd lppedd Jan 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just know that kotlinx-benchmark is not ready outside of JVM. Too many issues still, especially in JS.
But I'd be happy to see it adopted here, it may speed up the resolution of such issues.

Before this change, we had two `parse` overloads: one that accepted
an ISO string and one that accepted some string and the
corresponding formatter. Now, we only have one overload that uses
the ISO format by default.

As a workaround for KT-65484, we have to hide the actual default
parameters behind a redundant `internal` function.
Before this change, you'd observe
`date(LocalDate.Formats.ISO)char('T')`, without a newline
in-between.
Also check that Java returns the same string as toUnicodePattern()
@dkhalanskyjb
Copy link
Collaborator Author

Merged in #343

@dkhalanskyjb dkhalanskyjb deleted the locale-invariant-formatting branch February 29, 2024 12:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment