Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed width formatting #133

Open
jagerber48 opened this issue Jan 21, 2024 · 16 comments · May be fixed by #139
Open

Fixed width formatting #133

jagerber48 opened this issue Jan 21, 2024 · 16 comments · May be fixed by #139

Comments

@jagerber48
Copy link
Owner

jagerber48 commented Jan 21, 2024

There seems to be generic demand for functions that format numbers but work very hard to preserve the overall width of the string. sciform could support a function like this. See e.g. lmfit gformat and an example usage. This was brought up here.


NOTICE (2024/02/05): This issue is on hold until some discussion can happen around how the edge cases described below can be handled. It would also be helpful (but not strictly necessary) to see example use cases that aren't covered by applying normal python string left/right/center padding to the string (actually FormattedNumber) that is output by sciform.


It's not totally clear to me exactly how this could be done, and I'm worried there are some edge cases that can't perfectly satisfy the length requirements. For example, suppose we want to display -112345678 (10 characters) in 7 characters

-112345678  # 10 characters

-1e+08  # 6 characters
-1.1e+08  # 8 characters
-11e+07  # 7 characters

We definitely can't do it in fixed point notation because there are 9 digits already before the decimal symbol, so we must use scientific notation. We see that with one sig fig we require 6 characters, but with 2 sig figs we require 8 characters because adding the second sig fig introduced a decimal symbol. The resolution is we have to carefully choose the exponent and number of sig figs so that there is no decimal symbol to get around this "characters jumping by 2" problem. However, it might be that this problem is unique to length=7 when the exponent requires 2 digits to express (i.e. -99 <= exp <= +99).

What then should be the exact procedure for getting a fixed length representation? I suggest the following procedure which would be relatively simple with sciform.

  • The user specifies a target length (the string will be no shorter than this, but it may be necessary for it to be longer)
  • The user passes in a Formatter (with any settings whatsoever)
  • The user passes in an ordered sequences (list or tuple) of exp_mode to try, e.g. ["fixed_point", "engineering", "scientific"]
  • The code then just brute forces formatting using the Formatter but overriding (1) the exp_mode with the entries in the list above (2) the round_mode to be sig_fig and (3) ndigits to vary from 1 up to length.
  • If it is possible to exactly hit the target length then the above procedure should do it.
  • The code then looks for any mode combinations which hit the target length. It then picks the mode combinations that hit the target length with the most number of sig figs. If there is a tie between exp_mode then the ordered list is used to select a winner. If no mode combination can hit the target length the mode combination which hits the shortest length with the most sig figs and with precedence given to the exp_mode list is selected.

This algorithm will probably miss the -1.1e+08 -> -11e+07 optimization described above, but I think this is an ok price to pay (one sig fig) for staying strictly in standard exponent modes like "scientific" (with exp_val=AutoExpVal unless otherwise specified) or "engineering".

The performance of the above algorithm is very poor, it's jut brute force. But I think it would be good for a brute force approach and to get some tests/examples written. I think some simple optimizations could be done like guessing close to the right number of sig figs using the magnitude of the number and a static analysis of the character "overhead" for each mode then guessing +/- 2 sig figs or so to cover strange edge cases like described above. Note also that non-trivial upper/lower separators will introduce even more edge cases, make the need for a guess-and-check algorithm even more pressing.

@newville
Copy link

@jagerber48 Yes, going below 7 characters is very hard, and probably not worth supporting. Just sign, one digit, decimal, 'e', exponent sign, 2 value exponent) use 7 characters. So, there won't be very many general-use cases for fewer than 7 characters, so it's probably fine to just assert that is not supported. I think it would even be OK to say 8 or more characters.

@jagerber48
Copy link
Owner Author

@newville thanks for checking this out and responding. Yes... I think if the target length is >= 8 then you can always represent the number (assuming -99<=exp<=99) in scientific notation with length-6 sig figs. The trick is just figuring out exactly when you need to switch between fixed point and scientific notation (which I think has been worked out in gformat).

All of that said, I think there are use cases around (I recall reading them on e.g. stack overflow) where people want numbers formatted to a fixed width, and I think, naively, they're thinking about fixed point number. That is they have numbers like 123, 0.23, 0.001, and 1 and they want to see them listed as

123.0  # 4 sig figs
0.230  # 3 sig figs
0.001  # 1 sig fig
1.000  # 4 sig figs

Here the target length is 6, the minimum length that can be used to display all numbers in fixed point format. For this, if the target length was e.g. 4 then 0.001 would overflow no matter how it is represented.

I'll consider which of the following should be in sciform

  • The general purpose guess-and-check algorithm described in the original post
  • A formatting function like gformat which only allows target length >= 8 but guarantees to hit that length with fixed format mode if possible, or scientific notation if not
  • A helper function that accepts a list of numbers + a target length and formats all the numbers to hit the same shortest possible length >= the target length

@jagerber48 jagerber48 linked a pull request Jan 29, 2024 that will close this issue
@jagerber48
Copy link
Owner Author

jagerber48 commented Jan 29, 2024

@newville See PR #139.

I haven't explicitly tested this, but looking at the link you provided, it looks like

val = 0.000202754321
gformat(val )
#  2.0275e-04
format_to_target_length(
                    val,
                    target_length=11,
                    allowed_exp_modes=["fixed_point", "scientific"],
                    base_formatter=Formatter(sign_mode=" "),
                )
#  0.00020275

format_to_target_length(
                    val/10,
                    target_length=11,
                    allowed_exp_modes=["fixed_point", "scientific"],
                    base_formatter=Formatter(sign_mode=" "),
                )
#  2.0275e-05

So the two formatting algorithms make different decisions, at least on this case, about which to switch from fixed_point to scientific notation. Maybe changing allowed_exp_modes to ["scientific", "fixed_point"] would bring agreement, but I don't know if that would result in decision reversals on different tests. I guess I'm having an internal bias towards preferring "fixed_point" if possible which is why I wrote the code above like that.

I plan to also include a function that accepts a list of input numbers (or number/uncertainty pairs) to format as well as format settings (like target_length, etc.) and it will format all of the numbers to the best minimum possible matching length.

Neither of these functions provide a guarantee that the target length will be hit, but I will include an example either in the docstring or readthedocs or both showing that if you selected allowed_exp_modes=["fixed_point", "scientific"] and target_length=8 (or greater, need to double check this is the right number) that you are guaranteed to hit the target_length.

An almost exact gformat drop in helper function could be defined like:

def sciform_gformat(val, length=11):
    return format_to_target_length(
                   val,
                   target_length=length,
                   allowed_exp_modes=["fixed_point", "scientific"],
                   base_formatter=Formatter(sign_mode=" "),
               )

However, I don't think I'll provide something like this in sciform since some of the choices are a bit up to user preference.

@newville
Copy link

Right, I think it is ambiguous at the 1e-4 level, giving the same precision with both formats. Anyway, thanks!

@jagerber48
Copy link
Owner Author

Two issues I've been pondering while trying to think about a function that formats a collection/sequence of numbers or number/uncertainty pairs to have the same lengths.


Issue 1: Value formatting with thousandths separators

When thousandths separators are used it may not be possible to hit any sufficiently large width just by adding sig figs.

-123.456_789    # 12 characters
-123.456_789_0  # 14 characters

This could be compensated by left padding with zeros or spaces

-0123.456_789  # 13 characters

Issue 2: Value/Uncertainty formatting even/odd issue

For value/uncertainty pairs the problem is worse

-92 ± 1         # 7 characters
-92 ± 11        # 8 characters
-92.0 ± 11.1    # 12 characters

-92.0 ± 1.0     # 11 characters
-92.0 ± 11.0    # 12 characters
-92.00 ± 11.10  # 13 characters

Adding a sig fig always adds characters equally to both the value and the uncertainty, so it is impossible for certain value/uncertainty pairs to be made to have the same length. This problem could be addressed by left padding one of the value or the uncertainty, but that may not be possible given the behavior of left_pad_dec_place.


Another possibly even worse case is when doing value/uncertainty formatting with thousandths separators.

I'm not sure what algorithm can be used to address both of these issues. I think worst case scenario two value/uncertainty pairs might be off by 3 or 4 with respect to their target or neighboring numbers. Possibly you can always do better than 3 with a clever algorithm but it's not obvious to me.


My knee-jerk reaction to this feature request was to not be to excited about it. Scientific number formatting should be driven by uncertainty orders of magnitude and significant figures. It should not be driven by string length. The python built in format specification mini-language has functionality for padding strings to a fixed width, but sciform takes the point of view that that kind of thing is the job of a string formatter, not a number formatter so this sort of feature was specifically kept out of sciform. See https://sciform.readthedocs.io/en/stable/fsml.html#incompatibilities-with-built-in-format-specification-mini-language.

I'm tempted to say that a number formatting package shouldn't provide functionality for controlling string widths and that modifying number formatting based on string width is an anti-pattern based on the philosophy above.

Of course, the left_pad feature is preserved in sciform, and really the only use case for that is matching string widths. The reason that was preserved is it is padding to a decimal place that appears between the sign symbol and the decimal symbol, so it would not be possible for an end user to do that padding on their own after the fact like they could if they just wanted to left/right/center pad the entire output as a string.

Curious for your thoughts @newville. For the specific use case of output result formatting for lmfit, I'm curious if you considered the tabulate which can pretty easily turn arrays/dicts/etc. of data into properly spaced ASCII or unicode (I think?) tables and might make the whole "getting strings to match lengths for reporting" problem go away.

@newville
Copy link

@jagerber48

Scientific number formatting should be driven by uncertainty orders of magnitude and significant figures. It should not be driven by string length

Hm, do you have a citation for that.

I'm tempted to say that a number formatting package shouldn't provide functionality for controlling string widths and that modifying number formatting based on string width is an anti-pattern based on the philosophy above.

Hm, that's confusing. Is the point of sciform something other than "convert numbers to strings so that they can be more easily read by humans?" If that is the goal, there could be many ways to make numbers more readable. One of those ways would be to specify the width, so that, for example, the downstream programmer had a very easy time creating tables of numbers.

If that is not the goal, then is it fair to say that sciform should not be used in __str__ or __repr__ methods? Those are specifically for "string representations" of objects.

I think I may have misunderstood the goal of the project. Good luck and all the best!

@jagerber48
Copy link
Owner Author

@newville thanks for the feedback. My opinions in my last post possibly came off too strong. I'd say I'm on the fence about these issues. Thinking as a scientist, the character width of a number should never matter in the least. But, as you say, if sciform is supposed to make numbers more readable, maybe character width does matter.

In any case, so far sciform has focused a little more on the "science-oriented" approach. In large part because it's a little more well-defined and avoids thorny issues like above.

If I could ask your opinion once more: In simple cases it is possible to format individual numbers to a specified character width (like gformat does). But I have raised a few examples where adding significant figures may cause the character width to increase by more than one with each sig fig (e.g. value/uncertainty or thousandths separators). These examples throw wrenches in the "general-purpose-target-length" formatters I'm trying to engineer. But maybe I should forget about a general-purpose target-length formatter because it is under-specified and focus on a more specialized formatter, like gformat, that could already be broadly useful?

I think I may have misunderstood the goal of the project. Good luck and all the best!

I don't think you've misunderstood. sciform is still a young project, and I'm not the most experience developer and I'm still trying to figure out what directions sciform should go. I certainly want it to be helpful for people!. So far it has been helpful for me to push back against including various helper functions in sciform because they took effort way from core functionality. But now I think most of the core functionality is in place and it is a good time to put in these sorts of helper functions that will make sciform very useful for people.

@jagerber48
Copy link
Owner Author

Said another way: If there was an obvious algorithm to format value and value/uncertainty pairs to specified lengths for all sciform formatter options I would implement it with no questions asked. But given my perceived impossibility of achieving that task, (even admitting the target width must be above some minimum value for a given set of formatting options) I'm not sure how to proceed.

@newville
Copy link

@jagerber48

Thinking as a scientist, the character width of a number should never matter in the least.

Thinking as a scientist, communication of non-trivial numerical results is the most important function we perform.

When discussing communication of scientific results, I cannot think of any attribute X where I would ever agree with a statement like "X should never matter in the least".

If the width of a string should "never matter in the least", then should the height of the string also not ever matter? How about a "numerical base"? How would you feel about "x = 14401.5 +/- 0x632"?

When I look at the aim of this project, the formatting of numerical values to communicate scientific results seems to be the point. I must have misunderstood. Good luck and all the best.

@jagerber48
Copy link
Owner Author

@newville my apologies. I don't want to waste your time. Leaving the philosophical discussion about string/number formatting aside, If you are able to spare any more time to this discussion I would really appreciate your feedback specifically on how sciform should provide character-length-controlled formatting in light of the edge cases I presented above (e.g. where a certain value or value/uncertainty pair might skip one or more lengths as # of sig figs are increased). If I had a clear direction on what exactly should be promised to the user by character-length-controlled formatting then I would have less qualms about implementing it. Perhaps your answer is simply and pragmatically that you would appreciate seeing a function with the same scope and behavior as gformat in sciform (i.e. the place where this conversation started before I brought my more generic/larger scoped ideas)?

  • We've established that formatting number alone with both fixed point and scientific notation available it is possible to deterministically hit any target length >= 8.
  • We've established that introducing thousandths separators means certain lengths are impossible to hit for certain values.
  • We've established that when formatting value/uncertainty pairs there will always be certain lengths that are impossible to hit for any given value/uncertainty pair (there is an even/odd parity).

Perhaps the conclusion is that sciform should provide a format-to-target-width function that covers the first case since it is tractable, but just totally ignore the latter two cases since they are not? What are your thoughts?


If you care to answer I have a question about your lmfit use case. In the lmfit table I see that you use gformat so that each number in a given table has a certain width. The left-most and right-most digits of all the numbers are aligned. However, you could have foregone gformat and just used regular string padding (e.g. f"{a_string:<12s}") to left align all the numbers but then guarantee that, as strings, they extend far enough the right to keep the table aligned. Did you consider this, and if so, why did you want to go through the extra effort of writing gformat? Was it specifically important to you that the left-most and right-most digits shown for all the numbers are aligned?

There is a sciform example using tabulate to accomplish a similar goal. Here I transcribe the table but with modifications to make the strings more "ragged".

+---------+-------------------+------------------+---------------------+
| color   | curvature         | x0               | y0                  |
+=========+===================+==================+=====================+
| red     | (2001.7(1.9))e+12 | (-42.7(4.8))e-06 | (1.0000060(46))e+09 |
+---------+-----------------+------------------+---------------------+
| blue    | (18.4(1.0))e+12   | (-6.9(1.6))e-06  | (1.0262(29))e+09    |
+---------+-----------------+------------------+---------------------+
| purple  | (2.7(1.7))e+12    | (15.1(2.5))e-06  | (1.000246(44))e+09  |
+---------+-------------------+------------------+---------------------+

Here tabulate automatically right pads strings (it can right/left or center justify by user request) so that the table is aligned. Maybe you find the lmfit approach where the left and right digits are aligned to be more pleasing, even in cases where the decimal points aren't aligned?

@newville
Copy link

@jagerber48

Leaving the philosophical discussion about string/number formatting aside

It seems to me that you wrote pages of philosophical discussion. I do not know what you would like to set aside.

I cannot wrap my head around the idea that someone would want to write and discuss software for formatting numbers to strings and then say
"a number formatting package shouldn't provide functionality for controlling string widths".

But you said rather a lot of "should not" statements.

The width of the output string is one of the fundamental characteristics that a formatting process must address.
See, for example, https://docs.python.org/3/library/string.html#format-specification-mini-language. Width is one of the key parameters. It is not the only consideration, but it is one of the things that many people do, in fact, regularly consider.

I'll have to leave it there. Best of luck with the project. You might do well to find people to work with.

@jagerber48
Copy link
Owner Author

@newville Sorry if I've been unclear and come on overly strong in some of my statements. My main interest in this issue has been trying to do some brainstorming about the specific edge cases that have been bothering me and blocking me from including general character width formatting in sciform. But I haven't been able to get any feedback from you on those edge cases which I'm disappointed by because I feel like, with your experience with lmfit which is very central within scientific python, your feedback would be very valuable. I said multiple times that if I could see a clear path through these edge cases (which could include not supporting them..) I would be happy to include character-width formatting, but it seems like you've fixated on disagreement with some of my individual sentences that I said more in the spirit of exploring ideas than establishing hard go/no-go statements on certain things in sciform. Again, I'm sorry that my statements came off overly strong or decisive. I don't feel decisive at all about these questions, I'm needing to bounce ideas off people.

I'll have to leave it there. Best of luck with the project. You might do well to find people to work with.

Thanks for the well wishes. @newville, I would love to find people to work with. I'm actively seeking feedback and collaboration for sciform. sciform is wrapping up review with PyOpenSci which has allowed me to get some more eyes on the code and have some very fruitful discussions about it. I'm trying to advertise to get more users on places like stackoverflow and reddit but don't have the most cohesive strategy. I'm not sure where/how to advertise to recruit partner maintainers. Any advice you have along these lines would be very much appreciated. (One idea which occurred to me is to advertise a call-for-contributors/partners in the readme. Right now I'm explicitly working to get more users, but haven't put serious effort into finding contributors/partners.


Going forward I'll continue an investigation into if/how it makes sense to include character-width formatting in sciform. I may collect my ideas in this issue. Feel free to unsubscribe or not as you like if you find the brainstorming annoying. If I do end up including character-width formatting that covers the gformat usage I'll tag you here.

Also, this arose out of the uncertainties migration conversation I want to say the following. I'm still interested in contributing to uncertainties whether any relationship develops with sciform or not. I'm interested in the development topics there beyond just formatting. Once that migration gets going I will probably re-raise the possibility of a relationship there with you and the others. I think format-by-width discussion we're having here is mostly (but not entirely) orthogonal to the question of if sciform would be useful as a backend for uncertainties formatting.

@jagerber48
Copy link
Owner Author

Collecting my thoughts on by-overall-width formatting for the moment. The first comment will have a lengthy history of how sciform came to by and why the choice was originally made to exclude by-overall-width formatting. The next comment will have a summary of the findings in this issue and the outstanding edge cases.

sciform was originally inspired by the uncertainties package which has great value/uncertainty formatting, encoded in a nice conservative extension* of the built-in format specification mini language (FSML) but had a few shortcomings including issues such as no support for engineering notation and issues with formatting non-finite values or uncertainties. The built-in FSML also has some issues including not being able to easily format numbers to a fixed number of sig figs in fixed point notation, and a lack of support for engineering notation. The shortcoming of the built-in FSML are discussed in this thread. Prior to that thread I had been kicking around some code for this type of formatting, but that thread inspired to build the code out to a nice PyPi package to fill the gaps I saw. I've learned a lot along the way and still have a LOT to learn about open source software.

Originally sciform was meant to be a conservative extension of the built-in FSML just like uncertainties formatting was a conservative extension. However two things happened. First, I took the decision pretty early on that I didn't want sciform to be making decisions "under-the-hood". I wanted all decisions to be presented up front as explicit options to the user. This "philosophical" point is at-odds with the built in FSML g formatting style which makes a choice under-the-hood whether to use fixed-point or scientific formatting. Second, this post in the above thread inspired me to move away from an FSML-based formatting approach and towards an approach based on a highly-configurated Formatter object. The main justification being that there are just too many options possible to be neatly captured in an FSML.

Free from the constraint of making an extension to the built-in FSML, I could reconsider from the ground up what exactly a value or value/uncertainty formatter should do. The main shortcoming of the built-in FSML is that you can't independently control the exponent mode (e.g. fixed point or scientific notation) and the number of significant figures displayed. If you use e formatting then the number of sig figs will always be prec + 1. If you use g formatting then you can always control the number of sig figs (it will always be prec I think), but you can't always control the exponent mode (it switches between fixed point and scientific notation based on some rules). sciform allows you independent control over these two things. exp_mode controls the exponent mode and round_mode controls whether ndigits sets the number of digits appearing after the decimal place or the total number of sig figs displayed. There is no exact equivalent of g formatting mode in sciform.

The next part of the built-in FSML I considered was the width control. After developing the round_mode I had gotten into numbers-based mindset of thinking about number formatting in terms of decimal places. That is, decimal place rounding specifies how many decimal places past the decimal symbol to display. Sig fig rounding specifies the number of digits between the most and least significant decimal places. And the code reflected this emphasis on decimal places. While in that mindset, I got it in my head that it made sense to have an option to pad a number up to a certain decimal place with either zeros or spaces between the most significant digit and the sign symbol (or the start of the string if no sign symbol is present). This was inspired by (but different from) the = align mode in the built-in FSML. In this way, the round_mode + ndigits controls how many decimal places to the right of either the decimal symbol or the most significant digit are included, and the left_pad options control how many decimal places to the left of the most-significant digit are included.

Looking at the built-in FSML was strange through this lens. The width specifier in the built-in FSML is not parametrized in terms of decimal places at all, but it is rather trying to control the overall character-width of the string, including incidental characters like the sign symbols, decimal symbols, thousands/thousandths separators, exponent symbols, etc. At this point I elected to include the left_pad by decimal place formatting in sciform, but to exclude padding to a certain overall character width. However, this was done with the specific knowledge that a user could ALWAYS take a sciform output string and format it to a fixed width using python string padding functionality.

from sciform import Formatter

formatter = Formatter(
    exp_mode="engineering",
    pdg_sig_figs=True
)

formatted = formatter(123000, 456)

print(formatted)
# '(123.0 ± 0.5)e+03'

print(f"{formatted:<30s}")
# '(123.0 ± 0.5)e+03             '

print(f"{formatted:^30s}")
# '      (123.0 ± 0.5)e+03       '

print(f"{formatted:>30s}")
# '             (123.0 ± 0.5)e+03'

So the idea was that sciform would handle the numeric side of formatting, allows users to control characters to the left and right of the most significant digit in way parametrized by number properties like decimal places. If the user had need for the overall string to be formatted to a certain overall width (e.g. the result string is appearing in an aligned table), then, partially in the interest of keeping sciform lean, that duty would be abdicated to python string formatting. This is what I meant above when I tried to say something along the lines that sciform should handle numerical formatting while python handles generic string formatting.

However, importantly, one thing is lost by having sciform only be able to specify left padding in terms of a specific decimal place. The user can no longer specify a target overall length and have the string automatically be padded between the sign symbol and the most significant digit. That is, the user must explicitly control which decimal places are populated. Decimal places can no longer be automatically populated to hit a certai string width. This means, if the user wants to both pad the string to a certain length and have the padding appear between the most significant digit and the sign symbol then sciform cannot help them unless they do some ad-hoc calculations up front using knowledge of the number of the "overhead" symbols like exponent characters etc. to carefully tune the left_pad_dec_place and ndigits to the right values to hit an overall string length. This is sort of like the reverse of how if you want to format a fixed point number to a certain number of sig figs using the built in FSML you have to do an up front calculation of the desired precision (number of digits after the decimal place) using the magnitude of the number and the desired number of sig figs.

I have so-far concluded that adding padding symbols between the sign symbol and the most significant digit to reach a certain overall length is out-of-scope for sciform. Hence, sciform is not a conservative extension of the built-in FSML. There is at least one thing the built-in FSML can do that sciform can't (there are other things too).

I am curious to learn more about use cases for controlling overall string width by controlling the number of decimal places occupied to the left and right of the most significant digit. Why do left/right/center string padding not suffice?

The lmfit use case seems to prefer left- and right-most symbols to match up within a given column. This looks nice at a glance, but I don't know if it helps readers glean information more easily. Especially since more sig figs are displayed than are probably necessary for the reader in the interest of matching string lengths. In the lmfit case the numbers in different rows may have wildly varying orders of magnitude and may not have need to be compared to each other. If one is comparing numbers between rows the ease of comparison may be spoiled by the facts that (1) the decimal points don't line up and (2) some numbers might be displayed in fixed point and some in scientific notation. If lmfit was interested in dropping the gformat function I think it could just use regular python number formatting for individual values along with python string left or right padding to get the tables aligned (possibly using tabulate which does all of the calculation automatically). For the value/uncertainty formatting lmfit could use uncertainties or sciform value/uncertainty formatting and then, again, use python string formatting to pad the resulting strings to make the tables aligned (or use tabulate). The left/right digit alignment will be lost, but I don't think it would be a net loss to readability (including fewer sig figs could be a win and using uncertainties or sciform for value/uncertainty formatting would definitely be a win). But, I fully admit that a lot of this is a matter of preference. Maybe many users have a preference for the clean look of aligned left-most and right-most digits, and that preference alone could justify the inclusion of a feature to control string length using # of decimals included. In any case, I have some code starting to demonstrate what these suggestions would look like for lmfit reports as an example for this issue.

If instead users had a collection of numbers of similar order of magnitude that should be compared that they wanted to display in columns I would still argue that formatting by decimal places like sciform does would be nice, and the numbers could then be formatted using python string formatting techniques to align either their right-most digits or their decimal points. Or better yet, the numbers could be left padded to the largest decimal place present and rounded to the least significant digit present.

!!!!!
To all readers: If python left/right/center padding to control string width does not suffice for you, and you would instead prefer controlling string width by controlling displayed decimal places, I am curious to know more details about why!
!!!!!

*By conservative extension I mean the uncertainties FSML will format accept any built-in format specification string and format numbers the same way as the built-in FSML. It only extends the FSML and has different behavior for format specification strings outside of the built-in FSML (e.g. the u format type).

@jagerber48
Copy link
Owner Author

jagerber48 commented Jan 31, 2024

After that long comment, more thoughts on the specific issue at hand.

The previous comment was a long history of why I haven't included by-total-length formatting SO FAR. That doesn't mean I'm 100% opposed to including it ever. As evidence see #139. If it was very easy to include total-length formatting as a helper/wrapper around sciform functionality I would do so without much complaint. However, there are a few edge cases that prevent me from implementing a general by-total-length formatter.

  • As discussed early in this thread, and in Feature/fixed width formatter #139, it is possible to write a guess-and-check wrapper function that adds sig figs until a string exceeds a certain length. The function can try different exponent modes and select the most preferred exponent mode which comes closest to the target length (while at least matching it) and has the most significant figures. If fixed point and scientific exponent modes are allowed, only a single number is being formatted, the target length is >= 8, and no thousand or thousandths separators are included, then this function is guaranteed to be able to hit the target length.
  • When formatting value/uncertainty pairs using +/- (as opposed to parentheses) format, adding a significant figure always adds an even number of characters to the string since the new decimal place must be present in both the value and uncertainty. This means, depending on whether the base format has an even or odd number of characters, it will be impossible to hit the desired length in about half the cases. There will be at least an off-by-one error in some cases. If there is a need to jump across a decimal symbol or thousandths symbol then I think there may be an off-by-three error.
  • Similar to the above case, sometimes adding a new sig fig will introduce a thousandths separator. This means the number of characters will jump by two on a single value. This may result in an off-by-one error when trying to hit a target length.

How should these cases be addressed? Ideas:

  • sciform should only provide a formatter that narrowly covers the guaranteed case in the first bullet point above
  • sciform should provide a formatter the does a best-effort to hit a certain length, but might overshoot by up to 3 characters. This is already implemented in Feature/fixed width formatter #139. However, would this even be useful? If this function is being used for putting numbers in a table it needs to hit the correct length! Otherwise the user will need to manually left or right pad the resulting string with places. And if the user already has to do that, why not just forget by-total-length formatting and just format the number and left/right pad it as described above?
  • The off-by-three error could be reduced to an off-by-one error by removing a single sig fig but now the overall length would be less than the target length. Maybe the by-length formatted could be advertised as returning the string that gets closest to the target length, whether that is lesser or greater than the target length.
  • Perhaps the formatter could return the string which is closest to the target length but smaller, and then any difference could be made up for by e.g. prepending or appending spaces.
    • An issue with this approach is that there is no guarantee that you can format a string to be smaller than a target value by removing sig figs. you can always format a string to be larger than the target value by adding more sig figs but not the other way around since there is a minimum "character overhead" for any formatting method.
  • In all of the above cases, what should happen if the target length is missed? Should an exception or warning be raised? Or should it just be documented behavior that the function only makes a best-effort to hit the target length? Again, if it is only best-effort is that function even useful?

@jagerber48
Copy link
Owner Author

On my local fork of lmfit I reworked some code in printfuncs.py to drop usage of gformat and pick up usage of tabulate (at least for the example_fit_with_bounds.py example script. This is to demonstrate (1) how the table would look if the left- and right-most digits of the numbers are not matched up (but tabulate controls white space so the table is still aligned and (2) how the code would look with lmfit not needing to worry about getting the whitespace right for the table to look good.

See jagerber48/lmfit-py#1. See especially the diff on correl_table to see how much easier the code is to follow when the table formatting whitespace work is outsourced to tabulate.

Old fit results:

[[Fit Statistics]]
    # fitting method   = leastsq
    # function evals   = 79
    # data points      = 1500
    # variables        = 4
    chi-square         = 11301.3646
    reduced chi-square = 7.55438813
    Akaike info crit   = 3037.18756
    Bayesian info crit = 3058.44044
[[Variables]]
    amp:     13.8904759 +/- 0.24410753 (1.76%) (init = 13), model_value = 14
    period:  5.44026387 +/- 0.01416106 (0.26%) (init = 2), model_value = 5.4321
    shift:   0.12464389 +/- 0.02414210 (19.37%) (init = 0), model_value = 0.12345
    decay:   0.00996363 +/- 2.0275e-04 (2.03%) (init = 0.02), model_value = 0.01
[[Correlations]] 
  +----------+----------+----------+----------+----------+
  | Variable | amp      | period   | shift    | decay    |
  +----------+----------+----------+----------+----------+
  | amp      | +1.0000  | -0.0700  | -0.0870  | +0.5757  |
  | period   | -0.0700  | +1.0000  | +0.7999  | -0.0404  |
  | shift    | -0.0870  | +0.7999  | +1.0000  | -0.0502  |
  | decay    | +0.5757  | -0.0404  | -0.0502  | +1.0000  |
  +----------+----------+----------+----------+----------+

New fit results:

[[Fit Statistics]]
    ------------------  ----------
    # fitting method    leastsq
    # function evals    79
    # data points       1500
    # variables         4
    chi-square          11301.3646
    reduced chi-square  7.5544
    Akaike info crit    3037.1876
    Bayesian info crit  3058.4404
    ------------------  ----------
[[Variables]]
     Name          Value          Percent Uncertainty    Constraint    Init Val    Model Val
    ------  -------------------  ---------------------  ------------  ----------  -----------
     amp    (1.389+/-0.024)e+01          1.76%              Vary          13          14
    period  (5.440+/-0.014)e+00          0.26%              Vary          2         5.4321
    shift    (1.25+/-0.24)e-01          19.37%              Vary          0         0.12345
    decay    (9.96+/-0.20)e-03           2.03%              Vary         0.02        0.01
[[Correlations]] 
    ┌────────┬─────────┬──────────┬─────────┬─────────┐
    │        │ amp     │ period   │ shift   │ decay   │
    ├────────┼─────────┼──────────┼─────────┼─────────┤
    │ amp    │ +1.0000 │ -0.0700  │ -0.0870 │ +0.5757 │
    ├────────┼─────────┼──────────┼─────────┼─────────┤
    │ period │ -0.0700 │ +1.0000  │ +0.7999 │ -0.0404 │
    ├────────┼─────────┼──────────┼─────────┼─────────┤
    │ shift  │ -0.0870 │ +0.7999  │ +1.0000 │ -0.0502 │
    ├────────┼─────────┼──────────┼─────────┼─────────┤
    │ decay  │ +0.5757 │ -0.0404  │ -0.0502 │ +1.0000 │
    └────────┴─────────┴──────────┴─────────┴─────────┘

If lmfit does not want to adopt sciform then I would recommend moving to using tabulate but preserve usage of gformat in the formatting of the floats in the fit statistics table so that they all have the same number of sig figs like they do in the old fit results table above.

If lmfit does want to adopt sciform then the floats in the fit statistics table could be configured to use the same number of sig figs by using exp_mode="fixed_point" and round_mode="sig_fig" (default). I would recommend ndigits=4 to show only 4 significant figures (noting that the string widths will not always agree if the numbers in the table span many orders of magnitude). The sig fig formatting could be extended to Init Val and Model Val columns if desired. And of course sciform could be used to control the value/uncertainty formatting in the value column. This could even be done in an lmfit user configurable way if the user prefers, e.g. (1.389(24))e+01 over (1.389+/-0.024)e+01 or prefers engineering notation, etc.

@jagerber48
Copy link
Owner Author

jagerber48 commented Jan 31, 2024

Here is how the table might look using sciform + tabulate (but without controlling overall string widths by adjusting sig figs). Here I'm demonstrating sciform's superscript=True behavior to show exponents as unicode superscripts.

[[Fit Statistics]]
    ------------------  -------
    # fitting method    leastsq
    # function evals    79
    # data points       1500
    # variables         4
    chi-square          11300
    reduced chi-square  7.554
    Akaike info crit    3037
    Bayesian info crit  3058
    ------------------  -------
[[Variables]]
     Name          Value          Percent Uncertainty    Constraint    Init Val    Model Val
    ------  -------------------  ---------------------  ------------  ----------  -----------
     amp    (1.389 ± 0.024)×10¹          1.76%              Vary          13          14
    period  (5.440 ± 0.014)×10⁰          0.26%              Vary          2         5.4321
    shift   (1.25 ± 0.24)×10⁻¹          19.37%              Vary          0         0.12345
    decay   (9.96 ± 0.20)×10⁻³           2.03%              Vary         0.02        0.01
[[Correlations]] 
    ┌────────┬─────────┬──────────┬─────────┬─────────┐
    │        │ amp     │ period   │ shift   │ decay   │
    ├────────┼─────────┼──────────┼─────────┼─────────┤
    │ amp    │ +1.0000 │ -0.0700  │ -0.0870 │ +0.5757 │
    ├────────┼─────────┼──────────┼─────────┼─────────┤
    │ period │ -0.0700 │ +1.0000  │ +0.7999 │ -0.0404 │
    ├────────┼─────────┼──────────┼─────────┼─────────┤
    │ shift  │ -0.0870 │ +0.7999  │ +1.0000 │ -0.0502 │
    ├────────┼─────────┼──────────┼─────────┼─────────┤
    │ decay  │ +0.5757 │ -0.0404  │ -0.0502 │ +1.0000 │
    └────────┴─────────┴──────────┴─────────┴─────────┘

Or with the variable table in fixed_point mode:

[[Variables]]
     Name         Value         Percent Uncertainty    Constraint    Init Val    Model Val
    ------  -----------------  ---------------------  ------------  ----------  -----------
     amp      13.89 ± 0.24             1.76%              Vary          13          14
    period    5.440 ± 0.014            0.26%              Vary          2         5.4321
    shift     0.125 ± 0.024           19.37%              Vary          0         0.12345
    decay   0.00996 ± 0.00020          2.03%              Vary         0.02        0.01

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants