-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework C++ Runtime Interval, IntervalSet, ATN and ATNState #1930
Conversation
- Remove the readonly status from IntervalSet. - Remove virtual functions from IntervalSet and Interval. These are passed by value throughout the C++ runtime; meaningful inheritance is not possible anyway. - Moving the atomic flag into ATNState as a "now cached" flag. - Return a const reference from ATN::nextStates(ATNState*) so the readonly status is enforced by the compiler not at runtime in the code. - Use value semantics using std::move to reduce the number of copies performed, constent with how these classes are used in the C++ runtime source. - Remove type-unsafe varargs constructor in IntervalSet, replace with type-safe varadic templates implementation.
The Xcode build failure seems to be a timeout ... |
I ran my test program at https://github.com/ksedgwic/antlr-cpp-threading-example using this branch (21a53c2). I'm still seeing crashes when the program is multi-threaded
|
@ksedgwic Thanks for testing and for the stack trace. This bug is this sequence of events: locking to acquire an iterator, dropping the lock, and then dereferencing the iterator. In the mean time the iterator has been invalidated. I'll add a patch to this pull request. (On looking, there is another case in the code where the same pattern occurs; fixing that one too.) |
Iterators on an unordered_map were being dereferenced after dropping a read lock, leading to races where iterators could be invalidated before they were used.
@ksedgwic I believe the latest commit in this pull request should solve the crash you are seeing. |
@@ -6,6 +6,7 @@ | |||
#pragma once | |||
|
|||
#include "misc/IntervalSet.h" | |||
#include <atomic> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
System headers should go in antlr4-common.h
. I love to have them all together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, will change.
private: | ||
/// Used to cache lookahead during parsing, not used during construction. | ||
|
||
misc::IntervalSet nextTokenWithinRule; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: private vars use a leading underscore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, will change.
For some reasons I left my review pending and had to finish it now. |
std::cout << std::string("reuse state ") << s->stateNumber << std::string(" edge to ") << iterator->second->stateNumber << std::endl; | ||
} | ||
if (iterator != s->edges.end()) { | ||
std::cout << std::string("reuse state ") << s->stateNumber << std::string(" edge to ") << iterator->second->stateNumber << std::endl; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ugh, that's still code from the automatic conversion. We don't need to create std::string
here explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at this again -- The string conversion is from the existing code. I added the iterator check to avoid dereferencing a bad iterator in the debug code. That's a bugfix (if only in an error message).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My comment wasn't specifically for your patch. More loudly thinking... Would be nice if things like that get fixed while someone comes across them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The removal of the read-only flag might not work as you expect and in fact I have a bad feeling about this change. Interval sets can be made read only dynamically. That cannot be enforced by the compiler. Let's not deviate so much from the Java code, even if that seems like a nice beautification. Intervals are stored in many places and it needs really careful checks to ensure changes don't break anything.
@@ -26,59 +26,56 @@ namespace misc { | |||
ssize_t b; | |||
|
|||
Interval(); | |||
explicit Interval(size_t a_, size_t b_); // For unsigned -> signed mappings. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I missed that one in my first review. Why did you remove the explicit
keyword? I certainly don't want automatic conversion between size_t
and ssize_t
! Compilers behave differently here. Please put that back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to put it back. However: I don't understand what you mean about size_t
and ssize_t
conversions. My understanding of the explicit
keyword is that it will prevent:
IntervalSize s = { 1, 2 };
from compiling, and require:
IntervalSize s(1, 2);
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IntervalSet
of course ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The explicit
keyword marks a c-tor so that no automatic type conversion is applied. So when the compiler looks up a c-tor with e.g. int values that explicit c-tor is not considered a valid match. This way we can ensure that unsigned values only go with the explicit c-tor, while signed ones use the other c-tor (which is important because of the internal handling). I wish we could avoid using negative values. From the pure semantic of that class there cannot be a negative interval, but such values denote invalid intervals and are used when sorting intervals in an interval set (which is why we cannot use something like (size_t)-1
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The explicit
keyword's type conversion refers to conversion to the target class through a converting constructor, and whether that conversion can be implicit or must be explicit. I'm not sure how you are using this in the code. For example:
explicit IntervalSet(size_t a, size_t b);
will prevent both:
IntervalSet s = { 1, 2 };
IntervalSet s = { -1, -2 }
and will allow both:
IntervalSet s(1, 2);
IntervalSet s(-1, -2);
What is the form you want to disallow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's surprising. From my tests I found that the explicit constructor will be used for IntervalSet s(1, 2)
.
Anyway, the idea is simple: use the c-tor for size_t
only with unsigned values and the one with ssize_t
for signed ones. That means usually that the one for unsigned values is used (since most values are unsigned) and only in specific cases take the signed version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems we haven't finished this one. Are you sure we don't need the explicit
keyword?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't do what you think, re: signed vs. unsigned comparisons, however, adding it back in because it doesn't hurt.
} | ||
|
||
IntervalSet::IntervalSet(const std::vector<Interval> &intervals) : IntervalSet() { | ||
_intervals = intervals; | ||
IntervalSet::IntervalSet(const IntervalSet &set) : _intervals(set._intervals) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like the changes to the c-tors. I took care to use delegating c-tors everywhere and want to stay with them, instead of repeating init of internal values over and over again. Please revert.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you insist. However: This a class with a single member that's being constructed empty and then assigned to in a constructor body. I don't see a repeated init here. What am I missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, this one is so simple, but I'd like to keep that pattern throughout the runtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you pls change this back also? As I said already, I like to stay with delegating c-tors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
for (int i = 0; i < n; i++) { | ||
add(va_arg(vlist, int)); | ||
} | ||
IntervalSet::IntervalSet(std::vector<Interval>&& intervals) : _intervals(std::move(intervals)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mentioned a replacement of the variadic variant with template params, but that's not it. So where is the new code? Also using move
here is not a good idea. What if you wanna pass in a constant? Better use const&
for the parameter (if we stay with this variant).
Actually, the new code changes the semantic significantly. The variadic variant uses single value intervals (in fact it actually uses single elements, which are later converted to single element intervals). Your new code allows for any interval list (a case we already handled in a different c-tor). This change is not improving anything IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The varadic code starts on line 42 of IntervalSet.h, but I think you found it from your later comment.
The constructor that takes a vector rvalue reference is a private constructor used internally. If we want to use a constant here it will be local to this class and can be an explicit decision.
Removing type-unsafe va_args code certainly seems like an improvement to me: The old code is likely to fail outright, especially on machines where sizeof(int) != sizeof(ssize_t). The new code is type-safe, provides compile-time diagnostics, and allows single and multi-value intervals. For the same arguments I don't see a semantic change (excluding failures because of va_args usage problems).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I didn't ask for the variadic code, but the templated one (which I don't see in your patch). However, as you say, the new code is a private c-tor, which is not that obvious from the patch. I wonder, though, why you could remove the variadic code. Is it not used anywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I replaced the varadic code with a templated version that is type safe and provides the same interface. There are two changes that are being mixed here because they're close by in the diff:
- The va_arg code was replaced with a templated version in the header file. The old va_arg code was removed. The interface is the same and any code using the old va_args based interface will now use the templated version (including type safety).
- At the same place in the file an internal constructor was placed to move in a vector of intervals.
They are distinct changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My fault. I looked at the wrong places. The templated code is just fine (even though a bit less readable IMO). But type safety is a good thing.
On the read-only flag: IntervalSets seem to copied by value throughout the code base, with const references from the cached values. The actual read only flag is only set in the Am I missing something? |
Ok, I checked again about the read-only flag and it seems you are right. I confused that with the |
I have a pending patch that also removes read only from ATNConfig. I spent a while making sure it was OK, but that was a while ago now. |
@mike-lischke Changes made as requested; sorry about the delays, had to do other work! |
@mike-lischke I meant to commit the whitespace and mode changes to my pending branch, did it here by accident. They remove the byte order mark from checked in files (eg. they're useless on UTF8) and removes the execute bit from source file mode flags. A side-effect of using Perforce for our local copy and moving changes around. |
@jm-mikkelsen no, please remove [b106f8e]. That's a big useless patch. Next time these files are saved the BOMs are re-added. Don't let your SCMS manage styles (BOM, line endings and the like). |
@mike-lischke But isn't that the point of "charset = utf-8" in .editorconfig? (Although I agree it hasn't been set up for .h and .cpp, but I suspect that's just an oversight.) From the editorconfig.org front page:
|
BOMs are really not an issue. Why fixing something that is not broken? We cannot forsee what editor settings are used and I really don't want to be forced to check this everytime I use a new editor (or when I get a patch). But regardless of that, the BOM removal doesn't belong to this patch and must be reverted. |
@mike-lischke Where do we stand on this PR? |
I asked @jm-mikkelsen to revert patch b106f8e. This one changes almost 300 files only because of BOM clean up (+ some file mod changes). I recommend not to merge until this is removed. |
@parrt @mike-lischke Sorry about the absence, had project deadlines and then was travelling for 6 weeks. BOM changes reverted, merged changes from head. Back to normal now, so can respond to changes quickly again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only a few small things left. I think we are close now.
} | ||
|
||
IntervalSet::IntervalSet(const std::vector<Interval> &intervals) : IntervalSet() { | ||
_intervals = intervals; | ||
IntervalSet::IntervalSet(const IntervalSet &set) : _intervals(set._intervals) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you pls change this back also? As I said already, I like to stay with delegating c-tors.
@@ -26,59 +26,56 @@ namespace misc { | |||
ssize_t b; | |||
|
|||
Interval(); | |||
explicit Interval(size_t a_, size_t b_); // For unsigned -> signed mappings. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems we haven't finished this one. Are you sure we don't need the explicit
keyword?
|
||
private: | ||
void InitializeInstanceFields(); | ||
void addItems() { /* No-op */ } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one seems strange. Can this be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this is necessary for the variadic template constructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah right, to end the iteration. OK.
Hmm...all of the copyright headers are showing diff so looks like huge PR. can you undo that part? |
@parrt The headers have been taken back already. I see only 10 files changed in the patch. |
@mike-lischke ok, ping me when this is ready to merge. |
- Add "explicit" to Interval(size_t, size_t) constructor. - Change an IntervalSet constructor to delegate part of the construction - Add "explicit" to Interval(size_t, size_t) constructor. - Change an IntervalSet constructor to delegate part of the construction
@mike-lischke OK, changes made! |
|
||
private: | ||
void InitializeInstanceFields(); | ||
void addItems() { /* No-op */ } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah right, to end the iteration. OK.
@parrt I think we are good here. |
@jm-mikkelsen The byte order marks are not unnecessary, and serve a very valuable purpose. In a world where the default encoding for everyone is UTF-8, it is unnecessary, but we are nowhere close to that. If the byte order mark is omitted, eventually someone will open up a file and make an edit in an encoding other than UTF-8, leading to eventual misinterpretation of the text content and loss of information. The place I see this most often is in the names of people from European countries, but I've seen it happen in other cases as well. Despite the wording provided by EditorConfig for the |
@sharwell I agree that byte order marks serve a purpose, but that purpose is limited to detecting the difference between UTF-16LE and UTF-16BE. UTF-8 byte order marks just get in the way; if you want byte order marks, add them during the conversion to UTF-16. Perforce, for example, will do this. I don't see how you get the "UTF-8 is discouraged" reading from the EditorConfig text, and I don't see how UTF-8 leads to corruption. The editor count seems wrong too: If an editor doesn't handle UTF-8, how does it handle a UTF-8 BOM? A quick check of my antlr4 source tree shows 243 out of 1918 files have byte order marks. 205 of these are C++ source code (.cpp/.h). None of these are Java source code. My reading is that byte order marks have been incorrectly added for languages which are not listed in .editorconfig, with the majority of these in the C++ runtime. |
passed by value throughout the C++ runtime; meaningful inheritance is
not possible anyway.
status is enforced by the compiler not at runtime in the code.
constent with how these classes are used in the C++ runtime source.
type-safe varadic templates implementation.