-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: implement fast-failover for MessageRecvManager and DataClientManager #243
Conversation
Codecov Report
@@ Coverage Diff @@
## master #243 +/- ##
============================================
+ Coverage 85.83% 85.85% +0.02%
Complexity 3232 3232
============================================
Files 344 344
Lines 12072 12076 +4
Branches 1087 1088 +1
============================================
+ Hits 10362 10368 +6
+ Misses 1185 1182 -3
- Partials 525 526 +1
... and 3 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your contribution~
computer-core/src/main/java/org/apache/hugegraph/computer/core/sender/MessageSendManager.java
Outdated
Show resolved
Hide resolved
computer-core/src/main/java/org/apache/hugegraph/computer/core/receiver/MessageRecvManager.java
Outdated
Show resolved
Hide resolved
computer-core/src/main/java/org/apache/hugegraph/computer/core/sender/QueuedMessageSender.java
Outdated
Show resolved
Hide resolved
computer-core/src/main/java/org/apache/hugegraph/computer/core/worker/WorkerService.java
Outdated
Show resolved
Hide resolved
Hi, @javeme , thanks for your review again, i've made some changes based on your comments, PTAL : ) AND
I've reverted this part to only catch InterruptedException with original log message. |
computer-core/src/main/java/org/apache/hugegraph/computer/core/receiver/MessageRecvManager.java
Outdated
Show resolved
Hide resolved
computer-core/src/main/java/org/apache/hugegraph/computer/core/receiver/MessageRecvManager.java
Outdated
Show resolved
Hide resolved
Hi, all, really thanks for @coderzc's suggestion, I've modified the implementation in a more elegant way. Like describe above in PR description:
PTAL : ) |
computer-core/src/main/java/org/apache/hugegraph/computer/core/receiver/MessageRecvManager.java
Outdated
Show resolved
Hide resolved
computer-core/src/main/java/org/apache/hugegraph/computer/core/sender/MessageSender.java
Outdated
Show resolved
Hide resolved
Hi, @coderzc , during the process of sending vertices and edges, two control message sent from the same worker share the same public void finishSend(MessageType type, boolean sendControlMessage) {
Map<Integer, MessageSendPartition> all = this.buffers.all();
MessageStat stat = this.sortAndSendLastBuffer(all, type);
Set<Integer> workerIds = all.keySet().stream()
.map(this.partitioner::workerId)
.collect(Collectors.toSet());
if (sendControlMessage == true) {
this.sendControlMessageToWorkers(workerIds, MessageType.FINISH);
}
LOG.info("Finish sending message(type={},count={},bytes={})",
type, stat.messageCount(), stat.messageBytes());
} or we separate the call of |
@Radeity Do you mean to merge send vertex and send edge into the same session? |
Yes, can we do that? |
@coderzc it seems that a lot to change, maybe we should keep using |
I agree with this, then only add a |
computer-core/src/main/java/org/apache/hugegraph/computer/core/sender/MessageSendManager.java
Outdated
Show resolved
Hide resolved
computer-core/src/main/java/org/apache/hugegraph/computer/core/receiver/MessageRecvManager.java
Outdated
Show resolved
Hide resolved
computer-core/src/main/java/org/apache/hugegraph/computer/core/receiver/MessageRecvManager.java
Outdated
Show resolved
Hide resolved
this.expectedFinishMessages); | ||
this.finishMessagesFuture = new CompletableFuture<>(); | ||
if (!this.finishMessageCount.compareAndSet(0, this.expectedFinishMessages)) { | ||
throw new ComputerException("The origin count must be 0"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The origin finishMessageCount ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UT will fail if using compareAndSet()
here. Also, I find it doesn't have concurrency issue, so I modify to directly use set()
.
computer-core/src/main/java/org/apache/hugegraph/computer/core/receiver/MessageRecvManager.java
Outdated
Show resolved
Hide resolved
computer-core/src/main/java/org/apache/hugegraph/computer/core/network/DataClientManager.java
Show resolved
Hide resolved
computer-core/src/main/java/org/apache/hugegraph/computer/core/receiver/MessageRecvManager.java
Outdated
Show resolved
Hide resolved
computer-core/src/main/java/org/apache/hugegraph/computer/core/receiver/MessageRecvManager.java
Outdated
Show resolved
Hide resolved
computer-core/src/main/java/org/apache/hugegraph/computer/core/receiver/MessageRecvManager.java
Outdated
Show resolved
Hide resolved
"finish-messages received in %s ms in superstep %s", | ||
e, | ||
this.expectedFinishMessages, | ||
this.waitFinishMessagesTimeout, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prefer to catch timeout exception individually
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean we should catch them with different error message?
MessageRecvManager
and DataClientManager
for fast-fail
computer-core/src/main/java/org/apache/hugegraph/computer/core/receiver/MessageRecvManager.java
Outdated
Show resolved
Hide resolved
computer-core/src/main/java/org/apache/hugegraph/computer/core/receiver/MessageRecvManager.java
Outdated
Show resolved
Hide resolved
@Radeity Please fix the code style, you can import https://github.com/apache/incubator-hugegraph/blob/master/hugegraph-style.xml to IDEA |
Good reminder 🤔 Update: put the idea-config-doc here first Also, |
I use IDEA for remote development, which really annoy me that I can not import files. May I ask do we have other ways to make this format script effective? BTW, maybe we can consider to use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
But the CI still always fail.. add a TODO issue/task for it
computer-core/src/main/java/org/apache/hugegraph/computer/core/receiver/MessageRecvManager.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Purpose of the PR
MessageRecvManager
andDataClientManager
Fast-Fail #211Main Changes
This PR implements method
exceptionCaught
ofMessageRecvManager
andDataClientManager
in different ways.ComputerJob
will get Fail status (according to default configurationAUTO_DESTROY_POD
is true, the CR will be directly deleted).I find it's the most light weight implementation, IN FACT, it can be seen as a lazy-fail, rather than block as before.
Verifying these changes
Does this PR potentially affect the following parts?
Documentation Status
Doc - TODO
Doc - Done
Doc - NO Need