Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: implement fast-failover for MessageRecvManager and DataClientManager #243

Merged
merged 6 commits into from
May 25, 2023

Conversation

Radeity
Copy link
Member

@Radeity Radeity commented May 21, 2023

Purpose of the PR

Main Changes

This PR implements method exceptionCaught of MessageRecvManager and DataClientManager in different ways.

  • Both set an exceptional future. The main thread will sense this exception later when sending(sender) and waiting(receiver) controlling message.
  • JVM process will exit and ComputerJob will get Fail status (according to default configuration AUTO_DESTROY_POD is true, the CR will be directly deleted).

I find it's the most light weight implementation, IN FACT, it can be seen as a lazy-fail, rather than block as before.

Verifying these changes

  • This change is already covered by existing tests.

Does this PR potentially affect the following parts?

  • Nope
  • Dependencies (add/update license info)
  • Modify configurations
  • The public API
  • Other affects (typed here)

Documentation Status

  • Doc - TODO
  • Doc - Done
  • Doc - NO Need

@Radeity
Copy link
Member Author

Radeity commented May 21, 2023

Hi, @coderzc, @imbajin, would you like to help review this PR when available : )

@codecov
Copy link

codecov bot commented May 21, 2023

Codecov Report

Merging #243 (26aca44) into master (cc5a7e7) will increase coverage by 0.02%.
The diff coverage is 60.86%.

@@             Coverage Diff              @@
##             master     #243      +/-   ##
============================================
+ Coverage     85.83%   85.85%   +0.02%     
  Complexity     3232     3232              
============================================
  Files           344      344              
  Lines         12072    12076       +4     
  Branches       1087     1088       +1     
============================================
+ Hits          10362    10368       +6     
+ Misses         1185     1182       -3     
- Partials        525      526       +1     
Impacted Files Coverage Δ
...graph/computer/core/network/DataClientManager.java 85.36% <0.00%> (ø)
...re/network/netty/ChannelFutureListenerOnWrite.java 80.00% <ø> (ø)
...graph/computer/core/sender/MessageSendManager.java 78.40% <0.00%> (ø)
...raph/computer/core/sender/QueuedMessageSender.java 73.68% <0.00%> (-2.29%) ⬇️
...aph/computer/core/receiver/MessageRecvManager.java 84.88% <82.35%> (+3.48%) ⬆️

... and 3 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Copy link
Contributor

@javeme javeme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution~

@Radeity
Copy link
Member Author

Radeity commented May 21, 2023

Hi, @javeme , thanks for your review again, i've made some changes based on your comments, PTAL : )

AND

please also update the exception message at line 206

I've reverted this part to only catch InterruptedException with original log message.

@Radeity Radeity marked this pull request as draft May 22, 2023 11:44
@Radeity Radeity marked this pull request as ready for review May 22, 2023 16:02
@Radeity
Copy link
Member Author

Radeity commented May 22, 2023

Hi, all, really thanks for @coderzc's suggestion, I've modified the implementation in a more elegant way. Like describe above in PR description:

Both set an exceptional future. The main thread will sense this exception later when sending(sender) and waiting(receiver) controlling message.

PTAL : )

@Radeity
Copy link
Member Author

Radeity commented May 23, 2023

Maybe we should find finishMessageFuture through connectionId?

Hi, @coderzc , during the process of sending vertices and edges, two control message sent from the same worker share the same ConnectionId. How about add a boolean variable sendControlMessage in finishSend, like:

public void finishSend(MessageType type, boolean sendControlMessage) { 
     Map<Integer, MessageSendPartition> all = this.buffers.all(); 
     MessageStat stat = this.sortAndSendLastBuffer(all, type); 
  
     Set<Integer> workerIds = all.keySet().stream() 
                                 .map(this.partitioner::workerId) 
                                 .collect(Collectors.toSet()); 
     if (sendControlMessage == true) {
         this.sendControlMessageToWorkers(workerIds, MessageType.FINISH); 
     }
     LOG.info("Finish sending message(type={},count={},bytes={})", 
              type, stat.messageCount(), stat.messageBytes()); 
 } 

or we separate the call of sendControlMessageToWorkers and finishSend, only when finish sending edges, we call sendControlMessageToWorkers. In this way, we can use connectionId to find a future, another advantage is for both inputStep and superStep, we can maintain a future array with same length, cuz expectedFinishMessages keeps the same, WDYT?

@coderzc
Copy link
Member

coderzc commented May 23, 2023

Maybe we should find finishMessageFuture through connectionId?

Hi, @coderzc , during the process of sending vertices and edges, two control message sent from the same worker share the same ConnectionId. How about add a boolean variable sendControlMessage in finishSend, like:

public void finishSend(MessageType type, boolean sendControlMessage) { 
     Map<Integer, MessageSendPartition> all = this.buffers.all(); 
     MessageStat stat = this.sortAndSendLastBuffer(all, type); 
  
     Set<Integer> workerIds = all.keySet().stream() 
                                 .map(this.partitioner::workerId) 
                                 .collect(Collectors.toSet()); 
     if (sendControlMessage == true) {
         this.sendControlMessageToWorkers(workerIds, MessageType.FINISH); 
     }
     LOG.info("Finish sending message(type={},count={},bytes={})", 
              type, stat.messageCount(), stat.messageBytes()); 
 } 

or we separate the call of sendControlMessageToWorkers and finishSend, only when finish sending edges, we call sendControlMessageToWorkers. In this way, we can use connectionId to find a future, another advantage is for both inputStep and superStep, we can maintain a future array with same length, cuz expectedFinishMessages keeps the same, WDYT?

@Radeity Do you mean to merge send vertex and send edge into the same session?

@Radeity
Copy link
Member Author

Radeity commented May 23, 2023

@Radeity Do you mean to merge send vertex and send edge into the same session?

Yes, can we do that?

@Radeity Radeity marked this pull request as draft May 23, 2023 08:32
@Radeity
Copy link
Member Author

Radeity commented May 23, 2023

@coderzc it seems that a lot to change, maybe we should keep using finishMessageCount, WDYT?

@coderzc
Copy link
Member

coderzc commented May 23, 2023

@coderzc it seems that a lot to change, maybe we should keep using finishMessageCount, WDYT?

I agree with this, then only add a future and complete this future when finishMessageCount==0

@Radeity Radeity marked this pull request as ready for review May 23, 2023 09:37
this.expectedFinishMessages);
this.finishMessagesFuture = new CompletableFuture<>();
if (!this.finishMessageCount.compareAndSet(0, this.expectedFinishMessages)) {
throw new ComputerException("The origin count must be 0");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The origin finishMessageCount ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UT will fail if using compareAndSet() here. Also, I find it doesn't have concurrency issue, so I modify to directly use set().

"finish-messages received in %s ms in superstep %s",
e,
this.expectedFinishMessages,
this.waitFinishMessagesTimeout,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prefer to catch timeout exception individually

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean we should catch them with different error message?

@Radeity Radeity changed the title feat: implement MessageRecvManager and DataClientManager for fast-fail feat: implement fast-failover for MessageRecvManager and DataClientManager May 23, 2023
@coderzc
Copy link
Member

coderzc commented May 23, 2023

@Radeity Please fix the code style, you can import https://github.com/apache/incubator-hugegraph/blob/master/hugegraph-style.xml to IDEA

@imbajin
Copy link
Member

imbajin commented May 23, 2023

@Radeity Please fix the code style, you can import apache/incubator-hugegraph@master/hugegraph-style.xml to IDEA

Good reminder 🤔
I'll add a full doc for devs to use IDEA effectively in Wiki first (then move to website/doc)

Update: put the idea-config-doc here first

Also, spotless could add in the TODO Task and assign to someone to test it

@Radeity
Copy link
Member Author

Radeity commented May 23, 2023

Good reminder 🤔
I'll add a full doc for devs to use IDEA effectively in Wiki first (then move to website/doc)

I use IDEA for remote development, which really annoy me that I can not import files. May I ask do we have other ways to make this format script effective? BTW, maybe we can consider to use mvn spotless in the future.

coderzc
coderzc previously approved these changes May 23, 2023
Copy link
Member

@coderzc coderzc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

But the CI still always fail.. add a TODO issue/task for it

@coderzc coderzc requested a review from javeme May 24, 2023 01:54
@Radeity Radeity dismissed stale reviews from javeme and coderzc via 26aca44 May 24, 2023 14:58
Copy link
Contributor

@javeme javeme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@coderzc coderzc merged commit f7d10a8 into apache:master May 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

[Feature] Implement MessageRecvManager and DataClientManager Fast-Fail
4 participants