Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JobManager related fix #4742

Merged
merged 2 commits into from
Oct 19, 2022
Merged

JobManager related fix #4742

merged 2 commits into from
Oct 19, 2022

Conversation

critical27
Copy link
Contributor

@critical27 critical27 commented Oct 18, 2022

What type of PR is this?

  • bug
  • feature
  • enhancement

What problem(s) does this PR solve?

Issue(s) number:

Description:

Two bugs are fixed:

  1. Some job do not support stop, e.g. COMPACT/DOWNLOAD/INGEST and so on. Previously in make 'stop job' work as expected #4460, we have already report error. However, sadly, when we show job, the job is marked as STOPPED, which is not expected. It looks like as below:

image

After this PR, these unstoppable job will still keep in RUNNING state.
2. Some job need to run in storage leader. There is a corner case that, when we create a new space, before storage report leader distribution to meta, we start a job, e.g. STATS. For now JobManager can't select a storage to run the job, and the task won't be sent to storage for execution. Since no storage ever run it, the job will always stay in RUNNING.

How do you solve it?

  1. Only save STOPPED when it really stopped
  2. If no leader found, report error.

Special notes for your reviewer, ex. impact of this fix, design document, etc:

Checklist:

Tests:

  • Unit test(positive and negative cases)
  • Function test
  • Performance test
  • N/A

Affects:

  • Documentation affected (Please add the label if documentation needs to be modified.)
  • Incompatibility (If it breaks the compatibility, please describe it and add the label.)
  • If it's needed to cherry-pick (If cherry-pick to some branches is required, please label the destination version(s).)
  • Performance impacted: Consumes more CPU/Memory

Release notes:

Please confirm whether to be reflected in release notes and how to describe:

ex. Fixed the bug .....

@critical27 critical27 added ready-for-testing PR: ready for the CI test ready for review cherry-pick-v3.3 PR: need cherry-pick to this version labels Oct 18, 2022
@critical27 critical27 requested a review from a team as a code owner October 18, 2022 12:08
@@ -81,6 +81,12 @@ ErrOrHosts StorageJobExecutor::getLeaderHost(GraphSpaceID space) {
it->second.emplace_back(partId);
}
}
// If storage has not report leader distribution to meta and we don't report error here,
// JobMananger will think of the job consists of 0 task, and the task will not send to any
// storage. And the job will always be RUNNING.
Copy link
Contributor

@SuperYoko SuperYoko Oct 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I got an issue for @liangliangc . That Explorer will execute "SUBMIT JOB STATS" after create space and wait a heartbeat, and job is always running, what do you think? Do we need to handle this?

Copy link
Contributor Author

@critical27 critical27 Oct 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that is what I told him. To be precise, a heartbeat is not enough, we must wait until leader distribution in meta is not empty. Dashboard or explorer don't want to handle this logic (by checking show hosts or show parts)... So I changed in core. It is indeed a rare case, if storage report even once, it won't reproduce.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, I am not sure whether we should check leader count same as partition count here... WDYT?

Copy link
Contributor

@SuperYoko SuperYoko Oct 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a flag to identify a space is ready to supply service.
Maybe use leader count == partition,both okay for me.

@@ -853,6 +871,15 @@ TEST_F(JobManagerTest, StoppableJob) {
code = jobMgr->stopJob(spaceId, jobId);
ASSERT_EQ(code, nebula::cpp2::ErrorCode::SUCCEEDED);

// check job status again, it still should be running
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job, the comments here should be be stopped.

@Sophie-Xie Sophie-Xie merged commit dbaca57 into vesoft-inc:master Oct 19, 2022
caton-hpg pushed a commit to caton-hpg/nebula that referenced this pull request Oct 19, 2022
Co-authored-by: Sophie <84560950+Sophie-Xie@users.noreply.github.com>
@critical27 critical27 added the doc affected PR: improvements or additions to documentation label Oct 19, 2022
@foesa-yang foesa-yang self-assigned this Oct 20, 2022
Sophie-Xie added a commit that referenced this pull request Oct 20, 2022
Co-authored-by: Sophie <84560950+Sophie-Xie@users.noreply.github.com>
Sophie-Xie added a commit that referenced this pull request Oct 20, 2022
* fix dep of loop in go planner (#4736)

* fix inappropriate error code from raft (#4737)

Co-authored-by: Sophie <84560950+Sophie-Xie@users.noreply.github.com>

* Fix variable types collected and graph crash (#4724)

* Fix variable types collected and graph crash

add test cases

small fix

* unskip some test cases related to multiple query parts

* small delete

* fmt

* Fix ldbc BI R5 implementation

Co-authored-by: Harris.Chu <1726587+HarrisChu@users.noreply.github.com>
Co-authored-by: Sophie <84560950+Sophie-Xie@users.noreply.github.com>

* stats handle the flag use_vertex_key (#4738)

* JobManager related fix (#4742)

Co-authored-by: Sophie <84560950+Sophie-Xie@users.noreply.github.com>

* download job related fix (#4754)

* fixed case-when error (#4744)

* fixed case-when error

* fix tck

* fix tck

* fix tck

Co-authored-by: Sophie <84560950+Sophie-Xie@users.noreply.github.com>

* Refine go planner (#4750)

* refine go planner

* update

* fix ctest

Co-authored-by: jie.wang <38901892+jievince@users.noreply.github.com>
Co-authored-by: Doodle <13706157+critical27@users.noreply.github.com>
Co-authored-by: kyle.cao <kyle.cao@vesoft.com>
Co-authored-by: Harris.Chu <1726587+HarrisChu@users.noreply.github.com>
Co-authored-by: canon <87342612+caton-hpg@users.noreply.github.com>
@SuperYoko SuperYoko mentioned this pull request Oct 20, 2022
11 tasks
@foesa-yang
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cherry-pick-v3.3 PR: need cherry-pick to this version doc affected PR: improvements or additions to documentation ready for review ready-for-testing PR: ready for the CI test
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants