Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix] Remove blocking calls from BookieRackAffinityMapping #22846

Merged
merged 4 commits into from
Jun 5, 2024

Conversation

merlimat
Copy link
Contributor

@merlimat merlimat commented Jun 5, 2024

Motivation

There are a couple of blocking calls to metadata in BookieRackAffinityMapping class, in places where we're not supposed to make any blocking calls.

This is made a bit hard by the BK rack-affinity interface that doesn't allow to use futures.

Modifications

During the initial setConf() load the list of bookie-racks info asynchronously, and set up the async watch. This seems the most sensible approach, barring a complete redesign of the rack-affinity interface in BK.

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

@merlimat merlimat added type/bug The PR fixed a bug or issue reported a bug ready-to-test labels Jun 5, 2024
@merlimat merlimat added this to the 3.4.0 milestone Jun 5, 2024
@merlimat merlimat self-assigned this Jun 5, 2024
@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Jun 5, 2024

bookieMappingCache = store.getMetadataCache(BookiesRackConfiguration.class);
store.registerListener(this::handleUpdates);
bookieMappingCache.get(BOOKIE_INFO_ROOT_PATH)
Copy link
Contributor

@rdhabalia rdhabalia Jun 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this async processing may cause a race condition and bk-client may create ledgers before it updates bookie with rack info, and that can cause ledgers to be created on the wrong bookies and that will not work for us because we need strict bookie affinity and we can't afford ledgers created on wrong bookies.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

valid point. I guess this is hard to solve since the interface is synchronous in Bookkeeper. I guess there are ways to workaround the problem and still make it asynchronous. In that case, the initialization would have to somehow happen before the bk-client is made ready.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, makes sense. I think we should still change the sync call in the watcher event. It's unlikely to have a cache miss there, though it's technically possible.

Without changing BK client constructor to reflect the blocking operation that happens, I think we could change the BK provider class to do the creation in background.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@merlimat can you please review #22842 for now as it makes bk-client creation async and avoids blocking call at org.apache.bookkeeper.mledger.impl.ManagedLedgerFactoryImpl.asyncOpen which will prevent the deadlock.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it does not solve the problem completely because the calling thread is still blocked.

The BK client creation is moved to a background thread and it's ok, thought the thread waiting for the BK client to be created is still blocked.

eg:

  1. IO thread is initiating the creation of topic
  2. We trigger creation of BK client
  3. background-thread is creating BK client
  4. IO thread is still blocked waiting for completion.

I'm working on a change to return future from BK client factory that would avoid the problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merging this for now since it's solves a (small) part of the problem

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The BK client creation is moved to a background thread and it's ok, thought the thread waiting for the BK client to be created is still blocked.

No, this solves the issue. deadlock was happening at org.apache.bookkeeper.mledger.impl.ManagedLedgerFactoryImpl.asyncOpen at ledgers.computeIfAbsent
Moving bk-client creation async will not hold the lock at concurrentHashMap and that will also not cause metadata-thread waiting.

infact this PR should not be merged. Making change in bk-client and releasing bk version takes time and not all previous version can take that change. so, we need a fix which can solve the problem and #22842 should do it.
so, if there is no issue with #22842 then can we merge it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdhabalia

infact this PR should not be merged.

This is solving a problem your PR was not addressing, the use of blocking call in the notification watcher.

nikhil-ctds pushed a commit to datastax/pulsar that referenced this pull request Jun 17, 2024
)

(cherry picked from commit aece67e)
(cherry picked from commit efa5e8b)
nikhil-ctds pushed a commit to datastax/pulsar that referenced this pull request Jun 25, 2024
)

(cherry picked from commit aece67e)
(cherry picked from commit efa5e8b)
srinath-ctds pushed a commit to datastax/pulsar that referenced this pull request Jul 1, 2024
)

(cherry picked from commit aece67e)
(cherry picked from commit efa5e8b)
luky116 pushed a commit to luky116/pulsar that referenced this pull request Jul 29, 2024
@AlexStocks
Copy link

related stack
1267792802456608768.md

@TakaHiR07
Copy link
Contributor

TakaHiR07 commented Sep 11, 2024

I encounter rack information lost issue with this pr's modification, and do an analysis. Can you take a look? @merlimat @rdhabalia I think we can not make the watch async.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants