-
Notifications
You must be signed in to change notification settings - Fork 6.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DB::Exception: Cannot schedule a task #6833
Comments
Can you provide listing of This exception means that you have exceeded the queue size of global thread pool which is 10k. |
Sorry for the delay on responding to this. Just got into this state again and |
@abyss7 Is there a way to configure or increase the global thread pool limit? |
@markcorwin-iex There is no way to configure this limit right now. |
The global limit is 10 000 threads. It should be enough in all but pathological cases. Probably you are using (You can use Distributed table on top of Merge tables instead). |
https://www.altinity.com/blog/2018/5/10/circular-replication-cluster-topology-in-clickhouse Following this article, I setup circular replication with 3 shards across a 3 node cluster. Node 1 had shards 1,2, Node 2 had shards 2,3, and Node 3 had shards 3,1. These were all merge tables, and one distributed table on each node pointed to local shards. |
It also happened to me, and AFAIR the At that time I cannot find the culprit, once this will be reproduced will try to find something |
Hm, am I missing something or any unhandled exception from job that is scheduled in thread pool will shutdown it? (And also the exception will not be logged) |
I've the same issue i.e. : How to reproduce
|
Otherwise GlobalThreadPool can be terminated (for example due to an exception from the ParallelInputsHandler::onFinish/onFinishThread, from ParallelAggregatingBlockInputStream::Handler::onFinish/onFinishThread, since writeToTemporaryFile() can definitelly throw) and the server will not accept new connections (or/and execute queries) anymore. Here is possible stacktrace (it is a bit inaccurate, due to optimizations I guess, and it had been obtained with the DB::tryLogCurrentException() in the catch block of the ThreadPoolImpl::worker()): 2020.02.16 22:30:40.415246 [ 45909 ] {} <Error> ThreadPool: Unhandled exception in the ThreadPool(10000,1000,10000) the loop will be shutted down: Code: 241, e.displayText() = DB::Exception: Memory limit (total) exceeded: would use 279.40 GiB (attempt to allocate chunk of 4205536 bytes), maximum: 279.40 GiB, Stack trace (when copying this message, always include the lines below): 1. Common/Exception.cpp:35: DB::Exception::Exception(...) ... 6. Common/Allocator.h:102: void DB::PODArrayBase<8ul, 4096ul, Allocator<false, false>, 15ul, 16ul>::reserve<>(unsigned long) (.part.0) 7. Interpreters/Aggregator.cpp:1040: void DB::Aggregator::writeToTemporaryFileImpl<...>(...) 8. Interpreters/Aggregator.cpp:719: DB::Aggregator::writeToTemporaryFile(...) 9. include/memory:4206: DB::Aggregator::writeToTemporaryFile(...) 10. DataStreams/ParallelInputsProcessor.h:223: DB::ParallelInputsProcessor<DB::ParallelAggregatingBlockInputStream::Handler>::thread(...) Refs: ClickHouse#6833 (comment) (Reference to particular comment, since I'm not sure about the initial issue)
Looks like original issue should be fixed with #9154 Since according to the comment
It does not looks like too many Buffer/Distributed tables that acquire tons of threads (and there is a comment that restart fixes the issue) |
Fixed in master. |
Describe the bug or unexpected behaviour
While running a 3 node clickhouse cluster with replicated tables, we saw a large number of the included stack trace. It seems that we reached a resource/config limit, but it's not clear which one (not cpu, memory or disk). There did appear to be a large number of inter cluster node tcp connections, and a significant number of open files (~3 million per cluster node).
Mostly just looking to see if there is an explanation for runaway tcp connections, and what can be expected in terms of inter node connections/communications.
We did see a large number of waiting tcp connections to each node in the cluster (~20k).
How to reproduce
Which ClickHouse server version to use
19.13.2.19
Which interface to use, if matters
HTTP
Non-default settings, if any
The uncompressed cache is enabled as the queries typically return a small result set (couple hundred rows).
Queries to run that lead to unexpected result
NOTE: These queries target a distributed table.
Also the queried table contains about 220 million rows.
Expected behavior
A clear and concise description of what you expected to happen.
Error message and/or stacktrace
Additional context
If the problem is too many tcp connections, would chproxy be a recommended solution?
https://github.com/Vertamedia/chproxy
The text was updated successfully, but these errors were encountered: