A minimal C++ example to reproduce the problem in #273 #331

shendiaomo · 2020-09-15T15:23:33Z

As #273 explains, the migration of the main goroutine from one thread to another would cause lots of threads created and a large footprint.

In fact, if we simulate the situation in C++, the problem can also be reproduced. As a result, this problem may degrade performance in an online inference setting. Of course, we can always set OMP_NUM_THREADS to 1 to avoid the problem.

#include <stdlib.h>
#include <thread>
#include <string>
#include <iostream>
#include <sstream>
#include <chrono>

#include "torch/torch.h"


namespace nn = torch::nn;  // for the literal `ms`

using namespace std::chrono_literals;

std::mutex mu;

int main(int argc, char* argv[]) {
  std::string argv0 = argv[0];
  if (auto pos = argv0.rfind('/'); pos != std::string::npos) {
    argv0 = argv0.substr(pos + 1);
  }
  std::stringstream thread_count_command;
  thread_count_command << "ps -T|grep " << argv0 <<"| wc -l";
  std::cout << "Thread count command: " << thread_count_command.str() << std::endl;
  std::cout << std::string(20, '-') << std::endl;

  std::vector<std::thread> pool;
  auto model = nn::Conv2d(nn::Conv2dOptions(3, 64, 1).stride(1).bias(false));

  auto total = std::thread::hardware_concurrency();
  if (argc > 1) total = std::atoi(argv[1]);

  for (int i = 0; i < total; ++i) {
    pool.push_back(std::thread([&, i] {
      int step = 0;
      while (true) {
        step += 1;
        {
          std::lock_guard<std::mutex> lock(mu);
          std::cout << "Thread "<< i << "(" << std::this_thread::get_id()
                    << "), step " << step << std::endl;
          std::cout << "#Threads before `forward`:" << std::endl;
          auto _ = system(thread_count_command.str().c_str());
          std::vector<torch::Tensor> data;
          while (data.size() < 32) data.push_back(torch::rand({3, 599, 599}));
          auto output = model->forward(torch::stack(data));
          std::cout << "#Threads after `forward`:" << std::endl;
          _ = system(thread_count_command.str().c_str());
          std::cout << std::string(20, '-') << std::endl;
        }
        std::this_thread::sleep_for(10ms); // Yield to another thread
      }
    }));
  }
  for (auto& t: pool) t.join();
}

Compile under the gotorch/cgotorch directory:

g++ -std=c++17 -I .. -I libtorch/include -I libtorch/include/torch/csrc/api/include -L linux/libtorch/lib many_threads.cpp  -O  -Wl,-rpath,libtorch/lib -lc10 -ltorch -ltorch_cpu -pthread

A typical output of the program on a Docker container with 6 cores:

Thread count command: ps -T|grep a.out| wc -l
--------------------
Thread 0(140561573615360), step 1
#Threads before `forward`:
7
#Threads after `forward`:
12
--------------------
Thread 1(140561565222656), step 1
#Threads before `forward`:
12
#Threads after `forward`:
17
--------------------
Thread 3(140561548437248), step 1
#Threads before `forward`:
17
#Threads after `forward`:
22
--------------------
Thread 4(140561540044544), step 1
#Threads before `forward`:
22
#Threads after `forward`:
27
--------------------
Thread 2(140561556829952), step 1
#Threads before `forward`:
27
#Threads after `forward`:
32
--------------------
Thread 5(140561461802752), step 1
#Threads before `forward`:
32
#Threads after `forward`:
37
--------------------
Thread 0(140561573615360), step 2
#Threads before `forward`:
37
#Threads after `forward`:
37
--------------------

The text was updated successfully, but these errors were encountered:

wangkuiyi · 2020-09-15T19:10:16Z

Without the expected output from the above program, I am not sure if I understand what it reveals.

On my iMac with quad-core Intel i5, I built and ran this program. The main function created 4 threads as expected, and there had been always 6 threads in total -- I am not sure if 6 is the "a lot of threads"?

I re-ran the program with OMP_NUM_THREADS set to 1, the result was the same -- the main function created 4 threads and the process had 6 threads in total.

Then, I set both OMP_NUM_THREADS and MKL_NUM_THREADS to 1, the result was the same again.

The steps to build and run the above program include:

Copy-n-paste it to /tmp/a.cc.
cp -r $GOPATH/src/github.com/wangkuiyi/gotorch/cgotorch/libtorch /tmp/
make with the attached Makefile.

a : a.cc
	${CXX} -std=c++14 \
	-I .. \
	-I libtorch/include \
	-I libtorch/include/torch/csrc/api/include \
	-L libtorch/lib \
	-fPIC \
	$< \
	-o $@ \
	-Wl,-rpath,libtorch/lib \
	-lc10 -ltorch -ltorch_cpu \
	-D_GLIBCXX_USE_CXX11_ABI=1

shendiaomo · 2020-09-22T14:16:10Z

This problem is very likely to be caused by the function lazy_init_num_threads introduced in https://github.com/pytorch/pytorch/pull/37461/files#diff-7678d6e1a6fd4451bb1c23d73b3240a0R38-R45
This function is called by parallel_for and parallel_reduce, which are called by aten/src/ATen/native/ConvolutionMM2d.cpp and/or many other ops.

wangkuiyi mentioned this issue Sep 16, 2020

WIP Add build many_threads.cc into Makefile #336

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A minimal C++ example to reproduce the problem in #273 #331

A minimal C++ example to reproduce the problem in #273 #331

shendiaomo commented Sep 15, 2020 •

edited

Loading

wangkuiyi commented Sep 15, 2020 •

edited

Loading

shendiaomo commented Sep 22, 2020

A minimal C++ example to reproduce the problem in #273 #331

A minimal C++ example to reproduce the problem in #273 #331

Comments

shendiaomo commented Sep 15, 2020 • edited Loading

wangkuiyi commented Sep 15, 2020 • edited Loading

shendiaomo commented Sep 22, 2020

shendiaomo commented Sep 15, 2020 •

edited

Loading

wangkuiyi commented Sep 15, 2020 •

edited

Loading