-
Notifications
You must be signed in to change notification settings - Fork 741
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Code discussion #39
Comments
|
I was just trying to implement the NSFnets paper(March 2020) as part of my learning process. They have used very large number of points in the domain (144,000). The GPU that I am using has 12 GB memory and I have 4 such GPU's. If I use just one then I am get OOM error for more than 40,000 points. The results I am getting for less points are not satisfactory so I wanted to use more number of points. Also currently I am using using anchors functionality in your code to specify points rather than specifying num_domains and num_boundary and num_initial because I felt that uniform number of points in time and space might lead to lower loss values.
The total training loss for me never goes below 10^-2 while in the paper I see the loss is lot smaller (in the range of 10^-4) I have read the paper you mentioned in point 2. I will try to read the paper mentioned in point 4. |
Unless you are trying to do something very similar to this paper, this paper may not be a good example for learning, because they are solving complex equations. I am not involved in the NSFnets paper, but here are some details that might be useful for you to repeat their results.
@xiaoweijin Welcome to add more details here! |
@lululxvi Regarding your comment 3 about implementing mini batch. Currently your code is structured in such a way where the position of points in the generated dataset (train_x) determine what error (IC, Dirichlet, Neuman) function would be be used. So if I specify just batch_size there then the code logic would break probably because in each of the batches based on the position of points losses are getting calculated. Also if someone is implementing batch_size then they should ideally sample initial, boundary and domain points in each of the batches so that the batch is representative of entire dataset. Due to these reasons I felt that it would be easier trying to run on multiple GPU's with less code modification compared to implementing mini batch. Also in pde.py file there is a decorator 'run_if_any_none("train_x", "train_y")' |
It is easy to use mini-batch. Let me first explain the details of the following code for generating the training points: @run_if_all_none("train_x", "train_y")
def train_next_batch(self, batch_size=None):
self.train_x = self.train_points()
self.train_x = np.vstack((self.bc_points(), self.train_x))
self.train_y = self.soln(self.train_x) if self.soln else None
return self.train_x, self.train_y
Note: Certain points will be repeated more than once, if they are used for both PDE and BCs. Next, let us see how to implement mini-batch:
So, to use mini-batch,
|
If you want to use multiple GPUs, I don't have much experience of TensorFlow support of multiple GPUs, but if you use Horovod, then the code can be modified easily (maybe several lines of codes). |
Regarding mini-batch you mentioned 2 points:
I also tried using Horovod but it threw up few errors. I believe horovod and its dependencies are installed correctly since I also tried to add Don't you think this part of your code requires some modifications. Thanks again for your inputs. |
For the second approach of mini-batch, make sure that each time you have the same Another useful information is that according to the author of the NSFnets paper, large batch size does not lead to better results. The docs for decay: List. Name and parameters of decay to the initial learning rate. One of the following options: |
I tried running it on multiple GPU's using Horovod and made the following changes in the import horovod.tensorflow as hvd
# Initialize Horovod
hvd.init() def train(self,....):
if self.train_state.step == 0:
print("Initializing variables...")
self.sess.run(tf.global_variables_initializer())
###horovod
self.hooks = [hvd.BroadcastGlobalVariablesHook(0)]
...
...
print("Training model...\n")
# Save checkpoints only on worker 0 to prevent other workers from corrupting them.
self.checkpoint_dir = '/tmp/train_logs' if hvd.rank() == 0 else None def _open_tfsession(self):
tfconfig = tf.ConfigProto()
tfconfig.gpu_options.allow_growth = True
---> tfconfig.gpu_options.visible_device_list = str(hvd.local_rank())
---> self.config = tfconfig
self.sess = tf.Session(config=tfconfig)
self.saver = tf.train.Saver(max_to_keep=None)
self.train_state.set_tfsession(self.sess) Then I made the following change in with tf.control_dependencies(update_ops):
lr =lr*hvd.size()
optim = _get_optimizer(optimizer, lr)
train_op = hvd.DistributedOptimizer(optim).minimize(loss, global_step=global_step)
return train_op With only the following changes the code ran on multiple GPU's and all of them were not working in sync and you get the epoch output being printer 4 times for each GPU. I ran the file using following command:
I was following the horovod docs which show an example for tf1 After that I made the following change in the I replaced the following code: def _train_sgd(self, epochs, display_every, uncertainty):
for i in range(epochs):
self.callbacks.on_epoch_begin()
self.callbacks.on_batch_begin()
self.train_state.set_data_train(
*self.data.train_next_batch(self.batch_size)
)
self.sess.run(self.train_op, feed_dict=self._get_feed_dict(True, True, 0, self.train_state.X_train, self.train_state.y_train),) with def _train_sgd(self, epochs, display_every, uncertainty):
for i in range(epochs):
self.callbacks.on_epoch_begin()
self.callbacks.on_batch_begin()
self.train_state.set_data_train(
*self.data.train_next_batch(self.batch_size)
)
with tf.train.MonitoredTrainingSession(checkpoint_dir=self.checkpoint_dir, config=self.config, hooks=self.hooks) as mon_sess:
while not mon_sess.should_stop():
# Perform synchronous training.
mon_sess.run(self.sess.run(self.train_op, feed_dict=self._get_feed_dict(True, True, 0, self.train_state.X_train, self.train_state.y_train ),)) It throws the following error at the MonitoredTrainingSession step above:
Do you have any idea why this happens? Thanks. |
@XuhuiM Any idea? |
@kpratik41 The error could be from Another thing is that in different GPUs, you should use different data for PDE/BC, but keeping the size the same. So |
How did you manage to run it on a single GPU? Did you have to change the code? Or simply run it on a machine with GPU? Thanks. |
If you correctly install the GPU-version TF with all the libraries, the code will run on GPU without any change. Check your GPU usage to see if the GPU is being used or not. |
I came across this discussion in search of how to make my results more accurate. I thought I was applying mini-batch to my code, but after reading this post, I just want to make sure I'm implementing it correctly. Do I still need to modify the function Currently, I'm just using these lines:
Is this sufficient or do I need to modify |
The discussion is out of date. |
Thank you for the clarification @lululxvi ! |
It is the |
Hello I am trying to integrate mini-batch training by spliting the datasets in to 2 mini batches . I am well aware of the PDE and Model class and I am pretty sure what I have done is correct (by shuffling the domain and bc points before each epoch and then splitting the bc in order to have the same portion of bc and domain in each mini-batch). Also i keep num_bcs constant in order to use bc.error() correctly. My problem is that I keep stuck in a certain BC in the meaning it's loss is not going down. I would like to ask why PDEresampler is the correct way to go and splitting the dataset is not working (at least in my case) |
|
Hello @lululxvi, If I understood you correctly, the If yes, should I train longer/use smaller periods to let the network learn these intermediate values better? |
Yes.
Probably yes, there are hyperparameters you should tune. |
IMHO mini batch here is misleading. This callback resample the datapoints used for training and feed them all at once into the NN. In the long run it may have the same effect in the gradients as with mini batch training, but this is another discussion! |
Thank you very much for your amazing work. Had a few questions.
I was able to successfully run it on single GPU but it gave me an OOM error when i increased the domain points to a large value. How do I make it run on a multiple GPU (using TF 1.14). Could you provide me some direction regarding this? Have you ever ran this code on multiple GPU's?
After going through the code I believe that currently you have not implemented mini batch. What sort of effort do you think would be required to implement that?
Dropouts are used in neural networks in order to prevent overiftting. If we are solving a heat equation then we want the network to learn all the small variations in space and time over the entire domain. Essentially we want to overfit the neural network on that domain to get the most accurate solution so should we use dropouts during training?
The text was updated successfully, but these errors were encountered: