Why do I have heavy DeserializeSparse phase after EagerKernelExecutes on the multiple GPU training?

Question

I'm trying to train a small TF2.x model on 4 GPUs (AWS g4dn.12xlarge) that takes both dense and sparse tensors as its input. Once I tried without sparse features and just used dense features, my distributed training code worked well without any performance degradation. After including the sparse features, however, I found numerous unexpected chunks on the TensorBoard Profiler's trace_viewer.&#160;Attached the profiler screenshot.The main problem is that, although it seems all the GPUs computes their given batches well, there is a large timespan between a pair of computation blocks on the host side. There are 17x4 of&#160;EagerExecute:DeserializeSparse&#160;with the terminal ops of&#160;_Send input 0 from /job:localhost/replica:0/task:0/device:GPU:{gpu_number} to /job:localhost/replica:0/task:0/device:CPU:0. Here, 17 is the number of sparse features that the model receives, and 4 is the num of GPUs being utilized. Plus, tons of&#160;MemcpyD2H&#160;(small pink blocks at the screen shot) are occupying each GPU, not parallelized. That large period of time is about x6 of the actual forward pass.Below is how the model treats sparse tensor inputs:def call(self, inputs: tf.sparse.SparseTensor):
  with tf.device("\cpu:0"):
    x = self.hash_inputs_from_static_hash_table(inputs)
    x = self.embedding_lookup_sparse(x)
  return self.prediction_head(x)
The data can never be big (batch size = 128 per replica, sparse feature embedding dimension is <10), and I tried to move all sparse-related operations to CPU not to burden GPUs, but the problem persists just as the same as I didn't move those ops to CPU manually.I want to know why those chunks appear after the GPU computations, and hopefully remove them to fully benefit from distributed training with multiple GPUs.Seems like I'm still missing something that can be optimized and this situation might not that unique in distributed training, so asking for help for broader audience.

anonymous · Answer

The heavy "DeserializeSparse" phase after the "EagerKernelExecutes" on the multiple GPU training is likely caused by the serialization and deserialization of sparse tensor data during data transfer between the GPUs and the CPU. In distributed training, data parallelism is often used to split the batch across multiple GPUs, and each GPU computes its own part of the batch. However, when the computation on the GPU is finished, the results need to be aggregated on the CPU for the next step, and this requires the serialized data to be deserialized on the CPU.The reason for the large timespan between a pair of computation blocks on the host side is that the deserialization process can be slow and may become a bottleneck when dealing with large amounts of sparse tensor data. Moreover, the large number of "MemcpyD2H" operations suggests that data transfer between the GPU and the CPU is not fully parallelized, which further contributes to the slow deserialization.To optimize the performance of your distributed training code, you can try the following:Use the TensorFlow Dataset API to create input pipelines that can preprocess the data and batch it efficiently before it is fed to the model. This can help reduce the amount of data that needs to be serialized and deserialized during training.Consider using a sparse optimizer, such as the "Adagrad" optimizer with the "tf.IndexedSlices" data structure, to update the sparse feature embeddings. This can help reduce the memory footprint and improve performance.Use TensorFlow's distributed training strategies, such as the "MirroredStrategy" or "ParameterServerStrategy", which provide built-in support for data parallelism and can optimize the data transfer between the GPUs and the CPU.Consider using mixed precision training, which can help reduce the memory footprint and speed up training.Use profiling tools, such as TensorBoard Profiler, to identify performance bottlenecks and optimize your code accordingly.By applying these optimizations, you should be able to improve the performance of your distributed training code and reduce the heavy "DeserializeSparse" phase after the "EagerKernelExecutes".Elevate Your Expertise with&#160;Microservices Certification!

Why do I have heavy DeserializeSparse phase after EagerKernelExecutes on the multiple GPU training

Your comment on this question:

1 answer to this question.

Your answer

Your comment on this answer:

Related Questions In AWS

How do I use the user portal once I have enabled the Single Sign-On?

Why do we have to add Internet Gateway into the Route Tables to receive the internet traffic?

How and Why AWS bill comes after i suspended the account

Hello Team, I have a classic ELB on which two EC2 instances are attached. For cost optimization, this is what I need to do:

Use inverse transform with deep learning. Conceptual clarity needed

Role of the bias in neural networks.

Tensorflow on Google Cloud Platform

SKLearn NMF Vs Custom NMF

How do I obtain temporary AWS credentials for an unauthenticated role in PowerShell using a Cognito IdentityPool?

Why my new server shows numbers in the URL address?

Subscribe to our Newsletter, and get personalized recommendations.

TRENDING CERTIFICATION COURSES

TRENDING MASTERS COURSES

COMPANY

WORK WITH US

DOWNLOAD APP

CATEGORIES

CATEGORIES

TRENDING BLOG ARTICLES

TRENDING BLOG ARTICLES