Recently, I’ve been working on some projects for fun, trying out some things I hadn’t worked with before, such as profiling.
But after profiling my code, I found out that my average GPU activity is around 50%. Apparently, the code frequently hangs for a few hundred milliseconds on the dataloader process. I’ve tried a few things in the dataloader: increasing/decreasing the number of workers, setting pin-memory to true or false, but neither seems to really matter. I have an NVME drive, so the disk is not the problem either. I’ve concluded that the bottleneck must be the CPU.
Now, I’ve read that pre-processing the data might help, so that the dataloader doesn’t have to decode the images, for example, but I don’t really know how to go about this. I have around 2TB of NVME storage, and I’ve got a couple datasets on the disk (ImageNet and INaturalist are the two biggest ones), so I don’t suppose I’ll be able to store them on the disk uncompressed.
Is there anything I can do to lighten the load on the CPU during training so that I can take advantage of the 50% of the GPU that I’m not using at the moment?
Number of workers should be the number of cpus. The preprocessing should be done in the dataset class not data loader.