A team of alum Andrew Shaw, DIU researcher Yaroslav Bulatov, and I have managed to train Imagenet to 93% accuracy in just 18 minutes, using 16 public AWS cloud instances, each with 8 NVIDIA V100 GPUs, running the fastai and PyTorch libraries. This is a new speed record for training Imagenet to this accuracy on publicly available infrastructure, and is 40% faster than Google’s DAWNBench record on their proprietary TPU Pod cluster. Our approach uses the same number of processing units as Google’s benchmark (128) and costs around $40 to run.

DIU and will be releasing software to allow anyone to easily train and monitor their own distributed models on AWS, using the best practices developed in this project. The main training methods we used (details below) are:’s progressive resizing for classification, and rectangular image validation; NVIDIA’s NCCL with PyTorch’s all-reduce; Tencent’s weight decay tuning; a variant of Google Brain’s dynamic batch sizes, gradual learning rate warm-up (Goyal et al 2018, and Leslie Smith 2018). We used the classic ResNet-50 architecture, and SGD with momentum.

Please follow and like us:

Fillip Technologies

leave a comment

Create Account

Log In Your Account