In a paper titled : “Parameter Efficient Multimodal Transformers for Video Representation Learning,” researchers discuss how they reduced multimodal transformer size by 97 per cent to achieve improved AI training for video clips of 30 seconds (sampled at 480 frames, 16 per second). This is a major improvement on existing models that can process video sequences of 10 seconds or less. Microsoft and Nvidia point out learning and understanding video is one of the biggest challenges for AI. Getting AI to be efficient at learning multimodal representation is fundamental to understanding video, such as actions, objects, and audio. Recent multimodal transformers have become good at understanding aspects of video sequences, such as vision and language, or image recognition. Still, it remains a challenge to train multimodal transformer AI because it requires a large amount of memory. Microsoft says in a blog post that many existing transformers simply tap into existing pretrained models to train themselves.
Reducing Model Loads
Microsoft and Nvidia have made significant improvements that allow models to train on new video sequences more efficiently. There are five components to the model: audio and visual convolutional neural networks (CNNs), audio and visual transformers, and a multimodal transformer. CNNs both visual signals and audio signals on one-second video sequences, with the transformers encode visual, audio, and audio-visual signals from input sequences of 30 seconds. Microsoft admits training the model is still memory intensive on GPUs because it contains 155 million weight parameters, and the three transformers together consume 128 million (82.6 percent) of the total parameters. Instead of accepting the small batches and longer training cycles, Microsoft and Nvidia decided to split the weight parameters to cutdown the model size. Researchers from both companies used two methods to achieve this:
“The first strategy shares weights across layers within each transformer, treating a transformer as an unrolled recurrent network. As shown in Figure 2(b), this reduces the parameters by 83 percent (from 128 million to 22 million). The second strategy involves partial weight sharing with low-rank factorization, where we factorize each weight matrix of the transformer into a form W=U?V and shared U across transformers while keeping ?V private to each transformer.”
In total, the research team was able to reduce the parameter usage of the transformers from 128 million to just 4 million. Tip of the day: Due to the various problems that arise with microphones, it can often be necessary to perform a mic test, but those wondering how to hear yourself on mic in Windows 10 are often left stumped. Microsoft’s OS doesn’t make it especially intuitive to listen to microphone playback or play the microphone through speakers. In our tutorial we show you how to hear yourself on mic with just a few clicks.