Memory usage is a common issue for large ML models. Especially in academia, we have to use resources wisely and make the most out of resources available. While working on my mixture model’s KL-objective, I have to make some less common optimization to reduce memory usage.
Setup Decoder outputs a large matrix \(O\) with dimensionality \((M \times B \times L \times D)\) where \(M\) is the number of clusters, \(B\) is a batch size, \(L\) is a sequence lengths and \(D\) is model output dimension.