With Lorenzo Foglianti, Papercup

At Papercup, we aim to translate the world’s content. What this means in practice is to translate audio from an input language to an output language. In this talk, we will focus on what we consider the most interesting part of this problem, which is the function mapping text to audio. Over the past few years, Machine Learning research has made a giant leap forward in the quality of the synthesised audio compared to more traditional methods. However, these methods are inherently autoregressive and therefore cannot be parallelised on modern machines. Because of this, these methods can rarely be deployed in practice. Hence, the synthesis time is limited by the nature of the model, rather than the hardware. In this talk, we present a new class of models, called Flows, which allows us to generate audio in a non autoregressive way. We will also show sample audio synthesised by state of the art models.

