You don't need to train from scratch. Google provides a pre-trained "solo violin" and "singer" model.
The adoption of DDSP vocoders has been explosive in the AI music and speech communities. Here is why this technology is superior for specific use cases. ddsp vocoder
The network does not learn to output raw audio waveforms. Instead, it learns to output control signals for traditional synthesizers (e.g., "Amplitude envelope," "Fundamental frequency," "Filter coefficients"). You don't need to train from scratch