(Hint: Here in this text is the word phone the phonetic phone, and not a phone device. And please do not confuse a phonetic phone with a phonetic phoneme, which is indeed something similar to a phonetic phone)
Today I'm going to tell you about my singable Text-To-Speech engine, which I also integrated into my own DAW. For simplicity's sake, I will primarily explain how MIDI events become the phone information data including f0 fundamental frequency envelopes, which my implementation of the MBROLA algorithm can process directly without re-parsing. And since a company is interested in my technology, where the DAW, which is quite well known, is also implemented in Object Pascal, I will not provide a code snippet this time. Let's start with the explanation of the procedure.
The first step is to extract all lyric text MIDI events from the remaining MIDI events and put them into a separate array, where each lyric text item has three fields, a text UTF8 string, a start time position 64-bit integer and an end time position 64-bit integer. The unit of time positions is here in milliseconds.
And with this data the MIDI Note-On and Note-Off events are assigned to the lyric text items, where the last Note-Off event also determines the duration of the lyric text item, if there is no other lyric text item before this Note-Off event, because if this is the case, the start time position of the new lyric text item determines the end position of the previous lyric text item.
And then time gaps between Lyric Text Items are filled with silence (using the phone "_").
Afterwards all lyric text items are converted to a phone stream array, whereby each lyric text item is respectively scaled and stretched in time.
When this is done, the MIDI events like Note-On, Note-Off, Pitch-Bend and Pitch-Bend-RPN-controllers are converted to a global f0 fundamental frequency envelope. Simple frequency linear note slides can be achieved by overlapping active notes. The global f0 fundamental frequency envelope is then used to fill in the phone-specific time-relative f0 fundamental frequency envelopes.
Then the duration of the phones is checked for being too long, if there are any, they are split up with copies. This is done by dividing the duration of all phones by remaining copies with maximum duration, because otherwise some phones might be truncated by their duration, which might cause the whole timing to be out of sync.
After that, it is ensured that all F0 envelopes of all phones are within the normalized time range of 0.0 ... 1.0 per phone duration are correct, if not, they will be filled with global time interpolated values.
Finally we're done and we can synthesize the vocals with the diphone-based MBROLA algorithm.
And here are some work flow examples:
And here the final result:
I hope you were interested. �