KVR Audio

rodyy · Post by **rodyy** » Sat Jan 21, 2017 9:05 am

Komponant is looking for a lead developer / CTO to join our team (1 audio expert + 1 coder + 1 designer). Here is our demo, actually just a very short teaser for the “sing” mode, but we'll be happy to share a demo of the “speech” mode. www.soundcloud.com/komponant/teaser

We're building a revolutionary Text-to-Voice engine. Current TTS technologies are one of the key obstacles to natural, enjoyable voice interfaces. We are offering a ground-floor opportunity to help solve this problem, by building a technology that is able to speak and sing just like humans do.

What you will do:
Define our technology roadmap, stack and toolsets
Architect and lead development of our entire software ecosystem
Implement our technology on various platforms, starting with an AU plugin and a web app
Integrate third-party technologies (e.g. ARA, Splice, speech recognition, ML)
Grow and inspire a team of developers to help realize your technology vision
Design, develop, advise and lead on all things technological

You have:
5+ years experience as a software developer for desktop and web apps
Expert level OO language skills (C++)
Previous roles as CTO and/or in technology startups
Experience in using audio software and music production tools
Experience with audio engineering & DSP
An interest in music production, bots or AI
An amazing, collaborative work ethic (but you are also totally self-directed as befits a distributed team and leadership role)

What we offer:
Equity and cofounder status (salary to follow at next fund-raise in 2017Q2)
100% Remote working if you want. Our team is currently distributed across the globe (Paris, Bangkok, Philadelphia)
A place where you can learn, grow, and drive innovation
A chance to participate at the very beginning of an exploding technology area

We are a small, diverse and experienced team, consisting of an audio expert, a coder, and a designer. We have some initial, limited funding and the backing of Techgrind (South-East Asia).

The list of applications for a TTV engine that actually speaks - and sings - like a human is endless (think conversational UIs, education, entertainment, music production, etc.). In a context where machines are learning to process natural language and recognize emotions, the current TTS engines’ inanimate responses are increasingly inappropriate and destructive to the experience. And when it comes to singing, the existing offer (essentially Vocaloid) doesn't meet the requirements of the market in terms of quality and workflow.
So we're a building the most advanced and the most versatile vocal synthesis technology that can power the next generation of conversational UIs.

Interested? Let's talk.

xoxos · Post by **xoxos** » Sat Jan 21, 2017 4:10 pm

rodyy wrote: Interested? Let's talk.

cute

but i really don't get it. it doesn't make any sense whatsoever, does it.

i mean, i'm one guy, who taps around on a keyboard in my spare time, certainly never for a salary,

and i've explored *hundreds* of audio synthesis and processing methods. as far as voice goes, my last model is 7 bandpasses based on the very, very old method, which extends the signal to ~12kHz and imo sounds real nice, mine isn't quite wonderful as i'm too lazy to finish the consonant amplitudes.

takes a bit of figgling but i'm sure that a source-filter model would provide much more flexibility than the hilariously awful sampling methods predominating today. but no one asked me.

the thing is, i cannot see how three people (apparently on salary) can't accomplish this, evne if there were only one of them. why does it take three people sitting around on a salary to make a singing voice synthesizer? i made one, mine's alright, and i'm working on a 2001 netbook in poverty.

i know someone, they work for a company that made a video game, roblox iirc.

they have worked for this company for years. roblox.

how can a game like roblox require a fleet of persons to operate and maintain, and amazingly awesome complex games are produced in spare time by people who are totally independent?

the thing is, when salary becomes involved,

when *life support* becomes involved,

we're no longer talking about pursuit of an objective other than population control.

right now i do physical labor for minimum wage, i'm getting on in years, but i have the freedom when, not being exhausted, to develop whatever i want, which has been some "really neat things," all over the place, and there's pretty much sod all for it. even if i made something amazing, it would be ignored, like many cultural contributors today.

but here's someone, they're paying someone, to sit around and make the thing,

with three other people!

on salary!

well it's a voice and it speaks the words or sings.

how difficult can it be? a collection of analysis and rendering methods, from the myriad options documented.. i could pull off a dozen rendering methods, and i've only as much math as trig.

this would translate as, i'm too paranoid to work for you. everything in culture is wrong. i'm swinging a pickaxe for $8 an hour and people are getting paid to have their productivity obliterated with a "team" paradigm. roblox, seriously?

stratum · Post by **stratum** » Sat Jan 21, 2017 11:32 pm

. everything in culture is wrong. i'm swinging a pickaxe for $8 an hour and people are getting paid to have their productivity obliterated with a "team" paradigm. roblox, seriously?

Well, that's unbelievable, you could easily find a much higher paying job.
The thing is, some of it involves:

- first write a 800 pages long software requirements specification
- link these requirements to a customer specified requirement doc so that something called a tracibility matrix (which nobody surprisingly reads) could be produced.
- then write a testing document even longer that pretty much replicates every scenario in it for morons err testing engineers who wouldn't know which button to click and what value to type in edit boxes
- dealing with QA personnel who would ask why public methods like this do not have a unit test
void setValue(int x){m_X = x;}
- implement, test and integrate it all in 6 months, after a number of years involving producing a lot of useless stuff
- getting fired for being too tired for continuing this silly thing.
- did I mention that this wasn't supposed to be a waterfall model from 1970's?

the "good" side of it is that it's a team job, consisting 4-5 people, packed together with other teams in a large room, all chit-chatting (about the work of course, fortunately) and destroying each others concentration.

that's what it takes to be really inefficient. There's no way that a team consisting of 3 persons working remotely could achieve all this, so perhaps you have been a bit unjust in criticising the ad. they are just learning this thing, you know? not very surprising.

rodyy · Post by **rodyy** » Sun Jan 22, 2017 5:01 am

xoxos wrote:we're no longer talking about pursuit of an objective other than population control.

Very cute too

But seriously, I'd be curious to hear what the source-filter model you mentioned sounds like. The right solution may actually be a mix of modeling and sampling

stratum · Post by **stratum** » Sun Jan 22, 2017 10:13 am

Source filter model is probably just another name for subtractive synthesis. Not that I know, I've never looked at it. But what I know about any speech related software while trying to write a recognizer is the following:

- You need to find a lot of data, or record yourself.
- Find a way to label the data automatically because it's a time consuming task (have a look at http://www.phon.ox.ac.uk/jcoleman/BAAP_ASR.pdf for one method)
- Extract the relevant parameters from the automatic labelling result for each phoneme you are interested in
- Phonemes are not what you see in a dictionary, they are context dependent and affected by their surroundings (i.e. other phonemes in a word or in word boundaries)
- Somebody else solved the problem decades ago, so there is a lot of information out there, but the problem is, there is a lot to read
- All the software you need is available in some open source package or another, so look for it before trying to reinvent your own, because it requires specialist knowledge to do it right, especially if it is a recognition algorithm. Text-to-speech is probably easier.

xoxos · Post by **xoxos** » Sun Jan 22, 2017 6:15 pm

rodyy wrote: But seriously, I'd be curious to hear what the source-filter model you mentioned sounds like. The right solution may actually be a mix of modeling and sampling

this is the kind of talk i like - how to get things done

same technique as the 1938 (or thereabouts) world's fair model. for this build i used an sinc impulse train osc (equal gain for all harmonics), in the past i've used a simple shaped osc approximating the signal the glottis produces (a rounded pulse with some dc overshoot, more or less). equal gain is good theoretically

source-filter models allow all sorts of familiar synthesis techniques to modify them, and adapt easily if you want to use or combine phoneme sets. while other voice synthesis techniques are much more sophisticated, it gets the job done well.

flat phonemes with my last build
http://xoxos.net/temp/syng3.mp3

course the effect isn't impressive while it's static, sounds like computer. i built it using headphones (and with a head cold) and by the time i got to hear it on speakers, i was on to other things.. several of the consonants are much too loud. if you can hear "past" the imperfections, i think the technique hasn't been sufficiently explored by the public for "decent quality generalised parametric voice synthesis".

only thing i've done with it
http://xoxos.net/temp/syng3tyg2.mp3

it just kills me that s.o.t.a. in commercial audio synthesis is sample based. c. 2000 a japanese developer shared some audio of their model with me - while i can only attest to the quality, it has contributed to my pursuit of this method instead of articulatory modeling.

especially for singing, these kinds of models can render all the nonphonemic vocalisations you can ask for.. sighs, giggling, fading words into aspiration, stuff that sample based models "do not do easily". imo lots of mileage for your development investment.

Looking for a developer to join our team (building text-to-voice engine)