Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
Technology

Text to Speech Software Copies Any Human Voice 299

mindpixel writes " A New York Times Report (registration required) states that AT&T Labs will start selling speech software that it says is so good at reproducing the sounds, inflections and intonations of a human voice that it can recreate voices and even bring the voices of long-dead celebrities back to life. The software, which turns printed text into synthesized speech, makes it possible for a company to use recordings of a person's voice to utter things that the person never actually said."
This discussion has been archived. No new comments can be posted.

Text to Speech Software Copies Any Human Voice

Comments Filter:
  • I think 99.5% accuracy would suck personally. If you had that close that only a couple of words in say, 200 or 500 or 1000 were wrong, you still have to go and find them. And if the software is that good it would probably do things like to->too or thier->they're. The errors that aren't easily caught by a spell checker program.

    No, I'd rather have such horrible accuracy that you *know* you're going to have to go through and correct the document. *OR* I'd take 100% accuracy, so I'd be completely confident that there were no errors.

    Just my $0.02
  • I already get prerecorded voice messages. Talk about the ultimate annoyance: phone spam is bad enough, but you don't even have someone on the other end whose time you can waste, mind you can play with, and other ... er, someone on the other end to demand you're taken off their lists, etc.

    Thus, it is very interesting to learn about this part of the TCPA... any idea who I can file a formal complaint with next time I get one of these calls?


    --

  • There are instrument synth's that have been out for a while that actually acoustically model the instrument being synthesized, and instead of altering the frequency/amplitude of the generated noise, actually change the model's airflow, resonance, etc.

    Having extensive experience with digital "pianos," I can testify that the technology to realistically produce an authentic piano sound is a long, long, long way off. The synth is close, but any trained pianist can easily tell the difference. I have a fairly modern (about 1.5 years old) professional digital piano and every time I use it I lament the limitations of the resonance reproduction. It's just not there.

    Acoustic instruments (including voice) are very complex beasts. To reproduce the qualities of a piano's strings, pedal effects, soundboard and overall resonance is not easily done. That's not even taking into account temperature and humidity. In the end it would probably take more memory than is practical, if it is even possible.

    Better to treat these things as they are: another class of instrument. I don't call my keyboard a piano. I call it a keyboard.

    --

  • Or put them together, and get a REAL telephone voice changer... What celebrity do you want to pretend to be your secretary?

  • One of the features of HL is that the voice that you hear over the PA system throughout the game is actually several different samples; it's possible to program the scripting engine to say any combination of words that have been generated as sound files; many mod authors used this to give a bit more unique feel to their maps. Yes, there was no intonation, but for a loudspeaker voice, this worked well.

  • but there are different voices for festival, some of which sound quite fine.

    http://mandrake.net/demo_voice.wav
    produced with festival
    http://mandrake.net/demo_2.wav
    produced with festival
    --
    Geoff Harrison (http://mandrake.net)
  • soon festival will have higher quality voices available freely.

    see http://mandrake.net/demo_voice.wav
    and http://mandrake.net/demo_2.wav
    for samples.

    --
    Geoff Harrison (http://mandrake.net)
  • a few people in here so far have been comparing Festival to AT&T NextGen saying that NextGen sounds better, etc. I find that funny, especially considering that NextGen is built on top of festival, and festival (the open source speech synthesizer) can be made to sound just as good or better.
    --
    Geoff Harrison (http://mandrake.net)
  • by Mandrake ( 3939 ) <mandrake@mandrake.net> on Tuesday July 31, 2001 @03:01PM (#2179452) Homepage Journal
    Also, building a voice from speakers who you do not control the coverage on (particularly the mention of reviving dead actors, etc) would be problematic at best. You could not get the proper coverage (nor the quality) to really do anything useful.
    --
    Geoff Harrison (http://mandrake.net)
  • AT&T's synthesis system actually contains dinburgh University's Festival Speech Synthesis System (http://festvox.org/festival), Although the synthesis technique in NextGen is not in Festival (as its proprietary). However there is work from Carnegie Mellon, by Kevin Lenzo and Alan Black (http://www.festvox.org) that provides all the tools (for free) that allow you to build your own voice in Festival. For simple domains the tools really work well, and easily capture the quality of the original speaker, for a whole general voice that can say anything it is a *lot* of work, but is possible from the tools. This is what we are doing in our company Cepstral (http://www.cepstral.com)

    Actually there is even and example of Hemos himself, doing a talking clock on http://www.festvox.org/ldom/ldom_time.html
    --
    Geoff Harrison (http://mandrake.net)
  • Very true with regard to movie and TV acting, but there's always live performances. Perhaps this could spur a greater public interest in theatre acting. After all, all these trained actors and actresses would still want jobs.
  • You mean they aren't already?
  • festival is kinda cool but it sounds like Charlie Brown's teacher compared to the AT&T voice.
  • both of these voices are hands above the default :)
  • supposedly the DoD has had this capability for years, including in foreign languages. The idea being that the US can intercept enemy radio communications and replace them with confusing or erroneous instructions, *in real-time, in the original radio operator's voice.

    If this were ever used it would be a violation of the Geneva Convention (the idea that you could use it to give fake orders to the enemy, or impersonate leaders telling their people to surrender, etc). Not that the United States cares at all about the Geneva Convention, what with our history of detainming and even executing foreign nationals without ever letting them speak to their consulates, in direct violation of said Convention.

    Nevertheless, the U.S. military (and this is indeed ironic) has been more inclined to repect the Geneva contention (at least officially) than the civilian government. Developing this sort of technologies flies in the face of that, however, which makes me suspect it is being driven more by one of the spook agencies (CIA, NSA, FBI) than the DoD ... but in today's ethic-free climate, who can tell?
    --
  • While you can certainly automate intonation so that the sentences come out with a natural-sounding intonation, you still have the problem of choosing which intonation, based on the meaning and context.

    Look at MarkusQ's examples again. "Yeah, right!" is the best one since it's so simple. There is more than one correct way to intone it. How do you know whether to it's meant ironically or as an exclamation of revelation? I think you need Intelligence to correctly make that decision.


    ---
  • What happens when you get a sample of some General's voice

    That's why you also need to know the secret key "OPE" to get past the CRM-114 discriminator.


    ---
  • "limited problem domain" is the key. Given a limited problem domain, computers are better than anyone at anything. It's all in how you define the problem.

    Most AI research is useless for this very reason. Data sets that are known to do well are re-used as "proof" that a particular algorithm works well. Heck, the act of inputting the data into the computer for processing usually involved human interaction which skews the data.

    When it comes to performing tasks that a 5 year old can do, computers still suck.

    -jon

  • by ivan256 ( 17499 ) on Tuesday July 31, 2001 @09:15AM (#2179469)
    It's interesting that their precooked demo's sound great, but the speach generated in the interactive demo still sounds like a classic text-to-speech program with a few enhancements. This doesn't seem like a significant improvement over, say, what ships with MacOS by default. I'm not impressed.

  • Who will need an overpaid difficult celebrity when you can resurrect dead ones or invent new ones for movies, sitcoms, etc?

    It'll never happen. Would you have gone to see Final Fantasy if it hadn't starred Ben Affleck and a RealDoll, plus the cast from Aliens?

    --
    Evan

  • They already did something similar in the game Dune II: The Builing of a Dynasty. Words and phrases are atomized and then combined.

    e.g.

    [Ordos] [Unit] [Destroyed].
    [Harkonen] [Unit] [Destroyed].

    Each of the three voice actors used had the same thing applied. Different phrases have different intonations so that it doesn't sound (too) robotic (unlike autmated phone attendants).

    I thought the effect was pretty good, much better than the AT&T samples I tried IMHO.

    -Shieldwolf

  • This is extremely difficult in practice, as it equates to acting. The speech may sound pretty good, but it can't insert all the aural cues that a human actor can. You may as well expect a virtual actor to act out a virtual movie using nothing but the text input of the script.

    Instead of TTS, what is needed is a program that takes the speech input of one actor and modifies it to sound like another. This is already sort of done with vocoders used for music production.

  • From now on I will have James Earl Jones read me all my e-mail. Occasional he can also utter "Come to the Dark side" to keep me amused :)
  • With a good speech recognition package, this would be a good way to get extremely high compression for voice. Record your voice, convert to text, compress text, spit over the net, change back to *your* voice on the other end. It would require initially transmitting your voice profile. However, it would not work well with current technology because the lag during speech recognition would be quite noticable. Also, you would have to detect inflection in the speech recognition phase and encode that in the text.

    This could also be very useful for deaf telephone users. Currently, a deaf person relies on a human relay to talk to a non-TDD equipped person. With good speech-to-text and text-to-speech technology the human middle-man could be removed, saving a ton of money.
  • I'm _not_ saying the device should be illegal. I'm not even saying that using it and publishing the results should necessarily be banned, no matter what use it's put to. I'm honestly not sure.

    All I'm saying is that I can see an argument for making it illegal to post a statement which poses as being from someone when (a) it isn't and (b) it causes them harm. Whether I agree with that opinion I'm really not sure.
  • by GregWebb ( 26123 ) on Tuesday July 31, 2001 @08:47AM (#2179477)
    I'm honestly not sure what to think here, but do I have a right to my voice?

    Let's say someone wanted to make me say something in direct contradiction to my normal views, then publish that. Now, I don't consider myself famous enough for this to be a problem ;-) but the possibilities are obvious. The technical liberal in me says that this is fine. The, erm, other part of me says that this could cause some serious problems and harm for people, so shouldn't be allowed. Which do people think here?

    The flipside for law enforcement is perhaps even more scary. What if I published a recording, generated in this way, of (for example) Gary Condit (sp?) confessing to having killed Chandra Levy (again, sp?)? For a parallel (and I never thought I'd cite Lois & Clarke... Promise I'm not a fan, my sister used to watch it over meals so we all had to, I have a weird memory, honest really...) the episode where a photographer produces a pre-wedding image of them in bed which could have been taken properly but was actually faked due to a lost film.

    This has been coming for years, I know, but it's still a nasty big can of worms.
  • When a Russian company writes software that can be misused to copy a book, a programmer gets arrested and sits in jail.

    When AT&T writes software that can be misused to copy somebody's IDENTITY, they are hailed as great innovators.

    Something is wrong with this picture.

  • by __aadkms7016 ( 29860 ) on Tuesday July 31, 2001 @09:01AM (#2179480)

    Read it on Yahoo without registration here [yahoo.com].

  • I tried a bit of Shakespeare:

    O, for a muse of fire that would ascend the brightest heaven of invention! A kingdom for a stage, princes to act, and monarchs to behold the swelling scene. -- Henry V, I:1

    and

    Can such things be, and o'ercome us like a summer's cloud, without our special wonder? -- Macbeth, III:4

    Now, mind you, it sounded like a TV weatherman reading it, rather than anything like a Shakespearean actor (no, not even Kevin Costner ;-) ) -- but if you think that this is intended to be a generic male voice...hey, maybe they could take Ian McKellen's or Patrick Stewart's or Emma Thompson's or (God forbid) Keanu Reeve's voices. Who knows?

    Well, I'm impressed...

    Now, if you want to have some fun, try some Bushisms with it. ;-)

    cya

    Ethelred [macnews.de]

  • From the NYT article:
    "...a person must first go to a studio where engineers record 10 to 40 hours of readings. Texts range from business news reports to nonsense babble."

    I wouldn't be too concerned about someone faking my voice (yet---wait for next year) this still raises the issue that what we hear and see may no longer be reality at all. This reminds me of the technology that the media is using to insert adds into sports events, and which CBS used to cover up a NBC billboard [slashdot.org] during the "millenium" New Years celebration.

    It's not too long before we'll be able to completely fake the voice and image of whomever we please. Then it's just the credibility of the source that will matter. Content alone will carry little weight.

  • This thing has only marginal improvement over the old System 8 MacSpeak I remember playing with in high school. It pauses too long at commas and has trouble with contractions and plurals...as evidenced by the industry-wide standard of "The Oscar Meyer Wiener" song. True, this ATT thing doesn't need funky spelling to say it properly ("Meier weener" being the MacSpeak solution), but the demo doesn't attempt to sing or say it in rhythm. MacSpeak actually hit some of the notes and beats!
    ------------------------
  • by Monthenor ( 42511 ) <monthenor@@@gogeek...org> on Tuesday July 31, 2001 @08:46AM (#2179492) Homepage
    ...it still stumbles over the relatively simple "Gonna bust a cap in this bizatch's shizass."
    ------------------------
  • by cr0sh ( 43134 ) on Tuesday July 31, 2001 @11:16AM (#2179493) Homepage
    I looked up Klatt, like the AC mentioned - here are some links for the rest of us...

    GPL'd Klatt Synth Source [bham.ac.uk]

    RSynth Speech Synthesizer - Klatt based synth - go to /soundapps to download gzipped code [peak.org]

    KPE80 - A Klatt Synthesiser and Parameter Editor [ucl.ac.uk]

    Worldcom [worldcom.com] - Generation Duh!
  • by cr0sh ( 43134 ) on Tuesday July 31, 2001 @09:40AM (#2179494) Homepage
    Prior to this, the best sounding speech synthesis I had heard was from the Festival [ed.ac.uk] system, which is still pretty good - epecially considering it has an open source license, something the AT&T system doesn't.

    Another good speech synthesizer, no doubt an early version of the AT&T one (possibly?), is by Lucent [bell-labs.com].

    Still, I am amazed at the quality of the AT&T system - it sounds almost perfectly natural. To the naysayers that say "No, it isn't natural" - what all of you have to realize is that this simply demo doesn't allow you to tweak all the variables that would really allow the inflections or type of voice (like whispering, etc) to really come through - it is too bad they don't give an advanced interface with a FAQ or some other form of documentation to allow this, but I imagine that if they did, it would probably take quite a while to compose even a simple sentence (I remember the hell you had to go through with an old Radio Shack speech synth for the Color Computer, specifying individual phoenomes (sp?) just to get proper speech to come out - it could pronounce many words, but others it just fell flat on its face).

    Finally - something I want everyone to ponder. Take a look at this old article [slashdot.org] (it was about Square redubbing FFTM) - once it loads, search for "cr0sh" and "I dare say" - you will come across a series of comments about what I think may happen in the future - what is funny is that the comments in reply to my take on things sound like your typical naysayers. How many computers were we supposed to only need back in the 60's? How much memory would people "only" need again Mr. Gates?

    What I predict will come about - probably sooner than we can all imagine. It may not be cheap enough to do it now, at a quality that people would watch, fast enough to be done quicker than what can be done with live actors - but it is all software and hardware - this stuff will get faster and cheaper. Anybody who has been in this business long enough knows that it will happen. There might still be a need for actors, and voice artists, and such - but they probably won't have the "god" status society seems to confer on them now (with the exception, perhaps, of stage acting - which will probably enjoy a huge comeback).

    Worldcom [worldcom.com] - Generation Duh!
  • by wiredog ( 43288 ) on Tuesday July 31, 2001 @08:52AM (#2179495) Journal
    I used to be in the army.

    A general can't just call up the guard post and order the person on duty to let unknown people in. I once was on duty in a radio room and we had a Very Important Senior Officer come by to see what we were doing. He wasn't on the access list, so we wouldn't let him in, even though we recognized him. He had to go get the Colonel, who was on the list, to get in. We got attaboys from him, the Colonel, and our NCOs for that. If we'd let him in, we'd have been in deep doo doo.

  • by ncc74656 ( 45571 ) <scott@alfter.us> on Tuesday July 31, 2001 @11:18AM (#2179497) Homepage Journal
    Its main use is for telephony (surprise!) but it I suppose it'll be turning up in new and exciting places.

    On the radio this morning, CBS ran a short blurb about this system, including hypothetical news and sports reports. It sounded pretty good, too...if you've done anything with TTS before, the speech quality of this system was considerably ahead of what's been done before. (Light years ahead of Speak & Spell, but that's almost a given at this point. Compared to more modern systems such as Festival, it still comes out ahead quite a bit.)

    The announcer posited that, one day, his job could be in danger from this kind of technology. With some broadcasters' penchants for cutting costs any way possible (somebody either here or on K5 posted a link about Clear Channel and its shenanigans a while back, but I can't find it), DJs could end up going the way of the dodo as well.

  • The software might very well be able to copy a voice, but how does it copy emotion? Can it whisper? Can it shout? Can it sound happy or sad? Can it sing?

    Do we have a more enhanced vocal technology, or a real voice? Considering where the Amiga was in 1985 with synthesised voices, I would have hoped that a lot could have happened in the 16 years since...

  • For what it's worth, SpeechWorks International licensed an earlier version of the AT&T synthesizer. You can find demos here [speechworks.com]. The version in the NYT seems to have been developed with different constraints. Many TTS engines are designed to achieve real time play back or to use limited amounts of CPU. For instance, synthesized speech during game play should only use 5% or maybe 10% of the processor. Whereas a system for Hollywood may demand considerable CPU power to produce small utterances (say 100 CPU seconds per second of speech). This is completely acceptable for many purposes where perceived quality is the primary criteria.

    There is also an open source TTS engine called Festival, developed at the University of Edinburgh and at Carnegie Mellon University. You can find out more here [cmu.edu]. Or, just download the source [festvox.org].

  • Last week at the Oreilly Open Source conference, I heard two examples of TTS singing from Carnegie Mellon University and the University of Colorodo. Unfortunately, I don't have any links I can refer you to. Let's just say, they were very rough, but quite humourous.

  • I'm honestly not sure what to think here, but do I have a right to my voice?

    Yes, you do. That is until you sign a contract as an aspiring actor/actress with no leverage which requires that you sign over future rights to the studio.

    There should be some interesting legal cases over the next 3-5 years.

  • Comment removed based on user account deletion
  • Comment removed based on user account deletion
  • Ensign Crusher did that aboard the NCC-1701-D years ago. He had a synthesizer that'd reproduce Picard's voice, and he'd send himself all kinds of orders.

    News, earl gray, lukewarm.
  • I had even more fun. I started with a little T.S. Eliot and it sounded pretty good. But I noticed it handled some words better than other, so I decided to try...
    Twas brillig, and the slithy toves did gyre and gimble in the wabe. All mimsy were the borogoves, and the mome raths outgrabe.
    Naturally, it didn't fare much better than any other TTS synthesizer I've heard. That is, a jittery, obviously artificial monotone. Apparently, it can only produce inflection for words that are already in its vocabulary.
  • That's a good point, and in some sense, you're right -- I'm sure, even given textual context, that this program can't always figure out the right intonation for "yeah, right". It takes more -- far more -- than what we have now to figure out how to say "reah, right" correctly in a given context. Out of context (as in a single sentence), it should be able to do pretty well.

    However, I still claim that it takes far less than full AI to determine that, from purely textual context. That is, it takes far less than full AI to get pretty good at looking at the same set of text given to a human, and determine the correct intonation for the text. If a human can't figure it out (as I couldn't from your isolated textual "yeah, right"), I don't expect a machine to, ever.

    No, I'm not claiming we're there now. I'm only claiming that we won't need a "thinking machine" to get there (the the ability of the average human) -- just one with significantly enhanced ability to analyze language and context. But exactly when we will have achieved "full AI" is, and how our work on AI will progress in general is hardly determined, so I guess we'll just have to wait and see. :)

    -Puk
  • by Puk ( 80503 ) on Tuesday July 31, 2001 @09:33AM (#2179517)
    That's patently false. Speech synthesis systems are getting better and better at (or, technically, their creators are getting better at creating systems which) generate speech with very similar intonation to what a human would, based on sentence structure analysis and concatenation of recorded subword units with various intonations (there aren't as many as you might think).

    Of course, it would need a corpus of recorded and (possibly automatically) tagged speech from the person they wish to imitate, but that's not that impossible. Every notice how the generated speech on some speech recognizing phone system (such as American Airlines) is getting better and better, with more and more human-like pronunciation and intonation? And these are the production systems -- not the research systems. I'm not saying they're perfect (and, of course, they're dealing with multiple intonations of fully recorded words, not subwords), but the problem is a far cry from "true AI", and the work on it is getting better all the time.

    Check out http://www.sls.lcs.mit.edu/sls/publications/1998/m engthesis-jonyi.pdf [mit.edu] for som more detailed info on such research. (Other papers and theses at http://www.sls.lcs.mit.edu/sls/publications/index. html [mit.edu] may be relevant as well.)

    -Puk

    p.s. If this gets modded up, I could cap my karma on this. :P

  • I tried to make it say, "Go and boil your bottoms, sons of a silly person.". Pronounced everything right, even sounded halfway realistic, but it sounded much more like a radio newscaster announcing the current stock quotes or something. :P
  • Not to mention, I seem to remember reading that the Army looked into modifying people's vocal cords to get around voice-based security systems which is why the armed services don't have any kind of Star Trek "authorization Picard alpha zero" voice-authentication for their secure areas. Fingerprint or retinal scan or galvanic skin response or something.

    But anyway, beside the point, commands are no good no matter whose voice they are in because they have to give the appropriate code words or the order is immediately ignore and the channel is closed.

    - JoeShmoe
  • Actually, in the past two years or so, TTS has again become more important. The cost of a voice talent in relation to the cost of a developed system is really cheap. But, it would be really hard and get more expensive to use prerecorded words and phrases to read things back like email, for example.

    I believe that Yahoo! and AOL have phone systems (touch-tone and speech, respectively) that read back email to people.

    For the most part, using a real voice talent is the best bet; there are some fantastic people working out there.

    Todd
  • Disable those voice passwords on your machines, kids. Your pr0n is now exposed.

    TomatoMan
  • Are we watching a little too much T.V.? Do you think:

    A: Voice activation is what gets you into a military installation

    B: If voice activation were useful to get you into an installation that a recording of someones voice, in the traditional manner, wouldn't be sufficient?

    C: If voice activation were useful to get you into an installation that recordings or impersonations would get past algorithms that search for exactly this thing?

    Remember one thing: Voice is pretty much useless for security. Fingerprints are much more useful. Why? Ever get a bad cold? What happens to your voice? I went to a wedding recently where I drank and smoked too much. I came back and pretty much lost my voice. My friends didn't recognize me over the phone. Do you think a computer can do better than a human being at voice recognition? If so, you're living in the Star Trek universe. Doesn't happen.

  • Pedrito, Pedrito, te pica el culito?

    Claro que si ;-)

  • Boy, that's sad and disturbing. I don't think I'll sleep better at night. While easily intimidated, I never hesitate to talk to the "higher-ups." I'd like to think there are more like me, but I haven't been through boot camp, so I can't say. Maybe that would have changed my behavior.

  • This doesn't really have a lot of bearing on that; you still have court-appointed witnesses to such testimony who can vouch for its authenticity, just like anyone can alter a printed will but having it witnessed and notorized creates an official copy.

    Now video evidence, that is something else entirely.
  • by Tiroth ( 95112 ) on Tuesday July 31, 2001 @10:56AM (#2179535) Homepage
    I think that is a very interesting idea, but there are a lot of subtleties to consider. Languages don't share a common sound set...if you were dubbing English into German, there just isn't a sound for the glottal stop. How would you infer how the "actor model" should sound? I'm guessing this is a very nontrivial problem.

    One solution would be to get demo reels of the actors saying various sounds in the target language. The downside is that they will come across speaking the foreign language with a terrible accent...a Japanese actor might be fairly unintelligable speaking English since they are missing so many sounds (la=da=ra, no th-, etc)

    It's definitely a neat idea though.
  • by jaydho ( 98032 ) on Tuesday July 31, 2001 @05:24PM (#2179537) Homepage

    If you haven't already, listen to the AT&T Customized Voice Product Demo (U.S. English, Male: "Rich") [att.com], truly amazing.

    With online news feeds coming in to the local radio station and the quality of the "Rich" custom voice, I have a feeeling a lot of announcers may be going bye bye. In these samples he's way better than our local guy. Plus, since Shoutcast and such already have all the song info, think of the cool DJ announcing you could have.

    My roommate and I used the older online AT&T TTS to do our answering machine message for the dorm... It's did pretty will with "This is mack daddy JD and phat daddy John's room" that's the only message we've ever had that people would call back just to hear. With the old AT&T system you could adjust the pitch and various other settings to get it to sound good, I can't imagine what their new system will do!

    If you don't think too good, don't think too much.

    KingoftheBongo.com [kingofthebongo.com]
  • Wow! It actually does a pretty good rendition of:

    I teleported home one night,
    With Ron and Sid and Meg.
    Ron stole Meggie's heart away,
    And I got Sidney's leg.

    That is exceedingly cool.
  • by 11thangel ( 103409 ) on Tuesday July 31, 2001 @08:40AM (#2179543) Homepage
    So you have a computer program that takes binary (or ascii converted to binary) and makes it into a sound. Get me something that turns a sound into text with more than 90% accuracy and under 5 minutes of training routines, and I'll buy it.
  • This is what we refer to as "nostalgia". I used Macs for years, and there is no way in hell you can tell me that Macspeak sounds as good as this thing. It's not even close.

    --- egomaniac
  • Searched the web for "innocent until prooven guilty". Results 1 - 7 of about 11. Search took 0.26 seconds.
    Searched the web for "guilty until prooven innocent". Results 1 - 7 of about 9. Search took 0.33 seconds.

    Spelling errors are FUN!

  • by Mr. Sketch ( 111112 ) <`mister.sketch' `at' `gmail.com'> on Tuesday July 31, 2001 @08:43AM (#2179554)
    On AT&T Speech Labs [att.com] website, they have a little demo [att.com] where you can enter you're own text and have it play for you using their software (30 word limit). Way Cool!!

    They also have recorded demos you can listen to, but I thought the interactive demo was pretty nifty.


    --BEGIN SIG BLOCK--
    I'd rather be trolling for goatse.cx [slashdot.org].
  • ... but how about Natalie Wood's voice saying "I'll have a few drinks at the party... but I won't go overboard"

    /me ducks as karma goes whizzing away.../
  • Comment removed based on user account deletion
  • Imagine software following actors around through their career, watching their movies and public appearances and learning their style and their history and developing a database to draw on to simulate them.

    The actor hits their third blockbuster at 28 and the computer says "I think I can take it from here."

    -Erik
  • by mr_gerbik ( 122036 ) on Tuesday July 31, 2001 @09:08AM (#2179561)
    "i guess this can only mean more fraudulent accounts of his-story."

    His-story.. I hate that term. Who are you? Michael Jackson?
  • Comment removed based on user account deletion
  • Generals do not call up guards.

    And bosses don't send attachments saying "I love you", but that never stopped people from believing it anyway.

  • It's been a long time coming and it's still not that great. It still has that little bendy creaky quality at the end of syllables.

    The main problem is sampling became so cheap, that some of the incentive for pushing it beyond a 1983 Commodore 64 running the all-software S.A.M. was lost. Now maybe that paying for voice talent is the limiting factor, this will improve.

    Jar Jar the first all-computer major character in a full length flick my ass, his annoying voice was voiced just like the flintstones.
    --
  • i guess this can only mean more fraudulent accounts of his-story.

    more astronomical accounts of what 'might have' be
    said.

    maybe now the G8 can fake the sounds of that protester shouting 'yeah shoot me i wanna die'

    and maybe they can fake Dmitri Sklyarov shouting 'jail me im bad'
    of course hearing all my fave dead celebs
    selling coca cola will be so good for humanity too

  • Well, everyda more and more telemarketing call centers are being installed outside the US, where this law doesn't apply.

  • On AT&T Speech Labs website, they have a little demo ...

    I heard that they keep a log of the stuff people enter into that demo and that it's almost always the worst, most grotesque, violent, sickening verbage people can think of. I bet it won't be as bad as what there going to see today from ./ers via your link though.

  • That's something i have no experience with. can you (or someone else) briefly explain the perjury laws, as they would apply in this case?

    The short non-technical answer:
    Everything offered as evidence, unless both sides otherwise agree (if the court lets them), has to have a live person testifying about it, to vouch for its accuracy and authenticity. Physical evidence found at the crime scene? The cop who bagged it testifies as to where and when he found it, the condition it was in, etc. Surveilance video or audio tape? Someone has to testify as to how and when it was made, how accurate the process is, etc. Those witnesses, of course, are subject to cross-examination, and are subject to the laws against perjury.
    --------------------
    WWW.TETSUJIN.ORG [tetsujin.org]

  • Another interesting point of interest is with the new Final Fantasy: spririts within movie, actors are beginning to consider copyrighting their likenesses...

    Actually, the concept "replicating a voice" is a bit short sited. For example, years ago we where able to replicate the sound of a piano with computers/synthesizers. That doesn't mean that the computer becomes a great piano player - it's just a narrow replication of the sound. The same (to some degree) applies to replicating a voice. Sure, I can make a voice resemble an actors voice, but no computer can generate a persona as annoying as Chris Tuker :).
  • Yes, we can give you any celebrity as your own personal plaything. All you have to do is send us the script (or enter it on our website) and we'll give you 5 minutes to remember. 5.99/minute. Long distance charges may apply.
  • If some grad student did this every major company with a text->speech or speech recognition product would be jumping all over him for the potential copyright && || patent violations.

    And this pretty much kills the security by voice recognition methods doesn't it. Maybe they can invent little balds with LCD displays in them to trick retina scanners.

    Ya know, a good quality text to speech program was all we really needed. Something that didn't sound like R2D2 on a cell phone. The potential for abuse is way too great with this.
  • You're on your own, there. That bulgy eyed creature ain't fan material in my book.
  • She doesn't have any of those, though. Just bulgy eyes and funny teeth.
  • Well, it's good that we're finally (after decades of research) we get realistic sounding Text to Speech. On the other hand I can't imagine Stephen Hawking speaking in non-metallic voice. Am I weird?

    That's an interesting idea. If Stephen Hawking has recorded samples of his voice from when he could talk, they could change his synthizer to use his own voice. Interesting idea, may have actual applications for the disabled.

    Now, Stephen Hawking talking like John Wayne - that would be weird...

  • Is it just me, or does the speech synthesis part sound a lot like a refinement of a venerable old TI speech synthesizer (TMS5220 I believe?) that they used in the old Star Wars arcade machine? They were able to get a fairly reasonable approximation of the Star Wars cast members' voices out of that...

    -- Shamus

    This space for rent, EZ terms!
  • ... when I hear a TTS say "we are the knights who say ... NI ! [nbci.com]" with the proper intonation :-)
  • Does this mean we have to see John Wayne in more crappy beer commercials? Please, don't drag the Duke through the mud much longer...
  • A crude example would be, say, a chat log. If someone were to just hand in an ASCI or HTML transcript of things that were said online, I dont see how that would be admissible evidence, since it only takes a word processessor and a little bit of time to forge/alter. Even with IP logging, THEN you have to proove that no one was spoofing the IP adress.

    In the american courts, FACTS are determined by the Jury. Whether or not you were speeding. Whether or not OJ really did kill his wife. All determined by the Jury. For the most part, they simply sum it up as "yes he's guilty" or "no he's not", but to reach that the jury gets to listen to all of the evidence that the judge allows in, which is usualy just about anything that isn't an outright lie or illegally obtained.

    Perjury is the crime of giving false testimony. To be convicted, you just need to give testimony in a trial, lie, and then have a DA take the time to convince a jury that you did so.

    If you want to know more about the perjury laws, you might want to talk to a law school. If you're a US citizen, you can probably call up the local bar assocation (or the police) and, if they have time, they can probably point you towards someone who can explain the laws to you.

    If you're not a US citizen, you might want to just dig around on the 'net, since US perjury laws will probably never affect you. Do a search on google for "US criminal laws" and you'll probably get a few descent hits.
  • Expect video testimony to become useless in court cases... I mean, with a bit of photo work anyone can fake the gerky security camera footage--

    No, wait. We already have laws that cover this. I think they're called perjury...
  • by DreamingReal ( 216288 ) <dreamingreal&yahoo,com> on Tuesday July 31, 2001 @09:12AM (#2179621) Homepage
    Dr. Rabiner said he was excited about the possibility of resurrecting renowned voices, like that of Harry Caray, the Chicago Cubs announcer who delivered rousing play-by-play broadcasts. "There are probably hours of recordings in archives," he said. Wouldn't it be great, he asked, if Harry Caray's voice could again be broadcasting in Wrigley Field?

    Absolutely not. And for the same reason that second-printings, plastic surgery, and fake breasts all suck - they're not the real deal.

    And as a die-hard Cubs fan since the age of 4, might I also add that the World Series drought for the last half century has taken on a sort of religious significance, not unlike the 40 years the Hebrews spent wandering in the desert. And Harry Caray was our Moses - resurrecting his voice without the man behind it is tantamount to sacrilege (not to mention unbelievably morbid!).


    -------

  • by AFCArchvile ( 221494 ) on Tuesday July 31, 2001 @08:49AM (#2179626)
    I quote from U.S. Code, Title 47, Section 227 [cornell.edu], otherwise known as the Telephone Consumer Protection Act:

    "(b) (1) It shall be unlawful for any person within the United States
    (B) to initiate any telephone call to any residential telephone line using an artificial or prerecorded voice to deliver a message without the prior express consent of the called party, unless the call is initiated for emergency purposes or is exempted by rule or order by the Commission under paragraph (2)(B); ..."

    You hear that? There is to be no telemarketing use of this technology!

  • by AFCArchvile ( 221494 ) on Tuesday July 31, 2001 @08:55AM (#2179627)
    Just imagine how much less space some of the more involving computer games like Half-Life and Deus Ex would take up if all the dialog was synthesized with key samples from the voice actor (or, should I say, the "phoneme source"). That saved space could be used toward other things, like textures or ambient sounds. Of course, the biggest challenge would be to allocate some processing power for the synthesis. Still, it's probably in the works.
  • by DaneelGiskard ( 222145 ) on Tuesday July 31, 2001 @08:42AM (#2179631) Homepage
    You can try out the "research version of Next-Generation Text-To-Speech (TTS) from AT&T Labs." here [att.com].

    I'm sure it's not the same thing as the one mentioned in the article, but I'm pretty sure the one in the article is at least based on this one.

    Try it out!
  • by DaneelGiskard ( 222145 ) on Tuesday July 31, 2001 @09:03AM (#2179632) Homepage
    Some links to other online demos, so you can compare:

    http://www.elantts.com/indemo.htm [elantts.com]
    http://www.cstr.ed.ac.uk/projects/festival/userin. html [ed.ac.uk]
    http://www.flexvoice.com/demo.html [flexvoice.com]
    http://www.acuvoice.com/downloads/ttsdemo.html [acuvoice.com]


    I searched for good TTS software to give voice to some of the 3d animations I did in max ... but I did not find anything satisfactory... :(
  • by KarmaBlackballed ( 222917 ) on Tuesday July 31, 2001 @09:09AM (#2179635) Homepage Journal
    expect the same audience as if Tom Hanks were doing the character

    And who says Tom Hanks ever has to fade away? It could be a brave new world where your future kids and mine grow up watching the same stars we have today and some from yesterday. I can imagine my grandchildren raving about that new Humphrey Bogart action film. Not so far fetched really.

    And for those that wonder about the legal aspects ... I think Tom Hanks would not mind getting paid nice royalty fees for the use of his young persona when he is retired in his 80's.


    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    ~~ the real world is much simpler ~~
  • by KarmaBlackballed ( 222917 ) on Tuesday July 31, 2001 @09:19AM (#2179636) Homepage Journal
    One neat application would be to dub foreign language films in the target language using the voice of the original actor even though they do not know the target language. They could start doing that today.

    They could start by fixing all those old Chinese and Japanese action/monster flicks dubbed by the same guy talking in false baritone and falsetto.


    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    ~~ the real world is much simpler ~~
  • by Auckerman ( 223266 ) on Tuesday July 31, 2001 @08:56AM (#2179639)
    George Bush: All your Scuds are belong to us!

    Saddam Hussein: Somebody set us up the bomb!

    God help us all!

  • by 3-State Bit ( 225583 ) on Tuesday July 31, 2001 @09:16AM (#2179640)
    I could understand it if they said "We can take a sample of speech, for instance, an actor reading a script in a dead celebrity's role, and then digitize it into an inflection and reproduce the same inflection in a different voice."
    But this isn't what they're saying:
    "The software [...] turns printed text into synthesized speech"
    Which prays the question "How does the software know what inflection to associate with the printed text?"
    I know that the same words can sound radically different. Take the phrase "one, two, or three" in each of the following contexts (not that none begins or ends a sentence):
    • "I can't imagine why ANYONE would want four subnets in their own house. I mean one, two, or three I can imagine, but four??"
    • Please press one, two, or three at the tone."
    • Okay, so it was in the early morning before 4. But can you be at all more specific? Do you have any idea whether it was at around one, two, or three AM?
    • Settings of four or five are considered dangerous, while settings of one, two, or three are considered to be within acceptible parameters.
    I think that if you record yourself saying the above phrases, then crop out just the highlighted phrase, you'll find a different inflection in each one. Without understanding what a sentence says, or, more precisely, what the person means who is saying the sentence, the fact that you can produce any inflection won't help you determine which one is right.

    I found Liz and Ike playing scrabble while very drunk, and putting on all sorts of none-sensical words. I even saw "Zisis's", using a piece of rice for an apostrophe! (Zisis is a greek convenience store near us).
    I told Liz and Ike that I thought they were crazy. "Heheh, yeah we're crazy", Ike says, "but each of us only put one word down that broke the rules in a major way."
    "Which words were those?"
    " 'Zisis's' and 'Windology' "
    Since Liz was the crazier of the two, I ventured a guess, "Liz's is Zisis's, isn't it?"
    "Nope. Liz's is 'windology'. 'Zisis's' is mine." Ike replied proudly.

    Anyway, the point of this exercise is to show that a human reader reading this can make the phrase "Liz's is Zisis's, isn't it" sound natural, but I bet any speech-synthesizing software that just follows rules will make it sound incomprehensible. That's because speech is more than reading things by set rules -- it is reading things to reflect your internal parsing of the sentence.

    Not to mention the fact that actors can read the same line in a thousand different ways to show a thousand different "interpretations" (states of the character who speaks it, or parsings of the sentence). How will this software produce them, if it only has the same text to parse?

    Either someone manually will give it an inflection, or it needs (or would need before truly being able to make good its claim) a human oral reading to "mimic", where it can use the synthesized voice to sound the same inflection in a different voice. Now that would, as the old mis-translated Coke slogan goes, "bring your dead relatives back alive."

    Mere dancing with power brooms? Ha, now celebrities will be telling you about how easy to use AOL is. So easy to use, no wonder it's number 1 -- even among the dead!

    Gee, I can hardly wait.




    (It was intended to sound like "coca cola" when its Chinese characters pronounced).

    --
  • by corvi42 ( 235814 ) on Tuesday July 31, 2001 @09:58AM (#2179645) Homepage Journal
    Ask not what your country ... can do ... for you but what
    what
    what
    what
    what
    what
    you can do for for your country.
  • by Bonker ( 243350 ) on Tuesday July 31, 2001 @08:56AM (#2179649)
    Another interesting point of interest is with the new Final Fantasy: spririts within movie, actors are beginning to consider copyrighting their likenesses,

    Good for them... Better for us! Who wants dumpy Sandra Bullock, bug-eyed Steve Buscemi, or smarmy Ben Affleck when we can have perfect, artist produced, fan-boy (and fan-girl) material like Aki from FF?
  • by Tin Weasil ( 246885 ) on Tuesday July 31, 2001 @08:37AM (#2179652) Homepage Journal
    While this is a really great leap in TTS technologies... which is sure to make computers for the blind even more accessible then ever... the idea of being able to reproduce any voice is very scary.

    What happens when you get a sample of some General's voice and then use a synthesiser to call up the poor kid on guard duty and get him to let a bunch of terrorists enter the base?
  • This is great news. For too long TTS has been held back by questionable voice quality. Microsofts engine was a huge step forward, but still wasn't quite there. As the technology advances and requires less CPU power (or more CPU power is fit into a smaller space) I can imagine this will rapidly show up in places where voice prompts would be nice be are so critical as to deploy a bad sounding technology.
  • by dachshund ( 300733 ) on Tuesday July 31, 2001 @08:54AM (#2179664)
    Actually, this isn't a very exciting thing for the blind. For most practical uses, the visually impaired tend to prefer speed over quality. It doesn't have to sound great as long as it can read several times faster than "normal" speed. The AT&T TTS isn't really designed for this purpose.

    Its main use is for telephony (surprise!) but it I suppose it'll be turning up in new and exciting places.

  • by Anixamander ( 448308 ) on Tuesday July 31, 2001 @08:44AM (#2179690) Journal
    What happens when you get a sample of some General's voice and then use a synthesiser to call up the poor kid on guard duty and get him to let a bunch of terrorists enter the base?

    Obviously if this does happen, then all their bases...aww, forget it.
    --
  • by MarkusQ ( 450076 ) on Tuesday July 31, 2001 @08:50AM (#2179692) Journal
    Match the intonation of any human voice, without a sample of that voice saying the phrase in the desired intonation, just from the text?

    "Yeah, right!"

    "Officer, it is clear to me that you are in fact the one who is inebriated."

    "I found it that way. Honest."

    "Now, nothing has really changed since the last contract, we just cleaned up a few details; Please sign and return ASAP."

    "But Billy got one...why can't I? Please?"

    "Would you like to move to the sofa?"

    I don't buy it for a minute. To do what they claim would require real AI(tm).

    -- MarkusQ

  • by Nihilanth ( 470467 ) <chaoswave2&aol,com> on Tuesday July 31, 2001 @08:40AM (#2179725)
    Well kids, say goodbye to phone taps, voice mail, and important business being conducted over the phone. If this technology really accomplishes what the above says, Voice recordings wouldnt be able to hold up in court because..well..it would be difficult/impossible to proove that they were really recordings of the persons voice.

    Of course, i don't think this kind of techonology should be "outlawed" or "restricted", that will only make it easier to be used maliciously, as with any technological advancement.

    Another interesting point of interest is with the new Final Fantasy: spririts within movie, actors are beginning to consider copyrighting their likenesses, since they can be reproduced on a computer with frightening quality and clarity. Perhaps this applies to voice reproduction as well.

    This sounds like a very beneficial technology, especially for games, where a high-quality voice synth could replace volumes of digitally recorded and compressed audio files..but it opens the door for some really frightening possabilities of fraud, social engineering, and copywrite side-stepping.

"Protozoa are small, and bacteria are small, but viruses are smaller than the both put together."

Working...