Full-Text Audio Search 135
Captain Chad writes "The latest print edition (12/16/2002) of InfoWorld has an interesting article about an audio search program by Fast-Talk Communications. (The article is not yet available on the InfoWorld web site, but the Fast-Talk site has some good info, including a downloadable trial version.) The product works by breaking the audio stream into phonemes, which are the 'basic units of sound in a language.' The search is then performed for a specific sequence of phonemes. This method is faster and far superior to traditional audio searches which convert to text and then perform a normal text search. The author of the Infoworld article, Jon Udell, tried a variety of searches that were surpisingly successful. If this technology is as good as he claims, there is a reasonable chance it will revolutionize the way we store data. Maybe there will even be an 'Audio' tab on Google." Here's the Infoworld article.
How long... (Score:2, Interesting)
/sarcasm
Just one more step on the road to TIA (Score:5, Insightful)
Yay!
Re:Just one more step on the road to TIA (Score:5, Insightful)
Re:Just one more step on the road to TIA (Score:3, Interesting)
Yes and no; more no than yes:
Until about two years ago, speaker-independent telephone speech dictation for accurate word-spotting wasn't good enough to run on mass volumes of calls.
So, the story goes that it finally got tested in 2000. Even with fairly high accuracy rates, the utility of the extracted data was near or below zero, in that the amount of time a human agent would spend reviewing the tagged calls (international calls can be eavesdropped without a warrant) was less effective by some measure than if the same agent had been following other readily available leads.
The whole problem is that most of the people who talk about {drugs, bombs, nerve gas, etc.} are not the people engaged in the manufacture and smuggling of those contraband; usually such people use code words. A code book with coordinates, maps, and timetables sent by FAX between anonymous hotel business centers can completely confound even the most concerted traditional eavesdropping scheme, let alone an automated word-spotting system. So, the agents ended up reviewing hundreds of calls between people talking about cocaine, but not one call between people talking about shipping or producing cocaine.
Speech recognition has a place in law enforcement by mass-eavesdropping, but I don't think that place is found, yet. I predict it will probably end up being used more to ferret gays out of the military than anything else.
Re:Just one more step on the road to TIA (Score:2)
Digitised phone convesations (Score:1)
</wide-eyed innocence>
Ironically, I'm posting this over a dial-up which modulates my digital data to analogue. The signal is then digitally encoded at the telephone exchange, with the whole process being reversed as the signal reaches my ISP.
Re:Digitised phone convesations (Score:1)
Sheesh.
Re:Just one more step on the road to TIA (Score:4, Interesting)
Let's see, given 5000 billion dial equipment minutes in 2001, we'd have around 150 trillion seconds of conversations. Assuming you could code everything at a bitrate of 8kbps, this would mean roughly 150 terabytes of compressed data for 2001 alone. Presumably the storage would be distributed at the switches where you record the conversations. So the problem is now to compress, transcribe, index, search, decompress, and access 150 terabytes of distributed storage.
And keep in mind that doing a phoneme transcription rather than full-blown speech-to-text is likely to generate a whole lot of nonsense transcriptions, precisely because you don't have any guiding information from the words in the conversation.
While I enjoy Popular Paranoia as much as the next guy, the whole TIA thing does not really get to me. My reaction is mostly: bring it on, if you really think you can convince yourself that it can be done.
Re:Just one more step on the road to TIA (Score:1)
Re:Just one more step on the road to TIA (Score:1)
Re: (Score:2)
Re:Just looking at kazaa right now (Score:1)
Re:Just one more step on the road to TIA (Score:1)
Maybe the idea that they CAN'T do it right is the exact idea that should worry you. Think about this: They can't execute the war on (some) drugs correctly because it's a bullshit idea, same as TIA. Now we have jack-boot thugs busting in on completely innocent 70 year old couples and scaring them (literally) to death... And we have agents (of Matrix quality) imprisoning, denying medication, and consequently killing medical MJ patients who are using the drug legitmately (Peter McWilliams [hazlitt.org]).
In short, things the government does that it is not supposed to do or cannot do correctly are the exact things you should worry about.
Echelon (Score:1)
How long... (Score:2, Interesting)
Re:How long... (Score:1, Informative)
Another similar product is already on the market from Scansoft [scansoft.com] (formerly Dragon Systems) but it uses a complete different approach than the fast-talk product. We are actually using Scansoft product where I work to index all of the media (audio and video) files on the corprate lan.
Re:How long... (Score:1)
With the way Bush is going how long before he gets a "homeland" tab on his own version of Google
-Jason
Point? (Score:1, Interesting)
Now, if I could hum a tune into my computer and have it find what song I was humming (for those songs you just can't remember the lyrics too), i would be much happier.
That's called query-by-humming (Score:2, Interesting)
Re:Point? (Score:2)
there would be many uses for this technology...... (Score:1)
2 - data retrieval - a phonetic query language - cool !!!
3 - pr0n video spider - just listen out for lots of Ooohs, Aaahs, and the like (sorry, could not resist
Re:Point? (Score:2, Insightful)
You hold a meeting where each person's channel was recorded and stored as part of the meeting info. Upon saving the meeting minutes, the software builds a phonetic index of the entire conversation.
Searches later on would be no more taxing on the server than a fulltext search in MySQL is now.
Useful? Definitely. And that's just one possibility.
Re:Point? (Score:3, Insightful)
For a professional rather than personal use, imagine how useful this could be to radio stations if they keep digital archives of their programs--if someone wanted to look up a particular program based on a vague memory of some of the text, a tool like this would be invaluable.
Re:Point? (Score:2)
Conversion? (Score:2, Interesting)
Re:Conversion? (Score:1)
oh, what am I saying, of course it would!
and in answer to your question, you index it once which can be done in pretty much 1:1 time, then you save the index files and just search *those* to find things - the index files tell you which recording and timecode the result is found at, then you playback from there.
Converting to Text (Score:1, Insightful)
Somebody who knows about the subject, please post and explain the process.
Re:Converting to Text (Score:1)
If we convert the text to phonemes instead, hare and hair resolve to the same result. So a search for either of those words will produce a match.
In hindsight, this is an obvious idea. Like many obvious ideas, the person who spotted it was a genius.
Re:Converting to Text (Score:1)
Re:Converting to Text (Score:1)
Google's Voice Searc (Score:1, Informative)
Re:Google's Voice Searc (Score:5, Informative)
Re:Google's Voice Search (Score:1)
Woohoo. (Score:2, Funny)
Flux that snip... (Score:1)
linx ear [hazardfactory.org]
Yes, but... (Score:5, Funny)
-----BEGIN PGP MESSAGE-----
and I wouldn't be able to tell the difference.
Worse yet (Score:1)
Re:Worse yet (Score:2)
The trench oily guy feels great (Score:3, Interesting)
the fee Cult to longer stained syrups and Hussein marmot pervert sucks eggs rat. Intact, eye amusing into dick tape his pest of flash snot.
- - -
It has no problems at all firguring out those difficult to understand lyrics and has an almost perfect success rate. In fact, I am using it to dictate this post to Slashdot.
Patches... (Score:4, Funny)
Re:Patches... (Score:3, Funny)
Re:Patches... (Score:1)
And party like it's 1986! Don't go there! (nightmare visions of a zx spectrum enter my mind...."look, it's the yellow and blue pattern and the random static noise! that means it's loading the program!") ;-)
Re:Patches... (Score:1)
w00t NPR, and NLP feasability? (Score:2)
On a serious note. I really didn't think NLP software was to the point to make this plausible. I've never actually used NLP tools, but what I've heard in the main stream is that while they work they aren't perfect. This is fine for someone starting at a screen while talking or someone who is going to review the transcription, but it seems like it would break any automated system when there is not system of checks in place, since this involves a human.
Re:w00t NPR, and NLP feasability? (Score:2)
Dictation (the only application where you need/want 100% accuracy) is only one small application for speech recognition.
Re:w00t NPR, and NLP feasability? (Score:2)
the one in wetware, between your ears) isn't 100% accurate, and that doesn't stop you from understanding people, right?
The wetware has knowledge of context which helps reduce the choices of a potential vocal sound. This is one of the hardest things to build into a NLP system and afaik it currenlty isn't anywhere near perfect, otherwise we'd have voice interfaces to a lot more things in the common world
Re:w00t NPR, and NLP feasability? (Score:2)
It really depends on the problem. If you want to guarantee that if a person says "frobnitz" that you're going to find it, yes, you need 100% accuracy. But even human listners aren't going to give that to you. If you want to find people mentioning words associated with terrorism, your accuracy rate can drop somewhat. If you want to find people talking about the *topic* of terrorism, your accuracy can drop even more.
Your comment about context is very perceptive, however -- I would say that NO current ASR system has essentially ANY real-world context, and as you said this is a tremendous boost in how humans interpret speech. Once that breakthrough is made, NLP in general will take a quantum step forward.
Re:w00t NPR, and NLP feasability? (Score:2)
I see how your description would work in a key-word environment, but what about a phrasal aspect? Or is our best bet ever going to be a fuzzy system?
Re:w00t NPR, and NLP feasability? (Score:2)
In the case of the NPR show, it's likely you'd know the general topic, or a speaker, or some proper names that were mentioned. All of these can be used to augment your search, and all of them can contribute to the accuracy of your results.
Re:w00t NPR, and NLP feasability? (Score:2)
However I've seen other threads where they have I guess assumed that this technology would be used for music. I think that remembering a single line would be more apt and example in such a case.
Re:w00t NPR, and NLP feasability? (Score:1)
Search the Geeks in Space archive! (Score:1)
Dialects and Foreign Language application? (Score:3, Interesting)
Of course, noble thoughts aside, I keep thinking how useful it would have been to have such technology in college when I had to transpose long lectures from my chicken scratch.
Hmmm
Re:Dialects and Foreign Language application? (Score:2)
It's been done already. [ectaco.com]
Re:Dialects and Foreign Language application? (Score:2)
Phonemes are pretty much language independent. One particular sound sounds the same in different languages. It might be spelled differently, and it certainly falls into different places in different words, but it's made the same way in the mouth, and it has the same acoustic pattern (there are some variations, since some languages make distinctions between different sounds, and others don't, like [l] and [r] are the same phoneme in Japanese, but not in English, and [p] and [p^h] (aspirated [p]) are the same in English, but not in Hindi, but this mostly doesn't matter, since in a given language each form tends to be used in the same words regardless of the speaker). Converting an audio stream to a sequence of phonemes is basically a solved problem (given lack of inflection/emphasis, no background noise, etc.), this is just a new, useful application of an established technique. The problem of translation lies in finding the meaning behind the phonemic sequences.
Re:Dialects and Foreign Language application? (Score:1)
Cognitition and translation are VERY complex.
Phonemes are a unifying constant of speech, not cognition of language. By definition, words are converted to audible phoneme, not spelling with etemological, gramatical and syntactical meanings.
Any words used to search are converted to phonemes and then searched against a phoneme trascript of human speech, simplifying and broadening the chance of a match.
This is MUCH easier than the reverse, of converting phonemes into words, particulaarly homonyms, e.g. "pair" and "pear", "two", "to", and "too". The complexity of extracting correct words, grammar and syntax makes understanding the orignial spoken message VERY hard. We have yet to reliably solve computer cognition of (machine readable) written or spoken language.
We often use proper nouns such as people and place names. Is "Victor" some guy or the battle winner?
What would a universal translater think of a spoken question "Do you listen to Phish?"
How about the amazing French subtitles for "Pulp Fiction"? to paraphrase Travolta's pun joke... A family of tomatos are walking down the street, and the baby Tomato keeps dawdling and getting distracted, and after two warnings to keep up with the family from the Daddy tomato, the Daddy finally just looses his composure on the third time and pounds the pulp out of the baby tomato and yells 'Ketchup!'. A pun for catch-up and tomato ketchup.
The French subtitle change the joke from Tomatos to Lemons, err... rather "Citron" and on doling out punishment yells the pun "Citron presse!" meaning "Lemon hurry up!" and "lemonade!"
Universal translators are years off. High level translations will require humans for the forseeable future.
Mac Refugee, Paper MCSE, Linux wanna be
Imagine... (Score:4, Insightful)
Now if you record street conversations or all types of public conversations... Do a search on 'bomb'... How appealing is that to big brother.
All right... I'm learning sign language. Now.
Re:Imagine... (Score:2)
Re:Imagine... (Score:1)
Too bad
Re:Imagine... (Score:2)
Now, try that same tactic on every conversation in America. The utility would be some order of magnitude LESS than the crap you got back from google! (if you can have utility less than zero, that is!)
Re:Imagine... (Score:1)
Re:Imagine... (Score:3, Insightful)
I just did the same. Got 5,580,000 results, only three hours later.
At that rate of growth, (50,000 bombs per hour, or about 14 bombs per second) there's going to be an awful lot of poor bastards at the FBI/CIA/NSA chasing noise...
Exciting Implications (Score:3, Insightful)
I just hope one of those nuisance lawsuits from Tzsvestaeya Zolskovova, the eccentric widow of Sergei Zolskovova, (Russian lunguist who coined the word phoneme) over the use of the term "phoneme" doesn't hobble progress in this fascinating area.
Phoenetic search engines (Score:3, Interesting)
I guess this is a similar idea. Pretty cool tech.
Re:Phoenetic search engines (Score:1)
Re:Phoenetic search engines (Score:2)
Perl also has the nifty String::Approx module which is great for catching typos instead of homonyms.
Re:No you didn't! (Score:2)
Wanna bet? [php.net]
I've often wondered about using a tech like this.. (Score:3, Interesting)
If nothing else, putting the computer to work on the 'condense 100 hours of footage into pieces of paper' stage would be a nice step.
If it prevented just one assitant editor from going insane, it'd be worth it. Do it for the children.
-Brett
What we need now is (Score:3, Funny)
Re:What we need now is (Score:1)
Google for it and i'm sure you'll find some interesting articles.
Re:What we need now is (Score:1)
Re:What we need now is (Score:2)
Wow, cool idea (Score:3, Insightful)
Someone mentioned it can be used by the government for TIA stuff - agreed, but same with any technology. It has its positive and negative uses. I don't think we are all going to revert to cavemen to get away from it.
Re:Wow, cool idea (Score:2)
This is of somewhat limited utility. Perhaps if you had 2000 hours of John Gotti talking about lasagne, and one minute where he was talking about rubbing out Sammy the Bull, you could speed forward to find that bit. But if he talks about Sammy a whole lot, it's going to be easier to *read* and skim than listen.
Re:Wow, cool idea (Score:2)
I consider audio/video data "fuzzy" -- there is no clear cut method of intrepreting such data. Here's a real world example: tell a computer to determine if a movie is pornographic or better still find out if a picture is showing a vagina, penis, etc.
The simpsons (Score:1)
Audio search (Score:2)
I've seen a couple of web sites which offer tune searches, but they all work on the index system used in fake-books: start from the first note, and then from there, say whether the next note is higher, lower, or the same. But this system has problems: a reasonably short search will match a whole lot of songs; it's often hard to tell whether certain extra notes are considered part of the tune; and some songs have an obscure beginning and an easily recognizable theme farther in, and you don't know which one is indexed.
These sites have also tended to only index very well-known tunes - usually, folk songs, show tunes, and a few jazz standards.
One site allows you to send them a recording of you whistling the tune, which seems like an improvement, but it actually just translates it into the up-down-repeat notation.
My ideal music search would be something that would take large quantities of music (let's ignore for the moment where it gets the large quantities of music without pissing off the RIAA) and scan each song for prominent tunes. You could then search these with perhaps the up-down-repeat notation, but also by inputting music notation, for people who know it. The search would have to be key-insensitive, and allow fairly fuzzy matches.
If it could give me the name of that pop song/jazz tune/classical piece I just heard on the radio, it'd be pretty good.
But if it works really well, it'd be a blessing for music composers - they could just search for that tune that just popped into their head, instead of worrying over whether they're subconsciously ripping off another song.
Re:Audio search (Score:1)
It would be better if the radio station just transmitted the name and other relevent information along with the tune itself. Digital radio must be able to do this, surely? And if not, why not?
Re:Audio search (Score:2)
I can just see composers checking the database everytime they 'hear' something familiar, and never ignoring the fact that 'I just ripped off a Elvis Costello chorus. Oh, well. It worked for him.' Just write the damn song; if it is good it will stand on it's own merits. Everything old is new again! The public has a ever-shorter memory, and when you got just 12 notes, you like it that way.
Besides, every musican has a friend who will pipe up and say, "That sounds just like that Archies song!" Don't be afraid to rip off a couple of notes, otherwise Dylan and the Beatles own the world.
Bad artists borrow, good artists steal. (Thanks to whoever this sig belongs to..;)
Where have you been? (Score:2)
Re:Where have you been? (Score:1)
Refined twist on an old idea (Score:3, Informative)
Why are more and more
So... (Score:5, Funny)
old technique (Score:2)
Streamsage has this technology (Score:1)
InfoWorld article (Score:3, Informative)
Actually it is. InfoWorld: The Power of Voice [infoworld.com].
I looked into this recently (Score:4, Informative)
Because phoneme recognition is not particularly accurate (for example, it's hard to tell the difference between "hard d" as in "Dan" and "hard b" as in "Ban" over a noisy phone line), traditional speech to text systems use several approaches to improve accuracy. One is to improve the accuracy of the basic phoneme recognition by "training" it for a specific voice. Another is to use all sorts of hairy-language-specific grammar / syntax algorithms.
Computationally, it's the matching of the phonemes against the dictionary that's the most difficult, and the larger the dictionary, the less accurate and more CPU-chomping it becomes. In addition, searching the resulting text for specific matches grows less accurate as the search string increases in length, due to the likelihood of a transcription errors.
The cool thing that Fast Talk has done is to store and index the phoneme meta-data, rather than complete the recognition to text. When you enter search words, they break the search string into phonemes and look for matches that way. This has several positive benefits:
1. Computational resources are dramatically lessened, since the "phoneme recognition algorithms" are fast and there's no dictionary matching.
2. The matching doesn't depend on having the right words in the dictionary at input time. It works just as well for unusual proper names and technical jargon as it does for common words, since they're all formed from the same basic phonemes.
3. The longer the search string, the greater probability of an accurate match.
4. No need for accurate search string spelling. It doesn't matter if you know how to spell a word, as long as you can write it down phonetically.
In theory, the system should work for any language, but reality is that different languages do have different sets of phonemes, and I think Fast Talk has only really worked on English. So languages like Spanish that are fairly similar phonetically to English would probably work pretty well, but tonal languages like Mandarin Chinese or those with non-vocal sounds like the clicks and pops of the African Bushmen would require a rework of the phoneme recognition code.
The main downside of their system is that it doesn't actually produce text... which means that you'd need another speech-to-text system if you wanted transcripts, or want the data to be searchable with whatever standard text-based search engine you are using on your intranet. But they appear to be aiming at applications where that's not necessary. One of my favorite ideas is integrating it with a video editing suite and being able to jump to different cues in your video clip library simply by stating the dialogue that's found there.
Of course, one of the most obvious applications is for intelligence and security. So far it doesn't appear that the company is pushing too hard in that direction -- it was founded by an academic group that originally developed the technology for a library project at Georgia Tech. However, I'm betting that's where the real money is, and it's only a matter of time before their ideas are found in your favorite national department of big-brotherhood.
-R
Actually, (Score:2, Informative)
And this doesn't even begin to deal with "Engrish" speakers =]
Sound search? (Score:2)
The benefits of having actual sound? If it's just going to use a soundex-type formula in the core functioning, the sound would just be a gimmick, and a storage-taking one at that. Sure, compression has gotten amazing, but will the sound of Smith really take anything near the same 4 bytes as "S720" ??
CallMiner (Score:1)
potential for medical applications is exciting (Score:3, Interesting)
Imagine a system that listens to a consultation in real time, making helpful suggestions for diagnosis based on analysis of the patient and the doctor's phoneme streams! And no tedious data entry, just an unobtrusive microphone.
I've been waiting for this.
wordspotting (Score:2, Insightful)
Note that I'm not saying the GATech technology used by this company is derivative - I haven't looked at the specifics of this approach.
Shhhhhh, don't bother me. I'm "grepping". . . . (Score:1)
KFG
reading, writing... audio what? (Score:2)
I have yet to meet anyone in good health who prefers getting ten voice mails over ten emails.
What the world needs is fewer karma whores and more good friends.
Go ahead, friend.
Old News (Score:2, Interesting)
Melody search (Score:2)
I think there should be three tabs instead of one 'Audio' one:
Soundex (Score:1)
Paul
Last Post! (Score:1)
programmer is told about the Tao and searches for it. The foolish programmer
is told about the Tao and laughs at it. If it were not for laughter, there
would be no Tao.
The highest sounds are the hardest to hear. Going forward is a way to
retreat. Greater talent shows itself late in life. Even a perfect program
still has bugs.
-- Geoffrey James, "The Tao of Programming"
- this post brought to you by the Automated Last Post Generator...
Re:Most Likely New Application (Score:1)
Terrorist (Score:2, Funny)