Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
Communications Networking Software

Open Source Codec Encodes Voice Into Only 700 Bits Per Second (rowetel.com) 128

Longtime Slashdot reader Bruce Perens writes: David Rowe VK5DGR has been working on ultra-low-bandwidth digital voice codecs for years, and his latest quest has been to come up with a digital codec that would compete well with single-sideband modulation used by ham contesters to score the longest-distance communications using HF radio. A new codec records clear, but not hi-fi, voice in 700 bits per second -- that's 88 bytes per second. Connected to an already-existing Open Source digital modem, it might beat SSB. Obviously there are other uses for recording voice at ultra-low-bandwidth. Many smartphones could record your voice for your entire life using their existing storage. A single IP packet could carry 15 seconds of speech. Ultra-low-bandwidth codecs don't help conventional VoIP, though. The payload size for low-latency voice is only a few bytes, and the packet overhead will be at least 10 times that size.
This discussion has been archived. No new comments can be posted.

Open Source Codec Encodes Voice Into Only 700 Bits Per Second

Comments Filter:
  • Can this be used for two-way comms? conversion time from analog to the bitstream, across the net and converted back to voice, what's the delay?
  • Specific to English? (Score:5, Interesting)

    by MichaelSmith ( 789609 ) on Friday January 13, 2017 @05:51PM (#53663917) Homepage Journal

    I wonder how it performs on tonal languages like Cantonese.

    • by Stele ( 9443 )

      Or more importantly, atonal languages like Klingon!

    • It includes poorly translated Engrish subtitles.
    • You can try it pretty easily, if you speak such a language. There are test programs that work on sound files.
    • I wonder how it performs on tonal languages like Cantonese.

      I don't see any reason it shouldn't work. It encodes pitch (you really can't avoid that if you're encoding speech, which will include "voiced" sounds that have a fundamental frequency), and some casual reading about how it encodes suggest that it captures more specific information in the lower frequencies than in the higher ones, which also matches how our (logarithmic) perception of frequency works. That being said, the English sample I heard doesn't sound fantastic: think of a phone conversation in which

  • by JoeyRox ( 2711699 ) on Friday January 13, 2017 @05:51PM (#53663923)
    I've been way thing for a new cold deck for joyce recordings.
  • 70 years * 365 days (roughly) * 24 * 60 * 60 * 88 bytes/sec / 1024 / 1024 / 1024 = 181GB

    Is my math off or are they assuming such people will only have a 15 year life span?

    • by darkain ( 749283 )

      There are 256GiB MicroSD cards on the market right now. So yes, this is entirely possible.

      • by Cramer ( 69040 )

        Only if that SD card were used EXCLUSIVELY for recording your voice, and it's ACTUALLY 256GB of usable space (capacity is always a lie, filesystems take up space too, etc.), and it doesn't fail over the decades, AND you don't live more than ~98 years, sure.

    • Nobody is assuming a 15 year life span.

      The question is, why do you assume that people talk nonstop 24 hours per day?

    • I got the same as you. 2.59GB/year
      Still damn impressive as 250GB m2 SSDs would hold ~ a century of voice.

      Now, assuming that you are not talking continuously (say you talk 1/3 of the day; 8 hours of continuous talking; that's a lot) then you're at 60 GB/70Yr and that *is* valid for a high(ish) end smartphone.

      • MicroSD capacity should increase faster than the rate data is added to the device.

      • I've been programming all day, and haven't said many words at all. There are people who talk for their entire work day, but they generally spend half their time listening and more processing something, so they may actually do 4 hours of speech or less in the work day. Most people don't really speak for more than a few hours per day.
        • > [I] haven't said many words at all.

          And this is from a guy who is famous largely for saying stuff!* Well known for talking about Morse code, talking about free software and open source, talking about Debian's principles, talking at conferences, probably talking to Congress ... and even you don't talk more than a few hours per week.

          * and also of course for DOING a lot of things, including doing things like founding organizations - which requires a lot of talking.

          Actually, that got me curious, what do y

          • A couple of typos made that hard to read. Let me try again:

            What do you think first / most really got your name out there?
            Why did you start getting so much press attention, etc, compared to other people who also did important work?

            Not that you aren't worth listening to. I'm not saying you don't "deserve" the attention or whatever. I'd just like to know your thoughts on how and why someone like yourself becomes a bit of a celebrity in the field.

            • Being at Pixar, being Debian project leader, my technical work on Debian, and announcing Open Source. Those things interested a lot of people. And founding No-Code International stirred up a lot of controversy in the radio amateur world.
              • Thanks for that. Sounds like I have a lot of work to do to become nerd famous. ;)

                  I just checked out your blog and found the bit about switching power supplies interesting. I knew about switching *regulators*, but didn't realize common power supplies could actually run on DC. I'll have to check your blog more often.

    • You don't record the pauses. You do sleep, you know :-)
  • 15s/IP packet - this should lower operational cost for our government.
  • How does it sound? (Score:4, Interesting)

    by jandrese ( 485 ) <kensama@vt.edu> on Friday January 13, 2017 @05:56PM (#53663963) Homepage Journal
    That's starting to approach feeding the sentence into a speech to text system at one end and then sending the text over the air to be fed back into a text to speed converter.
    • by Anonymous Coward

      It's right there in TFA (samples that is). The answer appears to vary from muffled but understandable if you listen closely to bad-phone-connection, breaking up level of unintelligability. It's impressive but not really something you'd want to listen to if there was an alternative.

    • good point. I suppose the low limit would be doing that while compressing the text stream via a pre-shared library and assuming optimum (no ECC required) communication channel?

    • by ezdiy ( 2717051 )
      Look at the codec diagram - if you ignore the entropy coder, it largely resembles input filters of voicerecog systems - before feeding the NN input terminals, signal is decimated to extremely low bandwidth vectors with only the psychoacoustic essentials of human voice - quantized to very few dominating tones and their attack/release values. The NN model does the final step of "compressing" the result only by factor of around 100 into text. It is popularly conjenctured that compression is, in fact, a ML prob [hutter1.net]
  • by Anonymous Coward

    Good old POTS had 3k of audio bandwidth. What is the bandwidth of this CODEC? It's hard to be impressed without knowing the details.

    • Re:Bandwidth? (Score:4, Interesting)

      by dlleigh ( 313922 ) on Friday January 13, 2017 @06:22PM (#53664107)

      To compute the channel capacity, you need to know the channel's signal-to-noise ratio as well as its bandwidth.

      The Shannon channel capacity [wikipedia.org] formula is: C = B * log_2(1 + SNR) where C is the channel's capacity in bits/second, B is its bandwidth in hertz, log_2 is the base-2 logarithm and SNR is the channel's signal-to-noise ratio.

      If we assume an SNR of 48 dB for a reasonable POTS line, its capacity would be C = 3 kHz * log_2(1 + 48 dB) ~= 3000 * log_2(63097) which is almost 48,000 bits per second.

      This is a theoretical limit that realizable systems can only approach, but never equal or exceed. A practical system would also use extra bits for forward error correction purposes; I doubt that this codec deals gracefully with bit errors.

      For back-of-the-envelope purposes, assume you could use this codec to send a single voice signal in 700 Hz of bandwidth on a channel with low SNR, or you could send 60 voice signals over a regular POTS line.

      • Actually, the modem does deal gracefully with bit errors. It protects the most important bits and lets the less important ones get clobbered. In a high bit error situation you get speech that sounds wrong but can still be understood. FEC actually falls down sooner than this scheme.
      • just couple the codec with gold plated monster cables that will eliminate those bit errors.
      • POTS is traditionally converted to a 64 kbit/s digital signal, e.g. in ISDN, but also in the digital back-end used for the POTS network these days.

  • Close (Score:4, Funny)

    by fahrbot-bot ( 874524 ) on Friday January 13, 2017 @06:11PM (#53664045)

    A new codec records clear, but not hi-fi, voice in 700 bits per second -- that's 88 bytes per second.

    It's 87.5 bytes/s and it's that odd 1/2 byte that keeps it from being too fuzzy sounding for hi-fi.

    • How low could they make the bit rate if they made their system parse phonemes, transmit only them, and reproduce them on the other side?
      • Re:Close (Score:5, Informative)

        by Bruce Perens ( 3872 ) <bruce@perens.com> on Saturday January 14, 2017 @02:42AM (#53666195) Homepage Journal

        Lots of people ask about this. If we did pure speech-to-text and text-to-speech, it would take about half the bandwidth but everybody would have the same synthesized voice. Once you start trying to add parameters to the synthesized voice such as pitch, speed, and tonality, those take as much bandwidth as we are using for the entire codec, because they are essentially the same parameters.

        • Lots of people ask about this. If we did pure speech-to-text and text-to-speech, it would take about half the bandwidth but everybody would have the same synthesized voice. Once you start trying to add parameters to the synthesized voice such as pitch, speed, and tonality, those take as much bandwidth as we are using for the entire codec, because they are essentially the same parameters.

          Doesn't Motorola have a low bandwidth FM mode using phonemes? I've listened to a few radios using something like that, and they are pretty unpleasant to use.

        • When input is voice; we assume it's a human generated voice. A specific human's sound generating apparatus (vocal chord etc) have a specific signature (common parlance call it accent); If the software can capture this and send it along, you can reasonably construct back in the text-to-speech part something resembling/unique to the original voice. And this info is independent of the size of the sample - whether he/she talks 10 words or a thousand, the accent part info stays the same.
        • Or you could leave out the text-to-speech part, and just let the other person read it. Much faster, and you can grep it.
    • It sounds strange in our digital world based on whole bytes, but those odd half-byte encode naturally onto vinyl and add warmth and feeling to the intonation.

  • by Bryan Ischo ( 893 ) * on Friday January 13, 2017 @06:15PM (#53664075) Homepage

    They're skirting the bottom edge of comprehensibility, the voice in the samples is by no means "clear". You have to focus very closely to understand that is being said much of the time, and even then, repeated listenings are sometimes necessary.

    • Though thats often true of amateur radio generally.

    • by msauve ( 701917 )
      "You have to focus very closely to understand that is being said much of the time, and even then, repeated listenings are sometimes necessary."

      You're describing all of the tech support calls I've had to make in the past few years.
    • by tlhIngan ( 30335 ) <slashdot@wor[ ]et ['f.n' in gap]> on Friday January 13, 2017 @06:46PM (#53664269)

      They're skirting the bottom edge of comprehensibility, the voice in the samples is by no means "clear". You have to focus very closely to understand that is being said much of the time, and even then, repeated listenings are sometimes necessary.

      In other words, it's being efficient.

      The brain has a very powerful voice and audio decoder. (In fact, the brain's wetware is so powerful to compensate for relatively poor sensors - but coupled with the power of the brain, they become much more powerful detection devices. The downside to the economy in hardware with powerful software combination is artifacting - though we usually call those things illusions).

      So the codec basically saves transmission bytes by making the brain do a lot of the signal recovery work.

      Of course, in Amateur Radio, SSB can be really bad and you have to do a lot of deciphering anyhow.

      • That's the theory. The modem also degrades gracefully in a way that lets you use your "ears" to recover information when there are bit errors. No on-off behavior like most digital codecs, in fact one of the samples is rendered with 1% bit errors, which might kill a normal codec or at least require a packet repeat. We have higher bit rate versions of the codec that don't make you work so hard.
        • I am sure the tech is very useful, and being able to transmit understandable voice (even if it takes some concentration to understand it) in a very low number of bits is cool. I just thought the slashdot summary exaggerated a little bit.

          • All the hams I spoke with this evening are wondering why you find it difficult to copy. No kidding. We seem to have trained our ears on the analog radios over marginal paths.
            • All the hams I spoke with this evening are wondering why you find it difficult to copy. No kidding. We seem to have trained our ears on the analog radios over marginal paths.

              It is a training thing. I am pretty deaf, with two separate tinnitus tones, what does get to my brain sounds like a cracked speaker, and tremendous loss above 2 KHz, yet I am able to hear a lot of transmissions that inexperienced people with good hearing cannot. This is proven time and again when contesting with a noob helper.

              The issue I find with low bandwidth signals is that they cause fatigue over time. It's like when I wear a hearing aid. After 20 minutes, I'm ready to scream - This is likely because

  • " A single IP packet could carry 15 seconds of speec"

    great

  • A stream of sounds is difficult to parse. Converting it via various codecs won't change that or make it more useful. Converting the analog wave sounds into meaningful digital data (in the form of words as text, musical notation, specific fart parameters, a database of whale or bird calls, etc) is more helpful and efficient. Meaning can be extracted and/or analyzed. As someone else suggests, those can be converted back to a semblance of the original sequential stream of sounds (but why?).

    If you are communica

    • This is not, however, a waveform codec. It models the human voice tract, and encodes the parameters of that, rather than any waveform.
  • Do we finally have a 2400b mode? Would love to do digital but when existing FM transceivers. Due to HOA I can't (and yes have tried) do HF reliably.

    • by pe1rxq ( 141710 )

      I have been experimenting with 2400b on UHF for almost a year now. Especially since it allowes mixed voice and data.

  • Codec source code (Score:4, Informative)

    by TypoNAM ( 695420 ) on Friday January 13, 2017 @07:19PM (#53664399)

    Here's a link to the current source code, as it wasn't straight forward to find: https://svn.code.sf.net/p/free... [sf.net]

    Licensed under GNU LGPL v2.1.

  • by Rick Schumann ( 4662797 ) on Friday January 13, 2017 @07:39PM (#53664469) Journal
    That's who'll be interested in technology like this. They could compress and store the conversations of every person in the U.S., 24/7/365, for decades, without having to upgrade their data storage capacity.

    Just to show I'm not all gloom-and-doom: I'd think NASA, and private spaceflight companies like SpaceX, would be interested, since a low datarate for voice communications would be great, I'd think, for interplanetary distances. With higher datarates available you could have multiple conversations happening simultaneously.
    • since a low datarate for voice communications would be great, I'd think, for interplanetary distances

      If you're looking at waiting minutes for any reply, you might as well just use text. If you're on another planet, and incapacitated in such a way that you can't type, and you need help from home, you're probably pretty much boned already.

      I certainly wouldn't want to rely on this codec to get any emergency information across clearly.

    • There are commercial codecs that get to slightly lower data rates, which the government presently uses.

      I once had to ask the Pakistani military to not use the mailing list to ask questions, as I didn't want our ham radio project to get in ITAR trouble. Of course they can still use the code, it's Open Source. But they have to get help elsewhere.

    • by ajb44 ( 638669 )
      Codecs designed for conversation are limited in how much they can compress because they can't use as much correlation over a long period - to avoid long latencies. The Intelligence agencies have probably designed their own compression algorithm focussed on offline storage. My guess at the reasons that low-bitrate codecs are export controlled are 1) submarines and 2) covert channels.
  • by wonkey_monkey ( 2592601 ) on Friday January 13, 2017 @07:58PM (#53664573) Homepage

    Those samples are anything but "clear." It's still impressive, given the compression ratio, but there's no need to go overboard. You wouldn't want to have to rely on your understanding of one of these samples

  • by Anonymous Coward

    I wonder if Google could pair Codec 2 700c and RAISR (Rapid and Accurate Super Image Resolution) for YouTube videos that use even lower bandwidth than the 144p that exist already. Or, they could use the same technology to reduce the bandwidth necessary to stream 1080p/4k/8k videos and further embarrass the data capping ISPs.

  • by jensend ( 71114 ) on Friday January 13, 2017 @11:01PM (#53665423)

    I guess it's impressive to get anything other than straight noise out of less than 1kbps. But I've wondered why Rowe hasn't focused more on quality at more moderate (e.g. 2-3kbps) bitrates rather than continuing to seek ways to trade away some quality for an ever lower bitrate. It's been a couple years since I tried it out and came to that conclusion; this looks like that trend has continued.

    I couldn't get my encoded samples to sound nearly as good as the samples posted on the codec2 site. And it seemed like the second-lowest bitrate at the time (1400?) sounded essentially just as good as the highest (3200), which meant it wasn't making effective use of the additional bits. The quality jump between its highest mode and the lowest Opus mode (at 6kbps) was huge . (EVS would be a big jump over that.)

    From what I understand, codec2's most prominent competition operates at 2.4kbps and up and sounds noticeably better at those rates than codec2 does.

    • The jump in intelligibility and voice quality going from 4kHz narrowband to 6kHz mediumband is big- probably bigger than going from mediumband to 20kHz fullband. The distinguishing features of many consonants are between 3.5 and 6 kHz.

      Finding some way to take advantage of information beyond narrowband - even if not trying to encode much of it - could be a distinct advantage for a low bitrate codec over existing competition.

  • What a weird summary:

    The new codec isn't "competing with single-sideband modulation".

    Normal SSB is unprocessed speech. So the codec is simply competing with natural speech.

    The claim that SSB "is used by ham contesters to score the longest-distance communications using HF radio" is just plain wacky. So they use natural speech too talk to each other???

  • by Anonymous Coward

    This call for an implementation on those ESP8266 and similar modules: ADC and DAC (or PWM if absent) to interface with headset and that codec to send voice over IP sparing most possible bandwidth for other data and/or degraded link conditions.
    Also an Arduino or other cheap platform and a couple serial rf modules could be an interesting way to tinker with the protocol and explore applications.

  • Will this run on a 6502 or more importantly is this what bender uses?

One man's "magic" is another man's engineering. "Supernatural" is a null word. -- Robert Heinlein

Working...