Canonical's Upcoming AI Tool: Talk to Ubuntu Instead of Typing (itsfoss.com) 56
This week the Ubuntu desktop's director of engineering announced they're bringing speech-to-text dictation to Ubuntu Desktop, aiming for an experience "that feels like a natural part of the desktop while respecting user privacy and running entirely on local hardware."
"Speech recognition has become a common feature on modern platforms, and we think it should be a first-class experience on Ubuntu Desktop as well."
More details from the blog It's FOSS: For Ubuntu 26.10, the initial version of Myna is expected to be a desktop dictation tool built around GNOME on Wayland with a push-to-talk mechanism gatekeeping when your microphone accepts input. Using it means holding a hotkey, speaking, and letting go. A small activity indicator shows while it is listening, and the transcribed text lands wherever the cursor was sitting when dictation started.
Recognition itself happens inside a sandboxed component called the Canonical Inference Snap, while a Speech Orchestrator manages the session and an Audio Adapter handles whatever the microphone picks up, denoising and chunking it before it ever reaches the model... Speech recognition will happen locally, and an internet connection is not needed once the appropriate model is installed... The audio data won't be sticking around either, being stored in a small in-memory buffer that gets discarded the moment the session ends. Features like dictation into password fields, wake words, continuous listening, voice assistants, voice commands, translation, speaker identification, and automatic language detection are all off the table...
You should also know that Canonical is looking for feedback before the specs for Myna are finalized, especially from people who already rely on dictation or assistive tools on Linux.
"Speech recognition has become a common feature on modern platforms, and we think it should be a first-class experience on Ubuntu Desktop as well."
More details from the blog It's FOSS: For Ubuntu 26.10, the initial version of Myna is expected to be a desktop dictation tool built around GNOME on Wayland with a push-to-talk mechanism gatekeeping when your microphone accepts input. Using it means holding a hotkey, speaking, and letting go. A small activity indicator shows while it is listening, and the transcribed text lands wherever the cursor was sitting when dictation started.
Recognition itself happens inside a sandboxed component called the Canonical Inference Snap, while a Speech Orchestrator manages the session and an Audio Adapter handles whatever the microphone picks up, denoising and chunking it before it ever reaches the model... Speech recognition will happen locally, and an internet connection is not needed once the appropriate model is installed... The audio data won't be sticking around either, being stored in a small in-memory buffer that gets discarded the moment the session ends. Features like dictation into password fields, wake words, continuous listening, voice assistants, voice commands, translation, speaker identification, and automatic language detection are all off the table...
You should also know that Canonical is looking for feedback before the specs for Myna are finalized, especially from people who already rely on dictation or assistive tools on Linux.
This Is Why I Ditched Ubuntu (Score:2, Insightful)
Re: (Score:2)
So, "People With Disabilities Don't Exist" then?
My father was recently paralyzed by Guillain-Barré, so I'll let him know, thanks.
Re: (Score:3)
I see what you are saying, but also I'm not sure if I'd say it's a disability, but I certainly find typing stuff out is easier than saying it. Especially as I can edit it before submission, where as what I say is what the AI responds to immediately.
It's a lot like talking on the phone, which I also dislike. Face to face, people can see your expressions and when you look like you are trying to think of what to say or how to rephrase something, they can wait. On the phone, or talking to an AI, that isn't poss
Re: (Score:2)
A stroke might change your opinion - even a very mild one. Stay healthy and stay lucky.
Re: (Score:2)
What keeps me up at night, is the fate of Tim Smith from the Cardiacs. In my view possibly the closest pop music ever got to a Frank Zappa league songwriter. The man wrote crazy complicated , surreal and energetic music that admitedly is an aquired taste for people. Anyway, at the height of the bands fame, he was out one night at a sisters of mercy concert, and he got robbed for his wallet, and immediately had a massive heart attack. Rushed to hospital and clinically dead for 7 minutes before they revived h
Re: (Score:2)
Re: (Score:2)
Me too!
Re: (Score:2)
You don't have to have a disability to use one of these tools. I did start using one because I was typing way, way too much. But it turns out that they're actually extremely useful. So I compose all my emails and do all of my development work with propts using a tool like this. It doesn't need to be bundled with Ubuntu.
They all work essentially the same way. They are using some sort of an API connection into an LLM. But it's not full AI. It's essentially a type of speech recognition, as far as I understand
Re: (Score:3)
All current (and all actually working) speech recognition is neural networks and by this clearly in the category "full AI". In principle the old pattern matching approaches are also AI (which is a way broader category than most people think).
Re: (Score:2)
Users do want this. Watch the final part of the recent Linus Tech Tips Linux Challenge. Three of them switched to Linux for a month, and they all kept using it afterwards. Previous challenges had them going back to Windows.
The two big things that changed are Proton making games work, and AI making solving Linux problems less painful. They all commented on the reaction they get when asking questions of the Linux community. It's often hostile and unhelpful, telling them that they don't actually want to do wha
Re: (Score:2)
I didn’t like the unattended update service that ran on occasion. Only discovered that because manually running apt generated an error about the database being locked. That’s some Windows shit.
Re: (Score:1)
I didn’t like the unattended update service that ran on occasion. Only discovered that because manually running apt generated an error about the database being locked. That’s some Windows shit.
sudo systemctl disable unattended-upgrades
I guess you are one of the people that needs AI.
Re: (Score:1)
Makes me wonder if AI dev teams finished fixing "AI agents that can configure OS settings for you directly with admin privileges" to the point where they're safe enough to use (i.e. won't change something destructive by accident).
I remember seeing news about Claude based agents still doing weird shit with unintended destructive operations just a few months ago.
Re: (Score:2)
unattended-updates should only install security update (except Ubuntu did have some special ideas again ...) ...) bullshit.
The problem is snap automatically updating if you want it or not. I think without low level tampering (disabling snap basically) you can only pause updates for a limited time, but not disable them. That feels like iOS/Android (yeah, also Android. Try to disable Google play service updates
Re: (Score:2, Redundant)
So I already use a tool like this. It's called Voicy. I use it because I've been writing so many long prompts that I developed relatively severe tendonitis in my left arm. And also I was leaning on my desk so much when typing that I developed bursitis in my elbow. So I got this application, and then I got a microphone, and now I can make very long, large prompts. It has actually sped up the development of a game that I'm working on by an astronomical amount. In fact, I'm using it right now. It's not perfect
Re: (Score:2)
Re: (Score:2)
> So I already use a tool like this. It's called Voicy. I use it because I've been writing so many long prompts that I developed relatively severe tendonitis in my left arm.
Have you ever used a computer before LLMs became a thing?
If yes, how did you manage to not hurt yourself before your life was nothing but writing prompts?
(Maybe the solution is to stop writing prompts and go back to doing what you did before, is what I'm suggesting)
=Smidge=
Re: (Score:2)
yes yes "i'm old and cranky"
Re: (Score:2)
I suspect everyone from people with disabilities, to people who struggle with fast typing on keyboards (a shocking amount of gen Z and gen Alpha, who are used to on screen keyboards are in this category).
For us older dudes who grew up doing work with keyboards, we probably type faster than we can speak clearly. And with less errors.
But we're not the whole population. Not by a long shot. And for those who are less keyboard-inclined, this may be a useful feature.
Re: This Is Why I Ditched Ubuntu (Score:1)
Re: (Score:2)
They are trying so hard to be Windows they might as well call this "Torcana."
Perfect for corporate use (Score:4)
This feature is great in an office that uses small cubicles. Even better for open-plan offices!
But seriously, apart from disabled users who might not be able to use a keyboard, I don't see a use case for this. The reason we use dictation on mobile devices is that they typically have poor keyboards. If you have a good keyboard, you can be far more efficient with it than with voice input.
Re: (Score:3)
How about the script writer sitting in the coffee shop, working on a porn video?
Re: (Score:2)
Re: (Score:2)
I have a good keyboard, and on a good day, I can do 60 words a minute. I completely and fundamentally disagree with you. Using a microphone to speak what you want to appear on the screen can be, if you use it correctly, much, much faster than typing it. using a keyboard is great now for certain types of things, but these modern tools recognize things like when you use commas and are pausing, when a sentence ends, and so on. You don't need to actually say the sentence and then say the word period after it to
Re: (Score:3)
I guess it depends on what you're typing. For English text, it's probably fine. But I do a lot of programming and doubt that speech input would be effective for that.
Re: Perfect for corporate use (Score:2)
Happily, I retired just before AI would've started to raise my ire at work daily, but I cannot imagine tolerating the Hell of trying to think while on a corporate floor full of people babbling at their computers all day.
Someone please invent the cone of silence. What?
Re: (Score:2)
Me too! Retired in 2023... perfect timing, I think!
Re: (Score:2)
Re: (Score:2)
Do you know how many people never learned to touch type? I prefer a keyboard very much and are way faster with it. But most people type veeeery slow.
Re: (Score:2)
I see doctors using dictation more and more. It saves them a lot of time which can be spent on actually interacting with patients.
Re: (Score:2)
Re: (Score:2)
"If you have a good keyboard, you can be far more efficient with it than with voice input."
-1, provably false
https://www.medrxiv.org/conten... [medrxiv.org]
Re: (Score:2)
Relevent (Score:2)
I use Linux on everything. So how relevant is Canonical's announcement for me?
1) I don't use Gnome
2) I don't use Wayland
3) I don't use SNAP
4) I don't use Ubuntu
5) I have no use for desktop dictation since I can type much faster than speaking something, then reading it all again to edit and correct all the mistakes and add all the missing punctuation/etc.
At least they kept it "local" and perhaps some people might find the tool useful. So wake us up when it is a real/native package, can be used on any Linux,
Re: (Score:3)
Must be incredibly relevant. You went out of you way to post how it doesn't affect you.
Re: (Score:2)
>"Must be incredibly relevant. You went out of you way to post how it doesn't affect you."
I am probably not the minority in the views of relevancy and I specifically wrote it could be useful for some people. But, whatever :)
Re: (Score:3)
People understanding that snaps sucks and people using Ubuntu have probably very little overlap ;)
Re: (Score:2)
I hate snap, but for an average computer user (not Linux enthusiast) there is nothing wrong with it. Steam was an issue for a while but that's been resolved.
Re: (Score:2)
It's slower and it increases RAM usage. That is more enough "wrong" for an average user. As the average user is not likely to go find out the details it also does not appear as "uses more RAM than other installation methods" but as "Ubuntu is slow" to them, being bad marketing for desktop Linux. You don't need to go into technical details to notice the effects of the problems.
okay... where? (Score:4, Insightful)
You should also know that Canonical is looking for feedback before the specs for Myna are finalized, especially from people who already rely on dictation or assistive tools on Linux.
OK, how do we provide this feedback? The article is chock-full of links, but not one for that. It gives strong "get fucked" energy.
Since it's not worth putting out the effort to figure out where to submit some comments they definitely won't give a fuck about anyway: In no way is it a "first class" anything when it's only for GNOME and only in a snap. Let us know when it's ready for prime time so we can test it out and decide if we care. There's a 0% chance I'm going to use GNOME or snap.
Re: (Score:3)
>"In no way is it a "first class" anything when it's only for GNOME and only in a snap. [...]There's a 0% chance I'm going to use GNOME or snap."
^^THIS
If it were a project that mapped to many/all Linux distros, using a native package (not container, especially not a SNAP container), worked on any Linux desktop environment (and yes, X11 too), then it would be far more interesting. I might even check it out and give feedback.
Hooray! (Score:4, Insightful)
I'm all for a speech to text feature. I've wanted one for years. But, it has to not suck. The speech recognition in my car is dog shit. The speech recognition in Windows is dog shit. The speech recognition in Google has, after decades, reach a point where it is good. But, not great.
If Ubuntu can put it into the desktop, make it good, and not require 64GB of DDR5(with a street value of a squillion dollars) I'll be happy to see it.
Re: (Score:1)
In my experience, it's very language dependent. Big popular languages like English? Big players in the field like google got their AI good enough to take diction even when speaking quickly after minimal training.
But smaller languages like Finnish? The level of "oh no, it's retarded" is over 9000.
Also needs a decent quality mic and reasonably clean background noise levels in most cases.
Honestly doesn't seem that bad? (Score:2)
Feels to me like It's FOSS baited people by calling this "Canonical's New AI Tool" when Canonical's announcement doesn't use the term 'AI' anywhere in it. They call it "Speech To Text", which is what it is. It probably uses advances in programming from the AI industry to try and improve the speech recognition but why should that be a problem? It's free software and operates entirely locally. If this counts as AI then this is the type of AI I can live with.
Re: (Score:2)
"Canonical's announcement doesn't use the term 'AI' anywhere in it."
STTs are not LLMs, but they are AI that use CNNs and RNNs.
The models Canonical mentions by name are Whisper, Parakeet, Nemotron, and Qwen3-ASR (https://github.com/canonical/myna/blob/main/docs/architecture/Myna%20-%20System%20Architecture.png).
Re: (Score:2)
Some are even multimodal LLM. You can transcribe audio using Gemma 4 for example. While it is not the primary purpose of an LLM it has the advantage that it can do more than speech, like describing other sounds and that context a LLM knows that a simpler TTS engine does not know can prevent transcript errors. Every silly error you see in automatic captions that is obvious nonsense can be caught by an LLM. The subtle ones stay subtle, of course.
Re: (Score:2)
Every silly error you see in automatic captions that is obvious nonsense can be caught by an LLM.
Every silly error I see in automatic captions today is on Youtube and was created by an LLM. Why didn't it catch them?
Re: (Score:2)
Because they do not use a LLM for that, as that is more time (and thereby energy and cost) consuming than using a simpler AI system.
Don't Need AI for Voice Recognition (Score:2)
Sue Doe: Are Em, Dash, Star and the rest here? (Score:1)
What? My secretary's name is Sue Doe.
What about obscenities? (Score:2)