792 points · meetpateltech · 13 hours ago
mistral.aisimonw
bytesandbits
dmix
Amazons transcription service is $0.024 per minute, pretty big difference https://aws.amazon.com/transcribe/pricing/
iagooar
I tried English + Polish:
> All right, I'm not really sure if transcribing this makes a lot of sense. Maybe not. A цьому nie mówisz po polsku. A цьому nie mówisz po polsku, nie po ukrańsku.
janalsncm
Obertr
pietz
observationist
https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-26...
~9GB model.
mnbbrown
The dataset is ~100 8kHz call recordings with gnarly UK accents (which I consider to be the final boss of english language ASR). It seems like it's SOTA.
Where it does fall down seems to be the latency distribution but I'm testing against the API. Running it locally will no doubt improve that?
cyp0633
krick
jiehong
We need better independent comparison to see how it performs against the latest Qwen3-ASR, and so on.
I can no longer take at face value the cherry picked comparisons of the companies showing off their new models.
For now, NVIDIA Parakeet v3 is the best for my use case, and runs very fast on my laptop or my phone.
mdrzn
Is it better? Worse? Why do they only compare to gpt4o mini transcribe?
yko
But whatever I tried, it could not recognise my Ukrainian and would default to Russian in absolutely ridiculous transcription. Other STT models recognise Ukrainian consistently, so I assume there is a lot of Russian in training material, and zero Ukrainian. Made me really sad.
fph
satvikpendem
antirez
gwerbret
serf
"Click me to try now!" banners that lead to a warning screen that says "Oh, only paying members, whoops!"
So, you don't mean 'try this out', you mean 'buy this product'.
Let's not act like it's a free sampler.
I can't comment on the model : i'm not giving them money.
XCSme
If you transcribe a minute of conversation, you'll have like 5 words transcribed wrongly. In an hour podcast, that is 300 wrongly transcribed words.
aavci
mijoharas
Maybe this'll get wrapped into a nice tool later.
Does anyone have any recommendations?
harry8
Seems like fundamental info for any model announcement. Did I just miss it? Does everyone just know except me?
_blackhawk_
maxdo
how does it compare to sparrow-1?
sgt
sbinnee
ccleve
siddbudd
[^1]: https://www.wired.com/story/mistral-voxtral-real-time-ai-tra...
[deleted]
[deleted]
Archelaos
What estimates do others use?
upcoming-sesame
Rapzid
yewenjie
jszymborski
[deleted]
blobinabottle
numbers
asah
Voxtral Transcribe 2:
Light up our guns, bring your friends, it's fun to lose and to pretend. She's all the more selfish, sure to know how the dirty world. I wasn't what I'd be best before this gift I think best A little girl is always been Always will until again Well, the lights out, it's a stage And we are now entertainers. I'm just stupid and contagious. And we are now entertainers. I'm a lot of, I'm a final. I'm a skater, I'm a freak. Yeah! Hey! Yeah. And I forget just why I taste it Yeah, I guess it makes me smile I found it hard, it's hard to find the well Whatever, never mind Well, the lights out, it's a stage. You and I are now entertainers. I'm just stupid and contagious. You and I are now entertainers. I'm a lot of, I'm a minor. I'm a killer. I'm a beater. I'm a nerd. I'm a nerd. I'm a nerd. I'm a nerd. I'm a nerd. I'm a nerd. I'm a nerd. I'm a nerd. I'm a nerd. And I forget just why I taste it Yeah, I guess it makes me smile I found it hard, it's hard to find the well Whatever, never mind I know, I know, I know, I know, I know Well, the lights out, it's a stage. You and I are now entertainers. I'm just stupid and contagious. You and I are now entertainers. I'm a lot of, I'm a minor. I'm a killer. I'm a beater. I'm a nerd. I'm a nerd. I'm a nerd. I'm a nerd. I'm a nerd. I'm a nerd. I'm a nerd. I'm a nerd. I'm a nerd.
Google/Musixmatch:
Load up on guns, bring your friends It's fun to lose and to pretend She's over-bored, and self-assured Oh no, I know a dirty word Hello, hello, hello, how low? Hello, hello, hello, how low? Hello, hello, hello, how low? Hello, hello, hello With the lights out, it's less dangerous Here we are now, entertain us I feel stupid and contagious Here we are now, entertain us A mulatto, an albino A mosquito, my libido, yeah Hey, yey I'm worse at what I do best And for this gift, I feel blessed Our little group has always been And always will until the end Hello, hello, hello, how low? Hello, hello, hello, how low? Hello, hello, hello, how low? Hello, hello, hello With the lights out, it's less dangerous Here we are now, entertain us I feel stupid and contagious Here we are now, entertain us A mulatto, an albino A mosquito, my libido, yeah Hey, yey And I forget just why I taste Oh yeah, I guess it makes me smile I found it hard, it's hard to find Oh well, whatever, never mind Hello, hello, hello, how low? Hello, hello, hello, how low? Hello, hello, hello, how low? Hello, hello, hello With the lights out, it's less dangerous Here we are now, entertain us I feel stupid and contagious Here we are now, entertain us A mulatto, an albino A mosquito, my libido A denial, a denial A denial, a denial A denial, a denial A denial, a denial A denial
tallesborges92
derac
atentaten
ewuhic
scotty79
This combo has almost unbeatable accuracy and it rejects noises in the background really well. It can even reject people talking in the background.
The only better thing I've seen is Ursa model from Speechmatics. Not open weights unfortunately.
antirez
p.s. even the demo uses a remote server via websocket.
dumpstate
boringg
[deleted]
Don't be confused if it says "no microphone", the moment you click the record button it will request browser permission and then start working.
I spoke fast and dropped in some jargon and it got it all right - I said this and it transcribed it exactly right, WebAssembly spelling included:
> Can you tell me about RSS and Atom and the role of CSP headers in browser security, especially if you're using WebAssembly?