👨‍🏫 Tutorial Microsoft just dropped a open-source voice AI that actually slaps ElevenLabs and its 100% FREE

Zyphyrs

Established
Microsoft's open-source voice AI and it does things ElevenLabs charges you for.


Transcribe 60 minutes of audio in a single pass with speaker labels, timestamps, and custom hotwords
Generate 90 minutes of multi-speaker audio (up to 4 speakers) that sounds natural
Real-time TTS with ~300ms latency for live applications
50+ languages supported

FREE. Open-source. 35.5K stars on GitHub.


You do not have permission to view the full content of this post. Log in or register now.
 
Biruin mo, Qwen2.5 1.5B model lang ang ginamit dyan to understand text and speech flow, tapos 7.5hz na frame rate ng internal tokenizer (or 7.5 "audio tokens" per second para makagawa ng speech), para maka-produce ng 24khz audio output. Kaya nga nyang mag-generate ng 90-min audio sa isang isang pasada gamit ang GPU lang ng pc. Di tulad ng karamihan na 50 - 100 tokens/sec ang requirement na limited to 1 - 5 minute clips. Ang pambato nyan ay yung English and Chinese (bilingual) dahil sa 1.8M hours of training ng bilingual corpus - mas polished kaysa sa ibang lengguwahe tulad ng Filipino. Yung VV-Realtime 0.5B model ang gamitin nyo to test, at kaya naman nyan using 2GB VRAM. Ito gamit ko: You do not have permission to view the full content of this post. Log in or register now., tapos gamit ko api nya sa ibang app for TTS use with microsoft/VibeVoice-Realtime-0.5B model from Hugging Face. Pwede rin cpu mode dyan using an onnx model of VV. Check nyo yung iba't-ibang paraan: You do not have permission to view the full content of this post. Log in or register now.. Ang standard sa net Comfy and Pinokio pero maraming UI yan na pwedeng gamitin yung TTS model.

Di yan pamalit sa 11labs na Gold Standard/Premium - best open source alternative siya. Magaling ang 11labs sa perfect speech flows kaya dyan nyo muna ipasada, then use a gradio webui for zero shot cloning using vibevoice. Hanapin nyo yung app sa "vibevoice-community/VibeVoice" sites tulad nito: You do not have permission to view the full content of this post. Log in or register now.. Magaling yang mag-mimic he he.
 
Biruin mo, Qwen2.5 1.5B model lang ang ginamit dyan to understand text and speech flow, tapos 7.5hz na frame rate ng internal tokenizer (or 7.5 "audio tokens" per second para makagawa ng speech), para maka-produce ng 24khz audio output. Kaya nga nyang mag-generate ng 90-min audio sa isang isang pasada gamit ang GPU lang ng pc. Di tulad ng karamihan na 50 - 100 tokens/sec ang requirement na limited to 1 - 5 minute clips. Ang pambato nyan ay yung English and Chinese (bilingual) dahil sa 1.8M hours of training ng bilingual corpus - mas polished kaysa sa ibang lengguwahe tulad ng Filipino. Yung VV-Realtime 0.5B model ang gamitin nyo to test, at kaya naman nyan using 2GB VRAM. Ito gamit ko: You do not have permission to view the full content of this post. Log in or register now., tapos gamit ko api nya sa ibang app for TTS use with microsoft/VibeVoice-Realtime-0.5B model from Hugging Face. Pwede rin cpu mode dyan using an onnx model of VV. Check nyo yung iba't-ibang paraan: You do not have permission to view the full content of this post. Log in or register now.. Ang standard sa net Comfy and Pinokio pero maraming UI yan na pwedeng gamitin yung TTS model.

Di yan pamalit sa 11labs na Gold Standard/Premium - best open source alternative siya. Magaling ang 11labs sa perfect speech flows kaya dyan nyo muna ipasada, then use a gradio webui for zero shot cloning using vibevoice. Hanapin nyo yung app sa "vibevoice-community/VibeVoice" sites tulad nito: You do not have permission to view the full content of this post. Log in or register now.. Magaling yang mag-mimic he he.
uy silipin ko nga to bossing! tyty
 
User friendly ba siya sa tulad ko na low tech?
For beginners, hindi po, dahil models lang yan hindi app. Kayo yung mag-setup. Yung github link is for the 3 featured models: ASR (Automatic Speech Recognition for Speech to Text) , TTS and Realtime TTS (for Text to Speech). Each has its own use. Sa ASR model may supplied package na, pero ds TTS maghahanap ka pa ng guides dahil binura nila due to misuse. Yung binanggit ko is just my own way to do it.
Kung gusto nyo ng readymade, nandyan naman sa github link yung clickable buttons to try sa ilalim mg "VibeVoice: Open-Source Frontier Voice AI". Or gamitin nyo ito sa search engine: free vibevoice online. Kahit nong app na gusto mong subukan by just knowing the "name", gamit ka na lang ng search engine kaysa m ag-intay ng sagot sa iba. Mas mapapadali yung pakay nyo. August 2025 pa yang vibevoice kaya maraming mga websites na ang gumawa ng free demo links or even ρáíd services.
 
Naghanap ako ng guide sa direct use ng github link:
You do not have permission to view the full content of this post. Log in or register now.
Be sure may hardware kayo to try and initial requirements like python. Download the right model!

Sa Vibevoice-Realtime-TTS, check nyo dito:
You do not have permission to view the full content of this post. Log in or register now.
Pwedeng straight sa python after doing step #2, or via colab. Basta mag-download kayo ng model dito muna and set the path ng model sa script.
You do not have permission to view the full content of this post. Log in or register now.
Pang-linux yung full feature pero ok naman gamitin sa windows basta sinunod nyo lang yung basic guides. Or else sa docker bagsak nyo he he. Subukan nyo yung iba pang demo scripts dyan.

Kung nababagalan kayo due to HW limits, try its ONNX models on other TTS apps that support it. Sa Hugginface meron. Ang alam ko may mga TTS wrappers na to use any TTS model ngayon basta pasok yung architecture protocol. Comfy pa lang yung alam ko sa vibevoice.

Kung di pa rin kaya, switch to You do not have permission to view the full content of this post. Log in or register now.. Ang dami niyang supported models from Kokoro, Piper, Melo-TTS, CoQui, atbp.: You do not have permission to view the full content of this post. Log in or register now.
 
ty dito bossing!
Naghanap ako ng guide sa direct use ng github link:
You do not have permission to view the full content of this post. Log in or register now.
Be sure may hardware kayo to try and initial requirements like python. Download the right model!

Sa Vibevoice-Realtime-TTS, check nyo dito:
You do not have permission to view the full content of this post. Log in or register now.
Pwedeng straight sa python after doing step #2, or via colab. Basta mag-download kayo ng model dito muna and set the path ng model sa script.
You do not have permission to view the full content of this post. Log in or register now.
Pang-linux yung full feature pero ok naman gamitin sa windows basta sinunod nyo lang yung basic guides. Or else sa docker bagsak nyo he he. Subukan nyo yung iba pang demo scripts dyan.
 
ty dito bossing!
Since mahilig ka sa hilig ko he he. Obserbahan mo yung mga TTS na yan kung kayang ayusin yung speech flow perfectly dahil kahit gaano kaganda yung boses, pag biglang mag-pause midway naman ay sira na yung audio mo - pangit pakinggan din. Yan yung limits ng mga TTS na parang nag-buffing dahil naubusan ng token para ituloy yung speech generation o sa ibang rason. Kailangan na nyan ng LLM pre-processing o manual editing ng text, para ayusin yung text at lagyan ng pauses o pause tags, like [pause:1.0s], atbp. Yung mga sites tulad ng 11labs, meron nyan (internally). Pero merong ilan like Chatterbox o Pocket-TTS app na gumagamit ng pause parsing. Siguro, pwedeng lagyan ng "custom voice-agent pipeline" yung vibevoice kung sakali lang na sumablay sa haba ng babasahin he he. Pero sa tingin ko, swabe siyang magsalita dahil sa compressed feature nya - matipid sa tokens.
Pero a simple LLM pre-processing pipeline (prior to TTS audio conversion) will do the trick using this prompt template to add the markers: ""Rewrite this dialogue for a TTS engine. Insert [pause:X.X] tags where a human would naturally take a breath or pause for dramatic effect." . Very effective yan. Meron pang ibang ginagamit pero this will do for now.
Gamit ko ito sa Kokoro-TTS noon, pero noong lumabas na yung Kokoro-TTS-Pause (You do not have permission to view the full content of this post. Log in or register now.) mas umayos yung results.
Kung interisado ka sa speech tools, check mo yung Vocalis o Voice-Agent sa github. Para kang may mini-11labs na Class B, he he.
 
aba aba haha goods yan bossing! mahilig ka din pala mag explore, check ko yan tapos ipapasok ko sa ai agent ko, currently inaayos ko muna memory nya kasi alam mo naman sakit ng ai agents nag hallucinate and nababaliw kapag madami ka na pinapagawa! hahaha
 
aba aba haha goods yan bossing! mahilig ka din pala mag explore, check ko yan tapos ipapasok ko sa ai agent ko, currently inaayos ko muna memory nya kasi alam mo naman sakit ng ai agents nag hallucinate and nababaliw kapag madami ka na pinapagawa! hahaha
Lahat ng klase ng exploration ginawa ko na. Ito, pambahay lang para may mabutingting he he. Limited lang akong mag-test pag malalaki na ang models. Yung ganito kaya pa. Mas maganda gawin mo silang local api server, para madaling gamitin sa kahit anong AI UI na may TTS integration. OpenAI compatible api naman halos lahat dyan.

Ang di ko nabanggit, sa Kokoro-TTS naiintindihan niya yung pause tags atbp., pero sa ibang TTS check nyo muna sa docs nila kung meron silang ganyang feature. di lahat meron nyan - SSML or inline tags.

Comparison Table: Open Source TTS Engines

You do not have permission to view the full content of this post. Log in or register now.Natural language tags like [pause]Free-form inline tags (e.g., [excited], [whisper])High-quality, E×ρréššive narration
You do not have permission to view the full content of this post. Log in or register now.Fine-grained control over pausesSupports interjections like [laughter], [sighs], and [laughs]Conversational AI and dialogue
PiperSupport for <break> tags in newer versionsLimited; focuses on speed and offline efficiencyFast, low-resource IoT/Edge devices
IndexTTS-2Precise duration control for video dubbingDisentangled control over timbre and emotion via promptsVideo dubbing and duration-specific tasks
BarkUses non-verbal cues for natural pausesGenerates non-speech sounds (laughs, sighs) and musicCreative and E×ρréššive "text-to-audio"
You do not have permission to view the full content of this post. Log in or register now.Engine-dependent (XTTS-v2 supports styles)Supports emotion/style transfer in XTTS-v2Production apps needing


Kokoro-TTSUses punctuation (;:,.!?) for natural pacing. Specific [1s] or PAUSE tags are supported via wrappers like Kokoro-FastAPI.Supports custom phonemes like [Kokoro](/kˈOkəɹO/) and stress marks (ˈ, ˌ) to manually adjust intonation.Lightweight (82M params) and extremely fast.
You do not have permission to view the full content of this post. Log in or register now.Supports custom tags like [pause] (1s) or [pause:ms] for specific millisecond durations through its official and community wrappers.Leverages an LLM (Qwen2.5) to maintain natural prosody and "turn-taking" over long-form conversations up to 90 minutes.Optimized for multi-speaker (up to 4) "podcast-style" audio.

Dagdag ko itong guide sa vibevoice na galing Comfy:

VibeVoice: Tag Configuration
VibeVoice supports two primary tags for controlling audio pacing and E×ρréššion. These are typically handled by community wrappers like VibeVoice-Athena or You do not have permission to view the full content of this post. Log in or register now. which parse these markers before processing.

  • Pause Tags:
    • [pause]: Inserts a default 1-second silence.
    • [pause:ms]: Inserts a custom duration in milliseconds (e.g., [pause:500] for half a second).
  • Tone Tags: You can influence prosody by adding [tone:STYLE] before a sentence.
    • Supported styles: excited, calm, sad, whisper, shout, and curious.
Example Input Script:

Code:
Speaker 1: [tone:calm] Welcome to the demonstration. [pause]
Speaker 2: [tone:excited] It is great to be here! [pause:1500]
Speaker 1: [tone:whisper] Let's keep this between us.
Check mo na lang yung iba pa sa docs.
At least dyan meron ka ng idea.

Note: Dyan sa 2-speaker script, automatic na mag-assign yang vibevoice ng voice model. Kung custom voice ang gusto mo gagamit ka ng zero-shot cloning by assigning sample wav files specific for the speaker/s you want. Magagawa mo yan using their realtime_model_inference_from_file.py script.

Ito yung sample bash provided kung nasa working folder ka sa cmd:
Code:
python demo/realtime_model_inference_from_file.py \
  --model_path microsoft/VibeVoice-Realtime-0.5B \
  --txt_path sample.txt \
  --speaker_name "Alice" \
  --output_dir ./outputs
Di ko pa nasubukan, pero i trust this will work.
Halos ganyan din yung procedure sa ibang TTS engines unless gagamit ka ng prefessional grade UI's like You do not have permission to view the full content of this post. Log in or register now. or You do not have permission to view the full content of this post. Log in or register now. para medyo iwas manu-mano he he..

Para doon sa mga nahihilo na sa pinagsasabi ko, dito kayo mag-test ng vibevoice. Don't expect much sa Tagalog TTS dahil bagsak tayo sa support sa ating lengguwahe. Dapat, may active support din tayo para di nahuhuli sa AI trends - laging wala sa listahan ng mga tTTS o saling-kit lang kaya low quality he he:
You do not have permission to view the full content of this post. Log in or register now.
Compare nyo sa iba:
You do not have permission to view the full content of this post. Log in or register now.
You do not have permission to view the full content of this post. Log in or register now. -choose PH model
...
Pero the best sa'kin is Inworld, followed by 11 labs, then Open-TTS or Google-TTS via api sa Tagalog. Collect the apis to take advantage of these premium TTS. Sa English, ang daming pagpipilian na libre. It's just a search away or run them locally if open source. FFMPEG o Audacity na lang to encode para crisp! Pwede ninyong i-force ang TTS na mag-Tagalog using zero shot cloning with your sample audio files para gayahin nila - minsan ok, minsan hindi. Yung Qwen-TTS pwedeng pwersahin din. Versatile ang AI ngayon. Merong mga free online service na may ganyan or make your own. Just ask the AI and they will do it for you using the right prompts. You won't learn if you don't try!






 
Lahat ng klase ng exploration ginawa ko na. Ito, pambahay lang para may mabutingting he he. Limited lang akong mag-test pag malalaki na ang models. Yung ganito kaya pa. Mas maganda gawin mo silang local api server, para madaling gamitin sa kahit anong AI UI na may TTS integration. OpenAI compatible api naman halos lahat dyan.

Ang di ko nabanggit, sa Kokoro-TTS naiintindihan niya yung pause tags atbp., pero sa ibang TTS check nyo muna sa docs nila kung meron silang ganyang feature. di lahat meron nyan - SSML or inline tags.

Comparison Table: Open Source TTS Engines

You do not have permission to view the full content of this post. Log in or register now.Natural language tags like [pause]Free-form inline tags (e.g., [excited], [whisper])High-quality, E×ρréššive narration
You do not have permission to view the full content of this post. Log in or register now.Fine-grained control over pausesSupports interjections like [laughter], [sighs], and [laughs]Conversational AI and dialogue
PiperSupport for <break> tags in newer versionsLimited; focuses on speed and offline efficiencyFast, low-resource IoT/Edge devices
IndexTTS-2Precise duration control for video dubbingDisentangled control over timbre and emotion via promptsVideo dubbing and duration-specific tasks
BarkUses non-verbal cues for natural pausesGenerates non-speech sounds (laughs, sighs) and musicCreative and E×ρréššive "text-to-audio"
You do not have permission to view the full content of this post. Log in or register now.Engine-dependent (XTTS-v2 supports styles)Supports emotion/style transfer in XTTS-v2Production apps needing


Kokoro-TTSUses punctuation (;:,.!?) for natural pacing. Specific [1s] or PAUSE tags are supported via wrappers like Kokoro-FastAPI.Supports custom phonemes like [Kokoro](/kˈOkəɹO/) and stress marks (ˈ, ˌ) to manually adjust intonation.Lightweight (82M params) and extremely fast.
You do not have permission to view the full content of this post. Log in or register now.Supports custom tags like [pause] (1s) or [pause:ms] for specific millisecond durations through its official and community wrappers.Leverages an LLM (Qwen2.5) to maintain natural prosody and "turn-taking" over long-form conversations up to 90 minutes.Optimized for multi-speaker (up to 4) "podcast-style" audio.

Dagdag ko itong guide sa vibevoice na galing Comfy:

VibeVoice: Tag Configuration
VibeVoice supports two primary tags for controlling audio pacing and E×ρréššion. These are typically handled by community wrappers like VibeVoice-Athena or You do not have permission to view the full content of this post. Log in or register now. which parse these markers before processing.

  • Pause Tags:
    • [pause]: Inserts a default 1-second silence.
    • [pause:ms]: Inserts a custom duration in milliseconds (e.g., [pause:500] for half a second).
  • Tone Tags: You can influence prosody by adding [tone:STYLE] before a sentence.
    • Supported styles: excited, calm, sad, whisper, shout, and curious.
Example Input Script:

Code:
Speaker 1: [tone:calm] Welcome to the demonstration. [pause]
Speaker 2: [tone:excited] It is great to be here! [pause:1500]
Speaker 1: [tone:whisper] Let's keep this between us.
Check mo na lang yung iba pa sa docs.
At least dyan meron ka ng idea.

Note: Dyan sa 2-speaker script, automatic na mag-assign yang vibevoice ng voice model. Kung custom voice ang gusto mo gagamit ka ng zero-shot cloning by assigning sample wav files specific for the speaker/s you want. Magagawa mo yan using their realtime_model_inference_from_file.py script.

Ito yung sample bash provided kung nasa working folder ka sa cmd:
Code:
python demo/realtime_model_inference_from_file.py \
  --model_path microsoft/VibeVoice-Realtime-0.5B \
  --txt_path sample.txt \
  --speaker_name "Alice" \
  --output_dir ./outputs
Di ko pa nasubukan, pero i trust this will work.
Halos ganyan din yung procedure sa ibang TTS engines unless gagamit ka ng prefessional grade UI's like You do not have permission to view the full content of this post. Log in or register now. or You do not have permission to view the full content of this post. Log in or register now. para medyo iwas manu-mano he he..





I asked my AI agent and yang fish and other tts model dn sinuggest nya, depende kasi minsan din sa unit capacity and oo maganda nga kung openai capabile sila tama ka jan bossing
 
I asked my AI agent and yang fish and other tts model dn sinuggest nya, depende kasi minsan din sa unit capacity and oo maganda nga kung openai capabile sila tama ka jan bossing
Ang standard ng AI agents kalimitan yung trending, hindi sa AI leaderboards na mas mahigpit ang testing. Yang Fish Audio, talagang OK din naman pero usually sa English sila nag-excel dahil mas maraming training yon kaysa sa ibang languages, hindi kasi pare-parehas ang training ng TTS voice models. Kahit sa ASR ganyan din. Ang mga tests naman via bare models hindi yung service.
Yang OpenAI TTS kahit hindi masyadong E×ρréššive, mataas ang quality sa grammar, sa paggamit ng pauses without adding tags at fluent yung speech sa English. English kasi yung standard sa tests kalimitan. Pagdating sa ibang lengguwahe, dyan na mag-iiba. Kailangan na ng retoke he he. Yang 11labs, halu-halo ang models, voices and engines sa kanilang service. Yung training nila sa voices ang kanilang pambato plus the extras pag gamit mo yung web platform interface nila. Pag sa api, yung TTS model lang ang gamit tulad din ng Gemini and OpenAI (as-is-where-is). Ang sa akin, kung gagamit ka ng TTS, kailangan niya ng mg mga accessories like pre-processing texts, presets, mga editable functions like volume, pitch, speed etc. para may leeway yung user to tweak to their own desires - hindi model lang na mag-convert to audio. Yan yung punto ko.
 
I asked my AI agent and yang fish and other tts model dn sinuggest nya, depende kasi minsan din sa unit capacity and oo maganda nga kung openai capabile sila tama ka jan bossing
Binusisis ko tuloy sinabi mo ng malibre ako. Tama yung AI Agent mo. Nasa top din yung Fish Audio sa open weights catrgory noong nilabas nila yung S2 Pro; pero ang taas ng GPU requirement nyan he he (NVIDIA 12 - 24GB VRAM).
You do not have permission to view the full content of this post. Log in or register now.
May bago rin labas na open source: You do not have permission to view the full content of this post. Log in or register now., parang vibevoice din- up to 12 minutes in one pass.
Di na ako masyadong nag-test ng local TTS dahil libre naman sa apis na pwede mong gawing SAPI-5 TTS sa Windows. Di kaya ng pc ko pag nagsabay ang local AI sa ibang AI projects.
Pero sa akin, maganda na rin yung vibevoice dahil up to 90 minutes sa isang pass ang kaya niya sa standard model at yung VV-Realtime TTS ay 10 minutes. yung Kokoro-TTS hanggang minute lang. Di naman naglalayo mga boses nyan at sa E×ρréššions and emotions pati speech flow na lang ang pagtatalunan.
 

About this Thread

  • 33
    Replies
  • 1K
    Views
  • 9
    Participants
Last reply from:
alist1986

Online now

Members online
1,038
Guests online
1,299
Total visitors
2,337

Forum statistics

Threads
2,273,317
Posts
28,948,782
Members
1,235,697
Latest member
Hrk94hrk94
Back
Top