❓ Help Is this AI? PLEASE HELP ME

Ano sa tingin nyo tong audio?

  • [TYPE] Text to Speech

    Votes: 2 66.7%
  • [TYPE] Real audio of a person

    Votes: 0 0.0%
  • [TYPE] Voice changer

    Votes: 0 0.0%
  • [FEEL] Normal lang

    Votes: 0 0.0%
  • [FEEL] Mukang AI pero di halata

    Votes: 0 0.0%
  • [FEEL] Halatang AI

    Votes: 2 66.7%
  • [FEEL] Not sure

    Votes: 0 0.0%

  • Total voters
    3
  • Poll closed .
Obvious naman na AI yan sa tunog ng boses he he - TTS engine na may kaunting E×ρréššion training.
Maraming AI voice detector sa net to test, at ang average result is 82% AI.

Pag may oras ako bukas, replicate ko yan sa AI na ginamit!
 
Di ko ma-perfect yung script! Test mo sa alam mong TTS engine like sa OpenAI' "alloy" medyo malapit ng kaunti pag kinalikot ng ilang beses.
<speak>
<E×ρréšš-as type="shouting">Uy, uy!</E×ρréšš-as>
<break time="20ms"/>
<E×ρréšš-as type="fast">May dalawa sa mid.</E×ρréšš-as>
<break time="1200ms"/>
<E×ρréšš-as type="excited">Abangan niyo lang, baka mang-flank yung mga yun eh.</E×ρréšš-as>
<break time="20ms"/>
Check niyo yung gilid-gilid.
<break time="1360ms"/>
<E×ρréšš-as type="sad">Ay naku, talo na naman tayo.</E×ρréšš-as>
</speak>
 
Di ko ma-perfect yung script! Test mo sa alam mong TTS engine like sa OpenAI' "alloy" medyo malapit ng kaunti pag kinalikot ng ilang beses.
Thank you for this, gamit ko right now yung V3 (alpha) ng elevenlabs, I was amazed talaga sa feature ng v3 here's an example nakita ko sa reddit.

 
Obvious naman na AI yan sa tunog ng boses he he - TTS engine na may kaunting E×ρréššion training.
Maraming AI voice detector sa net to test, at ang average result is 82% AI.

Pag may oras ako bukas, replicate ko yan sa AI na ginamit!
I've been working so hard talaga to find ways to use TTS and configure it by adding some effects (like distortion) via python or similar app that can make an audio almost real.

I'll take a look yung OpenAI' "alloy" you have mentioned, salamat! If you have more idea, let me know hahaha.
 
I've been working so hard talaga to find ways to use TTS and configure it by adding some effects (like distortion) via python or similar app that can make an audio almost real.

I'll take a look yung OpenAI' "alloy" you have mentioned, salamat! If you have more idea, let me know hahaha.
Trial and error naman talaga pag nag-uumpisa he he. Di ako sure na OpenAI TTS ang ginamit dyan dahil, realistic yung E×ρréššions at high resolution audio (stereo) sa pandinig - baka Premium TTS like Chirp model (not Elevenlabs) ng Google. Baka pa nga pinaghalong human sa unahan at AI sa huli, tapos enhanced pagkatapos.

Subukan mo sa Gemini TTS at supported niya yang SSML script din like OpenAI. Yung SSML prosody yung ginamit for the trial E×ρréššions only. May kulang pa dyan like speaking rate, pitch, volume, etc. para ma-control mo yung audio to sound human-like. Di ko pa nasubukan yung straight SRT with some E×ρréššion commands. Iba-iba kasi ang SDK ng mga yan. Nag-test lang ako sa present knowledge ko sa TTS.

Sa ElevenLabs, ayaw ng SSML. Medyo makunat na yung Google sa paggamit ng API sa TTS sa AiStudio kaya di ko masubukan ng matagal. Dyan ka sa puter.js mag-test at libre (unlimited pa) yan. Sakop nyan halos lahat ng AI, pero mag-edit ka ng code dyan mismo para makuha mo yung audio sa TTS - at least working scripts yung mga demo for the AI apis for text, image, audio.....Kaunting AI assist, tatakbo na yung javascript or copypaste ng code snippets. Kung gusto mo ng python using puter, check mo na lang sa github. Create a python environment and test this: You do not have permission to view the full content of this post. Log in or register now. , and the rest na makita mo sa github...

Pag-aralan ko ulit yan for a change. Sa English language, mas madaling mag-aral nyan dahil maraming TTS engines na supported yung E×ρréššions and emotions, even automatically, dahil yung model is MOE that understands text structures like the latest GPTs and Gemini TTS models. Yung iba pure TTS models lang kaya ikaw yung maglalagay ng E×ρréššion commands atbp. (prosodies).

Good luck.
 
Ito na lang pabaon ko sa'yo using python kung di mo pa na-test:
You do not have permission to view the full content of this post. Log in or register now.
Supported nyan yung SRT and SSML. Deploy mo lang using docker or run in python yung api server nya. OpenAI SDK gin gamit nyan. Pag deploy mo, no api key needed! Para siyang hybrid ng Microsoft Azure+OpenAI he he.
Since SSML lang yung option to control various aspects of speech such as pronunciation, pitch, speaking rate, and even pauses, kailangan mong matuto ng SSML scripting - kasama rin ako dyan he he.
You do not have permission to view the full content of this post. Log in or register now.
You do not have permission to view the full content of this post. Log in or register now.
You do not have permission to view the full content of this post. Log in or register now.


Marami namang free online converters from srt to ssml, Sa huli na yung prosodies manually. Yan muna sa ngayon.

Ang advanced project ko dyan ay hanapin yung AI workflow from movie audio to SSML with prosodic elements (pati transcribing na rin sa ibang languages), para pagbalik sa audio ulit o kaya ibang boses ay medyo maganda sa pandinig. Wala pa yang 1-click process sa ngayon pero posible. Ang mahirap dyan, yung FREE tools he he. Pero sa github, walang imposible! Baka maunahan mo pa'ko.
 
May correction ako sa sinabi ko sa post#3 and #6. Kahit nag-convert ng audio yung OpenAI TTS using SSML doesn't mean it supports XML - di kasi supported yung SSML sa ngayon sa nabasa ko. Nataon lang na its system ignored the tags and processed only the plain text inside them. Maling akala he he. Kahit sa ElevenLabs version model specific pag gagamit ng phoneme tags. di pa masyadong established yung voice customization kahit outside SSML methods sa TTS/STT. Sa OpenAI, ito yung nakuha kong basic instructions:
To achieve your desired E×ρréššions and timing using OpenAI's native standards as of early 2026, use the following techniques:
1. Emotional E×ρréššion (Creative Wording & Punctuation)
OpenAI TTS does not use tags for "shouting" or "surprised". Instead, it reacts to punctuation and contextual cues:
  • Shouting: Use all caps and multiple exclamation points (e.g., "UY, UY!!!") to trigger a more energetic delivery.
  • Surprise: Use interjections and question marks (e.g., "Hala! Abangan niyo lang...") to shift the pitch higher.
  • Sadness/Disappointment: Use longer, slower sentences with periods to create a trailing, somber tone (e.g., "Ay naku... talo na naman tayo.").
2. Controlling Timing and Pauses
Since <break> is ignored, you must use textual fillers or punctuation to simulate pauses:
  • Short Pauses (20ms - 500ms): Use commas (,) or extra periods (.) between words.
  • Long Pauses (1s+): Use ellipses (...) or dash marks (). The model interprets an ellipsis as a natural break in thought.
  • Manual Segmentation: For exact 1.5s timing, it is standard practice to send separate API requests for each segment and stitch the audio files together with the desired silence in between.
3. Using the Realtime API (Instructions)
If you are using the OpenAI Realtime API (introduced in 2025), you can influence the voice via the system prompt rather than tags:
  • Instructions: "Speak like a gamer in a high-stakes match. Use a panicked tone when spotting enemies and a defeated, slow tone when losing."
  • Result: The model will "steer" the voice based on these instructions while reading your text.
Revised Example Script
To get the best result without SSML, format your input like this:
"UY, UY!!! ... May dalawa sa mid. ...... Hala! Abangan niyo lang, baka mang-flank yung mga yun eh. ... Check niyo yung gilid-gilid. ......... Ay naku... talo na naman tayo."
Summary Comparison

Pacing<break time="1s"/>Punctuation (...) or split API calls
Emotion<E×ρréšš-as type="sad">Word choice and punctuation
Volume<prosody volume="loud">Capitalization and exclamation marks
InstructionEmbedded in scriptProvided in System Prompt (Realtime API)

 
May correction ako sa sinabi ko sa post#3 and #6. Kahit nag-convert ng audio yung OpenAI TTS using SSML doesn't mean it supports XML - di kasi supported yung SSML sa ngayon sa nabasa ko. Nataon lang na its system ignored the tags and processed only the plain text inside them. Maling akala he he. Kahit sa ElevenLabs version model specific pag gagamit ng phoneme tags. di pa masyadong established yung voice customization kahit outside SSML methods sa TTS/STT. Sa OpenAI, ito yung nakuha kong basic instructions:
I think the elevenlabs pero v2 supports the XML thing or like yung parang may tags and attributes.
 
May correction ako sa sinabi ko sa post#3 and #6. Kahit nag-convert ng audio yung OpenAI TTS using SSML doesn't mean it supports XML - di kasi supported yung SSML sa ngayon sa nabasa ko. Nataon lang na its system ignored the tags and processed only the plain text inside them. Maling akala he he. Kahit sa ElevenLabs version model specific pag gagamit ng phoneme tags. di pa masyadong established yung voice customization kahit outside SSML methods sa TTS/STT. Sa OpenAI, ito yung nakuha kong basic instructions:
I still end up using elevenlabs v3, talaga. here's a sample audio.

RAW SCRIPT
"UY, UY!!! ... May dalawa sa mid. ...... Hala! Abangan niyo lang, baka mang-flank yung mga yun eh. ... Check niyo yung gilid-gilid. ......... Ay naku... talo na naman tayo."

<div><iframe width="300" height="60" src="" frameborder="0" allow="autoplay"></iframe><br><a href="" title="Vocaroo Voice Recorder" target="_blank">View on Vocaroo &gt;&gt;</a></div>

SCRIPT WITH TAGS
[annoyed] uy, uy!!! ... May dalawa sa mid. [surprised] ...... Hala! Abangan niyo lang, baka mang-flank yung mga yun eh. ... Check niyo yung gilid-gilid. [sighs] ......... Ay naku... talo na naman tayo.

<div><iframe width="300" height="60" src="" frameborder="0" allow="autoplay"></iframe><br><a href="" title="Vocaroo Voice Recorder" target="_blank">View on Vocaroo &gt;&gt;</a></div>

THESE ARE RAW AUDIOS, no distortion yet or any configs.

---

For context, this little project will be used for quality assurance purposes, I am trying to look for a loop hole kasi sa isang paying na audio to money website.
I've come to realize na... maybe for us we probably could identify it talaga since we speak tagalog, but siguro for other non tagalog speakers, maybe it may would sound real? haha
 
Di ko pa nasubukan yung bagong v3 models sa Elevenlabs to test. BTW, yung AI detector nila sa site, ang sabi, di raw gawa yung audio mo ng 11labs he he.

Sa "uy, uy" lang halos parehas na yung tono. Mukhang nakuha mo yung process using script with E×ρréššion tags...galing. Napansin mo siguro yung difference ng script pag "UY, UY! vs "uy, uy!!!".
Sa Puter.js, hanggang v2 lang yung support nila - tapos selected pa. Ilagay ko man yung model id, di gagana. Kahit openai TTS - 3 models lang din.

Tama ka. Alam natin sabihin dahil native speaker tayo. Sa iba na nakakaintindi lang ng meaming like AI, must learn the actual E×ρréššion of the words based sa different conditions/ situations.

Kaya dyan sa short script, gagamit tayo ng special techniques in writing texts to control the TTS audio results. Kalimutan na yung grammar, kaya challenging!

Teka, anong voice model card yung gamit dyan sa V3 Enhanced (alpha). English defaults lang yung access ko.
 
Di ko pa nasubukan yung bagong v3 models sa Elevenlabs to test. BTW, yung AI detector nila sa site, ang sabi, di raw gawa yung audio mo ng 11labs he he.

Sa "uy, uy" lang halos parehas na yung tono. Mukhang nakuha mo yung process using script with E×ρréššion tags...galing. Napansin mo siguro yung difference ng script pag "UY, UY! vs "uy, uy!!!".
Sa Puter.js, hanggang v2 lang yung support nila - tapos selected pa. Ilagay ko man yung model id, di gagana. Kahit openai TTS - 3 models lang din.

Tama ka. Alam natin sabihin dahil native speaker tayo. Sa iba na nakakaintindi lang ng meaming like AI, must learn the actual E×ρréššion of the words based sa different conditions/ situations.

Kaya dyan sa short script, gagamit tayo ng special techniques in writing texts to control the TTS audio results. Kalimutan na yung grammar, kaya challenging!

Teka, anong voice model card yung gamit dyan sa V3 Enhanced (alpha). English defaults lang yung access ko.
Actually, as per your suggestion in your previous comment na gumamit ng online audio checker, ginamit ko yung tester mismo ni 11labs,

1. RAW file ni 11labs

1768671695215.webp


2. RAW file ni 11labs + distortion and some configurations

1768671790991.webp


For the voice model, gamit ko si Kael. but here's a list of voices.
You do not have permission to view the full content of this post. Log in or register now.

Super dami ng choices, as matter of fact. I am also considering to use either Korean, Japanese, or Chinese language haha. I bet the validator (the admin, who checks the validity of the audio) wouldn't even know if the audio is ai or not.

Here's a sample of a Chinese audio (using Chinese voice model).

RAW VOICE, no configurations yet.
<div><iframe width="300" height="60" src="" frameborder="0" allow="autoplay"></iframe><br><a href="" title="Vocaroo Voice Recorder" target="_blank">View on Vocaroo &gt;&gt;</a></div>
 
Actually, as per your suggestion in your previous comment na gumamit ng online audio checker, ginamit ko yung tester mismo ni 11labs,

1. RAW file ni 11labs

View attachment 4023296

2. RAW file ni 11labs + distortion and some configurations

View attachment 4023297

For the voice model, gamit ko si Kael. but here's a list of voices.
You do not have permission to view the full content of this post. Log in or register now.

Super dami ng choices, as matter of fact. I am also considering to use either Korean, Japanese, or Chinese language haha. I bet the validator (the admin, who checks the validity of the audio) wouldn't even know if the audio is ai or not.

Here's a sample of a Chinese audio (using Chinese voice model).

RAW VOICE, no configurations yet.
<div><iframe width="300" height="60" src="" frameborder="0" allow="autoplay"></iframe><br><a href="" title="Vocaroo Voice Recorder" target="_blank">View on Vocaroo &gt;&gt;</a></div>

To make this even more believable, I'm planning to add like background vfx of gunshots likely FPS game in distorted manner.
 
Maraming paraan to cloak a good AI audio from adding simple imperfections, background noise or any type of dirty processing. Marami na rin high-end vocoders at siyempre yung remixing and editing sa DAW. Ang dami to mention.
(Uy, uy!!! Si T'yo Kael pala yung model he he. )
Ang interest ko kasi sa AI audio ay sa music generation, hindi sa dialogues. Related din naman yung topic natin since meron din metatags and special instructions sa mga lyrics to follow the standard music prompts to control the final music style/arrangement of the resulting song. Di ba sa Suno may combination of genre/mood in the Style field, and structural/instrumental tags within the Lyrics field that controls the final composition? Parehas din yan to generate an artificial dialogue using Elevenlabs na designed for that purpose.
Dyan lang may matinong Filipino models at active sa emotional synthesis/emphatic AI. Limited ako dyan. Naghahanap pa ako ng local AI na gagana sa pc kong laos na he he. Galing naman sa opensource yung dev dyan, pero closed to public view. Sa Huggingface na lang yung pag-asa pag may nag-tip for longer testing.
Ang importante, malaman natin yung hands-on process to replicate it using some minor codes and commands.
 
PS. Meron na palang V3 sa puter.js. Na-replicate ko na rin ng kaunti he he.
buti mabait si ᑕᕼᗩTGᑭT sa'kin at may option na akong i-download yung audio.

"uy, uy!!! ... May dalawa sa mid!!! ...... Hala!!! Abangan niyo lang, baka mang-flank yung mga yun eh. ... Check niyo yung gilid-gilid!!! ......... Ay naku!!!... talo na naman tayo!!!"

provider: "elevenlabs",
model: "eleven_v3",
voice: "53HEM9cpXMMsKDVvXwHV",
output_format: "mp3_44100_128"
 

Attachments

Maraming paraan to cloak a good AI audio from adding simple imperfections, background noise or any type of dirty processing. Marami na rin high-end vocoders at siyempre yung remixing and editing sa DAW. Ang dami to mention.
(Uy, uy!!! Si T'yo Kael pala yung model he he. )
Ang interest ko kasi sa AI audio ay sa music generation, hindi sa dialogues. Related din naman yung topic natin since meron din metatags and special instructions sa mga lyrics to follow the standard music prompts to control the final music style/arrangement of the resulting song. Di ba sa Suno may combination of genre/mood in the Style field, and structural/instrumental tags within the Lyrics field that controls the final composition? Parehas din yan to generate an artificial dialogue using Elevenlabs na designed for that purpose.
Dyan lang may matinong Filipino models at active sa emotional synthesis/emphatic AI. Limited ako dyan. Naghahanap pa ako ng local AI na gagana sa pc kong laos na he he. Galing naman sa opensource yung dev dyan, pero closed to public view. Sa Huggingface na lang yung pag-asa pag may nag-tip for longer testing.
Ang importante, malaman natin yung hands-on process to replicate it using some minor codes and commands.
agree, so far ito yung medyo decent when it comes to our local language, not bad rin naman yung english version language nila.
 
PS. Meron na palang V3 sa puter.js. Na-replicate ko na rin ng kaunti he he.
buti mabait si ᑕᕼᗩTGᑭT sa'kin at may option na akong i-download yung audio.

"uy, uy!!! ... May dalawa sa mid!!! ...... Hala!!! Abangan niyo lang, baka mang-flank yung mga yun eh. ... Check niyo yung gilid-gilid!!! ......... Ay naku!!!... talo na naman tayo!!!"

provider: "elevenlabs",
model: "eleven_v3",
voice: "53HEM9cpXMMsKDVvXwHV",
output_format: "mp3_44100_128"
aight hahaha. thanks ulit for your inputs.
 
aight hahaha. thanks ulit for your inputs.
Same here. May natuklasan din akong bago at binalikan sa thread mo he he.
Sa back-up mo (for variety), don't forget to use the huggingface inference api using the transformers (sa TTS atbp.) para lagi kang updated sa trends - with decent free credits to test them or ρáíd. Very easy to use with python kung masanay ka na sa pag-edit ng code. Mas inclined akong gumamit ng opensource AI for local use kaysa commercial unless the AI api is free. Tried to pay for my children, but not worth because of the free alternatives.
Sa susunod na lang he he.

PS. Medyo mahigpit na yang HF pala sa TTS (endpoints) ngayon he he, kaya doon ka pupunta sa developer site for more decent API credits, or go the to underground dens of "diyescord". LLMs na maliliit na models ang pwede up to 300 requests (w/o exceeding the $0.10 credits) or receive a payment warning - para doon sa di nakakaalam.
 

About this Thread

  • 17
    Replies
  • 433
    Views
  • 2
    Participants
Last reply from:
alist1986

Online now

Members online
1,288
Guests online
2,080
Total visitors
3,368

Forum statistics

Threads
2,268,394
Posts
28,921,888
Members
1,242,929
Latest member
rpione
Back
Top