JoyVoice

Long-Context Conditioning for Anthropomorphic Multi-Speaker Conversational Synthesis

Speech Lab, JD Explore Academy, JD.com Inc.

💡

JoyVoice Highlight

Large speech generation models are evolving from single-speaker, short sentence synthesis to multi-speaker, long conversation generation. Introducing JoyVoice, a highly anthropomorphic, multi-speaker and long-context conversational speech synthesis foundation model. JoyVoice is capable of generating a conversation as long as 5 minitues in a single shot, featuring up to 8 speakers. Compared with similar speech foundation models, JoyVoice achieves significant improvements in prosodic continuity for long-form speech, rhythm richness in multi-speaker conversations, paralinguistic naturalness, besides superior intelligibility.
Key Innovations of JoyVoice
🎯 End-to-End Transformer-DiT Architecture JoyVoice leverages a fully optimized end-to-end structure, where hidden representations from the AR-Transformer are directly fed into the DiT module. This integrated design enables seamless coordination between components and ensures efficient, high-fidelity multi-speaker audio synthesis.
🎵 MM-Tokenizer with Enhanced Loss Design JoyVoice MM-Tokenizer introduces both multitask semantic loss and Mel-spectrogram reconstruction loss to better capture acoustic details. Operating at a low bitrate of 12.5 Hz, it effectively models both semantic and acoustic attributes of speech.
📝 Minimal Reliance on TTS Frontend JoyVoice significantly reduces dependency on text normalization modules through large-scale data coverage, text perturbation techniques, and simulated data generation—boosting system robustness and simplifying deployment.
🏆 State-of-the-Art Performance JoyVoice (0.5B Paraformers) achieves top-tier results on both the Seed-TTS-Eval Benchmark and multi-speaker long-form conversational voice cloning tasks, demonstrating superior audio quality and generalization.
Model Architecture

Figure 1: Model Architecture

Performance Chart

Figure 2: Intelligibility Benchmarks

Multi-Speaker
Zero-Shot
Voice Clone

Bring Every Conversation to Life

JoyVoice empowers you to craft dynamic conversations for 2~8 speakers with incredible realism. Enjoy flawless consistency, stable character voices, and expressive delivery that makes every line feel alive.

Crosstalk Performers: Guo Degang & Yu Qian

Crosstalk
谢谢,谢谢大家。看见您各位高兴,我心里也痛快。
是。
来到这儿,跟我的好朋友于谦老师,给您说一段相声。
对,我们俩合作多年。
于老师,了不起的人呐。
您又捧我。
不是捧。在咱们相声界,您是这个学问大,见识广。人家那生活,丰富多彩。
哎,也就是平常人的日子。
谦儿哥三大爱好,全国观众没有不知道的。
哦?哪三大爱好?
抽烟、喝酒、烫头。
嗨!这都多少年前的事了,您还提呢。
这是您的标志啊。现在怎么样?身体还好吗?
托您的福,还不错。
那就好。我最近可是不太好。
您怎么了?
让我媳妇给轰出来了。
因为什么呀?
说我在家不务正业,整天研究些没用的。
您研究什么了?
我研究怎么能让咱们这个相声,更上一层楼。
这是好事啊。

Prompt Audio

Guo Degang
Yu Qian

Proposed Systems

JoyVoice-25Hz
JoyVoice-12.5Hz

Other Models

VibeVoice-7B
Kimi-MoonCast

Tech Podcast (Chinese)

Podcast
F
哈喽大家好,欢迎收听我们这一期的播客啊,然后今天咱们会聊一聊最近在人工智能领域的的一些比较大的进展啊,无论是认知模型还是多模态,还是具身智能啊,都有一些非常有意思的突破。
M
没错没错,对,这些方向都有挺震撼的一些成果出来,那我们就赶紧开始吧。
F
啊行,今天咱们就先从认知模型开始聊起啊,那第一个咱们就先聊一聊这个MIT最近发布的这个CEO框架。这个东西到底是干嘛的,有什么特点,然后会带来哪些影响。
M
好的,这个CEO框架呢,它的厉害的地方就在于, 它解决的问题就是让这个大语言模型可以在面对新的数据的时候自己去调整,它是通过生成自我编辑,利用强化学习来优化这个编辑的策略。
F
听起来非常的酷啊,那对于开发者来讲有什么具体的影响呢?
M
就是开发者可以在很多具体的场景下使用这个模型,然后在测试的时候就可以继续训练,就很灵活。
F
然后还有一个就是这个Sergey Levine, 他最近有一篇论文啊,叫 Language Models in Plato's Cave。
M
嗯!
F
它其实就是在讲这个语言模型在推理上面的一些表现,以及它和这个视频模型的一些对比。
M
最近有几个演示真的让我惊掉下巴。比如那个能实时解读视频内容,还能和你自然对话的模型,感觉科幻片里的场景一下就拉到现实了。
F
对对!不只是简单描述画面,还能推理场景里的人物关系,甚至预测接下来可能发生什么。这已经有点接近我们人类的'常识'了。
M
确实如此!不过这些进展也带来了不少新的挑战和思考就是了。比如说伦理问题,还有这些技术到底会如何影响我们的生活。
F
哎,这个话题可就大了,咱们留到下一期再深入聊聊?今天时间也差不多了。
M
好啊好啊,那今天我们就在这儿告一段落。各位听众朋友,我们下期再见!

Reference Audio

Female
Male

Generated Audio

VibeVoice-7B
JoyVoice-25Hz
JoyVoice-12.5Hz
-->

Podcast by Luo Yonghao & He Tongxue

Podcast
Luo
哟哟,在这凡尔赛。你这个年纪如果去传统的企业里去,就很难有这么快取得这么好的成果。
He
啊不是凡尔赛,不是凡尔赛。确实,我肯定没法来,来这录咱们这个播客,这个确实很幸运。
Luo
不,那那是另外一回事。但是我确实觉得其实最大的两个机会嘛,一个就是一个就是AI,一个就是自媒体。当然也我这个是我,我这个是感觉啊,我不知道是不是这样。
He
对当然。但我觉得如果你想真正的成为一个就是真正的大,真正的企业,那确实是软件、互联网、AI,这个方向会更,肯定是天花板是更高的。
Luo
那你有这想法吗?
He
完全没有。我甚至我只会做视频,我做不好别的工作。
Luo
也不能这么说,这个话题我们后边会专门聊到,然后大学,对,你到大二的时候成绩开始下来,然后集中精力就在视频上了。
He
是的是的。
Luo
然后那个时候有什么同学同好一起凑在一块商量的做吗?还是你一个人?
He
完全没有,那个时候就是我一个人做。
Luo
同学都没有感兴趣做这个?
He
我有一个,我有一个室友经常会跟我一起拍视频,就是比如说我要去拍一个镜头,然后他如果没啥事他就说,哎,阿泛咱俩一起去拍这个镜头,他就帮我拍一下。
Luo
其实是帮你忙。
He
确实我同学对我很好,大家帮我拍这个。
Luo
然后你自己做了,你第一次红的是大几的时候?是5G那个吗?
He
对,如果是全社会范围内大家都确实很多人看过的视频就是5G那个视频,那个是我大二的时候。
Luo
啊大二就做出来了。 所以其实有了一个超强的正向激励。
He
对,但是,其实我感觉这个激,激励,客观的说啊,确实是给了我非常大的作为一20岁的年轻人非常大的这个流量和名声。但我当时做完这个视频的时候其实是非常的怎么说焦虑或者是,因为你现在好突然,我原来做视频是这样,突然就干成了这样,那你下一个视频该怎么办呢?你不可能再往高走了呀。
Luo
明白,你当时,你是在那个视频红了以后自己说的安迪沃霍尔那话吗?就是15分钟名人那个。
He
啊是的是的。
Luo
你是觉得下一支可能就没了?

Prompt Audio

Luo Yonghao
He Tongxue

Proposed Systems

JoyVoice-25Hz
JoyVoice-12.5Hz

Other Models

VibeVoice-7B
Kimi-MoonCast

Peppa Pig (Cartoon Multi-Role)

Cartoon
P
Look! This puddle is perfect for jumping! Ready? One... two... three... jump!
M
Oh dear! Not so high, sweetheart! Daddy Pig, could you please... oh, never mind.
D
Ho ho! Splendid jump! Now watch Daddy do a SUPER SPLASH! Stand back, everyone!
Narrator: Daddy Pig takes a running start and leaps into the biggest muddy puddle he can find. Muddy water flies everywhere, covering Peppa, George, and even Mummy Pig.
P
Wow! That was the biggest splash in the whole world, Daddy!
M
Daddy Pig! My nice clean dress! Oh... well, I suppose we're all muddy pigs now.
D
Exactly, Mummy Pig! When everyone is muddy, nobody is muddy! It's called being... equally festive!
P
Can you jump with us too, Mummy? Please? It will be so much fun!
M
Oh, alright then. But just one little jump. Let me find a nice, small puddle.
Narrator: Mummy Pig carefully finds a small puddle at the edge. She gives a graceful little hop, making a delicate splash.
D
Bravo! What a magnificent jump!
P
Hooray! Now our whole family is jumping in muddy puddles!
Narrator: And so, the Pig family spends a wonderful rainy afternoon together, jumping and laughing in the garden. They are all very, very muddy, and they are all very, very happy. Because on a rainy day, there is nothing better than jumping in muddy puddles with your family.

Prompt Audio

Peppa (Female)
Mom (Female)
Dad (Male)
Narrator

Proposed Systems

JoyVoice-25Hz
JoyVoice-12.5Hz

Other Models

VibeVoice-7B

Podcast by 3 Speakers

Podcast
Sarah
Hey everyone, welcome back to "Language Lounge" the podcast where we chat about all things related to learning languages. I'm Sarah, and I'm here with Mark and Alex. Today, we're diving into how people are learning English in this digital age. So, Mark, what's your take on the shift from traditional classrooms to online resources?
Mark
Well, Sarah, it's huge. You know, with apps and websites, anyone can start learning from their couch. But I wonder, is it as effective? Like, without a teacher, people might miss out on proper feedback. I've seen folks pick up bad habits because they're just going with whatever sounds right.
Alex
That's so true, Mark. I'm Alex, by the way. Hi everyone. And it's not just about grammar or pronunciation; it's the whole experience. When you learn online, you might not get the cultural context. For example, idioms or jokes can be confusing without someone to explain them. But on the other hand, the accessibility is amazing. People in remote areas can now access quality materials.
Sarah
Absolutely, Alex. And let's talk about motivation. In a classroom, you have that peer pressure and schedule to keep you going. But when you're self-studying, it's easy to lose steam. I mean, how many of us have downloaded a language app and then forgotten about it after a week?
Mark
Oh, I'm guilty of that! But you know, some apps gamify the process, which helps. Like, earning points or competing with friends. It makes learning fun. Still, I think the human element is missing. Speaking with real people, getting that instant correction—it's hard to replace.
Alex
Yeah, and what about the variety of accents? Online, you hear all sorts of English—American, British, Australian—which is great for exposure. But if you're not careful, you might end up mixing them up and sounding a bit off. I remember trying to imitate a British accent from a show, and my friends laughed because it came out all wrong!
Sarah
Ha, that happens! But you know, I think the key is balance. Use online tools for practice, but maybe join a language exchange or find a tutor for speaking sessions. That way, you get the best of both worlds. And speaking of tutors, how do you guys feel about the rise of AI tutors? Are they any good?
Mark
Hmm, AI tutors are interesting. They're getting better at mimicking conversations, but they still lack that human touch. Like, if you're feeling discouraged, an AI might not notice and offer encouragement. But for drilling vocabulary or grammar, they're pretty efficient. It's like having a pocket teacher.
Alex
I agree, Mark. And let's not forget cost. Traditional classes can be expensive, while many online resources are free or cheap. That opens up learning to so many more people. But the downside is, with free content, quality can vary. You might end up learning from someone who isn't qualified.
Sarah
So true. It's a bit of a gamble. But overall, I think the trend is positive. More people are engaging with English, and that fosters global communication. What do you think the future holds? Will we see a blend of tech and human teaching?
Mark
Probably. I envision hybrid models where AI handles the basics, and teachers focus on advanced skills and motivation. That way, learning becomes more personalized. And with virtual reality, we might even have immersive language environments someday.
Alex
Oh, that sounds cool! Imagine practicing English in a virtual café with avatars from around the world. It could make learning so much more engaging. But we have to be cautious—too much tech might isolate people rather than connect them.
Sarah
Good point, Alex. Well, we're almost out of time. Thanks for this great discussion, Mark and Alex. And to our listeners, keep exploring and don't be afraid to mix old and new ways to learn. Remember, it's all about connecting with others.
Mark
Thanks, Sarah. And hey, if you're learning English, just keep at it. Mistakes are part of the journey!
Alex
Absolutely. Thanks for tuning in, and we'll catch you next time on "Language Lounge"!

Prompt Audio

Sarah (Female)
Mark (Male)
Alex (Male)

Proposed Systems

JoyVoice-25Hz
JoyVoice-12.5Hz

Other Models

VibeVoice-7B

Infernal Affairs: Andy Lau & Tony Leung

Movie & TV
A
几熟手喔。
T
技术嘅嘢,我喺警队都学过㗎。
A
你哋卧底真系得意,见亲面都喺天台。
T
我唔似得你,我见得光。我要嘅嘢呢?
A
我要嘅嘢你都未必带嚟。
T
咩意思?出嚟行下啰。
A
攞嚟吖嘛。俾次机会我呀。
T
点俾机会你呀?
A
以前我冇得拣,而家我想做好人。
T
好呀,同法官讲啦,睇下俾唔俾你做好人。
A
咁即系要我死?
T
对唔住,我系差人。
A
边个知?

Prompt Audio

Andy Lau
Tony Leung

Proposed Systems

JoyVoice-25Hz
JoyVoice-12.5Hz

Other Models

VibeVoice-7B
Kimi-MoonCast

Mystery Audiobook "Zheyun"

Audiobook
东州市第一监狱,犯人屠国安被狱警带到了招待室。门一开,他看到有人背对着他,对方短发,身形纤瘦,姿态挺拔,在他的记忆里,并没有这样的熟人。迟夏听到动静,转过身来,朝狱警点了点头,又看向屠国安。
F
屠先生,你好!
屠国安坐了下来,扫了一圈。
M
在这见面,警方还是检察院的?
F
有区别吗?
迟夏问他。屠国安冷笑一声
M
我很讨厌这两类人。
迟夏顺着他的话问
F
为什么?
屠国安举手晃了晃手铐
M
你觉得呢?
F
那自我介绍一下。
迟夏递了张名片过去
F
东州警察学院,犯罪心理专业的研究生。我的导师,正在做一项犯罪心理学的研究。我们跟监狱合作,搜集一些样本资料。
屠国安想拿那张名片,又被迟夏给拿走了。
M
犯罪心理学?
屠国安指了指自己。
M
我这种杀人犯?样本?
F
可以这么理解。
屠国安哼笑一声。
M
杀人嘛...想杀就杀了,走投无路就杀了,恨到深处爱到深处也就杀了,还需要什么心理?那是什么狗屁玩意儿?
F
想杀就杀、走投无路、爱和恨,就是我们认为的动机。而我们要研究的,就是这个动机的来源。世上那么多怨恨,也没见谁生气了就杀人呀。

Prompt Audio

屠国安 (Male)
迟夏 (Female)
旁白

Proposed Systems

JoyVoice-25Hz
JoyVoice-12.5Hz

Other Models

VibeVoice-7B

Whispers by Man & Woman

Chatting
M
喂?…听得见我说话吗?我这边世界都安静下来了,就等你出声了。
F
嗯…听得超清楚。你声音轻轻的,好像就趴在我耳边说一样。
M
那就好…今天累不累呀?我好像…有点想你了。
F
嗯?只有一点点吗?可是我好像比一点点…还要多很多哦。
M
多少呀?能具体形容一下吗?
F
大概就是…从我今天挂掉你上一个电话之后,就开始想了那么多。
M
那岂不是想了一整天?…那我是不是该补偿你?
F
怎么补偿?再陪我聊一个小时?
M
一个小时哪够…我想把我的晚安,分你一半。
F
一半?那另一半呢?
M
另一半…留着明天早上再跟你说呀。这样你就知道,我的一天是从想你结束,也是从想你开始的。

Prompt Audio

Male
Female

Proposed Systems

JoyVoice-25Hz
JoyVoice-12.5Hz

Other Models

VibeVoice-7B
Kimi-MoonCast

Interview by Trump & Host

Interview
I
Sounds like it's not going to get resolved, the shutdown.
T
It's going to get solved. Oh, it's going to get solved.
I
How?
T
We'll get it solved. Eventually, they're going to have to vote.
I
You're saying the Democrats will capitulate.
T
I think they have to. And if they don't vote, that's their problem. Now, I happen to agree with something else. I think we should do the nuclear option. This is a totally different nuclear, by the way. It's called ending the filibuster.
I
Did you see John Thune said today that he's not going to do that?
T
I know, John doesn't... Well, John and a few others. But you know what? The Republicans have to get tougher. If we end the filibuster, we can do exactly what we want. We're not going to lose power. The theory is, "Oh, then we'll do it, but then when they get into power someday, they'll do it." That's true, but you know what? We're here right now.
I
So you think John Thune...
T
No, I like John Thune. I think he's terrific, but I disagree with him on this point.
I
He said today he wasn't going to do it.
T
Well, that's too bad.
I
So far, the shutdown hasn't spooked the stock market, which hit record highs this past week.
T
Perfect timing for your show. Just hit an all-time high. We're doing really well.
I
Can I ask you, Mr. President, on that point, though?
T
Yeah.
I
When the stock market is doing well, that doesn't affect everybody. Not everybody's invested in the stock market.
T
Oh, does it.
I
It does.

Prompt Audio

Interviewer
Trump

Proposed Systems

JoyVoice-25Hz
JoyVoice-12.5Hz

Other Models

VibeVoice-7B
Kimi-MoonCast

Journey to the West

Movie & TV
唐僧
悟空... 悟空... 这日头毒得厉害,为师觉得嗓子眼里都要冒烟了。咱们... 咱们还要走多久才能见到人烟啊?这腿脚像是灌了铅,实在迈不动了。
悟空
嘿嘿,师父!您这就又泄气了?俺老孙那火眼金睛刚才往远处一扫,虽然这山路崎岖,但妖气不重,祥云隐隐。您再坚持个三五里地,翻过前面那个光秃秃的土坡,保准有阴凉地儿歇脚!
八戒
哎呦我的猴哥诶——!三五里?你那是腾云驾雾的三五里,也就是眨巴眼的功夫。老猪我可是挑着担子,一步一个脚印量过来的!你听听,你听听我这肚子,咕噜噜叫得比雷公都响!我现在别说走路了,连哼哼的力气都没了。
悟净
二师兄,你就少说两句吧。这担子里也没剩什么干粮了,轻了不少。大师兄在前面探路费心费力,咱们做师弟的,得体谅体谅。再说了,师父都没喊停,你怎么就先赖在地上了?
八戒
沙师弟,你这是站着说话不腰疼!这担子是不重,可我这心里苦啊!想当年我在高老庄,这时候早就那是好酒好菜端上桌了,哪像现在,跟着这只遭瘟的猴子在这鸟不拉屎的地方喝西北风... 哎呦,不行了,我得坐会儿,就算天王老子来了我也得歇会儿!
悟空
好你个呆子!还没取到真经呢,就想着散伙回高老庄了?行啊,你歇着吧!待会儿若是有那山精树怪出来,看见这细皮嫩肉的大肥猪,正好拖去腌了过冬,俺老孙可不救你!
唐僧
悟空!八戒!都什么时候了,你们两个还有心思斗嘴?悟净说得对,咱们不能在这儿干耗着。悟空啊,你且去前面探探,若真有人家,哪怕是讨碗凉水来解解渴也是好的。为师... 为师这心里慌得紧。
悟空
师父放心!俺老孙这就去!刚才我瞧见那南山坡后有一缕炊烟升起,定是庄户人家在做午饭。师父,您和师弟们就在这松树底下稍坐片刻,俺去去就回!八戒,沙师弟,护好师父,要是把师父弄丢了,回来我揭了你们的皮!
八戒
有炊烟?也就是有人家?也就是有大白馒头?猴哥!猴哥你快去!早去早回!要是能化缘到斋饭,别说护着师父,这担子我再挑十里地都不带喘气的!快去快去!
悟净
二师兄,你这脸变得也太快了... 大师兄,你放心去吧,这里有我看着,出不了岔子。
唐僧
阿弥陀佛,悟空,切记多加小心,不可伤生害命,速去速回。

Prompt Audio

唐僧
孙悟空
猪八戒
沙悟净

Proposed Systems

JoyVoice-25Hz
JoyVoice-12.5Hz

Other Models

VibeVoice-7B

Customized
Multi-Speaker
Post-Training

Lower Cost, Greater Convenience, More Authentic

There's no speech more natural than natural speech. By fine-tuning JoyVoice with just 10 minutes of ordinary-quality natural recording, a highly anthropomorphic multi-speaker model is ready to use.

Livestream Shopping

Livestream
Host
玉米油我比较推荐,是因为大家可能平常煎炸的时候,我会比较推荐玉米油。因为玉米油它本身的,嗯,油腥味会比较少一些。玉米油的话,它用的不是单纯的玉米,而是用的玉米的胚芽,像5升的玉米油。呃,耗费的这个玉米胚芽的话,大概是60到80万颗的这个玉米胚芽,所以它也是比较珍贵的。油香的味道比较浓郁,但是油腥味会比较少。你在煎一些东西,炸一些东西的时候,它会比较的适合。包括呢,他给大家做的是物理压价,你看它的这个质感。它里面没有一些那种沉淀物啊,或者是一些絮述,它非常的一个清澈。宝贝们啊,还有就是它烟点比较高,就是你们做饭的时候不喜欢有很多油烟的宝贝,你就可以去买这一款。它的烟点的话可能在二百六十多。二百六十多,所以它就不容易冒烟出来的宝贝们,它不容易冒烟出来。所以它整个的这个呃。这个厨房里面就会更加的干净一些,宝贝们就没有什么油烟的。

Livestream Shopping

Livestream
Host
大家还有问题的话呢,也可以跟我们的这个主播说一下,你家推拉门的门槛大概多高呀?因为我们像T80S呢,它的越障高度呢是在2厘米啊,像家里的这个门槛门框呢。呃,正常的门槛,门框都是可以轻松的翻越的啊。如果是超过2厘米的话呢,你也可以借助那种扫地机器人的爬坡垫给它。爬过去啊,所以说呃,这个门槛它不是一个很难的问题。大家还有疑问问题的话呢,现在也可以去告诉咱们的这个主播,主播这边呢可以给大家去做这个介绍。我们面前这台是T80S啊,直播间的50号链接,通过直播间的右下角购物口袋可以直接去下单购买。怎么去拍呢?点开右下角购物口袋,支付方式呢,选择用我们的这个京东支付付款啊。50号链接T80S水箱版本,支付方式选择使用京东支付付款。啊,因为京东支付他才能够拿到国补,这个换新补贴就是国补的意思啊,记得一定支付方式选对。如果通过,呃,微某信支付或者其他支付方式,它是会没有这个国补,会贵一点的啊,所以说一定是选择正确的支付方式。

Entertainment

Podcast
M
哈喽大家好!欢迎回到“戏精研究所”,我是阿杰。
F
我是小琳~ 今天我们又要来深度拉片,聊聊那些让我们拍案叫绝或者笑到头掉的经典场面了!
M
说到这个,我立刻想起一个名场面!就是苏和那个爱子对峙的那场戏。苏那个气场一出来,我的天,她心里那潜台词简直就是:就我要顾全这个大局,我们两个是合伙人。
F
明白。
M
对,我不会让你一个小妹妹从中作梗,你也不看看你是在挑拨谁呢?
F
M
这个其实当时看的也特别的有一种江湖义气!
F
对,当时苏的,那个眼神也,苏演的也很好啊,就是杨谨华这演员。苏当时看到就是林心如在批评,就是我最喜欢的那个。郭雪芙的时候,
M
嗯,
F
那个表露的就是啊,你不会原谅我了吧?就那种感觉。对,所以就更能衬托出林心如当时的那个。那段。
M
对。
F
你要分清楚谁,就即使我们俩之间有感情纠葛,但依旧轮不到你在这。
M
挑拨离间。
F
没错!好了,由于时间关系,今天“戏精研究所”的名场面小课堂就先到这里。
M
嗯…那我们就下期再见啦!拜拜~

Technology

Podcast
M
哈喽大家好,欢迎收听我们这一期的播客啊,然后今天咱们会聊一聊最近在人工智能领域的的一些比较大的进展啊,无论是认知模型还是多模态,还是具身智能啊,都有一些非常有意思的突破。
F
没错没错,对,这些方向都有挺震撼的一些成果出来,那我们就赶紧开始吧。
M
啊行,今天咱们就先从认知模型开始聊起啊,那第一个咱们就先聊一聊这个MIT最近发布的这个CEO框架。这个东西到底是干嘛的,有什么特点,然后会带来哪些影响。
F
好的,这个CEO框架呢,它的厉害的地方就在于, 它解决的问题就是让这个大语言模型可以在面对新的数据的时候自己去调整,它是通过生成自我编辑,利用强化学习来优化这个编辑的策略。
M
听起来非常的酷啊,那对于开发者来讲有什么具体的影响呢?
F
就是开发者可以在很多具体的场景下使用这个模型,然后在测试的时候就可以继续训练,就很灵活。
M
嗯,对了,最近还有几个演示真的让我惊掉下巴。咱们留到下一期再深入聊聊?今天时间也差不多了。
F
好啊好啊,那各位听众朋友,我们下期再见!

Single
Speaker
Voice
Clone

Capture Every Nuance

JoyVoice excels in single-speaker voice cloning tasks, capable of highly expressive cloning of prosody, timbre, emotion, volume, speech rate, and paralinguistic information from reference audio. It also achieves cross-lingual voice cloning supporting Chinese, English, Japanese, Korean, and multiple Chinese dialects.

English Fearful Emotion Voice Cloning

Cross-lingual
🇺🇸 EN Oh my god... Did you hear that? The floorboard in the hallway just creaked... but I live alone. No, no, no...
🇨🇳 ZH 深夜,小李独自在家,突然听到厨房传来轻微的响动。他慢慢靠近,只见冰箱门微微开启,里面透出幽暗的光。正当他准备查看时,冰箱内突然伸出一只手,紧紧抓住了他的手腕。小李惊恐地尖叫起来,但随后发现那只是自己之前挂在冰箱门上的万圣节装饰——一只塑料手。尽管如此,那一瞬间的心跳加速让他久久难以平静。
🇯🇵 JA えっ?今、真っ暗な部屋で誰かの息遣いが聞こえた気がした…。冗談やめてよ、怖すぎる。
🇰🇷 KO 아! 진짜... 갑자기 누가 내 어깨 툭 치더라? 깜짝 놀라서 심장이 쿵쾅거려...

🎧 Prompt Audio & Text

I'm really scared to taste this dish. What if it's awful or makes me sick? I'm not sure I can go through with it.

CosyVoice 2.0

VibeVoice-7B

OURS

JoyVoice-25Hz

JoyVoice-12.5Hz

Japanese Anime Character:Rem

Cross-lingual
🇯🇵 JA ゼロから始めましょう!全てを失う痛みを知ったのだから、ゼロから、やり直せばいい!レムは信じています。どんなゼロからの未来でも、スバルくんとなら、きっと輝いていると。
🇨🇳 ZH 从零开始吧!既然已经知道了失去一切的痛苦,那就从零开始,重新来过就好了!雷姆相信,无论是怎样从零开始的未来,只要是和昴在一起,就一定是光明的。
🇺🇸 EN Let's start from zero! Since we now know the pain of losing everything, we just have to start over from zero! I believe that no matter what kind of future begins from zero, if I'm with you, Subaru, it will surely shine.
🇰🇷 KO 제로부터 시작합시다! 모든 것을 잃는 고통을 알게 되었으니까, 제로부터, 다시 시작하면 돼요! 렘은 믿어요. 어떤 제로부터의 미래라도, 스바루 군과 함께라면 반드시 빛날 거라고.

🎧 Prompt Audio & Text

こんにちは。今日もご一緒できて光栄です。

CosyVoice 2.0

VibeVoice-7B

OURS

JoyVoice-25Hz

JoyVoice-12.5Hz

English Fast Speed Voice Cloning

En-Male
Peter Piper picked a peck of pickled peppers. A peck of pickled peppers Peter Piper picked. If Peter Piper picked a peck of pickled peppers, where's the peck of pickled peppers Peter Piper picked?
Prompt Audio
CosyVoice2.0
VibeVoice-7B
JoyVoice-25Hz
JoyVoice-12.5Hz

Star Wars C-3PO

En-C3PO
R2-D2, are you even listening to my primary sensors? It's much too rocky. This way is much easier. What makes you think there are settlements over there? The probability of finding habitation in such an untraversable locale is approximately 4,098 to one!
Prompt Audio
CosyVoice2.0
VibeVoice-7B
JoyVoice-25Hz
JoyVoice-12.5Hz

A Crying Girl Reading Classical Chinese Poetry

Zh-Female
帘外雨潺潺,春意阑珊。罗衾不耐五更寒。梦里不知身是客,一晌贪欢。独自莫凭栏,无限江山。别时容易见时难。流水落花春去也,天上人间。
Prompt Audio
CosyVoice2.0
VibeVoice-7B
JoyVoice-25Hz
JoyVoice-12.5Hz

Story Telling (评书)

Zh-Li qingfeng
却说关羽斩了华雄,回到中军大帐,将那颗血淋淋的人头往地上一掷,上前对袁绍抱腕拱手:“启禀盟军,华雄首级在此。” 帐中霎时鸦雀无声。那华雄的人头在地上滚了半圈,双目圆睁,须发戟张,嘴角还凝着一抹黑血。方才在阵前,这西凉骁将连斩联军四员上将,此刻却成了关云长马前之鬼。
Prompt Audio
CosyVoice2.0
VibeVoice-7B
JoyVoice-25Hz
JoyVoice-12.5Hz

Chatting in Whisper

Chinese Female
嘘…靠过来一点。接下来这句话… 只让空气和你听见。是藏在喉咙深处的星星, 趁世界不注意… 轻轻落进你耳朵里了。
Prompt Audio
CosyVoice2.0
VibeVoice-7B
JoyVoice-25Hz
JoyVoice-12.5Hz

Game Voice-over

Taiyi Zhenren
突然,身边一阵笑声。我看着他们,意气风发地挺直了胸膛,甩了甩那稍显肉感的双臂,轻笑道:"我身上的肉,是为了掩饰我爆棚的魅力,否则,岂不吓坏了你们呢?
Prompt Audio
CosyVoice2.0
VibeVoice-7B
JoyVoice-25Hz
JoyVoice-12.5Hz

Chinese Dialect Accent Voice Cloning

Dialect Accent
🇨🇳 ZH-yue 讲你又唔听,听你又唔明,明你又唔做,做你又做错,错你又唔认,认你又唔改,改你又唔服,唔服你又唔讲!
🇨🇳 zh-TW 真的捏~天空蓝到像被洗过一样!超适合去淡水看夕阳,或者到阳明山踏青啦~ 姐妹们穿小裙子拍照超赞的!晚上再去夜市吃冰,完美的一天喂!
🇨🇳 ZH-SC 要得!看来你已经进入状态咯。来来来,赶紧把身子陷进沙发里头,找个最舒坦的姿势窝起。工作是啥子?电脑是啥子?此刻的你,只需要晓得啥子叫安逸!外头可能刮风下雨,屋头却是我们的小天地。是切搓盘麻将决个高下,还是打开电视追两集剧?你开个腔,我随时奉陪
🇨🇳 ZH-GX 表哥~今天嗦粉了咩?荔枝龙眼甜到心窝啵!带你逛gai吃酸野咯~ 莫跑莫跑,我请客喔!

🎧 Reference Audio

ZH-yue
ZH-TW
ZH-SC
ZH-GX

CosyVoice 2.0

VibeVoice-7B

OURS

JoyVoice-25Hz

JoyVoice-12.5Hz