Text-to-speech, or TTS, is the backbone of Cabinet of Wonders — this is what drives our audio guide experience.

Behind the scenes, we leverage the API of ElevenLabs, a company that creates natural-sounding TTS using deep learning. They have just released their new voice model V3 and we decided to leverage it right away!

Before we dive in, see the results of this update:

You might also want to look at the updated Czech version we also played with:

How do text-to-speech models affect audio experience?

Like everything in the deep learning and AI world, text-to-speech systems have models, which define the engine's output quality.

Before this change, for English narration, we used ElevenLabs' multilingual model Turbo V2.5 for more than a year and results were quite solid. There were known issues though:

  • If a sample combines names in different languages, the voices would often attempt reading “foreign” names using the rules of the main text language
  • Dates are often pronounced as plain numbers
  • Abbreviations and special symbols sometimes just confuse the TTS engine, resulting in incomprehensible sounds

ElevenLabs mention these problems in their announcement. Note also that such problems vary across languages. Regardless, their new model V3 promised more accuracy there. We were curious!

How did we evaluate the models?

TTS accuracy is a defining factor for picking a voice model — and we gave it extensive testing! But this is not the only metric. Each audio sample generation takes space and time. Since we generate samples on the fly and then store them, we should keep this in mind to enable a smooth user experience and not waste too much storage.

From the file size perspective, we didn't notice a big difference between the models. Generated audio samples for our samples took around ~1MB per 1K characters for both Turbo V2.5 and V3 models. This is a reasonable size.

Time-wise, there are tradeoffs though. Turbo models are optimized for generation speed, somewhat sacrificing precision. V3 maximizes output quality compromising on speed — as a result, generation for our samples took ~10 times longer, about 40 seconds per 1K characters. This is quite a difference which we had to take into account when deciding on which models to pick for which languages.

How did we pick the voices?

ElevenLabs have A LOT of voices and it can be hard to figure things out there in the beginning. Yet, there are useful filters: age, gender and languages the models are trained on. Some voices have custom rates and live moderation affecting generation cost and latency respectively.

An important caveat is that the same voices sound quite differently for V2.5 and V3. This is why we decided to replace all the voices we had with new ones and new names, to avoid any confusion among our customers.

ElevenLabs has a few dozen recommended voices for V3. We could not pick all of them though, since we have an additional requirement of those voices sounding good in Czech, as most of the museums we work with are currently in the Czech Republic.

So we still had to do some manual casting... we actually spent the whole week just listening to the generated samples! Besides filtering out voices with not-so-good Czech, we also deselected those which are too fast or too slow, or too cartoonish — ElevenLabs have a lot of those!

Where are the new voices?

Eventually, we picked 12 new adult voices, balanced for age and gender. Experimentally, we also added two kid voices (Lucy and Tommy). They are popular community choices but not very battle-tested just yet. We will keep our eye on them!

You can listen to the demo right here, on our web. Those are narratives for typical museum exhibits and you can see — or rather hear — on your own, how much difference can a voice make in the listening experience.

And what about TTS accuracy?

We talked about “hard cases” for text-to-speech engines, like dates and special symbols. In the context of culture exhibitions, some of those (e.g. currencies) are more relevant than others (e.g. license plates). Eventually, we created our own tongue-twisters to challenge the voices.

Here is our English sample text you saw in the video demos:

Welcome to the gallery. The exhibit dates from the 20th century, measures 29x22x10 cm and weighs 3 kg. The temperature in the room is 21 °C. The author is Alfons Mucha, inspired by Paris and Art Nouveau. Note the mention of the Musée d'Orsay gallery. Mucha wrote about his work to Paul Gauguin, František Kupka and Auguste Rodin. The entrance fee to the museum is 300 CZK or 12.50 € (as of 1.3.2026). For more information, visit www.muzeum.cz, or reach out to the museum administrator, Ing. Jana Nováková.

As you heard, v3 handles the combination of Czech, English and French names in the text. The only thing it still struggles with are Czech academic degrees, but well, it is a strange legacy system anyway that is heavily made fun of — we bet ElevenLabs didn't train their models on it!

Looking forward

This was a very fun exercise to do! But also, our ears and brains hurt from listening to the same texts in dozens of voices...

The TTS technology is making sizable progress. We hope the next ElevenLabs models will be even smoother. We also wish to engage AI more in sample evaluation — currently it’s good at comparing technicalities (e.g. sample rate or pause ratio) but can't reliably verify word-for-word correctness.

We are excited about the new voices and very happy to improve the experience of the Cabinet users in museums!