The founders of Spext are tech enthusiasts Ashutosh Trivedi and Anup Gosavi, who believe that the best content today is not just written, but spoken – in podcasts, panel discussions and interviews. “Unlike text, this voice content is not able to reach larger audience because tools to edit, produce and share media are hard to use and expensive,” says Anup Gosavi, Co-founder, Spext, pointing out that a text editor is a familiar interface for everyone and with developments in deep learning algorithms. So they thought they could create a text based media editor to make editing, transcribing and sharing voice media easy.
Spext developed technology for a text based Voice Editor that works in the browser (it transcribes voice automatically and then syncs the spoken words and transcript accurately).Spext has also developed a technology where you can type in new words and change the spoken words in the voiceover. “This is however not available publicly to prevent misuse,” informs Gosavi.
The current product, a Voice Marketing Platform, has the following features: Accurate automatic transcription; a text based voice editor to edit voice content; post production features like noise reduction and voice leveling to make the voice sound studio quality; a clip creator to easily create short clips that can be shared on social media. “We can also create custom synthetic, digital voices for brands that can be used for personal assistants, audio books etc.,” states Gosavi.
Customer segments
The customer segments are two categories – the first being law firms, marketers and journalists, who use Spext primarily for automatic transcription and transforming the transcript to blog, subtitles or short clips. The second set of customers are media creators – podcasters and media production houses who use Spext to edit and produce voice content. Spext improves their workflow and saves them time and costs in editing and producing voice media. Gosavi explains, “Currently, it takes 7-8 hours to edit, transcribe and share one hour of voice media because the tools are hard to use and need professional engineers. So, most of the voice content remains locked and is not widely published. Spext reduces the time and cost to edit and transcribe voice media by 80 per cent. So, its affordable and easy to share this content.”
The advantages range from cost savings for transcription to productivity improvements for editing and clipping voice content to correcting mistakes in voice-overs easily without having to rebook the artist and studio.
Focus on R&D
Gosavi informs that Spext is doing some cutting-edge research in a new, emerging field called AI media production – it uses deep learning and other machine learning (ML) techniques to reduce cost of producing, editing and repurposing media. Some of the techniques used by Spext are:
- ML techniques to classify parts of audio/ video as silence, music, speech and word fillers (umm, uhh etc.)
- A deep learning model to automatically punctuate long form media (upto 4 hours)
- Accurate insertion of a new word in the same voice by just typing it in (alpha )
- Special audio algorithms for noise reduction and audio leveling to make your audio recording sound very professional and production quality.
- A lot of browser based optimisation to have the accurate editing of media by editing the text, this happens at blazing fast speed and without taking much memory.
Regarding future plans, Gosavi shares, “Our most immediate plan is to release the voice synthesizer feature – you can type in new words to change spoken words in a voice over. This will make it easy to correct mistakes in the recordings without having to rebook the artist/ studio again.” It has broader applications in voice personalisation as well. “Imagine a children’s audio book. You can type in the kid’s name and she becomes the hero of the story. You can do it for everyone who buys the audio book. There are ways to misuse this feature, but security features will make it clear which edits are official and which are not,” he explains. He foresees a great opportunity in local languages.