Oral interpretation and language teaching's Fan Box

Search This Blog

Tuesday, March 01, 2011

SpeakerText: Extracting text from video, with speed and savvy



Rocky and I go around the world shooting videos. But it’s hard to consume the information in a video unless you have the 20 minutes to watch it, and Google can’t search the words spoken in my interviews. Today we’re going to see how SpeakerText solves the video/text conundrum.




SpeakerText is an online, on-demand, video-to-text platform. “What we’ve done is built this virtual assembly line that combines the power of crowdsourcing, with the parts of speech recognition and artificial intelligence that actually work,” says Matt Mireles, founder and CEO of SpeakerText. They do that by breaking the video’s sound into chunks, and sending it off to sites like Mechanical Turk, where workers transcribe the audio in units just five or ten seconds long. Then they reassemble the parts, have editors check for errors, use phonetic speech recognition to timestamp each word, and natural language processing to figure out sentence boundaries. And that’s just the platform.

“On top of that, we’ve built this application layer,” explains Mireles. “We’ve built this widget that taps into the JavaScript API of a video player, any player, whether it’s Brightcove, Ooyala, YouTube, blip.tv, even self-hosted videos. As the video plays back, it highlights each sentence as a video plays, and scrolls through the transcript. You can click on the transcript, and it will jump to the exact moment you’re interested in.” That also allows you to find a great quote in a video and tweet a link to that exact moment.

Mireles points out that this process, which he calls “distributed human computation,” allows SpeakerText to transcribe a two-hour long video just about as fast as one that’s only a minute long. “There are all these things where people are trying to create truly artificial intelligence, and it’s not there,” says Mireles. “It always reaches maybe 85 percent accuracy that is okay, but not complete. The way these problems actually get solved is by layering on the human component.”

More info:
SpeakerText web site:


No comments:

Post a Comment