Creating and Editing Captions

This is the fourth in a series of posts about Capturing and Captioning a Screen Reader Demo.

I should start by saying that there is a difference between captions and subtitles although the terms are often interchanged. Subtitles just provide the actual spoken text while captions provide a thorough description of all of the sounds within the video. Captions can also be open or closed. Open captions are “burned into” the video and are always visible. Closed captions must be activated by the user. That is a very basic overview. Even though I keep referring to captions, I am actually creating subtitles – a time based transcription of the spoken text in my demonstration video.

I have the recorded demo in .mp4 format, now I need to create the captions. There are services that will do this for you for about $1/minute and provide the caption file in various formats. This may be the best way to proceed as manually creating captions can be tedious. I didn’t investigate this because IBM has an automated tool that I can use. Plus, I figured I should understand the whole process.

There are some free online services that help you create the caption file as well. The HTMl5 Video Caption Maker is one example. However, your video file must be available via a public URL. Since it is a Microsoft tool, I am guessing the video must be in the IE supported .mp4 format but I didn’t verify that. I tried it out in Safari with a .mp4 file.

Basically using the video controls, you play, pause, and replay segments of the video and type what you hear in each segment. Once you are satisfied you have the correct text, you save that segment and continue with the process for the entire video. When you save a segment, the program saves the beginning and ending time stamp for that section along with the test. This program will output TTML (Timed Text Markup Language) and WebVTT (Web Video Text Tracks) formats. I won’t try to explain the differences here. A simple search will reveal others who have covered that topic. WebVTT is the emerging format being used with the HTML5 video tag so that is what you want for embedding your captioned video on a Web page.

The goal of the captioning/subtitling process is to obtain a text file that contains a time stamp for the beginning and ending of each spoken segment and the actual text for that segment. Using this file with the <track> element within the HTML5 <video> element will overlay the captions on top of the video during the time that the words are spoken. Thus, it is important that each segment is not too long or the words will wrap on the screen.

I am fortunate that I have access to an IBM tool that will create the time stamped text file for me, Media Captioner and Editor (MCE). I upload the video to MCE and using text to speech conversions, MCE will transcribe the text and create the time segments for me. The process isn’t perfect. I still need to edit the caption text and identify different speakers but it certainly makes the process much faster than doing it myself! I was amazed at how well MCE captured the text and it seems to improve each time I use it.

The only drawback right now is that MCE provides me with a file of text and timestamps in TTML format. I need it in WebVTT for use with HTML5. It looks like there is another conversion in order! MCE will output WebVTT in the near future.

The WebVTT format is straightforward. In the most basic form is is a text file with time stamps and text segments with each segment separated via a line feed. Here is a simple example from the start of my video:


00:00:00.560 --> 00:00:02.242
[JAWS] navigation toolbar toolbar

00:00:02.292 --> 00:00:04.079
search or enter address edit combo

00:00:04.110 --> 00:00:04.960

00:00:05.010 --> 00:00:06.159

00:00:07.119 --> 00:00:09.059
[Becky] so here I am in firefox 35

00:00:09.090 --> 00:00:10.670
and I'm running JAWS 15

WebVTT format Basics:

  • The text file must be encoded as UTF-8 and start with WEBVTT followed by a blank line.
  • The format of the time stamp is: HH:MM:SS.mmm or hours:minutes:seconds.milliseconds. The hour is optional.
  • The time duration format is: time stamp start –> time stamp end.
  • The text immediately follows the time stamp with a line feed separating the segments.
  • The file is saved with a .vtt format. There are many formatting and styling options available.

This recent post, WebVTT, from the Mozilla Developer Network provides a thorough explanation without resorting to the full W3C specification.

The TTML format is much more verbose but contains all of the necessary information. I used my text editor and macros to convert the TTML formatted file from MCE to a WebVTT file. I just stripped out everything but the time stamps and text. With the rise of WebVTT I suspect someone will create and XSLT transform to convert TTML to WebVTT. So far I haven’t found one and don’t have the energy or desire to create one myself. Don’t worry if XSLT (Extensible Stylesheet Language Transformations) doesn’t mean anything to you, it is just the geek in me escaping. Basically, it is a language to translate xml files, of which TTML is one, to other formats.

This has been longer than I expected! Now that I have my video file and my webVTT file of time stamped text, I can put it all together on an HTML5 page. See Adding Captioned Video to an HTML5 page.

About Becka11y

Web Accessibility Architect at IBM. Have worked on various open source projects including Dojo, Apache Cordova/PhoneGap, and Readium

Leave a Reply

Your email address will not be published. Required fields are marked *