VSM Cover Story
Give Your Users a Voice
.NET 3.0 introduces several new features that simplify utilizing speech in your applications.
- By Jeff Certain
- 06/01/2007
Technology Toolbox: VB .NET, XML, Windows XP or Windows Vista, Visual Studio 2005 with the Windows Presentation Foundation Libraries, A microphone and a sound card.
The potential of speech in applications is enormous -- and almost completely untapped. One reason for this underutilization is that many developers' initial experiences with speech recognition were unsatisfying. As recently as a decade ago, the state-of-the-art in voice recognition technology was a dedicated ISA sound card that performed speech recognition. More recently, most people's experience with speech recognition has been limited to end-user technologies, such as automated telephone-menu systems and voice-to-text dictation packages. These products have a number of flaws from a developer's point of view. In general, they provide little -- if any -- support for developers; they require specialized and esoteric knowledge to perform the requisite phoneme mappings; they require significant training of both the user and the speech recognition engine; and they are expensive to deploy.
Contrast that with the speech API (SAPI 5.1) that shipped with Windows XP. (Download the SAPI 5.1 SDK here). This version provides a solid speech-recognition engine that provided developers with flexibility (albeit poorly documented flexibility), accepts English words without any mappings, can be used with minimal or no training of the user or the engine, and -- best of all from my project manager's viewpoint! -- is free to deploy
The speech API has improved significantly in its latest iteration. The current offering from Microsoft not only exposes more powerful toys, such as the ability to use a WAV file as the input to the engine, but it also gives developers coarse-grained functionality to perform common tasks. It's hard to criticize more powerful toys that get easier to use.
Unfortunately, this great technology is under-publicized. One of the least mentioned features of .NET 3.0 is the inclusion of the System.Speech namespace in WPF, which uses the speech APIs (SAPI) built into Windows XP (SAPI 5.1) and Windows Vista (SAPI 5.3) to provide speech recognition and text-to-speech (TTS) functionality (Table 1).
These recognition engines, wrapped by System.Speech, allow dictation, custom grammars, and rules-based recognition. You can use either a microphone or a WAV file as the input to the engine to simplify transcription. The dictation mode provides a dictionary of about sixty-thousand English words, and requires three lines of code to use. Rules-based recognition allows the creation of a flowchart of words or phrases, similar to the phone menus we've all grown to despise. Neither of these features is particularly compelling; as mentioned earlier, applications providing these features have been around for quite some time.
That said, the System.Speech namespace offers desktop application developers a tremendous amount of power. Specifically, the namespace enables you to generate custom grammars dynamically. You can use configuration files to generate sets of dynamic controls and the accompanying grammar; once you create a relation between the phrases in the grammar and the controls, you have a tremendously flexible interface that lets users activate your controls by voice. (Note that this functionality is distinct from Microsoft's server offerings. For Web-based and telephony applications, Microsoft has a separate offering: Speech Server. The latest version [Speech Server 2007] is currently in beta.)
The System.Speech namespace approach works particularly well for data-collection applications. Using voice-activated controls enables you to allow users to record data without having to look at the computer. If the user is performing a visual task, this allows the user to continue looking at the relevant item(s), instead of shifting his or her focus to the computer to record data. Through careful design, you can create applications that can output either values (such as abbreviations or database keys for the selected items) or the phrases used to activate the controls.
Getting Started
Before I drill down on the code for the sample application, I should note that the sample uses controls that extend the basic Windows Forms buttons. This gives you a fall-back plan for voice-enabled applications. If a user breaks his or her microphone, gets mud in the microphone jack, or steps on the headset, it's nice to know your application can continue to be usable because you've provided redundancy in the user interface.
The sample described in this article includes three prerequisites: Windows XP or Windows Vista; Visual Studio 2005 with the Windows Presentation Foundation (WPF) libraries for Visual Studio installed; and a microphone and a sound card.
The sample app uses a configuration file to generate grammars dynamically, as well as controls mapped to these grammars. The task itself is somewhat involved, so the best way to approach your application design is to break it into distinct parts -- five, in this case.
The class diagram for the speech application is relatively straightforward (Figure 1). You need to define a data structure for the dynamic grammars before you can begin generating your grammars dynamically (Figure 2). You should include both the "text" (phrase) and the "value" in the configuration file. This makes it easier to integrate your application data with other systems:
<?xml version="1.0" encoding="utf-8"? >
<xs:schema attributeFormDefault="unqualified"
elementFormDefault=
"qualified" xmlns:xs=
"http://www.w3.org/2001/XMLSchema">
<xs:element name="root">
<xs:complexType>
<xs:sequence>
<xs:element name="node"
type="node" />
</xs:sequence>
</xs:complexType>
</xs:element>
<xs:complexType name="node">
<xs:sequence>
<xs:element name="text"
type="xs:string" />
<xs:element name="value"
type="xs:string" />
<xs:element name="subnodes"
type="node" />
</xs:sequence>
</xs:complexType>
</xs:schema>
You use a recursive algorithm to fill a SpeechItem object when you read the configuration file. A SpeechItem contains a phrase/value pair and a list of SpeechItems that you want to display when it's active (Listing 1). The SpeechDataReader helper class reads the XML file and returns a fully populated SpeechItem object (Listing 2).
Construct the Recognizer Class So far, you've created the configuration file. The next step is to create a wrapper around the System.Speech.SpeechRecognizer class; your wrapper's constructor will accept a list of strings to create the grammar. In previous versions of SAPI, creating a dynamic grammar required that you generate an XML file that you saved to your disk first and then loaded into the recognition engine. This required about fifty lines of code, as well as two disk IO calls (for interest's sake, the SAPI 5.1 version of generating and loading a dynamic grammar is included in the sample code).
The System.Speech namespace provides some great new classes to support this functionality. You can see several of these classes at work in the LoadGrammar() method:
Public Sub LoadGrammar(ByVal options _
As List(Of String))
Dim choices As New Choices(options.ToArray)
Dim gb As New GrammarBuilder(choices)
Dim g As New Grammar(gb)
reco.LoadGrammarAsync(g)
End Sub
Much of the beauty of the new System.Speech classes lies in their simplicity. The Choices class constructor accepts an array of strings. (I prefer to use the new generic collections wherever possible, so LoadGrammar takes a list of strings instead of an array of strings as the argument.) You pass the Choices object to the GrammarBuilder class contructor. The GrammmarBuilder object is, in turn, passed to the Grammar class's constructor. Next, you load the resulting Grammar object into the SpeechRecognizer using the LoadGrammarAsync method. The remainder of the Recognizer class provides plumbing for functionality such as starting and stopping the SpeechRecognizer (
Listing 3).
You configure the Recognizer class to recognize a simple list of phrases by passing a list of strings to the constructor and then activating the SpeechRecognizer:
Public Class MyFirstSpeechForm
Private WithEvents reco As Recognizer
Private Sub MyFirstSpeechForm_Load( _
ByVal sender As Object, ByVal e _
As System.EventArgs) Handles Me.Load
Dim phrases As New List(Of String)
phrases.Add("Speech is easy")
phrases.Add("Nothing to it")
phrases.Add("Easy as pie")
reco = New Recognizer(phrases)
reco.StartListening()
End Sub
Private Sub reco_SpeechRecognized(ByVal sender _
As Object, ByVal e As SpeechEventArgs) _
Handles reco.SpeechRecognized
MessageBox.Show(e.Result)
End Sub
End Class
Once you load the page, speaking the phrases should cause a message box to appear. Don't forget to make sure you have your microphone plugged in!
The microphone settings can be somewhat finicky; you need to have the input volume set so that, when speaking normally, you receive enough input to enable recognition, but not so much that you overdrive the input channel.
The SpeechButton control inherits from the System.Windows.Forms.Button control. Add a SpeechItem property to allow each SpeechButton control to contain the SpeechItem that displays when a user clicks on the button. You must also provide a read-only property to expose the SpeechItem's value:
Public Class SpeechButton
Inherits System.Windows.Forms.Button
Private _speechItem As SpeechItem
Public Property SpeechItem() As SpeechItem
Get
Return _speechItem
End Get
Set(ByVal value As SpeechItem)
If Not value Is Nothing Then
_speechItem = value
Me.Text = value.Text
End If
End Set
End Property
Public Readonly Property Value() As String
Get
Return SpeechItem.Value
End Get
End Property
End Class
Note that the Button control already has a text property, which we've set to display the SpeechItem's text property.
Create the SpeechPanel
Your next major hurdle is to create the SpeechPanel. The SpeechPanel container control coordinates the loading of the data from the XML file, generates SpeechButton controls (including "constant" buttons), creates dynamic grammars, configures the Recognizer object, associates the SpeechButton controls with the phrases in the grammar, stores results, and exposes fine-grained control over the data-collection process.
When the SpeechPanel control is loaded, the SpeechData-Reader helper class loads the data from the XML file specified in the DataFileName property. The returned SpeechItem object is stored in the SpeechPanel's RootSpeechItem property. The Return-ToRoot() method resets the SpeechPanel control to display the SpeechButtons for the root SpeechItem.
Once you populate the SpeechItem object, the SpeechPanel creates the SpeechButton controls. During the process of creating the SpeechButton controls, you map each button to the word or phrase displayed as the SpeechButton's text using a Dictionary(Of String, SpeechButton). This allows the appropriate SpeechButton control to be activated when the Recognizer object recognizes a word or phrase. If FullWidthConstant-Buttons is set to True, the constant buttons span the full width of the control and have a height equal to the value of the ConstantButtonHeight property. If FullWidthConstantButtons is False, the constant buttons are treated like any other SpeechButton control for purposes of layout and resizing.
In addition to creating a button for each SpeechItem belonging to the CurrentSpeechItem, you also create buttons for the "constant" buttons. Constant buttons (specified in the ConstantButtons property) are displayed with every set of buttons in the SpeechPanel. You can configure the size and display mode for these buttons using the ConstantButtonHeight and FullWidthConstantButtons properties.
When a constant button is activated, the ConstantButtonClicked event is raised. This event enables you to take control of the execution from outside the SpeechPanel. This is extremely useful in cases where you want to record a particular piece of information without regard to sequence, or when you want to short-circuit the data-collection process if not every field is required. (You can see this technique at work in the code sample's use of the "Save Data" constant button.)
Similarly, when the user reaches the end of the data-collection sequence (i.e., the current SpeechItem has no SpeechItems of its own), the EndOfLogicTree event is raised, allowing you to determine what happens at the end of the data-collection process. For example, you might choose to present a textbox with free-form dictation enabled for the user to input comments. Or, you might choose to get the results of the data collection process by calling GetTextResults() or GetValueResults().
You generate the required grammar once the SpeechButtons are created. Calling the GetTextItems() method retrieves the list of phrases in the CurrentSpeechItem. You add the phrases for the constant buttons, if any, and pass the list of phrases to the Recognizer's LoadGrammar method:
Private Sub LoadGrammar()
Dim phrases As List(Of String) = _
CurrentSpeechItem.GetTextItems
If Not ConstantButtons Is Nothing AndAlso _
ConstantButtons.Length = 0 Then
phrases.AddRange(ConstantButtons)
End If
If recognizer Is Nothing Then
recognizer = New Recognizer(phrases)
Else
recognizer.LoadGrammar(phrases)
End If
End Sub
A similar sequence, minus the reading of the data from the XML file, takes place when a SpeechButton is clicked on. You set the SpeechPanel's CurrentSpeechItem property to the activated SpeechButton's SpeechItem. At this point, the above sequence begins anew, commencing with the creation of the SpeechButton controls.
Much of the remaining code provides the SpeechPanel's necessary plumbing code, such as control resizing and layout. You can find the full code for the SpeechPanel in the code download.
At this point, you have the SpeechPanel control. The next step is to run this control through its paces. Begin by building your project so your new controls are available, then create a new form called SpeechPanelDemo. Next, add a SpeechPanel to the form and (for the purposes of this demo) set the SpeechPanel's Dock property to Fill. Now add "Save Data" to the ConstantButtons property and set the FullWidthConstantButtons property to True. In your form's Load event, set the DataFile property to the path of the XML grammar file that you created with the DataBuilder (or the location where you've saved the file provided as a sample).
Note that there is a bug with System.Speech that results in an InvalidCastException when the DataFile property is set in the designer. The Microsoft speech recognition team has been informed of this bug and is working on correcting this issue.
Next, you want to create handlers for the EndOfLogicTree and ConstantButtonClicked events. When either event is raised, you display the selected text and values, then clear the results and start the data-collection process over. Accomplishing this requires that you create a simple subroutine called ShowDataAndResetPanel. You call this subroutine from both the EndOf-LogicTree and ConstantButtonClicked event handlers. You begin by showing the text of the triggered phrases, and then you clear the SpeechPanel's results. End by calling ReturnToRoot, which resets the SpeechPanel to the root, or first, SpeechItem:
Private Sub ShowDataAndResetPanel()
With SpeechPanel1
MessageBox.Show(.GetTextResults)
.ClearResults()
.ReturnToRoot()
End With
End Sub
Note that the code becomes slightly more complicated with multiple constant buttons. In this case, you need to determine which constant button raised the event. It requires only a few lines of code to create the SpeechPanelDemo form (Listing 4).
That covers the basics of creating custom grammars dynamically and the associated controls. This certainly isn't a definitive approach -- rather, my goal is to show you how easy it is to provide a powerful extension of the traditional desktop UI. There's plenty of room to extend the tools presented here. For example, a richer event model is required to make this fit into real-world applications, so you might extend the sample by making the activated SpeechButton change color. Or, you might want to provide a mechanism to order the buttons alphabetically rather than in the order they appear in the configuration file.
The future of speech-enabled applications is bright. I'm confident that, as the hardware gets more powerful, more sophisticated algorithms will evolve to improve recognition accuracy. Just as significant -- if not more so -- are the features that will be accessible directly from VB .NET and C#. For example, will Microsoft give us a way to use extension methods in Orcas to speech-enable every control in an existing application, and do so without rewriting the application? Or will future .NET controls include the ability to activate speech recognition out of the box? I'm optimistic about the answers to these questions, given how far and how quickly speech applications have progressed in Windows.