Confessions of a Speech Recognition Consultant

Time for an update!

2012-07-19T19:27:00.000-07:00

Well it's been far too long since I last blogged . . . . I've been buried in several large projects with clients over the past year and have barely looked up. As they've all been wrapped up, I've decided to start a new chapter in my career and have accepted an offer to join Lumevox LLC of San Diego, as Senior Director of Client Services.

Essentially, I'll be doing much of the same work that I have for the past 8 years in my consulting practice, only with a great team working for me to support the efforts. My new team includes the technical support staff, training team and professional service delivery teams. Our collective job is to enable Lumenvox customers and partners to succeed in their speech recognition based projects.

Lumevox produces ASR, TTS and Call Progress Analysis engines along with the supporting tools and services. You can check them out at http://www.lumenvox.com.

For now, I'll be working part of the time in San Diego at Lumenox's headquarters and part of the time remotely in Seattle, where I live.

If you're planning a trip to New York for SpeechTek 2012, be sure to stop by Lumenvox's booth and say "hello", also check out the program at SpeechTek, I'll be making a couple of presentations.

Stay tuned and I'll try to ramp back up my blogging and tweeting, especially from SpeechTek.

Cheers!

Observations from SpeechTek 2011 - New York

2011-08-17T16:33:00.000-07:00

I'm just back from SpeechTek, the major industry conference in the Speech Recognition / Text-To-Speech / Voice Biometrics industry. I spent three great days attending sessions, catching up with friends in the industry and see the latest offerings from vendors in the space. I've been attending this conference for better than a dozen years now and it's interesting to see how the industry has evolved and matured during that time. As I flew home to Seattle, I jotted down a few of my thoughts and observations. Three Themes seemed to run through the conference:

· Cloud Computing
· Analytics, Analytics & Analytics
· Smart phones (and multi-modal applications)

and each of themes converged to produce a trend I'd call Adaptive Personalization.

Cloud Computing
I've said it before and it's worth repeating, the clouds are gathering! By that, I mean that the speech recognition industry (and it's related applications) are running full speed towards the trend in cloud computing. In fact, I think it may be the vanguard of that advance. So many major customer self-service applications today run in the cloud on platforms like Microsoft's TellMe, Nuance's BeVocal, Voxeo, Angel.com or others that it would be impossible to argue it's not a full-fledged trend. Millions of automated self-service calls (both inbound and outbound) pass through each of these today. Supporting this growth of Cloud Computing related to speech recognition is the parallel movement of applications and data to the cloud that's being driven by the advent of Apple's iPad (and other tablet computers) along with the ever growing use of Smart Phones. Both of these items share a common trait, much of their application smarts or functionality come from cloud based services and data using a model in which the device is primarily a presentation layer in an application and the functional work and data storage are largely handled in a cloud based platform or platforms. Many of these applications are even mashups which aggregate data and services from multiple cloud based applications. A whole new generation of speech applications are cloud based, using the cloud for application functionality, speech recognition, voice biometrics and data aggregation from multiple sources. This approach allows for incredibly rich applications with access to large data sets far beyond the limited processing power and storage capabilities of the typical individual smart phone.

Analytics, Analytics & Analytics
If there was a single buzzword that prevailed at SpeechTek it was Analytics. The use of the term was so prevalent and so overloaded that it almost lost all meaning (the true sign of a buzzword). Every presentation, every piece of product literature, every vendor booth in the exhibit hall had some reference to analytics. Despite the overuse of the term, it was clear that it represents a major trend in the industry and I believe on that offers the potential of significant benefit to the end users of these systems. Perhaps we can look to the web and the evolution of e-commerce for some clues for what lies ahead in the speech industry. Analytics has found wide use on the Internet as a tool to understand user behavior, customer needs and help companies provide more carefully filtered and tailored information to users.

In reality, I think we saw three distinct applications of analytics (Analytics is defined as the science of analysis. A simple and practical definition, however, would be the application of computer technology, operational research, and statistics to solve problems): (1) Using analytics as a discovery tool in customer service operations to help identify hot spots or problems (such as issues in self-service speech applications or e-commerce web sites), (2) Using analytics (and computerized semantic processing) to process data from a variety of channels (Twitter, Facebook, email, blogs, etc., speech based self-service applications) to identify trends and customer issues and (3) Using analytics and data from all customer interface modalities (web, smartphone, IVR, call center agents, SMS messages, Twitter, etc.) to model and infer meaning & intent for individual customers. I believe that this third use is the most significant and potentially most game changing of the three.

Smartphone (and multi-modal applications)
With the rapid growth of smart phones, iPads and similar data/voice enabled portable devices, we're seeing a new generation of applications emerge. The availability of voice, Internet and background access to large amounts of data (especially real time data) a new generation of mobile applications that are truly multi-modal, that is they are capable of accepting typed and spoken inputs and delivering visual and audible outputs. This gives users a choice in their preferred communications channel and opens up these devices to more effective and efficient means of delivering complex data, such as lists which don't lend themselves to audio output. A good example of this type of mixed mode application is Nuance's "Dragon Go!" which is available on the iPhone or iPad. With this application, you can speak a simple query phrase. The application captures your utterance, ships it off to be processed in "the cloud" using natural language understanding and returns search results form multiple data sources in visual form. You can get more information about the application from Nuance's web site or Apple's App Store.

Adaptive Personalization
The convergence of these three: Cloud Computing, Analytics and Multi-modal applications offers us the most compelling theme of all. By having access to large amounts of data and computing power in the cloud, combined with the "intelligence" that can be gleaned from analytic (which can process information about the user from a variety of sources and channels) with the powerful presentation and input possibilities of multi-modal applications, we can make a leap forward to a "brave" new world" where applications understand the context of our actions across multiple channels and products and present us with information, help or services tailored to exactly what we need and exactly when we need it. I'm calling this trend Adaptive Personalization. This kind of personalization goes far beyond the kind of customizing we see in things like a search query using your location data to constrain the choices presented.

An example of this kind of adaptive personalization might be for the customer of a financial services company or bank who is applying for a loan on the institutions web site, when they encounter a question or issue not address in the online application process. Imagine that they might grab their cell phone and call the institution's customer service number for assistance.

When they reach the customer service number the applications identifies the caller from their cell phone ANI information and then rather than presenting them with a natural language question or deep menu of choices, through the use of analytics the application can see their most recent activity was on the loan application process on the web and offer them the option of being transferred directly to a loan specialist to assist. One early example of a product that supports this in a product is Genesys's Conversation Manger.

I don't think it will be too many years before this will be common place in advanced customer service environments. When melded with information about customer channel preferences and proactive notification it will completely turn the customer experience inside out, and in a good way.

That's my two cents worth, let me know that you think or feel free to add you own ideas and observations in the comments. If you'd like to see my tweets from the conference (and those of the other attendees) search using the tag #SpeechTek.

GM = Google Motors?

2010-05-10T14:41:00.000-07:00

Thanks to Dan Miller at Opus Research for bringing this to my attention. I've included a hyper-link back to his original blog of the story.

After Ford showcased the full spectrum SYNC services on a sub-$16K Fiesta (even taking Kara Swisher for a test sit), GM appears prepared to counter with a broad variety of wireless mobile apps offered in conjunction with Google. In this article in Motor Trend Todd Lassa lays out the basics of a relationship whereby the the “open” Android operating system would be licensed for use in GM automobiles.

Lassa asserts that the GM/Google relationship would place emphasis on a better phone-to-car interface, as opposed to the voice control and voice user interface that Microsoft’s Speech Application Group has played up. Thus GM’s approach will enable drivers to use their phones to do such things as start or turn off their cars, lock and unlock doors, and make other adjustments. It was not spelled out explicitly in the article, but given Google’s efforts to invoke automated speech recognition whenever a keyboard comes into play on a mobile device, it is highly likely that all of these functions can be voice controlled – making starting your car another “speechable moment”.

As for the supposition that Android in the car spells the end of OnStar, that is highly unlikely. Lassa notes that turn-by-turn directions through OnStar would become unnecessary because Android phones using Google Maps and a special mount have been successfully deployed for in-car navigation. But OnStar has been sold more as a safety feature and remote diagnostic service. The Android operating system in the car is more likely to augment, rather than compete with OnStar.

The prospects for more automobile-based Android apps is provocative. The car is destined to be the most fertile spawning ground for speech-based apps and the prospects for Android-oriented developers to define a range of “hands-on-the-wheel/eyes-facing-forward” capabilities and activities is very promising. Meanwhile, Ford remains ahead of the game with a well-defined, and now time tested, suite of voice control applications for frequent activities like carrying out phone conversations, messaging and controlling the car’s entertainment system.

Voice Biometrics Conference Next Week

2010-04-30T11:43:00.000-07:00

I'll be in the New York area next week for Opus Research's Voice Biometrics Conference. It's being held at the Hyatt Regency Jersey City on the Hudson. If you're attending and we've not had a chance to meet in person feel free to say hello. it's not too late to register at http://www.voicebiocon.com/. There is a great lineup of speakers and sessions.

You can follow my comments live from the conference on Twitter at http://twitter.com/jeff_hopper or using the tag #voicebiocon.

Nuance Shutters SpinVox Consumer Service

2010-03-30T10:09:00.000-07:00

It's been just over 3 months since Nuance acquired SpinVox, saving the company from a death spiral. Since the acquisition a string of news stories have surfaced about the financial shenanigans that went on at SpinVox prior to Nuance's purchase. It's a shame that Spinvox fell to such lows. They had a great idea and were developing credible technological solutions.

A recent and not too surprising post on SpinVox's website indicates that they will discontinue their consumer offerings allowing them to focus on their carrier and network operator business.

A Twitter post from SpinVox stated: "We regret to inform you that SpinVox is no longer supporting individual user accounts. Your account will expire in 7 days. "

No word yet on what they intend to do with consumer customers of Jott, the Seattle based competitor of SpinVox that Nuance also purchased.

Speaker Authentication using Voice Biometrics - Now's the time!

2010-01-20T15:22:00.000-08:00

I'm working on a project for one of my clients who's interested in using voice biometrics to authenticate callers. Voice biometrics uses the unique ways that individuals formulate phonemes to create a voice signature that can be used to validate a person's identity at a later time. It seems that the actual use cases today are fairly narrow, mostly password reset applications such as those from Nuance Communications or access to corporate based auto-attendants.

Based on the research I've been doing for my client, several recent environmental and economic changes make this a compelling time to investigate integrating voice biometric based authentication into your transactional environment.

Three factors make this so:

The advent of so many hosted SaaS (software as a service) offerings from experienced voice services firms like Voxeo, TradeHarbor, BeVocal, Convergys, Authentify, Angel.com, CSIdentity, PhoneFactor and others.
Ubiquity of telephones (both land line and cellular) as a transaction end point for authentication across all channels. Even internet based transactions can use outbound phone calls to reach a user to authenticate them.
The ability to combine speech recognition and voice authentication to achieve true multi-factor authentication and the corresponding higher confidence in the security provided by using speech recognition to gather content (something the user knows) and voice authentication (some the user is).

The move to SaaS offerings is a real game changer, significantly lowering the barrier of entry by lowering the cost and the integration complexity with existing applications - regardless of the channel. Since it's voice and there is a phone available almost everywhere to use as the authentication end-point there is no need to invest in expensive dedicated hardware like fingerprint scanners and cameras for facial recognition.

When faced with the need for more secure access to transactions across a variety of channels (phone, web, smart phones, etc.) voice based authentication can provide high confidence, secure, multi-factor authentication with a lower capital expenditure, less complexity and quicker time to implementation that any other biometric solution that I've examined.

On the BBC news story bashing SpinVox

2009-07-24T11:20:00.000-07:00

I read with interest this morning a BBC story about SpinVox which suggests that the majority of messages on it's platform have been heard and transcribed by call centre staff in South Africa and the Philippines rather than being transcribed into text using speech recognition technology.

The article goes on to say that messages appear to have been read by workers outside of the European Union raises questions about the firm's data protection policy. SpinVox's entry on the UK Data Protection Register says it does not transfer anything outside the European Economic Area.

Anyone with a working knowledge of the voice mail to text transcription industry (which includes other vendors like Jott Networks and Google Voice) understands that no speech recognition process available today can achieve perfectly accurate automated transcriptions for large numbers of voice mail messages from thousands of different callers and the wide variety of audio quality typical of phone calls, especially those like poor cellular connections.

Today, almost everyone working in this space uses a combination of speech recognition technologies and human (read: caller center agent) based quality assurance (q/a) to obtain transcriptions of a usable quality. The human touch adds two elements: first, it can edit out errors from the automated transcription process and secondly, the markup data from the human q/a agents can be used to further refine the recognition process.

In the rare cases where no human q/a is used before delivering the transcription to the end users, the quality of the transcription almost always suffers. By example, has anyone seen "Great" transcription yet from Google Voice?

Unfortunetly, the economic model playing out in this industry forces this q/a work to off-shore or third world call centers.

The BBC story is important in it's discussion of the data security issues. So far, none of these services has provided sufficent details about the processes they use to assure data security and it does appear on the surface that SpinVox may be violating the EU Data Protection Policy that it's committed to. To quote Ross Perot: "The nut's in the detail". We've not yet seen enough detail to know much about the "nut".

I've used Jott, Google Voice and SpinVox myself (in fact I currently use SpinVox on my cellular voice mail) and I've found all to be useful but none to be superbly accurate in their transcriptions. However the services with human based q/a have faired much better.

What's your experience been? Are you concerned about the security of your message content when using these services for voicemail transcription? I'd be interested in hearing your comments!

I Finally Got a Google Voice Invitation

2009-07-17T10:35:00.000-07:00

I finally got an invitation to use the beta of Google Voice. It's all set up now and I've added a control to the right hand panel on this blog that allows you to call me with it. Just click on the control, it will prompt you for your phone number. Google Voice will then place a call to you at the number you specified and at the same time will call me and then bridge the two calls together.

I've configured my Google Voice account with all of my possible phone numbers: work, home, mobile and client work site. When it's trying to send me a call, it tries all of them simultaneously and then drops all the lines that I don't answer on. If I'm simply not available, it will route the inbound call to my Google Voice voice mail box. Once you've left your message, it will send me an email and an SMS message (to my mobile phone) with the transcription of your message. It will be interesting to see how Google's voice mail transcription compares to Jott and SpinVox. I'll keep you posted on that.

If you'd like my Google Voice direct phone number for your address book, it's (425) 502-5613.

Useful Speakers Notes for Presentations

2009-06-29T07:59:00.000-07:00

If you've done much presenting using Microsoft Power Point, you know how much work is involved in writing "good" speaker notes to augment the content of your slides. While slides needs to be brief in their content and are meant to convey the essence of a point, the speaker's notes provide you with the ability to expand on the details that you actually discuss with your audience.

In my job I do lots of presentations and have found a great solution for capturing the speakers notes and at the same time giving myself a chance to rehearse "out loud" my talking points for a specific slide. I use Dragon Naturally Speaking 10 to capture my discussion for each slide in the "speaker notes" field for the each slide.

I bring up the slides, put my headset on and capture my presentation for each slide. I don't worry about any corrections as I present, rather I focus on the flow of the presentation and how it sounds.

After presenting each slide, I can quickly edit out any minor issues with the captured text and have a great set of detailed notes for my audience and a chance rehearse and see if there are any refinements I should make to the content.

TellMe Rolls Out New Features

2009-04-29T09:52:00.000-07:00

Tellme®, a subsidiary that Microsoft® purchased two years ago has announced a Spring Release of new features and product improvements that tighten the integration between TellMe & it's parent and take advantage of the deep technical skills from both TellMe and Microsoft's speech research group.

The new release offers improvements focused on:

Positioning TellMe as a service offering in the emerging Cloud Computing arena
Improving core speech recognition quality
Adding new multi-slot recognition capability
Reducing telecom costs

The enhancements include a new VoIP offering through a partnership with Global Crossing, a new TTS voice which leverages the Microsoft Text-to-Speech (TTS) engine, new acoustic models, phonetic dictionaries and grammar products that increase the accuracy, new multi-slot dialog capability which will allow callers a more natural conversational experience while keeping recognition rates high and new mobile services, including the Windows Mobile 6.5 application. They will now be able to offer not only Nuance and IBM recognition engines but also add Microsoft's engine.

I've hyperlinked to the announcement press release in the opening paragraph so you can take a look at the changes in more detail.

TellMe has always been a strong hosting company and these additions will only further strengthen their position in the market.

Talking Gadget Theater II: The Wrath of Kindle

2009-04-28T22:30:00.001-07:00

Apple's iPod Shuffle and Amazon's Kindle II TTS play a scene from Startrek: Wrath of Khan

The Clouds are Gathering - Speech 2.0

2009-04-27T07:27:00.000-07:00

When I lived in north Texas, it was easy to see a storm coming – the thunderheads gathered on the horizon to the west and you could see them building for hours before they arrived.

After eComm 2009: Emerging Communications Conference I saw the same kind of storm clouds gathering in the west. This time rather than bringing wind, rain and lightening I believe they mark a significant shift in the evolution of speech recognition – from hosted VXML based platforms to Telephony “in the cloud” or what’s being called Speech 2.0.

This shift builds on the trend towards cloud computing; the hosting of applications and more importantly services in the cloud and provides a number of potential benefits including:

Reduced or no front end capital investment
Reduced operating expenses and easier upgrade/scalability
Rapid speed to market deployment for new application
Easier integration with and synchronization with existing web self-service
Access to best-of-breed technologies
Superior network reliability and redundancy
Gigabyte interfaces for fast reliable operations
Scalability to meet high or rapidly changing call volumes
Compatibility with existing telephone and Web infrastructure

Several specific Speech 2.0 platforms were presented at eComm, each with its own twist on features, supported interfaces and hosting options. All of them offer some form of free developer account so that you can sign up and try them yourself.

The platforms that I took note of were:

Tropo – Voxeo introduced Tropo, an in-the-cloud development platform that lets users create and deploy speech and telephony applications using a simple API (application programming interface). The API support application development in Groovy, JavaScript, PHP, Python, and Ruby and is designed as an alternative to the standard XML- and VoiceXML-based platforms that have become so common in the last few years. Applications can incorporate inbound calling via the public switched telephone network, Session Initiation Protocol (SIP), Skype, and iNum, while also providing appropriate connections for outbound calling. Capabilities include robust call control, playing and recording audio, touchtone entry, speech recognition, text-to-speech, and mashups with Web services. Planned application capabilities include call recording, conferencing, and Web services.

Twilio - Twilio provides an in-cloud API for voice communications that leverages existing web development skills, resources and infrastructure. Designed to enable web applications to be able to interact with phone callers, Twilio allows you use your existing web development skills, existing code, existing servers, existing databases and existing karma to solve these problems quickly without the need to learn some foreign telecom programming languages, or set up an entire stack of PBX software. Twilio provides the infrastructure; you provide the business logic via HTTP. Currently PHP, REST.

IfByPhone – IfByPhone is a hosted voice application and platform company with a simplified approach to the deployment of stand-alone and web-integrated voice services for small and medium sized businesses (SMB). Through a combination of telephony and web services IfByPhone offers you prebuilt applications or a programmable API which enables you to create inbound or outbound calls or other IVR functionality. The configuration and deployment tools look and feel just like Web applications, and require no previous knowledge of telephony programming or terminology.

In the spirit of full disclosure, I use a mash-up of IfByPhone’s applications, a “Click to Call” button on my own website which allows callers to connect to me directly from my website. It prompts the caller for their phone number, places a call to them, and then invokes a “Find Me” application that places simultaneous calls to me at several possible numbers and bridges which ever number I answer on to the caller.

Jaduka – Jaduka provides a SOAP-based Web Services interface which enables companies to easily blend voice into their workflow activities Jaduka's Web services API makes adding the benefits of voice communication to enterprise applications as easy as constructing a mash-up.

I’m sure there are other platforms that I’ve missed. If I’ve overlooked you, drop me a note.

These cloud-based telephony providers start with a hosted platform, accessible via the open Internet and provide a an API whose premise is to enable developers not familiar with speech recognition or telephony to incorporate a speech based channel into their existing application infrastructure without having to learn a great deal of domain specific skills or development languages, like VXML or CCXML that are typically used today to build voice enabled applications. Each varies slightly but in general the primary interfaces are built on common web based development languages and protocols such as REST, SOAP, PHP, Groovy, JavaScript, Python, and Ruby. In addition, some of these platforms have also developed more complete applets or small applications which can be used right of the box or with a nominal amount of configuration.

In theory, this approach makes it easy for developers to add voice interfaces to existing applications. But as your mother reminded you when you were are child: “just because you can does not mean you should”. Speech enabled applications have taken a long time to reach the main stream success. There are many reasons for this, one of which was the need to develop experience in what’s required to provide users with an interface that works well for the caller. There is as much art to this as there is science and technology. As with many endeavors, our first attempts often fail to meet expectations. Developers using these cloud based platforms will still need to design and implement good users interface practices for voice which are much different than those applied in visual application interfaces. Much has been written about this, so I’ll leave that topic to another conversation.

This evolution in platforms when added to the strong uptake in hosted platforms is sure to have significant impact on the business models in the speech recognition and telephony industries. With the concentration of fewer and larger consumers of the core technology, pricing and volume of sales with no doubt change for firms like Nuance and the various IVR hardware vendors.

Over the next month, I'll explore these new cloud based platform individually by building a application on each and sharing my experience and results with you. I'll wrap up this series of posts with some kind of comparison post giving you my perspective on the pros and cons of each. Stay tuned!

If you're already using or experimenting with one of these cloud based telephony platforms I'd love to hear from you about your experience as well.

Speech Technology At Its Finest!

2009-04-24T11:19:00.001-07:00

Thanks to the Speech Tech Blog for bringing this to my attention and enjoyment. I thought I'd pass it along.

Amazon's Kindle 2 TTS & Apple's iPod Shuffle TTS perform a scene from Blade Runner . . .

Collecting SLM Data in an existing self service application

2009-04-09T10:36:00.000-07:00

I had the opportunity to call Comcast (my internet service provider) this week about a problem with my cable modem.

The call was answered by the usual self-service application that asked me a series of questions (a directed dialog style) including my home phone number and then if I was calling about my cable TV service, Internet service or phone service. At that point the application took an unexpected change of direction by asking me to state why I was calling. After I said "Internet not working" the application informed me that it collected that information for a future application enhancement then went back to the normal directed dialog style with 5 or 6 more questions. No doubt, the data collected will be pared with my ultimate destination and purpose in the call to help develop a natural language component.

I immediately thought of a post by Phillip Hunter this week on his Design-Outloud blog and the related article in SpeechTech Magazine: Is it Natural? I'd encourage you to read both, especially Phillip's blog post. These will give you a good overview of the challenges to this approach.

It's obvious that Comcast is gathering data to add some kind of "Natural Language" capability to their customer self service application. This would imply that Comcast has a very high call volume for this application, as it typically costs a large 6 figure number to build a Natural Language call steering application based on SLM (statistical language modeling). It's often difficult to justify the expense of building and maintaining this kind of approach.

While this approach is often very productive and cost justified for high volume callers, an equally workable approach with a slightly narrower focus can be developed using a typical SRGS grammar for a much lower cost, using an technique known as grammatical inference.

Essentially grammatical inference tools rely on artificial intelligence to build your SRGS and GSL grammars automatically based on example utterances. Just as you build an SLM grammar using caller utterances, with grammatical inference, you feed the utterances into a ‘grammar learning tool’ which outputs a set of grammar rules in whatever format you require (e.g. GSL or SRGS). The grammar learner has a fundamental knowledge of the language that you are building the grammar for (e.g. English) and combines this with your utterances to produce a set of grammar rules. Unlike an SLM approach, grammatical inference allows you to build a usable grammar with only a very small number of sample utterances (‘tens of utterances’ rather than ‘tens of thousands’). Of course, if more training data is available, you can feed the grammar learner as much as you like. One such tool that I've used is offered by Inference Communications Pty. Ltd. I'm sure others are out there waiting to be found.

When working with my clients, I often find that major speech vendors jump straight to a recommendation of Natural Language (with glimmers of large licensing and professional services fees in their eyes) when a more conservative and less expensive approach "may" work just as well.

Regardless of the direction you choice, consult with a skilled voice interaction designer who can help you parse through the pros & cons for each approach and choose the right one for you and your callers.

P.S. Also worthy of mentioning this week is a new web site from The Association for Voice Interaction Design. If you have anything more than a passing interested in voice interaction design, they have some great references to other blogs, websites and publications.

Despite being a skeptic . . I'm impressed

2009-04-02T14:50:00.001-07:00

You would think that since I've made a career of working with speech recognition technology that I would be wildly enthusiastic about it and just like a carpenter with a hammer would see every problem as a nail. Not so with me!

I got into this technology reluctantly and have remained largely skeptical ever since. It's just in the past few years I have become truly comfortable in recommending voice recognition technology as a viable and solid technology to be used in business solutions. Even then, I'm careful and conservative in when and how I make that recommendation. I have to be 100 percent confident that the proposed solution can succeed and will prove successful both in the mind of the user and the business.

Along the way I've tried, sampled and tested almost everything offering including the desktop recognition packages from IBM's ViaVoice, Nuance's Dragon Naturally Speaking and most recently the speech recognition included in Microsoft's Windows Vista. After every one of these trials, I've uninstalled or disabled the recognition package and gone back to my keyboard. While they worked to a greater or lesser degree they never "excited me" and often were more trouble than they were worth.

Despite previous mediocrity, I keep trying new versions or products. This week I finally got around to installing Nuance's Dragon Naturally Speaking 10, after it lay on my desk for month's.

The installation process was quick and easy with no complications and once installed the product walked me through a basic set process to set and calibrate the audio input source (I have multiple devices with microphones attached to my Dell Vostro with Windows Vista) and complete a short training exercise by reading a Dave Barry column so that Naturally Speaking could create a user profile for me. All done in less than one half hour from start to finish, including time to download and install the Microsoft Visual C++ Redistributable.

For those of you who know me in person, you'll no doubt remember that I have a distinct southern accent, having grown up near Memphis, TN. One of the user configuration options was a choice of languages that included General US or Southern US. I resisted temptation and chose General just to test Naturally Speaking a bit more.

Now after a few days of use, I have to admit amazement! The recognition is incredibly accurate and quick. I've not typed an instant message or email so far this week. All have been dictated using Nuance's Dragon Naturally Speaking 10. In fact this entire blog entry was done using speech dictation.

I've only experienced two problems:

1) Some issues with product names or proper names like Vostro, which require correct.

2) I think I need a more highly directional microphone for my headset. I listen to my local public radio station in the background while at my desk and if I'm quite without telling the software to stop listening it will start transcribing the conversation on KUOW 94.9 that's playing in the background. This might make it difficult to use Dragon in a loud office or call enter environment.

That's my two cents worth . . . and if you're interested, I have a copy of Dragon Naturally Speaking 9.0 that I never got around to installing. It's for sale on eBay!

The Good, The Bad and The Ugly - Real World Examples of Speech Enabled Self-Service

2009-04-02T10:44:00.001-07:00

When discussing issues related to a particular approach or dialog in a speech recognition based application with my clients, I find it's very helpful to have them call and interact with examples in a real application. Often interacting with a real world example, illuminates their thinking more clearly than any intellectual discussion. It makes the point I'm trying to illustrate very visceral for them, having experienced the pain or the pleasure.

I keep a document with several dozen of these real world examples which I give my clients as a reference. It's time for me to refresh the content for this document and I thought I'd ask for your input. I've put together a survey to collect this information. Click Here to take survey

Once I've gathered the results and updated my reference document I'll post a link to it here and discuss some of the most note worthy examples.

Apple's new Text-to-speech interface for the iPod shuffle

2009-03-11T10:40:00.000-07:00

Apple has introduced an all-new iPod® shuffle, which features a Text-To-Speech interface (called VoiceOver). With the press of a button, you can play, pause, adjust volume, switch playlists and hear the name of the song and artist. The new Shuffle can speak 14 languages including English, Czech, Dutch, French, German, Greek, Italian, Japanese, Mandarin Chinese, Polish, Portuguese, Spanish, Swedish, and Turkish.

You can you read much more about the new Ipod on Apple's website and you can hear samples of VoiceOver on the Apple Website.

(Thanks to Adam B. at SpeechTechMag.com's blog for bringing this news to my attention).

While discussing this on Twitter this morning (Yes, I admit I've become a Twitterholic!) I sparked a conversation thread about what various folks think about TTS in general and the new VoiceOver feature on the iPod® shuffle.

I have my own opinions, but as a good user interface student, I thought I'd ask for your opinion first. Once I've gathered a good sample of responses, I'll post the results and add my own "two cents worth".

Use the link above to listen to the new VoiceOver interface then take my short survey on SurveyMonkey.com

Take a few minutes to listen to the new iPod® interface, take the survey and let me know what you think!

Our menus have changed....more IVR Humor

2008-11-11T11:48:00.000-08:00

It's almost a cliche to hear an opening statement in an IVR that says something to the effect of "please pay careful attention as our menus have recently changed". We've all heard that kind of introduction when calling a self-service application and probably thought "why", just get to the menu items.

Most experienced voice user interface designers agree that you should never include this kind of statement in your prompting. It's simply not needed and can be problematic in some situations. If your callers are infrequent users, they could care less about the message and in fact it can be distracting from the real menu items. If your callers are "power users" who call frequently they will self discover the changes immediately and course correct at once without the need for the warning prompts.

I ran across a great parody of the classic "our menus have changed" application at:888-583-2801. Give it a call and have a little chuckle!

chasing change: Nailing the Revenue Model: Jott.com

2008-11-08T10:52:00.000-08:00

chasing change: Nailing the Revenue Model: Jott.com

A Little IVR Humor . . .

2008-11-06T11:04:00.000-08:00

It's sad, but far too many people really expect to find something like this when they find an automated customer service system answering at the other end of the phone call . . . I ran across this on the web and it had me laughing out loud.

Thanks to vCom Solutions the sponsor of this little bit of humor.

It is a perfect example of how not to do self-service and why speech is a better option when compared to DTMF (touch tone) self-service.

Click on the posting title to hear for yourself . . . you'll be taken to a new URL.

Enjoy!

California Bans Texting and Driving

2008-10-03T08:09:00.000-07:00

We all know in our "gut" that texting on our cell phones while driving or conducting any other task that requires high concentration and motor skills is seriously distracting. A good example of this is the recent fatal train accident in southern California. The NTSB (National Transportation Safety Board) has determined that the engineer of the train that caused the accident was sending text messages from his cell phone during the last few minutes before the accident and the belief is that this may have contributed to the accident.

Last fall at Nuance Communication's CONVERSATIONS Conference, one of the keynote sessions involved a demonstration of just how dangerous and distracting "texting" was. You can see a video of this demonstration on YouTube at Amazing Race: Distracted Driving.

This past week California's Governor, Arnold Schwarzenegger said "Hasta la vista" to texting while driving and terminated a loophole in California's vehicle code that banned drivers from talking on cell phones while driving without a hands-free device but let them communicate via text messages. The Associated Press reported that the governor signed the law which will take effect on January 1, 2009. My home state of Washington had already banned text messaging while driving.

Given the usefulness of these devices it's difficult to image that we'll break our addiction to the BlackBerry or iPhone -- they don't call it a "CrackBerry" without a reason. That said, speech recognition technology offers a straight forward way to improve safety in the way we use these devices without taking away their convenience. Several services and embedded applications have already reached the market which address this problem. One example that I've blogged about before is Jott Networks. Jott allows you to dial their service and then send notes to yourself and others entirely using voice recognition based navigation and dictation. Others companies with similar offerings include SpinVox and SimulSays. When combined with voice based dialing and hands free access, these services remove most of the physical contact required to interact with mobile devices, eliminating much of the distraction that occurs when you interact via the keyboard.

This lesson can be carried over to the customer service world. Many IVR based self service applications in use today require serious use of the DTMF keypad for entry of things like account numbers, choices from lists, ticker symbols, etc. and pose the same risk to drivers as other kinds of text messaging. Given the prevalence in the use of cell phones these days, I think a strong case can be made for speech enabling applications which have complex DTMF (touch tone) entry requirements simply as a safety step for callers and to avoid potential legal issues which may become a problem as more and more states impose similar bans on non-hands-free use of cell phones.

Jott - The Ultimate Speech Self Service Application

2008-08-13T09:41:00.000-07:00

While I'm a strong believer in using speech recognition, I'm not easily given over to hyperbole about speech applications. Frankly I find too few applications that wow me with their simplicty, their elegance or their usefulness. Even less frequently do I run across one that I find useful enough that I actually put the number in my cell phone's dialing directory (yes, it's speech enabled) and use often. That said, I've become addicited to Jott!

For those of you who haven't yet run across it, Jott is the brain child of a Seattle based company, Jott Networks that operates a voice to text service that makes staying organized and in touch easy. Simply put, Jott converts your voice into emails, text messages, reminders, lists and appointments.

After signing up for Jott's service at thier web site, you simply call the toll free number and your account is recognized from your caller ID information. The application asks you "Who you want to Jott", recognizes that request and then records your message. After recording your message, it is passed through a speech recognition process to convert it to text then sent on to the person, application or list you specified. It's an elegantly simple interface, that's intuitive and easy to use.

I use Jott to save reminder messages to myself, keep a record of business expenses and vehicle mileage throughout the day and add entries to my calendar without the need to pull out my laptop and find a WiFi connection. It's also great for sending "To Do" list items to my family members or co-workers. Direct application links are provided to dozens of tools like Google Calendar, Blogger, Twitter and many more so that your Jott's can be sent directly to them. They have a developer's site so that you can develope web service links to other applications that aren't already provided also.

I find it especially useful from my cell phone since it's user interface requires no use of hands, so with my headset I'm completely hands free for the entire transaction.

For you iPhone addicts, Jott has just added a mobile notepad that turns your voice into notes on your iPhone. It's available for download on Apple's App Store and their's a link to it on Jott's website at www.jott.com

Hands Off - it's the law!

2008-06-30T11:02:00.000-07:00

Beginning this week drivers in California and Washington join those in a list of other states who can't use their cell phones while driving to talk or send text messages unless they're using them in a hands free mode. In my home state of Washington, drivers who read and compose text messages or talk on a cell phone without a hands-free device could face a $101 ticket. The text-messaging ban took effect Jan. 1; the cell-phone law will be enforced starting in July 1st. Drivers are exempt in some situations, including emergencies, and neither offense will be enough to get a driver pulled over by the police.

Several of my clients who still have complex DTMF applications in service with high cell phone caller populatations which require lots of DTMF entry are moving rapidly to migrate those applications to speech recogntion based self-service to avoid any potential liability issues and to make sure that the applications remain available during the periods when callers actually want to use them, namely while driving or otherwise in transit.

A stunning 80 percent of mobile phone owners talk while driving, according to a recent survey by the Nationwide Mutual Insurance Company. It's a major distraction—some have even equated using a cell phone behind the wheel with driving under the influence, since reaction times can be slowed during a call.

According to the National Highway Transportation Safety Administration, there are 115 road fatalities each day in the United States and distracted driving causes 80 percent of road accidents.

I've been making the recommendation to my clients that these facts alone justify the move from DTMF to speech recognition based functionality for any self service application with modest to high cellular caller populations and more than the simplest of input requirements.

I'm curious if anyone else is seeing this move to hands free cellular use as much of a driver for moving DTMF applications to speech recognition based self service or if anyone else has had the issue of liability raised.

Despite having a reputation as an early adopter . . .

2007-03-27T14:59:00.000-07:00

. . . of almost any technology , I'm late in joining the world of blogging.

For just over a year now, I've been helping a client navigate the diverse world of speech recognition. Beginning with the issue that they had a capacity problem in their existing call centers and the idea that they wanted to add speech recognition based self-service to both relieve the capacity problem and to improve their overall customer service experience.

It's been an unusual experience for me. My typical engagement is very focused. I come in to help solve some specific problem or issue and depart as soon as that job is completed. This engagement has been very different in that I joined with my client for their entire journey from idea to completed delivery and post-implementation managment. It's been one of those rare opportunities to start with a blank page and to apply everything I've ever learned about "doing this the right way". It's been my job to act as a mentor, educating them, guiding them and making sure they "stayed between the white lines on the road" so to speak.

I decided to start this blog as a way of capturing what I've learned in the process and share that experience in a way that would help others new to speech recogntion accelerate their own learning curve. Both my client and I still have some unanswered questions and I thought that a blog might be an interesting way to create a community of others with similar questions, issues and experiences who are interested in sharing their own journey.

As this moves forward I hope to share with you what we've learned and explore what we've yet to answer.

Cheers,
-Jeff