Friday, July 24, 2009

On the BBC news story bashing SpinVox

I read with interest this morning a BBC story about SpinVox which suggests that the majority of messages on it's platform have been heard and transcribed by call centre staff in South Africa and the Philippines rather than being transcribed into text using speech recognition technology.

The article goes on to say that messages appear to have been read by workers outside of the European Union raises questions about the firm's data protection policy. SpinVox's entry on the UK Data Protection Register says it does not transfer anything outside the European Economic Area.

Anyone with a working knowledge of the voice mail to text transcription industry (which includes other vendors like Jott Networks and Google Voice) understands that no speech recognition process available today can achieve perfectly accurate automated transcriptions for large numbers of voice mail messages from thousands of different callers and the wide variety of audio quality typical of phone calls, especially those like poor cellular connections.

Today, almost everyone working in this space uses a combination of speech recognition technologies and human (read: caller center agent) based quality assurance (q/a) to obtain transcriptions of a usable quality. The human touch adds two elements: first, it can edit out errors from the automated transcription process and secondly, the markup data from the human q/a agents can be used to further refine the recognition process.

In the rare cases where no human q/a is used before delivering the transcription to the end users, the quality of the transcription almost always suffers. By example, has anyone seen "Great" transcription yet from Google Voice?

Unfortunetly, the economic model playing out in this industry forces this q/a work to off-shore or third world call centers.

The BBC story is important in it's discussion of the data security issues. So far, none of these services has provided sufficent details about the processes they use to assure data security and it does appear on the surface that SpinVox may be violating the EU Data Protection Policy that it's committed to. To quote Ross Perot: "The nut's in the detail". We've not yet seen enough detail to know much about the "nut".

I've used Jott, Google Voice and SpinVox myself (in fact I currently use SpinVox on my cellular voice mail) and I've found all to be useful but none to be superbly accurate in their transcriptions. However the services with human based q/a have faired much better.

What's your experience been? Are you concerned about the security of your message content when using these services for voicemail transcription? I'd be interested in hearing your comments!

Friday, July 17, 2009

I Finally Got a Google Voice Invitation

I finally got an invitation to use the beta of Google Voice. It's all set up now and I've added a control to the right hand panel on this blog that allows you to call me with it. Just click on the control, it will prompt you for your phone number. Google Voice will then place a call to you at the number you specified and at the same time will call me and then bridge the two calls together.

I've configured my Google Voice account with all of my possible phone numbers: work, home, mobile and client work site. When it's trying to send me a call, it tries all of them simultaneously and then drops all the lines that I don't answer on. If I'm simply not available, it will route the inbound call to my Google Voice voice mail box. Once you've left your message, it will send me an email and an SMS message (to my mobile phone) with the transcription of your message. It will be interesting to see how Google's voice mail transcription compares to Jott and SpinVox. I'll keep you posted on that.

If you'd like my Google Voice direct phone number for your address book, it's (425) 502-5613.

Monday, June 29, 2009

Useful Speakers Notes for Presentations

If you've done much presenting using Microsoft Power Point, you know how much work is involved in writing "good" speaker notes to augment the content of your slides. While slides needs to be brief in their content and are meant to convey the essence of a point, the speaker's notes provide you with the ability to expand on the details that you actually discuss with your audience.

In my job I do lots of presentations and have found a great solution for capturing the speakers notes and at the same time giving myself a chance to rehearse "out loud" my talking points for a specific slide. I use Dragon Naturally Speaking 10 to capture my discussion for each slide in the "speaker notes" field for the each slide.

I bring up the slides, put my headset on and capture my presentation for each slide. I don't worry about any corrections as I present, rather I focus on the flow of the presentation and how it sounds.

After presenting each slide, I can quickly edit out any minor issues with the captured text and have a great set of detailed notes for my audience and a chance rehearse and see if there are any refinements I should make to the content.

Wednesday, April 29, 2009

TellMe Rolls Out New Features

Tellme®, a subsidiary that Microsoft® purchased two years ago has announced a Spring Release of new features and product improvements that tighten the integration between TellMe & it's parent and take advantage of the deep technical skills from both TellMe and Microsoft's speech research group.

The new release offers improvements focused on:
  • Positioning TellMe as a service offering in the emerging Cloud Computing arena
  • Improving core speech recognition quality
  • Adding new multi-slot recognition capability
  • Reducing telecom costs

The enhancements include a new VoIP offering through a partnership with Global Crossing, a new TTS voice which leverages the Microsoft Text-to-Speech (TTS) engine, new acoustic models, phonetic dictionaries and grammar products that increase the accuracy, new multi-slot dialog capability which will allow callers a more natural conversational experience while keeping recognition rates high and new mobile services, including the Windows Mobile 6.5 application. They will now be able to offer not only Nuance and IBM recognition engines but also add Microsoft's engine.

I've hyperlinked to the announcement press release in the opening paragraph so you can take a look at the changes in more detail.

TellMe has always been a strong hosting company and these additions will only further strengthen their position in the market.

Tuesday, April 28, 2009

Talking Gadget Theater II: The Wrath of Kindle

Apple's iPod Shuffle and Amazon's Kindle II TTS play a scene from Startrek: Wrath of Khan

Monday, April 27, 2009

The Clouds are Gathering - Speech 2.0

When I lived in north Texas, it was easy to see a storm coming – the thunderheads gathered on the horizon to the west and you could see them building for hours before they arrived.

After eComm 2009: Emerging Communications Conference I saw the same kind of storm clouds gathering in the west. This time rather than bringing wind, rain and lightening I believe they mark a significant shift in the evolution of speech recognition – from hosted VXML based platforms to Telephony “in the cloud” or what’s being called Speech 2.0.

This shift builds on the trend towards cloud computing; the hosting of applications and more importantly services in the cloud and provides a number of potential benefits including:


  • Reduced or no front end capital investment
  • Reduced operating expenses and easier upgrade/scalability
  • Rapid speed to market deployment for new application
  • Easier integration with and synchronization with existing web self-service
  • Access to best-of-breed technologies
  • Superior network reliability and redundancy
  • Gigabyte interfaces for fast reliable operations
  • Scalability to meet high or rapidly changing call volumes
  • Compatibility with existing telephone and Web infrastructure
Several specific Speech 2.0 platforms were presented at eComm, each with its own twist on features, supported interfaces and hosting options. All of them offer some form of free developer account so that you can sign up and try them yourself.

The platforms that I took note of were:

TropoVoxeo introduced Tropo, an in-the-cloud development platform that lets users create and deploy speech and telephony applications using a simple API (application programming interface). The API support application development in Groovy, JavaScript, PHP, Python, and Ruby and is designed as an alternative to the standard XML- and VoiceXML-based platforms that have become so common in the last few years. Applications can incorporate inbound calling via the public switched telephone network, Session Initiation Protocol (SIP), Skype, and iNum, while also providing appropriate connections for outbound calling. Capabilities include robust call control, playing and recording audio, touchtone entry, speech recognition, text-to-speech, and mashups with Web services. Planned application capabilities include call recording, conferencing, and Web services.

Twilio - Twilio provides an in-cloud API for voice communications that leverages existing web development skills, resources and infrastructure. Designed to enable web applications to be able to interact with phone callers, Twilio allows you use your existing web development skills, existing code, existing servers, existing databases and existing karma to solve these problems quickly without the need to learn some foreign telecom programming languages, or set up an entire stack of PBX software. Twilio provides the infrastructure; you provide the business logic via HTTP. Currently PHP, REST.

IfByPhoneIfByPhone is a hosted voice application and platform company with a simplified approach to the deployment of stand-alone and web-integrated voice services for small and medium sized businesses (SMB). Through a combination of telephony and web services IfByPhone offers you prebuilt applications or a programmable API which enables you to create inbound or outbound calls or other IVR functionality. The configuration and deployment tools look and feel just like Web applications, and require no previous knowledge of telephony programming or terminology.

In the spirit of full disclosure, I use a mash-up of IfByPhone’s applications, a “Click to Call” button on my own website which allows callers to connect to me directly from my website. It prompts the caller for their phone number, places a call to them, and then invokes a “Find Me” application that places simultaneous calls to me at several possible numbers and bridges which ever number I answer on to the caller.

JadukaJaduka provides a SOAP-based Web Services interface which enables companies to easily blend voice into their workflow activities Jaduka's Web services API makes adding the benefits of voice communication to enterprise applications as easy as constructing a mash-up.

I’m sure there are other platforms that I’ve missed. If I’ve overlooked you, drop me a note.

These cloud-based telephony providers start with a hosted platform, accessible via the open Internet and provide a an API whose premise is to enable developers not familiar with speech recognition or telephony to incorporate a speech based channel into their existing application infrastructure without having to learn a great deal of domain specific skills or development languages, like VXML or CCXML that are typically used today to build voice enabled applications. Each varies slightly but in general the primary interfaces are built on common web based development languages and protocols such as REST, SOAP, PHP, Groovy, JavaScript, Python, and Ruby. In addition, some of these platforms have also developed more complete applets or small applications which can be used right of the box or with a nominal amount of configuration.

In theory, this approach makes it easy for developers to add voice interfaces to existing applications. But as your mother reminded you when you were are child: “just because you can does not mean you should”. Speech enabled applications have taken a long time to reach the main stream success. There are many reasons for this, one of which was the need to develop experience in what’s required to provide users with an interface that works well for the caller. There is as much art to this as there is science and technology. As with many endeavors, our first attempts often fail to meet expectations. Developers using these cloud based platforms will still need to design and implement good users interface practices for voice which are much different than those applied in visual application interfaces. Much has been written about this, so I’ll leave that topic to another conversation.

This evolution in platforms when added to the strong uptake in hosted platforms is sure to have significant impact on the business models in the speech recognition and telephony industries. With the concentration of fewer and larger consumers of the core technology, pricing and volume of sales with no doubt change for firms like Nuance and the various IVR hardware vendors.

Over the next month, I'll explore these new cloud based platform individually by building a application on each and sharing my experience and results with you. I'll wrap up this series of posts with some kind of comparison post giving you my perspective on the pros and cons of each. Stay tuned!

If you're already using or experimenting with one of these cloud based telephony platforms I'd love to hear from you about your experience as well.

Friday, April 24, 2009

Speech Technology At Its Finest!

Thanks to the Speech Tech Blog for bringing this to my attention and enjoyment. I thought I'd pass it along.

Amazon's Kindle 2 TTS & Apple's iPod Shuffle TTS perform a scene from Blade Runner . . .


Thursday, April 09, 2009

Collecting SLM Data in an existing self service application

I had the opportunity to call Comcast (my internet service provider) this week about a problem with my cable modem.

The call was answered by the usual self-service application that asked me a series of questions (a directed dialog style) including my home phone number and then if I was calling about my cable TV service, Internet service or phone service. At that point the application took an unexpected change of direction by asking me to state why I was calling. After I said "Internet not working" the application informed me that it collected that information for a future application enhancement then went back to the normal directed dialog style with 5 or 6 more questions. No doubt, the data collected will be pared with my ultimate destination and purpose in the call to help develop a natural language component.

I immediately thought of a post by Phillip Hunter this week on his Design-Outloud blog and the related article in SpeechTech Magazine: Is it Natural? I'd encourage you to read both, especially Phillip's blog post. These will give you a good overview of the challenges to this approach.

It's obvious that Comcast is gathering data to add some kind of "Natural Language" capability to their customer self service application. This would imply that Comcast has a very high call volume for this application, as it typically costs a large 6 figure number to build a Natural Language call steering application based on SLM (statistical language modeling). It's often difficult to justify the expense of building and maintaining this kind of approach.

While this approach is often very productive and cost justified for high volume callers, an equally workable approach with a slightly narrower focus can be developed using a typical SRGS grammar for a much lower cost, using an technique known as grammatical inference.

Essentially grammatical inference tools rely on artificial intelligence to build your SRGS and GSL grammars automatically based on example utterances. Just as you build an SLM grammar using caller utterances, with grammatical inference, you feed the utterances into a ‘grammar learning tool’ which outputs a set of grammar rules in whatever format you require (e.g. GSL or SRGS). The grammar learner has a fundamental knowledge of the language that you are building the grammar for (e.g. English) and combines this with your utterances to produce a set of grammar rules. Unlike an SLM approach, grammatical inference allows you to build a usable grammar with only a very small number of sample utterances (‘tens of utterances’ rather than ‘tens of thousands’). Of course, if more training data is available, you can feed the grammar learner as much as you like. One such tool that I've used is offered by Inference Communications Pty. Ltd. I'm sure others are out there waiting to be found.

When working with my clients, I often find that major speech vendors jump straight to a recommendation of Natural Language (with glimmers of large licensing and professional services fees in their eyes) when a more conservative and less expensive approach "may" work just as well.

Regardless of the direction you choice, consult with a skilled voice interaction designer who can help you parse through the pros & cons for each approach and choose the right one for you and your callers.

P.S. Also worthy of mentioning this week is a new web site from The Association for Voice Interaction Design. If you have anything more than a passing interested in voice interaction design, they have some great references to other blogs, websites and publications.

Thursday, April 02, 2009

Despite being a skeptic . . I'm impressed

You would think that since I've made a career of working with speech recognition technology that I would be wildly enthusiastic about it and just like a carpenter with a hammer would see every problem as a nail. Not so with me!

I got into this technology reluctantly and have remained largely skeptical ever since. It's just in the past few years I have become truly comfortable in recommending voice recognition technology as a viable and solid technology to be used in business solutions. Even then, I'm careful and conservative in when and how I make that recommendation. I have to be 100 percent confident that the proposed solution can succeed and will prove successful both in the mind of the user and the business.

Along the way I've tried, sampled and tested almost everything offering including the desktop recognition packages from IBM's ViaVoice, Nuance's Dragon Naturally Speaking and most recently the speech recognition included in Microsoft's Windows Vista. After every one of these trials, I've uninstalled or disabled the recognition package and gone back to my keyboard. While they worked to a greater or lesser degree they never "excited me" and often were more trouble than they were worth.

Despite previous mediocrity, I keep trying new versions or products. This week I finally got around to installing Nuance's Dragon Naturally Speaking 10, after it lay on my desk for month's.

The installation process was quick and easy with no complications and once installed the product walked me through a basic set process to set and calibrate the audio input source (I have multiple devices with microphones attached to my Dell Vostro with Windows Vista) and complete a short training exercise by reading a Dave Barry column so that Naturally Speaking could create a user profile for me. All done in less than one half hour from start to finish, including time to download and install the Microsoft Visual C++ Redistributable.

For those of you who know me in person, you'll no doubt remember that I have a distinct southern accent, having grown up near Memphis, TN. One of the user configuration options was a choice of languages that included General US or Southern US. I resisted temptation and chose General just to test Naturally Speaking a bit more.

Now after a few days of use, I have to admit amazement! The recognition is incredibly accurate and quick. I've not typed an instant message or email so far this week. All have been dictated using Nuance's Dragon Naturally Speaking 10. In fact this entire blog entry was done using speech dictation.

I've only experienced two problems:

1) Some issues with product names or proper names like Vostro, which require correct.

2) I think I need a more highly directional microphone for my headset. I listen to my local public radio station in the background while at my desk and if I'm quite without telling the software to stop listening it will start transcribing the conversation on KUOW 94.9 that's playing in the background. This might make it difficult to use Dragon in a loud office or call enter environment.

That's my two cents worth . . . and if you're interested, I have a copy of Dragon Naturally Speaking 9.0 that I never got around to installing. It's for sale on eBay!

The Good, The Bad and The Ugly - Real World Examples of Speech Enabled Self-Service

When discussing issues related to a particular approach or dialog in a speech recognition based application with my clients, I find it's very helpful to have them call and interact with examples in a real application. Often interacting with a real world example, illuminates their thinking more clearly than any intellectual discussion. It makes the point I'm trying to illustrate very visceral for them, having experienced the pain or the pleasure.

I keep a document with several dozen of these real world examples which I give my clients as a reference. It's time for me to refresh the content for this document and I thought I'd ask for your input. I've put together a survey to collect this information. Click Here to take survey

Once I've gathered the results and updated my reference document I'll post a link to it here and discuss some of the most note worthy examples.

Wednesday, March 11, 2009

Apple's new Text-to-speech interface for the iPod shuffle

Apple has introduced an all-new iPod® shuffle, which features a Text-To-Speech interface (called VoiceOver). With the press of a button, you can play, pause, adjust volume, switch playlists and hear the name of the song and artist. The new Shuffle can speak 14 languages including English, Czech, Dutch, French, German, Greek, Italian, Japanese, Mandarin Chinese, Polish, Portuguese, Spanish, Swedish, and Turkish.

You can you read much more about the new Ipod on Apple's website and you can hear samples of VoiceOver on the Apple Website.

(Thanks to Adam B. at SpeechTechMag.com's blog for bringing this news to my attention).

While discussing this on Twitter this morning (Yes, I admit I've become a Twitterholic!) I sparked a conversation thread about what various folks think about TTS in general and the new VoiceOver feature on the iPod® shuffle.

I have my own opinions, but as a good user interface student, I thought I'd ask for your opinion first. Once I've gathered a good sample of responses, I'll post the results and add my own "two cents worth".

Use the link above to listen to the new VoiceOver interface then take my short survey on SurveyMonkey.com

Take a few minutes to listen to the new iPod® interface, take the survey and let me know what you think!