Innovation in speech recognition
“James and Janet Baker, now in their 60s, are computer speech revolutionaries,” wrote The New York Times in an article last July (2012) article about the tragic consequences of the sale of the company, Dragon Systems, Inc., founded by the Bakers. Indeed, for over 30 years the Bakers have been considered to be the preeminent leaders in speech recognition technology. Their technology became the industry state of the art, the standard against which all other speech recognition products were measured. And the technology has stood the test of time.
This post is part of “The impact of CMU” series, sponsored by Larry Jennings, trustee.
The Holy Grail of speech recognition is the ability to speak continuously, no matter your gender, your accent, volume, vocabulary, or context. Ideally, a computer, phone or other device recognizes what you say and communicates back to you, reading your voice mail, taking messages, calling someone, or typing the words that you dictate into a word processing program. Being able to speak into a device and have it accurately understand what you just said was a technical challenge and an elusive target for years. Dragon Systems, the company founded in 1982 by the Bakers, came as close as any firm in achieving the goal of continuous speech recognition using a wide vocabulary of over 100,000 words. The Dragon software has enabled millions of people and businesses to employ speech recognition in daily activities and important business functions.
Dragon Systems grew from just the two Bakers in their Victorian home outside of Boston to nearly $70 million in annual revenue and approximately 400 employees by 1999. The company doesn’t exist now, having been split up and acquired in several transactions, but the technology and products are still in use worldwide. Today, you probably use the technology in your phone or your tablet computer. And the legacy of that technology stems from Carnegie Mellon University (CMU).
CMU connection. James (Jim) and Janet received their PhDs from Carnegie Mellon in 1974. While Dragon Systems was founded years after the Bakers completed their doctoral degrees, the core of the technology was based on the couple’s work and discoveries in their lab at CMU. It was there that the couple developed the technology for continuous speech recognition (meaning without having to pause in between words) that was later used as the foundation for the Dragon Systems software products. That technology, known today as the software Dragon NaturallySpeaking®, is now sold by Nuance. Used by millions of people to communicate to their computer and to each other, Dragon NaturallySpeaking was the first, and reputedly the best, continuous speech recognition software on the market.
Why speech recognition? The Bakers interest in speech stemmed from Janet’s graduate work at The Rockefeller University in a neuroscience laboratory on animal sounds and vocalizations. Her work gradually migrated to human speech. Studying mathematics, Jim got interested in the visual patterns displayed by human speech that he saw in Janet’s graphs. In order to analyze the signals mathematically he had to get the signals onto a computer. Jim started writing computer programs that could get digitize the signals and analyze the speech.
Speech at CMU. The Bakers were wooed to CMU, which was an early leader in speech recognition as a branch of classical AI research with autonomous sources of knowledge as the underlying philosophy. Although the Bakers endorsed the philosophy of applying multiple autonomous knowledge sources, Jim wanted to represent the knowledge from each of these sources in terms of stochastic processes and probabilities. By imposing this framework on all of the knowledge sources, each of their contributions to the task could be assessed and evaluated rigorously and consistently. Furthermore, optimization techniques could then be easily applied to improve performance with whatever knowledge sources were available. Jim concluded that, “If we want to mathematically represent uncertainty, then we need to model the probability theory. We need to model speech as a statistical process.” Other universities researching speech had different approaches, but no one in academia at the time had used mathematical models to recognize speech. Jim, Janet and CMU were the first.
The idea. Jim’s probabilistic approach was to use a statistical method that he was familiar with called a model of a hidden Markov process. This class of algorithm was based on a mathematical model that enables predictions of individual words, phrases, and sentences. Applying this approach resulted in the development of a flexible technology that could represent acoustics, semantics, syntax, and context. Surprising to many, these probabilistic models can be used to represent many different types of knowledge! Since each of the knowledge sources is imperfect, these probabilistic representations allow the system to reflect that uncertainty with fuzzy models. These models can often be substantially improved by incorporating more data into them. Automatic learning and dynamic adaptation can further improve the models, their predictions, and system performance. All encompassed within a simple underlying framework. The Bakers called the technology the Dragon Speech Recognition System. The first glimpse of Dragon Systems emerged.
Interestingly, at around the same time, a group at IBM developed a similar, mathematical modeling approach to speech recognition. This became important, when the Bakers attracted the attention of IBM. Upon graduation from CMU in 1974, the Bakers joined IBM to work in their continuous speech recognition group. Their industry experience continued when they went to Verbex Voice Systems, a company which was founded in the 1970s as Dialog Systems, and later acquired by Exxon Enterprises. When Exxon decided to divest itself of all of its “office technologies”, the Bakers had to decide what to do next. The Bakers had developed a lot of research and management experience at IBM and Verbex, and they decided to found a start-up of their own. Of course they couldn’t use any of the technology from IBM or Verbex in the newly-formed Dragon Systems; they had to go back to what they had done at CMU to use as a base for the start-up.
Innovation. The revelation that Jim had around math and speech was the key to understanding how speech recognition could be productized for the masses. “It’s extremely difficult, even today, to recognize dialects, accents, genders, and how I pronounce a word, say ‘Carnegie’ (kahr-ni-gee) versus how you might pronounce it (kahr-ney-gee). We built Dragon around the idea that you can calculate to a high degree of accuracy the mathematical probability of one sound following another. That’s a relatively simple idea, and it turns out that it works.”
The Bakers had not set out to become entrepreneurs. But even before they were married they had talked about what they wanted to accomplish in their lives. Janet tells the story:
“We always had very long-term goals. Before we got married, we discussed several different problems that we were interested in, that we thought we could solve. We set out to do something that mattered, something that was doable in our lifetimes, something that would be challenging and interesting and useful. We chose automatic speech recognition. Our estimate was that it would take between 25 and 40 years to get to our long-term goals. It actually took us 26 years. That was the introduction of Dragon NaturallySpeaking in1997, 15 years after founding Dragon Systems. Having formed the company actually speeded up the process of that discovery.”
Jim thinks about the entrepreneurial effort a bit differently,
“We did Dragon because the research was our goal. If we were just trying to be successful entrepreneurs and maximize the success of the business we wouldn’t have paid attention to goals more than 10 years away. We would have had to do it differently, and we didn’t want to compromise.”
With Dragon Systems, the Bakers had three long-term goals for speech recognition. They were prepared to push the boundaries and limitations of speech recognition to achieve: 1) continuous speech; 2) large vocabularies; and 3) unrestricted text. These goals were born from their work at CMU, and they became even more engrained by the couple’s industry experience at IBM and Verbex. The three criteria can be lumped together in what is termed natural language. The Bakers wanted speech recognition to be like natural language. They wanted people to be able to speak normally and have a device recognize what they were saying. To be really useful, speech recognition has to be easy to use, accurate, and have a broad vocabulary. It has to be able to recognize speech as it is spoken, naturally and without artifice.
Building the business. Dragon Systems first products were developed to work on the PCs of the day, starting with the Apple II. What set Dragon apart from other systems and from the research of the day was this focus on popularizing the application on affordable computers. Jim relates:
“Nobody could do large vocabularies in real time on a PC. Our goal in 1985 was a vocabulary of 10,000 words using continuous speech. By 1990, with the release of our first text-to-speech product, DragonDictate®, we got the vocabulary up to 30,000 words, surpassing anything else on the market.”
Beneath the hood, Dragon is still the speech recognition technology of choice.
The DragonDictate software required the speaker to use discrete speech where the user had to pause between speaking each word. It wasn’t until the release of Dragon NaturallySpeaking in 1997 that the Bakers realized their long-term goal of general-purpose continuous speech recognition. NaturallySpeaking could handle normal speech and featured a broad vocabulary equivalent to a typical dictionary. In 1998, Dragon released a multi-lingual version of NaturallySpeaking, incorporating six languages. The company grew.
The Bakers grew Dragon organically from revenue, and mostly without debt. “When you live within your means, it means you start small and stay small. You allow the natural course of events to dictate how you manage a business,” Jim states. Janet adds, “We knew that we didn’t fit into the typical VC five-year timeframe. Speech recognition at the time entailed small vocabularies; we thought that until we got to a 10,000 word vocabulary the market wouldn‘t be large enough to interest top VCs.”
The company grew slowly but steadily. In the 1990s, sales started to take off. The annual growth rate was 50-75%. They took some outside investment from only one investor, Seagate Technology, the disk drive manufacturer, which bought 25% of the company in 1994, and later added another round in the final R&D stages of Dragon NaturallySpeaking.
In 1997, when Dragon NaturallySpeaking launched, the company had several hundred employees. The product was highly successful in its market share, user satisfaction, comparative reviews, trade journals and awards that it won. IBM was only a bit behind, but Dragon won 90% of the comparative reviews and awards worldwide. The Dragon system was considered easier to use and more accurate.
By the year 2000, Dragon had developed a system that ran full dictation on a hand held device (Compaq). The company also found a niche in helping people with disabilities. “We, even today, get thank you letters from people who are able to go to work, go to school, or stay in social contact who wouldn’t be able to use computers if they didn’t have our technology,” Janet relates.
Dragon Systems peaked at almost 400 employees by 2000 and approximately $70 million in revenue. After 18 years, the Bakers started to explore going public or selling the company. It was a hard decision because Dragon Systems was really their third child. Founded when they already had two children, and housed initially in their own home, Dragon was a deeply personal story. Even the name was personal. Janet had always loved dragons. She loved the mystery and magic of the mythical beasts that breathe fire. Their home today exhibits hundreds of dragon images and figures – even dragon wallpaper.
The company engaged in several false attempts at an exit. In 2001, The New York Times reported on how good deals turned bad for Dragon,
“Lernout & Hauspie was not, in fact, Dragon’s suitor of choice. Visteon, a Ford Motor Company spinoff with $18 billion in annual revenue, seemed more attractive. But Visteon left Dragon at the corporate altar in February 2000; Visteon would later argue in court papers that it did so at the urging of Lernout & Hauspie.”
Dragon returned to Lernout & Hauspie and they did a deal that valued the company at about $600 million in L&H stock. The deal had tragic consequences several months later when it was discovered that Lernout and Hauspie had committed accounting fraud, overstating their revenue by hundreds of millions of dollars. The revelations resulted in L&H filing for bankruptcy protection several months later.
Part of the Dragon technology was eventually sold to Visteon; another part went to ScanSoft, which engaged in a merger and was renamed Nuance, which had revenues of $1.4 billion in 2011. It is Nuance’s deal with Apple that makes the Bakers believe that the Siri technology uses some of the original Dragon technology. Not only did Nuance keep the technology, but the branding around Dragon. Their latest speech recognition program for PCs is still Dragon NaturallySpeaking. On a Macintosh, the continuous speech recognizer is called DragonDictate. “It’s not the same technology as the original Dictate, but they loved the name.” In 2012 the Sunday NY Times did a cover story on the Bakers and the debacle of the deal with Lernout & Hauspie, which also involved Goldman Sachs, entitled “The $580 Million Black Hole.” The Bakers have been through a lot with and for their company.
The future of speech recognition. Speech recognition has come a long way. Starting with voice typewriters in the 1800s, people have been researching voice and speech recognition in the hopes of achieving the Holy Grail of natural language processing – perfect and continuous recognition without limitations. As Janet says, “nobody wants to type with a toothpick; talking is easier than using your thumbs.” When the Bakers reached their goal in 1997, they realized that it wasn’t an end point but a starting point, “We had worked in anticipation in getting to that new era which has started blossoming, and it will continue to do so.” The Bakers see lots in store for the future of speech recognition.
The industry has changed, not because of new, breakthrough technology, but because of the data available via the Internet. Jim tells me, “With all the data that is available you can now capture the variability of speech. You can use the same techniques but take advantage of the increased data to better characterize and represent it.”
Janet believes that we will be talking more with objects around us in the future. In an interview with AudioBoo, Janet predicts,
“We are going to find ourselves talking to a lot more devices, equipment and gadgets that we want to use whether it’s in our automobile, or whether it’s helping to provide literacy support for people who don’t have it, or children learning to read, and for a host of other things. Speech by itself is a tool just as a keyboard is on a piano. It helps you create something else. And it can be used in conjunction, and many times benefits from being used in conjunction, with other modalities…A way of measuring the success of voice interface is as it disappears into the devices and the equipment that we interact with around us, and you get to the point that you don’t think about using it or having it being used for you.”
Today, the legacy Dragon speech recognition technology runs on PCs and servers, and is embedded in portable devices used by millions in many languages. The technology also has been used for years in the automotive industry and many industrial applications. The Bakers created the first audio mining technology, “If you have pre-recorded speech or you are recording on the fly, you can do a transcript, and then locate particular areas of semantic interest. You can find all areas about a particular company or industry; you can pull out the concepts of what is being discussed with a high level of accuracy.”
The Bakers pioneered the field of speech recognition and they are still talking today, pushing the field to the next level and seeing their technology flourish as it gets incorporated into dozens of others’ systems, products, and applications.