Episodes

Mary Parks on Voice User Interfaces

“When it comes to spoken language or any form of language, it's very deeply tied to our identity.”

My guest today is Mary Parks. Mary’s background is in communications and applied linguistics, and for the past twenty years she has worked on designing voice user interfaces for digital systems. In this conversation, we focus on what voice interfaces are and how voice-driven systems work.

Listen to the full conversation

Show notes

Read the full transcript

Jorge: So Mary, welcome to the show.

Mary: Thank you.

Jorge: For folks who don’t know you, would you please tell us about yourself?

Mary: I am a voice user interface designer, and if I had to say, where my voice user interface design background starts is actually in applied linguistics. So sometimes I would call myself an applied linguist. So, to give more background on that, I started out in my educational background going into cross-cultural communications for my bachelor’s degree, and then, kind of got interested in becoming an English as a second language teacher. So, knew that I needed a degree in applied linguistics. A lot of people that teach English as a second language pursue applied linguistics degrees. So that’s what I started working my master’s degree in, was applied linguistics.

And it took me a while actually to get my master’s degree because it turned out that you could start teaching right away in that field. So, with a bachelor’s in cross-cultural communications, I started teaching English as a second language, and I did that for 15 years. Had my master’s, took a lot of linguistics classes, loved linguistics, got interested in theoretical linguistics, and I kept thinking about what I wanted to do after a while because I felt like I wanted to do something else with linguistics. I loved teaching. I was teaching at the university level, teaching students writing, listening all kinds of skills that they would need to succeed in a university. But I also taught other students as well, who are coming to the United States to learn conversation in English as well, or business English, other types of specialized English.

I bring this up because it’s relevant to my work actually, just that whole teaching background. But as I said, I kind of wanted to do something else in linguistics. So, I was aware as the internet wave in the 90s was happening, I knew that a lot of linguists were being hired in different roles in technology companies. So, I thought, “Oh, maybe I could do something in technology.” And kind of fell into voice user interface design because of various reasons. It was just one of the jobs that I applied for. But it was funny because I didn’t have a background in design, nor did I have… I didn’t think I had a background in design, and I didn’t think that I had a background in UI. Like I didn’t know what a UI was. And clearly, I didn’t even have a technology background.

The good news is that when I started working at… And the first company that hired me as a voice user interface designer was a startup here in San Francisco. Because it was a startup, I was able to just dive in and just start working in speech technology and seeing overlaps and bringing in… I knew then that I had a lot of relevant background. I had to learn, you know, what a UI is, I had to learn a lot about design. I just started reading a ton, going to workshops, going to conferences, doing all I could. I had mentors who were helping me learn. So, the first two years, my brain was on fire in this new career. And yeah, that was back in 2000.

So, I’ve been working since then in voice user interface design. There was the first startup. I went to another startup here in the Bay area, and then joined a company called Nuance Communications. It wasn’t called that at the time, but it was basically one of the primary vendors for speech recognition technology, and they had a large professional services organization. So, I was there for 10 years. It was great, to just get into a regular rhythm and practice of how to build these applications.

And then, what I didn’t say was, when I started in this field in 2000, I quickly learned about the notion of the internet of things that was already being talked about back then. So, living here in San Francisco, walking around using public transportation, I was constantly thinking about how could voice interfaces fit in an internet of things world. And I wasn’t really thinking of a cloud-based internet of things world, I was thinking more about ground-based computing — “fog” computing, if you wanted to call it that.

So that was already something going around in my mind. And then, a few years back, I ended up joining Honeywell, and was working there on basically what I would say, like location-based and internet of things-based applications of voice. So was there for a few years and then a few years back, I started contracting at a large tech company here in the Bay Area, working in voice and multimodal experiences, and I’ve been doing that for a couple of years. So it’s coming up to about 20 years in this area.

Jorge: This notion that you had — sounds like very early on — that there was a future where voice was going to overlap with the internet of things… Or rather, that there was such a thing as computing devices distributed all around us that would be driven by voice — seems very prescient.

Mary: Yeah. In the startup that I was at, it was called Vocal Points, and there were a lot of people there that had been already thinking about this for years and they were very much thinking about how do we make it so that voice is available everywhere. You know this, it’s just there. So how do you do that? And so, I was lucky, and I want to say: I think luck plays a lot in my story. I was just lucky that I was at a startup that was attacking that problem at the time.

Jorge: But it also sounds like you had the right background. What should folks know about applied linguistics?

Mary: Oh yeah. Linguistics is interesting. Applied linguistics means just… It’s kind of like you have physics and you have applied physics. So. There’s a part of linguistics that goes after theory and you could say kind of lab or field research. There’s lots of different types of linguistics. And then there’s the applied side, which tends to be, for example, how do you take what we know about linguistics, what we know about language learning, for example, and apply that to helping people learn another language. Or how do you apply it to helping people learn what they need to know in their own language. It could also be applied that way. There’s also lots of clinical ways that you can use linguistics, for example, in speech pathology. Linguistics helps in something like that. So, there’s forensic linguistics, is another field that I would say as applied linguistics.

So, it’s basically saying, how do you apply linguistics to solve real world needs? And to me, voice user interface design, you don’t need to be a linguist to apply linguistics there. To do that work, you don’t have to have a linguistics background. But I think it’s always good to have some linguists around as designers who know how to apply linguistics to it.

Jorge: I’m assuming that when you talk about applied linguistics, that covers both spoken and written language. Is that right?

Mary: Yeah. It covers all language phenomena.

Jorge: When I hear you say that you are a voice user interface designer, that to me speaks to the verbal part of that dichotomy. Yes?

Mary: Yeah. Yeah. It’s speech. Well, in the UI design, when you get into the UI design part, there’s two big components. One is the speech input, what people input into the machine, and the speech output, what comes out of the machine. So, kind of technically it’s usually divided that way, and it has to be. But then it’s not like you have to have speech output tough. So, you could have speech input, and the machine does something else. It doesn’t talk to you. It might do something. So those are two different kind of components of it, speech input and speech output.

Jorge: And you work on both?

Mary: Mm, yes. I think that it’s really important that, in order to make sure the machine behaves as desired, you have to be able to… In other words, it’s not just about the speech output side or the machine behavior side, but then just deciding how it’s going to behave based on the input. So, if you have one group of people working on input and then the designers are over here deciding behavior, you can end up with a lot of trouble because you can’t do it independently of each other. And there’s multiple layers of talking about that.

So, you can talk about it at the information architecture layer level, just saying you have to architecture how the speech input works. So, you know, you could say, what are the tags that are going to be associated with certain types of utterances, let’s put it that way. So, you can have all this input coming in, it all has to be bucketed into different tags. Well, if that structure is being made in a way that doesn’t really suit the needs of how the output needs to come, or how the machine behavior needs to be, you’re going to have a complete disaster. Right? So, it’s really important to be able to help guide how the input is handled and then on the pure output side of things. Yeah. Anyhow, you have to do both together. Long story short.

Jorge: The voice interactions that I’m most familiar with are with my phone. When I’m in my commute, I’ll be walking along with earphones and I will be listening to podcasts or an audio book or something like that and something will come up that makes me think I want to follow up with that idea. And I will speak into the air, I will say, “Hey, name, remind me to look up blah blah blah blah blah.” And the command “remind me to” is… I know it’s a trigger that causes this voice-based assistant to place a transcription of whatever follows into my to do list. And I’m wondering if that “remind me to” trigger is what you’re thinking of when you say “tags”?

Mary: Okay. Yes. So, and that’s a really great example of a use case, right? So basically, that’s a tag, right. And there’s a tag somewhere that says there’s a bucket of utterances that will help the system know, “okay this is what the person wanted to do.” And it doesn’t have to be “remind me to” like it could be… It’s interesting as you as you play around with systems more you start to realize you could probably say something like, “I need to pick up some milk,” and the systems will know to put that in a list. That’s kind of interesting.

Recently I tested something on an Apple interface: instead of saying something like, “timer two minutes” — that’s kind of I normally do set timers that way, I just say, “timer blah.” I thought, “Oh, two minutes.” And bam! It got it. I didn’t even have to say “timer.” So, you don’t even have to give the phrase “timer” there. So, the system knows that I’m referencing a time and it knows what it does. So, it knows when I say, “two minutes,” it’s most likely that I wanted a timer. And there’s some risk here, because what we’re talking about — what I think is really important to understand — is that speech systems are probabilistic.

So, when I brought up this notion of input… And there’s also two systems happening that have to work to get the magic of voice working, at least two. One is the magic of speech recognition, which is taking an acoustic signal and trying to figure out how we can map an acoustic signal to a text string. So, the text string is not like if you remember earlier in our discussion, I said that speech and written forms of language are two different animals. But unfortunately, our speech systems don’t work off of the acoustics side of it. They take the acoustics and then they translate that into a text string. So, they don’t have a lot of signal, actually, in terms of what you were just… Like if you were yelling at it, the system only knows there’s a text string of something.

And the important thing about being probabilistic input is that it’s trying to make the best guess based on what it has learned to date about translating acoustic signals into text strings. So, it’s made its best guess about what that text string is, and then that text string is being put into one of these tags, which usually is called an intent. It doesn’t matter what those things are called, what those tags are called doesn’t matter. It’s just this notion that there’s a bucket of strings that are assumed will fall into a certain behavior that is desired from the end user.

What’s interesting about the example that you gave, “remind me to blah,” or that particular thing, is that when you get to the part about what you’re reminding, there’s two ways that a system can handle that. So, I told you there’s the speech technology that takes the acoustics, and then you also now suddenly have a text string. Now there’s still going to be some probabilistic things done with that text string as well, in modern systems. It’s not just kind of a rule-based, you could be using machine learning on that system as well. Well, you are actually. Machine learning has been around for a long time, so there’s always machine learning throughout the system.

So, when you got the text string part, “remind me to” or “I need to pick up milk,” certain utterances, the system goes, “Oh, okay. I’ve got this string, ‘I need to pick up milk’ or I’ve got this string, ‘remind me to pick up milk.’ So “pick up milk…” Like, “I need to pick up milk…” “I need…” The string goes, “Oh, okay. That’s a reminder request. We’re putting it into that bucket.” And then the other part, “pick up milk.” “To pick up milk” is what the string is. So it sees “to pick up milk.” So it’s almost like what you said before doesn’t matter, it’s about the “to pick up milk.” It’s a reminder.

Now the system can do two things with that part, “to pick up milk.” It could run what’s called natural language understanding, which means that it tries to take that string and understand it and do something with it that kind of goes into a bucket of its own. Or it could just take that string, and the system can be built so that it literally takes the exact string that you said and puts it in word for word.

So, this matters because there’s different types of technologies going on. In the case where you have a text string and the system is just trying to understand it, that’s a lot of our interactions with these devices that have voice interfaces on them, that’s a lot of what goes on. You say something and it tries to put it in a bucket, and it does something with it.

There’s a whole other type of technology, and this is most commonly felt when we use dictation. That’s a different technology. It’s not the same. So, you can feel this the most when you’re doing text messaging through one of these assistants. So, you could just open up your phone and tap the input field and hit the microphone and start dictating. When you do that as a person, you know you’re dictating you see it happening, or you might’ve done this, you might’ve started it on the phone and then you can put the phone away and you’re dictating to the machine, and it’s an amazing feat.

Or if you’ve ever used dictation on a computer or a product such as Dragon Naturally Speaking, which came from Nuance — that’s an amazing technology. Apple has their own dictation; Microsoft has their own dictation; a lot of these companies have their own dictation algorithms, let’s put it that way. And that’s a different thing. That’s taking your utterances and putting it in word for word and then there’s this kind of modeling that’s taking place that might… That makes it easier for the system to figure out what you want to say as a dictation thing.

So that’s one way of doing dictation, when you’re asking Siri to, or some other device, to do dictation as part of an interaction with the assistant, in that case, you’re saying something like, ” Hello, so and so.” You wake it up, you wake up the device, you get its attention, you press a button and then you say, “Send a message to my mother. Tell her I miss her and I’m going to call her tomorrow.”

So, you say that and then suddenly everything you said is written down word for word: “I miss you. I’ll call her tomorrow.” And you look at it and you go, “Oh no, Oh no. Oh no. That’s not what I wanted.” See what’s going on. There is, there’s no interpretation of your input. It’s doing other things to try to get the dictation right, but it’s purely dictation and it’s not the same as the whole part that came before. The whole part that came before was an interaction with the assistant or voice UI to kind of let it know, “Hey, I’m trying to send a message. Hey, it’s to this person.” So, it’s filling out fields, right? It’s going, “Okay. They want to send a message, and oh, here’s the person they want to send it to.” So, it’s got those fields handled. But the whole message part, in order for it to be done correctly, you have to suddenly start dictating, not telling the voice UI what you want said in the message, because it can’t do that yet. That technology isn’t available.

Jorge: It sounds to me like there are two distinct modalities of interacting using your voice with these digital systems. In one modality, you’re dictating to the system and it’s trying to capture what you’re saying verbatim or as close to verbatim as possible. And in some cases, it’s giving you live visual feedback of what it thinks it’s hearing you say. And in the other modality, it’s trying to guess — based on the intents that it’s been programmed to understand — it’s trying to guess from what has come before, what utterances you’ve made before, trying to guess what it is you’re trying to get it to do.

Mary: Yes. I mean, it’s really important to realize that there may not be a visual component. But basically, you can imagine that you have a form. So, you have one sort of speech that comes in and the system goes, “Okay, we’re filling out a form. So, this is the sendee, this is the sender, this is…” Oh, it’s, you know… And then maybe you could say, “Oh, it’s a text message.” You know? So, there’s certain things that are being filled out that the system knows. And then there’s these other parts that are really just capturing, trying to capture what you’re saying verbatim.

Jorge: Knowing the workings of these systems puts you in a different level than the rest of us, when interacting with them. And I’m wondering how this knowledge of how these things work influenced the way that you yourself interact using voice.

Mary: Ah, that’s a great question. It’s really funny. So, speech is an amazing thing because we all use it automatically. It took a long time for us to learn it. And there are other forms of language, sign language, and you know, there’s other forms of language. It’s not that all of us learn speech. But that’s what I specialize in, so when I talk about this, I’m just going to focus on the speech. It’s amazing because all of us kind of use it automatically without thinking. And it’s best when it’s that way, because speech is this mechanism that you have a thought, or your brain has a focus on something and then it has a mechanism to take that thought and translate it into the stuff that comes out of your mouth. So, it’s amazing.

So, it’s really hilarious because I run into the same problems that anybody else would. I’m constantly sending text messages where I realize, “Oh, I said that, but I should have said to my husband…” There’s the one that always cracks me up is when I happen to say, “Send the message to my husband. Tell him I love him.” It’s like I look at, you know… I hear it read back to me, for example, it says, “Okay, message to is saying I love him.” And I always laugh when I hear that. Right? I always correct it because I would hate for him to hear, you know, to get that message, and have to say, “No, change it. Tell him I love you.” And I like to send that message a lot, so for me, I would say maybe one out of 10 times it comes out the wrong way.

So, some of this is, why do I say it right the right way? And I don’t know. I really don’t know the percentage of how it performs, to be honest, in the wild. And I think platforms matter. So, I don’t know what, you know, Apple teams versus Google teams, who knows what they run into. But it’s funny because I’m guessing because I’m aware of what’s going on, maybe I’m less bothered by it or I have fewer questions in my mind.

But speech itself is spontaneous. That’s the beauty of it. It’s this wonderful thing where you have a thought and it comes out of your mouth and you know, when you’re talking with a person, the brain catches it and then quickly something, some interaction happens and it’s a chain of these spoken packets, let’s put it that way, that goes on, that builds up a conversation through time. So, for me personally, I probably have less curiosity or less kind of perplexity when I’m interacting with systems. But as a human, when I’m talking, I run into the exact same problems as everybody else does.

Jorge: That is such a great example that you’ve brought up because it points out how the current state of these systems leads to this kind of uncanny valley of conversation, where this utterance, “tell my spouse that I love him or her” is perfectly normal if you’re speaking with a conscious entity who will relay your message.

Mary: Yeah.

Jorge: But this is not a conscious entity that is receiving the message, right? It’s a machine. It’s an algorithm.

Mary: Yeah.

Jorge: And it does not know to change the subject of the sentence by itself.

Mary: Right. I think I may be wrong; I think maybe the Pixel 4 phone might be doing some… You know, because when you’re typing on your Apple phone, like there’s all that predicting what you wanted to say and filling it in. Or you do this in search, for example, and it predicts what you wanted and fills it in ahead for you. You know what I’m talking about, right?

Jorge: Yes.

Mary: So that type of predictive stuff is happening now. But it raises a hornet’s nest of issues because do you want a system to take what you say and interpret it and do a best guess? Or do you want it to take what you said and deliver it exactly as you said it? And I don’t know. I mean, here’s the thing: I think when it comes to spoken language or any form of language, it’s very deeply tied to our identity. So, we didn’t get to get into this aspect of it.

The moment we open our mouths, a massive amount of identifying information is in the speech utterance, in the first two seconds of the utterance. Whenever we talk, there’s a ton of information there. You hear things in the in the sound of the voice that tell you who the person is, elements of their identity, including perhaps the region they’re from. You know, there’s just all kinds of things that come up. And if you know the person, then your brain goes, “Oh, I know this voice.” Like you can hear only just to the two seconds of a voice, and if it’s somebody you really know, you’ll know who it is right away with pretty high confidence as a person. And so just identity and language are deeply tied.

And I bring this up because I don’t think that voice interfaces… It seems like right now we’re at this phase where people are looking for some sort of universal way to have voice interactions meet the needs of all human beings. And I think it’s going to be… We’re going to we are going to have to get into adaptive type Interfaces that adapt to personal people, because I don’t think everyone wants the same thing. Some people will love it that the system takes what they say and interprets it for them and delivers it more neatly. Other people will prefer more control and just having it more literal to what they want when they’re using these types of technologies.

The other problem that is going on right now is as we’re talking about the dictation method — and even when we talk about text strings — is that this is an assumption therefore that the people that are using these systems are literate. But a lot of people are not literate, or they have lower rates of literacy. And so then how will these systems really meet their needs?

And I mean these are big questions, and I think it’s really easy also for the designers — I just want to take a plug for this — that is designers need to realize that we come from our own backgrounds and from our own communities and that when we are in our workplace, the way we behave and the way we talk is different from the people we are designing for. So, it’s important to recognize that our own speech and our own behaviors when we’re working on these systems. If we try to imagine our end users, how they talk or how they behave, probably there’s a huge gap. And so, a ton of work is needed just to be in the field and out there with the people we’re designing for in order to get how they talk and how they behave.

Jorge: That is such a great observation and I just feel like there’s so much more that we could talk about this subject. I think that we need to do a second part to this show. Alas, we’re coming up to the end of our time together.

Mary: Yeah

Jorge: So, Mary, where can folks follow up with you?

Mary: Ah! Best place for me is on Twitter and I’m Mary Parks.

Jorge: Fantastic. I will link your Twitter account from the show notes. It’s been such a pleasure having you on the show. Thank you.

Mary: I’m so honored that you asked me to be on, and I really appreciate it. Thank you for this opportunity.