Phillip Hunter is a strategy and innovation consultant focused on conversational systems. He has a long trajectory working on such systems; among other roles, he was head of user experience for Amazon Alexa Skills. In this conversation, we focus on conversation itself, and how to design systems that converse.

Show notes

Some show notes may include Amazon affiliate links. We get a small commission for purchases made through these links.

If you're enjoying the show, please rate or review us in Apple's podcast directory.

This episode's transcript was produced by an AI. If you notice any errors, please get in touch.

Read the transcript

Jorge: Phillip, welcome to the show.

Phillip: Oh, thank you. So great to be here. So great to be with you.

Jorge: I'm excited to have you. For folks who might not know you, can you please tell us about yourself?

About Phillip

Phillip: Sure. So, I've been doing many sorts of design activities and jobs and things over the course of about 25 years. I actually started my career as a developer, but quickly learned the difference between building computer programs as a hobby versus work. I enjoy one of them! So, I ended up getting into design interestingly, just because I complained so much about the applications that we were building at the company I was with, and how they just didn't make sense to me. And all of a sudden, someone hands me a book from Brenda Laurel and says, you need to read this. And that literally changed my life. I got to meet her one day and tell her that.

But that was about the same time the company I was with, which was building interactive voice response started as those touchtone systems that everybody hates for your bank or airlines insurance companies. We started adopting speech recognition as part of the platform. And to me, that opened up so many new possibilities. I learned some really interesting things from the get-go about how designing for that was so different from the things I had been used to before. Now, this was around when Don Norman coined the term "user experience design," and so it wasn't well-known. But while everyone was also getting used to designing for the web and designing then later for mobile, I was in that, but I was also getting used to designing for conversation: what does it mean to exchange things by voice that's different from how we think about information being presented on screens.

So, to speed things up a little bit, I've done that sort of work in startups and then in big companies like Microsoft and Amazon. I was around for the early days of Cortana, before it was public. And I've worked on Alexa as part of their developer third-party focused team. But along the way, I've always also been fascinated by large systems. So I worked at Amazon Web Services for a while where, at the time I started, there were about 35 different offerings that they had, and now it's somewhere around 150. It's just amazing growth over the past 10 years for them. And this idea of how these — all of these — services would come together in different permutations based on who was using it and what they were using it for just really fascinated me as, you know, beyond the Lego-block metaphor into each of these things are by themselves an advanced technology, and then, how do you use all these things to run a business or create a product or serve customers or all the things that we normally do in business, nut now we're doing them with these really amazing technologies. And so, conversation itself is also a system.

And so, it was interesting to me to get into the systems thinking from a pure technology standpoint. I've read other things about human systems and economies and healthcare and physiology and things like that. But I'm in tech and I soon began seeing in a different way some of the systemic elements of conversation. And so, for me, the past four, five years has really been amazing in terms of my own personal growth around what it means to interact with machines, including by voice and text, as well as just starting to see the power of systems in our lives. And you know, with technologies like Google Assistant and Alexa, now infrastructure — technology infrastructure — along with our mobile phones, along with our laptops and ways we interact with the worldwide web, all of these things are now very much in our homes every day for many of us. So, they've started to cross some interesting boundaries, that make everything that I've talked about way more interesting and way more pervasive. So, today I'm consulting in that, I've got some product ideas that I'm working on as well to explore where things go now that machine learning is really a big component, and artificial intelligence, whatever phrase you want to put on it, is now a real factor in the mix. 10 years ago, it was still sort of, you know, science fiction more than a daily practice. But now we have... well, for a number of reasons, we have these things, and we have to say, "okay, what's the impact here? What does this mean for our lives too?" So, That's the kind of thing I'm working on and, it's really exciting.

Conversation as a system

Jorge: You said, "conversation is a system." What do you mean by that?

Phillip: So, most of us who speak, no matter what age, we started learning how to speak and interpret speech very early on. Certainly, before we started reading, some of us start reading, two to three years later after we learned how to interact by voice. And by then, interestingly, we are ready experts at conversation, which raises the question of what are we experts at?

Well, so it turns out through the study of things like conversational analysis or through practices like that and linguistics and psycholinguistics... it turns out that language is not just a bunch of sounds that we make spontaneously. In our minds and between each other, we are actually doing some really intricate dancing and processing of emotions, information, contextual settings, history, all of these things that you know are part of our daily lives, and to process those effectively with each other and, some would say for ourselves even, we have developed this system of how conversation works.

And the way I think about it in my current work — and this is not a re-statement of anything that I've read necessarily — but there are essentially three levels of where a system is operative. One is types of conversations we have. So, you and I are having a sort of conversation, call it an interview, or a structured discussion, things like that. There are casual, "how are you doing?" You mentioned teaching earlier, lecture as a type of conversation, usually followed by questions and answers. And so, there are types of conversations and at the opposite end of that, there are the linguistic structures that help us understand: this is a noun, this is a verb, this is a modifier. Most of us probably hated studying those sorts of things in school, but we learned them, and we understand the basics there. And we know how to use them. We're experts in them, even if we don't necessarily like to study how it works.

In the middle, there's something we don't typically think about, which is how conversations have a structure on an individual level. And so... I'll just use, what you and I did. We joined this Zoom call and we started exchanging words that we both probably could have predicted we were going to exchange: 'how are you, how is life? What are you doing these days?' All of these things are, some would call it chit chat, some would call it small talk, some people would call it, social niceties but it's also giving us time to understand where each other is currently in our lives. Like I can... you know, especially if you're meeting a friend, let's say you can see, is this person in the usual mood? You know, are they presenting to me how they usually come across? Is something different? Why is it different? How is it different?

So maybe think about a loved one. You come home from – back in the day when we came home from work – you come home from work and you see a concerned look on your partner's face and right away you start to pick up something is going on. But maybe you start with a greeting, “hey, how are you?” But at some point, you're going to probably say something like, “what's going on? Is there something...” So, there's these elements of conversation where we connect, we survey, we assess, then we get into things like, a section called negotiation.

What are we going to talk about? How do we know what each other means? Do we need to clarify something? So, for us, for you and I today, you know, at some point you said, “hey, here's how this is going to work.” Which is a statement again that I expected but it doesn't mean I know the answer. So, you gave me an outline of how we would use our time today, and now we're doing it, right? Now you're asking me questions, I'm giving you thoughts and answers, and at some point, we'll move — and you said it yourself — we'll move to close out the interview. And almost all conversations have a closure. One of the things I like to point out to people is how often do you.… again, when we would run into someone in the hall at work and we'd say, "hey! Oh, I've been meaning to talk to you about this." And now you raised a topic, maybe you talk about a few specific items and you say, "okay, great. You know what? Let's catch up on that next week." "Sure. I'll put some time on your calendar." "Oh! Hey, by the way, how did that thing go?"

And so, you have this other... you have this transition. You're talking about another subject. Then you both start to feel like, okay, we spend enough time doing this. And what do you do? You returned to the first topic and you say, "it'll be great to talk to you next week when we meet about such and such." And you're like, "sure, looking forward to it." And that's your signal that it's over.

So, all of these are well-documented, and for people who study this, fairly well-understood components of conversations. They're not the types of conversations because they occur across many different types of conversations. They're not the linguistic elements of what sounds and what are the individual meanings of those sounds and how they work together. It's somewhere in the middle around how does a conversation work and these systems are actually incredibly important, for reasons that I can go into in a minute, but that's what I mean about conversation: conversation is a system or more accurately, as is the case with many systems, a collection of systems that's at work. And part of our skill at being able to converse well is a tacit understanding that there are those systems and that we can and should use them to be effective in our day-to-day lives with other people.

Protocols

Jorge: The word that came to my mind when you were describing this middle between those two extremes is the word "protocol." It's like, well, we're establishing a protocol, right?

Phillip: Yes!

Jorge: And the image that came to my mind, I think that you and I are both of the vintage where we remember these modems where you would connect to the phone line and you would hear this awful screeching noise as the modems were trying to figure out if they were compatible.

Phillip: Right, right, right. Yes! It's a connection in negotiation which is for nearly every conversation we have, a crucial step, even when it's someone who we talk to on a frequent basis. Now, it will adapt based on who you are talking to. And certainly, for meeting someone new for the first time, it's a very different feel to it; it has a very different feel to it than if it's somebody you talk to multiple times a day. But yeah, it's really important. And I'm really glad you said the word protocol too, because we can breach it. And it causes something else to happen. It may not be a problem. But it almost always is a signal that you have to adapt. The direction you thought it was going to go is not how it's going to go, and you need to figure out what is happening. And that's again, where negotiation becomes a key part of the conversational ability that we have.

Jorge: When you breach the subject of conversation in the context of user experience design, I think of two things. One, I think of the "assistant in a cylinder" that you've touched on earlier, right? We have a HomePod here at home, so, we have the Apple variant of that. And I also think of chatbots, which are not oral, but they're text based.

Phillip: Yes!

Jorge: And I'm getting the sense from hearing you that the type of protocol that you're talking about is mostly the verbal one, the one when we speak to each other.

Phillip: Right.

Channels

Jorge: Do we have different protocols for chatting via text versus talking?

Phillip: We do, we do. And I'll mention several books as we go along. And the first one I mention is Because Internet, the author is Gretchen McCulloch and she has studied the evolution of language on the internet, going back to the sixties and seventies, when some of the first chat systems, text-chat systems, were being created and all the way up through modern texting and messaging platforms.

So, the difference between how we converse verbally and how we converse via text is a long-standing thing.… And so, like even if you go back to like, when we wrote letters and things like that, conversational protocols were different then, but written was still very clearly different from verbal. So, there are some different protocols and some of them are different because the establishment of context is clearer from the get-go. Meaning that, if I go to someone's website and then I look for this chatbot thing, and I open it. Well, I've already sort of taken a step into a context, right? I’ve visited a website. I know it's a company. I don't want to do any of the other stuff. I'm making this implicit statement of, "I don't want that stuff," by choosing this other thing explicitly.

But with many chatbots, you still see a greeting, "hi, this is Jojo-bot. I'm here to help you with your questions about X, Y, Z company." So, the idea is there's still some semblance of this because it's about acknowledgement, a statement of presence; here I am. You can even say things to me. And then the added protocol differences. We have no emotional context, right?

And now emojis are a valid expression of emotion and conversational meaning, but we can't appreciate them with the nuance and the subtlety that we can by viewing another human's face and hearing their tone of voice. So, as we all know, when you go to text, like... first of all, when we go from visibility to invisibility. So, if you and I weren't looking at each other during this podcast, we would be having a channel, a signal, that is no longer available to us, right? And then in texts, it's the same thing. But now we also don't have some of the audio signals that we can get from somebody's voice. So, we replace some of these by emojis in some cases, but we also tend to read a lot into certain ways of phrasing.

One of the fascinating things that's going on right now in the world of text messaging is periods or full stops, indicate to teenagers — or maybe even into the 20- and maybe 30-year-olds — they indicate a different emotional tone than the lack of periods or full stops. And, you know this becomes just... for me, somebody like me, extremely fascinating to think about that the incredible subtlety that that brings. Part of the problem is like... I mean, one of my kids said this to me. I typed a period in a text message and the question was, "are you upset?" I was like, "No! I just typed a period!" He was like, "oh, well, periods usually mean that somebody is upset." Like, oh! Okay. Not upset! Also, ignorant! So please, excuse me!

So, it's not so much that we... well, yeah, I think you said it: we had different protocols. And we do adjust our protocols based on the channel and what signals are available to us, because at some point, there may be some information we need that might've come in through... as a signal through a different channel of visual tone of voice, and now we're just a text, so we might need to be more explicit. This becomes a problem because — and we all know this, those of us who've been working in tech for a long time — we've known how we can misread emails, right? You see an email and you think, "Oh man, there is something wrong here." And you go talk to the person and they're like, "no, everything's great." "Well, your email just made it sound like..." and we use those phrases "made it sound like." There was no sound involved in this.

So, we have some understanding intuitively that the different channels mean different things for us, and if we are missing some, then we have to adapt. But we aren't necessarily good at that. We don't necessarily think — and this is one of the downfalls of conversational technology right now — we think that it's the words alone that matter the most. And I won't quote the stats about like how much of meaning comes across in other channels but suffice it to say that when we have sort of full bandwidth conversations, we are actively using all of the channels available to us. But it doesn't mean that we understand that we're using them or that we are necessarily capable of adapting well to the channel loss or the signal loss. So long-winded answer, sorry about that! But yeah, it's quite different.

Jorge: One of the things that I'm hearing there is that there are at least two dimensions that you can use to think about a channel. One dimension has to do with the bandwidth that is available to communicate these nuances that we're talking about. And what I'm getting from what you're saying there is that text — something like a chatbot — is a fairly low bandwidth channel, right?

Phillip: Yes.

Jorge: Like, we lose a lot of nuance. And another dimension has to do with context, with the amount of context that you have when engaging in that channel. And I'm saying this because, the way I envisioned it when you were talking about it, was that the mere fact that the chatbot is popping up in this website already sets boundaries for what you’re expected to deal with, right? Like you don't come to it expecting that it will play your favorite song.

Phillip: Right. That's right.

Jorge: It's going to be a conversation related to that thing, right?

Phillip: Right. And nor do we — for those of us who've used or worked in customer service over the telephone — sometimes where we have these little conversations about, "Oh, where are you? How's the weather, how are you?" So, we incorporate some of these things. You don't see that as much in text-only chatbots. And the other thing, that's a challenge there is the fact that we communicate at very different rates of speed verbally than we do typing and reading. We're much faster verbally. And the other thing is we are much more tolerant verbally of rambling and sort of things that would show up as incoherence if it were typed out. We repeat words, we pause in funny places, we gather our thoughts in the middle of a sentence and take a turn on a dime. And we keep up with that, verbally. Like we're really, really, really good at it! We don't understand how good we are, but we are really good at it. And translating that into text sometimes is just a trainwreck, even if we're doing almost the exact same behaviors.

Jorge: Yeah, I can relate to that, having to go through the transcripts for this show and make them legible. It's like, “Wow! There's a lot of repetition happening here."

Phillip: Yeah! And I can almost guarantee you that I'm going to be a tough one for you, even though I do this for a living. Sorry about that!

Jorge: No, it's fascinating. And it's inherent in the... I suspect that it's inherent in the channel, right? Like you're, it's almost like you're down sampling to a different channel.

Phillip: Yeah! That's an excellent way to think about it. Exactly. And to get techie for a second, when I first dealt with speech recognition, over the telephone... the telephone because of economics is a tremendously downsampled version of audio. You can ask anybody who works in music or who's an audiophile. It's just the telephone bandwidth is terrible when it comes to the higher and lower frequencies. So, it's just a squished down to this middle. And yeah, it's very similar to that. And so, in speech recognition technology, we just lost all of the signal that was available for processing. If you recorded something into a microphone, we had that nice 44K bandwidth, it’s so much richer than something that comes out of the telephone. And so, yeah. It's very similar, just that signal compression, the signal loss. And our brains are, again, just really, really expert at doing things with it that we don't understand that it's doing. And so, because we don't understand it, we don't necessarily notice the loss of it, but part of our brain does. And it's like, "but I don't know what to do now because I'm so used to that being there."

Designing for conversation

Jorge: We've been talking about protocols and we've been talking about the signal and there's all these different aspects to this, and it also sounds like the channels are quite different. I'm wondering how one goes about designing for conversation. How do you prototype this stuff? How do you model it?

Phillip: Yeah! Right, right. Well, this is great. To start this, I'll touch on something, that I think you asked, and I'm not sure I addressed earlier. But when we think about these systems, conversational systems, whether it's the cylindrical devices that we have, or whatever shape they are, how those are different from what we have available to us in human-to-human conversation. Well, a lot of it is that we focused on sort of the nugget of action. So, that's why a lot of these systems, what are they used mostly for? For playing music, getting weather, news, maybe opening an audiobook or listening to a podcast, or turning on lights. You know, all these sorts of things. To do that, the command sequence is all fairly straightforward, right? It's, "whatever-the-name-is, turn on this light" or "play this station or artists" or "start reading my book". And then whatever audiobook it was last reading will open up.

And so, what we're not designing currently, and what is not designed into any of these systems is really anything about that middle structure of conversation. We have different types of conversation. You can play a game. You can do this command kind of interaction. You know, there are ways to simulate interviews and things like that. And certainly, there's this undergirding of linguistic information, right? You have to know what the words are and what roles they usually play in a conversation.

This is an interesting experiment: If you take the words in a sentence that makes sense in the normal order, standard order, and then you mix them around, it's interesting to see what these assistants understand and don't. I'll tell you that most of them don't pay a lot of attention to the order of words, but the order does also matter somewhat. But what they don't have is this like clear establishment of contexts and negotiation ability, where you can clarify or correct. The interactions really just sort of jump right to what we consider the meat of a conversation. And then we don't really... closure isn't really part of this either. You can see a little bit more of it in customer service type applications where someone dials a phone number and there's a greeting like, "hi, you've reached such as such, what can I do for you?' There's a... like you said, a minute ago, chatbots have a limited range of things that are expected or understood. Mental model mismatch is a thing, but for the sake of this, we'll just keep it narrow.

So, there's just a little bit of this sort of... we're giving some lip service to the greeting — pun intended. We're giving some negotiation, you know, of what's available and what's desired. And then it moves very quickly into action. And then at the end of the action, it might... the closure might be, "is there anything else I could do for you? If not, you know, have a great day." But with our virtual assistants, that shows up very rarely. It is there in some cases, but it's very rare.

So, I say all that to say, one of the first big steps in designing is — like with all other design — is really understanding what's the context, what's the goal, who's participating, what knowledge might they have? What knowledge do we expect them not to have? What do they want? Why do they want it? All of these sorts of questions that are fundamental to any sort of true design activity that we're doing, are still important. The thing now though, is instead of saying, “Okay, well that means we're going to have certain kinds of boxes or certain content on our screen," we're saying, “How do we translate all of that into words that we can exchange fairly easily?" And right now, I’ve got to say, we're mostly doing a really terrible job of it.

But your question was about prototyping. So, first of all, fundamentally we can prototype very simply. I'm a big, big fan of doing basically the equivalent of conversational sketching, which looks like a screenplay. And it doesn't matter if you write this out by hand — and there's some benefits to that — or for speed, you can write it out. You could type it up. But it basically looks like a back and forth of a screenplay and then you go try it with someone — ideally several someones. Someone who might know the technology and help give you some pointers from that angle, but also people who don't care or don't know about the technology. What you're looking for is how quickly can you come to that establishment of sort of clarity of context and purpose and meaning, so that you can proceed into the conversation. That's what those upfront sections are about. The early prototyping is just simulating this conversation with another human.

You can expand that into running that in a way, we call "Wizard of Oz" testing, which is where I'm pretending to be the system, and people are going to interact with me, but they don't know it's me and they can't see me. So, whether it's picking up a telephone that's connected to a different phone in the next room and, you know, pretending to talk to the phone or whether it's, you know, pretending to talk to the cylinder and I can pipe something back into the room... the idea is now you're simulating more of the end context, which is a person and a machine or a device. And there's a couple of different ways we can do that.

We used to do that in some ways involving Keynote and PowerPoint and recordings and things like that. But today, there are also some tools that we can use that are prototyping tools for voice. Adobe XD has some of that built-in or tools like Voiceflow and Botmock, that are available to do some of this as well. And they... they're a little bit more system centric in the idea of that they're representing capabilities of the end of the system where you might deploy this.

So, they have some built-in constraints. And then like all tools, they have, philosophies and other things built into the tool that when you're an experienced designer, you have to learn how to see, or how to work around, those things. So, around those limitations, the tool designer doesn't necessarily understand all the situations you're going to use the tool for. But those tools are available and some of them can be ported directly to one of these devices, in a private setting, so you can test them yourself. You can interact with them. They use text-to-speech technology to give the audio although you can do human recordings with some of them as well. And really that's sort of the... that's where prototyping ends.

There are other tools out there. Google has Dialogue Flow and there's the Alexa skills kit tools, which I helped create. All of those are much more system-centric because you're starting to access the assets of those technologies and platforms. But they also have some level of simulation. They have beta modes where you can release it to a certain number of people to interact with it and get feedback on it, so you can make some changes before it goes live. And then they also have some amount of automated testing available too, where you can start to see holes in the application because you didn't specify some sort of action or maybe you didn't take care of a certain condition that might arise, but, you know, that's getting further into the end stage of development, away from prototyping.

Closing

Jorge: Well, this is all so fascinating. It feels like there's material here for us to go on, but unfortunately, we need to wrap things up.

Phillip: Right, right! Sure!

Jorge: Where can folks follow up with you, Phillip?

Phillip: Well, my consultancy is called Conversational Collaborative AI Services. Clearly, I am focused on some of the underlying artificial intelligence machine learning things, and that's at ccaiservices.com and I am Phillip with two L's, at ccaiservices.com. And I'm also on Twitter as designoutloud, no hyphens or anything, just all one word, and I'm always happy to connect and discuss things on LinkedIn. So pretty easy to find there. I think I was lucky enough to get Phillip Hunter as my LinkedIn URL so you can find me there, and I love to talk about this stuff! Also, my, site has... I've got a fair amount of content out there about these topics, where I go much deeper on... okay, once you understand these principles, how do you really start to apply them and how do you, have an iterative and thoughtful design approach to writing for voice and text interaction. So yeah, so any of those ways be great.

Jorge: Well, this is fantastic. Thank you so much for this conversation about conversation!

Phillip: Well, it's my pleasure. And obviously I have a lot to say! And yes, we could go on for quite a while. In fact, I might even forget you're there and just keep talking while you're, while you're sleeping or, you know, petting your cat or whatever I saw.…

Jorge: Maybe we need to do a part two.

Phillip: Well, maybe so! Let's see what kind of response we get, but I'd be happy to, and you know, it is a fascinating thing to think about and analyze. And if anyone wants to dive in, I have some great resources. There's a book called How We Talk by N.J. Enfield, that is also just really, really fascinating. And I'm currently reading another book, that so far, it's very promising, but I'd want to finish it before I recommend it.

But I guess the other thing is, I want to say here to people is, don't just study the tools and the technology. You need to study people and conversation to really be good at this, if you want to get into it. It's way more sophisticated than anything we have done for standard web and mobile design. As important, and as difficult as that work is, conversation has some really special and deep challenges. So, don't limit yourself to just understanding the technology and how to apply it.

Jorge: That seems like a great admonition and a good place to end it. Thank you so much!

Phillip: Oh, you're very welcome.