When Alexa went rogue: The importance of context design
Wednesday, September 5, 2018/
I recently read a story on Daring Fireball about Amazon Alexa secretly recording a conversation and emailing it. Amazon’s official explanation was that Alexa overhead an ongoing conversation, and interpreted it to be directions to record an ‘audio memo’ and email it. Surprise!
For this article, I’ll assume Amazon’s official explanation is true and it wasn’t some CIA conspiracy gone wrong. This is because Alexa sits in busy places, such as kitchens and living rooms, lots of people have these devices, the audio memo feature exists and Alexa is easily accidentally triggered. Statistically, something like this was bound to happen. Or as the Mythbusters say, plausible.
The crime unfolds
Here’s how I imagine it went down. A couple sit in their living room near the kitchen, talking passionately about hardwood floors. (This was the actual reported topic of conversation.) They want the floors to ‘accentuate’ the other natural wood in the house (or some other word that sounds something like ‘Alexa’). Meanwhile, Alexa sits unnoticed up on the kitchen counter. Alexa’s volume is set low, so the device’s voice feedback is drowned out by the ongoing chatter. As the human’s conversation continues, Alexa thinks it’s part of the conversation. Inevitably, the right combination of words are uttered.
Conversation is more than words
I’ve got young kids. It takes time for them to understand they can’t just barge into your conversation to start telling you about their latest LEGO creation. It takes time for them to learn when they need to speak up, or use a quiet voice, and that two people can’t communicate by talking over one another.
The key lessons they have learnt are:
- Moderate your voice volume to the situation;
- There is an ‘inside voice’ and an ‘outside voice’;
- If you’re talking to someone far away, speak louder; and
- If that person talks over you when you speak, something is amiss.
(Regarding the first bullet point: I read a tweet somewhere (which I can’t find now) where someone said they whispered to Alexa and it whispered back. I’m unable to reproduce this behaviour, although I have got Alexa to whisper by asking it to.)
Let’s imagine Alexa also knew these key lessons about conversation etiquette. If Alexa heard a distant conversation it thought it was part of, and increased its volume, and if you spoke over it while it was talking, it would realise something was wrong. If Alexa butted loudly into your conversation, you would notice. Or if it noticed you were talking over it, it could do something smarter.
Other apps with context?
Google Maps offers three different search types: walking, transit and car. I find myself repeatedly forgetting to switch the means of travel. This mistake requires me to back out of a search and to perform it again. Argh.
Note: as this post sat in my drafts a new version on the UI appeared which allows you to change your transportation type after selecting a destination. Woohoo!
Apple Maps goes one step further as it can detect when you’re walking and switch to walking directions. This is a good start.
But let’s imagine these apps had a more robust understanding of the context of the user. What are some of the inputs or clues to help understand the intention of the user?
✔ Phone in near vertical or portrait orientation and largely still (car holder)
✔ Known bluetooth network from a car sound system
✔ Receiving a charge
✔ Destination exceeds ‘normal’ walkable distance
✔ Phone on the move, or being hand-held in a way that implies the user is active/walking
✔ Device currently charging
⛌ No previous trip taken to this general area by car
✔ User at a known transit hub or train station
✔ Searched destination exceeds normal walking distance
✔ Searched destination within normal walking distance
✔ Is the phone on the move, or being hand-held in a way that implies the user is active/walking?
⛌ Device not being charged
Yes. There are many reasons why this is a fuzzy problem.
- I have an external portable battery for my phone;
- I use a powered wheelchair where I mount my phone;
- My car park is on top of a transit centre; or
- I prefer to walk long distances.
I’d encourage product teams not to give up at this point. With the right support, context can be detected robustly.
Scannable: A case study
Here’s a story about how I learnt the value of context.
While at Evernote, I worked on document-scanning app Scannable. Scannable is an example of an app that has some understanding of context. User content provides a great deal of context. For Scannable, there are a set of common document types that users commonly scan. Scannable classifies these into buckets, including magazine, black and white text document, photo, illustration, business card, whiteboard and post-it note. You can see this as Scannable defaults to descriptive filenames such as photo.jpg or whiteboard.png. Anyone that’s used the post-it notes feature will also know that Evernote can sort scans based on post-it colour.
Context pays dividends
Developing a system in Scannable to classify the content — what’s being scanned — wasn’t cheap or fast. But knowing what the heck the user is trying to scan allowed us to produce a powerful set of user experience implications:
- Image cleanup switches between appropriate greyscale or multicolor configurations, producing a better looking ‘scan’. This also produces better OCR and searchability.
- File compression vastly improves when you know what you’re compressing. Black and white text documents (the most common kind of thing scanned) are greyscale compressed to produce astoundingly small PNG files with excellent legibility.
- As mentioned before, filenames default to something more human, Meaning documents became more searchable.
- Knowing a document is a ‘business card’ allows Scannable to trigger a special UI flow that fetches metadata based on OCRing the card.
The future of things like Alexa
Machine learning is a great tool for classification. I can imagine many applications that could benefit from the classification of their main content or context into a handful of categories.
What if a podcast player knew it was playing music and not spoken words? What if a weather app knew if it was inside or out? What if a camera app knew it had a dirty lens?
My experience with Scannable shows how valuable context is to an app. I encourage you to also look for opportunities to classify what your users are doing into some common activities.
Designing invisible solutions
So, we know these fuzzy systems can often make obtuse mistakes. We also know, with effort, these systems can have a sense of the context of their small worlds. So how do we build technology that doesn’t do the equivalent of walking through a preschool with its pants off?
There’s a commonly quoted phrase from Steve Jobs: “Most people make the mistake of thinking design is what it looks like. People think it’s this veneer — that the designers are handed this box and told: ‘Make it look good!’ That’s not what we think design is. It’s not just what it looks like and feels like. Design is how it works.”
Carefully tuned systems take months to develop and are often created by just one or two people. They’re systems and designs that have been prototyped in the real world and tweaked by their carers.
But like parenting, it’s not sexy. And often doesn’t produce visible, virtuosic results. Your kid might not be able to play Bach’s Cello Suites, but she doesn’t wipe her nose on the curtains. For anyone not in the industry, you may not guess that many of the apps you use are built by ‘App shops’ or designed by one company and out-sourced to another. And in these scenarios, there’s neither the time, budget or incentive to fine-tune the ‘how it works’.
For designers-for-hire, there’s no sexy UI you can print and stick on the wall or in your folio. Developers often simply don’t have time to spend time polishing at this level. There’s a tonne of work to make something great, and there’s no numerical endpoint to help get product management off your back.
The solution is when design and development meld, either in a single person or a tight partnership, and is allowed enough time to really fine-tune one of these solutions. Management needs to understand the results may not be visually marketable, but the value in real-world usability will be noticed by customers.
And it’s increasingly important in design. The future of all-personalised computing, The Internet of Things and wearables are all context-dependant. My startup at teampurr.com is about remote work: a place in computing where context is sorely lacking. Today, computers know (or care?) almost nothing about the human’s state, but it’s time to change that.