Serverless use cases — Voice apps

This article describes a popular use case in serverless for voice applications and identifies key components and integrations to ensure best practices.

4 min readMay 25, 2022

This article is part of several writings about serverless. Previously wrote about what is serverless, design & technical trade offs, business benefits, market offerings comparison, service-full vs no-code, integration patterns: orchestration & choreography. The idea is to go through scenarios to understand when and how to get the best of serverless.

What is a voice application?

A Voice Application refers to any software component that makes spoken human interaction with computers possible, using speech recognition to understand spoken commands and answer questions. A voice activated device is a device controlled with a voice user interface. [1]

There are 2 types of voice applications. First one about conversational experiences (i.e. is it going to rain tomorrow?, what time is it in Mars right now?). Second one is about IoT devices such as smart light bulbs, air conditioning, garage doors, coffee machines, etc. (i.e. turn the light on/off)

Why is serverless a good fit for voice applications?

Voice applications are a good serverless scenario because this type of voice-first solutions rely on complex voice interactions that need Speech Language Understanding (SLU), Automated Speech Recognition (ASR), Natural Language Understanding (NLU), and Text to Speech (TTS) conversions. It is simply too complex to create all these components yourself.

Secondly, it is preferable to rely on out of the box solutions that can elastically scale up and down meeting uncertain demand since you can’t control how many users are going to be making requests at the same time.

Finally, you want to focus on your voice application, how to make it engaging, improvements, support, etc. And not on the infrastructure.

What are the components of a voice application solution?

A voice app solution includes:

User device (smart TV, smart assistants, mobile phones)
Speech Language Understanding (SLU), Automated Speech Recognition (ASR), Natural Language Understanding (NLU), and Text to Speech (TTS) conversions
A component that controls the user experience, including a custom interaction models and intents
A back-end component, usually an API that can respond to queries

Solution design

The following diagram describes a generic voice application solution design.

How does it work?

Alexa users speak (or even allowed to type in other cases like mobile phones) asking for what they want, for instance, “Alexa, tell me a joke”.
Alexa enabled devices such as Smart TVs, assistants such as the echo, echo dot or mobile phones with the Alexa application installed are able to listen for a wake word and activate as soon as one is recognized.
The Amazon Alexa Service performs common Speech Language Understanding (SLU) processing on behalf of the Alexa Skill, including Automated Speech Recognition (ASR), Natural Language Understanding (NLU), and Text to Speech (TTS) conversion.
The Alexa Custom Skill, based on the Alexa Skills Kit, controls the user experience, including a custom interaction model, intents and Alexa Conversations. Within the Alexa Skills Kit, you can also develop Alexa Smart Home Skills to control IoT devices.
The Skill Lambda function has the brains on the architecture. It processes different types of requests sent from the Alexa Service and builds speech responses. Images and special audio effects can be stored in S3.
Dynamo DB (NoSQL data store) is used to persist user state and sessions or any other required data.

Key integration points

Alexa Skills Endpoint

The above screenshot corresponds to point (4) from the solution design. As highlighted, the endpoint receives requests when a user speaks (1) which then calls your logic implemented in the lambda function (5). You could also implement your logic in a separate endpoint (i.e. a google function).

Lambda function

This is an important integration point in which the function allows Alexa Skills invocations, otherwise it won’t work.

The function overview allows you to see triggers, layers, and destinations to your function. Triggers are AWS services or resources that invoke the function (5) such as the Alexa Skills Kit (4). In this view you can also configure destinations which are AWS resources that receive a record of an invocation after success or failure. Layers are resources that contain libraries, a custom runtime, or other dependencies.

Conclusion

Voice applications are growing in popularity and serverless is a great fit to implement these solutions due to its reusability allowing you to focus on the engaging bits of the app, elasticity to cope with uncertain workloads and speed to market as a result of the previous points. Solutions are easy to implement in its basic shape with not many integration points which reduces complexity and errors. However, it can be harder to create a unique and engaging experience.

References

[1] What is a voice application? — PentaTech Voice
[2] How are Alexa voice applications designed? — PentaTech Voice
[3] AWS Well-Architected Framework, Serverless Applications Lens — Alexa Skills

Disclaimers

This is a personal article. The opinions expressed here represent my own and not those of my employer.