Build a Smart Voice Assistant: Planning the PoC prototype

Part of delivering effective Voice and A.I. solutions to our clients means experimenting with the tools and technologies we have at our disposal and finding ways to develop new and exciting capabilities.


Last year, we began working on a Proof of Concept (PoC) project for a new Smart Voice Assistant called Omochi, with the aim of integrating voice recognition, natural language, biometrics, and voice synthesis into a single automatic dialog management system. By placing these functionalities in a single system, we hope to showcase our different technologies and improve the way they interact with each other.

This blog series will share our journey throughout the project, so you can see how some of the services we offer originate, develop and mature into full working solutions.




The Plan

For PoC projects like this one, we start by outlining an initial plan, which includes listing the features and functionalities we’d like the solution to include, as well as some self-imposed constraints we’ll need to adhere to. 



We’re aiming for the prototype of Omochi to include:

– General assistance capabilities, such as weather checking.

– Ability to answer questions about a specific domain.

– Ability to answer chit-chat and trivia.

– Must support several languages at the same time, not including English. We’ve challenged ourselves to incorporate Japanese, which will be quite complex due to the incorporation of another alphabet.

– Activation using a wake up word.

– Ability to register and recognize speakers, with personalized interactions.

– Must work in real-time, since it has to interact with humans.

– Ability to provide user feedback with lights.


These requirements and the nature of the solution mean we’ll face time and technology constraints:

1. Must be built using consumer-grade hardware: RaspberryPi™ 3 and the Jabra Speak 410 microphone.

2. The device must be portable and wireless.

3. Prototype to be developed in less than a month.


We started by gathering an inter-departmental team of 13 brave and talented people with different skills and competences, each eager to dedicate their valuable time and energy. To keep the project within our time frame, we decided to reduce management resources and maximize development as much as possible.
This meant having short daily stand-ups and open demos to enable better communication and collaboration.

When it came to designing the prototype of Omochi, we started by brainstorming man-machine dialogs that would showcase all the features our technology can provide. People from our marketing team also helped to expand on how this could be hypothetically positioned in the future and we were able to create an initial map of the software system:


The system features Verbio’s CSR, TTS, Dialog Manager, and Biometry modules, and new services like: wake-up word detection, voice activity detection, language detection, a smart-voice application to connect each element of the system, and an audio client to run in the physical device.

As we moved into the development phase, we encountered some technical challenges – many of which are typical when creating this kind of solution. These include sending audio from the device to the server, adding language detection and deciding how to market the product.

Stay tuned for part two, where we’ll share how we were able to solve these challenges and push the project forward.


Written by Pere Comas, Mònica Sensat and Carlos Quintana

Not sure what to look for?

Tell us your problem!

shares