Did I Just Have a Video Chat with an AI?
BY: Steve Callan | 11/13/23
A Series on Artificial Intelligence: Part 1
Um....was I just interviewed by an AI over video?
A few days ago, I was requested to speak on behalf of SilverTech about the current state of AI and how our clients perceive and utilize it today. The conversation took place over video and was preceded by several email communications and a pre-generated list of questions that would be asked. The gentleman I spoke with was polite, attentive to my answers, and displayed natural human inflection, facial expressions, and engagement. However, as someone experienced in AI, there were moments in our dialogue that led me to question whether I was conversing with a real person or an avatar powered by a sophisticated language model.
I’m sure my background in technology has me paranoid....right?
At SilverTech, we are tech geeks. We love solving difficult problems for our clients with technology and we often find ourselves exploring new technologies in our free-time. One of the hallmarks of the individuals at SilverTech is that we enjoy deconstructing something to understand how it works. Most this is an existing technology to see how it tickets, but sometimes this is a technology that doesn't exist, but asking ourselves, "if it did exist, what would it look like inside if we took it apart?"
Considering our AI interviewer—assuming that was indeed his nature—how would one construct such a system? To begin, I would consider the high-level components: the core technologies involved and how I would integrate them to simulate:
The ease of human interaction through language has already been demonstrated; large language models such as ChatGPT and Meta's Llama2 have shown us that we can engage with AI using written dialogue and receive responses that are convincingly human-like. When these responses are imbued with distinct personality traits, intentions, and motivations, distinguishing between a real human and an AI becomes nearly impossible.
Voice synthesis technology, like Azure Text to Speech and Amazon Polly, have been around for some time, but believability has always been a stumbling block due to their distinctly computerized sound. Recently, however, breakthroughs have been made. Technologies now on the market such as Descript's Lyrebird and ElevenLabs can create or emulate a human voice with uncanny accuracy. These technologies are propelling the AI landscape forward with generative audio models that give our ChatGPT interactions human-like speech.
Facial Capture and Expressions
The surge in VR and augmented reality has brought forth an array of technologies designed to digitize real-world objects and people for use in 3D environments. Volumetric capture, once requiring a visit to a specialized studio, and extended post-production timelines, can now be achieved in near real-time with incredible accuracy. Companies such as MetaHuman as recently demonstrated at State of Unreal 2023 how these high-fidelity humans can be easily animated to create speech representations and incredibly lifelike facial expressions. By synchronizing the expected sounds of speech to these digital faces, much like in animation, we edge closer to being able to produce videos with AI-generated dialogue.
The ultimate goal is to create a system that can interact in real time, synthesizing text, voice, and facial expressions on the fly. This involves integrating several complex technologies, each requiring communication channels to interact, which introduces latency. During my conversation with what might have been an AI, the responses were relatively immediate, without the delays one would expect from such a system. To achieve this, one would need to condense these technologies into a unified network, or push all the models to the edge networks. Despite the inherent challenges, the prospect isn't far-fetched. Given the expanding array of models available today and efforts from companies such as CloudFlare pushing AI models to edge networks we are not far away.
This type of innovation is not somewhere in distant the future; the only thing it lacks to bring it to life is connecting the dots between divergent technologies and a reason to do it. As architects of this new digital era, it's our responsibility to wield this power wisely, ensuring ethical judgment and the collective good, while simultaneously moving with the pace of rapid innovation around us.
This outline only represents high-level back-of-the-napkin thinking, does not hit on 100% aspects of delivering on this reality, however the next several years we'll begin to see these connected technologies come together and with increasing levels of fidelity. Soon it will be impossible to distinguish fact from fiction.
Perhaps the most thought-provoking question of all, though, is what becomes of digital communications when we can no longer trust any form of it?