GPT Chat’s Path to Profit: Modalities and Mobile

The conversational nature of GenAI chat lends itself to voice modalities much more naturally than web search engines do. Speaking a keyword query and having results URLs read back to you feels very stilted. Search is a discrete action and doesn’t carry state from one search to the next. A voice conversation with a search engine feels like many single-turn interactions, while a multi-turn voice interaction with a GPT chatbot is much more like a continuous human-to-human discussion.

In part 2 of this series, we called out the persuasive nature of GenAI-driven chat. While this is true for text-based interactions, it is even more the case when we switch to voice. Voice vastly increases the range of psychological manipulation techniques available to platforms. The more an ad platform can persuade and influence behavior, the more valuable it is. Voice can also play a key role in making the platform stickier to improve user retention; I will cover this in a later section.

While voice captures the literal words spoken by the users and can translate that into the same input as the text-based interface, it also captures additional valuable information not present (or as present) in text alone. Analyzing how someone is speaking — volume, rate, tone, etc — can enable you to infer information about their mood. Automated mood detection isn’t an exact science, and this is an area where predictive AI solutions may run into trouble. That being said, it is still likely valuable data for ad personalization and targeting. In the worst case, you simply offer up an advertisement that isn’t well-targeted.

Background noise is usually something that computer applications want to filter out. GPT chatbots will want to filter noise to accurately calculate speech-to-text prompt generation. But they could also analyze background noise and tag the conversation with additional attributes. For example, if the chat picks up that you have kids in the background, that could be valuable information advertisers can leverage and exploit.

You can certainly use voice while working on a PC, but mobile is a more likely home for the majority of voice-driven chats. Speaking naturally with hands free is much more natural than thumbing out a long conversation. When a spoken response is your only interface, you lose more details on the sources of the response. Most generated responses will cite the sources that contribute to the response. While citations are not a prominent aspect of the text interface, they are visible there. You might be able to ask a follow-up question in a voice chat about sources, but I suspect that most users won’t. The less users know about where their response comes from, the more opportunities advertisers have to inject ad content the appears to part of the system’s response.

Mobile platforms provide yet more valuable user information like GPS location and speed. Geo-location data enables brick-and-mortar locations to reach out to customers in their vicinity. Speed can tell you if a user is walking, running, biking, or driving. Advertisers can update their offers to match the current activity.

Mobile and voice enrich platforms’ data and create the possibility for highly individualized advertising content. Next, we’ll take a look at agents, which could be the killer app for GPT chat and a boon for ad conversion.