ChatGPT now accepts voice commands and image prompts.

Photo by Emiliano Vittoriosi on Unsplash

OpenAI’s ChatGPT, once solely a text-based chatbot, is now evolving to understand queries through both voice commands and pictures. This marks a shift in the way users interact with the platform.

Historically, enhancements to ChatGPT were centered on its capabilities, such as its range of questions and the depth of its information base. The current update, however, focuses on the user experience. OpenAI is launching a version where users can engage with the AI through voice or by uploading an image. This feature will be initially available for premium users, with a wider release following soon.

The voice interaction is intuitive. Users press a button, voice their query, and ChatGPT translates it into text. Once the AI processes and generates a response, it’s converted back into speech and relayed to the user. This interaction is reminiscent of devices like Alexa or Google Assistant, but OpenAI aims for superior responses based on their advanced technology. With many virtual assistants transitioning to large language models (LLMs), OpenAI is leading the charge.

A significant contribution to the voice feature is OpenAI’s Whisper model, which converts speech to text. Furthermore, a forthcoming text-to-speech model will produce realistic audio from mere text and a brief audio sample. Users can pick from five voice options for ChatGPT, but the potential doesn’t stop there. A collaboration with Spotify will enable podcast translation while preserving the original voice’s essence. The development of such synthetic voices, however, raises concerns, especially around misuse. OpenAI acknowledges the potential risks, such as impersonation and fraud. Therefore, this model will have limited access, reserved for specific applications and collaborations.

The image search feature resembles Google Lens. Users can capture a photo, and ChatGPT will attempt to deduce and answer the implied query. A drawing tool in the app further aids users in clarifying their questions. The conversational design of ChatGPT facilitates a more interactive search experience. Instead of static searches, users can iteratively refine their queries based on the bot’s feedback, reminiscent of Google’s multimodal search. Still, there are challenges. When users query about individuals, for instance, OpenAI has intentionally constrained ChatGPT from making definitive remarks, prioritizing accuracy and privacy.

Nearly a year post its debut, OpenAI remains in the exploratory phase, aiming to augment ChatGPT’s functionalities without introducing new issues. By placing intentional limitations on the model’s capabilities, OpenAI is cautiously progressing. However, as the user base grows and as ChatGPT evolves into a comprehensive virtual assistant, striking a balance between innovation and safety will inevitably become more intricate.