Voice-Guided Image Search with CLIP

Overview

Built an interactive Streamlit application that combines voice recognition and vision-language models for educational noun learning across multiple languages.

Key Features

CLIP Integration: Leveraged OpenAI’s CLIP model for image-text matching
Voice Input: Integrated Azure Speech-to-Text (STT) for voice-guided interaction
Multilingual Support: Enabled learning in multiple languages through translation APIs
Interactive UI: Built with Streamlit for easy web-based access

How It Works

Voice Input: User speaks a noun in their native language
Speech Recognition: Azure STT converts speech to text
Translation: Multilingual APIs translate to target language
Image Search: CLIP model finds relevant images matching the word
Learning: User sees images associated with the vocabulary word

Technologies Used

Framework: Streamlit
ML Model: CLIP (Contrastive Language-Image Pre-training)
Speech: Azure Speech-to-Text API
Languages: Python
Translation: Multilingual APIs

Publication

This project contributed to a publication:

“Exploring the Use of CLIP Model for Images Recommendation in Noun Memorization using Various Learning Context” - Bulletin of Research Center for Computing and Multimedia Studies, Hosei University (2023)