Voice-Guided Image Search with CLIP
Streamlit app for noun learning using CLIP, Azure STT, and multilingual APIs
Overview
Built an interactive Streamlit application that combines voice recognition and vision-language models for educational noun learning across multiple languages.
Key Features
- CLIP Integration: Leveraged OpenAI’s CLIP model for image-text matching
- Voice Input: Integrated Azure Speech-to-Text (STT) for voice-guided interaction
- Multilingual Support: Enabled learning in multiple languages through translation APIs
- Interactive UI: Built with Streamlit for easy web-based access
How It Works
- Voice Input: User speaks a noun in their native language
- Speech Recognition: Azure STT converts speech to text
- Translation: Multilingual APIs translate to target language
- Image Search: CLIP model finds relevant images matching the word
- Learning: User sees images associated with the vocabulary word
Technologies Used
- Framework: Streamlit
- ML Model: CLIP (Contrastive Language-Image Pre-training)
- Speech: Azure Speech-to-Text API
- Languages: Python
- Translation: Multilingual APIs
Publication
This project contributed to a publication:
“Exploring the Use of CLIP Model for Images Recommendation in Noun Memorization using Various Learning Context” - Bulletin of Research Center for Computing and Multimedia Studies, Hosei University (2023)