Voice-Guided Image Search with CLIP

Streamlit app for noun learning using CLIP, Azure STT, and multilingual APIs

Overview

Built an interactive Streamlit application that combines voice recognition and vision-language models for educational noun learning across multiple languages.

Key Features

  • CLIP Integration: Leveraged OpenAI’s CLIP model for image-text matching
  • Voice Input: Integrated Azure Speech-to-Text (STT) for voice-guided interaction
  • Multilingual Support: Enabled learning in multiple languages through translation APIs
  • Interactive UI: Built with Streamlit for easy web-based access

How It Works

  1. Voice Input: User speaks a noun in their native language
  2. Speech Recognition: Azure STT converts speech to text
  3. Translation: Multilingual APIs translate to target language
  4. Image Search: CLIP model finds relevant images matching the word
  5. Learning: User sees images associated with the vocabulary word

Technologies Used

  • Framework: Streamlit
  • ML Model: CLIP (Contrastive Language-Image Pre-training)
  • Speech: Azure Speech-to-Text API
  • Languages: Python
  • Translation: Multilingual APIs

Publication

This project contributed to a publication:

“Exploring the Use of CLIP Model for Images Recommendation in Noun Memorization using Various Learning Context” - Bulletin of Research Center for Computing and Multimedia Studies, Hosei University (2023)