Multimodal QA systems (text + image + video)
QA pipelines that ingest text, images, and video for unified retrieval, reasoning, and answer generation.
Muhammad Zeeshan
Technologies Used
Python
LLMs
Vision
RAG
FastAPI
Key Features
1
Cross-modal ingestion and embedding
2
Unified retrieval across modalities
3
Grounded answers with source references
Implementation
Combined vision encoders, transcript extraction, and vector retrieval so users query mixed media corpora in one interface.
Results & Impact
Enabled support and analytics teams to query documentation, screenshots, and walkthrough video in a single flow.