International Journal on Science and Technology

E-ISSN: 2229-7677     Impact Factor: 9.88

A Widely Indexed Open Access Peer Reviewed Multidisciplinary Bi-monthly Scholarly International Journal

Call for Paper Volume 16 Issue 3 July-September 2025 Submit your research before last 3 days of September to publish your research paper in the issue of July-September.

End-to-End Transformer Pipeline for Image Captioning and Text-to-Image Generation

Author(s) Prof. R Swathi, Mr. P Anish, Mr. Kanishkar J, Mr. Sanjay M, Mr. Sasi Kumar R, Mr. Shreesha V B
Country India
Abstract In this project, we present a Multimodal AI framework, integrating text generation (Cohere), image creation (FLUX), speech synthesis (FastSpeech2) and object detection (DETR with ResNet-50) into one system to achieve more rich human-like interactions. It is useful for education, accessibility, content creation and virtual assistants by seamlessly blending multiple modalities: expressive language, illustrative imagery, real voiceprint and detailed object recognition.
Keywords Multimodal AI, Vision-Language Integration, Transformer Models, BLEU Score, ROUGE, Image Captioning, Speech-to-Text, Object Detection
Field Computer > Artificial Intelligence / Simulation / Virtual Reality
Published In Volume 16, Issue 3, July-September 2025
Published On 2025-08-24
DOI https://doi.org/10.71097/IJSAT.v16.i3.7687
Short DOI https://doi.org/g9x32s

Share this