End-to-End Transformer Pipeline for Image Captioning and Text-to-Image Generation

Author(s)	Prof. R Swathi, Mr. P Anish, Mr. Kanishkar J, Mr. Sanjay M, Mr. Sasi Kumar R, Mr. Shreesha V B
Country	India
Abstract	In this project, we present a Multimodal AI framework, integrating text generation (Cohere), image creation (FLUX), speech synthesis (FastSpeech2) and object detection (DETR with ResNet-50) into one system to achieve more rich human-like interactions. It is useful for education, accessibility, content creation and virtual assistants by seamlessly blending multiple modalities: expressive language, illustrative imagery, real voiceprint and detailed object recognition.
Keywords	Multimodal AI, Vision-Language Integration, Transformer Models, BLEU Score, ROUGE, Image Captioning, Speech-to-Text, Object Detection
Field	Computer > Artificial Intelligence / Simulation / Virtual Reality
Published In	Volume 16, Issue 3, July-September 2025
Published On	2025-08-24
DOI	https://doi.org/10.71097/IJSAT.v16.i3.7687

About IJSAT Fees & Payment Current Issue Publication Archive	Submit Research Paper Track Submission Status Publication Guidelines Publication Ethics	Join as a Reviewer Editors & Reviewers Get Reviewer Membership Certi.	Website/Journal Policies Usage Policy Content Policies Privacy Policy

Contact Us	Message on WhatsApp	+91-9687-182-185	editor@ijsat.org

International Journal on Science and Technology