Text-to-Image Generation Using Artificial Intelligence: A Systematic Review
Abstract
Este estudio examina diferentes enfoques utilizados en la generación de texto a imagen mediante inteligencia artificial, con especial énfasis en la relación semántica establecida entre las descripciones textuales y las imágenes generadas por los modelos texto-imagen. Además, se revisa la fiabilidad de las métricas utilizadas para evaluar su rendimiento, con el objetivo de identificar sus capacidades y limitaciones actuales. La investigación se realizó siguiendo la metodología PRISMA, mediante la cual se seleccionaron 18 artículos según criterios predefinidos. Estos estudios abordaron temas relacionados con arquitecturas de difusión, mecanismos de control semántico, atención a nivel de frase e ingeniería de indicaciones. Los resultados indican que los modelos basados en difusión son los más utilizados, mientras que los modelos GAN y VAE se aplican principalmente en aplicaciones de nicho. Con base en el análisis, se identificaron tres niveles de control: atributos visuales, composición y estilo. Sin embargo, actualmente se observan diversas limitaciones en las métricas utilizadas para evaluar la alineación semántica, así como la persistencia de ciertos sesgos asociados con los modelos preentrenados. Las conclusiones indican que los modelos de difusión dominan la literatura reciente y que el uso de técnicas como LoRA contribuye a mejorar la coherencia texto-imagen. Estos hallazgos sugieren que aún se requiere más investigación sobre la atención relacional, particularmente con respecto al desarrollo de métricas estandarizadas en estudios futuros.
Downloads
References
J. Xu, J. Du, and J. Wang, “A survey of generative models used in text-to-image,” Applied and Computational Engineering, vol. 79, pp. 38–48, 2024. [Online]. Available: DOI: 10.54254/2755-2721/79/20241286. DOI: https://doi.org/10.54254/2755-2721/79/20241286
C. Zhang, C. Zhang, M. Zhang, I. S. Kweon, and J. Kim, “Text-to-image Diffusion Models in Generative AI: A Survey,” arXiv preprint arXiv:2303.07909, Mar. 14, 2023. [Online]. Available: https://arxiv.org/abs/2303.07909.
K. Wang, X. Liu, Y. Chang, D. Zhao, T. Xian, and X. Geng, “Semantic guidance for precise style control in diffusion image generation,” Scientific Reports, 2025. [Online]. Available: DOI: 10.1038/s41598-025-28715-x. DOI: https://doi.org/10.1038/s41598-025-28715-x
R. Li, W. Li, Y. Yang, H. Wei, J. Jiang, and Q. Bai, “Swinv2-Imagen: hierarchical vision transformer diffusion models for text-to-image generation,” Neural Computing and Applications, vol. 36, pp. 17245–17260, 2024. [Online]. Available: DOI: 10.1007/s00521-023-09021-x. DOI: https://doi.org/10.1007/s00521-023-09021-x
H. Ma and H. Zheng, “Text Semantics to Image Generation: A Method of Building Facades Design Base on Stable Diffusion Model,” in Phygital Intelligence (CDRF 2023), Computational Design and Robotic Fabrication, First Online: 04 Jan 2024, pp. 24–34. [Online]. Available: DOI: 10.1007/978-981-99-8405-3_3. DOI: https://doi.org/10.1007/978-981-99-8405-3_3
O. Avrahami, O. Fried, and D. Lischinski, “Blended Latent Diffusion,” ACM Transactions on Graphics, vol. 42, no. 4, art. no. 3592450, 2023. [Online]. Available: DOI: 10.1145/3592450. DOI: https://doi.org/10.1145/3592450
Z. Kuang, J. Zhang, Y. Li, et al., “Preserving architectural heritage in urban renewal: a stable diffusion model framework for automated historical facade generation,” npj Heritage Science, vol. 13, art. no. 256, 2025. [Online]. Available: DOI: 10.1038/s40494-025-01826-4. DOI: https://doi.org/10.1038/s40494-025-01826-4
Z. Sordo, E. Chagnon, Z. Hu, et al., “Synthetic Scientific Image Generation with VAE, GAN, and Diffusion Model Architectures,” Journal of Imaging, vol. 11, no. 8, art. no. 252, 2025. [Online]. Available: DOI: 10.3390/jimaging11080252. DOI: https://doi.org/10.3390/jimaging11080252
M. Gao, Q. Zhang, C. Song, X. Zhang, and Y. Li, “Hierarchical Prompt Engineering and Task-Differentiated Low-Rank Adaptation for Artificial Intelligence-Generated Content Image Quality Assessment,” Information (Switzerland), vol. 16, no. 11, art. no. 1006, 2025. [Online]. Available: DOI: 10.3390/info16111006. DOI: https://doi.org/10.3390/info16111006
Z. Ye, X. He, and Y. Peng, “RaT2IGen: Relation-aware Text-to-image Generation via Learnable Prompt,” ACM Transactions on Multimedia Computing, Communications and Applications, vol. 21, no. 5, art. no. 151, 2025. [Online]. Available: DOI: 10.1145/3726527. DOI: https://doi.org/10.1145/3726527
M. D’Incà, E. Peruzzo, M. Mancini, X. Xu, H. Shi, and N. Sebe, “GradBias: Unveiling Word Influence on Bias in Text-to-Image Generative Models,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 47, no. 11, pp. 9863–9875, 2025. [Online]. Available: DOI: 10.1109/TPAMI.2025.3592901. DOI: https://doi.org/10.1109/TPAMI.2025.3592901
J. Li, S. Zhang, L. Sun, et al., “Enhancing product concept image generation through semantic feature prompts and LoRA training,” Scientific Reports, vol. 15, art. no. 40795, 2025. [Online]. Available: DOI: 10.1038/s41598-025-24600-9. DOI: https://doi.org/10.1038/s41598-025-24600-9
H. He, H. Yang, Z. Tuo, Y. Zhou, Q. Wang, Y. Zhang, Z. Liu, W. Huang, H. Chao, and J. Yin, “DreamStory: Open-Domain Story Visualization by LLM-Guided Multi-Subject Consistent Diffusion,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 47, no. 12, pp. 11874–11891, 2025. [Online]. Available: DOI: 10.1109/TPAMI.2025.3600149. DOI: https://doi.org/10.1109/TPAMI.2025.3600149
W. Hu, Y. Zhao, L. Yin, et al., “Hierarchical symmetric GAN for Thangka image generation,” npj Heritage Science, vol. 13, art. no. 568, 2025. [Online]. Available: DOI: 10.1038/s40494-025-02100-3. DOI: https://doi.org/10.1038/s40494-025-02100-3
N. S. Mudiraj and S. Singh, “Semantic mapping of Hindi text-to-image generation using CUB dataset,” Scientific Reports, vol. 15, art. no. 36632, 2025. [Online]. Available: DOI: 10.1038/s41598-025-20537-1. DOI: https://doi.org/10.1038/s41598-025-20537-1
X. Peng, T. Sun, Q. Hu, et al., “Poe2CLP: Phrase-level attention and cross-modal semantic alignment for poem generate Chinese landscape paintings,” npj Heritage Science, vol. 13, art. no. 656, 2025. [Online]. Available: DOI: 10.1038/s40494-025-02238-0. DOI: https://doi.org/10.1038/s40494-025-02238-0
Y. Zhao, Z. Liang, Y. Qiu, et al., “A novel flexible identity-net with diffusion models for painting-style generation,” Scientific Reports, vol. 15, art. no. 27896, 2025. [Online]. Available: DOI: 10.1038/s41598-025-12434-4. DOI: https://doi.org/10.1038/s41598-025-12434-4
J. Zhu and L. Mu, “GrainedCLIP and DiffusionGrainedCLIP: Text-Guided Advanced Models for Fine-Grained Attribute Face Image Processing,” IEEE Access, vol. 11, pp. 99030–99045, 2023. [Online]. Available: DOI: 10.1109/ACCESS.2023.3313248. DOI: https://doi.org/10.1109/ACCESS.2023.3313248
Z. Li, Y. Wang, C. Li, et al., “LFMDiff: generation of Chinese traditional landscape paintings based on diffusion model,” npj Heritage Science, vol. 13, art. no. 564, 2025. [Online]. Available: DOI: 10.1038/s40494-025-02136-5. DOI: https://doi.org/10.1038/s40494-025-02136-5
K. Jung, N. Lee, and S. Choi, “KoDi: A Korean Diffusion Model for Bilingual Text-to-Image Generation and Cultural Fidelity,” IEEE Access, vol. 13, pp. 200290–200307, 2025. [Online]. Available: DOI: 10.1109/ACCESS.2025.3633798. DOI: https://doi.org/10.1109/ACCESS.2025.3633798
Y. Xu, H. Liu, R. Yang, and Z. Chen, “Remote Sensing Image Semantic Segmentation Sample Generation Using a Decoupled Latent Diffusion Framework,” Remote Sensing, vol. 17, no. 13, art. no. 2143, 2025. [Online]. Available: DOI: 10.3390/rs17132143. DOI: https://doi.org/10.3390/rs17132143
T. Xing, H. Yan, X. Wang, K. Sun, H. Yu, P. Li, and Q. Zhao, “DLDC: A Dual Loop Data Cleaning Method for Fine-Tuning Remote Sensing Image Generative Models,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 18, pp. 28709–28725, 2025. [Online]. Available: DOI: 10.1109/JSTARS.2025.3627924. DOI: https://doi.org/10.1109/JSTARS.2025.3627924
Y. Zhao, M. Li, and M. Berger, “CUPID: Contextual Understanding of Prompt-conditioned Image Distributions,” Computer Graphics Forum, vol. 43, no. 3, art. no. e15086, 2024. [Online]. Available: DOI: 10.1111/cgf.15086. DOI: https://doi.org/10.1111/cgf.15086
Copyright (c) 2026 Innovation and Software

This work is licensed under a Creative Commons Attribution 4.0 International License.
The authors exclusively grant the right to publish their article to the Innovation and Software Journal, which may formally edit or modify the approved text to comply with their own editorial standards and with universal grammatical standards, prior to publication; Likewise, our journal may translate the approved manuscripts into as many languages as it deems necessary and disseminates them in several countries, always giving public recognition to the author or authors of the research.











