The digital era has brought an influx of images carrying textual information in a variety of languages, necessitating new computational algorithms for interpretation. This problem gets more difficult when dealing with languages that differ significantly in typography and structure, such as English, Hindi, and Sanskrit. This paper presents a complete framework that uses Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) to recognize and comprehend text inside pictures in various languages. To ensure dependability, this model is trained on a varied collection of pictures that include English, Hindi, and Sanskrit text in a variety of styles, sizes, orientations, and lighting conditions. After being evaluated on a dataset with several languages, the CNN-RNN model outperformed previous approaches for text detection and recognition. This model achieves remarkable accuracy, recall, and F1 scores in all three languages, notably in handling complicated conjuncts in Hindi and diacritic signs in Sanskrit. Furthermore, the study emphasizes the importance of attention mechanisms in increasing interpretability of models and accuracy, laying the groundwork for future advances in this field.
Keywords: Text Detection, Text Recognition, Natural Image, Convolutional Neural Networks, LSTM, Neural Network Architectures, Recurrent Neural Networks (RNN).