Skip to main navigation menu Skip to main content Skip to site footer


Advances in Multimodal Recognition Systems: Integrating Text, Voice, and Images

 

 

Special Issue Editors

 

Dr. Basanta Joshi
Department of Electronics and Computer Engineering,
Institute of Engineering, Tribhuvan University,
Pulchowk Campus, Kathmandu, Nepal.
Email: BassantaJoshi@hotmail.com , basanta@ioe.edu.np
Google Scholar

 

Dr. Sri Redjeki
Department of Informatics Engineering,
Universitas Teknologi Digital,
Jawa Barat, Indonesia.
Email: dzeky@utdi.ac.id
Google Scholar

 

Prof. Cheruiyot, Wilson Kipruto
School of Science and Informatics,
Taita Taveta University,
Voi, Kenya.
Email: wilchery68@gmail.com
Google Scholar

 

 

Special Issue Information  

The most difficult uses of advanced learning, such as multimodal processing and language acquisition, are then carefully examined and analysed. Although at a slower rate than speech and image recognition, these application areas are being revolutionised by new concepts from deep learning, particularly continuous-space incorporating. At various levels of abstraction, multimodal systems interpret and work with data from various interpersonal communication channels. Meaning can potentially be automatically extracted by multimodal systems. On the other hand, they generate perceivable information from symmetric abstract models derived from multimodal raw input data. A multimodal system might be a multimodal voice system or a multimodal interface.

The various stages of fusion and probable situations in a multimodal sensor system are addressed in this specific issue. It covers the many modes of operation, the amalgamation techniques used to compile the evidence, and problems pertaining to the development and implementation of these systems. Biometrics, characterised as the science of identifying a person based on physiological or behavioural characteristics, is starting to be acknowledged as a valid technique for establishing a person's identification. There are three modes of operation for a multimodal system: serial, parallel, or hierarchical. Realistic and adaptable human-machine interactions are only one of the many significant applications that depend on voice recognition and machine-based speaker identification. The majority of advancements in speech-based automated recognition have only taken into account acoustic speech as an input signal, ignoring visual speech. However, especially in challenging environments, acoustical voice recognition alone could suffer from flaws that make it unsuitable for usage in a wide range of real-world scenarios. Higher identification precision and resilience than can be achieved with only one method are promised by the amalgamation of auditory and visual senses. Thus, multimodal recognition is seen as an essential part of the linguistic systems of the future.

In order to improve the user experience during the construction of the multimodal corpuses for audio-visual voice recognition in both driver surveillance systems, a new technique is presented in this special issue. an examination of voice-driven interfaces and speech recognition systems for motorist monitoring systems using a study of both audio and visual data. Multimodal voice recognition enables the use of video data in acoustically loud environments and the use of audio data when video data is worthless. a new structure for creating audio-visual corpuses, and tackle outlines the essential procedures and prerequisites for multimodal reservoir design.

Topics of interest for the special issue include, but are not limited to, the following:

  • Submerged development: from multimodal processing and recognising languages to voice recognition
  • A multimodal information fusion implementation for the identification of emotions
  • Multiple modes an extensive assessment employing physiological, aural, visual, and textual cues
  • An overview of multimodal learning in computer vision: developments, patterns,
  • A multimodal speech-based face emotion recognition system using infrared images
  • Incorporating visual, auditory, and textual emotions in multimodal recognition
  • Multimodal Interfaces: An Overview of Ideas, Architectures, and Approaches
  • Visual and voice signals as fusion methods for a multimodal biometric device
  • Development of multimodal database for audio-visual speech detection in automobile interiors
  • Hierarchical neural networks combined with multimodal fusion for audio-visual emotion identification.
  • Engaging with the multimodal content: Thoughts on image and text.

Deadline for manuscript submissions: 31 December 2025.

To submit your manuscript, click here