Skip to content Skip to sidebar Skip to footer

Computer vision

An Extensive Analysis of Studies on Effective Large Multimodal Language Models

Multimodal large language models (MLLMs) are advanced artificial intelligence structures that combine features of language and visual models, increasing their efficiency across a range of tasks. The ability of these models to handle vast different data types marks a significant milestone in AI. However, extensive resource requirements present substantial barriers to their widespread adoption. Models like…

Read More

OmniGlue: The Initial Image Matching Tool Created with a Central Focus on Generalizability

Local feature image matching techniques often fall short when tested on out-of-domain data, leading to diminished model performance. Given the high costs associated with collecting extensive data sets from every image domain, researchers are focusing on improving model architecture to enhance generalization capabilities. Historically, local feature models like SIFT, SURF, and ORB were used in…

Read More

The Engineering Department extends a warm welcome to its latest professors.

Read More

The National University of Singapore has published an AI research paper that presents MambaOut: a system that enhances the efficiency of visual models to upgrade their precision.

Recent advancements in neural networks such as Transformers and Convolutional Neural Networks (CNNs) have been instrumental in improving the performance of computer vision in applications like autonomous driving and medical imaging. A major challenge, however, lies in the quadratic complexity of the attention mechanism in transformers, making them inefficient in handling long sequences. This problem…

Read More

Decoding Vision-Language Models: A Comprehensive Examination

A team of researchers from Hugging Face and Sorbonne Université has conducted in-depth studies on vision-language models (VLMs), aiming to better understand the critical factors that impact their performance. These models, capable of processing both images and text, have become popular in a variety of areas, such as information retrieval in scanned documents to code…

Read More

CinePile: A Unique Dataset and Benchmark Specifically Constructed for Genuine Extensive Video Comprehension

Video understanding, a branch of artificial intelligence research, involves equipping machines to analyze and comprehend visual content. Specific tasks under this umbrella include recognizing objects, reading human behavior, and interpreting events within a video. This field has applications across several industries, including autonomous driving, surveillance, and entertainment. The need for such advances arises from the challenge…

Read More

THRONE: Progress in Assessing Hallucinations in Vision-Language Models

The rapidly evolving field of research addressing hallucinations in vision-language models (VLVMs), or artificially intelligent (AI) systems that generate coherent but factually incorrect responses, is increasingly gaining attention. Especially important when applied in crucial domains like medical diagnostics or autonomous driving, the accuracy of the outputs from VLVMs, which combine text and visual inputs, is…

Read More

THRONE: Progressing the Assessment of Visual-Language Models’ Hallucinations

Artificial Intelligence (AI) systems, such as Vision-Language Models (VLVMs), are becoming increasingly advanced, integrating text and visual inputs to generate responses. These models are being used in critical contexts, such as medical diagnostics and autonomous driving, where accuracy is paramount. However, researchers have identified a significant issue in these models, which they refer to as…

Read More