Skip to content Skip to sidebar Skip to footer

Computer vision

Is Our Approach in Assessing Large-Scale Visual-Language Models Correct? This Chinese AI Research Presents MMStar: A Superior Vision-Driven Multi-Modal Benchmark.

Researchers have noted gaps in the evaluation methods for Large Vision Language Models (LVLMs). Primarily, they note that evaluations overlook the potential of visual content being unnecessary for many samples, as well as the risk of unintentional data leakage during training. They also indicate the limitations of single-task benchmarks for accurately assessing the multi-modal capabilities…

Read More

A computer science professional is advancing the limits of geometry.

Over 2000 years ago, Greek mathematician Euclid drastically influenced how we perceive shapes. Adding a modern facet to these ancient teachings, Justin Solomon is leveraging modern geometric methods to confront complex issues often unrelated to shapes. As an Associate Professor in the MIT Department of Electrical Engineering and Computer Science and a member of MIT’s…

Read More

A computer engineer is pushing the limits in the field of geometry.

Drawing influence from over 2,000 years ago, MIT Professor Justin Solomon is building upon the works of Greek mathematician Euclid - the father of geometry, using modern geometric techniques to tackle difficult problems, often not related to shapes. Solomon works in the Department of Electrical Engineering and Computer Science as part of the Computer Science…

Read More

MathVerse: A Comprehensive Visual Math Benchmark Crafted for Fair, Thorough Assessment of Multi-modal Extensive Language Models (MLLMs)

The ability of large Multimodal Language Models (MLLMs) to tackle visual math problems is currently the subject of intense interest. While MLLMs have performed remarkably well in visual scenarios, the extent to which they can fully understand and solve visual math problems remains unclear. To address these challenges, frameworks such as GeoQA and MathVista have…

Read More

This research document on AI, co-authored by Max Planck, Adobe, and UCSD, suggests the use of Time Reversal Fusion (TRF) for probing the blending of time and space.

Researchers from the Max Planck Institute for Intelligent Systems, Adobe, and the University of California have introduced a diffusion image-to-video (I2V) framework for what they call training-free bounded generation. The approach aims to create detailed video simulations based on start and end frames without assuming any specific motion direction, a process known as bounded generation,…

Read More

Cobra for Multimodal Language Learning: Streamlining Multimodal Big Language Models (MLLM) with Linear Processing Complexity

The exponential advancement of Multimodal Large Language Models (MLLMs) has triggered a transformation in numerous domains. Models like ChatGPT- that are predominantly constructed on Transformer networks billow with potential but are hindered by quadratic computational complexity which affects their efficiency. On the other hand, Language-Only Models (LLMs) lack adaptability due to their sole dependence on…

Read More

Researchers from Alibaba and Renmin University of China have unveiled mPLUG-DocOwl 1.5, a unified framework for understanding documents without the need for Optical Character Recognition (OCR).

Researchers from Alibaba Group and the Renmin University of China have developed an advanced version of MultiModal Large Language Models (MLLMs) to better understand and interpret images rich in text content. Named DocOwl 1.5, this innovative model uses Unified Structure Learning to enhance the efficiency of MLLMs across five distinct domains: document, webpage, table, chart,…

Read More

FeatUp: An Advanced Machine Learning Algorithm that Enhances the Resolution of Deep Neural Networks for Superior Performance in Computer Vision Activities

The capabilities of computer vision studies have been vastly expanded due to deep features, which can unlock image semantics and facilitate diverse tasks, even using minimal data. Techniques to extract features from a range of data types – for example, images, text, and audio – have been developed and underpin a number of applications in…

Read More

Observing Everything: LLaVA-UHD Can Detect High-Resolution Images in Any Aspect Ratio

Large language models like GPT-4, while powerful, often struggle with basic visual perception tasks such as counting objects in an image. This can be due to the way these models process high-resolution images. Current AI systems can mainly perceive images at a fixed low resolution, leading to distortion, blurriness, and loss of detail when the…

Read More

Arc2Face Leads the Way in Realistic Face Image Generation Using ID Embeddings

The production of realistic human facial images has been a long-standing challenge for researchers in machine learning and computer vision. Earlier techniques like Eigenfaces utilised Principal Component Analysis (PCA) to learn statistical priors from data, yet they notably struggled to capture the complexities of real-world factors such as lighting, viewpoints, and expressions beyond frontal poses.…

Read More

UC Berkeley and Microsoft Research are redefining our understanding of visuals. Their approach of scaling at scale is proving to be more effective and sophisticated than larger models.

In the ever-evolving fields of computer vision and artificial intelligence, traditional methodologies favor larger models for advanced visual understanding. The assumption underlying this approach is that larger models can extract more powerful representations, prompting the construction of enormous vision models. However, a recent study challenges this wisdom, with a closer look at the practice of…

Read More

MinusFace: Transforming Facial Recognition Privacy through Feature Deduction and Channel Mixing – An Innovative Research by Fudan University and Tencent

The increasing use of facial recognition technologies is a double-edged sword, wherein it provides unprecedented convenience, but also poses a significant risk to personal privacy as facial data could unintentionally reveal private details about an individual. As such, there is an urgent need for privacy-preserving measures in these face recognition systems. A pioneering approach to this…

Read More