Models such as CLIP (Radford et al., 2021) that fuse visual and language data to understand complex tasks show potential but struggle with performance issues when presented with untrained or out-of-distribution (OOD) data. This concern is of particular importance when models encounter novel categories not in their training set, which can pose potential safety issues. Attempts to improve OOD detection have included scaling or incorporating an additional text generator.
A recent model called OGEN (Fort et al., 2021) showed promise for improving accuracy of in-distribution (ID) and OOD data. However, without proper regularisation mechanisms, these models are susceptible to overfitting, limiting their generalisation to unknown categories. In response, OGEN introduces an innovative non-linear solution for feature synthesis of unknown classes combined with an effective model regulation system.
This method includes a class-conditional feature generator for image feature synthesis of unknown classes, based on the image-text feature spaces identified by CLIP. To counter the complexity of generating OOD features, an “extrapolating bias” is applied, using similarities between known classes to extrapolate features for unknown classes.
Two feature synthesis methods named ‘extrapolating per class’ and ‘extrapolating jointly’ were tested for synthesising unknown features. The latter was found to be more effective. To further restrict overfitting during joint optimisation, an adaptive self-distillation strategy was employed.
OGEN, the new approach, was proven effective across various settings and diverse datasets with CLIP-like models. Particularly, it excels in two challenging settings: base-to-new class generalization within the same dataset and cross-dataset generalization. OGEN demonstrates the ability to improve both ID and OOD performance without the risk of overfitting. It also showcases its competence to strike a balance between maintaining existing classification accuracy while maximizing generalization for new class tasks.
In terms of future work, OGEN could be evaluated across different finetuning methods and explored for its effectiveness in modeling uncertainties on unseen data. OGEN represents an innovative solution for enhancing out-of-domain generalization in vision-language models. The study and the developed approach contribute significantly to the research field.