Historically, image captioning and text-to-image search have fascinated machine-learning practitioners and businesses. The open-source models, Contrastive Language-Image Pre-training (CLIP) and Bootstrapping Language-Image Pre-training (BLIP), were the first to produce near-human results. Recently, multimodal models using generative models are being used to map text and images to the same embedding space for best results. Amazon has…
