Skip to content Skip to footer

StrokeNUWA Unveiled by Microsoft Scientists: The Tokenization of Strokes for Vector Graphic Generation

Researchers from Soochow University, Microsoft Research Asia, and Microsoft Azure AI have developed a new method for image processing using Large transformer-based Language Models (LLMs). LLMs have been making advancements in Natural Language Processing and other fields like robotics, audio, and medicine. They are also being used to generate visual data, with modules like VQ-VAE and VQ-GAN converting visual pixels into discrete grid tokens that the LLM can process.

The researchers developed a new method involving vector graphics and an alternate image format that preserves the semantic concept of images better than pixel-based formats. They split images into a series of interconnected ‘stroke’ tokens, each carrying full semantic data. The stroke token approach brings a few advantages, such as more intuitive semantic segmentation of images, high compressibility without data quality loss, and ease of processing for LLMs.

The team then proposed StrokeNUWA, a model that creates vector graphics independently of the visual module. StrokeNUWA’s architecture includes an Encoder-Decoder model and a VQ-Stroke module, which can condense serialized vector graphic data into SVG tokens. The study showed that stroke tokens could produce visually rich semantic material, outperforming LLM-base baselines. It also demonstrated speed efficiency up to 94 times due to vector graphics’ compressibility.

The researchers’ future goal is to further improve stroke token quality using advanced visual tokenization techniques and expand stroke tokens to other domains and tasks. Credit for the research goes to the project’s researchers, and more details can be seen in their paper. This news was originally published on MarkTechPost.

Leave a comment

0.0/5