Large Language Models (LLMs) have revolutionized natural language processing, with considerable performance across various benchmarks and practical applications. However, these models also have their own sets of challenges, primarily due to the autoregressive training paradigm which they rely upon. The sequential nature of autoregressive token generation can drastically slow down processing speeds, limiting their practicality in high-throughput situations. Additionally, exposure bias is a possible risk, impacting the coherence and quality of the generated text.
Researchers have worked on various techniques to rectify these issues, enhancing generation speed and tackling the sampling challenges associated with LLMs. They have developed efficient implementations to improve model performance alongside low-precision inference techniques aimed at reducing the computational needs. They also introduced new architectures for better processing efficiency and multi-token prediction strategies for simultaneous token generation. Some researchers have started implementing diffusion models for text generation as an alternative to traditional autoregressive methods.
Researchers from CLAIRE have been studying Score Entropy Discrete Diffusion (SEDD), showing it to be a promising alternative to the traditional autoregressive generation in language models. The strength of SEDD lies in its ability to maintain a balance between two critical aspects: quality and computational efficiency. This makes it well-suited for applications where availability of verifier is guaranteed, such as in combinatorial problem-solving scenarios.
SEDD employs a transformer backbone similar to GPT-2, and is trained on the OpenWebText dataset. Comparative evaluations reveal SEDD either matches or surpasses GPT-2’s likelihood on various test datasets, including LAMBADA, Wikitext2, PTB, WikiText103, and 1BW. SEDD’s sampling process allows for fewer steps than the sequence length, with 32 sampling steps giving a better perplexity than GPT-2 for 1024-token sequences. Unlike autoregressive models which require sequential token generation, SEDD is more versatile in terms of non-causal token generation and a more flexible definition of the forward process.
In unconditional generation quality, SEDD either matches or outperforms GPT-2, offering lower perplexity without annealing and similar likelihood with 1024 sampling steps. In conditional generation, SEDD performs slightly lower on the MAUVE metric but displays comparable accuracy on other tasks. However, SEDD is less diverse than GPT-2 and shows a higher repetition rate and a decrease in unigram entropy as sampling steps increase. For conditional generation with short prompts, SEDD is slightly weaker than GPT-2.
While SEDD demonstrates promising results, challenges remain in terms of sampling efficiency. The necessity to match GPT-2’s unconditional text quality using nucleus sampling requires significantly more steps, leading to slower generation compared to GPT-2 with KV-caching. Nonetheless, this study presents a strong argument that diffusion models for text such as SEDD are a viable and promising alternative to autoregressive models.