Generating Text from Language Models

Definition: Language Model is a probability distribution over strings.

Objective of language generators: Steer strings distributions towards text with specific attributes without retraining/fine-tuning the language model. Or, in simple words, make the language model to generate what we want without compute intensive model retraining.

Discrete distributions

Distributions could be classified into “small” (on the left, used for vocabularies) and “large” (on the right, used for chain-of-thoughts).

Entropy

Definition: measures uncertainty of the probability distribution. The higher uncertainty, the higher entropy.

Example of entropy value for language generation tasks

Scenario 1 (document generation, news generation):

Distribution: web documents
Entropy: high
Good generations: many

Scenario 2 (machine translation):

Distribution: sentence translations
Entropy: low
Good generations: few

Entropy definition equation

LM evaluation

The quality of LMs is usually measured with perplexity (the lower, the better). It basically means “how far from the top the ground-truth token is.”

LM evaluation shortcomings

Difficult to measure what exactly the LMs learn about the language;
LMs could “hallucinate” (produce wrong facts without warning);
Often, researchers/developers are not working with super large LMs, because of limited budget.

Text generation

📝 Colab notebook

The generation of tokens usually happens autoregressively (token-by-token).

E.g., you pass string <BOS>* my to the LM, and it provides you with a probability distribution, what token most likely will come next.

(* BOS - beginning of the string)

Scoring functions

📝 Colab notebook

Selecting the right next token is not that obvious. It depends on our objectives and the token with the highest probability not always that we want.

Top-k sampling

Simply truncate the tail by selecting the k tokens with the largest probability.

Perplexity and generation quality

Tune k so that the surprisal of the generated text is close to a target value τ

Nucleus sampling

Select tail size dynamically by only considering the nucleus of the distribution.

Prompting

This is the way of steering the LM generation into required direction by providing more context.

The main reason why prompting is working is because context (often) lowers the entropy.

Prompting examples

(The LM will accept the following as input and will finish the sentence.)

Translation:

English: My name is Edward. French:

Summarization:

Toronto police reported …

TL;DR:

Demonstrations (concatenate x,y pairs and give them as model input, i.e., few-shot learning)

Tweet: “I hate it when my phone battery dies.”

Sentiment: Negative

Tweet: “My day has been 👍”

Sentiment: Positive

Tweet: “This is the link to the article”

Sentiment: Neutral

Tweet: “This new music video was incredibile”

Sentiment:

Chain-of-Thought

Controlled generation

📝 Colab notebook

Using a set of scoring functions and algorithms to change the token probability distribution towards the desired output.

DExperts: Decoding-Time Controlled Text Generation with Experts and Anti-Experts.

Combine the outputs from Anti-Experts and Experts, lowering the most probable tokens from Anti-Experts.

FUDGE: Controlled Text Generation With Future Discriminators

Using classifiers for eliminating tokens that we don’t want.

Energy functions

Guiding the generation to regions with minimum energy, i.e., increasing specific tokens probability.

Measuring the quality of generations

📝 Colab notebook

Low-entropy Text Generation Evaluation

Exact match between generation and ground-truth tokens (mostly not applicable in real-world applications);
Levenshtein distance: minimum number of edits required to change a string into another;
BLEU: how many n-grams are in generated text (still used, but the main shortcoming of the approach is that semantically agnostic, i.e., texts could be semantically similar, but used words are very different. Thus, BLEU score is low.);
BERTScore: generates embeddings using LM (e.g., BERT) and then compare similarity (e.g., cosine distance). This metric takes semantics into account;
Mauve (Pillutla et. al. 2022) is a language generator evaluation metric that empirically exhibits high correlation with human judgments. Uses clustering for generated embeddings, get distributions from them and then compute similarity.

ACL 2023 Day 0 (part 2)

Generating Text from Language Models

Discrete distributions

Entropy

Example of entropy value for language generation tasks

Entropy definition equation

LM evaluation

LM evaluation shortcomings

Text generation

Scoring functions

Top-k sampling

Perplexity and generation quality

Nucleus sampling

Prompting

Prompting examples

Translation:

Summarization:

Demonstrations (concatenate x,y pairs and give them as model input, i.e., few-shot learning)

Chain-of-Thought

Controlled generation

DExperts: Decoding-Time Controlled Text Generation with Experts and Anti-Experts.

FUDGE: Controlled Text Generation With Future Discriminators

Energy functions

Measuring the quality of generations

Low-entropy Text Generation Evaluation