Introduction
This article outlines the use of transition matrices to generate a sequence of words based on a root text. We use a First-Order Markov process with the acknowledgement that better results can be found with Hidden Markov Models. In this article we will generate a Lorem Ipsum style text given the root text of Don Quixote.
Notation and Assumptions
-
Let be the space of distinct words, extracted in order of first appearance from a given root text.
-
The input corpus is a sequence of tokens:
-
We assume a first-order Markov property:
Extracting Bigrams and Counting Frequencies
Define the set of all observed ordered pairs (bigrams) including a self-pair at the end:
Let be the count function:
In code, we build:
def read_text(text):
"""Read a text string and extract unique words."""
word_list = []
unique_words = []
for word in text.split():
word_list.append(word)
if word not in unique_words:
unique_words.append(word)
return word_list, unique_words
and then count each bigram to obtain .
Transition Probabilities
For each , define the total outgoing count:
Then the conditional probability of transitioning from to is:
This yields a probability distribution over the next word given the current word.
def calculate_transition_probabilities(word_list, unique_words):
"""Calculate transition probabilities between words."""
# Create pairs of consecutive words
word_pairs = []
i = 0
while i <= len(word_list) - 1:
if i < len(word_list) - 1:
word_pairs.append((word_list[i], word_list[i + 1]))
else:
word_pairs.append((word_list[i], word_list[i]))
i += 1
# Count occurrences of each word pair
unique_pairs = []
pair_counts = []
for word_pair in word_pairs:
if word_pair not in unique_pairs:
unique_pairs.append(word_pair)
pair_counts.append(1)
else:
index = unique_pairs.index(word_pair)
pair_counts[index] += 1
# Calculate transition probabilities
pair_probabilities = [0] * len(pair_counts)
for index, target_pair in enumerate(unique_pairs):
matching_pairs = []
matching_counts = []
first_word_list = []
for index2, pair in enumerate(unique_pairs):
if target_pair[0] == pair[0]:
matching_pairs.append(pair)
matching_counts.append(pair_counts[unique_pairs.index(pair)])
first_word_list.append(pair)
for first_word in first_word_list:
pair_index = unique_pairs.index(first_word)
match_index = matching_pairs.index(first_word)
pair_probabilities[pair_index] = matching_counts[match_index] / np.sum(matching_counts)
return word_pairs, unique_pairs, pair_counts, pair_probabilities
The Transition Matrix
We index words by such that . Then define the transition matrix with entries:
In code:
def build_transition_matrix(pair_probabilities, unique_words, unique_pairs):
"""Build transition matrix from word pair probabilities."""
# Create matrix with dimensions [num_unique_words x num_unique_words]
matrix = np.zeros((len(unique_words), len(unique_words)))
# Fill matrix with transition probabilities
for index, pair in enumerate(unique_pairs):
row = unique_words.index(pair[0])
col = unique_words.index(pair[1])
matrix[row][col] = pair_probabilities[index]
return matrix
Sampling the Next Word
Given the current word (index ), we draw the next word index with probability:
Concretely, if denotes row of , then
def generate_text(initial_word, transition_matrix, unique_words, iterations=10):
"""Generate text using Markov chain model."""
predicted_text = []
current_word = initial_word
predicted_text.append(current_word)
row_number = unique_words.index(current_word)
for _ in range(iterations):
next_word = nprand.choice(unique_words, p=transition_matrix[row_number])
predicted_text.append(next_word)
current_word = next_word
row_number = unique_words.index(current_word)
predicted_text = " ".join(predicted_text)
return predicted_text
Generated Text
And finally we get:
“The farmer, “call on Sundays, made more firmly bound to him, one in the Manchegan horizon, when out or would adopt the people who were to see and fixing bars of Gaul. Master Andres,” said to see if he mounted, and that they seemed in him, moreover, that this was like heretics.” “So say in these troutlets enough,” said the novice knight that ye shall be she might be, all that, his sword, which he could not thy captive Abindarraez gave Rocinante was like all the carrier’s head that he felt himself he mounted, and went off the landlord, seeing the shield until by my heart tell the Taverns of people, who, as the ceremonial of the artifice of La Mancha, the approach of chivalry with his thinking, lofty, sonorous, and deducted three of the bed is true,” said it was saying his intention of his horn to have been at last the ground, and said, the blood-lettings when he was it was proceeding to dark, poring over with such fury and the two gay damsels who hath to-day righted the man blustering in and covered all was—give him that this sort the valiant and matters little,” replied that special purpose. But these words and quiet, saying to give him, fell on Saturdays, lentils on their lucidity of in the title of himself, “If, for the oath on a thing so wantonly lashing that he considered, he bore them to set forth in and robbing everyone he might from both hands and took, because as he come in remembrance this sort the spot, and backpiece, but Pedro Alonso your High Magnificence,” replied Don Quixote braced his stirrups, got to stay two hours only, while his hand on this harangue of knighthood before him away to the youth that ye can against”
Final Thoughts
- This implementation relies heavily on linear searches (using .index() multiple times), which would be inefficient for large texts. Using dictionaries or hash maps instead of repeated linear searches would significantly improve performance.
- Overall the use of transition matrices in this way is wholly inefficient for larger data sets.
- calculate_transition_probabilities is a long function and could be broken up into smaller functions
- We do not cover a comparison to the classic Lorem Ipsum in any way.
- It would be interesting to explore how text size affects generation quality.
Mathematical Summary
- Bigram counts: .
- Outgoing totals: .
- Transition probability:
- Sampling update:
Citations
- Cervantes Saavedra, M. de. (2003). Don Quixote (E. Grossman, Trans.). HarperCollins. (Original work published 1605)