Markov Chains - NLP | text generation from scratch - Heuristic Analyst

Today I would like to create a simple, primitive text generator using Markov chains (I explained them in a previous post).

Necessary steps for this project:

Get text to learn from
Create the transition matrix
Generate text

I will do everything from scratch, but I will a library called “tqdm” to show a progress bar of a for loop.

Get text to learn from:

First we need to get data. I will use Wikipedia to mine free text data. The output will be a .txt file with the text as one big string – I will not go into much detail here since it is not the main topic of this post. I will use the Wikipedia pages of Psychology, Philosophy and Sigmund_Freud:

# Import package
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
from tqdm import tqdm

def downloadWikiText(urls):
    text = ''
    for url in tqdm(urls):
        # Specify url of the web page
        source = urlopen(url).read()
        # Make a soup
        soup = BeautifulSoup(source,'lxml')
        # Extract the plain text content from paragraphs
        for paragraph in soup.find_all('p'):
            text += paragraph.text

        # Clean text
        text = re.sub(r'\[.*?\]+', '', text)
        text = text.replace('\n', '')
        # Append space for next text
        text += " "
    # Remove last " " and then save text as a file
    text = text[:-1]
    with open('wiki_text.txt', 'w', encoding='utf-8') as f:
        f.write(text)

urls = ["https://en.wikipedia.org/wiki/Psychology",
        "https://en.wikipedia.org/wiki/Philosophy",
        "https://en.wikipedia.org/wiki/Sigmund_Freud"]
downloadWikiText(urls)

Now the text is saved in “./wiki_text.txt”.

Create the transition matrix:

The following code will the these steps:

Open wiki_text.txt and save text in a variable
delete every non-alphanumeric symbol and convert string text to a list with each word as an element of the text (all words will be lowercase)
Create a list will all unique words
Create the transition matrix which will be a square matrix with the length of the unique words list
Iterate through the wiki text:
1. Find current word
2. Find next word
3. Add the occurence of the transition (from current to next word) to the transition matrix by adding 1 to the corresponding cell
Since the probability of all next words from the current word should be 1 -> divide transition matrix rows by its sum
Save unique words and corresponding transition matrix

import re
from tqdm import tqdm

# open text file with gathered wiki text
with open('wiki_text.txt', 'r', encoding='utf-8') as f:
    text = f.read()

# get only words out of it - replaces all symbols like !,? etc...
# first substitute every non-alphanumeric character with " ", then:
# if double, triple, ... spaces exist just replace them with one space, afterwards split text into a list
text = re.sub(" +", " ", re.sub(r"[^a-zA-Z0-9]", " ", text)).split(" ")
# delete all "" in list
text = list(filter(lambda a: a != "", text))
# make every word lowercase
text = [x.lower() for x in text]
# get every word one time and sort the list alphabetically
unique_words = sorted(list(set(text)))

# create transition matrix - square matrix
transition_matrix = [[0 for j in range(len(unique_words))] for i in range(len(unique_words))]
# add each occurrence of state transition into the transition matrix
for i in tqdm(range(len(text)-1)):
    # current word in text (index in unique word list)
    current_state = unique_words.index(text[i])
    # next word in text (index in unique word list)
    future_state = unique_words.index(text[i+1])
    # add 1 to state transition in matrix
    transition_matrix[current_state][future_state] += 1

# divide each row by the sum of each row
for i in tqdm(range(len(transition_matrix))):
    if text[-1] != unique_words[i]:
        sum_row = 0
        for number_of_occurrences in transition_matrix[i]:
            sum_row += number_of_occurrences
        for j in range(len(transition_matrix[i])):
            transition_matrix[i][j] /= sum_row

# save unique words and transition matrix
with open('unique_words.txt', 'w', encoding='utf-8') as f:
    f.write(str(unique_words))
with open('transition_matrix.txt', 'w', encoding='utf-8') as f:
    f.write(str(transition_matrix))

Generate text:

The following code will do these steps:

Open unique words list and transition matrix list/array – both as a string
Convert both string lists to actual lists with the “json” library
Set a starting phrase/word and number of words as well as number of sentences to be generated
Create loop for each sentence:
1. Create loop for each word:
  1. get current state
  2. get to next state by random choice with probability distribution from transition matrix
  3. add word to sentence
  4. set next state to current state
  5. Repeat until number of words and number of sentences are generated
Print sentences – will differ every time

import json
import random
from tqdm import tqdm

# open unique words and calculated transition matrix
with open('unique_words.txt', 'r', encoding='utf-8') as f:
    unique_words = f.read()
with open('transition_matrix.txt', 'r', encoding='utf-8') as f:
    transition_matrix = f.read()

# convert string representation of list to actual list using json.loads()
unique_words = json.loads(unique_words.replace("'", '"'))
print("Unique words:", len(unique_words))
transition_matrix = json.loads(transition_matrix)

# generate random text by giving a start word (must be included in the training text)
starting_phrase = "Sigmund Freud"
length_of_sentences = 20    # words
number_of_sentences = 10
# get "Freud" as the starting word
start_word = starting_phrase.split(" ")[-1]
current_word = start_word.lower()
sentences = [starting_phrase.lower() for i in range(number_of_sentences)]
for i in tqdm(range(number_of_sentences)):
    for j in range(length_of_sentences):
        current_state = unique_words.index(current_word)
        try:
            next_state = random.choices(unique_words, transition_matrix[current_state])[0]
            current_word = next_state
            sentences[i] += " " + current_word
        except:
            break
for i in range(len(sentences)):
    print(str(i+1)+". sentence:", sentences[i])

# Output
Unique words: 6402
1. sentence: sigmund freud s grandson ernst simmel founded in an example newton s directorship the dora a person s daughter s method and
2. sentence: sigmund freud ego it was cremated at johns hopkins university led to one independent variable at golders green crematorium s sense philosophy
3. sentence: sigmund freud of questions about the brain was dominated by the feminine mystique 1963 declared assets and then various traditions were a
4. sentence: sigmund freud major trait or innate drives a legacy though without which sought as it the most articles recommending medical doctors psychology
5. sentence: sigmund freud emerged as military the theoretical work permits were the qing dynasty through stories and to engineer the nose and the
6. sentence: sigmund freud importance of castration complex concludes that male self as a much information quickly found that allow one s case they
7. sentence: sigmund freud are we come readily to work of neuroanatomy 42 countries for freud s early buddhist philosophy include selective attrition the
8. sentence: sigmund freud set out the nature of freud theorized that freud hoped to vienna general hospital in the complex this period began
9. sentence: sigmund freud smoking and behavioral neuroscience uses the birth which had first to increase through to test freud he had a small
10. sentence: sigmund freud number of his patients assured him from western electric s ideas are content of the group of such an illusion

As you can see the sentences do not make sense, but their grammar is mostly correct though. A very good outcome, considering we only worked with 6402 unique words!

This was a first step into understanding NLP. As always the code can be found here on my github (www.github.com/Heuristic-Analyst/…). Cheers!