Today I would like to create a simple, primitive text generator using Markov chains (I explained them in a previous post).
Necessary steps for this project:
- Get text to learn from
- Create the transition matrix
- Generate text
I will do everything from scratch, but I will a library called “tqdm” to show a progress bar of a for loop.
Get text to learn from:
First we need to get data. I will use Wikipedia to mine free text data. The output will be a .txt file with the text as one big string – I will not go into much detail here since it is not the main topic of this post. I will use the Wikipedia pages of Psychology, Philosophy and Sigmund_Freud:
# Import package
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
from tqdm import tqdm
def downloadWikiText(urls):
text = ''
for url in tqdm(urls):
# Specify url of the web page
source = urlopen(url).read()
# Make a soup
soup = BeautifulSoup(source,'lxml')
# Extract the plain text content from paragraphs
for paragraph in soup.find_all('p'):
text += paragraph.text
# Clean text
text = re.sub(r'\[.*?\]+', '', text)
text = text.replace('\n', '')
# Append space for next text
text += " "
# Remove last " " and then save text as a file
text = text[:-1]
with open('wiki_text.txt', 'w', encoding='utf-8') as f:
f.write(text)
urls = ["https://en.wikipedia.org/wiki/Psychology",
"https://en.wikipedia.org/wiki/Philosophy",
"https://en.wikipedia.org/wiki/Sigmund_Freud"]
downloadWikiText(urls)
Now the text is saved in “./wiki_text.txt”.
Create the transition matrix:
The following code will the these steps:
- Open wiki_text.txt and save text in a variable
- delete every non-alphanumeric symbol and convert string text to a list with each word as an element of the text (all words will be lowercase)
- Create a list will all unique words
- Create the transition matrix which will be a square matrix with the length of the unique words list
- Iterate through the wiki text:
- Find current word
- Find next word
- Add the occurence of the transition (from current to next word) to the transition matrix by adding 1 to the corresponding cell
- Since the probability of all next words from the current word should be 1 -> divide transition matrix rows by its sum
- Save unique words and corresponding transition matrix
import re
from tqdm import tqdm
# open text file with gathered wiki text
with open('wiki_text.txt', 'r', encoding='utf-8') as f:
text = f.read()
# get only words out of it - replaces all symbols like !,? etc...
# first substitute every non-alphanumeric character with " ", then:
# if double, triple, ... spaces exist just replace them with one space, afterwards split text into a list
text = re.sub(" +", " ", re.sub(r"[^a-zA-Z0-9]", " ", text)).split(" ")
# delete all "" in list
text = list(filter(lambda a: a != "", text))
# make every word lowercase
text = [x.lower() for x in text]
# get every word one time and sort the list alphabetically
unique_words = sorted(list(set(text)))
# create transition matrix - square matrix
transition_matrix = [[0 for j in range(len(unique_words))] for i in range(len(unique_words))]
# add each occurrence of state transition into the transition matrix
for i in tqdm(range(len(text)-1)):
# current word in text (index in unique word list)
current_state = unique_words.index(text[i])
# next word in text (index in unique word list)
future_state = unique_words.index(text[i+1])
# add 1 to state transition in matrix
transition_matrix[current_state][future_state] += 1
# divide each row by the sum of each row
for i in tqdm(range(len(transition_matrix))):
if text[-1] != unique_words[i]:
sum_row = 0
for number_of_occurrences in transition_matrix[i]:
sum_row += number_of_occurrences
for j in range(len(transition_matrix[i])):
transition_matrix[i][j] /= sum_row
# save unique words and transition matrix
with open('unique_words.txt', 'w', encoding='utf-8') as f:
f.write(str(unique_words))
with open('transition_matrix.txt', 'w', encoding='utf-8') as f:
f.write(str(transition_matrix))
Generate text:
The following code will do these steps:
- Open unique words list and transition matrix list/array – both as a string
- Convert both string lists to actual lists with the “json” library
- Set a starting phrase/word and number of words as well as number of sentences to be generated
- Create loop for each sentence:
- Create loop for each word:
- get current state
- get to next state by random choice with probability distribution from transition matrix
- add word to sentence
- set next state to current state
- Repeat until number of words and number of sentences are generated
- Create loop for each word:
- Print sentences – will differ every time
import json
import random
from tqdm import tqdm
# open unique words and calculated transition matrix
with open('unique_words.txt', 'r', encoding='utf-8') as f:
unique_words = f.read()
with open('transition_matrix.txt', 'r', encoding='utf-8') as f:
transition_matrix = f.read()
# convert string representation of list to actual list using json.loads()
unique_words = json.loads(unique_words.replace("'", '"'))
print("Unique words:", len(unique_words))
transition_matrix = json.loads(transition_matrix)
# generate random text by giving a start word (must be included in the training text)
starting_phrase = "Sigmund Freud"
length_of_sentences = 20 # words
number_of_sentences = 10
# get "Freud" as the starting word
start_word = starting_phrase.split(" ")[-1]
current_word = start_word.lower()
sentences = [starting_phrase.lower() for i in range(number_of_sentences)]
for i in tqdm(range(number_of_sentences)):
for j in range(length_of_sentences):
current_state = unique_words.index(current_word)
try:
next_state = random.choices(unique_words, transition_matrix[current_state])[0]
current_word = next_state
sentences[i] += " " + current_word
except:
break
for i in range(len(sentences)):
print(str(i+1)+". sentence:", sentences[i])
# Output
Unique words: 6402
1. sentence: sigmund freud s grandson ernst simmel founded in an example newton s directorship the dora a person s daughter s method and
2. sentence: sigmund freud ego it was cremated at johns hopkins university led to one independent variable at golders green crematorium s sense philosophy
3. sentence: sigmund freud of questions about the brain was dominated by the feminine mystique 1963 declared assets and then various traditions were a
4. sentence: sigmund freud major trait or innate drives a legacy though without which sought as it the most articles recommending medical doctors psychology
5. sentence: sigmund freud emerged as military the theoretical work permits were the qing dynasty through stories and to engineer the nose and the
6. sentence: sigmund freud importance of castration complex concludes that male self as a much information quickly found that allow one s case they
7. sentence: sigmund freud are we come readily to work of neuroanatomy 42 countries for freud s early buddhist philosophy include selective attrition the
8. sentence: sigmund freud set out the nature of freud theorized that freud hoped to vienna general hospital in the complex this period began
9. sentence: sigmund freud smoking and behavioral neuroscience uses the birth which had first to increase through to test freud he had a small
10. sentence: sigmund freud number of his patients assured him from western electric s ideas are content of the group of such an illusion
As you can see the sentences do not make sense, but their grammar is mostly correct though. A very good outcome, considering we only worked with 6402 unique words!
This was a first step into understanding NLP. As always the code can be found here on my github (www.github.com/Heuristic-Analyst/…). Cheers!