A delicious hoax was recently perpetrated on the highbrow literary community. It really was the tasty cake. Sometime in 2002, Arnold Harvey invented an 1862 meeting between Charles Dickens & Fyodor Dostoyevsky and had 'evidence' of the meeting published in a respectable literary journal. In 2011 his fabrication was briefly taken as fact and appeared in at least two Dickens biographies and numerous book reviews.
Part of Mr. Harvey's genius was his sly reverse psychology. The meeting is mentioned in a nonchalant, matter-of-fact way towards the end of a utterly commonplace piece of scholarly boffinhood titled Dickens’s Villains: A Confession and a Suggestion. The author of the article, a pseudonymous "Stephanie Harvey," quotes a letter from Dostoyevsky which she supposedly translated from Russian. In the letter, Dostoyevsky recalls his meeting with Dickens sixteen years after the fact. Mr/Ms. Harvey's article appeared in vol. 98 of the literary journal the Dickensian, where it went unremarked-upon for almost ten years before biographer Claire Tomalin discovered it. She found the anecdote so "irresistable" that she put it in her tome Charles Dickens: a Life. From there the "remarkable" encounter wound up in the opening paragraph of the NYT's review, various other reviews and biographies (including Michael Slater's Charles Dickens: A Life Defined by Writing), and will probably continue to be recounted as fact forever in the endless echo chamber of the Interwebs. (For the interested reader, the Times Literary Supplement contains a lengthy investigation and a speculation that Mr. Harvey is some sort of rogue scholar-vigilante).
The tantalising prospect of such a meeting seems to have intoxicated many otherwise sober critics. Nobody asked the practical questions like, what language did they communicate in? Or, had Dickens ever even heard of Dostoyevsky? The London Review of Books wrote that Ms. Tomalin "might have been less susceptible had she not so badly wanted it to be true." And we want it to be true, too! Sadly, absent a time machine, there's no way to make it so. But we can do the next best thing: we can train machines with the words of these two authors and then set those machines to chatting with one-another.
- FYODOR: "This hatred for Russia has been already embodied in the narrative as it stands so far and the other my own ."
- CHARLES: "We know what Russia means sir says Podsnap we know what England is."
- FYODOR: "You thirsted while in Switzerland for your home country for Russia you read doubtless many books about Russia ."
- CHARLES: "Although I saw him every day it was for some time longer to settle myself for the present in Switzerland ."
- FYODOR: "It was a recollection of Switzerland ."
- CHARLES: "I tingle again from head to foot as my recollection turns that corner and my pen shakes in my hand ."
- FYODOR: "But I had supposed that laying aside my pen and saying farewell to my readers I should be heard ..."
- CHARLES: "Upon my life the whole social system as the men call it when they make speeches in Parliament is a system of Prince 's nails !"
- FYODOR: "Why in the English Parliament a Member got up last week and speaking about the Nihilists asked the Ministry whether it was not high time to intervene to educate this barbarous people ."
- CHARLES: "Do n't you know that people die there ?"
- FYODOR: "But excuse me I 'll make merry till I die !"
Isn't this conversation just what we'd expected?! It's lively, and moves quickly from Mother Russia to writing to English politics to colonialism... And what strikes me is how the character of the two authors is present in their words. Dostoyevsky's brooding existentialism, Dickens' concern with social justice. Their words could have come straight from their books... Because they did: this conversation was automatically generated from the two author's oeuvres.
Before we proceed with the technical details, pause for a moment and think about the possibilities of this technology. It could open up an entirely new area of scholarship in the humanities: the study of speculative conversations. For example, how much would we learn if Joyce and Homer shared a glass of wine-dark liquor? How entertaining to eavesdrop on Ikkyuu and Hunter S. Thompson. Why stop at two authors? The magical "dead authors' dinner party" can be a reality.
Automatic text generation with a Markov process
The text above was generated using a Markov process. The concept is very simple: imagine a network that links every word with the word which appears after it. Here's an illustration using the first line of a certain Dickens novel:
"It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness,"
# initialise
%pylab inline
from nltk import word_tokenize
import networkx as nx
text = "It was the best of times , it was the worst of times ,\
it was the age of wisdom , it was the age of foolishness ,\
it was the epoch of belief , it was the epoch of incredulity ,\
it was the season of Light , it was the season of Darkness ,"
# Tokenize and load into tuples
tokens = word_tokenize(text)
# Create network chart of 1st order Markov network
tuples = []
for i in range(len(tokens)-1):
gram = tuple(tokens[i:i+2])
tuples.append(gram)
G = nx.DiGraph()
for tup in tuples:
G.add_node(tup[0])
G.add_node(tup[1])
G.add_edge(tup[0], tup[1])
nx.draw_spring(G,node_size=0,alpha=0.4,edge_color='r',font_size=12, node_shape='o',iterations=50)
You can see how many words lead back to "of," which in turns points out to many other words. For a Markov process, the original order of the words is irrelevant. The process creates strings of words by following links between them at random. You might imagine it like a recursive decision tree. Here's an illustration, along with a simple implementation using the Dickens excerpt and some example generated text:
# create a dict of each pair of words
unigramDict = dict()
for i in range(len(tokens) - 1):
gram = (tokens[i])
next = tokens[i+1]
if gram in unigramDict:
unigramDict[gram].append(next)
else:
unigramDict[gram] = [next]
print "Possible words subsequent to 'the': " + ", ".join(unigramDict['the'])
print "Possible words subsequent to 'of': " + ", ".join(unigramDict['of'])
# Generation of text
print "\nThree randomly generated, 12-token sentences starting with 'it':"
for i in range(3):
sentence = ['it']
for i in range(12):
nextword = choice(unigramDict[sentence[len(sentence)-1]])
sentence.append(nextword)
print "* " + " ".join(sentence)
### Illustration of recursive decision tree
print "\n"
plt.figure(figsize=(16, 10))
G = nx.DiGraph()
## networkx doesn't allow >1 node with same name, so we must fool it by
## creating nodes with numeric names then replacing with labels
node_labels={}
tier_counter = 0
node_counter = 0
# start network with rootnode
rootnode = 'it'
node_num = str(tier_counter)+str(node_counter)
G.add_node(node_num)
node_labels[node_num]=rootnode
# populate remaining tiers using Markov dict to generate them
tierlist = []
for i in range(0,10):
tierlist.append(list(set(unigramDict[sentence[i]])))
#print tierlist
for (tier_down) in tierlist:
tier_counter = tier_counter + 1
node_counter = 0
for word in tier_down:
node_num = str(tier_counter) + str(node_counter)
G.add_node(node_num)
node_labels[node_num]=word
node_counter=node_counter + 1
G.add_edge(str(tier_counter-1)+str(0), node_num)
plt.title("A recursive decision tree: building a sentence from 'it'")
pos=nx.graphviz_layout(G,prog='dot')
nx.draw(G,pos,with_labels=True,arrows=False,node_size=0,alpha=0.4,edge_color='r',font_size=12,labels=node_labels)
Possible words subsequent to ‘the': best, worst, age, age, epoch, epoch, season, season
Possible words subsequent to ‘of': times, times, wisdom, foolishness, belief, incredulity, Light, Darkness
Three randomly generated, 10-token sentences starting with ‘it':
* it was the age of times , it was the season of Darkness
* it was the age of incredulity , it was the age of wisdom
* it was the age of times , it was the season of foolishness
The sentences are pretty similar and there's no much variety, which you'd expect with such a tiny corpus. They aren't always quite sensible either. "It was the age of times" feels like it might make sense but it doesn't. Below we'll create a network based on six of Dickens' books and you can imagine it will be a lot larger: common words like "of" will have hundreds of possible paths.
That's the basic idea. Markov chains are mindless, memoryless things, which is why it's so suprising that they generate even remotely convincing text. I say 'half-convincing' because what you really get is 'word-salad', with occasionally lucid bits by chance. Text generated in this way remind me of the old Surrealist game, Exquisite Corpse. They have the same eerie feeling of half-sense.
Below we'll add a couple of extra bells and whistles to try and reduce the 'word salad' effect. For example, we'll create Markov processes based not only on pairs of words, but also on trios and quartets. It turns out that if you use longer chains, you tend to generate more coherant text.
Further reading: some amusing examples of Markov generated text:
- Gnoetry: human-computer collaborative poetry
- Garkov: Markov-generated Garfield strips
- Building a markov-chain IRC bot
Download, import and cleanup text
Feel free to skip ahead
import urllib2
import re
import collections
from nltk import pos_tag, ne_chunk
from nltk.tokenize import RegexpTokenizer
from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters
from nltk.corpus import stopwords
from scipy import stats
from random import random, sample, choice
import pygraphviz
# Specify texts that will form basis of talking machine
DickensTexts = ['http://www.gutenberg.org/cache/epub/1400/pg1400.txt', # Great expectations
'http://www.gutenberg.org/cache/epub/98/pg98.txt', # A Tale of Two Cities
'http://www.gutenberg.org/cache/epub/730/pg730.txt', # Oliver Twist
'http://www.gutenberg.org/cache/epub/1023/pg1023.txt', # Bleak House
'http://www.gutenberg.org/cache/epub/766/pg766.txt', # David Copperfield
'http://www.gutenberg.org/cache/epub/883/pg883.txt'] # Our Mutual Friend
FyodorTexts = ['http://www.gutenberg.org/cache/epub/37536/pg37536.txt', # The House of the Dead
'http://www.gutenberg.org/cache/epub/2197/pg2197.txt', # The Gambler
'http://www.gutenberg.org/cache/epub/8117/pg8117.txt', # The Possessed
'http://www.gutenberg.org/cache/epub/600/pg600.txt', # Notes from the Underground
'http://www.gutenberg.org/files/28054/28054.txt', # Brothers Karamazov
'http://www.gutenberg.org/cache/epub/2638/pg2638.txt'] # The Idiot
## download and concatenate texts
def downloadTexts(textList):
allTexts = ""
for i in textList:
text = urllib2.urlopen(i).read().replace('\n', ' ').replace('\r', ' ')
# Strip out project Gutenberg header & footer
textStart = re.search(r'START OF [\w]+ PROJECT GUTENBERG EBOOK', text).end()
textEnd = re.search(r'END OF [\w]+ PROJECT GUTENBERG EBOOK', text).start()
allTexts = allTexts + text[textStart:textEnd]
return allTexts
DickensString = downloadTexts(DickensTexts)
FyodorString = downloadTexts(FyodorTexts)
## remove punctuation
def cleanup(text):
# Strip out punctuation except for that which marks the end of a sentence:
# !, ? and . _except_ where the . comes after Mr, Mrs, Ms, etc.
text = re.sub(r'[;:()",_`]', ' ', text)
text = re.sub(r'\s[\']', ' ', text) # remove apostrophes where they are used as talking marks...
text = re.sub(r'\'[\']', ' ', text) # ...which in some texts is, weirdly, side by side in imitation of double quotes...
text = re.sub(r'[\']\s', ' ', text) # ...but leave where they are contractions or possessives
text = re.sub(r'[-+]', ' ', text) # remove hyphens were they occur on their own - Melville does them in pairs sometimes
return text
DickensString = cleanup(DickensString)
FyodorString = cleanup(FyodorString)
## Convert text to a list of sentences, where each sentence is tokenized
# Sentences are defined as ending where either !, ? or . appears, except where period is after Mr/Mrs/Ms/etc
punkt_param = PunktParameters()
punkt_param.abbrev_types = set(['dr', 'vs', 'mr', 'mrs', 'prof', 'inc', 'st'])
sentence_splitter = PunktSentenceTokenizer(punkt_param)
# Split into sentences, and split those sentences into word tokens
# Tokenize sentences, saving end of sentence punctuation as seperate tokens
# Note that this will split words on apostrophes, annoyingly
FyodorSentences = [word_tokenize(i) for i in sentence_splitter.tokenize(FyodorString)]
print "\nSome sentences from the Dostoyevsky corpus:"
for sentence in FyodorSentences[500:505]:
print "* " + " ".join(sentence)
DickensSentences = [word_tokenize(i) for i in sentence_splitter.tokenize(DickensString)]
print "Some sentences from the Dickens corpus:"
for sentence in DickensSentences[2000:2005]:
print "* " + " ".join(sentence)
Some sentences from the Dostoyevsky corpus:
On seeing me the little girl blushed and murmured a few words into her mother 's ear who stopped and took from a basket a kopeck which she gave to the little girl .
The little girl ran after me .
Here poor man she said take this in the name of Christ .
I took the money which she slipped into my hand .
The little girl returned joyfully to her mother .
Some sentences from the Dickens corpus:
I thought it best to hint through the medium of a meditative look that this might be occasioned by circumstances over which I had no control .
She said no more at the time but she presently stopped and looked at me again and presently again and after that looked frowning and moody .
On the next day of my attendance when our usual exercise was over and I had landed her at her dressing table she stayed me with a movement of her impatient fingers Tell me the name again of that blacksmith of yours .
- `Joe Gargery ma'am .
- `Meaning the master you were to be apprenticed to ?``
Load corpi into ngram-searchable dicts
We load the tokenized sentences into a large Python dict. We load ngrams with all their possible subsequent words:
- Unigrams, eg. "I" --> "have", "can", "want", ...
- Bigrams, eg. "I", "have" --> "a", "seen", "never", ...
- Trigrams, eg. "I", "have", "a", "sharp" --> "harpoon", "knife"
- 4-grams, eg. "I", "have", "a", "sharp", "harpoon" --> "and"
By the time we arrive at 4-grams, there's often only one or two possible subsequent terms. Even with an oeuvre of six books, it's pretty rare for more than one sentence to contain the same string of four consecutive words.
## This function creates a dict of ngram-keys with possible subsequent words
# the ngramDepth parameter determines the dept of ngrams to stop at: unigrams only (1),
# unigrams and bigrams (2), unigrams, bigrams AND trigrams (3), etc
def ngramSentences(sentences, maxNgramDepth):
ngramDict = dict()
# for each sentence
for ngramDepth in range(1,maxNgramDepth+1):
for sentence in sentences:
for i in range(len(sentence) - ngramDepth):
if ngramDepth == 1: gram = (sentence[i])
else: gram = tuple(sentence[i:i+ngramDepth]) # For 2nd order+, we store ngrams in tuples
next = sentence[i+ngramDepth] # get the element after the gram
if gram in ngramDict:
ngramDict[gram].append(next)
else:
ngramDict[gram] = [next]
return ngramDict
## create 1st-4th order ngram dicts for both authors
# we can call these with unigrams or bigrams or trigrams or...
DickensDict = ngramSentences(DickensSentences,4)
FyodorDict = ngramSentences(FyodorSentences,4)
# print samples
print "Samples of ngram-keys with their possible subsequent word(s) from Dickens dict:"
for (key, wordlist) in DickensDict.items()[0:5]:
print "* " + str(key) + ": " + str(wordlist)
print "\nSamples of ngram-keys with their possible subsequent word(s) from Dostoyevsky dict:"
for (key, wordlist) in FyodorDict.items()[500000:500005]:
print "* " + str(key) + ": " + str(wordlist)
Samples of ngram-keys with their possible subsequent word(s) from Dickens dict:
('be', 'eyed'): ['as', 'as']
('himself', 'a', 'little', 'shake'): ['as']
('and', 'since', 'that'): ['whenever']
('taken', 'air', 'as'): ['I']
('with', 'me', 'we'): ['can']
Samples of ngram-keys with their possible subsequent word(s) from Dostoyevsky dict:
('A', 'few', 'hours'): ['earlier']
('B.', 'loved', 'and', 'esteemed'): ['him']
('You', "'d", 'better', 'hold'): ['your']
('be', 'a', 'fool', 'strikes'): ['a']
('a', 'loving', 'heart'): ['.', '.']
Now we're ready to start building sentences. For a given string of up to four words we can search an author's dict for possible subsequent words from their oeuvre. As the sentence gets longer, we use only the last words as input to find the next word.
Stringing together sentences
To generate sentences, we don't need many rules:
- Generate sentence from starting word until punctuation is encountered. This ends the sentence
- Where the length of the sentence permits, call for new words using the last 4 words. If that fails, use the last 3 words. And so on.
### Get the next word for a list of words
## Word through ngrams in hierarchy: eg. for a string of words, try to use 4-grams first, then trigrams, and so on
## To prevent too much repetition, add a little randomness that occasionally skips 4-grams, etc.
# Create a PMF to generate a starting ngram length of between 1 and 4, where 4 is most likely
# Use this number to determine where to start on hierarchy of dict calls for subsequent word
xk = [1,2,3,4]
pk = (0.01,0.05,0.1,0.84)
custm = stats.rv_discrete(name='custm', values=(xk, pk))
def getNextWords(startWords, authorDict):
randNgram = custm.rvs() # Random choice of ngram length to start with
wordsLen = len(startWords) # length of input sentence
startNgram = min(randNgram,wordsLen)
# Try to get next word using most recent 4 words
if startNgram == 4:
lastWords = tuple(startWords[wordsLen-4:wordsLen])
if authorDict.has_key(lastWords):
return authorDict[lastWords]
# Else, try with most recent 3 words
if startNgram >= 3:
lastWords = tuple(startWords[wordsLen-3:wordsLen])
if authorDict.has_key(lastWords):
return authorDict[lastWords]
# if not possible, try with most recent 2 words
if startNgram >= 2:
lastWords = tuple(startWords[wordsLen-2:wordsLen])
if authorDict.has_key(lastWords):
return authorDict[lastWords]
# if not possible, get next word with last word
lastWord = startWords[wordsLen-1]
if authorDict.has_key(lastWord):
return authorDict[lastWord]
# failing that, return nothing
Now we have a function that which suggests a new word for any string of words, it's trivial to call that repeatedly and build a sentence:
def buildsentencefwd(startWord, authorDict, maxLength = 100):
# before starting, check if word appears at all in dicts
if not authorDict.has_key(startWord): return []
chainSentence = [startWord]
sentenceLength = len(chainSentence)
while sentenceLength <= maxLength:
nextWord = choice(getNextWords(chainSentence, authorDict))
if not nextWord: break
chainSentence.append(nextWord)
if any([punct in nextWord for punct in [".","!","?"]]): # if next word is or contains sentence end punctuation
break
chainSentence.insert(0,nextWord)
sentenceLength = len(chainSentence)
return chainSentence
# Show some example sentences
subject = 'London'
print "\nA sentence by Dickens on the subject of '" + subject + "':"
# generate sentence _forward_ from keyword
CharlesTalk = buildsentencefwd(subject, DickensDict)
# generate sentence _backward_ from keyword
CharlesTalk = buildsentenceback(CharlesTalk, DickensDictBack)
print 'CHARLES: "' + " ".join(CharlesTalk) + '"'
subject = 'Petersburg'
print "\nA sentence by Dostoyevsky on the subject of '" + subject + "':"
FyodorTalk = buildsentencefwd(subject, FyodorDict)
FyodorTalk = buildsentenceback(FyodorTalk, FyodorDictBack)
print 'FYODOR: "' + " ".join(FyodorTalk) + '"'
A sentence by Dickens on the subject of 'London':
CHARLES: "London streets so crowded with people and so brilliantly lighted in the dusk of the ninth evening ."
A sentence by Dostoyevsky on the subject of 'Petersburg':
FYODOR: "Petersburg with slip shod government clerks discharged military men beggars of the higher class and drunkards of all sorts that he visited their filthy families spent days and nights in the belly of the whale and you get the character of that thinker who lay across the road ."
So, now we can generate snippets of text that are sometimes more or sometimes less convincing. At this point, admittedly, it's mostly less. The most obvious flaw is that sentences seem to start jarringly in the middle of a subject. How to solve this? An idea that I picked up from the MegaHAL 'conversational simulator' is to generate each sentence backwards using the same process. Our current process generates a sentence forwards until it encounters punctuation - let's do something almost identical.
# Make more dicts, this time for going _backwards_ in sentences
# I want to use punctuation to mark the _start_ of sentences now,
def ngramSentencesBack(text, maxNgramDepth):
ngramDict = dict()
# word_tokenize raw text
tokens = word_tokenize(text)
# reverse order of text
tokens = tokens[::-1]
# for each sentence
for ngramDepth in range(1,maxNgramDepth+1):
for i in range(len(tokens) - ngramDepth):
if ngramDepth == 1: gram = (tokens[i])
else: gram = tuple(tokens[i:i+ngramDepth]) # For 2nd order+, we store ngrams in tuples
next = tokens[i+ngramDepth] # get the element after the gram
if gram in ngramDict:
ngramDict[gram].append(next)
else:
ngramDict[gram] = [next]
return ngramDict
## create 1st-4th order REVERSE ngram dicts for both authors
DickensDictBack = ngramSentencesBack(DickensString,4)
FyodorDictBack = ngramSentencesBack(FyodorString,4)
print "Samples of *reversed* ngram-keys with their possible *prior* word(s) from Dickens dict:"
for (key, wordlist) in DickensDictBack.items()[0:5]:
print "* " + str(key) + ": " + str(wordlist)
print "\nSamples of *reversed* ngram-keys with their possible *prior* word(s) from Dostoyevsky dict:"
for (key, wordlist) in FyodorDictBack.items()[500000:500005]:
print "* " + str(key) + ": " + str(wordlist)
## Define a function to build a sentence _backwards_ from a list of words
# We can still call on our 'getnextwords' function
def buildsentenceback(startSentence, authorDictBack, maxLength=100):
# before starting, check that at least the first word appears in dict so we have something to work with
if not authorDictBack.has_key(startSentence[0]): return []
chainSentence = startSentence
sentenceLength = len(chainSentence)
while sentenceLength <= maxLength:
nextWord = choice(getNextWords(chainSentence[::-1], authorDictBack))
if not nextWord: break
if any([punct in nextWord for punct in [".","!","?"]]): # if next word is or contains sentence end punctuation
break
chainSentence.insert(0,nextWord)
sentenceLength = len(chainSentence)
return chainSentence
# Show some example sentences
subject = 'mankind'
print "\nA full sentence by Dickens on the subject of '" + subject + "':"
# generate sentence _forward_ from keyword
CharlesTalk = buildsentencefwd(subject, DickensDict)
# generate sentence _backward_ from keyword
CharlesTalk = buildsentenceback(CharlesTalk, DickensDictBack)
print 'CHARLES: "' + " ".join(CharlesTalk) + '"'
print "\nA full sentence by Dostoyevsky on the subject of '" + subject + "':"
FyodorTalk = buildsentencefwd(subject, FyodorDict)
FyodorTalk = buildsentenceback(FyodorTalk, FyodorDictBack)
print 'FYODOR: "' + " ".join(FyodorTalk) + '"'
Samples of reversed ngram-keys with their possible prior word(s) from Dickens dict:
('looking', 'and', 'returned'): ['I']
('sealed', 'I', 'letter'): ['This']
('parallel', 'two', 'into', 'vibrate'): ['to']
('the', 'of', 'heap', 'dust'): ['common']
('dice', 'her', 'with', 'knuckles'): ['the']
Samples of reversed ngram-keys with their possible prior word(s) from Dostoyevsky dict:
('painted', 'Gania', 'as', 'black'): ['so']
('us', 'suits', 'that', 'and'): ['fathers']
('It', 'girl.'): ['imperious', 'little']
('dance', 'to', 'used'): ['arms']
('up', 'went', 'he', 'and'): ['instant']
A full sentence by Dickens on the subject of 'mankind':
CHARLES: "I am above the rest of mankind in such a case than merely as a poor vagabond which any one can be ."
A full sentence by Dostoyevsky on the subject of 'mankind':
FYODOR: "We sat like that for ten roubles any day not to speak of the suffering of mankind generally in such a disinterested and magnanimous and it 's no matter to you ."
Generating replies to speech that are on-topic (or not)
Now we can generate sentences, the remaining challenge is how to make a conversation out of them. To make the conversation flow, our machine- conversationalists need to be able to comprehend what has been said to them and generate a reply on-topic. Teaching a machine to comprehend is far beyond my skills as a programmer or a philosopher. As a playful substitute, let's try just simply pulling keywords (read: nouns) out of a piece of text, then using those to generate possible replies.
Also, there's always the chance that a subject could come up for which one of our authors has absolutely nothing to say. In that case, like any good conversationalist, they'll have a fall-back list of conversation topics. What better universal subject than travel? Let's create a list of the three most- mentioned places in each author's oeuvres.
### create a function to pick out keywords from any piece of text
def getKeywords(sentence):
sentenceTagged = pos_tag(sentence)
# if possible, pull nouns
keywords = [word for word, pos in sentenceTagged if pos in ['NN', "NNS", "NNP", "NNPS"]]
# if no nouns, take verbs
if not keywords:
keywords = [word for word, pos in sentenceTagged if pos == 'VB']
return sample(keywords,min(5,len(keywords)))
print "Make keywords examples:"
speech = buildsentencefwd("I", DickensDict)
print 'CHARLES: "' + " ".join(speech) + '"'
keywords = getKeywords(speech)
print "Keywords: " + str(keywords)
speech = buildsentencefwd("I", FyodorDict)
print '\nFYODOR: "' + " ".join(speech) + '"'
keywords = getKeywords(speech)
print "Keywords: " + str(keywords)
Make keywords examples:
CHARLES: "I found it smelt exactly as if it had gradually decomposed into that nightmare condition out of the overflowings of the polluted stream ."
Keywords: ['condition', 'stream', 'overflowings']
FYODOR: "I can not think that everything I saw on the stage of our little progressive corners in Petersburg they were prepared to throw anything overboard so soon as they get into the hands of the French ."
Keywords: ['Petersburg', 'stage', 'corners', 'anything', 'everything']
The following code uses NLTK's impressive (but not perfect), out-of-the-box named entity tagger to find the three most-commonly mentioned places in our author's texts.
## Define function to traverse ne_chunked text and pull out locations
# with thanks to yhat blog for help (http://blog.yhathq.com/posts/named-entities-in-law-and-order-using-nlp.html)
def find_places(chunks):
def traverse(tree):
"recursively traverses an nltk.tree.Tree to find named entities"
place_names = []
if hasattr(tree, 'node') and tree.node:
if tree.node in ['LOCATION','GBE']:
place_names.append(' '.join([child[0] for child in tree]))
else:
for child in tree:
place_names.extend(traverse(child))
return place_names
named_places = []
for chunk in chunks:
entities = sorted(list(set([word for tree in chunk
for word in traverse(tree)])))
for e in entities:
if e not in named_places:
named_places.append(e)
return named_places
## Define function to extract most-mentioned three places
def favePlaces(sentences):
x = [ne_chunk(pos_tag(sentence)) for sentence in sentences]
y = find_entities(x)
z = collections.Counter(y)
return [noun for noun,count in z.most_common(3)]
DickensPlaces = favePlaces(random.sample(DickensSentences,30000))
print "Dickens' most-mentioned places: " + ", ".join(DickensPlaces)
FyodorPlaces = favePlaces(random.sample(FyodorSentences,30000))
print "Dostoyevsky's most-mentioned places: " + ", ".join(FyodorPlaces)
Dickens’ most-mentioned places: Chiltern Hundreds, South Foreland, North Pole
Dostoyevsky’s most-mentioned places: North Cape, Caucasus, Western
The final function ties them altogether. It is what makes a conversational reply from a preceding comment. It extracts keywords and generates a large quiver of potential replies from those keywords. Then, before an awkward silence descends on the conversation, we must choose a single reply to say aloud.
### For each keyword, generate some random sentences that could be a reply
def makeReply(priorSpeech, authorDict, authorDictBack, authorPlaces):
replies = []
# extract keywords from speech
keywords = getKeywords(priorSpeech)
# remove keywords that don't appear in author dict
keywords = [word for word in keywords if authorDict.has_key(word)]
# if no keywords left, then use author favourites
if not keywords: keywords = authorPlaces
# generate 5 replies for each of the keywords
for word in keywords:
for i in range(5):
replyFwd = buildsentencefwd(word, authorDict)
replyBack = buildsentenceback(replyFwd, authorDictBack)
replies.append(replyBack)
## Choose the reply
# Try to (1) get one that is reasonable length, (2) get one that is interesting and addresses prior speech
n_replies = len(replies)
# drop the shortest 25% and the longest 25% of replies
lenReplies = [len(reply) for reply in replies]
lowerCutoff = stats.mstats.mquantiles(lenReplies, prob=[0.25])
upperCutoff = stats.mstats.mquantiles(lenReplies, prob=[0.75])
replies = [reply for reply in replies if len(reply) > lowerCutoff and len(reply) < upperCutoff]
return choice(replies)
## From that list of possible sentences, pick one
print "Three possible rejoinders by Dostoyevsky on the subject of 'Russia':"
print "* " + " ".join(makeReply(['Russia'], FyodorDict, FyodorDictBack, FyodorPlaces))
print "* " + " ".join(makeReply(['Russia'], FyodorDict, FyodorDictBack, FyodorPlaces))
print "* " + " ".join(makeReply(['Russia'], FyodorDict, FyodorDictBack, FyodorPlaces))
Three possible rejoinders by Dostoyevsky on the subject of 'Russia':
Russia will be overwhelmed with darkness the earth will weep for its old gods ...
If this is true if Russia and her justice are such she may go forward with good cheer !
Evidently he was delighted to get hold of someone upon whom to vent his rage with the calmer men more gracious interpreters of the modern Sclav who like Ivan Tourguenieff were able to see Russia on purpose to get the proper instruments .
Generating conversations
Let's put it altogether and generate a few brief conversations that start on some of the big issues. Of course, who knows where the conversation will lead!
## Create a conversation
topic = ['mankind']
conversationLength = 2
print "The two authors start on the subject of '" + topic[0] + "':\n"
reply = topic
for i in range(conversationLength):
# Dostoyevsky speaks
reply = makeReply(reply, FyodorDict, FyodorDictBack, FyodorPlaces)
print '* FYODOR: "'+ " ".join(reply) + '"'
# Dickens speaks
reply = makeReply(reply, DickensDict, DickensDictBack, DickensPlaces)
print '* CHARLES: "'+ " ".join(reply) + '"'
The two authors chat, beginning with the subject 'mankind'
- FYODOR: “The only gain of civilisation for mankind is the greater capacity for variety of sensations and absolutely nothing more .”
- CHARLES: "Well my child you used to complain with bitterness of the proneness of mankind to cheat him him invested with the dignity of Labour !"
- FYODOR: "And so many ages mankind had prayed with faith and fervor O Lord our God hasten Thy coming so many ages called upon Him that in His infinite mercy He came once more among men in that morning ."
- CHARLES: "As it pleased Heaven in its mercy to restore him so soon I should have great hope ."
On the subject of 'history'
- FYODOR: "Do you hear do you hear that majestic voice from the past century of our glorious history ?"
- CHARLES: "It was a history of the lives and trials of great criminals and the pages were soiled and thumbed with use ."
- FYODOR: "The nearest cab stand the trials of this life are even effaced from my memory ."
- CHARLES: "Crackit went to the window and lean his arms on the open window of a cab ."
On the subject of 'authors'
- FYODOR: "Well all the classical authors have been translated into all languages not of course on account of their actual literary merit but because of the great events !"
- CHARLES: "Those allied powers were considerably astonished when they arrived within a couple of pages they would have possessed the inestimable merit of being the most concise and faithful specimen of biography extant in the literature of any age or country ."
On the subject of 'Russia':
- FYODOR: "Let me say then it was not a question of showing that Pushkin is stupid or that Russia must be torn in pieces ."
- CHARLES: "The climate affected his dye it did very well in Russia but it was no go here ."
- FYODOR: "It 's night I am in my room with a candle and suddenly there are devils all over the place in the paling where you can take a board out he gets through no one sees ."
- CHARLES: "As was natural the head quarters and great gathering place of Monseigneur in London was Tellson 's Bank ."
On the subject of the 'soul'
- FYODOR: "Do you understand anything of my soul did this murder actually take place ?"
- CHARLES: "Do n't add new injuries to the long long list of injuries you have done anything to the contrary ."
- FYODOR: "That is just how it is with people who like Dmitri have never had anything to do with the matter and know nothing about it though ."
- CHARLES: "There 'll be Murder the matter too replied the hump backed man coolly if you do n't begin somebody else must ."
Conclusions
OK, clearly this is not going to open up new avenues of scholarship in the humanities. Pataphysics, maybe... But it is amazing how very simple Markov processes can generate often-convincing, occasionally-awesome text. Perhaps we shouldn't be surprised. Language is a formulaic creature: it has to be, otherwise we wouldn't be able to understand one-another. A few simple formulae can go 80% of the way to mastering it. It's the remaining 20%, though, where things get insanely difficult. Then there's the question of how to teach computers to 'understand' a piece of text, whatever that might mean...
Thoughts on how to improve this simulator
- Source material appropriate for purpose: my sense is that the training corpus isn't nearly large enough to generate convincing text, nor is it the right sort of material for generating speech. Most of our corpus involves third person text about fictional characters, and the names of those characters continually turn up in the generated text. In short, we need more "I" and "you" sentences of the sort that turn up in conversations. The MegaHAL conversational simulator was partly trained on things like snippets of dialogue from scripts and popular quotations. In this case, perhaps we could use the authors' correspondences..?
- Reply creation: It's impossible to parse the subject of speech when said speech has been randomly generated by Markov processes, because there is no subject. Were this a real conversational simulator, I think we could do much better than selecting nouns at random. Perhaps we could parse named entities and concentrate on them. We might also include synonyms for each keyword. It would be smart to be able to interpret tenses, too. Perhaps we could use a lemmatiser on everything...
- Convincing speech: I suspect that a little effort put into grammar would go a long way to reducing the 'word-salad' feel of the text. Perhaps having a handful of grammatical 'templates' to which generated sentences must fit.
- Reply selection: Lastly, I would like to make the reply selection function a little smarter. Perhaps it could select replies using a system that scores them for (1) grammatical likelihood and/or (2) 'interestingness'.