How to roll an imaginary 6-sided die in your head

I don’t know perl from mindfuck, but I wrote a python script to download a book from Project Gutenberg and then do the same analysis:

from urllib import request
from nltk import word_tokenize
import numpy as np
from scipy.stats import chi2
from tabulate import tabulate

# get book from Project Gutenberg
url = "http://www.gutenberg.org/files/2554/2554-0.txt" # Crime and Punishment
#url = "https://www.gutenberg.org/files/1342/1342-0.txt" # Pride and Prejudice
#url = "https://www.gutenberg.org/files/84/84-0.txt" # Frankenstein
#url = "https://www.gutenberg.org/files/98/98-0.txt" # A Tale of Two Cities

response = request.urlopen(url)
raw = response.read().decode('utf-8-sig')
tokens = word_tokenize(raw)

# filter out non-alpha characters and make all lowercase
filt_words = [w.lower() for w in tokens if w.isalpha()]
# filter out all words length 2 or less
filt_words = [w for w in filt_words if len(w)>2]
    
mod9List = []
mod6List = []

for I in range(len(filt_words)):
#for I in range(1000):
    word = filt_words[I]
    sum = 0
    for J in range(len(word)):
        sum = sum + ord(word[J])-96
    
    if np.mod(sum,9)<6: # only keep 0-5
        mod9List.append(np.mod(sum,9))
    
    mod6List.append(np.mod(sum,6))

# convert to np arrays
mod9List = np.array(mod9List)
mod6List = np.array(mod6List)

## calculate chi-squared for each list
# total number of numbers in each list
N9 = len(mod9List)
N6 = len(mod6List)

# make similar array using numpy.random.randit
randArray = np.random.randint(0,6,N6)

# expected number of rolls for each number if perfectly fair and random
expect9 = N9/6
expect6 = N6/6

# count actual number of times each number appears
mod9Num = np.zeros(6)
mod6Num = np.zeros(6)
randNum = np.zeros(6)
for I in range(6):
    mod9Num[I] = np.count_nonzero(mod9List==I)
    mod6Num[I] = np.count_nonzero(mod6List==I)
    randNum[I] = np.count_nonzero(randArray==I)

# sum of squares of [difference between actual and ideal] divided by ideal
SSE6 = np.sum(np.square(mod6Num-expect6)/expect6)
SSE9 = np.sum(np.square(mod9Num-expect9)/expect9)
SSErand = np.sum(np.square(randNum-expect6)/expect6)
SSE = [SSE6,SSE9,SSErand]
methodName = ['mod(N,6) method','mod(N,9) method','rand function']

# calculate confidence critical value
confLim = 0.9999
confVal = chi2.ppf(confLim,5)

# print results to screen as table
arr1 = [methodName[0]]
arr1.extend(mod6Num.tolist())
arr2 = [methodName[1]]
arr2.extend(mod9Num.tolist())
arr3 = [methodName[2]]
arr3.extend(randNum.tolist())
tableDat = [arr1,arr2,arr3]
print('Book has {:d} words'.format(len(filt_words)))
print('')
print(tabulate(tableDat, headers = ['method',1,2,3,4,5,6]))
print('')
for I in range(len(SSE)):
    print('Normalized sum of squared error (SSE) for {} = {:5g}'.format(methodName[I],SSE[I]))
print('Critical value of chi-squared distribution for 99.99% confidence level = {:5g}'.format(confVal))
print('')
for I in range(len(SSE)):
    if SSE[I]>confVal:
        printstr = '{} SSE is greater than critical value, {:5g}% confidence level that numbers are not distributed randomly'
        print(printstr.format(methodName[I],100*confLim))

The thing is, I end up with wildly non-random distributions. For Crime and Punishment:

method               1      2      3      4      5      6
---------------  -----  -----  -----  -----  -----  -----
mod(N,6) method  24500  43816  23020  29030  23315  15064
mod(N,9) method  15804  20820  14081  14960  19924  12867
rand function    26610  26573  26465  26457  26359  26281

With the following chi-squared analysis:

Normalized sum of squared error (SSE) for mod(N,6) method = 17510
Normalized sum of squared error (SSE) for mod(N,9) method = 3183.74
Normalized sum of squared error (SSE) for rand function = 2.92951
Critical value of chi-squared distribution for 99.99% confidence level = 25.7448

mod(N,6) method SSE is greater than critical value, 99.99% confidence level that numbers are not distributed randomly
mod(N,9) method SSE is greater than critical value, 99.99% confidence level that numbers are not distributed randomly
rand function SSE is less than critical value, 99.99% confidence level that numbers are distributed randomly

Both are horrible compared to just using the rand() function, though the mod(N,6) method is even worse than the mod(N,9) method. I got similar results from Pride and Prejudice, Frankenstein, and A Tale of Two Cities.

I wonder if I did something wrong with my calculation method? I ran the chi-squared test on your numbers, and both the %6 and %9 methods passed the test quite well, with sum squared errors of 6.5 and 2.4 respectively. That’s within range of using the rand() function, which usually gives values between 2 to 6.

Anyway, since I really went overkill on this, I uploaded it to github here if anyone is interested in tinkering with it.

2 Likes