Numpy Gems, Part 2

Trying out something new here with a Jupyter notebook blog post. We’ll keep this short. Let’s see how it goes!

In this episode, we’ll be exploring random number generators.

Usually, you use psuedo-random number generators (PRNGs) to simulate randomness for simulations. In general, randomness is a great way of avoiding doing integrals because it’s cheaper to average a few things than integrate over the whole space, and things tend to have accurate averages after just a few samples… This is the Monte Carlo Method.

That said, since the priority is speed here, and the more samples, the better, we want to take as many samples as possible, so parallelism seems viable.

This occurs in lots of scenarios:

  • Stochastic simulations of physical systems for risk assessment
  • Machine learning experiments (e.g., to show a new training method is consistently effective)
  • Numerical estimation of integrals for scientific equations
  • Bootstrap estimation in statistics

For all of these situations, we also usually want replicatable studies.

Seeding is great for making the random PRNG sequence deterministic for one thread, but how do you do this for multiple threads?

import numpy as np
from multiprocessing import Pool
from scipy.stats import ttest_1samp

def something_random(_):
    return np.random.randn()

n = 2056
print("stddev {:.5f}".format(1 / np.sqrt(n)))

with Pool(4) as p:
    mu = np.mean(p.map(something_random, range(n)))

mu
stddev 0.02205
-0.03392958488974697

OK, so not seeding (using the system default of time-based seeding) gives us dependent trials, and that can really mess up the experiment and it prevents the very determinism we need!

np.random.seed(1)

n = 256
seeds = np.random.randint(2 ** 32, size=n)

def something_random(i):
    np.random.seed(seeds[i])
    return np.random.randn()

with Pool(8) as p:
    mu = np.mean(p.map(something_random, range(n)))
    
print(mu * np.sqrt(n))
-0.6038931772504026

The common solution I see for this is what we see above, or using i directly as the seed. It kind of works, in this case, but for the default numpy PRNG, the Mersenne Twister, it’s not a good strategy.

Here’s the full discussion in the numpy docs.

To short circuit to the “gem” ahead of time, the solution is to use the new API.

from numpy.random import SeedSequence, default_rng

ss = SeedSequence(12345)
n = 2 ** 16
child_seeds = ss.spawn(n)

def something_random(s):
    rng = default_rng(s)
    return rng.normal()

with Pool(4) as p:
    mu = np.mean(p.map(something_random, child_seeds))
    
print(mu * np.sqrt(n))
-0.11130135587093562

That said, I think the fun part is in trying to break the old PRNG seeding method to make this gem more magical.

That is, the rest of this blog post is going to be trying to find non-randomness that occurs when you seed in a n invalid way.

# aperitif numpy trick -- get bits, fast!
def fastbits(n):
    nbytes = (n + 7) // 8 # == ceil(n / 8) but without using floats (gross!)
    return np.unpackbits(np.frombuffer(np.random.bytes(nbytes), np.uint8))[:n]
%%timeit
np.random.randint(2, size=(10 * 1000 * 1000))
39.5 ms ± 2.69 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
fastbits(10 * 1000 * 1000)
2.29 ms ± 221 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# Attempt 1: will lining up random
# streams break a chi-square test?

n = 1000 * 1000 * 10

np.random.seed(1)
x1 = fastbits(n)
x2 = fastbits(n)
np.random.seed(2)
y1 = fastbits(n)

from scipy.stats import chisquare

def simple_pairwise(a, b):
    # do a simple pairwise check on equilength arrays dof = 4 - 1
    # build a contingency table for cases 00 10 01 11
    c = np.bincount(a + b * 2)
    return chisquare(c)

print('random', simple_pairwise(x1, x2))
print('seeds 1-2', simple_pairwise(x1, y1))
random Power_divergenceResult(statistic=6.848932, pvalue=0.07687191550956339)
seeds 1-2 Power_divergenceResult(statistic=10000003.551559199, pvalue=0.0)
# And now let's try another approach!

import tempfile
import os

def size(x):
    if os.path.isfile('/tmp/x.bz2'):
        os.remove('/tmp/x.bz2')
    with open('/tmp/x', 'wb') as f:
        f.write(x.tobytes())
    ! bzip2 -z /tmp/x
    return os.path.getsize('/tmp/x.bz2')

def rbytes(n):
    return np.frombuffer(np.random.bytes(n), np.uint8)
    
trials = 256
np.random.seed(trials)
n = 1000 * 1000
print('random', size(rbytes(n * trials)))

re_seeded = []
for i in range(trials):
    np.random.seed(i)
    re_seeded.append(rbytes(n))

a = np.concatenate(re_seeded)
print('seeds 0-255', size(a))
random 257131407
seeds 0-255 257135234

OK, so zip isn’t easily able to untangle any correlation between the streams (in which case, the compressed file of bits from random streams from sequential seeds would presumably be able to compress better).

We’ll need another approach.

There’s a lot of investment in PRNG quality tests.

However, we’re not interested in evaluating whether individual streams are random-looking, which they very well might be. Instead, we want to find out if there’s any dependence between streams. Above we just tried two tests for independence, but they didn’t work well (there’s a lot of ways to be dependent, including ways that don’t fail the chi squared test or bz2-file-size test).

That said, we can use a simple trick, which is to interleave streams from the differently-seeded PRNGs. If the streams are dependent, the resulting interleaved stream is not going to be a realistic random stream. This is from the TestU01 docs. Unfortunately, my laptop couldn’t really handle running the full suite of tests… Hopefully someone else can break MT for me!