{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Numpy Gems, Part 2\n",
    "\n",
    "Trying out something new here with a Jupyter notebook blog post. We'll keep this short. Let's see how it goes!\n",
    "\n",
    "In this episode, we'll be exploring random number generators.\n",
    "\n",
    "Usually, you use psuedo-random number generators (PRNGs) to simulate randomness for simulations. In general, randomness is a great way of avoiding doing integrals because it's cheaper to average a few things than integrate over the whole space, and things tend to have accurate averages after just a few samples... This is the [Monte Carlo Method](https://en.wikipedia.org/wiki/Monte_Carlo_method).\n",
    "\n",
    "That said, since the priority is speed here, and the more samples, the better, we want to take as many samples as possible, so parallelism seems viable.\n",
    "\n",
    "This occurs in lots of scenarios:\n",
    "\n",
    "* Stochastic simulations of physical systems for risk assessment\n",
    "* Machine learning experiments (e.g., to show a new training method is consistently effective)\n",
    "* Numerical estimation of integrals for scientific equations\n",
    "* Bootstrap estimation in statistics\n",
    "\n",
    "For all of these situations, we also usually want replicatable studies.\n",
    "\n",
    "Seeding is great for making the random PRNG sequence deterministic for one thread, but how do you do this for multiple threads?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "stddev 0.02205\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "-0.03392958488974697"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import numpy as np\n",
    "from multiprocessing import Pool\n",
    "from scipy.stats import ttest_1samp\n",
    "\n",
    "def something_random(_):\n",
    "    return np.random.randn()\n",
    "\n",
    "n = 2056\n",
    "print(\"stddev {:.5f}\".format(1 / np.sqrt(n)))\n",
    "\n",
    "with Pool(4) as p:\n",
    "    mu = np.mean(p.map(something_random, range(n)))\n",
    "\n",
    "mu"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "OK, so not seeding (using the system default of time-based seeding) gives us dependent trials, and that can really mess up the experiment and it prevents the very determinism we need!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "-0.6038931772504026\n"
     ]
    }
   ],
   "source": [
    "np.random.seed(1)\n",
    "\n",
    "n = 256\n",
    "seeds = np.random.randint(2 ** 32, size=n)\n",
    "\n",
    "def something_random(i):\n",
    "    np.random.seed(seeds[i])\n",
    "    return np.random.randn()\n",
    "\n",
    "with Pool(8) as p:\n",
    "    mu = np.mean(p.map(something_random, range(n)))\n",
    "    \n",
    "print(mu * np.sqrt(n))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The common solution I see for this is what we see above, or using `i` directly as the seed. It kind of works, in this case, but for the default numpy PRNG, the Mersenne Twister, it's not a good strategy.\n",
    "\n",
    "[Here's the full discussion](https://docs.scipy.org/doc/numpy/reference/random/parallel.html#seedsequence-spawning) in the numpy docs.\n",
    "\n",
    "To short circuit to the \"gem\" ahead of time, the solution is to use the new API."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "-0.11130135587093562\n"
     ]
    }
   ],
   "source": [
    "from numpy.random import SeedSequence, default_rng\n",
    "\n",
    "ss = SeedSequence(12345)\n",
    "n = 2 ** 16\n",
    "child_seeds = ss.spawn(n)\n",
    "\n",
    "def something_random(s):\n",
    "    rng = default_rng(s)\n",
    "    return rng.normal()\n",
    "\n",
    "with Pool(4) as p:\n",
    "    mu = np.mean(p.map(something_random, child_seeds))\n",
    "    \n",
    "print(mu * np.sqrt(n))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That said, I think the fun part is in trying to break the old PRNG seeding method to make this gem more magical.\n",
    "\n",
    "That is, the rest of this blog post is going to be trying to find non-randomness that occurs when you seed in a n invalid way."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "# aperitif numpy trick -- get bits, fast!\n",
    "def fastbits(n):\n",
    "    nbytes = (n + 7) // 8 # == ceil(n / 8) but without using floats (gross!)\n",
    "    return np.unpackbits(np.frombuffer(np.random.bytes(nbytes), np.uint8))[:n]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "39.5 ms ± 2.69 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
     ]
    }
   ],
   "source": [
    "%%timeit\n",
    "np.random.randint(2, size=(10 * 1000 * 1000))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2.29 ms ± 221 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n"
     ]
    }
   ],
   "source": [
    "%%timeit\n",
    "fastbits(10 * 1000 * 1000)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "random Power_divergenceResult(statistic=6.848932, pvalue=0.07687191550956339)\n",
      "seeds 1-2 Power_divergenceResult(statistic=10000003.551559199, pvalue=0.0)\n"
     ]
    }
   ],
   "source": [
    "# Attempt 1: will lining up random\n",
    "# streams break a chi-square test?\n",
    "\n",
    "n = 1000 * 1000 * 10\n",
    "\n",
    "np.random.seed(1)\n",
    "x1 = fastbits(n)\n",
    "x2 = fastbits(n)\n",
    "np.random.seed(2)\n",
    "y1 = fastbits(n)\n",
    "\n",
    "from scipy.stats import chisquare\n",
    "\n",
    "def simple_pairwise(a, b):\n",
    "    # do a simple pairwise check on equilength arrays dof = 4 - 1\n",
    "    # build a contingency table for cases 00 10 01 11\n",
    "    c = np.bincount(a + b * 2)\n",
    "    return chisquare(c)\n",
    "\n",
    "print('random', simple_pairwise(x1, x2))\n",
    "print('seeds 1-2', simple_pairwise(x1, y1))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "random 257131407\n",
      "seeds 0-255 257135234\n"
     ]
    }
   ],
   "source": [
    "# Ok... not so easy, clearly dependence is not just \"pointwise\"\n",
    "# between streams but across streams... Maybe a generic\n",
    "# compression algorithm will notice the difference if we just\n",
    "# appended\n",
    "\n",
    "import tempfile\n",
    "import os\n",
    "\n",
    "def size(x):\n",
    "    if os.path.isfile('/tmp/x.bz2'):\n",
    "        os.remove('/tmp/x.bz2')\n",
    "    with open('/tmp/x', 'wb') as f:\n",
    "        f.write(x.tobytes())\n",
    "    ! bzip2 -z /tmp/x\n",
    "    return os.path.getsize('/tmp/x.bz2')\n",
    "\n",
    "def rbytes(n):\n",
    "    return np.frombuffer(np.random.bytes(n), np.uint8)\n",
    "    \n",
    "trials = 256\n",
    "np.random.seed(trials)\n",
    "n = 1000 * 1000\n",
    "print('random', size(rbytes(n * trials)))\n",
    "\n",
    "re_seeded = []\n",
    "for i in range(trials):\n",
    "    np.random.seed(i)\n",
    "    re_seeded.append(rbytes(n))\n",
    "\n",
    "a = np.concatenate(re_seeded)\n",
    "print('seeds 0-255', size(a))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "OK, so zip isn't easily able to untangle any correlation between the streams (in which case, the compressed file of bits from random streams from sequential seeds would presumably be able to compress better).\n",
    "\n",
    "We'll need another approach.\n",
    "\n",
    "There's a lot of investment in PRNG quality tests.\n",
    "\n",
    "However, we're not interested in evaluating whether _individual_ streams are random-looking, which they very well might be. Instead, we want to find out if there's any dependence between streams. Above we just tried two tests for independence, but they didn't work well (there's a lot of ways to be dependent, including ways that don't fail the chi squared test or bz2-file-size test).\n",
    "\n",
    "That said, we can use a simple trick, which is to interleave streams from the differently-seeded PRNGs. If the streams are dependent, the resulting interleaved stream is not going to be a realistic random stream. This is from the [TestU01](https://www.iro.umontreal.ca/~lecuyer/myftp/papers/testu01.pdf) docs. Unfortunately, my laptop couldn't really handle running the full suite of tests... Hopefully someone else can break MT for me!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}