*********************************************************************** 1. Testing with PractRand 1.A - Connecting your RNG to PractRand 1.A.1 - Testing a programs output 1.A.2 - Testing a file: 1.A.3 - Testing your RNG at full speed: 1.A.4 - Options NOT supported by PractRand 1.B - Interpretting PractRand Test Results 1.B.1 - RNG_test results for people in a hurry 1.B.2 - RNG_test results in more detail 1.B.3 - Notes on RNG_test p-values 1.B.4 - TestResult objects from calling PractRand directly 1.B.5 - Misc PractRand test result quality notes 1.B.6 - Metatests performed on PractRand p-values 1.B.7 - What specific failures might mean about a PRNG 2. General Discussion of RNG Testing 3. General Information on Interpretting Test Results 3.A - Forms of test results 3.B - False positives 3.C - Knowing the false-positive rates is IMPORTANT 3.D - Metatests 4. Brief Reviews of Non-PractRand Test Suites *********************************************************************** *********************************************************************** 1. Testing with PractRand *********************************************************************** 1.A - Connecting your RNG to PractRand If you're testing an RNG that is not built in to PractRand then your first challenge is getting the data to PractRand in a manner that it can test. The three general ways to interface an external RNG to a tester are by dumping random data to a file, by piping random data, or by linking the RNG to the tester. 1.A.1 - Testing a programs output: If you have a program that generates raw random data and sends them to standard output then you can pipe them in to the PractRand RNG testing tool and it can work with them. If your program is called My_RNG then the command line looks like this on linux: My_RNG | ./RNG_test stdin Or like this on Windows (from a command prompt): My_RNG | RNG_test stdin If your program stops sending out random data before the PractRand RNG testing tool is done testing then it may print "error reading input". In that case you might want to try adding command line options like "-tlmin 1KB", which will tell RNG_test that you want it to start printing output ASAP. It will still print "error reading input" when the input stops, but at least you'll get some test results as long as the input is at least 1024 bytes long. If you know that your program produces data in 8 bit chunks then you should change "stdin" to "stdin8". Similar variants exist for 16, 32, and 64. If you don't know the format of your data or none of those make any more sense than the others then just use "stdin". 1.A.2 - Testing a file: If you have a file of random data named My_Data that you want to test, technically PractRand doesn't support that. But it's easily adapted to piped in data using standard tools on most OSes. On linux, try: cat My_Data | ./RNG_test stdin On windows instead use: type My_Data | RNG_test stdin The comments that apply to testing data from a program apply here as well. In fact the comment on "error reading input" becomes much more important. Note that reading random data from a file tends to be much much slower than getting it by any other method. 1.A.3 - Testing your RNG at full speed: It's also possible to link your RNG directly with PractRand and/or RNG_test. This eliminates the overhead of piping the data around, increasing performance significantly for fast RNGs. However, it tends to require more work as you have to understand some of the subtleties of the internals of PractRand. The easiest way may be to to edit the "dummy" RNG in RNG_test.cpp, but other ways are possible too. 1.A.4 - Options NOT supported by PractRand PractRand supports testing ONLY on packed binary data as input. You can not give it ASCII 0s and 1s or ASCII hex or anything else that doesn't have 8 random bits per byte. 1.B - Interpretting Results The next challenge is figuring out what the results mean. 1.B.1 - RNG_test results for people in a hurry If you are using RNG_test the simplest way to interpret test results is to look for the word "FAIL" in the output. It will appear on the right-hand side and be obvious, like this: Test Name Raw Processed Evaluation BCFN(2+0,13-0) R= -6.5 p =1-3.5e-3 unusual [Low4/16]BCFN(2+0,13-0) R= +39.2 p = 4.4e-20 FAIL !!! [Low4/16]BCFN(2+1,13-1) R= +28.8 p = 3.1e-14 FAIL !! [Low1/16]BCFN(2+0,13-1) R= +27.1 p = 2.1e-13 FAIL !! ...and 50 test result(s) without anomalies 1.B.3 - RNG_test results in more detail RNG_test produces a series of interim result summaries as it goes along. Each result summary has a header showing the RNG name, the number of bytes tested, the time taken, and the RNG seed used (so you can reproduce the results later if desired). The body of the result summary is a table showing all anomolous results followed by a statement of how many results were omitted from the table because they were not anomolous. If the table would have zero entries then the table is skipped. If no results were omitted then the number of omitted results is skipped. The threshold for what kinds of results are considered anomolous is very adjustable via command line parameters. Adding "-p 1" will force all results to be shown in the table regardless of how suspicious they were or weren't, while adding "-e 0.001" will make it show only highly suspicious results. The default is "-e 0.1". The table has four columns: "Test Name" - a name for the subtest the line corresponds to "Raw" - not of much use to end users, but I use this a lot "Processed" - either a p-value or "pass" or "fail". However, see 3.B.3 for more details on RNG_test p-values "Evaluation" - an english word or phrase describing the result. "FAIL" means that the RNG unambigiously failed that subtest, while "suspicious" means that that result should not happen often on a good RNG but should happen occasionally. The formula used to assign evaluations is not adjustable from the command line, and is a relatively complex function (not just a hardwired p-value threshold). The most extreme evaluations get an extra space or two of indentation to make them easier to spot when scanning a large chart by eye. The evaluations treat all PractRand p-values as having two symmetric error regions - both zero and one are considered equally bad, despite the fact that PractRand tests are designed to produce zero for most expected failure circumstances. 1.B.4 - Notes on RNG_test p-values When RNG_test prints a p-value it can be in any of the following forms: * normal p-value (e.g. "p = 0.188") - exactly what you'd expect * p-value near zero (e.g. "p = 2.3e-4") - scientific notation in a fairly standard form. Note however that the exponent can reach extreme values (up to 9999) that do not fit in normal floating point formats. * p-value near one (e.g. "p =1-2.7e-3") - one minus a number in scientific notation. This format is a little less standard but still fairly self-explanatory and it allows finer resolution to be displayed for numbers close to one. * p-values equal to zero (e.g. "p = 0") - values VERY close to zero may get rounded to zero. * p-values equal to one (e.g. "p = 1") - values VERY close to one may get rounded to one. * p-values that are only very crudely approxiamted - these can be distinguished from normal p-values by the "~" between the "p" and the "=" in the Processed results column of RNG_test output. The current version should never produce these, but past and future versions did and will. Such results may be shown with fewer significant digits displayed than normal p-values. * "pass" or "fail" in place of a p-value - Some subtests can't produce p-values. If you must have a p-value, assume that "pass" corresponds to a p-value randomly distributed between 0 and 1, while "fail" corresponds to a p-value of exactly zero. Generally PractRand subtests that produce this kind of output are very boring tests that never fail any but the most absurdly extreme RNGs (things like RNGs that produce only zeroes). These results are necessary because some normal subtests results cannot be calculated at all under such absurd circumstances, but the test still needs some way to report those circumstances occuring. An example results chart showing one of each category of result: Test Name Raw Processed Evaluation DummyTest R= +0.4 p~= 0.4 normal DC6-9x1Bytes-1 R= +0.4 p =1-1.1e-3 unusual Gap-16:! R= +0.0 "pass" normal Gap-16:A R= +0.5 p = 0.511 normal Gap-16:B R= +3.3 p = 9.9e-3 normal 1.B.5 - TestResult objects from calling PractRand directly PractRand tests natively produce sets of results of type PractRand::TestResult. Each of these objects correspond to a subtest result. The simplest thing to do with them is call their get_pvalue() method. See tests.h for for details on their interface, or RNG_test.cpp for examples of more complicated things you can do with them. 1.B.5 - Misc PractRand test result quality notes: The various tests within PractRand vary widely in terms of the quality of p-values they produce - the tests used in the normal test set tend to produce cleaner p-values than other tests in PractRand, because I spent more effort on them. In all cases however I have made effort to minimize any chance of the tests themselves contributing to a false positive - when there is uncertainty what the true p-value result should be, I attempt to report the least extreme possibility within the range of uncertainty. The worst p-value qualities tend to come from tests on short sequences of non-power-of-2 lengths (usually 6 KB) or on very long sequences (hundreds of gigabytes or more). Do NOT assume that p-values from the same results summary have zero correlation. While the normal test set is *mostly* orthogonal it's not *completely* orthogonal. The significant intra-correlations in the recommended test set are: * Gap-16:A is correlated with Gap-16:B. The strength of correlation is very strong for shorter sequences and slowly decreases as sequence length increases. The Gap-16 results should have no significant correlation with the other standard tests. * BCFN(X1,Y1) should correlate with BCFN(X2,Y2) if and only if X1 and X2 are similar - if they differ by 1 then the correlation is moderate, if they differ by 2 then the correlation is minor, if they differ by 3 then the correlation should be insignificant, and at 4 or more the correlation should generally be zero. The Y values are generally not very relevant. * DC6-9x1Bytes-1 is slightly correlated with BCFN(2,whatever) and insignificantly correlated with BCFN(3,whatever). Higher X BCFNs should have zero correlation with DC6-9x1Bytes-1. * All of the correlations for pairs of subtests described above also exist for pairs of subtests that have names identical to the previous described names but with matching [LowX/Y] prefixes added on. * A subtest with a prefix of the form [LowX/Y] may correlate with subtests of the same name but different or no prefixes. The degree of correlation is (at most) moderate among such cases in the standard test sets. (why, you ask, are there moderate and strong correlations present in an optimized selection of tests? Because the moderate and strong correlations are between subtests that enjoy synergies that allow them to be more efficiently calculated at once - the efficiency bonuses they enjoy thusly are slightly offset by non-orthogonality penalties, but only slightly. The strongest correlation that is not a byproduct of a major efficiency boost is that between DC6-9x1Bytes-1 and BCFN(2,whatever), which is only a slight correlation) The expanded test set is not very orthogonal at all - many of its tests are strongly correlated with each other. 1.B.6 - Metatests performed on PractRand p-values: Normal PractRand p-values are usually fine for performing meta-tests on. However, using large numbers of p-values in a single meta-test may be enough to amplify biases in the p-values to noticable levels, depending upon the particular type of metatest used and how tolerant you are of bias in the result. 50 p-values should generally be okay, 5 should always be okay, 50,000 should only be acceptable on a rather limited range of tests and metatests (check emperically). If RNG_test printed a "~" or (for those linking with PractRand directly) TestResult had a type with BAD in the name (eg TYPE_BAD_P), or the result type was a pass/fail result, then metatests are probably a bad idea. Results summaries from the same run of RNG_test are based upon some of the same RNG output and thus have some degree of correlation. Normally two consecutive results summaries have 50% overlap, producing a substantial degree of correlation, but that can vary with the command line parameters used. 1.B.7 - What specific failures might mean about a PRNG Failures on test names that begin with [Low1/??] (replacing the ?? with 8, 16, 32, or 64) typically mean low-bit issues - not enough right-ward mixing (barrel shifts and right-shifts mainly, though also table-lookups and some more complex / slower math) in the PRNG. However, there are other possibilities - it could be that the bit focused on didn't need to be the lowest bit, it just had to be one bit per word to emphasize the relationship at that particular spacing. Simultaneous failures on multiple unrelated tests often means that the PRNGs cycle length was exhausted, or it otherwise changed behaviors to something awful near the end of testing. Failures on short-ranged tests, especially DC6 parameterizations, are typical of small chaotic PRNGs with insufficient mixing / avalanche. Failures on FPF parameterizations can have similar meaning. However, both of those find bias in a wide variety of PRNGs, sometimes for non-obvious reasons. BCFN failures where the first parameter listed is high (say, 5 or more) typically reflect a cyclic buffer being traversed, where patterns in the content of that buffer show up at specific spacings. On the other hand, BCFN failures where the first parameter listed is low are typically similar in meaning to a failure on a DC6 parameterization (see the previous paragraph). Note that when I refer to the value of the first BCFN parameter, I'm refering to the evaluated value listed in the output, meaning that for a test result listed under "BCFN(2+3,13-1)" the first parameter would be 5 (2+3). Gap16 failures with p-values near 1 generally mean that the PRNG output was too regular, typically reflecting a single-cycle PRNG nearing the end of its cycle length. Those with p-value near zero on the other hand indicate that some values showed up at more extreme spacings than expected, seen most often on single-cycle PRNGs and sometimes on rc4-like PRNGs. Failures on BRank suggest that in some way the PRNG output, or at least part of it, was extremely linear, producable strictly by xoring bits of previous PRNG output. This is classically seen on LFSRs / GFSRs, but sometimes it shows up on PRNGs that are *supposed* to be chaotic, but actually have some component that isn't really. Most failures on the core tests (other thatn BRank) should have started small and gotten steadily worse as more bytes were tested until failure was reached. Thus, if "FAIL" was reported at 2**36 bytes then at the previous results summary (at 2**35 bytes) there should have been some kind of anomaly on the same subtest, maybe a "suspicious" or "very suspicious" test result. If there was no preceding warning and the failure was particularly extreme, that may mean that the RNG somehow worsened its behavior in some way shortly before the failure was reported, similar to exhausting the cycle length, but perhaps in some way detectable to only one of the tests in PractRand. *********************************************************************** 2. General Information on Testing *********************************************************************** The classic source of empirical statistical tests for random number generators is "The Art of Computer Programming", by Donald Knuth. Several empirical tests are described in that book. The purpose of such tests is to distinguish data produced by a truly random process from data produced by a non-random process. They look for patterns associated with types of not-entirely-random systems that the data might have come from instead of coming from a truly random process. There is a theoretical problem here - there are an infinite number of such patterns and possible not-entirely-random systems, and if you consider ALL of them then all blocks of data are equally likely to be random or non-random. In practice however, this is not actually a problem - most of those infinite number of patterns are infinitely complex, utterly impractical to test for, and don't really make much sense in the context of math and hardware as we know them. In practice, looking for patterns that tend to arise from non- random processes of limited complexity and very finite state size built out of standard mathematical operations (particularly those that are most efficient in binary logic) does a good job. In empirical testing, a classic is the Diehard program by Marsaglia. That program is badly outdated today and should not be used anymore, but it was, so far as I know, the first real standardized battery of such tests. A standardized battery of tests is just a more clearly defined (and usually more convenient) version of running multiple tests on a block of data. The basic results of a standardized battery of tests can be communicated quickly and effectively with just the name and version of the standard battery of tests and the length of sequence tested (though some batteries operate on fixed-size sequence lengths and don't need the last). That is much easier to interpret (not to mention more concise) than a listing of every major test name / version / implementation / parameterization tried. *********************************************************************** 3. General Information on Interpretting Test Results *********************************************************************** 3.A - Forms of test results Most such tests produce results of a form called a "p-value". p-values are numbers between 0 and 1 that represent where a tests results fall within the range of results expected on truly random data (ie truly random data should produce p-values uniformly distributed between 0 and 1). The results can never truly lie outside the range expected from truly random data, but they can lie absurdly close to an edge of the expected range. Tests are generally constructed such that the pattern they are looking for produces p-values that are very close to zero. Results that are extremely close to 1 are frequently also considered suspicious. Exactly how close to zero or one might be considered a failure varies extremely widely - I have seen sources that differed by more than 8 orders of magnitude on the issue. Some tests will produce flat pass/fail or pass/ambiguous/fail results in place of p-values. Any pass/fail result should in theory correspond to whether or not a hypothetical p-value would be above or below some specific threshold (the percent of the time that truly random data would produce a fail result). Typically p-value calculations are not 100% accurate (ie not 100% uniformly distributed between 0 and 1 on truly random data). Instead they have subtle and potentially complex patterns of deviations from uniformity. In some cases these inaccuracies are insignificant, in other cases they are very important. Similarly a pass/fail result will fail a percentage of truly random data that differs slightly from the intended percentage. A p-value can be converted to a pass/fail result simply by comparing it to an arbitrary threshold. Is P less than 0.001? Then it failed. That kind of thing. The exact value of the arbitrary threshold used varies widely depending upon who picks it - some prefer thresholds of 0.05, others prefer thresholds of 0.0000000001. This is assuming single-ended failure criteria, in which a p-value is only considered bad near zero, not near one. If the p-value has double-ended failures you might have one threshold at 0.0001 (anything below that is considered a failure) and another threshold at 0.9999 (anything above that is considered a failure). 3.B - False positives The theoretical false positive rate is the chance that a perfect RNG would fail a test. All meaningful evaluations of RNG quality have a non-zero false positive rate, though it can be very very close to zero in some cases. In theory, if you're dealing with p-values (under the idealized assumption that only values near zero are failures, values near one are not), then the false positive rate is equal to the failure threshold. A failure threshold of 0.001 means a 0.1% false positive rate. Unfortunately, that does not take in to account inaccuracies in the calculations of the p-value, which can have large effects on false- positive rates. (In PractRand, the target false-positive rate is done on a per-results-summary basis rather than the usual per-p-value basis, and is currently set so that approximately one in 300 million result summaries should show a false positive, meaning that if 300 million result summaries are produced from testing good PRNGs, approximately 1 results summary would include a "FAIL" evaulation. At that rate, if you have PractRand producing a results summary every 2 seconds, it would take an average of 20 CPU-years to produce a single false-positive. In practice however, because of biases in PractRand's p-values, the actual false-positive rate should be substantially lower than that when working with the standard test set.) 3.C - Knowing the false-positive rates is important The important point of test evaluation is that you need to be aware of false-positive rates. Any meaningful scheme for evaluating test results must have a non-zero false-positive rate, meaning a chance of incorrectly producing a failure on good (ie produced by a random process) data. If the false-positive rate is significantly greater than zero then your test results may not mean much on their own. How much greater than zero is significant depends upon the cost of making an error, and how many chances to make an error you will have. If you are producing a million test results and want the chance of having any errors at all to be on the order of one in a million, then you want a false-positive rate on the order of one in a trillion or less. The standard approach to dealing with false positive rates that are too high is to run tests repeatedly (on different data from the same RNG), and ignore failures that did not occur repeatedly, with the number of failures required to be considered significant adjusted for the expected false-positive rate. This is a convenient way to handle the issue that often meets peoples needs. However, there are pitfalls to this approach that may not be intuitively obvious - having the tests run a variable number of times can undermine reproducibility to a surprising degree, while running them a constant number of times can slow things down more than you might expect if your desired effective false positive rate is extremely low. This sort of thing is a type of metatest. (PractRand attempts to set its false-positive rate for "FAIL" messages low enough to make it unlikely that any user will ever see a false- positive from it. The effectiveness of this strategy is limited by two things: 1. I did say "attempts" to set the false-positive rate that low. In practice, for the standard test set, I think I've even succeeded. The expanded test set, and tests not used in either, don't necessarily meet the same standards though. 2. users rightly pay attention to the number of less extreme result evaluations as well, as a kind of informal done-by-eye metatest, for which there is no quantified false-positive rate.) 3.D - Metatests Sometimes instead of considering a tests p-values as a final product someone will use it as random data to be put in to another set of statistical tests. Such a test using other tests p-values as input is referred to as a metatest. P-values for data from good RNGs are supposed to be uniformly distributed between 0 and 1. Test results on data produced by different seeds should show no correlation. Thus p-values from a good test run on different seedings of a good RNG should behave like the output of a good RNG. Care must be taken to avoid using correlated (or at least strongly correlated) test results in such a way, but this is generally a managable problem for a canny user. Some test suites have a habit of completely hiding p-values produced directly by their basic statistical tests on the random numbers and displaying instead metatest results. But why would such a thing be desirable? Here are a few reasons, and some associate drawbacks for each: * If the original test results p-values have biases with known properties, this can be an effective way to remove their biases. - However this can amplify errors if done incorrectly, and is not necessarily the most efficient way to go about correcting well understood p-value biases. * By combining multiple tests, the sensitivity of a test can be scaled up. For some tests, this may be the *only* practical way to scale up their sensitivity. - However, in practice this is a very poor way of scaling up a tests, so poor that in all cases I have examined you would be at least as well off leaving the test unscaled or abandoning the test entirely. * If there are multiple tests and we don't want to look at their individual results but a combined result instead. - I have found using extremely crude metatests to be more useful for this purpose than sophisticated metatests due to the costs imposed by having to eliminate all correlation between the various base tests (sophisticated metatests generally require that all inter- p-value correlations be completely understood, which is generally only possible if all such correlations are indistinguishable from zero). (PractRand does not encourage the use of metatests by end users; metatests are used in the validation and calibration of PractRand subtests though.) *********************************************************************** 4. Brief Reviews of Non-PractRand Test Suites *********************************************************************** gjrand: Very good. As good as PractRand, perhaps even a hair better. I've only used it on linux, I suspect a windows build would be difficult. Handles all types of RNGs well in my experience. DO NOT RUN MULTIPLE TESTS AT ONCE. It will silently fail, producing incorrect results. I pipe my data in to it. I think that's the way it's intended to be used. TestU01: Good, but not as good as PractRand or gjrand in my experience. Difficult to build on windows. Slightly worse on chaotic PRNGs than non-chaotic PRNGs. I recommend its SmallCrush, Crush, and BigCrush batteries of tests. I link my PRNGs with it, but I think you can give it files or pipe in data too. RaBiGeTe: If the new version is out yet it's okay (I've been using a pre-release copy), not as good as TestU01 or PractRand or gjrand but sometimes worth using. Windows only, recent versions are closed-source. Does not need many bits - a big bonus on slow PRNGs. Much worse on chaotic PRNGs than non-chaotic PRNGs. It tends to work best in my experience on LCG-like PRNGs. Lots of bit-level tests, maybe good for slow non-byte-oriented PRNGs? Numerous minor bugs and presentation issues, but no show-stoppers. I compile my PRNGs in to .dll files for it, but you can also give it files. Dieharder: Not good, but parts of it show promise. *nix-only. I *strongly* recommend using a customized set of tests for this, never the -a option. I recommend using the dab_ tests and only those tests. dab_monobit2 is the best test, dab_monobit2 + dab_dct is the best pair of tests. Adjust their sequence lengths. Max out the possible sequence length on dab_monobit2 (it's a 32 bit signed value IIRC). Beware of false positives. I pipe in my PRNG output, but you can also link with Dieharder (it has a viral license for linking though IIRC) NIST STS: Don't bother. No, seriously, don't use this. You have been warned. Beware of false positives. Diehard: Don't bother. It's of historic value only, very badly outdated.