Tuesday, November 18, 2014

A Simple Example

In a recent blog, I suggested that in principle the Erdős–Kac Theorem should have a complementary application with respect to the distribution of prime numbers where the normal (Gaussian) distribution can be used to explain behaviour. To demonstrate this important aspect, we take repeated samples (of same size) within a relatively restricted region of the number system, where changes in the average gap as between primes is so small as to be discounted.

So starting with the n = 1,000,000,000 I took 100 samples of size t = 1000 up to n = 1,000,100,000.

For convenience in identifying the number of primes in each sample, I took them in strict sequence with the accompanying values listed below. As the primes are themselves distributed in a random fashion this would seem permissible in this instance.
Alternatively - at least in theory - 100 repeated random samples of 1000 (with replacement) could be taken within the same range i.e. 100,000,000 to 100,100,000. However in practice this would be very difficult.

So the first row for example represents the 10 samples in sequence from 100,000,000 to 100,010,000 with the last row, for example representing the final 10 sample values from 100,090,000 to 100,100,000.

 54 56 57 55 57 61 57 56 47 51 61 55 57 49 43 54 56 43 58 54 51 56 51 49 43 54 56 43 58 54 60 56 55 57 54 52 59 56 51 56 49 43 59 64 55 63 62 53 49 51 63 54 40 54 51 56 52 54 42 57 53 73 55 50 54 53 61 49 52 56 54 59 44 57 50 56 56 53 52 54 57 56 52 54 63 43 54 52 51 48 49 55 57 52 54 52 56 48 56 66

Now in general terms, the mean number of primes in each sample can be approximated as t/log n

= 1000/18.42 = 54.29 (approx)

This equates well with actual value averaged over the 100 sample results above = 54.11.

Much more problematic however is the provision of a general formula to approximate the standard deviation (for all values of n).

Though I experienced doubts on several occasions with respect to my initial "hunch", repeated empirical testing seems to suggest it as perhaps the simplest and best estimate,

i.e. √{t/(2log n)}

This would give the  standard deviation as 5.21 and compares well enough with the estimated standard deviation (based on the 100 sample values) = 5.46. This does not of course constitute a proof, and indeed a much greater degree of sampling would be required to truly establish it as the most likely estimate.

However in principle by now using the normal distribution, we could estimate the probabilities associated for example of sample values lying within any prescribed distance from the mean (on both sides).

For example we would expect for the above a little in excess of 2/3 of sample values to lie within 1 standard deviation of the mean value.

This would suggest therefore the probability that 2/3 of sample values (for frequency of primes occurring) would lie in the range of 49 - 59 (approx).

Addendum (5/3/2016).  Having returned to this issue in recent days, I feel I can bring more clarity to a situation that I did not eally feel had been properly dealt with, first time around.

Though I was hoping that the standard deviation would correspond to √{t/(log n)}, I was led - largely through the empirical evidence of a small sample - to adjust it somewhat to fit the data.

However on reflection this was not warranted. Even just a few stray "outliers" with respect to this data would have a vey large influence on the standard deviation. Therefore it was unrealistic to expect that the empirical example would fit in with theoretical explanations.

The  Erdős–Kac Theorem states that if ω(n) is the number of (distinct) prime factors of n, the probability distribution of

$\frac{\omega(n) - \log\log n}{\sqrt{\log\log n}}$

is the normal distribution.

In like manner I am suggesting that if  ω(n) is the number of primes in a sample of size t. (i.e. where samples are taken in the region of n) the probaibility distribution of

ω(n) - (t/log t)

√{(t/logt)}

is the normal distribution.

The Erdős–Kac Theoremwould suggest that where the average number of (distinct) primes is approximately 100 (with standard deviation 10), then we would expect just alittle more than 2/3 of all factors to lie within 1 standard deviation of 100 either side of the mean (i.e. between 90 and 110).

In like manner if the average mean value of the number of primes in each sample is 100 (where samples are taken in the region of n) then we would likewise expect again that in a little more than 2/3 of samples, the number of primes would lie within 1 standard deviation (i.e. 10) of  100 (i.e. between 90 and 110).