→Katz family of distributions: Moving text
|
|
||
Line 137: | Line 137: | ||
: <math> \mathrm{ CI } = t ( \frac { P(x) ( 1 - P(x) } { N } )^{ 1 / 2 } </math> |
: <math> \mathrm{ CI } = t ( \frac { P(x) ( 1 - P(x) } { N } )^{ 1 / 2 } </math> |
||
where ''CI'' is the confidence interval, ''t'' is the critical value taken from the t distribution and ''N'' is the total sample size |
where ''CI'' is the confidence interval, ''t'' is the critical value taken from the t distribution and ''N'' is the total sample size. |
||
===Katz family of distributions=== |
|||
Katz in 1963<ref name=Katz1963>Katz L (1963) United treatment of a broad class of discrete probability distributions. ''in'' Proceedings of the International Symposium on Discrete Distributions. Montreal</ref> proposed a family of distributions (the [[Katz family]]) with 2 parameters ( w<sub>1</sub>, w<sub>2</sub> ). This family of distributions includes the Bernoulli, Pascal and Poisson distributions. The mean and varaince of a Katz distribution are |
|||
: <math> m = \frac { w_1 } { 1 - w_2 } </math> |
|||
: <math> s^2 = w_1 / ( 1 - w_2 )^2 </math> |
|||
where ''m'' is the mean and ''s''<sup>2</sup> is the variance. If the population obeys a Katz distribution then the coefficients of Taylor law are ''a'' = - log ( 1 - ''w''<sub>2</sub> ) and ''b'' = 1. |
|||
===Sampling size estimators=== |
===Sampling size estimators=== |
Taylor's law is an empirical law in ecology that relates the between sample variance in density to the overall mean density of a sample of organisms in a study area.[1] Taylor described this relationship in 1961[2] and it has been found to be true for many species since.[3] It has also been found to be true in other areas including transmission of infectious diseases, human sexual behavior, childhood leukemia, cancer metastases, blood flow heterogeneity, genomic distributions of single nucleotide polymorphisms and gene structures.[4][5][6] This law is also known in the literature as the power law (in the biological literature) or the fluctuation scaling law (in the physics literature). Despite its apparent widespread utility no satisfactory theoretical basis for this law has been proposed to date.[7]
The first to propose an empirical relationship of this type between the mean and variance was Smith in 1938 while studying crop yields.[8] Smith proposed the relationship
where Vx is the variance of yield for plots of x units, V1 is the variance of yield per unit area and x is the size of plots. The slope (b) is the index of heterogeneity. The value of b in this relationship lies between 0 and 1. Where the yield are highly correlated b tends to 0; when they are uncorrelated b tends to 1.
Fracker and Brischle in 1944[9] and Hayman and Lowe in 1961[10] independently described relationships between the mean and variance that are now known as Taylor's law.
The law itself is named after the ecologist L. R. Taylor (1924–2007). The name 'Taylor's law' was coined by Southwood in 1966.[11] Taylor's original name for this relationship was the law of the mean.
It appears that Taylor's law is an example of Stigler's law of eponymy.
In symbols
where si2 is the variance of the density of the ith sample, mi is the mean density of the ith sample and a and b are constants.
In logarithmic form
A refinement in the estimation of the slope b has been proposed by Rayner.[12]
where r is the Pearson moment correlation coefficient between log(s2) and log m, f is the ratio of sample variances in log(s2) and log m and φ is the ratio of the errors in log(s2) and log m.
Ordinary least squares regression assumes that φ = ∞. This tends to underestimate the value of b because the estimates of both log(s2) and log m are subject to error.
A extension of Taylor's law has been proposed by Ferris et al when multiple samples are taken[13]
where s2 and m are the variance and mean respectively, b, c and d are constants and n is the number of samples taken. To date this proposed extension has not been verified to be as applicable as the original version of Taylor's law.
Slope values (b) significantly > 1 indicate clumping of the organisms.
InPoisson distributed data b = 1.[14] If the population follows a lognormalorgamma distribution then b = 2.
Populations that are experiencing constant per capita environmental variability the regression of log variance versus log mean abundance should have a line with b = 2.
Most populations that have been studied have b < 2 (usually 1.5–1.6) but values of 2 have been reported.[5] Occasionally cases with b > 2 have been reported.[15] b values below 1 are uncommon but have also been reported ( b = 0.93 ).[16]
The origin of the slope (b) in this regression remains unclear. Two hypotheses have been proposed to explain it. One suggests that b arises from the species behavior and is a constant for that species. The alternative suggests that it is dependent on the sampled population. Despite the considerable number of studies carried out on this law (>1000) this question remains open.
It is known that both a and b are subject to change due to age-specific dispersal, mortality and sample unit size.[17]
This law may be a poor fit if the values are small. For this reason an extension to Taylor's law has been proposed by Hanski which improves the fit of Taylor's law at low densities.[18]
Binomial sampling is popular where there are large number of units (crops, trees) to be examined and where counts of individuals of interest (typically insects) may be difficult (frequently because the insects fly away before they can be accurately counted).
A form of Taylor's law applicable to binary sampling (presence/absence of at least one individual in a sample unit) has been proposed.[19] In a binomial distribution the theoretical variance is
where s2 is the variance, n is the sample size and p is the proportion of sample units with at least one individual. The proposed binary form of Taylor's law is
where varobs is the observed variance and varbin is that expected from the binomial distribution. When both a and b are equal to 1, then a random spatial pattern is suggested and is best described by the binomial distribution. When b = 1 and a > 1, there is overdispersion with no dependence on the mean incidence (p). When both a and b are > 1, the degree of aggregation varies with p.
Because of the ubiquitous occurrence of Taylor's law in biology it has found a variety of uses some of which are listed here.
It has been recommended based on simulation studies[20] that in applications using Taylor's law that:
(1) the total number of organisms studied be >15
(2) the minimum number of groups of organisms studied be >5
(3) the density of the organisms should vary by at least 2 orders of magnitude within the sample
It is common assumed (at least initially) that a population is randomly distributed in the environment. If a population is randomly distributed then the mean (m) and variance ( s2 ) of the population are equal and the proportion of samples that contain at least one individual (p) is
When a species with a clumped pattern is compared with one that is randomly distributed with equal overall densities, p will be less for the species having the clumped distribution pattern. Conversely when comparing a uniformly and a randomly distributed species but at equal overall densities, p will be greater for the randomly distributed population. This can be graphically tested by plotting p against m. Wilson and Room developed a binomial model that incorporates Taylor's law.[21] The basic relationship is
where the log is taken to the base e.
Including Taylor's law this relationship becomes
The dispersion parameter (k)[22]is
where m is the sample mean and s2 is the variance. If k-1 is > 0 the population is considered to be aggregated; k-1 = 0 the population is considered to be random and if k-1 is < 0 the population is considered to be uniformly distributed.
Wilson and Room assuming that Taylor's law applied to the population gave an alternative estimator for k:[21]
where a and b are the constants from Taylor's law.
Jones[23] using the estimate for k above along with the relationship Wilson and Room developed for the probability of finding a sample having at least one individual[21]
derived an estimator for the probability of a sample containing x individuals per sampling unit. Jones's formula is
where P( x ) is the probability of finding x individuals per sampling unit, k is estimated from the Wilon and Room equation and m is the sample mean. The probability of finding zero individuals P( 0 ) is estimated with the negative binomial distribution
Jones also gives confidence intervals for these probabilities.
where CI is the confidence interval, t is the critical value taken from the t distribution and N is the total sample size.
Katz in 1963[24] proposed a family of distributions (the Katz family) with 2 parameters (w1, w2 ). This family of distributions includes the Bernoulli, Pascal and Poisson distributions. The mean and varaince of a Katz distribution are
where m is the mean and s2 is the variance. If the population obeys a Katz distribution then the coefficients of Taylor law are a = - log ( 1 - w2 ) and b = 1.
The degree of precision (D) is defined to be s / m where s is the standard deviation and m is the mean. The degree of precision is known as the coefficient of variation in other contexts. In ecology research it is recommended that D be in the range 10-25%.[25] The desired degree of precision is important in estimating the required sample size where an investigator wishes to test if Taylor's law applies to the data. The required sample size has been estimated for a number of simple distributions but where the population distribution is not known or cannot be assumed more complex formulae may needed to determine the required sample size.
Where the population is Poission distributed the sample size (n) needed is
where t is critical level of the t distribution for the type 1 error with the degrees of freedom that the mean (m) was calculated with.
If the population is distributed as a negative binomial distribution then the required sample size is
where k is the parameter of the negative binomial distribution.
A more general sample size estimator has also been proposed[26]
where a and b are derived from Taylor's law.
An alternative has been proposed by Southwood[27]
where n is the required sample size, a and b are the Taylor's law coefficients and D is the desired degree of precision.
Karandinos proposed two similar estimators for n.[28] The first was modified by Ruesink to incorperate Taylor's law.[29]
where d is the ratio of half the desired confidence interval (CI) to the mean. In symbols
The second estimator is used in binomial (presence-absence) sampling. The desired sample size (n) is
where the dp is ratio of half the desired confidence interval to the proportion of sample units with individuals, p is proportion of samples containing individuals and q = 1 - p. In symbols
Sequential analysis is a method of statistical analysis where the sample size is not fixed in advance. Instead samples are taken in accordance with a predefined stopping rule. Taylor's law has been used to derive a number of stopping rules.
A formula for fixed precision in serial sampling to test Taylor's law was derived by Green in 1970.[30]
where T is the cumulative sample total, D is the level of precision, n is the sample size and a and b are obtained from Taylor's law.
As an aid to pest control Wilson et al developed a test that incorporated a threshold level where action should be taken.[31] The required sample size is
where a and b are the Taylor coefficients, || is the absolute value, m is the sample mean, T is the threshold level and t is the critical level of the t distribution. The authors also provided a similar test for binomial (presence-absence) sampling
where p is the probability of finding a sample with pests present and q = 1 - p.
Green derived another sampling formula for sequential sampling based on Taylor's law[32]
where D is the degree of precision, a and b are the Taylor's law coefficients, n is the sample size and T is the total number of individuals sampled.
A number of other methods for detecting relationships between the variance and mean in biological samples have been proposed. To date none have achieved the popularity of Taylor's law.
Barlett in 1936[33] and Iawo in 1968[34] both proposed an alternative relationship between the variance and the mean. In symbols
where s is the variance in the ith sample and mi is the mean of the ith sample
When the population follows a negative binomial distribution, a = 1 and b = k (the exponent of the negative binomial distribution).
This alternative formulation has not been found to be as good a fit as Taylor's law in most studies.
Nachman proposed a relationship between the mean density and the proportion of samples with zero counts:[35]
where p0 is the proportion of the sample with zero counts, m is the mean density, a is a scale parameter and b is a dispersion parameter. If a = b = 0 the distribution is random. This relationship is usually tested in its logarithmic form
A negative binomial model has also been proposed.[36] The dispersion parameter (k) using the method of moments is m2 / ( s2 - m ) and pi is the proportion of samples with counts > 0. The s2 used in the calculation of k are the values predicted by Taylor's law. pi is plotted against 1 - ( k ( k + m ) -1 )k and the fit of the data is visually inspected.
Perry and Taylor have proposed an alternative estimator of k based on Taylor's law.[37]
A better estimate of the dispersion parameter can be made with the method of maximum likelihood. For the negative binomial it can be estimated from the equation[22]
where Ax is the total number of samples with more than x individuals, N is the total number of individuals, x is the number of individuals in a sample, m is the mean number of individuals per sample and k is the exponent. The value of k has to estimated numerically.
Goodness of fit of this model can be tested in a number of ways including using the chi square test. As these may be biased by small samples an alternative is the U statistic - the difference between the variance expected under the negative binomial distribution and that of the sample. The expected variance of this distribution is m + m2 / k and
where s2 is the sample variance, m is the sample mean and k is the negative binomial parameter.
The variance of U is[22]
where p = m / k, q = 1 + p, R = p / q and N is the total number of individuals in the sample. The expected value of U is 0. For large sample sizes U is distributed normally.
As noted above binary sampling is not uncommonly used in ecology. In 1958 Kono and Sugino derived an equation that relates the proportion of samples without individuals to the mean density of the samples.[38]
where p0 is the proportion of the sample with no individuals, m is the mean sample density, a and b are constants. Like Taylor's law this equation has been found to fit a variety of populations including ones that obey Taylor's law.
The predicted estimates of m from this equation are subject to bias[39] and it is recommended that the adjusted mean ( ma ) be used instead[40]
where var() is the variance of the sample unit means ( mi ) and m is the overall mean.
Hughes and Madden have proposed testing a similar relationship also applicable to binary sampling (presence/absence in a unit)[41][42]
where a, b and c are constants, s2 is the variance and p is the proportion of units with at least one individual. In logarithmic form this relationship is
This proposed relationship has not been subjected to the extensive testing that Taylor's law has had and because of this its general applicability remains uncertain.
Lloyd's index of mean crowding (IMC) is the average number of other points contained in the sample unit that contains a randomly chosen point.[43]
where m is the sample mean and s2 is the variance.
Lloyd's index of patchiness (IP)[43]is
It is a measure of pattern intensity that is unaffected by thinning (random removal of points).
If the population obeys Taylor's law then
Iwao proposed a patchiness regression to test for clumping[34][44]
Let
yi here is Lloyd's index of mean crowding.[43] Perform an ordinary least squares regression of mi against y.
In this regression the value of the slope (b) is an indicator of clumping: the slope = 1 if the data is Poisson-distributed. The constant (a) is the number of individuals that share a unit of habitat at infinitesimal density and may be < 0, 0 or > 0. These values represent regularity, randomness and aggregation of populations in spatial patterns respectively. A value of a < 1 is taken to mean that the basic unit of the distribution is a single individual.
Where the statistic s2 / m is not constant it has been recommended to use instead to regress Lloyd's index against am + bm2 where a and b are constants.[45]
The sample size (n) for a given degree of precision (D) for this regression is given by[45]
where a is the constant in this regression, b is the slope, m is the mean and t is the critical value of the t distribution.
Iawo has proposed a sequential sampling test based on this regression.[46] The upper and lower limits of this test are based on critical densities mc where control of a pest requires action to be taken.
where Nu and Nl are the upper and lower bounds respectively, a is the constant from the regression, b is the slope and i is the number of samples.
Kuno has proposed an alternative sequential stopping test also based on this regression.[47]
where Tn is the total sample size, D is the degree of precision, n is the number of samples units, a is the constant and b is the slope from the regression respectively.
Kuno's test is subject to the condition that n ≥ (b - 1) / D2
The dispersion parameter (k)[22]is
where m is the sample mean and s2 is the variance. If k-1 is > 0 the population is considered to be aggregated; k-1 = 0 the population is considered to be random; and if k-1 is < 0 the population is considered to be uniformly distributed.
Southwood has recommended regressing k against the mean and a constant[27]
where ki and mi are the dispersion parameter and the mean of the ith sample respectively to test for the existence of a common dispersion parameter (kc). A slope (b) value signifcantly > 0 indicates the dependence of k on the mean density.
An alternative method was proposed by Elliot who suggested plotting ( s2 - m ) against ( m2 - s2 / n ).[48] kc is equal to 1/slope of this regression.
Southwood's index of spatial aggregation (k) is defined as
where m is the mean of the sample and m* is Lloyd's index of crowding.[27]
Fsiher's index of dispersion[48]is
This index may be used to test for over dispersion of the population. It is recommended that in applications n > 5.[49] It can be applied both to the overall population and to the individual areas sampled individually. The use of this test on the individual sample areas should also include the use of a Bonferroni correction factor.
The index is equal to n and is distributed as the chi-square distribution with n − 1 degrees of freedom when the population is Poisson distributed.[49] It is equal to the scale parameter when the population obeys the gamma distribution.
A related statistic suggested by de Oliveria is the difference of the variance and the mean.[50] If the population is Poission distributed then
where t is the Poission parameter, s2 is the variance, m is the mean and n is the sample size. The expected value of s2 - m is zero. This statistic is distributed normally.[51]
If the population obeys Taylor's law then
The index of cluster size (ICS) was created by David and Moore.[52] Under a random distribution ICS is expected to equal 0. Positive values indicate a clumped distribution; negative values indicate a uniform distribution.
where s2 is the variance and m is the mean.
If the population obeys Taylor's law
Green’s index (GI) is a modification of the index of cluster size that is independent of n the number of sample units.[53]
This index equals 0 if the distribution is random, 1 if it is maximally aggregated and -1 / ( nm - 1 ) if it is uniform.
If the population obeys Taylor's law
Morisita’s index of dispersion ( Im ) is the scaled probability that two points chosen at random from the whole population are in the same sample.[54] Higher values indicate a more clumped distribution.
where n is the total sample size, m is the sample mean and x are the individual values with the sum taken over the whole sample. It is also equal to
where IMC is Lloyd's index of crowding.[43]
A significance test for this index has been developed for large samples.[55]
where m is the overall sample mean, n is the number of sample units and z is the normal distribution abscissa. Significance is tested by comparing the value of z against the values of the normal distribution.
A function for its calculation is available in the statistical R language. R function
Binary sampling (presence/absence) is frequently used where it is difficult to obtain accurate counts. The The dispersal index (D) is used when the study population is divided into a series of equal samples ( number of units = N: number of units per sample = n: total population size = nxN ).[19] The theoretical variance of a sample from a population with a binomial distribution is
where s2 is the variance, n is the number of units sampled and p is the mean proportion of sampling units with at least one individual present. The dispersal index (D) is defined as the ratio of observed variance to the expected variance. In symbols
where varobs is the observed variance and varbin is the expected variance. The expected variance is calculated with the overall mean of the population. Values of D > 1 are considered to suggest aggregation. D( n - 1 ) is distributed as the chi squared variable with n - 1 degrees of freedom where n is the number of units sampled.
An alternative test is the C test.Cite error: There are <ref>
tags on this page without content in them (see the help page).
where D is the dispersal index, n is the number of units per sample and N is the number of samples. C is distributed normally. A statistically significant value of C indicates overdispersion of the population.[56]