This article needs attention from an expert in Statistics. Please add a reason or a talk parameter to this template to explain the issue with the article.WikiProject Statistics may be able to help recruit an expert.(April 2024)
In addition to these several statistics have been developed with nominal data in mind. A number have been summarized and devised by Wilcox (Wilcox 1967), (Wilcox 1973), who requires the following standardization properties to be satisfied:
Variation varies between 0 and 1.
Variation is 0 if and only if all cases belong to a single category.
Variation is 1 if and only if cases are evenly divided across all categories.[1]
In particular, the value of these standardized indices does not depend on the number of categories or number of samples.
For any index, the closer to uniform the distribution, the larger the variance, and the larger the differences in frequencies across categories, the smaller the variance.
Indices of qualitative variation are then analogous to information entropy, which is minimized when all cases belong to a single category and maximized in a uniform distribution. Indeed, information entropy can be used as an index of qualitative variation.
One characterization of a particular index of qualitative variation (IQV) is as a ratio of observed differences to maximum differences.
Wilcox gives a number of formulae for various indices of QV (Wilcox 1973), the first, which he designates DM for "Deviation from the Mode", is a standardized form of the variation ratio, and is analogous to variance as deviation from the mean.
This is an analog of the mean difference - the average of the differences of all the possible pairs of variate values, taken regardless of sign. The mean difference differs from the mean and standard deviation because it is dependent on the spread of the variate values among themselves and not on the deviations from some central value.[3]
where fi and fj are the ith and jth frequencies respectively.
where K is the number of categories and is the proportion of observations that fall in a given category i.
M1 can be interpreted as one minus the likelihood that a random pair of samples will belong to the same category,[9] so this formula for IQV is a standardized likelihood of a random pair falling in the same category. This index has also referred to as the index of differentiation, the index of sustenance differentiation and the geographical differentiation index depending on the context it has been used in.
where K is the number of categories and is the proportion of observations that fall in a given category i. The factor of is for standardization.
M1 and M2 can be interpreted in terms of variance of a multinomial distribution (Swanson 1976) (there called an "expanded binomial model"). M1 is the variance of the multinomial distribution and M2 is the ratio of the variance of the multinomial distribution to the variance of a binomial distribution.
·
where K is the number of categories, Xi is the number of data points in the ith category, N is the total number of data points, || is the absolute value (modulus) and
This formula can be simplified
where pi is the proportion of the sample in the ith category.
In practice M1 and M6 tend to be highly correlated which militates against their combined use.
has also found application. This is known as the Simpson index in ecology and as the Herfindahl index or the Herfindahl-Hirschman index (HHI) in economics. A variant of this is known as the Hunter–Gaston index in microbiology[11]
where fi is the count of the ithgrapheme in the text and n is the total number of graphemes in the text.
M1
The M1 statistic defined above has been proposed several times in a number of different settings under a variety of names. These include Gini's index of mutability,[13] Simpson's measure of diversity,[14] Bachi's index of linguistic homogeneity,[15] Mueller and Schuessler's index of qualitative variation,[16] Gibbs and Martin's index of industry diversification,[17] Lieberson's index.[18] and Blau's index in sociology, psychology and management studies.[19] The formulation of all these indices are identical.
Simpson's D is defined as
where n is the total sample size and ni is the number of items in the ith category.
For large n we have
Another statistic that has been proposed is the coefficient of unalikeability which ranges between 0 and 1.[20]
where n is the sample size and c(x,y) = 1 if x and y are unalike and 0 otherwise.
For large n we have
where K is the number of categories.
Another related statistic is the quadratic entropy
where K is the number of categories, L is the number of subtypes, Oij and Eij are the number observed and expected respectively of subtype j in the ith category, ni is the number in the ith category and pj is the proportion of subtype j in the complete sample.
Note: This index was designed to measure women's participation in the work place: the two subtypes it was developed for were male and female.
The Berger–Parker index equals the maximum value in the dataset, i.e. the proportional abundance of the most abundant type.[23] This corresponds to the weighted generalized mean of the values when q approaches infinity, and hence equals the inverse of true diversity of order infinity (1/∞D).
This index is strictly applicable only to entire populations rather than to finite samples. It is defined as
where N is total number of individuals in the population, ni is the number of individuals in the ith category and N! is the factorialofN.
Brillouin's index of evenness is defined as
where S is the number of data types in the sample and N is the total size of the sample.[26]
Inlinguistics this index is the identical with the Kuraszkiewicz index (Guiard index) where S is the number of distinct words (types) and N is the total number of words (tokens) in the text being examined.[27][28] This index can be derived as a special case of the Generalised Torquist function.[29]
This is a statistic invented by Kempton and Taylor.[30] and involves the quartiles of the sample. It is defined as
where R1 and R2 are the 25% and 75% quartiles respectively on the cumulative species curve, nj is the number of species in the jth category, nRi is the number of species in the class where Ri falls (i = 1 or 2).
where N is the total number in the sample and pi is the proportion in the ith category.
In ecology where this index is commonly used, H usually lies between 1.5 and 3.5 and only rarely exceeds 4.0.
An approximate formula for the standard deviation (SD) of His
where pi is the proportion made up by the ith category and N is the total in the sample.
A more accurate approximate value of the variance of H(var(H)) is given by[31]
where N is the sample size and K is the number of categories.
A related index is the Pielou J defined as
One difficulty with this index is that S is unknown for a finite sample. In practice S is usually set to the maximum present in any category in the sample.
where ni is the number in the ith category and K is the number of categories.
He also proposed several normalized versions of this index. First is D:
where N is the total sample size.
This index has the advantage of expressing the observed diversity as a proportion of the absolute maximum diversity at a given N.
Another proposed normalization is E — ratio of observed diversity to maximum possible diversity of a given N and K (i.e., if all species are equal in number of individuals):
This was the first index to be derived for diversity.[33]
where K is the number of categories and N is the number of data points in the sample. Fisher's α has to be estimated numerically from the data.
The expected number of individuals in the rth category where the categories have been placed in increasing size is
where X is an empirical parameter lying between 0 and 1. While X is best estimated numerically an approximate value can be obtained by solving the following two equations
where K is the number of categories and N is the total sample size.
This index (Dw) is the distance between the Lorenz curve of species distribution and the 45 degree line. It is closely related to the Gini coefficient.[35]
In symbols it is
where max() is the maximum value taken over the N data points, K is the number of categories (or species) in the data set and ci is the cumulative total up and including the ith category.
In a rarefied sample a random subsample n in chosen from the total N items. In this sample some groups may be necessarily absent from this subsample. Let be the number of groups still present in the subsample of n items. is less than K the number of categories whenever at least one group is missing from this subsample.
The rarefaction curve, is defined as:
Note that 0 ≤ f(n) ≤ K.
Furthermore,
Despite being defined at discrete values of n, these curves are most frequently displayed as continuous functions.[40]
This is a z type statistic based on Shannon's entropy.[41]
where H is the Shannon entropy, E(H) is the expected Shannon entropy for a neutral model of distribution and SD(H) is the standard deviation of the entropy. The standard deviation is estimated from the formula derived by Pielou
where pi is the proportion made up by the ith category and N is the total in the sample.
This index is used to compare the relationship between hosts and their parasites.[42] It incorporates information about the phylogenetic relationship amongst the host species.
where s is the number of host species used by a parasite and ωij is the taxonomic distinctness between host species i and j.
This index is also known as the multigroup entropy index or the information theory index. It was proposed by Theil in 1972.[43] The index is a weighted average of the samples entropy.
Let
and
where pi is the proportion of type i in the ath sample, r is the total number of samples, ni is the size of the ith sample, N is the size of the population from which the samples were obtained and E is the entropy of the population.
Indices for comparison of two or more data types within a single sample
Let A and B be two types of data item. Then the index of dissimilarity is
where
Ai is the number of data type A at sample site i, Bi is the number of data type B at sample site i, K is the number of sites sampled and || is the absolute value.
This index is probably better known as the index of dissimilarity (D).[44] It is closely related to the Gini index.
This index is biased as its expectation under a uniform distribution is > 0.
A modification of this index has been proposed by Gorard and Taylor.[45] Their index (GT) is
This index ( Lxy ) was invented by Lieberson in 1981.[48]
where Xi and Yi are the variables of interest at the ith site, K is the number of sites examined and Xtot is the total number of variate of type X in the study.
where px is the proportion of the sample made up of variates of type X and
where Nx is the total number of variates of type X in the study, K is the number of samples in the study and xi and pi are the number of variates and the proportion of variates of type X respectively in the ith sample.
This is a binary form of the cosine index.[50] It is used to compare presence/absence data of two data types (here A and B). It is defined as
where a is the number of sample units where both A and B are found, b is number of sample units where A but not B occurs and c is the number of sample units where type B is present but not type A.
This coefficient was invented by Stanisław Kulczyński in 1927[51] and is an index of association between two types (here A and B). It varies in value between 0 and 1. It is defined as
where a is the number of sample units where type A and type B are present, b is the number of sample units where type A but not type B is present and c is the number of sample units where type B is present but not type A.
This index was invented by Yule in 1900.[52] It concerns the association of two different types (here A and B). It is defined as
where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present. Q varies in value between -1 and +1. In the ordinal case Q is known as the Goodman-Kruskal γ.
Because the denominator potentially may be zero, Leinhert and Sporer have recommended adding +1 to a, b, c and d.[53]
where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present.
This index was invented by Baroni-Urbani and Buser in 1976.[54] It varies between 0 and 1 in value. It is defined as
where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present. N is the sample size.
When d = 0, this index is identical to the Jaccard index.
where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present. N is the sample size.
where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present. N is the sample size
where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present. N is the sample size.
where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present. N is the sample size.
where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present. N is the sample size.
where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present.
where b is the number of samples where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present. N is the sample size.
where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present. N is the sample size.
where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present. N is the sample size (N = a + b + c + d).
A modification of this coefficient which does not require the knowledge of d has been proposed by Alroy[56]
where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present. N is the sample size.
where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present. N is the sample size.
where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A, d is the sample count where neither type A nor type B are present, n equals a + b + c + d and || is the modulus (absolute value) of the difference.
where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present.
where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present.
In 1975 Hawkin and Dotson proposed the following coefficient
where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present. N is the sample size.
where a is the number of samples where types A and B are both present, b is where type A is present but not type B and c is the number of samples where type B is present but not type A. Min(b, c) is the minimum of b and c.
where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present. N is the sample size.
where a is the number of samples where types A and B are both present, b is where type A is present but not type B and c is the number of samples where type B is present but not type A.
where a is the number of samples where types A and B are both present, b is where type A is present but not type B and c is the number of samples where type B is present but not type A.
where a is the number of samples where types A and B are both present, b is where type A is present but not type B and c is the number of samples where type B is present but not type A. K is a normalizing parameter. N is the sample size.
This index is also known as the coefficient of arithmetic means.
where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the number of samples where both A and B are not present.
where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the number of samples where both A and B are not present. N is the sample size.
where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the number of samples where both A and B are not present. N is the sample size.
where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A.
Indices for comparison between two or more samples
This is also known as the Bray–Curtis index, Schoener's index, least common percentage index, index of affinity or proportional similarity. It is related to the Sørensen similarity index.
where xi and xj are the number of species in sites i and j respectively and the minimum is taken over the number of species in common between the two sites.
The Canberra distance is a weighted version of the L1 metric. It was introduced by introduced in 1966[58] and refined in 1967[59] by G. N. Lance and W. T. Williams. It is used to define a distance between two vectors – here two sites with K categories within each site.
The Canberra distance d between vectors p and q in a K-dimensional realvector spaceis
where pi and qi are the values of the ith category of the two vectors.
This is a measure of the similarity between two samples:
where A is the number of data points shared between the two samples and B and C are the data points found only in the first and second samples respectively.
This index was invented in 1902 by the Swiss botanist Paul Jaccard.[60]
Under a random distribution the expected value of Jis[61]
The standard error of this index with the assumption of a random distribution is
This is a measure of the similarity between two samples:
where A is the number of data points shared between the two samples and B and C are the data points found only in the first and second samples respectively.
Morisita's index of dispersion ( Im ) is the scaled probability that two points chosen at random from the whole population are in the same sample.[62] Higher values indicate a more clumped distribution.
An alternative formulation is
where n is the total sample size, m is the sample mean and x are the individual values with the sum taken over the whole sample. It is also equal to
is distributed as a chi-squared variable with n − 1 degrees of freedom.
An alternative significance test for this index has been developed for large samples.[64]
where m is the overall sample mean, n is the number of sample units and z is the normal distribution abscissa. Significance is tested by comparing the value of z against the values of the normal distribution.
Morisita's overlap index is used to compare overlap among samples.[65] The index is based on the assumption that increasing the size of the samples will increase the diversity because it will include different habitats
xi is the number of times species i is represented in the total X from one sample.
yi is the number of times species i is represented in the total Y from another sample.
Dx and Dy are the Simpson's index values for the x and y samples respectively.
S is the number of unique species
CD = 0 if the two samples do not overlap in terms of species, and CD = 1 if the species occur in the same proportions in both samples.
Smith-Gill developed a statistic based on Morisita's index which is independent of both sample size and population density and bounded by −1 and +1. This statistic is calculated as follows[67]
First determine Morisita's index ( Id ) in the usual fashion. Then let k be the number of units the population was sampled from. Calculate the two critical values
where χ2 is the chi square value for n − 1 degrees of freedom at the 97.5% and 2.5% levels of confidence.
The standardised index ( Ip ) is then calculated from one of the formulae below
When Id ≥ Mc >1
When Mc > Id ≥ 1
When 1 > Id ≥ Mu
When 1 > Mu > Id
Ip ranges between +1 and −1 with 95% confidence intervals of ±0.5. Ip has the value of 0 if the pattern is random; if the pattern is uniform, Ip < 0 and if the pattern shows aggregation, Ip > 0.
This is related to the Manhattan distance. It was described by Prevosti et al. and was used to compare differences between chromosomes.[73] Let P and Q be two collections of r finite probability distributions. Let these distributions have values that are divided into k categories. Then the distance DPQis
where r is the number of discrete probability distributions in each population, kj is the number of categories in distributions Pj and Qj and pji (respectively qji) is the theoretical probability of category i in distribution Pj (Qj) in population P(Q).
Its statistical properties were examined by Sanchez et al.[74] who recommended a bootstrap procedure to estimate confidence intervals when testing for differences between samples.
Leik's measure of dispersion (D) is one such index.[76] Let there be K categories and let pibefi/N where fi is the number in the ith category and let the categories be arranged in ascending order. Let
where a ≤ K. Let da = caifca ≤ 0.5 and 1 − ca ≤ 0.5 otherwise. Then
The potential-for-conflict Index (PCI) describes the ratio of scoring on either side of a rating scale's centre point.[77] This index requires at least ordinal data. This ratio is often displayed as a bubble graph.
The PCI uses an ordinal scale with an odd number of rating points (−n to +n) centred at 0. It is calculated as follows
where Z = 2n, |·| is the absolute value (modulus), r+ is the number of responses in the positive side of the scale, r− is the number of responses in the negative side of the scale, X+ are the responses on the positive side of the scale, X− are the responses on the negative side of the scale and
Theoretical difficulties are known to exist with the PCI. The PCI can be computed only for scales with a neutral center point and an equal number of response options on either side of it. Also a uniform distribution of responses does not always yield the midpoint of the PCI statistic but rather varies with the number of possible responses or values in the scale. For example, five-, seven- and nine-point scales with a uniform distribution of responses give PCIs of 0.60, 0.57 and 0.50 respectively.
The first of these problems is relatively minor as most ordinal scales with an even number of response can be extended (or reduced) by a single value to give an odd number of possible responses. Scale can usually be recentred if this is required. The second problem is more difficult to resolve and may limit the PCI's applicability.
where K is the number of categories, ki is the number in the ith category, dij is the distance between the ith and ith categories, and δ is the maximum distance on the scale multiplied by the number of times it can occur in the sample. For a sample with an even number of data points
and for a sample with an odd number of data points
where N is the number of data points in the sample and dmax is the maximum distance between points on the scale.
Vaske et al. suggest a number of possible distance measures for use with this index.[78]
if the signs (+ or −) of ri and rj differ. If the signs are the same dij = 0.
where p is an arbitrary real number > 0.
if sign(ri ) ≠ sign(ri ) and p is a real number > 0. If the signs are the same then dij = 0. misD1, D2orD3.
The difference between D1 and D2 is that the first does not include neutrals in the distance while the latter does. For example, respondents scoring −2 and +1 would have a distance of 2 under D1 and 3 under D2.
The use of a power (p) in the distances allows for the rescaling of extreme responses. These differences can be highlighted with p > 1 or diminished with p < 1.
In simulations with a variates drawn from a uniform distribution the PCI2 has a symmetric unimodal distribution.[78] The tails of its distribution are larger than those of a normal distribution.
Vaske et al. suggest the use of a t test to compare the values of the PCI between samples if the PCIs are approximately normally distributed.
This measure is a weighted average of the degree of agreement the frequency distribution.[79]A ranges from −1 (perfect bimodality) to +1 (perfect unimodality). It is defined as
where U is the unimodality of the distribution, S the number of categories that have nonzero frequencies and K the total number of categories.
The value of U is 1 if the distribution has any of the three following characteristics:
all responses are in a single category
the responses are evenly distributed among all the categories
the responses are evenly distributed among two or more contiguous categories, with the other categories with zero responses
With distributions other than these the data must be divided into 'layers'. Within a layer the responses are either equal or zero. The categories do not have to be contiguous. A value for A for each layer (Ai) is calculated and a weighted average for the distribution is determined. The weights (wi) for each layer are the number of responses in that layer. In symbols
Auniform distribution has A = 0: when all the responses fall into one category A = +1.
One theoretical problem with this index is that it assumes that the intervals are equally spaced. This may limit its applicability.
If there are n units in the sample and they are randomly distributed into k categories (n ≤ k), this can be considered a variant of the birthday problem.[80] The probability (p) of all the categories having only one unit is
Ifc is large and n is small compared with k2/3 then to a good approximation
This approximation follows from the exact formula as follows:
Sample size estimates
For p = 0.5 and p = 0.05 respectively the following estimates of n may be useful
This analysis can be extended to multiple categories. For p = 0.5 and p 0.05 we have respectively
where ci is the size of the ith category. This analysis assumes that the categories are independent.
If the data is ordered in some fashion then for at least one event occurring in two categories lying within j categories of each other than a probability of 0.5 or 0.05 requires a sample size (n) respectively of[81]
The adjusted Rand index is the corrected-for-chance version of the Rand index.[83][84][85] Though the Rand Index may only yield a value between 0 and +1, the adjusted Rand index can yield negative values if the index is less than the expected index.[86]
Given a set of elements, and two groupings or partitions (e.g. clusterings) of these points, namely and , the overlap between and can be summarized in a contingency table where each entry denotes the number of objects in common between and : .
The adjusted form of the Rand Index, the Adjusted Rand Index, is
more specifically
where are values from the contingency table.
Since the denominator is the total number of pairs, the Rand index represents the frequency of occurrence of agreements over the total pairs, or the probability that and will agree on a randomly chosen pair.
Different indices give different values of variation, and may be used for different purposes: several are used and critiqued in the sociology literature especially.
If one wishes to simply make ordinal comparisons between samples (is one sample more or less varied than another), the choice of IQV is relatively less important, as they will often give the same ordering.
Where the data is ordinal a method that may be of use in comparing samples is ORDANOVA.
In some cases it is useful to not standardize an index to run from 0 to 1, regardless of number of categories or samples (Wilcox 1973, pp. 338), but one generally so standardizes it.
^Friedman WF (1925) The incidence of coincidence and its applications in cryptanalysis. Technical Paper. Office of the Chief Signal Officer. United States Government Printing Office.
^Gini CW (1912) Variability and mutability, contribution to the study of statistical distributions and relations. Studi Economico-Giuricici della R. Universita de Cagliari
^Bachi R (1956) A statistical analysis of the revival of Hebrew in Israel. In: Bachi R (ed) Scripta Hierosolymitana, Vol III, Jerusalem: Magnus press pp 179–247
^Hutcheson K (1970) A test for comparing diversities based on the Shannon formula. J Theo Biol 29: 151–154
^McIntosh RP (1967). An Index of Diversity and the Relation of Certain Concepts to Diversity. Ecology, 48(3), 392–404
^Fisher RA, Corbet A, Williams CB (1943) The relation between the number of species and the number of individuals in a random sample of an animal population. Animal Ecol 12: 42–58
^Anscombe (1950) Sampling theory of the negative binomial and logarithmic series distributions. Biometrika 37: 358–382
^Theil H (1972) Statistical decomposition analysis. Amsterdam: North-Holland Publishing Company>
^Duncan OD, Duncan B (1955) A methodological analysis of segregation indexes. Am Sociol Review, 20: 210–217
^Gorard S, Taylor C (2002b) What is segregation? A comparison of measures in terms of 'strong' and 'weak' compositional invariance. Sociology, 36(4), 875–895
^Hutchens RM (2004) One measure of segregation. International Economic Review 45: 555–578
^Lieberson S (1981). "An asymmetrical approach to segregation". In Peach C, Robinson V, Smith S (eds.). Ethnic segregation in cities. London: Croom Helm. pp. 61–82.
^Bell, W (1954). "A probability model for the measurement of ecological segregation". Social Forces. 32 (4): 357–364. doi:10.2307/2574118. JSTOR2574118.
^Ochiai A (1957) Zoogeographic studies on the soleoid fishes found in Japan and its neighbouring regions. Bull Jpn Soc Sci Fish 22: 526–530
^Kulczynski S (1927) Die Pflanzenassoziationen der Pieninen. Bulletin International de l'Académie Polonaise des Sciences et des Lettres, Classe des Sciences
^Yule GU (1900) On the association of attributes in statistics. Philos Trans Roy Soc
^Lienert GA and Sporer SL (1982) Interkorrelationen seltner Symptome mittels Nullfeldkorrigierter YuleKoeffizienten. Psychologische Beitrage 24: 411–418
^Forbes SA (1907) On the local distribution of certain Illinois fishes: an essay in statistical ecology. Bulletin of the Illinois State Laboratory of Natural History 7:272–303
^Alroy J (2015) A new twist on a very old binary similarity coefficient. Ecology 96 (2) 575-586
^
Carl R. Hausman and Douglas R. Anderson (2012). Conversations on Peirce: Reals and Ideals. Fordham University Press. p. 221. ISBN9780823234677.
^Lance, G. N.; Williams, W. T. (1967). "Mixed-data classificatory programs I.) Agglomerative Systems". Australian Computer Journal: 15–20.
^Jaccard P (1902) Lois de distribution florale. Bulletin de la Socíeté Vaudoise des Sciences Naturelles 38:67-130
^Archer AW and Maples CG (1989) Response of selected binomial coefficients to varying degrees of matrix sparseness and to matrices with known data interrelationships. Mathematical Geology 21: 741–753
^ abMorisita M (1959) Measuring the dispersion and the analysis of distribution patterns. Memoirs of the Faculty of Science, Kyushu University Series E. Biol 2:215–235
^Lloyd M (1967) Mean crowding. J Anim Ecol 36: 1–30
^Pedigo LP & Buntin GD (1994) Handbook of sampling methods for arthropods in agriculture. CRC Boca Raton FL
^Morisita M (1959) Measuring of the dispersion and analysis of distribution patterns. Memoirs of the Faculty of Science, Kyushu University, Series E Biology. 2: 215–235
^Horn, HS (1966). "Measurement of "Overlap" in comparative ecological studies". The American Naturalist. 100 (914): 419–424. doi:10.1086/282436. S2CID84469180.
^Smith-Gill SJ (1975). "Cytophysiological basis of disruptive pigmentary patterns in the leopard frog Rana pipiens. II. Wild type and mutant cell specific patterns". J Morphol. 146 (1): 35–54. doi:10.1002/jmor.1051460103. PMID1080207. S2CID23780609.
^Peet (1974) The measurements of species diversity. Annu Rev Ecol Syst 5: 285–307
^Monostori K, Finkel R, Zaslavsky A, Hodasz G and Patke M (2002) Comparison of overlap detection techniques. In: Proceedings of the 2002 International Conference on Computational Science. Lecture Notes in Computer Science 2329: 51-60
^Bernstein Y and Zobel J (2004) A scalable system for identifying co-derivative documents. In: Proceedings of 11th International Conference on String Processing and Information Retrieval (SPIRE) 3246: 55-67
^Sanchez, A; Ocana, J; Utzetb, F; Serrac, L (2003). "Comparison of Prevosti genetic distances". Journal of Statistical Planning and Inference. 109 (1–2): 43–65. doi:10.1016/s0378-3758(02)00297-5.
^HaCohen-Kerner Y, Tayeb A and Ben-Dror N (2010) Detection of simple plagiarism in computer science papers. In: Proceedings of the 23rd International Conference on Computational Linguistics pp 421-429
^Leik R (1966) A measure of ordinal consensus. Pacific sociological review 9 (2): 85–90
^Manfredo M, Vaske, JJ, Teel TL (2003) The potential for conflict index: A graphic approach tp practical significance of human dimensions research. Human Dimensions of Wildlife 8: 219–228
^ abcVaske JJ, Beaman J, Barreto H, Shelby LB (2010) An extension and further validation of the potential for conflict index. Leisure Sciences 32: 240–254
^Van der Eijk C (2001) Measuring agreement in ordered rating scales. Quality and quantity 35(3): 325–341
^Von Mises R (1939) Uber Aufteilungs-und Besetzungs-Wahrcheinlichkeiten. Revue de la Facultd des Sciences de de I'Universite d'lstanbul NS 4: 145−163
^Sevast'yanov BA (1972) Poisson limit law for a scheme of sums of dependent random variables. (trans. S. M. Rudolfer) Theory of probability and its applications, 17: 695−699
^Hoaglin DC, Mosteller, F and Tukey, JW (1985) Exploring data tables, trends, and shapes, New York: John Wiley
Lieberson, Stanley (December 1969), "Measuring Population Diversity", American Sociological Review, 34 (6): 850–862, doi:10.2307/2095977, JSTOR2095977
Swanson, David A. (September 1976), "A Sampling Distribution and Significance Test for Differences in Qualitative Variation", Social Forces, 55 (1): 182–184, doi:10.2307/2577102, JSTOR2577102
Wilcox, Allen R. (June 1973). "Indices of Qualitative Variation and Political Measurement". The Western Political Quarterly. 26 (2): 325–343. doi:10.2307/446831. JSTOR446831.