|
→Formulae: Corrections to formulae
|
||
Line 19: | Line 19: | ||
Wilcox gives a number of formulae for various indices of QV {{Harv|Wilcox|1973}}, the first, which he designates DM for "Deviation from the Mode", is a standardized form of the [[variation ratio]], and is analogous to [[variance]] as deviation from the mean. |
Wilcox gives a number of formulae for various indices of QV {{Harv|Wilcox|1973}}, the first, which he designates DM for "Deviation from the Mode", is a standardized form of the [[variation ratio]], and is analogous to [[variance]] as deviation from the mean. |
||
The formula for this is |
The formula for this is derived as follows: |
||
:<math> |
:<math> M = \sum_{ i = 1 }^K ( f_m - f_i ) </math> |
||
where ''f''<sub>m</sub> is the modal frequency, ''K'' is the number of catagories and ''f''<sub>i</sub> is the frequency of the ''i''<sup>th</sup> group. |
where ''f''<sub>m</sub> is the modal frequency, ''K'' is the number of catagories and ''f''<sub>i</sub> is the frequency of the ''i''<sup>th</sup> group. |
||
Line 27: | Line 27: | ||
This can be simplified to |
This can be simplified to |
||
:<math> |
:<math> M = Kf_m - N</math> |
||
where ''N'' is the total size of the sample. |
where ''N'' is the total size of the sample. |
||
Freeman's index is<ref name=Freemen1965>Freemen LC (1965) Elementary applied statistics. New York: John Wiley |
|||
Freeman's index is |
|||
and Sons pp 40-43</ref> |
|||
: <math> v = 1 - \frac{ f_m }{ N } </math> |
: <math> v = 1 - \frac{ f_m }{ N } </math> |
||
This is related to |
This is related to M as follows: |
||
:<math> \frac{ ( \frac{ f_m }{ N } ) - \frac{ 1 }{ K } }{ \frac{ N }{ K }\frac{ ( K - 1 )} { N } } = \frac{ |
:<math> \frac{ ( \frac{ f_m }{ N } ) - \frac{ 1 }{ K } }{ \frac{ N }{ K }\frac{ ( K - 1 )} { N } } = \frac{ M }{ N( K - 1 ) }</math> |
||
The ModVR is then defined as |
|||
:<math> ModVR = 1 - \frac{ Kf_m - N }{ N( K - 1 ) }</math> |
|||
Low values of ModVR correspond to small amount of variation and high values to larger amounts of variation. |
|||
One formula for IQV,<ref>[http://www.xycoon.com/qualitative_variation.htm IQV at xycoon]</ref> given as M2 in {{Harv|Gibbs|1975|p=472}} is: |
One formula for IQV,<ref>[http://www.xycoon.com/qualitative_variation.htm IQV at xycoon]</ref> given as M2 in {{Harv|Gibbs|1975|p=472}} is: |
Anindex of qualitative variation (IQV) is a measure of statistical dispersioninnominal distributions. There are a variety of these, but they have been relatively little-studied in the statistics literature. The simplest is the variation ratio, while the most sophisticated is the information entropy.
There are various indices of qualitative variation; a number are summarized and devised by Wilcox (Wilcox 1967), (Wilcox 1973), who requires the following standardization properties to be satisfied:
In particular, the value of these standardized indices does not depend on the number of categories or number of samples.
For any index, the closer to uniform the distribution, the larger the variance, and the larger the differences in frequencies across categories, the smaller the variance.
Indices of qualitative variation are in this sense complementary to information entropy, which is maximized when all cases belong to a single category and minimized in a uniform distribution, but they are not complementary in the sense of a particular IQV equaling 1 minus entropy. Indeed, information entropy can be used as an index of qualitative variation.
One characterization of a particular index of qualitative variation (IQV) is as a ratio of observed differences to maximum differences.
Wilcox gives a number of formulae for various indices of QV (Wilcox 1973), the first, which he designates DM for "Deviation from the Mode", is a standardized form of the variation ratio, and is analogous to variance as deviation from the mean.
The formula for this is derived as follows:
where fm is the modal frequency, K is the number of catagories and fi is the frequency of the ith group.
This can be simplified to
where N is the total size of the sample.
Freeman's index is[2]
This is related to M as follows:
The ModVR is then defined as
Low values of ModVR correspond to small amount of variation and high values to larger amounts of variation.
One formula for IQV,[3] given as M2 in (Gibbs 1975, p. 472) harv error: no target: CITEREFGibbs1975 (help) is:
where K is the number of categories, and is the proportion of observations that fall in a given category i. The factor of
is for standardization.
The unstandardized index, , denoted as M1 (Gibbs 1975, p. 471) harv error: no target: CITEREFGibbs1975 (help), can be interpreted as the likelihood that a random pair of samples will belong to the same category (Lieberson 1969, p. 851), so this formula for IQV is a standardized likelihood of a random pair falling in the same category. M1 and M2 can be interpreted in terms of variance of a multinomial distribution (Swanson 1976) (there called an "expanded binomial model").
The sum
has also found application. This is known as the Simpson index in ecology and as the Herfindahl index or the Herfindahl-Hirschman index (HHI) in economics. A variant of this is known as the Hunter-Gaston index in microbiology[4]
The M1 statistic defined above has been proposed several times in a number of different settings under a variety of names. These include Gini's index of mutability,[5] Simpson's measure of diversity,[6] Bachi's index of linguistic homogeneity[7], Mueller and Schuessler's index of qualitative variation,[8] Gibbs and Martin's index of industry diversification,[9] Lieberson's index.[10] and Blau's index index in sociology, psychology and management studies.[11] The formulation of all these indices are identical.
Simpson's D is defined as
where n is the total sample size and ni is the number of items in the ith category.
For large n we have
Another statistic that has been proposed is the coefficient of unalikeability which ranges between 0 and 1.[12]
where n is the sample size and c(x,y) = 1 if x and y are alike and 0 otherwise.
For large n we have
where K is the number of categories.
Greenberg's monolingual non weighted index of linguistic diversity,[13] is the M2 statistic defined above.
The Berger-Parker index equals the maximum value in the dataset, i.e. the proportional abundance of the most abundant type. This corresponds to the weighted generalized mean of the
values when q approaches infinity, and hence equals the inverse of true diversity of order infinity (1/∞D).
The Rényi entropy is a generalization of the Shannon entropy to other values of q than unity. It can be expressed:
which equals
This means that taking the logarithm of true diversity based on any value of q gives the Rényi entropy corresponding to the same value of q.
The value of is also known as the Hill number.[14]
Different indices give different values of variation, and may be used for different purposes: several are used and critiqued in the sociology literature especially.
If one wishes to simply make ordinal comparisons between samples (is one sample more or less varied than another), the choice of IQV is relatively less important, as they will often give the same ordering.
In some cases it is useful to not standardize an index to run from 0 to 1, regardless of number of categories or samples (Wilcox 1973, pp. 338), but one generally so standardizes it.
{{citation}}
: Unknown parameter |month=
ignored (help){{citation}}
: Unknown parameter |month=
ignored (help){{citation}}
: Unknown parameter |month=
ignored (help){{citation}}
: Unknown parameter |month=
ignored (help)