Jump to content
 







Main menu
   


Navigation  



Main page
Contents
Current events
Random article
About Wikipedia
Contact us
Donate
 




Contribute  



Help
Learn to edit
Community portal
Recent changes
Upload file
 








Search  

































Create account

Log in
 









Create account
 Log in
 




Pages for logged out editors learn more  



Contributions
Talk
 



















Contents

   



(Top)
 


1 Name  





2 Formula  



2.1  Continuous Dice Coefficient  







3 Difference from Jaccard  





4 Applications  





5 Abundance version  





6 See also  





7 References  





8 External links  














Dice-Sørensen coefficient






العربية
Español
Français
Italiano
Polski
Русский
Українська

 

Edit links
 









Article
Talk
 

















Read
Edit
View history
 








Tools
   


Actions  



Read
Edit
View history
 




General  



What links here
Related changes
Upload file
Special pages
Permanent link
Page information
Cite this page
Get shortened URL
Download QR code
Wikidata item
 




Print/export  



Download as PDF
Printable version
 
















Appearance
   

 






From Wikipedia, the free encyclopedia
 

(Redirected from Dices coefficient)

The Dice-Sørensen coefficient (see below for other names) is a statistic used to gauge the similarity of two samples. It was independently developed by the botanists Lee Raymond Dice[1] and Thorvald Sørensen,[2] who published in 1945 and 1948 respectively.

Name[edit]

The index is known by several other names, especially Sørensen–Dice index,[3] Sørensen index and Dice's coefficient. Other variations include the "similarity coefficient" or "index", such as Dice similarity coefficient (DSC). Common alternate spellings for Sørensen are Sorenson, Soerenson and Sörenson, and all three can also be seen with the –sen ending (the Danish letter ø is phonetically equivalent to the German/Swedish ö, which can be written as oe in ASCII).

Other names include:

Formula[edit]

Sørensen's original formula was intended to be applied to discrete data. Given two sets, X and Y, it is defined as

where |X| and |Y| are the cardinalities of the two sets (i.e. the number of elements in each set). The Sørensen index equals twice the number of elements common to both sets divided by the sum of the number of elements in each set.

When applied to Boolean data, using the definition of true positive (TP), false positive (FP), and false negative (FN), it can be written as

.

It is different from the Jaccard index which only counts true positives once in both the numerator and denominator. DSC is the quotient of similarity and ranges between 0 and 1.[9] It can be viewed as a similarity measure over sets.

Similarly to the Jaccard index, the set operations can be expressed in terms of vector operations over binary vectors a and b:

which gives the same outcome over binary vectors and also gives a more general similarity metric over vectors in general terms.

For sets X and Y of keywords used in information retrieval, the coefficient may be defined as twice the shared information (intersection) over the sum of cardinalities :[10]

When taken as a string similarity measure, the coefficient may be calculated for two strings, x and y using bigrams as follows:[11]

where nt is the number of character bigrams found in both strings, nx is the number of bigrams in string x and ny is the number of bigrams in string y. For example, to calculate the similarity between:

night
nacht

We would find the set of bigrams in each word:

{ni,ig,gh,ht}
{na,ac,ch,ht}

Each set has four elements, and the intersection of these two sets has only one element: ht.

Inserting these numbers into the formula, we calculate, s = (2 · 1) / (4 + 4) = 0.25.

Continuous Dice Coefficient[edit]

Source:[12]

For a discrete ground truth and continuous measures the following formula can be used:

where c can be computed as follows:

If which means no overlap between A and B, c is set to 1 arbitrarily.

Difference from Jaccard[edit]

This coefficient is not very different in form from the Jaccard index. In fact, both are equivalent in the sense that given a value for the Sørensen–Dice coefficient , one can calculate the respective Jaccard index value and vice versa, using the equations and .

Since the Sørensen–Dice coefficient does not satisfy the triangle inequality, it can be considered a semimetric version of the Jaccard index.[4]

The function ranges between zero and one, like Jaccard. Unlike Jaccard, the corresponding difference function

is not a proper distance metric as it does not satisfy the triangle inequality.[4] The simplest counterexample of this is given by the three sets {a}, {b}, and {a,b}, the distance between the first two being 1, and the difference between the third and each of the others being one-third. To satisfy the triangle inequality, the sum of any two of these three sides must be greater than or equal to the remaining side. However, the distance between {a} and {a,b} plus the distance between {b} and {a,b} equals 2/3 and is therefore less than the distance between {a} and {b} which is 1.

Applications[edit]

The Sørensen–Dice coefficient is useful for ecological community data (e.g. Looman & Campbell, 1960[13]). Justification for its use is primarily empirical rather than theoretical (although it can be justified theoretically as the intersection of two fuzzy sets[14]). As compared to Euclidean distance, the Sørensen distance retains sensitivity in more heterogeneous data sets and gives less weight to outliers.[15] Recently the Dice score (and its variations, e.g. logDice taking a logarithm of it) has become popular in computer lexicography for measuring the lexical association score of two given words.[16] logDice is also used as part of the Mash Distance for genome and metagenome distance estimation[17] Finally, Dice is used in image segmentation, in particular for comparing algorithm output against reference masks in medical applications.[8]

Abundance version[edit]

The expression is easily extended to abundance instead of presence/absence of species. This quantitative version is known by several names:

See also[edit]

References[edit]

  1. ^ Dice, Lee R. (1945). "Measures of the Amount of Ecologic Association Between Species". Ecology. 26 (3): 297–302. doi:10.2307/1932409. JSTOR 1932409. S2CID 53335638.
  • ^ Sørensen, T. (1948). "A method of establishing groups of equal amplitude in plant sociology based on similarity of species and its application to analyses of the vegetation on Danish commons". Kongelige Danske Videnskabernes Selskab. 5 (4): 1–34.
  • ^ a b Carass, A.; Roy, S.; Gherman, A.; Reinhold, J.C.; Jesson, A.; et al. (2020). "Evaluating White Matter Lesion Segmentations with Refined Sørensen-Dice Analysis". Scientific Reports. 10 (1): 8242. Bibcode:2020NatSR..10.8242C. doi:10.1038/s41598-020-64803-w. ISSN 2045-2322. PMC 7237671. PMID 32427874.
  • ^ a b c d e f g h i j Gallagher, E.D., 1999. COMPAH Documentation, University of Massachusetts, Boston
  • ^ Nei, M.; Li, W.H. (1979). "Mathematical model for studying genetic variation in terms of restriction endonucleases". PNAS. 76 (10): 5269–5273. Bibcode:1979PNAS...76.5269N. doi:10.1073/pnas.76.10.5269. PMC 413122. PMID 291943.
  • ^ Prescott, J.W.; Pennell, M.; Best, T.M.; Swanson, M.S.; Haq, F.; Jackson, R.; Gurcan, M.N. (2009). "An automated method to segment the femur for osteoarthritis research". 2009 Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE. pp. 6364–6367. doi:10.1109/iembs.2009.5333257. PMC 2826829.
  • ^ Swanson, M.S.; Prescott, J.W.; Best, T.M.; Powell, K.; Jackson, R.D.; Haq, F.; Gurcan, M.N. (2010). "Semi-automated segmentation to assess the lateral meniscus in normal and osteoarthritic knees". Osteoarthritis and Cartilage. 18 (3): 344–353. doi:10.1016/j.joca.2009.10.004. ISSN 1063-4584. PMC 2826568. PMID 19857510.
  • ^ a b Zijdenbos, A.P.; Dawant, B.M.; Margolin, R.A.; Palmer, A.C. (1994). "Morphometric analysis of white matter lesions in MR images: method and validation". IEEE Transactions on Medical Imaging. 13 (4): 716–724. doi:10.1109/42.363096. ISSN 0278-0062. PMID 18218550.
  • ^ http://www.sekj.org/PDF/anbf40/anbf40-415.pdf [bare URL PDF]
  • ^ van Rijsbergen, Cornelis Joost (1979). Information Retrieval. London: Butterworths. ISBN 3-642-12274-4.
  • ^ Kondrak, Grzegorz; Marcu, Daniel; Knight, Kevin (2003). "Cognates Can Improve Statistical Translation Models" (PDF). Proceedings of HLT-NAACL 2003: Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics. pp. 46–48.
  • ^ Shamir, Reuben R.; Duchin, Yuval; Kim, Jinyoung; Sapiro, Guillermo; Harel, Noam (2018-04-25). "Continuous Dice Coefficient: a Method for Evaluating Probabilistic Segmentations": 306977. arXiv:1906.11031. doi:10.1101/306977. S2CID 90993940. {{cite journal}}: Cite journal requires |journal= (help)
  • ^ Looman, J.; Campbell, J.B. (1960). "Adaptation of Sorensen's K (1948) for estimating unit affinities in prairie vegetation". Ecology. 41 (3): 409–416. doi:10.2307/1933315. JSTOR 1933315.
  • ^ Roberts, D.W. (1986). "Ordination on the basis of fuzzy set theory". Vegetatio. 66 (3): 123–131. doi:10.1007/BF00039905. S2CID 12573576.
  • ^ McCune, Bruce & Grace, James (2002) Analysis of Ecological Communities. Mjm Software Design; ISBN 0-9721290-0-6.
  • ^ Rychlý, P. (2008) A lexicographer-friendly association score. Proceedings of the Second Workshop on Recent Advances in Slavonic Natural Language Processing RASLAN 2008: 6–9
  • ^ Ondov, Brian D., et al. "Mash: fast genome and metagenome distance estimation using MinHash." Genome biology 17.1 (2016): 1-14.
  • ^ Bray, J. Roger; Curtis, J. T. (1957). "An Ordination of the Upland Forest Communities of Southern Wisconsin". Ecological Monographs. 27 (4): 326–349. doi:10.2307/1942268. JSTOR 1942268.
  • ^ Ayappa, Indu; Norman, Robert G (2000). "Non-Invasive Detection of Respiratory Effort-Related Arousals (RERAs) by a Nasal Cannula/Pressure Transducer System". Sleep. 23 (6): 763–771. doi:10.1093/sleep/23.6.763. PMID 11007443.
  • ^ John Uebersax. "Raw Agreement Indices".
  • External links[edit]


    Retrieved from "https://en.wikipedia.org/w/index.php?title=Dice-Sørensen_coefficient&oldid=1230679568"

    Categories: 
    Information retrieval evaluation
    String metrics
    Measure theory
    Similarity measures
    Hidden categories: 
    All articles with bare URLs for citations
    Articles with bare URLs for citations from March 2022
    Articles with PDF format bare URLs for citations
    CS1 errors: missing periodical
    Articles with short description
    Short description matches Wikidata
     



    This page was last edited on 24 June 2024, at 02:41 (UTC).

    Text is available under the Creative Commons Attribution-ShareAlike License 4.0; additional terms may apply. By using this site, you agree to the Terms of Use and Privacy Policy. Wikipedia® is a registered trademark of the Wikimedia Foundation, Inc., a non-profit organization.



    Privacy policy

    About Wikipedia

    Disclaimers

    Contact Wikipedia

    Code of Conduct

    Developers

    Statistics

    Cookie statement

    Mobile view



    Wikimedia Foundation
    Powered by MediaWiki