Home  

Random  

Nearby  



Log in  



Settings  



Donate  



About Wikipedia  

Disclaimers  



Wikipedia





URI normalization





Article  

Talk  



Language  

Watch  

Edit  





URI normalization is the process by which URIs are modified and standardized in a consistent manner. The goal of the normalization process is to transform a URI into a normalized URI so it is possible to determine if two syntactically different URIs may be equivalent.

Types of URI normalization.

Search engines employ URI normalization in order to correctly rank pages that may be found with multiple URIs, and to reduce indexing of duplicate pages. Web crawlers perform URI normalization in order to avoid crawling the same resource more than once. Web browsers may perform normalization to determine if a link has been visited or to determine if a page has been cached. Web servers may also perform normalization for many reasons (i.e. to be able to more easily intercept security risks coming from client requests, to use only one absolute file name for each resource stored in their caches, named in log files, etc.).

Normalization process

edit

There are several types of normalization that may be performed. Some of them are always semantics preserving and some may not be.

Normalizations that preserve semantics

edit

The following normalizations are described in RFC 3986 [1] to result in equivalent URIs:

http://example.com/foo%2ahttp://example.com/foo%2A
HTTP://User@Example.COM/Foohttp://User@example.com/Foo
http://example.com/%7Efoohttp://example.com/~foo
http://example.com/foo/./bar/baz/../quxhttp://example.com/foo/bar/qux
http://example.comhttp://example.com/
http://example.com:80/http://example.com/

Normalizations that usually preserve semantics

edit

For http and https URIs, the following normalizations listed in RFC 3986 may result in equivalent URIs, but are not guaranteed to by the standards:

http://example.com/foohttp://example.com/foo/
However, there is no way to know if a URI path component represents a directory or not. RFC 3986 notes that if the former URI redirects to the latter URI, then that is an indication that they are equivalent.

Normalizations that change semantics

edit

Applying the following normalizations result in a semantically different URI although it may refer to the same resource:

http://example.com/a/index.htmlhttp://example.com/a/
http://example.com/default.asphttp://example.com/
http://example.com/bar.html#section1http://example.com/bar.html
However, AJAX applications frequently use the value in the fragment.
http://208.77.188.166/http://example.com/
The reverse replacement is rarely safe due to virtual web servers.
https://example.com/http://example.com/
http://example.com/foo//bar.htmlhttp://example.com/foo/bar.html
http://www.example.com/http://example.com/
http://example.com/display?lang=en&article=fredhttp://example.com/display?article=fred&lang=en
However, the order of parameters in a URI may be significant (this is not defined by the standard) and a web server may allow the same variable to appear multiple times.[9]
http://example.com/display?id=123&fakefoo=fakebarhttp://example.com/display?id=123
Note that a parameter without a value is not necessarily an unused parameter.
http://example.com/display?id=&sort=ascendinghttp://example.com/display
http://example.com/display?http://example.com/display

Normalization based on URI lists

edit

Some normalization rules may be developed for specific websites by examining URI lists obtained from previous crawls or web server logs. For example, if the URI

http://example.com/story?id=xyz

appears in a crawl log several times along with

http://example.com/story_xyz

we may assume that the two URIs are equivalent and can be normalized to one of the URI forms.

Schonfeld et al. (2006) present a heuristic called DustBuster for detecting DUST (different URIs with similar text) rules that can be applied to URI lists. They showed that once the correct DUST rules were found and applied with a normalization algorithm, they were able to find up to 68% of the redundant URIs in a URI list.

See also

edit

References

edit
  • ^ RFC 3986, Section 6.2.2.1. Case Normalization
  • ^ RFC 3986, Section 6.2.2.1. Case Normalization
  • ^ RFC 3986, Section 6.2.2.3. Path Segment Normalization
  • ^ RFC 3986, 5.2.4. Remove Dot Segments
  • ^ RFC 3986, 6.2.2.3. Path Segment Normalization
  • ^ RFC 3986, Section 6.2.3. Scheme-Based Normalization
  • ^ RFC 3986, Section 6.2.3. Scheme-Based Normalization
  • ^ "jQuery 1.4 $.param demystified". Ben Alman. December 20, 2009. Retrieved August 24, 2013.

  • Retrieved from "https://en.wikipedia.org/w/index.php?title=URI_normalization&oldid=1196087923"
     



    Last edited on 16 January 2024, at 07:23  





    Languages

     


    فارسی
    Русский
    Српски / srpski
     

    Wikipedia


    This page was last edited on 16 January 2024, at 07:23 (UTC).

    Content is available under CC BY-SA 4.0 unless otherwise noted.



    Privacy policy

    About Wikipedia

    Disclaimers

    Contact Wikipedia

    Code of Conduct

    Developers

    Statistics

    Cookie statement

    Terms of Use

    Desktop