URL normalization

URL normalization is the process by which URLs are modified and standardized in a consistent manner. The goal of the normalization process is to transform a "raw" URL into a normalized URL so it is possible to determine if two syntactically different URLs are equivalent.

Our programs follow the rules below to normalize URLs according to rfc3986.

URL normalization

scheme - normalized to lowercase.
userinfo - non-ascii characters are converted to utf8 and percent-encoded.
host - normalized to lowercase, internationalized domain name converted to IDNA encoding.
port - removed if empty or scheme's default port.
path - non-ascii characters are converted to utf8 and percent-encoded, removed dot segments.
query - non-ascii characters are percent-encoded in the same charset as the page itself.
fragment - non-ascii characters are converted to utf8 and percent-encoded.

Categories: Bundle, HTML Checker, Link Checker, Manual, Site Inspector, Spell Checker

Leave a Reply