java.net.URI
Representsa Uniform Resource Identifier (URI) reference.
Asidefrom some minor deviations noted below, an instance of this class represents aURI reference as defined by RFC 2396:Uniform Resource Identifiers (URI): Generic Syntax, amended by RFC 2732: Format for LiteralIPv6 Addresses in URLs. The Literal IPv6 address format also supportsscope_ids. The syntax and usage of scope_ids is described here.This class provides constructors for creating URI instances from theircomponents or by parsing their string forms, methods for accessing the variouscomponents of an instance, and methods for normalizing, resolving, andrelativizing URI instances. Instances of this class are immutable.
URIsyntax and components
Atthe highest level a URI reference (hereinafter simply "URI") instring form has the syntax
[scheme:]scheme-specific-part[#fragment]
wheresquare brackets [...] delineate optional components and the characters : and # stand for themselves.
An absoluteURI specifies a scheme; a URI that is not absolute is said to be relative.URIs are also classified according to whether they are opaque or hierarchical.
An opaqueURI is an absolute URI whose scheme-specific part does not begin with a slashcharacter ('/'). Opaque URIs are notsubject to further parsing. Some examples of opaque URIs are:
mailto:[email protected] |
news:comp.lang.java |
urn:isbn:096139210x |
A hierarchicalURI is either an absolute URI whose scheme-specific part begins with a slashcharacter, or a relative URI, that is, a URI that does not specify a scheme.Some examples of hierarchical URIs are:
http://java.sun.com/j2se/1.3/
docs/guide/collections/designfaq.html#28
../../../demo/jfc/SwingSet2/src/SwingSet2.java
file:///~/calendar
Ahierarchical URI is subject to further parsing according to the syntax
[scheme:][//authority][path][?query][#fragment]
wherethe characters :, /, ?, and # stand for themselves.The scheme-specific part of a hierarchical URI consists of the charactersbetween the scheme and fragment components.
Theauthority component of a hierarchical URI is, if specified, either server-basedor registry-based. A server-based authority parses according to thefamiliar syntax
[user-info@]host[:port]
wherethe characters @ and : stand for themselves.Nearly all URI schemes currently in use are server-based. An authoritycomponent that does not parse in this way is considered to be registry-based.
Thepath component of a hierarchical URI is itself said to be absolute if it beginswith a slash character ('/'); otherwiseit is relative. The path of a hierarchical URI that is either absolute orspecifies an authority is always absolute.
Alltold, then, a URI instance has the following nine components:
Component |
Type |
scheme |
String |
scheme-specific-part |
String |
authority |
String |
user-info |
String |
host |
String |
port |
int |
path |
String |
query |
String |
fragment |
String |
Ina given instance any particular component is either undefined or definedwith a distinct value. Undefined string components are represented by null, while undefined integercomponents are represented by -1.A string component may be defined to have the empty string as its value; thisis not equivalent to that component being undefined.
Whethera particular component is or is not defined in an instance depends upon thetype of the URI being represented. An absolute URI has a scheme component. Anopaque URI has a scheme, a scheme-specific part, and possibly a fragment, buthas no other components. A hierarchical URI always has a path (though it may beempty) and a scheme-specific-part (which at least contains the path), and mayhave any of the other components. If the authority component is present and isserver-based then the host component will be defined and the user-informationand port components may be defined.
Operationson URI instances
Thekey operations supported by this class are those of normalization, resolution,and relativization.
Normalization is theprocess of removing unnecessary "." and ".." segments from the path component of ahierarchical URI. Each "." segment issimply removed. A ".." segment isremoved only if it is preceded by a non-".." segment.Normalization has no effect upon opaque URIs.
Resolution is theprocess of resolving one URI against another, base URI. The resultingURI is constructed from components of both URIs in the manner specified byRFC 2396, taking components from the base URI for those not specified inthe original. For hierarchical URIs, the path of the original is resolvedagainst the path of the base and then normalized. The result, for example, ofresolving
docs/guide/collections/designfaq.html#28 (1)
againstthe base URI http://java.sun.com/j2se/1.3/ is the resultURI
http://java.sun.com/j2se/1.3/docs/guide/collections/designfaq.html#28
Resolvingthe relative URI
../../../demo/jfc/SwingSet2/src/SwingSet2.java (2)
againstthis result yields, in turn,
http://java.sun.com/j2se/1.3/demo/jfc/SwingSet2/src/SwingSet2.java
Resolutionof both absolute and relative URIs, and of both absolute and relative paths inthe case of hierarchical URIs, is supported. Resolving the URI file:///~calendar against anyother URI simply yields the original URI, since it is absolute. Resolving therelative URI (2) above against the relative base URI (1) yields the normalized,but still relative, URI
demo/jfc/SwingSet2/src/SwingSet2.java
Relativization, finally, isthe inverse of resolution: For any two normalized URIs u and v,
u.relativize(u.resolve(v)).equals(v) and
u.resolve(u.relativize(v)).equals(v) .
Thisoperation is often useful when constructing a document containing URIs thatmust be made relative to the base URI of the document wherever possible. Forexample, relativizing the URI
http://java.sun.com/j2se/1.3/docs/guide/index.html
againstthe base URI
http://java.sun.com/j2se/1.3
yieldsthe relative URI docs/guide/index.html.
Charactercategories
RFC 2396specifies precisely which characters are permitted in the various components ofa URI reference. The following categories, most of which are taken from thatspecification, are used below to describe these constraints:
alpha |
The US-ASCII alphabetic characters, 'A' through 'Z' and 'a' through 'z' |
digit |
The US-ASCII decimal digit characters, '0' through '9' |
alphanum |
All alpha and digit characters |
unreserved |
All alphanum characters together with those in the string "_-!.~'()*" |
punct |
The characters in the string ",;:$&+=" |
reserved |
All punct characters together with those in the string "?/[]@" |
escaped |
Escaped octets, that is, triplets consisting of the percent character ('%') followed by two hexadecimal digits ('0'-'9', 'A'-'F', and 'a'-'f') |
other |
The Unicode characters that are not in the US-ASCII character set, are not control characters (according to the
|
The set of all legal URI characters consists of the unreserved,reserved, escaped, and other characters.
Escapedoctets, quotation, encoding, and decoding
RFC2396 allows escaped octets to appear in the user-info, path, query, andfragment components. Escaping serves two purposes in URIs:
· To encode non-US-ASCII characters when aURI is required to conform strictly to RFC 2396 by not containing any othercharacters.
· To quote characters that are otherwiseillegal in a component. The user-info, path, query, and fragment componentsdiffer slightly in terms of which characters are considered legal and illegal.
Thesepurposes are served in this class by three related operations:
· A character is encoded by replacing itwith the sequence of escaped octets that represent that character in the UTF-8character set. The Euro currency symbol ('\u20AC'), for example, is encoded as "%E2%82%AC". (Deviationfrom RFC 2396, which does not specify any particular character set.)
· An illegal character is quoted simply byencoding it. The space character, for example, is quoted by replacing it with "%20". UTF-8 containsUS-ASCII, hence for US-ASCII characters this transformation has exactly theeffect required by RFC 2396.
· A sequence of escaped octets is decodedby replacing it with the sequence of characters that it represents in the UTF-8character set. UTF-8 contains US-ASCII, hence decoding has the effect ofde-quoting any quoted US-ASCII characters as well as that of decoding anyencoded non-US-ASCII characters. If a decodingerror occurs when decoding the escaped octets then the erroneous octets arereplaced by '\uFFFD', the Unicodereplacement character.
Theseoperations are exposed in the constructors and methods of this class asfollows:
· The single-argument constructor
requires anyillegal characters in its argument to be quoted and preserves any escapedoctets and
other characters that are present.
· The multi-argument constructors
quote illegalcharacters as required by the components in which they appear. The percentcharacter ('%') is always quotedby
these constructors. Any other characters are preserved.
· The getRawUserInfo
,
getRawPath
, getRawQuery
,
getRawFragment
, getRawAuthority
, and
getRawSchemeSpecificPart
methodsreturn the values of their corresponding components in raw form, withoutinterpreting any escaped octets. The strings returned by these methods maycontain both escaped octets and
other characters, and will not containany illegal characters.
· The getUserInfo
,
getPath
, getQuery
,
getFragment
, getAuthority
, and
getSchemeSpecificPart
methodsdecode any escaped octets in their corresponding components. The stringsreturned by these methods may contain both
other characters and illegalcharacters, and will not contain any escaped octets.
· The toString
methodreturns a URI string with all necessary quotation but which may contain
othercharacters.
· The toASCIIString
methodreturns a fully quoted and encoded URI string that does not contain any
othercharacters.
Identities
Forany URI u, it is always the case that
new URI(u.toString()).equals(u) .
Forany URI u that does not contain redundant syntax such as two slashesbefore an empty authority (as in file:///tmp/ ) or a colon following a host name but noport (as in http://java.sun.com: ), and thatdoes not encode characters except those that must be quoted, the followingidentities also hold:
new URI(u.getScheme(),
u.getSchemeSpecificPart(),
u.getFragment())
.equals(u)
inall cases,
new URI(u.getScheme(),
u.getUserInfo(), u.getAuthority(),
u.getPath(), u.getQuery(),
u.getFragment())
.equals(u)
ifu is hierarchical, and
new URI(u.getScheme(),
u.getUserInfo(), u.getHost(), u.getPort(),
u.getPath(), u.getQuery(),
u.getFragment())
.equals(u)
ifu is hierarchical and has either no authority or a server-basedauthority.
URIs,URLs, and URNs
AURI is a uniform resource identifier while a URL is a uniform resource locator.Hence every URL is a URI, abstractly speaking, but not every URI is a URL. Thisis because there is another subcategory of URIs, uniform resource names(URNs), which name resources but do not specify how to locate them. The mailto, news, and isbn URIs shown above areexamples of URNs.
Theconceptual distinction between URIs and URLs is reflected in the differencesbetween this class and the
URL
class.
Aninstance of this class represents a URI reference in the syntactic sensedefined by RFC 2396. A URI may be either absolute or relative. A URIstring is parsed according to the generic syntax without regard to the scheme,if any, that it specifies. No lookup of the host, if any, is performed, and noscheme-dependent stream handler is constructed. Equality, hashing, andcomparison are defined strictly in terms of the character content of theinstance. In other words, a URI instance is little more than a structuredstring that supports the syntactic, scheme-independent operations ofcomparison, normalization, resolution, and relativization.
Aninstance of the URL
class, bycontrast, represents the syntactic components of a URL together with some ofthe information required to access the resource that it describes. A URL mustbe absolute, that is, it must
always specify a scheme. A URL string is parsedaccording to its scheme. A stream handler is always established for a URL, andin fact it is impossible to create a URL instance for a scheme for which nohandler is available. Equality and hashing depend upon both
the scheme and theInternet address of the host, if any; comparison is not defined. In otherwords, a URL is a structured string that supports the syntactic operation ofresolution as well as the network I/O operations of looking up the host andopening a connection
to the specified resource.
Since:
1.4
Author:
Mark Reinhold
SeeAlso:
RFC 2279: UTF-8, atransformation format of ISO 10646,
RFC 2373: IPv6 AddressingArchitecture,
RFC 2396: UniformResource Identifiers (URI): Generic Syntax,
RFC 2732: Format forLiteral IPv6 Addresses in URLs,
URISyntaxException
java.net.URL
ClassURL
represents a UniformResource Locator, a pointer to a "resource" on the World Wide Web. Aresource can be something as simple as a file or a directory, or it can be areference to a more complicated object, such as a query to a database
or to asearch engine. More information on the types of URLs and their formats can befound at:
http://www.socs.uts.edu.au/MosaicDocs-old/url-primer.html
Ingeneral, a URL can be broken into several parts. The previous example of a URLindicates that the protocol to use is
http
(HyperTextTransfer Protocol) and that the information resides on a host machine named
www.socs.uts.edu.au
. The information on that host machine isnamed
/MosaicDocs-old/url-primer.html
. The exactmeaning of this name on the host machine is both protocol dependent and hostdependent. The information normally resides in a file, but it could begenerated on the fly. This component of the URL is called the
pathcomponent.
A URLcan optionally specify a "port", which is the port number to whichthe TCP connection is made on the remote host machine. If the port is notspecified, the default port for the protocol is used instead. For example, thedefault port for
http
is 80
. An alternative port could be specified as:
http://www.socs.uts.edu.au:80/MosaicDocs-old/url-primer.html
Thesyntax of URL
is defined by
RFC 2396: Uniform ResourceIdentifiers (URI): Generic Syntax, amended by
RFC 2732: Format for LiteralIPv6 Addresses in URLs. The Literal IPv6 address format also supportsscope_ids. The syntax and usage of scope_ids is described
here.
A URLmay have appended to it a "fragment", also known as a "ref"or a "reference". The fragment is indicated by the sharp signcharacter "#" followed by more characters. For example,
http://java.sun.com/index.html#chapter1
Thisfragment is not technically part of the URL. Rather, it indicates that afterthe specified resource is retrieved, the application is specifically interestedin that part of the document that has the tag
chapter1
attached toit. The meaning of a tag is resource specific.
Anapplication can also specify a "relative URL", which contains onlyenough information to reach the resource relative to another URL. Relative URLsare frequently used within HTML pages. For example, if the contents of the URL:
http://java.sun.com/index.html
containedwithin it the relative URL:
FAQ.html
itwould be a shorthand for:
http://java.sun.com/FAQ.html
Therelative URL need not specify all the components of a URL. If the protocol,host name, or port number is missing, the value is inherited from the fullyspecified URL. The file component must be specified. The optional fragment isnot inherited.
TheURL class does not itself encode or decode any URL components according to theescaping mechanism defined in RFC2396. It is the responsibility of the callerto encode any fields, which need to be escaped prior to calling URL, and alsoto decode any escaped fields, that are returned from URL. Furthermore, becauseURL has no knowledge of URL escaping, it does not recognise equivalence betweenthe encoded or decoded form of the same URL. For example, the two URLs:
http://foo.com/hello world/ and http://foo.com/hello%20world
wouldbe considered not equal to each other.
Note,the java.net.URI
class doesperform escaping of its component fields in certain circumstances. Therecommended way to manage the encoding and decoding of URLs is to use
java.net.URI
, and toconvert between these two classes using
toURI()
and URI.toURL()
.
The URLEncoder
and URLDecoder
classes canalso be used, but only for HTML form encoding, which is not the same as theencoding scheme defined in RFC2396.
Since:
JDK1.0
Author:
James Gosling