URI in java

java.net.URI

Representsa Uniform Resource Identifier (URI) reference.

Asidefrom some minor deviations noted below, an instance of this class represents aURI reference as defined by RFC 2396:Uniform Resource Identifiers (URI): Generic Syntax, amended by RFC 2732: Format for LiteralIPv6 Addresses in URLs. The Literal IPv6 address format also supportsscope_ids. The syntax and usage of scope_ids is described here.This class provides constructors for creating URI instances from theircomponents or by parsing their string forms, methods for accessing the variouscomponents of an instance, and methods for normalizing, resolving, andrelativizing URI instances. Instances of this class are immutable.

URIsyntax and components

Atthe highest level a URI reference (hereinafter simply "URI") instring form has the syntax

[scheme:]scheme-specific-part[#fragment]

wheresquare brackets [...] delineate optional components and the characters : and # stand for themselves.

An absoluteURI specifies a scheme; a URI that is not absolute is said to be relative.URIs are also classified according to whether they are opaque or hierarchical.

An opaqueURI is an absolute URI whose scheme-specific part does not begin with a slashcharacter ('/'). Opaque URIs are notsubject to further parsing. Some examples of opaque URIs are:

mailto:[email protected]

news:comp.lang.java

urn:isbn:096139210x

A hierarchicalURI is either an absolute URI whose scheme-specific part begins with a slashcharacter, or a relative URI, that is, a URI that does not specify a scheme.Some examples of hierarchical URIs are:

http://java.sun.com/j2se/1.3/
docs/guide/collections/designfaq.html#28
../../../demo/jfc/SwingSet2/src/SwingSet2.java
file:///~/calendar

Ahierarchical URI is subject to further parsing according to the syntax

[scheme:][//authority][path][?query][#fragment]

wherethe characters :, /, ?, and # stand for themselves.The scheme-specific part of a hierarchical URI consists of the charactersbetween the scheme and fragment components.

Theauthority component of a hierarchical URI is, if specified, either server-basedor registry-based. A server-based authority parses according to thefamiliar syntax

[user-info@]host[:port]

wherethe characters @ and : stand for themselves.Nearly all URI schemes currently in use are server-based. An authoritycomponent that does not parse in this way is considered to be registry-based.

Thepath component of a hierarchical URI is itself said to be absolute if it beginswith a slash character ('/'); otherwiseit is relative. The path of a hierarchical URI that is either absolute orspecifies an authority is always absolute.

Alltold, then, a URI instance has the following nine components:

Component

Type

scheme

String

scheme-specific-part    

String

authority

String

user-info

String

host

String

port

int

path

String

query

String

fragment

String

Ina given instance any particular component is either undefined or definedwith a distinct value. Undefined string components are represented by null, while undefined integercomponents are represented by -1.A string component may be defined to have the empty string as its value; thisis not equivalent to that component being undefined.

Whethera particular component is or is not defined in an instance depends upon thetype of the URI being represented. An absolute URI has a scheme component. Anopaque URI has a scheme, a scheme-specific part, and possibly a fragment, buthas no other components. A hierarchical URI always has a path (though it may beempty) and a scheme-specific-part (which at least contains the path), and mayhave any of the other components. If the authority component is present and isserver-based then the host component will be defined and the user-informationand port components may be defined.

Operationson URI instances

Thekey operations supported by this class are those of normalization, resolution,and relativization.

Normalization is theprocess of removing unnecessary "." and ".." segments from the path component of ahierarchical URI. Each "." segment issimply removed. A ".." segment isremoved only if it is preceded by a non-".." segment.Normalization has no effect upon opaque URIs.

Resolution is theprocess of resolving one URI against another, base URI. The resultingURI is constructed from components of both URIs in the manner specified byRFC 2396, taking components from the base URI for those not specified inthe original. For hierarchical URIs, the path of the original is resolvedagainst the path of the base and then normalized. The result, for example, ofresolving

docs/guide/collections/designfaq.html#28          (1)

againstthe base URI http://java.sun.com/j2se/1.3/ is the resultURI

http://java.sun.com/j2se/1.3/docs/guide/collections/designfaq.html#28

Resolvingthe relative URI

../../../demo/jfc/SwingSet2/src/SwingSet2.java    (2)

againstthis result yields, in turn,

http://java.sun.com/j2se/1.3/demo/jfc/SwingSet2/src/SwingSet2.java

Resolutionof both absolute and relative URIs, and of both absolute and relative paths inthe case of hierarchical URIs, is supported. Resolving the URI file:///~calendar against anyother URI simply yields the original URI, since it is absolute. Resolving therelative URI (2) above against the relative base URI (1) yields the normalized,but still relative, URI

demo/jfc/SwingSet2/src/SwingSet2.java

Relativization, finally, isthe inverse of resolution: For any two normalized URIs u and v,

u.relativize(u.resolve(v)).equals(v)  and
u.resolve(u.relativize(v)).equals(v)  .

Thisoperation is often useful when constructing a document containing URIs thatmust be made relative to the base URI of the document wherever possible. Forexample, relativizing the URI

http://java.sun.com/j2se/1.3/docs/guide/index.html

againstthe base URI

http://java.sun.com/j2se/1.3

yieldsthe relative URI docs/guide/index.html.

Charactercategories

RFC 2396specifies precisely which characters are permitted in the various components ofa URI reference. The following categories, most of which are taken from thatspecification, are used below to describe these constraints:

alpha

The US-ASCII alphabetic characters, 'A' through 'Z' and 'a' through 'z'

digit

The US-ASCII decimal digit characters, '0' through '9'

alphanum

All alpha and digit characters

unreserved    

All alphanum characters together with those in the string "_-!.~'()*"

punct

The characters in the string ",;:$&+="

reserved

All punct characters together with those in the string "?/[]@"

escaped

Escaped octets, that is, triplets consisting of the percent character ('%') followed by two hexadecimal digits ('0'-'9', 'A'-'F', and 'a'-'f')

other

The Unicode characters that are not in the US-ASCII character set, are not control characters (according to the Character.isISOControl method), and are not space characters (according to the Character.isSpaceChar method)  (Deviation from RFC 2396, which is limited to US-ASCII)

The set of all legal URI characters consists of the unreserved,reserved, escaped, and other characters.

Escapedoctets, quotation, encoding, and decoding

RFC2396 allows escaped octets to appear in the user-info, path, query, andfragment components. Escaping serves two purposes in URIs:

·        To encode non-US-ASCII characters when aURI is required to conform strictly to RFC 2396 by not containing any othercharacters.

·        To quote characters that are otherwiseillegal in a component. The user-info, path, query, and fragment componentsdiffer slightly in terms of which characters are considered legal and illegal.

Thesepurposes are served in this class by three related operations:

·        A character is encoded by replacing itwith the sequence of escaped octets that represent that character in the UTF-8character set. The Euro currency symbol ('\u20AC'), for example, is encoded as "%E2%82%AC". (Deviationfrom RFC 2396, which does not specify any particular character set.)

·        An illegal character is quoted simply byencoding it. The space character, for example, is quoted by replacing it with "%20". UTF-8 containsUS-ASCII, hence for US-ASCII characters this transformation has exactly theeffect required by RFC 2396.

·        A sequence of escaped octets is decodedby replacing it with the sequence of characters that it represents in the UTF-8character set. UTF-8 contains US-ASCII, hence decoding has the effect ofde-quoting any quoted US-ASCII characters as well as that of decoding anyencoded non-US-ASCII characters. If a decodingerror occurs when decoding the escaped octets then the erroneous octets arereplaced by '\uFFFD', the Unicodereplacement character.

Theseoperations are exposed in the constructors and methods of this class asfollows:

·        The single-argument constructor requires anyillegal characters in its argument to be quoted and preserves any escapedoctets and other characters that are present.

·        The multi-argument constructors quote illegalcharacters as required by the components in which they appear. The percentcharacter ('%') is always quotedby these constructors. Any other characters are preserved.

·        The getRawUserInfo, getRawPath, getRawQuery, getRawFragment, getRawAuthority, and getRawSchemeSpecificPart methodsreturn the values of their corresponding components in raw form, withoutinterpreting any escaped octets. The strings returned by these methods maycontain both escaped octets and other characters, and will not containany illegal characters.

·        The getUserInfo, getPath, getQuery, getFragment, getAuthority, and getSchemeSpecificPart methodsdecode any escaped octets in their corresponding components. The stringsreturned by these methods may contain both other characters and illegalcharacters, and will not contain any escaped octets.

·        The toString methodreturns a URI string with all necessary quotation but which may contain othercharacters.

·        The toASCIIString methodreturns a fully quoted and encoded URI string that does not contain any othercharacters.

Identities

Forany URI u, it is always the case that

new URI(u.toString()).equals(u) .

Forany URI u that does not contain redundant syntax such as two slashesbefore an empty authority (as in file:///tmp/ ) or a colon following a host name but noport (as in http://java.sun.com: ), and thatdoes not encode characters except those that must be quoted, the followingidentities also hold:

new URI(u.getScheme(),
        u.getSchemeSpecificPart(),
        u.getFragment())
.equals(u)

inall cases,

new URI(u.getScheme(),
        u.getUserInfo(), u.getAuthority(),
        u.getPath(), u.getQuery(),
        u.getFragment())
.equals(u)

ifu is hierarchical, and

new URI(u.getScheme(),
        u.getUserInfo(), u.getHost(), u.getPort(),
        u.getPath(), u.getQuery(),
        u.getFragment())
.equals(u)

ifu is hierarchical and has either no authority or a server-basedauthority.

URIs,URLs, and URNs

AURI is a uniform resource identifier while a URL is a uniform resource locator.Hence every URL is a URI, abstractly speaking, but not every URI is a URL. Thisis because there is another subcategory of URIs, uniform resource names(URNs), which name resources but do not specify how to locate them. The mailto, news, and isbn URIs shown above areexamples of URNs.

Theconceptual distinction between URIs and URLs is reflected in the differencesbetween this class and the URL class.

Aninstance of this class represents a URI reference in the syntactic sensedefined by RFC 2396. A URI may be either absolute or relative. A URIstring is parsed according to the generic syntax without regard to the scheme,if any, that it specifies. No lookup of the host, if any, is performed, and noscheme-dependent stream handler is constructed. Equality, hashing, andcomparison are defined strictly in terms of the character content of theinstance. In other words, a URI instance is little more than a structuredstring that supports the syntactic, scheme-independent operations ofcomparison, normalization, resolution, and relativization.

Aninstance of the URL class, bycontrast, represents the syntactic components of a URL together with some ofthe information required to access the resource that it describes. A URL mustbe absolute, that is, it must always specify a scheme. A URL string is parsedaccording to its scheme. A stream handler is always established for a URL, andin fact it is impossible to create a URL instance for a scheme for which nohandler is available. Equality and hashing depend upon both the scheme and theInternet address of the host, if any; comparison is not defined. In otherwords, a URL is a structured string that supports the syntactic operation ofresolution as well as the network I/O operations of looking up the host andopening a connection to the specified resource.

Since:

1.4

Author:

Mark Reinhold

SeeAlso:

RFC 2279: UTF-8, atransformation format of ISO 10646,
RFC 2373: IPv6 AddressingArchitecture,
RFC 2396: UniformResource Identifiers (URI): Generic Syntax,
RFC 2732: Format forLiteral IPv6 Addresses in URLs,
URISyntaxException

 


 

java.net.URL

 

ClassURL represents a UniformResource Locator, a pointer to a "resource" on the World Wide Web. Aresource can be something as simple as a file or a directory, or it can be areference to a more complicated object, such as a query to a database or to asearch engine. More information on the types of URLs and their formats can befound at:

http://www.socs.uts.edu.au/MosaicDocs-old/url-primer.html

Ingeneral, a URL can be broken into several parts. The previous example of a URLindicates that the protocol to use is http (HyperTextTransfer Protocol) and that the information resides on a host machine named www.socs.uts.edu.au. The information on that host machine isnamed /MosaicDocs-old/url-primer.html. The exactmeaning of this name on the host machine is both protocol dependent and hostdependent. The information normally resides in a file, but it could begenerated on the fly. This component of the URL is called the pathcomponent.

A URLcan optionally specify a "port", which is the port number to whichthe TCP connection is made on the remote host machine. If the port is notspecified, the default port for the protocol is used instead. For example, thedefault port for http is 80. An alternative port could be specified as:

     http://www.socs.uts.edu.au:80/MosaicDocs-old/url-primer.html
 

Thesyntax of URL is defined by RFC 2396: Uniform ResourceIdentifiers (URI): Generic Syntax, amended by RFC 2732: Format for LiteralIPv6 Addresses in URLs. The Literal IPv6 address format also supportsscope_ids. The syntax and usage of scope_ids is described here.

A URLmay have appended to it a "fragment", also known as a "ref"or a "reference". The fragment is indicated by the sharp signcharacter "#" followed by more characters. For example,

     http://java.sun.com/index.html#chapter1
 

Thisfragment is not technically part of the URL. Rather, it indicates that afterthe specified resource is retrieved, the application is specifically interestedin that part of the document that has the tag chapter1 attached toit. The meaning of a tag is resource specific.

Anapplication can also specify a "relative URL", which contains onlyenough information to reach the resource relative to another URL. Relative URLsare frequently used within HTML pages. For example, if the contents of the URL:

     http://java.sun.com/index.html
 

containedwithin it the relative URL:

     FAQ.html
 

itwould be a shorthand for:

     http://java.sun.com/FAQ.html
 

Therelative URL need not specify all the components of a URL. If the protocol,host name, or port number is missing, the value is inherited from the fullyspecified URL. The file component must be specified. The optional fragment isnot inherited.

TheURL class does not itself encode or decode any URL components according to theescaping mechanism defined in RFC2396. It is the responsibility of the callerto encode any fields, which need to be escaped prior to calling URL, and alsoto decode any escaped fields, that are returned from URL. Furthermore, becauseURL has no knowledge of URL escaping, it does not recognise equivalence betweenthe encoded or decoded form of the same URL. For example, the two URLs:

    http://foo.com/hello world/ and http://foo.com/hello%20world

wouldbe considered not equal to each other.

Note,the java.net.URI class doesperform escaping of its component fields in certain circumstances. Therecommended way to manage the encoding and decoding of URLs is to use java.net.URI, and toconvert between these two classes using toURI() and URI.toURL().

The URLEncoder and URLDecoder classes canalso be used, but only for HTML form encoding, which is not the same as theencoding scheme defined in RFC2396.

Since:

JDK1.0

Author:

James Gosling

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章