2-URLs and Resources

Please indicate the source: http://blog.csdn.net/gaoxiangnumber1

Welcome to my github: https://github.com/gaoxiangnumber1

2.1 Navigating the Internet’s Resources

  • URIs are comprised of two main subsets, URLs and URNs.
    1. URLs identify resources by describing where resources are located. HTTP applications deal only with the URL subset of URIs.
    2. URNs identify resources by name, regardless of where they currently reside.

  • Figure 2-1. http://www.joes-hardware.com/seasonal/index-fall.html.
    1. The first part “http” is the URL scheme that tells a web client how to access the resource. “http” means using the HTTP protocol.
    2. The second part “www.joes-hardware.com” is the server location that tells the web client where the resource is hosted.
    3. The third part “/seasonal/index-fall.html” is the resource path that tells what particular local resource on the server is being requested.
  • URLs can direct you to resources available through protocols other than HTTP. They can point you to any resource on the Internet, from a person’s email account:
    mailto:[email protected]
    to files that are available through other protocols, such as File Transfer Protocol(FTP):
    ftp://ftp.lots-o-books.com/pub/complete-price-list.xls

2.1.1 The Dark Days Before URLs

2.2 URL Syntax

  • URLs provide a means of locating any resource on the Internet, but these resources can be accessed by different schemes(e.g., HTTP, FTP, SMTP), and URL syntax varies from scheme to scheme.
  • Most URL schemes base their URL syntax on this nine-part general format:
    ://:@:/;?#
    Three most important parts are scheme, host, and path. Table 2-1.

2.2.1 Schemes: What Protocol to Use

  • The scheme tells the application interpreting the URL what protocol it needs to speak. It must start with an alphabetic character, and it is separated from the rest of the URL by the first “:” character. Scheme names are case-insensitive: “http://www.joes-hardware.com” and “HTTP://www.joes-hardware.com” are equivalent.

2.2.2 Hosts and Ports

  • The host component identifies the host machine that has access to the resource. The name can be provided as a hostname(“www.example.com”) or as an IP address. The following two URLs point to the same resource:
    http://www.joes-hardware.com:80/index.html = http://161.58.228.45:80/index.html
  • The port component identifies the network port on which the server is listening. For HTTP, which uses the underlying TCP protocol, the default port is 80.

2.2.3 Usernames and Passwords

  • ftp://ftp.prep.ai.mit.edu/pub/gnu
    If an application is using a URL scheme that requires a username and password, it will insert a default username and password if they aren’t supplied. FTP will insert “anonymous” for username and send a default password.

2.2.4 Paths

  • The path specifies where on the server machine the resource lives. The path resembles a hierarchical filesystem path. The path for HTTP URLs can be divided into path segments separated by “/” characters on Unix filesystem.

2.2.5 Parameters

  • Applications interpreting URLs need protocol parameters to access the resource, otherwise, the server on the other side might not service the request or service it wrong.
  • URLs’ params component is a list of name/value pairs in the URL, separated from the rest of the URL(and from each other) by “;” characters.
    ftp://prep.ai.mit.edu/pub/gnu;type=d
    There is one param “type=d”: name = “type” and value = “d”.
  • The path component for HTTP URLs can be broken into path segments and each segment can have its own params.
    http://www.joes-hardware.com/hammers;sale=false/index.html;graphics=true

2.2.6 Query Strings

  • Figure 2-2 shows an example of a query component being passed to a server that is acting as a gateway to Joe’s Hardware’s inventory-checking application. The query is checking whether a particular item, 12731, is in inventory in size large and color blue.
  • There is no requirement for the format of the query component except that some characters are illegal(later in this chapter). By convention, many gateways expect the query string to be formatted as a series of “name=value” pairs, separated by “&” characters.

2.2.7 Fragments

  • Some resource types can be divided further than just the resource level. To allow referencing of parts or fragments of a resource, URLs support a frag component to identify pieces within a resource.
  • A fragment dangles off the right-hand side of a URL, preceded by a # character.
    http://www.joes-hardware.com/tools.html#drills
    The fragment drills references a portion of the /tools.html web page located on the Joe’s Hardware web server. The portion is named “drills”.

  • Figure 2-3. Because HTTP servers deal only with entire objects, not with fragments of objects, clients don’t pass fragments along to servers. After your browser gets the entire resource from the server, it uses the fragment to display the part of the resource in which you are interested.

2.3 URL Shortcuts

  • Relative URLs are convenient for specifying a resource within a resource. Many browsers support “automatic expansion” of URLs, where the user can type in a key part of a URL, and the browser fills in the rest(Section 2.3.2).

2.3.1 Relative URLs

  • URLs come in two flavors: absolute and relative.
    1. An absolute URL has all the information you need to access a resource.
    2. Relative URLs are incomplete. To get all the information needed to access a resource from a relative URL, you must interpret it relative to its base URL.
  • Example 2-1 contains an example HTML document with an embedded relative URL.
Example 2-1. HTML snippet with relative URLs 
<HTML>
<HEAD><TITLE>Joe's Tools</TITLE></HEAD>
<BODY>
<H1> Tools Page </H1>
<H2> Hammers <H2>
<P> Joe's Hardware Online has the largest selection of <A href="./hammers.html">hammers
</BODY>
</HTML>
  • Example 2-1: base URL is http://www.joes-hardware.com/tools.html
    In the HTML document, the URL “./hammers.html” is a legal relative URL. It can be interpreted relative to the URL of the document in which it is found; in this case, relative to the resource /tools.html on the Joe’s Hardware web server. Figure 2-4.

  • Relative URLs are only fragments or pieces of URLs. Applications that process URLs need to be able to convert between relative and absolute URLs.

2.3.1.1 Base URLs

  • The first step in the conversion process is to find a base URL that serves as a point of reference for the relative URL. It can come from a few places:
    1. Explicitly provided in the resource: An HTML document may include a HTML tag defining the base URL by which to convert all relative URLs in that HTML document.
    2. Base URL of the encapsulating resource: If a relative URL is found in a resource that does not explicitly specify a base URL, it can use the URL of the resource in which it is embedded as a base.
    3. No base URL: This means that you have an absolute URL; but sometimes you may have an incomplete or broken URL.

2.3.1.2 Resolving relative references

  • The next step in converting a relative URL into an absolute one is to break up both the relative and base URLs into their component pieces. After that, you can apply the algorithm in Figure 2-5 to finish the conversion.

  • This algorithm converts a relative URL to its absolute form, which can then be used to reference the resource. This algorithm was specified in RFC 1808 and incorporated into RFC 2396.
  • With ./hammers.html from Example 2-1:
    1. Path is ./hammers.html; base URL is http://www.joes-hardware.com/tools.html.
    2. Scheme is empty; proceed down left half of chart and inherit the base URL scheme HTTP.
    3. At least one component is non-empty; proceed to bottom, inheriting host and port components.
    4. Combining the components we have from the relative URL(path: ./hammers.html) with what we have inherited(scheme: http, host: www.joes-hardware.com, port: 80), we get our absolute URL: http://www.joes-hardware.com/hammers.html.

2.3.2 Expandomatic URLs

  • Some browsers try to expand URLs automatically, either after you submit the URL or while you’re typing. These expandomatic features come in a two flavors.
    1. Hostname expansion
      In hostname expansion, the browser can expand the hostname you type in into the full hostname by using some simple heuristics.
      E.g. if you type “yahoo” in the address box, your browser can automatically insert “www.” and “.com” onto the hostname, creating “www.yahoo.com”.
    2. History expansion
      Another technique that browsers use is to store a history of the URLs that you have visited in the past. As you type in the URL, they can offer you completed choices to select from by matching what you type to the prefixes of the URLs in your history.

2.4 Shady Characters

  • URLs were designed to be portable and uniformly name all the resources on the Internet, which means that they will be transmitted through various protocols. Because different protocols have different mechanisms for transmitting their data, URLs should be designed to be transmitted safely through any Internet protocol.
  • Safe transmission means that URLs can be transmitted without the risk of losing information. So, URLs are permitted to contain only characters from a small, universally safe alphabet.
  • What’s more, URLs should be readable by people. So invisible, non-printing characters are prohibited in URLs. Non-printing characters include whitespace(RFC 2396 recommends that applications ignore whitespace).
  • URLs also need to be complete. URL designers realized there would be times when people would want URLs to contain binary data or characters outside of the universally safe alphabet. So, an escape mechanism was added, allowing unsafe characters to be encoded into safe characters for transport.

2.4.1 The URL Character Set

  • Historically, many computer applications have used the US-ASCII character set that uses 7 bits to represent most keys available on an English typewriter and a few nonprinting control characters for text formatting and hardware signaling.
  • US-ASCII is portable, but it doesn’t support the inflected characters common in other languages(European languages, Chinese…).
  • Furthermore, some URLs may need to contain arbitrary binary data. Recognizing the need for completeness, the URL designers have incorporated escape sequences that allow the encoding of arbitrary character values or data using a restricted subset of the US-ASCII character set, yielding portability and completeness.

2.4.2 Encoding Mechanisms

  • The encoding represents the unsafe character by an “escape” notation, consisting of a percent sign(%) followed by two hexadecimal digits that represent the ASCII code of the character. Table 2-2.

2.4.3 Character Restrictions

    1. Several characters have been reserved to have special meaning inside of a URL.
    2. Others are not in the defined US-ASCII printable set.
    3. Still others are known to confuse some Internet gateways and protocols, so their use is discouraged.
  • Table 2-3 lists characters that should be encoded in a URL before you use them for anything other than their reserved purposes.

2.4.4 A Bit More

  • For some transport protocols, nothing is wrong when you use characters that are unsafe. For instance, you can visit Joe’s home page at:
    http://www.joes-hardware.com/~joe
    and not encode the “~” character.
  • It is best for client applications to convert any unsafe or restricted characters before sending any URL to any other application. Once all the unsafe characters have been encoded, the URL is in a canonical form that can be shared between applications.
  • The original application that gets the URL from the user is best fit to determine which characters need to be encoded. Because each component of the URL may have its own safe/unsafe characters, and which characters are safe/unsafe is scheme-dependent, only the application receiving the URL from the user determine what needs to be encoded.
  • The other extreme is to encode all characters. While this is not recommended, there is no hard rule against encoding characters that are considered safe already. But in practice this can lead to odd and broken behavior, because some applications may assume that safe characters will not be encoded.

2.5 A Sea of Schemes

  • Appendix A gives a list of schemes and references to their individual documentation. Table 2-4 summarizes some of the most popular schemes.

2.6 The Future

  • URLs are addresses, not true names. A URL tells where something is located for the moment. It provides you with the name of a specific server on a specific port, where you can find the resource. If the resource is moved, the URL is no longer valid.
  • The Internet Engineering Task Force(IETF) has been working on a new standard uniform resource names(URNs) that provide a stable name for an object, regardless of where that object moves.
  • Persistent uniform resource locators(PURLs) are an example of how URN functionality can be achieved using URLs. The concept is to introduce another level of indirection in looking up a resource, using an intermediary resource locator server that catalogues and tracks the actual URL of a resource. A client can request a persistent URL from the locator, which can then respond with a resource that redirects the client to the actual and current URL for the resource(Figure 2-6).

2.6.1 If Not Now, When?

2.7 For More Information

Please indicate the source: http://blog.csdn.net/gaoxiangnumber1

Welcome to my github: https://github.com/gaoxiangnumber1

發佈了371 篇原創文章 · 獲贊 29 · 訪問量 34萬+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章