The World Wide Web This section provides a brief introduction to the evolution of the World Wide Origins In 1989, a small group of people led by Tim Berners-Lee at Conseil Européen pour la Recherche Nucléaire (CERN) or European Organization for Particle Physics proposed a new protocol for the Internet, as well as a system of document access to use it.
The intent of this new system, which the groupnamed the World Wide Web, was to allow scientists around the world to use the Internet to exchange documents describing their work. The proposed new system was designed to allow a user anywhere on the Internet to search for and retrieve documents from databases on any number of different document-serving computers connected to the Internet. By late 1990, the basic ideas for the new system had been fully developed and implemented on a NeXT computer at CERN. In 1991, the system was ported to other computer platforms and released to the rest of the world. For the form of its documents, the new system used hypertext, which is text with embedded links to text, either in the same document or in another docu- ment, to allow nonsequential browsing of textual material. The idea of hypertext had been developed earlier and had appeared in Xerox’s NoteCards and Apple’s HyperCard in the mid-1980s. From here on, we will refer to the World Wide Web simply as the Web. The units of information on the Web have been referred to by several different names; among them, the most common are pages, documents, and resources. Perhaps the best of these is documents, although that seems to imply only text. Pages is widely used, but it is misleading in that Web units of information often have more than one of the kind of pages that make up printed media. There is some merit to call- ing these units resources, because that covers the possibility of nontextual informa- tion. This book will use documents and pages more or less interchangeably, but we prefer documents in most situations. Documents are sometimes just text, usually with embedded links to other documents, but they often also include images, sound recordings, or other kinds of media. When a document contains nontextual information, it is called hypermedia. In an abstract sense, the Web is a vast collection of documents, some of which are connected by links.
It is important to understand that the Internet and the Web are not the same thing. The Internet is a collection of computers and other devices connected by equipment that allows them to communicate with each other. The Web is a col- lection of software and protocols that has been installed on most, if not all, of the computers on the Internet. Some of these computers run Web servers, which pro- vide documents, but most run Web clients, or browsers, which request documents from servers and display them to users. The Internet was quite useful before the Web was developed, and it is still useful without it. However, most users of the Internet now use it through the Web.
Web servers are programs that provide documents to requesting browsers. Servers are slave programs: They act only when requests are made to them by browsers running on other computers on the Internet. The most commonly used Web servers are Apache, which has been imple- mented for a variety of computer platforms, and Microsoft’s Internet Information Server (IIS), which runs under Windows operating systems. As of October 2013, there were over 150 million active Web hosts in operation,5 about 65 percent of which were Apache, about 16 percent were IIS, and about 14 percent were nginx (pronounced “engine-x”), a product produced in Russia.
Operation Although having clients and servers is a natural consequence of information dis- tribution, this configuration offers some additional benefits for the Web. While serving information does not take a great deal of time, displaying information on client screens is time consuming. Because Web servers need not be involved in this display process, they can handle many clients. So, it is both a natural and efficient division of labor to have a small number of servers provide documents to a large number of clients. Web browsers initiate network communications with servers by sending them URLs. A URL can specify one of two different things: the address of a data file stored on the server that is to be sent to the client, or a program stored on the server that the client wants executed and the output of the program returned to the client.
When two computers communicate over some network, in many cases one acts as a client and the other as a server. The client initiates the communication, which is often a request for information stored on the server, which then sends that information back to the client. The Web, as well as many other systems, operates in this client-server configuration. Documents provided by servers on the Web are requested by browsers, which are programs running on client machines. They are called browsers because they allow the user to browse the resources available on servers. The first browsers were text based—they were not capable of displaying graphic information, nor did they have a graphical user interface. This limitation effectively constrained the growth of the Web. In early 1993, things changed with the release of Mosaic, the first browser with a graphical user interface. Mosaic was developed at the National Center for Supercomputer Applications (NCSA) at the University of Illinois. Mosaic’s interface provided convenient access to the Web for users who were neither scientists nor software developers. The first release of Mosaic ran on UNIX systems using the X Window system. By late 1993, versions of Mosaic for Apple Macintosh and Microsoft Windows systems had been released. Finally, users of the computers connected to the Internet around the world had a power- ful way to access anything on the Web anywhere in the world. The result of this power and convenience was explosive growth in Web usage. A browser is a client on the Web because it initiates the communication with a server, which waits for a request from the client before doing anything. In the simplest case, a browser requests a static document from a server. The server locates the document among its servable documents and sends it to the browser, which displays it for the user. However, more complicated situations are com- mon. For example, the server may provide a document that requests input from the user through the browser. After the user supplies the requested input, it is transmitted from the browser to the server, which may use the input to perform some computation and then return a new document to the browser to inform the user of the results of the computation. Sometimes a browser directly requests the execution of a program stored on the server. The output of the program is then returned to the browser.
Characteristics Most of the available servers share common characteristics, regardless of their origin or the platform on which they run. This section provides brief descriptions of some of these characteristics. The file structure of a Web server has two separate directories. The root of one of these is the document root. The file hierarchy that grows from the document root stores the Web documents to which the server has direct access and normally serves to clients. The root of the other directory is the server root. This directory, along with its descendant directories, stores the server and its support software. The files stored directly in the document root are those available to cli- ents through top-level URLs. Typically, clients do not access the document root directly in URLs; rather, the server maps requested URLs to the document root, whose location is not known to clients.
Uniform Resource Locators Uniform (or universal)
Resource Identifiers (URIs) are used to identify resources (often documents) on the Internet. URIs are used for two different purposes, to name a resource, in which case they are often called URIs, even though they could be more accurately called Uniform Resource Names (URNs). The more commonly used form of URIs is to provide a path to, or location of, a resource, in which case they are called Uniform Resource Locators (URLs). The general forms of URIs and URLs are similar, and URIs are often confused with URLs.
Multipurpose Internet Mail Extensions
A browser needs some way to determine the format of a document it receives from a Web server. Without knowing the form of the document, the browser would not be able to render it, because different document formats require dif- ferent rendering software. The forms of these documents are specified with Mul- tipurpose Internet Mail Extensions (MIMEs).
Type Specifications MIME was developed to specify the format of different kinds of documents to be sent via Internet mail. These documents could contain various kinds of text, video data, or sound data. Because the Web has needs similar to those of Inter- net mail, MIME was adopted as the way to specify document types transmitted over the Web. A Web server attaches an MIME format specification to the beginning of the document that it is about to provide to a browser. When the browser receives the document from a Web server, it uses the included MIME format specification to determine what to do with the document. If the content is text, for example, the MIME code tells the browser that it is text and also indicates the particular kind of text it is. If the content is sound, the MIME code tells the browser that it is sound and then gives the particular representation of sound so the browser can choose a program to which it has access to produce the transmitted sound.
MIME specifications have the following form: type/subtype The most common MIME types are text, image, and video. The most common text subtypes are plain and html. Some common image subtypes are gif and jpeg. Some common video subtypes are mpeg and quicktime. A list of MIME specifications is stored in the configuration files of every Web server. In the remainder of this book, when we say document type, we mean both the type and subtype of the document. Servers determine the type of a document by using the file name extension as the key into a table of types. For example, the extension .html tells the server that it should attach text/html to the document before sending it to the requesting browser.Browsers also maintain a conversion table for looking up the type of a docu- ment by its file name extension. However, this table is used only when the server does not specify an MIME type, which may be the case with some older servers. In all other cases, the browser gets the document type from the MIME header provided by the server.
Types Experimental subtypes are sometimes used. The name of an experimental sub- type begins with x-, as in video/x-msvideo. Any Web provider can add an experimental subtype by having its name added to the list of MIME specifica- tions stored in the Web provider’s server. For example, a Web provider might have a handcrafted database whose contents he or she wants to make available to others through the Web. Of course, this raises the issue of how the browser can display the database. As might be expected, the Web provider must supply a program that the browser can call when it needs to display the contents of the database. These programs either are external to the browser, in which case they are called helper applications, or are code modules that are inserted into the browser, in which case they are called plug-ins. Every browser has a set of MIME specifications (file types) it can handle. All can deal with text/plain (unformatted text) and text/html (HTML files), among others. Sometimes a particular browser cannot handle a specific document type, even though the type is widely used. These cases are handled in the same way as the experimental types described previously. The browser determines the helper application or plug-in it needs by examining the browser configuration file, which provides an association between file types and their required helpers or plug-ins. If the browser does not have an application or a plug-in that it needs to render a document, an error message is displayed.
The Internet and the Web are fertile grounds for security problems. On the Web server side, anyone on the planet with a computer, a browser, and an Internet connection can request the execution of software on any server computer. He or she can also access data and databases stored on the server computer. On the browser end, the problem is similar: Any server to which the browser points can download software to be executed on the browser host machine. Such software can access parts of the memory and memory devices attached to that machine that are not related to the needs of the original browser request. In effect, on both ends, it is like allowing any number of total strangers into your house and trying to prevent them from leaving anything in the house, taking anything from the house, or altering anything in the house. The larger and more complex the design of the house, the more difficult it will be to prevent any of those activities. The same is true for Web servers and browsers: The more complex they are, the more difficult it is to prevent security breaches. Today’s browsers and Web servers are indeed large and complex software systems, so security is a significant problem in Web applications. The subject of Internet and Web security is extensive and complicated, so much so that numerous books have been written on the topic. Therefore, this one section of one chapter of one book can give no more than a brief sketch of some of the subtopics of security. One aspect of Web security is the matter of getting one’s data from the browser to the server and having the server deliver data back to the browser without anyone or any device intercepting or corrupting those data along the way. Consider a simple case of transmitting a credit card number to a company from which a purchase is being made. The security issues for this transaction are as follows:
1. Privacy: It must not be possible for the credit card number to be stolen on its way to the company’s server. 2. Integrity: It must not be possible for the credit card number to be modi- fied on its way to the company’s server. 3. Authentication: It must be possible for both the purchaser and the seller to be certain of each other’s identity. 4. Nonrepudiation: It must be possible to prove legally that the message was actually sent and received. The basic tool to support privacy and integrity is encryption.
Data to be transmitted is converted into a different form, or encrypted, such that someone (or some computer) who is not supposed to access the data cannot decrypt it. So, if data is intercepted while en route between Internet nodes, the interceptor cannot use the data because he or she cannot decrypt it. Both encryption and decryption are done with a key and a process (applying the key to the data). Encryption was developed long before the Internet existed. Julius Caesar crudely encrypted the messages he sent to his field generals while at war. Until the middle 1970s, the same key was used for both encryption and decryption, so the initial problem was how to transmit the key from the sender to the receiver. This problem was solved in 1976 by Whitfield Diffie and Martin Hellman of Stanford University, who developed public-key encryption, a process in which a public key and a private key are used, respectively, to encrypt and decrypt mes- sages. A communicator—say, Joe—has an inversely related pair of keys, one public and one private. The public key can be distributed to all organizations that might send messages to Joe. All of them can use the public key to encrypt messages to Joe, who can decrypt the messages with his matching private key. This arrange- ment works because the private key need never be transmitted and also because it is virtually impossible to compute the private key from its corresponding public key. The technical wording for this situation is that it is computationally infeasible to determine the private key from its public key. The most widely used public-key algorithm is named RSA, developed in 1977 by three MIT professors—Ron Rivest, Adi Shamir, and Leonard Adleman—the first letters of whose last names were used to name the algorithm. Most large companies now use RSA for e-commerce. Another, completely different security problem for the Web is the intentional and malicious destruction of data on computers attached to the Internet. The number of different ways this can be done has increased steadily over the life span of the Web. The sheer number of such attacks has also grown rapidly. There is now a continuous stream of new and increasingly devious Denial-of-Service (DoS) attacks, viruses, and worms being discovered, which have caused billions of dol- lars of damage, primarily to businesses that use the Web heavily. Of course, huge damage also has been done to home computer systems through Web intrusions. DoS attacks can be created simply by flooding a Web server with requests, overwhelming its ability to operate effectively. Most DoS attacks are conducted with the use of networks of virally infected zombie computers, whose owners are unaware of their sinister use. So, DoS and viruses are often related. Viruses are programs that often arrive in a system in attachments to elec- tronic mail messages or attached to free downloaded programs. Then they attach to other programs. When executed, they replicate and can overwrite memory and attached memory devices, destroying programs and data alike. Worms damage memory, like viruses, but spread on their own, rather than being attached to other files. Perhaps the most famous worm so far has been the Blaster worm, spawned in 2003. DoS, virus, and worm attacks are created by malicious people referred to as hackers. The incentive for these people apparently is simply the feeling of pride and accomplishment they derive from being able to cause huge amounts of dam- age by outwitting the designers of Web software systems. Protection against viruses and worms is provided by antivirus software, which must be updated frequently so that it can detect and protect against the continu- ous stream of new viruses and worms.
The Hypertext Transfer Protocol All Web communications transactions use the same protocol: the Hypertext Transfer Protocol (HTTP). The current version of HTTP is 1.1, formally defined as RFC 2616, which was approved in June 1999. RFC 2616 is available at the Web site for the World Wide Web Consortium (W3C), http://www .w3.org. This section provides a brief introduction to HTTP. HTTP consists of two phases: the request and the response. Each HTTP communication (request or response) between a browser and a Web server con- sists of two parts: a header and a body. The header contains information about the communication; the body contains the data of the communication if there is any. 1.7.1 The Request Phase The general form of an HTTP request is as follows: 1. HTTP method Domain part of the URL HTTP version 2. Header fields 3. Blank line 4. Message body The following is an example of the first line of an HTTP request: GET /storefront.html HTTP/1.1 Only a few request methods are defined by HTTP, and even a smaller num- ber of these are typically used.lists the most commonly used methods. Table 1.1 HTTP request methods Method Description GET Returns the contents of a specified document HEAD Returns the header information for a specified document POST Executes a specified document, using the enclosed data PUT Replaces a specified document with the enclosed data DELETE Deletes a specified document Among the methods GET and POST are the most fre- quently used. POST was originally designed for tasks such as posting a news article to a newsgroup. Its most common use now is to send form data from a browser to a server, along with a request to execute a server-resident program on the server that will process the data. Following the first line of an HTTP communication is any number of header fields, most of which are optional. The format of a header field is the field name followed by a colon and the value of the field. There are four categories of header fields: 1. General: For general information, such as the date 2. Request: Included in request headers 3. Response: For response headers 4. Entity: Used in both request and response headers One common request field is the Accept field, which specifies a preference of the browser for the MIME type of the requested document. More than one Accept field can be specified if the browser is willing to accept documents in more than one format. For example, we might have any of the following: Accept: text/plain Accept: text/html Accept: image/gif A wildcard character, the asterisk (*), can be used in place of a part of a MIME type, in which case that part can be anything. For example, if any kind of text is acceptable, the Accept field could be as follows: Accept: text/* The Host: host name request field gives the name of the host. The Host field is required for HTTP
The If-Modified-Since: date request field specifies that the requested file should be sent only if it has been modified since the given date. If the request has a body, the length of that body must be given with a Content-length field, which gives the length of the response body in bytes. The POST method requests require this field because they send data to the server. The header of a request must be followed by a blank line, which is used to separate the header from the body of the request. Requests that use the GET, HEAD, and DELETE methods do not have bodies. In these cases, the blank line signals the end of the request. A browser is not necessary to communicate with a Web server; telnet can be used instead. Consider the following command, given at the command line of any widely used operating system: >telnet blanca.uccs.edu http This command creates a connection to the http port on the blanca.uccs .edu server. The server responds with the following: Trying 18.104.22.168 … Connected to blanca Escape character is ‘^]’.
The Web Programmer’s Toolboxt
Overview of HTML At the onset, it is important to realize that HTML is not a programming language—it cannot be used to describe computations. Its purpose is to describe the general form and layout of documents to be displayed by a browser. The word markup comes from the publishing world, where it is used to describe what production people do with a manuscript to specify to a printer how the text, graphics, and other elements in the book should appear in printed form. HTML is not the first markup language used with computers. TeX and LaTeX are older markup languages for use with text; they are now used primarily to specify how mathematical expressions and formulas should appear in print. An HTML document is a mixture of content and controls. The controls are specified by the tags of HTML. The name of a tag specifies the category of its content. Most HTML tags consist of a pair of syntactic markers that are used to delimit particular kinds of content. The pair of tags and their content together are called an element. For example, a paragraph element specifies that its content, which appears between its opening tag, <p>, and its closing tag, </p>, is a para- graph. A browser has a default style (font, font style, font size, and so forth) for paragraphs, which is used to display the content of a paragraph element. Some tags include attribute specifications that provide some additional infor- mation for the browser. In the following example, the src attribute specifies the location of the img tag’s image content: <img src = “redhead.jpg”/> In this case, the image document stored in the file redhead.jpg is to be dis- played at the position in the document in which the tag appears.Tools for Creating HTML Documents HTML documents can be created with a general-purpose text editor. There are two kinds of tools that can simplify this task: HTML editors and What-You-See- Is-What-You-Get (WYSIWYG, pronounced wizzy-wig) HTML editors. HTML editors provide shortcuts for producing repetitious tags such as those used to create the rows of a table. They also may provide a spell-checker and a syntax-checker, and they may color code the HTML in the display to make it easier to read and edit. A more powerful tool for creating HTML documents is a WYSIWYG HTML editor. Using a WYSIWYG HTML editor, the writer can see the for- matted document that the HTML describes while he or she is writing the HTML code. WYSIWYG HTML editors are very useful for beginners who want to create simple documents without learning HTML and for users who want to prototype the appearance of a document. Still, these editors sometimes pro- duce poor-quality HTML. In some cases, they create proprietary tags that some browsers will not recognize. Two examples of WYSIWYG HTML editors are Microsoft Expression Web and Adobe Dreamweaver. Both allow the user to create HTML-described docu- ments without requiring the user to know HTML. They cannot handle all the tags of HTML, but they are very useful for creating many of the common features of documents. Information on Dreamweaver is available at http://www.adobe.com/; information on Expression Web is available at http://www.microsoft.com/.
Plug-ins and Filters Two different kinds of converter tools can be used to create HTML documents.
kind of tool is a filter, which converts an existing document in some form, such as LaTeX or Microsoft Word, to HTML. Filters are never part of the editor or word processor that created the document—an advantage because the filter can then be platform independent. For example, a WordPer- fect user working on a Macintosh computer can use a filter running on a UNIX platform to produce HTML documents with the same content on that machine. The disadvantage of filters is that creating HTML documents with a filter is a two-step process: First you create the document, and then you use a filter to convert it to HTML. Neither plugs-ins nor filters produce HTML documents that, when displayed by browsers, have the identical appearance of that produced by the word processor. The two advantages of both plug-ins and filters, however, are that existing documents produced with word processors can be easily converted to HTML and that users can use a word processor with which they are familiar to produce HTML documents. This obviates the need to learn to format text by using HTML directly. For example, once you learn to create tables with your word processor, it is easier to use that process than to learn to define tables directly in HTML. The HTML output produced by either filters or plug-ins often must be mod- ified, usually with a simple text editor, to perfect the appearance of the displayed document in the browser. Because this new HTML file cannot be converted to its original form (regardless of how it was created), you will have two different source files for a document, inevitably leading to version problems during maintenance of the document. This is clearly a disadvantage of using converters.
Overview of PHP
Overview of Rails