The HTTP Protocol

Version 1.0 of the HTTP protocol is specified by RFC1945. You can obtain a copy from the World Wide Web Consortium, (see [WWWC]). (Version 1.1 of the protocol is specified in RFC2616). Also see RFC2396 for the latest general URI syntax.

The following restrictions on the protocol will be made:

URL Syntax

The full monty for the URL syntax, from the RFC, is

URL            = ( absoluteURL | relativeURL ) [ "#" fragment ]

absoluteURL    = scheme ":" *( uchar | reserved )

relativeURL    = net_path | abs_path | rel_path

net_path       = "//" net_loc [ abs_path ]
abs_path       = "/" rel_path
rel_path       = [ path ] [ ";" params ] [ "?" query ]

path           = fsegment *( "/" segment )
fsegment       = 1*pchar
segment        = *pchar

params         = param *( ";" param )
param          = *( pchar | "/" )

scheme         = 1*( ALPHA | DIGIT | "+" | "-" | "." )
net_loc        = *( pchar | ";" | "?" )
query          = *( uchar | reserved )
fragment       = *( uchar | reserved )

pchar          = uchar | ":" | "@" | "&" | "=" | "+"
uchar          = unreserved | escape
unreserved     = ALPHA | DIGIT | safe | extra | national

escape         = "%" HEX HEX
reserved       = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+"
extra          = "!" | "*" | "'" | "(" | ")" | ","
safe           = "$" | "-" | "_" | "."
unsafe         = CTL | SP | <"> | "#" | "%" | "<" | ">"
national       = <any OCTET excluding ALPHA, DIGIT,

In this syntax "*()" means zero or more repetitions and "1*()" means one or more. The URL syntax allows national characters such as accented letters as long as they are 8-byte characters and include the ASCII character set. For example ISO-8859-1 "Latin 1" would be fine. This doesn't restrict the characters allowed in pages though. They are only constrained by the MIME type for the page.

Since I am only implementing the "http" scheme I will actually be implementing this syntax:

http_URL       = "http:" "//" host [ ":" port ] [ abs_path ]

host           = <A legal Internet host domain name
                 or IP address (in dotted-decimal form),
                 as defined by Section 2.1 of RFC 1123>

port           = *DIGIT

The host and scheme names are case-insensitive. If the port is empty or not given, port 80 is assumed. Only TCP connections will be used. Only absolute paths are allowed and they are case-sensitive.

The canonical form for "http" URLs is obtained by converting any uppercase alphabetic characters in the host name to their lowercase equivalent (host names are case-insensitive), eliding the [":" port] if the port is 80, and replacing an empty abs_path with "/".

Characters may be encoded by the "%" escape sequence if they are unsafe or reserved. When parsing a URL the path will be split up according to the reserved characters before escapes are interpreted. So the path /%2Fabc/def has /abc as the name of its first segment and def as the name of the second segment. I will reject URLs having a forward slash or a NUL character in a segment so that they can be directly mapped to file names.

HTTP Requests

Each request is done using a separate TCP connection to the server. (Version 1.1 of the protocol allows more than one request per connection which is a lot more efficient). The RFC says

    ... current practice requires that the connection be established
    by the client prior to each request and closed by the server
    after sending the response. Both clients and servers should be
    aware that either party may close the connection prematurely,
    due to user action, automated time-out, or program failure, and
    should handle such closing in a predictable fashion. In any case,
    the closing of the connection by either or both parties always
    terminates the current request, regardless of its status.

All lines in the message are supposed to be terminated with a CR-LF character pair but applications must also accept a single CR or LF character. In the body of the page the line termination will depend on the MIME type but CRLF should be used for text types.

A request message looks like:

Full-Request   = Request-Line
             *( General-Header
              | Request-Header
              | Entity-Header )
             CRLF
             [ Entity-Body ]

Request-Line = Method SP Request-URL SP HTTP-Version CRLF

Method       =  "GET"
              | "HEAD"
              | "POST"

General-Header = Date 
              | Pragma

Request-Header = Authorization
              | From
              | If-Modified-Since
              | Referer
              | User-Agent

Entity-Header  = Allow
              | Content-Encoding
              | Content-Length
              | Content-Type
              | Expires
              | Last-Modified
              | extension-header

An example request line is

GET http://www.w3.org/pub/WWW/TheProject.html HTTP/1.0

After the request line come zero or more headers and then a blank line to terminate the headers. The entity body is only used to supply data for the POST method.

Each header consists of a name followed immediately by a colon (":"), a single space (SP) character, and the field value. Field names are case-insensitive. Header fields can be extended over multiple lines by preceding each extra line with at least one SP or HT (horizontal tab), though this is not recommended.

HTTP-header    = field-name ":" [ field-value ] CRLF

field-name     = token
field-value    = *( field-content | LWS )

field-content  = <the OCTETs making up the field-value
                and consisting of either *TEXT or combinations
                of token, tspecials, and quoted-string>

In this syntax the following definitions are used.

token          = 1*<any character except CTLs or tspecials>

tspecials      = "(" | ")" | "<" | ">" | "@"
                  | "," | ";" | ":" | "\" | <">
                  | "/" | "[" | "]" | "?" | "="
                  | "{" | "}" | SP | HT

TEXT           = <any OCTET except CTLs, but including LWS>

CTL            = a control character or DEL (ASCII 127)

LWS            = [CRLF] 1*( SP | HT )

quoted-string  = Any sequence of characters except double-quote and CTLs,
                 but including LWS, enclosed in double-quote characters.
                 There is no backslash quoting of characters within strings.

The general headers are applicable to both requests and responses. They pertain to the message itself rather than the entity being transferred. The request headers provide extra information about the request. The entity headers provide information about the entity itself. I will use them only in the response. The next sections describe the headers.

The Date Header

This provides the data and time that the message was originated. The preferred format of the date is the RFC822 format used in e-mail. For example

    Date: Tue, 15 Nov 1994 08:12:31 GMT

A well behaved server should accept all of the following date formats:

Sun, 06 Nov 1994 08:49:37 GMT    ; RFC 822, updated by RFC 1123
Sunday, 06-Nov-94 08:49:37 GMT   ; RFC 850, obsoleted by RFC 1036
Sun Nov  6 08:49:37 1994         ; ANSI C's asctime() format

All times are GMT (UTC). The following syntax describes all of the allowed date formats.

HTTP-date      = rfc1123-date | rfc850-date | asctime-date

rfc1123-date   = wkday "," SP date1 SP time SP "GMT"
rfc850-date    = weekday "," SP date2 SP time SP "GMT"
asctime-date   = wkday SP date3 SP time SP 4DIGIT

date1 = 2DIGIT SP month SP 4DIGIT          ; day month year
date2 = 2DIGIT "-" month "-" 2DIGIT        ; day-month-year
date3 = month SP ( 2DIGIT | ( SP 1DIGIT )) ; month day

time  = 2DIGIT ":" 2DIGIT ":" 2DIGIT       ; 00:00:00 - 23:59:59

wkday          = "Mon" | "Tue" | "Wed"
               | "Thu" | "Fri" | "Sat" | "Sun"

weekday        = "Monday" | "Tuesday" | "Wednesday"
               | "Thursday" | "Friday" | "Saturday" | "Sunday"

month          = "Jan" | "Feb" | "Mar" | "Apr"
               | "May" | "Jun" | "Jul" | "Aug"
               | "Sep" | "Oct" | "Nov" | "Dec"

The Pragma Header

This is usually "Pragma: no-cache" to tell the recipient not to cache the entity. I won't generate it.

The Authorization Header

This header provides information such as a password to access secure information. I will support basic password protection. Typically what happens is that after a request has been received, if a password is needed, the server returns a status code of 401 along with a challenge header looking like:

WWW-Authenticate: Basic realm="WallyWorld"

The client must resend the request with an Authorization header such as

Authorization: Basic QWxhZGRpbjpvcGVuIHNlc2FtZQ==

which contains a user id and password encoded as Base64[1]. (It decodes to "Aladdin:open sesame"). The syntax for the Authorization is:

basic-credentials = "Basic" SP basic-cookie

basic-cookie      = <base64 encoding of userid-password,
                    except not limited to 76 char/line>

userid-password   = [ token ] ":" *TEXT

The client can send the Authorization with the initial request if it has already prompted the user for a password. See RFC1945 for more details for HTTP 1.0 or RFC2617 for HTTP 1.1.

The From Header

This identifies the person sending the request.

From: webmaster@w3.org

It's not normally used but I'll recognise it and pass it on.

The If-Modified-Since Header

The If-Modified-Since request-header field is used with the GET method to make it conditional: if the requested resource has not been modified since the time specified in this field, a copy of the resource will not be returned from the server; instead, a 304 (not modified) response will be returned without any Entity-Body.

An example of the field is:

If-Modified-Since: Sat, 29 Oct 1994 19:43:31 GMT

I'll recognise but ignore this header.

The Referer Header

This header provides the URL from which the request originated, if appropriate. For example if a user clicks on a link in a web page then the referer URL is the URL of the page. This is sometimes used to control access to pages for example to prevent a page from being accessed unless the user has passed through a sign-on page.

An example is

Referer: http://www.w3.org/hypertext/DataSources/Overview.html

I'll recognise the header and pass it on.

The User-Agent Header

This identifies the kind of browser or whatever that generated the request. I'll recognise the header and pass it on.

The Allow Header

This is used in responses. I won't generate it. See the RFC for more details.

The Content-Encoding Header

This is used to indicate if the entity is compressed or otherwise encoded. I won't generate it in responses. An example is:

Content-Encoding: x-gzip

The Content-Length Header

This provides the size of the entity in bytes starting at the first byte after the CR-LF that terminates the header. I will always generate a content length. An example is:

Content-Length: 3495

The Content-Type Header

This provides the MIME type for the entity. An example is:

Content-Type: text/html

I will generate: text/plain, text/directory, text/html, image/jpeg, image/gif, image/png where appropriate.

The Expires Header

This is used in a response to tell the client how long to cache the document. I won't generate this.

The Last-Modified Header

This provides the date and time when the entity was last modified. I'll generate this. An example is:

Last-Modified: Tue, 15 Nov 1994 12:45:26 GMT

Extension Headers

Any other headers are allowed as long as their syntax is valid. I'll just ignore them.

HTTP Responses

A response looks a lot like a request.

Full-Response  = Status-Line 
                *( General-Header
                 | Response-Header
                 | Entity-Header )
                CRLF
                [ Entity-Body ]

Status-Line = HTTP-Version SP Status-Code SP Reason-Phrase CRLF

Response-Header = Location
               | Server
               | WWW-Authenticate

The first line returns the status of the request: success, failure etc. A numeric status code is provided for programs to read. A textual version is provided for (some kind of) human readability in error messages.

The standard status codes are:

Status-Code    = "200"   ; OK
              | "201"   ; Created
              | "202"   ; Accepted
              | "204"   ; No Content
              | "301"   ; Moved Permanently
              | "302"   ; Moved Temporarily
              | "304"   ; Not Modified
              | "400"   ; Bad Request
              | "401"   ; Unauthorized
              | "403"   ; Forbidden
              | "404"   ; Not Found
              | "500"   ; Internal Server Error
              | "501"   ; Not Implemented
              | "502"   ; Bad Gateway
              | "503"   ; Service Unavailable

Full details of the status codes can be found in the RFC. I'll just describe the few that the server will use.

200 - OK

The entity follows in the Entity-Body section.

204 - No Content

Something went wrong. The Entity-Body section is omitted.

401 - Unauthorized

The client must supply a password to get the URL.

404 - Not Found

You know what this means.

500 - Internal Server Error

General cop-out.

501 - Not Implemented

I'll have a lot of this.

The response headers provide extra details for the response itself such as elaborating on the status code. They are described in the following sections. I will only use the WWW-Authenticate header.

The Location Header

This provides the location for status codes that redirect the client to some other location such as the 30x codes. An example is:

Location: http://www.w3.org/hypertext/WWW/NewLocation.html

The Server Header

This provides identification for the server e.g. its name and version. I won't be using this.

The WWW-Authenticate Header

This header is returned along with a 401 status code to request the client to authenticate itself. More details can be found in the section called The Authorization Header.

Notes

[1]

See RFC1521 for a description of Base64 encoding