/* -*- text -*- */ /**@MODULEPAGE "msg" - Message Parser Module @section msg_meta Module Meta Information This module contains parser and functions for manipulating messages and headers for text-based protocols like SIP, HTTP or RTSP. It also provides parsing of MIME headers and MIME multipart messages common to these protocols. @CONTACT Pekka Pessi @STATUS @SofiaSIP Core library @LICENSE LGPL @par Contributor(s): - Pekka Pessi @section msg_contents Contents of msg Module The msg module contains the public header files as follows: - base message interfaces - message and header struct definitions and typedefs - prototypes of header-specific functions for generic headers - function prototypes and macros for manipulating message headers - functions for accessing network addresses and I/O vectors associated with the message - types and functions for handling dates and times - types, function prototypes and macros for MIME headers and @ref msg_multipart "multipart messages" - prototypes of MIME-header-specific functions In addition to this interface, the @ref msg_parser "parser documentation" contains description of the functionality required when an existing parser is extended by a new header or a parser is created for a completely new protocol. It is possible to add new headers to the parser or extend the definition of existing ones. The header files used for constructing these parsers are as follows: - parsing functions, macros - message factory object definition - hashing of header names @section msg_overview Parsers, Messages and Headers The Sofia @b msg module contains interface to the text-based parsers for RFC822-like message, the header and message objects. Currently, there are three parsers defined: SIP, HTTP, and MIME. The C structure corresponding to each header is defined either in a or in a protocol-specific header file. These protocol-specific header files include , , and . For each header, there is defined a @em header @em class structure, some standard functions, and tags for including them in tag lists. As a convention, all the identifiers for SIP headers start with prefix @c sip and all the macros with @c SIP. Same thing holds for HTTP, too: it uses prefix @c http. However, the MIME headers and the functions related to them are defined within the @b msg module and they use prefix @c msg. If a SIP or HTTP header uses a structure defined in , there is a typedef suitable for the particular protocol, for example @b Accept header is defined multiple times: @code typedef struct msg_accept_s sip_accept_t; typedef struct msg_accept_s http_accept_t; @endcode For header @e X of protocol @e NS, there are types, functions, macros and header class as follows: - @c ns_X_t is the structure used to store parsed header, - @c ns_hclass_t @c ns_X_class[] contains the @em header @em class for header X, - @c NS_X_INIT() initializes a static instance of @c ns_X_t, - @c ns_X_init() initializes a dynamic instance of @c ns_X_t, - @c ns_is_X() tests if header object is instance of header X, - @c ns_X_make() creates a header X object by decoding given string, - @c ns_X_format() creates a header X object by decoding given @c printf() list, - @c ns_X_dup() duplicates (deeply copies) the header X, - @c ns_X_copy() copies the header X, - @c NSTAG_X() is used include instance of @c ns_X_t in a tag list, and - @c NSTAG_X_STR() is used to include string containing value header in a tag list. The declarations of header tags and the prototypes for these functions can be imported separately from the type definitions, for instance, the tags related to SIP headers are declared in the include file , and the header-specific functions in . @section parser_intro Parsing Text Messages Sofia text parser follows @em recursive-descent principle. In other words, it is a program that descends the syntax tree top-down recursively. (All syntax trees have root at top and they grow downwards.) In the case of SIP, HTTP and other similar protocols, such a parser is very efficient. The parser can choose between different forms based on each token, as the protocol syntax is carefully designed so that it requires only minimal scan-ahead. It is also easy to extend a recursive-descent parser via a standard API, unlike, for instance, a LALR parser generated by @em Bison. The abstract message module @b msg contains a high-level parser engine that drives the parsing process and invokes the protocol-specific parser for each header. As there is no low-layer framing between the RFC822-style messages, the parser considers any received data, be it a UDP datagram or a TCP stream, as a @em byte @em stream. The protocol-specific parsers controls how a byte stream is split into separate messages or if it consists of a single message only. The parser engine works by separating stream into fragments, then passing the fragment to a suitable parser. A fragment is a piece of message that is parsed during a single step: the first line, each header, the empty line between headers and message body, the message body. (In case of HTTP, the message body can consists of multiple fragments known as chunks.) The parser starts by separating the first line (e.g., request or status line) from the byte stream, then passing the line to the suitable parser. After first line comes the message headers. The parser continues parsing process by extracting headers, each on their own line, from the stream and passing contents of each header to its parser. The message structure is populated based on the parsing results. When an empty line - indicating end of headers - is encountered, the control is passed to the protocol-specific parser. Protocol-specific functions take care of extracting the possible message body from the byte stream. After parsing process is completed, it can be given to the upper layers (typically a protocol state machine). The parser continues processing the stream and feeding the messages to protocol engine until the end of the stream is reached. @image html sip-parser.gif Separating byte stream to messages @image latex sip-parser.eps Separating byte stream to messages When the parsing process has completed, the first line, each header, separator and the message body are all in their own fragment structure. The fragments form a dual-linked list known as @e fragment @e chain as shown in the above figure. The memory buffers for the message, the fragment chain, and a whole lot of other stuff is held by the generic message type, #msg_t, defined in . The internal structure of #msg_t is known only within @b msg module and it is opaque to other modules. The @b msg parser engine also drives the reverse process, invoking the encoding method of each fragment so that the whole outgoing message can be encoded properly. @section msg_header_struct Message Header as a C struct Just separating headers from each other and from the message body is not usually enough. When a header contains structured data, the header contents should be converted to a form that is convenient to use from C programs. For that purpose, the message parser needs a parsing function specific to each individual header. This parsing function divides the contents of the header into semantically meaningful segments and stores the result in the structure specific to each header. The parser engine passes the fragment contents to the parsing function after it has separated the fragment from the rest of the message. The parser engine selects correct @e header @e class either by implication (in case of first line), or it searches for the header class from the hash table using the header name as the hash key. The @e header @e class contains a pointer to the parsing function. The parser has also special header classes for headers with errors and @e unknown headers, header with a name that is not regocnized by the parser. For instance, the Accept header has following syntax: @code Accept = "Accept" ":" #( media-range [ accept-params ] ) media-range = ( "*" "/" "*" | ( type "/" "*" ) | ( type "/" subtype ) ) *( ";" parameter ) accept-params = ";" "q" "=" qvalue *( accept-extension ) accept-extension = ";" token [ "=" ( token | quoted-string ) ] @endcode When an Accept header is parsed, the header parser function (msg_accept_d()) separates the @e type, @e subtype, and each parameter in the list to strings. The parsing result is assigned to a #msg_accept_t structure, which is defined as follows: @code typedef struct msg_accept_s { msg_common_t ac_common[1]; //< Common fragment info msg_accept_t *ac_next; //< Pointer to next Accept header char const *ac_type; //< Pointer to type/subtype char const *ac_subtype; //< Points after first slash in type msg_param_t const *ac_params; //< List of parameters msg_param_t ac_q; //< Value of q parameter } msg_accept_t; @endcode The string containing the @e type is put into the @c ac_type field, the @e subtype after slash in the can be found in the @c ac_subtype field, and the list of @e accept-params (together with media-specific-parameters) is put in the @c ac_params array. If there is a @e q parameter present, a pointer to the @c qvalue is assigned to @c ac_q field. In the beginning of the header structure there are two boilerplate members. The @c ac_common[1] contains information common to all message fragments. The @c ac_next is a pointer to next header field with the same name, in case a message contains multiple @b Accept headers or multiple comma-separated header fields are located in a single line. @section msg_object_example Representing a Message as a C struct It is not enough to represent a message as a list of headers following each other. The programmer also needs a convenient way to access certain headers at the message level, for example, accessing directly the @b Accept header instead of going through all headers and examining their name. The structured view to the message is provided via a message-specific C struct. In general, its type is msg_pub_t (it provides public view to message). The protocol-specific type is #sip_t, #http_t or #msg_multipart_t for SIP, HTTP and MIME, respectively. So, a single message is represented by two objects, first object (#msg_t) is private to the @b msg module and opaque by an application programmer, second (#sip_t, #http_t or #msg_multipart_t) is a public protocol-specific structure accessible by all. @note The application programmer can obtain a pointer to the protocol-specific structure from an #msg_t object using msg_public() function. The msg_public() takes a protocol tag, a well-known identifier, as its argument. The SIP, HTTP and MIME already define a wrapper around msg_public(), for example, a #sip_t structure can be obtained with sip_object() function (or macro). As an example, the #sip_t structure is defined as follows: @code typedef struct sip_s { msg_common_t sip_common[1]; // Used with recursive inclusion msg_pub_t *sip_next; // Ditto void *sip_user; // Application data unsigned sip_size; // Size of the structure with // extension headers int sip_flags; // Parser flags sip_error_t *sip_error; // Erroneous headers sip_request_t *sip_request; // Request line sip_status_t *sip_status; // Status line sip_via_t *sip_via; // Via (v) sip_route_t *sip_route; // Route sip_record_route_t *sip_record_route; // Record-Route sip_max_forwards_t *sip_max_forwards; // Max-Forwards ... } sip_t; @endcode As you can see above, the public #sip_t structure contains the common header members that are also found in the beginning of a header structure. The @e sip_size indicates the size of the structure - the application can extend the parser and #sip_t structure beyond the original size. The @e sip_flags contains various flags used during the parsing and printing process. They are documented in the . These boilerplate members are followed by the pointers to various message elements and headers. @section msg_parsing_example Result of Parsing Process Let us now show how a simple message is parsed and presented to the applications. As an exampe, we choose a SIP request message with method BYE, including only the mandatory fields: @code BYE sip:joe@example.com SIP/2.0 Via: SIP/2.0/UDP sip.example.edu;branch=d7f2e89c.74a72681 Via: SIP/2.0/UDP pc104.example.edu:1030;maddr=110.213.33.19 From: Bobby Brown ;tag=77241a86 To: Joe User ;tag=7c6276c1 Call-ID: 4c4e911b@pc104.example.edu CSeq: 2 @endcode The figure below shows the layout of the BYE message above after parsing: @image html sip-parser2.gif BYE message and its representation in C @image latex sip-parser2.eps BYE message and its representation in C The leftmost box represents the message of type #msg_t. Next box from the left reprents the #sip_t structure, which contains pointers to a header objects. The next column contains the header objects. There is one header object for each message fragment. The rightmost box represents the I/O buffer used when the message was received. Note that the I/O buffer may be non-continous and composed of many separate memory areas. The message object has link to the public message structure (@a m_object), to the dual-linked fragment chain (@a m_frags) and to the I/O buffer (@a m_buffer). The public message structure contains pointers to the headers according to their type. If there are multiple headers of the same type (like there are two Via headers in the above message), the headers are put into a single-linked list. Each fragment has pointers to successing and preceding fragment. It also contains pointer to the corresponding data within the I/O buffer and its length. The main purpose of the fragment chain is to preserve the original order of the headers. If there were an third Via header after CSeq in the message, the fragment representing it would be after the CSeq header in the fragment chain but after second Via in the header list. @section msg_parsing_memory Example: Parsing a Complete Message The following code fragment is an example of parsing a complete message. The parsing process is more hairy when there is stream to be parsed. @code msg_t *parse_memory(msg_mclass_t const *mclass, char const data[], int len) { msg_t *msg; int m; msg_iovec_t iovec[2] = {{ 0 }}; msg = msg_create(mclass, 0); if (!msg) return NULL; m = msg_recv_iovec(msg, iovec, 2, n, 1); if (m < 0) { msg_destroy(msg); return NULL; } assert(m <= 2); assert(iovec[0].mv_len + iovec[1].mv_len == n); memcpy(iovec[0].mv_base, data, n = iovec[0].mv_len); if (m == 2) memcpy(iovec[1].mv_base + n, data + n, iovec[1].mv_len); msg_recv_commit(msg, iovec[0].mv_len + iovec[1].mv_len, 1); m = msg_extract(msg); assert(m != 0); if (m < 0) { msg_destroy(msg); return NULL; } return msg; } @endcode Let's go through this simple function, step by step. First, we get the @a data pointer and its size in bytes, @a len. We first initialize an I/O vector used to represent message with the parser. @code msg_t *parse_memory(msg_mclass_t const *mclass, char const data[], int len) { msg_t *msg; int m; msg_iovec_t iovec[2] = {{ 0 }}; @endcode The message class @a mclass (a parser driver object, #msg_mclass_t) is used to represent a particular protocol-specific parser instance. When a message object is created, it is given as an argument to msg_create() function: @code msg = msg_create(mclass, 0); if (!msg) return NULL; @endcode Next we obtain a memory buffer for data with msg_recv_iovec(). The memory buffer is usually a single continous memory area, but in some cases it may consist of two distinct areas. Therefore the @a iovec is used here to pass the buffers around. The @a iovec is also very handly as it can be directly passed to various system I/O calls. @code m = msg_recv_iovec(msg, iovec, 2, n, 1); if (m < 0) { msg_destroy(msg); return NULL; } @endcode These assumptions hold always true when you call msg_recv_iovec() first time with a complete message: @code assert(m >= 1 && m <= 2); assert(iovec[0].mv_len + iovec[1].mv_len == n); @endcode Next, we copy the data to the I/O vector and commit the copied data to the message. Earlier with msg_recv_iovec() we allocated buffer space for data, now calling msg_recv_commit() indicates that valid data has been copied to the buffer. The last parameter to msg_recv_commit() indicates that the end of stream is encountered and no more data is to be expected. @code memcpy(iovec[0].mv_base, data, n = iovec[0].mv_len); if (m == 2) memcpy(iovec[1].mv_base + n, data + n, iovec[1].mv_len); msg_recv_commit(msg, iovec[0].mv_len + iovec[1].mv_len, 1); @endcode We call msg_extract() next; it takes care of parsing the message. A fatal parsing error is indicated by returning -1. If the message is incomplete, msg_extract() returns 0. When a complete message has been parsed, a positive value is returned. We know that a message cannot be incomplete, as a call to msg_recv_commit() indicated to the parser that the end-of-stream has been encountered. @code m = msg_extract(msg); assert(m != 0); if (m < 0) { msg_destroy(msg); return NULL; } return msg; } @endcode */ /**@class msg_s msg.h * * @brief Message object. * * The message object is used by Sofia parsers for SIP and HTTP * protocols. The message object has an abstract, protocol-independent * inteface type #msg_t, and a separate public protocol-specific interface * #msg_pub_t (which is typedef'ed to #sip_t or #http_t depending * on the protocol). * * The main interface to abstract messages is defined in . The * network I/O interface used by transport protocols is defined in * . The protocol-specific parser table, also known as message * class, is defined in . (The message class is used as a * factory object when a message object is created with msg_create()). */ /**@typedef typedef struct msg_s msg_t; * * Message object. * * The @a msg_t is the type of a message object used by Sofia signaling * protocols and parsers. Its contents are not directly accessible. */ /**@typedef typedef struct msg_common_s msg_common_t; * * Common part of header. * * The @a msg_common_t is the base type of a message headers used by * protocol parsers. Instead of #msg_common_t, most interfaces use * #msg_header_t, which is supposed to be a union of all possible headers. */ /** * @defgroup msg_parser Parser Building Blocks * * This submodule contains the functions and types for building a * protocol-specific parser. */ /** * @defgroup msg_headers Headers * * This submodule contains the functions and types for handling message * headers and other elements. */ /** * @defgroup msg_mime MIME Headers * * This submodule contains the header classes, functions and types for * handling MIME headers (@RFC2045) and MIME multipart (@RFC2046) processing. * * The MIME headers implemented are as follows: * - @ref msg_accept "@b Accept header" * - @ref msg_accept_charset "@b Accept-Charser header" * - @ref msg_accept_encoding "@b Accept-Encoding header" * - @ref msg_accept_language "@b Accept-Language header" * - @ref msg_content_disposition "@b Content-Disposition header" * - @ref msg_content_encoding "@b Content-Encoding header" * - @ref msg_content_id "@b Content-ID header" * - @ref msg_content_location "@b Content-Location header" * - @ref msg_content_language "@b Content-Language header" * - @ref msg_content_md5 "@b Content-MD5 header" * - @ref msg_content_transfer_encoding "@b Content-Transfer-Encoding header" * - @ref msg_mime_version "@b MIME-Version header" */ /** * @defgroup test_msg Testing Parser * * This submodule contains the functions and types for building a * parser objects for testing purposes. */