.NC "The Design of Unix IPC" .sh 1 "General" .pp The ARGO implementation of TP and CLNP was designed to fit into the AOS kernel as easily as possible. All the standard protocol hooks are used. To understand the design, it is useful to have read Leffler, Joy, and Fabry: \*(lq4.2 BSD Networking Implementation Notes\*(rq July 1983. This section describes the design of the IPC support in the AOS kernel. .sh 1 "Functional Unit Overview" .pp The AOS kernel is a monolithic program of considerable size and complexity. The code can be separated into parts of distinct function, but there are no kernel processes per se. The kernel code is either executed on behalf of a user process, in which case the kernel was entered by a system call, or it is executed on behalf of a hardware or software interrupt. The following sections describe briefly the major functional units of the kernel. .\" FIGURE .so figs/func_units.nr .CF shows the arrangement of these kernel units and their interactions. .sh 2 "The file system." .pp .sh 2 "Virtual memory support." .pp This includes protection, swapping, paging, and text sharing. .sh 2 "Blocked device drivers (disks, tapes)." .pp All these drivers share some minor functional units, such as buffer management and bus support for the various types of busses on the machine. .sh 2 "Interprocess communication (IPC)." .pp This includes support for various protocols, buffer management, and a standard interface for inter-protocol communication. .sh 2 "Network interface drivers." .pp These drivers are closely tied to the IPC support. They use the IPC's buffer management unit rather than the buffers used by the blocked device drivers. The interface between these drivers and the rest of the kernel differs from the interface used by the blocked devices. .sh 2 "Tty driver" .pp This is terminal support, including the user interface and the device drivers. .sh 2 "System call interface." .pp This handles signals, traps, and system calls. .sh 2 "Clock." .pp The clock is used in various forms by many other units. .sh 2 "User process support (the rest)." .pp This includes support for accounting, process creation, control, scheduling, and destruction. .pp .sh 2 "IPC" .pp The major functional unit that supports IPC can be divided into the following smaller functional units. .sh 3 "Buffer management." .pp All protocols share a pool of buffers called \fImbufs\fR: .(b \fC .TS tab(+); l s s s. struct mbuf { .T& l l l l. +struct mbuf+*m_next;+/* next buffer in chain */ +u_long+m_off;+/* offset of data */ +short+m_len;+/* amount of data */ +short+m_type;+/* mbuf type (0 == free) */ +u_char+m_dat[MLEN];+/* data storage */ +struct mbuf+*m_act;+/* link in 2-d structure */ }; .TE \fR .)b .pp There are two forms of mbufs - small ones and large ones. Small ones are 128 octets in AOS and 256 octets in the ARGO release. Small mbufs are copied by byte-to-byte copies. The data in these mbufs are kept in the character array field \fIm_dat\fR in the mbuf structure itself. For this type of mbuf, the field \fIm_off\fR is positive, and is the offset to the beginning of the data from the beginning of the mbuf structure itself. Large mbufs, called \fIclusters\fR, are page-sized and page-aligned. They may be \*(lqcopied\*(rq by multiply mapping the pages they occupy. They consist of a page of memory plus a small mbuf structure whose fields are used to link clusters into chains, but whose \fIm_dat\fR array is not used. The \fIm_off\fR field of the structure is the offset (positive or negative) from the beginning of the mbuf structure to the beginning of the data page part of the cluster. In the case of clusters, the offset is always out of the bounds of the \fIm_dat\fR array and so it is alway possible to tell from the \fIm_off\fR field whether an mbuf structure is part of a cluster or is a small mbuf. All mbufs permanently reside in memory. The mbuf management unit manages its own page table. The mbuf manager keeps limited statistics on the quantities and types of buffers in use. Mbufs are used for many purposes, and most of these purposes have a type associated with them. Some of the types that buffers may take are MT_FREE (not allocated), MT_DATA, MT_HEADER, MT_SOCKET (socket structure), MT_PCB (protocol control block), MT_RTABLE (routing tables), and MT_SOOPTS (arguments passed to \fIgetsockopt()\fR and \fIsetsockopt()\fR. Data are passed among functional units by means of queues, the contents of which are either chains of mbufs or groups of chains of mbufs. Mbufs are linked into chains with the \fIm_next\fR field. Chains of mbufs are linked into groups with the \fIm_act\fR field. The \fIm_act\fR field allows a protocol to retain packet boundaries in a queue of mbufs. .sh 3 "Routing." .pp Routing decisions in the kernel are made by the procedure \fIrtalloc()\fR. This procedure will scan the kernel routing tables (stored in mbufs) looking for a route. A route is represented by .(b \fC .TS tab(+); l s s s. struct rtentry { .T& l l l l. +u_long+rt_hash;+/* to speed lookups */ +struct sockaddr+rt_dst;+/* key */ +struct sockaddr+rt_gateway;+/* value */ +short+rt_flags;+/* up/down?, host/net */ +short+rt_refcnt;+/* # held references */ +u_long+rt_use;+/* raw # packets forwarded */ +struct ifnet+*rt_ifp;+/* interface to use */ } .TE \fR .)b When looking for a route, \fIrtalloc()\fR will first hash the entire destination address, and scan the routing tables looking for a complete route. If a route is not found, then \fIrtalloc()\fR will rescan the table looking for a route which matches the \fInetwork\fR portion of the address. If a route is still not found, then a default route is used (if present). .pp If a route is found, the entity which called \fIrtalloc()\fR can use information from the \fIrtentry\fR structure to dispatch the datagram. Specifically, the datagram is queued on the interface identified by the interface pointer \fIrt_ifp\fR. .sh 3 "Socket code." .pp This is the protocol-independent part of the IPC support. Each communication endpoint (which may or may not be associated with a connection) is represented by the following structure: .(b \fC .TS tab(+); l s s s. struct socket { .T& l l l l. +short+so_type;+/* type, e.g. SOCK_DGRAM */ +short+so_options;+/* from socket call */ +short+so_linger;+/* time to linger @ close */ +short+so_state;+/* internal state flags */ +caddr_t+so_pcb;+/* network layer pcb */ +struct protosw+*so_proto;+/* protocol handle */ +struct socket+*so_head;+/* ptr to accept socket */ +struct socket+*so_q0;+/* queue of partial connX */ +short+so_q0len;+/* # partials on so_q0 */ +struct socket+*so_q;+/* queue of incoming connX */ +short+so_qlen;+/* # connections on so_q */ +short+so_qlimit;+/* max # queued connX */ +struct sockbuf+{ ++short+sb_cc;+/* actual chars in buffer */ ++short+sb_hiwat;+/* max actual char count */ ++short+sb_mbcnt;+/* chars of mbufs used */ ++short+sb_mbmax;+/* max chars of mbufs to use */ ++short+sb_lowat;+/* low water mark (not used yet) */ ++short+sb_timeo;+/* timeout (not used ) */ ++struct mbuf+*sb_mb;+/* the mbuf chain */ ++struct proc+*sb_sel;+/* process selecting */ ++short+sb_flags;+/* flags, see below */ +} so_rcv, so_snd; +short+so_timeo;+/* connection timeout */ +u_short+so_error;+/* error affecting connX */ +short+so_oobmark;+/* oob mark (TCP only) */ +short+so_pgrp;+/* pgrp for signals */ } .TE \fR .)b .pp The socket code maintains a pair of queues for each socket, \fIso_rcv\fR and \fIso_snd\fR. Each queue is associated with a count of the number of characters in the queue, the maximum number of characters allowed to be put in the queue, some status information (\fIsb_flags\fR), and several unused fields. For a send operation, data are copied from the user's address space into chains of mbufs. This is done by the socket module, which then calls the underlying transport protocol module to place the data on the send queue. This is generally done by appending to the chain beginning at \fIsb_mb\fR. The socket module copies data from the \fIso_rcv\fR queue to the user's address space to effect a receive operation. The underlying transport layer is expected to have put incoming data into \fIso_rcv\fR by calling procedures in this module. .in -5 .sh 3 "Transport protocol management." .pp All protocols and address types must be \*(lqregistered\*(rq in a common way in order to use the IPC user interface. Each protocol must have an entry in a protocol switch table. Each entry takes the form: .(b \fC .TS tab(+); l s s s. struct protosw { .T& l l l l. +short+pr_type;+/* socket type used for */ +short+pr_family;+/* protocol family */ +short+pr_protocol;+/* protocol # from the database */ +short+pr_flags;+/* status information */ +++/* protocol-protocol hooks */ +int+(*pr_input)();+/* input (from below) */ +int+(*pr_output)();+/* output (from above) */ +int+(*pr_ctlinput)();+/* control input */ +int+(*pr_ctloutput)();+/* control output */ +++/* user-protocol hook */ +int+(*pr_usrreq)();+/* user request: see list below */ +++/* utility hooks */ +int+(*pr_init)();+/* initialization hook */ +int+(*pr_fasttimo)();+/* fast timeout (200ms) */ +int+(*pr_slowtimo)();+/* slow timeout (500ms) */ +int+(*pr_drain)();+/* free some space (not used) */ } .TE \fR .)b .pp Associated with each protocol are the types of socket abstractions supported by the protocol (\fIpr_type\fR), the format of the addresses used by the protocol (\fIpr_family\fR), the routines to be called to perform a standard set of protocol functions (\fIpr_input\fR,...,\fIpr_drain\fR), and some status information (\fIpr_flags\fR). The field pr_flags keeps such information as SS_ISCONNECTED (this socket has a peer), SS_ISCONNECTING (this socket is in the process of establishing a connection), SS_ISDISCONNECTING (this socket is in the process of being disconnected), SS_CANTSENDMORE (this socket is half-closed and cannot send), SS_CANTRCVMORE (this socket is half-closed and cannot receive). There are some flags that are specific to the TCP concept of out-of-band data. A flag SS_OOBAVAIL was added for the ARGO implementation, to support the TP concept of out-of-band data (expedited data). .sh 3 "Network Interface Drivers" .pp The drivers for the devices attaching a Unix machine to a network medium share a common interface to the protocol software. There is a common data structure for managing queues, not surprisingly, a chain of mbufs. There is a set of macros that are used to enqueue and dequeue mbuf chains at high priority. A driver delivers an indication to a protocol entity when an incoming packet has been placed on a queue by issuing a software interrupt. .sh 3 "Support for individual protocols." .pp Each protocol is written as a separate functional unit. Because all protocols share the clock and the mbuf pool, they are not entirely insulated from each other. The details of TP are described in a section that follows. .\"***************************************************** .\" FIGURE .so figs/unix_ipc.nr .pp .CF shows the arrangement of the IPC support. .pp The AOS IPC was designed for DoD Internet protocols, all of which run over DoD IP. The assumptions that DoD Internet is the domain and that DoD IP is the network layer appear in the code and data structures in numerous places. For example, it is assumed that addresses can be compared by a bitwise comparison of 4 octets. Another example is that the transport protocols all directly call IP routines. There are no hooks in the data structures through which the transport layer can choose a network level protocol. A third example is that the host's local addresses are stored in the network interface drivers and the drivers have only one address - an Internet address. A fourth example is that headers are assumed to fit in one small mbuf (112 bytes for data in AOS). A fifth example is this: It is assumed in many places that buffer space is managed in units of characters or octets. The user data are copied from user address space into the kernel mbufs amorphously by the socket code, a protocol-independent part of the kernel. This is fine for a stream protocol, but it means that a packet protocol, in order to \*(lqpacketize\*(rq the data, must perform a memory-to-memory copy that might have been avoided had the protocol layer done the original copy from user address space. Furthermore, protocols that count credit in terms of packets or buffers rather than characters do not work efficiently because the computation of buffer space is not in the protocol module, but rather it is in the socket code module. This list of examples is not complete. .pp To summarize, adding a new transport protocol to the kernel consists of adding entries to the tables in the protocol management unit, modifying the network interface driver(s) to recognize new network protocol identifiers, adding the new system calls to the kernel and to the user library, and adding code modules for each of the protocols, and correcting deficiencies in the socket code, where the assumptions made about the nature of transport protocols do not apply.