broadcast can be carried out in time $O(n)$. Other algorithms can be implemented using divide and conquer trees. A divide and conquer tree is a tree whose leaves are labeled with the $p \in S_n$, and if $p$ is the label of an interior vertex then its sons are labeled $p, p(1i)$ for some $i$. Essentially optimal divide and conquer trees are constructed in [7], of height $O(n \log n)$; this construction can readily be adapted to $tQ$ graphs.

Algorithms which can be implemented using a divide and conquer tree thus run with hypercube performance on a star. Examples are numerous, and include single link synchronous broadcast, associative binary operation on one element per processor, and all prefixes for the latter where the processors are ordered according to the order of the leaves of the divide and conquer tree. It is an interesting question whether there are divide and conquer trees of height $O(n \log n)$, whose leaves are ordered according to the standard enumeration of $S_n$. We suspect that these might exist.

A universal sequence is a sequence of transpositions $(1i), \ldots , (1v)$ such that every permutation in $S_v$ occurs as a product of a subsequence. Universal sequences of length $O(n \log n)$ are constructed in [1]. They can also be found by finding a common supersequence of the branches of a routing tree. Determining the shortest common supersequence of a set of sequences is NP complete [15]. For small $n$ however they may easily be found by computer search. For example for $n = 5$ the shortest universal sequence of this type has length 12; an example is

$$23452345342.$$ 

A universal sequence for $S_n$ yields one for a $tQ$ graph, but there may be shorter ones.

As mentioned in [1], given a universal sequence a divide and conquer tree can be derived. The height of the tree is bounded by the length of the universal sequence. To derive the divide and conquer tree, start with $1$ and the first transposition in the sequence. For each leaf $p$ of the tree constructed so far, if $p(1i)$ does not occur in the tree, where $(1i)$ is the current transposition in the universal sequence, add sons labeled $p, p(1i)$ and advance the current transposition. Continue until all $p$ occur in the tree. There are algorithms which require a universal sequence and not just a divide and conquer tree, such as the multiprobe broadcast algorithm of [7].

The mesh embedding of Section III has been used to obtain algorithms for some problems. These run on $tQ$ graphs, but since many of them have been superseded by [4] we will not consider this further.

X. Conclusion

We have given what appears to be a useful notion of a "standard" enumeration of the star graph $S_n$. A bound is given on the diameter of prefixes and suffixes, and an $O(n)$ interval broadcast algorithm is given. The prefixes of length $t(n - 1)$ for some $t$ are especially well behaved. These graphs would be the same in any cost respecting enumeration, but the standard enumeration is useful in discussing them, as it is for $S_n$ itself.

REFERENCES


Prevention of Congestion in Packet-Switched Multistage Interconnection Networks

Jyh-Charn Liu, Kang G. Shin, and Charles C. Chang

Abstract—This paper proposes a simple, yet effective scheme to prevent congestion in a packet-switched multistage interconnection network (MIN) caused by hot spots. In this scheme, switches in the second and third stages of the MIN monitor their buffer occupancy to detect any notable nonuniform access behavior. When a switch detects congestion, packets generated by processors will be blocked from entering the congested switch until the congestion is cleared. Our scheme is compared with two well known schemes [1], [2], and shown to exhibit significantly better performance than these two.

Index Terms—Congestion tree, packet-switching, multistage interconnection networks (MIN), hot spots, buffer occupancy.

Manuscript received February 26, 1993; revised May 16, 1994. This work was supported in part by the Texas A&M College of Engineering Excellence Fund, and by the National Science Foundation under Grant MIP-9203895. J.-C. Liu is with the Department of Computer Science, Texas A&M University, College Station, TX 77843 USA.

K. G. Shin is with Real-Time Computing Laboratory, Department of Electrical Engineering and Computer Science, The University of Michigan, Ann Arbor, MI 48109 USA.

C. C. Chang is with the Department of Electrical Engineering, Texas A&M University, College Station, TX 77843 USA.

IEEE Log Number 9405874.
I. INTRODUCTION

The multistage interconnection networks (MIN's) are an important class of high-bandwidth networks that can be used for multiprocessor/multicomputer systems. The MIN has a good performance when the network traffic pattern is uniform, so that network resources (link bandwidths and buffers) may be shared by different communication channels in a fair and efficient manner. However, when demands of different sources are not uniform, communication channels may not be able to share the network resources fairly. Even worse, if the service demand on a resource exceeds its capacity, packets blocked by this resource may build up quickly in the network. The blocking effect can quickly propagate backward, and most paths in the network will soon be blocked, thereby resulting in a severe drop of network throughput.

The network congestion caused by nonuniform traffic patterns is also called the hot spot problem in a multiprocessor system, where an MIN is used as the interconnection network between processors and memory modules. In this case, a hot spot is a memory module whose service rate is exceeded by the rate of requests to use it. Once a hot spot is formed, most packets in the network would be blocked, and the network throughput may drop to nearly zero. When a network becomes congested due to the presence of a hot spot in the network, the set of switches on the paths to the hot spot forms a congestion tree with the hot spot as the root and processors as the leaves. In a congestion tree, packets destined for the root are called hot packets and the other packets are called cold packets.

Many researchers had studied the hot spot problem using stochastic models and simulations [3]-[7], [10], [11], and the combining switch was then proposed to handle excessive hot packets. Since this initial effort, several software [16], [17] and hardware based solutions [1], [2], [10], [12]-[15], [18], [19] to the hot spot problem had been proposed, and some of the representative solutions would be reviewed below.

In this paper, we propose a distributed algorithm for detection and prevention of network congestions caused by hot spots. Similar to the schemes proposed by Scott and Sohi [1], we adopt a feedback control principle to prevent network congestions. However, our congestion detection scheme uses the buffer occupancy for congestion detection, while Scott and Sohi's scheme uses a continuous stream of hot packets arriving at their destinations. Through extensive simulations we find that in an uncontrolled network, most packets are blocked in switches very close to network input. Therefore, the buffer occupancy of switches close to the network input is used for early detection of the congestions caused by hot spots. Each buffer-occupancy monitor controls a set of processors, and occupancy monitors (at the same stage) make packet-blocking decisions independently. Therefore, at any instant, only a fraction of the processors in the system can (not) input packets to the network. Our scheme thus has a much smoother throughput than other schemes. We compare the performance of our scheme with those of the two schemes proposed in [1], [2]. Our simulation results show that the three schemes have a similar behavior when the hot access rate is low. However, our scheme has a much more stable, and significantly better performance, than those of the other schemes at higher hot-access rates.

II. DETECTION AND PREVENTION OF NETWORK CONGESTION

The MIN under consideration is assumed to be constructed with $2 \times 2$ switches. Each input port of a switch has a packet buffer. The Banyan, or any other topologically-equivalent topology of Banyan, is assumed for the network. Processors are connected to the network input ports, and memory modules are connected to network output ports. It is assumed that the duration of a memory cycle is the same as a network cycle, and a memory module can serve only one memory-access request in one memory cycle.

Network congestion is a statistical phenomenon, and its effect is measured by how fast it affects the performance of a network. Thus, we define network congestion as a condition under which the communication delay of cold packets becomes very large because of excessive demands of hot packets for network resources. A congested area is composed of a congested switch and those switches that have forward paths to the congested switch. In a congested area, the number of cold packets may not be significantly smaller than that of hot packets, but cold and hot packets are blocked for a very long time, and hence, the throughput of the area is very low. Based on the above definition, switches close to a hot spot are not considered as congested, because although most packets are destined for the hot spot, the throughput of switches in this area may be relatively higher than that of the rest of the network. It is more appropriate to consider the network input as the congested area, since most cold packets in an uncontrolled and congested network are blocked close to network input. In this case, hot spots cause the network congestion.

Our design prevents network congestion based on an optimal combination of congestion-detection, congestion-resolution and congestion-blocking. Our objective is to maintain a stable network throughput and to provide good quality of service to these packets already in the network. A distributed control approach is helpful to achieve this objective. For congestion detection we need to define the optimal conditions under which the other two schemes should be activated to handle congestion. The congestion-detection scheme must consistently detect congestion, whether the congestion-resolution and blocking mechanisms are activated or not. The congestion-resolution scheme is used to handle packets in the network after detecting a network congestion. This could be done through specially designed switches, or dynamic assignment of packet priority, or simply doing nothing. The congestion-blocking scheme is used to stop packets from entering the network at proper time instants. The efficiency and simplicity of these schemes are particularly important, so that they can be used in high-speed networks without appreciable interference with the normal network operation. As it will become clear later, we only need to use the congestion-detection and congestion-blocking mechanisms for our design. Hence, we will not consider the congestion-resolution scheme any further.

Explicit recognition of excessive packets destined for a particular memory module [1] is the most commonly-suggested approach for detecting a hot spot. This approach was shown to offer a fairly good performance when the hot-access rate is low, but its performance deteriorates rapidly with increase of the hot-access rate, as observed in our simulation. To avoid this problem, the switch buffer occupancy for congestion detection. The occupancy of a buffer increases only when the packet input rate is higher than the packet output rate of the corresponding switch. A continual increase of the buffer occupancy in a switch implies that the buffer space of the switch may soon be used up and all inbound packets would be blocked. Therefore, either the packet input rate to the congested area should be lowered, or the output rate of the switch should be increased, until the buffer occupancy drops to an acceptable level. This is true even when the congestion-blocking mechanism is activated.

Next, we decide on how and where to detect network congestions to maximize the network throughput and minimize the packet delay. To reduce the possibility of false alarms, we use different rules for congestion detection before, during, and after the network becomes congested due to the hot spot(s). In an uncontrolled network, if the network becomes congested due to a hot spot, most switches can be in one of the two states: heavily-loaded or under-loaded. Switches would be heavily-loaded if they are on the congestion tree, or would be extremely under-loaded otherwise. The under-loaded switches are called the free switches, and they are empty for most of the time.
because most packets destined for them have been blocked in the congestion tree.

Through interactive graphical simulations, we observed that, when a congestion tree is being formed, the buffer occupancy of switches on the congestion tree becomes sharply different from those of their adjacent free switches in a very short time period. Therefore, the difference in the buffer occupancies in adjacent switches (of certain adjacent stages) is used as the congestion detection rule, called rule A, before the congestion-blocking mechanisms are activated. This change in the buffer occupancy of adjacent switches becomes less dramatic after the congestion-blocking mechanisms are activated. This difference in the buffer occupancies in adjacent switches (of certain adjacent stages) is used as the congestion detection rule, called rule A, before the congestion-blocking mechanisms are activated. This change in the buffer occupancy of adjacent switches becomes less dramatic after the congestion-blocking mechanisms are activated. Therefore, the difference in the buffer occupancies in adjacent switches (of certain adjacent stages) is used as the congestion detection rule, called rule A, before the congestion-blocking mechanisms are activated.

To implement the congestion detection rules A and C, the occupancy monitor in a switch needs to examine its own buffer occupancy. Therefore, the difference in the buffer occupancies in adjacent switches (of certain adjacent stages) is used as the congestion detection rule, called rule A, before the congestion-blocking mechanisms are activated. Therefore, the difference in the buffer occupancies in adjacent switches (of certain adjacent stages) is used as the congestion detection rule, called rule A, before the congestion-blocking mechanisms are activated. Therefore, the difference in the buffer occupancies in adjacent switches (of certain adjacent stages) is used as the congestion detection rule, called rule A, before the congestion-blocking mechanisms are activated. Therefore, the difference in the buffer occupancies in adjacent switches (of certain adjacent stages) is used as the congestion detection rule, called rule A, before the congestion-blocking mechanisms are activated. Therefore, the difference in the buffer occupancies in adjacent switches (of certain adjacent stages) is used as the congestion detection rule, called rule A, before the congestion-blocking mechanisms are activated.

To implement rule A, we note that any switch on an arbitrary congestion tree is also connected to a free switch through one of its output ports. That is, any switch on a congestion tree has one output port connected to a switch on the congestion tree, and the other connected to a free switch. If the switch size is $r \times r$, then a switch on a congestion tree has one of its output ports connected to another switch on the same congestion tree, and all other output ports connected to free switches. Therefore, to implement rule A, the occupancy monitor in a switch also needs to be connected to the occupancy monitors of its adjacent switches at the next stage. Moreover, the occupancy monitor of a switch may also need to inform its upstream neighboring switches of its status as either congested, or not-congested. This way the congestion condition can be propagated to the processors to control traffic flow.

A. Hardware Implementation

We propose a sliding-window based CUTH design, where the window size is $w$ clock cycles. The occupancy monitor can be implemented with an $w$-stage shift-register array and a counter, where $w$ is the window size. Each register is $m = \log_2 (q + 1)$ bits wide, where $q$ is the number of buffers in a packet queue. The number of packets in a packet queue is sampled into the shift-register array of the occupancy monitor, and the register array is shifted forward one stage in each clock cycle. The counter storing the buffer occupancy is updated to be $C = C + v_t - v_{t-1}$ in each cycle, where $C$ is the current value of the counter, and $v_t$ and $v_{t-1}$ are the input and output of the shift-register array, respectively.

If an occupancy monitor is in the passive state, the current counter value is stored into its memory $M$ periodically as $p_h$. When the occupancy monitor becomes active, and the value of its counter becomes higher than $p_h$, all the flow-throttles in the same CUTH will have to be activated to block packets from entering the congested area. When an occupancy monitor detects congestion, it needs to inform all processors that can reach the switch of its location, so that flow-throttles can block packets destined for the congested area.

This communication mechanism can be implemented in two different ways. In the first approach, each switch is assigned a subspace ID, which indicates the address subspace of the memory modules that can be reached by the switch [20]. Counting from input to output stages, a switch at the $i$th stage has $i$ bits of subspace ID. When the occupancy monitor at stage $i$ detects a congestion, it can send a message, which contains its subspace ID, back to all processors that can reach this switch. The flow-throttle in a processor blocks any messages whose addresses are in the received subspace ID(s). If several hot spots become active at the same time, then packets can be divided into groups based on the subspace ID(s) of the (detected) congested areas. A new hot-spot group is added to the processor after a new hot spot is detected. The hot-spot group can be deleted when rule C with respect to the hot spot detects the disappearance of the hot spot. The overhead in dealing with the hot spots in the communication buffer of the processor should not impose any serious performance penalty. This is because when hot accesses begin, the effective packet generation rate will drop to lower than the original packet generation rate of processors. The drop of packet generation rate can be substantial for most existing schemes, as our simulation results indicate. Therefore, our scheme outperforms many existing schemes as long as the time for sorting packets is smaller than the average packet interdeparture rate (from the processor buffer to the...
An alternative design for coordination of occupancy monitors and flow-throttles is by direct propagation of control signals. To pass congestion condition from an occupancy monitor to its flow-throttles, an active occupancy monitor located at stage $h$ asserts a block-signal to its neighbors at stage $h-1$. A switch $S_h$ at stage $h-1$ that receives an asserted block-signal propagates it backward to its neighbors at stage $h-2$, with one additional hot-address line denoting the outport of $S_h$ that is connected to the asserted occupancy monitor. Then, $S_{h1}$ and $S_{h2}$, both of which are connected to the input ports of $S_h$, continue to propagate the block-signal and the hot-address signals one more stage, after adding one more bit denoting the output of $S_{h1} \neq (S_{h2})$ that is connected to $S_h$, to the received hot-address lines. This approach is more suitable for the case when the occupancy monitors are very close to processors. Since our simulation results show that placing the occupancy monitors on stages 2 and 3 yields best performance even under extremely high hot-access rates, the hardware costs of both schemes are reasonably low. A 16 $\times$ 16 MIN with embedded CUTH's is illustrated in Fig. 1.

To reduce the packet blocking effect under congestion, we modify the conventional first-in-first-out (FIFO) buffers into a parallel-access-buffers (PAB) structure. Based on the PAB structure, cold packets need not be blocked in congested switches in the presence of hot spots. The circuit design of the PAB is illustrated in Fig. 2, in which a packet-selector is associated with each buffer in the PAB to control movement of the packet, if any. RD, denotes the existence of a packet in the i-th buffer of a PAB. The Packet.in, In-enable, queue-full and RD, 's together decide where to insert a new packet into the PAB. This part of control circuits is also necessary for conventional FIFO-buffers. The rest of control signals are used for dynamic routing of blocked packets as follows. Whenever two packets from the head of the two PAB's contend for the same output port, only one of the two requests will be granted, and the lost PAB is allowed to route another packet, if any, destined for the other output port. To realize this operation, a tristate transmission gate array needs to be added to the output of each packet buffer, in addition to control signals. At the beginning of each routing cycle, a PAB issues a request signal, which can be the RD bit of the leading buffer, to the switching arbitrator. In the meantime the routing tag of the leading packet is also made available to the entire PAB. If the PAB wins the contention, then its leading packet is routed to the next stage. Otherwise, the Alternate signal will be set to 1 and propagated backward to the PAB. The first packet buffer with Alternate.in = 1, which has a different tag value from that of the leading packet, should assert its out-enable to route its packet to the switch at the next stage. In the meantime, the packet buffer needs to set its alternate.out to 0. Packets behind the routed packets will be shifted up one register after the grant-signal returns. A buffer can accept a shifted-up packet if its Advance signal is asserted.

After a new packet arrives, the empty buffer next to the last occupied buffer will receive the packet and change its RD bit accordingly. Note that the additional complexity of the PAB over the FIFO buffers is limited, because only a small amount of combinational circuits would be needed to implement the algorithm. This way network resources can be released for servicing other packets in the shortest time possible. We also note that the buffer-occupancy monitors are quite flexible, because they can be embedded into any existing switch design in a nonintrusive manner. Their interference with the normal operation of switches is minimal. On the other hand, most existing, if not all, hot spot prevention algorithms are based on the detection of a large number of packets destined for a particular network output port. This usually leads to a solution which indeed tries to resolve the congestion already formed in the network.

III. PERFORMANCE COMPARISON

We compare the performance of our scheme against that of two simple yet effective schemes proposed in [1] and [2] by simulation.
processors, we obtain the best performance in all the cases studied. This is because we can detect the blockage of cold and hot packets whole network. However, by placing the buffer-occupancy monitors and input blocking were embedded in the buffer-occupancy monitors. The system performance was not good if only at the second and third stages, and the flow-throttles in the different experimental runs.

We observed that the two schemes being compared worked well when the hot-packet request rates were relatively low, but their network throughputs were reduced to 0.1 when the hot-packet request rate from each processor is greater than 0.1. For Scott and Sohi’s scheme, this is mainly because of the relatively slow detection of hot spots, and thus, at a high hot-access rate; there were already too many hot packets in the congestion tree before detecting a hot memory, as pointed out in [1]. For Peir-Lee’s scheme, hot packets could enter the network even after a hot spot was detected, thus making it difficult to dissipate the hot packets already in the network when the hot-packet generation rate was high. When the hot-memory request rate was 0.03 or lower, our scheme was only slightly better than the other two schemes being compared. However, our control scheme was shown to be much better than the other two when the hot-memory request rate was higher than 0.03, regardless of the number of hot spots in the system. The network throughputs packet delays under the three schemes for a 512 x 512 MIN are plotted in Fig. 3, in which \( p \) and \( h \) denote the cold and hot packet generation rates by each processor, respectively. In addition, buffer-occupancy monitors are placed at the second and third stages. Since

![Diagram](image-url)

**Fig. 2.** Network throughput and packet delay with eight hot memory modules \((N = 512, p = 0.7, h = 0.1)\).
the network was far less congested in our scheme, the effective (cold) packet generation rate—which was the actual (cold) packet output rate from processors—under our scheme was much higher than that of the two schemes being compared, as can be seen from Fig. 4.

In these figures, \( p \) denotes the background packet generation rate, and \( h \) denotes the hot packet generation rate in each processor. After comparing our scheme with the other two, we proceeded to study the performance impact of network sizes on our control scheme, and we observed little performance difference.

All the performance results reported so far are based on the assumption that it takes one cycle network for the congestion monitor to inform the flow throttles. The network performances under different combinations of the window size and message delay are also evaluated through simulation. When the message delay was in the range of 1 to 4, no notable performance degradation was observed. So, it was important to keep the message delay as short as possible.

IV. CONCLUSION

We have introduced a cost-effective congestion detection and prevention scheme for packet-switched MIN's. Our simulation results show that our scheme can sustain the network throughput under extremely nonuniform traffic. The effectiveness of our scheme stems from the facts that 1) it detects network congestion caused by hot spots, instead of the hot spots themselves; and 2) excessive packets are not allowed to enter the network, so that packets blocked in the network can be dissipated quickly. Our scheme can be implemented with simple, inexpensive hardware mechanisms, making it an attractive solution to prevention of network congestion.

REFERENCES


Abstract—In a recent development, El-Amawy [3], [5] introduced a novel clock distribution scheme called Branch and Combine (or BaC for short) which is the first to guarantee a constant upper skew bound irrespective of network size. The scheme distributes the global clock through a set of simple interconnected nodes whose main function is to process the clock signal such that it adheres to certain timing constraints. In [3]–[5], three BaC network designs for clocking a 2-D mesh of processors have been introduced. A single network was analyzed in detail in [5] and shown to be stable under the assumed model. Also skew was shown to be bounded above by a constant for each of the three networks. In [6], BaC clocking was compared to H-tree clocking of a 2-D mesh of processors in a VLSI context. That study shows that the BaC clocking is superior to the H-tree in almost all aspects, for medium and large size meshes.

I. INTRODUCTION

In order to meet the ever increasing demand for higher computing power, the size and complexity of computing systems have grown steadily. Clock skew has been widely recognized as a major problem in implementing very large synchronous systems. Typically, a single global clock is used to provide the basic timing and synchronization mechanism for different parts of the system. Due to several factors such as variation in buffer delays, variation in signal propagation delays on wires, and different threshold voltages, clocking events generally reach clocked cells at different times. The maximum difference in arrival times at two cells, which directly communicate, is what we mean by clock skew. Many researchers have investigated the clock skew problem as exemplified by the work in [1]–[8], [10]–[13]. In one study Fisher and Kung [7] have shown that if an H-Tree is used to clock a 2-D mesh of processors, it is impossible to guarantee a constant skew upper bound, provided that small delay variations are not ignored.

In a recent development, El-Amawy [3], [5] introduced a novel clock distribution scheme called Branch and Combine (or BaC for short) which is the first to guarantee a constant upper skew bound irrespective of network size. The scheme distributes the global clock through a set of simple interconnected nodes whose main function is to process the clock signal such that it adheres to certain timing constraints. In [3]–[5], three BaC network designs for clocking a 2-D mesh of processors have been introduced. A single network was analyzed in detail in [5] and shown to be stable under the assumed model. Also skew was shown to be bounded above by a constant for each of the three networks. In [6], BaC clocking was compared to H-tree clocking of a 2-D mesh of processors in a VLSI context. That study shows that the BaC clocking is superior to the H-tree in almost all aspects, for medium and large size meshes.

In this paper we extend, generalize, and improve on the previous work on BaC networks. We define a general class of BaC networks. We do not confine our study to any particular topology or dimensionality. Instead we base our study on a graph theoretic model which only requires that each pair of adjacent vertices be included in a directed cycle of finite length. We define a network model which specifies the underlying assumptions such as node function and clock signal constraints which are dependent on the properties of the underlying graph representation. We then utilize the network model to derive important properties of the generalized class of BaC networks. We prove that a network adhering to our general model is stable (will not oscillate) despite its cyclic nature. We also prove that no tree of any kind can be used to distribute the clock in two or more dimensions such that skew bound is constant. The paper then exploits the derived properties to describe the inherent interplay between topology, timing, node function, and skew bound.

Index Terms—Branch-and-combine network, clock distribution, skew bound, synchronous system, VLSI, large system, network stability, cyclic clock networks.