# THE COMMUNICATION COMPLEXITY OF VLSI CIRCUITS Thomas Lengauer A 82/14 This material was presented as a Lecture within the AMS Short Course on Computer Communications, Denver, Colorado, Jan.3-4, 1983 Fachbereich 10 Universität des Saarlandes D-6600 Saarbrücken West Germany ## THE COMPUNICATION COMPLEXITY OF VEST CIRCUITS ## Thomas Lengauer Abstract: Very Large Scale Integration (VLSI) is a quickly energing discipline in Computer Science that also raises many theoretical questions. The concept of a VLSI computation is very much different from classical concepts of a (sequential) computation. A VLSI computation is performed by many switching elements (s.e.'s) that are laid out on the planar chip surface and connected among each other with wires. These s.e.'s can perform computations in parallel. The computation of any s.e. depends on data received over wires from other s.e.'s. Several measures of complexity are of interest in this context: The chip area A, i.e., the area of wires and s.e.'s, the computing time I and the switching energy E. Clearly there exist tradeoffs between A and T. In this paper we survey lower bounds on the combined complexity measure AT<sup>2</sup>. The lower bounds are proved by accounting for the amount of work necessary just to communicate intermediate results between s.e.'s. In many cases the lower bounds we get are (asymptotically) tight, giving evidence for the fact that communication cost dominates the complexity of many VLSI computations. #### TABLE OF CONTENTS: - 1. Introduction - 2. The VLSI Model - The Lower Bound Argument - 4. A Geometric Separator Theorem - 5. The Communication Complexity of Boolean Predicates - 6. The Communication Complexity of Boolean Functions With Marry Cutputs - 7. A Lower Bound on the Switching Energy of YESI Chips # INTRODUCTION In recent years the area of Very Large Scale Integration (VLSI) which originally was reserved to electrical engineering gained increasing interest among computer scientists. The book of Head and Comway ([MC80]] triggered a lot of activity in VLSI design at the computer science departments of the universities and also focussed the attention of theoretical computer scientists and mathematicians on the area. Several authors proposed theoretical models for VLSI computations ([TRO,BK8], CM81,LS81,LM81]). These models are designed to capture the essential characteristics of the VLSI technology such as for instance the planarity of the chip. On the other hand they abstract from extraneous technological detail. Their purpose is to facilitate general statements on VLSI computations, specifically, lower bounds on their complexity. The main complexity measures studied are the chip area A and the computing time T. For A we measure the area on the chip surface that is taken up by switches and wires, the so-called active area ([T80,LM81]), resp., the area of the enclosing rectangle, the so-called total area ([SK8],CN81,LS81]). The first area measure is directly related to the chip yield during fabrication. The secand area measure accounts for the total size of the chip. For our purposes both area measures are equivalent ([LMB1]). The question how to measure time is more difficult to answer. Most authors only study synchronous circuits. Here a global clock counts the synchronous steps of the computation performed by the switching elements. [T90,8K81,LS81,LM81] interpret T as the number of clock cycles spent on the computation. Note that this is not a "physical" measure of time, since the length of a clock cycle may itself vary with the size of the chip. However, as long as the delay of signal propagation along wires is not significantly longer than the delay of the switching elements the number of clock cycles gives a good representation of the time spent, i.e., is asymptotically accurate. As the number of transistors on a chip increases this ceases to be the case, however. [CM81] introduce a physical time measure T by measuring time in seconds under the assumption that signals are propagated along wires at a constant speed. They get dramatically different lower bounds on circuit complexity. Not only are their lower bounds larger, as we expect, but the optimal chips according to their complexity measures differ significantly from the optimal chips if T is measured in clock cycles. This is, because [CM81] pay a penalty for long wires across which communication is expensive. The time measure of [CN81] may become technologically significant eventually, as the speed of light becomes the limiting factor in signal propagation on VLSE chips. In the meantime, however, other physical time measures may be more appropriate, such as the one introduced in [MC80] that is based on the capacitive properties of VLSI structures. In this paper we will only count clock cycles in synchronous VLSI circuits. Our main observation, namely that the chip complexity is dominated by the cost of communication, holds for the other time measures as well. However, the time measure we consider has the most developed body of theory and thus is best suited to make our point. Since synchronous circuits can be pipelined another complexity measure gains significance, namely the period P. A pipelined circuit receives input to a rew problem to be solved before it has finished solving the old problem. P is the number of clock cycles between the first input to two consecutive problems (see Figure 1). As a circuit solves many problems, i.e., a number n of problems such that nf>>T, P assume the role of T since the time T+(n-1)P to solve all n problems does not differ significantly from nP. We will not discuss P explicitly in this paper but just mention that the lower bounds we prove hold even if we substitute P for T and make the appropriate modifications in the relevant definitions. Finally the switching energy E has been considered as a complexity measure for VLSI computations. The switching energy E is measured based on the following assumption: Every unit of active chip area consumes one unit of switching energy each time it changes its state from 0 to 1 or vice versa. This complexity measure is closely related to the energy dissipated when charging a wire. In technologies without high d.c. currents the switching energy dominates the total energy dissipation on the chip. Lower bounds on A can be obtained in many cases [[V80,LM81]]. However, the respective arguments are based on measuring the storage capacity necessary for the computation and do not involve communication complexity. We therefore do not discuss these results here. Looking for lower bounds on T reduces to the same problem in Boolean circuit complexity, which is known to be very hard. A lower bound on E will be proved in Section 7 of this paper. It stems from [LM81]. The above complexity measures can be combined to form new interesting complexity measures. Clearly there is a tradeoff between A and T, since more available chip area allows greater on-chip concurrency. Thus the quantity AT is an interesting complexity measure. However, with respect to AT many circuits are optimal that we intuitively would not consider good VLSI chips. (An example is the bit- serial adder.) This is, because concurrency in the computation is not given sufficient priority. We can make the complexity more sensitive to concurrency by giving the T component a heavier weight. This motivates the investigation of AT<sup>2</sup> as a complexity measure. (T80) has presented a proof method for deriving lower bounds on the AT<sup>2</sup> complexity of VESI chips, and later authors have elaborated on it. This method will be the main subject of this paper. The method measures the amount of information that has to be exchanged between different components of the chip and relates it to the chip area and computing time. It only accounts for the cost of communication, not for the cost of computation or storage. The fact that many of the derived lower bounds are (asymptotically) tight shows that the complexity of VLSI computations is in many cases dominated by the cost for communicating intermediate results across the chip. In Section 2 we present the model for synchronous VLSI circuits we are using. In Section 3 we give an intuitive outline of the method for proving AT<sup>2</sup> lower bounds. In Sections 4 to 6 we provide the formal tools for proving the lower bounds. Section 4 contains the geometric tool in the form of a separator theorem. Sections 5 and 6 provide the information theoretic tools for Boolean predicates resp. for Boolean functions with many outputs. Finally, in Section 7 we use the ideas developed in the earlier chapters to prove a lower bound on the switching energy consumed by VLSI chips. ## 2. THE VLST MODEL Our VLSI model is based on Boolean circuits. This choice is adequate also for modelling more general "multi-directional" VLSI structures, e.g., buses ([LM81]). DEFINITION 1: A chip $\chi = (\Gamma, \Lambda, \Delta)$ consists of three structures: - a) The execute F: A synchronous Boolean circuit with feedback and unbounded fanin. Formally, this is a directed bipartite graph F={V,E} where V is partitioned into a set S of switches and a set W of wires. Here S=PuG where P is a set of ports labelled in or out and G is a set of grates labelled and, or, nand, or nor. For seS and weW, if {s,w}eE then w is called an output of s, if {w,s}eE then w is called an input of s. All gates have out-degree 1, all wires have in-degree 1. Each input port has one input and one output. Each output port has two inputs and no output. The "additional" input signal for the ports is an enable signal computed on the chip that activates the port. - b) The Anyone A: The layout maps every vertex in the circuit into a compact connected region in the plane. Furthermore, each point in the region lies inside some square of side length A>O that is completely contained in the region. (This provision models the finite resolution of the fabrication process for VLSI chips.) Each point in the plane belongs to the interior of at most $v \ge 2$ regions. (The parameter v is another fabrication process specific constant representing the number of functional layers on the chip. Since $v \ge 2$ also non-planar circuits $\Gamma$ can be laid out.) Two regions touch exactly if the vertices they represent are neighbours in $\Gamma$ . (We say that regions $R_1$ and $R_2$ touch if $R_1 \cap R_2 = \emptyset$ but $R_1^0 \cap R_2^0 = \emptyset$ , where $R_1^0$ , $R_2^0$ are the interiors of $R_1$ resp. $R_2$ .) c) The manual a: The manual is a set of directions for the communication between the chip and its environment. It contains for every input port a sequence of numbers that identify the input bits that enter through this port. The sequence also determines the order in which the input bits enter. When its enable signal is raised the port requests the next input bit in the sequence to enter. Analogously, for each output port the manual contains a sequence of numbers identifying the output bits produced at that port and their order. When its enable signal is raised the port produces the next output bit in the sequence. After defining all components of the chip we can define the operation of the circuit. To this end we associate with each port a word web\*, 8={0,1}, that we call its history. Furthermore, we label each wire with an initial Boolean value from Bu(X). (X stands for the undefined Boolean value, 0.X=0, 1vX=1, 0vX=1.X=X.) Such a labelling we call a state of the circuit. The initial state is most often the completely undefined state. In the i-th cycle the circuit does the following. Each gate "reads" the values on all its input wires and computes the Boolean operation given by its label. The resulting value is put on its output wire. Each input port puts an X on its output wire if its enable signal is 0, otherwise it puts the ext bit from its history on its output wire. Each output port checks if its enable signal is 1, and if so it puts its other input at the end of its history. Thus input ports consume their histories and output ports produce them. All actions of the gates happen in parallel. Thus a new state is reached on which the {i+1}-st cycle of the computation is started. Finally we discuss, what A,T and E mean in this context. The area A is the area of the set M of all points in the plane that belong to at least one region in the layout. (Thus we measure active area.) The time T is the number of steps taken by the circuit to produce the desired outputs in its output histories. The switching energy E is based on the area of the regions in the layout. There has been some confusion about different kinds of manuals in the past. Therefore we discuss the issue here at greater length. The manual used in Definition I is called atrongly where-oblivious in (LMSI). Many authors use manuals of a different kind to prove the same lower bounds. [LS81] call a manual where-oblivious (when-oblivious) if the port (cycle) at which an input bit is requested resp. an output bit is produced is independent of the input. In this paper we will also consider manuals which are both whereand when-oblivious. Such manuals simply list for each I/O bit the time and port at which it crosses the chip boundary. These data are independent of the input. (Mote that strongly where-oblivious manuals are in general not when-oblivious.) If we restrict ourselves to manuals which are where- and when-oblivious them we do not have to require the I/O ports to be activated by enable signals which are produced by the circuit. We can instead assume that all ports are active all the time, and that they scretimes consume resp. produce bits which are insignificant for the problem to be solved. This assumption takes some computational burden off the chip, because it can "blindly" consume and produce bits without having to compute the knowledge on what data it is currently working. Even so, we can still prove the same lower bounds as for strongly where-oblivious chips. This is, because if the manual is where- and when-oblivious then even on "blind" chips solutions to different problems are always represented by different I/O histories. If the manual is strongly where-oblivious this is not the case, however, and the chip has to "know" what bits cross its boundary when and where. If this knowledge were not computed by the chip but could be found inside the manual, every problem could be solved by a trivial chip with a ronual that contains all the computation. (The chip takes no input, runs for two cycles and produces a O in the first and a 1 in the second cycle at its unique output port. One of these two values is the unique output and the namual says which one it is, depending on the input.) We distinguish two kinds of complexity analysis: Worst case analysis and average case analysis. In worst case analysis we ask for the greatest value of a complexity measure (A,T,E) for all possible input sequences of the same length. In average case analysis we average the complexity measure over all input sequences of the same length under the assumption that they are equally likely. For the area A worst case analysis and average case analysis coincide. For the switching energy E they may differ. For I they coincide if the manual is whereand when-oblivious, and they may differ, if the manual is strongly where-oblivious. #### 3. THE LOWER BOUND ARGUMENT In this section we give an intuitive outline of the argument for proving lower bounds on the AT<sup>Z</sup> complexity of VLSI chips. Let us assume that we have a rectangular chip computing some Boolean predicate p:8<sup>n</sup>=8. The chip has side lengths a and b, a≤b. Let us furthermore assume that the chip has n input ports and one output port, and that a unique [/O bit is assigned to each port. Clearly we can cut the chip into two halves L and R by a line C parallel to the sides of length a of the chip, such that about half the input ports lie on the side L resp. R.We first observe that the length of the cut is lq(C)-as/A. The chip is where-oblivious and therefore we can also assign to each input bit a side among L.R through which it enters the chip, independently of the input. This partitions the input bits into two classes X and Y corresponding to the sides L and R. Let us assume that the output port lies on side R. If we now can show that in order to compute the output bit we have to know many, e.g. w of the input bits in X we can argue as follows: Since the chip is planar the input bits in X can only be communicated to the side R across the cut C. By our layout model C can only be crossed by a number of wires that is proportional to its length. Thus during each cycle only O(lg(C)) bits can be connumicated across C. Since ω bits have to cross C this takes time T=Ω(u/lq(C)). By our bound on lg(C) we get $T-\Omega(\omega/\sqrt{A})$ , or $AT^2=\Omega(\omega^2)$ . Thus the argument is divided into two parts. The first part is the proof of a geometric separator theorem that says that any well behaved (e.g. rectangular) region in the plane with area A representing the chip layout can be cut by a nicely behaved (e.g. straight) line into two "balanced" halves (e.g. with respect to the number of input ports in them). Furthermore the length of the cut line is small, i.e., O(-A). We call such a theorem a separator theorem. The notion of a separator is known from graph theory ([LT79]], and in non-trivial cases the methods from graph theory can be carried over to prove geometric separator theorems. We will prove a general separator theorem in Section 4 that allows the application of the lower bound argument for the case that we are interested in active area. The obvious separator theorem for rectangles contained in the above outline applies to the notion of total area. The second part of the lower bound argument lies in the definition of a certain notion of "communication complexity" for the Boolean function p to be computed. Several different notions are possible here. We will discuss them in detail in Section 5 and 6. Intuitively, the communication complexity of a function p is the amount of information that has to be exchanged between two cooperating processors that each have a partial knowledge of the input and that want to aid each other in computing p. The communicating processors are in this case the two halves of the chip as divided by the cut C. The communication complexity of p then takes the part of the quantity u in the above outline. These ideas will be made more precise in the following sections. ## 4. A GEOMETRIC SEPARATOR THEOREM This section gives a separator theorem for compact plane regions. The theorem is a special case of a more general separator theorem presented in [LN81]. The proof is patterned after the proof of a similar separator theorem for planar graphs given in [LT79]. THEOREM 1: Let M be a compact plane region. Let $p_1, \ldots, p_r$ be r different points inside the interior of M of which n are specially "marked". Then M can be cut by a set C for three straight cuts into two compact sets M, and M, such that: - 1. М<sub>2</sub> ∪ М<sub>2</sub> − И М<sub>2</sub> ∩ М<sub>2</sub> + С - 2. $lg(C) \le 2\sqrt{A}$ where A is the area of M. - M<sub>2</sub>. M<sub>p</sub> both have at most 2n/3 marked points of p<sub>1</sub>,...,p<sub>p</sub> in their interior and none of the points p<sub>1</sub>,...,p<sub>p</sub> on their border. PROOF: First we note that because the set $P=\{p_1,\ldots,p_p\}$ is finite M can be put into a Cartesian coordinate system such that no line parallel to the x- or y- axis contains more than one point in P. We devise the following procedure for cutting M. W.l.o.g. assume that [n/2] of the marked points lie above the x-axis, [n/2] of the marked points lie below the x-axis, and no point in P lies on the x-axis. Let N<sub>x</sub> = $\{(x,y)\in N; y\ge 0\}$ and N<sub>x</sub> = $\{(x,y)\in N; y\ge 0\}$ and N<sub>x</sub> = $\{(x,y)\in N; y\ge 0\}$ . Let A<sub>x</sub> be the area of M<sub>x</sub> and A<sub>x</sub> be the area of N<sub>x</sub> $\{A_x+A_y=A\}$ . The method for cutting N has two steps: Step 1: For any real 4, let $C_{\underline{t}} = \{(x,y) \in M; y=2\}$ . We choose two cuts of M parallel to the x-axis at distances $y=2_{\underline{t}}>0$ and $y=2_{\underline{t}}<0$ . $C_{\underline{t}}$ and $C_{\underline{t}}$ cut M into three parts $M_{\underline{t}}$ , $M_{\underline{t}}$ , and $M_{\underline{b}}$ as suggested by Figure Z. We choose $\ell_t$ such that $\log(C_{\hat{\chi}_{\hat{\xi}}}) * \ell_t \leq \sqrt{2}A_+$ and $\ell_t \leq \sqrt{2}A_+$ . Analogously we choose $\ell_b$ such that $\log(C_{\hat{\xi}_{\hat{\xi}}}) * \ell_b \leq \sqrt{2}A_-$ and $\ell_b \geq -\sqrt{2}A_-$ . Such $\ell_t$ and $\ell_b$ can be found. For assume that such an $\ell_t$ does not exist. Then $$A_{+} = \int_{0}^{\infty} \log(C_{2}) dz \ge \int_{0}^{\sqrt{2}A_{+}} \log(C_{2}) dz > \int_{0}^{\sqrt{2}A_{+}} (\sqrt{2}A_{+} - x) dz = A_{+}$$ which is a contradiction. In fact we can assume that no point of P lies on $C_{\xi_{t}}$ since the set $(2;C<2\leq/2R_{+})$ and $\log(C_{\xi})+2\leq\sqrt{2R_{+}}$ ) must have a positive measure. An analogous argument applies to $L_{h}$ . Now, if $M_{\rm m}$ contains no more than 2n/3 marked points of P then we are done. Because $M_{\rm t}$ and $M_{\rm b}$ both have at most n/2 marked points of P we only have to choose $M_{\rm t}$ to be the one of the three sets with the most marked points of P in it. If, however, $M_{\rm m}$ has more than 2n/3 marked points in it then we have to continue butting. Step 2: We bisect $M_m$ with a line x-L such that half the marked points in $N_m$ are on either side on the line. Nore precisely, the cut consists of the segment $\{(x,y); x=L_y \text{ and } L_{t} \geq y \geq L_{b}\}$ . Now clearly the half of $M_m$ containing the most marked points of P will serve as $M_s$ . The total length of the cut C is bounded by $$\lg(c) \leq \lg(c_{2_t}) + \lg(c_{2_b}) + (\epsilon_t - \epsilon_b) \leq \sqrt{2}A_t + \sqrt{2}X_t \leq \sqrt{2}A_t + \sqrt{2}(A - A_t) \leq 2\sqrt{A}$$ since the function $t(x)=\sqrt{x}+\sqrt{1-x}$ has a maximum of $\sqrt{2}$ at x=1/2 or ## 5. THE COMMUNICATION COMPLEXITY OF BOOLEAN PREDICATES As notivated in Section 3 we start from the following definition. DEFINITION 2 ([Y79]): Let $p:X\times Y+B$ be a Boolean predicate. We call X the set of left inputs and Y the set of right inputs. Let L and R be two computing agents one of which knows xX and the other of which knows Y. L and R want to compute p(x,y) by exchanging information between each other. We are primarily interested in deterministic computations: A deterministic algorithm is given by two reponse functions $o_1: x \times 0^* \cdot 0$ and $o_R: Y \times 0^* \cdot 0$ and the partial output function $a: 0^* \cdot 0$ . For computing p(x,y) L starts sending the bit $w_1 = o_1(x,c)$ to R. R responds with $w_2 = o_2(y,w_1)$ ; L returns $w_3 = o_1(x,w_2)$ and so on, until $w_1,\dots,w_{k(X,Y)} \in dow(a)$ . At this point the computation stops with the result $a(w_1,...,w_{k(x,y)})=p(x,y)$ . k(x,y) is the length of the computation. The deterministic two-way communication complexity of p is given by where A is any deterministic algorithm computing p and $k_{A}(x,y)$ is the length of the computation on input (x,y), if algorithm A is used. For theoretical purposes it is often interesting to introduce non-determinism into the model. In an non-deterministic algorithm we are allowed guesses as to the outcome of certain computations. These guesses then only have to be verified. W.l.o.g. we can assume for non-deterministic algorithms that information is only sent from L to R. (L could guess all of the information sent from R and R just has to verify these guesses.) DEFINITION 3: A non-deterministic (one-way) algorithm is given by a set $\rho_L(x) \in B$ for every x and by a response function b:Y×8+8. On input (x,y), L sends some word w $\in \rho_L(x)$ to R and R outputs b $\{y,w\}$ . In order for R to know when L has finished sending its information we assume $\rho_L(x)$ to be prefix-free.) The non-deterministic algorithm computes p if p(x,y)=1 exactly if there is a w $\in \rho_L(x)$ such that b $\{y,w\}=1$ . The non-deterministic communication complexity of p is $$C_{\text{ndet}}(f, L+R) = \min \max_{\substack{x \in X \\ y \in Y \\ f(x,y)=1}} \min \{twi; w \in p_{L,A}(x) \text{ and } b_A(y,w)=1\}$$ A non-deterministic algorithm is sommbiguous if the last minimum in the above equation is taken over singletons (and can thus be omitted). We define $$C_{\text{unamb}}(f,L+R) = \min \max_{\substack{x \in X \\ \text{unamb}}} \{w(x,y)\}$$ $\sum_{\substack{x \in X \\ \text{unamb}}} f(x,y) = 1$ where w(x,y) is the unique w such that $w\in p_{1,A}(x)$ and b(y,w)=1. A third class of algorithms which are of interest are "Las Yegas" algorithms. Las Yegas algorithms use internal randomization (coin flipping) to determine the result. However, the randomization does not affect the outcome of the computation but just its complexity. Las Yegas algorithms are not only of theoretical interest but of practical significance since the realization of true coin flipping devices in VLSI technologies seems in the realm of the possible. DEFINITION 4: A Las Vegas algorithm is given by two response functions $o_{\underline{t}}: X \times B^{+} \times B^{+} \to B$ and $o_{\underline{t}}: Y \times B^{+} \times B^{+} \to B$ and the partial output function $a: B^{+} \to B$ . We assume that both L and R first toss t coins to determine the third arguments $t_{\underline{t}}$ , $t_{\underline{R}}$ in the response functions, and then start a deterministic computation as in Definition 2. The computation ends when $\{w_{1}, \dots, w_{k}(x, y, t_{\underline{t}}, t_{\underline{R}})\} \in dom(a)$ . Its result is $a(w_{1}, \dots, w_{k})$ . The Las Vegas communication complexity of p is $$C_{\{\gamma\}}(p,l \leftarrow R) = min \qquad f \qquad k(x,y,t_{i},t_{R})/2^{2t}$$ $$K \qquad LV-alg$$ We can also define a deterministic average case communication complexity of p by just averaging over all inputs instead of maximizing over them: The relation of the quantities defined above to the AT<sup>2</sup> complexity of chips will be discussed later in this section. However, they are also of use for concerns outside the YLSI area. For instance, they apply to problems in one-tape Turing machine complexity. Indeed, they can be used as a basis for a theory of communication complexity, and as such we will consider them now. This kind of complexity theory has been studied in [Y79,Y81,MY82]. We now study methods for finding lower bounds on the communication complexity of Boolean predicates. The methods consider the Boolean predicate p as an $|X| \times |Y|$ matrix P of zeroes and ones. The input x selects the row, y selects the column of the matrix. Thus $P_{xy} = p(x,y)$ . METHOD 1 (Crossing Sequences): This method is an adaptation of the crossing sequence technique for showing lower time bounds for one-tape Turing machine complexity. It was first presented in our context in [S81]. It is based on the following theorem. INSECREM 2: Consider the matrix P defining the Boolean predicate p. By parameting rows and columns independently put P into the form $\begin{pmatrix} I & B \\ A & C \end{pmatrix}$ where 1 is an most identity matrix and m is as large as possible. Then $C_{\det}(p,L \leftrightarrow R) \ge \log m$ . PROOF: The proof is an application of the pigeon hole principle. Assume that $C_{\det}(p,L^{-}R)<\log m$ . Then there is a deterministic algorithm computing p that always exchanges less than log m bits. Thus there are two inputs $(x_1,y_1)$ and $(x_2,y_2)$ corresponding to diagonal elements in the identity matrix 1 of P for which the same information is exchanged between L and R. By Definition 2 the same information is also exchanged on inputs $(x_1,y_2)$ resp. $(x_2,y_1)$ . Thus the alrithm yields the same result for all four input values. Since $(x_1,y_2)$ and $(x_2,y_1)$ are off the diagonal in I the algorithm must therefore yield incorrect results on some of the four input values. COROLLARY 1: The above theorem is still true if we substitute $C_{ndet}$ or $C_{unamb}$ for $C_{det}$ . PROOF: Analogous to the proof of Theorem 2.0 Corollary 1 is the manifestation for communication complexity of the well known fact that the crossing sequence technique is not able to differentiate between determinism and non-determinism. Note that Method 1 is not strong enough to show average case lower bounds. This is, because we bound the communication complexity of a Boolean predicate from below by showing that it entails an identity predicate $id(x,y):=(x\cdot y)$ for strings of some length. But identity predicates are easy to compute on the average. In fact, most often the inputs (x,y) of a identity predicate will not agree and this can be found out very fast. (Discrepancies will occur in the first bit half the time.) One easily shows that $C_{aup}(id,L \mapsto R) \le 2$ . Although Corollary 1 may at first sight seem to show a strength of Nethod 1, it really exhibits a weakness. Mon-deterministic lower bounds are in general weaker than deterministic lower bounds because non-determinism is a more powerful concept than determinism. Thus Nethod I will yield bad results for predicates that are difficult to compute deterministically but easy to compute non-determimistically. Sometimes one can get better results by complementing the predicate. i.e., by considering $\vec{p}$ = 1-p instead of p. $\vec{p}$ has deterministically the same communication complexity as p but may be harder non-deterministically. An example for such a predicate is again the identity predicate. By Corollary 1 we have for the identity predicate on strings of length $n \in C_{ndet}(1d,L \longrightarrow R) \ge n$ . However, we can show that Condet (3d, L+R) < log n + 1. This is, because in order to show that x+y, L only has to guess one bit position in which x and y differ, [t then sends the number of this position (log n bits) and the corresponding bit value of x (1 bit) to R. Thus 3d is easy to compute non-deterministically. However, we can still show a good deterministic lower bound on $C_{\text{det}}(\overline{1d},L\leftarrow R)$ by applying Method I to its complement id, which is difficult to compute non-deterministically. Moneyer, there are natural Boolean predicates p such that $$C_{\text{ndet}}(p,L \cdot R)$$ , $C_{\text{ndet}}(\overline{p},L \cdot R) \ll C_{\text{det}}(p,L \cdot \cdot R)$ . For such predicates Method 1 fails to prove good deterministic lower bounds. For the purpose of proving lower bounds in such cases a second method is presented. METHOD 2 (Rank Lower Bound): This method has been presented in [NN82]. DEFINITION 5: Let r be a ring and let $r^{(n,n)}$ be the set of non matrices over r. The rank of $AEr^{(n,n)}$ over r is the minimum k such that A can be written as $A = C \cdot D$ , where $CEr^{(n,k)}$ and $DEr^{(k,n)}$ . If r is a field the above definition coincides with the definition of matrix rank known from linear algebra. Method 2 is based on the following theorem. THEOREM 3: Let p be a Boolean predicate, and let P be its associated matrix. Then $C_{det}(p,L \leftrightarrow R) \ge \log \operatorname{rank}_R(P) \ge \log \operatorname{rank}_R(P)$ where r is any field. PROOF: The second inequality is known from algebra. For the proof of the first inequality we state the following lemma. LEMMA 1: Let r be a ring, $A \in r^{(n,m)}$ , $B \in r^{(n,k)}$ , and $C \in r^{(k,m)}$ . Then $$rank_r((A B)) \le rank_r(A) + rank_r(B)$$ $$\operatorname{rank}_{r}(\binom{A}{C})$$ $\leq \operatorname{rank}_{r}(A) + \operatorname{rank}_{r}(C)$ PROOF: If A=D1-E1 and B=D2-E2 then The proof of the second inequality is analogous.o Now consider any deterministic algorithm for computing p. Inductively on the length of wEB\* we define the matrix P<sub>o</sub> as follows: iwi>0: If Iwi=22 then $P_{w0}$ $(P_{w1})$ is obtained from $P_{w}$ by selecting all rows x with $\rho_{L}(x,w)=0$ $(\rho_{L}(x,w)=1)$ . If Iwi=22+1 then $P_{w0}$ $(P_{w1})$ is obtained from $P_{w}$ by selecting all columns y with $\rho_{D}(y,w)=0$ $(\rho_{D}(y,w)=1)$ . By Lemma 1 we have max $\{\operatorname{rank}_{\mathbb{N}}(\mathsf{P}_{\mathbb{N}^0}), \operatorname{rank}_{\mathbb{N}}(\mathsf{P}_{\mathbb{N}^1})\}\geq \operatorname{rank}_{\mathbb{N}}(\mathsf{P}_{\mathbb{N}})/2$ . Noreover, if wildow[a] then $\mathsf{P}_{\mathbb{N}}$ must be monochromatic, i.e., all of the entries in $\mathsf{P}_{\mathbb{N}}$ must have the same value $\mathsf{a}(\mathbb{N})$ , and thus $\operatorname{rank}_{\mathbb{N}}(\mathsf{P}_{\mathbb{N}})\leq 1$ must hold. Thus there is a wildow[a] such that $|\mathsf{w}|\geq \log_2 \operatorname{rank}_{\mathbb{N}}(\mathsf{P})$ . In [MM82] it is shown that in fact $C_{unamb}(p,L+R)=[\log rank_{N}(P)]$ . On the other hand $C_{ndet}(p,L+R)$ can be much smaller, as the following example shows. DEFINITION 6: Let nek and $x=Y=\{0:2^n-1\}^n$ . For $x=(x_1,...,x_n)\in X$ and $y=(y_1,...,y_n)\in Y$ let $$P_1(x,y) = \begin{cases} 1 & \text{if } x_i = y_i \text{ for some } i, 1 \le i \le n \\ 0 & \text{otherwise} \end{cases}$$ The above definition is the result of modifying the identity predicate such that it and its complement are non-deterministically about equally hard. Instead of checking the identity of one pair of n-bit words we now check the identity of n pairs of n-bit words r If any pair is identical we output 1 else 0. THEOREM 4: a) $$C_{det}(p_1,L\leftrightarrow R) \ge n^2$$ - 6) C<sub>adet</sub>(p<sub>1</sub>.L+R) = O(n+log n) - c) Codet(P1,L+R) O(n log n) - d) $C_{i,v}(p,L\rightarrow R) = O(n(\log n)^2)$ e) $C_{ave}(p,L\leftrightarrow R) = O(n)$ PROOF: a) Since $P_1 \in \mathbb{R}^{2^{(n^2)}}$ we only have to show that $\operatorname{rank}_{GF(2)}P_1 \geq 2^{(n^2)}$ , where GF(2) is the field of characteristic 2. Let a denote addition modulo 2. We transform the matrix $\overline{P_1}$ associated with $\overline{p_1}$ into the identity matrix of size $2^{(n^2)}\times 2^{(n^2)}$ by means of linear transformations. LEMMA 2: Let $$w_1, \dots, w_n, y_1, \dots, y_n \in \{0:2^n-1\}$$ . Define $$g(w_1, \dots, w_n, y_1, \dots, y_n) := \begin{cases} \vdots & \cdots & p_1 \in \{x_1, \dots, x_n, y_1, \dots, y_n\} \\ x_1 & \cdots & x_n \end{cases}$$ $$x_1^{*w_1} \qquad x_n^{*w_n}$$ Then $$g(w_1, ..., w_n, y_1, ..., y_n) = id(w_1, ..., w_n; y_1, ..., y_n)$$ . PROOF: Omitted.a - b) L guesses the number $i \in [1:n]$ such that $x_i = y_i$ and sends i and $x_i$ to R. If $x_i = y_i$ then R outputs 1. - c) For each $i \in \{1:n\}$ L guesses a position $j_i \in \{1:n\}$ such that $x_i$ and $y_i$ differ in position $j_i$ . L sends $j_i$ and the corresponding bit of $x_i$ to R. R verifies that the bits indeed differ. - d) See [MN82]. - e) For i=1,...,n L and R alternate sending bits of $x_i$ resp. $y_i$ until one bit is found in which $x_i$ and $y_i$ differ. Then L and R proceed with $\{x_{i+1},y_{i+1}\}$ . If $x_i = y_i$ is verified then the computation stops immediately. This deterministic algorithm computes n independent identity predicates. We already noted that the average case communication complexity of the identity predicate is less than 2. Thus $C_{x_i \neq x_i}(p_1, L \leftrightarrow R) \le 2n \cdot n$ Part e) of Theorem 4 shows that Hethod 2 is not strong enough to show average case lower bounds either. With the above two methods for showing lower bounds on the communication complexity of Boolean predicates we are now able to devise strategies for making statements about the AT<sup>2</sup> complexity of chips computing such predicates. DEFINITION 7: Let p be a Boolean predicate with n "marked" input bits. p is called u-separable if for every partition of the input bits into sets X, Y such that both X and Y contain at most 2n/3 of the marked input bits we have $$C_{det}(p(x,y),L\leftrightarrow R) \ge \omega.$$ The following Theorem emerged over a series of papers ([180,8K51,V80,S81, LS81,K82]). THEOREM 5: Let p be an $\omega$ -separable Boolean predicate. Then $AT^2 = \Omega(\omega^2)$ for every (where-and when-oblivious or strongly where-oblivious) chip computing p. PROOF: Let $\chi=\{\Gamma,\Lambda;\Delta\}$ be a chip computing p. Let $\pi$ be an input port and let $R_{\pi}$ be its associated region in the layout $\Lambda$ . Let the input bit $\chi_{i}$ enter the chip through input port $\pi$ . With each such $\chi_{i}$ we associate a point $p_{i}$ in the interior of $R_{\pi}$ such that different points are associated with different input bits. For the purposes of the lower bound proof we will consider the bit $\chi_{i}$ to enter the chip through point $p_{i}$ . If $\chi_{i}$ is a marked input bit we consider $p_{i}$ a marked point. We now apply Theorem 1. Thus we cut the layout area M into two halves $M_{\chi}$ and $M_{\chi}$ , and induce a partition of the input bits as follows: $$X = \{x_i; p_i \in N_i\}$$ $Y = \{x_i; p_i \in N_p\}$ . Furthermore X,Y contain at most 2n/3 marked inputs. Since the cut C has a length of at most 2/A, at most h=0(.'A) regions of A associated with components of I can intersect C. This is, because by the VLSI model, only O(1) regions can intersect any square of unit area in A. Now consider the computation of p by $\chi$ . At each cycle we associate two values with C, a left and a right "crossing" value. The left (right) crossing value $\mathbf{v}^{\xi}:(\mathbf{v}_1^{\xi},\ldots,\mathbf{v}_h^{\xi})^*$ ( $\mathbf{v}^{r}:(\mathbf{v}_1^{r},\ldots,\mathbf{v}_h^{r})$ ) contains a component $\mathbf{v}_i^{\xi}$ ( $\mathbf{v}_i^{r}$ ) for each region $R_i$ intersecting C. If $R_i$ is a (<u>nand,nor,and,or</u>) gate then $\mathbf{v}_i^{\xi}$ ( $\mathbf{v}_i^{r}$ ) is the (nand, nor,and,or) of all input values during the last cycle whose regions intersect $H_{\xi}$ ( $\mathbf{v}_r^{r}$ ). If $R_i$ is a wire then $\mathbf{v}_i^{\xi}$ ( $\mathbf{v}_i^{r}$ ) is the above value for its input gate. Ports act as and-gates in this context. With these definitions the computation of p by x can be regarded as a deterministic algorithm in the sense of Definition 2. The information exchanged between L and R are the crossing values. The left crossing values are sent from L to R, and the right crossing values are sent from R to L. After the last cycle L and R exchange two more bits which make the result of the computation known globally. If the length of the VLSI computation is T the communication length of the associated deterministic algorithm is O(Th). We get $Th = T/A = \Omega(\omega)$ , or $AT^2 = \Omega(\omega^2)$ . Note that if we substitute C<sub>ave</sub> for C<sub>det</sub> in Definition 7 Theorem 5 yields an average case lower bound. However, as mentioned before we have no methods of proving good average case lower bounds on the communication complexity of Boolean predicates. The identity predicate id(x,y) that served as an example for the application of Method 1 is not $\Omega(n)$ -separable because we can partition the input bits cleverly to assure that little information has to be exchanged between L and R. Specifically, if we choose $X=\{x_1,\ldots,x_{n/2},y_1,\ldots,y_{n/2}\}$ and Y accordingly L has to send just one bit to R, namely the bit $x_1=y_1 \wedge x_2=y_2 \wedge \ldots \wedge x_{n/2}=y_{n/2}$ . We can modify the identity predicate to yield an $\Omega(n)$ -separable Boolean predicate. OEFINITION 8: Let id<sub>1</sub>(x,y,k) be a predicate of three inputs, x and y are n-bit strings (n-2<sup>n</sup>) and k is an m-bit string encoding a number Osk<n. We define $$id_1(x,y,k) =$$ $$\begin{cases} 1 & \text{if for all } i=0,\dots,n-1 \text{ we have } x_i = y_{\{i+k\}} \text{ mod } n \\ 0 & \text{otherwise} \end{cases}$$ The prodicate id tests whether y is the same as x rotated cyclically by k positions. THEOREM 6: With the choice of x and y as the marked input bits of $id_1$ , $id_1$ is $\Omega(n)$ -separable. PROOF: The proof is an application of Theorem 10 in Section 6 to the cyclic shifts function defined in Section 6.0 By Theorem 5, for every chip computing $id_1$ we have $AT^2=\Omega(n^2)$ . We can modify the predicate $p_1$ in a similar way to carry Theorem 4 over to VLSI. In particular, this yields a predicate $p_2$ of N inputs such that for every chip computing $p_2$ we have $AT^2 \circ R(N^2)$ . However, in a straightforward modification of the VLSI model to allow coin tossing on a chip there is a chip that computes $p_2$ with $AT^2 \circ O(N^{3/2} \operatorname{poly}(\log N))$ . For details see [NN82]. # 6. THE COMMUNICATION COMPLEXITY OF BOOLEAN FUNCTIONS WITH MANY DUTPUTS Boolean functions with many outputs do not fit the framework outlined in Section 5. This is, because just to put the output bits on the communication channel will cost a lot. But in VLSI an output bit does not have to be made known to the whole chip. Rather it just has to be known to the output port that produces it. Thus in multiple output functions output bits may be computed "locally" and we have to reasure how much information has to be gathered from far away to compute each output bit. Therefore when dividing the inputs up between L and R we assign sides to the output bits as well. The corresponding definition is the following. DEFINITION 9: Let $f:X\times Y\to B^{m}$ . Let $f_{*}J$ be a partition of the output bits. We assume that L has to compute I and R has to compute J. A deterministic algorithm is again given by two response functions as in Definition 2. But now for each input xEX and output bit iEI we have a partial output function $a(x,1):B^*+B$ . Similarly, for each input yEY and output bit jEJ we have a partial output function $a(y,j):B^*+B$ . The computation proceeds as in Definition 2. To avoid conflicts we assume that if w is a prefix of w' and a(x,i)(w) is defined then a(x,i)(w') is defined and a(x,i)(w')+a(x,i)(w). (Similarly for a(y,j).) The computation stops as soon as a(x,i)(w) is defined for all iEI and a(y,j)(w) is defined for all jEJ. $C_{\text{det}}(f,L++R)$ is defined as in Definition 2. Definition 9 differs from Definition 2 in that the partial inputs x resp. y can help in determining an output. If m=1 then Definition 8 yields a notion of communication complexity which is no greater and at most 2 less than the communication complexity defined in Definition 2. This is, because once a side has computed the unique output bit it can make it globally known by transmitting it to the other side. The concept of non-determinism does not make sense for multiple output functions. Las Vegas algorithms can be defined analogously to Definition 4. There is one method for proving lower bounds on the communication complexity of multiple output functions. METHOD 3 (Crossing Sequences for Multiple Output Functions): DEFINITION 10:Let $f: X \times X \cdot B^{II}$ and let 1.J be a partition of the output bits of f, f has $x \cdot f = f \cdot B$ if there is a partial input $y \in Y$ such that f restricted to $X \times \{y\}$ in its domain and J in its range has were than $2^{x \cdot y \cdot 1}$ different points in its range. THEOREM 7: If f has $\omega$ -flow then $C_{det}(f_*L \leftrightarrow R) \ge \omega$ . PROOF: The proof is similar to the proof of Theorem 2. Assume that there is a deterministic algorithm computing f that has a communication length of less than w. Then for two inputs $(x_1,y)$ , $(x_2,y)$ generating different output configurations in J the same communication sequence w is generated. Thus for some $j \in J$ $f_j(x_1,y) * f_j(x_2,y)$ , but the algorithm computes $f_j(x_1,y) * a(y,j)(w) * f_j(x_2,y)$ . Thus the algorithm does not compute f correctly, a contradiction. Theorem 7 can be applied to get lower bounds on the AT<sup>2</sup> complexity of chips computing multiple output functions. The corresponding definition and Theorem are the following. DEFINITION 11 Let f:8<sup>th</sup>+8<sup>th</sup>. Let A be a subset of the input bits and B be a subset of the output bits containing the "marked" I/O bits. f is called u-separable if for all partitions of the I/O bits into sets X,Y such that both X and Y contain at most 2/3 of all marked I/O bits we have $$C_{det}(f,L\rightarrow R) \geq \omega$$ . THEOREM 8: If $f:B^{n}+B^{m}$ is an separable then $AT^{2}+\Omega(\omega^{2})$ for every chip computing f. PROOF: Analogous to the proof of Theorem 5.0 [V80] gives an example of a class of functions to which Method 3 applies. DEFINITION 12: Let $f(x_1,\dots,x_n,s_1,\dots,s_m)=(y_1,\dots,y_n)$ be a Boolean function. f computes a permutation group G on n elements if for all gEG there is an assignment $a_1,\dots,a_m$ to the $s_1,\dots,s_m$ such that $f(x_1,\dots,x_n,a_1,\dots,a_m)=(x_{g(1)},\dots,x_{g(n)})$ for all choices of $x_1,\dots,x_n$ . We call $x_1,\dots,x_n$ the permutation inputs and $s_1,\dots,s_k$ the control inputs. f is called \*\*runsitive\*\* of degree n if G is a transitive group, i.e., if for all i,j=1,...,n there is a gEG such that g(i)=j. The most straightforward example of a transitive function of degree n is the cyclic shift function $cs(x_1,...,x_n,s_1,...,s_n)=(y_1,...,y_n)$ where $n=2^n$ and the $s_1,...,s_n$ encode a number k, Osk-n and $y_1=x_{\{i+k\}} \mod n$ os computes the transitive group of cyclic permutations. Other examples of transitive functions of degree $\Omega(n)$ are the multiplication of n-bit integers, the multiplication of three $\sqrt{n} \times \sqrt{n}$ matrices, the sorting of n numbers between 0 and n etc. THEOREM 9: Let $f:8^{m+m}+8^m$ be transitive of degree n. Then f is n/6-separable. PROOF: Let G be the transitive group computed by f. The equivalence relation g(i)-h(i) for fixed but arbitrary i\(\bar{e}(1,...,n)\) divides G into n equivalence classes of size IGI/n. Let A be the set of all permutation input bits and let B be the set of all output bits. Let X,Y be any partition of AUB such that IXI,:Y|\(\sigma 4n\)/3. For each input bit i in X and output bit j in Y there are |\(\Gamma i/n\) group elements g\(\Gamma G\) such that g(i)=j. Let w.l.o.g. X be no greater than Y and assume that X contains at least as many input bits as output bits. (The other cases can be argued similarly.) Let S be the set of input bits in X and S' be the set of output bits in Y. Then $$|S| \cdot |S| \ge n/3 \cdot n/2 = n^2/6$$ . For each of the pairs (i,j)65×5° there are 6 /n group elements matching them. Since there are only a total of 161 group elements there must be one element gothere realizing at least n/5 matchings between inputs in 5 and outputs in 5'. The partial input y realizing the flow sets the control input bits in Y such together with appropriate assignments to the control input bits in X they encode this element go. The other input bits in Y are assigned arbitrarily.o We can establish a relationship between Method 3 and Method 1. DEFINITION 13: Let $f: X \rightarrow B^{m}$ . The viannoteristic predicate of f is the predicate $p_{\phi}: X \times E^{m} \rightarrow B$ such that $$p_f(x,y) = \begin{cases} 1 & \text{if } y - f(x) \\ 0 & \text{otherwise.} \end{cases}$$ The characteristic predicate of the cyclic shift function cs is the predicate 1d, defined in Definition 8. THEOREM 10: Let $f: X \times Y + B^{-1}$ . Let $p_f$ be the characteristic predicate of f. Let $f: X \times Y + B^{-1}$ . Let $p_f$ be the characteristic predicate of f. Let $f: X \times Y + B^{-1}$ . Let $p_f$ be the characteristic predicate of f. Let $f: X \times Y + B^{-1}$ . Let $p_f$ be the characteristic predicate of $p_f$ . (An input bit of $p_f$ corresponding to an output f of f enters on the side on which the output bit is produced.) If f has $p_f$ for $p_f$ for $p_f$ for $p_f$ the premise of Theorem 2 with $p_f$ with $p_f$ the premise of Theorem 2 with $p_f$ in $p_f$ the premise of Theorem 2 with PROOF: The rows-and columns of the identity matrix are given by the $2^{M-1}$ input configurations of $\rho_{g}$ that correspond to the $2^{M}$ input configurations of f generating all different output configurations in J, together with the correct guesses of all outputs.0 An application of Theorem 10 to the cyclic shift function cs proves Theorem 6. Theorem 10 shows that as Nethod 1, Nethod 3 is also rather weak: If we can prove a lower bound on the communication complexity of f then we can prove the same non-deterministic lower bound on the communication complexity of p<sub>d</sub>. In general proving average case lower bounds on the AT<sup>2</sup>-complexity of chips is hard. However, there are examples where it can be done. [TBO] proves an average case lower bound for the Discrete Fourier Transform and for Sorting in a modified VLSI model. We will now prove an average case lower bounds for the cyclic shift function. The proof is based on two observations. First we note that for the cyclic shift function every assignment of the central input bits implements exactly one group element. Second we observe that the worst case lower bound on the number of matchings realized simultaneously by a group element (see proof of Theorem 9) can be extended to an average case lower bound. LEANA 3: Let f be a transitive function of degree a computing the transitive group G. Then for each partition of the I/O bits of f as in Theorem 9 at least IGI/II group elements realize each at least n/I2 matchings across the cut. PROOF: Otherwise at most (161/11 - 1)n + (10161/11 + 1)(n/12 - 1) < n161/6 matchings across the cut are realized in total. $\alpha$ THEOREM 11: For any chip computing the cyclic shift function as we have $AT^2-\Omega(n^2)$ on the average. PROOF: Let C be the cut bisecting the chip. Let gCG be one of the n/11 group elements that realize n/12 matchings across the cut C. Consider the assignment to the control inputs that encodes g, and any assignment to the lln/12 permutation inputs that are not matched across C. By varying the n/12 permutation input bits that are matched across C we have to generate 2<sup>m/12</sup> different sequences of bits across C. The average length of each such bit sequence is thus at least $$0 \int_{0}^{\infty} z^{n/12} j2^{j} / z^{n/12} \ge n/12 - 2.$$ Thus the average length for all inputs whose control part encodes g is n/12 - 2. This means that $C_{ave}(cs,L\rightarrow R) \ge (n/12 - 2)/11$ . The rest follows as in the proof of Theorem 5.0 Note that by the argument known to us from Section 5 it can be shown that Cave(id<sub>1</sub>,L→R)=O(I) across any cut bisecting a chip computing id<sub>1</sub>. Indeed, one can devise chips that compute id<sub>1</sub> with AT<sup>2</sup>=O(n poly(log n)) on the average. Nethod 3 also applies to many functions that are not transitive, e.g., matrix multiplication ([SS1]). In most cases the lower bounds thus proved can be (asymptotically) matched with tight upper bounds ([PV90,PV8])). This shows that indeed the communication complexity often dominates the AT<sup>2</sup> complexity of a VLSI chip. ## 7. A LOMER BOUND ON THE SWITCHING ENERGY OF YEST CHIPS In Section 6 we proved lower bounds on the AT<sup>2</sup> complexity of chips computing Boolean functions with multiple outputs by measuring the amount of information that has to flow across a cut bisecting the chip. We argued that each bit needs some time, namely one cycle, to cross the cut, and that because of the limited length of the cut only a limited number of bits can cross simultaneously. The switching energy essentially counts the number of state changes happening in the circuit. More precisely, the switching energy is defined by the following statement: Each unit of chip area consumes one unit of switching energy every time it changes its state from 0 to 1 or vice versa. Thus, if we want to prove a lower bound on the switching energy considering one cut is not enough. Many of the state changes could happen in circuit components that do not cross the cut. Therefore we must consider many cuts to cover a large part of the circuit. By proving a lower bound on the amount of information flowing across each cut, and summing over all cuts, we can account for many of the state changes happening in the circuit. Since we have to consider many cuts at once it will not suffice to consider functions with a great communication complexity in the worst case. However, a property such as the one proved for transitive functions in Lemma 3 will do. THEOREM 12 ([LN81]): [et f be a transitive function of degree m. Let E be the worst case switching energy consumed by any chip computing f. Let A be the [active] chip area and let T be the worst case computing time. Then $$c_1 AT^2 \ge ET \ge \frac{c_2 n^2}{\log \frac{c_3 AT}{n^2}} \ge 0$$ for appropriate constants $c_1, c_2, c_3>0$ . PROOF: Let $\chi=\{f_*A_*\Delta\}$ be any chip computing $f_*$ . The upper bound on ET is trivial. We prove the lower bound in two steps. First we defined a set of rectilinear cuts through the chip. Second we count state changes in the circuit components crossing the cuts. In the First step we cut the chip according to a technique devised by [779]. To this end we overlay the chip with a square grid of mesh size $\lambda$ . Then we define cuts $C_1, \ldots, C_6$ bisecting the chip as shown in Figure 3. Hereby each cut consists of several sections as shown in Figure 4. The middle section of cut $C_i$ has a length of at most $(2i-1)\lambda$ . The vertical sections of all cuts $C_i$ are disjoint. Each cut induces a partition of the permutation input bits A into sets X,Y and of the output bits B into sets [,J such that |XI+|II|, $|YI+|J| \le 1\pi/3$ . We will count state changes happening on the vertical section of the cuts. Let $L_i$ be the number of circuit components crossing $C_i$ and let $L_i$ be the number of circuit components crossing the vertical sections of $C_i$ . Then $L_i \ge L_i - c_i$ for some appropriate constant $c_i > 0$ . In the second step we start by noting that for each cut $C_1$ ( $1 \le i \le 0$ ) by Lemma 3 there are 1GI/11 group elements each realizing at least n/12 matchings across $C_2$ . Therefore there has to be one group element $g_0 \in G$ realizing at least n/12 matchings across a set $\phi$ of Q/11 cuts. We will prove a lower bound on the average case switching energy consumed for computing $g_0$ on any configuration of the permutation inputs. This is then also a lower bound for the worst case energy for computing f on any input. Before we start the actual argument we discuss an encoding that expresses bit strings in terms of the state changes happening in them. DEFINITION 10: Let w be an arbitrary bit string w=0 $^{11}1^{12}0^{13}...^{1}$ . - a) s(w) is the number of state changes in w, i.e.\_s(w)\*t. - b) bin(w) is the bit string obtained from w after substituting each 0 with 00 and each 1 with 11. - c) compress(w) = 0 bin(l<sub>1</sub>) 01 bin(l<sub>2</sub>) 01 bin(l<sub>3</sub>) .01 ... 01 bin(l<sub>1</sub>). If the first bit of w is a 1 so is the first bit of compress(w). Lemma 4: |compress(w)| $$\leq 4s(w) + 2s(w) \log \frac{w}{s(w)}$$ PROOF: |compress(w)| $\leq 2s(w) + 2 \int_{j=1}^{s(w)} (1 + \log \epsilon_j)$ $\leq 4s(w) + 2 \int_{j=1}^{s(w)} \log_2 2_j$ $\leq 4s(w) + 2s(w) \log \frac{w}{s(w)}$ . Here the last inequality is derived using the well known fact that the geometric mean is no greater than the arithmetic mean.c We now consider an arbitrary but fixed cut $C_i$ . We choose an arbitrary but fixed assignment of the 11n/12 permutation inputs that are not natched across $C_i$ and let the n/12 permutation inputs matched across $C_i$ vary in all $2^{n/12}$ possible ways. Thus we generate $2^{n/12}$ different bit sequences $\{w_{ij},h_{ij}\}$ across $C_i$ . Here $w_{ij}$ summarizes all bits crossing $C_i$ in the vertical sections and $h_{ij}$ summarizes all bits crossing $C_i$ in the middle section.(1 $\le$ 5 $\le$ 2 $^{n/12}$ ). LEMMA 5: let $\{w_1, \dots, w_r\}$ be a set of r different bit strings. Then PROOF: The mapping w - compress(w) is injective. Thus Using Lemma 5 we now compute an upper bound on the length of compress(w $_{ij}$ ) averaged over all $1 \le j \le 2^{n/12}$ . LENNA 6: $$\sum_{1 \le j \le 2^{n/12}} |compress(w_{1j})| \ge (\frac{n}{12} - c_0)|T - 3| 2^{n/12}.$$ PROOF: Since the length of the middle section of cut $C_i$ is at most $(2i-1)\lambda$ there are at most $c_0i$ circuit components that intersect the middle section of cut $C_i$ . Thus there are at most t, $1 \le t \le 2^{C_0iT}$ , different sequences $h_{ij}$ . For each such sequence there is a number $u_i$ of different sequences $u_{ij}$ such that $(u_{ij},h_{ij})$ is generated by some input as described above. We have $U:=\sum_{1\le k\le 1}u_k\ge 2^{n/12}$ and by Lemma 5 Here we used the fact that $\sum_{1 \le k \le t} u_k \log u_k$ is minimum if for all $k u_k = 0/t$ . $\sigma$ Using the upper bound on Icompress(w)| derived in Lemma 4 we can now give a lower bound on the average number of state changes in all sequences w<sub>ij</sub>. Let $$S_i = \sum_{1 \le j \le 2^{n/12}} s(w_{i,j})/2^{n/12}$$ be the average number of state changes occurring LEMMA 7: $$(\pi/12 - c_0 iT - 3) \le 4S_i + 2S_i \log \frac{T_2}{S_i}$$ . PROOF: Apply Lemma 4 and Lemma 6 and observe that $\sum_{1 \le j \le 2^{n/12}} s(w_{i,j}) \log \frac{T_i^2}{s(w_{i,j})}$ is maximum if for all j, $s(v_{ij}) = S_i \cdot o$ We now sum over all cuts $C_i$ in $\phi$ . Let $\Phi:=\{0\}, 0\}$ and $\Phi:=\{0\}, 0\}$ 0. Thus $\Phi:=\{0\}, 0\}$ is the average number of state changes in the sequences $w_{i,j}$ across all cuts in $\phi$ . PROOF: He manipulate the formula in a way similar to the proof of Lemma 7.0 Choosing Q=(n-36)/24cnT yields: LEMMA 9: For suitable constants c<sub>2</sub>,c<sub>3</sub>>0 and sufficiently large n we get $$ST \ge \frac{-c_2 n^2}{\log \frac{c_3 T^2 n}{n^2}}.$$ PROOF: Substituting the value for Q into Lemma 8 yields $$\left(\frac{n-36}{24}\right)^2 \frac{1}{c_0 T} \ge 4S \left(1 + \frac{1}{2} \log \frac{TE}{S}\right)$$ (\*) Thus $$\log S \ge \log \left( \left( \frac{n-36}{24} \right)^2 \frac{1}{4c_0 T} \right) - \log \left( 1 + \frac{1}{2} \log \frac{Tt}{5} \right)$$ . Applying the inequality log(1+x)sx we get $$\log S \ge 2 \log \left( \left( \frac{n-36}{24} \right)^2 \frac{1}{4c_0 T} \right) - \log(T2)$$ . Substituting this into (+) proves the lemma.o We are left with relating S to the switching energy and bounding 2 from above, Both can be done with the following argument. Let c be a component crossing the vertical section of cut $C_i$ . It can be shown that we can charge to c an area of size $\Omega(1)$ inside the layout of component c such that areas charged to different components overlap at most v-fold. From this immediately follows that $\Omega(A)$ , Furthermore S=0(E+A). (We have to add A here for the state changes in $w_{ij}$ that can occur when swutching from one component crossing $C_i$ to the next. Those state changes in $w_{ij}$ do not correspond to quanta of switching energy dissipated on the chip.) Thus ET $$\geq \frac{-c_2n^2}{\log \frac{c_3n^2}{n^2}} - c_4n$$ . If we assume that on each circuit component there will be at least one state change, say, during initialization, then the last term can be omitted proving the theorem.c Note that if the chip is $AT^2$ optimal, i.e., if $AT^2 \cdot O(n^2)$ then the above lower backd on E is tight up to a constant factor. This shows that $AT^2$ optimal chips use their computing resources to capacity (up to a constant factor). #### REFERENCES - (BKS1) R.P.Brent, H.T.Xung, "The Area-Time Complexity of Binary Multiplication," Journal ACM Vol.28 No.3 (1981), 521-534. - [CM81] 8.Chazelle, L.Monier, "A Model of Computation for VLS1 with Related Complexity Results," 13th Annual ACH-STOC Conference (1981), 318-325. - (K82) R.Kolla, "Untere Schranken für VLS1," Diplomarbeit, Fachbereich 10, Univ. d. Saarl., Saarbrücken, West Germany (1982). - (LM81) T.Lengauer, K.Nehlhorn, "The Complexity of VLSI Computations." CMU-Conference on VLSI Systems and Computations (1981), 89-99. - [LS81] R.J.Lipton, R.Sedgewick, "Lower Bounds for VLSI," 13th Annual ACM-STOC Conference (1981) 300-307. - (LT79) R.J.Lipton, R.E.Tarjan, "A Separator Theorem for Planar Graphs." SIAN J. Appl. Math. Vol.36 (1979), 177-189. - [NCSO] C.Mead, L.Conway, Introduction to VLSI Systems, Addison Wesley, Reading, Nass. (1980). - [MM92] K.Nehlhorn, E.Meineche-Schnidt, "Las Vegas is Better Than Determinism in VLSI and Distributed Computing," 14th Annual ACN-STOC Conference (1982), 330-337. - (PYSO) F.P.Preparata, J.Vuillemin, "Area-Time Optimal VLSI Networks for Multiplying Matrices," [nf. Proc. Let. Vol.11 No.2 (1980), 77-80. - [PV81] F.P.Preparata, J.Vuillemin, "Area-Time Optimal VLSI Networks for Computing Integer Multiplication and Discrete Fourier Transform," 8th Int. Colloq. on Automata Theory Languages and Programming (Springer Lecture Notes in Comp. Sci. No.115) (1981), 29-40. - [S8]] J.E.Savage, "Planar Circuit Complexity and the Performance of YLSI Algorithms," CMU-Conference on YLSI Systems and Computations (1981), 61-68. - [T80] C.D.Thompson, A Complexity Theory For VLSI, Ph.O. Thesis, Dep. of Comp. Sci., Carregie-Mellon University (1980). - (Y80) J. Vuillemin, "A Combinatorial Limit to the Computing Power of YLSI Circuits." 21st Annual IEEE-FOCS Symposium (1980), 294-200. - [Y79] A.C.Yao, "Some Complexity Questions Related to Distributive Computing," 11th Annual ACM-STOC Conference (1979), 209-213. - [Y81] A.C.Yao, "The Entropic Limitations on VLSI Computations," 13th Annual ACM-STOC Conference (1981), 308-311. FACHSEREICH 10 UNIVERSITÄT DES SAARLANDES 6600 SAARBROCKEN WEST GERMANY