ADVANCED ELECTRONIC PACKAGING:

With Emphasis On Multi-Chip Modules

Editor: W. D. Brown

CHAPTER 12

TESTING AND QUALIFICATION

S. Kolluru and D. Berleant

12.1 INTRODUCTION

12.1.1 Testing of Highly Integrated Packages: General Considerations

12.1.2 Test Issues for Multichip Modules

12.1.3 Testing and Dependability Considerations and Their Interaction

12.1.4 Dependability in MCM-Based Systems From a Testing Perspective

12.1.4.1 Dependability vs. testing

12.1.5 Fault Tolerance

12.2 TESTING: GENERAL CONCEPTS

12.2.1 Fault Models

12.2.1.1 Stuck-at-fault models

12.2.1.2 Bridging fault models

12.2.1.3 Open fault models

12.2.1.4 Delay fault models

12.2.2 Fault Collapsing

12.3 TESTING OF MEMORY CHIPS

12.3.1 The Zero-One Test

12.3.2 The Checkerboard Test

12.3.3 The Walking 1/0 Test

12.4 DESIGN FOR TESTABILITY

12.4.1 Scan Design

12.4.1.1 Multiplexed Scan Design

12.4.1.2 Level sensitive scan design

12.4.1.3 Random access scan

12.4.1.4 Partial scan

12.4.2 Built-In Self Test (BIST)

12.4.2.1 Pseudorandom test generation

12.4.2.2 Pseudoexhaustive testing

12.4.2.3 Output response analysis

12.4.2.4 Signature analysis

12.4.2.5 BIST test structures I: built‑in logic block observation (BILBO)

12.4.2.6 Circular self test path (CSTP)

12.5 FUNCTIONAL TESTING

12.5.1 Approaches to Testing MCMs

12.5.2 Staged Testing

12.5.3 MCM substrate testing

12.5.3.1 Manufacturing defects in MCM substrates

12.5.3.2 Contact testing

12.5.3.2.1 Bed-of-nails testing

12.5.3.2.2 Single probe testing and two probe testing

12.5.3.3 Non-contact testing

12.5.3.3.1 Electron beam testing

12.5.3.4 Wear of MCM substrates

12.5.4 Die testing

12.5.4.1 Chip carriers

12.5.5 Bond Testing

12.5.6 Testing Assembled MCMs

12.5.6.1 Test pads

12.5.6.1.1 Test pads and performance

12.5.6.1.2 Test pad number and accessibility

12.5.6.1.3 Test pad summary

12.6 CRITICAL ISSUE: BOUNDARY SCAN

12.6.1 The Boundary Scan Concept

12.6.2 Boundary Scan for MCMs

12.7 CRITICAL ISSUE: KNOWN GOOD DIE

12.8 SUMMARY

12.9 EXERCISES/PROBLEMS

12.10 REFERENCES

Key terms: integrated, package, dependability, testing, fault tolerance, fault model, stuck-at fault, bridging fault, open fault, delay fault, fault collapsing, zero-one test, checkerboard test, walking I/O test, design rule, design verification, scan design, design for test, scan path, level sensitive scan design, random access, partial scan, BIST, signature, syndrome testing, BILBO, MISR, SRSG, STUMPS, CSTP, functional testing, speed sorting, substrate, contact testing, bed-of-nails probe, non-contact testing, die, dice, parameter testing, chip carrier, bond, staged testing, qualification, test pad, boundary scan, known good die, KGD.

12.1 INTRODUCTION

The main purpose of testing is to assess quality. This assessment can be with respect to an entire system or device, or with respect to smaller or larger parts of it, as when attempting to find the location of a fault. The assessment can produce a quantitative value, as when chips are to be sorted into speed categories based on the highest clock rate for which each will function properly, or it can be (and often is) simply a qualitative determination of whether something works or not. Assessing quality is obviously important in applications for which avoiding failure is critical. Perhaps less obviously but no less importantly, assessing quality can reduce costs. For example, it is costly to sell bad units and have to refund or replace them, and it is costly to complete the fabrication of a unit that could have been discarded due to defects early in the fabrication process.

While the concept of testing is useful in a wide range of applications, we will limit our discussion to testing of microelectronic devices, and especially to testing issues surrounding advanced electronic packages such as MCMs.

Testing of advanced electronic packages, like testing of other complex electronic systems, begins with informal critiques of a design concept, ends with verifying repairs to deployed units, and covers numerous intermediate stages. Figure 12.1 outlines the testing stages MCMs, one type of advanced electronic pacakge.

****Insert Fig. 12.1****

12.1.1 Testing of Highly Integrated Packages: General Considerations

Testing and the related area of dependability are well‑known and important topics in the computing fields. Issues of dependability and testability become more acute for highly integrated packages (such as MCMs) than for traditional printed circuit boards due to a general heuristic ("rule of thumb") principle:

Heuristic 1: As component density increases, the individual components tend to become harder to test and fix.

This heuristic holds because components get smaller and more concealed by other components and packaging. Fortunately, this is offset by another heuristic principle:

Heuristic 2: As component density increases, elementary parts become cheaper and more efficiently used.

The tendency toward more efficient use of elementary components holds because of decreased need for components assigned to interfacing, broadly defined to include packaging, bonds, connections, I/O drivers, etc.

The elementary parts that Heuristic 2 refers to we classify into four categories:

1) Electronic parts, such as transistors, resistors, capacitors, etc.

2) Electrical nets, which connect other parts together. They share important properties with the other categories of elementary parts, such as finite dependability, non‑zero cost, and performance of important duties.

3) Electrical bonds, such as the short wires (wire bonds) that may be used to connect an IC and its pins, or an IC die and an MCM substrate. Bonds also share important properties with other kinds of parts like electrical nets, and even perform similar functions, yet differ from nets from the stand-point of fabrication, testing, and reliability.

4) Physical parts, such as pins, physical package parts, etc.

Integration increases component density, and at the same time reduces the number of elementary parts. For example, integrating the functions that were previously performed by two chips into one chip eliminates the need for some of the interfacing electronics, which in turn reduces the number of required nets, electronic parts, and bonds. Having one package instead of two also reduces the number of physical package components like pins and ceramic or plastic parts. Placing two chips on an MCM substrate (a lesser degree of integration than having one new chip with the functionality of the previous two) also reduces the total number of elementary parts such as pins, bonds, and plastic or ceramic parts.

Heuristic 1 suggests that increased integration tends to lead to problems with dependability and testability, and hence to higher costs. Counteracting this tendency is Heuristic 2, which suggests that increased integration tends to lead to improvements in dependability, testability, and cost.

As the technology and experience in support of a given level of technology improve, the balance shifts in favor of Heuristic 2, and the degree of integration that is most cost effective tends to increase over time.

In this chapter we emphasize multichip modules (MCMs) and other advanced packages, and their testing and testability as compared with functionally equivalent single chip integrated circuits (ICs) on circuit boards (CBs), which is the traditional genre of electronic integration. The heuristic principles are useful because they provide basic concepts that give broad guidance and structure for understanding this area.

12.1.2 Test Issues for Multichip Modules

Testing is currently a serious bottleneck in MCM design, manufacture, and deployment. Testing has always played a major role in elecronic systems, yet there are unique characteristics of MCMs that lend a distinctive character to the testing problem (See Fig. 12.1.2).

****Insert Fig. 12.1.2****

As Fig. 12.1.2 indicates, nets are less accessible for probing on an MCM than we might wish. This is because nets are small and pass through the substrate, rather than large and over the surface as in the case of printed circuit boards. Nevertheless, the accessibility of nets for testing in an MCM is greater than the accessibility of nets in a single chip (or wafer), because a test pad can be built for any given net in an MCM, providing an externally accessible point for probing that net. This is much more difficult with a chip, where as a rule a net can be made accessible for probing only if an entire pin is connected to that net. Yet probe points are important for electrical testing. For example, during the MCM manufacturing process, it is useful to perform tests on individual dice that have just been mounted (see the section on staged testing) and those tests require access to the nets that connect to them.

As device complexity increases, it is difficult to perform a full functional test, as the number of test vectors required becomes astronomical. This led to the need to increase the testability of internal circuits. The boundary scan method, BIST (Built In Self Test), adding test points on an MCM substrate exterior ("test pads") and pinning out all internal I/O to test pads are some of the ways to increase testability [15]. MCM testing is broadly divided into two categories: those based on software simulations and those applied directly to the devices themselves. Simulation based test methods help ensure the functionality and specifications compliance of the design before manufacturing. Direct test methods perform functional testing on the MCM during and after fabrication.

12.1.3 Testability and Dependability Considerations and Their Interaction

The connection between testability and dependability is that improving dependability tends to reduce the effort and expense needed for testing, and improving testability tends to reduce (but not eliminate) the importance of dependability. Since testing of advanced electronic packages is often challenging, dependability is an important consideration from a testing perspective since we can control testing needs to some degree by controlling dependability.

While the output of a manufacturing process cannot in general be guaranteed to work, different manufacturing lines can and do produce artifacts of widely varying dependabilities. The dependability of an engineered artifact is determined by both the quality of the manufacturing process, and by intrinsic properties of the artifact being produced. An important intrinsic property influencing dependability is the complexity of the artifact. High complexity tends to cause lowered dependability, and vice versa. Since the complexity of advanced electronic packages is so high, achieving adequate dependability is an important problem. Therefore, let us review dependability from a testing perspective. For further discussion, see the chapter on dependability.

12.1.4 Dependability in MCM‑Based Systems from a Testing Perspective

Like all electronic systems, MCM‑based systems can be viewed at different levels. At the lowest level is analog circuitry at the circuit level (MacDougall 1987, p. 1 [54]). The abstraction hierarchy proceeds upwards to the system level (see Fig. 12-3). Dependability problems can occur due to faults in the building blocks of any level in the hierarchy, leading to errors and failures of the overall system.

****Insert Fig. 12.1.4****

A dependable system requires dependability of the building blocks and their interconnections in each level of the hierarchy. For the circuit, gate, and register‑transfer levels, the issues for MCM‑based systems are similar in many ways to those for other integrated circuit based electronics. However, a significant difference exists: for MCMs, the least replaceable unit (LRU) is now an entire MCM which is more complex -- and therefore expensive ‑‑ than the least replaceable unit on a printed circuit board.

When the LRU is an MCM, dependability and testing of its components prior to mounting them, and staged testing and reliability at intermediate stages of the assembly process become more important. Staged testing refers to verifying that components and interactions among components meet standards at intermediate stages during the assembly of an MCM or other system. Reworkability refers to the ease with which a bad component, bad connection, or other defect found during staged testing can be fixed or replaced during the assembly process.

12.1.4.1 Dependability vs. testing

It is impossible or nearly so to repair a faulty chip. This makes it more important than it otherwise might be for chips to work dependably. Chip dependability is even more important when the chip is mounted in an MCM because not only are bad chips mounted in an MCM difficult and expensive to replace in comparison to their replacement on ordinary circuit boards, but just one bad chip of the several contained in the MCM will usually make the whole MCM bad, and the probability that any one of the several chips is bad is much higher than the probability that a given chip is bad (see equations 12.9-1 & 12.9-2). Compounding the problem is that chips are hard to test before they are mounted in an MCM, a problem of sufficient magnitude as to make testing of unmounted chips - "bare dice" - a critical issue in making MCMs economically viable (called the "known good die" problem, see section 12.9).

MCM dependability and testing needs are also impacted by fabrication, operating environment, and maintainability factors. In particular, Fabrication factors include the dependabilities and testabilities of the component chips, the bonds which provide electrical connections between chip and wiring, the substrate or board and its wiring, and the bonds which provide electrical contact between the MCM and its pins. Other fabrication related factors include the interconnection technology (e.g. optical vs. electrical), the type of bonding (e.g. flip chip, TAB, or wire bonding), the type of substrate (e.g. MCM‑D or MCM deposited (Roy 1991 [56]), MCM‑D/C or thin film copper polyimide Deposited on Ceramic, MCM‑C or Cofired ceramic, and MCM‑L or Laminate substrate), and the type of substrate (e.g. hermetic vs. non‑hermetic).

The impact of operating environment is similar in many ways to its effects on printed circuit board dependability, in that many of the same environmental factors are issues in both cases. Such environmental factors include heat and heat cycling, humidity, shock, vibration, and cosmic rays. However specifics often differ, so that existing knowledge of how environmental factors influence printed circuit board dependability must be augmented with results applicable to MCMs.

Maintainability factors include testability, reworkability, and repairability.

Rework is important when testing uncovers a defective component of a partially or completely fabricated MCM. For MCMs, rework is a much more difficult and higher technology process than for printed circuit boards. MCM rework ranges from technically feasible for TAB (Tape Automated Bonding) and flip chip bonding technologies, and for the thin film copper polyimide deposited on ceramic and cofired ceramic packaging technologies, to technically more difficult (for wire bonding) or currently uneconomical (for laminate substrates) (Trent 1992 [15]).

From the standpoint of repairing failed systems, replacing a failed chip can be done when it is mounted on a fully manufactured and deployed printed circuit board, but is much more difficult with a fully manufactured and deployed MCM.

12.1.5 Fault Tolerance

Considerable progress remains to be made in fault tolerant architectures for MCMs. This is partly because MCM technology, in its present state, is often too expensive for the substantial extra circuitry required for some forms of fault tolerance to be financially feasible. Yet, other forms of fault tolerant design do not require significantly more silicon real estate. The perspective that might profitably be taken is one of optimizing the tradeoff between the expense of adding in fault tolerance, and the expense of lowered dependabilities and increased needs for testing of non-fault-tolerant architectures.

The basic idea in fault-tolerant design is to use redundancy to counteract the tendency of individual faults to cause improper functioning of the unit. Previous work on fault tolerance in multichip modules is reported by Carey (1991, p. 29 [51]), who discusses redundant interconnections, by Pearson and Malek (1992: pp. 2‑3 [55]), who discuss redundancy within individual chips on a specialized MCM design, and by Yamamoto (1991 [58]), who discusses redundant refrigeration units for increased reliability of cryogenic MCMs. More recent work suggests that the great increases in yield achievable by adding redundant chips to an MCM design can be cost effective (Kim and Lombardi 1995 [50]). We describe these various approaches to MCM fault tolerant design next.

One approach to maximizing the probability that a chip will work once mounted is to include redundant circuitry on the chip that can take over the function of faulty circuitry if and when other circuitry on the chip becomes faulty. Hence, dice (unpackaged chips) used for placement in an MCM may have their own built‑in fault tolerance. This approach to fault tolerance is efficient in terms of the increase in size it implies in the MCM), since an incremental increase in the size of a die leads to a relatively small increase in the area of the MCM substrate that is required to hold the slightly larger die. However, such redundant designs are highly specific to the particular chip. In summary, on-chip redundancy to enhance yield (Thibeault et al., 1990 [57]) is particulary applicable when chips must be reliable but are hard to acquire in tested form, as is often true for bare dice intended for use in MCMS. An MCM design utilizing this approach is proposed by Pearson and Malek (1992 [55]).

Fault tolerance can also be built into the MCM substrate, in the form of redundant interconnection paths. If the substrate is found to have an open path, for example, there might be another functionally identical path that can be used instead. Actual MCMs have been fabricated implementing this capability (Carey 1991 [51]). This approach need not lead to increased MCM area at all, since if less than 100% of the substrate's interconnect capacity is needed for a non‑fault-tolerant design, the remaining capacity could be used for holding redundant interconnections. In the event that capacity exists for only some interconnections to be duplicated, duplication of longer ones should be preferred since the probability of a fault in a path increases with the length of the path (Carey 1993 [50]).

This redundant routing approach has been shown to enhance MCM yields significantly (Carey 1991 [51]). Since the dependability of nets in the MCM substrate decreases as net length increases, Carey (1991 [51]) duplicated long paths in preference to short ones. Since designs will often have some unused routing capacity, why not use what routing capacity is still available for fault-tolerant redundancy.

Redundant conductors have been used in MCMs not only for routing through the MCM substrate, but also for wire bonds. Redundant wire bonds are described by Hagge and Wagner (1992, pp. 1980‑1981). A large substrate was designed in four quadrants, so that the yield for a relatively smaller quadrant was higher than for a large substrate containing all four sections on one substrate. However, connecting the four quadrants must be done dependably in order for the connected quadrants to compete with the large single substrate design. Connections were done with double wire bonds for increased dependability over single wire bonds. This redundant bond concept could be investigated for use with die‑to‑substrate connections as well. A potential disadvantage is that double bonds may require larger bond pads. However, bonds would require little or no additional substrate area.

The more chips there are in an MCM design, the more risk there is of lowered yield. However, a design with more chips may actually have a higher yield than one with fewer, if the extra chips are there for the express purpose of providing redundancy, the increment in chip number is modest, and an appropriate staged testing technique is employed. Indeed, Kim and Lombardi (1995 [50]) found that very high yields were possible, and provide analytical results establishing this.

The MCMs of the future may be liquid nitrogen cooled, for speed and eventually to support superconductivity and its varied benefits. The refrigeration system on which such MCMs depend must be reliable. This motivated a dual refrigeration unit design in the MCM system built by Yamamoto (1991 [58]). If one refrigerator breaks down, the low required operating temperatures can still be maintained by the other refrigerator.

Finally, MCM fabrication lines must provide reliable control of the manufacturing equipment. An uncontrolled shutdown can have serious negative effects on the facility. When computers are used for control, redundancy should be built into the fabrication line control system to prevent the destructive effects of unanticipated shutdowns due to computer crashes, since such crashes will tend to occur occasionally due to software bugs, as software of significant complexity is almost impossible to produce without bugs.

12.2 TESTING: GENERAL CONCEPTS

We begin with some basic definitions:

Fault detection -- the action of determining that there is a defect present.

Fault location -- the action of determining where a defect is.

Fault detection coverage -- the proportion of defects that a fault detection method can discover.

Fault location coverage -- the proportion of faults which can be successfully located. Successful location does not necessarily mean finding the exact location. Usually it means finding a sub-unit (e.g. chip, board, or other component) which contains the fault and hence needs to be replaced.

Destructive testing -- any testing method which causes units to fail in order to measure how well they resist failure.

Non‑destructive testing -- any method of testing which does not intend to cause units to fail.

Defects may occur during the manufacture of any system. In IC manufacturing, defects may occur during any of the various physical, chemical and thermal processes involved. A defect may occur in the original silicon wafer, by oxidation or diffusion, or during photolithography, metallization, or packaging. Not all manufacturing defects affect circuit operation, and it may not be feasible or even particularly desirable to test for such faults. We discuss only those defects which do.

12.2.1 Fault Models

Fault analysis can be made independent of the technology by modeling physical faults as logical faults whose effects approximate the effects of common actual faults. Fault models are used to specify well defined representations of faulty circuits that can then be simulated. Fault models can also be used to assist in generating test patterns [1]. A good fault model has the following properties [1]:

1. The level of abstraction of the fault model should match the level of abstraction at which it is to be used (Figure 12.1.4 exemplifies different levels of abstraction).

2. The computational complexity (amount of computation required to make deductions) of algorithms that use the fault model should be low enough that results can be achieved in a reasonable amount of time.

3. The great majority of actual faults are represented accurately by the fault model.

Typical faults in VLSI circuits are stuck‑at‑faults, opens, and shorts. The ability of a set of test patterns to reveal faults in the circuit is measured by fault coverage. 100% fault coverage in complex VLSI circuits is usually impractical, as this would require astronomical amounts of testing. In practice, a tradeoff exists between the fault coverage and the amount of testing effort expended.

Since for complex circuits it is not reasonably possible to apply a large enough set of tests to achieve full fault coverage, a subset of all possible tests must be chosen. A good choice of such a subset will provide better fault coverage than a less good subset of the same size. Various algorithms have been proposed for choosing good tests for various kinds of ICs. The D‑algorithm, PODEM (Path Oriented DEcision Making) algorithms [5], the FAN algorithm [17], the CONT algorithm [18], and the subscripted D‑algorithm [19] are for combinational circuits. Test generation for sequential circuits is more complex than for combinational circuits because they contain memory elements, and also they need to be initialized. Early algorithms for test generation for sequential circuits used iterative combinational circuits to represent them, and employed modified combinational test algorithms [20,21,22]. Test patterns for memory devices can be generated by checkerboard algorithms, the Static Pattern Sensitive Fault algorithm, etc. [23].

No test pattern generation algorithm can ever fully solve the VLSI testing problem because the problem is np complete, and thus unsolvable in reasonable time for large examples [24]. Partitioning the circuit into modules and testing each module independently is one way to reduce the problem size. Partitioning is not always a workable approach, however. As an example, it is non-trivial to test a circuit consisting of a cascade of two devices, from tests for the constituent devices. Another approach is to include circuitry in the design whose purpose is to facilitate testing of the device. Design for testability methods include BIST (Built In Self Test) and boundary scan, both of which are described later. Now, we review some well-known fault models.

12.2.1.1 The stuck‑at fault model

Suppose any line in a circuit under test could always have the same logical value (0 or 1) due to a fault. This is a relatively simple method of fault model termed the stuck-at fault model. A line that is stuck at a logical value of 1 because of a fault is called stuck‑at‑1, and a line that is stuck at a logical value of 0 because of a fault is called stuck‑at‑0. To make test generation computationally tractable, a simpler version of the stuck-at fault model called the single stuck‑at fault model assumes that only one line in a circuit is faulty. Thi sis often a reasonable assumption because a faulty circuit often does have just one fault. The single stuck-at fault model is more computationally tractable because there are many fewer faults to consider under this model than under a more complex model (the multiple stuck-at fault model) which allows for more than one fault to be present at once. Consider as an example a circuit with k lines. Each line can be either properly working, stuck-at 1, or stuck-at 0, leading to the necessity to consider 3^k‑1 distinct fault conditions (+ 1 non-fault condition). On the other hand, the same circuit under the single stuck-at model. Each of the k lines can be either working, stuck-at 1, or stuck-at 0, but if one if the lines is stuck, all the others are assumed to be working. This leads to the necessity to consider 2k distinct fault conditions, 2 (stuck-at 1 and stuck-at 0) for each line. Luckily single fault tests have reasonably high fault detection coverage of multiple faults as well [3].

The basic concept in stuck-at fault testing is to set up the inputs to the circuit so that the line under test should have the opposite logical value from the logical value which it is hypothesized to be stuck at, and further, so that the effect of that line being stuck at the wrong value is to cause an incorrect logical value downstream at an output line so that faulty circuit operation can be observed. The process of setting the inputs so that the line under test is set to the opposite value is called sensitizing the fault. It might be pointed out that if a stuck-at fault cannot lead to an observable error in the output, then the circuit is tolerant of that fault and for many purposes the fault does not matter.

As an example, consider the circuit shown in Figure 12.2.1.1. A stuck-at fault on input X2 cannot be detected at the output, as you can see by tracing logical values through the circuit. For this circuit, the output is determined by input X1. On the other hand, a stuck-at fault on line X1 can be detected at the output.

****Insert Fig. 12.2.1.1****

12.2.1.2 Bridging fault models

Short circuits in VLSI are called bridging faults because of their cause, which is usually improperly present conducting "bridges" between physically adjacent lines. Because the small size of modern circuit components makes lines very close together, bridging faults are common faults. Bridging fault models typically assume the effect of a short is to create a logical AND or a logical OR between the lines that are shorted together. An AND would result when circuit characteristics require both inputs to be high for the shorted lines to be forced high. An OR would result when circuit characteristics allow the lines to be forced high if the input to either line is high. Usually the resistance of a bridge is assumed zero, although this assumption may not actually hold in practice [4]. Bridging fault modeling is more complicated when the resistance of the short is to be accounted for. High resistance shorts may result in degraded noise resistance or other degradations in circuit performance without affecting logical levels [4]. Sometimes bridging faults can convert a combinational circuit into a sequential one, leading to oscillations or other sequential behaviors. Stuck-at testing covers many but not all bridging faults [7].

To illustrate a case where all stuck-at faults can be detected by a set of test vectors but a bridging fault would be missed, consider the circuit of Figure 12.2.1.2. The test vectors 0110, 1001, 0111, and 1110 applied to inputs A, B, C, and D (a test vector describes the value applied to each input) will detect all stuck-at faults. However, since all those test vectors apply the same value to inputs B and C, a bridging fault between B and C will not be detected.

****Insert Fig. 12.2.1.2****

12.2.1.3 Open fault models

The major VLSI defect types are shorts and opens. Usually, opens are assumed to have infinite resistance. Leakage current can be modeled with a resistance [4]. Opens can be modeled with a resistance and a capacitance connected in parallel.

In NMOS circuits, open faults may be modeled as stuck-at faults. But opens in CMOS circuits cannot, and in fact such circuits will often have sequential behavior [4].

12.2.1.4 Delay fault models

A delay fault causes signals to propagate more slowly than they should. Detection may occur when this delay is great enough that signal propagation cannot keep up with the clock rate [9]. Two fault models that account for delay faults are the single‑gate delay fault model and the path-oriented delay fault model.

Single‑gate delay fault models attempt to account for the effects of individual slow gates. Path-oriented delay fault models attempt to account for the cumulative delay in a path through a circuit. Gate-level models often work better for large circuits because the large number of paths that can be present can make path-oriented approaches impractical [10].

12.2.2 Fault Collapsing

Recall that a circuit with P lines can have as many as p³-1 possible multiple stuck‑at faults alone. It is difficult and time consuming to test for a large number of possible faults and, in practical terms impossible for a circuit of significant size. By "collapsing" equivalent faults into a single fault to test for, the total number of faults to test for can be decreased. Faults that are equivalent can be collapsed [5]. Faults are equivalent if they have the same effects on the outputs, and therefore cannot be distingished from each other by examining the outputs. Therefore, a test vector that detects some fault will also detect any equivalent fault. As a simple example, consider a NAND gate with inputs A and B and output Z. Under the stuck-at fault model, each of A, B, and Z may be working, stuck-at 0 or stuck-at 1, implying 3**3-1=27-1=26 possible multiple stuck-at faults (considering a single stuck-at fault to be a one variety of multiple stuck-at fault. But note if either input is stuck-at 0, the output Z will have the value 1. Therefore, input A stuck-at-0, input B stuck-at-0, and output Z stuck-at-1 are equivalent, in addition to some multiple stuck-at faults such as A stuck-at 0 and B stuck-at 0, etc.

No fault detection coverage is lost by collapsing equivalent faults (assuming we only have access to outputs). However, we might want to collapse some faults that are not equivalent, saving on testing time at the expense of some loss in coverage. For example, let us postulate two faults f1 and f2. If any test for f1 will also detect f2, but a test for f2 does not necessarily detect f1, then dominates f2. (Occasionally the term is used oppositely so that f2 would be said to dominate f1 [5].) As an example, consider the NAND gate with input A stuck‑at‑1. The fault is detectable at the output only by setting A to 0 and B to 1. The output Z should be 1, but the fault makes it 0. Note that the same test detects Z stuck-at 0. Another test for Z stuck-at 0 would be setting B to 0, but that test will not detect A stuck-at-1. Therefore a stuck‑at 1 fault on A dominates a stuck‑at‑0 fault on Z because every test (of which there are only one) that detects a stuck-at 1 fault on A also detects a stuck-at 0 fault on Z.

Fault equivalence and dominance both guide the "collapsing" of various different faults into one fault, in that testing for that one fault also detects the others. Fault collapsing is a useful idea because it reduces the total number of faults that must be explicitly test for to obtain a given fault coverage.

12.3 TESTING OF MEMORY CHIPS

Testing of memory chips is a well defined testing task that in some respects serves to exemplify testing of conventional chips. Here are some kinds of faults that can cause failure in the storage cells (faults could also appear in other parts of the memory, such as the address decoder).

. Stuck‑at fault (SAF)

. Transition fault (TF)

. Coupling fault (CF)

. Neighborhood pattern sensitive fault (NPSF).

In a stuck‑at fault, the logic value of a cell is forced by a physical defect to always be zero (stuck‑at‑0) or one (stuck‑at‑1). A transition fault is close to a stuck-at fault. A transition fault is present if a memory cell (or a line) will not change value either from 0 to 1 or from 1 to 0. If it won't transition from 0 to 1, it is called an up transition fault, and if it won't transition from 1 to 0 it is called a down transition fault. If a cell is in the state from which it will not transition after power is applied, it acts like a stuck-at fault. Otherwise, it can have one transition after which it remains stuck. A coupling fault is present if the state of one cell affects the state of another cell. If k cells together can affect the state of some other cell, the coupling fault is called a k‑coupling fault. One kind of k-coupling fault is the neighborhood pattern sensitive fault. If a cell's state is influenced by any particular configuration of values or changes to values in neighboring cells, a neighborhood pattern sensitive fault is present.

Here are some basic tests that have been used to detect memory faults.

12.3.1 The Zero‑One Test

This test consists of writing 0s and 1s to the memory. The algorithm is shown below (Figure 12.3.1). The algorithm is easy to implement, but has low fault coverage. However, this test will detect stuck-at faults if the address decoder is working properly.

****Insert Figure 12.3.1 here.****

12.3.2 The Checkerboard Test

In the checkerboard test the cells in memory are written with alternating values, so that each cell is surrounded on four sides with cells whose value is different. The algorithm for the checkerboard test is shown in Figure 12.3.2. The checkerboard test detects stuck-at faults as well as such coupling faults as shorts between adjacent cells if the address decoder is working properly.

****Insert Figure 12.3.2 here****

12.3.3 The Walking 1/0 Test

In the walking I/O test, the memory is written with all 0s (or 1s) except for a "base" cell, which contains the opposite logic value. This base cell is "walked" or stepped through the memory. All cells are read for each step. The GALPAT (GALloping PATtern) test is like the Walking 1/0 test except that, in GALPAT, after each read the base cell is also read. Since the base cell is also read, address faults and coupling faults can be located. This test is done first with a background of 0s to the base cell value of 1, and then with a background of 1s to a base cell value of 0.

12.4 DESIGN FOR TESTABILITY

Design for Testability (DFT) attempts to facilitate testing of circuits by incorporating features in the design for the purpose of making verification of the circuit easier. Generally, the strategy is to make points in the circuit controllable and observable. Here is a more specific albeit still informal characterization of testability:

"A circuit is `testable' if a set of test patterns can be generated,

evaluated and applied in such a way as to satisfy pre‑defined

levels of performance, defined in terms of fault-detection, fault-location,

and test application criteria, within a pre‑defined cost budget and time scale" [28].

Factors that affect testability include difficulty of test generation, difficulty of fault coverage estimation, number of test vectors required, time needed to apply a particular test, and the cost of test equipment. The more complex the circuit, the lower its testability tends to be because, as we saw earlier, observability and controllability decrease, where observability is how easy it is to determine the state of a test point in question by observing other locations (usually outputs), and controllability is how easy it is to cause a test point in question to have the value 0 or 1 by controlling circuit inputs [29]. There are various methods of design for testability. We review some of them next.

12.4.1 Scan design

Scan design uses extra shift registers in the circuit to shift in test input data to points within the circuit and to shift out values inside the circuit. The shift registers provide access to internal points in a circuit. Test vectors may be applied using those points as inputs and responses to tests may be taken using those points as outputs.

The shift register may consist of D flip flops (i.e. latches) that are used as storage elements in the circuit, which are connected using extra hardware into a "scan chain" so that in test mode, test vectors can be shifted in serially, and so that the internal state of the circuit, once latched into the latches in parallel, can be serially shifted back out so that the state can be observed from outside. See Figure 12.4.1.

****Insert Fig. 12.4.1 ****

Thus,

1. Latches themselves can be tested.

2. Outputs of the latches can be set independently of their inputs.

3. Inputs to the latches can be observed.

12.4.1.1 Scan Path and Multiplexed Scan Design technique

A multiplexer is connected to each latch, and an extra control line, the scan select, is used to set the circuit for scan (test) mode. When the scan select line is off, the multiplexers connect the lines from the combinational logic to the latches so that the circuit works normally. When the scan select line is on, the latches are connected together to form a serial in, serial out shift register. The test vector can now be input by serially shifting in the test vector. The test output can be output by shifting it serially out the scan output, that is, the last latch's output.

Here is a summary of the method:

1. Put the circuit into scan mode by inputing 1 on the scan select line.

2. Test the scan circuitry itself by shifting in a vector of 1s and then a vector of 0s, to check that none of the latches have stuck-at faults.

3. Shift a test vector in.

4. Put the circuit in normal mode by inputing a 0 on the scan select line. Apply the primary inputs needed for that test vector, and check the outputs:

5. Clock the latches so that they capture their inputs, which are the circuit's internal responses to the test.

6. Put the circuit into scan mode and shift out the captured responses. For efficiency, clock in the next test vector as the responses to the previous one are clocked out. Check the responses for correctness.

7. Apply more test sequences by looping back to step 4.

Scan design has some disadvantages. These include:

1. Additional circuitry is needed for the scan latches and multiplexors.

2. Extra pins are needed for test vector input and output, and for setting the circuit to scan mode or normal mode.

3. The circuit operation is slower than it would otherwise be, because of the extra logic (e.g. multiplexors) which signals must traverse.

12.4.1.2 Level sensitive scan design (LSSD)

In level sensitive scan design, state changes in the circuit are caused by clock values being high, rather than transitions in clock values (edges). To reduce the possibility that analog properties such as rise and fall times and propagation delays can lead to races or hazards, level sensitivity can be a useful design criterion. Another positive characteristic of level sensitive design is that steady state response does not depend on the order of changes to input values [28]. The basic storage element used in circuits that adhere to LSSD is as shown in Fig. 12.4.1.2-1.

****Insert Fig. 12.4.1.2-1****

The clock values (note there are three of them) determine whether the storage element is used as a normal circuit component or for test purposes.

To form a scan chain, the double latch storage elements are connected into a chain configuration whereby the L2 output of one element feeds into the L1 input of the next element. This chain configuration is activated only during test mode and allows clocking in a series of values to set the values of the elements in the chain. Figure 12.4.1.2-2 illustrates.

****Insert Fig. 12.4.1.2-2****

For proper operation of a level sensitive circuit, certain constraints must be placed on the clocks [30], including:

(1) Two storage elements may be adjacent in the chain only if their scan related clocks (Scan Clock and Clock 2 in Figure 12.4.1.2-2) are different, to avoid race conditions.

(2) The output of a storage element may enable a clock signal only if the clock driving that element is not derived from the clock signal it is activating [30].

12.4.1.3 Random access scan

In random access scan, storage elements in the circuit can be addressed individually for reading and writing [28]. This is in contrast to other scan design approaches such as level sensitive scan design and scan path design, described earlier, in which the test values of the storage elements must be read in sequentially and iteratively passed down the shift register formed by the chain of storage elements until the register is full. In random access scan design, storage elements are augmented with addressing, scan mode read, and scan mode write capability (see Fig. 12.4.1.3).

****Insert Fig. 12.4.1.3****

An address decoder selects a storage element which is then readable or writeable via the scan input and output lines. A disadvantage of random access scan design is the

extra logic required to implement the random access scan capabilities. Another disadvantage is the need for additional primary input lines, for example the address lines for choosing which storage element to access [30].

12.4.1.4 Partial Scan

Fully implemented scan design requires substantial extra chip area for additional circuitry, about 30% [31]. If, however, only some of the storage elements in the circuit are given scan capability, the extra area overhead can be reduced somewhat. Where full scan design involves connecting all latches into a shift register, called the "scan chain," in partial scan some are excluded from the chain [31]. Partial scan test vectors are shorter than those that would be needed for a full scan design, since there are fewer latches to be manipulated. Test sequences tend to be shorter, since because the test vectors are shorter there are fewer of them. Since in partial scan, some storage elements in the circuit cannot be read/written via the scan circuitry, and since the importance of test access to a latch depends on its role in the circuit, a particular partial scan design must make an intelligent choice of which storage elements should be in the scan path.

Partial scan compared to full scan leads to reduced area and faster circuit operation. The speed-up in circuit operation is because those storage elements that are in critical paths may be left out of the scan path so as not to slow down those paths.

12.4.2 Built-In Self Test

BIST (Built‑In Self Test) is a class of design-for-testability methods involving hardware support within the circuit for generating tests, analyzing test results, and controlling test application for that circuit [39]. The purpose is to facilitate testing and maintenance. By building test capability into the hardware, the speed and efficiency of testing can be enhanced. BIST techniques have costs as well as benefits, however. In particular, the extra circuitry for implementing the BIST capability increases the chip area needed, leading to decreased yield and decreased reliability of the resulting chips. On the other hand, BIST can reduce testing related costs.

Test vectors may be either stored in read-only memory (ROM) or generated as needed. Storing them in ROM requires large amounts of ROM and may be undesirable for that reason, however, it does potentially provide high fault coverage and advantages in special cases [39].

We consider two ways to generate test vectors as illustrative. Pseudorandom testing picks test vectors without an obvious pattern. Exhaustive testing leads to better fault coverage but is more time consuming.

12.4.2.1 Pseudorandom test generation

A linear feedback shift register (LFSR) can generate apparently random test vectors. An LFSR is typically made of D flip‑flops and XOR gates. Each flip-flop feeds into either the next flip-flop, an XOR gate, or both, and each flip-flop takes as input either the output of either the previous flip-flop or of an XOR gate. The overall form of the circuit is a ring of flip-flops and XOR gates with some connections into the XOR gates from across the ring because XOR gates have more than one input. If there is no external input to the circuit, it is called an autonomous linear feedback shift register (ALFSR) and the output is simply the values of the flip-flops (see end of chapter exercise 9). The pattern generated by an LFSR is determined by the mathematics of LFSR theory (see [39] for a brief description and [40] for a detailed treatment), and LFSRs can generate test vectors that are pseudorandom (or exhaustive).

12.4.2.2 Pseudoexhaustive testing

Testing exhaustively requires, given a combinational circuit with n inputs, providing 2**n test vectors of n bits each (in other words, every possible input combination). Pseudoexhaustive testing means testing comprehensively but taking advantage of circuit properties to do this with less than 2**n input vectors.

If the circuit is such that no output is affected by all n inputs, it is termed a partial dependent circuit and any given output line can be comprehensively tested with less than 2**n input vectors. The exact number depends on how many inputs affect that output line. If k inputs affect it, then 2**k vectors will suffice, comprising every possible combination of values for the inputs that affect that output, with the values for the other input lines being irrelevant (to testing that output line). Each output line may be tested in this way. Thus, if the circuit has 20 inputs and 20 outputs, but each output relies on exactly 10 of the inputs, 2**10 tests for each of the 20 outputs implies that 20 x 2**10 or approximately 20,000 tests can be comprehensive, compared to 2**20 or approximately 1,000,000 tests for an exhaustive testing sequence which would be no more comprehensive.

Other pseudoexhaustive techniques can improve on this even more. For example, if there are two input lines which never affect the same output line, they can always be given the same value with no decrement in the comprehensiveness of the test sequence. More generally, test vectors for testing one output line can also be used for other output lines, reducing the number of additional test vectors that must be generated for those other output lines. An approach to doing that is described, for example, in [41].

As a concrete example, Figure 12.4.2.2 illustrates a partial dependent circuit.

****Insert Figure 12.4.2.2 here****

The circuit shown has an output f which is determined by inputs w and x, and an output g which is determined by inputs x and y. Neither output is affected by both w and y, so nothing is lost by connecting x and y together so that they both always have the same value. With that done, now only four vectors, instead of 2**3=8, can provide an exhaustive test sequence.

When a circuit is not partial dependent (that is, some output depends on all inputs), the circuit is termed complete dependent. In this case, pseudoexhaustive testing may be done by a technique involving partitioning the circuit [42]. This method is more complex.

12.4.2.3 Output response analysis

Consider a circuit with one output line. Checking for faults means checking the response sequence of the circuit to a sequence of tests. One possibility is to have a fault dictionary consisting of the sequence of correct outputs to the tests. However, this is impractical for a complex circuit due to the large amount of data that would need to be stored. One way to address this problem is to compact the response sequence so that it takes less memory to store. The compacted form of an output response pattern is called its signature. This concept is known as response compression [43]. Since there are fewer bits in the signature than in the actual output sequence, there are fewer possible signatures than there are actual potential outputs. This results in a problem known as aliasing. In aliasing, the signature of a faulty circuit is the same as the signature of the correct circuit. The faulty output signature is then called an alias. Aliasing leads to a loss of fault coverage. One approach to using compaction is "signature analysis," described next.

12.4.2.4 Signature analysis

Signature analysis has been a commonly used compaction technique in BIST. An LFSR (Linear Feedback Shift Register) may be used to read in an output reponse and output its signature, a shorter pattern determined by the test output response pattern.

Since the signature is determined by the test output pattern, if a fault results in a different test output pattern, then the fault is likely (but not certain) to have a different signature. If a fault has a different test output pattern but its signature is the same as the proper test output, aliasing is said to have occured. Aliasing reduced test coverage. Figure 12.4.2.4-1 depicts an LFSR with an input for the test response pattern and contents which form the signature.

Many circuits have multiple output lines, and for these the way an LFSR is used for signature generation must be changed. One way is to feed the different output lines into different points in the LFSR simultaneously (Figure 12.4.2.4-2). An alternative approach uses a multiplexer to feed the value of each output line in turn into a one-input LFSR, a process which must be followed for each test input vector.

****Insert Figure 12.4.2.4-1 here****

****Insert Figure 12.4.2.4-2 here****

12.4.2.5 BIST test structures I: built‑in logic block observation (BILBO)

BILBO has features of scan path, level sensitive scan design, and signature analysis. A BILBO register containing 3 D flip-flops (latches, labeled DFF), one for each input, appears in Figure 12.4.2.5-1. Z1, Z2, and Z3 are the parallel inputs to the flip flops and Q1, Q2 and Q3 are the parallel outputs from the flip flops. Control is provided through lines B1 and B2. If B1=1 and B2=1 the BILBO register operates in function (non-test) mode. If B1=0 and B2=0 the BILBO register operates as a linear shift register and a sequence of bits can be shifted in from Sin to serve for example as a scan string. If B1=0 and B2=1 the BILBO register in in reset mode and its flip flops are reset to 0. If B1=1 and B2=0 the BILBO register is in singature analysis mode and the MUX is set to select Sout as the input to Sin, forming a linear feedback shift register (LFSR) with external inputs Z1, Z2, and Z3. See Figure 12.4.2.5-2 and end of chapter problem 10.

****Insert Figure 12.4.2.5-1 here.****

****Insert Figure 12.4.2.5-2 here.****

The BILBO approach relies on the suitability of pseudorandom inputs for testing combinational logic. Therefore, when the BILBO control inputs cause it to operate in signature analysis mode (that is, to be an LFSR), the pseudo random patterns it produces can be used as test vectors. For example, Figure 12.4.2.5-3 shows a circuit with two combinational blocks, testable with two BILBO registers.

****Insert figure 12.4.2.5-3 here****

In figure 12.4.2.5-3, the first BILBO is set via input vector pn to generate pseudorandom test vectors for the combinational block it feeds into. The second BILBO is set via input vector sa for signature analysis purposes. The first BILBO is therefore used to apply a sequence of test patterns, after which the second BILBO is used to store the resulting outputs of the combinational block, followed by scanning out those outputs (the signature). When combinational block 1 has ben been tested, block 2 can be tested similarly by simply reversing the roles of the BILBO registers.

BILBO has an interesting advantage over many other types of scan discipline. Using BILBO, if N test vectors are applied before scanning out the results, the number of scan outs for those N vectors is 1, compared with the N scan outs required by other scan disciplines. However BILBO requires more extra circuitry than LSSD, as well as leading to relatively more signal delays because of the gates connected to the flip flop inputs [30].

12.4.2.6 Circular self test path (CSTP)

CSTP [25] connects some (or all) storage cells in the circuit together, forming one large circular register. A cell of the circular register may contain one D flip flop or two arranged as a master and slave. The cells form a feedback shift register, hence the use of the term "circular." The circular path is augmented with a gate at the input of each cell that, during test mode, XORs the functional input from the circuit that would be the sole input during non-test circuit operation, with the output of the preceding register in the circular path. This causes the outputs of the flip flops during test mode to change in a difficult to predict way, so that they can be used as test inputs to the circuit. When operated in the normal mode, the cells feed inputs through to the combinational blocks. When operated in the test mode, the cells feed test values into the combinational blocks. Once the test pattern has propagated through the circuitry, the response is fed into the circular register which compacts the response into a signature. The test response is combined with its present state via the XOR gates to produce its next state and next output. The circular path can now apply the next test vector which is its current contents. After repeating this some number of times the register contents can be checked for correctness. Correctness might be determined by matching against the contents for a known working circuit, for example. The creators of CSTP cite as significant advantages of CSTP that:

1) the complexity of the on-chip mode control circuitry is minimized by the fact that a full test can be done in one test session.

2) The hardware overhead is low compared to other multifunctional register test methods like the BILBO technique, because the cells are simpler as they need only be able to load data and compact data. As a caveat, this assumes the circuit can be reset into a known state from which to begin testing.

The test pattern generated by the circular path is neither pseudorandom nor purely random, but instead is determined by the logic of the circuit. The authors defend this by analyzing the effect of this in comparison to exhaustive testing (that is, applying all possible test input vectors), concluding that with a testing time of 4X what would be needed for exhaustive testing, 98% of the possible test vectors will be applied, and with a testing time of 8X, 99.9+% of the possible test vectors will be applied. The problem of test pattern repetition must be dealt with because if it occurs then the entire preceding sequence of test vectors will also then repeat. Then longer test times will result in no improvement in coverage. The authors of this approach found that this is unlikely to occur, can be identified if it does occur, and can be avoided by changing the initial state of the circular register.

12.5 OTHER ASPECTS OF FUNCTIONAL TESTING

The self-test methods described above facilitate functional testing, in which we test an actual device to ensure its behavior conforms to specifications. This contrasts with speed testing, in which properly working circuits are sorted depending upon how fast they will run, and with destructive testing, in which circuits under test are destroyed in a process which aims to find out what the limits of the circuit are. In this section we address some additional aspects of functional testing, emphasizing MCMs.

Functional testing is important not only for screening out defective units but for quality control, production line problem diagnosis, and fault location within larger systems.

Functional testing occurs after all design rules are satisfied, all design specifications are met during the simulation and analysis phase, and the physical design goes through part or all of the manufacturing process. In MCMs, functional testing is primarily done at the substrate level, die level, and module level.

Staged testing in which proper functioning of each die on an MCM is checked after it is mounted but before the next die is mounted can help catch problems early. Testing of fully assembled units verifies that the completed system works.

12.5.1 Approaches To Testing MCMs

Testing methods can be classified as built‑in or external. Built‑in (e.g. BIST) approaches may be preferable in some cases, however, this makes the design process more difficult since it requires extra hardware, beyond the dice and their connections, on the MCM. External test methods will be preferable in many cases due to lower design and production costs.

Testing methods can alternatively be classified as concurrent or non‑concurrent. In concurrent testing, the device is tested as it runs, such as by a testing program that runs using clock cycles that would otherwise go unused. In contrast, non‑concurrent testing is run on a unit that is not being used. Concurrent testing makes the design task more difficult, yet can enhance dependability by automatic detection of faults when they occur, as is necessary e.g. for fault tolerance methods requiring on-the-fly reconfiguration. Non‑concurrent testing is easier and will probably have a role in MCM testing indefinitely.

Testing methods can also be classified as static or dynamic. Static testing deals with DC characteristics of devices that are not actually running. In MCMs, this can be used for testing substrates prior to die installation. MCM testing also requires dynamic testing, that is, testing while the MCM is in operation.

Still another way to classify testing methods is functional vs. parametric. Functional testing involves testing to see if a device can do the things it is supposed to (that is, perform its functions). Parameter testing is testing to see whether various parameters fall within range. For example, a parametric test might measure rise and fall times to check that they will support operation at a specified frequency.

Let us now glance at staged testing, in which components of MCMs are tested, in section 12.5.2, then some ways of testing various components of MCMs in sections 12.5.3 through 12.5.5, and then move on to testing of entire MCMs in section 12.5.6.

12.5.2 Staged Testing

The general strategy of testing earlier in the construction of a complex circuit rather than later is intended to minimize wasted work (and hence expense). Taking MCMs as an example, early detection of faults means less likelihood of mounting dice on bad substrates, less likelihood of mounting bad dice, less chance of sealing MCMs with bad components, less likelihood of selling bad MCMs, less chance of embedding bad MCMs in a larger system, etc. Detection of faults as early as feasible is thus an important part of an overall testing philosophy.

Increasing the feasibility of early testing has its own costs. In the case of MCMs, a staged approach to testing in which each die is tested after it is installed (instead of testing the whole MCM after all the dice are installed) requires test pads to be located on the substrate to facilitate test access to each die. This means using potentially valuable substrate area for the pads, a more complex substrate design, and potentially slower operation due to the capacitance and cross talk increase caused by the extra metal in the pads and the conductance paths that lead to them.

Taking the early testing strategy further, we might test each die prior to installation. This would not completely eliminate the need for testing it after installation, and hence the need for test pads, because dice can be damaged by the installation process, but it would avoid performing the installation process on a die that is already bad. But the cost of this is high because testing a die prior to installation is a difficult problem in itself. In fact this problem has a name: the known good die problem (KGD). This important problem is described later in the chapter.

12.5.3 MCM substrate testing

MCM substrates are like miniaturized printed circuit boards in that they connect together all the component parts of the MCM as well as serve as a platform on which to mount those parts. They should be tested for defects before mounting ICs on it, because it is relatively easy to do and because of the substantial cost of going through the rest of the fabrication process. This cost would be wasted if the substrate was bad.

12.5.3.1 Manufacturing defects in MCM substrates

The substrate contains nets that should be tested for opens and shorts. These nets terminate at the substrate surface in pad to which components such as dice will be connected. Those connections may use wire bonds, flip chip bonding technology, or tape automated bonding (TAB). While many pads are used as connections to dice, some are used to connect with the pins of the MCM. A net may be tested for opens, shorts to other nets, and high resistance opens or shorts by probing those test pads. High frequency test signals can be applied to test for characteristics like impedance, crosstalk and signal propagation delays.

There are a number of approaches to testing nets which are reviewed in the following paragraphs. Each has its own advantages and disadvantages. These approaches may be classified into the two broad categories of contact and non-contact methods.

12.5.3.2 Contact testing

In contact testing a substrate is tested by making physical contact with the pads. Resistance and capacitance measurements are done using probes to contact the pads and locate opens, shorts and high resistance defects in the nets. For example, a net demonstrating an unexpectedly low capacitance likely has an break in it. As another example, by moving two probes to two pads, the tester can verify that continuity exists or that no short exists, as desired.

12.5.3.2.1 Bed‑of‑nails testing

Bed‑of‑nails testing uses a probe consisting of an array of stiff wires. Each wire contacts a different pad on a device, so that all (or many) of the pads needing to be probed are contacted by a different wire at once. Multiplexing allows the testing device to select which wires to use for sending or receiving test signals, allowing measurements of resistance or impedance between a pair of pads or between any two sets of pads.

Suppose there are N nets on a substrate to be tested, and P_k pads in the k-th net. The number of tests required to certify the k-th net for opens is P_k-1. Therefore the total number of tests to certify all N nets on the substrate for opens is S(P_k-1). Given an average of p pads per net, then N(p-1) tests are needed to test for opens. To test for shorts, each net must be checked for infinite resistance to each other net, unless auxiliary information about the spatial layout of the MCM is available which will allow the testing procedure to skip testing nets that are far apart spatially. In the absence of such information, N(N-1)/2 tests for checking shorts on the substrate are needed (provided the nets have no opens). As an example, suppose a substrate has 100 nets with an average of 5 pads per net. Then there are 5x100=500 tests needed for open circuit testing, and 100x(100-1)=9,900 tests needed for short circuit testing. The number of tests needed for short circuit testing, increases quickly with the number of nets. As the number of tests becomes high, bed-of-nails test probes become save increasing amounts of time because the probe need not be moved from place to place as each pad is already connected to one of the probes in the bed-of-nails. Packages for which the test pads form a regular grid with a fixed center are better suited to bed-of-nails testers than idiosyncratic arrangements of pads because idiosyncratic arrangements require the probe head to be custom built [13]. Packages with small, densely packed pads are harder to use with bed-of-nails testers because the probe becomes more complex and expensive to make.

Because bed-of-nails testers are relatively complex and expensive, yet the probe need not be mechanically (therefore slowly) moved around for each separate test, bed-of-nails testing is most suited to situations requiring testing a large volume of circuits quickly, so that the high cost is distributed over many tested circuits [13].

12.5.3.2.2 Single probe testing and two probe testing

Nets are separated by non-conducting, dielectric material. This implies a capacitance between a pair of nets or between a net and the reference plane. If a testing procedure applies an AC signal to it, typically from 1 KHz to 10 MHz, the impedance can be measured [14]. This measurement can be compared with the corresponding measurement from another copy of the same device which is known to be good, or perhaps with a statistical characterization of the corresponding measurement from a number of other copies of the device. Lower than expected capacitance suggests an open circuit, while higher than expected capacitance suggests a short circuit.

To check for shorts, one measurement for each net is required. To check for opens, one measurement for each pad is required. If doubt exists as to whether the flow of current created by application of AC represents only the normal capacitance of the net or includes a high resistance short, AC of a different frequency may be applied. The difference in current flow I1-I2 that this creates will be a function of the capacitance C, the frequencies F1 and F2, and the resistance R. If R is infinite, then I1/I2=F1/F2, and any deviation from that is due to resistance (and inductance).

Single probe testing is not as affected as bed-of-nails testing by high pad density or small pad size, but there are also some disadvantages [13]. One disadvantage is that, if nominal test values are derived fro actual copies of the circuit, design faults will not be detected. Another disadvantage is that if the substrate has pads on both sides then it must be turned over during the testing process.

Two probe testing has all the capabilities of one probe testing and them some, at the price of a modestly more complex mechanism that can mechanically handle two probes at once. Shorts can be isolated to the two offending nets by probing both of them at once.

With single and dual probe testers, the probes must be mechanically moved from pad to pad. This limits the speed of testing [13]. To maximize speed, minimize the total travel distance of the probes. An optimal minimization requires solving the famous Traveling Salesman Problem, a known intractable problem.

Flying probe technologies are becoming more popular as control of impedances of lines in a substrate become more important due to modern high signal frequencies. Flying probe heads provide control over the impedance of the probe itself, to facilitate sensitive measurements of the nets [40].

12.5.3.3 Non‑contact testing

Testing using probes that make mechanical contact with pads on a circuit can damage the pads, which in turn can prevent good contact between the pad and a connection to it later in the manufacturing process. This is one reason why a non-contact testing method is attractive. Another reason is that in some MCM technologies it is desirable to test the substrate at various stages in its manufacture before pads are present. This may not be practical with mechanical testers due to the small size of the metal areas to be probed. In non-contact testing, electrical properties of nets are tested without making actual physical contact with them.

12.5.3.3.1 Electron beam testing

Electron beam testing (e.g. [46]) works somewhat like the picture tube on a television set or computer terminal. A hot negatively charged piece of metal is used as a source of electrons, which are directed toward a target, which is the circuit in the case of a tester or the screen in the case of a television. Magnetic deflection coils or electrostatically charged plates can move the beam, back and forth and up and down for a television, or however is needed for a tester. By directing the electron beam at a particular place on a circuit, a net can be charged up. If the charge then appears on another net, or does not appear on part of the charged net, there is a short or open.

Electron beam testing is like single probe testing in some ways, because the electron beam is analogous to the single probe. However because there are no moving parts it can operate much faster than a mechanical device. Another difference is that the electron beam is DC whereas single probe testers typically use AC. However both varieties of tester rely on the capacitance of the circuit structures to hold charge, and both thus can mistake high resistances as shorts.

A disadvantage of electron beam testing not shared by contact methods is the need for the circuit to be in a vacuum chamber. This can mean a delay of minutes to pump out the air in the chamber before the testing process can begin. One solution to this is to have an air lock on the vacuum chamber. The circuit is placed in the relatively small air lock which can be evacuated much faster than the larger test chamber. After the air lock is evacuated the circuit is moved into the test chamber proper, which has been in a vacuum all along.

Electron beam testers appear to be entering the current marketplace. The Alcedo company estimates an ability to sell them for $1.2 million each, and was completing one for the US Air Force as of this writing [45].

12.5.3.4 Wear of MCM substrates

The substrate contains the wiring used to connect all the other components on the MCM. Improper fabrication can lead to gradual corrosion of nets leading to failure. Once properly manufactured and found working, however, reliability has been tested and found remarkably high. Roy (1991 [56]) subjected MCM‑D (deposited), HDI (high density interconnect) MCM substrates to HAST (highly accelerated stress tests) for thermal, moisture resistance, salt atmostphere, and thin film adhesion reliability characterization and found that military MIL‑STD‑883C and Jedec JEDEC‑STD‑22 reliability standards were easily exceeded, with expected substrate lifetimes of over 20 years.

12.5.4 Die testing

An MCM is populated with unpackaged chips ("bare dice") which are mounted on the substrate. These bare dice should be good, because if they are not there is substantial extra cost involved in removing and replacing them. This is a problem, because bare dice are not widely available in tested form, as they are usually tested by the manufacturer only after they are mounted in the typical one-die package. There is more than one reason for this:

1) It is much easier to test a packaged chip than an unpackaged bare die.

2) Manufacturers make much of their money from the packaging, and so are not very interested in selling the unpackaged dice.

3) Manufacturers prefer not to sell untested bare dice because they may not only risk their reputation for reliability, but fear the MCM manufacturer might damage dice during their own testing and then blame the die supplier for supplying bad dice! Such concerns are real.

ICs intended for mounting on an MCM may also be designed differently from ICs intended for standard use. Because they are so close together the paths between them will tend to have low capacitance, meaning that the dice can be designed with low power drivers. It is more difficult to test such dice because their loads must have high impedance to match the drivers [13]. Another MCM-specific testing difficulty is that manufacturers sometimes change the chip dimensions without warning, requiring the MCM maker to reactively change their test setup on short notice.

As discussed elsewhere in this book, die yield has a major impact on MCM yield. In fact, the yield of the MCM will be significantly lower than the yield of the dice it contains. Furthermore, the rework required in removing and replacing bad dice is expensive. So verification of bare dice before mounting is important despite the difficulties.

MCMs are usually intended to operate at high frequencies, and so high frequency testing is an important part of an MCM test strategy. High frequency testing is more difficult than standard testing due to the interference posed by the impedances in the test equipment.

12.5.4.1 Chip carriers

A chip carrier (Figure 12.5.2.3) is a die package which is close in size to the die it carries. Simple in principle, it connects to densely packed perimeter bond pads on a die and runs leads to a less densely packed area array about the size of the die itself. This less densely packed area array package provides a surface mount device (SMD) that is easily assembled into test sockets or directly onto MCM substrates. This provides easier access to the I/O ports for either testing or mounting on MCM substrates than is provided by the bare dice, yet does not change the area of the device to be mounted significantly because the area array is layered over the die itself. Easier testing means dice that are not yet mounted on a substrate can be tested before mounting, thus helping to address the problem of providing known good die (KGD). If the chip carrier with its mounted die passes the tests, the entire carrier package may be mounted as is on an MCM substrate, with connections between the substrate and the die mediated by the area array provided by the carrier. The carrier is thus a permanent package which is acceptable for mounting on an MCM because its size is insignificantly larger than the die it contains. Problems include the fact that getting from the die to the MCM substrate now requires two connections, one from the die to the carrier and one from the carrier to the substrate, instead of just one from the die to the substrate as it would be without the carrier. The problem with this is it leads to some loss in reliability of the connections and hence lowered yield, since now there are two connections that must be made successfully instead of just one. Chip carrier philosophy and current technology is reviewed by Gilleo (1993 [52]).

****Insert Figure 12.5.4.1here.****

12.5.5 Bond Testing

70‑80% of MCM faults are interconnection faults (assuming use of known good die, which often is not actually the case) (Hewlett Packard 1991 [53]), so this kind of testing is useful, even though it does not directly target faulty dice. Interconnection faults are faults in the connections between a die and the substrate. Testing for interconnection faults is the responsibility of the MCM manufacturer.

Open bonds could be tested by applying a probe to the net on the MCM substrate to which it is supposed to be attached, and measuring the capacitance. A properly working bond will cause the measured capacitance to be the sum of the capacitance of the net, the wire bond or other bond material, and the capacitance of the input gate or output driver on the die to which it makes a connection. The capacitance measurement is facilitated by the fact that resistive current will be negligible in CMOS circuits, which are the most common kind. For ordinary dice, input gate capacitances run about 2 pF, whereas output drivers have capacitances on the order of 100 pF. (Dice made specifically for mounting in MCMs do not need powerful output drivers and so their output capacitances can be significantly less, but such dice are not generally available at present.) Thus open bonds that disconnect output drivers should be relatively easy to detect. However, if an output driver and an input gate are both bonded to the same net, the presence of the higher capacitance output driver could dominate the capacitance measurement, precluding a reliable conclusion about whether the bond to the input gate is open or not. However, with respect to a given net, a die with an input gate could be mounted on the substrate before a die with an output driver to be bonded to the same net, so that bonds to low capacitance input gates can be capacitance tested before high capacitance output drivers are present, if this bond testing approach is to be used. This would be a form of staged testing.

This form of staged testing does have its drawbacks, mainly due to the fact that some bonds of a given die will need to be created before, and others after, bonds of other dice are made. Process lines are better suited to handling all of the bonds to a die in one stage, and only then going on to another die. Yet flexible, integrated manufacturing lines should become increasingly viable in the future as automation increasingly assists manufacturing processes.

A more serious drawback is that this approach will not work with flip chip processes.

Mechanical testing of a particular kind of bond, wire bonds (which are small wires going between a pad on a die and a pad on an interconnect), involves pulling on it to make sure it is physically well attached at both ends.

12.5.6 Testing Asembled MCMs

Even when all components (substrate, dice, bonds, pins . . .) are working properly, they still may not necessarily interact properly. Thus a complete MCM or other system must be tested even if its components are known to be good.

Working parts may also become bad during the process of assembling a larger system and so it is useful for the system to support testing of its parts even if they were tested before assembly, and especially if they were not, or not completely. MCMs are a good example of such systems due to the difficulty of testing some of the component parts prior to assembly. Components of completed MCMs can be hard to test because it is hard to observe states of the interconnects, which are hidden within the MCM and much smaller, than interconnects on printed circuit boards.

Thus a module must be tested after it is assembled. This requires testing of components in case they were damaged during assembly, even if they were known to be working prior to assembly. A burn-in may be provided to help detect latent defects.

Various test strategies are suited for testing assembled MCMs. The test strategy chosen will be affected by built-in testability features if any, die quality, substrate quality, reliability requirements, fault location coverage requirement, and rework cost [13].

Testing exhaustively is equivalent to fully testing of each module component plus verifying that the components work properly together. This provides high fault coverage, however it is impractical for all but certain MCM designs, such as a bus connected microprocessor and SRAM, due to the complexity of fully testing all internal logic from the MCM pins [13].

Building the MCM using components with built in testability features can facilitate MCM testing significantly. Ways to incorporate testability include boundary scan, built‑In self test (BIST), and providing external access to internal I/O (e.g. through test pads, discussed next). Methods such as these facilitate fault detection coverage, fault location coverage, and faster and simpler testing.

12.5.6.1 Test pads

The testability of an MCM can be increased by bringing out internal nets to the exterior of the package. This is done by having test pads, which are small contact points on the outside of the MCM package that are each connected to some internal net. This provides observability and controllability to internal connections not connected to any of the MCM pins. This can help test engineers to isolate internal faults. Test pads can be connected directly to each net in an MCM. This makes the MCM testing problem analogous to the printed circuit board test problem, in that all nets are accessible for probing. Unfortunately, while test pads connected to all or many nets in an MCM for testability may be feasible to manufacture, they have some drawbacks. These drawbacks include the following:

* Test pads increase capacitance and cross‑talk, adversely affecting performance.

* Test pads can be hard to access for test purposes simply because of their necessarily small size (4 mils might be a typical test pad dimension) because all such pads must be crammed into the small external area provided by the MCM package. Let us deal with each of these issues in turn.

12.5.6.1.1 Test pads and performance

While test pads are useful in MCM testing, they have the disadvantage of decreasing performance. Therefore one approach to avoiding this tradeoff would be to build test pads on the MCM, use them for testing, and when testing is concluded remove the pads. A fabrication step that chemically etches away the exposed pad while leaving everything else intact would be one approach.

12.5.6.1.2 Test pad number and accessibility

More test pads means crowded, smaller, and therefore less accessible test pads. By providing fewer test pads, the ones provided could be made larger and therefore more accessible. Thus there is a tradeoff between the number of test points and their accessibility. Consequently, progress on MCM testing via test pads must take one of two broad strategies:

Strategy 1: Better ways to access small pads arranged in dense arrays.

Strategy 2: Dealing with an incomplete set of test pads.

Regarding strategy 1, here are some ways to access pads:

* A small number of probes (for example, two) which can be moved from pad to pad efficiently (the moving probe approach).

* Many probes, one for each test pad, to be applied all at once. This avoids the problem of moving probes around from pad to pad, but at the price of having to build a dense, precisely arranged, expensive set of probes ("bed‑of‑nails") that can reliably connect to their respective pads. A collapsing column approach to constructing each probe is one way to do this.

* Electron beam ("E‑beam") use. A high technology and non‑trivial undertaking.

These methods were discussed previously in this chapter.

With regard to strategy 2, here are some possibilities for maximizing testing effectiveness given limited access to substrate nets. Judicious use of available pads is required. Approaches for this include:

* Design the "right" test pads into the MCM. Some nets will be more important to test than others, and part of the design process would be to decide which are the most important ones and provide pads for those. Artificial intelligence work on sensor placement, such

as [47] might be applicable here.

* Clever diagnostic use of existing test pads. Artificial intelligence work on diagnosis, especially of digital circuits, could come into play here. Hamscher [48] describes one approach and reviews previous work.

* Vias could be manufactured for test pads for all paths, however, actual pads be fabricated only for some. This would make design changes in which changing the choice of which paths are provided with test pads is easier, since no redesign of vias is needed, and instead only the MCM layer providing the pads themselves would need to be redesigned.

12.5.6.1.3 Test pad summary

There are tradeoffs between the desirable goal of high fault coverage and its undesirable price of small, numerous, difficult‑to‑access pads. This tradeoff could be optimized by providing pads for the more important probe points in preference to the less important ones. This optimization process also involves a tradeoff: there is benefit to be gained from providing access to only important probe points, but at a cost in the design phase of finding out what those probe points are. Its utility depends on the number of MCMs to be produced from a given design: a greater number means more benefit. For MCM designs in which cost and efficiency are not the overriding factors, it would seem reasonable to provide test pads for all nets. This might apply, for example, to small runs of experimental MCMs.

12.6 CRITICAL ISSUE: BOUNDARY SCAN

It has been said that if boundary scan was implemented on chips which were made available as known good die (KGD), MCMs would suddenly be the packaging technology of choice for many applications. While there are other issues involved, there is a good deal of truth to the belief that widespread use of boundary scan (and availability of known good die) could alleviate the MCM testing problem to the degree that MCMs would be a much more competitive packaging option than they presently are. That is why we characterize use of boundary scan (and availability of known good die - see Section 12.9) as "critical issues."

12.6.1 The Boundary Scan Concept

Boundary scan [26,37,38], formally known as IEEE/ANSI Standard 1149.1‑1990, and informally often referred to as "JTAG," is a set of hardware design rules that allow improved testing time and cost. Boundary scan allows testing at the IC level, the PCB (printed circuit board) level, and the system level, as long as each has a "boundary" consisting of input and output lines.

The basic boundary scan architecture appears in Figure 12.6.1-1. The main modules are the Test Access Port (TAP) controller, the instruction register, and the data registers which include the boundary-scan register, bypass register, MUX, and 2 optional registers ("device ID" and "design specific").

****Insert Fig. 12.6.1-1 here.****

The Test Access Port (TAP) includes the extra pins added to the package to communicate with the internal boundary scan logic. These are called the test clock (TCK), test data input (TDI), test mode select (TMS), and test data output (TD0) lines. The boundary scan logic is controlled through the TCK and TMS pins, and data is shifted into and outof the logic via the TDI and TD0 pins [37].

The TAP controller is a 16 state finite state machine (FSM) (Figure 12.6.1-2). The TAP controller changes state synchronously on a Test Clock rising edge, or asynchronously if an optional Test Reset pin is also included in the pins comprising the test access port. The state of the TAP controller machine determines the mode of operation of the overall boundary scan logic.

The set of TAP controller states include a state in which a boundary scan instruction is shifted into the Instruction Register one bit a time from the Test Data In (TDI) line. The Instruction Register can also be initialized to 01 by having the TAP controller enter another state for this purpose. The Instruction Register contains a shift register for shifting in the instruction, and output latches for storing the instruction and making it accessible to the rest of the boundary scan logic in order to determine its specific behavior. However the Instruction Register is loaded, the loaded values must be moved to the output latches of the register for them to determine the test function of the boundary scan circuitry. This is done with yet another state of the TAP controller. Once the Instruction Register is properly set, another TAP controller state can be entered (via signals sent through the test access port pins, of course) in which the contents of the Instruction Register determine the specific behavior of the boundary scan logic.

The data registers include two registers which are required by the boundary scan standard. These are the boundary register and the bypass register. The boundary register is a set of cells, one per I/O pin on the tested device, except for the TAP pins. The boundary scan logic allows these cells to act as a shift register so that test input data can be shifted into the cells, and test output data shifted out of them, using the TDI and TDO pins. Each boundary register cell can also read data from the pin to which it is connected or the internal logic whose output goes to the pin. Thus the boundary cell can pass values through, allowing the circuit to act in normal mode, or can shift test data in or out, or can provide values to the inputs of the tested circuitry or read values from it. The bypass register has only one cell and provides a short path from TDI to TDO that bypasses the boundary register.

A transition diagram for the 16 states of the TAP controller appears in Figure 12.6.1-2. The label on an arc shows the logical value required on the TMS line for the indicated transition to occur [37]. The transition occurs on the rising edge of the TCK signal. Depending on the state of the TAP controller, data may be shifted in at TDI, may be parallel loaded into the instruction register, etc. Depending on the TAP controller state, activity may also occur on the falling edge of TCK. For example, data that has been shifted into the shift rank of a register through TDI may be latched into the hold rank of the register where it is stored and made available to other parts of the circuit. As another example, on the falling edge of TCK data may be shifted out on TDO (although shifting in through TDI only occurs on a rising edge). In the figure, notice the similarity between the two vertical columns, the data column and the instruction column. These columns represent states in which analogous activities are performed on the data or instruction registers.

****Insert Fig. 12.6.1-2****

The tests supported by boundary scan can be placed in three major modes[26].

1. External: Stuck-at, open circuit, and bridging fault conditions related to the connections between devices on an MCM and other such devices or the outside world are detectable.

2. Internal: Individual devices on an MCM can be tested despite their being already mounted on a substrate and sealed into the MCM without their I/O ports connected directly to MCM pins (or test pads). Devices may have test data set up on their ports via the TDI pin which can shift data into the boundary register. However only the I/O ports (the "boundary") is accessible through boundary scan. If the device also contains built-in BIST capability, then the BIST facility can be controlled and used by the boundary scan circuitry to do a more thorough and faster test of the internal circuitry of the device.

3. Sample Mode: The values at the I/O ports of a device (i.e. what would be the pins if the die was packaged individually in the usual fashion) can be read by the cells of the boundary register, and those values shifted out for analysis, while the device is operating normally. In this mode the boundary scan logic does not affect circuit operation, but rather provides visibility to values entering and leaving the device even though its I/O may not be directly accessible from outside the MCM.

12.6.2 Boundary Scan for MCMs

The individual chips can be tested in isolation by linking the TDO port of one chip to the TDI port of another, forming a chain of chips. Figure 12.6.2 shows how this chain is constructed. A test vector is clocked in at the TDI of the first chip in the chain, and clocking in continues until all chips have their test data. Then the chips are run for a desired number of clock cycles. Finally, the resulting outputs of the chips are clocked out of the TDO port of the last chip in the chain, with clocking continuing until all the boundary register contents of all the chips are clocked out for external analysis.

Shorts between MCM substrate interconnects should be tested before opens, because operating the MCM with shorts present can damage or shorten the lifetimes of components. Thus shorts should be detected as soon as possible so the problem can be corrected before further damage occurs, if it hasn't already. The basic idea of testing for shorts is to use boundary scan to clock in an appropriate test vector, then clock it out again to see if it contains values that have changed. A changed value would be due to the wrong value being present at a boundary cell because the corresponding chip I/O port is shorted to an interconnect that is set to that value. An algorithm and its explanation are provided for example in Parker (1992 [37] section 3.2.2.1). Testing for shorts reveals many open faults as well, but not all. Testing for opens is thus necessary. This is done by ensuring that values set in one location are propagated to another location that is supposed to be connected, where both locations are accessible by the boundary scan logic (that is, both locations are die I/O ports).

****Insert Fig. 12.6.2****

12.7 CRITICAL ISSUE: KNOWN GOOD DIE

Unpackaged chips (plural either "bare die" or "bare dice") are usually available from manufacturers only in untested form, if at all. The eventual availability of pre‑tested dice, which are termed "known good," is expected to be an important factor in making MCMs economical for other than high-end applications, since even one bad die on an MCM almost always means the whole MCM will not work. Testing of bare die is easier for the die manufacturer than for the MCM assembler. This is because

(1) the die manufacturer will be more likely to already have testing capabilities for the chip, even if only for its packaged form, and

(2) generating tests for a chip is easier for the manufacturer when, as is frequently the case, the design of the chip is known only to its manufacturer and kept proprietary.

The importance of mounting known good die ("KGD") on an MCM is due to the speedy degradation in MCM yield as the number of dice increases and the yield of each die decreases. The concept of yield is dealt with elsewhere in this book, but due to its relevance to this section is briefly reviewed here.

The proportion of all units that are fault free is the yield of a manufacturing process. A yield for bare die that would make them well suited to placement on MCMs would be something like .999 [26]. A figure like that is high enough that a set of dice all with that yield individually, would all work (making the resulting MCM work also) with a reasonably high probability. Of course, this assumes that the process of mounting them on the MCM does not degrade the yield of the MCM unduly, and that the substrate has already been verified as fault free. However, such a high yield for bare die may not be easily approached. One reason for that is the low demand for known good die (KGD). This means IC manufacturers are not impelled by market forces to produce them. Only about 0.001% of IC sales are in the form of KGD [49]. As the MCM market increases, this may lead to increased demand for KGD, leading to more availability of KGD, leading in turn to more economical MCM production, leading to yet more market increase for MCMs. Thus there is a feedback relationship between the availability of KGD and the market share of MCMs. This feedback relationship could lead to an explosion in MCM production, or could impede it instead, resulting in continued relatively small, specialized market niches, depending on whether the critical point in MCM production and KGD production can be exceeded by other, smaller forces.

Unpackaged ICs (i.e. bare dice) are harder to test than ICs in individual packages. Any testing unpackaged ICs do get is typically while they are still on the wafers, which are the relatively large slices of silicon on which a number of identical chips are manufactured prior to their being sawed apart into individual dice. Those tests are usually of limited scope, due to the relative difficulty compared to the packaged chips of testing at various temperatures, removing heat generated by the chip while it is operating, and adequately accessing the I/O ports of the chip [25]. The result of these problems is fault coverage much lower than for packaged ICs, which can be more thoroughly tested. This low fault coverage leads directly to a higher percentage of faulty dice passing the limited tests they are given. Unfortunately the yield of a module must be lower than the yield of the least reliable die mounted on it. Yield of an MCM depends in part on the yields of its constituent ICs in accordance with

Ym=(Yd)^n (12.7-1)

Where Ym is module yield, Yd is the expected die yield, and n is the number of dice mounted on the MCM. This is just the probability that all the individual dice are working. For example, given a bare chip yield of 99% and 20 chips on a module, the module yield is only 82%. Given a bare chip yield of 95% and 20 chips, the module yield is 40%. The MCM yield Ym decreases exponentially as the number of dice on the MCM increases.

The above formula can be modified to account for different dice of different yields. In that case, the yield of the MCM is

Ym=Y1*Y2*Y3...*Yn (12.7-2)

where there are n dice on the MCM, all must be working for the MCM to work, and Yx is the yield of die number x.

These equations ignore factors other than the die yield in calculating the module yield, however, the module yield is also dependent on other things, in particular the interconnects, the substrate, and the assembly processes [25].

Here are some of the technical problems described in Williams (1992 [49]) that inhibit the availability of KGD.

(1) DC parametric testing of dice is useful but does not verify functional performance of a die. At‑speed functional testing (perhaps at different temperatures) is important in acheiving the high yield of bare die necessary for high yield of the resulting MCMs.

(2) Proper burn-in of die, especially since bare die may have different thermal characteristics than they do after mounting on the MCM.

(3) Test vector acquisition from the manufacturers of the die.

(4) Compatibility issues of different test equipment.

Testing of bare die can be facilitated through design for testability. BIST, for example, will become more cost effective as more chips are used in MCMs, due to the difficulties in testing MCMs compared to individually packaged chips.

Only an MCM testing process with 100% fault coverage will detect all faulty MCMs. However, the increasing complexity of modern integrated electronic circuitry, exemplified by MCMs, makes 100% fault detection coverage difficult. Since the defect level of MCMs passing the final testing process is determined by both the yield of the MCM itself and the fault coverage of the final testing process, and the yield of the MCM itself is determined in large part by the yield of its component dice, testing of only the assembled MCM will result in lowered probability of an MCM being fault free when its component dice are not known good.

12.8 SUMMARY

This chapter reviews many of the topics related to testing of MCMs and other complex forms of electronic circuitry. The more miniaturized and integrated an electronic circuit is, the harder it is to test. On the other hand, the fewer elementary components it has and hence the greater its potential dependability, since fewer components means fewer ways it can have faults.

Fault coverage refers to the ability of a testing method to find faults in the circuit. Since it is impractical to be able to catch every possible fault in a complex circuit while testing, fault coverage is less than 100%. One reason for this is the reliance of testing methodologies on fault models, which only encompass some of the diverse possible kinds of real faults. Typical fault models include stuck-at-fault models, bridging fault models, open fault models, and delay fault models.

An increasingly important approach to testing is designing circuitry from the very beginning in a way that supports testing later. Approaches to designing for testability (DFT) include scan design, scan path and multiplexed scan design technique, level sensitive scan design, random access scan, partial scan, and built-in self test (BIST) techniques. BIST means including hardware support for testing that is more sophisticated than the simpler aforementioned approaches. The most important current BIST method is boundary scan.

External testers continue to be extremely important in testing. MCM substrate testing verifies the substrate prior to the expensive process of mounting dice (unpackaged chips) onto it. That and subsequent stages of testing can use contact testing methods (e.g. bed-of-nails, single probe, two probe, or flying probe methods), and non-contact testing methods, such as electron beam testing. Major purposes of such testing are to verify functionality and to determine the speed at which the circuit can operate.

12.9 EXERCISES/PROBLEMS

1 Consider Figure 12.2.1.2 and section 12.2.1.2. Explain how the test vectors 0110, 1001, 0111, and 1110 can detect all stuck-at faults.

2 Consider an AND gate and its three lines. List all single and multiple stuck-at faults that are equivalent to input line A stuck-at 0.

3 Consider a NAND gate with inputs A and B and output Z. List all tests that detect a stuck-at 1 on A. List all tests that detect a stuck-at 0 on Z. Why does the stuck-at 1 on A dominate the stuck-at 0 on Z?

4 Consider the RUN‑TEST/IDLE state of the TAP controller in the boundary scan architecture. Does the term "RUN" go with "TEST" only or does "RUN" go with both "TEST" and "IDLE"? Why?

5 Why is the TAP controller designed with states EXIT1‑IR, EXIT2‑IR, EXIT1‑DR, and EXIT2‑DR? That is, why not remove those states from the TAP controller design in order to make it simpler?

6 What is the purpose of the TAP controller states UPDATE‑IR and UPDATE‑DR? Hint: Registers have an input "rank," or portion for shifting in data, and an output rank for providing logic values to other parts of the system.

7 Michael and Michelle Chipkin suggest the following test approach. Critique it. Their "revolutionary" approach is to store a "brain scan" of a known working (that is, a "gold standard") MCM and compare it to the "brain scan" of the MCM under test. To get such a "brain scan," chart the frequency spectrum above each point on the surface of the MCM. If the MCM under test has unexpected differences, these differences indicate areas that are not operating properly.

8 Consider the issue of known good dice in the following light. The competitiveness of MCM technology is dependent in considerable degree on the availability of KGD. Yet the availability of KGD depends in considerable degree on demand for them in the form of MCMs. Thus there is a feedback cycle that tends to inhibit or promote the growth of MCM technology depending on the values for KGD availability (which we might model imperfectly as price) and MCM use (which we might model imperfectly as some percent of all IC manufacturing). Write a computer program that implements a model of this situation. For your model, is there and if so what is a value for MCM use and a value for KGD price that will just tip the model into a positive feedback situation in which MCM use suddenly increases very quickly? Use any numbers you like in setting up the variables of your model, or better, obtain numbers from the current literature. Consider this problem as a thesis topic.

9 It is the future. The McDonald's corporation McModule division has decided MCMs will play an important role in the next wave of computer technology, ubiquitous computing. Their motto is, "a hamburger in every pot and an MCM in every plate," evoking the idea that complex electronic modules will be everywhere, even embedded in your plate to monitor the food on it. Figure 12.9 shows the floor plan of their multipurpose MCM, using components selling for less than a dozen for a penny, for use in plates and other everyday items. How will the overall size of the MCM change if various test methodologies are used. How does yield change if yield is assumed proportional to size?

****Insert Figure 12.9 here****

9 From the description of LFSRs in this chapter, draw a diagram of an ALFSR containing three latches and 2 XOR gates. Assuming a starting state in which all the latches output the value 0, what is the next state of the circuit? What are the next 10 states of the circuit?

10 Consider the 4 control input combinations possible for a BILBO circuit. Explain how each of the four causes the circuit to behave. Refer to the BILBO section of this chapter.

11 Section 12.3 describes up transition faults. Give an analogous description of down transition faults.

12 Give an example of a memory fault that the 0-1 test will not find.

13 Consider figure 12.2.1.1. Explain why a stuck-at fault on line X2 cannot be detected, and how a stuck-at fault on line X1 can be detected.

12.10 REFERENCES

1. John P. Hayes, "Fault modeling," IEEE Design & Test, pp. 88‑95, Apr. 1985.

2. Kenneth M. Butler and M. Ray Mercer, Assessing Fault Model and Test Quality, Kluwer Academic Publishers, 1992.

3. V. K. Agarwal and A. S. F. Fung, "Multiple fault testing of large circuits by single fault tests," IEEE Trans. Comp., Vol C‑30, No. 11, pp. 855‑865, Nov. 1981.

4. Rochit Rajsuman, Digital Hardware Testing: Transistor‑level Fault Modeling and Testing, Artech House, Boston, 1992.

5. Melvin A. Breuer and Arthur D. Friedman, Diagnosis and Reliable Design of Digital Systems, Computer Science Press, 1976.

6. J. P. Shen, W. Maly, and F. J. Ferguson, "Inductive fault analysis of MOS integrated circuits," IEEE Design & Test, Vol. 2, No. 6, pp. 13‑26, Dec. 1985.

7. S. D. Millman and E. J. McCluskey, "Detecting bridging faults with stuck‑at test sets," in Proc. 1988 IEEE Int. Test Conference, pp. 773‑783, Sept. 1988.

8. S. D. Millman and E. J. McCluskey, "Detecting stuck‑open faults with stuck‑at test sets," in Proc. 1989 IEEE Custom Integrated Circuits Conference, pp. 22.3.1‑22.3.4, May 1989.

9. J. D. Lesser and J. J. Schedletsky, "An experimental delay test generator for LSI," IEEE Trans. Comp., Vol. C‑29, No. 3, pp. 235‑248, Mar. 1980.

10. V. S. Iyengar, B. K. Rosen, and J. A. Waicukauski, "On computing the sizes of detected delay faults," IEEE Trans. CAD, Vol. 9, No. 3, pp. 299‑312, Mar. 1990.

11. Kenyon C.-Y. Mei, "Bridging and stuck‑at faults," IEEE Trans. Comp., Vol C‑23, No.7, pp. 720‑727, July 1974.

12. Donald R. Schertz and Gernot Metze, "A new representation for faults in combinational digital circuits," IEEE Trans. Computers, Vol. C‑21, No.8, pp. 858‑866, Aug. 1972.

13. Thomas C. Russell and Yenting Wen, "Electrical testing of multichip modules," in Daryl Ann Doane and Paul D. Franzon (Editors), Multichip Module Technologies and Alternatives, pp. 615‑660.

14. Frank Crnic and Thomas H. Morrison, "Electrical test of multi‑chip substrates," ICEMM Proceedings '93, pp. 422‑428.

15. James R. Trent, "Test philosophy for mutichip modules," International Journal of Microcircuits and Electronic Packaging, vol. 15, no. 4, 1992, pp. 239‑248.

16. H. T. Nagle, S. C. Roy, C. F. Hawkins, M. G. McNamer, and R. R. Fritzemeieroy, "Design for testability and built‑In self test: a review," IEEE Transactions on Industrial Electronics, Vol. 36, No. 2, May 1989, pp. 129‑140.

17. H. Fujiwara and T. Shimono, "On the acceleration of test generation algorithms," IEEE Trans. on Computers, Vol C‑32, pp. 1137‑1144, Dec. 1983.

18. Y. Takamatsu and K. Kinoshita, "CONT: a concurrent test generation algorithm," Fault-Tolerant Computing Symp. (FTCS‑17) Digest of papers, Pittsburgh, PA, pp. 22‑27, July 1987.

19. C. Benmehrez and J. F. McDonald, "The subscripted D‑algorithm: ATPG with multiple independent control paths," ATPG Workshop Proceedings, pp. 71‑80, 1983.

20. H. Kubo, "A procedure for generating test sequences to detect sequential circuit failures," NEC Res & Dev (12), pp. 69‑78, Oct 1968.

21. G. R. Putzolu and J. P. Roth, "A heuristic algorithm for the testing of asynchronous circuits," IEEE Trans. on Computers, Vol C‑20, pp. 639‑647, 1971.

22. P. Muth, "A nine‑valued circuit model for test generation," IEEE Trans. on Computers, Vol C‑25, pp. 630‑636, June 1976.

23. Manoj Franklin and Kewel K. Saluja, "Built‑in self‑testing of random‑access memories," Computer (IEEE Computer Society Press), Vol 23, No.10, pp. 45‑55, Oct. 1990.

24. B. Konemann, J. Mucha, and G. Zwiehoff, "Built‑in test for complex digital integrated circuits," IEEE Journal of Solid‑State Circuits, Vol SC‑15, No.3, pp. 315‑318, June 1980.

25. A. Krasniewski, Circular Self-Test Path: A Low-Cost BIST Technique for VLSI Circuits, IEEE Transactions on Computer-Aided Design, Vol. 8, no. 1, Jan. 1989, pp. 46-55.

26. G. Messner, I. Turlik, J. Balde, and P. E. Garrou, Thin Film Multichip Modules, International Society for Hybrid Microelectronics, 1993.

27. T. W. Williams and N. C. Brown, "Defect level as a function of fault coverage," IEEE Trans. on Computers, Vol. 30, pp. 987‑988, Dec. 1981.

28. R. G. Bennetts, Design of Testable Logic Circuits, Addison‑Wesley, 1984.

29. Joseph Di Giacomo, Designing with High Performance ASICs, Prentice Hall, Englewood Cliffs, New Jersey, 1992.

30. B. W. Johnson, Design and Analysis of Fault Tolerant Digital Systems, Addison‑Wesley, 1989.

31. V. D. Agarwal, K.-T. Cheng, D. D. Johnson, and T. Lin, "Designing circuits with partial scan," IEEE Design & Test of Computers, pp. 9‑15, Apr. 1988.

32. R. Gupta and M. A. Breuer, "The BALLAST methodology for structured partial scan design," IEEE Trans. on Computers, Vol 39, No 4, pp. 538‑544, Apr. 1990.

33. Edward J. McCluskey, "Built‑in self‑test structures," IEEE Design & Test, pp. 29‑36, Apr. 1985.

34. Andrew Flint and William Blood, Jr., "MCM test strategy: board test in an IC environment," ICEMM Proceedings '93, pp. 429‑434.

35. R. W. Bassett, P. S. Gillis, and J. J. Shushereba, "Testing and diagnosis of high‑density CMOS multichip modules," International Test Conference, 1991, pp. 530‑539.

36. David Karpenske and Chris Tallot, "Testing and diagnosis of multichip modules," Solid State Technology, June 91, pp. 24‑26.

37. Kenneth Parker, Boundary‑Scan Handbook, Kluwer Academic Publishers, 1992.

38. John K. Hagge and Russell J. Wagner, "High‑yield assembly of multichip modules through known‑good IC's and effective test strategies," Proc. of IEEE, Vol. 80, No. 12, Dec 92, pp. 1234‑1245.

39. V. D. Agrawal, C. R. Kime, and K. L. Saluja, "A tutorial on built-in self test, part I: principles," IEEE Design and Test of Computers, March 1993.

40. Elwyn R. Berlekamp, Algebraic Coding Theory, McGraw-Hill, NY, 1968.

41. Edward J. McCluskey, "Verification testing -- a pseudoexhaustive test technique," IEEE Transacations on Computers, Vol. C-33 No. 6, June 1984, pp. 541-546.

42. E. J. McCluskey and S. Bozorgui-Nesbat, "Design for autonomous test," IEEE Transactions on Computers, Vol. C-30, pp. 866-875, Nov. 1981.

43. E. J. McCluskey, "Built-in self-test techniques," IEEE Design and Test of Computers, April 1985.

44. Clive Shipley, "Flying probes," Advanced Packaging, Fall 1992, pp. 30-35.

45. Alcedo, WWW site http://www.businessexchange.com/filesavce/beamtest.html.

46. A. B. El-Kareh, Testing printed circuit boards, MCM's and FPD's with electron beams, Alcedo, 485 Macara Ave. Suite 903, Sunnyvale CA.

47. R. Doyle, U. Fayyad, D. Berleant, L. Charest, L. de Mello, H. Porta, and M. Wiesmeyer, "Sensor selection in complex system monitoring using information quantification and causal reasoning," in: Faltings & Struss, eds., Recent Advances in Qualitative Physics, MIT Press, 1992, pp. 229‑244.

48. W. Hamscher, "Modelling digital circuits for troubleshooting," Artifical Intelligence, vol. 51 (1991), pp. 223-271.

49. T. A. Williams, "Securing known good die," Advanced Packaging, Fall 1992, pp. 52-59.

50. S. Kim and F. Lombardi, Modeling Intermediate Tests for Fault-Tolerant Multichip Module Systems, IEEE Transactions on Components, Packaging, and Manufacturing Technology - Part B, Vol. 18, no. 3, Aug. 1995, pp. 448-455.

51. D. Carey, "Programmable multichip module technology," Hybrid Circuit Technology (August 1991) 25‑29.

52. K. Gilleo, "The SMT chip carrier: enabling technology for the MCM,'' Electronic Packaging & Production, September 1993, pp. 88‑89.

53. Hewlett Packard, Semiconductor Systems Center US‑SSC VL‑MTC and VL-CM, 12/3/91.

54. M. MacDougall, Simulating Computer Systems, MIT Press, 1987.

55. R. Pearson and H. Malek, "Active silicon substrate multi‑chip module packaging for spaceborne signal/data processors," Government Microcircuit Applications Conference (GOMAC), 1992.

56. K. K. Roy, "Multichip module deposited ‑‑‑ reliability issues," Materials Developments in Microelectronic Packaging Conference Proceedings (Montreal, August 19‑22, 1991), pp. 305‑309.

57. C. Thibeault, Y. Savaria, and J. L., Houle, "Impact of reconfiguration logic on the optimization of defect‑tolerant integrated circuits," Fault‑Tolerant Computing: The Twentieth International Symposium, IEEE Computer Society Press, 1990, pp. 158‑165.

58. Haruhiko Yamamoto, "Multichip module packaging for cryogenic computers," 1991 IEEE International Symposium on Circuits and Systems V. 4 (IEEE Service Center, Piscataway, NJ cat. no. 91CH3006‑4), pp. 2296‑2299.

write 1 in all cells in group 1 and 0 in all cells in group 2;

read all cells;

write 0 in all cells in group 1 and 1 in all cells in group 2;

read all cells;

Figure 12.3.2: Checkerboard test algorithm