ADVANCED
ELECTRONIC PACKAGING:
With
Emphasis On Multi-Chip Modules
Editor:
W. D. Brown
CHAPTER
12
TESTING
AND QUALIFICATION
S.
Kolluru and D. Berleant
12.1 INTRODUCTION
12.1.1 Testing of Highly Integrated Packages: General
Considerations
12.1.2 Test Issues for Multichip Modules
12.1.3 Testing and Dependability Considerations and
Their Interaction
12.1.4 Dependability in MCM-Based Systems From a
Testing Perspective
12.1.4.1
Dependability vs. testing
12.1.5 Fault Tolerance
12.2 TESTING: GENERAL CONCEPTS
12.2.1 Fault Models
12.2.1.1
Stuck-at-fault models
12.2.1.2 Bridging
fault models
12.2.1.3 Open fault
models
12.2.1.4 Delay fault
models
12.2.2 Fault Collapsing
12.3 TESTING OF MEMORY CHIPS
12.3.1 The Zero-One
Test
12.3.2 The
Checkerboard Test
12.3.3 The Walking
1/0 Test
12.4 DESIGN FOR TESTABILITY
12.4.1 Scan Design
12.4.1.1 Multiplexed
Scan Design
12.4.1.2 Level
sensitive scan design
12.4.1.3 Random
access scan
12.4.1.4 Partial scan
12.4.2 Built-In Self Test (BIST)
12.4.2.1 Pseudorandom test
generation
12.4.2.2 Pseudoexhaustive
testing
12.4.2.3 Output response
analysis
12.4.2.4 Signature analysis
12.4.2.5 BIST test structures I: built‑in
logic block observation (BILBO)
12.4.2.6 Circular self test
path (CSTP)
12.5 FUNCTIONAL TESTING
12.5.1 Approaches to Testing MCMs
12.5.2 Staged Testing
12.5.3 MCM substrate
testing
12.5.3.1
Manufacturing defects in MCM substrates
12.5.3.2 Contact
testing
12.5.3.2.1
Bed-of-nails testing
12.5.3.2.2 Single
probe testing and two probe testing
12.5.3.3 Non-contact
testing
12.5.3.3.1 Electron
beam testing
12.5.3.4
Wear of MCM substrates
12.5.4
Die testing
12.5.4.1 Chip
carriers
12.5.5 Bond Testing
12.5.6 Testing
Assembled MCMs
12.5.6.1 Test pads
12.5.6.1.1 Test pads
and performance
12.5.6.1.2 Test pad
number and accessibility
12.5.6.1.3 Test pad
summary
12.6 CRITICAL ISSUE: BOUNDARY SCAN
12.6.1 The Boundary Scan Concept
12.6.2 Boundary Scan for MCMs
12.7 CRITICAL ISSUE: KNOWN GOOD DIE
12.8 SUMMARY
12.9 EXERCISES/PROBLEMS
12.10 REFERENCES
Key terms: integrated, package,
dependability, testing, fault tolerance, fault model, stuck-at fault, bridging
fault, open fault, delay fault, fault collapsing, zero-one test, checkerboard
test, walking I/O test, design rule, design verification, scan design, design
for test, scan path, level sensitive scan design, random access, partial scan,
BIST, signature, syndrome testing, BILBO, MISR,
SRSG, STUMPS, CSTP, functional testing, speed sorting, substrate,
contact testing, bed-of-nails probe, non-contact testing, die, dice, parameter
testing, chip carrier, bond, staged testing, qualification, test pad, boundary
scan, known good die, KGD.
12.1 INTRODUCTION
The
main purpose of testing is to assess quality. This assessment can be with
respect to an entire system or device, or with respect to smaller or larger
parts of it, as when attempting to find the location of a fault. The assessment
can produce a quantitative value, as when chips are to be sorted into speed
categories based on the highest clock rate for which each will function
properly, or it can be (and often is) simply a qualitative determination of
whether something works or not. Assessing quality is obviously important in
applications for which avoiding failure is critical. Perhaps less obviously but no less
importantly, assessing quality can reduce costs. For example, it is costly to
sell bad units and have to refund or replace them, and it is costly to complete
the fabrication of a unit that could have been discarded due to defects early
in the fabrication process.
While
the concept of testing is useful in a wide range of applications, we will limit
our discussion to testing of microelectronic devices, and especially to testing
issues surrounding advanced electronic packages such as MCMs.
Testing of advanced electronic packages,
like testing of other complex electronic systems, begins with informal
critiques of a design concept, ends with verifying repairs to deployed units,
and covers numerous intermediate stages. Figure 12.1 outlines the testing
stages MCMs, one type of advanced electronic pacakge.
****Insert
Fig. 12.1****
12.1.1 Testing of Highly Integrated Packages: General
Considerations
Testing
and the related area of dependability are well‑known and important topics
in the computing fields. Issues of
dependability and testability become more acute for highly integrated packages
(such as MCMs) than for traditional printed circuit boards due to a general heuristic
("rule of thumb") principle:
Heuristic 1: As component density increases, the
individual components tend to become harder to test and fix.
This
heuristic holds because components get smaller and more concealed by other
components and packaging. Fortunately,
this is offset by another heuristic principle:
Heuristic 2: As component density increases,
elementary parts become cheaper and more efficiently used.
The
tendency toward more efficient use of elementary components holds because of
decreased need for components assigned to interfacing, broadly defined
to include packaging, bonds, connections, I/O drivers, etc.
The
elementary parts that Heuristic 2 refers to we classify into four categories:
1) Electronic
parts, such as transistors, resistors, capacitors, etc.
2) Electrical
nets, which connect other parts together. They share important properties with the
other categories of elementary parts, such as finite dependability, non‑zero
cost, and performance of important duties.
3) Electrical
bonds, such as the short wires (wire bonds) that may be used to connect
an IC and its pins, or an IC die and an MCM substrate. Bonds also share important properties with
other kinds of parts like electrical nets, and even perform similar functions,
yet differ from nets from the stand-point of fabrication, testing, and
reliability.
4) Physical
parts, such as pins, physical package parts, etc.
Integration
increases component density, and at the same time reduces the number of
elementary parts. For example, integrating the functions that were previously
performed by two chips into one chip eliminates the need for some of the
interfacing electronics, which in turn reduces the number of required nets,
electronic parts, and bonds. Having one package instead of two also reduces the
number of physical package components like pins and ceramic or plastic parts.
Placing two chips on an MCM substrate (a lesser degree of integration than
having one new chip with the functionality of the previous two) also reduces
the total number of elementary parts such as pins, bonds, and plastic or
ceramic parts.
Heuristic
1 suggests that increased integration tends to lead to problems with
dependability and testability, and hence to higher costs. Counteracting this
tendency is Heuristic 2, which suggests that increased integration tends to
lead to improvements in dependability, testability, and cost.
As
the technology and experience in support of a given level of technology
improve, the balance shifts in favor of Heuristic 2, and the degree of
integration that is most cost effective tends to increase over time.
In
this chapter we emphasize multichip modules (MCMs) and other advanced packages,
and their testing and testability as compared with functionally equivalent
single chip integrated circuits (ICs) on circuit boards (CBs), which is the
traditional genre of electronic integration. The heuristic principles are
useful because they provide basic concepts that give broad guidance and
structure for understanding this area.
12.1.2 Test
Issues for Multichip Modules
Testing
is currently a serious bottleneck in MCM design, manufacture, and deployment.
Testing has always played a major role in elecronic systems, yet there are
unique characteristics of MCMs that lend a distinctive character to the testing
problem (See Fig. 12.1.2).
****Insert
Fig. 12.1.2****
As
Fig. 12.1.2 indicates, nets are less accessible for probing on an MCM than we
might wish. This is because nets are small and pass through the substrate,
rather than large and over the surface as in the case of printed circuit
boards. Nevertheless, the accessibility of nets for testing in an MCM is
greater than the accessibility of nets in a single chip (or wafer), because a
test pad can be built for any given net in an MCM, providing an externally
accessible point for probing that net.
This is much more difficult with a chip, where as a rule a net can be
made accessible for probing only if an entire pin is connected to that
net. Yet probe points are important for
electrical testing. For example, during the MCM manufacturing process, it is
useful to perform tests on individual dice that have just been mounted (see the
section on staged testing) and those tests require access to the nets that
connect to them.
As device complexity increases, it is
difficult to perform a full functional test, as the number of test vectors
required becomes astronomical. This led to the need to increase the testability
of internal circuits. The boundary scan method, BIST (Built In Self Test),
adding test points on an MCM substrate exterior ("test pads") and
pinning out all internal I/O to test pads are some of the ways to increase
testability [15]. MCM testing is broadly divided into two categories: those
based on software simulations and those applied directly to the devices
themselves. Simulation based test methods help ensure the functionality and
specifications compliance of the design before manufacturing. Direct test
methods perform functional testing on the MCM during and after fabrication.
12.1.3 Testability and Dependability Considerations
and Their Interaction
The
connection between testability and dependability is that improving
dependability tends to reduce the effort and expense needed for testing, and
improving testability tends to reduce (but not eliminate) the importance of
dependability. Since testing of advanced electronic packages is often
challenging, dependability is an important consideration from a testing
perspective since we can control testing needs to some degree by controlling
dependability.
While
the output of a manufacturing process cannot in general be guaranteed to work,
different manufacturing lines can and do produce artifacts of widely varying
dependabilities. The dependability of an engineered artifact is determined by
both the quality of the manufacturing process, and by intrinsic properties of
the artifact being produced. An important intrinsic property influencing
dependability is the complexity of the artifact. High complexity tends to cause
lowered dependability, and vice versa. Since the complexity of advanced
electronic packages is so high, achieving adequate dependability is an
important problem. Therefore, let us
review dependability from a testing perspective. For further discussion, see the chapter on
dependability.
12.1.4 Dependability in MCM‑Based Systems from
a Testing Perspective
Like
all electronic systems, MCM‑based systems can be viewed at different
levels. At the lowest level is analog circuitry at the circuit level
(MacDougall 1987, p. 1 [54]). The
abstraction hierarchy proceeds upwards to the system level (see Fig.
12-3). Dependability problems can occur due to faults in the building blocks of
any level in the hierarchy, leading to errors and failures of the overall
system.
****Insert
Fig. 12.1.4****
A
dependable system requires dependability of the building blocks and their
interconnections in each level of the hierarchy. For the circuit, gate, and
register‑transfer levels, the issues for MCM‑based systems are
similar in many ways to those for other integrated circuit based electronics.
However, a significant difference exists: for MCMs, the least replaceable unit
(LRU) is now an entire MCM which is more complex -- and therefore expensive ‑‑
than the least replaceable unit on a printed circuit board.
When the LRU is an MCM, dependability
and testing of its components prior to mounting them, and staged testing
and reliability at intermediate stages of the assembly process become
more important. Staged testing refers to
verifying that components and interactions among components meet standards at
intermediate stages during the assembly of an MCM or other system.
Reworkability refers to the ease with which a bad component, bad connection, or
other defect found during staged testing can be fixed or replaced during the
assembly process.
12.1.4.1
Dependability vs. testing
It
is impossible or nearly so to repair a faulty chip. This makes it more
important than it otherwise might be for chips to work dependably. Chip
dependability is even more important when the chip is mounted in an MCM because
not only are bad chips mounted in an MCM difficult and expensive to replace in
comparison to their replacement on ordinary circuit boards, but just one bad
chip of the several contained in the MCM will usually make the whole MCM bad,
and the probability that any one of the several chips is bad is much higher
than the probability that a given chip is bad (see equations 12.9-1 &
12.9-2). Compounding the problem is that chips are hard to test before they are
mounted in an MCM, a problem of sufficient magnitude as to make testing of
unmounted chips - "bare dice" - a critical issue in making MCMs
economically viable (called the "known good die" problem, see section
12.9).
MCM
dependability and testing needs are also impacted by fabrication, operating
environment, and maintainability factors. In particular, Fabrication factors
include the dependabilities and testabilities of the component chips, the bonds
which provide electrical connections between chip and wiring, the substrate or
board and its wiring, and the bonds which provide electrical contact between
the MCM and its pins. Other fabrication
related factors include the interconnection technology (e.g. optical
vs. electrical), the type of bonding
(e.g. flip chip, TAB, or wire bonding), the type of substrate (e.g. MCM‑D or MCM deposited (Roy 1991 [56]),
MCM‑D/C or thin film copper polyimide Deposited on Ceramic, MCM‑C
or Cofired ceramic, and MCM‑L or
Laminate substrate), and the type of substrate (e.g. hermetic vs. non‑hermetic).
The
impact of operating environment is similar in many ways to its effects
on printed circuit board dependability, in that many of the same environmental
factors are issues in both cases. Such environmental factors include heat and
heat cycling, humidity, shock, vibration, and cosmic rays. However specifics often differ, so that
existing knowledge of how environmental factors influence printed circuit board
dependability must be augmented with results applicable to MCMs.
Maintainability
factors include testability, reworkability, and
repairability.
Rework
is important when testing uncovers a defective component of a partially or
completely fabricated MCM. For MCMs, rework is a much more difficult and higher
technology process than for printed circuit boards. MCM rework ranges from
technically feasible for TAB (Tape Automated Bonding) and flip chip bonding
technologies, and for the thin film copper polyimide deposited on ceramic and
cofired ceramic packaging technologies, to technically more difficult (for wire
bonding) or currently uneconomical (for laminate substrates) (Trent 1992 [15]).
From
the standpoint of repairing failed systems, replacing a failed chip can be done
when it is mounted on a fully manufactured and deployed printed circuit board,
but is much more difficult with a fully manufactured and deployed MCM.
12.1.5 Fault Tolerance
Considerable
progress remains to be made in fault tolerant architectures for MCMs. This is
partly because MCM technology, in its present state, is often too expensive for
the substantial extra circuitry required for some forms of fault tolerance to
be financially feasible. Yet, other forms of fault tolerant design do not
require significantly more silicon real estate. The perspective that might
profitably be taken is one of optimizing the tradeoff between the expense of
adding in fault tolerance, and the expense of lowered dependabilities and
increased needs for testing of non-fault-tolerant architectures.
The
basic idea in fault-tolerant design is to use redundancy to counteract the
tendency of individual faults to cause improper functioning of the unit.
Previous work on fault tolerance in multichip modules is reported by Carey
(1991, p. 29 [51]), who discusses redundant interconnections, by Pearson and
Malek (1992: pp. 2‑3 [55]), who discuss redundancy within individual
chips on a specialized MCM design, and by Yamamoto (1991 [58]), who discusses
redundant refrigeration units for increased reliability of cryogenic MCMs. More
recent work suggests that the great increases in yield achievable by adding
redundant chips to an MCM design can be cost effective (Kim and Lombardi 1995
[50]). We describe these various approaches to MCM fault tolerant design next.
One
approach to maximizing the probability that a chip will work once mounted is to
include redundant circuitry on the chip that can take over the function of
faulty circuitry if and when other circuitry on the chip becomes faulty. Hence, dice (unpackaged chips) used for
placement in an MCM may have their own built‑in fault tolerance. This approach to fault tolerance is efficient
in terms of the increase in size it implies in the MCM), since an incremental
increase in the size of a die leads to a relatively small increase in the area
of the MCM substrate that is required to hold the slightly larger die. However, such redundant designs are highly
specific to the particular chip. In
summary, on-chip redundancy to enhance yield (Thibeault et al., 1990 [57]) is
particulary applicable when chips must be reliable but are hard to acquire in
tested form, as is often true for bare dice intended for use in MCMS. An MCM design utilizing this approach is
proposed by Pearson and Malek (1992 [55]).
Fault
tolerance can also be built into the MCM substrate, in the form of redundant
interconnection paths. If the substrate is found to have an open path, for
example, there might be another functionally identical path that can be used
instead. Actual MCMs have been fabricated implementing this capability (Carey
1991 [51]). This approach need not lead to increased MCM area at all, since if
less than 100% of the substrate's interconnect capacity is needed for a non‑fault-tolerant
design, the remaining capacity could be used for holding redundant
interconnections. In the event that capacity exists for only some
interconnections to be duplicated, duplication of longer ones should be
preferred since the probability of a fault in a path increases with the length
of the path (Carey 1993 [50]).
This
redundant routing approach has been shown to enhance MCM yields significantly
(Carey 1991 [51]). Since the dependability of nets in the MCM substrate
decreases as net length increases, Carey (1991 [51]) duplicated long paths in
preference to short ones. Since designs will often have some unused routing
capacity, why not use what routing capacity is still available for
fault-tolerant redundancy.
Redundant
conductors have been used in MCMs not only for routing through the MCM
substrate, but also for wire bonds. Redundant wire bonds are described by Hagge
and Wagner (1992, pp. 1980‑1981). A large substrate was designed in four
quadrants, so that the yield for a relatively smaller quadrant was higher than
for a large substrate containing all four sections on one substrate. However,
connecting the four quadrants must be done dependably in order for the
connected quadrants to compete with the large single substrate design.
Connections were done with double wire bonds for increased dependability
over single wire bonds. This redundant bond concept could be investigated for
use with die‑to‑substrate connections as well. A potential
disadvantage is that double bonds may require larger bond pads. However, bonds
would require little or no additional substrate area.
The more chips there are in an MCM
design, the more risk there is of lowered yield. However, a design with more
chips may actually have a higher yield than one with fewer, if the extra chips
are there for the express purpose of providing redundancy, the increment in
chip number is modest, and an appropriate staged testing technique is employed.
Indeed, Kim and Lombardi (1995 [50]) found that very high yields were possible,
and provide analytical results establishing this.
The
MCMs of the future may be liquid nitrogen cooled, for speed and eventually to
support superconductivity and its varied benefits. The refrigeration system on
which such MCMs depend must be reliable. This motivated a dual refrigeration
unit design in the MCM system built by Yamamoto (1991 [58]). If one
refrigerator breaks down, the low required operating temperatures can still be
maintained by the other refrigerator.
Finally,
MCM fabrication lines must provide reliable control of the manufacturing
equipment. An uncontrolled shutdown can have serious negative effects on the
facility. When computers are used for
control, redundancy should be built into the fabrication line control system to
prevent the destructive effects of unanticipated shutdowns due to computer
crashes, since such crashes will tend to occur occasionally due to software bugs,
as software of significant complexity is almost impossible to produce without
bugs.
12.2 TESTING:
GENERAL CONCEPTS
We
begin with some basic definitions:
Fault
detection -- the action of determining that there is
a defect present.
Fault
location -- the action of determining where a defect
is.
Fault
detection coverage -- the proportion of defects that a
fault detection method can discover.
Fault
location coverage -- the proportion of faults which can
be successfully located. Successful location does not necessarily mean finding
the exact location. Usually it means
finding a sub-unit (e.g. chip, board, or other component) which contains the
fault and hence needs to be replaced.
Destructive
testing -- any testing method which causes units to
fail in order to measure how well they resist failure.
Non‑destructive
testing -- any method of testing which does not
intend to cause units to fail.
Defects
may occur during the manufacture of any system. In IC manufacturing,
defects may occur during any of the
various physical, chemical and thermal processes involved. A defect may occur in the original silicon
wafer, by oxidation or diffusion, or during photolithography, metallization, or
packaging. Not all manufacturing defects
affect circuit operation, and it may not be feasible or even particularly
desirable to test for such faults. We
discuss only those defects which do.
12.2.1
Fault Models
Fault
analysis can be made independent of the technology by modeling physical
faults as logical faults whose effects approximate the effects of common actual
faults. Fault models are used to specify well defined representations of faulty
circuits that can then be simulated. Fault models can also be used to assist in
generating test patterns [1]. A good fault model has the following properties
[1]:
1. The level of abstraction of the fault
model should match the level of abstraction at which it is to be used (Figure
12.1.4 exemplifies different levels of abstraction).
2. The computational complexity (amount
of computation required to make deductions) of algorithms that use the fault
model should be low enough that results can be achieved in a reasonable amount
of time.
3. The great majority of actual faults are
represented accurately by the fault model.
Typical faults in VLSI circuits are
stuck‑at‑faults, opens, and shorts. The ability of a set of test
patterns to reveal faults in the circuit is measured by fault coverage. 100%
fault coverage in complex VLSI circuits is usually impractical, as this would
require astronomical amounts of testing. In practice, a tradeoff exists between
the fault coverage and the amount of testing effort expended.
Since for complex circuits it is not
reasonably possible to apply a large enough set of tests to achieve full fault
coverage, a subset of all possible tests must be chosen. A good choice of such
a subset will provide better fault coverage than a less good subset of the same
size. Various algorithms have been proposed for choosing good tests for various
kinds of ICs. The D‑algorithm, PODEM (Path Oriented DEcision Making)
algorithms [5], the FAN algorithm [17], the CONT algorithm [18], and the
subscripted D‑algorithm [19] are for combinational circuits. Test
generation for sequential circuits is more complex than for combinational
circuits because they contain memory elements, and also they need to be
initialized. Early algorithms for test generation for sequential circuits used
iterative combinational circuits to represent them, and employed modified
combinational test algorithms [20,21,22]. Test patterns for memory devices can
be generated by checkerboard algorithms, the Static Pattern Sensitive Fault
algorithm, etc. [23].
No test pattern generation algorithm can
ever fully solve the VLSI testing problem because the problem is np complete,
and thus unsolvable in reasonable time for large examples [24]. Partitioning
the circuit into modules and testing each module independently is one way to
reduce the problem size. Partitioning is not always a workable approach,
however. As an example, it is non-trivial to test a circuit consisting of a
cascade of two devices, from tests for the constituent devices. Another
approach is to include circuitry in the design whose purpose is to facilitate
testing of the device. Design for testability methods include BIST (Built In
Self Test) and boundary scan, both of which are described later. Now, we review
some well-known fault models.
12.2.1.1
The stuck‑at fault model
Suppose
any line in a circuit under test could always have the same logical value (0 or
1) due to a fault. This is a relatively simple method of fault model termed the
stuck-at fault model. A line that is stuck at a logical value of 1
because of a fault is called stuck‑at‑1, and a line that is
stuck at a logical value of 0 because of a fault is called stuck‑at‑0.
To make test generation computationally tractable, a simpler version of the
stuck-at fault model called the single stuck‑at fault model assumes
that only one line in a circuit is faulty. Thi sis often a reasonable
assumption because a faulty circuit often does have just one fault. The single
stuck-at fault model is more computationally tractable because there are many
fewer faults to consider under this model than under a more complex model (the
multiple stuck-at fault model) which allows for more than one fault to be
present at once. Consider as an example a circuit with k lines. Each line can
be either properly working, stuck-at 1, or stuck-at 0, leading to the necessity
to consider 3k‑1 distinct fault conditions (+ 1 non-fault
condition). On the other hand, the same circuit under the single stuck-at
model. Each of the k lines can be either working, stuck-at 1, or stuck-at 0,
but if one if the lines is stuck, all the others are assumed to be working.
This leads to the necessity to consider 2k distinct fault conditions, 2
(stuck-at 1 and stuck-at 0) for each line. Luckily single fault tests have
reasonably high fault detection coverage of multiple faults as well [3].
The
basic concept in stuck-at fault testing is to set up the inputs to the circuit
so that the line under test should have the opposite logical value from the logical
value which it is hypothesized to be stuck at, and further, so that the effect
of that line being stuck at the wrong value is to cause an incorrect logical
value downstream at an output line so that faulty circuit operation can be
observed. The process of setting the inputs so that the line under test is set
to the opposite value is called sensitizing the fault. It might be
pointed out that if a stuck-at fault cannot lead to an observable error in the
output, then the circuit is tolerant of that fault and for many purposes the
fault does not matter.
As
an example, consider the circuit shown in Figure 12.2.1.1. A stuck-at fault on
input X2 cannot be detected at the output, as you can see by tracing logical
values through the circuit. For this circuit, the output is determined by input
X1. On the other hand, a stuck-at fault on line X1 can be detected at the
output.
****Insert
Fig. 12.2.1.1****
12.2.1.2
Bridging fault models
Short
circuits in VLSI are called bridging faults because of their cause,
which is usually improperly present
conducting "bridges" between physically adjacent lines. Because the
small size of modern circuit components makes lines very close together,
bridging faults are common faults. Bridging fault models typically assume the
effect of a short is to create a logical AND or a logical OR between the lines
that are shorted together. An AND would result when circuit characteristics
require both inputs to be high for the shorted lines to be forced high. An OR
would result when circuit characteristics allow the lines to be forced high if
the input to either line is high. Usually the resistance of a bridge is assumed
zero, although this assumption may not actually hold in practice [4]. Bridging
fault modeling is more complicated when the resistance of the short is to be
accounted for. High resistance shorts may result in degraded noise resistance
or other degradations in circuit performance without affecting logical levels
[4]. Sometimes bridging faults can
convert a combinational circuit into a sequential one, leading to oscillations
or other sequential behaviors. Stuck-at testing covers many but not all
bridging faults [7].
To
illustrate a case where all stuck-at faults can be detected by a set of test
vectors but a bridging fault would be missed, consider the circuit of Figure
12.2.1.2. The test vectors 0110, 1001, 0111, and 1110 applied to inputs A, B,
C, and D (a test vector describes the value applied to each input) will detect
all stuck-at faults. However, since all those test vectors apply the same value
to inputs B and C, a bridging fault between B and C will not be detected.
****Insert
Fig. 12.2.1.2****
12.2.1.3
Open fault models
The
major VLSI defect types are shorts and opens. Usually, opens are assumed to
have infinite resistance. Leakage current can be modeled with a resistance
[4]. Opens can be modeled with a
resistance and a capacitance connected in parallel.
In NMOS circuits, open faults may be
modeled as stuck-at faults. But opens in CMOS circuits cannot, and in fact such
circuits will often have sequential behavior [4].
12.2.1.4
Delay fault models
A delay
fault causes signals to propagate more slowly than they should. Detection
may occur when this delay is great enough that signal propagation cannot keep
up with the clock rate [9]. Two fault models that account for delay faults are the single‑gate delay
fault model and the path-oriented delay fault model.
Single‑gate
delay fault models attempt to account for the effects of individual slow gates.
Path-oriented delay fault models attempt
to account for the cumulative delay in a path through a circuit. Gate-level
models often work better for large circuits because the large number of paths
that can be present can make path-oriented approaches impractical [10].
12.2.2 Fault Collapsing
Recall
that a circuit with P lines can have as many as p3-1 possible
multiple stuck‑at faults alone. It
is difficult and time consuming to test for a large number of possible faults
and, in practical terms impossible for a circuit of significant size. By
"collapsing" equivalent faults into a single fault to test for, the
total number of faults to test for can be decreased. Faults
that are equivalent can be collapsed [5]. Faults are equivalent if they have the same
effects on the outputs, and therefore cannot be distingished from each other by
examining the outputs. Therefore, a test vector that detects some fault will
also detect any equivalent fault. As a simple example, consider a NAND gate
with inputs A and B and output Z. Under the stuck-at fault model, each of A, B,
and Z may be working, stuck-at 0 or stuck-at 1, implying 3**3-1=27-1=26
possible multiple stuck-at faults (considering a single stuck-at fault to be a
one variety of multiple stuck-at fault. But note if either input is stuck-at 0,
the output Z will have the value 1. Therefore, input A stuck-at-0, input B
stuck-at-0, and output Z stuck-at-1 are equivalent, in addition to some
multiple stuck-at faults such as A stuck-at 0 and B stuck-at 0, etc.
No
fault detection coverage is lost by collapsing equivalent faults (assuming we
only have access to outputs). However, we might want to collapse some faults
that are not equivalent, saving on testing time at the expense of some loss in
coverage. For example, let us postulate two faults f1 and f2. If any test for f1 will also detect f2, but a
test for f2 does not necessarily detect f1, then dominates f2. (Occasionally
the term is used oppositely so that f2 would be said to dominate f1 [5].) As an
example, consider the NAND gate with input A stuck‑at‑1. The fault
is detectable at the output only by setting A to 0 and B to 1. The output Z
should be 1, but the fault makes it 0.
Note that the same test detects Z stuck-at 0. Another test for Z stuck-at 0 would be
setting B to 0, but that test will not detect A stuck-at-1. Therefore a stuck‑at
1 fault on A dominates a stuck‑at‑0 fault on Z because every test
(of which there are only one) that detects a stuck-at 1 fault on A also detects
a stuck-at 0 fault on Z.
Fault
equivalence and dominance both guide the "collapsing" of various
different faults into one fault, in that testing for that one fault also
detects the others. Fault collapsing is a useful idea because it reduces the
total number of faults that must be explicitly test for to obtain a given fault
coverage.
12.3
TESTING OF MEMORY CHIPS
Testing of memory chips is a well defined
testing task that in some respects serves to exemplify testing of conventional
chips. Here are some kinds of faults that can cause failure in the storage
cells (faults could also appear in other parts of the memory, such as the
address decoder).
. Stuck‑at fault (SAF)
. Transition fault (TF)
. Coupling fault (CF)
. Neighborhood pattern sensitive fault
(NPSF).
In
a stuck‑at fault, the logic value of a cell is forced by a physical
defect to always be zero (stuck‑at‑0) or one (stuck‑at‑1). A transition fault is close to a stuck-at
fault. A transition fault is present if a memory cell (or a line) will not
change value either from 0 to 1 or from 1 to 0. If it won't transition from 0
to 1, it is called an up transition fault, and if it won't transition
from 1 to 0 it is called a down transition fault. If a cell is in the
state from which it will not transition after power is applied, it acts like a
stuck-at fault. Otherwise, it can have one transition after which it remains
stuck. A coupling fault is present if the state of one cell affects the
state of another cell. If k cells together can affect the state of some other
cell, the coupling fault is called a k‑coupling fault. One kind of
k-coupling fault is the neighborhood pattern sensitive fault. If a
cell's state is influenced by any particular configuration of values or changes
to values in neighboring cells, a neighborhood pattern sensitive fault is
present.
Here are some basic tests that have been
used to detect memory faults.
12.3.1
The Zero‑One Test
This
test consists of writing 0s and 1s to the memory. The algorithm is shown below
(Figure 12.3.1). The algorithm is easy to implement, but has low fault
coverage. However, this test will detect stuck-at faults if the
address decoder is working properly.
****Insert Figure 12.3.1 here.****
12.3.2
The Checkerboard Test
In
the checkerboard test the cells in memory are written with alternating values,
so that each cell is surrounded on four sides with cells whose value is
different. The algorithm for the checkerboard test is shown in Figure 12.3.2.
The checkerboard test detects stuck-at faults as well as such coupling faults
as shorts between adjacent cells if the address decoder is working properly.
****Insert Figure 12.3.2 here****
12.3.3
The Walking 1/0 Test
In
the walking I/O test, the memory is written with all 0s (or 1s) except for a
"base" cell, which contains the opposite logic value. This base cell
is "walked" or stepped through the memory. All cells are read for
each step. The GALPAT (GALloping PATtern) test is like the Walking 1/0 test
except that, in GALPAT, after each read the base cell is also read. Since the
base cell is also read, address faults and coupling faults can be located. This
test is done first with a background of 0s to the base cell value of 1, and
then with a background of 1s to a base cell value of 0.
12.4 DESIGN
FOR TESTABILITY
Design
for Testability (DFT) attempts to facilitate testing of circuits by
incorporating features in the design for the purpose of making verification of
the circuit easier. Generally, the strategy is to make points in the circuit
controllable and observable. Here is a more specific albeit still informal
characterization of testability:
"A circuit is `testable' if a
set of test patterns can be generated,
evaluated and applied in such a way
as to satisfy pre‑defined
levels of performance, defined in
terms of fault-detection, fault-location,
and test application criteria,
within a pre‑defined cost budget and time scale" [28].
Factors
that affect testability include difficulty of test generation, difficulty of
fault coverage estimation, number of test vectors required, time needed to
apply a particular test, and the cost of test equipment. The more complex the
circuit, the lower its testability tends to be because, as we saw earlier, observability and controllability
decrease, where observability is how easy it is to determine the state of a
test point in question by observing other locations (usually outputs), and
controllability is how easy it is to cause a test point in question to have the
value 0 or 1 by controlling circuit inputs [29]. There are various methods of
design for testability. We review some of them next.
12.4.1 Scan design
Scan
design uses extra shift registers in the circuit to shift in test input data to
points within the circuit and to shift out values inside the circuit. The shift
registers provide access to internal points in a circuit. Test vectors may be
applied using those points as inputs and responses to tests may be taken using
those points as outputs.
The shift register may consist of D flip
flops (i.e. latches) that are used as storage elements in the circuit, which
are connected using extra hardware into a "scan chain" so that in
test mode, test vectors can be shifted in serially, and so that the internal
state of the circuit, once latched into the latches in parallel, can be
serially shifted back out so that the state can be observed from outside. See
Figure 12.4.1.
****Insert Fig. 12.4.1
****
Thus,
1. Latches themselves can be tested.
2. Outputs of the latches can be set
independently of their inputs.
3. Inputs to the latches can be observed.
12.4.1.1
Scan Path and Multiplexed Scan Design technique
A
multiplexer is connected to each latch, and an extra control line, the scan
select, is used to set the circuit for scan (test) mode. When the scan select
line is off, the multiplexers connect the lines from the combinational logic to
the latches so that the circuit works normally. When the scan select line is
on, the latches are connected together to form a serial in, serial out shift
register. The test vector can now be input by serially shifting in the test
vector. The test output can be output by shifting it serially out the scan
output, that is, the last latch's output.
Here
is a summary of the method:
1.
Put the circuit into scan mode by inputing 1 on the scan select line.
2.
Test the scan circuitry itself by shifting in a vector of 1s and then a vector
of 0s, to check that none of the latches have stuck-at faults.
3.
Shift a test vector in.
4.
Put the circuit in normal mode by inputing a 0 on the scan select line. Apply
the primary inputs needed for that test vector, and check the outputs:
5.
Clock the latches so that they capture their inputs, which are the circuit's
internal responses to the test.
6.
Put the circuit into scan mode and shift out the captured responses. For
efficiency, clock in the next test vector as the responses to the previous one
are clocked out. Check the responses for correctness.
7.
Apply more test sequences by looping back to step 4.
Scan design has some disadvantages. These
include:
1. Additional circuitry is needed for the
scan latches and multiplexors.
2. Extra pins are needed for test vector
input and output, and for setting the circuit to scan mode or normal mode.
3. The circuit operation is slower than it
would otherwise be, because of the extra logic (e.g. multiplexors) which
signals must traverse.
12.4.1.2 Level sensitive scan design (LSSD)
In
level sensitive scan design, state changes in the circuit are caused by clock
values being high, rather than transitions in clock values (edges). To reduce
the possibility that analog properties such as rise and fall times and
propagation delays can lead to races or hazards, level sensitivity can be a
useful design criterion. Another positive characteristic of level sensitive
design is that steady state response does not depend on the order of changes to input values [28]. The basic storage
element used in circuits that adhere to LSSD is as shown in Fig. 12.4.1.2-1.
****Insert
Fig. 12.4.1.2-1****
The
clock values (note there are three of them) determine whether the storage
element is used as a normal circuit component or for test purposes.
To form a scan chain, the double latch
storage elements are connected into a chain configuration whereby the L2 output
of one element feeds into the L1 input of the next element. This chain
configuration is activated only during test mode and allows clocking in a
series of values to set the values of the elements in the chain. Figure
12.4.1.2-2 illustrates.
****Insert
Fig. 12.4.1.2-2****
For
proper operation of a level sensitive circuit, certain constraints must be
placed on the clocks [30], including:
(1)
Two storage elements may be adjacent in the chain only if their scan related
clocks (Scan Clock and Clock 2 in Figure 12.4.1.2-2) are different, to avoid
race conditions.
(2)
The output of a storage element may enable a clock signal only if the clock
driving that element is not derived from the clock signal it is activating
[30].
12.4.1.3 Random access scan
In
random access scan, storage elements in the circuit can be addressed
individually for reading and writing [28]. This is in contrast to other scan
design approaches such as level sensitive scan design and scan path design,
described earlier, in which the test values of the storage elements must be read
in sequentially and iteratively passed down the shift register formed by the
chain of storage elements until the register is full. In random access scan
design, storage elements are augmented with addressing, scan mode read, and
scan mode write capability (see Fig. 12.4.1.3).
****Insert
Fig. 12.4.1.3****
An
address decoder selects a storage element which is then readable or writeable
via the scan input and output lines. A disadvantage of random access scan
design is the
extra
logic required to implement the random access scan capabilities. Another
disadvantage is the need for additional primary input lines, for example the
address lines for choosing which storage element to access [30].
12.4.1.4
Partial Scan
Fully implemented scan design requires substantial
extra chip area for additional circuitry, about 30% [31]. If, however, only
some of the storage elements in the circuit are given scan capability, the
extra area overhead can be reduced somewhat. Where full scan design involves
connecting all latches into a shift register, called the "scan
chain," in partial scan some are
excluded from the chain [31]. Partial scan test vectors are shorter than those
that would be needed for a full scan design, since there are fewer latches to
be manipulated. Test sequences tend to be shorter, since because the test
vectors are shorter there are fewer of them. Since in partial scan, some
storage elements in the circuit cannot be read/written via the scan circuitry,
and since the importance of test access to a latch depends on its role in the
circuit, a particular partial scan design must make an intelligent choice of
which storage elements should be in the scan path.
Partial scan compared to full scan leads
to reduced area and faster circuit operation. The speed-up in circuit operation
is because those storage elements that are in critical paths may be left out of
the scan path so as not to slow down those paths.
12.4.2
Built-In Self Test
BIST (Built‑In Self Test) is a class
of design-for-testability methods involving hardware support within the circuit
for generating tests, analyzing test results, and controlling test application
for that circuit [39]. The purpose is to facilitate testing and maintenance. By
building test capability into the hardware, the speed and efficiency of testing
can be enhanced. BIST techniques have
costs as well as benefits, however. In particular, the extra circuitry for
implementing the BIST capability increases the chip area needed, leading to
decreased yield and decreased reliability of the resulting chips. On the other
hand, BIST can reduce testing related costs.
Test vectors may be either stored in
read-only memory (ROM) or generated as needed. Storing them in ROM requires
large amounts of ROM and may be undesirable for that reason, however, it does
potentially provide high fault coverage and advantages in special cases [39].
We consider two ways to generate test
vectors as illustrative. Pseudorandom testing picks test vectors without an
obvious pattern. Exhaustive testing
leads to better fault coverage but is more time consuming.
12.4.2.1
Pseudorandom test generation
A linear feedback shift register (LFSR)
can generate apparently random test vectors. An LFSR is typically made of D
flip‑flops and XOR gates. Each flip-flop feeds into either the next
flip-flop, an XOR gate, or both, and each flip-flop takes as input either the
output of either the previous flip-flop or of an XOR gate. The overall form of
the circuit is a ring of flip-flops and XOR gates with some connections into
the XOR gates from across the ring because XOR gates have more than one input.
If there is no external input to the circuit,
it is called an autonomous linear feedback shift register (ALFSR) and the output is simply the values of the
flip-flops (see end of chapter exercise 9). The pattern generated by an LFSR is
determined by the mathematics of LFSR theory (see [39] for a brief description
and [40] for a detailed treatment), and LFSRs can generate test vectors that
are pseudorandom (or exhaustive).
12.4.2.2
Pseudoexhaustive testing
Testing exhaustively requires, given a
combinational circuit with n inputs, providing 2**n test vectors of n bits each
(in other words, every possible input combination). Pseudoexhaustive testing
means testing comprehensively but taking advantage of circuit properties to do
this with less than 2**n input vectors.
If the circuit is such that no output is
affected by all n inputs, it is termed a partial dependent circuit and any
given output line can be comprehensively tested with less than 2**n input
vectors. The exact number depends on how many inputs affect that output line.
If k inputs affect it, then 2**k vectors will suffice, comprising every
possible combination of values for the inputs that affect that output, with the
values for the other input lines being irrelevant (to testing that output
line). Each output line may be tested in this way. Thus, if the circuit has 20
inputs and 20 outputs, but each output relies on exactly 10 of the inputs,
2**10 tests for each of the 20 outputs implies that 20 x 2**10 or approximately
20,000 tests can be comprehensive, compared to 2**20 or approximately 1,000,000
tests for an exhaustive testing sequence which would be no more comprehensive.
Other pseudoexhaustive techniques can
improve on this even more. For example, if there are two input lines which
never affect the same output line, they can always be given the same value with
no decrement in the comprehensiveness of the test sequence. More generally,
test vectors for testing one output line can also be used for other output
lines, reducing the number of additional test vectors that must be generated
for those other output lines. An approach to doing that is described, for
example, in [41].
As a concrete example, Figure 12.4.2.2
illustrates a partial dependent circuit.
****Insert Figure 12.4.2.2
here****
The
circuit shown has an output f which is
determined by inputs w and x, and an output g which is determined by inputs x
and y. Neither output is affected by both w and y, so nothing is lost by connecting x and y
together so that they both always have the same value. With that done, now only
four vectors, instead of 2**3=8, can provide an exhaustive test sequence.
When a circuit is not partial dependent
(that is, some output depends on all inputs), the circuit is termed complete
dependent. In this case, pseudoexhaustive testing may be done by a technique
involving partitioning the circuit [42]. This method is more complex.
12.4.2.3
Output response analysis
Consider a circuit with one output line.
Checking for faults means checking the response sequence of the circuit to a
sequence of tests. One possibility is to have a fault dictionary consisting of
the sequence of correct outputs to the tests. However, this is impractical for
a complex circuit due to the large amount of data that would need to be stored.
One way to address this problem is to compact the response sequence so
that it takes less memory to store. The
compacted form of an output response pattern is called its signature. This
concept is known as response compression [43]. Since there are fewer bits in
the signature than in the actual output sequence, there are fewer possible signatures
than there are actual potential outputs. This results in a problem known as
aliasing. In aliasing, the signature of a faulty circuit is the same as the
signature of the correct circuit. The faulty output signature is then called an
alias. Aliasing leads to a loss of fault coverage. One approach to using
compaction is "signature analysis," described next.
12.4.2.4
Signature analysis
Signature analysis has been a commonly
used compaction technique in BIST. An LFSR (Linear Feedback Shift Register) may be used to read
in an output reponse and output its
signature, a shorter pattern determined by the test output response pattern.
Since the signature is determined by the
test output pattern, if a fault results in a different test output pattern,
then the fault is likely (but not certain) to have a different signature. If a
fault has a different test output pattern but its signature is the same as the
proper test output, aliasing is said to have occured. Aliasing reduced test coverage. Figure 12.4.2.4-1
depicts an LFSR with an input for the test response pattern and contents which
form the signature.
Many circuits have multiple output lines,
and for these the way an LFSR is used for signature generation must be changed.
One way is to feed the different output lines into different points in the LFSR
simultaneously (Figure 12.4.2.4-2). An alternative approach uses a multiplexer
to feed the value of each output line in turn into a one-input LFSR, a process
which must be followed for each test input vector.
****Insert Figure 12.4.2.4-1 here****
****Insert Figure 12.4.2.4-2 here****
12.4.2.5
BIST test structures I: built‑in logic block observation (BILBO)
BILBO has features of scan path, level
sensitive scan design, and signature analysis. A BILBO register containing 3 D
flip-flops (latches, labeled DFF), one for each input, appears in Figure
12.4.2.5-1. Z1, Z2, and Z3 are the parallel inputs to the flip flops and Q1, Q2
and Q3 are the parallel outputs from the flip flops. Control is provided
through lines B1 and B2. If B1=1 and B2=1 the BILBO register operates in
function (non-test) mode. If B1=0 and B2=0 the BILBO register operates as a
linear shift register and a sequence of bits can be shifted in from Sin to serve
for example as a scan string. If B1=0 and B2=1 the BILBO register in in reset
mode and its flip flops are reset to 0. If B1=1 and B2=0 the BILBO register is
in singature analysis mode and the MUX is set to select Sout as the input to
Sin, forming a linear feedback shift register (LFSR) with external inputs Z1,
Z2, and Z3. See Figure 12.4.2.5-2 and end of chapter problem 10.
****Insert Figure 12.4.2.5-1 here.****
****Insert Figure 12.4.2.5-2 here.****
The BILBO approach relies on the
suitability of pseudorandom inputs for testing combinational logic. Therefore,
when the BILBO control inputs cause it to operate in signature analysis mode
(that is, to be an LFSR), the pseudo random patterns it produces can be used as
test vectors. For example, Figure 12.4.2.5-3 shows a circuit with two
combinational blocks, testable with two BILBO registers.
****Insert figure 12.4.2.5-3
here****
In figure 12.4.2.5-3, the first BILBO is
set via input vector pn to generate pseudorandom test vectors for the
combinational block it feeds into. The second BILBO is set via input vector sa
for signature analysis purposes. The first BILBO is therefore used to apply a
sequence of test patterns, after which the second BILBO is used to store the resulting
outputs of the combinational block, followed by scanning out those outputs (the
signature). When combinational block 1 has ben been tested, block 2 can be
tested similarly by simply reversing the roles of the BILBO registers.
BILBO has an interesting advantage over
many other types of scan discipline. Using BILBO, if N test vectors are applied
before scanning out the results, the number of scan outs for those N vectors is
1, compared with the N scan outs required by other scan disciplines. However
BILBO requires more extra circuitry than LSSD, as well as leading to relatively
more signal delays because of the gates connected to the flip flop inputs [30].
12.4.2.6
Circular self test path (CSTP)
CSTP [25] connects some (or all) storage
cells in the circuit together, forming one large circular register. A cell of the circular register may contain
one D flip flop or two arranged as a master and slave. The cells form a feedback shift register,
hence the use of the term "circular." The circular path is augmented
with a gate at the input of each cell that, during test mode, XORs the functional input from the circuit
that would be the sole input during non-test circuit operation, with the output
of the preceding register in the circular path. This causes the outputs of the
flip flops during test mode to change in a difficult to predict way, so that
they can be used as test inputs to the circuit. When operated in the normal
mode, the cells feed inputs through to the combinational blocks. When operated
in the test mode, the cells feed test values into the combinational blocks.
Once the test pattern has propagated through the circuitry, the response is fed
into the circular register which compacts the response into a signature. The test
response is combined with its present state via the XOR gates to produce its
next state and next output. The circular path can now apply the next test
vector which is its current contents. After repeating this some number of times
the register contents can be checked for correctness. Correctness might be
determined by matching against the contents for a known working circuit, for
example. The creators of CSTP cite as significant advantages of CSTP that:
1)
the complexity of the on-chip mode control circuitry is minimized by the fact
that a full test can be done in one test session.
2)
The hardware overhead is low compared to other multifunctional register test
methods like the BILBO technique, because the cells are simpler as they need
only be able to load data and compact data. As a caveat, this assumes the
circuit can be reset into a known state from which to begin testing.
The test pattern generated by the
circular path is neither pseudorandom nor purely random, but instead is
determined by the logic of the circuit. The authors defend this by analyzing the effect of this in comparison to
exhaustive testing (that is, applying all possible test input vectors),
concluding that with a testing time of 4X what would be needed for exhaustive
testing, 98% of the possible test vectors will be applied, and with a testing
time of 8X, 99.9+% of the possible test vectors will be applied. The problem of
test pattern repetition must be dealt with because if it occurs then the entire
preceding sequence of test vectors will also then repeat. Then longer test
times will result in no improvement in coverage. The authors of this approach
found that this is unlikely to occur, can be identified if it does occur, and
can be avoided by changing the initial state of the circular register.
12.5 OTHER
ASPECTS OF FUNCTIONAL TESTING
The self-test methods described above
facilitate functional testing, in which we test an actual device to
ensure its behavior conforms to specifications. This contrasts with speed testing,
in which properly working circuits are sorted depending upon how fast they will
run, and with destructive testing, in which circuits under test are destroyed
in a process which aims to find out what the limits of the circuit are. In this
section we address some additional aspects of functional testing, emphasizing
MCMs.
Functional testing is important not only
for screening out defective units but for quality control, production line
problem diagnosis, and fault location within larger systems.
Functional testing occurs after all
design rules are satisfied, all design specifications are met during the
simulation and analysis phase, and the physical design goes through part or all
of the manufacturing process. In MCMs, functional testing is primarily done at
the substrate level, die level, and module level.
Staged testing in which proper
functioning of each die on an MCM is checked after it is mounted but before the
next die is mounted can help catch problems early. Testing of fully assembled units verifies
that the completed system works.
12.5.1 Approaches
To Testing MCMs
Testing methods can be classified
as built‑in or external.
Built‑in (e.g. BIST) approaches may be preferable in some cases, however,
this makes the design process more difficult since it requires extra hardware,
beyond the dice and their connections, on the MCM. External test methods will
be preferable in many cases due to lower design and production costs.
Testing methods can alternatively be
classified as concurrent or non‑concurrent. In concurrent
testing, the device is tested as it runs, such as by a testing program that
runs using clock cycles that would otherwise go unused. In contrast, non‑concurrent
testing is run on a unit that is not being used. Concurrent testing makes the
design task more difficult, yet can enhance dependability by automatic
detection of faults when they occur, as is necessary e.g. for fault tolerance
methods requiring on-the-fly reconfiguration.
Non‑concurrent testing is easier and will probably have a role in
MCM testing indefinitely.
Testing
methods can also be classified as static or dynamic. Static
testing deals with DC characteristics of devices that are not actually running.
In MCMs, this can be used for testing substrates prior to die installation. MCM
testing also requires dynamic testing, that is, testing while the MCM is in
operation.
Still another way to classify testing
methods is functional vs. parametric. Functional testing involves testing to
see if a device can do the things it is supposed to (that is, perform its
functions). Parameter testing is testing to see whether various parameters fall
within range. For example, a parametric test might measure rise and fall times
to check that they will support operation at a specified frequency.
Let
us now glance at staged testing, in which components of MCMs are tested, in
section 12.5.2, then some ways of testing various components of MCMs in
sections 12.5.3 through 12.5.5, and then move on to testing of entire MCMs in
section 12.5.6.
12.5.2
Staged Testing
The general strategy of testing earlier
in the construction of a complex circuit rather than later is intended to
minimize wasted work (and hence expense). Taking MCMs as an example, early
detection of faults means less likelihood of mounting dice on bad substrates,
less likelihood of mounting bad dice, less chance of sealing MCMs with bad
components, less likelihood of selling bad MCMs, less chance of embedding bad
MCMs in a larger system, etc. Detection of faults as early as feasible is thus
an important part of an overall testing philosophy.
Increasing the feasibility of early
testing has its own costs. In the case of MCMs, a staged approach to testing in
which each die is tested after it is installed (instead of testing the whole
MCM after all the dice are installed) requires test pads to be located on the
substrate to facilitate test access to each die. This means using potentially
valuable substrate area for the pads, a more complex substrate design, and
potentially slower operation due to the capacitance and cross talk increase
caused by the extra metal in the pads and the conductance paths that lead to
them.
Taking
the early testing strategy further, we might test each die prior to
installation. This would not completely eliminate the need for testing it after
installation, and hence the need for test pads, because dice can be damaged by
the installation process, but it would avoid performing the installation
process on a die that is already bad. But the cost of this is high because
testing a die prior to installation is a difficult problem in itself. In fact
this problem has a name: the known good die problem (KGD). This important
problem is described later in the chapter.
12.5.3 MCM substrate testing
MCM
substrates are like miniaturized printed circuit boards in that they connect
together all the component parts of the MCM as well as serve as a platform on which
to mount those parts. They should be tested for defects before mounting ICs on
it, because it is relatively easy to do and because of the substantial cost of
going through the rest of the fabrication process. This cost would be wasted if
the substrate was bad.
12.5.3.1 Manufacturing defects in MCM
substrates
The substrate contains nets that
should be tested for opens and shorts. These nets terminate at the substrate
surface in pad to which components such as dice will be connected. Those
connections may use wire bonds, flip chip bonding technology, or tape automated
bonding (TAB). While many pads are used
as connections to dice, some are used to connect with the pins of the MCM. A net may be tested for opens, shorts to
other nets, and high resistance opens or shorts by probing those test pads.
High frequency test signals can be applied to test for characteristics like
impedance, crosstalk and signal propagation delays.
There
are a number of approaches to testing nets which are reviewed in the following
paragraphs. Each has its own advantages and disadvantages. These approaches may
be classified into the two broad categories of contact and non-contact methods.
12.5.3.2 Contact testing
In
contact testing a substrate is tested by making physical contact with the pads.
Resistance and capacitance measurements are done using probes to contact the
pads and locate opens, shorts and high resistance defects in the nets. For
example, a net demonstrating an unexpectedly low capacitance likely has an
break in it. As another example, by moving two probes to two pads, the tester
can verify that continuity exists or that no short exists, as desired.
12.5.3.2.1 Bed‑of‑nails testing
Bed‑of‑nails
testing uses a probe consisting of an array of stiff wires. Each wire contacts
a different pad on a device, so that all (or many) of the pads needing to be
probed are contacted by a different wire at once. Multiplexing allows the
testing device to select which wires to use for sending or receiving test signals,
allowing measurements of resistance or impedance between a pair of pads or
between any two sets of pads.
Suppose
there are N nets on a substrate to be tested, and Pk pads in the
k-th net. The number of tests required to certify the k-th net for opens is Pk-1.
Therefore the total number of tests to certify all N nets on the substrate for
opens is S(Pk-1).
Given an average of p pads per net, then N(p-1) tests are needed to test for
opens. To test for shorts, each net must be checked for infinite resistance to
each other net, unless auxiliary information about the spatial layout of the
MCM is available which will allow the testing procedure to skip testing nets
that are far apart spatially. In the absence of such information, N(N-1)/2
tests for checking shorts on the substrate are needed (provided the nets have
no opens). As an example, suppose a substrate has 100 nets with an average of 5
pads per net. Then there are 5x100=500 tests needed for open circuit testing, and
100x(100-1)=9,900 tests needed for short circuit testing. The number of tests
needed for short circuit testing, increases quickly with the number of nets. As
the number of tests becomes high, bed-of-nails test probes become save
increasing amounts of time because the probe need not be moved from place to
place as each pad is already connected to one of the probes in the
bed-of-nails. Packages for which the test pads form a regular grid with a fixed
center are better suited to bed-of-nails testers than idiosyncratic
arrangements of pads because idiosyncratic arrangements require the probe head
to be custom built [13]. Packages with small, densely packed pads are harder to
use with bed-of-nails testers because the probe becomes more complex and
expensive to make.
Because
bed-of-nails testers are relatively complex and expensive, yet the probe need
not be mechanically (therefore slowly) moved around for each separate test,
bed-of-nails testing is most suited to situations requiring testing a large
volume of circuits quickly, so that the high cost is distributed over many
tested circuits [13].
12.5.3.2.2 Single probe testing and two probe
testing
Nets are separated by non-conducting,
dielectric material. This implies a capacitance between a pair of nets or
between a net and the reference plane. If a testing procedure applies an AC
signal to it, typically from 1 KHz to 10 MHz, the impedance can be measured
[14]. This measurement can be compared with the corresponding measurement from
another copy of the same device which is known to be good, or perhaps with a
statistical characterization of the
corresponding measurement from a number of other copies of the device. Lower
than expected capacitance suggests an open circuit, while higher than expected
capacitance suggests a short circuit.
To check for shorts, one measurement for
each net is required. To check for opens, one measurement for each pad is
required. If doubt exists as to whether the flow of current created by
application of AC represents only the normal capacitance of the net or includes
a high resistance short, AC of a different frequency may be applied. The
difference in current flow I1-I2 that this creates will be a function of the
capacitance C, the frequencies F1 and F2, and the resistance R. If R is
infinite, then I1/I2=F1/F2, and any deviation from that is due to resistance
(and inductance).
Single
probe testing is not as affected as bed-of-nails testing by high pad density or
small pad size, but there are also some disadvantages [13]. One disadvantage is
that, if nominal test values are derived fro actual copies of the circuit,
design faults will not be detected. Another disadvantage is that if the
substrate has pads on both sides then it must be turned over during the testing
process.
Two
probe testing has all the capabilities of one probe testing and them some, at
the price of a modestly more complex mechanism that can mechanically handle two
probes at once. Shorts can be isolated to the two offending nets by probing
both of them at once.
With
single and dual probe testers, the probes must be mechanically moved from pad
to pad. This limits the speed of testing [13]. To maximize speed, minimize the
total travel distance of the probes. An optimal minimization requires solving
the famous Traveling Salesman Problem, a known intractable problem.
Flying probe technologies are becoming
more popular as control of impedances of lines in a substrate become more
important due to modern high signal frequencies. Flying probe heads provide
control over the impedance of the probe itself, to facilitate sensitive
measurements of the nets [40].
12.5.3.3
Non‑contact testing
Testing using probes that make mechanical
contact with pads on a circuit can damage the pads, which in turn can prevent
good contact between the pad and a connection to it later in the manufacturing
process. This is one reason why a non-contact testing method is attractive.
Another reason is that in some MCM technologies it is desirable to test the
substrate at various stages in its manufacture before pads are present. This
may not be practical with mechanical testers due to the small size of the metal
areas to be probed. In non-contact testing, electrical properties of nets are
tested without making actual physical contact with them.
12.5.3.3.1
Electron beam testing
Electron beam testing (e.g. [46]) works
somewhat like the picture tube on a television set or computer terminal. A hot
negatively charged piece of metal is used as a source of electrons, which are
directed toward a target, which is the circuit in the case of a tester or the
screen in the case of a television. Magnetic deflection coils or
electrostatically charged plates can move the beam, back and forth and up and
down for a television, or however is needed for a tester. By directing the
electron beam at a particular place on a circuit, a net can be charged up. If
the charge then appears on another net, or does not appear on part of the
charged net, there is a short or open.
Electron beam testing is like single
probe testing in some ways, because the electron beam is analogous to the
single probe. However because there are no moving parts it can operate much
faster than a mechanical device. Another difference is that the electron beam
is DC whereas single probe testers typically use AC. However both varieties of
tester rely on the capacitance of the circuit structures to hold charge, and
both thus can mistake high resistances as shorts.
A disadvantage of electron beam testing
not shared by contact methods is the need for the circuit to be in a vacuum
chamber. This can mean a delay of minutes to pump out the air in the chamber
before the testing process can begin. One solution to this is to have an air
lock on the vacuum chamber. The circuit is placed in the relatively small air
lock which can be evacuated much faster than the larger test chamber. After the
air lock is evacuated the circuit is moved into the test chamber proper, which
has been in a vacuum all along.
Electron beam testers appear to be
entering the current marketplace. The Alcedo company estimates an ability to
sell them for $1.2 million each, and was completing one for the US Air Force as
of this writing [45].
12.5.3.4 Wear of MCM substrates
The
substrate contains the wiring used to connect all the other components on the
MCM. Improper fabrication can lead to
gradual corrosion of nets leading to failure. Once properly manufactured and
found working, however, reliability has been tested and found remarkably high.
Roy (1991 [56]) subjected MCM‑D (deposited), HDI (high density
interconnect) MCM substrates to HAST (highly accelerated stress tests) for
thermal, moisture resistance, salt atmostphere, and thin film adhesion
reliability characterization and found that military MIL‑STD‑883C
and Jedec JEDEC‑STD‑22 reliability standards were easily exceeded,
with expected substrate lifetimes of over 20 years.
12.5.4 Die testing
An
MCM is populated with unpackaged chips ("bare dice") which are
mounted on the substrate. These bare dice should be good, because if they are
not there is substantial extra cost involved in removing and replacing them.
This is a problem, because bare dice are not widely available in tested form,
as they are usually tested by the manufacturer only after they are mounted in
the typical one-die package. There is more than one reason for this:
1)
It is much easier to test a packaged chip than an unpackaged bare die.
2)
Manufacturers make much of their money from the packaging, and so are not very
interested in selling the unpackaged dice.
3)
Manufacturers prefer not to sell untested bare dice because they may not only
risk their reputation for reliability, but fear the MCM manufacturer might
damage dice during their own testing and then blame the die supplier for
supplying bad dice! Such concerns are real.
ICs
intended for mounting on an MCM may also be designed differently from ICs
intended for standard use. Because they are so close together the paths between
them will tend to have low capacitance, meaning that the dice can be designed
with low power drivers. It is more difficult to test such dice because their
loads must have high impedance to match the drivers [13]. Another MCM-specific
testing difficulty is that manufacturers sometimes change the chip dimensions
without warning, requiring the MCM maker to reactively change their test setup
on short notice.
As discussed elsewhere in this book, die
yield has a major impact on MCM yield. In fact, the yield of the MCM will be
significantly lower than the yield of the dice it contains. Furthermore, the
rework required in removing and replacing bad dice is expensive. So verification
of bare dice before mounting is important despite the difficulties.
MCMs are usually intended to operate at
high frequencies, and so high frequency testing is an important part of an MCM
test strategy. High frequency testing is more difficult than standard testing
due to the interference posed by the impedances in the test equipment.
12.5.4.1 Chip carriers
A
chip carrier (Figure 12.5.2.3) is a die package which is close in size to the
die it carries. Simple in principle, it connects to densely packed perimeter
bond pads on a die and runs leads to a less densely packed area array
about the size of the die itself. This less densely packed area array package
provides a surface mount device (SMD) that is easily assembled into test
sockets or directly onto MCM substrates. This provides easier access to the I/O
ports for either testing or mounting on MCM substrates than is provided by the
bare dice, yet does not change the area of the device to be mounted
significantly because the area array is layered over the die itself. Easier
testing means dice that are not yet mounted on a substrate can be tested before
mounting, thus helping to address the problem of providing known good die
(KGD). If the chip carrier with its mounted die passes the tests, the entire
carrier package may be mounted as is on an MCM substrate, with connections
between the substrate and the die mediated by the area array provided by the
carrier. The carrier is thus a permanent package which is acceptable for mounting
on an MCM because its size is insignificantly larger than the die it contains.
Problems include the fact that getting from the die to the MCM substrate now
requires two connections, one from the die to the carrier and one from the
carrier to the substrate, instead of just one from the die to the substrate as
it would be without the carrier. The problem with this is it leads to some loss
in reliability of the connections and hence lowered yield, since now there are
two connections that must be made successfully instead of just one. Chip
carrier philosophy and current technology is reviewed by Gilleo (1993 [52]).
****Insert Figure 12.5.4.1here.****
12.5.5 Bond Testing
70‑80% of MCM faults are
interconnection faults (assuming use of known good die, which often is not
actually the case) (Hewlett Packard 1991 [53]), so this kind of testing is
useful, even though it does not directly target faulty dice. Interconnection
faults are faults in the connections between a die and the substrate. Testing
for interconnection faults is the responsibility of the MCM manufacturer.
Open
bonds could be tested by applying a probe to the net on the MCM substrate to
which it is supposed to be attached, and measuring the capacitance. A properly
working bond will cause the measured capacitance to be the sum of the
capacitance of the net, the wire bond or other bond material, and the
capacitance of the input gate or output driver on the die to which it makes a
connection. The capacitance measurement is facilitated by the fact that
resistive current will be negligible in CMOS circuits, which are the most
common kind. For ordinary dice, input gate capacitances run about 2 pF, whereas
output drivers have capacitances on the order of 100 pF. (Dice made
specifically for mounting in MCMs do not need powerful output drivers and so
their output capacitances can be significantly less, but such dice are not
generally available at present.) Thus open bonds that disconnect output drivers
should be relatively easy to detect. However, if an output driver and an input
gate are both bonded to the same net, the presence of the higher capacitance
output driver could dominate the capacitance measurement, precluding a reliable
conclusion about whether the bond to the input gate is open or not. However,
with respect to a given net, a die with an input gate could be mounted on the
substrate before a die with an output driver to be bonded to the same net, so
that bonds to low capacitance input gates can be capacitance tested before high
capacitance output drivers are present, if this bond testing approach is to be
used. This would be a form of staged testing.
This form of staged testing does have its
drawbacks, mainly due to the fact that some bonds of a given die will need to
be created before, and others after, bonds of other dice are made. Process
lines are better suited to handling all of the bonds to a die in one stage, and
only then going on to another die. Yet flexible, integrated manufacturing lines
should become increasingly viable in the future as automation increasingly
assists manufacturing processes.
A more serious drawback is that this
approach will not work with flip chip processes.
Mechanical testing of a particular kind
of bond, wire bonds (which are small wires going between a pad on a die and a
pad on an interconnect), involves pulling on it to make sure it is physically
well attached at both ends.
12.5.6 Testing Asembled MCMs
Even when all components (substrate,
dice, bonds, pins . . .) are working properly, they still may not necessarily
interact properly. Thus a complete MCM or other system must be tested even if
its components are known to be good.
Working
parts may also become bad during the process of assembling a larger system and
so it is useful for the system to support testing of its parts even if they
were tested before assembly, and especially if they were not, or not
completely. MCMs are a good example of such systems due to the difficulty of testing
some of the component parts prior to assembly. Components of completed MCMs can
be hard to test because it is hard to observe states of the interconnects,
which are hidden within the MCM and much smaller, than interconnects on printed
circuit boards.
Thus
a module must be tested after it is assembled. This requires testing of
components in case they were damaged during assembly, even if they were known
to be working prior to assembly. A burn-in may be provided to help detect
latent defects.
Various
test strategies are suited for testing assembled MCMs. The test strategy chosen
will be affected by built-in testability features if any, die quality,
substrate quality, reliability requirements, fault location coverage requirement, and rework cost [13].
Testing exhaustively is equivalent to
fully testing of each module component plus verifying that the components work
properly together. This provides high fault coverage, however it is impractical
for all but certain MCM designs, such as a bus connected microprocessor and
SRAM, due to the complexity of fully testing all internal logic from the MCM pins [13].
Building
the MCM using components with built in testability features can facilitate MCM
testing significantly. Ways to incorporate testability include boundary scan,
built‑In self test (BIST), and providing external access to internal I/O
(e.g. through test pads, discussed next). Methods such as these facilitate
fault detection coverage, fault location coverage, and faster and simpler
testing.
12.5.6.1 Test pads
The
testability of an MCM can be increased by bringing out internal nets to the
exterior of the package. This is done by having test pads, which are
small contact points on the outside of the MCM package that are each connected
to some internal net. This provides observability and controllability to
internal connections not connected to any of the MCM pins. This can help test
engineers to isolate internal faults. Test pads can be connected directly to
each net in an MCM. This makes the MCM testing problem analogous to the printed
circuit board test problem, in that all nets are accessible for probing. Unfortunately, while test pads connected to
all or many nets in an MCM for testability may be feasible to manufacture, they
have some drawbacks. These drawbacks include the following:
*
Test pads increase capacitance and cross‑talk, adversely affecting
performance.
*
Test pads can be hard to access for test purposes simply because of their
necessarily small size (4 mils might be a typical test pad dimension) because
all such pads must be crammed into the small external area provided by the MCM
package. Let us deal with each of these issues in turn.
12.5.6.1.1 Test pads and performance
While
test pads are useful in MCM testing, they have the disadvantage of decreasing
performance. Therefore one approach to avoiding this tradeoff would be to build
test pads on the MCM, use them for testing, and when testing is concluded
remove the pads. A fabrication step that chemically etches away the exposed pad
while leaving everything else intact would be one approach.
12.5.6.1.2 Test pad number and accessibility
More
test pads means crowded, smaller, and therefore less accessible test pads. By
providing fewer test pads, the ones provided could be made larger and therefore
more accessible. Thus there is a
tradeoff between the number of test points and their accessibility. Consequently, progress on MCM testing via
test pads must take one of two broad strategies:
Strategy
1: Better ways to access small pads arranged in dense arrays.
Strategy
2: Dealing with an incomplete set of test pads.
Regarding
strategy 1, here are some ways to access pads:
* A
small number of probes (for example, two) which can be moved from pad to pad
efficiently (the moving probe approach).
*
Many probes, one for each test pad, to be applied all at once. This avoids the problem of moving probes
around from pad to pad, but at the price of having to build a dense, precisely
arranged, expensive set of probes ("bed‑of‑nails") that
can reliably connect to their respective pads. A collapsing column
approach to constructing each probe is one way to do this.
*
Electron beam ("E‑beam") use. A high technology and non‑trivial
undertaking.
These
methods were discussed previously in this chapter.
With regard to strategy 2, here are some
possibilities for maximizing testing effectiveness given limited access to
substrate nets. Judicious use of available pads is required. Approaches for
this include:
*
Design the "right" test pads into the MCM. Some nets will be more
important to test than others, and part of the design process would be to
decide which are the most important ones and provide pads for those. Artificial
intelligence work on sensor placement, such
as
[47] might be applicable here.
*
Clever diagnostic use of existing test pads. Artificial intelligence work on
diagnosis, especially of digital circuits, could come into play here. Hamscher
[48] describes one approach and reviews previous work.
*
Vias could be manufactured for test pads for all paths, however, actual pads be
fabricated only for some. This would make design changes in which changing the
choice of which paths are provided with test pads is easier, since no redesign
of vias is needed, and instead only the MCM layer providing the pads themselves
would need to be redesigned.
12.5.6.1.3 Test pad summary
There are tradeoffs between the desirable
goal of high fault coverage and its undesirable price of small, numerous,
difficult‑to‑access pads.
This tradeoff could be optimized by providing pads for the more
important probe points in preference to the less important ones. This optimization
process also involves a tradeoff: there is benefit to be gained from providing
access to only important probe points, but at a cost in the design phase of
finding out what those probe points are. Its utility depends on the number of
MCMs to be produced from a given design: a greater number means more
benefit. For MCM designs in which cost
and efficiency are not the overriding factors, it would seem reasonable to
provide test pads for all nets. This
might apply, for example, to small runs
of experimental MCMs.
12.6 CRITICAL
ISSUE: BOUNDARY SCAN
It
has been said that if boundary scan was implemented on chips which were made
available as known good die (KGD), MCMs would suddenly be the packaging
technology of choice for many applications. While there are other issues
involved, there is a good deal of truth to the belief that widespread use of
boundary scan (and availability of known good die) could alleviate the MCM
testing problem to the degree that MCMs would be a much more competitive packaging
option than they presently are. That is why we characterize use of boundary
scan (and availability of known good die - see Section 12.9) as "critical
issues."
12.6.1 The Boundary Scan Concept
Boundary
scan [26,37,38], formally known as IEEE/ANSI Standard 1149.1‑1990, and
informally often referred to as "JTAG," is a set of hardware design
rules that allow improved testing time and cost. Boundary scan allows testing
at the IC level, the PCB (printed circuit board) level, and the system level, as
long as each has a "boundary" consisting of input and output lines.
The
basic boundary scan architecture appears in Figure 12.6.1-1. The main modules
are the Test Access Port (TAP) controller, the instruction register, and the
data registers which include the boundary-scan register, bypass register, MUX,
and 2 optional registers ("device ID" and "design
specific").
****Insert
Fig. 12.6.1-1 here.****
The
Test Access Port (TAP) includes the extra pins added to the package to
communicate with the internal boundary scan logic. These are called the test
clock (TCK), test data input (TDI), test mode select (TMS), and test data
output (TD0) lines. The boundary scan logic is controlled through the TCK and
TMS pins, and data is shifted into and outof the logic via the TDI and TD0 pins
[37].
The
TAP controller is a 16 state finite state machine (FSM) (Figure 12.6.1-2). The
TAP controller changes state synchronously on a Test Clock rising edge, or
asynchronously if an optional Test Reset pin is also included in the pins
comprising the test access port. The state of the TAP controller machine
determines the mode of operation of the overall boundary scan logic.
The
set of TAP controller states include a state in which a boundary scan
instruction is shifted into the Instruction Register one bit a time from the
Test Data In (TDI) line. The Instruction Register can also be initialized to 01
by having the TAP controller enter another state for this purpose. The
Instruction Register contains a shift register for shifting in the instruction,
and output latches for storing the instruction and making it accessible to the
rest of the boundary scan logic in order to determine its specific behavior.
However the Instruction Register is loaded, the loaded values must be moved to
the output latches of the register for them to determine the test function of
the boundary scan circuitry. This is done with yet another state of the TAP
controller. Once the Instruction Register is properly set, another TAP
controller state can be entered (via signals sent through the test access port
pins, of course) in which the contents of the Instruction Register determine
the specific behavior of the boundary scan logic.
The
data registers include two registers which are required by the boundary scan
standard. These are the boundary register and the bypass register. The boundary
register is a set of cells, one per I/O pin on the tested device, except for
the TAP pins. The boundary scan logic allows these cells to act as a shift
register so that test input data can be shifted into the cells, and test output
data shifted out of them, using the TDI and TDO pins. Each boundary register
cell can also read data from the pin to which it is connected or the internal
logic whose output goes to the pin. Thus the boundary cell can pass values
through, allowing the circuit to act in normal mode, or can shift test data in
or out, or can provide values to the inputs of the tested circuitry or read
values from it. The bypass register has only one cell and provides a short path
from TDI to TDO that bypasses the boundary register.
A
transition diagram for the 16 states of the TAP controller appears in Figure
12.6.1-2. The label on an arc shows the logical value required on the TMS line
for the indicated transition to occur [37]. The transition occurs on the rising
edge of the TCK signal. Depending on the state of the TAP controller, data may
be shifted in at TDI, may be parallel
loaded into the instruction register, etc. Depending on the TAP controller
state, activity may also occur on the falling edge of TCK. For example, data
that has been shifted into the shift rank of a register through TDI may be
latched into the hold rank of the register where it is stored and made
available to other parts of the circuit. As another example, on the falling
edge of TCK data may be shifted out on TDO (although shifting in through TDI
only occurs on a rising edge). In the figure, notice the similarity between the
two vertical columns, the data column and the instruction column. These columns
represent states in which analogous activities are performed on the data or
instruction registers.
****Insert
Fig. 12.6.1-2****
The
tests supported by boundary scan can be placed in three major modes[26].
1.
External: Stuck-at, open circuit, and bridging fault conditions related to the
connections between devices on an MCM and other such devices or the outside world
are detectable.
2.
Internal: Individual devices on an MCM can be tested despite their being
already mounted on a substrate and sealed into the MCM without their I/O ports
connected directly to MCM pins (or test pads). Devices may have test data set
up on their ports via the TDI pin which can shift data into the boundary
register. However only the I/O ports (the "boundary") is accessible
through boundary scan. If the device also contains built-in BIST capability,
then the BIST facility can be controlled and used by the boundary scan
circuitry to do a more thorough and faster test of the internal circuitry of
the device.
3.
Sample Mode: The values at the I/O ports of a device (i.e. what would be the
pins if the die was packaged individually in the usual fashion) can be read by
the cells of the boundary register, and those values shifted out for analysis,
while the device is operating normally. In this mode the boundary scan logic
does not affect circuit operation, but rather provides visibility to values
entering and leaving the device even though its I/O may not be directly
accessible from outside the MCM.
12.6.2 Boundary Scan for MCMs
The
individual chips can be tested in isolation by linking the TDO port of one chip
to the TDI port of another, forming a chain of chips. Figure 12.6.2 shows how
this chain is constructed. A test vector is clocked in at the TDI of the first
chip in the chain, and clocking in continues until all chips have their test
data. Then the chips are run for a desired number of clock cycles. Finally, the
resulting outputs of the chips are clocked out of the TDO port of the last chip
in the chain, with clocking continuing until all the boundary register contents
of all the chips are clocked out for external analysis.
Shorts between MCM substrate
interconnects should be tested before opens, because operating the MCM with
shorts present can damage or shorten the lifetimes of components. Thus shorts
should be detected as soon as possible so the problem can be corrected before
further damage occurs, if it hasn't already. The basic idea of testing for
shorts is to use boundary scan to clock in an appropriate test vector, then
clock it out again to see if it contains values that have changed. A changed value
would be due to the wrong value being present at a boundary cell because the
corresponding chip I/O port is shorted to an interconnect that is set to that
value. An algorithm and its explanation are provided for example in Parker
(1992 [37] section 3.2.2.1). Testing for shorts reveals many open faults as
well, but not all. Testing for opens is thus necessary. This is done by
ensuring that values set in one location are propagated to another location
that is supposed to be connected, where both locations are accessible by the
boundary scan logic (that is, both locations are die I/O ports).
****Insert
Fig. 12.6.2****
12.7 CRITICAL
ISSUE: KNOWN GOOD DIE
Unpackaged
chips (plural either "bare die" or
"bare dice") are usually available from manufacturers only in
untested form, if at all. The eventual availability of pre‑tested dice,
which are termed "known good," is expected to be an important factor
in making MCMs economical for other than high-end applications, since even one
bad die on an MCM almost always means the whole MCM will not work. Testing of
bare die is easier for the die manufacturer than for the MCM assembler. This is
because
(1) the die manufacturer will be more likely
to already have testing capabilities for the chip, even if only for its
packaged form, and
(2) generating tests for a chip is easier
for the manufacturer when, as is frequently the case, the design of the chip is
known only to its manufacturer and kept proprietary.
The
importance of mounting known good die ("KGD") on an MCM is due to the
speedy degradation in MCM yield as the number of dice increases and the yield
of each die decreases. The concept of yield is dealt with elsewhere in this
book, but due to its relevance to this section is briefly reviewed here.
The proportion of all units that are fault
free is the yield of a manufacturing process. A yield for bare die that
would make them well suited to placement on MCMs would be something like .999
[26]. A figure like that is high enough that a set of dice all with that yield
individually, would all work (making the resulting MCM work also) with a
reasonably high probability. Of course, this assumes that the process of
mounting them on the MCM does not degrade the yield of the MCM unduly, and that
the substrate has already been verified as fault free. However, such a high
yield for bare die may not be easily approached. One reason for that is the low demand for
known good die (KGD). This means IC manufacturers are not impelled by market
forces to produce them. Only about 0.001% of IC sales are in the form of KGD
[49]. As the MCM market increases, this may lead to increased demand for KGD,
leading to more availability of KGD, leading in turn to more economical MCM
production, leading to yet more market increase for MCMs. Thus there is a
feedback relationship between the availability of KGD and the market share of
MCMs. This feedback relationship could lead to an explosion in MCM production,
or could impede it instead, resulting in continued relatively small,
specialized market niches, depending on whether the critical point in MCM
production and KGD production can be exceeded by other, smaller forces.
Unpackaged
ICs (i.e. bare dice) are harder to test than ICs in individual packages. Any
testing unpackaged ICs do get is typically while they are still on the wafers,
which are the relatively large slices of silicon on which a number of identical
chips are manufactured prior to their being sawed apart into individual dice.
Those tests are usually of limited scope, due to the relative difficulty
compared to the packaged chips of testing at various temperatures, removing
heat generated by the chip while it is operating, and adequately accessing the
I/O ports of the chip [25]. The result of these problems is fault coverage much
lower than for packaged ICs, which can be more thoroughly tested. This low
fault coverage leads directly to a higher percentage of faulty dice passing the
limited tests they are given. Unfortunately the yield of a module must be lower
than the yield of the least reliable die mounted on it. Yield of an MCM depends
in part on the yields of its constituent ICs in accordance with
Ym=(Yd)^n
(12.7-1)
Where
Ym is module yield, Yd is the expected die yield, and n is
the number of dice mounted on the MCM. This is just the probability that all
the individual dice are working. For example, given a bare chip yield of 99%
and 20 chips on a module, the module yield is only 82%. Given a bare chip yield
of 95% and 20 chips, the module yield is 40%. The MCM yield Ym decreases
exponentially as the number of dice on the MCM increases.
The
above formula can be modified to account for different dice of different
yields. In that case, the yield of the MCM is
Ym=Y1*Y2*Y3...*Yn (12.7-2)
where
there are n dice on the MCM, all must be working for the MCM to work,
and Yx is the yield of die number x.
These equations ignore factors other than
the die yield in calculating the module yield, however, the module yield is
also dependent on other things, in particular the interconnects, the substrate, and the assembly processes
[25].
Here are some of the technical problems
described in Williams (1992 [49]) that inhibit the availability of KGD.
(1) DC parametric testing of dice is useful
but does not verify functional performance of a die. At‑speed functional
testing (perhaps at different temperatures) is important in acheiving the high
yield of bare die necessary for high yield of the resulting MCMs.
(2) Proper burn-in of die, especially since
bare die may have different thermal characteristics than they do after mounting
on the MCM.
(3) Test vector acquisition from the
manufacturers of the die.
(4) Compatibility issues of different test
equipment.
Testing of bare die can be facilitated
through design for testability. BIST, for example, will become more cost effective as more chips
are used in MCMs, due to the difficulties in testing MCMs compared to
individually packaged chips.
Only an MCM testing process with 100%
fault coverage will detect all faulty MCMs. However, the increasing complexity
of modern integrated electronic circuitry, exemplified by MCMs, makes 100%
fault detection coverage difficult. Since the defect level of MCMs passing the
final testing process is determined by both the yield of the MCM itself and the
fault coverage of the final testing process, and the yield of the MCM itself is
determined in large part by the yield of its component dice, testing of only
the assembled MCM will result in lowered probability of an MCM being fault free
when its component dice are not known good.
12.8 SUMMARY
This
chapter reviews many of the topics related to testing of MCMs and other complex
forms of electronic circuitry. The more miniaturized and integrated an
electronic circuit is, the harder it is to test. On the other hand, the fewer
elementary components it has and hence the greater its potential dependability,
since fewer components means fewer ways it can have faults.
Fault coverage refers to the ability of a
testing method to find faults in the circuit. Since it is impractical to be
able to catch every possible fault in a complex circuit while testing, fault
coverage is less than 100%. One reason for this is the reliance of testing
methodologies on fault models, which only encompass some of the diverse
possible kinds of real faults. Typical fault models include stuck-at-fault
models, bridging fault models, open fault models, and delay fault models.
An increasingly important approach to
testing is designing circuitry from the very beginning in a way that supports
testing later. Approaches to designing for testability (DFT) include scan
design, scan path and multiplexed scan design technique, level sensitive scan
design, random access scan, partial scan, and built-in self test (BIST)
techniques. BIST means including hardware support for testing that is more
sophisticated than the simpler aforementioned approaches. The most important
current BIST method is boundary scan.
External testers continue to be extremely
important in testing. MCM substrate testing verifies the substrate prior to the
expensive process of mounting dice (unpackaged chips) onto it. That and
subsequent stages of testing can use contact testing methods (e.g.
bed-of-nails, single probe, two probe, or flying probe methods), and
non-contact testing methods, such as electron beam testing. Major purposes of
such testing are to verify functionality and to determine the speed at which
the circuit can operate.
12.9
EXERCISES/PROBLEMS
1 Consider
Figure 12.2.1.2 and section 12.2.1.2. Explain how the test vectors 0110, 1001,
0111, and 1110 can detect all stuck-at faults.
2 Consider
an AND gate and its three lines. List all single and multiple stuck-at faults
that are equivalent to input line A stuck-at 0.
3 Consider
a NAND gate with inputs A and B and output Z. List all tests that detect a
stuck-at 1 on A. List all tests that detect a stuck-at 0 on Z. Why does the
stuck-at 1 on A dominate the stuck-at 0 on Z?
4 Consider
the RUN‑TEST/IDLE state of the TAP controller in the boundary scan
architecture. Does the term "RUN" go with "TEST" only or
does "RUN" go with both "TEST" and "IDLE"? Why?
5 Why
is the TAP controller designed with states EXIT1‑IR, EXIT2‑IR,
EXIT1‑DR, and EXIT2‑DR? That is, why not remove those states from
the TAP controller design in order to make it simpler?
6 What
is the purpose of the TAP controller states UPDATE‑IR and UPDATE‑DR?
Hint: Registers have an input "rank," or portion for shifting in
data, and an output rank for providing logic values to other parts of the
system.
7 Michael
and Michelle Chipkin suggest the following test approach. Critique it. Their
"revolutionary" approach is to store a "brain scan" of a
known working (that is, a "gold standard") MCM and compare it to the
"brain scan" of the MCM under test.
To get such a "brain scan," chart the frequency spectrum above
each point on the surface of the MCM. If the MCM under test has unexpected
differences, these differences indicate areas that are not operating properly.
8 Consider
the issue of known good dice in the following light. The competitiveness of MCM
technology is dependent in considerable degree on the availability of KGD. Yet
the availability of KGD depends in considerable degree on demand for them in
the form of MCMs. Thus there is a feedback cycle that tends to inhibit or
promote the growth of MCM technology depending on the values for KGD
availability (which we might model imperfectly as price) and MCM use (which we
might model imperfectly as some percent of all IC manufacturing). Write a
computer program that implements a model of this situation. For your model, is
there and if so what is a value for MCM use and a value for KGD price that will
just tip the model into a positive feedback situation in which MCM use suddenly
increases very quickly? Use any numbers you like in setting up the variables of
your model, or better, obtain numbers from the current literature. Consider
this problem as a thesis topic.
9 It
is the future. The McDonald's corporation McModule division has decided MCMs
will play an important role in the next wave of computer technology, ubiquitous
computing. Their motto is, "a
hamburger in every pot and an MCM in every plate," evoking the idea that
complex electronic modules will be everywhere, even embedded in your plate to
monitor the food on it. Figure 12.9 shows the floor plan of their multipurpose
MCM, using components selling for less than a dozen for a penny, for use in
plates and other everyday items. How will the overall size of the MCM change if
various test methodologies are used. How does yield change if yield is assumed
proportional to size?
****Insert Figure 12.9 here****
9 From
the description of LFSRs in this chapter, draw a diagram of an ALFSR
containing three latches and 2 XOR
gates. Assuming a starting state in which all the latches output the value 0,
what is the next state of the circuit? What are the next 10 states of the
circuit?
10 Consider
the 4 control input combinations possible for a BILBO circuit. Explain how each
of the four causes the circuit to behave. Refer to the BILBO section of this
chapter.
11 Section
12.3 describes up transition faults. Give an analogous description of down
transition faults.
12 Give
an example of a memory fault that the 0-1 test will not find.
13 Consider
figure 12.2.1.1. Explain why a stuck-at fault on line X2 cannot be detected,
and how a stuck-at fault on line X1 can be detected.
12.10
REFERENCES
1. John
P. Hayes, "Fault modeling," IEEE Design & Test, pp. 88‑95,
Apr. 1985.
2. Kenneth
M. Butler and M. Ray Mercer, Assessing Fault Model and Test Quality, Kluwer
Academic Publishers, 1992.
3. V.
K. Agarwal and A. S. F. Fung, "Multiple fault testing of large circuits by
single fault tests," IEEE Trans. Comp., Vol C‑30, No. 11, pp. 855‑865,
Nov. 1981.
4. Rochit
Rajsuman, Digital Hardware Testing: Transistor‑level Fault Modeling and
Testing, Artech House, Boston, 1992.
5. Melvin
A. Breuer and Arthur D. Friedman, Diagnosis and Reliable Design of Digital
Systems, Computer Science Press, 1976.
6. J.
P. Shen, W. Maly, and F. J. Ferguson, "Inductive fault analysis of MOS
integrated circuits," IEEE Design & Test, Vol. 2, No. 6, pp. 13‑26,
Dec. 1985.
7. S.
D. Millman and E. J. McCluskey, "Detecting bridging faults with stuck‑at
test sets," in Proc. 1988 IEEE Int. Test Conference, pp. 773‑783,
Sept. 1988.
8. S.
D. Millman and E. J. McCluskey, "Detecting stuck‑open faults with
stuck‑at test sets," in Proc. 1989 IEEE Custom Integrated Circuits
Conference, pp. 22.3.1‑22.3.4, May 1989.
9. J.
D. Lesser and J. J. Schedletsky, "An experimental delay test generator for
LSI," IEEE Trans. Comp., Vol. C‑29, No. 3, pp. 235‑248, Mar.
1980.
10. V.
S. Iyengar, B. K. Rosen, and J. A. Waicukauski, "On computing the sizes of
detected delay faults," IEEE Trans. CAD, Vol. 9, No. 3, pp. 299‑312,
Mar. 1990.
11. Kenyon
C.-Y. Mei, "Bridging and stuck‑at faults," IEEE Trans. Comp.,
Vol C‑23, No.7, pp. 720‑727, July 1974.
12. Donald
R. Schertz and Gernot Metze, "A new representation for faults in
combinational digital circuits," IEEE Trans. Computers, Vol. C‑21,
No.8, pp. 858‑866, Aug. 1972.
13. Thomas
C. Russell and Yenting Wen, "Electrical testing of multichip
modules," in Daryl Ann Doane and Paul D. Franzon (Editors), Multichip
Module Technologies and Alternatives, pp. 615‑660.
14. Frank
Crnic and Thomas H. Morrison, "Electrical test of multi‑chip
substrates," ICEMM Proceedings '93, pp. 422‑428.
15. James
R. Trent, "Test philosophy for mutichip modules," International
Journal of Microcircuits and Electronic Packaging, vol. 15, no. 4, 1992, pp.
239‑248.
16. H.
T. Nagle, S. C. Roy, C. F. Hawkins, M. G. McNamer, and R. R. Fritzemeieroy,
"Design for testability and built‑In self test: a review," IEEE
Transactions on Industrial Electronics, Vol. 36, No. 2, May 1989, pp. 129‑140.
17. H.
Fujiwara and T. Shimono, "On the acceleration of test generation
algorithms," IEEE Trans. on Computers, Vol C‑32, pp. 1137‑1144,
Dec. 1983.
18. Y.
Takamatsu and K. Kinoshita, "CONT: a concurrent test generation
algorithm," Fault-Tolerant Computing Symp. (FTCS‑17) Digest of
papers, Pittsburgh, PA, pp. 22‑27, July 1987.
19. C.
Benmehrez and J. F. McDonald, "The subscripted D‑algorithm: ATPG
with multiple independent control paths," ATPG Workshop Proceedings, pp.
71‑80, 1983.
20. H.
Kubo, "A procedure for generating test sequences to detect sequential
circuit failures," NEC Res & Dev (12), pp. 69‑78, Oct 1968.
21. G.
R. Putzolu and J. P. Roth, "A heuristic algorithm for the testing of
asynchronous circuits," IEEE Trans. on Computers, Vol C‑20, pp. 639‑647,
1971.
22. P.
Muth, "A nine‑valued circuit model for test generation," IEEE
Trans. on Computers, Vol C‑25, pp. 630‑636, June 1976.
23. Manoj
Franklin and Kewel K. Saluja, "Built‑in self‑testing of random‑access
memories," Computer (IEEE Computer Society Press), Vol 23, No.10, pp. 45‑55,
Oct. 1990.
24. B.
Konemann, J. Mucha, and G. Zwiehoff, "Built‑in test for complex
digital integrated circuits," IEEE Journal of Solid‑State Circuits,
Vol SC‑15, No.3, pp. 315‑318, June 1980.
25. A.
Krasniewski, Circular Self-Test Path: A Low-Cost BIST Technique for VLSI
Circuits, IEEE Transactions on Computer-Aided Design, Vol. 8, no. 1, Jan. 1989,
pp. 46-55.
26. G.
Messner, I. Turlik, J. Balde, and P. E. Garrou, Thin Film Multichip Modules,
International Society for Hybrid Microelectronics, 1993.
27. T.
W. Williams and N. C. Brown, "Defect level as a function of fault
coverage," IEEE Trans. on Computers, Vol. 30, pp. 987‑988, Dec.
1981.
28. R.
G. Bennetts, Design of Testable Logic Circuits, Addison‑Wesley, 1984.
29. Joseph
Di Giacomo, Designing with High Performance ASICs, Prentice Hall, Englewood
Cliffs, New Jersey, 1992.
30. B.
W. Johnson, Design and Analysis of Fault Tolerant Digital Systems, Addison‑Wesley,
1989.
31. V.
D. Agarwal, K.-T. Cheng, D. D. Johnson, and T. Lin, "Designing circuits
with partial scan," IEEE Design & Test of Computers, pp. 9‑15,
Apr. 1988.
32. R.
Gupta and M. A. Breuer, "The BALLAST methodology for structured partial
scan design," IEEE Trans. on Computers, Vol 39, No 4, pp. 538‑544,
Apr. 1990.
33. Edward
J. McCluskey, "Built‑in self‑test structures," IEEE
Design & Test, pp. 29‑36, Apr.
1985.
34. Andrew
Flint and William Blood, Jr., "MCM test strategy: board test in an IC
environment," ICEMM Proceedings '93, pp. 429‑434.
35. R.
W. Bassett, P. S. Gillis, and J. J. Shushereba, "Testing and diagnosis of
high‑density CMOS multichip modules," International Test Conference,
1991, pp. 530‑539.
36. David
Karpenske and Chris Tallot, "Testing and diagnosis of multichip
modules," Solid State Technology, June 91, pp. 24‑26.
37. Kenneth
Parker, Boundary‑Scan Handbook, Kluwer Academic Publishers, 1992.
38. John
K. Hagge and Russell J. Wagner, "High‑yield assembly of multichip
modules through known‑good IC's and effective test strategies,"
Proc. of IEEE, Vol. 80, No. 12, Dec 92, pp. 1234‑1245.
39. V.
D. Agrawal, C. R. Kime, and K. L. Saluja, "A tutorial on built-in self
test, part I: principles," IEEE Design and Test of Computers, March 1993.
40. Elwyn
R. Berlekamp, Algebraic Coding Theory, McGraw-Hill, NY, 1968.
41. Edward
J. McCluskey, "Verification testing -- a pseudoexhaustive test
technique," IEEE Transacations on Computers, Vol. C-33 No. 6, June 1984,
pp. 541-546.
42. E.
J. McCluskey and S. Bozorgui-Nesbat, "Design for autonomous test,"
IEEE Transactions on Computers, Vol. C-30, pp. 866-875, Nov. 1981.
43. E.
J. McCluskey, "Built-in self-test techniques," IEEE Design and Test
of Computers, April 1985.
44. Clive
Shipley, "Flying probes," Advanced Packaging, Fall 1992, pp. 30-35.
45. Alcedo,
WWW site http://www.businessexchange.com/filesavce/beamtest.html.
46. A.
B. El-Kareh, Testing printed circuit boards, MCM's and FPD's with electron beams,
Alcedo, 485 Macara Ave. Suite 903, Sunnyvale CA.
47. R.
Doyle, U. Fayyad, D. Berleant, L. Charest, L. de Mello, H. Porta, and M.
Wiesmeyer, "Sensor selection in complex system monitoring using
information quantification and causal reasoning," in: Faltings &
Struss, eds., Recent Advances in Qualitative Physics, MIT Press, 1992, pp. 229‑244.
48. W.
Hamscher, "Modelling digital circuits for troubleshooting," Artifical
Intelligence, vol. 51 (1991), pp. 223-271.
49. T.
A. Williams, "Securing known good die," Advanced Packaging, Fall
1992, pp. 52-59.
50. S.
Kim and F. Lombardi, Modeling Intermediate Tests for Fault-Tolerant Multichip
Module Systems, IEEE Transactions on Components, Packaging, and Manufacturing
Technology - Part B, Vol. 18, no. 3, Aug. 1995, pp. 448-455.
51. D.
Carey, "Programmable multichip module technology," Hybrid Circuit
Technology (August 1991) 25‑29.
52. K.
Gilleo, "The SMT chip carrier: enabling technology for the MCM,''
Electronic Packaging & Production, September 1993, pp. 88‑89.
53. Hewlett
Packard, Semiconductor Systems Center US‑SSC VL‑MTC and VL-CM,
12/3/91.
54. M.
MacDougall, Simulating Computer Systems, MIT Press, 1987.
55. R.
Pearson and H. Malek, "Active silicon substrate multi‑chip module
packaging for spaceborne signal/data processors," Government Microcircuit
Applications Conference (GOMAC), 1992.
56. K.
K. Roy, "Multichip module deposited ‑‑‑ reliability
issues," Materials Developments in Microelectronic Packaging Conference
Proceedings (Montreal, August 19‑22, 1991), pp. 305‑309.
57. C.
Thibeault, Y. Savaria, and J. L., Houle, "Impact of reconfiguration logic
on the optimization of defect‑tolerant integrated circuits," Fault‑Tolerant
Computing: The Twentieth International Symposium, IEEE Computer Society Press,
1990, pp. 158‑165.
58. Haruhiko
Yamamoto, "Multichip module packaging for cryogenic computers," 1991
IEEE International Symposium on Circuits and Systems V. 4 (IEEE Service Center,
Piscataway, NJ cat. no. 91CH3006‑4), pp. 2296‑2299.
write 0 in all cells;
read all cells;
write 1 in all cells;
read all cells;
Figure 12.3.1: Zero‑One algorithm.
write 1 in all cells in group 1 and 0
in all cells in group 2;
read all cells;
write 0 in all cells in group 1
and 1 in all cells in group 2;
read all cells;
Figure 12.3.2: Checkerboard test
algorithm