The system health state indicates whether a system will be able to meet the performance specifications now and for some future time. It goes beyond just monitoring its current state and testing for correctness. But also, it predicts how long it will continue to function correctly. This type of predictive evaluation is motivated by the increasing reliability requirements and widespread use of semiconductor systems. The current state of the art employs sensors and monitors to gather, interpret, analyze, and utilize data streams in order to evaluate the system health state at runtime. Power consumption, current, temperature, and timing behavior of a circuit are among the health indicators of the circuit. This information is, however, collected with an additional hardware overhead and with a performance penalty due to additional workload. Additionally, every additional hardware component is another lifetime challenge since they will all age, as well.
In this doctoral project, one of the main objectives is to reduce the additional overhead and costs of the instrumentation and extract as much information as possible from the usual system behavior. Built-in-self-test (BIST) is a popular technique which allows hardware to test itself and evaluate the responses. A BIST architecture consists of a pattern generator that produces required patterns; a large volume of test responses is then compressed into signatures through different strategies. We utilize the BIST architecture in this study to assess the system health state at a lower hardware and performance cost. We use adaptive voltage and frequency scaling mechanisms to keep track of how the circuit behavior changes over time and investigate the effect of each degradation on the signatures of a BIST architecture. We use this investigation to train Machine Learning (ML) models to predict the health state.
The main research challenges comprise Analysis and Classification. As we explore defect mechanisms, our aim is to investigate how reliability threats behave under variant conditions. We can characterize these different reliability threats by studying their behavior under various frequencies and voltages.
Machine Learning techniques will be used to classify observed changes over time and to estimate the remaining lifetime.
The following scientific challenges need to be addressed in this doctoral study in order to assess a system's health
- Analyzing the hardware, and looking in defects that how they may involve overtime
- Analyzing the circuit behavior and see how it changes overtime based on the defect
- Measurement of the aging process using the data from the early life stages
- Classification of defects according to their behavior
- Runtime prediction of the system health state using Machine Learning methods
An extensive number of instruments is needed to facilitate cost-efficient bring-up, debug, test, diagnosis, and adaptivity in the field. Reconfigurable Scan Networks (RSNs)[1, 2] provide a exible way to access such instruments. At the same time, the dependability properties of the device-under-test (DUT) can be effected through an improper RSN integration. In this doctoral project, the problem of dependable RSN integration and its use during runtime operation are investigated. The functional safety and fault-tolerance, as well as security compliance of the RSN with the properties of the functional design, are addressed.
The main goal is to investigate the dependability properties of RSNs and to provide an automated framework, capable to generate a dependable RSN for a given DUT. Various dependability aspects of RSNs and corresponding methods and algorithms for the RSN analysis and consequent integration are investigated. The doctoral project aims to achieve the following goals:
- Security-compliant RSN: An RSN can be potentially misused to sniff or spoof the data if additional data channels between instruments are introduced through the RSN integration. Only those dependencies between instruments, which are present in the DUT, should exist in the RSN. An accurate security compliance analysis of an RSN must be developed to guide the construction of a secure RSN. Both methods to repair an existing RSN and to automatically generate a dependable RSN, for a given DUT must be investigated.
- RSN for offline and online test: Reconfigurable Scan Networks have been initially proposed for offline test and diagnosis of integrated circuits. However, RSNs can also be efficiently reused for online test and diagnosis. Optimizing criteria for RSN generation as access latency, throughput or area overhead, must be formalized and the corresponding RSN architectures have to be developed.
- Functionally safe & fault-tolerant RSN: A single fault in a scan design can lead to inaccessibility of a safety-critical instrument and consequently might lead to a system failure. An integrated RSN must be fault-tolerant to provide reliable test and diagnosis results for DUT even in presence of faults during online and offline system operation. Moreover, the RSN itself has to remain testable and diagnosable.
- Dependable RSN: Pareto-optimal solution, considering the aforementioned dependability aspects and hardware costs, must be proposed.
To ensure the security compliance of a given RSN with the DUT we developed an approach to accurately calculate the data paths between the instruments introduced through the RSN integration and analyzed, whether those data paths are a subset of initial paths in the DUT. If any new paths are introduced, the RSN has to be modified in a most cost-efficient way .
To ensure efficient operation of an RSN for offline test and diagnosis, such criteria as hardware overhead or power con- sumption can be considered. Safety-critical components must be tested periodically and concurrently to the main functionality of the design. They must be accessed in runtime through an RSN having a low access latency. For periodic access to such instruments as BIST, a heuristic must set up an optimal schedule, considering the retargeting capabilities of RSNs. RSNs are also used to perform access to the aging monitors. Since the number of monitors can be large, access to them must be optimized to approach all of them in a reasonable amount of time.
To guarantee the reliable operation of DUT such fault-tolerance techniques as selective hardening, DMR/TMR or ECC can be applied, and reliability instruments, such as aging monitors or Built-In Self-Test (BIST), can be integrated on chip. Error detection and error correction codes can be used to detect erroneous output values of the original design and mask them to a limited extent. An RSN structure, capable to transport error indicators efficiently and autonomously, will be developed. The dependable RSN must be at least as fault-tolerant as the original DUT and certain aforementioned methods with acceptable hardware overhead, e.g. selective hardening, can be applied to an RSN. If a fault occurs in an RSN itself, the fault location must be detected and possibly corrected within a short time. Robust hierarchical RSN can be augmented with additional error propagation paths, providing short fault detection and reaction period.
This doctoral project is funded by the German Research Foundation (DFG) under grant WU 245/17-2 (ACCESS).
- "IEEE Standard for Test Access Port and Boundary-Scan Architecture," IEEE Std 1149.1-2013 (Revision of IEEE Std 1149.1-2001), pp. 1-444, May 2013.
- "IEEE Standard for Access and Control of Instrumentation Embedded within a Semiconductor Device," IEEE Std 1687-2014, pp. 1-283, Dec. 2014.
- N. Lylina, A. Atteya, P. Raiola, M. Sauer, B. Becker, and H.-J. Wunderlich, "Security Compliance Analysis of Reconfigurable Scan Networks," in Proceedings of the IEEE International Test Conference (ITC'19), 2019.
1. Main goal: As the technology shrinks, SoCs suffer more from reliability vulnerabilities such as early life failures. Current test methods using conventional fault models like "stuck at fault" or "transient fault", even with high fault coverage, still result in high rate of Early Life Failure (ELF) which shows that conventional fault models are not sufficient anymore. Circuit timing as an indicator for reliability can be used to consider another group of defects which are called marginal defects. Marginal defects may produce small amount of extra delay to the paths of a circuit but it's not big enough to be detected in the conventional manufacturing test. Due to their special characteristics which is indicating some marginal hardware and imperfections, they degrade further when they go in the field and make the device fail after a short period of time.
On the other hand, all modern devices suffer from some process variations. Tests without considering the impact of process variation are no longer valid. That's why variation-aware test came into action. Process variation results into some deviation from the nominal delay of a circuit. This delay can be small enough to pass the manufacturing test which is not a thread to the reliability, since they will stay the same during the life time of the circuit and make no failure. The problem here is that, the duration of the delays due to process variation can be in the same order of magnitude with the delays due to marginal defects. This makes the detection of marginal defects even more challenging. Furthermore over detecting of the slow cells due to process variation will decrease the yield, whereas they can be appropriately speed binned and used as lower performance part.
In this work, we aim to present a novel variation-aware test method which is able to distinguish faulty circuits with marginal defects from fault-free circuits which are just slow due to process variation.
2. Description: The whole process will come in 3 main steps: 1-Modeling, 2-Characterizing, 3-Classification
In order to classify the faulty and fault-free devices (in our case faulty means devices with marginal defects and fault-free means slow circuits due to process variation), first these classes should be modeled to be able to investigate their behavior under different controllable or settable conditions. Therefore standard cells will be defined to model cell with process variation as the fault free class and with additional marginal defect injected to them to model the faulty class.
Next step after modeling the fault-free and faulty classes, is to investigate their behavior under different conditions which gives the characterizing of these two classes. Some controllable points which may have different impact on these two classes need to be investigated. There are many controllable points that can be considered such as "supply voltage" or "temperature". Circuit timing and in particular "propagation delay" is an important observation vector which is used to investigate and compare the behavior of these two classes. Time simulation is being used first at transistor level for standard cells and then at higher abstraction levels to simulate the behavior of the circuit. The time simulation and then the fault simulation with injected marginal faults will provide the characteristics of the cells which are the "functions" of the controllable point. As an example different supply voltages are being applied to these two classes, and two classes of propagation delays which are functions of supply voltage will be collected.
Next and final step is classiffication. These two "classes of functions" will be used as training data to the machine learning algorithms. These machine learning algorithms then are able to classify new cells by using their functions of the controllable points such as supply voltage function, to determine which class they belong to.
With having the characteristics of the cells and being able to classify them, final goal which is the classification of a complete circuits can be achieved. Cell characteristics can be used to make faulty models, BIST or offline testing of a complete circuits. Direct mapping from defined fault-free and faulty cells to the cells used in the circuit under test (CUT) and in addition, interconnects between cells will be investigated to apply the test for a complete CUT.
As a result of all discussed before in this part, a novel and intelligent variation-aware test will be introduced which is able to detect realistic marginal defects in presence of process variation. This will reduce the early life failure of a circuit and as a result increase the reliability while keeping the yield high.
This doctoral project is funded by the German Research Foundation (DFG) under grant WU 245/19-1 (FAST).