fault_injection_async25_paper/fault_injection_async25.tex

259 lines
23 KiB
TeX

\documentclass[conference]{IEEEtran}
\IEEEoverridecommandlockouts
% The preceding line is only needed to identify funding in the first footnote. If that is unneeded, please comment it out.
\usepackage{biblatex}
\addbibresource{fault_injection_async25.bib}
%\usepackage{cite}
\usepackage{amsmath,amssymb,amsfonts}
\usepackage{algorithmic}
\usepackage{graphicx}
\usepackage{textcomp}
\usepackage{xcolor}
\usepackage{orcidlink}
\usepackage[shortcuts,acronym]{glossaries}
\usepackage{subcaption}
% Tikz because graphs are fun
\usepackage{tikz}
\usepackage{tikz-timing}
\usepackage{tikz-timing-overlays}
\usepackage{tikz-timing-advnodes}
\usetikzlibrary{positioning}
\makeglossaries
% Acronyms for the document
\newacronym{dut}{DUT}{Design Under Test}
\newacronym{api}{API}{Application Programming Interface}
\newacronym{wchb}{WCHB}{Weakly Conditioned Half Buffer}
\newacronym{qdi}{QDI}{Quasi Delay Insensitive}
\newacronym{set}{SET}{Single Event Transient}
\newacronym{seu}{SEU}{Single Event Upset}
\newacronym{sed}{SED}{Single Event Delay}
\newacronym{prs}{PRS}{Production Rule Set}
\newacronym{uvm}{UVM}{Universal Verification Method}
% Simple citation required command
\newcommand{\citationneeded}{\textcolor{red}{[citation needed]}}
\newcommand{\referenceneeded}{\textcolor{red}{[reference needed]}}
\def\BibTeX{{\rm B\kern-.05em{\sc i\kern-.025em b}\kern-.08em
T\kern-.1667em\lower.7ex\hbox{E}\kern-.125emX}}
\begin{document}
\title{Lights, camera, \texttt{action}: A novel workload distribution tool applied to fault space exploration\\
\thanks{Identify applicable funding agency here. If none, delete this.}
}
\author{\IEEEauthorblockN{Fabian Posch\, \orcidlink{0009-0009-9272-1633}}
\IEEEauthorblockA{Institute for Computer Engineering\\
TU Wien\\
Vienna, Austria \\
fabian.posch@student.tuwien.ac.at}
\and
\IEEEauthorblockN{Florian Huemer\, \orcidlink{0000-0002-2776-7768}}
\IEEEauthorblockA{Institute for Computer Engineering\\
TU Wien\\
Vienna, Austria \\
fhuemer@ecs.tuwien.ac.at}
\and
\IEEEauthorblockN{Andreas Steininger\, \orcidlink{0000-0002-3847-1647}}
\IEEEauthorblockA{Institute for Computer Engineering\\
TU Wien\\
Vienna, Austria \\
steininger@ecs.tuwien.ac.at}
\and
\IEEEauthorblockN{Rajit Manohar}
\IEEEauthorblockA{Computer Systems Lab\\
Yale University\\
New Haven, CT 06520, USA\\
rajit.manohar@yale.edu}
}
\maketitle
\begin{abstract}
As a leading toolchain for asynchronous logic development, ACT offers a comprehensive environment for chip design and research. Its open nature allows for extensive customizability, enabling optimizations beyond what industry-grade tools typically provide. Building on this foundation, we introduce \texttt{action}, a new addition to the ACT toolchain that enables distributed build and compute tasks. To demonstrate its flexible extension interface, we developed a transient-fault-injection engine which significantly improves upon previous designs, both through deeper integration with ACT tools as well as better injection distribution heuristics.
These innovations eliminate the need for additional injection-related logic within the design while also reducing development effort, as testing infrastructure for behavioral validation can simply be reused. Additionally, only the design under test needs simulation at the gate-level, while the auxiliary testing harness can stay at higher levels of abstraction. Finally, we also achieve a reduction in necessary injections by targeting high-fanout signals more heavily, discovering more faults per injection.
To validate our setup, we benchmarked it against existing fault-injection tools, demonstrating its performance in both simulation efficiency and the overall number of injections needed to achieve representative results.
% Alternate abstract after test results
% To validate our setup, we benchmarked it against existing fault-injection tools, demonstrating substantial improvements in both simulation efficiency and the overall number of injections needed to achieve representative results, thus enabling better scaling as target designs grow more complex.
\end{abstract}
\begin{IEEEkeywords}
asynchronous circuits, SET, fault-tolerance, cluster computing, computer aided design, parallel computing
\end{IEEEkeywords}
\section{Introduction}
To make new things, we require tools. But while commercial tools offer access to the current state of the industry, they are usually not customizable enough (as they tend to be closed source) or - for more specialized applications - not available altogether. This problem is well understood for asynchronous logic, as the commercial offerings' focus on synchronous designs limits functionality for everything outside their scope. And while many of these problems have been mitigated by the publication of the open source ACT toolchain by the Yale AVLSI group \cite{manoharOpenSourceDesign}, local compute often does not suffice for tasks that are more complex.
Especially for those that lend themselves nicely to a high degree of parallelization, cluster computing offers high potential speed improvements. For this reason, we have built a tool which does just that - while offering a simple \acs{api} to vastly extend its functionality. Our goal was to create a framework to build on, and we here present a real world use-case to demonstrate this capability.
Exposing digital circuits to environments like space can break some of the most basic assumptions we make when designing digital circuits. Given the level of miniaturization we have access to, having high energy particles rain upon the millions of interconnects in an average design can introduce unexpected behavior. These undesired deviations from design specification, or \emph{failures}, need to be well understood to make predictions about a design's robustness.
Synchronizing logic to a clock cycle, while potentially compromising on average case performance, has the helpful side-effect of creating a temporal mask for logic faults. This means that when an erroneous value is induced in a wire, only a small window of time exists where this value can propagate beyond the next logic buffer. \\
In asynchronous logic, we unfortunately lack this convenient abstraction. While we assume temporal masking to also play a much less obvious role in asynchronous logic \cite{huemerIdentificationConfinementFault2020}, environmentally induced faults are still a much higher potential risk compared to a clock synchronized design.
But what is often much more important than knowing \emph{if} a design can fail under certain (extreme) circumstances, is \emph{how} exactly these failure modes play out. Certain use-cases might call for or even enforce safety in form of known failure modes on critical systems. While multiple attempts have been made to create tooling for exploration of fault-space in the past \cite{behalExplainingFaultSensitivity2021}, as of yet these tools have several shortcomings, which we feel need to be addressed.
\section{Related Work}
\texttt{action} is an addition to the ACT toolchain initially presented in \cite{manoharOpenSourceDesign}. ACT aims to be a collection of tools for an end-to-end chip design workflow. While the main focus of its tools is asynchronous designs, it is powerful enough to also map to synchronous logic families without issue \cite{vezzoliDesigningEnergyEfficientFullyAsynchronous2024}. The current version of the ACT toolflow does include a scripting environment \cite{heInteractInteractiveDesign}, it does however not contain a solution for distributed computing tasks, which would be helpful for testing and verification tasks.
Focusing on our specific demo use-case, the tool presented in \cite{behalExplainingFaultSensitivity2021} is a fault injection and fault space exploration tool, aiming to explore fault types in a given circuit. It is quite similar to the demo use-case we show in this paper. It distinguishes fault classes \emph{timing deviation}, \emph{value fault}, \emph{code fault}, \emph{glitch}, \emph{deadlock}, and \emph{token count error}, which are largely reused for this paper (more on our system model in Section \ref{sec:system_model/failures}). The core simulator used is QuestaSim (version 10.6c), which is a commercial simulation tool. To reduce the runtime of one simulation, a cluster based approach is employed to parallelize simulations over multiple machines. This tool has been designed for the \texttt{pypr} toolchain designed by the Huemer at TU Wien \cite{huemerContributionsEfficiencyRobustness2022}, a production rule based circuit description framework in Python. Notably, the system calculates the number of required injections using a system of average injection density, independently of which signal it is targeting. This is one of the main points on which we will try to improve upon.\\
% should i include work in master thesis?
An iteration of this system can be found in \cite{schwendingerEvaluationDifferentTools2022a}. While based on the same core toolflow, Behal adds limited bridging logic to the ACT toolchain, using \textrm{prsim} \cite{manoharOpenSourceDesign} as an alternative simulator. This change requires low level simulation of additional logic, as certain required features were not supported by \texttt{prsim} and no extension to the core simulator code was written. This again is a major point for potential improvement.
Finally, we want to briefly touch on different fault-mitigation techniques seen in literature. \\
Bainbridge and Salisbury \cite{bainbridgeGlitchSensitivityDefense2009} talks about the basic possibilities for fault behavior in \ac{qdi} circuits. Much like \cite{behalExplainingFaultSensitivity2021}, it identifies specific scenarios which can occur when a \ac{set} is injected into a circuit. We will come back to this in Section \ref{sec:system_model/failures} as well. It then lays out basic mitigation techniques, which largely focus on either introducing some form of redundancy in the circuit or reducing the temporal size of the window in which faults are converted into failure behavior (sensitivity window).
In a similar fashion, Huemer et.al \cite{huemerIdentificationConfinementFault2020} presents interlocking and deadlocking versions of a \ac{wchb}. These are also meant to reduce the sensitivity window size, as well as preventing the propagation of illegal symbols. We will use their implementations for interlocking and deadlocking \acp{wchb} in this paper (more in Section \ref{sec:experiment_setup}).
% should we maybe put this a bit further up the paper? I mean we want this to be the main point, no?
\section{Tooling}
\label{sec:tooling}
\begin{figure}
\centering
\includegraphics[width=.45\textwidth]{graphics/action_architecture.pdf}
\caption{High level architecture of an \texttt{action} cluster}
\label{fig:tooling/architecture}
\end{figure}
\texttt{action} itself is a tool flow framework. Its main job is to provide a build system which can act both locally as well as remotely, shifting computing tasks away from the end user machine. This means that other tasks can be performed by the user or the connection to the user interrupted while computation continues remotely without further intervention.
To configure \texttt{action} for a certain task, a string of tool invocations is defined in a \emph{pipeline} file in YAML grammar. \texttt{action}, while primarily meant for use with the ACT toolchain, is at its core tool agnostic. As long as a corresponding tool adapter, as well has handling capability for used data types is provided, any tool (commercial or open source) can be invoked by it. This makes it particularly useful as a base framework for highly parallel and/or computationally intense applications. It can alleviate interaction with clustering software for every day tasks, as only the local command line tool needs to be invoked to perform pipeline execution.
On a high level, \texttt{action} in its base architecture consists of the client application, a controller/database, and several compute nodes (see Figure \ref{fig:tooling/architecture}). On invocation, the client tool first loads input data, performs local tasks, then uploads required data into the controller database. From there, nodes can fetch open tasks and reupload their results to the controller node when done. For simulation tasks, the nodes already perform pre-analysis of logs, to reduce the amount of required post-processing.
The fault injection tool presented in this paper is a demonstration of \texttt{action}'s testcase generation engine, as well as distributed computing capability with \texttt{actsim} as the target tool. \texttt{action} is currently in its early stages, but the process to get the code ready for future open-source release is well underway. \texttt{action} also only uses open-source dependencies, enabling cheap and easy scaling for any application without worry for potential financial impact.
In addition to the build system itself, we present a new simulation library which is already being shipped with \texttt{actsim}\footnote{\url{https://github.com/asyncvlsi/actsim}}, and which we use for harnessing the \ac{dut} in our tests. Using \texttt{actsim} as our simulator compared to previous attempts has the additional advantage of allowing mixed-fidelity simulation. Only the \ac{dut} itself is simulated at gate-level, while supporting logic (testbench, data sources) are simulated at higher abstraction.
To support \ac{set} injection in actsim, we have added the functionality as a core command to the open-source simulation engine. This offers great performance advantages, as no additional logic has to be simulated, nor does the simulation engine have to be halted. Injections are treated as an additional type of event in the simulator event queue, allowing specification of injection timing and location before the simulation engine is started. In addition, we have implemented a \acf{sed} command, which forces a node delay to a specified value once. This is not a new class of transient faults, but a specific sub-class of \acp{set}. While we do not make use of targeted timing changes in this paper, the inclusion in the simulator engine might prove useful in future investigations.
Finally, \texttt{actsim} can now check for violation of \texttt{excl-hi} constraints through the invocation of a new command line option, which is used to detect a erroneous coding on m-of-n coding channels.
\section{System Model}
\label{sec:system_model}
It is important to note the fundamental difference between fault and failure in this context. A failure is the inability of a system to perform its specified task. Failures are caused by faults in the system, which can stem from design errors as well as external conditions \cite{nelsonFaulttolerantComputingFundamental1990}. For this paper we will only consider faults caused by external factors as suppose to internal design faults.
\subsection{Fault model}
\label{sec:system_model/faults}
This work mainly focused on the effects of \acfp{set}, where a wire is manually forced to a specific value, independently of what value the design would dictate. These transient effects can occur due to physical effects like ionizing radiation exposure or electromagnetic interference. After the \ac{set} ends, the wire returns to normal operation. This is an important distinction to a \acf{seu}, where a faulty value is latched by a memory cell. An \ac{set} can but does not necessary lead to an \ac{seu}.
\begin{figure}
\centering
\begin{subfigure}{0.4\textwidth}
\begin{center}
\scalebox{0.85}{\input{graphics/tri_fork_possible.tex}}
\end{center}
\caption{Node with 3-fork, only one leaf receives the transient}
\label{fig:model/3_fork_possible}
\end{subfigure}
\bigskip
\begin{subfigure}{0.4\textwidth}
\begin{center}
\scalebox{0.85}{\input{graphics/tri_fork_impossible.tex}}
\end{center}
\caption{Node with 3-fork, two leafs receive the transient}
\label{fig:model/3_fork_impossible}
\end{subfigure}
\caption{Fanout configurations of nodes and representation in model}
\label{fig:model/forks}
\end{figure}
We simulate our \ac{dut} on the \ac{prs} level. In the simulator, one \ac{prs} node consists of a pull-up, a pull-down, weak pull-up, and weak pull-down stack resulting in one output value. When an \ac{set} is injected, this output value is manually overridden; such an override does however not necessarily incur an output value change. If an \ac{set} is injected into a node, the output value of the node must be different to the forced value for the \ac{set} to be visible at the output. For an \ac{set} to propagate, the child nodes of the victim must be activated on their respective input - meaning a change of the targeted parent would also change their value. Since we attack only one output per simulation, we can only represent scenarios where, given a fanout-tree free of non-sensitive children, one node expresses an incorrect value output. A visual representation of these limitations is shown in Figure \ref{fig:model/forks}, where Subfigure \ref{fig:model/3_fork_possible} can be simulated in our model, while Subfigure \ref{fig:model/3_fork_impossible} cannot.
We feel comfortable with this limitation, as transients usually occur in either the junctions of a given gate, or a transmission wire \cite{ferlet-cavroisSingleEventTransients2013}, which leads to fault behavior similar to what our model can produce.
\subsection{Failure model}
\label{sec:system_model/failures}
When a transient is injected into a node, and given it is not masked by either the current state of the wire or by the child not being inside a sensitivity window currently, we can differentiate between:
\begin{itemize}
\item Nothing; the glitch is masked either by the target already being and organically staying at the forced value for the duration of the glitch.
\item The node is already at the forced value but would naturally transition during the glitch; the circuit experiences a temporary slowdown but should be able to resume normal operation afterwards. We classify this as a \emph{timing failure} (or a potential \emph{value failure} in non-DI logic families)
\item The glitch removes the spacing between value tokens and one is lost; this would manifest in a \emph{token count failure} (and potentially a \emph{value failure})
\item The glitch acts as additional spacing between what is perceived as two data tokens, injecting an additional one into the pipeline. This would also result in a \emph{token count failure} (and potentially a \emph{value failure})
\item The glitch changes the value of a data line and thus creates a \emph{value failure} or, in a DI-coding, a potential \emph{coding failure}
\item A non-recoverable state is reached in the circuit, resulting in a \emph{deadlock}
\item Additionally, a static-0 or static-1 hazard propagates to the edge of the design, we register it as a \emph{glitch}
\end{itemize}
These failure modes are in accordance with the potential states of a circuit from \cite{bainbridgeGlitchSensitivityDefense2009}; we thus reuse the same failure classification as already presented in \cite{behalExplainingFaultSensitivity2021}.
\subsection{Injection Strategy}
To improve our failure detection per simulated injection, we aim to target our simulation efforts based on signal fanout. We use this metric for both signal selection prioritization (if the injector is not set to select all signals), as well as for determining the number of injections necessary for a given signal. This is counter to previous efforts \cite{behalExplainingFaultSensitivity2021,schwendingerEvaluationDifferentTools2022a}.
Signal selection is based on weighted reservoir sampling \cite{efraimidisWeightedRandomSampling2006}, with their weight $\omega$ being randomness with exponential falloff given their fanout, where $R$ is uniformly random and $\tilde{F}$ is the normalized fanout of the signal.
\begin{align}
\omega = R^{(F_{max})/(F - F_{min})}
\end{align}
To calculate the number of required injections per signal, we use the Token Collector's problem. The tool is set to expect a set number of failure modes per node fanout $m$ (tokens). From this, we first calculate the expected number of required injections, then use the Markov inequality to bound this number to a specified probability of finding all tokens.
\begin{align}
n &= F \cdot m \\
E(T) &= n H_n \\
P(T \geq c n H_n) &\leq \frac{1}{c}
\end{align}
where $F$ represents the fanout, $m$ is the assumed failure modes per fanout, $n$ is the total number of expected failure modes for this signal, $T$ is the actual total required number of draws to collect all tokens, and $H_n$ is the $n$th harmonic number. Reforming this leads to the assumed total number of injections $\hat{T}$ being calculated as
\begin{align}
\hat{T} = n H_n \cdot \frac{1}{(1 - P_{cov}) \cdot P_{hit}}
\end{align}
where $P_{hit}$ additionally describes the probability of an injection hitting a sensitive window. We have set this value to $0.001$ based on previous experiments \cite{behalExplainingFaultSensitivity2021}. As $\hat{T}$ is calculated per selected signal, the number of required injections grows linearly with the number of signals selected (given identical fanout over all) and approximately $n\log n$ over fanout per signal given a certain assumed number of failure modes per fanout, coverage probability, and assumed hit probability.
\section{Experiment Setup}
\label{sec:experiment_setup}
\begin{figure}
\centering
\includegraphics[width=.45\textwidth]{graphics/testbench.pdf}
\caption{Testbench architecture}
\label{fig:setup/testbench}
\end{figure}
To test our new tool, we ran our setup against previous results by Behal et.al \cite{behalExplainingFaultSensitivity2021}. We simulated the same multiplier circuit, using the same buffer styles as well. We however did not sweep over any form of pipeline load factor. We found the definition of the metric ambiguous, especially once the circuit contains non-linear pipelines. For this reason we have opted to exclude this metric from further testing.
The \ac{dut} is wrapped in a \acs{uvm}-like testbench setup, which is provided by our new simulation library. Our future ambition is to enable a write once - use everywhere architecture, where wrapper code has to be written once and can then be reused arbitrarily for all tests and verification procedures. The overall architecture of the test setup can be seen in Figure \ref{fig:setup/testbench}. Since, like \acs{uvm}, asynchronous logic contrary to synchronous logic inherently uses a message passing abstraction, we do not require much additional logic in the way of sequencers or monitors to interface with the \ac{dut}. Input tokens are directly forwarded to the \ac{dut}, model, and scoreboard.
Points to talk about
\begin{itemize}
\item how much more detail should i mention about the target circuit and families? we're starting to run low on space
\item can I have the tikz graphics that were in the Behal paper for deadlocking, interlocking\dots
\end{itemize}
\section{Results}
Points to talk about
\begin{itemize}
\item Compared to Behal: how many failures were we able to find with our new tool vs with the old tool
\item Compared to Behal: how efficient (failures found / injection) is this setup compared to previous attempts
\item Dflow: how do certain families of async and sync compare
\end{itemize}
\section{Conclusion}
\printacronyms
\printbibliography
\end{document}