fault_injection_async25_paper/fault_injection_async25.tex

\documentclass[conference]{IEEEtran}
\IEEEoverridecommandlockouts
% The preceding line is only needed to identify funding in the first footnote. If that is unneeded, please comment it out.
\usepackage{biblatex}
\addbibresource{fault_injection_async25.bib}
%\usepackage{cite}
\usepackage{amsmath,amssymb,amsfonts}
\usepackage{algorithmic}
\usepackage{graphicx}
\usepackage{textcomp}
\usepackage{xcolor}
\usepackage{orcidlink}
\usepackage[shortcuts,acronym]{glossaries}
\makeglossaries

% Acronyms for the document
\newacronym{dut}{DUT}{Design Under Test}

% Simple citation required command
\newcommand{\citationneeded}{\textcolor{red}{[citation needed]}}

\def\BibTeX{{\rm B\kern-.05em{\sc i\kern-.025em b}\kern-.08em
    T\kern-.1667em\lower.7ex\hbox{E}\kern-.125emX}}
\begin{document}

\title{Lights, camera, \texttt{action}: Towards an efficient tool for fault space exploration\\
\thanks{Identify applicable funding agency here. If none, delete this.}
}

\author{\IEEEauthorblockN{Fabian Posch\, \orcidlink{0009-0009-9272-1633}}
\IEEEauthorblockA{Institute for Computer Engineering\\
TU Wien\\
Vienna, Austria \\
fabian.posch@student.tuwien.ac.at}
\and
\IEEEauthorblockN{Florian Huemer\, \orcidlink{0000-0002-2776-7768}}
\IEEEauthorblockA{Institute for Computer Engineering\\
TU Wien\\
Vienna, Austria \\
fhuemer@ecs.tuwien.ac.at}
\and
\IEEEauthorblockN{Andreas Steininger\, \orcidlink{0000-0002-3847-1647}}
\IEEEauthorblockA{Institute for Computer Engineering\\
TU Wien\\
Vienna, Austria \\
steininger@ecs.tuwien.ac.at}
\and
\IEEEauthorblockN{Rajit Manohar}
\IEEEauthorblockA{Computer Systems Lab\\
Yale University\\
New Haven, CT 06520, USA\\
rajit.manohar@yale.edu}
}

\maketitle

\begin{abstract}
ACT is a very versatile and powerful tool for design, simulation and manufacturing of asynchronous circuits. When fault tolerance is a priority, being able to simulate the possible effects as part of the design process can be a huge boon.\\
In this paper we present an augmentation of ACT which allows flexible and comprehensive (transient) fault injection. Key features of this tool are native integration into the existing language framework, distributed computation to speed up wait times, as well as an improved fault distribution model compared to previous attempts. We describe the fault model we targeted, the implementation, issues which had to be overcome, as well as an application example to demonstrate the use and capabilities of the tool.
\end{abstract}

\begin{IEEEkeywords}
NEEDS TO BE CHANGED
\end{IEEEkeywords}

\section{Introduction}

Exposing digital circuits to environments like space can break some of the most basic assumptions we make when designing digital circuits. Given the level of miniaturization we have access to, having high energy particles rain upon the millions of interconnects in an average design can introduce unexpected behavior. These undesired deviations from design specification, or \emph{failures}, need to be well understood about a design's robustness.

Synchronizing logic to a clock cycle, while potentially compromising on average case performance compared to asynchronous logic, has the helpful side-effect of creating a temporal mask for logic faults. This means that when an erroneous value is induced in a wire, only a small window of time exists where this value can propagate beyond the next logic buffer. \\
In asynchronous logic, we unfortunately lack this convenient abstraction. While we assume temporal masking to also play a much less obvious role in asynchronous logic \citationneeded, environmentally induced faults are still a much higher potential risk compared to a clock synchronized design.

But what is often much more important than knowing \emph{if} a design can fail under certain (extreme) circumstances, is \emph{how} exactly these failure modes play out. Certain use-cases might call for or even enforce safety in form of known failure modes on systems which are critical given their area of application. While multiple attempts have been made to create tooling for exploration of fault-space in the past \citationneeded, as of yet these tools have several shortcomings we feel need to be addressed. \\
First, these tools should be natively part of the toolchain slowly emerging as the go-to standard in asynchronous logic design, the ACT suite, published by the Yale AVLSI group \citationneeded. While previous attempts have partially integrated with it \citationneeded, significant progress, such as a new simulator \citationneeded, has been made in the base toolchain. Additionally, the old tool was more of an adapter between ACT and the original workflow \citationneeded, which we feel can be improved. \\
Second, the previous tool does not account for the potential complexity of knock-on effects a given signal might have in the grander scheme of the \ac{dut}. Average insertion density is used as a stand-in metric to determine whether or not enough tests have been performed. We feel this can be improved upon using a more sophisticated stochastic framework.

\section{Related Work}

Points to talk about

\begin{itemize}
    \item ACT toolchain in a nutshell
    \item previous works by TU Wien
    \item what fault model did they use
\end{itemize}

\section{Fault Model}

\subsection{On fault nomenclature}

Points to talk about

\begin{itemize}
    \item different types of faults that can occur
    \item upset vs transient
    \item single event delay (if we want to throw that in)
\end{itemize}


\subsection{Per-Node Fault Space}

Points to talk about

\begin{itemize}
    \item fault is injected as output from one node diverges from specification
    \item show why this makes sense: only certain input combinations would activate a gate in a way where it could create erroneous output; everything else is logically masked $\rightarrow$ simulation doesn't make sense anyway
    \item which fault scenarios can and cannot be simulated
    \item show some graphs for this
    \item talk about token collector's problem and certainty of coverage, Markov inequality\dots
\end{itemize}

\subsection{Types of failure behavior}

Points to talk about

\begin{itemize}
    \item types of failures observed at the output
\end{itemize}

\subsection{Discussion of Pipeline Load Factor}

Points to talk about

\begin{itemize}
    \item when does PLF make sense to begin with
    \item when does it not make sense
    \item why have we not really included it in this analysis
\end{itemize}

\subsection{Injection Strategy}

Points to talk about

\begin{itemize}
    \item fault distribution: skewed by node fanout instead of average injection density
    \item how does runtime scale with circuit size, linear with number of nodes
\end{itemize}

\section{Proposed Fault-Injection Tool}

Points to talk about

\begin{itemize}
    \item workflow: setup of harness, similarity to UVM, testbench design intended as design once, use for entire verification workflow
    \item why is this better than before? Performance improvements, not everything is simulated at gate level anymore, actsim is a mixed level simulator; \acs{dut} is simulated at gate level, while harness is simulated at higher level of abstraction
    \item changes to actsim? Addition of value overriding, addition of delay overriding; Addition of bounded stochastic delay?
    \item Using dflowmap means we can easily target different families of asynchronous circuits and even synchronous circuits and compare
    \item results database and post-processing
\end{itemize}

\section{Experiment Setup}

Points to talk about

\begin{itemize}
    \item what was the target circuit
\end{itemize}


\section{Results}

Points to talk about

\begin{itemize}
    \item how many failures were we able to find with our new tool vs with the old tool
    \item how efficient (failures found / injection) is this setup compared to previous attempts
    \item how do certain families of async and sync compare
\end{itemize}

\section{Conclusion}

\printacronyms

\printbibliography

\end{document}