fault_injection_async25_paper/fault_injection_async25.tex

\documentclass[conference]{IEEEtran}
\IEEEoverridecommandlockouts
% The preceding line is only needed to identify funding in the first footnote. If that is unneeded, please comment it out.
\usepackage{biblatex}
\addbibresource{fault_injection_async25.bib}
%\usepackage{cite}
\usepackage{amsmath,amssymb,amsfonts}
\usepackage{algorithmic}
\usepackage{graphicx}
\usepackage{textcomp}
\usepackage{xcolor}
\usepackage{orcidlink}
\usepackage[shortcuts,acronym]{glossaries}
\makeglossaries

% Acronyms for the document
\newacronym{dut}{DUT}{Design Under Test}
\newacronym{api}{API}{Application Programming Interface}
\newacronym{wchb}{WCHB}{Weakly Conditioned Half Buffer}
\newacronym{qdi}{QDI}{Quasi Delay Insensitive}
\newacronym{set}{SET}{Single Event Transient}

% Simple citation required command
\newcommand{\citationneeded}{\textcolor{red}{[citation needed]}}

\def\BibTeX{{\rm B\kern-.05em{\sc i\kern-.025em b}\kern-.08em
    T\kern-.1667em\lower.7ex\hbox{E}\kern-.125emX}}
\begin{document}

\title{Lights, camera, \texttt{action}: Towards an efficient tool for fault space exploration\\
\thanks{Identify applicable funding agency here. If none, delete this.}
}

\author{\IEEEauthorblockN{Fabian Posch\, \orcidlink{0009-0009-9272-1633}}
\IEEEauthorblockA{Institute for Computer Engineering\\
TU Wien\\
Vienna, Austria \\
fabian.posch@student.tuwien.ac.at}
\and
\IEEEauthorblockN{Florian Huemer\, \orcidlink{0000-0002-2776-7768}}
\IEEEauthorblockA{Institute for Computer Engineering\\
TU Wien\\
Vienna, Austria \\
fhuemer@ecs.tuwien.ac.at}
\and
\IEEEauthorblockN{Andreas Steininger\, \orcidlink{0000-0002-3847-1647}}
\IEEEauthorblockA{Institute for Computer Engineering\\
TU Wien\\
Vienna, Austria \\
steininger@ecs.tuwien.ac.at}
\and
\IEEEauthorblockN{Rajit Manohar}
\IEEEauthorblockA{Computer Systems Lab\\
Yale University\\
New Haven, CT 06520, USA\\
rajit.manohar@yale.edu}
}

\maketitle

\begin{abstract}
As a leading toolchain for asynchronous logic development, ACT offers a comprehensive environment for chip design and research. Its open nature allows for extensive customizability, enabling optimizations beyond what industry-grade tools typically provide. Building on this foundation, we introduce \texttt{action}, a new addition to the ACT toolchain that enables distributed build and compute tasks. To demonstrate its flexible extension interface, we developed a transient-fault-injection engine which significantly improves upon previous designs, both through deeper integration with ACT tools as well as better injection distribution heuristics.

These innovations eliminate the need for additional injection-related logic within the design while also reducing development effort, as testing infrastructure for behavioral validation can simply be reused. Additionally, only the design under test needs simulation at the gate-level, while the auxiliary testing harness can stay at higher levels of abstraction. Finally, we also achieve a reduction in necessary injections by targeting high-fanout signals more heavily, discovering more faults per injection.

To validate our setup, we benchmarked it against existing fault-injection tools, demonstrating its performance in both simulation efficiency and the overall number of injections needed to achieve representative results.

% Alternate abstract after test results
% To validate our setup, we benchmarked it against existing fault-injection tools, demonstrating substantial improvements in both simulation efficiency and the overall number of injections needed to achieve representative results, thus enabling better scaling as target designs grow more complex.
\end{abstract}

\begin{IEEEkeywords}
asynchronous circuits, SET, fault-tolerance, cluster computing, computer aided design, parallel computing
\end{IEEEkeywords}

\section{Introduction}

To make new things, we require tools. But while commercial tools offer access to the current state of the industry, they are usually not customizable enough (as they tend to be closed source) or - for more specialized applications - not available altogether. This problem is well understood for asynchronous logic, as the commercial offerings' focus on synchronous designs limits functionality for everything outside their scope. And while many of these problems have been mitigated by the publication of the open source ACT toolchain by the Yale AVLSI group \cite{manohar_open_2019}, local compute often does not suffice for tasks that are more complex.

Especially for those that lend themselves nicely to a high degree of parallelization, cluster computing offers high potential speed improvements. For this reason, we have built a tool which does just that - while offering a simple \acs{api} to vastly extend its functionality. Our goal was to create a framework to build on, and we here present a real world use-case to demonstrate this capability.

Exposing digital circuits to environments like space can break some of the most basic assumptions we make when designing digital circuits. Given the level of miniaturization we have access to, having high energy particles rain upon the millions of interconnects in an average design can introduce unexpected behavior. These undesired deviations from design specification, or \emph{failures}, need to be well understood to make predictions about a design's robustness.

Synchronizing logic to a clock cycle, while potentially compromising on average case performance, has the helpful side-effect of creating a temporal mask for logic faults. This means that when an erroneous value is induced in a wire, only a small window of time exists where this value can propagate beyond the next logic buffer. \\
In asynchronous logic, we unfortunately lack this convenient abstraction. While we assume temporal masking to also play a much less obvious role in asynchronous logic \cite{huemer_identification_2020}, environmentally induced faults are still a much higher potential risk compared to a clock synchronized design.

But what is often much more important than knowing \emph{if} a design can fail under certain (extreme) circumstances, is \emph{how} exactly these failure modes play out. Certain use-cases might call for or even enforce safety in form of known failure modes on critical systems. While multiple attempts have been made to create tooling for exploration of fault-space in the past \cite{behal_towards_2021}, as of yet these tools have several shortcomings, which we feel need to be addressed.

\section{Related Work}

\texttt{action} is an addition to the ACT toolchain initially presented in \cite{manohar_open_2019}. ACT aims to be a collection of tools for an end-to-end chip design workflow. While the main focus of its tools is asynchronous designs, it is powerful enough to also map to synchronous logic families without issue \cite{vezzoli_designing_2024}. The current version of the ACT toolflow does include a scripting environment \cite{he_interact_nodate}, it does however not contain a solution for distributed computing tasks, which would be helpful for testing and verification tasks.

Focusing on our specific demo use-case, the tool presented in \cite{behal_towards_2021} is a fault injection and fault space exploration tool, aiming to explore fault types in a given circuit. It is quite similar to the demo use-case we show in this paper. It distinguishes fault classes \emph{timing deviation}, \emph{value fault}, \emph{code fault}, \emph{glitch}, \emph{deadlock}, and \emph{token count error}, which are largely reused for this paper (more on our system model in Section \ref{sec:system_model}). The core simulator used is QuestaSim (version 10.6c), which is a commercial simulation tool. To reduce the runtime of one simulation, a cluster based approach is employed to parallelize simulations over multiple machines. This tool has been designed for the \texttt{pypr} toolchain designed by the Huemer at TU Wien \cite{huemer_contributions_2022}, a production rule based circuit description framework in Python. Notably, the system calculates the number of required injections using a system of average injection density, independently of which signal it is targeting. This is one of the main points on which we will try to improve upon.\\
% should i include work in master thesis?
An iteration of this system can be found in \cite{schwendinger_evaluation_2022}. While based on the same core toolflow, Behal adds limited bridging logic to the ACT toolchain, using \textrm{prsim} \cite{manohar_open_2019} as an alternative simulator. This change requires low level simulation of additional logic, as certain required features were not supported by \texttt{prsim} and no extension to the core simulator code was written. This again is a major point for potential improvement.

Finally, we want to briefly touch on different fault-mitigation techniques seen in literature. \\
Bainbridge and Salisbury \cite{bainbridge_glitch_2009} talks about the basic possibilities for fault behavior in \ac{qdi} circuits. Much like \cite{behal_towards_2021}, it identifies specific scenarios which can occur when a \ac{set} is injected into a circuit. We will come back to this in Section \ref{sec:system_model} as well. It then lays out basic mitigation techniques, which largely focus on either introducing some form of redundancy in the circuit or reducing the temporal size of the window in which faults are converted into failure behavior (sensitivity window).

In a similar fashion, Huemer et.al \cite{huemer_identification_2020} presents interlocking and deadlocking versions of a \ac{wchb}. These are also meant to reduce the sensitivity window size, as well as preventing the propagation of illegal symbols. We will use their implementations for interlocking and deadlocking \acp{wchb} in this paper (more in Section \ref{sec:experiment_setup}).


\section{System Model}
\label{sec:system_model}

\subsection{On fault nomenclature}

Points to talk about

\begin{itemize}
    \item different types of faults that can occur
    \item upset vs transient
    \item single event delay (if we want to throw that in)
\end{itemize}

Talk about fault outcomes in \cite{bainbridge_glitch_2009}

\subsection{Per-Node Fault Space}

Points to talk about

\begin{itemize}
    \item fault is injected as output from one node diverges from specification
    \item show why this makes sense: only certain input combinations would activate a gate in a way where it could create erroneous output; everything else is logically masked $\rightarrow$ simulation doesn't make sense anyway
    \item which fault scenarios can and cannot be simulated
    \item show some graphs for this
\end{itemize}


First, these tools should be natively part of the toolchain slowly emerging as the go-to standard in asynchronous logic design, the ACT suite, published by the Yale AVLSI group \citationneeded. While previous attempts have partially integrated with it \citationneeded, significant progress, such as a new simulator \citationneeded, has been made in the base toolchain. Additionally, the old tool was more of an adapter between ACT and the original workflow \citationneeded, which we feel can be improved. \\
Second, the previous tool does not account for the potential complexity of knock-on effects a given signal might have in the grander scheme of the \ac{dut}. Average insertion density is used as a stand-in metric to determine whether or not enough tests have been performed. We feel this can be improved upon using a more sophisticated stochastic framework.


\subsection{Types of failure behavior}

Points to talk about

\begin{itemize}
    \item types of failures observed at the output
\end{itemize}

\subsection{Discussion of Pipeline Load Factor}

Points to talk about

\begin{itemize}
    \item when does PLF make sense to begin with
    \item when does it not make sense
    \item (why have we not really included it in this analysis)
\end{itemize}

\subsection{Injection Strategy}

Points to talk about

\begin{itemize}
    \item fault distribution: skewed by node fanout instead of average injection density
    \item talk about token collector's problem and certainty of coverage, Markov inequality\dots
    \item how does runtime scale with circuit size, linear with number of nodes
\end{itemize}

\section{Proposed Fault-Injection Tool}

Points to talk about

\begin{itemize}
    \item workflow: setup of harness, similarity to UVM, testbench design intended as design once, use for entire verification workflow
    \item why is this better than before? Performance improvements, not everything is simulated at gate level anymore, actsim is a mixed level simulator; \acs{dut} is simulated at gate level, while harness is simulated at higher level of abstraction
    \item changes to actsim? Addition of value overriding, addition of delay overriding; Addition of bounded stochastic delay?
    \item Using dflowmap means we can easily target different families of asynchronous circuits and even synchronous circuits and compare
    \item results database and post-processing
\end{itemize}

\section{Experiment Setup}
\label{sec:experiment_setup}

Points to talk about

\begin{itemize}
    \item what was the target circuit
    \item what metrics did we sweep
    \item what was the entire workflow (in case we do two showcases, directly compared to Behal and one for a more full ACT-style workflow)
\end{itemize}


\section{Results}

Points to talk about

\begin{itemize}
    \item Compared to Behal: how many failures were we able to find with our new tool vs with the old tool
    \item Compared to Behal: how efficient (failures found / injection) is this setup compared to previous attempts
    \item Dflow: how do certain families of async and sync compare
\end{itemize}

\section{Conclusion}

\printacronyms

\printbibliography

\end{document}