Update on Overleaf.

2025-02-04 09:30:13 +00:00 · 2025-02-04 09:30:13 +00:00 · fdd972181f
commit fdd972181f
parent 2416a8e49e
1 changed files with 59 additions and 25 deletions
--- a/fault_injection_async25.tex
+++ b/fault_injection_async25.tex
@ -22,6 +22,8 @@
 \usepackage{tikz-timing-overlays}
 \usepackage{tikz-timing-advnodes}
 \usetikzlibrary{positioning}
+\usepackage{cleveref}
+

 \makeglossaries

@ -44,6 +46,7 @@
 \newcommand{\citationneeded}{\textcolor{red}{[citation needed]}}
 \newcommand{\referenceneeded}{\textcolor{red}{[reference needed]}}

+
 \makeatletter
 \newcommand{\linebreakand}{%
  \end{@IEEEauthorhalign}
@ -85,6 +88,8 @@ New Haven, CT 06520, USA\\
 rajit.manohar@yale.edu}
 }

+\crefname{figure}{Figure}{Figures}
+
 \maketitle

 \begin{abstract}
@ -134,7 +139,7 @@ This tool has been developed for the \texttt{pypr} toolchain designed by the Hue
 An iteration of this system can be found in \cite{schwendingerEvaluationDifferentTools2022a}. While based on the same core toolflow, Schwendinger adds limited bridging logic to the ACT toolchain, using \textrm{prsim} \cite{manoharOpenSourceDesign} as an alternative simulator. This change requires low level simulation of additional logic, as certain required features were not supported by \texttt{prsim} and no extension to the core simulator code was written. This again is a major point for potential improvement.

 Finally, we want to briefly touch on different fault-mitigation techniques seen in literature. \\
-Bainbridge and Salisbury \cite{bainbridgeGlitchSensitivityDefense2009} talks about the basic possibilities for fault behavior in \ac{qdi} circuits. Much like \cite{behalExplainingFaultSensitivity2021}, it identifies specific scenarios which can occur when a \ac{set} is injected into a circuit. We will come back to this in Section \ref{sec:system_model/failures} as well. It then lays out basic mitigation techniques, which largely focus on either introducing some form of redundancy in the circuit or reducing the x size of the time window in which faults are converted into failure behavior (sensitivity window).
+Bainbridge and Salisbury \cite{bainbridgeGlitchSensitivityDefense2009} talk about the basic possibilities for fault behavior in \ac{qdi} circuits. Much like \cite{behalExplainingFaultSensitivity2021}, they identify specific scenarios which can occur when a \ac{set} is injected into a circuit. We will come back to this in Section \ref{sec:system_model/failures} as well. They then lay out basic mitigation techniques, which largely focus on either introducing some form of redundancy in the circuit or reducing the x size of the time window in which faults are converted into failure behavior (sensitivity window).

 In a similar fashion, Huemer et.al \cite{huemerIdentificationConfinementFault2020} present interlocking and deadlocking versions of a \ac{wchb}. These are also meant to reduce the sensitivity window size, as well as preventing the propagation of illegal symbols. We will use their implementations for interlocking and deadlocking \acp{wchb} in this paper (more in Section \ref{sec:experiment_setup}).

@ -150,11 +155,11 @@ In a similar fashion, Huemer et.al \cite{huemerIdentificationConfinementFault202
    \label{fig:tooling/architecture}
 \end{figure}

-\texttt{action} itself is a tool flow framework. Its main service/purpose is to provide a build system which can act both locally as well as remotely, shifting computing tasks away from the end user machine. This means that other tasks can be performed by the user or the connection to the user be interrupted while computation continues remotely without further intervention.
+\texttt{action} itself is a tool flow framework. Its main purpose is to provide a build system which can act both locally as well as remotely, shifting computing tasks away from the end user machine. This means that other tasks can be performed by the user, or the connection to the user be interrupted while computation continues remotely without further intervention.

-To configure \texttt{action} for a certain task, a string of tool invocations is defined in a \emph{pipeline} file in YAML grammar. \texttt{action}, while primarily meant for use with the ACT toolchain, is at its core tool agnostic. As long as a corresponding tool adapter, as well has handling capability for used data types is provided, any tool (commercial or open source) can be invoked by it. This makes it particularly useful as a base framework for highly parallel and/or computationally intense applications. It can alleviate interaction with clustering software for every-day tasks, as only the local command line tool needs to be invoked to perform pipeline execution.
+To configure \texttt{action} for a certain task, a string of tool invocations is defined in a \emph{pipeline} in a YAML file. \texttt{action}, while primarily meant for use with the ACT toolchain, is at its core tool agnostic. As long as a corresponding tool adapter, as well as handling capability for used data types is provided, any tool (commercial or open source) can be invoked by it. This makes it particularly useful as a base framework for highly parallel and/or computationally intense applications. It can alleviate interaction with clustering software for every-day tasks, as only the local command line tool needs to be invoked to perform pipeline execution.

-On a high level, \texttt{action} in its base architecture consists of the client application, a controller/database, and several compute nodes (see Figure \ref{fig:tooling/architecture}). On invocation, the client tool first loads input data, performs local tasks, then uploads required data into the controller database. From there, nodes can fetch open tasks and re-upload their results to the controller node when done. For simulation tasks, the nodes already perform pre-analysis of logs, to reduce the amount of required post-processing.
+On a high level, \texttt{action} in its base architecture consists of the client application, a controller/database, and several compute nodes (see \cref{fig:tooling/architecture}). On invocation, the client tool first loads input data, performs local tasks, then uploads required data into the controller database. From there, nodes can fetch open tasks and re-upload their results to the controller node when done. For simulation tasks, the nodes already perform pre-analysis of logs, to reduce the amount of required post-processing.

 The fault injection tool presented in this paper is a demonstration of \texttt{action}'s testcase generation engine, as well as distributed computing capability with \texttt{actsim} as the target tool. \texttt{action} is currently in its early stages, but the process to get the code ready for future open-source release is well underway. It exclusively uses open-source dependencies, enabling cheap and easy scaling for any application without worry for potential financial impact.

@ -174,7 +179,11 @@ It is important to note the fundamental difference between fault and failure in
 \subsection{Fault model}
 \label{sec:system_model/faults}

-This work mainly focused on the effects of \acfp{set}, where a wire is manually forced to a specific value, independently of what value the design would dictate. These transient effects can occur due to physical effects like ionizing radiation exposure or electromagnetic interference. After the \ac{set} ends, the wire returns to normal operation. This is an important distinction to a \acf{seu}, where a faulty value is latched by a memory cell. An \ac{set} can but does not necessary lead to an \ac{seu}.
+This work mainly focuses on the effects of \acfp{set}, where a circuit node is manually forced to a specific value, independently of what value the design and its current state would dictate. 
+These transient effects can occur due to physical effects like ionizing radiation exposure or electromagnetic interference. 
+After the \ac{set} ends, the circuit node is released and returns to normal operation.
+This is an important distinction to a \acf{seu}, where a faulty value is latched by a memory cell.
+An \ac{set} can but does not necessary lead to an \ac{seu}.

 \begin{figure}
    \centering
@ -198,38 +207,58 @@ This work mainly focused on the effects of \acfp{set}, where a wire is manually
    \label{fig:model/forks}
 \end{figure}

-We simulate our \ac{dut} on the \ac{prs} level. In the simulator, one \ac{prs} node consists of a pull-up, a pull-down, weak pull-up, and weak pull-down stack resulting in one output value. When an \ac{set} is injected, this output value is manually overridden; such an override does however not necessarily incur an output value change. If an \ac{set} is injected into a node, the output value of the node must be different to the forced value for the \ac{set} to be visible at the output. For an \ac{set} to propagate, the child nodes of the victim must be activated on their respective input - meaning a change of the targeted parent would also change their value. Since we attack only one output per simulation, we can only represent scenarios where, given a fanout-tree free of non-sensitive children, one node expresses an incorrect value output. A visual representation of these limitations is shown in Figure \ref{fig:model/forks}, where Subfigure \ref{fig:model/3_fork_possible} can be simulated in our model, while Subfigure \ref{fig:model/3_fork_impossible} cannot.
+We simulate our \ac{dut} on the \ac{prs} level. In the simulator, one \ac{prs} node consists of a pull-up, a pull-down, weak pull-up, and weak pull-down stack resulting in one output value. 
+When an \ac{set} is injected, this output value is manually overridden; such an override does, however, not necessarily incur an output value change. 
+If an \ac{set} is injected into a node, the output value of the node must be different to the forced value for the \ac{set} to be visible at the output. For an \ac{set} to propagate, the child nodes of the victim must be activated on their respective input - meaning a change of the targeted parent would also change their value. Since we attack only one output per simulation, we can only represent scenarios where, given a fanout-tree free of non-sensitive children, one node expresses an incorrect value output. 
+\textcolor{red}{ST: I do not understand this sentence, FH: me neither}
+A visual representation of these limitations is shown in \cref{fig:model/forks}, where Subfigure \ref{fig:model/3_fork_possible} can be simulated in our model, while Subfigure \ref{fig:model/3_fork_impossible} cannot.

 We feel comfortable with this limitation, as transients usually occur in either the junctions of a given gate, or a transmission wire \cite{ferlet-cavroisSingleEventTransients2013}, which leads to fault behavior similar to what our model can produce.

 \subsection{Failure model}
 \label{sec:system_model/failures}

-When a transient is injected into a node, and given it is not masked by either the current state of the wire or by the child not being inside a sensitivity window currently, we can differentiate between:
+We are only interested in faults that affect the primary outputs of a circuit. 
+When a transient is injected into a node, it can be masked.
+Fault masking can be caused one of the following effects:
+\begin{itemize}
+    \item The current state of the node already matches the polarity of the injected value.
+    \item None of the node's children is currently inside their sensitive window.
+    \item The duration of the injected pulse is too short to overcome the inertial delay of the connected gates. Note that pulse shortening can also occur inside a circuit. 
+\end{itemize}
+
+However, even if the fault is not masked and affects a primary output in some way, it is still not necessarily the case that the observed effect must be classified a failure.
+In an asynchronous QDI circuits, if a node is already at the forced value but would naturally transition during the fault period; the circuit experiences a temporary slowdown but resumes normal operation afterwards. 
+If this slowdown is visible at an output, we classify the effect as a \emph{timing deviation}. 
+Note that in the context of non-QDI circuits, this is a potential value error.
+% \textcolor{red}{ST: isn't this excluded in the stc above?, yes but I would generally rework this section, because we also have to state that faults are only observed at the primary outputs, otherwise the whole masking makes no sense. -- True!}
+
+All other observable effects are classified according to the following categories.

 \begin{itemize}
-    \item Nothing; the glitch is masked either by the target already being and organically staying at the forced value for the duration of the glitch.
-    \item The node is already at the forced value but would naturally transition during the glitch; the circuit experiences a temporary slowdown but should be able to resume normal operation afterwards. We classify this as a \emph{timing failure} (or a potential \emph{value failure} in non-DI logic families)
-    \item The glitch removes the spacing between value tokens and one is lost; this would manifest in a \emph{token count failure} (and potentially a \emph{value failure})
-    \item The glitch acts as additional spacing between what is perceived as two data tokens, injecting an additional one into the pipeline. This would also result in a \emph{token count failure} (and potentially a \emph{value failure})
-    \item The glitch changes the value of a data line and thus creates a \emph{value failure} or, in a DI-coding, a potential \emph{coding failure}
+    \item The pulse caused by the transient changes the value of a data line and thus creates a \emph{value failure}.
+    \item If the transient produces an illegal state on a DI-encoded signals (e.g., setting both the true and the false rail of a dual-rail signal) a\emph{coding failure} is observed.
    \item A non-recoverable state is reached in the circuit, resulting in a \emph{deadlock}
    \item Additionally, a static-0 or static-1 hazard propagates to the edge of the design, we register it as a \emph{glitch}
+    \item The transient removes the spacing between value tokens and one is lost; this would manifest in a \emph{token count failure} (and potentially a \emph{value failure})
+    \item The transient acts as additional spacing between what is perceived as two data tokens, injecting an additional one into the pipeline. This would also result in a \emph{token count failure} (and potentially a \emph{value failure})
 \end{itemize}

 These failure modes are in accordance with the potential states of a circuit from \cite{bainbridgeGlitchSensitivityDefense2009}; we thus reuse the same failure classification as already presented in \cite{behalExplainingFaultSensitivity2021}.

 \subsection{Injection Strategy}

-To improve our failure detection per simulated injection, we aim to target our simulation efforts based on signal fanout. We use this metric for both signal selection prioritization (if the injector is not set to select all signals), as well as for determining the number of injections necessary for a given signal. This is counter to previous efforts \cite{behalExplainingFaultSensitivity2021,schwendingerEvaluationDifferentTools2022a}.
+To improve the probability of a simulated injection to generate a failure, we aim to target our simulation efforts based on signal fanout. We use this metric for both signal selection prioritization (if the injector is not set to select all signals), as well as for determining the number of injections necessary for a given signal. This sets our approach apart from previous efforts \cite{behalExplainingFaultSensitivity2021,schwendingerEvaluationDifferentTools2022a}.

-Signal selection is based on weighted reservoir sampling \cite{efraimidisWeightedRandomSampling2006}, with their weight $\omega$ being randomness with exponential falloff given their fanout, where $R$ is uniformly random and $\tilde{F}$ is the normalized fanout of the signal.
+Signal selection is based on weighted reservoir sampling \cite{efraimidisWeightedRandomSampling2006}, with their weight $\omega$ being randomness with exponential falloff given their fanout, where $R$ is uniformly random and $\tilde{F}$ is the normalized fanout of the signal. \textcolor{red}{ST: there is no $\tilde{F}$ in the equation}

 \begin{align}
    \omega = R^{(F_{max})/(F - F_{min})}
 \end{align}

-To calculate the number of required injections per signal, we use the Token Collector's problem. The tool is set to expect a set number of failure modes per node fanout $m$ (tokens). From this, we first calculate the expected number of required injections, then use the Markov inequality to bound this number to a specified probability of finding all tokens.
+To calculate the number of required injections per signal, we use the Token Collector's problem. The tool is set to expect a set number of failure modes per node fanout $m$ (tokens). 
+\textcolor{red}{ST: What is a failure mode?}
+From this, we first calculate the expected number of required injections, then use the Markov inequality to bound this number to a specified probability of finding all tokens. \textcolor{red}{ST: which parameter expresses this probability?}

 \begin{align}
    n &= F \cdot m \\
@ -237,13 +266,15 @@ To calculate the number of required injections per signal, we use the Token Coll
    P(T \geq c n H_n) &\leq \frac{1}{c}
 \end{align}

-where $F$ represents the fanout, $m$ is the assumed failure modes per fanout, $n$ is the total number of expected failure modes for this signal, $T$ is the actual total required number of draws to collect all tokens, and $H_n$ is the $n$th harmonic number. Reforming this leads to the assumed total number of injections $\hat{T}$ being calculated as
+Here $F$ represents the fanout, $m$ is the assumed failure modes per fanout, $n$ is the total number of expected failure modes for this signal, $T$ is the actual total required number of draws to collect all tokens, and $H_n$ is the $n^{th}$ harmonic number. Reforming this leads to the assumed total number of injections $\hat{T}$ being calculated as

 \begin{align}
    \hat{T} = n H_n \cdot \frac{1}{(1 - P_{cov}) \cdot P_{hit}}
 \end{align}

-where $P_{hit}$ additionally describes the probability of an injection hitting a sensitive window. We have set this value to $0.001$ based on previous experiments \cite{behalExplainingFaultSensitivity2021}. As $\hat{T}$ is calculated per selected signal, the number of required injections grows linearly with the number of signals selected (given identical fanout over all) and approximately $n\log n$ over fanout per signal given a certain assumed number of failure modes per fanout, coverage probability, and assumed hit probability.
+where $P_{hit}$ additionally describes the probability of an injection hitting a sensitive window. We have set this value to $0.001$ based on previous experiments \cite{behalExplainingFaultSensitivity2021}. 
+\textcolor{red}{ST: this number looks *very* low}
+As $\hat{T}$ is calculated per selected signal, the number of required injections grows linearly with the number of signals selected (given identical fanout over all) and approximately $n\log n$ over fanout per signal given a certain assumed number of failure modes per fanout, coverage probability, and assumed hit probability.

 \section{Experiment Setup}
 \label{sec:experiment_setup}
@ -255,7 +286,10 @@ where $P_{hit}$ additionally describes the probability of an injection hitting a
    \label{fig:setup/testbench}
 \end{figure}

-To test our new tool, we simulated the same multiplier circuit as Behal et.al \cite{behalExplainingFaultSensitivity2021}. We however did not sweep over any form of pipeline load factor. We found the definition of the metric ambiguous, especially once the circuit contains non-linear pipelines. For this reason we have opted to exclude this metric from further testing. For consistency, the multiplier was generated using \texttt{pypr} \cite{huemerContributionsEfficiencyRobustness2022}, which is able to generate \acs{prs} rules in an ACT container. We elected to simulate four different versions, a four bit multiplier with unit delays (10 time steps) in \acs{dims}, a four bit multiplier with randomized delays ($\pm 5\%$, \acs{prs} node delay between 95 and 105 time steps) in \acs{dims}, an 8 bit multiplier with unit delays (10 time steps) in \acs{dims}, and a 4 bit multiplier with unit delays (10 time steps) in \acs{nclx}.
+To test our new tool, we simulated the same multiplier circuit as Behal et.al \cite{behalExplainingFaultSensitivity2021}. We however did not sweep over any form of pipeline load factor. 
+While that parameter may make sense for analysis of specific blocks, it is not useful for target circuits containing non-linear pipelines. 
+%We found the definition of the metric ambiguous, especially once the circuit contains non-linear pipelines. For this reason we have opted to exclude this metric from further testing. 
+For consistency, the multiplier was generated using \texttt{pypr} \cite{huemerContributionsEfficiencyRobustness2022}, which is able to generate \acs{prs} rules in an ACT container. We elected to simulate four different versions, a four bit multiplier with unit delays (10 time steps) in \acs{dims}, a four bit multiplier with randomized delays ($\pm 5\%$, \acs{prs} node delay between 95 and 105 time steps) in \acs{dims}, an 8 bit multiplier with unit delays (10 time steps) in \acs{dims}, and a 4 bit multiplier with unit delays (10 time steps) in \acs{nclx}.

 The \ac{dut} is wrapped in a \acs{uvm}-like testbench setup, which is provided by our new simulation library. Our future ambition is to enable a write once - use everywhere architecture, where wrapper code has to be written once and can then be reused arbitrarily for all tests and verification procedures. The overall architecture of the test setup can be seen in Figure \ref{fig:setup/testbench}. Since, like \acs{uvm}, asynchronous logic contrary to synchronous logic inherently uses a message passing abstraction, we do not require much additional logic in the way of sequencers or monitors to interface with the \ac{dut}. Input tokens are directly forwarded to the \ac{dut}, model, and scoreboard.

@ -266,16 +300,16 @@ The \ac{dut} is wrapped in a \acs{uvm}-like testbench setup, which is provided b
        \hline
        \textbf{Parameter} & \textbf{Default setting}\\
        \hline
-        Hit probability & $0.8$ \\
+        Hit probability ($P_{hit}$) & $0.8$ \\
        \hline
-        Modes per fork & $1$ \\
+        Modes per fork ($m$) & $1$ \\
        \hline
-        Coverage certainty & $0.2$ \\
+        Coverage certainty  & $0.2$ \\
        \hline
-        Victim coverage & $0.5$ \\
+        Victim coverage ($P_{cov}$) & $0.5$ \\
        \hline
    \end{tabularx}
-    \caption{Default generation engine configuration}
+    \caption{Default generation engine configuration \textcolor{red}{ST: I added the variable names, but one is missing!}}
    \label{tab:setup/config}
 \end{table}

@ -344,9 +378,9 @@ To determine the performance of both our simulation environment, as well as our

 In our testing, this setup has shown itself quite capable as a cluster simulation tool. When running a batch of 13317 simulations, we measured a total execution time of 1 minute and 32 seconds, when executing on 4 nodes of 4 jobs each. This equates to almost exactly 9 simulations per second per core, which is in large part due to \texttt{actsim}'s high performance.

-Failure distributions for the examined circuits can be seen in Figure~\ref{fig:results/aggregated}. The simulation configuration shown in this graph was set to a lower assumed hit-probability ($0.1$, otherwise identical to Table~\ref{tab:setup/config}), to increase the number of simulations in hopes of establishing an accurate baseline. From there, varying hit-probability, assumed failure modes per fanout, and coverage certainty ultimately equates to a difference in injections per selected signal.
+Failure distributions for the examined circuits can be seen in \cref{fig:results/aggregated}. The simulation configuration shown in this graph was set to a lower assumed hit-probability ($0.1$, otherwise identical to Table~\ref{tab:setup/config}), to increase the number of simulations in hopes of establishing an accurate baseline. From there, varying hit-probability, assumed failure modes per fanout, and coverage certainty ultimately equates to a difference in injections per selected signal.

-Figure~\ref{fig:res/deviation_num_sims_dims} and Figure~\ref{fig:res/deviation_num_sims_nclx} show how observed failure mode distribution changes when the number of injections per signal is decreased. We observe that for both logic families, deviation is less than a single percentage point when going from over 20000 simulations down to about 3000.
+\cref{fig:res/deviation_num_sims_dims,fig:res/deviation_num_sims_nclx} show how observed failure mode distribution changes when the number of injections per signal is decreased. We observe that for both logic families, deviation is less than a single percentage point when going from over 20000 simulations down to about 3000.

 As a similar exercise, we established another baseline test for varying the percentage of signals to be targeted. For this, we configured the injection-engine to select all signals, then gradually lowered the percentage of selected signals (see Figure~\ref{fig:res/deviation_sel_signals_dims} and \ref{fig:res/deviation_sel_signals_nclx}). While not as stable as number of simulations, deviation still stayed within limits down to about 50\%, with the number of glitches observed in \acs{nclx} deviating by about $2.9$ points.