master.tex

\documentclass[a4paper,12pt]{article}
\title{Assignment 2: Detection of Attacks on Power System Grids}
\author{Charlie Britton}
\date{June 2023}
\usepackage[margin=0.5in]{geometry}
\usepackage[parfill]{parskip}
\usepackage{listings}
\usepackage{xcolor}
\usepackage{hyperref}

\newcommand\todo[1]{\textcolor{red}{\textbf{#1}}}
\lstdefinestyle{mystyle}{
    basicstyle=\ttfamily\footnotesize,
    breakatwhitespace=false,
    breaklines=true,
    captionpos=b,
    keepspaces=true,
    numbersep=5pt,
    showspaces=false,
    showstringspaces=false,
    showtabs=false,
    tabsize=2
}

\lstset{style=mystyle}
\begin{document}

\maketitle

This coursework applies machine learning to CPS power grids, with the goal of classifying 128 labels with binary classification--attack or normal event (for Part A); then extending this to a multivariate problem space.

Although the introduction gives information on each of the training labels in $\textbf{x}$, the details of these are not important. My solution initially makes use of the following components:
\begin{itemize}
	\item Normalization -- this fits the data between 0 and 1 for all inputs to the model
	\item Principal Component Analysis -- this is a type of dimensionality reduction that aims to make the data easier to interpret
	\item Hyperparameter search -- this allows us to automatically tune the hyperparameters used for the model to get the highest accuracy possible
	\item Fitting the model -- my solution makes use of logistic regression as the final stage to fit the model and classify the traces as either normal or an attack, and is subsequently adapted for Part B.
\end{itemize}

This problem can be viewed as an optimization problem, where the goal is to classify a set of explanatory variables $\textbf{x}_i$ given a label $y_i$.

\section{Part A}
By following the examples used in the labs with some slight adaptation, I was able to import my dataset using Pandas, then runs it through a \texttt{StandardScaler}, before running it through principal component analysis and then a logistic regression step.

With an 80\%/20\% train and test split, the model was able to attain an 89.6\% accuracy score.

The next part of the implementation focused on adding a support vector machine to the pipeline, which replaced the linear classifier. This brought the accuracy up to 90.3\%.

Changing from an SVC to a LinearSVC upped the accuracy to 91.6\%.

Changing the algorithm from LinearSVC to K-nearest neighbors and optimizing the hyperparameters yielded 93.5\% accuracy on the model. The search space was constructed from the following variables:
\begin{lstlisting}[language=python]
param_grid = {
    "pca__n_components": [5, 15, 30, 45, 60],
    "neighbors__n_neighbors": [5,10],
    "neighbors__weights": ["uniform", "distance"],
    "neighbors__algorithm": ["ball_tree", "kd_tree", "brute"],
    "neighbors__leaf_size": [10,20,30,40,50],
    "neighbors__p": [1,2]
}
\end{lstlisting}
and after running the search, the following hyperparameters were picked:
\begin{lstlisting}
{
    'neighbors__algorithm': 'ball_tree',
    'neighbors__leaf_size': 10,
    'neighbors__n_neighbors': 5,
    'neighbors__p': 1,
    'neighbors__weights': 'distance',
    'pca__n_components': 30
}
\end{lstlisting}

By changing the train/test split to 90/10, my accuracy was 94.6\%. When labelling the data for the submission for the report, the entire train/split dataset can be used for training as it won't overfit on the test data which is submitted but will mean I cannot see the accuracy metrics for the final training.

This is okay as my training doesn't appear to be overfitting.

By keeping the 90/10 split and changing to a bagging classifier, we can get to 97\% accuracy, with 15 estimators.

\section{Part B}
Due to my use of the K-neighbors classifier, it was simple to substitute binary data for multivariate data, obtaining a 90.4\% accuracy on a 80/20\% split of the test dataset.

\section{Ambiguity in the Specification}
The specification doesn't make it clear whether or not the file that we submit includes all of the training data, so in my submission, I have submitted 129 columns, the first 128 of which are the test data and the 129th which is the label.

In the event that this is not what is required, I have also added two further files to my Git repository, which contain simply the predicted label of each of the rows.

These can be found in the \texttt{predictions/} subdirectory of my Git repository, with the suffix \texttt{LabelOnly}.

\section{Predictions for Part A}
\lstinputlisting{predictions/TestingResultsBinaryLabelOnly.csv}

\section{Predictions for Part B}
\lstinputlisting{predictions/TestingResultsMultiLabelOnly.csv}

\section{Git Source Code}
Please find this at \url{https://git.soton.ac.uk/charlie/comp3217-cw2}
\end{document}