Skip to content
Snippets Groups Projects
Commit 356fee65 authored by yyl1c20's avatar yyl1c20
Browse files

Upload New File

parent e48bcb6e
Branches main
No related tags found
No related merge requests found
%% Cell type:markdown id: tags:
<h1>1. Loading Datasets</h1>
%% Cell type:code id: tags:
``` python
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
mTrain = pd.read_csv("TrainingDataMulti.csv")
mTest = pd.read_csv("TestingDataMulti.csv")
print("\n[ TrainingDataMulti.csv info ]")
mTrain.info()
print("\n[ TestingDataMulti.csv info ]")
mTest.info()
```
%% Output
[ TrainingDataMulti.csv info ]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6000 entries, 0 to 5999
Columns: 129 entries, R1-PA1:VH to marker
dtypes: float64(112), int64(17)
memory usage: 5.9 MB
[ TestingDataMulti.csv info ]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Columns: 128 entries, R1-PA1:VH to snort_log4
dtypes: float64(104), int64(24)
memory usage: 100.1 KB
%% Cell type:markdown id: tags:
<h1>1.1 Analysing the Data</h1>
%% Cell type:code id: tags:
``` python
mTrain.dtypes
```
%% Output
R1-PA1:VH float64
R1-PM1:V float64
R1-PA2:VH float64
R1-PM2:V float64
R1-PA3:VH float64
...
snort_log1 int64
snort_log2 int64
snort_log3 int64
snort_log4 int64
marker int64
Length: 129, dtype: object
%% Cell type:code id: tags:
``` python
mTrain['marker'].value_counts()
```
%% Output
marker
0 3000
2 1500
1 1500
Name: count, dtype: int64
%% Cell type:code id: tags:
``` python
mTrain
```
%% Output
R1-PA1:VH R1-PM1:V R1-PA2:VH R1-PM2:V R1-PA3:VH
0 70.399324 127673.0908 -49.572308 127648.0176 -169.578319 \
1 73.688102 130280.7109 -46.300719 130255.6377 -166.278082
2 73.733939 130305.7842 -46.254883 130280.7109 -166.232245
3 74.083443 130581.5902 -45.899649 130556.5169 -165.882741
4 74.553268 131083.0556 -45.424094 131057.9823 -165.424375
... ... ... ... ... ...
5995 116.889120 131860.3269 -3.076783 131810.1804 -123.094253
5996 116.849013 131810.1804 -3.116890 131760.0339 -123.128630
5997 116.384917 131734.9606 -3.586716 131684.8140 -123.586996
5998 111.125164 130506.3704 -8.846468 130456.2238 -128.858208
5999 110.878793 130481.2971 -9.092840 130456.2238 -129.104580
R1-PM3:V R1-PA4:IH R1-PM4:I R1-PA5:IH R1-PM5:I ...
0 127723.2374 65.689611 605.91099 -57.003571 626.78553 ... \
1 130355.9307 71.831719 483.59351 -50.947407 500.98896 ...
2 130381.0040 71.808800 483.59351 -50.913030 500.98896 ...
3 130656.8100 72.152575 482.86107 -50.437475 499.15786 ...
4 131158.2754 72.118198 484.50906 -50.013486 497.69298 ...
... ... ... ... ... ... ...
5995 131910.4735 114.780635 376.10794 -5.254023 374.82617 ...
5996 131885.4002 114.769176 376.29105 -5.322778 374.82617 ...
5997 131785.1071 114.299351 376.47416 -5.849899 374.82617 ...
5998 130556.5169 106.667553 478.83265 -13.464508 477.73399 ...
5999 130556.5169 106.392533 478.83265 -13.750987 477.91710 ...
control_panel_log4 relay1_log relay2_log relay3_log relay4_log
0 0 0 0 0 0 \
1 0 0 0 0 0
2 0 0 0 0 0
3 0 0 0 0 0
4 0 0 0 0 0
... ... ... ... ... ...
5995 0 0 0 0 0
5996 0 0 0 0 0
5997 0 0 0 0 0
5998 0 0 0 0 0
5999 0 0 0 0 0
snort_log1 snort_log2 snort_log3 snort_log4 marker
0 0 0 0 0 0
1 0 0 0 0 0
2 0 0 0 0 0
3 0 0 0 0 0
4 0 0 0 0 0
... ... ... ... ... ...
5995 0 0 0 0 0
5996 0 0 0 0 0
5997 0 0 0 0 0
5998 0 0 0 0 0
5999 0 0 0 0 0
[6000 rows x 129 columns]
%% Cell type:code id: tags:
``` python
mTrain.isnull().sum()
mTrain = mTrain.dropna()
mTrain
```
%% Output
R1-PA1:VH R1-PM1:V R1-PA2:VH R1-PM2:V R1-PA3:VH
0 70.399324 127673.0908 -49.572308 127648.0176 -169.578319 \
1 73.688102 130280.7109 -46.300719 130255.6377 -166.278082
2 73.733939 130305.7842 -46.254883 130280.7109 -166.232245
3 74.083443 130581.5902 -45.899649 130556.5169 -165.882741
4 74.553268 131083.0556 -45.424094 131057.9823 -165.424375
... ... ... ... ... ...
5995 116.889120 131860.3269 -3.076783 131810.1804 -123.094253
5996 116.849013 131810.1804 -3.116890 131760.0339 -123.128630
5997 116.384917 131734.9606 -3.586716 131684.8140 -123.586996
5998 111.125164 130506.3704 -8.846468 130456.2238 -128.858208
5999 110.878793 130481.2971 -9.092840 130456.2238 -129.104580
R1-PM3:V R1-PA4:IH R1-PM4:I R1-PA5:IH R1-PM5:I ...
0 127723.2374 65.689611 605.91099 -57.003571 626.78553 ... \
1 130355.9307 71.831719 483.59351 -50.947407 500.98896 ...
2 130381.0040 71.808800 483.59351 -50.913030 500.98896 ...
3 130656.8100 72.152575 482.86107 -50.437475 499.15786 ...
4 131158.2754 72.118198 484.50906 -50.013486 497.69298 ...
... ... ... ... ... ... ...
5995 131910.4735 114.780635 376.10794 -5.254023 374.82617 ...
5996 131885.4002 114.769176 376.29105 -5.322778 374.82617 ...
5997 131785.1071 114.299351 376.47416 -5.849899 374.82617 ...
5998 130556.5169 106.667553 478.83265 -13.464508 477.73399 ...
5999 130556.5169 106.392533 478.83265 -13.750987 477.91710 ...
control_panel_log4 relay1_log relay2_log relay3_log relay4_log
0 0 0 0 0 0 \
1 0 0 0 0 0
2 0 0 0 0 0
3 0 0 0 0 0
4 0 0 0 0 0
... ... ... ... ... ...
5995 0 0 0 0 0
5996 0 0 0 0 0
5997 0 0 0 0 0
5998 0 0 0 0 0
5999 0 0 0 0 0
snort_log1 snort_log2 snort_log3 snort_log4 marker
0 0 0 0 0 0
1 0 0 0 0 0
2 0 0 0 0 0
3 0 0 0 0 0
4 0 0 0 0 0
... ... ... ... ... ...
5995 0 0 0 0 0
5996 0 0 0 0 0
5997 0 0 0 0 0
5998 0 0 0 0 0
5999 0 0 0 0 0
[6000 rows x 129 columns]
%% Cell type:code id: tags:
``` python
X = mTrain.drop(columns = 'marker')
X
```
%% Output
R1-PA1:VH R1-PM1:V R1-PA2:VH R1-PM2:V R1-PA3:VH
0 70.399324 127673.0908 -49.572308 127648.0176 -169.578319 \
1 73.688102 130280.7109 -46.300719 130255.6377 -166.278082
2 73.733939 130305.7842 -46.254883 130280.7109 -166.232245
3 74.083443 130581.5902 -45.899649 130556.5169 -165.882741
4 74.553268 131083.0556 -45.424094 131057.9823 -165.424375
... ... ... ... ... ...
5995 116.889120 131860.3269 -3.076783 131810.1804 -123.094253
5996 116.849013 131810.1804 -3.116890 131760.0339 -123.128630
5997 116.384917 131734.9606 -3.586716 131684.8140 -123.586996
5998 111.125164 130506.3704 -8.846468 130456.2238 -128.858208
5999 110.878793 130481.2971 -9.092840 130456.2238 -129.104580
R1-PM3:V R1-PA4:IH R1-PM4:I R1-PA5:IH R1-PM5:I ...
0 127723.2374 65.689611 605.91099 -57.003571 626.78553 ... \
1 130355.9307 71.831719 483.59351 -50.947407 500.98896 ...
2 130381.0040 71.808800 483.59351 -50.913030 500.98896 ...
3 130656.8100 72.152575 482.86107 -50.437475 499.15786 ...
4 131158.2754 72.118198 484.50906 -50.013486 497.69298 ...
... ... ... ... ... ... ...
5995 131910.4735 114.780635 376.10794 -5.254023 374.82617 ...
5996 131885.4002 114.769176 376.29105 -5.322778 374.82617 ...
5997 131785.1071 114.299351 376.47416 -5.849899 374.82617 ...
5998 130556.5169 106.667553 478.83265 -13.464508 477.73399 ...
5999 130556.5169 106.392533 478.83265 -13.750987 477.91710 ...
control_panel_log3 control_panel_log4 relay1_log relay2_log
0 0 0 0 0 \
1 0 0 0 0
2 0 0 0 0
3 0 0 0 0
4 0 0 0 0
... ... ... ... ...
5995 0 0 0 0
5996 0 0 0 0
5997 0 0 0 0
5998 0 0 0 0
5999 0 0 0 0
relay3_log relay4_log snort_log1 snort_log2 snort_log3 snort_log4
0 0 0 0 0 0 0
1 0 0 0 0 0 0
2 0 0 0 0 0 0
3 0 0 0 0 0 0
4 0 0 0 0 0 0
... ... ... ... ... ... ...
5995 0 0 0 0 0 0
5996 0 0 0 0 0 0
5997 0 0 0 0 0 0
5998 0 0 0 0 0 0
5999 0 0 0 0 0 0
[6000 rows x 128 columns]
%% Cell type:code id: tags:
``` python
y = mTrain['marker']
```
%% Cell type:markdown id: tags:
Stratified Train-Test Split
The train-test split is stratified to ensure that the train and test samples from each class are almost the same percentage. This may be desirable for imbalanced number of samples as in this case.
In such imbalanced datasets, the stratified K fold cross validation is used instead of the K-fold cross validation
%% Cell type:code id: tags:
``` python
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=1, test_size=0.15, stratify=y)
```
%% Cell type:code id: tags:
``` python
y_train.value_counts()
```
%% Output
marker
0 2550
2 1275
1 1275
Name: count, dtype: int64
%% Cell type:code id: tags:
``` python
y_test.value_counts()
```
%% Output
marker
0 450
2 225
1 225
Name: count, dtype: int64
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
<h1>3. Choosing a Model: KNN , training, and evaluation</h1>
%% Cell type:code id: tags:
``` python
from xgboost import XGBClassifier
from sklearn.ensemble import GradientBoostingClassifier
gb_clf = GradientBoostingClassifier(n_estimators=300, learning_rate=0.5, max_depth=8, random_state=10, subsample=1.0, max_features='log2')
gb_clf.fit(X_train, y_train)
y_pred = gb_clf.predict(X_test)
```
%% Cell type:code id: tags:
``` python
gb_clf.score(X_test, y_test)
```
%% Output
0.9722222222222222
%% Cell type:markdown id: tags:
<h1>4. Improving</h1>
%% Cell type:markdown id: tags:
A grid search will be performed to find the optimal value of K.
Afterwards, the stratified K fold cross validation will be used, followed by a confusion metric as an evaluation.
%% Cell type:code id: tags:
``` python
# from scipy.stats import loguniform
# from sklearn.model_selection import RandomizedSearchCV
# from sklearn.ensemble import GradientBoostingRegressor
# param_distributions = {
# "n_estimators": [1, 2, 5, 10, 20, 50, 100, 200, 500],
# "max_leaf_nodes": [2, 5, 10, 20, 50, 100],
# "learning_rate": loguniform(0.01, 1),
# }
# search_cv = RandomizedSearchCV(
# GradientBoostingRegressor(), param_distributions=param_distributions,
# scoring="neg_mean_absolute_error", n_iter=20, random_state=0, n_jobs=2
# )
# search_cv.fit(X_train, y_train)
# columns = [f"param_{name}" for name in param_distributions.keys()]
# columns += ["mean_test_error", "std_test_error"]
# cv_results = pd.DataFrame(search_cv.cv_results_)
# cv_results["mean_test_error"] = -cv_results["mean_test_score"]
# cv_results["std_test_error"] = cv_results["std_test_score"]
# cv_results[columns].sort_values(by="mean_test_error")
```
%% Cell type:markdown id: tags:
<h1> 5. Metric Evaluation</h1>
%% Cell type:markdown id: tags:
Confusion Matrix
%% Cell type:code id: tags:
``` python
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import ConfusionMatrixDisplay
confusion_matrix(y_test, y_pred)
```
%% Output
array([[448, 2, 0],
[ 0, 218, 7],
[ 8, 8, 209]], dtype=int64)
%% Cell type:code id: tags:
``` python
print(classification_report(y_test, y_pred))
```
%% Output
precision recall f1-score support
0 0.98 1.00 0.99 450
1 0.96 0.97 0.96 225
2 0.97 0.93 0.95 225
accuracy 0.97 900
macro avg 0.97 0.96 0.97 900
weighted avg 0.97 0.97 0.97 900
%% Cell type:code id: tags:
``` python
cm =confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
plt.show()
```
%% Output
%% Cell type:code id: tags:
``` python
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=1)
cvScore = cross_val_score(gb_clf, X_train, y_train, cv=skf, scoring='f1_macro')
print (cvScore)
print (" StratifiedKFold Cross-Validation Accuracy: %0.2f%% | Standard Deviation: %0.2f%%" % (100*cvScore.mean(), 100*cvScore.std()))
```
%% Output
[0.94703762 0.96512283 0.96383989 0.95301525 0.94614962]
StratifiedKFold Cross-Validation Accuracy: 95.50% | Standard Deviation: 0.81%
%% Cell type:markdown id: tags:
<h1> 6. Testing Data</h1>
%% Cell type:code id: tags:
``` python
y_testpred = gb_clf.predict(mTest.values)
y_testpred = pd.DataFrame(y_testpred, columns=['predicted marker'])
y_testpred.value_counts()
```
%% Output
C:\Users\60172\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\sklearn\base.py:439: UserWarning: X does not have valid feature names, but GradientBoostingClassifier was fitted with feature names
warnings.warn(
predicted marker
1 47
0 31
2 22
Name: count, dtype: int64
%% Cell type:code id: tags:
``` python
y_testpred.to_csv('testresult.csv')
```
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment