Anomaly Detection with Z-Score: Pick The Low Hanging Fruits

This article introduces an unsupervised anomaly detection method which based on z-score computation to find the anomalies in a credit card transaction dataset using Python step-by-step.

Anomaly detection techniques can be applied to resolve various challenging business problems. For example, detecting the frauds in insurance claims, travel expenses, purchases/deposits, cyber intrusions, bots that generate fake reviews, energy consumptions, and so on.

Unsupervised learning is the key to the imperfect world because in which the majority of data is unlabeled. It is also much harder to evaluate an unsupervised learning solution than supervised learning method which we will discuss in details later on.

Outline

The hypothesis of z-score method in anomaly detection
Quick glance at the dataset
Feature scaling with standardization
Feature selection by visualizing outliers/inliers distribution
PCA transformation and feature selection II
Build and train the z-score model
Label your prediction and evaluate with multiple thresholds
Run on the test dataset
Conclusion

The hypothesis of z-score method in anomaly detection is that the data value is in a Gaussian distribution with some skewness and kurtosis, and anomalies are the data points far away from the mean of the population.

To make it intuitive, the following image was adapted from Standard score wiki page. The blue dots represent inliers, while the red dots are the outliers. The larger the number of standard deviations from the mean, the more anomaly the data point is. In other words, the further away from centre, the higher probability to be an outlier.

The core idea is so straightforward that applying z-score method is like picking the low hanging fruits comparing to other approaches, for example, LOC, isolation forest, and ICA. It is also practical to use z-score as benchmark in the unsupervised learning system which should ensemble multiple algorithms for the final anomaly scores.

Quick glance at the dataset

The credit card fraud detection dataset can be downloaded from this Kaggle link.

The shape of the dataset is (284807, 31).

“Time”: Number of seconds elapsed between this transaction and the first transaction in the dataset

“Amount”: Transaction amount

“Class”: 1 for fraud, 0 otherwise

“V1” ~ “V28”: Output of a PCA dimensionality reduction on original raw data to protect user identities and sensitive features

Check the completeness. There is no missing value which saves some work for us.

Compute the skewness of the dataset. As shown below, there are 1.72 fraudulent transactions in every 1000 entities.

Visualize the distribution of variables “Time” and “Amount”.

Visualize the distribution of variables “V1” to “V28”.

Feature scaling with standardization

As illustrated in the figures above, real life data rarely follows a perfect normal distribution. More than that, features “Time”, “Amount”, and “V*”s are sitting at different scales.

The code below will standardize the features except “Class” so that these features would be centred around 0 with a standard deviation of 1. I would like to emphasize that standardizing is an important step and also a general requirement for many machine learning algorithms.

Feature selection by visualizing outliers/inliers distribution

The red bars represent the fraudulent transactions while the blue bars are the normal transactions.

We can see that some features are not able to separate the outliers from inliers, for example, “Time”, “Amount”, “V19”, and “V26”. The red (outliers) are overlapped with blue bars (inliers). These features could not be feed into z-score models due to they could not differentiate the red from the blue.

Base on the same theory, I would pick features [“V9”, “V10”, “V11”, “V12”, “V14”, “V16”, “V17”, “V18”] and ignore the others. We can expect it to be able to pickup a good portion of anomalies which relies on the “intersections” among these cases.

PCA transformation and feature selection II

Perform the PCA transforming to get a dimensionality-reduced dataset even though I set the n_components to 8 that we did not reduce the dimensionality.

From the plot below, we can see that features [“V11”,”V14″,”V16″] could not separate the blue and the red obviously that they will be excluded before feeding the model.

Build and train the z-score model

Build and run a z-score model to get the anomaly score for each feature.
Then averaging the score of each feature into an overall score for all features which is stored in column “all_cols_zscore”.
After that, concat the score dataset with the label “Class”: 1 for fraud, 0 otherwise.
Check the Cumulative Distribution of the scores we computed.

# plot the cumulative histogram
fig2, ax2 = plt.subplots(figsize=(12, 6))    
x = zScore_df['all_cols_zscore']
ax2 = sns.kdeplot(x, shade=True, color="r", cumulative=True)# tidy up the figure
ax2.grid(True)
ax2.legend(loc='right')
ax2.set_title('Cumulative Distribution of Scores', fontsize = 18)
ax2.set_xlabel('zScore_df[\'all_cols_zscore\']', fontsize = 14)
ax2.set_ylabel('Likelihood of score', fontsize = 14);

As illustrated in the cumulative distribution of scores, around 90% of the scores are smaller than 0.1; almost 100% of the scores have value under 0.2.

Label your prediction

Fun part! How to label the prediction with the scores we got above?

Sort the dataset by “all_cols_zscore” descendingly because the higher the score, the more abnormal.
Label the top 350 rows with ‘predictClass’ label equal to 1, while the rest with 0 which means we predict the top 350 entities as fraudulent transactions.
Calculate and printout the precision and recall.

The shape of the training dataset is 190820 rows with 5 features.
There are 1.72 fraudulent transactions in every 1000 transactions.
The precision when we label 350 cases as frauds is 55.14% which means that if we predict 100 transactions as frauds, 55.14 cases out of them are fraud in reality.
The recall when we label 350 cases as fraud is 58.48% which means that if there are 100 fraudulent transactions, 58.48 cases can be successfully detected by the algorithm.

The result seems a bit poor. To be honest, it is a highly skewed dataset that we are looking for a needle in a haystack. The performance is not ideal but good enough in the real business world.

One step further: evaluate with multiple thresholds

We evaluated the model when label the top 350 entities with highest score as frauds. What if we label top 400, 450, 500,… cases as frauds? How will the z-score method perform under different thresholds?

Create the threshold list used for labeling.
Calculate and printout the precision and recall under multiple thresholds.
Save the precision and recall in the performance dataset.

By comparing the precision and recall under different thresholds, you could pick an appropriate threshold for your project. Precision means the purity of your prediction, and recall represents the completeness in detection.

In practice, I would suggest to lean a bit more on recall than precision because anomalies are usually rare in the population and you would like to catch as many anomalies as possible.

Run on the test dataset

How would our z-score method perform on never seen data? Let’s run it on the test dataset.

Impressively, it performs better on the test dataset with 67.90% recall when set threshold to 0.17%. For every 100 fraudulent transactions, we are able to catch 67.90 frauds out of them using the z-score method we built.

Compute average precision (AP) from prediction scores stored in “all_cols_zscore”.

Plot ROC curve.

Conclusion

Yeah! We go through a lot of interesting things in this article. To recap, we talk about:

the core idea of z-score method in anomaly detection
feature scaling with standardization
feature selection and PCA transformation
build and train the model on train/test datasets
evaluation with multiple thresholds

Actually, not that many after we summarized them. Anyway, the procedure used above is:

standardization -> feature selection I -> PCA -> feature selection II -> train model -> evaluation -> run on test data -> evaluation

You could run experiments using other possible procedures, for example,

procedure I: standardization -> train model
procedure II: standardization -> PCA -> train model
procedure III: standardization -> feature selection I -> train model

The goal is that by comparing the precision and recall of each procedure, you can build a sense that how standardization, feature selections and PCA can significantly affect the performance of same model.

Unfortunately, for unsupervised learning problem in real world, because the absence of labels, we could not select features by visualizing the distribution of outliers VS inliers against each features. Usually the underlying business process should give us a sense of which features should be more relevant when we don’t have labels. Be mindful of the potential bias and variance though.

To summarize, if there is only one thing you would take away, it should be: the procedure for anomaly detection in supervised learning using z-score method

standardization -> feature selection I -> PCA -> feature selection II -> train model -> evaluation -> run on test data -> evaluation

the procedure for anomaly detection in unsupervised learning using z-score method

standardization -> train model -> evaluation -> run on test data -> evaluation

2 Replies to “Anomaly Detection with Z-Score: Pick The Low Hanging Fruits”

Fred Romero says:

November 27, 2019 at 12:24 pm

Excellent article. I’ve yet to see an all inclusive application that detects anomalies in the thousands of daily transactions that occur at different levels of a company. Thank you!
1. Alina Zhang says:
  
  November 28, 2019 at 2:32 pm
  
  Thank you Fred. Anomaly detection with unsupervised learning solutions definitely is the next frontier in Machine Learning. Fun to explore.