Bank Account Fraud Detection

Robust, reproducible machine-learning for binary fraud detection on large-scale tabular data. Focus on out-of-time generalization, class imbalance, and clear, business-relevant evaluation (F2 score, PR-AUC, precision@k, calibrated thresholds).

Problem

Given an account application event (one row = one application at time t), predict fraud_bool ∈ {0,1} (fraud vs legit) using applicant attributes, device/channel signals, and time-window aggregates (for example, application velocities in the last 6h/24h/4w). Positive class ≈ 0.01 (≈ 1%).

Data description

"The Bank Account Fraud (BAF) suite of datasets has been published at NeurIPS 2022 and it comprises a total of 6 different synthetic bank account fraud tabular datasets. BAF is a realistic, complete, and robust test bed to evaluate novel and existing methods in ML and fair ML, and the first of its kind!"

Dataset source

High-level overview

Each row in the dataset represents information about a single bank account application. The prediction target is whether or not fraudulent activity is or is going to occur (binary target).
Each data point comes from a month. There are 8 months in the dataset. To replicate a practical application, we separated the train and test datasets by time: we used the first 6 months as training data (794989 observations) and the last 2 months as test data (205011 observations).
In the training dataset, the incidence rate of data points corresponding to fraud was ~0.01. Our models are therefore aiming to tackle a rare-event detection problem.
The raw dataset has 1000000 observations and 32 columns. We use fraud_bool as our outcome or target variable, as it is the indicator for whether fraud happened or not. We had a total of 30 initial features (excluding month) before preprocessing. The variables in the raw dataset are as follows:
Variable
Description
Fraud_bool (target/outcome variable)
Fraud label (1 = fraud, 0 = legit)
Income (predictor feature)
Annual income of applicant in quantiles [0, 1]
Name_email_similarity (predictor feature)
Similarity between email and applicant’s name [0, 1]
Prev_address_months_count (predictor feature)
Months in previous registered address [-1, 380]. (-1 = missing)
Current_address_months_count (predictor feature)
Months in currently registered address [-1, 406]. (-1 = missing)
Customer_age (predictor feature)
Applicant’s age in bins per decade (for example, 20).
Days_since_request (predictor feature)
Days passed since application [0, 78].
Intended_balcon_amount (predictor feature)
Initial transferred amount for application [-1, 108].
Payment_type (predictor feature)
Credit payment plan type (5 anonymized values).
Zip_count_4w (predictor feature)
Applications within same zip code in last 4 weeks [1, 5767].
Velocity_6h (predictor feature)
Average applications per hour in the last 6 hours.
Velocity_24h (predictor feature)
Average applications per hour in the last 24 hours [1329, 9527].
Velocity_4w (predictor feature)
Average applications per hour in the last 4 weeks [2779, 7043].
Bank_branch_count_8w (predictor feature)
Total applications in the selected bank branch in last 8 weeks [0, 2521].
Date_of_birth_distinct_emails_4w (predictor feature)
Emails for applicants with same DOB in last 4 weeks [0, 42].
Employment_status (predictor feature)
Employment status (7 anonymized values).
Credit_risk_score (predictor feature)
Internal score of application risk [-176, 387].
Email_is_free (predictor feature)
Domain of application email (free or paid).
Housing_status (predictor feature)
Current residential status (7 anonymized values).
Phone_home_valid (predictor feature)
Validity of provided home phone (True / False).
Phone_mobile_valid (predictor feature)
Validity of provided mobile phone (True / False).
Bank_months_count (predictor feature)
Age of previous account in months [-1, 31]. (-1 = missing)
Has_other_cards (predictor feature)
If applicant has other cards from same banking company (0/1).
Proposed_credit_limit (predictor feature)
Applicant’s proposed credit limit [200, 2000].
Foreign_request (predictor feature)
If origin country of request differs from bank’s country (0/1).
Source (predictor feature)
Online source ("INTERNET" or "APP").
Session_length_in_minutes (predictor feature)
Length of user session in minutes [-1, 107].
Device_os (predictor feature)
OS of device ("Windows", "Macintox", "Linux", "X11", or "other").
Keep_alive_session (predictor feature)
User option on session logout (0/1).
Device_distinct_emails_8w (predictor feature)
Distinct emails from used device in last 8 weeks [0, 3].
Device_fraud_count (predictor feature)
Fraudulent applications with used device [0, 1].
Month (NOT a predictor feature)
Month application was made [0, 7].

Common preprocessing steps

  • Train/test split: first 6 months for training, last 2 months for testing (794989 / 205011 split).
  • Drop the month column.
  • Drop the device_fraud_count column; it is constant-valued and therefore useless for prediction.
  • One-hot encode categorical columns.
More feature engineering and preprocessing steps can be found in the methodology section of each model, as each model has different requirements (for example, mean imputation).

Methodology

Performance objective
The fraud incidence rate in this dataset is ~0.01. For rare-event detection, accuracy becomes a far less meaningful metric than precision and recall. For bank account fraud in particular, missing a fraud instance is far more financially costly than falsely flagging a non-fraud instance. Therefore, we weigh recall significantly more than precision. Given this, we chose the F2 metric as the optimization goal for our models, reflecting real-world business objectives.

Logistic Regression

Additional data preprocessing

  • Replace all -1 values in the dataset (which indicate missing values) by setting them to 0.
  • Normalize all numerical variables by subtracting their mean and dividing by their standard deviation.
  • We obtain 49 features.

Final model training and evaluation

After preprocessing, we used logistic regression with L2 regularization on the training data. To optimize computation speed, we used the Newton–Cholesky solver in logistic regression. We then test the model on the 205011 observations from the test dataset and obtain accuracy ≈ 0.986. However, the F2 score is 0.004 because recall is very low for the logistic regression model, so the model is not very good at finding actual fraud cases.

K-Nearest Neighbor

Data preprocessing

  • Train/test split: for training data, we kept all fraudulent samples but downsampled non-fraudulent samples to have a ratio of 1:10, resulting in a training size of 89661 (fraud = 8151, non_fraud = 81510).
  • For columns prev_address_months_count, bank_months_count, and session_length_in_minutes, we added missingness indicators for each to handle -1 as a missing value and replaced -1 with NaN to impute numerically. For all NaN values, we used SimpleImputer(strategy="median").
  • All features were standardized using the training set statistics and applied to both training and test sets.

Cross-validation and hyperparameter tuning

  • Cross-validation: we used StratifiedKFold with n_splits = 5 to preserve class proportions in each fold.
  • Hyperparameter grid:
    • k ∈ {3, 5, 7, 9, 11, 21, 31, 41, 51, 61, 81, 101, 121}
    • weights ∈ {"uniform", "distance"}
    • In total, 26 hyperparameter combinations were evaluated.
  • For each fold:
    • Train kNN with the specific (k, weights) combination.
    • Choose the best threshold from threshold_grid = {0.01, 0.02, …, 0.50} that yields the best F2 score and record the corresponding threshold and hyperparameter pair.
  • After all 5 folds, we average the best F2 score and select the corresponding k value and weights with the optimal threshold value.

Final model training and evaluation

  • We trained the final kNN model with the selected hyperparameters (k = 81, weights = "distance") on a class-balanced training subset of the full training set from months 0–5.
  • Evaluation was done on the test set using the optimal threshold τ = 0.134, and we reported F2, recall, precision, and accuracy.

Linear-Kernel Support Vector Machine

Additional data preprocessing

  • Create indicator columns for columns with missing values.
  • Standardize columns.
  • Mean imputation of missing values.

After preprocessing

  • Training data count: 794989.
  • Test data count: 205011.
  • Number of features: 49.

Modelling

  • Model: LinearSVC(C=C, dual=False, class_weight={0: 1, 1: R}) from sklearn.svm.
  • Hyperparameters: C, R.
  • Coarse-grained, stratified 3‑fold logspace search for C ∈ {0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0} and R ∈ {1.0, 10.0, 100.0, 1000.0}, using F2 as the evaluation metric.
  • Fine-grained, stratified 3‑fold linespace search in the best-magnitude neighborhoods for C and R, again using F2.
  • Train the final model on all of the training data using the best (C, R) pair.

RBF-Kernel Support Vector Machine Ensemble

Additional data preprocessing

  • Identical to the linear-kernel SVM.

Modelling

Due to the high memory complexity of storing Gram matrices while training kernel SVMs, we chunk the training data into manageably sized chunks, train an SVM model on each of them, then predict 1 if at least K out of N of the SVM models predict 1.
  • Fix N = 7.
  • Divide the training dataset into stratified folds of size ≈ 10000. Create 3 splits of 8 stratified folds each. For each split, 7 of the folds are used to train 7 SVM models, and the remaining fold is used as a validation set. This requires 24 folds; we have approximately folds available.
  • Hyperparameters and ranges:
    • C ∈ [10^-2, 10]
    • R ∈ [10, 1000]
    • gamma ∈ [10^-3, 1] (RBF kernel)
    • K ∈ {1, …, 7} (the number of learners needed to make a positive prediction)
  • Because an exhaustive grid search is infeasible, we use random search: we randomly sample 10 different (C, R, gamma) combinations from their logspaces and combine each with K ∈ {1, …, 7}, producing 4‑D candidates (C, R, gamma, K).
  • After finding the optimal (C, R, gamma, K) combination, we create 7 stratified folds of size ≈ 20000, train 7 RBF SVM models with (C, R, gamma), and define the final ensemble to predict 1 if at least K of the 7 models predict 1.

Tree-Based Methods – Decision Trees

Additional data preprocessing

  • Replace all -1 values in the dataset (which indicate missing values) by setting them to 0.
  • Normalize all numerical variables by subtracting their mean and dividing by their standard deviation.

Cross-validation

  • Implement a 10‑fold stratified cross-validation using recall to find the best model.
  • Use the gini criterion for impurity and set the number of features to consider at each split to sqrt(n_features).
  • The best depth of the decision tree was max_depth = 90.
We then use the model to predict on the 205011 observations from the test dataset and obtain accuracy ≈ 0.97126. The F2 score is 0.108 because recall is the primary optimization metric. This is better than logistic regression, but still leaves room for improvement.

Tree-Based Methods – Random Forest

Data preprocessing

  • Train/test split: all observations from months 0–5 as the training set and months 6–7 as the test set.
  • For prev_address_months_count, bank_months_count, and session_length_in_minutes, add missingness indicators, replace -1 with NaN, and use median SimpleImputer.
  • Standardize all features using the training set statistics.

Cross-validation and hyperparameter tuning

  • Cross-validation: StratifiedKFold with n_splits = 3.
  • Hyperparameter grid:
    • n_estimators ∈ {100, 300, 500}
    • max_depth ∈ {None, 10, 20}
    • min_samples_leaf ∈ {1, 5}
    • In total, 18 hyperparameter combinations.
  • For each combination and fold, train a Random Forest, predict with threshold 0.5, compute F2, and average F2 across folds.
  • Best setting: n_estimators = 500, max_depth = 10, min_samples_leaf = 1.

Final model training and evaluation

  • Train the final Random Forest model with the best hyperparameters on the full training set (months 0–5).
  • Evaluate on the test set and report F2, recall, precision, and accuracy.

Tree-Based Methods – XGBoost

Additional data preprocessing

  • No additional preprocessing steps beyond the common ones.

After preprocessing

  • Number of features: 45.

Modelling

import xgboost as xgb params = { "learning_rate": learning_rate, "max_depth": max_depth, "min_child_weight": min_child_weight, "subsample": subsample, "colsample_bytree": colsample_bytree, "scale_pos_weight": scale_pos_weight, "objective": "binary:logistic", "eval_metric": "aucpr", "tree_method": "hist", "verbosity": 0, } model = xgb.train( params=params, dtrain=dtrain, num_boost_round=best_iteration_int, ) y_test_prob = model.predict(dtest) y_test_pred = (y_test_prob >= best_threshold).astype(int)
We first tune the following hyperparameters using stratified 5‑fold cross-validation, with F2 as the optimization metric:
  • learning_rate (range: logspace from 0.03 to 0.2)
  • max_depth (range: integers in [3, 8])
  • min_child_weight (range: [1, 10])
  • subsample (range: [0.7, 1.0])
  • colsample_bytree (range: [0.6, 1.0])
  • scale_pos_weight (range: logspace [10, 200])
  • best_threshold (range: linspace [0, 1])
The value best_iteration_int is implicitly tuned via early stopping when we average the optimal number of boosting rounds across splits.
Because XGBoost does not support F2 directly as an early-stopping metric, we use PR AUC as a surrogate for early stopping. It also rewards good overall precision and recall.
Due to the size of the hyperparameter space, we use random search: we randomly sample 20 different (learning_rate, max_depth, min_child_weight, subsample, colsample_bytree, scale_pos_weight) combinations, then pair each combination with multiple thresholds from [0, 1], yielding candidate tuples (learning_rate, max_depth, min_child_weight, subsample, colsample_bytree, scale_pos_weight, threshold).
We then train the final model on the full training data with the best hyperparameters.

Neural Networks

Data preprocessing

  • Replace all -1 values (missing values) with 0.
  • Normalize all numerical variables by subtracting their mean and dividing by their standard deviation.
  • One-hot encode categorical variables.
  • Drop device_fraud_count as it is constant-valued.
  • Train/test split: first 6 months for training, last 2 months for testing.
  • Number of features: 49.

Model architecture and loss

  • Fully connected feed-forward neural network with three hidden layers, with dropout on the first and last layers and ReLU activations.
  • Output layer: single neuron with linear output, followed by a sigmoid to convert logits to probabilities.
  • To address class imbalance, we tried combinations of sampling and loss functions:
      1. Oversampling with surrogate loss.
      1. Oversampling with BCEWithLogitsLoss.
      1. No oversampling with BCEWithLogitsLoss.
  • The best configuration was no_oversampling + BCEWithLogitsLoss with positive-class weight pos_weight = (# of negatives / # of positives).

Hyperparameter tuning and training

  • Small grid search with 27 combinations:
    • Hidden layer sizes: (64, 32, 16), (128, 64, 32), (256, 128, 64)
    • Dropout probability: 0.2, 0.3, 0.5
    • Learning rate: 1e-4, 1e-3, 3e-3
  • For each configuration:
    • Optimizer: Adam with weight_decay = 1e-4.
    • Batch size: 4096.
    • Up to 20 epochs with early stopping based on validation F2.
    • After each epoch, compute validation probabilities, sweep a threshold grid between 0.01 and 0.99, and record the threshold that maximizes F2.
  • Best setting: hidden_sizes = (256, 128, 64), dropout = 0.3, learning_rate = 0.001 with optimal threshold τ = 0.84.

Final evaluation

We keep the best model (highest validation F2) and evaluate it on the test set. Using the tuned threshold τ = 0.84, we report test F2, recall, precision, and accuracy. This neural network serves as a high-capacity predictive model (less interpretable than kNN or Random Forest, which we use for interpretability and comparison).

Final Results

Model
Accuracy
Recall
Precision
F2 Score
ROC AUC
PR AUC
Logistic Regression
0.9860
0.003
0.625
0.004
0.5017
0.016
KNN
0.8277
0.7318
0.0574
0.2186
0.859
0.136
SVM – Linear Kernel
0.9632
0.399
0.165
0.3108
0.8838
0.1662
SVM – RBF Kernel Ensemble
0.9568
0.435
0.147
0.3128
0.7314
0.1046
Decision Tree
0.9713
0.114
0.089
0.108
0.5471
0.022
Random Forest
0.8643
0.6959
0.0692
0.2475
0.869
0.144
XGBoost
0.9533
0.5083
0.1522
0.3463
0.896
0.200
Neural Network
0.9626
0.4051
0.1638
0.3130
0.881
0.170

Interpretable feature analysis

We filtered for the features with the strongest coefficient signals (absolute value ≥ 0.1) from the linear SVM model, as it is the easiest to interpret among the models with better F2 scores.

Positive signals for fraud

  • Missing values in prev_address_months_count (months in previous registered address) are a relatively strong signal for fraud. One potential reason is that fraudsters tend to avoid providing a traceable address history.
  • The device OS being "Windows" is a relatively strong signal for fraud. One possible explanation is that fraudsters often operate from cheap and widely available Windows environments.

Negative signals for fraud

  • Keep_alive_session = 1 (the user keeping the session alive on logout) is a negative indicator of fraud. Fraudsters may prefer not to stay connected longer than necessary.
  • Higher name_email_similarity is a negative indicator of fraud. Fraudsters often use randomly generated or non-identifying email handles.
  • phone_home_valid = 1 (valid home phone numbers) is a negative indicator of fraud, as fraudsters may provide fake or burner numbers.
  • Has_other_cards = 1 (applicant has other cards with the same bank) is a negative signal of fraud, possibly because multiple cards under the same identity increase detectability.
  • Certain housing_status categories (for example, "BE", "BB", "BC" in the anonymized encoding) are negative indicators of fraud, but since these are anonymized, we cannot interpret them semantically.

Conclusions

Across models, we see modest, non-negligible improvements over simple baselines. However, the Neural Network, XGBoost, linear-kernel SVM, and the RBF-kernel SVM ensemble all achieve similar F2 scores, suggesting limited remaining predictive signal in the current tabular features.
In other words, we are likely close to the Pareto frontier with respect to our current feature set. Significant future improvements will likely come from feature engineering rather than model choice: deriving new features, exploring interactions, and incorporating richer temporal or behavioral signals.

Contributions

Source

Source code and experiment artifacts are maintained in a GitHub repository.