Robust, reproducible machine-learning for binary fraud detection on large-scale tabular data. Focus on out-of-time generalization, class imbalance, and clear, business-relevant evaluation (F2 score, PR-AUC, precision@k, calibrated thresholds).
Problem
Given an account application event (one row = one application at time
t), predict fraud_bool ∈ {0,1} (fraud vs legit) using applicant attributes, device/channel signals, and time-window aggregates (for example, application velocities in the last 6h/24h/4w). Positive class ≈ 0.01 (≈ 1%).Data description
"The Bank Account Fraud (BAF) suite of datasets has been published at NeurIPS 2022 and it comprises a total of 6 different synthetic bank account fraud tabular datasets. BAF is a realistic, complete, and robust test bed to evaluate novel and existing methods in ML and fair ML, and the first of its kind!"
Dataset source
Bank Account Fraud Dataset from Kaggle (Base.csv) https://www.kaggle.com/datasets/sgpjesus/bank-account-fraud-dataset-neurips-2022
High-level overview
Each row in the dataset represents information about a single bank account application. The prediction target is whether or not fraudulent activity is or is going to occur (binary target).
Each data point comes from a month. There are 8 months in the dataset. To replicate a practical application, we separated the train and test datasets by time: we used the first 6 months as training data (794989 observations) and the last 2 months as test data (205011 observations).
In the training dataset, the incidence rate of data points corresponding to fraud was ~0.01. Our models are therefore aiming to tackle a rare-event detection problem.
The raw dataset has 1000000 observations and 32 columns. We use
fraud_bool as our outcome or target variable, as it is the indicator for whether fraud happened or not. We had a total of 30 initial features (excluding month) before preprocessing. The variables in the raw dataset are as follows:Variable | Description |
Fraud_bool (target/outcome variable) | Fraud label ( 1 = fraud, 0 = legit) |
Income (predictor feature) | Annual income of applicant in quantiles [0, 1] |
Name_email_similarity (predictor feature) | Similarity between email and applicant’s name [0, 1] |
Prev_address_months_count (predictor feature) | Months in previous registered address [-1, 380]. (-1 = missing) |
Current_address_months_count (predictor feature) | Months in currently registered address [-1, 406]. (-1 = missing) |
Customer_age (predictor feature) | Applicant’s age in bins per decade (for example, 20). |
Days_since_request (predictor feature) | Days passed since application [0, 78]. |
Intended_balcon_amount (predictor feature) | Initial transferred amount for application [-1, 108]. |
Payment_type (predictor feature) | Credit payment plan type ( 5 anonymized values). |
Zip_count_4w (predictor feature) | Applications within same zip code in last 4 weeks [1, 5767]. |
Velocity_6h (predictor feature) | Average applications per hour in the last 6 hours. |
Velocity_24h (predictor feature) | Average applications per hour in the last 24 hours [1329, 9527]. |
Velocity_4w (predictor feature) | Average applications per hour in the last 4 weeks [2779, 7043]. |
Bank_branch_count_8w (predictor feature) | Total applications in the selected bank branch in last 8 weeks [0, 2521]. |
Date_of_birth_distinct_emails_4w (predictor feature) | Emails for applicants with same DOB in last 4 weeks [0, 42]. |
Employment_status (predictor feature) | Employment status ( 7 anonymized values). |
Credit_risk_score (predictor feature) | Internal score of application risk [-176, 387]. |
Email_is_free (predictor feature) | Domain of application email ( free or paid). |
Housing_status (predictor feature) | Current residential status ( 7 anonymized values). |
Phone_home_valid (predictor feature) | Validity of provided home phone ( True / False). |
Phone_mobile_valid (predictor feature) | Validity of provided mobile phone ( True / False). |
Bank_months_count (predictor feature) | Age of previous account in months [-1, 31]. (-1 = missing) |
Has_other_cards (predictor feature) | If applicant has other cards from same banking company ( 0/1). |
Proposed_credit_limit (predictor feature) | Applicant’s proposed credit limit [200, 2000]. |
Foreign_request (predictor feature) | If origin country of request differs from bank’s country ( 0/1). |
Source (predictor feature) | Online source ( "INTERNET" or "APP"). |
Session_length_in_minutes (predictor feature) | Length of user session in minutes [-1, 107]. |
Device_os (predictor feature) | OS of device ( "Windows", "Macintox", "Linux", "X11", or "other"). |
Keep_alive_session (predictor feature) | User option on session logout ( 0/1). |
Device_distinct_emails_8w (predictor feature) | Distinct emails from used device in last 8 weeks [0, 3]. |
Device_fraud_count (predictor feature) | Fraudulent applications with used device [0, 1]. |
Month (NOT a predictor feature) | Month application was made [0, 7]. |
Common preprocessing steps
- Train/test split: first 6 months for training, last 2 months for testing (794989 / 205011 split).
- Drop the
monthcolumn.
- Drop the
device_fraud_countcolumn; it is constant-valued and therefore useless for prediction.
- One-hot encode categorical columns.
More feature engineering and preprocessing steps can be found in the methodology section of each model, as each model has different requirements (for example, mean imputation).
Methodology
Performance objective
The fraud incidence rate in this dataset is ~0.01. For rare-event detection, accuracy becomes a far less meaningful metric than precision and recall. For bank account fraud in particular, missing a fraud instance is far more financially costly than falsely flagging a non-fraud instance. Therefore, we weigh recall significantly more than precision. Given this, we chose the F2 metric as the optimization goal for our models, reflecting real-world business objectives.
Logistic Regression
Additional data preprocessing
- Replace all
-1values in the dataset (which indicate missing values) by setting them to0.
- Normalize all numerical variables by subtracting their mean and dividing by their standard deviation.
- We obtain 49 features.
Final model training and evaluation
After preprocessing, we used logistic regression with L2 regularization on the training data. To optimize computation speed, we used the
Newton–Cholesky solver in logistic regression. We then test the model on the 205011 observations from the test dataset and obtain accuracy ≈ 0.986. However, the F2 score is 0.004 because recall is very low for the logistic regression model, so the model is not very good at finding actual fraud cases.K-Nearest Neighbor
Data preprocessing
- Train/test split: for training data, we kept all fraudulent samples but downsampled non-fraudulent samples to have a ratio of 1:10, resulting in a training size of 89661 (
fraud = 8151,non_fraud = 81510).
- For columns
prev_address_months_count,bank_months_count, andsession_length_in_minutes, we added missingness indicators for each to handle-1as a missing value and replaced-1withNaNto impute numerically. For allNaNvalues, we usedSimpleImputer(strategy="median").
- All features were standardized using the training set statistics and applied to both training and test sets.
Cross-validation and hyperparameter tuning
- Cross-validation: we used
StratifiedKFoldwithn_splits = 5to preserve class proportions in each fold.
- Hyperparameter grid:
k ∈ {3, 5, 7, 9, 11, 21, 31, 41, 51, 61, 81, 101, 121}weights ∈ {"uniform", "distance"}
In total, 26 hyperparameter combinations were evaluated.
- For each fold:
- Train kNN with the specific
(k, weights)combination. - Choose the best threshold from
threshold_grid = {0.01, 0.02, …, 0.50}that yields the best F2 score and record the corresponding threshold and hyperparameter pair.
- After all 5 folds, we average the best F2 score and select the corresponding
kvalue andweightswith the optimal threshold value.
Final model training and evaluation
- We trained the final kNN model with the selected hyperparameters (
k = 81,weights = "distance") on a class-balanced training subset of the full training set from months0–5.
- Evaluation was done on the test set using the optimal threshold
τ = 0.134, and we reported F2, recall, precision, and accuracy.
Linear-Kernel Support Vector Machine
Additional data preprocessing
- Create indicator columns for columns with missing values.
- Standardize columns.
- Mean imputation of missing values.
After preprocessing
- Training data count: 794989.
- Test data count: 205011.
- Number of features: 49.
Modelling
- Model:
LinearSVC(C=C, dual=False, class_weight={0: 1, 1: R})fromsklearn.svm.
- Hyperparameters:
C,R.
- Coarse-grained, stratified
3‑fold logspace search forC ∈ {0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0}andR ∈ {1.0, 10.0, 100.0, 1000.0}, usingF2as the evaluation metric.
- Fine-grained, stratified
3‑fold linespace search in the best-magnitude neighborhoods forCandR, again using F2.
- Train the final model on all of the training data using the best
(C, R)pair.
RBF-Kernel Support Vector Machine Ensemble
Additional data preprocessing
- Identical to the linear-kernel SVM.
Modelling
Due to the high memory complexity of storing Gram matrices while training kernel SVMs, we chunk the training data into manageably sized chunks, train an SVM model on each of them, then predict
1 if at least K out of N of the SVM models predict 1.- Fix
N = 7.
- Divide the training dataset into stratified folds of size ≈ 10000. Create 3 splits of 8 stratified folds each. For each split, 7 of the folds are used to train 7 SVM models, and the remaining fold is used as a validation set. This requires 24 folds; we have approximately folds available.
- Hyperparameters and ranges:
C ∈ [10^-2, 10]R ∈ [10, 1000]gamma ∈ [10^-3, 1](RBF kernel)K ∈ {1, …, 7}(the number of learners needed to make a positive prediction)
- Because an exhaustive grid search is infeasible, we use random search: we randomly sample
10different(C, R, gamma)combinations from their logspaces and combine each withK ∈ {1, …, 7}, producing 4‑D candidates(C, R, gamma, K).
- After finding the optimal
(C, R, gamma, K)combination, we create 7 stratified folds of size ≈ 20000, train 7 RBF SVM models with(C, R, gamma), and define the final ensemble to predict1if at leastKof the 7 models predict1.
Tree-Based Methods – Decision Trees
Additional data preprocessing
- Replace all
-1values in the dataset (which indicate missing values) by setting them to0.
- Normalize all numerical variables by subtracting their mean and dividing by their standard deviation.
Cross-validation
- Implement a
10‑fold stratified cross-validation using recall to find the best model.
- Use the gini criterion for impurity and set the number of features to consider at each split to
sqrt(n_features).
- The best depth of the decision tree was
max_depth = 90.
We then use the model to predict on the 205011 observations from the test dataset and obtain accuracy ≈ 0.97126. The F2 score is 0.108 because recall is the primary optimization metric. This is better than logistic regression, but still leaves room for improvement.
Tree-Based Methods – Random Forest
Data preprocessing
- Train/test split: all observations from months 0–5 as the training set and months 6–7 as the test set.
- For
prev_address_months_count,bank_months_count, andsession_length_in_minutes, add missingness indicators, replace-1withNaN, and use medianSimpleImputer.
- Standardize all features using the training set statistics.
Cross-validation and hyperparameter tuning
- Cross-validation:
StratifiedKFoldwithn_splits = 3.
- Hyperparameter grid:
n_estimators ∈ {100, 300, 500}max_depth ∈ {None, 10, 20}min_samples_leaf ∈ {1, 5}
In total, 18 hyperparameter combinations.
- For each combination and fold, train a Random Forest, predict with threshold
0.5, compute F2, and average F2 across folds.
- Best setting:
n_estimators = 500,max_depth = 10,min_samples_leaf = 1.
Final model training and evaluation
- Train the final Random Forest model with the best hyperparameters on the full training set (months 0–5).
- Evaluate on the test set and report F2, recall, precision, and accuracy.
Tree-Based Methods – XGBoost
Additional data preprocessing
- No additional preprocessing steps beyond the common ones.
After preprocessing
- Number of features: 45.
Modelling
import xgboost as xgb params = { "learning_rate": learning_rate, "max_depth": max_depth, "min_child_weight": min_child_weight, "subsample": subsample, "colsample_bytree": colsample_bytree, "scale_pos_weight": scale_pos_weight, "objective": "binary:logistic", "eval_metric": "aucpr", "tree_method": "hist", "verbosity": 0, } model = xgb.train( params=params, dtrain=dtrain, num_boost_round=best_iteration_int, ) y_test_prob = model.predict(dtest) y_test_pred = (y_test_prob >= best_threshold).astype(int)
We first tune the following hyperparameters using stratified 5‑fold cross-validation, with F2 as the optimization metric:
learning_rate(range: logspace from0.03to0.2)
max_depth(range: integers in[3, 8])
min_child_weight(range:[1, 10])
subsample(range:[0.7, 1.0])
colsample_bytree(range:[0.6, 1.0])
scale_pos_weight(range: logspace[10, 200])
best_threshold(range: linspace[0, 1])
The value
best_iteration_int is implicitly tuned via early stopping when we average the optimal number of boosting rounds across splits.Because XGBoost does not support F2 directly as an early-stopping metric, we use PR AUC as a surrogate for early stopping. It also rewards good overall precision and recall.
Due to the size of the hyperparameter space, we use random search: we randomly sample 20 different
(learning_rate, max_depth, min_child_weight, subsample, colsample_bytree, scale_pos_weight) combinations, then pair each combination with multiple thresholds from [0, 1], yielding candidate tuples (learning_rate, max_depth, min_child_weight, subsample, colsample_bytree, scale_pos_weight, threshold).We then train the final model on the full training data with the best hyperparameters.
Neural Networks
Data preprocessing
- Replace all
-1values (missing values) with0.
- Normalize all numerical variables by subtracting their mean and dividing by their standard deviation.
- One-hot encode categorical variables.
- Drop
device_fraud_countas it is constant-valued.
- Train/test split: first 6 months for training, last 2 months for testing.
- Number of features: 49.
Model architecture and loss
- Fully connected feed-forward neural network with three hidden layers, with dropout on the first and last layers and ReLU activations.
- Output layer: single neuron with linear output, followed by a sigmoid to convert logits to probabilities.
- To address class imbalance, we tried combinations of sampling and loss functions:
- Oversampling with surrogate loss.
- Oversampling with
BCEWithLogitsLoss. - No oversampling with
BCEWithLogitsLoss.
- The best configuration was no_oversampling +
BCEWithLogitsLosswith positive-class weightpos_weight = (# of negatives / # of positives).
Hyperparameter tuning and training
- Small grid search with 27 combinations:
- Hidden layer sizes:
(64, 32, 16),(128, 64, 32),(256, 128, 64) - Dropout probability:
0.2,0.3,0.5 - Learning rate:
1e-4,1e-3,3e-3
- For each configuration:
- Optimizer:
Adamwithweight_decay = 1e-4. - Batch size: 4096.
- Up to 20 epochs with early stopping based on validation F2.
- After each epoch, compute validation probabilities, sweep a threshold grid between 0.01 and 0.99, and record the threshold that maximizes F2.
- Best setting:
hidden_sizes = (256, 128, 64),dropout = 0.3,learning_rate = 0.001with optimal thresholdτ = 0.84.
Final evaluation
We keep the best model (highest validation F2) and evaluate it on the test set. Using the tuned threshold
τ = 0.84, we report test F2, recall, precision, and accuracy. This neural network serves as a high-capacity predictive model (less interpretable than kNN or Random Forest, which we use for interpretability and comparison).Final Results
Model | Accuracy | Recall | Precision | F2 Score | ROC AUC | PR AUC |
Logistic Regression | 0.9860 | 0.003 | 0.625 | 0.004 | 0.5017 | 0.016 |
KNN | 0.8277 | 0.7318 | 0.0574 | 0.2186 | 0.859 | 0.136 |
SVM – Linear Kernel | 0.9632 | 0.399 | 0.165 | 0.3108 | 0.8838 | 0.1662 |
SVM – RBF Kernel Ensemble | 0.9568 | 0.435 | 0.147 | 0.3128 | 0.7314 | 0.1046 |
Decision Tree | 0.9713 | 0.114 | 0.089 | 0.108 | 0.5471 | 0.022 |
Random Forest | 0.8643 | 0.6959 | 0.0692 | 0.2475 | 0.869 | 0.144 |
XGBoost | 0.9533 | 0.5083 | 0.1522 | 0.3463 | 0.896 | 0.200 |
Neural Network | 0.9626 | 0.4051 | 0.1638 | 0.3130 | 0.881 | 0.170 |
Interpretable feature analysis
We filtered for the features with the strongest coefficient signals (absolute value ≥ 0.1) from the linear SVM model, as it is the easiest to interpret among the models with better F2 scores.
Positive signals for fraud
- Missing values in
prev_address_months_count(months in previous registered address) are a relatively strong signal for fraud. One potential reason is that fraudsters tend to avoid providing a traceable address history.
- The device OS being
"Windows"is a relatively strong signal for fraud. One possible explanation is that fraudsters often operate from cheap and widely available Windows environments.
Negative signals for fraud
Keep_alive_session = 1(the user keeping the session alive on logout) is a negative indicator of fraud. Fraudsters may prefer not to stay connected longer than necessary.
- Higher
name_email_similarityis a negative indicator of fraud. Fraudsters often use randomly generated or non-identifying email handles.
phone_home_valid = 1(valid home phone numbers) is a negative indicator of fraud, as fraudsters may provide fake or burner numbers.
Has_other_cards = 1(applicant has other cards with the same bank) is a negative signal of fraud, possibly because multiple cards under the same identity increase detectability.
- Certain
housing_statuscategories (for example,"BE","BB","BC"in the anonymized encoding) are negative indicators of fraud, but since these are anonymized, we cannot interpret them semantically.
Conclusions
Across models, we see modest, non-negligible improvements over simple baselines. However, the Neural Network, XGBoost, linear-kernel SVM, and the RBF-kernel SVM ensemble all achieve similar F2 scores, suggesting limited remaining predictive signal in the current tabular features.
In other words, we are likely close to the Pareto frontier with respect to our current feature set. Significant future improvements will likely come from feature engineering rather than model choice: deriving new features, exploring interactions, and incorporating richer temporal or behavioral signals.
Contributions
- Jerry Bao (@bao-jerry)
- Saim Zafar (@s29zafar)
Source
Source code and experiment artifacts are maintained in a GitHub repository.