Auto Machine Learning

Feature_scML automl func

automl is Auto Machine Learning, including fs module and cls module. Input data and test data is required CSV format. We integrated the hyperparameter optimization function in the training process. Test data is not required. If test data is not input, the train data wil be splited into train data and test data (train:test=8:2). In order to simplify the machine learning process, the module can automatically perform feature selection methods, and combine incremental features to train machine learning models. Finally, the user can obtain the feature data set ranked according to the feature selection method and the result table of incremental feature training.

$Feature_scML automl -h
usage: automl

optional arguments:
-h, --help            show this help message and exit
-i INPUT_TRAIN, --input_train INPUT_TRAIN
                        Input train data (CSV)
-t INPUT_TEST, --input_test INPUT_TEST
                        Input test data (CSV)
--method {fscore,pca,cv2,rfc,mic,turf,linearsvm}, -m {fscore,pca,cv2,rfc,mic,turf,linearsvm}
                        Select a feature selection method
--start START         Feature Number start (default=10)
--end END             Feature Number end (default=all features)
--step STEP           Feature Number step (default=10)
--njobs NJOBS         Number of jobs to run in parallel (default=1)
--classifier {svm,rf,gnb,lr}, -c {svm,rf,gnb,lr}
                        Select a machine learning method:
                        lr (Logical Regression)
                        svm (Support Vector Machine)
                        rf (Random Forest)
                        gnb (Gaussian Naive Bayes)
--getmodel GETMODEL   Generate model files (default=False)
-o OUTPUT, --output OUTPUT
                        Output directory (default=current directory)

Command

Parameters

Optional

Descripton

—input_train,-i

filename path

input Train data filename path (CSV format)

—input_test,-t

filename path

input Test data filename path (CSV format)

—method, -m

fscore, pca, cv2, rfc, mic, turf, linearsvm

The details of the methods are in the fs module

—classifier, -c

lr,svm,rf,gnb

  • lr (Logical Regression)

  • svm (Support Vector Machine)

  • rf (Random Forest)

  • gnb (Gaussian Naive Bayes)

—start

int, default=10

Minimal number of features

—end

int, default=all features

Maximum number of features

—step

int, default=10

Step size of incremental feature training

—output, -o

output directory

output directory (default:Current directory)

—njobs

int, default=1

The number of jobs to run in parallel

—getmodel

True or False

If True, model file will be saved

Example

# default start, end, and step
$Feature_scML automl -i example.csv -c svm -m cv2
The identity link function does not respect the domain of the Gamma family.
feature number: 10
train accuracy: 0.7923
test_accuracy: 0.7636
best parameters: {'C': 8192, 'gamma': 0.00048828125}
feature number: 20
train accuracy: 0.8276
test_accuracy: 0.7773
best parameters: {'C': 8, 'gamma': 0.5}
...
feature number: 100
train accuracy: 0.8824
test_accuracy: 0.8682
best parameters: {'C': 512, 'gamma': 0.001953125}
DONE!

The result will generate a dataframe with column names of feature number, train_accuracy, test_accuracy and parameters (optimal hyperparameter).

feature number

train_accuracy

test_accuracy

parameters

10

0.7922727272727272

0.7636363636363637

“{‘C’: 8192, ‘gamma’: 0.00048828125}”

20

0.8276233766233766

0.7772727272727272

“{‘C’: 8, ‘gamma’: 0.5}”

100

0.8824285714285715

0.8681818181818182

“{‘C’: 512, ‘gamma’: 0.001953125}”

# If output is None, model file will saved in current directory
# example_lr.joblib is saved in current directory.
# start = 20, step = 20, end = 60
$Feature_scML automl -i example.csv -c svm -m cv2 --start 20 --step 20 --end 60 --njobs 20 --getmodel True
$ls
20-60_cv2_SVM_accuracy.csv  example.csv  example_40_svm.joblib  example_cv2.csv
example_20_svm.joblib       example_60_svm.joblib  example_cv2_data.csv