Auto Machine Learning
Feature_scML automl func
automl is Auto Machine Learning, including fs module and cls module. Input data and test data is required CSV format. We integrated the hyperparameter optimization function in the training process. Test data is not required. If test data is not input, the train data wil be splited into train data and test data (train:test=8:2). In order to simplify the machine learning process, the module can automatically perform feature selection methods, and combine incremental features to train machine learning models. Finally, the user can obtain the feature data set ranked according to the feature selection method and the result table of incremental feature training.
$Feature_scML automl -h
usage: automl
optional arguments:
-h, --help show this help message and exit
-i INPUT_TRAIN, --input_train INPUT_TRAIN
Input train data (CSV)
-t INPUT_TEST, --input_test INPUT_TEST
Input test data (CSV)
--method {fscore,pca,cv2,rfc,mic,turf,linearsvm}, -m {fscore,pca,cv2,rfc,mic,turf,linearsvm}
Select a feature selection method
--start START Feature Number start (default=10)
--end END Feature Number end (default=all features)
--step STEP Feature Number step (default=10)
--njobs NJOBS Number of jobs to run in parallel (default=1)
--classifier {svm,rf,gnb,lr}, -c {svm,rf,gnb,lr}
Select a machine learning method:
lr (Logical Regression)
svm (Support Vector Machine)
rf (Random Forest)
gnb (Gaussian Naive Bayes)
--getmodel GETMODEL Generate model files (default=False)
-o OUTPUT, --output OUTPUT
Output directory (default=current directory)
Command
Parameters |
Optional |
Descripton |
|---|---|---|
—input_train,-i |
filename path |
input Train data filename path (CSV format) |
—input_test,-t |
filename path |
input Test data filename path (CSV format) |
—method, -m |
fscore, pca, cv2, rfc, mic, turf, linearsvm |
The details of the methods are in the fs module |
—classifier, -c |
lr,svm,rf,gnb |
|
—start |
int, default=10 |
Minimal number of features |
—end |
int, default=all features |
Maximum number of features |
—step |
int, default=10 |
Step size of incremental feature training |
—output, -o |
output directory |
output directory (default:Current directory) |
—njobs |
int, default=1 |
The number of jobs to run in parallel |
—getmodel |
True or False |
If True, model file will be saved |
Example
# default start, end, and step
$Feature_scML automl -i example.csv -c svm -m cv2
The identity link function does not respect the domain of the Gamma family.
feature number: 10
train accuracy: 0.7923
test_accuracy: 0.7636
best parameters: {'C': 8192, 'gamma': 0.00048828125}
feature number: 20
train accuracy: 0.8276
test_accuracy: 0.7773
best parameters: {'C': 8, 'gamma': 0.5}
...
feature number: 100
train accuracy: 0.8824
test_accuracy: 0.8682
best parameters: {'C': 512, 'gamma': 0.001953125}
DONE!
The result will generate a dataframe with column names of feature number, train_accuracy, test_accuracy and parameters (optimal hyperparameter).
feature number |
train_accuracy |
test_accuracy |
parameters |
10 |
0.7922727272727272 |
0.7636363636363637 |
“{‘C’: 8192, ‘gamma’: 0.00048828125}” |
20 |
0.8276233766233766 |
0.7772727272727272 |
“{‘C’: 8, ‘gamma’: 0.5}” |
… |
… |
… |
… |
100 |
0.8824285714285715 |
0.8681818181818182 |
“{‘C’: 512, ‘gamma’: 0.001953125}” |
# If output is None, model file will saved in current directory
# example_lr.joblib is saved in current directory.
# start = 20, step = 20, end = 60
$Feature_scML automl -i example.csv -c svm -m cv2 --start 20 --step 20 --end 60 --njobs 20 --getmodel True
$ls
20-60_cv2_SVM_accuracy.csv example.csv example_40_svm.joblib example_cv2.csv
example_20_svm.joblib example_60_svm.joblib example_cv2_data.csv