In Kaggle platform, there is an example dataset about Quality of Red Wine. I wrote some code for it by using scikit-learn and pandas:
import pandas as pd from sklearn.preprocessing import StandardScaler from sklearn.ensemble import ExtraTreesClassifier from sklearn.svm import SVC from sklearn.model_selection import cross_val_score # Read dataset wine = pd.read_csv('~/Downloads/winequality-red.csv', sep = ';') attrs = wine.drop(['quality'], axis = 1) header = list(attrs) attrs = attrs.values # Use scaler to normalize data scaler = StandardScaler() scaled_attrs = scaler.fit_transform(attrs) quality = wine['quality'].values # SVM classifier svr = SVC(kernel = 'rbf', max_iter = -1) svr.fit(attrs, quality) # Randomized decison trees classifier dt = ExtraTreesClassifier() dt.fit(attrs, quality) ls = list(zip(dt.feature_importances_, header)) ls.sort(key = lambda x: x[1]) for importance, name in ls: print(name, importance) print('\n\n') # Cross validation on this two classifiers for reg in [svr, dt]: scores = cross_val_score(reg, attrs, quality, scoring = 'neg_mean_squared_error', cv = 10) rmse = -scores print(reg) print(rmse.mean(), rmse.std()) print('\n')
The results reported by snippet above:
alcohol 0.1438906634767823 chlorides 0.07953780339531004 citric acid 0.07979101058207233 density 0.0846765183778148 fixed acidity 0.07686725880938272 free sulfur dioxide 0.07178658192019563 pH 0.07797509374376276 residual sugar 0.0796105749270121 sulphates 0.11872569296381115 total sulfur dioxide 0.0993798893196299 volatile acidity 0.08775891248422625 SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False) 0.6983420378445301 0.04803296683789781 ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini', max_depth=None, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1, oob_score=False, random_state=None, verbose=0, warm_start=False)
Looks the most important feature to predict quality of red wine is ‘alcohol’. Intuitively, right?