With one other Kaggle Playground Competitors within the books, I’m proud to share my outcomes and strategies from this month’s competitors!
I achieved my greatest Playground competitors rating but, ending within the high 33% of over 2,700 opponents, and alongside the best way I realized fairly a bit about engaged on multi-class classification issues. This month’s competitors was a multiclass classification drawback the place we predicted the educational danger of pupil’s in increased schooling given a dataset on frequent pupil options. The dataset was offered by Kaggle and generated from a dataset collected by the Polytechnic Institute of Portalegre in Portugal and archived by the College of California, Irvine — a hyperlink to the unique dataset will be discovered HERE
I began the competitors with some exploratory knowledge evaluation to achieve insights into the dataset, what my vital options could be, and any transformations which will have to be made to feed the info into my mannequin. I then started mannequin constructing with a single RandomForestClassifier which I developed to be a baseline mannequin as a result of its ease and ease when figuring out classifications. Then I constructed a couple of extra machine studying fashions equivalent to Gradient Boosting Classification fashions, after which my greatest mannequin: a Categorical Boosting Classification mannequin. This CatBoostClassification achieved a mannequin accuracy rating of 0.83513 which positioned me at 879 out of 2739 opponents (high 33%)! This mannequin was moderately easy as I didn’t must specify my parameters or additional engineer options particularly for this mannequin. I imagine the principle cause why this mannequin carried out so effectively is because of the many categorical options used to succeed in our categorical goal, which falls throughout the fashions specialization. For additional enchancment I might partake in additional function choice in addition to tune my mannequin parameters utilizing Optuna to succeed in the next accuracy rating.
From there I developed an Ensemble mannequin utilizing a Voting Classifier and Stacking Classifier that ensembled collectively most of the fashions I had constructed beforehand, in addition to others that I used to be testing to see if it might enhance my rating. Nevertheless, because of the quantity of noise, the ensemble methodology was unable to beat my CatBoostClassifier rating as a result of the ensemble mannequin had an inclination to overfit to the coaching knowledge the place because the CatBoostClassifier was strong sufficient to suit equally to each the coaching and testing knowledge. To enhance my ensembles I wish to do additional testing and tuning in addition to carry out CVSearch grids to seek out my optimum hyperparameters for every mannequin.
Lastly, I constructed a Deep Studying Classification mannequin utilizing TensorFlow. This mannequin took probably the most time by far as I bumped into fairly a couple of points whereas tuning my base DNN mannequin. I struggled inputting class weights for the multiclass mannequin, which then led me to making an attempt numerous resampling strategies to attempt to clean the imbalanced dataset. Probably the most profitable resampling methodology ended up being SMOTE which stunned me as I’ve typically believed the SMOTE methodology tends so as to add pointless noise to a mannequin that may trigger accuracy to lower. Every resampling methodology vastly improved our skill to foretell the minority class of “Enrolled” that our mannequin in any other case struggled to foretell. Nevertheless, even with this enhance in precision and recall to our “Enrolled” class, the resampling strategies had been unable to beat my base DNN fashions total accuracy. As well as, this base DNN couldn’t beat my Ensemble or CatBoostClassifier accuracy scores.
This venture taught me so much equivalent to: less complicated can carry out higher, SMOTE really works effectively when you’ve got a category imbalance (in comparison with over or below sampling), and DNNs should not at all times going to be an enchancment over a normal ML mannequin. This final level is essential as DNNs use a a lot increased compute than a CatBoostClassifier making it extra expensive for worse outcomes on this case. This Playground Collection competitors gave me some stable expertise utilizing totally different methods to enhance a classification rating, which might be very worthwhile for future fashions and competitions.
I’m terribly excited for the subsequent Playground Collection Competitors and the possibility to proceed bettering my machine studying and knowledge science expertise. Hopefully I can enhance my rating over this month as I attempt to grow to be a extra full knowledge scientist. You’ll find the code I created for this competitors on my GitHub below the kaggle_comps repository or by clicking this hyperlink HERE
#Kaggle #DataScience #MachineLearning #TensorFlow #DataAnalysis #Competitors