Introduction
On 22 July 2016 , me and a friend named Anchal Gupta participated in an Analytics Vidhya hackathon named ‘The Smart Recruits’ .
This blog is about our approach to this particular problem . The problem stated about a particular company Fintro ,a Financial Distribution company and over the last 10 years, they have created an offline distribution channel across India. They sell Financial products to consumers by hiring agents in their network. They are looking data scientist like us (participants) to predict the target variable for each potential agent, which would help them identify the right agents to hire based on their previous recruitment process .
Link: https://datahack.analyticsvidhya.com/contest/the-smart-recruits/
About the Data
Data was blend of various parameters like Manager DOB , Manager Grade , further details of manager and applicants. The target variable is Business Sourced whether the business was done or not .
Our approach
We started with glancing at the data and we discovered many NA fields in various manager details. It was around 1507 data points , so we tackled the approach by eliminating the NA’s field and did the multivariate analysis on the remaining data and treated the outliers by mean and median values of the respective variable and treat the missing values by visualizing the data points and replace the values by imputing them into appropriate values and then we created the dummy variables and did the predictive modelling on the 51 variables by applying 100 bags of Xgboost algorithm which gives the AUC of 0.62(approx. ) . Later we applied glmnet and remove the penalty and select the 18 variables which improved our accuracy to 0.684 (on the public leaderboard)
Later by looking at the winner’s approach we found that this data set was about the finding the leakage in the data set and selecting the golden features which reflected on the winner’s private leaderboard of 0.76 (approx.) . As we all have experience, we usually ignore the ID column while making predictions and use it only in the end to make a final submission file. Well in this particular problem, the ID variable gave details about the working pattern of the manager throughout a single day i.e. it showed that some managers got lazy in the second half of the day and that was the golden feature which could reduce over-fitting and increase the private leaderboard auc result
Result
With this prediction, we secured 14th rank on public leaderboard . As there was a little overfitting, on private leaderboard our auc was 0.62 (approx..) which finally put us on 30th rank.
What we learnt
We learnt a lot form this hackathon is to not always ignore the ID column as they might be giving some important insights. Also, there are better approaches to select important features than glmnet as glmnet gives widely varying results everytime we run that.
End Notes
You can refer this link to my github repository for the code:
https://github.com/aman1391/Smart–Recruits