Arief Rachman Hakim
As an employee, we must know several factors that can affect our Salary. One of the factors that can affect our Salary is the Years of Experience. We all know that if we have more experience in one subject we can get more Salary too. So, for that case, I will make some predictions using the Kaggle dataset that covered relations between the Year of experience and Salary.
1. Download dataset
This dataset covered how a year of experience can affect the Salary
2. Import Library
The first step import python libraries that we need, For list of library I need is :
3. Read Dataset
Now we read the dataset with pandas library. As we can see, the dataset has 2 feature there are YearExperience and Salary.
4. EDA
Now perform an Exploratory Data Analysis. In Exploratory Data Analysis, firstly we check that there are Null values present or not, then check the information of the data, then describe the data which shows the mean value, standard deviation value, minimum value, Maximum value etc.
As we can see, the data doesn’t have null value
And for data type for feature is float for YearExperience the int64 for Salary
Now visualize the data YearExperience and Salary using the matlplotlib scatter plot function
5. Prepare Data
On preparing the data, we divide the data into the independent and dependent features. X stores the independent feature (YearExperience) and y stores the dependent feature (Salary)
6. Split Data
Then Split the data into the training and testing using the train_test_split function which takes some of the parameters like X, y, random_state, test_size. X is an independent feature and y is the dependent feature, random_state used for randomly selecting the data and test_ size used for dividing the data into the training and testing.
7. Define the Model
Now define the LinearRegression model with by default parameters and trained LinearRegression model with training data ( X_train and Y_train ). And test the model using the testing data (X_test). and display the predicted and actual data.
Now calculate the difference between the actual salary value and the predicted salary value and make a DataFrame and show the data of actual salary, predicted salary and the difference between the actual salary and predicted salary
8. Visualize Model
Now visualize the training data, draw the best fit line and Plot all the training points of the training data and see the bias. Bias is the difference between the best fit line and the training point. Bias is the difference between the best fit line and the training point. This difference is called the Bias (error).
Now visualize the testing data, draw the best fit line and Plot all the testing points of the testing data and see the bias.
9. Model Evaluation
Check the accuracy of the model which is near 98% accuracy on the testing data and also check the mean squared error and r2_score using the actual data and predicted data.
I use rmse and r2 score because it’s the best model evaluation for regression type
10. Prediction with custom data
Now the last step is to test on the custom data so I gave 3 different years of experience to my model, there are 3, 4, and 5. Then check what prediction for the salary of the 3 different years of experienced employee. So, this is the predictions on 3 different year of experience :
In conclusion, the year of the experience of the employee can affect how big salary the employee can get
11. Code Documentation
https://drive.google.com/drive/folders/15_guXA36jMqTVgmdMgpySbUZh0uH0jH9?usp=sharing