Ghalbin Al Ghafiqi
In this experiment I will create a model to analyze the Top 5 Ranked Games by Category. The analysis aims to obtain DSBIZ certification by creating an EDA from a data set taken from kaggle.com which contains a data set of the top 100 games from each game category on the Google Play Store along with their ratings and other data such as price and number of installs.
Before doing Exploratory Data Analysis, first download the dataset (https://bit.ly/TopGamesOnGooglePlayStore). This is a dataset of top 100 games of each category of games on Google Play Store along with their ratings and other data like price and number of installs.
Here are some steps for Exploratory Data Analysis using Google Colab:
First, Let's begin with importing the necessary libraries
The libraries used to explore data with python are pandas, matplotlib, seaborn and numpy. Pandas and numpy make it easier to process numeric data and analyze data, while matplotlib and seaborn work to create visualizations.
Second, mount Google Drive first so that the dataset can be accessed.
Third, it's time to load and read through the dataset to get an overview
The head() method returns a specified number of rows, string from the top.
The tail() method returns a specified number of last rows.
I usually use df.head(n) and df.tail(n) to validate that the dataset loaded and looks like how it should be.
df.columns returns the label of each column in the DataFrame
The dataset consists of 1730 rows and 15 columns, where the variable names for the 15 columns can be found using the df.columns function while the index on the data is in the range 0 to 1730 with increments or step = 1. Based on the data above the columns' total ratings ' is the dependent variable while the rest are independent variables.
fourth, before analyzing the data, first check the missing values contained in the dataset.
The number of missing values in each variable is 0 or there are no missing values in each column. What needs to be considered is isna() is used to detect missing values in the data, if the data has a value of 0 then it will not be detected as a missing value, because the missing value is not equals 0.
Now we can display all columns and their data types
df.info() returns information about the DataFrame.
The information contains the number of columns, column labels, column data types, memory usage, range index, and the number of cells in each column (non-null values)
We can easily see from df.info()
- There 15 columns
- 1730 entries (rows)
- No missing values
- 8 int64, 3 float64, 3 object and 1 bool type column in the dataframe
- There seems to be 11 numeric columns
- We may make some calculations with numeric columns
fifth, data visualization. Necessary Adjustments Before Analysis Let's go on further with 'install' column 'installs' column displays numeric values. However, it's type is 'object'. We have to fix it.
Via the 'in_thousand' function we will change the values into million
we see the data install has changed to a value of millions
Next, we check again that the install data type has changed to an int data type and is worth millions
Next are the 'paid' and 'price' columns
- 'price' and 'price' columns are seemed to be common. If we have look at the values of these columns we can easily observe that almost all the games are free. We can drop one of them.
Sixth, dataset normalization on 'price' data
Almost all games are free. Only 7 games are paid. Whether the games are paid or not does not seem to be a distinguishing feature for EDA, as it can be seen. I'm dropping the 'price' column as I think only the 'paid' column will be enough for further analysis.
Then check the number of columns and rows when the 'price' data has been dropped
It can be seen that the number of columns has disappeared by one.
As seen above;
A new float64 type column was created from the 'installs' column by feature engineering
The 'price' column has been dropped for simplicity since the 'paid' and 'price' columns have similar content
Now we have 1730 rows (Games) and 14 columns (Feature)
Seventh, Exploratory Data Analysis
To view summary information about the columns, i am going to use a user defined functions seen below
Almost all categories have the same size.
Next displays the analysis results of the Top 5 Ranked Games by Category
The eighth stage, the correlation process on the data
From the heatmap above, the brighter the color on the box, the higher the correlation value between the 2 variables.
The variable that has the highest correlation is the total ratings variable with a 5 star ratings, this means that as the year (total ratings) increases, the 5 star ratings value will also increase. To prove this, you can use a scatter plot to see the relationship between the variables of the total ratings and the 5 star ratings.
Conclusion