BISA AI - AI For Everyone

Student Performance Index

Herlambang Saputra

Sosial Media

0 orang menyukai ini
Suka

Summary

Dataset Student Performance Index yang tersedia di Kaggle merupakan sebuah kumpulan data sintetis yang dirancang untuk menganalisis faktor-faktor yang memengaruhi prestasi akademik siswa. Dataset ini berisi sekitar 10.000 catatan siswa dengan sejumlah variabel yang relevan, seperti jumlah jam belajar per hari, jumlah jam tidur, jumlah aktivitas ekstrakurikuler, nilai ujian sebelumnya, serta frekuensi latihan soal. Variabel-variabel tersebut digunakan sebagai prediktor untuk memperkirakan performance index, yaitu skor numerik yang mencerminkan performa akademik siswa. Karena bersifat sintetis, data ini sudah dalam kondisi bersih (tidak terdapat missing values maupun anomali yang berarti), sehingga memudahkan pengguna untuk langsung melakukan eksplorasi data dan pemodelan statistik. Tujuan utama dari dataset ini adalah menjadi bahan pembelajaran dalam memahami konsep multiple linear regression, di mana lebih dari satu variabel independen digunakan untuk memprediksi variabel dependen. Dengan kata lain, dataset ini membantu peneliti maupun mahasiswa memahami bagaimana faktor-faktor eksternal dan kebiasaan belajar dapat memengaruhi hasil akademik seorang siswa.

Selain itu, dataset ini juga memiliki fleksibilitas tinggi untuk digunakan dalam berbagai eksperimen pembelajaran mesin. Walaupun fokus utamanya adalah pada multiple linear regression, banyak praktisi data menggunakan dataset ini untuk mencoba model prediksi lain, seperti decision tree regressor, random forest, hingga metode ensemble modern seperti XGBoost. Hal ini karena struktur datanya sederhana, jelas, dan sangat representatif untuk kasus prediksi performa siswa di dunia nyata. Dataset ini juga sangat cocok untuk digunakan dalam eksplorasi data (EDA) dengan menampilkan hubungan antar variabel melalui visualisasi seperti scatter plot atau heatmap untuk melihat korelasi. Lebih jauh lagi, dataset ini dapat dimanfaatkan sebagai contoh untuk mengevaluasi pentingnya pemilihan variabel prediktor yang tepat, validasi model, serta interpretasi koefisien regresi dalam memahami pengaruh setiap faktor terhadap hasil belajar siswa. Dengan demikian, Student Performance Index tidak hanya berfungsi sebagai dataset latihan, tetapi juga sebagai sarana edukasi yang efektif untuk memperkuat pemahaman konsep regresi linier dan teknik analisis prediktif lainnya dalam data science.

Description

Mounting Data

from google.colab import drive

drive.mount('/content/drive')

2. Baca data Set CSV

import pandas as pd

# Replace '/path/to/your/csvfile.csv' with the actual path to your CSV file on Google Drive

csv_file_path = '/content/drive/MyDrive/Student_Performance.csv'

try:

df = pd.read_csv(csv_file_path)

print("CSV file read successfully!")

display(df.head())

except FileNotFoundError:

print(f"Error: The file was not found at {csv_file_path}")

except Exception as e:

print(f"An error occurred: {e}")

3. Preprocessing Data

# Get the number of rows before dropping missing values

initial_rows = df.shape[0]

# Drop rows with any missing values

df.dropna(inplace=True)

# Get the number of rows after dropping missing values

rows_after_dropping = df.shape[0]

print(f"Number of rows before dropping missing values: {initial_rows}")

print(f"Number of rows after dropping missing values: {rows_after_dropping}")

# Display the first few rows of the modified DataFrame

display(df.head())

4. Pemilihan fitur menggunakan Tehnik Korelasi

import seaborn as sns

import matplotlib.pyplot as plt

# Drop non-numeric columns that are not suitable for correlation calculation

df_numeric = df.drop(['Extracurricular Activities'], axis=1)

# Calculate the correlation matrix

correlation_matrix = df_numeric.corr()

# Display the correlation of features with the target variable 'Performance Index'

print("Feature correlation with 'Performance Index':")

print(correlation_matrix['Performance Index'].sort_values(ascending=False))

# Visualize the correlation matrix using a heatmap (optional, but helpful for understanding)

plt.figure(figsize=(12, 10))

sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")

plt.title('Correlation Matrix of Features')

plt.show()

# Based on the correlation matrix, select the top 2 features most correlated with 'price'

# (excluding 'price' itself)

price_correlation = correlation_matrix['Performance Index'].sort_values(ascending=False)

top_2_features = price_correlation.drop('Performance Index').head(2).index.tolist()

print(f"\nTop 2 features most correlated with 'Performance Index': {top_2_features}")

# Create a new DataFrame with the selected features and the target variable

df_selected_features = df_numeric[top_2_features + ['Performance Index']]

display(df_selected_features.head())

5. Memilih variabel Independen dan Dependen

# Create a new DataFrame with the selected features and the target variable

# 'sqft_living' and 'sqft_above' as independent variables (X)

# 'price' as the dependent variable (y)

X = df_selected_features[['Previous Scores', 'Hours Studied']]

y = df_selected_features['Performance Index']

print("Independent variables (X):")

display(X.head())

print("\nDependent variable (y):")

display(y.head())

6. Split Data Training dan Testing

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Shape of X_train:", X_train.shape)

print("Shape of X_test:", X_test.shape)

print("Shape of y_train:", y_train.shape)

print("Shape of y_test:", y_test.shape)

7. Preprocessing Model

from sklearn.compose import ColumnTransformer

from sklearn.preprocessing import StandardScaler

# Define the columns to be scaled

feature_columns = X.columns

# Create a ColumnTransformer to apply StandardScaler to the feature columns

preprocessor = ColumnTransformer(

transformers=[

('scaler', StandardScaler(), feature_columns)

remainder='passthrough'

)

print("Preprocessor created successfully:")

print(preprocessor)

8. Pemilihan Model

from sklearn.linear_model import LinearRegression

# Instantiate a LinearRegression object

model = LinearRegression()

print("Linear Regression model instantiated successfully:")

print(model)

9. Membuat Pipeline

from sklearn.pipeline import Pipeline

# Create a pipeline object named pipeline that sequentially applies the preprocessor and the model.

pipeline = Pipeline(steps=[('preprocessor', preprocessor),

('model', model)])

# Print the created pipeline object to verify its structure.

print("Pipeline created successfully:")

print(pipeline)

10. Trining Pipeline

# Train the pipeline using the training data

pipeline.fit(X_train, y_train)

print("Pipeline trained successfully.")

11. Evaluasi Model

from sklearn.metrics import mean_squared_error, r2_score

# Make predictions on the testing data

y_pred = pipeline.predict(X_test)

# Calculate the Mean Squared Error (MSE)

mse = mean_squared_error(y_test, y_pred)

print(f"Mean Squared Error (MSE): {mse}")

# Calculate the R-squared score

r2 = r2_score(y_test, y_pred)

print(f"R-squared score: {r2}")

## Summary:

### Data Analysis Key Findings

* The data was split into training and testing sets, with 80% for training and 20% for testing. The shapes of the resulting sets were confirmed to be (17290, 2) for X\_train, (4323, 2) for X\_test, (17290,) for y\_train, and (4323,) for y\_test.
* Preprocessing steps were defined using a `ColumnTransformer` to apply `StandardScaler` to the feature columns.
* A `LinearRegression` model was chosen for the regression task.
* A `Pipeline` was created to sequentially apply the defined preprocessing steps and the `LinearRegression` model.
* The pipeline was successfully trained using the training data.
* The trained model was evaluated on the testing data, resulting in a Mean Squared Error (MSE) of approximately 990,354,280,095.51 and an R-squared score of approximately 0.0289.

### Insights or Next Steps

* The current model performs poorly, as indicated by the high MSE and low R-squared score. Further feature engineering, exploration of different models, or hyperparameter tuning is needed to improve performance.
* Investigate potential issues with the data or features used, as the current R-squared score suggests the chosen features explain very little of the variance in the target variable.

Informasi Course Terkait

Kategori: Artificial Intelligence
Course: Persiapan Ujian Sertifikasi Internasional DSBIZ - AIBIZ

Kelas GRATIS

Master Class

Master Class + Sertifikasi

Learning Path

Buku

Portofolio Peserta

Webinar

Udemy

Kelas GRATIS

Master Class

Master Class + Sertifikasi

Learning Path

Buku

Portofolio Peserta

Program Special

Webinar

Udemy

Learncation

Sertifikasi International

Sertifikasi Nasional

Kelas Corporate

Sertifikasi International

Sertifikasi Nasional

Kelas Corporate

Kolaborasi Seminar

Kolaborasi pelatihan

Gallery

Tentang Kami

Testimonial Peserta

Testimonial Video Peserta

Corporate Social Responsibility

Pengajar Kami

Hubungi Kami

Dokter Mekanik

E-learning

LEIP

Flungo

Tampil

Run Addicts

TripTracker

Gramatikal