Herlambang Saputra
Dataset Student Performance Index yang tersedia di Kaggle merupakan sebuah kumpulan data sintetis yang dirancang untuk menganalisis faktor-faktor yang memengaruhi prestasi akademik siswa. Dataset ini berisi sekitar 10.000 catatan siswa dengan sejumlah variabel yang relevan, seperti jumlah jam belajar per hari, jumlah jam tidur, jumlah aktivitas ekstrakurikuler, nilai ujian sebelumnya, serta frekuensi latihan soal. Variabel-variabel tersebut digunakan sebagai prediktor untuk memperkirakan performance index, yaitu skor numerik yang mencerminkan performa akademik siswa. Karena bersifat sintetis, data ini sudah dalam kondisi bersih (tidak terdapat missing values maupun anomali yang berarti), sehingga memudahkan pengguna untuk langsung melakukan eksplorasi data dan pemodelan statistik. Tujuan utama dari dataset ini adalah menjadi bahan pembelajaran dalam memahami konsep multiple linear regression, di mana lebih dari satu variabel independen digunakan untuk memprediksi variabel dependen. Dengan kata lain, dataset ini membantu peneliti maupun mahasiswa memahami bagaimana faktor-faktor eksternal dan kebiasaan belajar dapat memengaruhi hasil akademik seorang siswa.
Selain itu, dataset ini juga memiliki fleksibilitas tinggi untuk digunakan dalam berbagai eksperimen pembelajaran mesin. Walaupun fokus utamanya adalah pada multiple linear regression, banyak praktisi data menggunakan dataset ini untuk mencoba model prediksi lain, seperti decision tree regressor, random forest, hingga metode ensemble modern seperti XGBoost. Hal ini karena struktur datanya sederhana, jelas, dan sangat representatif untuk kasus prediksi performa siswa di dunia nyata. Dataset ini juga sangat cocok untuk digunakan dalam eksplorasi data (EDA) dengan menampilkan hubungan antar variabel melalui visualisasi seperti scatter plot atau heatmap untuk melihat korelasi. Lebih jauh lagi, dataset ini dapat dimanfaatkan sebagai contoh untuk mengevaluasi pentingnya pemilihan variabel prediktor yang tepat, validasi model, serta interpretasi koefisien regresi dalam memahami pengaruh setiap faktor terhadap hasil belajar siswa. Dengan demikian, Student Performance Index tidak hanya berfungsi sebagai dataset latihan, tetapi juga sebagai sarana edukasi yang efektif untuk memperkuat pemahaman konsep regresi linier dan teknik analisis prediktif lainnya dalam data science.
from google.colab import drive
drive.mount('/content/drive')
2. Baca data Set CSV
import pandas as pd
# Replace '/path/to/your/csvfile.csv' with the actual path to your CSV file on Google Drive
csv_file_path = '/content/drive/MyDrive/Student_Performance.csv'
try:
df = pd.read_csv(csv_file_path)
print("CSV file read successfully!")
display(df.head())
except FileNotFoundError:
print(f"Error: The file was not found at {csv_file_path}")
except Exception as e:
print(f"An error occurred: {e}")
3. Preprocessing Data
# Get the number of rows before dropping missing values
initial_rows = df.shape[0]
# Drop rows with any missing values
df.dropna(inplace=True)
# Get the number of rows after dropping missing values
rows_after_dropping = df.shape[0]
print(f"Number of rows before dropping missing values: {initial_rows}")
print(f"Number of rows after dropping missing values: {rows_after_dropping}")
# Display the first few rows of the modified DataFrame
display(df.head())
4. Pemilihan fitur menggunakan Tehnik Korelasi
import seaborn as sns
import matplotlib.pyplot as plt
# Drop non-numeric columns that are not suitable for correlation calculation
df_numeric = df.drop(['Extracurricular Activities'], axis=1)
# Calculate the correlation matrix
correlation_matrix = df_numeric.corr()
# Display the correlation of features with the target variable 'Performance Index'
print("Feature correlation with 'Performance Index':")
print(correlation_matrix['Performance Index'].sort_values(ascending=False))
# Visualize the correlation matrix using a heatmap (optional, but helpful for understanding)
plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Matrix of Features')
plt.show()
# Based on the correlation matrix, select the top 2 features most correlated with 'price'
# (excluding 'price' itself)
price_correlation = correlation_matrix['Performance Index'].sort_values(ascending=False)
top_2_features = price_correlation.drop('Performance Index').head(2).index.tolist()
print(f"\nTop 2 features most correlated with 'Performance Index': {top_2_features}")
# Create a new DataFrame with the selected features and the target variable
df_selected_features = df_numeric[top_2_features + ['Performance Index']]
display(df_selected_features.head())
5. Memilih variabel Independen dan Dependen
# Create a new DataFrame with the selected features and the target variable
# 'sqft_living' and 'sqft_above' as independent variables (X)
# 'price' as the dependent variable (y)
X = df_selected_features[['Previous Scores', 'Hours Studied']]
y = df_selected_features['Performance Index']
print("Independent variables (X):")
display(X.head())
print("\nDependent variable (y):")
display(y.head())
6. Split Data Training dan Testing
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)
7. Preprocessing Model
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
# Define the columns to be scaled
feature_columns = X.columns
# Create a ColumnTransformer to apply StandardScaler to the feature columns
preprocessor = ColumnTransformer(
transformers=[
('scaler', StandardScaler(), feature_columns)
],
remainder='passthrough'
)
print("Preprocessor created successfully:")
print(preprocessor)
8. Pemilihan Model
from sklearn.linear_model import LinearRegression
# Instantiate a LinearRegression object
model = LinearRegression()
print("Linear Regression model instantiated successfully:")
print(model)
9. Membuat Pipeline
from sklearn.pipeline import Pipeline
# Create a pipeline object named pipeline that sequentially applies the preprocessor and the model.
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
('model', model)])
# Print the created pipeline object to verify its structure.
print("Pipeline created successfully:")
print(pipeline)
10. Trining Pipeline
# Train the pipeline using the training data
pipeline.fit(X_train, y_train)
print("Pipeline trained successfully.")
11. Evaluasi Model
from sklearn.metrics import mean_squared_error, r2_score
# Make predictions on the testing data
y_pred = pipeline.predict(X_test)
# Calculate the Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error (MSE): {mse}")
# Calculate the R-squared score
r2 = r2_score(y_test, y_pred)
print(f"R-squared score: {r2}")
## Summary:
### Data Analysis Key Findings
* The data was split into training and testing sets, with 80% for training and 20% for testing. The shapes of the resulting sets were confirmed to be (17290, 2) for X\_train, (4323, 2) for X\_test, (17290,) for y\_train, and (4323,) for y\_test.
* Preprocessing steps were defined using a `ColumnTransformer` to apply `StandardScaler` to the feature columns.
* A `LinearRegression` model was chosen for the regression task.
* A `Pipeline` was created to sequentially apply the defined preprocessing steps and the `LinearRegression` model.
* The pipeline was successfully trained using the training data.
* The trained model was evaluated on the testing data, resulting in a Mean Squared Error (MSE) of approximately 990,354,280,095.51 and an R-squared score of approximately 0.0289.
### Insights or Next Steps
* The current model performs poorly, as indicated by the high MSE and low R-squared score. Further feature engineering, exploration of different models, or hyperparameter tuning is needed to improve performance.
* Investigate potential issues with the data or features used, as the current R-squared score suggests the chosen features explain very little of the variance in the target variable.