Agitha Suci Rachmawardani
This portfolio explores how house prices are influenced by various factors using correlation and regression analysis. More than just building a predictive model, the project highlights the importance of data ethics — including fairness, transparency, privacy, and avoiding bias. It’s a reminder that good data science isn’t just about making accurate predictions, but also about using data responsibly The task is to predict the house price in this area based on several features, which are described below.Before building any models, we explored the dataset to understand variable distributions, detect anomalies, and identify potential features that may influence housing prices. This initial step is essential for both model accuracy and ethical data handling.
House Price Prediction with a Data Ethics Lens
This portfolio explores how house prices are influenced by various factors using correlation and regression analysis. More than just building a predictive model, the project highlights the importance of data ethics — including fairness, transparency, privacy, and avoiding bias. It’s a reminder that good data science isn’t just about making accurate predictions, but also about using data responsibly
This dataset, sourced from Kaggle (https://www.kaggle.com/dataset) , This is a real dataset of house prices sold in Seattle, Washing, USA between August and December 2022. The task is to predict the house price in this area based on several features, which are described below.Before building any models, we explored the dataset to understand variable distributions, detect anomalies, and identify potential features that may influence housing prices. This initial step is essential for both model accuracy and ethical data handling.
The following steps are :
Upload the file to Google Drive
Data Preparation
Cleaning the data : To ensure the dataset is suitable for analysis, a three-step data cleaning process was performed : a) Removing Columns with Excessive Missing Value : Any column containing more than 50% missing values was dropped. This threshold helps maintain data quality by eliminating features that may not provide reliable information due to excessive gaps; b) Imputing Missing Numeric Values : For all numeric columns, missing values were filled with the median of each respective column. Median imputation is robust to outliers and ensures that the central tendency of the data is preserved without skewing distributions, and c) Imputing Missing Categorical Values : For categorical (object-type) columns, missing values were filled with the mode—i.e., the most frequently occurring value in each column. This approach maintains the dominant category within each feature and helps preserve the overall data structure.
Basic Exploratory Data Analysis
Data Analysis : To explore the relationships and patterns within the dataset, a series of analytical steps were conducted focusing on the numerical features: a) Identification of Numerical Columns : The dataset was first examined to isolate numerical columns—those containing continuous or discrete numerical data., b) Scatter Plot Matrix (Pairwise Correlation) : A scatter plot matrix was generated to visualize the pairwise relationships among the numerical variables; c) Correlation Heatmap : A correlation matrix was calculated and visualized using a heatmap to quantify the strength and direction of relationships between each pair of numerical variables. The heatmap uses color gradients to easily highlight strong positive or negative correlations.
Visualization and the result of housing data analysis
Numerical Colums
Scatter Plot
Heatmap
Values close to 1 indicate a strong positive linear relationship; Values close to -1 indicate a strong negative linear relationship, and Values close to 0 indicate a weak or no linear relationship.
This portfolio integrates data ethics at every stage of the analysis process:
Responsible Data Cleaning
Columns with excessive missing values were removed, and missing numeric and categorical values were imputed using median and mode respectively—preserving data integrity and minimizing bias.
Fair Feature Selection
Features were selected based on actual correlation, not assumptions. Weakly correlated features like lot_size and zip_code were acknowledged but not overemphasized, ensuring fairness and objectivity.
Transparent Interpretation
Visualizations like scatter plots and heatmaps were used to communicate relationships without overstating causality, promoting clarity and avoiding misleading conclusions.
Bias Awareness
The influence of location (zip_code) was interpreted cautiously to avoid reinforcing geographic or socioeconomic bias in the model.
Ethical Model Evaluation
Beyond accuracy metrics, the model was evaluated in terms of relevance, fairness, and the real-world meaning of its predictions—aligning with responsible data use.
The analysis revealed that house price is most strongly correlated with property size, followed by the number of bathrooms and bedrooms. While lot size and zip code show weaker correlations, they may still hold contextual value. These insights highlight how key physical features impact property value—and remind us that understanding correlation is essential before building any predictive model.