๋ํ ๊ธฐ๊ด ํน์ฑ์ด ์ด์ ์ฐ๋ด์ ๋ฏธ์น๋ ์ํฅ ๋ถ์ | Predicting early-career salary using institutional characteristics and regression-based modeling
"์ด๋ค ๋ํ์ ๋์๋๋"๊ฐ ์ด์ ์ฐ๋ด์ ์ผ๋ง๋, ์ด๋ป๊ฒ ์ํฅ์ ๋ฏธ์น๋๊ฐ? ๊ธฐ๊ด ์ ํ, ๋ฑ๋ก๊ธ, STEM ๋น์จ, ํ์ ๊ตฌ์ฑ ๋ฑ ์ ๋ํ ๊ฐ๋ฅํ ๊ธฐ๊ด ํน์ฑ์ ๋ฐํ์ผ๋ก ์กธ์ ์ ์ด์์ ์์ธกํ๋ ํ๊ท ๋ชจ๋ธ์ ๊ตฌ์ถํ๊ณ , ์ด๋ค ๋ณ์๊ฐ ์ฐ๋ด ์ฐจ์ด๋ฅผ ๊ฐ์ฅ ์ ์ค๋ช ํ๋์ง ๋ถ์ํ์ต๋๋ค.
Which institutional characteristics drive early-career salary outcomes and by how much? This project approaches a question relevant to HR analytics and education consulting: can publicly available institutional data predict graduate salary outcomes? Using features such as institution type, tuition cost, STEM concentration, and student composition, I built and compared regression models to identify the structural drivers of starting salary across 444 U.S. colleges.
This project analyzes how institutional characteristics influence early-career salary outcomes across U.S. colleges.
The goal was to:
- Identify key institutional drivers of salary outcomes
- Evaluate linear vs regularized vs tree-based models
- Assess predictive stability using train/test evaluation
-
๋ฑ๋ก๊ธ(in-state/out-of-state)๊ณผ STEM ์ ๊ณต ๋น์จ์ด ์ด์ ์ฐ๋ด์ ํต์ฌ ์์ธก ๋ณ์๋ก ํ์ธ๋จ
Tuition-related variables and STEM concentration emerged as the strongest predictors of early-career salary
-
์ฌ๋ฆฝ๋ํ์ด ๊ณต๋ฆฝ๋ํ๋ณด๋ค ํ๊ท ์ด์์ด ๋์ ๊ฒฝํฅ์ ์ ๋์ ์ผ๋ก ํ์ธ
Private institutions show statistically higher average starting salaries than public institutions
-
๋ค์ค ์ ํ ํ๊ท ๋ชจ๋ธ์ด Rยฒ โ 0.80์ผ๋ก ๊ฐ์ฅ ๋์ ์์ธก๋ ฅ์ ๋ณด์ โ ๋ฆฟ์ง ํ๊ท, ์์ฌ๊ฒฐ์ ํธ๋ฆฌ ๋๋น ์ฐ์
Multivariate linear regression achieved the best performance (Rยฒ โ 0.80), outperforming Ridge regression and Decision Tree
-
๋ชจ๋ธ ๊ฒ์ฆ: Mount Holyoke College ์์ธก๊ฐ $55,720 vs ์ค์ ๊ฐ $52,736 (์ค์ฐจ์จ ์ฝ 5.7%)
Model validation: predicted $55,720 vs actual $52,736 for Mount Holyoke College (~5.7% error)
| ๋ชจ๋ธ / Model | Test Rยฒ | Test MSE | ๋น๊ณ / Notes |
|---|---|---|---|
| Multivariate Linear Regression | ~0.80 | ~13.2M | โ ์ต์ข ์ ํ / Best overall |
| Ridge Regression | Similar to OLS | Slightly more stable | Multicollinearity ์ํ |
| Decision Tree Regression | ~0.59 | ~22.8M | ๋น์ ํ ๊ด๊ณ ์บก์ฒ ์๋ |
์ ํ ์ด์ : ์ ํ ๋ชจ๋ธ์ด ๊ฐ์ฅ ๋์ ์์ธก ์ ํ๋์ ํด์ ๊ฐ๋ฅ์ฑ์ ๋์์ ์ ๊ณต. ๊ท์ (Ridge)๋ ์ฑ๋ฅ ๊ฐ์ ํจ๊ณผ ๋ฏธ๋ฏธ, ํธ๋ฆฌ ๋ชจ๋ธ์ ๊ณผ์ ํฉ ๊ฒฝํฅ ํ์ธ.
Highest predictive accuracy, strong interpretability, and stable generalization
- 4๊ฐ CSV ํ์ผ ๋ณํฉ ํ ์ต์ข ๋ฐ์ดํฐ์ : 444๊ฐ ๋ํ ร 10๊ฐ ๋ณ์
Source: College Tuition, Diversity, and Pay dataset
- This dataset comprises four CSV files:
diversity_school,salary_potential,tuition_cost, andtuition_income. - https://www.kaggle.com/datasets/jessemostipak/college-tuition-diversity-and-pay?select=diversity_school.csv
- ํํฐ ์กฐ๊ฑด: ์ฌํ์ 2,000๋ช ์ด์ ๊ธฐ๊ด. Select colleges with at least 2000 enrolled students.
- ๋น์จ ๋ณ์ ์์ฑ (์ฌํ์ ๋น์จ, ์ ํ์ ๋น์จ). Engineered proportion variables (women, international).
- ์์นํ ๋ณ์ ํ์คํ. Standardized numerical features.
- ๋ฒ์ฃผํ ๋ณ์ ์ธ์ฝ๋ฉ. (type: public/private) Encoded categorical variables .
- Train/test split: 80% / 20%
After cleaning and merging four datasets:
- Final dataset: 444 institutions ร 10 variables
- Target variable: early_career_pay
Variable Used:
name: The name of each collegetotal_enrollment: The number of total enrollment in schoolswomen_proportion: The proportion of women in schoolsForeign_proportion: The proportion of international students in schoolsstem_percent: The proportion of students majoring in STEM fields in schoolsnet_cost: Average cost of attendance after scholarship/financial aidtype: Type of schools (private, public)in_state_total: Average total cost (room & board + in-state tuition) for in-state residents in USDout_of_state_total: Average total cost (room & board + in-state tuition) for out-of-state residents in USDearly_career_pay: Starting salary in USD
Before model development, a structured exploratory analysis was conducted to understand variable distributions, relationships, and potential multicollinearity.
-
Pie chart of the average proportion of each demographic group

- This pie chart depicts the average proportion of diversity among the institutions in our data.
-
Boxplot comparing the early career pay across different regions in the US

- Northeast ์ง์ญ์ด ๊ฐ์ฅ ๋์ ํ๊ท ์ด์, South ์ง์ญ์ด ๊ฐ์ฅ ๋ฎ์
- This boxplot tells us that among the four regions in the United States, the Northeast has the highest mean early career pay, whereas the South has the lowest average early career pay.
-
Institution type (public vs private):
- ์ฌ๋ฆฝ๋ํ > ๊ณต๋ฆฝ๋ํ (ํ๊ท ์ด์ ๊ธฐ์ค)
- Private institutions exhibit higher average starting salaries.
-
Relationship between the proportion of women in schools and early career pay

- The proportion of women shows a moderate correlation with salary outcomes, though this likely reflects institutional characteristics rather than causal gender effects.
- Enrollment size shows weak correlation with salary outcomes.
- Tuition-related variables show strong positive correlation with early-career pay.

- Average cost of attendance after scholarship/financial aid has limited linear impact to early-career pay.
- International proportion show limited linear impact.
- STEM concentration demonstrates moderate-to-strong positive association.
Initial candidate predictors included:
total_enrollemntwomen_proportionforeign_proprotionStem_percentNet_costIn_state_totalOut_of_state_totaltype
Referring to the covariance matrix, to reduce overfitting in predictive models and increase predictive power, we selected 6 variables with higher correlations to the predictor variables.
- Covariance Matrix with all 8 variables

total_enrollmentandnet_costdemonstrate weak correlation with salary outcomes.
Final Feature Set The final predictors used in modeling:
women_proportionforeign_proportionstem_percentin_state_totalout_of_state_totaltype
Target variable:
early_career_pay
The dataset was split into training (80%) and testing (20%) subsets.
All numerical features were standardized prior to modeling.
Categorical variable type was encoded using indicator variables.

Three regression-based models were implemented and compared.
Purpose:
- Establish interpretable benchmark model.
- Evaluate explanatory strength of institutional variables.
Results:
- Test MSE โ 13.2M
- Test Rยฒ โ 0.80
Interpretation: The linear model explains approximately 80% of variance in early-career salary, indicating strong structural relationships between institutional characteristics and salary outcomes.
Purpose:
- Address potential multicollinearity among tuition variables.
- Improve model stability via L2 regularization.
Method:
- 30 alpha values logarithmically spaced from 10โปโต to 10ยณ.
- Model performance evaluated across regularization strengths.
Findings:
- Optimal performance observed at very small alpha values.
- Regularization did not significantly improve performance over OLS.
Interpretation: Multicollinearity exists but does not materially degrade predictive performance.
Purpose:
- Capture potential non-linear relationships.
- Compare interpretability vs performance trade-off.
Model Selection:
Results:
- Test MSE โ 22.8M
- Test Rยฒ โ 0.59
Optimal depth: 4 (based on MSE and Rยฒ)
Interpretation: Although the tree model captures non-linear splits, it did not outperform linear approaches. This suggests that early career salary is primarily explained by relatively stable structural relationships rather than complex non-linear interactions.
predicted early-career salary for Mount Holyoke College:
-
Total annual cost: $86,702
-
International students proportion: 23%
-
Women proportion: > 90%
-
STEM percent: 40%
-
Predicted Salary: $55,720
-
Actual Salary: $52,736
| Model | Test Rยฒ | Test MSE | Performance |
|---|---|---|---|
| Linear Regression | ~0.80 | ~13.2M | Best overall |
| Ridge Regression | Similar to OLS | Slightly more stable | Comparable |
| Regression Tree | ~0.59 | ~22.8M | Weaker |
- Highest predictive accuracy
- Strong interpretability
- Stable generalization
- Code Repository: https://github.com/yerimoh-23/MachineLearning-StartingSalaryPredictionModel/blob/main/final_code.ipynb
-
๋ฐ์ดํฐ์ ์ด ๋ณต์ ์ฐ๋๋ฅผ ํผํฉํ๊ณ ์์ด ์๊ณ์ด ํจ๊ณผ ํต์ ๋ถ๊ฐ
Dataset combines multiple years.
-
๋ฏธ๊ตญ ๋ด ์ทจ์ ๊ธฐ์ค ์ฐ๋ด๋ง ๋ฐ์ (ํด์ธ ์ทจ์ ์ ์ธ)
Salary outcomes limited to U.S.-based employment.
-
๊ธฐ๊ด ํ๊ท ๊ฐ ์ฌ์ฉ์ผ๋ก ํ๊ณผยท๊ฐ์ธ ๋จ์ ๋ณ๋์ฑ ๋ฏธ๋ฐ์
Institutional averages mask within-school variability.
-
๋๋ฝ ๋ณ์ ๊ฐ๋ฅ์ฑ: ๋ํ ์ ๋ณ์ฑ(selectivity), ์ ๊ณต๋ณ ์ธ๋ถํ ๋ฑ
Potential omitted variable bias (e.g., selectivity, major-level granularity).
- Institutional-level data reduces individual privacy risk.
- Salary prediction models may reinforce socioeconomic stratification.
- Care must be taken not to interpret correlation as causation.












