MLB Analytics with SQL
Published:
Problem
The Lahman Database (1871–2024) contains 150 years of MLB data across players, salaries, teams, universities, and post-season results. The goal: build a production-quality SQL analytics workflow to answer four business-focused questions using a clean, reproducible relational schema.
Approach
Four analytical pillars, each answered with modular SQL and visualised in Python:
- Talent Pipelines — which colleges produce the most MLB players, and how has that shifted by decade?
- Salary & Payroll Dynamics — team spending patterns, cumulative milestones, and decade-level comparisons.
- Player Career Analysis — career length, debut/retirement windows, age distributions, and team loyalty.
- Player Profiles — height/weight trends, cross-era comparisons, and physical attributes of standout players.
Built reusable analytical views to avoid repeated logic, and a Python notebook to turn SQL outputs into interpretable charts.
What it found
Low-payroll teams that consistently outperformed expectations, decade-level shifts in college talent pipelines, and clear physical attribute differences between Hall of Fame and non-HOF career trajectories.
SQL highlights
Window functions (RANK, NTILE, cumulative SUM), multi-step CTE pipelines, statistical SQL (COVAR_POP / VAR_POP for trend estimation), date manipulation, NULL-aware profiling, and reusable view architecture.
Stack
PostgreSQL · Python · pandas · matplotlib · seaborn · Git/GitHub
Links & Resources
- GitHub Repository: MLB Analytics SQL Project
