MLB Analytics with SQL
Published:
Problem
Historical sports datasets present a classic relational challenge: many tables, long time spans, inconsistent keys, and subtle relationships.
The Lahman Database (1871–2024) contains decades of MLB data covering players, salaries, teams, universities, post-season results, and biographical information.
The goal of this project is to build a professional, production-quality SQL analytics workflow to answer four major baseball questions around talent development, payroll behaviour, career trajectories, and physical performance profiles.
Approach
- Designed and implemented a fully reproducible PostgreSQL schema for the entire Lahman dataset, ensuring correct keys, constraints, and datatypes.
- Created a modular SQL environment with dedicated files for schema, views, analysis queries, advanced queries, and optimised solutions.
- Structured the project around four analytical pillars:
- Talent Pipelines: Which colleges contribute the most MLB players, and how does this shift by decade?
- Salary & Payroll Dynamics: Team spending patterns, cumulative payroll milestones, and decade-level comparisons.
- Player Career Analysis: Career length, debut/retirement windows, age distributions, and team loyalty.
- Player Profiles & Comparison: Height–weight trends, cross-era comparisons, and physical attributes of standout players.
- Built reusable analytical views to avoid repeated logic and keep the workflow clean and maintainable.
- Developed a Python notebook (
mlb_visual_analysis.ipynb) to visualise the most impactful insights from SQL queries, including:- Career length distributions
- Payroll growth and cumulative spending over time
- School-to-MLB talent funnels
- Physical attribute comparisons
- Organised extensive project documentation in
/docs/: schema design notes, project overview, and business-focused interpretations.
SQL Concepts Implemented
This project spans the full spectrum of practical SQL used in real analytics and data-engineering workflows:
- Relational design & data types
- Full schema with primary keys, composite keys, foreign keys, and constraints.
- Work across integers, numerics, booleans, text, and dates, including casting and normalisation.
- Date & time manipulation
- Deriving debut decades with
EXTRACT+FLOOR, building decade indices, and filtering by time windows (e.g., post-1950, post-1995). - Constructing birth dates and computing ages and career lengths using
MAKE_DATEandAGE.
- Deriving debut decades with
- Analytical & window functions
- Ranking and partitioning with
ROW_NUMBER,RANK, andNTILEfor salary percentiles and payroll terciles. - Running totals and cumulative metrics using
SUM(...) OVER (PARTITION BY ... ORDER BY ...)for cumulative team payroll. - Distributional statistics with
PERCENTILE_CONTfor median payrolls and career summaries.
- Ranking and partitioning with
- Data completeness & NULL-aware profiling
- Measuring coverage of key fields using COUNT vs COUNT(column) and CASE-based counts.
- Conditional / filtered aggregation
- Computing team-level compositions using CASE expressions and FILTER clauses inside aggregates.
- Statistical SQL
- Linear trend estimation using
COVAR_POP/VAR_POPto compute per-state slopes in player production over decades.
- Linear trend estimation using
- CTEs and recursive-style pipelines
- Multi-step CTE pipelines that layer transformations (e.g., from raw salaries → yearly totals → rank by payroll → join to postseason participation).
- Use of CTEs to encapsulate complex logic such as state-decade player counts, Hall of Fame flags, and primary franchise detection based on total games played.
- Reusable views & optimisation
- Canonical views for recurring transformations (debut decades, team-year payroll, average franchise payroll, postseason teams, LCS/WS teams, debut/final teams) that replace duplicated CTE boilerplate and ensure consistent logic across queries. :contentReference[oaicite:8]{index=8}
- Business-facing analytics
- Identifying top-spending teams, overperforming low-payroll clubs, long-career players, Hall of Fame vs non-HOF career patterns, and state-level growth in talent pipelines.
- Python integration
- Exporting results from SQL and building visualisations with
matplotlibandseabornto turn query outputs into interpretable charts.
- Exporting results from SQL and building visualisations with
Stack
- Database: PostgreSQL
- Languages: SQL, Python
- Libraries:
pandas,matplotlib,seaborn - Environment: VS Code, Jupyter Notebook, Git/GitHub
- Concepts: schema design, window functions, CTEs, NTILE-based bucketing, statistical SQL, views, data normalisation, analytical SQL, data visualisation
Results and Impact
- Produced a complete, production-quality SQL analytics system using real multi-table historical data (1871–2024), following sports-analytics pipelines.
- Demonstrated advanced SQL competency across the full analytical stack, including:
- Window functions (RANK, NTILE, LAG/LEAD, cumulative aggregates)
- Statistical SQL (COVAR_POP, VAR_POP trend estimation)
- Date manipulation and career-length calculations
- Multi-step CTE pipelines and recursive logic
- Reusable view architecture for query optimisation
- Delivered interpretable insights for baseball decision-making, such as:
- Decade-level shifts in college talent pipelines
- Team payroll growth and cumulative spending milestones
- Long-career vs short-career player characteristics
- Identification of low-payroll teams that consistently outperform expectations
- Cross-era comparisons of player physical attributes
- Integrated SQL with Python (
pandas,matplotlib,seaborn) to generate clear visualisations for the most impactful queries. - Established a clean, reproducible repository structure with separation of schema, views, analysis queries, advanced queries, documentation, and visual analytics — mirroring real-world data-engineering best practices.
Links & Resources
- GitHub Repository:
MLB Analytics SQL Project
