top of page

Alteryx Weekly Challenge #18

Predicting Baseball Wins

https://community.alteryx.com/t5/forums/recentpostspage/user-id/19037

The use case:

The Baseball season has completed and it's time to project next year's win totals.

The objective:  

Determine the top 10 variables that correlate to wins (excluding [Win_Pct] and [Games] from the correlation).  Leverage those top 10 variables to predict the # of wins the team will have in next year’s season. 

Isolate the teams to only use Boston - BOS, Los Angles of Anaheim - LAA, Chicago Cub - CHC, San Francisco Giants - SFG, Colorado Rockies - COL and Texas Rangers - TEX.

Create what the final standing will be and how many games out of first place each team is assuming each team plays 162 games.

Data Set #1                                                                                                      

Workflow with Detail

 

Final Data Set Solution

Workflow #1  |  Spearman Rank Correlation Top 10 Results

The workflow above will produce the top 10 Spearman Rank Correlation values. 

 

The Spearman Correlation tool assesses how well an arbitrary monotonic function could describe the relationship between two variables, without making any other assumptions about the particular nature of the relationship between the variables.* Correlation values ranges from –1.00 (a perfect negative correlation) to +1.00 (a perfect positive correlation). Zero indicates no correlation at all.                            https://help.alteryx.com/11.3/index.htm#Macro-SpearmanCorrCoeff.htm

Workflow #2  |  Pearson Correlation Top 10 Results

The Pearson Correlation tool uses the Pearson product-moment correlation coefficient (sometimes referred to as the PMCC, and typically denoted by r) to measure the correlation (linear dependence) between two variables X and Y, giving a value between +1 and −1 inclusive.

 

It is widely used in the sciences as a measure of the strength of linear dependence between two variables.*

Correlation (often measured as a correlation coefficient, ρ) indicates the strength and direction of a linear relationship between two random variables. Correlation values ranges from –1.00 (a perfect negative correlation) to +1.00 (a perfect positive correlation). Zero indicates no correlation at all.

The Pearson coefficient is obtained by dividing the covariance of the two variables by the product of their standard deviations.*

https://help.alteryx.com/11.3/index.htm#PearsonCorrelation.htm

Workflow #3  |  Linear Regression Using Top 10 Variables

The Linear Regression tool constructs a linear function to create a model that predicts a target variable based on one or more predictor variables.

 

There are two main types of linear regression: non-regularized and regularized. Non-regularized linear regression produces linear models that minimize the sum of squared errors between the actual and predicted values of the training data target variable.

 

Regularized linear regression balances the same minimization of sum of squared errors with a penalty term on the size of the coefficients and tends to produce simpler models less prone to overfitting. See Linear Regression.

https://help.alteryx.com/11.3/index.htm#lm.htm

bottom of page