Does Height Matter for Distance Running?

By Samuel Kellum


Table of contents:

  1. Introduction
  2. Data Extraction, Transform and Load
  3. Exploratory Data Analysis and Data Visualization
  4. Hypothesis Testing
  5. Conclusion and Further Study

1. Introduction

Height is very important in most American sports, such as basketball and most positions in football, where the top athletes are rarely shorter than 6 feet tall at the professional and NCAA Division I level. On the other hand, shorter athletes have an advantage in sports like gymnastics or equestrian.

People have always viewed distance running as a sport where height does not matter. The heights of different world class runnners in the same event varies a lot. For example, Kenenisa Bekele, former world record holder in the 10,000m, stands at 5'3. On the other hand, Chris Solinsky, fromer American record holder in the 10,000m, is 6'1. These two athletes are among some of the best distance runners of all-time, competing the same event, but their heights differ by almost an entire foot!

As a Division I runner, I became interested in exploring heights of athletes on Division I cross-country teams after a couple of my teammates noticed that the people on other teams seemed to be significantly taller than us. Many of other teams were also better than my team. This sequence of observations gave me two questions:


  • Is there a relationship between the average height of a team's runners and team success?
  • How does my team (Tulane) compare to other D1 cross-country in terms of average height?
  • In this analysis I will attempt to answer the above questions.


    2. Data Extraction, Transform and Load

    For this analysis, I will be using Python 3, and a variety of python libraries. The first code cell will be importing the necessary libraries.

    The first thing I need is dataset that contains:

    1. Every Division I cross-country team in the NCAA
    2. At least one quantitative variable that represents the quality of each team
    3. </ol> I found an amazing webiste LACCTiC.com, that contains exactly what I am loking for. This website standardizes cross-country race results based on a variety of factors, such as weather and course difficulty. The data on this website is extracted from tfrrs.org, which is a website that compiles all collegiate cross county meet results.

      After navigating to the Division I section of the website, I found two tables, one containing information on each individual Division I runner, and the other on each Division I team.

      I initially wanted to perform a GET request (a request to get the data from a website) on the Division I section of the website, and collect the data using the requests library. However, I eventually realized that the tables are generated dynamically with JavaScript, I decided it would be better to extract the data at the original source (which I found using developer tools), in the form of a JSON object.

      The next code sell shows me using the requests library to extract information from the original source, and storing the JSON in the form of a list.

    Now, we need to flatten the nested list we created into a new list called data. To show what the data looks like, I will display the first three teams in the data list.

    Next, since we have a list with each element representing one team. We need to iterate through the data and find each team's roster. We can use the googlesearch library to perform google searches on each team and return the top google search result based on the query we use. For example, inputting Northern Arizona men's cross country roster 2021-22 as a query will return the first google search result of that query. We can do this for every team in our data.

    This code takes a few minutes to run because we are performing over 300 google search queries, and pausing for 1.5 seconds between each query to avoid a HTTP Too Many Requests error from google's server.

    Unfortunately, google's search algorithm is not perfect. After running the above code cell, some search results return incorrect URLS, which we can easily fix manually, as shown below.

    Now, we can look through each roster and extract the heights and convert those heights to inches. To do so, we should create a couple of functions.

    The first function converts a height in feet and inches into inches. For example, it would convert 5'6 to 66, since 5 feet and 6 inches is equal to 66 inches.

    The next two functions use BeautifulSoup to find particular HTML elements. Out of the 319 Division I cross-country teams, about 290 of them were created by Sidearm Sports. This means that for each website that was created by Sidearm Sports, the HTML classes are identical, so the heights (if they exist) will always be located within the same <span> class: sidearm-roster-player-height. For the team websites that were not created by Sidearm Sports, the team roster is always contained in a <table>. We can use the .read_html() function in pandas to automatically convert the table into a DataFrame.

    Therefore, we should create two separate functions based on whether the website was created by Sidearm Sports or not.

    After we created the functions, we can once again iterate through each team in the data. We will perform a GET request on the URL, then if the website was created by Sidearm Sports, then we run the sidearm function to extract the heights from the website, otherwise, we run the function for the websites that were not created by Sidearm Sports.

    Since we are performing 319 GET requests, this code takes a few minutes to run.

    Some of the URLs we extracted heights from contained data for the entire track and field team (which consists of the distance runners on the cross-country team in addition to sprinters, throwers, and jumpers). Since we only want the heights for distance runners, we can remove the heights of non-distance runners from the data.

    Also, one school, Liberty University, was a unique case where neither of the above functions worked. Their athletes heights were contianed within a <p> class called playerDetails.

    Now that we created a complete JSON object, we can use the json library to convert the data into an JSON object that pandas can read and convert the data into a DataFrame, where each row represents one team, which is displayed below.

    After applying the .explode() function to the dataframe, with the "heights" column as a parameter, each row will represent one athlete. We will use .dropna() to remove any teams without any athlete heights. Also, we can drop the columns we do not want to use in the analysis.


    Since a team needs at least five runners to count as a full team in cross country, we should only consider teams with at least five runners.

    3. Exploratory Data Analysis and Data Visualization

    The first visualization we should look at is a histogram of heights for all of the individuals. We can see that the data is approximately normally distributed, which is what we would expect.

    Also, we can generate some summary statistics of the data we collected.

    Now, to find the average height of each team, with at least five heights listed on their roster, we can apply the groupby() function. We see that my team, Tulane, is the shortest team, on average, out of the 96 teams with at least five heights listed on their roster.

    Now, we should create scatterplots comparing the average height to the following performance measures of each team:

    In addition, we will compare the average heights to a series of random numbers, to see how randomly generated numbers compare to the scatterplots of the data.

    From this graph, we can see a slight linear negative relationship between average height and performace metrics (as a team's average height increases, the team's performance measures are faster), whereas there is no relationship between the average height and randomly generated y-coordinates.


    4. Hypothesis Testing

    Now that we have a general idea of what the data looks like, we should test whether the relationships between height and performance are statistically significant. We can do that by using the scipy.stats library.

    Since the p-value is less than 0.05, we reject the null hypothesis that height and performane are independent variables. Since the correlation coefficients are between -0.25 and -0.3, we consider the correlations weak.

    In comparison, the p-value of the random numbers generated is greater than 0.05, so we would fail to reject the null hypothesis, though this scatterplot was created to compare to the other scatterplots.


    5. Conclusion and Further Study

    In response to our initial questions:

  • Is there a relationship between the average height of a team's runners and team success?
  • There appears to be a weak negative correlation between average height and a team's performace, with the negative relationship representing teams with taller runners being faster on average.

  • How does my team (Tulane) compare to other D1 cross-country in terms of average height.
  • Out of the 96 teams that provided enough data, Tulane ranks as the shortest Division I cross-country team in the country!

    Since we found a weak correlation, an interesting topic for further study would be to compare each individual athlete's height to their performance, rather than the team's average height to the team's performance.

    Additionally, there are a few limitations in the data we were able to collect, such as the validity of the heights, since they are typically self-reported by the athletes and usually not verified.